<div dir="ltr"><span id="gmail-docs-internal-guid-cbf77756-7fff-6368-1f62-a585869894b3" style="color:rgb(0,0,0)"><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt"><span style="font-family:Arial,sans-serif;font-size:11pt;white-space:pre-wrap">Desde el Programa de Becas AISAR en AI Safety tenemos el placer de invitarlos a la próxima charla de nuestro seminario online, con la participación de investigadores del área.</span><br></p><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt"><span style="font-size:11pt;font-family:Arial,sans-serif;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">📌 </span><span style="font-size:11pt;font-family:Arial,sans-serif;font-weight:700;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Fecha y hora:</span><span style="font-size:11pt;font-family:Arial,sans-serif;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"> Miércoles 24 de septiembre, 13:00 hs (ARG).</span><span style="font-size:11pt;font-family:Arial,sans-serif;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><br></span><span style="font-size:11pt;font-family:Arial,sans-serif;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">🎤 </span><span style="font-size:11pt;font-family:Arial,sans-serif;font-weight:700;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Orador:</span><span style="font-size:11pt;font-family:Arial,sans-serif;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"> Adrià Garriga-Alonso – Research Scientist, </span><a href="https://www.far.ai/" style="text-decoration:none"><span style="font-size:11pt;font-family:Arial,sans-serif;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;text-decoration:underline;vertical-align:baseline;white-space:pre-wrap">FAR AI</span><span style="font-size:11pt;font-family:Arial,sans-serif;color:rgb(0,0,0);font-style:italic;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><br></span></a><span style="font-size:11pt;font-family:Arial,sans-serif;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">📖 </span><span style="font-size:11pt;font-family:Arial,sans-serif;font-weight:700;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Título:</span><span style="font-size:11pt;font-family:Arial,sans-serif;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"> </span><span style="font-size:11pt;font-family:Arial,sans-serif;font-style:italic;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Among Us: A Sandbox for Measuring and Detecting Agentic Deception</span></p><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt"><span style="font-size:11pt;font-family:Arial,sans-serif;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">👉 </span><span style="font-size:11pt;font-family:Arial,sans-serif;font-weight:700;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Inscripción:</span><span style="font-size:11pt;font-family:Arial,sans-serif;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"> Para asistir a la charla, por favor indicá tu nombre en el siguiente formulario (No es necesario que completes este formulario si ya indicaste "Quiero que me avisen por correo electrónico cuando haya nuevas charlas de AISAR" en un formulario previo): </span><a href="https://forms.gle/XNDf9uskcRoZ6koW6" style="text-decoration:none"><span style="font-size:11pt;font-family:Arial,sans-serif;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;text-decoration:underline;vertical-align:baseline;white-space:pre-wrap">https://forms.gle/XNDf9uskcRoZ6koW6</span></a></p><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt"><span style="font-size:11pt;font-family:Arial,sans-serif;font-weight:700;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Abstract: </span><span style="font-size:11pt;font-family:Arial,sans-serif;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Prior studies on deception in language-based AI agents typically assess whether the agent produces a false statement about a topic, or makes a binary choice prompted by a goal, rather than allowing open-ended deceptive behavior to emerge in pursuit of a longer-term goal. To fix this, we introduce Among Us, a sandbox social deception game where LLM-agents exhibit long-term, open-ended deception as a consequence of the game objectives. While most benchmarks saturate quickly, Among Us can be expected to last much longer, because it is a multi-player game far from equilibrium. Using the sandbox, we evaluate 18 proprietary and open-weight LLMs and uncover a general trend: models trained with RL are comparatively much better at producing deception than detecting it. We evaluate the effectiveness of methods to detect lying and deception: logistic regression on the activations and sparse autoencoders (SAEs). We find that probes trained on a dataset of ``pretend you're a dishonest model: …'' generalize extremely well out-of-distribution, consistently obtaining AUROCs over 95% even when evaluated just on the deceptive statement, without the chain of thought. We also find two SAE features that work well at deception detection but are unable to steer the model to lie less. We hope our open-sourced sandbox, game logs, and probes serve to anticipate and mitigate deceptive behavior and capabilities in language-based agents</span><span style="font-size:11pt;font-family:Arial,sans-serif;font-weight:700;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><br><br></span></p><p dir="ltr" style="line-height:1.38;margin-top:12pt;margin-bottom:12pt"><span style="font-size:11pt;font-family:Arial,sans-serif;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Encontrá el paper acá: </span><a href="https://arxiv.org/abs/2504.04072" style="text-decoration:none"><span style="font-size:11pt;font-family:Arial,sans-serif;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;text-decoration:underline;vertical-align:baseline;white-space:pre-wrap">https://arxiv.org/abs/2504.04072</span></a></p><span style="font-size:11pt;font-family:Arial,sans-serif;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap">Equipo AISAR</span><span style="font-size:11pt;font-family:Arial,sans-serif;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;vertical-align:baseline;white-space:pre-wrap"><br></span><a href="http://scholarship.aisafety.ar/?utm_source=chatgpt.com" style="text-decoration:none"><span style="font-size:11pt;font-family:Arial,sans-serif;font-variant-ligatures:normal;font-variant-alternates:normal;font-variant-numeric:normal;font-variant-east-asian:normal;text-decoration:underline;vertical-align:baseline;white-space:pre-wrap">http://scholarship.aisafety.ar/</span></a></span><br></div>