[general_comp] Seminario AISAR – Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Mon Sep 22 08:26:58 -03 2025

Desde el Programa de Becas AISAR en AI Safety tenemos el placer de
invitarlos a la próxima charla de nuestro seminario online, con la
participación de investigadores del área.

📌 Fecha y hora: Miércoles 24 de septiembre, 13:00 hs (ARG).
🎤 Orador: Adrià Garriga-Alonso – Research Scientist, FAR AI
<https://www.far.ai/>📖 Título: Among Us: A Sandbox for Measuring and
Detecting Agentic Deception

👉 Inscripción: Para asistir a la charla, por favor indicá tu nombre en el
siguiente formulario (No es necesario que completes este formulario si ya
indicaste "Quiero que me avisen por correo electrónico cuando haya nuevas
charlas de AISAR" en un formulario previo):
https://forms.gle/XNDf9uskcRoZ6koW6

Abstract: Prior studies on deception in language-based AI agents typically
assess whether the agent produces a false statement about a topic, or makes
a binary choice prompted by a goal, rather than allowing open-ended
deceptive behavior to emerge in pursuit of a longer-term goal. To fix this,
we introduce Among Us, a sandbox social deception game where LLM-agents
exhibit long-term, open-ended deception as a consequence of the game
objectives. While most benchmarks saturate quickly, Among Us can be
expected to last much longer, because it is a multi-player game far from
equilibrium. Using the sandbox, we evaluate 18 proprietary and open-weight
LLMs and uncover a general trend: models trained with RL are comparatively
much better at producing deception than detecting it. We evaluate the
effectiveness of methods to detect lying and deception: logistic regression
on the activations and sparse autoencoders (SAEs). We find that probes
trained on a dataset of ``pretend you're a dishonest model: …'' generalize
extremely well out-of-distribution, consistently obtaining AUROCs over 95%
even when evaluated just on the deceptive statement, without the chain of
thought. We also find two SAE features that work well at deception
detection but are unable to steer the model to lie less. We hope our
open-sourced sandbox, game logs, and probes serve to anticipate and
mitigate deceptive behavior and capabilities in language-based agents

Encontrá el paper acá: https://arxiv.org/abs/2504.04072
Equipo AISAR
http://scholarship.aisafety.ar/
<http://scholarship.aisafety.ar/?utm_source=chatgpt.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listas.exactas.uba.ar/pipermail/general_comp/attachments/20250922/ef6c5994/attachment-0001.htm>