[general_comp] Seminario AISAR – Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Tue Sep 23 10:41:57 -03 2025

Actualizamos la información del próximo seminario: habrá un cambio en el
tema de la charla, pero se mantiene el mismo orador.

📌 Fecha y hora: Miércoles 24 de septiembre, 13:00 hs (ARG).
🎤 Orador: Adrià Garriga-Alonso – Research Scientist, FAR AI
<https://www.far.ai/>📖 Título: Reverse-engineering a neural network that
plans: a mesa-optimizer model organism

👉 Inscripción: Para asistir a la charla, por favor indicá tu nombre en el
siguiente formulario (No es necesario que completes este formulario si ya
indicaste "Quiero que me avisen por correo electrónico cuando haya nuevas
charlas de AISAR" en un formulario previo):
https://forms.gle/XNDf9uskcRoZ6koW6

Abstract: We partially reverse-engineer a convolutional recurrent neural
network (RNN) trained to play the puzzle game Sokoban with model-free
reinforcement learning. Prior work found that this network solves more
levels with more test-time compute. Our analysis reveals several mechanisms
analogous to components of classic bidirectional search. For each square,
the RNN represents its plan in the activations of channels associated with
specific directions. These state-action activations are analogous to a
value function - their magnitudes determine when to backtrack and which
plan branch survives pruning. Specialized kernels extend these activations
(containing plan and value) forward and backward to create paths, forming a
transition model. The algorithm is also unlike classical search in some
ways. State representation is not unified; instead, the network considers
each box separately. Each layer has its own plan representation and value
function, increasing search depth. Far from being inscrutable, the
mechanisms leveraging test-time compute learned in this network by
model-free training can be understood in familiar terms.

Encontrá el paper acá: https://arxiv.org/abs/2506.10138

Equipo AISAR
http://scholarship.aisafety.ar/
<http://scholarship.aisafety.ar/?utm_source=chatgpt.com>

El lun, 22 sept 2025 a las 12:26, Agustín Martinez Suñé (<
agusmartinez92 at gmail.com>) escribió:

> Desde el Programa de Becas AISAR en AI Safety tenemos el placer de
> invitarlos a la próxima charla de nuestro seminario online, con la
> participación de investigadores del área.
>
> 📌 Fecha y hora: Miércoles 24 de septiembre, 13:00 hs (ARG).
> 🎤 Orador: Adrià Garriga-Alonso – Research Scientist, FAR AI
> <https://www.far.ai/>📖 Título: Among Us: A Sandbox for Measuring and
> Detecting Agentic Deception
>
> 👉 Inscripción: Para asistir a la charla, por favor indicá tu nombre en
> el siguiente formulario (No es necesario que completes este formulario si
> ya indicaste "Quiero que me avisen por correo electrónico cuando haya
> nuevas charlas de AISAR" en un formulario previo):
> https://forms.gle/XNDf9uskcRoZ6koW6
>
> Abstract: Prior studies on deception in language-based AI agents
> typically assess whether the agent produces a false statement about a
> topic, or makes a binary choice prompted by a goal, rather than allowing
> open-ended deceptive behavior to emerge in pursuit of a longer-term goal.
> To fix this, we introduce Among Us, a sandbox social deception game where
> LLM-agents exhibit long-term, open-ended deception as a consequence of the
> game objectives. While most benchmarks saturate quickly, Among Us can be
> expected to last much longer, because it is a multi-player game far from
> equilibrium. Using the sandbox, we evaluate 18 proprietary and open-weight
> LLMs and uncover a general trend: models trained with RL are comparatively
> much better at producing deception than detecting it. We evaluate the
> effectiveness of methods to detect lying and deception: logistic regression
> on the activations and sparse autoencoders (SAEs). We find that probes
> trained on a dataset of ``pretend you're a dishonest model: …'' generalize
> extremely well out-of-distribution, consistently obtaining AUROCs over 95%
> even when evaluated just on the deceptive statement, without the chain of
> thought. We also find two SAE features that work well at deception
> detection but are unable to steer the model to lie less. We hope our
> open-sourced sandbox, game logs, and probes serve to anticipate and
> mitigate deceptive behavior and capabilities in language-based agents
>
> Encontrá el paper acá: https://arxiv.org/abs/2504.04072
> Equipo AISAR
> http://scholarship.aisafety.ar/
> <http://scholarship.aisafety.ar/?utm_source=chatgpt.com>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listas.exactas.uba.ar/pipermail/general_comp/attachments/20250923/57b17195/attachment.htm>