We focus on technical approaches for managing risks from powerful AI.
In this paper, we develop and evaluate pipelines of safety techniques ("protocols") that are robust to intentional subversion.
This paper studies encoded reasoning: LLM could encode intermediate steps of reasoning in text they generated in a way that is not understandable to human readers.
This work tackles measurement tampering, where the AI system manipulates multiple measurements to create the illusion of good results instead of achieving the desired outcome.
A library for expressing and manipulating tensor computations for neural network interpretability, written in Rust and used in Python notebooks.
A principled approach for evaluating the quality of mechanistic interpretations.
We present an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI).
We show that in a toy model the optimal capacity allocation tends to monosemantically represent the most important features, polysemantically represent less important features and entirely ignore the least important features.
We found that humans seem to be consistently worse at next-token prediction than even small language models.
We use a safe language generation task ("avoid injuries") as a testbed for achieving high reliability through adversarial training.