Research

This describes the research generally.

Research Goals

We focus on technical approaches for managing risks from powerful AI.

AI Control: Improving Safety Despite Intentional Subversion

December 23, 2023

In this paper, we develop and evaluate pipelines of safety techniques ("protocols") that are robust to intentional subversion.

Preventing Language Models From Hiding Their Reasoning

October 27, 2023

This paper studies encoded reasoning: LLM could encode intermediate steps of reasoning in text they generated in a way that is not understandable to human readers.

Benchmarks for Detecting Measurement Tampering

August 29, 2023

This work tackles measurement tampering, where the AI system manipulates multiple measurements to create the illusion of good results instead of achieving the desired outcome.

Rust Circuit Library

February 16, 2023

A library for expressing and manipulating tensor computations for neural network interpretability, written in Rust and used in Python notebooks.

Causal Scrubbing

December 2, 2022

A principled approach for evaluating the quality of mechanistic interpretations.

Interpretability in the Wild

November 1, 2022

We present an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI).

Polysemanticity and Capacity in Neural Networks

October 4, 2022

We show that in a toy model the optimal capacity allocation tends to monosemantically represent the most important features, polysemantically represent less important features and entirely ignore the least important features.

Language Models Seem to be Much Better than Humans at Next-token Prediction

August 11, 2022

We found that humans seem to be consistently worse at next-token prediction than even small language models.

Adversarial Training for High-Stakes Reliability

May 3, 2022

We use a safe language generation task ("avoid injuries") as a testbed for achieving high reliability through adversarial training.