Our Research

Our technical research focuses on developing methods to ensure that AI systems act in accordance with their developers' intent, even in the face of internal misalignment.

Other Research

A sketch of an AI control safety case

arXiv 2024

Stress-Testing Capability Elicitation With Password-Locked Models

NeurIPS 2024

Preventing Language Models From Hiding Their Reasoning
Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small

ICLR 2023

Adversarial Training for High-Stakes Reliability

NeurIPS 2022