Our Research
Our technical research focuses on developing methods to ensure that AI systems act in accordance with their developers' intent, even in the face of internal misalignment.
Highlighted Research
Our most impactful work on AI safety and security
AI Control
Improving Safety Despite Intentional Subversion
Our research introduces and evaluates protocols designed to be robust even when AI models are trying to deceive us. Using GPT-4 as a stand-in for a potentially deceptive model, we tested strategies to detect hidden backdoors in code.
Read the full case study
Alignment Faking
When AI Models Pretend to Be Safe
We demonstrate that state-of-the-art LLMs can strategically fake alignment during training to avoid being changed, revealing a crucial challenge for AI safety. A model might pretend to follow human values while pursuing different goals.
Read the full case study
Other Research
A sketch of an AI control safety case
arXiv 2024
Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small
ICLR 2023