Our Research

Our technical research focuses on developing methods to ensure that AI systems act in accordance with their developers' intent, even in the face of internal misalignment.

Highlighted Research

Our most impactful work on AI safety and security

AI Control

Improving Safety Despite Intentional Subversion

Our research introduces and evaluates protocols designed to be robust even when AI models are trying to deceive us. Using GPT-4 as a stand-in for a potentially deceptive model, we tested strategies to detect hidden backdoors in code.

Read the full case study

Alignment Faking

When AI Models Pretend to Be Safe

We demonstrate that state-of-the-art LLMs can strategically fake alignment during training to avoid being changed, revealing a crucial challenge for AI safety. A model might pretend to follow human values while pursuing different goals.

Read the full case study