Pioneering threat mitigation and assessment for AI agents

AI Control Illustration

Redwood Research is a nonprofit organization focused on mitigating risks from advanced artificial intelligence.

In the coming years or decades, AI systems will very plausibly match or exceed human capabilities across most intellectual tasks, fundamentally transforming society. Our research specifically addresses the risks that could arise if these powerful AI systems purposefully act against the interests of their developers and human institutions broadly.

We work to better understand these risks, and to develop methodologies that will allow us to manage them while still realizing the benefits of AI.

Our focus areas

AI control

We introduced and have continued to propel the research area of "AI control." In our ICML oral paper AI Control: Improving Risk Despite Intentional Subversion, we proposed protocols for monitoring malign LLM agents. AI companies and other researchers have since built on this work (Benton et al, Mathew et al), and AI control has been framed by (Grosse, Buhl, Balesni, Clymer) as a bedrock approach for mitigating catastrophic risk from misaligned AI.

AI Control Illustration
Strategic Deception Illustration
Evaluations and demonstrations of risk from strategic deception

In Alignment Faking in Large Language Models, we (in collaboration with Anthropic) demonstrated that Claude sometimes hides misaligned intentions. This work is the strongest concrete evidence that LLMs might naturally fake alignment in order to resist attempts to train them.

Consulting on risks from misalignment

We collaborate with governments and advise AI companies including Google DeepMind and Anthropic on practices for assessing and mitigating risks from misaligned AI agents.

Consulting Illustration

Our work

  • Alignment faking in large language models. Paper, blog post summary, video.
  • AI Control: Improving Safety Despite Intentional Subversion (ICML 2024 oral). Paper, blog post, explainer video.
  • Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols. Paper, blog post.
  • Stress-Testing Capability Elicitation With Password-Locked Models (NeurIPS 2024). Paper, blog post.
  • Preventing Language Models From Hiding Their Reasoning. Paper, blog post.
  • Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats. Paper.
  • Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small (ICLR 2023). Paper, blog post, video.
  • Adversarial Training for High-Stakes Reliability (NeurIPS 2022). Paper, blog post.

Recent blog posts