Redwood Research

This page collates recommended reading for those seeking to understand the field of AI control.

AI control aims to develop techniques for training and deploying AI systems such that they cannot cause security or safety failures, even if they attempt to do so or try to defeat the control mechanisms. While the basic idea has been discussed for some time, systematic exploration and publication in this area significantly increased starting in late 2023.

Foundational Content

AI Control: Improving Safety Despite Intentional Subversion
- This was the original paper on control.
- You might also enjoy:
  - A Rob Miles YouTube video explainer of the paper. This explanation is particularly helpful for those who prefer video format or have less background in machine learning.
  - (Optional) The Alignment Forum post discussing the paper.
The case for ensuring that powerful AIs are controlled
- This is our original essay explaining why we think AI control is a valuable research direction.
- (Optional) We discuss this in more detail in our interview on AXRP.

Other Important Conceptual Writing

In roughly decreasing order of priority.

AI catastrophes and rogue deployments
- We often refer to these concepts when talking about what AI control aims to prevent.
Catching AIs red-handed
- It's crucial to think about what happens if the AI gets caught in order to think about plans that involve catching AIs red-handed. (Further discussion from a less technical perspective can be found here.)
Win/continue/lose scenarios and execute/replace/audit protocols
- The basic mathematical ideas of this post are crucial for thinking quantitatively about the main problems with using AI control to prevent security failures.
Scheming AIs: Will AIs fake alignment during training in order to get power?
- Read this for the current state of the art on arguments for why your models might end up conspiring against you. This is long; you might want to just read the first section, which summarizes the whole report.
How to prevent collusion when using untrusted models to monitor each other
- This gets into a lot of detail about some important points about untrusted monitoring.
Coup probes
- This post describes the simplest white-box method for AI control.
- Detecting Strategic Deception Using Linear Probes does a more thorough job of investigating white-box methods for control empirically.
Thoughts on the conservative assumptions in AI control
- This describes some of the ways that most of our control work is conservative, and explains why we usually choose to do the research that way.

AI Control Reading List

A curated list of resources on AI control research and concepts.

Foundational Content

Other Important Conceptual Writing

More Stuff to Read