AI control reading list

This is my attempt to collate recommended reading for people who want to understand AI control.

In AI control, we aim to develop techniques for training and deploying AIs such that the AIs won’t be able to cause security or safety failures, even if they’re trying to cause those failures, and even if they’re purposely trying to defeat the control techniques. This basic idea is obvious and has been discussed for ages. But there hadn’t been that much systematic exploration of it until we started publishing on it in late 2023.

Foundational content

AI Control: Improving Safety Despite Intentional Subversion
- This was the original paper on control.
- You might also enjoy:
  - A Rob Miles Youtube video explainer of the paper. This is a great explanation IMO, maybe worth watching instead of reading the paper if you like videos or if you don't have a strong ML background.
  - (Optional) The Alignment Forum post we wrote about the paper.
The case for ensuring that powerful AIs are controlled
- This is our original essay explaining why we think AI control is a valuable research direction.
- (Optional) We discuss this in more detail in our interview on AXRP.

Other important conceptual writing

In roughly decreasing order of priority.

AI catastrophes and rogue deployments
- We often refer to these concepts when talking about what we’re trying to prevent.
Catching AIs red-handed
- It’s crucial to think about what happens if the AI gets caught in order to think about plans that involve catching AIs red-handed. (I say more about this, from a less technical perspective, here.)
Win/continue/lose scenarios and execute/replace/audit protocols
- The basic mathematical ideas of this post are crucial for thinking quantitatively about the main problems with using AI control to prevent security failures.
Scheming AIs: Will AIs fake alignment during training in order to get power?
- Read this for the current state of the art on arguments for why your models might end up conspiring against you. This is long; you might want to just read the first section, which summarizes the whole report.
How to prevent collusion when using untrusted models to monitor each other
- This gets into a lot of detail about some important points about untrusted monitoring.
Coup probes
- This post describes the simplest white-box method for AI control.
- Detecting Strategic Deception Using Linear Probes does a more thorough job of investigating white-box methods for control empirically.
Thoughts on the conservative assumptions in AI control
- This describes some of the ways that most of our control work is conservative, and explains why we usually choose to do the research that way.

AI control reading list

Foundational content

Other important conceptual writing

More stuff to read