AI Control Reading List
A curated list of resources on AI control research and concepts.
This page collates recommended reading for those seeking to understand the field of AI control.
AI control aims to develop techniques for training and deploying AI systems such that they cannot cause security or safety failures, even if they attempt to do so or try to defeat the control mechanisms. While the basic idea has been discussed for some time, systematic exploration and publication in this area significantly increased starting in late 2023.
Foundational Content
- AI Control: Improving Safety Despite Intentional Subversion
- This was the original paper on control.
- You might also enjoy:
- A Rob Miles YouTube video explainer of the paper. This explanation is particularly helpful for those who prefer video format or have less background in machine learning.
- (Optional) The Alignment Forum post discussing the paper.
- The case for ensuring that powerful AIs are controlled
- This is our original essay explaining why we think AI control is a valuable research direction.
- (Optional) We discuss this in more detail in our interview on AXRP.
Other Important Conceptual Writing
In roughly decreasing order of priority.
- AI catastrophes and rogue deployments
- We often refer to these concepts when talking about what AI control aims to prevent.
- Catching AIs red-handed
- It's crucial to think about what happens if the AI gets caught in order to think about plans that involve catching AIs red-handed. (Further discussion from a less technical perspective can be found here.)
- Win/continue/lose scenarios and execute/replace/audit protocols
- The basic mathematical ideas of this post are crucial for thinking quantitatively about the main problems with using AI control to prevent security failures.
- Scheming AIs: Will AIs fake alignment during training in order to get power?
- Read this for the current state of the art on arguments for why your models might end up conspiring against you. This is long; you might want to just read the first section, which summarizes the whole report.
- How to prevent collusion when using untrusted models to monitor each other
- This gets into a lot of detail about some important points about untrusted monitoring.
- Coup probes
- This post describes the simplest white-box method for AI control.
- Detecting Strategic Deception Using Linear Probes does a more thorough job of investigating white-box methods for control empirically.
- Thoughts on the conservative assumptions in AI control
- This describes some of the ways that most of our control work is conservative, and explains why we usually choose to do the research that way.
More Stuff to Read
- Some empirical papers:
- Ctrl-Z: Controlling AI Agents via Resampling (pre-release). This is the sequel to the original control paper: we study high-stakes failures in an agent setting.
- Subversion Strategy Eval. This paper measures how good LLMs are at reasoning through how to subvert control measures.
- Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs. This paper explores steganography from a control perspective.
- Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats. This paper studies what happens in a setting where individual safety failures don't immediately cause a red team victory, and where intentional safety failures are hard to distinguish from accidental mistakes.
- A sketch of an AI control safety case
- This is a description of how you would really go through all the details to argue that a deployment was safe using AI control. It's kind of long and detailed; read the sections that interest you.
- There's a short description of some takeaways from this work here.
- The AI control tag on LessWrong contains links to most control research.
- The Redwood Research blog contains all of the writing we've done that's directly related to control.