AI control reading list
Buck Shlegeris
This is my attempt to collate recommended reading for people who want to understand AI control.
In AI control, we aim to develop techniques for training and deploying AIs such that the AIs
won’t be able to cause security or safety failures, even if they’re trying to cause those failures, and even if
they’re purposely trying to defeat the control techniques. This basic idea is obvious and has been discussed for
ages. But there hadn’t been that much systematic exploration of it until we started publishing on it in late
2023.
Foundational content
Other
important conceptual writing
In roughly decreasing order of priority.
More stuff to read
- Some empirical papers:
- A sketch of an AI control safety case
- This is a description of how you would really go through all the details to argue that
a deployment was safe using AI control. It’s kind of long and detailed; read the sections that interest you.
- There’s a short description of some takeaways from this work here.
- The AI control tag on LessWrong
contains links to most control research
- The Redwood Research blog contains all
of the writing we’ve done that’s directly related to control