AI control reading list

Buck Shlegeris

This is my attempt to collate recommended reading for people who want to understand AI control.

In AI control, we aim to develop techniques for training and deploying AIs such that the AIs won’t be able to cause security or safety failures, even if they’re trying to cause those failures, and even if they’re purposely trying to defeat the control techniques. This basic idea is obvious and has been discussed for ages. But there hadn’t been that much systematic exploration of it until we started publishing on it in late 2023.

Foundational content

Other important conceptual writing

In roughly decreasing order of priority.

More stuff to read