REMIX is a month-long coordinated research program on mechanistic interpretability of transformer models ran for the first time in the Winter of 2022-23. We are excited about recent advances in mechanistic interpretability and want to try to scale our interpretability methodology to a larger group doing research in parallel.
REMIX participants aim to provide mechanistic explanations of model behaviors, using our causal scrubbing methodology to formalize and evaluate interpretability hypotheses akin to our recent work investigating behaviors of GPT-2-small, toy language models, and models trained on algorithmic tasks. We think this work is a particularly promising research direction for mitigating existential risks from advanced AI systems.
Research output. We hope this program will produce research that is useful in multiple ways:
Training and hiring. We might want to hire people who produce valuable research during this program.
Experience running large collaborative research projects. It seems plausible that at some point it will be useful to run a huge collaborative alignment project. We’d like to practice this kind of thing, in the hope that the lessons learned are useful to us or others.
We think our recent progress in interpretability makes it a lot more plausible for us to reliably establish mechanistic explanations of model behaviors, and therefore get value from a large, parallelized research effort.
A unified framework for specifying and validating explanations. Previously, a big bottleneck on parallelizing interpretability research across many people was the lack of a clear standard of evidence for proposed explanations of model behaviors (which made us expect the research produced to be pretty unreliable). We believe we’ve recently made some progress on this front, developing an algorithm called “causal scrubbing” which allows us to automatically derive an extensive set of tests for a wide class of mechanistic explanations. This algorithm is only able to reject hypotheses rather than confirming them, but we think that this still makes it way more efficient to review the research produced by all the participants.
Improved proofs of concept. We now have several examples where we followed our methodology and were able to learn a fair bit about how a transformer was performing some behavior.
Tools that allow complicated experiments to be specified quickly. We’ve built a powerful library for manipulating neural nets (and computational graphs more generally) for doing intervention experiments and getting activations out of models. This library allows us to do experiments that would be quite error-prone and painful with other tools.
Mechanistic interpretability is an unusual empirical scientific setting in that controlled experimentation is relatively easy, but there’s relatively little knowledge about the kinds of structures found in neural nets.
Regarding the ease of experimentation:
Regarding the openness of the field:
REMIX participants pursue interpretability research akin to the investigations Redwood has done recently into induction heads, indirect object identification (IOI) in small language models, and balanced parenthesis classification. You can read more about behavior selection criteria here.
The main activities will be:
The mechanisms for behaviors we’ll be studying are often surprisingly complex, so careful experimentation is needed to accurately characterize them. For example, the Redwood researchers investigating the IOI behavior found that removing the influence of the circuit they identified as primarily responsible had surprisingly little effect on the model’s ability to do IOI. Instead, other heads in the model substantially changed their behavior to compensate for the excision. As the researchers write, “Both the reason and the mechanism of this compensation effect are still unclear. We think that this could be an interesting phenomenon to investigate in future work.”
Overall, we think that mechanistic interpretability is one of the most promising research directions for helping prevent AI takeover. Our hope is that mature interpretability techniques will let us distinguish between two ML systems that each behave equally helpfully during training – even having exactly the same input/output behavior on the entire training dataset – but where one does so because it is deceiving us and the other does so “for the right reasons.”
Our experience has been that explaining model behaviors supports both empirical interpretability work – guiding how we engineer interpretability tools and providing practical knowhow – and theoretical interpretability work – for example, leading to the development of the causal scrubbing algorithm. We expect many of the practical lessons that we might learn would generalize to more advanced systems, and we expect that addressing the theoretical questions that we encounter along the way would lead to important conceptual progress.
Currently, almost no interesting behaviors of ML models have been explained – even for models that are tiny compared with frontier systems. We have been working to change this, and we’d like you to help.
For other perspectives on this question, see Chris Olah’s description of the relevance of thorough, small-model interpretability here and Paul Christiano’s similar view here.