REMIX

This describes the leadership generally.

REMIX is a month-long coordinated research program on mechanistic interpretability of transformer models ran for the first time in the Winter of 2022-23. We are excited about recent advances in mechanistic interpretability and want to try to scale our interpretability methodology to a larger group doing research in parallel.

REMIX participants aim to provide mechanistic explanations of model behaviors, using our causal scrubbing methodology to formalize and evaluate interpretability hypotheses akin to our recent work investigating behaviors of GPT-2-small, toy language models, and models trained on algorithmic tasks. We think this work is a particularly promising research direction for mitigating existential risks from advanced AI systems.

Why are we doing this?
Why do this now?
What is doing this sort of research like?
How useful is this kind of interpretability research for understanding models that might pose an existential risk?