REMIX | Redwood Research

Research output. We hope this program will produce research that is useful in multiple ways:

We’d like stronger and more grounded characterizations of how language models perform a certain class of behaviors. For example, we currently have a variety of findings about how GPT-2-small implements indirect object identification (“IOI”, see next section for more explanation), but aren’t yet sure how often they apply to other models or other tasks. We’d know a lot more if we had a larger quantity of this research.
For each behavior investigated, we think there’s some chance of stumbling across something really interesting. Examples of this include induction heads and the “pointer manipulation” result in the IOI paper: not only does the model copy information between attention streams, but it also copies “pointers”, i.e. the position of the residual stream that contains the relevant information.
We’re interested in learning whether different language models implement the same behaviors in similar ways.
We’d like a better sense of how good the current library of interpretability techniques is, and we’d like to get ideas for new techniques.
We’d like to have more examples of this kind of investigation, to help us build infrastructure to support or automate this kind of research.

Training and hiring. We might want to hire people who produce valuable research during this program.

Experience running large collaborative research projects. It seems plausible that at some point it will be useful to run a huge collaborative alignment project. We’d like to practice this kind of thing, in the hope that the lessons learned are useful to us or others.

Mechanistic interpretability is an unusual empirical scientific setting in that controlled experimentation is relatively easy, but there’s relatively little knowledge about the kinds of structures found in neural nets.

Regarding the ease of experimentation:

It’s easy to do complicated intervention experiments. If you’re curious about whether the network, or some internal component of it, would have produced a radically different output had some feature of the input been different, you can re-run it exactly and just change that feature. There’s little to no hassle involved in controlling for all the other features of the run.
You can extract almost any metric from the internal state of the model at any time.
You can quickly run lots of inputs through the model all at once if you want to characterize average behaviors.
See Chris Olah’s note on neuroscience versus interpretability for more on this point.

Regarding the openness of the field:

Large and important questions, like “How much similarity is there in the internals of different models?” and “To what degree are model behaviors implemented as modular, human-comprehensible algorithms?” are still mostly unexplored.
We’re early on in the process of discovering commonalities across models like induction heads - we don’t currently know how much there tends to be simple, universal algorithms that show up across models, versus each model being idiosyncratic.

REMIX participants pursue interpretability research akin to the investigations Redwood has done recently into induction heads, indirect object identification (IOI) in small language models, and balanced parenthesis classification. You can read more about behavior selection criteria here.

The main activities will be:

doing exploratory analyses to generate hypotheses about how a language model (probably GPT-2-small) performs some behavior
evaluating your hypotheses with our causal scrubbing methodology
iterating to make hypotheses more fine-grained and more accurate

The mechanisms for behaviors we’ll be studying are often surprisingly complex, so careful experimentation is needed to accurately characterize them. For example, the Redwood researchers investigating the IOI behavior found that removing the influence of the circuit they identified as primarily responsible had surprisingly little effect on the model’s ability to do IOI. Instead, other heads in the model substantially changed their behavior to compensate for the excision. As the researchers write, “Both the reason and the mechanism of this compensation effect are still unclear. We think that this could be an interesting phenomenon to investigate in future work.”

Sign up to ourmailing list

Sign up to our
mailing list