Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2, published by Stefan Heimersheim on May 25, 2023 on The AI Alignment Forum. We solved the second Mechanistic Interpretability challenge (Transformer) that Stephen Casper posed in EIS VII. We spent the last Alignment Jam hackathon attempting to solve the two challenges presented there, see here for our solution to the first challenge (CNN). The challenges each provide a pre-trained network, and the task is to Find the labeling function that the network was trained with Find the mechanism by which the network works. We have understood the network’s labeling mechanism, but not found the original labeling function. Instead we have made a strong argument that it would be intractable to find the labeling function, as we claim that the network has not actually learned the labeling function. A notebook reproducing all results in this post can be found here (requires no GPU, around ~10GB RAM). Note that our solution descriptions are optimized with hindsight and skip all wrong paths and unnecessary techniques we tried. It took us, two somewhat experienced researchers, ~24 working hours to basically get the solutions for each challenge, and a couple days more for Stefan to perform the interventions, implement Causal Scrubbing tests, interventions & animations, and to write-up this post. Task: The second challenge network is a 1-layer transformer consisting of embedding (W_E and W_pos), an Attention layer, and an MLP layer. There are no LayerNorms and neither the attention matrices nor the unembedding have biases. The transformer is trained on sequences [A, B, C] to predict the next token. A and B are integer tokens from a = 0 to 112, C is always the same token (113). The answer is always either the token 0 or 1. If we consider all inputs we get 113x113 combinations which we can shape into an image to get the image from the challenge (copied below). Black is token 0, and white is token 1. The left panel shows the ground truth, and the right panel the model labels. The model is 98.6% accurate on the full dataset. Spoilers ahead! Summary of our solution (TL,DR) We found that the model basically just learns the shapes by heart, it does not learn any mathematical equations. Concretely we claim that The model barely uses the attention mechanism. Even if we fix the attention pattern to the dataset-mean, the model labels 92.7% of points correctly. We reverse-engineer the fixed-attention version of the transformer for simplicity, and we don’t expect any interesting mechanics in the attention mechanism but rather basically random noise the model has learned. With attention fixed, the post-attention residual stream (resid_mid) at token C is just an "extended embedding", it is a linear combination of the token A and B embeddings. We show that the model classification can already be read off at this point. In particular we claim that these extended embeddings are largely determined by a linear combination of two embedding directions. These directions correspond to input filters, this equivalence exists because we fixed the attention pattern and thus resid_mid is given by a linear combination of the embeddings. The classification is given by class 1 = Filter 1 AND Filter 2 (with some threshold numbers). We can illustrate this as binary mask using those thresholds: The left two images are filters learned by the model, and their AND-combination (3rd image) reproduces the model output to a large extent: The MLP basically just implements this AND gate, a simple non-linear transformation of the embedding into a linearly separable form. The animations below illustrates this: We test whether these PCA features we identified indeed correspond to the values of the two filters: If this hypothesis is true then randomly sampling da...
Comentarios