Improving Activation Steering in Language Models with Mean-Centering
Improving Activation Steering in Language Models with Mean-Centring.
Research question
The paper asks whether a simple geometric adjustment, mean-centering model activations before extracting steering vectors, can make activation steering more effective and easier to apply. More specifically, it studies whether subtracting the average training activation helps isolate the behavior-relevant direction instead of mixing it with the model’s general activation bias.
Methodology
The authors propose a mean-centering method where they compute the average activation for a target dataset, subtract the mean activation over a broader training sample, and use the result as the steering vector. They test this on several settings, including reducing toxic generation, steering story completions toward target genres, and improving function vectors for natural-language tasks, all by adding the resulting vector at inference time to a model’s residual stream.
Findings
The paper finds that mean-centering produces more effective steering vectors than non-centered alternatives and improves behavior control while maintaining coherence. It reduces toxicity, gives more semantically relevant genre steering, and significantly improves function-vector task performance relative to prior baselines, suggesting that a large part of better steering comes from removing the background bias in activation space.
Limitations
A main limitation is that the method still depends on having a representative target dataset and a suitable sample for estimating the background mean, so performance may vary with dataset quality and choice of layer. The paper also focuses on relatively controlled steering tasks, which leaves open how well the method generalizes to more complex behaviors, larger deployment settings, or safety-critical applications.
Why it’s important
This paper matters because it shows that activation steering can be improved with a very simple and cheap preprocessing step rather than more complicated model editing or training procedures. It is also important conceptually because it connects steering quality to the anisotropic geometry of model representations, giving a clearer explanation of why some steering vectors work better than others.