Extending Activation Steering to Broad Skills and Multiple Behaviours
Extending Activation Steering to Broad Skills and Multiple Behaviours.
Research question
The paper asks whether activation steering can be extended beyond narrow traits to broader skills and to several behaviors at once. More specifically, it studies whether a model can be steered to reduce broad abilities like coding, and whether multiple behavioral steering vectors can be combined or applied simultaneously without breaking model performance.
Methodology
The authors run experiments on Llama 2 7B Chat using activation addition and contrastive activation addition, extracting steering vectors from the residual stream’s last-token activations. They test two settings: broad-skill steering using text, general code, and Python data, and multi-behavior steering using five binary-behavior datasets, then evaluate both individual steering, combined steering into one vector, and simultaneous steering at multiple layers while tracking alignment tax through top-1 accuracy.
Findings
The paper finds that activation steering can reduce broad coding ability with only modest degradation in general text performance, and that broad steering is surprisingly competitive with narrower Python-specific steering. For multiple behaviors, individual steering generally works, combining several steering vectors into one usually works poorly and can cause mode collapse, while simultaneous injection of separate vectors at different layers looks more promising and preserves relatively low alignment tax.
Limitations
A main limitation is that the experiments use a single model, Llama 2 7B Chat, and rely heavily on matching-score style proxy metrics rather than fully open-ended downstream evaluations. The authors also note that results may depend strongly on vector construction, layer choice, and injection coefficients, and they acknowledge uncertainty about how well these results generalize to other models, broader real-world behaviors, or actual risk reduction.
Why it’s important
This paper matters because it pushes activation steering beyond simple one-behavior demonstrations and asks whether the method can scale to more realistic control problems. It is especially important for safety and alignment work because it suggests that separate targeted interventions at different layers may be more practical than trying to compress many desired behaviors into one steering direction.