Analyzing the Safety Pitfalls of Steering Vectors
Analysing the Safety Pitfalls of Steering Vectors
Research question
The paper asks how activation steering affects the safety alignment of large language models, especially whether steering vectors make models easier or harder to jailbreak. More specifically, it studies whether the same mechanism used to control model behavior can unintentionally interfere with refusal behavior and open a new safety vulnerability.
Methodology
The authors run a systematic safety audit of steering vectors built with Contrastive Activation Addition (CAA) across six models from three families and multiple sizes, using JailbreakBench and simple template-based attacks as the main evaluation setting. They then analyze the geometry of the steering vectors relative to a model’s refusal direction and test a mitigation by ablating the refusal-aligned component from the steering vector.
Findings
The paper finds that steering vectors consistently change jailbreak attack success rates, sometimes sharply increasing them by as much as 57% and sometimes decreasing them by as much as 50%, depending on the behavior being steered. The authors argue that this happens because many steering vectors overlap with latent refusal directions, so steering can inadvertently weaken safety alignment; removing that overlap partially mitigates the problem but does not fully restore baseline safety.
Limitations
A main limitation is that the mechanistic analysis models refusal with a one-dimensional refusal direction, which may be too simple if safety behavior is actually distributed across a higher-dimensional subspace. The study also focuses specifically on CAA, so while the authors argue the vulnerability may generalize to other activation-based interventions, that broader claim still needs direct testing.
Why it’s important
This paper matters because it shows that steering is not safety-neutral: a tool meant to improve controllability can also create a new attack surface by weakening refusal behavior. More broadly, it reframes activation steering as a safety problem, not just a utility or interpretability technique, and highlights a fundamental tradeoff between controllability and robustness in deployed LLM systems.