Day 58

A Unified Understanding and Evaluation of Steering Methods

A Unified Understanding and Evaluation of Steering Methods

A Unified Understanding and Evaluation of Steering Methods

Research question

The paper asks whether latent-space steering methods can be understood within one common framework and compared fairly across tasks, instead of being evaluated in fragmented ways that make results hard to interpret. More specifically, the authors want to know which way of constructing steering vectors actually works best, and why, for pushing model behavior toward desired traits or away from undesired ones. 

The method

The authors unify several major steering methods by expressing them as different ways of building a steering vector from contrastive positive and negative examples, including mean-of-differences, PCA-based methods, and classifier-based methods. They then combine theory with experiments on multiple-choice and open-ended generation tasks, mainly using Llama-2-7b-Chat, with additional results on Mistral-7B-Instruct-v0.3 and Llama-3.1-8B-Instruct; open-ended outputs are scored by GPT-4 for behavioral alignment. 

The findings

The central finding is that the mean-of-differences steering vector is both theoretically optimal under their framework and empirically the strongest method across most datasets and settings. By contrast, PCA-based and classifier-based approaches often perform worse because they can produce steering directions or scales that do not align well with the actual positive-versus-negative shift the model needs. 

Limitations

The paper says there is still substantial room to improve how steering vectors are applied, especially through more controlled and fine-grained steering mechanisms rather than the relatively blunt interventions studied here. The authors also note that they focus on isolating steering-vector design and do not test how well these methods generalize to new or unpredictable inference-time distributions, which they frame as an important open problem for future work.

← All Projects