Day 69

Personalized Steering of LLMs

Personalized Steering of LLMs

Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization

Research question

The paper asks how to better control or “steer” the behavior of large language models without expensive fine-tuning. Specifically, it investigates whether steering vectors can be improved so they more accurately reflect human preferences and allow flexible, personalized control over model outputs. 

Methodology

The authors propose Bi-directional Preference Optimization (BiPO), a method that learns steering vectors from contrastive pairs of human preference data rather than raw activations alone. These vectors are injected into model activations at inference time, with tunable magnitude and direction to control behavior intensity across tasks such as persona shaping, hallucination reduction, and jailbreak resistance. 

Findings

The paper finds that BiPO produces more effective and stable steering vectors than prior approaches, enabling precise control over model behavior across different tasks. It also shows that these vectors transfer across models and can be combined, suggesting a flexible and modular approach to behavior control. 

Limitations

The method still depends on the quality and representativeness of human preference data, which can limit performance in complex or poorly defined behaviors. Additionally, while more efficient than fine-tuning, it still requires training to extract steering vectors and may not fully generalize across all domains or alignment scenarios. 

Why it’s important

This paper is important because it advances lightweight, inference-time control of LLM behavior, which is crucial for personalization, safety, and alignment. It also highlights a broader shift toward modular control mechanisms that can adjust model behavior without retraining, making them more practical for real-world deployment. 

 


 

← All Projects