Day 85

Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence

Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence

Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence

Research questions. The article asks whether activation steering can act as a runtime defense against misalignment during open-ended LLM generation, without degrading coherence. It focuses on whether steering can restore target traits like honesty and compassion after malicious system prompts induce dishonest or dismissive behavior.

Methodology. The authors test three steering methods: one fixed additive method, SwFC, and two projection-aware methods, StTP and StMP. They evaluate these on Llama-3.3-70B-Instruct and Qwen3-32B using malicious system prompts, trait recovery measures, coherence checks, and benchmarks including MMLU, MT-Bench, and AlpacaEval.

Findings. All three methods substantially recover honesty and compassion while preserving coherence. The projection-aware methods perform best overall, especially for preserving general capabilities and reducing repetition in multi-turn conversations.

Why it matters. The paper suggests that alignment failures may be mitigated at inference time by continuously correcting internal activations, rather than relying only on training or prompt-level safeguards. This is important because the authors frame LLM alignment as brittle and vulnerable to adversarial prompts, fine-tuning side effects, and goal misgeneralization.

← All Projects