Aligning Large Language Models with Representation
Aligning Large Language Models with Representation Editing: A Control Perspective.
Research question
The paper asks whether large language models can be aligned at inference time by directly steering their internal representations, instead of relying on full fine-tuning or prompt engineering alone. More specifically, it studies whether alignment can be framed as an optimal control problem over the hidden-state dynamics of an autoregressive language model.
Methodology
The authors propose RE-CONTROL, which treats an LLM as a discrete-time stochastic dynamical system and injects small control signals into hidden states during generation. They train a lightweight value function over representations using the Bellman equation, then use gradient-based optimization at test time to choose control signals that improve alignment while regularizing them to preserve generation quality.
Findings
The paper reports that RE-CONTROL outperforms existing test-time alignment methods such as prompting and guided decoding, while requiring far fewer resources than alignment through fine-tuning. It also claims strong generalization and computational efficiency because the learned value model is small and the intervention happens directly in representation space during decoding.
Limitations
A key limitation is that the method still depends on training an auxiliary value model and running gradient-based optimization during inference, so it is not as simple as plain prompting. The paper also positions itself mainly against other test-time alignment methods and fine-tuning approaches, which means questions remain about robustness across broader deployment settings and more varied alignment objectives.
Why it’s important
This paper matters because it offers a middle ground between expensive fine-tuning and weaker prompt-only control, giving a more flexible way to steer model behavior at test time. It is also important conceptually because it connects LLM alignment to control theory, suggesting that hidden-state interventions can be designed systematically rather than heuristically