Day 74

Can Activation Steering Generalize Across Languages?

Can Activation Steering Generalize Across Languages?

Can Activation Steering Generalize Across Languages? A Study on Syllogistic Reasoning in Language Models.

Research question

The paper asks whether activation steering methods learned from English data can improve logical reasoning in other languages as well. More specifically, it studies whether test-time interventions can reduce the well-known content effect in multilingual syllogistic reasoning, where models confuse plausibility with formal validity. 

Methodology

The authors build a multilingual syllogism benchmark by translating English syllogisms into 10 languages, including both high-resource and lower-resource languages, and use round-trip back-translation checks for quality control. They then test several activation-steering methods on hidden representations, including centroid-based additive steering and learned blending approaches, with a probe-gating mechanism trained on English activations to decide which class-specific intervention to apply at inference time. 

Findings

The paper finds that English-trained steering interventions often transfer surprisingly well across languages, improving logical consistency and multilingual accuracy, in some cases by as much as 36 percent, while causing only modest degradation in general language modeling performance. The simplest additive method, CEN-AST, is the most robust overall, while more complex blending methods can work very well on some models like Qwen and Gemma but fail on others such as Mistral and Llama, suggesting that success depends heavily on the geometry of the model’s internal representations. 

Limitations

A main limitation is that the intervention depends on a probe trained on English activations, so imperfect cross-lingual transfer of that probe becomes a bottleneck for non-English performance. The study is also limited to relatively small 7B to 9B models because of hardware constraints, and the authors note that some failures of the more complex steering methods may reflect model scale or representational coherence rather than a fundamental weakness of steering itself. 

Why it’s important

This paper matters because it suggests that activation steering can act as a scalable cross-lingual control mechanism, improving reasoning without retraining the model and without heavily damaging general fluency. More broadly, it shows that at least some reasoning-relevant internal structure is shared across languages, which is important for building more reliable multilingual systems and for understanding how controllable reasoning generalizes beyond English. 

← All Projects