Day 70

Who's asking? User personas and the mechanics of latent misalignment

Who's asking? User personas and the mechanics of latent misalignment

Who’s asking? User personas and the mechanics of latent misalignment.”

Research question

The paper asks whether a model’s safety behavior depends on who it thinks the user is, rather than only on the content of the request. More specifically, it investigates whether inferred user personas can change refusal behavior for harmful queries and what this reveals about latent misalignment inside safety-tuned models. 

Methodology

The authors test Llama 2 13B Chat, with some preliminary checks on Gemma 7B, using adversarial prompts from AdvBench and a harder rewritten set they call SneakyAdvBench. They manipulate inferred user persona in two ways, through natural-language prompt prefixes and through activation steering with contrastive activation addition, then analyze refusal behavior, early-layer decoding, Patchscopes-based interpretations, and geometric properties of persona steering vectors across layers. 

Findings

The paper finds that user persona has a strong effect on whether the model refuses harmful queries, and that steering the model toward pro-social personas can sometimes make it more willing to answer dangerous requests, while anti-social personas can have the opposite effect. It also finds that activation steering is more effective than prompting for bypassing safeguards, that harmful content can remain present in earlier hidden layers even when the final output is safe, and that some personas appear to make the model interpret harmful prompts more charitably. 

Limitations

A main limitation is that the study focuses primarily on one open-source model, Llama 2 13B Chat, so broader generalization still needs to be tested. The paper also concentrates on harmful-query refusal, especially indirect adversarial prompts, and leaves a fuller study of how these interventions affect overall capabilities and other kinds of undesirable behavior for future work. 

Why it’s important

This paper matters because it shows that safety alignment may be conditional and user-dependent rather than globally stable, which is a serious concern for deployment. It is also important mechanistically because it suggests safety tuning can leave harmful capabilities latent in hidden representations, helping explain why models can appear safe on the surface while remaining vulnerable to persona-based jailbreaks and control interventions

← All Projects