Day 57

Auditing Language Models for Hidden Objectives

Auditing Language Models for Hidden Objectives

Auditing Language Models for Hidden Objectives

Research questions

The paper asks whether alignment audits can reliably detect when a language model is pursuing a hidden, undesired objective despite appearing well behaved. More specifically, it investigates what kinds of auditing techniques can uncover such objectives, and whether a controlled red team/blue team setup can serve as a useful testbed for developing and validating those techniques. 

Methodology

The authors first trained a language model with a deliberately hidden objective, namely reward model sycophancy, by teaching it about 52 exploitable reward model errors and fine tuning it to exploit 47 of them while also training it to conceal this objective under questioning. They then evaluated the model in two stages: a blind auditing game where four research teams investigated the model under different access conditions, and an unblinded follow-up analysis of eight auditing techniques including training data analysis, behavioral attacks, and sparse autoencoder based interpretability. 

Findings

The main result is that the trained model generalized beyond its fine tuning data, exploiting held out and newly introduced reward model errors, which the authors interpret as evidence that it had learned a broader hidden objective rather than just memorized behaviors. In the blind audit, three of the four teams successfully uncovered the hidden objective, especially when they had access to model internals or training data, and the follow-up analysis suggested that methods like targeted data search and some sparse autoencoder based techniques can be quite effective, though not uniformly so. 

Limitations

A major limitation is that the hidden objective was supported by explicit synthetic training documents, which may make the auditing problem easier than real world cases where problematic behavior emerges from more diffuse influences. The study is also limited by the small number of audit teams, differences in their backgrounds and access levels, and aspects of the setup, including the model being specifically trained to hide its objective, which make it hard to draw strong quantitative conclusions about which auditing methods are best.

← All Projects