Day 30

The Geometry of Persona

The Geometry of Persona

I'm still really interested in activation steering and identifying different personas in large language models.  Today I spent some time developing a study to systematically explore how activation steering can be done in large language models.   My main research questions are:

1. RQ1 (Method Comparison): How do CAA, PCA-based, and linear-probe steering vectors compare in steering fidelity, behavioral consistency, and side-effect minimization?
2. RQ2 (Layer-wise Topology): Is there a consistent optimal steering layer relative to model depth, or does it vary by architecture and persona type?
3. RQ3 (Steering vs. Prompting): Does activation steering achieve persona adherence that prompt engineering cannot?

Ultimately the study will look something like this

The Geometry of Persona: Comparing Activation Steering Methods for Behavioral Control in Large Language Models

Activation steering has emerged as a promising alternative to prompt engineering for controlling large language model (LLM) behavior, yet systematic comparisons of vector extraction methods remain scarce. We present a comprehensive study comparing three approaches to persona steering vector identification -- Contrastive Activation Addition (CAA), PCA-based extraction, and supervised linear probes -- across three architecturally distinct model families (Llama 3.1 8B, Gemma 2 9B, Qwen 2.5 7B) and five diverse persona types. For each method-model combination, we map the layer-wise effectiveness of steering vectors, revealing whether optimal injection points follow a consistent pattern relative to network depth or vary with architecture and persona type. We benchmark all steering conditions against prompt engineering baselines using a three-tier evaluation framework: representation-level metrics (cosine similarity shift, projection magnitude), a persona classifier trained on synthetic data, and LLM-as-judge scoring on a five-point Likert scale. Our findings characterize the trade-offs between extraction methods in terms of steering fidelity, behavioral consistency, and fluency preservation, while identifying the conditions under which activation steering surpasses -- or falls short of -- simple system-prompt-based persona control. We release our full experimental framework, steering vectors, and evaluation pipeline to support reproducible research in mechanistic approaches to LLM behavioral control.

Github

← All Projects