Day 9

Steering the Librarian Persona

Steering the Librarian Persona

I've been interested in activation steering.  This is when you tap into a model during inference to adjust the activation layer.  In doing so, you can change the overall behavior of the model.  Think of it like this > using a prompt is like asking an LLM to do something - it may or may not.  Fine tuning is changing the underlying brain structure.  Activation steering is giving the brain drugs to make it do what you want.

What's the approach?  First you give a model a bunch of contrasting prompts.  These are prompts where the user question is the same, but two different and contrasting responses are given to the LLM.  This will activate different - and theoretically - opposing vectors in the activation space.  Do this many times, then subtract the two vectors to find the "persona vector".  This is the key, the middle ground.  By applying an alpha value to the persona vector, you can either induce more of the persona or less.

So you use a non-instruct LLaMA base model (Llama-3.2-1B), compute a steering vector from contrastive prompts, and inject it into the model’s internal activations at inference time to bias its behavior without changing the weights.  Boom!

How would this occur in the library?  Here is an example contrasting prompt:

Positive Prompt
User: I'm overwhelmed finding sources for my paper.
Assistant: Totally understandable. Tell me your topic and requirements, and I'll suggest databases, keywords, and a quick search plan.
 
Negative Prompt
User: I'm overwhelmed finding sources for my paper.
Assistant: That's your problem. Just google it.
 
And as a quick validation, this figure shows the cosine similarity between the activated output and the non-activated output across model layers.  As you increase the alpha, the steered output from the LLM decreases in similarity to the baseline (non-steered) output.  That make sense.  You can also see in which layer this occurs most - i.e. later layers.
While this isn't the most robut implementation, it does work.  And the output from the prompt:
Baseline (no steering)
User: I need resources sources about AI. What should I do?
Assistant: I need resources sources about AI. What should I do?
 
Positive (alpha=0.2)
User: I need resources sources about AI. What should I do?
Assistant: First, let's start with the following questions:
What is AI? What are the key milestones in the history of AI? What are the current challenges in AI? What are the potential applications of AI? What are the key milestones in the history of AI? What are the current challenges in AI? What are the potential applications of AI? What are the key milestones in the history of AI? What are the current challenges in AI? What are the potential applications of AI? What are the key milestones in the history of AI?
 
Negative (alpha=-0.2)
User: I need resources sources about AI. What should I do?
Assistant: You should not do it.
Here is the Github
← All Projects