Steering Experiment

This document describes how to apply steering vectors to language models for text generation using the psyctl steering command.

Overview
Usage
Steering Parameters
Steering Methods
Examples
Advanced Usage

Overview

The steering experiment applies pre-extracted steering vectors to language models during text generation. This influences the model's personality or behavior according to the training data used during vector extraction.

The steering process involves:

Loading a model and its tokenizer
Loading steering vectors from a safetensors file
Registering forward hooks on target layers
Applying steering vectors during text generation
Decoding and returning the steered output

Usage

CLI Usage

Basic Command

Apply a steering vector to generate text:

psyctl steering \
  --model "google/gemma-2-2b-it" \
  --steering-vector "./steering_vector/out.safetensors" \
  --input-text "Tell me about yourself"

With Custom Strength

Adjust the steering strength multiplier:

psyctl steering \
  --model "meta-llama/Llama-3.2-3B-Instruct" \
  --steering-vector "./steering_vector/out.safetensors" \
  --input-text "hello world" \
  --strength 1.5

Using Orthogonalized Addition

Apply steering with the orthogonalized addition method:

psyctl steering \
  --model "google/gemma-3-270m-it" \
  --steering-vector "./steering_vector/out.safetensors" \
  --input-text "hello" \
  --orthogonal \
  --strength 2.0

Command-Line Options

--model: Model name or HuggingFace identifier (required)
--steering-vector: Path to steering vector file (.safetensors) (required)
--input-text: Input text for generation (required)
--strength: Steering strength multiplier (default: 1.0)
--max-tokens: Maximum number of tokens to generate (default: 200)
--temperature: Sampling temperature, 0 for greedy (default: 1.0)
--top-p: Top-p (nucleus) sampling parameter (default: 0.9)
--top-k: Top-k sampling parameter (default: 50)
--orthogonal: Use orthogonalized addition method
--verbose: Log full prompt after chat template application

Python Code Usage

You can use the SteeringApplier class directly in Python code with flexible input options.

Basic Example (Using model_name)

from pathlib import Path
from psyctl.core.steering_applier import SteeringApplier

# Initialize applier
applier = SteeringApplier()

# Apply steering with model_name
result = applier.apply_steering(
    model_name="google/gemma-3-270m-it",
    steering_vector_path=Path("./steering_vector/out.safetensors"),
    input_text="Tell me about yourself",
    strength=1.5
)

print(result)

Using Persistent Steering (Python API Only - Most Efficient for Multiple Generations)

The get_steering_applied_model() method returns a model with steering hooks already attached. This is the most efficient way to generate multiple outputs with the same steering configuration:

from pathlib import Path
from psyctl.core.steering_applier import SteeringApplier

# Initialize applier
applier = SteeringApplier()

# Get model with steering hooks attached
model, tokenizer = applier.get_steering_applied_model(
    model_name="google/gemma-3-270m-it",
    steering_vector_path=Path("./steering_vector/out.safetensors"),
    strength=2.0,
    orthogonal=True
)

# Use the model multiple times - hooks remain active
test_inputs = ["Hello", "How are you?", "What's your opinion?"]

for prompt in test_inputs:
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=50, use_cache=False)
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"Input: {prompt}")
    print(f"Output: {result}\n")

# Remove steering hooks when done
model.remove_steering()

Using Pre-loaded Model (Efficient for Multiple Generations)

from pathlib import Path
from transformers import AutoModelForCausalLM, AutoTokenizer
from psyctl.core.steering_applier import SteeringApplier

# Load model and tokenizer once
model = AutoModelForCausalLM.from_pretrained("google/gemma-3-270m-it")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m-it")

# Initialize applier
applier = SteeringApplier()

# Apply steering multiple times with different inputs/strengths
# No need to reload the model each time!
test_inputs = [
    "Hello, how are you?",
    "Tell me about yourself",
    "What is your opinion on AI?"
]

for input_text in test_inputs:
    result = applier.apply_steering(
        model=model,
        tokenizer=tokenizer,
        steering_vector_path=Path("./steering_vector/out.safetensors"),
        input_text=input_text,
        strength=1.5
    )
    print(f"Input: {input_text}")
    print(f"Output: {result}\n")

Experimenting with Different Strengths

from pathlib import Path
from transformers import AutoModelForCausalLM, AutoTokenizer
from psyctl.core.steering_applier import SteeringApplier

# Load model once
model = AutoModelForCausalLM.from_pretrained("google/gemma-3-270m-it")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m-it")

applier = SteeringApplier()
input_text = "Hello, how are you?"

# Test different steering strengths efficiently
for strength in [0.5, 1.0, 1.5, 2.0, 2.5]:
    result = applier.apply_steering(
        model=model,
        tokenizer=tokenizer,
        steering_vector_path=Path("./steering_vector/rudeness.safetensors"),
        input_text=input_text,
        strength=strength
    )
    print(f"Strength {strength}: {result}\n")

Using Orthogonalized Addition in Python

from pathlib import Path
from psyctl.core.steering_applier import SteeringApplier

applier = SteeringApplier()

# Apply with orthogonalized addition method
result = applier.apply_steering(
    model_name="google/gemma-3-270m-it",
    steering_vector_path=Path("./steering_vector/out.safetensors"),
    input_text="What is your personality like?",
    strength=2.0,
    orthogonal=True,  # Enable orthogonalized addition
    temperature=0.7
)

print(result)

Using Verbose Logging

Enable verbose logging to see the full prompt after chat template application:

from pathlib import Path
from psyctl.core.steering_applier import SteeringApplier

applier = SteeringApplier()

# Enable verbose to log the full formatted prompt
result = applier.apply_steering(
    model_name="google/gemma-3-270m-it",
    steering_vector_path=Path("./steering_vector/out.safetensors"),
    input_text="Hello",
    strength=1.5,
    verbose=True  # Logs full prompt with chat template
)

Using Per-Layer Strength (Python API Only)

Control steering strength individually for each layer:

from pathlib import Path
from psyctl.core.steering_applier import SteeringApplier

applier = SteeringApplier()

# Apply different strengths to different layers
result = applier.apply_steering(
    model_name="google/gemma-3-270m-it",
    steering_vector_path=Path("./steering_vector/multi_layer.safetensors"),
    input_text="Tell me about yourself",
    strength={
        "model.layers[10].mlp.down_proj": 1.0,
        "model.layers[13].mlp.down_proj": 2.5,
        "model.layers[16].mlp.down_proj": 1.5,
        # Layers not specified will use default strength of 1.0
    }
)

print(result)

You can also use per-layer strength with get_steering_applied_model():

# Get model with per-layer steering
model, tokenizer = applier.get_steering_applied_model(
    model_name="google/gemma-3-270m-it",
    steering_vector_path=Path("./steering_vector/multi_layer.safetensors"),
    strength={
        "model.layers[13].mlp.down_proj": 3.0,  # Strong on this layer
        # Other layers use default 1.0
    },
    orthogonal=True
)

# Generate multiple outputs with this configuration
for prompt in ["Hello", "How are you?"]:
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=50, use_cache=False)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

model.remove_steering()

Steering Parameters

Strength

The strength parameter controls how strongly the steering vector affects the model.

Uniform Strength (float):

Apply the same strength to all layers:

0.0: No steering (baseline model behavior)
1.0: Default steering strength
1.5-2.0: Strong steering (recommended for subtle personalities)
>2.0: Very strong steering (may produce extreme outputs)

CLI Example:

# Subtle steering
psyctl steering --model "google/gemma-3-270m-it" \
  --steering-vector "./vector.safetensors" \
  --input-text "What is your opinion?" \
  --strength 0.5

# Strong steering
psyctl steering --model "google/gemma-3-270m-it" \
  --steering-vector "./vector.safetensors" \
  --input-text "What is your opinion?" \
  --strength 2.5

Per-Layer Strength (Dict[str, float] - Python API Only):

Control strength for each layer individually:

# Dictionary mapping layer names to strength values
strength = {
    "model.layers[10].mlp.down_proj": 1.0,   # Mild steering
    "model.layers[13].mlp.down_proj": 2.5,   # Strong steering
    "model.layers[16].mlp.down_proj": 1.5,   # Moderate steering
    # Layers not in dict will use default strength of 1.0
}

result = applier.apply_steering(
    model_name="google/gemma-3-270m-it",
    steering_vector_path=Path("./vector.safetensors"),
    input_text="What is your opinion?",
    strength=strength
)

Benefits of per-layer strength:

Fine-grained control over steering behavior
Can emphasize or de-emphasize specific layers
Useful for experimenting with layer-specific effects
Layers not specified in the dict automatically use default strength (1.0)

Temperature

Controls randomness in text generation:

0.0: Greedy decoding (deterministic)
0.5-0.8: More focused and coherent
1.0: Balanced sampling (default)
>1.0: More creative and diverse

Example:

# Deterministic output
psyctl steering --model "google/gemma-3-270m-it" \
  --steering-vector "./vector.safetensors" \
  --input-text "hello" \
  --temperature 0.0

# Creative output
psyctl steering --model "google/gemma-3-270m-it" \
  --steering-vector "./vector.safetensors" \
  --input-text "hello" \
  --temperature 1.5

Top-p and Top-k

Fine-tune sampling behavior:

--top-p: Nucleus sampling threshold (0.0-1.0)
--top-k: Number of top tokens to consider

Example:

psyctl steering --model "google/gemma-3-270m-it" \
  --steering-vector "./vector.safetensors" \
  --input-text "hello" \
  --top-p 0.95 \
  --top-k 100

Steering Methods

Simple Addition (Default)

The default method adds the steering vector to model activations after the prompt:

output[prompt_length:] = output[prompt_length:] + strength * steering_vector

This is the standard CAA (Contrastive Activation Addition) method.

Orthogonalized Addition

The --orthogonal flag enables orthogonalized addition method:

Calculate projection of output onto steering vector direction
Remove the existing component along that direction
Add scaled steering vector

norm_steer = steering_vector / ||steering_vector||
proj = (output · norm_steer) * norm_steer
output[prompt_length:] = (output[prompt_length:] - proj) + strength * steering_vector

This method orthogonalizes the output with respect to the steering direction before applying the steering vector, providing more controlled modification of model behavior.

When to use:

When steering effects are too strong or unpredictable with simple addition
When you want more precise control over steering magnitude
When combining multiple steering vectors to avoid interference
When fine-tuning steering strength for subtle personality changes

Example:

psyctl steering \
  --model "meta-llama/Llama-3.2-3B-Instruct" \
  --steering-vector "./vector.safetensors" \
  --input-text "Describe your personality" \
  --orthogonal \
  --strength 1.5

Examples

Example 1: Extroversion Steering

# Extract extroversion steering vector (prerequisite)
psyctl extract.steering \
  --model "google/gemma-3-270m-it" \
  --layer "model.layers[13].mlp.down_proj" \
  --dataset "./dataset/extroversion" \
  --output "./vectors/extroversion.safetensors"

# Apply with moderate strength
psyctl steering \
  --model "google/gemma-3-270m-it" \
  --steering-vector "./vectors/extroversion.safetensors" \
  --input-text "Tell me about your weekend plans" \
  --strength 1.2

Example 2: Multiple Personalities

# Extract multi-layer steering vector
psyctl extract.steering \
  --model "meta-llama/Llama-3.2-3B-Instruct" \
  --layers "model.layers[13].mlp.down_proj,model.layers[14].mlp.down_proj" \
  --dataset "./dataset/agreeableness" \
  --output "./vectors/agreeable_multi.safetensors"

# Apply with orthogonalized addition
psyctl steering \
  --model "meta-llama/Llama-3.2-3B-Instruct" \
  --steering-vector "./vectors/agreeable_multi.safetensors" \
  --input-text "What do you think about helping others?" \
  --orthogonal \
  --strength 2.0

Example 3: Comparing Strengths

Test different steering strengths on the same input:

# No steering (baseline)
psyctl steering \
  --model "google/gemma-3-270m-it" \
  --steering-vector "./vectors/neuroticism.safetensors" \
  --input-text "I got a bad grade on my test" \
  --strength 0.0

# Mild steering
psyctl steering \
  --model "google/gemma-3-270m-it" \
  --steering-vector "./vectors/neuroticism.safetensors" \
  --input-text "I got a bad grade on my test" \
  --strength 0.8

# Strong steering
psyctl steering \
  --model "google/gemma-3-270m-it" \
  --steering-vector "./vectors/neuroticism.safetensors" \
  --input-text "I got a bad grade on my test" \
  --strength 2.0

Advanced Usage

Chat Template Handling

The steering command automatically detects and applies chat templates for instruction-tuned models:

# For models with chat templates (Llama, Gemma, etc.)
# Input is automatically formatted as:
# <bos><start_of_turn>user
# Your input text<end_of_turn>
# <start_of_turn>model

psyctl steering \
  --model "meta-llama/Llama-3.2-3B-Instruct" \
  --steering-vector "./vectors/out.safetensors" \
  --input-text "hello"

For base models without chat templates, the raw input text is used.

Multi-Layer Steering

When a steering vector file contains multiple layers, all layers are automatically applied:

# Extract from multiple layers
psyctl extract.steering \
  --model "google/gemma-3-270m-it" \
  --layers "model.layers[10].mlp.down_proj,model.layers[13].mlp.down_proj,model.layers[16].mlp.down_proj" \
  --dataset "./dataset/caa" \
  --output "./vectors/multi.safetensors"

# Apply to all layers at once
psyctl steering \
  --model "google/gemma-3-270m-it" \
  --steering-vector "./vectors/multi.safetensors" \
  --input-text "hello" \
  --strength 1.5

Greedy vs Sampling

For reproducible results, use greedy decoding:

psyctl steering \
  --model "google/gemma-3-270m-it" \
  --steering-vector "./vectors/out.safetensors" \
  --input-text "Tell me a story" \
  --temperature 0.0

For creative outputs, use higher temperature with sampling:

psyctl steering \
  --model "google/gemma-3-270m-it" \
  --steering-vector "./vectors/out.safetensors" \
  --input-text "Tell me a story" \
  --temperature 1.2 \
  --top-p 0.95

Technical Details

Hook Implementation

The steering mechanism uses PyTorch forward hooks registered on target layers. The hook function:

Receives layer output (batch_size, sequence_length, hidden_dim)
Applies steering only to tokens after the prompt
Returns modified output in the same format

Code reference: src/psyctl/core/steering_applier.py:_make_steering_hook()

Prompt Length Tracking

The system tracks prompt length to ensure steering is only applied to generated tokens, not the input prompt. This prevents distorting the input context.

Special case: Setting prompt_length=0 internally applies steering to all tokens (BiPO-style), though this is not exposed via CLI.

Memory Management

Hooks are automatically cleaned up after generation using try/finally blocks to prevent memory leaks.

Troubleshooting

Issue: Steering has no effect

Solution:

Increase --strength parameter
Try --orthogonal flag for orthogonalized addition method
Verify steering vector was extracted from the same model
Check that layer paths match between extraction and application

Issue: Output is too extreme

Solution:

Decrease --strength parameter (try 0.5-1.0)
Use --orthogonal flag for more controlled steering
Lower --temperature for more focused output

Issue: Model uses too much memory

Solution:

Use a smaller model (e.g., gemma-3-270m-it instead of gemma-3-27b-it)
Reduce --max-tokens parameter
The steering process uses use_cache=False which increases memory during generation

FilesExpand file tree

STEERING.md

Latest commit

History

STEERING.md

File metadata and controls

Steering Experiment

Table of Contents

Overview

Usage

CLI Usage

Basic Command

With Custom Strength

Using Orthogonalized Addition

Command-Line Options

Python Code Usage

Basic Example (Using model_name)

Using Persistent Steering (Python API Only - Most Efficient for Multiple Generations)

Using Pre-loaded Model (Efficient for Multiple Generations)

Experimenting with Different Strengths

Using Orthogonalized Addition in Python

Using Verbose Logging

Using Per-Layer Strength (Python API Only)

Steering Parameters

Strength

Temperature

Top-p and Top-k

Steering Methods

Simple Addition (Default)

Orthogonalized Addition

Examples

Example 1: Extroversion Steering

Example 2: Multiple Personalities

Example 3: Comparing Strengths

Advanced Usage

Chat Template Handling

Multi-Layer Steering

Greedy vs Sampling

Technical Details

Hook Implementation

Prompt Length Tracking

Memory Management

Troubleshooting

Issue: Steering has no effect

Issue: Output is too extreme

Issue: Model uses too much memory

See Also