Research pipeline for the paper "What If AI Lived Inside Your Mind? Simulating "Neural Integration" of Human and AI through Mechanistic Interpretability as Design Provocation".
The project has two main components:
persona-vectors/— Generate and evaluate persona vectors using contrastive activation differencessimulating-neural-integration/— Steering experiment: simulate, evaluate, and visualize results
Persona vectors are directions in a model's residual stream that correspond to behavioral traits (e.g., deception, empathy, formality). They are computed as the mean activation difference between contrastive system prompts.
Available traits: deception, empathy, sycophancy, toxicity, hallucination, formality, funniness, sociality, encouraging
Pre-computed vectors are stored in persona-vectors/evaluation/stored_persona_vectors/.
persona-vectors/generation/
├── generate_prompts.py # Step 1: Generate contrastive prompts for a trait
└── generate_persona_vectors.py # Step 2: Compute persona vector from Llama activations
Step 1 — Generate prompts for a trait (uses Claude API):
cd persona-vectors/generation
python generate_prompts.py --trait deceptionOutputs to stored_prompts/{trait}/:
contrastive_system_prompt.json— 5 pos/neg instruction pairsquestion_generation_prompt.json— 40 elicitation questionstrait_evaluation_prompt.json— GPT-4 scoring prompt (0–100)
Step 2 — Compute vector (uses Llama + GPT-4 to filter responses):
python generate_persona_vectors.py --trait deceptionRuns ~2×5×40×8 = 3200 forward passes. Filters responses by GPT-4 score (≥50 for positive, ≤50 for negative), then saves persona_vectors/{trait}_persona_vector.pt.
persona-vectors/evaluation/
├── create_scale.py # Find score range (min/max) for each trait
├── create_regression_data.py # Generate synthetic prompts for regression
├── eval_layers_regression.py # Per-layer R² comparison
├── eval_and_graph_regression.py # Linear regression + plots
└── activations_viz.py # Visualize activation values in a vector
Find score scale (needed to normalize scores for the interface):
cd persona-vectors/evaluation
python create_scale.pyGenerates 50 synthetic system prompts per trait (extreme positive/negative) and finds the most extreme projection scores. Saves persona_scores_scale.json.
persona-vectors/modal/
├── chat_api.py # Llama chat endpoint (Modal serverless)
└── persona_score_api.py # Persona scoring endpoint (Modal serverless)
Deploy to Modal:
cd persona-vectors/modal
modal deploy chat_api.py
modal deploy persona_score_api.pyTests whether injecting the persona vector into the residual stream at generation time amplifies or suppresses deceptive behavior, depending on direction.
simulating-neural-integration/
├── generate_test_scenarios.py # Generate test scenarios with Claude
├── test_scenario_scores.py # Score each scenario's baseline polarity
├── sim.py # Main experiment (3 conditions)
├── classify.py # Validate deception vector classification accuracy
├── eval.py # Rate responses with Claude (1–7 scale)
├── summarize_eval.py # Aggregate scores across conditions
└── graph.py # Plot results
1. Generate test scenarios (20 total: 10 deceptive roles, 10 honest roles):
cd simulating-neural-integration
python generate_test_scenarios.py
# → generated_test_scenarios.json2. Score baseline polarity (finds normalization bias):
python test_scenario_scores.py
# → scenario_scores.json3. Run experiment (3 conditions: control / steer-same / steer-opposite):
python sim.py
# → results/control.json, results/1.json, results/2.jsonSteering is applied at layer 15 with coefficient ±3.0. The deception vector projection is used to detect each scenario's polarity, then:
control— no steeringmode 1— steer in detected direction (amplify)mode 2— steer against detected direction (suppress)
4. Evaluate responses with Claude (5 ratings per response, 1–7 scale):
python eval.py
# → eval/control.json, eval/1.json, eval/2.json5. Summarize and plot:
python summarize_eval.py # → eval/sum.json
python graph.py # → graphs/*.pngOutput graphs comparing deceptive vs. honest scenarios across conditions are saved to simulating-neural-integration/graphs/.
pip install torch transformer_lens huggingface_hub tqdm scipy matplotlib anthropicRequires access to meta-llama/Llama-3.2-3B-Instruct on Hugging Face. Set your HF token and API keys in environment variables or .env.
All experiments use Llama-3.2-3B-Instruct loaded via transformer_lens (HookedTransformer), which enables activation caching and residual stream injection. The model has 26 layers; persona vectors are shaped (26, hidden_dim).