Z HUNT: The Recipe

What we're looking for

A language-agnostic reasoning subspace Z inside Qwen2.5-3B's hidden states at Layer 33. The hypothesis: when the model reasons about math, it encodes the problem into a compressed subspace Z that is structurally independent of the input language. Chinese 加法 and English addition activate the same Z dimensions; the language-specific information lives in Z⊥.

Why we believe it's there (structural evidence from 1.py + 2.py)

Finding	Source	Implication
L33 attention heads collapse to eff rank ~78 (mean 80.4, std 6.0)	1.py	L33 operates in a ~78-dim subspace — compressed bottleneck
All 16 heads converge to same rank at L33 (CV < 0.08)	1.py	Heads agree on what matters at this depth — maximally organized
FFN alignment at L33: 0.57× chance (actively avoids attention)	2.py	FFN and attention operate in ORTHOGONAL subspaces at bottleneck
FFN alignment std = 0.0001 across heads at L33	2.py	Not a fluke — FFN systematically avoids attention's subspace
L32↔L33 similarity = 0.482 (2× any other pair)	2.py	L32 is the approach layer — convergence happens over L32-L33
L33→L34 snap: biggest phase boundary in entire network	2.py	L34 breaks out of the compressed representation — decode begins
1-33-2 architecture: encode (L0), compute (L1-33), decode (L34-35)	1.py+2.py	92% of depth is "middle" computation
Attention rank ↓ monotone, FFN rank ↑ monotone	1.py	They do opposite things at every layer — complementary roles
W_V constant ~250 across all layers (CV = 0.02)	1.py	The messages have fixed format; only routing changes

What distinguishes us from the NeurIPS 2025 paper (arxiv 2505.15257)

They find language-specific directions via data-driven SVD (mean activations per language → SVD to find M_s). They ablate M_s at inference and find middle layers work best.

Our approach is structurally-motivated:

Aspect	NeurIPS 2505.15257	Our approach
Z construction	Data-driven: language centroids → SVD	Weight-based: attention kernel SVD at bottleneck
Why this layer?	Empirical sweep (try all layers)	Structural prediction: L33 has lowest eff rank
Why Z is clean	Not addressed	FFN orthogonality: 0.57× chance at L33
Data required	Parallel multilingual corpus	NONE — Z derived from weights alone
Theoretical backing	Ablation → improved reasoning	Information bottleneck + Grassmann geometry
Architecture insight	None (black box optimization)	1-33-2 structure, encode-compute-decode

Key novelty: We can predict WHERE the reasoning subspace lives without any multilingual data, using static weight properties alone. The NeurIPS paper has to run activation extraction + SVD to discover it empirically.

The experiment sequence

Phase 0: Behavioral Verification (script: `phase0_behavioral.py`)

Question: Does Qwen2.5-3B actually show Chinese >> English on math? Run: MPLBACKEND=Agg .venv_wsl/bin/python phase0_behavioral.py Time: ~10 min (15 problems × 2 langs × ~20s per generation)

Contingencies:

Chinese > English by ≥3 problems: PROCEED. Strong behavioral evidence.
Chinese = English (±1): PROCEED ANYWAY. The 3B base model may not show behavioral asymmetry even if Z exists. The structural evidence is strong enough. Flag this as a limitation in any writeup.
English > Chinese: INVESTIGATE prompt formatting. The few-shot exemplars may be biased. Try without few-shot. If still English-dominant, this is genuinely surprising — check if model is actually Qwen2.5-3B (not a renamed English model).

Phase 2: Activation Extraction + Z Projection (script: `phase2_z_extraction.py`)

Question: Does the L33 attention SVD subspace separate language from reasoning? Run: MPLBACKEND=Agg .venv_wsl/bin/python phase2_z_extraction.py Time: ~20 min (20 problems × 2 langs, no generation, just forward pass)

What to look for:

ratio_Z < ratio_Zp across most configurations → Z captures reasoning, Z⊥ captures language
energy_frac_Z < k/d for cross-lingual deltas → language difference lives OUTSIDE Z (good!)
energy_frac_Z > k/d for same-language different-problem deltas → reasoning variation IS in Z
Multi-head mask ≥ head0 mask → multi-head averaging is more robust

Contingencies:

Clear separation (ratio_Z << ratio_Zp for all k): SVD mask works! Skip ARD-MMD entirely. Go to Phase 3 (patching).
Partial separation (some k work, others don't): Note which k works best. The k≈78 (matching eff rank) should be strongest. If k=20 works but k=78 doesn't, the reasoning subspace is even MORE compressed than the eff rank suggests.
No separation (ratio_Z ≈ ratio_Zp everywhere): Three sub-contingencies:
- a) Try the NeurIPS 2505.15257 approach: activation-based SVD (data-driven) instead of weight-based SVD. This is a direct comparison.
- b) Try L32 instead of L33 — the "approach" layer may have cleaner separation.
- c) Fall back to ARD-MMD on ~10 layers (Phase 3 in Gameplan_v3).

Phase 3: Activation Patching (if Phase 2 succeeds)

Question: Does patching Z at L33 transfer reasoning between languages? Design:

Run Chinese math problem → extract h_zh at L33
Run English version of same problem → extract h_en at L33
Project h_zh onto Z and Z⊥: h_zh_Z and h_zh_Zp
Patch: replace h_en_Z with h_zh_Z (swap reasoning, keep language)
Continue forward pass → does the model now answer the English problem using Chinese reasoning?
Control: swap Z⊥ instead (should break language, not reasoning)

Success criteria: Patching Z changes the ANSWER without changing the output LANGUAGE. Patching Z⊥ changes the LANGUAGE without changing the ANSWER.

Phase 3b: Bridge (tiny model, if patching works)

Question: Can we learn a linear map within Z that translates between languages? Design: Linear regression from Z(Chinese, problem_i) to Z(English, problem_i) for all 20 problems. Size: [78, 78] = 6,084 parameters. Zero overfitting risk with 20 pairs. Success criteria: Bridge predicts held-out Z vectors (leave-one-out cross-validation).

Math verification

Grassmann similarity metric (from utils.py)

Definition: sim(V1, V2) = mean(σ²) where σ = svdvals(V1 @ V2.T)
Equivalent to: (1/k) ||V1 @ V2.T||_F² — mean squared cosine of principal angles
Range: [0, 1]. 1 = identical subspaces, 0 = orthogonal
Relationship to chordal distance: d_ch² = k(1 - sim). Our sim IS linearly related to the squared Grassmann chordal distance.
Caveat: Not a proper metric (fails triangle inequality). But we only use it for pairwise comparisons, so this is fine.

Projection math (from phase2)

P_z = Vh[:k,:].T @ Vh[:k,:] — orthogonal projector because Vh has orthonormal rows (SVD guarantee)
||P_z(h1 - h2)|| = ||Z_mask(h1 - h2)|| — distance in R^d via projector equals distance in R^k via coordinates
Random baseline: For random k-dim subspace Z, E[||P_Z δ||²] = (k/d)||δ||². For k=78, d=2048: 3.8%.

FFN alignment metric (from 2.py)

Project W_gate's top-k SVD vectors onto attention subspace
Measure energy ratio: ||P_attn V_gate||²_F / ||V_gate||²_F
At L33: 0.0056 (chance = k/d = 0.0098). Ratio = 0.57× — actively BELOW chance
This means FFN's gate directions are systematically avoiding the attention kernel's subspace

Information-theoretic framing

L33 as information bottleneck: layers 1-33 progressively compress the input into a minimal sufficient statistic for the task
IB theory predicts: I(Z_L; X) decreases with depth, I(Z_L; Y) increases until convergence
Our observation: effective rank drops (= compression) while semantic function is preserved
The FFN orthogonality at L33 suggests SEPARATE channels: attention carries the bottleneck representation (Z), FFN carries the complement (language, surface form, etc.)
Formal claim: at L33, the residual stream decomposes as h = P_Z(h) + P_Z⊥(h), where P_Z(h) ≈ minimal sufficient statistic for the math answer, and P_Z⊥(h) ≈ language-specific encoding

Cross-disciplinary connections

Neuroscience: Analogous to the "language of thought" hypothesis — Fodor (1975) proposed that cognition operates in an amodal representation independent of natural language. Our Z is the transformer equivalent.
Cognitive science: Bilingual speakers show language-independent math representations in fMRI (Spelke & Tsivkin, 2001). The intraparietal sulcus activates identically for math in either language.
Compression theory: Rate-distortion theory predicts that optimal compression produces representations that strip task-irrelevant information (language) while preserving task-relevant information (math structure).
Linear representation hypothesis: Recent work (Park et al., 2023) shows that concepts in LLMs are linearly encoded. Our Z is the specific linear subspace for "reasoning-relevant" concepts at the bottleneck layer.

What would make this publishable

Minimal viable paper (estimated 45-55% chance):

Structural analysis confirming L33 bottleneck + FFN orthogonality (DONE)
Z subspace separates language from reasoning in activation space (Phase 2)
Comparison with NeurIPS 2505.15257 approach on same model/data

Strong paper (estimated 25-35% chance):

All of minimal, plus: 4. Patching experiments showing causal role of Z 5. Generalization to at least one other model (e.g., Qwen3-8B, LLaMA-3-8B) 6. Theoretical connection to information bottleneck formalism

Transformative paper (estimated 5-10% chance):

All of strong, plus: 7. Universal bottleneck detection algorithm (predict which layer for any model) 8. Bridge that enables cross-lingual transfer at inference time 9. Evidence that Z captures not just math but ALL reasoning

Files on this machine

File	Status	Purpose
`1.py`	DONE (run)	Effective rank analysis across all layers
`2.py`	DONE (run)	Subspace overlap, FFN alignment, convergence
`utils.py`	DONE	Shared functions: eff_rank, subspace sim, etc.
`phase0_behavioral.py`	READY TO RUN	Chinese vs English math scoring
`phase2_z_extraction.py`	READY TO RUN	Activation extraction + Z projection
`Gameplan_v3.md`	Reference	Full gameplan with time estimates
`Gameplan.md`	Reference	Original ARD-MMD approach (backup)
`output/one_output.md`	Results	1.py analysis report
`output/two_output.md`	Results	2.py analysis report
`output/*.npy`	Data	Saved numpy arrays from 2.py
`.venv_wsl/`	Environment	Python 3.12, torch 2.6.0+cu124, transformers

Run commands

# Activate environment
source .venv_wsl/bin/activate

# Phase 0
MPLBACKEND=Agg python phase0_behavioral.py

# Phase 2
MPLBACKEND=Agg python phase2_z_extraction.py

Decision tree for tomorrow

START
  ↓
Run Phase 0
  ↓
Chinese >> English? ─── NO ──→ Flag limitation, proceed anyway
  ↓ YES                        (structural evidence > behavioral)
  ↓
Run Phase 2
  ↓
ratio_Z < ratio_Zp? ─── NO ──→ Try: (a) data-driven SVD
  ↓ YES                              (b) L32 instead of L33
  ↓                                  (c) ARD-MMD backup
  ↓
SVD mask works!
  ↓
Design patching experiment (Phase 3)
  ↓
Patching transfers reasoning? ─── NO ──→ Z is correlational, not causal
  ↓ YES                                   Still publishable as structural finding
  ↓
GOLD: Causal Z subspace identified
  ↓
Generalize to second model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Z HUNT: The Recipe

What we're looking for

Why we believe it's there (structural evidence from 1.py + 2.py)

What distinguishes us from the NeurIPS 2025 paper (arxiv 2505.15257)

The experiment sequence

Phase 0: Behavioral Verification (script: `phase0_behavioral.py`)

Phase 2: Activation Extraction + Z Projection (script: `phase2_z_extraction.py`)

Phase 3: Activation Patching (if Phase 2 succeeds)

Phase 3b: Bridge (tiny model, if patching works)

Math verification

Grassmann similarity metric (from utils.py)

Projection math (from phase2)

FFN alignment metric (from 2.py)

Information-theoretic framing

Cross-disciplinary connections

What would make this publishable

Minimal viable paper (estimated 45-55% chance):

Strong paper (estimated 25-35% chance):

Transformative paper (estimated 5-10% chance):

Files on this machine

Run commands

Decision tree for tomorrow

FilesExpand file tree

RECIPE.md

Latest commit

History

RECIPE.md

File metadata and controls

Z HUNT: The Recipe

What we're looking for

Why we believe it's there (structural evidence from 1.py + 2.py)

What distinguishes us from the NeurIPS 2025 paper (arxiv 2505.15257)

The experiment sequence

Phase 0: Behavioral Verification (script: phase0_behavioral.py)

Phase 2: Activation Extraction + Z Projection (script: phase2_z_extraction.py)

Phase 3: Activation Patching (if Phase 2 succeeds)

Phase 3b: Bridge (tiny model, if patching works)

Math verification

Grassmann similarity metric (from utils.py)

Projection math (from phase2)

FFN alignment metric (from 2.py)

Information-theoretic framing

Cross-disciplinary connections

What would make this publishable

Minimal viable paper (estimated 45-55% chance):

Strong paper (estimated 25-35% chance):

Transformative paper (estimated 5-10% chance):

Files on this machine

Run commands

Decision tree for tomorrow

Phase 0: Behavioral Verification (script: `phase0_behavioral.py`)

Phase 2: Activation Extraction + Z Projection (script: `phase2_z_extraction.py`)