Qwen2.5-3B (Qwen/Qwen2.5-3B), d=2048, L=36, h=16, GQA=2, d_head=128.
| Finding | Source | Numbers |
|---|---|---|
| L33 attention heads collapse to eff rank ~78 (CV < 0.08) | 1.py | mean 80.4, std 6.0 |
| FFN actively avoids attention subspace at L32–L33 | 2.py | 0.57× chance, std 0.0001 |
| L32↔L33 similarity = 0.482 (2× any other pair) | 2.py | multi-head confirmed |
| L33→L34 sharpest phase boundary in network | 2.py | drop 0.452 |
| Z separates better than Z⊥ for mean-pooling (12/12 configs) | Phase 2 | ratio_Z < ratio_Zp |
| Language energy in Z below random at k=20–50 | Phase 2 | 80–86% of k/d |
| Effect vanishes at k=78 | Phase 2 | E_Z ≈ E_rand |
| Last-token pooling reverses the effect | Phase 2 | ratio_Z > ratio_Zp |
| Best config: L32, multi-head, k=50, mean-pool | Phase 2 | ratio_Z=0.730, ratio_Zp=0.824 |
| No behavioral zh>en asymmetry at 3B | Phase 0 | 12 zh, 13 en |
Primary question: Is Z causal? Does replacing Z-content at L32 change model behavior in a predictable, directional way?
Operationally, "done" looks like this table, filled in for N ≥ 15 problems:
| Problem | Original output (en) | Z-patched output (en←zh) | Z⊥-patched output (en←zh) | Full-patched (all dims) |
|---|---|---|---|---|
| ... | ... | ... | ... | ... |
With these derived metrics:
- Z-patch answer change rate: fraction of problems where Z-patching changes the final numerical answer
- Z⊥-patch answer change rate: same for Z⊥
- Z-patch language preservation rate: fraction where output language remains English after Z-patch
- Z⊥-patch language disruption rate: fraction where output language shifts after Z⊥-patch
- Full-patch comparison: does full replacement behave more like Z-patch or Z⊥-patch?
Success (double dissociation):
- Z-patch changes answers but preserves language
- Z⊥-patch changes/disrupts language but preserves answers
Partial success:
- Z-patch changes answers more than Z⊥-patch does (even without clean language effects)
Null:
- No difference between Z-patch and Z⊥-patch effects
Let P = {(p_zh^i, p_en^i)}_{i=1}^N be N paired math problems.
For problem i, define:
- h_zh^i(ℓ) ∈ ℝ^{T_zh × d} — hidden states at layer ℓ from the Chinese prompt
- h_en^i(ℓ) ∈ ℝ^{T_en × d} — hidden states at layer ℓ from the English prompt
where T_zh, T_en are sequence lengths (generally T_zh ≠ T_en).
Reuse build_multi_head_z_mask from Phase 2. The mask is:
V_Z ∈ ℝ^{k × d} — top-k right singular vectors of the stacked multi-head attention kernel at layer ℓ
Orthogonal projectors:
- P_Z = V_Z^T V_Z ∈ ℝ^{d × d} (onto Z)
- P_{Z⊥} = I_d - P_Z (onto Z-complement)
Use k = 50 at layer 32 (best Phase 2 config). Also run k=20 as a secondary check.
Phase 2 showed mean-pooling works, last-token doesn't. For patching, we need a d-dimensional vector to inject. Define:
z̄_zh^i = (1/T_zh) Σ_t h_zh^i(ℓ)[t,:] ∈ ℝ^d — mean-pooled Chinese hidden state
The Z-component of this: z̄_zh^i|_Z = P_Z · z̄_zh^i ∈ ℝ^d
Z-patch: During English forward pass at layer ℓ, for each token position t:
h_en^i(ℓ)[t,:] ← P_{Z⊥} · h_en^i(ℓ)[t,:] + z̄_zh^i|_Z
Keep the English token's Z⊥-content (language). Replace its Z-content with the Chinese mean Z-content.
Z⊥-patch (control): Same structure, swap roles:
h_en^i(ℓ)[t,:] ← P_Z · h_en^i(ℓ)[t,:] + z̄_zh^i|_{Z⊥}
Keep reasoning (Z). Replace language scaffold (Z⊥) with Chinese.
Full-patch: Replace everything:
h_en^i(ℓ)[t,:] ← z̄_zh^i
No-patch baseline: Run English prompt unmodified.
For each of the 4 conditions × N problems, record:
- Raw output (full generated text, ≤ 100 tokens)
- Extracted numerical answer (regex or manual)
- Output language (classify as en/zh/mixed — can use a simple heuristic: fraction of CJK characters)
- Answer correctness (compare to known answer)
Layer choice. Primary: L32. Secondary: L33. The spec says to run both and compare. If results differ, that's informative — L32 is the "approach" layer, L33 is the bottleneck. Different patching effects would tell you something about the encode→bottleneck transition.
How to handle the mean-pooled vector. The spec above broadcasts the mean Chinese Z-content to all English token positions. An alternative: per-token patching using optimal transport alignment between Chinese and English token sequences. Specifically:
Let C_{ij} = ||h_zh^i(ℓ)[s,:] - h_en^i(ℓ)[t,:]||² be the cost matrix. Solve the OT problem:
π* = argmin_{π ∈ Π(a,b)} Σ_{st} π_{st} C_{st}
where a = (1/T_zh)·1, b = (1/T_en)·1 are uniform marginals. Then patch token t with the transport-weighted combination:
z_source(t) = P_Z · Σ_s (π*_{st} / b_t) · h_zh^i(ℓ)[s,:]
This is mathematically cleaner but harder to implement. Either approach is acceptable. If the mean-broadcast works, OT is unnecessary. If mean-broadcast shows no effect, try OT before declaring null.
The k parameter. Run k ∈ {20, 50}. Not 78 (Phase 2 showed signal vanishes there).
Prompt set. Reuse the 20 pairs from Phase 2, or write new ones. If reusing, the results are directly comparable to Phase 2's distance metrics. If writing new ones, you get independence from Phase 2 data. Either is fine — the Phase 2 prompts are clean and cover 10 categories.
This is ~20 lines of code piggybacked on the activation extraction from Experiment A. It answers: what is each layer doing — stripping language, computing reasoning, or both?
For each layer transition k → k+1 and each prompt, the residual update is:
Δh_k = h(k+1) - h(k) ∈ ℝ^{T × d}
Mean-pool across tokens:
Δh̄_k = (1/T) Σ_t Δh_k[t,:] ∈ ℝ^d
Project onto Z and Z⊥ using L32's basis (fixed — the "Rosetta Stone"):
||Δh̄_k|Z|| = ||P_Z · Δh̄_k|| ||Δh̄_k|{Z⊥}|| = ||P_{Z⊥} · Δh̄_k||
For each layer k, averaged across all prompts and both languages:
R(k) = ||Δh̄_k|Z|| / ||Δh̄_k|{Z⊥}||
One plot: R(k) vs layer index k, for k = 0, 1, ..., 34.
Interpretation:
- R(k) >> 1 at some layer → that layer is primarily computing in Z (reasoning step)
- R(k) << 1 → that layer is primarily modifying Z⊥ (language processing)
- R(k) ≈ 1 everywhere → reasoning and language processing are always interleaved, never separated
The most profound outcome: If R(k) is NEVER strongly Z-dominated, then Z is emergent — it's the cumulative effect of 30+ layers of mixed computation, not a discrete phase. This would mean "the model doesn't have a reasoning step; it has a reasoning gradient."
Compute R(k) separately for Chinese and English prompts. If R_zh(k) ≠ R_en(k) at specific layers, those layers process the two languages differently. This connects back to the wrapper-cost model from Chat_2.
This is the original Gameplan.md idea, now informed by structural priors. It's independent of Experiments A and B — it discovers Z from data rather than weights.
The ARD-RBF kernel on ℝ^d:
k_ℓ(x, y) = exp(-½ Σ_{j=1}^{d} (x_j - y_j)² / ℓ_j²)
where ℓ = (ℓ_1, ..., ℓ_d) ∈ ℝ_{>0}^d is the lengthscale vector. Each dimension gets its own scale.
Given two empirical measures μ = (1/n) Σ δ_{x_i} and ν = (1/m) Σ δ_{y_j} (token activations from Chinese and English), the unbiased MMD² estimator is:
MMD²(μ, ν; ℓ) = [1/(n(n-1))] Σ_{i≠i'} k_ℓ(x_i, x_{i'}) + [1/(m(m-1))] Σ_{j≠j'} k_ℓ(y_j, y_{j'}) - [2/(nm)] Σ_{i,j} k_ℓ(x_i, y_j)
This is zero iff the two token clouds are identically distributed in the kernel's feature space. It handles T_zh ≠ T_en by construction.
Minimize total MMD across all prompt pairs with L1 sparsity on inverse lengthscales:
min_{ℓ} Σ_{i=1}^{N} MMD²(μ_zh^i, μ_en^i; ℓ) + λ Σ_{j=1}^{d} ℓ_j^{-1}
Parameterize as log_ℓ_j to ensure positivity. Optimize with Adam.
What the optimization finds:
- Dimensions where ℓ_j → ∞: the kernel ignores these. They carry language-specific information (different between zh/en).
- Dimensions where ℓ_j stays finite: the kernel uses these. They carry shared information across languages. These are Z.
The sparsity penalty λ Σ ℓ_j^{-1} encourages most dimensions to shut off (ℓ → ∞), yielding a compact Z.
Instead of initializing all log_ℓ = 0, use the SVD-based Z mask from Phase 2:
V_Z = build_multi_head_z_mask(model, layer=32, ..., k=50) # (50, 2048)
# Project each dimension onto the Z subspace
z_energy = (V_Z[:, j]**2).sum() for each j # how much dim j participates in Z
log_ℓ_init[j] = -log(z_energy[j] + ε) # small ℓ where Z is active
This gives the optimizer a warm start. The SVD mask says "these ~50 dimensions are structurally important." The ARD optimization can confirm, refine, or contradict that.
| Parameter | Value | Rationale |
|---|---|---|
| λ (sparsity) | Grid search: {0.001, 0.01, 0.1} | λ controls Z size. Too small → all dims active. Too large → Z = ∅. |
| lr | 0.01 | Standard for Adam on log-scale parameters |
| steps | 500–1000 | Monitor convergence; stop when MMD plateaus |
| Target layers | {10, 15, 20, 25, 28, 30, 32, 33, 34, 35} | Structural analysis says these matter |
| d | 2048 | (not 4096 — we're on Qwen2.5-3B) |
-
|Z(k)| vs layer — number of dimensions with ℓ_j < threshold (e.g., median of converged ℓ values) at each layer
-
Lengthscale spectrum at L32 — histogram of log₁₀(ℓ_j) and sorted curve. Looking for bimodality (clean Z) vs power-law (graded) vs uniform (no Z found).
-
Overlap between SVD-Z and ARD-Z — what fraction of ARD's surviving dimensions are also in the SVD mask's top-50? If high → weight-based and data-based approaches agree (strong evidence). If low → the approaches find different subspaces (interesting but complicates the story).
SVD-Z (Experiments A–B) derives Z from model weights alone — no data needed. It's structurally motivated and fast.
ARD-Z (Experiment C) derives Z from cross-lingual activation data — no weight analysis needed. It's statistically principled and self-calibrating.
If they agree, that's powerful: the model's static architecture and its dynamic behavior both point to the same subspace. If they disagree, the disagreement itself is informative — it tells you whether Z is a property of the weights or of the computation.
Only run this if Experiments A or C succeed. It answers: is the mapping between languages within Z a simple linear transform?
Collect paired Z-projected mean activations at L32:
z_zh^i = P_Z · z̄_zh^i ∈ ℝ^d (only the k nonzero components matter) z_en^i = P_Z · z̄_en^i ∈ ℝ^d
Stack into matrices:
- Z_zh ∈ ℝ^{N × k} (extracting the k active components)
- Z_en ∈ ℝ^{N × k}
Solve: W* = argmin_W ||Z_zh - Z_en · W||²_F
Closed form: W* = (Z_en^T Z_en)^{-1} Z_en^T Z_zh
Size: k × k. For k=50: 2,500 parameters. For k=20: 400 parameters. Negligible overfitting risk with N=20 pairs.
-
R² = 1 - ||Z_zh - Z_en · W*||²_F / ||Z_zh - Z̄_zh||²_F
- R² > 0.9 → languages are near-rotations of each other in Z
- R² > 0.8 → thin wrappers
- R² < 0.5 → relationship is nonlinear
-
Orthogonality error = ||W*^T W* - I_k||_F / k
- Small → W* is approximately a rotation (isometric languages in Z)
- Large → W* is a more general linear map (one language uses Z-dims differently)
-
SVD of W* — singular value spectrum
- Flat spectrum → rotation-like
- Steep decay → some Z-dims matter more than others for the mapping
-
Leave-one-out cross-validation R² — can the bridge predict held-out pairs?
| Experiment | Estimated Time | GPU Needed | Dependencies |
|---|---|---|---|
| A (Patching) | 2–3 hours | Yes (forward passes + generation) | Z mask from Phase 2 |
| B (Update decomposition) | 30 min | Piggybacks on A's activation extraction | A's hidden states |
| C (ARD-MMD) | 3–5 hours | Yes (gradient through kernel, 500 steps × 10 layers) | Fresh activations |
| D (Bridge) | 15 min | No (just linear algebra) | A or C must succeed |
Recommended execution order: A+B together (share activations), then D if A works, then C independently.
C is the most expensive but also the most self-contained — it doesn't need Phases 0–2 at all. If you're short on time, A+B+D is the core contribution. C is the luxury version that provides a second, independent path to Z.
- The patching table (Experiment A) is filled in for ≥15 problems
- R(k) vs layer plot exists (Experiment B)
- You can state: "Z-patching changes answers X% of the time while preserving language Y% of the time, vs Z⊥-patching which changes answers X'% and disrupts language Y'%"
- All of minimum, plus:
- ARD lengthscale spectrum exists (Experiment C)
- SVD-Z and ARD-Z overlap is quantified
- Bridge R² exists (Experiment D)
- All of strong, plus:
- Results hold at both L32 and L33
- OT-aligned patching tried (if mean-broadcast was weak)
- R(k) decomposition reveals the encode/reason/decode phases
- Bridge is approximately orthogonal (languages are rotations in Z)