EXPERIMENTAL PROTOCOL: Multi-Scale Energy Attention for ARC

Core Thesis Summary

Why Language Fails at Spatial Reasoning

The WIDI (Write It Do It) analogy is perfect. Language is fundamentally lossy for spatial relationships—describing "rotate the blue shape 90° clockwise then place it adjacent to the red object's upper-left corner" requires the listener to mentally reconstruct spatial state that was immediately obvious visually. LLMs lack a "mind's eye" to run forward/backward simulations. They chain breadcrumbs of language to approximate the whole, rather than perceiving the whole and decomposing as needed.

The Two-Level Energy Minimization Insight

There are two levels of energy minimization:

Within-task: Local ↔ global attention across scales in grid space
Across-demos: Consistency of inferred rules across the 2-4 demonstration pairs

This is distinct from standard TTT which just adapts parameters. The energy function should explicitly enforce rule coherence across demonstrations.

Vision as Search Space Reduction

The vision system doesn't just perceive—it prunes the hypothesis space by instantly recognizing structural constraints (symmetry, repetition, containment) that would take many tokens to describe linguistically.

Phase 0: Environment Setup (Day 1)

Objective: Get VARC running, understand the codebase, establish baselines.

Clone github.com/lillian039/VARC
Reproduce VARC baseline on ConceptARC subset (pick 4 concept groups: rotation, reflection, color_mapping, scaling)
Instrument attention visualization (extract per-layer attention maps)
Document: What's the architecture? How does TTT work? What's the data pipeline?

Deliverable: Working VARC reproduction with attention visualization on ConceptARC subset.

Go/No-Go: If can't reproduce within 10% of reported accuracy, debug before proceeding.

Phase 1: Multi-Scale Attention (Days 2-5)

Objective: Implement parallel local-global attention, test if it improves over single-scale.

Architecture change:

Current VARC: 
  Canvas → Patch(16x16) → ViT blocks → Output

Proposed:
  Canvas → [Patch(4x4), Patch(8x8), Patch(16x16)] → 
           [Local-ViT, Mid-ViT, Global-ViT] →
           Cross-scale attention fusion →
           Output

Day 2:

Implement multi-scale patch embedding (3 scales: 4x4, 8x8, 16x16)
Initially: process each scale independently, concatenate before output head
Train on ConceptARC subset, compare to baseline

Day 3:

Add cross-scale attention: each scale can attend to tokens from other scales
Two variants: (a) late fusion (after all ViT blocks), (b) interleaved (every N blocks)
Compare both variants

Day 4-5:

Ablation: which scales matter most for which concept types?
Visualize: do different scales attend to different structures?
Document findings

Metrics:

ConceptARC accuracy per concept group
Attention entropy (are scales specializing?)
Compute overhead vs. baseline

Checkpoint criteria (Day 5):

Multi-scale ≥ baseline accuracy (even if marginal)
Evidence of scale specialization in attention patterns
If multi-scale is significantly worse, pivot to Phase 1b (investigate why)

Phase 2: Energy-Guided Iterative Refinement (Days 6-10)

Objective: Replace single forward pass with iterative energy minimization in embedding space.

Core idea: Instead of predicting output directly, the model iteratively refines a "draft" output by minimizing an energy function that scores (input, output) compatibility.

Architecture:

Energy function E(z_input, z_output | demos):
  - z_input, z_output are embeddings from the multi-scale ViT
  - Lower energy = more compatible transformation
  
Inference:
  1. Initialize z_output randomly (or from single forward pass)
  2. For t = 1 to T:
       z_output ← z_output - η * ∇_{z_output} E(z_input, z_output | demos)
  3. Decode z_output to discrete grid

Day 6:

Define energy function: E = -similarity(f(z_input), z_output) where f is learned transformation
Start simple: E is just negative cosine similarity between transformed input embedding and output embedding
Train energy function on demo pairs: E should be low for correct (input, output), high for corrupted pairs

Day 7:

Implement gradient-based refinement at inference time
Use VARC's direct prediction as initialization for z_output (warm start)
Compare: direct prediction vs. direct + N refinement steps

Day 8:

Add cross-demo consistency term to energy:

E_total = Σ_demos E(input_i, output_i) + λ * Σ_{i,j} ||rule_i - rule_j||²

where rule_i is extracted from attending to (input_i, output_i) pair

This is the "second-order attention" idea—rules should be consistent across demos

Day 9-10:

Tune: number of refinement steps, step size, consistency weight λ
Compare to VARC + TTT baseline
Visualize: how does the output embedding trajectory change over refinement steps?

Metrics:

Accuracy vs. refinement steps (diminishing returns curve)
Energy landscape visualization (are there spurious minima?)
Consistency score across demos (does it actually improve?)

Checkpoint criteria (Day 10):

Iterative refinement improves accuracy OR provides useful interpretability
Cross-demo consistency term has measurable effect
If neither, document why and consider pivoting

Phase 3: Test-Time Energy Adaptation (Days 11-14)

Objective: Adapt energy function parameters to each task using demonstrations.

Key difference from standard TTT: We're not just fine-tuning a prediction head. We're adapting the energy landscape so that the correct transformation has lower energy.

Day 11:

Implement leave-one-out TTT on energy function:
- Given demos [(in1,out1), (in2,out2), (in3,out3)]
- Train energy on 2 demos, evaluate on held-out demo
- Rotate and average
What to adapt: (a) full energy network, (b) LoRA adapters, (c) task embedding vector

Day 12:

Compare adaptation strategies
Measure: how many gradient steps needed for adaptation to help?
Is there overfitting to specific demos?

Day 13-14:

Combine multi-scale + iterative refinement + TTT
Final evaluation on full ConceptARC (16 concepts × 10 tasks)
Comparison to VARC baseline (54.5% reported)

Metrics:

Accuracy with vs. without TTT
Adaptation efficiency (steps to convergence)
Generalization: does adapting on 2 demos help on the 3rd?

Phase 4: Analysis and Writing (Days 15-21)

Objective: Understand what's working, why, and prepare for Yilun meeting + potential workshop submission.

Day 15-16:

Per-concept breakdown: which concepts benefit most from multi-scale? from iterative? from TTT?
Failure case analysis: what's still failing and why?
Attention visualization: can we tell a story about what the model "sees"?

Day 17-18:

Ablation table: contribution of each component
Comparison to other baselines (if time: LPN, GFlowNet-ARC)

Day 19-21:

Draft 4-page workshop paper: "Multi-Scale Energy Attention for Visual Abstract Reasoning"
Prepare questions/results for Yilun meeting

Go/No-Go Decision Points

Day	Checkpoint	Go Criteria	No-Go Action
1	VARC reproduction	Within 10% of reported	Debug or use their checkpoint
5	Multi-scale attention	≥ baseline, scale specialization observed	Investigate why; may not need multi-scale
10	Iterative refinement	Accuracy gain OR interpretability	Drop iterative, focus on multi-scale + TTT
14	Full pipeline	Competitive with VARC	Still publishable as ablation study

Updated Questions for Yilun

On energy composition: "I'm summing energies across scales: E_total = E_local + E_mid + E_global. Is additive composition principled when scales have different receptive fields, or should I use product (E_total = ∏ E_scale)?"
On cross-demo consistency: "I want to add a term that penalizes inconsistent rules across demos. Mathematically, I'm thinking ||rule_i - rule_j||² where rule is extracted from cross-attention between input/output. Is there a more principled way to enforce this from an energy perspective?"
On iterative vs. direct: "VARC does direct prediction. I want gradient descent in embedding space. Should the energy function be the same network as the forward predictor, or separate? What's the relationship to your IRED work?"
On publication scope: "Given 3 weeks of experiments, is ConceptARC sufficient for a workshop paper? Or do I need full ARC-AGI-1 results?"

Risk Assessment

High risk: The iterative energy minimization may be slower than direct prediction without accuracy gain.

→ Mitigation: Start with VARC's direct prediction, add iterative refinement as optional module, compare.

Medium risk: Multi-scale attention increases parameter count without proportional gain.

→ Mitigation: Use shared weights across scales, only learn scale-specific modulation.

Lower risk: This is at least a valid ablation study of VARC + compositional energy, publishable as workshop paper regardless of SOTA performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EXPERIMENTAL PROTOCOL: Multi-Scale Energy Attention for ARC

Core Thesis Summary

Why Language Fails at Spatial Reasoning

The Two-Level Energy Minimization Insight

Vision as Search Space Reduction

Phase 0: Environment Setup (Day 1)

Phase 1: Multi-Scale Attention (Days 2-5)

Phase 2: Energy-Guided Iterative Refinement (Days 6-10)

Phase 3: Test-Time Energy Adaptation (Days 11-14)

Phase 4: Analysis and Writing (Days 15-21)

Go/No-Go Decision Points

Updated Questions for Yilun

Risk Assessment

FilesExpand file tree

VISION_FIRST_APPROACH.md

Latest commit

History

VISION_FIRST_APPROACH.md

File metadata and controls

EXPERIMENTAL PROTOCOL: Multi-Scale Energy Attention for ARC

Core Thesis Summary

Why Language Fails at Spatial Reasoning

The Two-Level Energy Minimization Insight

Vision as Search Space Reduction

Phase 0: Environment Setup (Day 1)

Phase 1: Multi-Scale Attention (Days 2-5)

Phase 2: Energy-Guided Iterative Refinement (Days 6-10)

Phase 3: Test-Time Energy Adaptation (Days 11-14)

Phase 4: Analysis and Writing (Days 15-21)

Go/No-Go Decision Points

Updated Questions for Yilun

Risk Assessment