The WIDI (Write It Do It) analogy is perfect. Language is fundamentally lossy for spatial relationships—describing "rotate the blue shape 90° clockwise then place it adjacent to the red object's upper-left corner" requires the listener to mentally reconstruct spatial state that was immediately obvious visually. LLMs lack a "mind's eye" to run forward/backward simulations. They chain breadcrumbs of language to approximate the whole, rather than perceiving the whole and decomposing as needed.
There are two levels of energy minimization:
- Within-task: Local ↔ global attention across scales in grid space
- Across-demos: Consistency of inferred rules across the 2-4 demonstration pairs
This is distinct from standard TTT which just adapts parameters. The energy function should explicitly enforce rule coherence across demonstrations.
The vision system doesn't just perceive—it prunes the hypothesis space by instantly recognizing structural constraints (symmetry, repetition, containment) that would take many tokens to describe linguistically.
Objective: Get VARC running, understand the codebase, establish baselines.
- Clone github.com/lillian039/VARC
- Reproduce VARC baseline on ConceptARC subset (pick 4 concept groups: rotation, reflection, color_mapping, scaling)
- Instrument attention visualization (extract per-layer attention maps)
- Document: What's the architecture? How does TTT work? What's the data pipeline?
Deliverable: Working VARC reproduction with attention visualization on ConceptARC subset.
Go/No-Go: If can't reproduce within 10% of reported accuracy, debug before proceeding.
Objective: Implement parallel local-global attention, test if it improves over single-scale.
Architecture change:
Current VARC:
Canvas → Patch(16x16) → ViT blocks → Output
Proposed:
Canvas → [Patch(4x4), Patch(8x8), Patch(16x16)] →
[Local-ViT, Mid-ViT, Global-ViT] →
Cross-scale attention fusion →
Output
Day 2:
- Implement multi-scale patch embedding (3 scales: 4x4, 8x8, 16x16)
- Initially: process each scale independently, concatenate before output head
- Train on ConceptARC subset, compare to baseline
Day 3:
- Add cross-scale attention: each scale can attend to tokens from other scales
- Two variants: (a) late fusion (after all ViT blocks), (b) interleaved (every N blocks)
- Compare both variants
Day 4-5:
- Ablation: which scales matter most for which concept types?
- Visualize: do different scales attend to different structures?
- Document findings
Metrics:
- ConceptARC accuracy per concept group
- Attention entropy (are scales specializing?)
- Compute overhead vs. baseline
Checkpoint criteria (Day 5):
- Multi-scale ≥ baseline accuracy (even if marginal)
- Evidence of scale specialization in attention patterns
- If multi-scale is significantly worse, pivot to Phase 1b (investigate why)
Objective: Replace single forward pass with iterative energy minimization in embedding space.
Core idea: Instead of predicting output directly, the model iteratively refines a "draft" output by minimizing an energy function that scores (input, output) compatibility.
Architecture:
Energy function E(z_input, z_output | demos):
- z_input, z_output are embeddings from the multi-scale ViT
- Lower energy = more compatible transformation
Inference:
1. Initialize z_output randomly (or from single forward pass)
2. For t = 1 to T:
z_output ← z_output - η * ∇_{z_output} E(z_input, z_output | demos)
3. Decode z_output to discrete grid
Day 6:
- Define energy function: E = -similarity(f(z_input), z_output) where f is learned transformation
- Start simple: E is just negative cosine similarity between transformed input embedding and output embedding
- Train energy function on demo pairs: E should be low for correct (input, output), high for corrupted pairs
Day 7:
- Implement gradient-based refinement at inference time
- Use VARC's direct prediction as initialization for z_output (warm start)
- Compare: direct prediction vs. direct + N refinement steps
Day 8:
- Add cross-demo consistency term to energy:
E_total = Σ_demos E(input_i, output_i) + λ * Σ_{i,j} ||rule_i - rule_j||²
where rule_i is extracted from attending to (input_i, output_i) pair
- This is the "second-order attention" idea—rules should be consistent across demos
Day 9-10:
- Tune: number of refinement steps, step size, consistency weight λ
- Compare to VARC + TTT baseline
- Visualize: how does the output embedding trajectory change over refinement steps?
Metrics:
- Accuracy vs. refinement steps (diminishing returns curve)
- Energy landscape visualization (are there spurious minima?)
- Consistency score across demos (does it actually improve?)
Checkpoint criteria (Day 10):
- Iterative refinement improves accuracy OR provides useful interpretability
- Cross-demo consistency term has measurable effect
- If neither, document why and consider pivoting
Objective: Adapt energy function parameters to each task using demonstrations.
Key difference from standard TTT: We're not just fine-tuning a prediction head. We're adapting the energy landscape so that the correct transformation has lower energy.
Day 11:
- Implement leave-one-out TTT on energy function:
- Given demos [(in1,out1), (in2,out2), (in3,out3)]
- Train energy on 2 demos, evaluate on held-out demo
- Rotate and average
- What to adapt: (a) full energy network, (b) LoRA adapters, (c) task embedding vector
Day 12:
- Compare adaptation strategies
- Measure: how many gradient steps needed for adaptation to help?
- Is there overfitting to specific demos?
Day 13-14:
- Combine multi-scale + iterative refinement + TTT
- Final evaluation on full ConceptARC (16 concepts × 10 tasks)
- Comparison to VARC baseline (54.5% reported)
Metrics:
- Accuracy with vs. without TTT
- Adaptation efficiency (steps to convergence)
- Generalization: does adapting on 2 demos help on the 3rd?
Objective: Understand what's working, why, and prepare for Yilun meeting + potential workshop submission.
Day 15-16:
- Per-concept breakdown: which concepts benefit most from multi-scale? from iterative? from TTT?
- Failure case analysis: what's still failing and why?
- Attention visualization: can we tell a story about what the model "sees"?
Day 17-18:
- Ablation table: contribution of each component
- Comparison to other baselines (if time: LPN, GFlowNet-ARC)
Day 19-21:
- Draft 4-page workshop paper: "Multi-Scale Energy Attention for Visual Abstract Reasoning"
- Prepare questions/results for Yilun meeting
| Day | Checkpoint | Go Criteria | No-Go Action |
|---|---|---|---|
| 1 | VARC reproduction | Within 10% of reported | Debug or use their checkpoint |
| 5 | Multi-scale attention | ≥ baseline, scale specialization observed | Investigate why; may not need multi-scale |
| 10 | Iterative refinement | Accuracy gain OR interpretability | Drop iterative, focus on multi-scale + TTT |
| 14 | Full pipeline | Competitive with VARC | Still publishable as ablation study |
- On energy composition: "I'm summing energies across scales: E_total = E_local + E_mid + E_global. Is additive composition principled when scales have different receptive fields, or should I use product (E_total = ∏ E_scale)?"
- On cross-demo consistency: "I want to add a term that penalizes inconsistent rules across demos. Mathematically, I'm thinking ||rule_i - rule_j||² where rule is extracted from cross-attention between input/output. Is there a more principled way to enforce this from an energy perspective?"
- On iterative vs. direct: "VARC does direct prediction. I want gradient descent in embedding space. Should the energy function be the same network as the forward predictor, or separate? What's the relationship to your IRED work?"
- On publication scope: "Given 3 weeks of experiments, is ConceptARC sufficient for a workshop paper? Or do I need full ARC-AGI-1 results?"
High risk: The iterative energy minimization may be slower than direct prediction without accuracy gain.
→ Mitigation: Start with VARC's direct prediction, add iterative refinement as optional module, compare.
Medium risk: Multi-scale attention increases parameter count without proportional gain.
→ Mitigation: Use shared weights across scales, only learn scale-specific modulation.
Lower risk: This is at least a valid ablation study of VARC + compositional energy, publishable as workshop paper regardless of SOTA performance.