ARC-AGI Competition Approach

System: LLM-Guided DSL Evolution via Nested Optimization Loops

Core Thesis

ARC tasks are compressible programs over a learnable DSL. An LLM approximates the Solomonoff prior over grid-transformation programs; a tournament selects for minimum description length. The DSL grows toward the universal function set for ARC by diagnosing failures and injecting targeted primitives.

1. Competition Criteria Mapping

ARC-AGI-2 (Static, Kaggle)

Criterion	Our Approach	Artifact	Status
Solve novel grid puzzles	DSL-guided program synthesis + LLM code generation	`target_kaggle_arc.py`, `dsl.py`	Working
Generalize from few examples	863 composable helpers reduce search space	`dsl.py` (870KB, 863 functions)	Living artifact
$50/task compute budget	Qwen2.5-14B on 2xT4 (fp8, TP=2)	`target_kaggle_arc.py` (vLLM backend)	Working
Offline Kaggle notebook	%%writefile + kaggle dataset modules	`target-arc2-agi-dsl.ipynb`	Working
Partial credit (pixel accuracy)	Spatial Hausdorff + color accuracy metric	`target_mlx_arc.py:calculate_pixel_accuracy`	Working

ARC-AGI-3 (Interactive, API)

Criterion	Our Approach	Artifact	Status
Explore unknown environments	Phase 1: systematic probing (9 actions, causal model)	`target_arc3_agent.py:NeurosymbolicAgent`	WIP
Plan and act in real-time	Phase 2: BFS over causal graph + optional LLM assist	`target_arc3_agent.py:choose_action`	WIP
Adapt policy per episode	POLICY_CODE string evolved by beam search	`target_arc3_agent.py:POLICY_CODE`	Scaffolded
Agent lifecycle compliance	Extends official `Agent` base class	`ARC-AGI-3-Agents/agents/agent.py`	Working

Local Development (M4 16GB)

Capability	Implementation	Artifact
Local LLM inference	Qwen3.5-9B-4bit via MLX (~5.5GB)	`target_mlx_arc.py`
KV cache compression	mlx-lm built-in `kv_bits=3`	`target_mlx_arc.py`
DSL evolution	Autoresearch outer loop + local beam search inner loop	`evolve_qwen_arc.py` + `beam_search_local.py`
Fitness evaluation	3-tier: helper coverage, prompt gen, solve accuracy	`benchmark_dsl.py`
Correctness gate	Syntax + function existence + grid operation tests	`tests_dsl.py`
Function dedup	Name-based dedup at injection to prevent duplicates	`beam_search_local.py`
ABPR execution traces	Traceback + pixel accuracy in diagnostic prompts	`evolve_qwen_arc.py`
D4 symmetry ensemble	8-transform voting (4 rotations + 4 reflections)	`d4_ensemble.py`
MDL composition search	Enumerate DSL primitive chains (depth 1-3, no LLM)	`mdl_compose.py`
LoRA self-distillation	Train adapter on successful programs + synthetic tasks	`lora_train.py`
Synthetic task generation	TransCoder-style: failed programs → training pairs	`synthetic_tasks.py`
Concept-based scoring	LLM describes concept, scores match (ConceptSearch)	`concept_scoring.py`
DSL pruning	Remove dead/uncalled helper functions	`dsl_prune.py`
Function relevance ranking	DreamCoder-style context-aware function selection	`dsl_recognition.py`
Literature scanning	Daily arXiv + Semantic Scholar + OpenAlex scan	`literature_scan.py`
Temperature split	Low temp (0.1) for eval, adaptive (0.6-0.9) for generation	arXiv 2603.28304

2. Architecture: Three Nested Loops

Autoresearch (evolve_qwen_arc.py, 1926 lines) — hypothesis generation, experiment tracking
  |
  +-- Per round: EVALUATE -> DIAGNOSE -> HYPOTHESIZE -> INJECT -> EXPERIMENT -> ANALYZE -> DECIDE -> RECORD -> COMMIT
  |
  +-- Local Beam Search (beam_search_local.py, 838 lines) — mutation tournament
  |     |
  |     +-- N branches mutate dsl.py in parallel (3 branches default)
  |     +-- Qwen3.5-9B-4bit generates candidates via MLX
  |     +-- benchmark_dsl.py scores each variant (3-tier: helper, prompt, solve)
  |     +-- tests_dsl.py gates correctness
  |     +-- Winners survive to next round
  |     +-- Failed programs → synthetic_tasks.py (TransCoder data collection)
  |
  +-- Enhanced Solve Pipeline (solve_task_enhanced)
  |     |
  |     +-- Phase 0: MDL composition search (mdl_compose.py, free, ~2-5s/task)
  |     +-- Phase 1: Standard LLM solve (single inference)
  |     +-- Phase 2: D4 symmetry ensemble (d4_ensemble.py, 4-8 LLM calls)
  |     +-- Phase 3: Fallback single with varied temperature
  |
  +-- Self-Distillation (lora_train.py, triggered every ~15 new programs)
  |     |
  |     +-- Collects successful programs + synthetic tasks (50% cap)
  |     +-- Trains LoRA adapter (rank 4, 8 layers, 200 iters)
  |     +-- Canary validation on reliably-solved tasks
  |     +-- Adapter applied to future inference rounds
  |
  +-- MLX Inference (target_mlx_arc.py)
        |
        +-- Qwen3.5-9B-4bit (~5.5GB VRAM)
        +-- kv_bits=3 cache compression
        +-- Temperature split: eval=0.1, generation=adaptive

3. Novelty Claims

What is new (verified against literature, April 2026)

LLM-guided DSL evolution for ARC — Prior work evolves model weights (SOAR, Pourcel et al. 2025) or uses fixed DSLs (ARGA, TransCoder). We evolve the DSL itself using an LLM as mutation proposer and beam search as selector. Closest paradigm: FunSearch (Romera-Paredes et al. 2023, Nature), which was never applied to ARC DSL growth.
Diagnostic autoresearch loop with ABPR-style execution traces — The outer loop diagnoses why tasks fail (3-tier scoring), generates hypotheses grounded in ARC research literature, and targets DSL growth at specific capability gaps. Inspired by ABPR (Qiu 2026), failing candidates' tracebacks and pixel accuracies are fed back into diagnostic prompts so the LLM can debug, not just guess. No published system combines execution-trace-informed hypothesis generation with code evolution for ARC.
863-function evolved DSL — Largest known DSL for ARC. ARGA uses ~50 hand-crafted ops. DreamCoder discovers abstractions via compression but was never applied to ARC at this scale. Our DSL grows empirically via LLM proposal + tournament selection, with function dedup to prevent bloat. DSL: 21,981 lines, 870KB.
Local quantized evolution — Running the full evolution loop on M4 16GB via Qwen3.5-9B-4bit + kv_bits=3 cache compression. No other ARC system runs at this scale locally. Directly addresses Chollet's thesis: intelligence = skill-acquisition efficiency, not raw compute.
Self-repairing orchestration — PROMPT-FIX MODE evolves not just the DSL toolbox but the strategy for using it (build_prompt, analyze_task_deeply). No published ARC system evolves its own orchestration layer.
TransCoder-style synthetic task generation — Failed beam search programs are applied to task inputs to create synthetic (task', program) pairs where the program IS the correct solution (Bednarek & Krawiec, 2410.04480). These feed LoRA self-distillation training, bootstrapping supervised data from otherwise-discarded programs. No published ARC system combines evolutionary search with self-distillation on synthetic failures.
Multi-technique enhanced solving — Combines MDL composition search (DreamCoder-inspired, no LLM needed), D4 symmetry augmentation (8 group transforms with verified inverses), and concept-based scoring (ConceptSearch, Singhal & Shroff, 2412.07322). Each technique addresses a different failure mode: MDL finds simple compositions the LLM misses, D4 handles orientation-invariant tasks, concept scoring provides richer signal than pixel accuracy alone.
Autonomous overnight evolution with literature integration — Evolution loop runs continuously in tmux; literature scanner queries arXiv + Semantic Scholar + OpenAlex across 10 search terms covering neuroevolution, neurosymbolic reasoning, evolutionary DSL, and ARC-specific work. New techniques are extracted and queued for injection.
ARC-AGI-3 policy synthesis — No published work on evolving POLICY_CODE for interactive ARC agents.

Paradigm placement: iterative neurosymbolic program synthesis

Our system instantiates the emerging "P4" paradigm — iterative Program-to-Program refinement where an LLM acts as both program synthesizer and mutation engine, with symbolic execution providing the fitness signal. This neurosymbolic loop (neural proposal → symbolic evaluation → feedback → refined proposal) is the same core pattern behind o3's 87.5% ARC score and MIT's TTT-based 8B model. The key differences in our instantiation:

We evolve the DSL, not just programs over a fixed DSL. Most P4-style systems search within Python or a hand-crafted language. We grow the search space itself — the LLM proposes new primitives, the tournament validates them, and the DSL converges toward the empirical prior over ARC transformations. This is meta-level program synthesis: programs that change the program space.
We operate at commodity scale. o3 spent an estimated $30K+ per task. We target

25% at <$50/task on a 16GB M4, demonstrating that the neurosymbolic loop works without brute-force compute — consistent with Chollet's thesis that intelligence is skill-acquisition efficiency, not raw search budget.
We learn from failure. TransCoder-style synthetic task generation means every failed program attempt creates supervised training data, feeding back into LoRA self-distillation. The system's effective sample efficiency exceeds its raw solve rate.

Universality argument

The autoresearch + beam search architecture is domain-agnostic. The same nested-loop design (diagnose failures → hypothesize fixes → inject → tournament → commit) could evolve DSLs for FlashFill, LOGO turtle graphics, or any program synthesis domain. Only the fitness function (benchmark_dsl.py) and data (arc_agi_2_data/) are ARC-specific. This is the strongest argument for Universality scoring.

Nearest competitors

System	Approach	Key Difference
ABPR (Qiu 2026)	Prolog + algorithmic debugging, 56.7% ARC-2	Fixed language (Prolog), debugs programs not DSL. Closest competitor
SOAR (Pourcel 2025)	LLM + evolutionary search, 52% ARC-1	Evolves model weights, not DSL
FunSearch (DeepMind 2023)	LLM-guided code evolution	Applied to cap sets, not ARC
DreamCoder (Ellis 2021)	Library learning via compression	E-graph refactoring, not LLM-guided
ARGA (Xu 2022)	Graph DSL + Tabu search	Fixed 50-op DSL, no evolution
TransCoder (Bednarek 2024)	Typed DSL + LLM synthesis	Fixed DSL (40 ops), synthetic tasks but no evolution
ConceptSearch (Singhal 2024)	FunSearch + concept scoring	No DSL evolution, fixed program space
KGMoN (Sorokin 2025)	Fine-tuned 4B model, 24% ARC-2	No DSL, pure neural, Kaggle winner
Poetiq (2026)	Gemini 3 meta-system, 54% ARC-2	Unconstrained ($30/task), no DSL
Greenblatt (2024)	GPT-4o generates ~8K Python programs/task	Fixed program space (Python), no DSL evolution

4. Theoretical Framework

Framing: Approximate Solomonoff Induction over a Learnable UTM

Component	Role	Theory
`dsl.py` (863 functions)	Universal Turing Machine bias	Solomonoff invariance theorem: UTM choice shifts complexity by O(1)
Beam search tournament	Approximate MAP inference	Selects p* = argmin \|p\| s.t. p(input) = output
Autoresearch loop	Active learning in program space	Diagnoses high-information failures, targets DSL growth
DSL evolution	Compression optimization	Each new helper reduces description length of solutions using it
LoRA self-distillation	Posterior sharpening	Fine-tunes LLM prior toward distribution of successful ARC programs
Synthetic task generation	Data augmentation in program space	Failed programs create supervised pairs, bootstrapping the prior
LLM (Qwen3.5-9B)	Approximate Solomonoff prior	Wan & Mei (2025): LLM training approximates Solomonoff via loss minimization

Key theoretical results supporting this approach

Chollet (2019) — Intelligence = skill-acquisition efficiency. ARC tests generalization with minimal priors. Our system's fixed DSL + evolution measures exactly this.
Wan & Mei (2025) — LLMs approximate Solomonoff induction. When our LLM proposes DSL mutations, it samples from an approximate Solomonoff prior over programs.
Rathmanner & Hutter (2011) — Solomonoff prior assigns higher probability to shorter programs (Occam's razor). MDL composition search (mdl_compose.py) directly implements this: shorter chains score higher.
Lattimore & Hutter (2011) — Universal bias succeeds across all structured domains, countering NFL theorems. Our DSL evolution discovers this bias empirically.
DreamCoder (Ellis 2021) — Library learning converges toward compression-optimal DSL. Our approach uses LLM-guided generation instead of E-graph refactoring but targets the same convergence. MDL composition search is a direct implementation of the "wake" phase.
TransCoder (Bednarek & Krawiec 2024) — Learning from mistakes: failed programs applied to inputs create synthetic tasks. Over 5 cycles: synthetic solve rate 1.72% → 21.66%. We implement this via synthetic_tasks.py feeding lora_train.py.

Why it works (not just how)

The system solves ARC tasks not by memorizing patterns but by constructing a compression hierarchy where each layer reduces the search space for the next:

DSL as learned inductive bias. Each helper function in dsl.py encodes a regularity discovered across ARC tasks (e.g., "grids often have mirror symmetry that needs completing"). This is not hand-engineering — the LLM proposes candidates and the tournament selects only those that improve solve rate. The DSL converges toward the empirical prior over ARC transformations.
Diagnosis closes the loop. Unlike blind evolutionary search, the autoresearch loop performs causal diagnosis: it identifies which tasks fail and classifies the failure mode (PROMPT_FAIL vs COMPOSE_FAIL vs WRONG_ANSWER). This creates an information gradient — new DSL functions are proposed specifically to fill diagnosed capability gaps, not randomly.
Three-level optimization. The system optimizes at three timescales:
- Fast (beam search): Given the current DSL, find the best composition of existing helpers for each task. This is approximate MAP inference over programs.
- Medium (LoRA self-distillation): Sharpen the LLM's prior toward successful programs, including synthetic tasks from failures. This improves future search.
- Slow (autoresearch): When fast search fails, grow the DSL with new primitives. This is meta-learning — learning the hypothesis space itself.
This mirrors the distinction between learning (fast), consolidation (medium), and evolution (slow) in biological intelligence, and between skill use, skill refinement, and skill acquisition in Chollet's framework.
Learning from failure. TransCoder-style synthetic task generation means every failed program attempt contributes training signal. The system's effective data efficiency is higher than the raw solve rate suggests — failed attempts are not wasted but recycled into supervised learning data.
Prompt-fix mode as self-repair. When the system detects that failures come from the orchestration layer (prompt builder) rather than missing primitives, it automatically switches to fixing build_prompt() / analyze_task_deeply(). This means the system can evolve not just its toolbox but its strategy for using tools — a form of metacognitive self-improvement.

On DSL completeness

No formal proof exists that any finite DSL is sufficient for all ARC tasks (tasks are intentionally open-ended per Chollet's design). However:

Any DSL with conditionals + loops + integer arithmetic is Turing-complete
The practical question is search efficiency, not expressibility
FlashFill++ (Cambronero et al. 2023) provides frameworks for managing DSL scaling
Our evolution loop addresses this empirically: if a task fails, the DSL grows

5. Progress Tracking

Current Scores

Metric	Value	Date
DSL benchmark (composite)	1.0000 (tier 1+2 ceiling)	2026-04-05
Helper coverage (helper_score)	1.00	2026-04-05
Prompt generation (prompt_score)	1.00	2026-04-05
Composition (compose_score)	1.00	2026-04-05
DSL function count	863	2026-04-05
DSL file size	21,981 lines (870KB)	2026-04-05
Evolution rounds completed	18 (current round in progress)	2026-04-05
Holdout solve rate (10 fixed tasks)	0.19 mean (round 17)	2026-04-05
Best holdout task solve	1.00 (2072aba6), 0.93 (890034e9)	2026-04-05
Beam search best score	0.8178 (round 18 winner)	2026-04-05
Synthetic tasks collected	Active (TransCoder pipeline)	2026-04-05
LoRA training	Pipeline ready, triggers at 15 programs	2026-04-05

Hypothesis Implementation Status

#	Hypothesis	Module	Status
H1	Color canonicalization	`dsl.py` (inline)	Active
H2	D4 symmetry ensemble	`d4_ensemble.py`	Deployed, pending restart
H3	MDL composition search	`mdl_compose.py`	Deployed, pending restart
H4	Persistent insight memory bank	TBD	Next hypothesis

Bottleneck Analysis

Tier 1+2 metrics have hit ceiling (1.0). The current bottleneck is tier 3 solve accuracy — the LLM (Qwen3.5-9B) must compose DSL functions into correct solutions for unseen tasks.

Active mitigations:

D4 symmetry (H2): 8-transform voting catches orientation-sensitive failures
MDL composition (H3): Free pre-check finds simple chains the LLM misses
LoRA self-distillation: Sharpens LLM prior toward successful program patterns
TransCoder synthetic tasks: Bootstraps training data from failed attempts
Concept scoring: Richer evaluation signal than pixel accuracy (opt-in)
Literature scanning: 10 queries across neuroevolution, neurosymbolic, ARC domains

SOTA Context (ARC-AGI-2, April 2026)

System	ARC-AGI-2 Score	Cost/Task	Approach
Opus 4.6 (unconstrained)	68.8%	high	Pure neural
ABPR (Qiu 2026)	56.7% Pass@2	moderate	Prolog + algorithmic debugging + Gemini-3-Flash
Poetiq (Gemini 3)	54%	$30.57	Meta-system, unconstrained
Anthropic Opus 4.5	37.6%	$2.20	Thinking mode
KGMoN (Kaggle 1st, 2025)	24%	$0.20	Fine-tuned 4B model
Target (our system)	>25%	<$50	DSL evolution + beam search + self-distillation

Note: All paradigms show 2-3x performance drops from ARC-AGI-1 to ARC-AGI-2 (ARC Prize technical report). The 85% Grand Prize ($200K) has never been claimed. Paper Prize: $75K top, $375K outstanding pool, scored on 6 dimensions.

Experiment Log

See evolution_results/hypotheses.jsonl for per-round records:

{"round": N, "hypothesis": "...", "metric_before": 0.72, "metric_after": ..., "status": "accepted|rejected"}

Holdout trends: evolution_results/holdout_scores.jsonl (10 fixed tasks, scored every round).

6. Artifact Inventory

Core Pipeline (20 Python files)

File	Lines	Purpose	Track
`evolve_qwen_arc.py`	1,926	Autoresearch outer loop (9-step rounds)	ARC-2
`beam_search_local.py`	838	Local beam search (replaces codopt)	ARC-2
`dsl.py`	21,981	Evolved DSL (863 functions)	ARC-2 + ARC-3
`benchmark_dsl.py`	—	Fitness function (3-tier scoring)	ARC-2
`tests_dsl.py`	—	Correctness gate	ARC-2
`target_mlx_arc.py`	518	Local M4 MLX inference backend	ARC-2
`target_kaggle_arc.py`	—	Kaggle A100/T4 inference backend	ARC-2
`target_arc2_kaggle.py`	—	Alternative Kaggle target	ARC-2
`target_arc3_agent.py`	—	Interactive neurosymbolic agent	ARC-3
`d4_ensemble.py`	—	D4 symmetry augmentation (H2)	ARC-2
`mdl_compose.py`	—	MDL composition search (H3)	ARC-2
`lora_train.py`	418	LoRA self-distillation training	ARC-2
`synthetic_tasks.py`	289	TransCoder synthetic task generation	ARC-2
`concept_scoring.py`	229	ConceptSearch concept-based scoring	ARC-2
`literature_scan.py`	496	arXiv + Semantic Scholar + OpenAlex scanner	Research
`dsl_prune.py`	—	Dead function pruner	ARC-2
`dsl_recognition.py`	—	DreamCoder-style function relevance ranker	ARC-2
`temperature_sweep.py`	—	Temperature optimization experiments	Research
`benchmark_arc3.py`	—	ARC-AGI-3 evaluation	ARC-3
`run_arc3_baseline.py`	—	ARC-AGI-3 baseline runner	ARC-3

Data & Results

Directory	Contents
`arc_data/data/training/`	ARC-AGI-1 (400 tasks)
`arc_agi_2_data/training/`	ARC-AGI-2 (1000+ tasks)
`evolution_results/`	hypotheses.jsonl, holdout_scores.jsonl, diagnostics, lora_adapters/
`evolution_results/successful_programs.jsonl`	Programs that solved tasks (LoRA training data)
`evolution_results/synthetic_tasks.jsonl`	TransCoder synthetic (task, program) pairs

Dependencies

Repo	Role	Status
`turboquant-mlx/`	KV cache compression (venv with mlx-lm)	Active
`arc-dsl/`	Reference DSL (165 primitives, v0 seed)	Reference only
`codex-optimize/`	Beam-search via Codex agents (legacy)	Replaced by beam_search_local.py
`ARC-AGI-3-Agents/`	Official interactive agent framework	Active
`autoresearch/`	Karpathy's experiment loop (reference)	Reference only

7. Dependency Graph

evolve_qwen_arc.py (outer loop, 1926 lines)
+-- dsl.py (mutated target, 863 functions)
|   +-- HELPER_CODE_PREFIX (evolved helper functions)
|   +-- build_prompt() (task encoding)
+-- beam_search_local.py (inner loop, 838 lines)
|   +-- benchmark_dsl.py (fitness: helper + prompt + composition)
|   +-- tests_dsl.py (correctness gate)
|   +-- synthetic_tasks.py (collect training data from failures)
+-- target_mlx_arc.py (Tier 3 solve eval)
|   +-- mlx-lm + Qwen3.5-9B-4bit
|   +-- kv_bits=3 cache compression
+-- d4_ensemble.py (D4 symmetry augmentation)
|   +-- 8 transforms with verified inverses
+-- mdl_compose.py (MDL composition pre-check)
|   +-- Depth 1-3 chains of DSL primitives
+-- lora_train.py (self-distillation, triggered periodically)
|   +-- successful_programs.jsonl (real programs)
|   +-- synthetic_tasks.jsonl (TransCoder pairs, 50% cap)
|   +-- mlx_lm.lora (adapter training)
+-- concept_scoring.py (opt-in concept-based scoring)
+-- literature_scan.py (technique extraction)
+-- evolution_results/
    +-- hypotheses.jsonl (per-round records)
    +-- holdout_scores.jsonl (fixed 10-task tracking)
    +-- lora_adapters/ (trained adapters)

target_kaggle_arc.py (standalone, cloud)
+-- vLLM or HF transformers
+-- dsl.py (helpers + prompt)

target_arc3_agent.py (standalone, interactive)
+-- dsl.py (frame analysis)
+-- POLICY_CODE (evolution target)
+-- ARC-AGI-3-Agents/agents/agent.py (base class)

8. Key Citations

#	Paper	Year	Relevance
1	Chollet, "On the Measure of Intelligence"	2019	ARC design principles, intelligence definition
2	Wan & Mei, "LLMs as Approximations to Solomonoff Induction"	2025	Theoretical grounding for LLM-as-prior
3	Romera-Paredes et al., "FunSearch" (Nature)	2023	LLM-guided evolutionary code search paradigm
4	Ellis et al., "DreamCoder" (PLDI)	2021	Library learning, DSL growth via compression
5	Pourcel et al., "SOAR"	2025	Self-improving LLM for ARC (evolves weights, not DSL)
6	Xu et al., "ARGA" (AAAI)	2022	Graph DSL + constraint search for ARC
7	Rathmanner & Hutter, "Solomonoff Induction"	2011	Formal basis for Occam's razor in program search
8	Chollet et al., "ARC Prize 2025 Technical Report"	2026	Competition results, winning paradigms
9	Vahdati et al., "The ARC of Progress towards AGI" (arxiv 2603.13372)	2026	Living survey: all paradigms 2-3x drop on ARC-AGI-2
10	Macfarlane & Bonnet, "Latent Program Network"	2024	Alternative: gradient-based program search
11	Qiu et al., "ABPR" (arxiv 2603.20334)	2026	Prolog + algorithmic debugging, 56.7% ARC-AGI-2
12	Chaudhry et al., "Recursive Concept Evolution" (arxiv 2602.15725)	2026	Compositional reasoning gaps
13	Greenblatt, "Getting 50% on ARC-AGI with GPT-4o"	2024	~8K programs/task, fixed program space
14	ARC-TGI, "Task Generators for ARC" (arxiv 2603.05099)	2026	Synthetic data generation for ARC training
15	Bednarek & Krawiec, "TransCoder" (arxiv 2410.04480)	2024	Learning from mistakes: synthetic tasks from failed programs
16	Singhal & Shroff, "ConceptSearch" (arxiv 2412.07322)	2024	Concept-based scoring for program search (~30% more efficient)
17	arXiv 2603.28304, "Temperature Judge"	2026	Low temp for eval, high for generation — implemented

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
arc_agi_2_data		arc_agi_2_data
arc_data		arc_data
environment_files		environment_files
evolution_results		evolution_results
.gitignore		.gitignore
.venv		.venv
Dockerfile		Dockerfile
Dockerfile.arc3		Dockerfile.arc3
INFO.md		INFO.md
INFO_ARC3.md		INFO_ARC3.md
README.md		README.md
arc3-master-submission-notebook.ipynb		arc3-master-submission-notebook.ipynb
baseline_results.json		baseline_results.json
beam_search_local.py		beam_search_local.py
benchmark_arc3.py		benchmark_arc3.py
benchmark_dsl.py		benchmark_dsl.py
concept_scoring.py		concept_scoring.py
d4_ensemble.py		d4_ensemble.py
dsl.py		dsl.py
dsl_map_elites.py		dsl_map_elites.py
dsl_prune.py		dsl_prune.py
dsl_recognition.py		dsl_recognition.py
evolve_qwen_arc.py		evolve_qwen_arc.py
literature_scan.py		literature_scan.py
lora_train.py		lora_train.py
mdl_compose.py		mdl_compose.py
program_archive.py		program_archive.py
research_log.md		research_log.md
run_arc3_baseline.py		run_arc3_baseline.py
run_codopt.sh		run_codopt.sh
run_overnight.sh		run_overnight.sh
setup.sh		setup.sh
synthetic_tasks.py		synthetic_tasks.py
target-arc2-agi-dsl.ipynb		target-arc2-agi-dsl.ipynb
target_arc2_kaggle.py		target_arc2_kaggle.py
target_arc3_agent.py		target_arc3_agent.py
target_kaggle_arc.py		target_kaggle_arc.py
target_mlx_arc.py		target_mlx_arc.py
temperature_sweep.py		temperature_sweep.py
tests_dsl.py		tests_dsl.py

Folders and files

Latest commit

History

Repository files navigation

ARC-AGI Competition Approach

System: LLM-Guided DSL Evolution via Nested Optimization Loops

Core Thesis

1. Competition Criteria Mapping

ARC-AGI-2 (Static, Kaggle)

ARC-AGI-3 (Interactive, API)

Local Development (M4 16GB)

2. Architecture: Three Nested Loops

3. Novelty Claims

What is new (verified against literature, April 2026)

Paradigm placement: iterative neurosymbolic program synthesis

Universality argument

Nearest competitors

4. Theoretical Framework

Framing: Approximate Solomonoff Induction over a Learnable UTM

Key theoretical results supporting this approach

Why it works (not just how)

On DSL completeness

5. Progress Tracking

Current Scores

Hypothesis Implementation Status

Bottleneck Analysis

SOTA Context (ARC-AGI-2, April 2026)

Experiment Log

6. Artifact Inventory

Core Pipeline (20 Python files)

Data & Results

Dependencies

7. Dependency Graph

8. Key Citations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages