autoresearch-evo

A fork of karpathy/autoresearch that keeps the same tiny 5-minute train.py benchmark, but gives the research loop memory, novelty-search-inspired exploration, and autonomous review so it can search more broadly than plain hill-climbing.

Instead of treating each run as an isolated local tweak, this fork tries to make the loop behave more like a lightweight research system:

it explores through multiple mutation styles instead of only one local-tuning behavior
it remembers near-misses, crashes, and stepping stones instead of forgetting them immediately
it reviews each run and feeds that evidence back into the next decision
it can keep operating autonomously on a dedicated research branch
it still preserves the same simple public benchmark: one file, one metric, one 5-minute run

At this release cut, that loop improved the baseline from 0.997426 to 0.985596 val_bpb under the same benchmark contract.

This is a heuristic discovery control plane with novelty-search-inspired exploration, not a formal evolutionary algorithm framework. The goal is still pragmatic: help one small autoresearch loop search more intelligently without bloating the repo.

The public branch is curated to keep only the intentional fork surface: the control-plane, the runner, tests, the best validated train.py found in our run history, and release charts summarizing the search.

If you want the deeper explanation of how the Codex/discovery control plane works, including each hook, emitter, and memory layer, see docs/discovery-control-plane.md.

Upstream Base

The clean base for this fork is karpathy/autoresearch, which provides:

prepare.py for one-time data prep and evaluation utilities
train.py as the only experiment target
a fixed 5-minute training budget
val_bpb as the optimization metric

This repo keeps that contract. The benchmark is still the same 5-minute train.py run. Any backend-specific timeout is only an outer guardrail around startup, sync, and evaluation.

What This Fork Adds

The main value-add is a repo-local discovery layer that changes how the loop searches:

Broader search than hill-climbing. The loop chooses among several mutation styles instead of repeating one kind of local tweak.
Persistent research memory. Near-misses, crashes, archive elites, and selected conjectures carry across turns.
Structured run review. Every completed run is parsed and scored for fitness, novelty, information gain, and surprise before the next step is chosen.
Autonomous continuation. Repo-local hooks can keep the loop moving on autoresearch/* branches without turning the repo into a heavyweight framework.
Optional remote backend. A generic repo runner supports a local path by default and an optional remote GPU backend when local hardware is not the target environment.

The emitter portfolio is:

local_tuner
optimizer_hacker
architecture_mutator
simplifier
contrarian
recombinator
anomaly_chaser

Under the hood, the core implementation lives in .codex/, research/, and the repo-local runner scripts, but the intended public mental model stays small: prepare.py, train.py, program.md, and a loop that now searches with memory.

Best Validated Configuration Shipped Here

This public branch ships the best validated train.py configuration we found during the since-inception run history used for the charts below.

Best validated run:

commit: 0261b2c
val_bpb: 0.985596
baseline: 80b9a32 at 0.997426
absolute improvement: 0.011830
relative improvement: about 1.186%

The most important differences versus the discovery-path base are:

shared norm changed from pure RMSNorm to LayerNorm with dtype preservation
q/k normalization changed to a mixed RMSNorm/LayerNorm path
total batch reduced to 2**17
Muon weight decay retuned to 0.109
warmup restored to 2%
device batch reduced to 64

This is not presented as a final optimum, only as the best validated point discovered in this release cut.

Since-Inception Research Progress

Overall run history, including keeps, discards, crashes, and the best-so-far frontier:

Frontier-only view to make the cumulative best improvements easier to inspect:

Release-cut summary:

total runs: 102
keeps: 12
discards: 83
crashes: 7
best run: 0261b2c at 0.985596

Quick Start

Requirements:

Python 3.10+
uv
a single NVIDIA GPU locally for the upstream-style path

Setup:

uv sync
uv run prepare.py

Local benchmark run:

uv run train.py

Generic repo runner on the local backend:

python scripts/run_experiment.py --backend local -- uv run train.py > run.log 2>&1

Optional remote GPU run through the Modal backend:

python scripts/run_experiment.py --backend modal --gpu H100 --timeout 10 -- uv run train.py > run.log 2>&1

The inner command must remain exactly uv run train.py. The runner only handles transport, cache reuse, and backend lifecycle.

For Modal-specific operational details and environment variables, see docs/infra/modal.md.

Repo Layout

prepare.py             # fixed data prep + evaluation utilities
train.py               # single experiment target, shipped at the best validated config
program.md             # agent-facing research protocol
.codex/                # repo-local Codex config + hooks
research/              # discovery-state logic
scripts/run_experiment.py      # generic experiment runner
scripts/backends/modal_backend.py # optional Modal backend
scripts/modal_gpu.py           # compatibility shim for the Modal backend
tests/                 # control-plane and runner tests
docs/figures/          # release charts

Testing

python3 -m py_compile research/core.py scripts/run_experiment.py scripts/modal_gpu.py scripts/backends/modal_backend.py train.py
python3 -m unittest discover -s tests -v

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.codex		.codex
docs		docs
research		research
scripts		scripts
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
prepare.py		prepare.py
program.md		program.md
progress.png		progress.png
pyproject.toml		pyproject.toml
train.py		train.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

autoresearch-evo

Upstream Base

What This Fork Adds

Best Validated Configuration Shipped Here

Since-Inception Research Progress

Quick Start

Repo Layout

Testing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

autoresearch-evo

Upstream Base

What This Fork Adds

Best Validated Configuration Shipped Here

Since-Inception Research Progress

Quick Start

Repo Layout

Testing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages