A system that runs Karpathy's autoresearch in parallel on SageMaker Spot Training.
Unlike the original autoresearch, which runs experiments sequentially one at a time, this system operates using a generation-based parallel evolution approach:
- Candidate Generation: Automatically generate N variants of train.py
- Parallel Execution: Run simultaneously across N SageMaker Spot instances
- Selection: The candidate with the lowest val_bpb becomes the new baseline
- Iteration: Repeat for M generations
- Prepare Data:
python scripts/prepare_s3.py - IAM Role:
./infrastructure/setup_iam.sh→ enter role_arn in config.yaml - Docker Image:
./infrastructure/build_and_push.sh→ enter image_uri in config.yaml - Validate Config:
python -m pipeline.orchestrator --dry-run
python -m pipeline.orchestrator --dry-runpython -m pipeline.orchestrator --single --population 10python -m pipeline.orchestrator --generations 10 --population 10python scripts/run_single.py- Only
train.pymay be modified (model, optimizer, hyperparameters) prepare.pymust not be modified (evaluation functions, data loaders, constants)- No new dependencies may be added
- Training time: fixed at 5 minutes (TIME_BUDGET=300 seconds)
- Goal: achieve the lowest val_bpb
Each generation produces candidates with varied strategies:
| Type | Count | Strategy |
|---|---|---|
| Conservative | 3 | Fine-tune LR ±10-30% |
| Moderate | 4 | Change DEPTH, ASPECT_RATIO, BATCH_SIZE, WINDOW |
| Aggressive | 2 | Radical combinations (deep-narrow, wide-shallow, high-LR) |
| Crossover | 1 | Combine top-2 ideas from previous generation |
- Instance: ml.g5.xlarge (A10G 24GB, Ampere)
- Spot price: ~$0.30/hr
- Per experiment: ~$0.04 (8 minutes)
- Per generation (10 candidates): ~$0.40
- Full pipeline (10 gen × 10 pop): ~$4.00
results.tsv: Full experiment log (TSV format)generations/gen_NNN/: Per-generation candidate code + results- Git tags:
gen-NNN-best— best state for each generation
Run the full loop autonomously with OMC autopilot:
/autopilot
Read program.md and run the full pipeline:
python -m pipeline.orchestrator --generations 10 --population 10
After completion, analyze results.tsv and summarize:
1. Best val_bpb achieved
2. Most impactful changes
3. Cost summary
4. Recommendations for next run
After experiments complete, prepare the following:
results.tsv+generations/→ visualize the experiment process- Git log → trace the evolutionary progression
- Cost report → demonstrate cloud efficiency
- Comparison against baseline: 8-hour sequential vs 100-minute parallel