Experimenting with cost-effective ways to run Karpathy's autoresearch on AWS infrastructure, and documenting the journey as a hands-on tutorial.
Karpathy's autoresearch shows that AI agents can autonomously improve deep learning models overnight — but it assumes you have an H100 GPU sitting idle for 8 hours. Most people don't have that.
This project answers: Can you get the same results using cheap cloud GPUs, paying only pennies per experiment?
The answer is yes. We run 83 experiments 2.3x faster and 5-18x cheaper than the original, using SageMaker Spot instances that spin up for 5 minutes and disappear.
| Original (H100, 8 hours) | This project (L40S Spot) | |
|---|---|---|
| Cost for 83 experiments | $7-24 | $1.33 |
| Wall clock time | ~8 hours | ~3.5 hours |
| GPU idle cost | ~50% wasted | $0 (HUGI pattern) |
| Experiments in parallel | 1 | 4 |
| GPU required | H100 80GB | Any (L40S, A10G, H100...) |
- ML practitioners without expensive GPUs — Run autoresearch on AWS Spot for $0.02/experiment instead of buying an H100
- Teams exploring model architectures — Use cheap L40S Spot to validate hypotheses, then apply winning configurations to production H100 training (research shows this transfers well)
- Cloud cost optimizers — Learn HUGI pattern, Spot capacity management, and parallel execution strategies applicable beyond ML
- Educators and students — Every experiment is documented as a tutorial with exact commands, costs, and lessons learned
Read the experiments folder — each experiment is a self-contained story with hypothesis, setup, results, and lessons learned. Start with 001-baseline.
Fork this repo, set up your AWS credentials, and run make run to start your own autonomous experiments. The pipeline handles candidate generation, parallel SageMaker job submission, result collection, and selection — all automatically.
The docs folder contains practical guides on Spot capacity, GPU cost analysis, and battle-tested insights from real experiments. The sagemaker-spot-training skill packages these lessons for Claude Code users.
The original autoresearch runs experiments sequentially on a single GPU — 12 experiments/hour, ~8 hours for 100 experiments. We built a parallel evolution pipeline on SageMaker Managed Spot Training that leverages the HUGI (Hurry Up and Get Idle) pattern to complete 100 experiments in 100 minutes at the same cost ($4) with zero GPU idle time.
| Original autoresearch | Serverless (this repo) | |
|---|---|---|
| Execution | 1 experiment at a time | 10 experiments in parallel |
| 100 experiments | ~8 hours | ~100 minutes |
| Cost | ~$4 (GPU always on) | ~$4 (HUGI: pay only when running) |
| GPU | 1x H100 (always occupied) | N x H100 Spot (on-demand burst) |
| Search strategy | Greedy (sequential) | Population-based evolution |
| Improvement probability | 18% per experiment | 86% per generation |
Traditional GPU server:
████░░░░████░░░░████░░░░████░░░░ (utilization ~50%, paying 24/7)
HUGI with SageMaker Spot:
██████████ (utilization 100%, $0 when idle)
↑ N GPUs burst ↑ terminate immediately
- AWS CLI configured (
aws configure) - Python 3.11+
- SageMaker Python SDK
pip install boto3 sagemaker pyyaml click./infrastructure/setup_iam.sh --profile personal --region ap-northeast-1
# → Copy role ARN to config.yamlmake prepareDownloads 10 training shards + validation shard from HuggingFace, trains BPE tokenizer, uploads everything to S3.
make dry-run# Single experiment test (~$0.04, ~10 min)
make run-single
# Full pipeline (~$4, ~100 min)
make runEach generation follows 4 steps:
-
Candidate Generation — Creates N variants of
train.pywith diverse strategies:Strategy Count Description Conservative 3 LR adjustments (±10-30%) Moderate 4 Architecture changes (depth, width, window, batch) Aggressive 2 Radical combinations (deep-narrow, wide-shallow) Crossover 1 Combine ideas from top-2 of previous generation -
Batch Launch — Submits all N candidates as parallel SageMaker Spot Training Jobs (async,
wait=False) -
Result Collection — Polls all jobs until completion, extracts
val_bpbmetric from CloudWatch -
Selection — Best
val_bpbbecomes new baseline, committed with git taggen-NNN-best
- Only
train.pycan be modified (model architecture, optimizer, hyperparameters) prepare.pyis read-only (evaluation function, data loading, constants)- No new dependencies allowed
- Fixed 5-minute training time budget (TIME_BUDGET=300s)
- Goal: lowest val_bpb (validation bits per byte)
serverless-autoresearch/
├── train.py # Training script (agent modifies this)
├── prepare.py # Data prep + evaluation (read-only)
├── config.yaml # AWS & pipeline config (gitignored)
├── config.yaml.example # Config template
├── program.md # Agent instructions
├── Makefile # make run, make dry-run, make cost, etc.
│
├── src/ # Source code (cookiecutter-style)
│ ├── pipeline/ # Core evolution pipeline
│ │ ├── orchestrator.py # Main evolution loop
│ │ ├── candidate_generator.py
│ │ ├── batch_launcher.py
│ │ ├── result_collector.py
│ │ └── selection.py
│ ├── sagemaker/ # SageMaker wrappers
│ │ ├── entry_point.py
│ │ └── train_wrapper.py
│ └── scripts/ # CLI utilities
│ ├── prepare_s3.py
│ ├── run_single.py
│ └── cost_report.py
│
├── data/raw/ # Data references (actual data in S3)
├── models/ # Trained model artifacts
├── notebooks/ # Jupyter notebooks (analysis)
├── references/ # Research notes & external references
├── experiments/ # Per-experiment reports & results
│ ├── 001-baseline-l40s/
│ └── 002-optimization-l40s/
├── docs/ # Project documentation & diagrams
├── infrastructure/ # AWS IAM, Dockerfile, requirements
└── generations/ # Pipeline output (per-generation)
config.yaml:
aws:
profile: personal
region: ap-northeast-1 # Tokyo (H100 Spot available)
role_arn: "arn:aws:iam::..."
sagemaker:
instance_type: ml.p5.4xlarge # H100 80GB
use_spot: true
max_run: 900 # 15 min
max_wait: 3600 # 1 hour spot wait
framework_version: "2.8.0"
py_version: "py312"
pipeline:
num_generations: 10
population_size: 10
num_conservative: 3
num_moderate: 4
num_aggressive: 2
num_crossover: 1| Component | Unit Cost | Qty | Total |
|---|---|---|---|
| ml.p5.4xlarge Spot (8min/exp) | ~$0.04 | 100 | ~$4.00 |
| S3 storage | — | — | ~$0.10 |
| Total | ~$4.10 |
Run the full pipeline autonomously with oh-my-claudecode:
/autopilot
Read program.md and execute:
python -m pipeline.orchestrator --generations 10 --population 10
After completion, analyze results.tsv and summarize findings.
| # | Experiment | GPU | val_bpb | Cost | Key Finding |
|---|---|---|---|---|---|
| 001 | Baseline on L40S | ml.g7e.4xlarge | 1.065 | $0.04 | Pipeline validated, SDPA fallback works |
| 002 | L40S Optimization | ml.g7e.4xlarge | TBD | TBD | Research notes |
| 003 | H100 Fair Comparison | ml.p5.4xlarge | TBD | TBD | Pending quota approval |
Spot instance availability varies dramatically by region. Always check before running experiments.
# Quick check: Spot placement score (1-10, higher = better)
for region in us-east-1 us-east-2 us-west-2; do
echo -n "$region: "
aws ec2 get-spot-placement-scores \
--instance-types g7e.4xlarge --target-capacity 1 \
--single-availability-zone --region-names $region \
--region $region \
--query "max_by(SpotPlacementScores, &Score).Score" --output text
doneOur experience:
| Region | Spot Score | Result |
|---|---|---|
| us-west-2 (Oregon) | 1-2 | Stuck in "Starting" for 30+ min |
| us-east-1 (Virginia) | 9 | Allocated in ~2 min |
See Spot Capacity Guide for the full guide including price history checks and quota management.
Follow the full journey of building this project through conversational AI coding (vibe coding):
Vibe Coding Tutorial — 8 chapters from idea to autonomous ML evolution, with real prompts and debugging stories
| Chapters | Time | Cost | Key Topics |
|---|---|---|---|
| 8 | ~8 hours | $0.44 | Deep interview → pipeline design → SageMaker Spot → GPU debugging → autonomous evolution → insights |
Lessons learned from running 20+ experiments on SageMaker Spot. Full list: docs/insights.md
| # | Insight | Impact |
|---|---|---|
| 1 | Spot capacity varies 1-9 by region — always check placement scores first | Saves 30+ min of stuck jobs |
| 3 | DEVICE_BATCH_SIZE ≠ token throughput — increase TOTAL_BATCH_SIZE instead |
Avoided wrong optimization path |
| 4 | Flash Attention 3 is Hopper-only; L40S needs FA2 or SDPA fallback | MFU 20% vs 40% |
| 5 | SageMaker startup overhead is 3 min per job (60% for 5-min training) | Scale up > scale out |
| 11 | Spot GPUs are valid proxies for large-scale training — architecture/optimizer rankings transfer, absolute LR/batch size don't | Cheap experiments ($0.04) inform expensive production runs |
| Document | Description |
|---|---|
| Key Insights | Battle-tested lessons from SageMaker Spot experiments (continuously updated) |
| Comparison Report | Original sequential vs serverless parallel pipeline — architecture, cost, search efficiency |
| GPU Cost Analysis | P5 (H100) vs P6 (B200/B300) pricing and performance for autoresearch workloads |
| Spot Capacity Guide | How to find available Spot capacity by region before running experiments |
| Architecture Diagram | System architecture (SageMaker + S3 + local orchestrator) |
| Sequential vs Parallel | Visual comparison of sequential and parallel experiment pipelines |
- karpathy/autoresearch — Original sequential autoresearch framework
- karpathy/nanochat — Training codebase that autoresearch is based on
MIT