Serverless Autoresearch

Experimenting with cost-effective ways to run Karpathy's autoresearch on AWS infrastructure, and documenting the journey as a hands-on tutorial.

Why This Project?

Karpathy's autoresearch shows that AI agents can autonomously improve deep learning models overnight — but it assumes you have an H100 GPU sitting idle for 8 hours. Most people don't have that.

This project answers: Can you get the same results using cheap cloud GPUs, paying only pennies per experiment?

The answer is yes. We run 83 experiments 2.3x faster and 5-18x cheaper than the original, using SageMaker Spot instances that spin up for 5 minutes and disappear.

	Original (H100, 8 hours)	This project (L40S Spot)
Cost for 83 experiments	$7-24	$1.33
Wall clock time	~8 hours	~3.5 hours
GPU idle cost	~50% wasted	$0 (HUGI pattern)
Experiments in parallel	1	4
GPU required	H100 80GB	Any (L40S, A10G, H100...)

Who Is This For?

ML practitioners without expensive GPUs — Run autoresearch on AWS Spot for $0.02/experiment instead of buying an H100
Teams exploring model architectures — Use cheap L40S Spot to validate hypotheses, then apply winning configurations to production H100 training (research shows this transfers well)
Cloud cost optimizers — Learn HUGI pattern, Spot capacity management, and parallel execution strategies applicable beyond ML
Educators and students — Every experiment is documented as a tutorial with exact commands, costs, and lessons learned

How to Use This

As a tutorial

Read the experiments folder — each experiment is a self-contained story with hypothesis, setup, results, and lessons learned. Start with 001-baseline.

As a pipeline

Fork this repo, set up your AWS credentials, and run make run to start your own autonomous experiments. The pipeline handles candidate generation, parallel SageMaker job submission, result collection, and selection — all automatically.

As a reference

The docs folder contains practical guides on Spot capacity, GPU cost analysis, and battle-tested insights from real experiments. The sagemaker-spot-training skill packages these lessons for Claude Code users.

How It Works

The original autoresearch runs experiments sequentially on a single GPU — 12 experiments/hour, ~8 hours for 100 experiments. We built a parallel evolution pipeline on SageMaker Managed Spot Training that leverages the HUGI (Hurry Up and Get Idle) pattern to complete 100 experiments in ~~100 minutes at the same cost (~~$4) with zero GPU idle time.

Architecture

Sequential vs Parallel

	Original autoresearch	Serverless (this repo)
Execution	1 experiment at a time	10 experiments in parallel
100 experiments	~8 hours	~100 minutes
Cost	~$4 (GPU always on)	~$4 (HUGI: pay only when running)
GPU	1x H100 (always occupied)	N x H100 Spot (on-demand burst)
Search strategy	Greedy (sequential)	Population-based evolution
Improvement probability	18% per experiment	86% per generation

HUGI Pattern (Hurry Up and Get Idle)

Traditional GPU server:
  ████░░░░████░░░░████░░░░████░░░░  (utilization ~50%, paying 24/7)

HUGI with SageMaker Spot:
  ██████████                          (utilization 100%, $0 when idle)
  ↑ N GPUs burst                 ↑ terminate immediately

Quick Start

Prerequisites

AWS CLI configured (aws configure)
Python 3.11+
SageMaker Python SDK

pip install boto3 sagemaker pyyaml click

1. Setup IAM Role

./infrastructure/setup_iam.sh --profile personal --region ap-northeast-1
# → Copy role ARN to config.yaml

2. Prepare Data (one-time, ~5 min)

make prepare

Downloads 10 training shards + validation shard from HuggingFace, trains BPE tokenizer, uploads everything to S3.

3. Verify Setup

make dry-run

4. Run Experiments

# Single experiment test (~$0.04, ~10 min)
make run-single

# Full pipeline (~$4, ~100 min)
make run

How It Works

Generation Loop

Each generation follows 4 steps:

Candidate Generation — Creates N variants of train.py with diverse strategies:

Strategy	Count	Description
Conservative	3	LR adjustments (±10-30%)
Moderate	4	Architecture changes (depth, width, window, batch)
Aggressive	2	Radical combinations (deep-narrow, wide-shallow)
Crossover	1	Combine ideas from top-2 of previous generation

Batch Launch — Submits all N candidates as parallel SageMaker Spot Training Jobs (async, wait=False)
Result Collection — Polls all jobs until completion, extracts val_bpb metric from CloudWatch
Selection — Best val_bpb becomes new baseline, committed with git tag gen-NNN-best

The Rules (same as original autoresearch)

Only train.py can be modified (model architecture, optimizer, hyperparameters)
prepare.py is read-only (evaluation function, data loading, constants)
No new dependencies allowed
Fixed 5-minute training time budget (TIME_BUDGET=300s)
Goal: lowest val_bpb (validation bits per byte)

Project Structure

serverless-autoresearch/
├── train.py                    # Training script (agent modifies this)
├── prepare.py                  # Data prep + evaluation (read-only)
├── config.yaml                 # AWS & pipeline config (gitignored)
├── config.yaml.example         # Config template
├── program.md                  # Agent instructions
├── Makefile                    # make run, make dry-run, make cost, etc.
│
├── src/                        # Source code (cookiecutter-style)
│   ├── pipeline/               # Core evolution pipeline
│   │   ├── orchestrator.py     # Main evolution loop
│   │   ├── candidate_generator.py
│   │   ├── batch_launcher.py
│   │   ├── result_collector.py
│   │   └── selection.py
│   ├── sagemaker/              # SageMaker wrappers
│   │   ├── entry_point.py
│   │   └── train_wrapper.py
│   └── scripts/                # CLI utilities
│       ├── prepare_s3.py
│       ├── run_single.py
│       └── cost_report.py
│
├── data/raw/                   # Data references (actual data in S3)
├── models/                     # Trained model artifacts
├── notebooks/                  # Jupyter notebooks (analysis)
├── references/                 # Research notes & external references
├── experiments/                # Per-experiment reports & results
│   ├── 001-baseline-l40s/
│   └── 002-optimization-l40s/
├── docs/                       # Project documentation & diagrams
├── infrastructure/             # AWS IAM, Dockerfile, requirements
└── generations/                # Pipeline output (per-generation)

Configuration

config.yaml:

aws:
  profile: personal
  region: ap-northeast-1          # Tokyo (H100 Spot available)
  role_arn: "arn:aws:iam::..."

sagemaker:
  instance_type: ml.p5.4xlarge    # H100 80GB
  use_spot: true
  max_run: 900                    # 15 min
  max_wait: 3600                  # 1 hour spot wait
  framework_version: "2.8.0"
  py_version: "py312"

pipeline:
  num_generations: 10
  population_size: 10
  num_conservative: 3
  num_moderate: 4
  num_aggressive: 2
  num_crossover: 1

Cost

Component	Unit Cost	Qty	Total
ml.p5.4xlarge Spot (8min/exp)	~$0.04	100	~$4.00
S3 storage	—	—	~$0.10
Total			~$4.10

OMC Autopilot Integration

Run the full pipeline autonomously with oh-my-claudecode:

/autopilot

Read program.md and execute:
python -m pipeline.orchestrator --generations 10 --population 10

After completion, analyze results.tsv and summarize findings.

Experiments & Tutorials

#	Experiment	GPU	val_bpb	Cost	Key Finding
001	Baseline on L40S	ml.g7e.4xlarge	1.065	$0.04	Pipeline validated, SDPA fallback works
002	L40S Optimization	ml.g7e.4xlarge	TBD	TBD	Research notes
003	H100 Fair Comparison	ml.p5.4xlarge	TBD	TBD	Pending quota approval

Choosing a Region: Spot Capacity Matters

Spot instance availability varies dramatically by region. Always check before running experiments.

# Quick check: Spot placement score (1-10, higher = better)
for region in us-east-1 us-east-2 us-west-2; do
  echo -n "$region: "
  aws ec2 get-spot-placement-scores \
    --instance-types g7e.4xlarge --target-capacity 1 \
    --single-availability-zone --region-names $region \
    --region $region \
    --query "max_by(SpotPlacementScores, &Score).Score" --output text
done

Our experience:

Region	Spot Score	Result
us-west-2 (Oregon)	1-2	Stuck in "Starting" for 30+ min
us-east-1 (Virginia)	9	Allocated in ~2 min

See Spot Capacity Guide for the full guide including price history checks and quota management.

Tutorial

Follow the full journey of building this project through conversational AI coding (vibe coding):

Vibe Coding Tutorial — 8 chapters from idea to autonomous ML evolution, with real prompts and debugging stories

Chapters	Time	Cost	Key Topics
8	~8 hours	$0.44	Deep interview → pipeline design → SageMaker Spot → GPU debugging → autonomous evolution → insights

Key Insights

Lessons learned from running 20+ experiments on SageMaker Spot. Full list: docs/insights.md

#	Insight	Impact
1	Spot capacity varies 1-9 by region — always check placement scores first	Saves 30+ min of stuck jobs
3	`DEVICE_BATCH_SIZE ≠ token throughput` — increase `TOTAL_BATCH_SIZE` instead	Avoided wrong optimization path
4	Flash Attention 3 is Hopper-only; L40S needs FA2 or SDPA fallback	MFU 20% vs 40%
5	SageMaker startup overhead is 3 min per job (60% for 5-min training)	Scale up > scale out
11	Spot GPUs are valid proxies for large-scale training — architecture/optimizer rankings transfer, absolute LR/batch size don't	Cheap experiments ($0.04) inform expensive production runs

Documentation

Document	Description
Key Insights	Battle-tested lessons from SageMaker Spot experiments (continuously updated)
Comparison Report	Original sequential vs serverless parallel pipeline — architecture, cost, search efficiency
GPU Cost Analysis	P5 (H100) vs P6 (B200/B300) pricing and performance for autoresearch workloads
Spot Capacity Guide	How to find available Spot capacity by region before running experiments
Architecture Diagram	System architecture (SageMaker + S3 + local orchestrator)
Sequential vs Parallel	Visual comparison of sequential and parallel experiment pipelines

Credits

karpathy/autoresearch — Original sequential autoresearch framework
karpathy/nanochat — Training codebase that autoresearch is based on

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
data/raw		data/raw
docs		docs
experiments		experiments
infrastructure		infrastructure
notebooks		notebooks
references		references
src		src
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Makefile		Makefile
README.md		README.md
config.yaml.example		config.yaml.example
prepare.py		prepare.py
program.md		program.md
pyproject.toml		pyproject.toml
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

Serverless Autoresearch

Why This Project?

Who Is This For?

How to Use This

As a tutorial

As a pipeline

As a reference

How It Works

Architecture

Sequential vs Parallel

HUGI Pattern (Hurry Up and Get Idle)

Quick Start

Prerequisites

1. Setup IAM Role

2. Prepare Data (one-time, ~5 min)

3. Verify Setup

4. Run Experiments

How It Works

Generation Loop

The Rules (same as original autoresearch)

Project Structure

Configuration

Cost

OMC Autopilot Integration

Experiments & Tutorials

Choosing a Region: Spot Capacity Matters

Tutorial

Key Insights

Documentation

Credits

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages