Skip to content

roboco-io/serverless-autoresearch

Repository files navigation

Serverless Autoresearch

Experimenting with cost-effective ways to run Karpathy's autoresearch on AWS infrastructure, and documenting the journey as a hands-on tutorial.

Why This Project?

Karpathy's autoresearch shows that AI agents can autonomously improve deep learning models overnight — but it assumes you have an H100 GPU sitting idle for 8 hours. Most people don't have that.

This project answers: Can you get the same results using cheap cloud GPUs, paying only pennies per experiment?

The answer is yes. We run 83 experiments 2.3x faster and 5-18x cheaper than the original, using SageMaker Spot instances that spin up for 5 minutes and disappear.

Original (H100, 8 hours) This project (L40S Spot)
Cost for 83 experiments $7-24 $1.33
Wall clock time ~8 hours ~3.5 hours
GPU idle cost ~50% wasted $0 (HUGI pattern)
Experiments in parallel 1 4
GPU required H100 80GB Any (L40S, A10G, H100...)

Who Is This For?

  • ML practitioners without expensive GPUs — Run autoresearch on AWS Spot for $0.02/experiment instead of buying an H100
  • Teams exploring model architectures — Use cheap L40S Spot to validate hypotheses, then apply winning configurations to production H100 training (research shows this transfers well)
  • Cloud cost optimizers — Learn HUGI pattern, Spot capacity management, and parallel execution strategies applicable beyond ML
  • Educators and students — Every experiment is documented as a tutorial with exact commands, costs, and lessons learned

How to Use This

As a tutorial

Read the experiments folder — each experiment is a self-contained story with hypothesis, setup, results, and lessons learned. Start with 001-baseline.

As a pipeline

Fork this repo, set up your AWS credentials, and run make run to start your own autonomous experiments. The pipeline handles candidate generation, parallel SageMaker job submission, result collection, and selection — all automatically.

As a reference

The docs folder contains practical guides on Spot capacity, GPU cost analysis, and battle-tested insights from real experiments. The sagemaker-spot-training skill packages these lessons for Claude Code users.

How It Works

The original autoresearch runs experiments sequentially on a single GPU — 12 experiments/hour, ~8 hours for 100 experiments. We built a parallel evolution pipeline on SageMaker Managed Spot Training that leverages the HUGI (Hurry Up and Get Idle) pattern to complete 100 experiments in 100 minutes at the same cost ($4) with zero GPU idle time.

Architecture

Serverless Autoresearch Architecture

Sequential vs Parallel

Original autoresearch Serverless (this repo)
Execution 1 experiment at a time 10 experiments in parallel
100 experiments ~8 hours ~100 minutes
Cost ~$4 (GPU always on) ~$4 (HUGI: pay only when running)
GPU 1x H100 (always occupied) N x H100 Spot (on-demand burst)
Search strategy Greedy (sequential) Population-based evolution
Improvement probability 18% per experiment 86% per generation

HUGI Pattern (Hurry Up and Get Idle)

Traditional GPU server:
  ████░░░░████░░░░████░░░░████░░░░  (utilization ~50%, paying 24/7)

HUGI with SageMaker Spot:
  ██████████                          (utilization 100%, $0 when idle)
  ↑ N GPUs burst                 ↑ terminate immediately

Quick Start

Prerequisites

pip install boto3 sagemaker pyyaml click

1. Setup IAM Role

./infrastructure/setup_iam.sh --profile personal --region ap-northeast-1
# → Copy role ARN to config.yaml

2. Prepare Data (one-time, ~5 min)

make prepare

Downloads 10 training shards + validation shard from HuggingFace, trains BPE tokenizer, uploads everything to S3.

3. Verify Setup

make dry-run

4. Run Experiments

# Single experiment test (~$0.04, ~10 min)
make run-single

# Full pipeline (~$4, ~100 min)
make run

How It Works

Generation Loop

Each generation follows 4 steps:

  1. Candidate Generation — Creates N variants of train.py with diverse strategies:

    Strategy Count Description
    Conservative 3 LR adjustments (±10-30%)
    Moderate 4 Architecture changes (depth, width, window, batch)
    Aggressive 2 Radical combinations (deep-narrow, wide-shallow)
    Crossover 1 Combine ideas from top-2 of previous generation
  2. Batch Launch — Submits all N candidates as parallel SageMaker Spot Training Jobs (async, wait=False)

  3. Result Collection — Polls all jobs until completion, extracts val_bpb metric from CloudWatch

  4. Selection — Best val_bpb becomes new baseline, committed with git tag gen-NNN-best

The Rules (same as original autoresearch)

  • Only train.py can be modified (model architecture, optimizer, hyperparameters)
  • prepare.py is read-only (evaluation function, data loading, constants)
  • No new dependencies allowed
  • Fixed 5-minute training time budget (TIME_BUDGET=300s)
  • Goal: lowest val_bpb (validation bits per byte)

Project Structure

serverless-autoresearch/
├── train.py                    # Training script (agent modifies this)
├── prepare.py                  # Data prep + evaluation (read-only)
├── config.yaml                 # AWS & pipeline config (gitignored)
├── config.yaml.example         # Config template
├── program.md                  # Agent instructions
├── Makefile                    # make run, make dry-run, make cost, etc.
│
├── src/                        # Source code (cookiecutter-style)
│   ├── pipeline/               # Core evolution pipeline
│   │   ├── orchestrator.py     # Main evolution loop
│   │   ├── candidate_generator.py
│   │   ├── batch_launcher.py
│   │   ├── result_collector.py
│   │   └── selection.py
│   ├── sagemaker/              # SageMaker wrappers
│   │   ├── entry_point.py
│   │   └── train_wrapper.py
│   └── scripts/                # CLI utilities
│       ├── prepare_s3.py
│       ├── run_single.py
│       └── cost_report.py
│
├── data/raw/                   # Data references (actual data in S3)
├── models/                     # Trained model artifacts
├── notebooks/                  # Jupyter notebooks (analysis)
├── references/                 # Research notes & external references
├── experiments/                # Per-experiment reports & results
│   ├── 001-baseline-l40s/
│   └── 002-optimization-l40s/
├── docs/                       # Project documentation & diagrams
├── infrastructure/             # AWS IAM, Dockerfile, requirements
└── generations/                # Pipeline output (per-generation)

Configuration

config.yaml:

aws:
  profile: personal
  region: ap-northeast-1          # Tokyo (H100 Spot available)
  role_arn: "arn:aws:iam::..."

sagemaker:
  instance_type: ml.p5.4xlarge    # H100 80GB
  use_spot: true
  max_run: 900                    # 15 min
  max_wait: 3600                  # 1 hour spot wait
  framework_version: "2.8.0"
  py_version: "py312"

pipeline:
  num_generations: 10
  population_size: 10
  num_conservative: 3
  num_moderate: 4
  num_aggressive: 2
  num_crossover: 1

Cost

Component Unit Cost Qty Total
ml.p5.4xlarge Spot (8min/exp) ~$0.04 100 ~$4.00
S3 storage ~$0.10
Total ~$4.10

OMC Autopilot Integration

Run the full pipeline autonomously with oh-my-claudecode:

/autopilot

Read program.md and execute:
python -m pipeline.orchestrator --generations 10 --population 10

After completion, analyze results.tsv and summarize findings.

Experiments & Tutorials

# Experiment GPU val_bpb Cost Key Finding
001 Baseline on L40S ml.g7e.4xlarge 1.065 $0.04 Pipeline validated, SDPA fallback works
002 L40S Optimization ml.g7e.4xlarge TBD TBD Research notes
003 H100 Fair Comparison ml.p5.4xlarge TBD TBD Pending quota approval

Choosing a Region: Spot Capacity Matters

Spot instance availability varies dramatically by region. Always check before running experiments.

# Quick check: Spot placement score (1-10, higher = better)
for region in us-east-1 us-east-2 us-west-2; do
  echo -n "$region: "
  aws ec2 get-spot-placement-scores \
    --instance-types g7e.4xlarge --target-capacity 1 \
    --single-availability-zone --region-names $region \
    --region $region \
    --query "max_by(SpotPlacementScores, &Score).Score" --output text
done

Our experience:

Region Spot Score Result
us-west-2 (Oregon) 1-2 Stuck in "Starting" for 30+ min
us-east-1 (Virginia) 9 Allocated in ~2 min

See Spot Capacity Guide for the full guide including price history checks and quota management.

Tutorial

Follow the full journey of building this project through conversational AI coding (vibe coding):

Vibe Coding Tutorial — 8 chapters from idea to autonomous ML evolution, with real prompts and debugging stories

Chapters Time Cost Key Topics
8 ~8 hours $0.44 Deep interview → pipeline design → SageMaker Spot → GPU debugging → autonomous evolution → insights

Key Insights

Lessons learned from running 20+ experiments on SageMaker Spot. Full list: docs/insights.md

# Insight Impact
1 Spot capacity varies 1-9 by region — always check placement scores first Saves 30+ min of stuck jobs
3 DEVICE_BATCH_SIZE ≠ token throughput — increase TOTAL_BATCH_SIZE instead Avoided wrong optimization path
4 Flash Attention 3 is Hopper-only; L40S needs FA2 or SDPA fallback MFU 20% vs 40%
5 SageMaker startup overhead is 3 min per job (60% for 5-min training) Scale up > scale out
11 Spot GPUs are valid proxies for large-scale training — architecture/optimizer rankings transfer, absolute LR/batch size don't Cheap experiments ($0.04) inform expensive production runs

Documentation

Document Description
Key Insights Battle-tested lessons from SageMaker Spot experiments (continuously updated)
Comparison Report Original sequential vs serverless parallel pipeline — architecture, cost, search efficiency
GPU Cost Analysis P5 (H100) vs P6 (B200/B300) pricing and performance for autoresearch workloads
Spot Capacity Guide How to find available Spot capacity by region before running experiments
Architecture Diagram System architecture (SageMaker + S3 + local orchestrator)
Sequential vs Parallel Visual comparison of sequential and parallel experiment pipelines

Credits

License

MIT

About

Parallel evolution pipeline for Karpathy's autoresearch on SageMaker Spot Training (H100). 10x faster with HUGI pattern.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors