Emerge-Lab · eugenevinitsky · May 20, 2026 · May 20, 2026 · May 20, 2026 · May 20, 2026
diff --git a/.gitignore b/.gitignore
@@ -81,7 +81,7 @@ instance/
 # Scrapy stuff:
 .scrapy
 
-# Sphinx documentation
+# Sphinx documentation (if sphinx is later added; docs/*.md is tracked)
 docs/_build/
 
 # PyBuilder
@@ -211,9 +211,6 @@ pufferlib/resources/drive/output*.gif
 # External local clones
 external/
 
-# Generated docs
-docs/
-
 # Claude config
 .claude/
 CLAUDE.local.md
diff --git a/README.md b/README.md
@@ -71,6 +71,8 @@ python scripts/submit_cluster.py \
 
 `scripts/cluster_configs/nyu_greene.yaml` defines `account`, `gpus`, `cpus`, `mem`, `time` — edit `account` to your allocation before first submit. `--container` makes `submit_cluster.py` wrap the job command in `singularity exec --nv --overlay $OVERLAY_PATH:ro $IMAGE_PATH ...`.
 
+**For a full guide on how to use this see [`docs/cluster_training.md`](docs/cluster_training.md).**
+
 ## Data
 
 Place binaries under `pufferlib/resources/drive/binaries/`.
@@ -146,6 +148,8 @@ renders/index.html           # sortable index of all episodes
 
 Open `renders/index.html` in a browser to triage. The index page filters by "failures only" / "replays only" and sorts by any metric column. Each row links to the per-episode viewer with the scene's full 2D animation.
 
+For the deeper guide — viewer features, `score_threshold` semantics, the required `--vec.backend Serial` flag, loading checkpoints with non-default `policy.*` dims, and the on-cluster `submit_cluster.py` pattern — see [`docs/mining.md`](docs/mining.md).
+
 ## Key Configuration (`pufferlib/config/ocean/drive.ini`)
 
 ### `[env]` — Simulation

diff --git a/docs/cluster_training.md b/docs/cluster_training.md
@@ -0,0 +1,180 @@
+# Cluster training — operational guide
+
+How to run PufferDrive training on a SLURM cluster. This is written with the NYU cluster in mind but it should mostly hold for any SLURM cluster.
+
+## A quick overview of the setup and launch process
+
+```bash
+# One-time per cluster: create the singularity overlay and install deps
+# into the venv (this also installs submitit and the other submission
+# deps as part of the project's pyproject.toml).
+./scripts/setup_container.sh create-overlay
+sbatch --account=<acct> --gres=gpu:1 --cpus-per-task=8 --mem=32gb --time=60 \
+    --wrap "./scripts/setup_container.sh install"
+
+# If code changes, or we haven't built before, rebuild the C code in the container
+sbatch --account=<acct> --partition=cpu_short --cpus-per-task=8 --mem=16gb --time=20 \
+    --chdir=$PWD -o $LOGDIR/rebuild_%j.log \
+    --wrap "./scripts/setup_container.sh rebuild"
+
+# Training: source the venv on the login node, then submit_cluster.py
+# with --container --heartbeat. --main defaults to RL training; override
+# it to launch other modes (e.g. mining, eval).
+source /scratch/$USER/venvs/pufferdrive/bin/activate
+python scripts/submit_cluster.py \
+    --save_dir /scratch/$USER/runs \
+    --compute_config scripts/cluster_configs/nyu_greene.yaml \
+    --program_config scripts/cluster_configs/train_base.yaml \
+    --container --heartbeat \
+    --account <acct> --partition <gpu-partition> --time 2880 \
+    --args train.checkpoint_interval=250 env.simulation_mode=gigaflow # use this to override config args
+```
+
+## Container model
+
+PufferDrive on Greene runs inside a singularity container. The container provides
+a modern glibc + CUDA toolkit; the project's Python environment lives in a venv
+on `/scratch` so installs aren't bottlenecked by the slow process of building a venv inside a container.
+
+The container is invoked with a **read-only** overlay mount for the miniforge3
+base interpreter, plus the on-disk venv for project packages. As an example of running such a command:
+```bash
+singularity exec --nv \
+    --overlay /scratch/$USER/images/PufferDrive/overlay-15GB-500K.ext3:ro \
+    /share/apps/images/cuda12.8.1-cudnn9.8.0-ubuntu24.04.2.sif \
+    bash -c '
+        source /scratch/$USER/venvs/pufferdrive/bin/activate
+        export PYTHONNOUSERSITE=1
+        cd /scratch/$USER/code/PufferDrive
+        <your command>
+    '
+```
+
+## Submitting training — `submit_cluster.py`
+
+`scripts/submit_cluster.py` is the canonical submission path. It composes:
+- a `compute_config` YAML (SLURM settings)
+- a `program_config` YAML (pufferl training args)
+- `--args` CLI overrides
+- wraps the inner train command in `singularity exec` when `--container` is set
+- optionally injects the GPU heartbeat when `--heartbeat` is set. WARNING: this is specifically for the torch cluster to prevent our jobs being killed. No one else should use this.
+
+It performs code isolation (symlinks the
+top-level entries + hard-copies `pufferlib/` into a per-run sandbox), and
+hands the package to `submitit` for `sbatch`-submission.
+
+### Source the venv before invoking `submit_cluster.py`
+
+`setup_container.sh install` puts submitit + its deps into the project
+venv at `/scratch/$USER/venvs/pufferdrive/`. Sourcing the venv on the
+login node makes that submitit importable and lines up `sys.executable`
+with the same venv python that the compute node will run, so submitit's
+serialization round-trips cleanly.
+
+```bash
+source /scratch/$USER/venvs/pufferdrive/bin/activate
+python scripts/submit_cluster.py \
+    --save_dir /scratch/$USER/runs \
+    --prefix mytrain \
+    --compute_config scripts/cluster_configs/nyu_greene.yaml \
+    --program_config scripts/cluster_configs/train_base.yaml \
+    --account <acct> --partition <gpu-partition> --time 2880 \
+    --container \
+    --heartbeat \
+    --args \
+        train.total_timesteps=10000000000 \
+        train.checkpoint_interval=250
+```
+
+Key flags:
+
+| Flag | Effect |
+|---|---|
+| `--container` | wraps both submitit's outer launcher and the inner train command in `singularity exec --nv --overlay $OVERLAY:ro $IMAGE` |
+| `--heartbeat` | wraps the train command in a brace group that backgrounds `python scripts/gpu_heartbeat.py` preventing the cluster from killing your job due to low GPU usage |
+| `--args key=value ...` | passes nested config keys (underscores converted to dashes) as `--key value` on the torchrun line; e.g. `env.simulation_mode=replay` becomes `--env.simulation-mode replay` |
+| `--account` / `--partition` / `--time` | override `compute_config` SLURM settings |
+
+### GPU heartbeat — required for long runs
+
+`--heartbeat` is not optional for jobs over ~2 hours. Without it, the
+cluster's idle-GPU reclaimer issues a `scancel` from `uid 0` (root) during
+the first eval / checkpoint dip in GPU utilization.
+
+`scripts/gpu_heartbeat.py` monitors `nvidia-smi` and runs short matmul bursts
+when utilization drops below 65%, so the cluster always sees the GPU as
+active. It cooperates with training and steps aside when training is busy.
+
+### Environment knobs the container path sets
+
+When `--container` is on, the inner bash command has these env vars set
+before `cd $PROJECT_ROOT && <train>`:
+
+```bash
+source /scratch/$USER/venvs/pufferdrive/bin/activate
+export PYTHONNOUSERSITE=1
+export XDG_CACHE_HOME=/scratch/$USER/cache
+export WANDB_CACHE_DIR=/scratch/$USER/wandb_cache
+export WANDB_CONFIG_DIR=/scratch/$USER/wandb_config
+export WANDB_DATA_DIR=/scratch/$USER/wandb_data
+export WANDB_DIR=/scratch/$USER/wandb_data
+```
+
+## CPU rebuild path
+
+GPU partitions are routinely saturated by training jobs. `setup_container.sh
+rebuild` doesn't need a GPU — submit to a CPU partition for fast turnaround:
+
+```bash
+sbatch --account=<general-account> --partition=cpu_short \
+    --cpus-per-task=8 --mem=16gb --time=20 \
+    --chdir=$PWD \
+    -o /scratch/$USER/rebuild_logs/rebuild_%j.log \
+    --wrap "./scripts/setup_container.sh rebuild"
+```
+
+`--chdir=$PWD` is required because the script uses `./scripts/`. Takes ~40s.
+
+### Common pitfalls
+
+- **`ncclCommShrink` undefined symbol** at `from torch._C import *`. Greene's
+  cuda12.8.1 sif ships `libnccl 2.25.1` in `/usr/lib`, but torch ≥ 2.10 calls
+  `ncclCommShrink` from NCCL ≥ 2.27.5. torch's own NCCL 2.27.5 sits in
+  `site-packages/nvidia/nccl/lib/` and needs to win the loader search.
+  `setup_container.sh install`/`rebuild` patches `/ext3/env.sh` to prepend that
+  dir to `LD_LIBRARY_PATH`; existing overlays from before that patch need the
+  same line appended to `/ext3/env.sh`.
+- **`-lomp5` link errors on Linux** with conda-forge openmp. The default is for
+  older Intel OpenMP packaging. `setup.py` honors `OMP_LIB="-L$prefix/lib -lomp"`.
+- **`du /ext3` undercounts** when the overlay has cruft outside `upper/ext3/`
+  (e.g. failed pip installs that wrote to `/usr/local/lib/...` end up in
+  `upper/usr/local/` and aren't visible to apptainer's view). Use
+  `debugfs -R "ls /upper" overlay.ext3` from a login node to inspect.
+
+### `TORCH_CUDA_ARCH_LIST`: a quick warning that won't generally be an issue
+
+PufferDrive's C extension contains CUDA kernels. When `setup.py build_ext`
+compiles them, `nvcc` emits machine code for each architecture listed in
+the `TORCH_CUDA_ARCH_LIST` env var (and only those); the result is a large binary containing one variant per arch. If the env var is unset, the build
+defaults to whatever GPU was visible to the compiler at build time which is often
+just one architecture.
+
+On Greene, you frequently don't get to
+choose which GPU you land on. `_general` accounts queue across L40S
+(sm_89), H100 (sm_90), and H200 (sm_90); `_tandon_*` partitions add A100
+(sm_80). If the `_C.so` was built against only sm_80 and your job lands on
+an H100, every CUDA call into the extension dies with
+`no kernel image is available for execution on the device`.
+
+Setting `TORCH_CUDA_ARCH_LIST="8.0;8.9;9.0"` covers A100 / L40S+H100 / H200
+in one fat binary — the build is a bit slower (three variants instead of
+one) and the `.so` is a bit larger, but the resulting binary runs on every
+GPU Greene routes you to.
+
+`setup_container.sh rebuild` exports this automatically for the build step,
+so a fresh rebuild on the cluster is already multi-arch. The env var only
+matters when you build the C extension **outside** the rebuild wrapper —
+e.g. an interactive `python setup.py build_ext --inplace --force` inside a
+hand-launched singularity exec. Adding the export to your shell profile
+(or sourcing it before any manual build) saves you from hitting the "no
+kernel image" error after a quick fix-and-rebuild loop.
diff --git a/docs/mining.md b/docs/mining.md
@@ -0,0 +1,167 @@
+# Failure mining workflow
+
+How to roll a trained policy out, capture compact replays, and produce a
+browser-viewable HTML index of episodes. Pairs with `pufferl.mine_failures`
+and `pufferlib/mining_viz.py`.
+
+## TL;DR
+
+```bash
+# Roll the policy out for 100 episodes, save compact replays for episodes
+# whose episode_return falls below the threshold, render HTML for each +
+# a sortable index.
+puffer mine_failures puffer_drive \
+    --load-model-path /path/to/model_011000.pt \
+    --mine.output-dir ./failure_mining/baseline_011000 \
+    --mine.num-episodes 100 \
+    --mine.score-threshold 1e9 \
+    --vec.backend Serial
+```
+
+Outputs:
+
+```
+./failure_mining/baseline_011000/
+    replays/episode_NNNNNN.replay.zlib   one per saved episode
+    renders/episode_NNNNNN.html          per-replay viewer
+    renders/index.html                   sortable summary
+    episodes.csv                         all episodes, all metrics
+```
+
+Open the index in a browser:
+
+```bash
+open ./failure_mining/baseline_011000/renders/index.html
+```
+
+## What gets captured
+
+A compact replay bundle is a pickled+zlib'd `schema_version=2` dict containing
+per-step agent state, traffic state, and observation arrays for a single
+episode. Bundles are produced C-side when `capture_compact_replay=True` is
+passed to `Drive(...)`. `mine_failures` sets this automatically.
+
+Each saved bundle is paired with a metadata row in `episodes.csv` including
+`episode_return`, `collision_rate`, `offroad_rate`, `num_goals_reached`,
+`avg_distance_per_infraction`, etc. The HTML viewer (`pufferlib/mining_viz.py`)
+reads the bundle and replays it in-browser on a top-down canvas, with optional
+overlays for the agent's observed FOV, partner circle, goal route, and waypoint
+markers.
+
+## `mine.score_threshold` selection
+
+The save rule is "write replay if and only if `episode_return < score_threshold`".
+
+- `--mine.score-threshold 1e9` captures every episode (any real return is
+  less than 1e9).
+- `--mine.score-threshold 0` captures only negative-return ("true failure")
+  episodes.
+- Default `-inf` captures **nothing** — useful only if you want `episodes.csv`
+  metrics without the bundle overhead.
+
+`episodes.csv` always contains all N episodes' metadata regardless of
+threshold; only the bundle save + HTML render is gated.
+
+## `--vec.backend Serial`
+
+Mining must use `--vec.backend Serial`. The drive.ini default
+`Multiprocessing` backend forks workers post-torch-import, which deadlocks on
+CUDA in the child process. Symptom is a parent process at 100% CPU with no
+visible progress and no `[mine_failures] target episodes=...` print.
+
+`Serial` keeps the env in the same process as the policy. Mining is a single
+env / single rollout workflow, so the throughput cost is negligible.
+
+## Tuning the rollout config
+
+The mining env config comes from drive.ini's `[mine]` section plus per-CLI
+overrides:
+
+```bash
+# Larger output (slower):
+--mine.num-episodes 500
+
+# Replay mode (drive recorded nuPlan / Waymo scenarios):
+--env.simulation-mode replay \
+--env.control-mode control_sdc_only \
+--env.map-dir /path/to/recorded_bins \
+--env.init-steps 10 \
+--env.scenario-length 200
+
+# Looser goal radius (default 2 m, up to 12 m under reward randomization):
+--env.goal-radius 6
+
+# Closer-spaced goals:
+--env.min-waypoint-spacing 10 \
+--env.max-waypoint-spacing 15
+```
+
+## Loading checkpoints with non-default architecture
+
+`mine_failures` does not read the sibling `config.yaml` next to
+`load_model_path` (only `pufferl.train` does). If the checkpoint was trained
+with non-default `policy.*` or `rnn.*` dimensions (e.g. `input_size=128`,
+`backbone_num_layers=4`), pass them on the CLI to match the saved state dict:
+
+```bash
+--policy.input-size 128 \
+--policy.actor-hidden-size 512 \
+--policy.actor-num-layers 0 \
+--policy.backbone-hidden-size 512 \
+--policy.backbone-num-layers 4 \
+--policy.critic-hidden-size 512 \
+--policy.critic-num-layers 0 \
+--policy.encoder-gigaflow True \
+--policy.split-network False \
+--rnn.hidden-size 512 \
+--rnn.input-size 512
+```
+
+You can read the right values out of the checkpoint's sibling `config.yaml`
+(under `policy:` and `rnn:`) and pass them through. The error if you forget
+is a wall of `size mismatch for ...` lines from `policy.load_state_dict`.
+
+## On the cluster
+
+Mining is GPU-bound on the policy forward pass but memory-light compared to
+training (single env, no rollout buffer, no PPO update). 48 GB RAM and a
+60-minute time limit are plenty for 100 episodes. The same `submit_cluster.py`
+flow as training works — override `--main` to invoke `mine_failures`:
+
+```bash
+python3 scripts/submit_cluster.py \
+    --save_dir /scratch/$USER/runs \
+    --prefix mine \
+    --compute_config scripts/cluster_configs/nyu_greene.yaml \
+    --account <acct> --partition <gpu-partition> --time 60 \
+    --mem 48gb --cpus 8 \
+    --container \
+    --main "-m pufferlib.pufferl mine_failures puffer_drive" \
+    --args \
+        load_model_path=<path-to-ckpt> \
+        mine.output_dir=/scratch/$USER/failure_mining/out \
+        mine.num_episodes=100 \
+        mine.score_threshold=1e9 \
+        vec.backend=Serial
+```
+
+Source the venv before invoking `submit_cluster.py` (`source
+/scratch/$USER/venvs/pufferdrive/bin/activate`) — see
+[`docs/cluster_training.md`](cluster_training.md) for the rationale.
+
+Outputs land on `/scratch`; pull them down with `rsync` for in-browser viewing.
+
+## Viewer features (`mining_viz.py`)
+
+The per-episode HTML viewer supports:
+
+- Frame scrubber + play/pause + speed control.
+- Toggle observation overlay (FOV rectangle, partner circle, observed-entity
+  highlights, goal route, waypoint markers).
+- Toggle road segment / road edge / lane line rendering.
+- Map background (CARLA / nuPlan / Waymo road graph from the bundle's
+  embedded `simulation_mode`).
+
+The index (`renders/index.html`) is a sortable table linking to each per-episode
+HTML, with the metadata columns from `episodes.csv` (failure metrics, scenario
+ID, map name).