From b992f62329d22e869614e2437b70f832d4daf8a9 Mon Sep 17 00:00:00 2001
From: Eugene Vinitsky <eugenevinitsky@Eugenes-MacBook-Air.local>
Date: Wed, 20 May 2026 07:06:49 -0500
Subject: [PATCH 01/20] docs: cluster training + mining operational guides
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add two new docs and link them from README.md:

- docs/cluster_training.md — sbatch template (incl. GPU heartbeat,
  required for jobs > ~2h or the cluster's idle-reclaimer will
  scancel them), CPU rebuild path (cpu_short partition; no GPU
  needed for build_ext), account/partition strategy (QOSGrpGRES vs
  QOSMaxGRESPerUser, _tandon_priority vs _general tiers, when to
  race partitions), replay-mode memory sizing (vec.num_envs lever,
  [eval.*] suite cost at first eval), and submit_cluster.py
  failure modes (login-node python lacks pip; submitit's srun
  launcher inherits the in-container venv python path).

- docs/mining.md — mine_failures workflow, the score_threshold
  default-captures-nothing gotcha (docstring is misleading; -inf
  means no replay is saved), the pufferlib.vector Multiprocessing
  CUDA-after-fork hang and --vec.backend Serial workaround, the
  shape-mismatch gotcha when load_model_path checkpoints have
  non-default policy.* / rnn.* dimensions (mine_failures doesn't
  do the sibling config.yaml auto-merge that train() does), and
  the on-cluster sbatch pattern.

.gitignore: un-ignore docs/ (was a blanket rule that prevented
checking in markdown docs alongside the existing eval_unification.md
spec); add failure_mining/ which is large local-only mining output.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 .gitignore               |   7 +-
 README.md                |   4 +
 docs/cluster_training.md | 254 +++++++++++++++++++++++++++++++++++++++
 docs/mining.md           | 179 +++++++++++++++++++++++++++
 4 files changed, 442 insertions(+), 2 deletions(-)
 create mode 100644 docs/cluster_training.md
 create mode 100644 docs/mining.md

diff --git a/.gitignore b/.gitignore
index fd4c1de0a2..390f3dd7a2 100644
--- a/.gitignore
+++ b/.gitignore
@@ -211,9 +211,12 @@ pufferlib/resources/drive/output*.gif
 # External local clones
 external/
 
-# Generated docs
-docs/
+# Generated docs (sphinx build output only; docs/*.md is tracked)
+# docs/
 
 # Claude config
 .claude/
 CLAUDE.local.md
+
+# Mining output artifacts (large local-only renders/replays)
+failure_mining/
diff --git a/README.md b/README.md
index aca5182bee..489aa057d5 100644
--- a/README.md
+++ b/README.md
@@ -71,6 +71,8 @@ python scripts/submit_cluster.py \
 
 `scripts/cluster_configs/nyu_greene.yaml` defines `account`, `gpus`, `cpus`, `mem`, `time` — edit `account` to your allocation before first submit. `--container` makes `submit_cluster.py` wrap the job command in `singularity exec --nv --overlay $OVERLAY_PATH:ro $IMAGE_PATH ...`.
 
+For an operational deep-dive — sbatch templates, the GPU heartbeat (required for runs > ~2h or the idle-GPU reclaimer will scancel them), CPU rebuild path, account/partition strategy, replay-mode memory sizing, and known `submit_cluster.py` failure modes — see [`docs/cluster_training.md`](docs/cluster_training.md).
+
 ## Data
 
 Place binaries under `pufferlib/resources/drive/binaries/`.
@@ -146,6 +148,8 @@ renders/index.html           # sortable index of all episodes
 
 Open `renders/index.html` in a browser to triage. The index page filters by "failures only" / "replays only" and sorts by any metric column. Each row links to the per-episode viewer with the scene's full 2D animation.
 
+For the deeper guide — viewer features, the `score_threshold` semantic gotcha (default `-inf` saves nothing — not what the `mine_failures` docstring claims), the `Multiprocessing`-backend CUDA-after-fork hang and the `--vec.backend Serial` workaround, the shape-mismatch gotcha when loading checkpoints with non-default `policy.*` dims, and the on-cluster sbatch pattern — see [`docs/mining.md`](docs/mining.md).
+
 ## Key Configuration (`pufferlib/config/ocean/drive.ini`)
 
 ### `[env]` — Simulation
diff --git a/docs/cluster_training.md b/docs/cluster_training.md
new file mode 100644
index 0000000000..7b8fa365c8
--- /dev/null
+++ b/docs/cluster_training.md
@@ -0,0 +1,254 @@
+# Cluster training — operational guide
+
+How to run PufferDrive training on a SLURM cluster. Written against the NYU
+Greene workflow but the patterns generalize. Pairs with `scripts/setup_container.sh`,
+`scripts/gpu_heartbeat.py`, and `scripts/submit_cluster.py`.
+
+## TL;DR
+
+```bash
+# One-time per cluster: create the singularity overlay and install deps.
+./scripts/setup_container.sh create-overlay
+sbatch --account=<acct> --gres=gpu:1 --cpus-per-task=8 --mem=32gb --time=60 \
+    --wrap "./scripts/setup_container.sh install"
+
+# Per code change to C extensions: rebuild on a CPU partition (no GPU needed).
+sbatch --account=<acct> --partition=cpu_short --cpus-per-task=8 --mem=16gb --time=20 \
+    --chdir=$PWD -o $LOGDIR/rebuild_%j.log \
+    --wrap "./scripts/setup_container.sh rebuild"
+
+# Training: direct sbatch with inline singularity-exec + heartbeat.
+#   (`submit_cluster.py` has known limitations on this branch lineage —
+#   see "submit_cluster.py" below.)
+sbatch /path/to/my_train.sh   # template in this doc
+```
+
+## Container model
+
+PufferDrive on Greene runs inside a singularity container. The container provides
+a modern glibc + CUDA toolkit; the project's Python environment lives in a venv
+on `/scratch` (not in the overlay) so installs aren't bottlenecked by fuse2fs.
+
+The container is invoked with a **read-only** overlay mount for the miniforge3
+base interpreter, plus the on-disk venv for project packages:
+
+```bash
+singularity exec --nv \
+    --overlay /scratch/$USER/images/PufferDrive/overlay-15GB-500K.ext3:ro \
+    /share/apps/images/cuda12.8.1-cudnn9.8.0-ubuntu24.04.2.sif \
+    bash -c '
+        source /scratch/$USER/venvs/pufferdrive/bin/activate
+        export PYTHONNOUSERSITE=1
+        cd /scratch/$USER/code/PufferDrive
+        <your command>
+    '
+```
+
+`source venv/activate` is **required** — sourcing `/ext3/env.sh` alone gives you
+a torch-less base interpreter (it imports as a namespace-package stub with
+`torch.__file__ == None`).
+
+## Training sbatch template
+
+The minimal template below uses a direct `sbatch` (no `submit_cluster.py`),
+includes the GPU heartbeat to prevent idle-reclamation, and wraps everything in
+a singularity-exec. Adapt the `--account`, `--partition`, paths, and CLI args:
+
+```bash
+#!/bin/bash
+#SBATCH --job-name=mytrain
+#SBATCH --account=<your-account>
+#SBATCH --partition=<gpu-partition>
+#SBATCH --gres=gpu:1
+#SBATCH --cpus-per-task=16
+#SBATCH --mem=96gb
+#SBATCH --time=2880          # 48h
+#SBATCH -o /scratch/$USER/runs/logs/train_%j.log
+
+singularity exec --nv \
+  --overlay /scratch/$USER/images/PufferDrive/overlay-15GB-500K.ext3:ro \
+  /share/apps/images/cuda12.8.1-cudnn9.8.0-ubuntu24.04.2.sif \
+  bash -c "
+    source /scratch/$USER/venvs/pufferdrive/bin/activate
+    export PYTHONNOUSERSITE=1
+    export TORCH_CUDA_ARCH_LIST=\"8.0;8.9;9.0\"
+    export XDG_CACHE_HOME=/scratch/$USER/cache
+    export WANDB_DIR=/scratch/$USER/wandb_data
+    cd /scratch/$USER/code/PufferDrive
+
+    # GPU heartbeat: keeps utilization above 65% during eval/checkpoint dips
+    # so the cluster's idle-GPU reclaimer doesn't kill the job (root scancel
+    # at ~2h is the symptom).
+    python scripts/gpu_heartbeat.py > /tmp/gpu_heartbeat.log 2>&1 &
+    HB_PID=\$!
+
+    torchrun --standalone --nproc_per_node 1 -m pufferlib.pufferl train puffer_drive \
+        --train.total-timesteps 10000000000 \
+        --train.checkpoint-interval 250 \
+        --wandb --wandb-project pufferdrive \
+        --train.data-dir /scratch/$USER/runs/mytrain
+
+    TRAIN_EXIT=\$?
+    kill \$HB_PID 2>/dev/null
+    exit \$TRAIN_EXIT
+"
+```
+
+`TORCH_CUDA_ARCH_LIST="8.0;8.9;9.0"` covers A100 (sm_80), L40S/H100 (sm_89/90),
+and H200 (sm_90). Without it the C extension is compiled only for the build
+host's GPU type and crashes on different hardware with `no kernel image is available`.
+
+## GPU heartbeat — required for long runs
+
+Without `scripts/gpu_heartbeat.py` backgrounded alongside training, jobs lasting
+~2 hours risk **CANCELLED by uid 0** from the cluster's idle-GPU reclaimer.
+Eval / checkpoint / map-load phases dip GPU utilization briefly, and the
+reclaimer interprets those dips as "idle".
+
+The heartbeat monitors `nvidia-smi` and runs short matmul bursts when
+utilization drops below 65%, so the cluster always sees the GPU as active.
+It cooperates with real training (steps aside when training is active).
+
+## CPU rebuild path
+
+GPU partitions are routinely saturated by training jobs of this same project.
+`setup_container.sh rebuild` doesn't actually need a GPU — it just runs
+`python setup.py build_ext --inplace --force` plus a smoke import. Submit to a
+CPU partition for fast turnaround:
+
+```bash
+sbatch --account=<general-account> --partition=cpu_short \
+    --cpus-per-task=8 --mem=16gb --time=20 \
+    --chdir=$PWD \
+    -o /scratch/$USER/rebuild_logs/rebuild_%j.log \
+    --wrap "./scripts/setup_container.sh rebuild"
+```
+
+`--chdir=$PWD` is required because the script uses `./scripts/`. Takes ~40s.
+
+## Account / partition strategy
+
+NYU Greene exposes `_general` and `_tandon_priority` account tiers, each with
+their own QOS pool per partition. When `squeue` shows your job pending on
+`QOSGrpGRES`, the issue is partition-level pool saturation — **switching
+accounts within the same tier doesn't help**, but switching partitions does.
+
+`QOSMaxGRESPerUser` is different: you're over your own concurrent-GPU cap.
+Cancel a pending job or wait.
+
+Practical recipe for long training:
+
+- For short jobs (rebuilds, eval, mining): try `cpu_short` if CPU-only; else
+  `h200_public + *_general`. Often the fastest GPU slot.
+- For long training: race 2–3 GPU partitions in parallel and cancel the
+  losers as soon as one starts. `tandon_priority` accounts often unblock when
+  `_general` pools are pinned. `l40s_public` typically has multi-hour queues
+  and is the last resort.
+
+Quick test-only across combos:
+
+```bash
+for combo in \
+    "<acct-priority> a100_tandon" \
+    "<acct-priority> h100_tandon" \
+    "<acct-general>  h200_public"; do
+  read ACCT PART <<< "$combo"
+  RES=$(sbatch --test-only --account=$ACCT --partition=$PART \
+        --gres=gpu:1 --cpus-per-task=16 --mem=96gb --time=2880 \
+        --wrap "echo test" 2>&1 | head -1)
+  echo "$ACCT $PART -> $RES"
+done
+```
+
+`--test-only` prints an estimated start time without actually submitting.
+
+## Memory sizing — replay mode is heavier than gigaflow
+
+Gigaflow training with `num_agents=1024` fits comfortably in 96 GB on Greene.
+Replay-mode training on nuPlan does not — each sub-env loads its own bin file
+(parsed lane graph + per-agent trajectories), so `--mem=96gb` OOMs.
+
+Levers, in order of impact:
+
+- `--vec.num-envs N` (drive.ini default `20`). Each vec worker is a fork; each
+  worker holds copy-on-write-divergent state proportional to `num_agents/num_envs`
+  + the loaded map data. Halving from 20→10 saves ~25 GB.
+- Disable subsets of `[eval.*]` evaluators via CLI overrides. The 14 enabled
+  evaluators in `drive.ini` all spin up their own `pufferlib.vector.make` envs
+  at the first eval cycle and can collectively cost 30–50 GB at peak.
+  `[eval.validation_gigaflow]` specifically renders 8 × 1080p MP4s in parallel.
+- `--mem=128gb` or `--mem=192gb` if you need the eval signal in wandb.
+
+`vec.*` keys are **not** in pufferl's `KEYS_OF_INTEREST` auto-merge, so a
+sibling `config.yaml` next to a `load_model_path` won't override them. They
+come from `drive.ini` or the CLI.
+
+## submit_cluster.py — known limitations
+
+`scripts/submit_cluster.py` wraps the training launch in submitit + a heartbeat
+wrapper. On the `emerge/temp_training`-derived branch lineage it doesn't work
+end-to-end:
+
+1. Login-node `/usr/bin/python3` lacks `pip` → can't `pip install submitit`
+   on the login node. The venv's `pip` shebang points at
+   `/ext3/miniforge3/bin/python3` (overlay-internal) so `pip install` outside
+   the container errors with "required file not found".
+2. Running `submit_cluster.py` *inside* the container makes submitit's `srun`
+   launcher inherit the venv python path (`/scratch/.../venvs/.../python`).
+   On the compute node `srun` tries to invoke that path *outside* singularity
+   and fails with `execve(): No such file or directory`. submit_cluster.py
+   wraps the *inner* train command in singularity-exec but the *outer* launcher
+   is not wrapped.
+
+Workaround if you really want submitit + sbatch: bind the slurm dirs into the
+container so the in-container python can see sbatch and call it directly:
+
+```bash
+singularity exec --nv \
+    --bind /opt/slurm:/opt/slurm \
+    --bind /run/munge:/run/munge \
+    --bind /etc/passwd:/etc/passwd \
+    --bind /etc/group:/etc/group \
+    --overlay overlay.ext3:ro \
+    $SIF bash -c 'PATH=/opt/slurm/bin:$PATH ...submit_cluster.py...'
+```
+
+This gets the submission through (real SLURM job ID), but the **submitted job
+itself** still hits (2) above unless you also bind those dirs into the launched
+container, which submit_cluster.py doesn't do.
+
+**Recommended**: use the direct-sbatch template from this doc. The heartbeat
+is a 4-line bash addition; you don't need submitit for that.
+
+## Common pitfalls
+
+- **`ncclCommShrink` undefined symbol** at `from torch._C import *`. Greene's
+  cuda12.8.1 sif ships `libnccl 2.25.1` in `/usr/lib`, but torch ≥ 2.10 calls
+  `ncclCommShrink` from NCCL ≥ 2.27.5. torch's own NCCL 2.27.5 sits in
+  `site-packages/nvidia/nccl/lib/` and needs to win the loader search.
+  `setup_container.sh install`/`rebuild` patches `/ext3/env.sh` to prepend that
+  dir to `LD_LIBRARY_PATH`; existing overlays from before that patch need the
+  same line appended to `/ext3/env.sh`.
+- **`-lomp5` link errors on Linux** with conda-forge openmp. The default is for
+  older Intel OpenMP packaging. `setup.py` honors `OMP_LIB="-L$prefix/lib -lomp"`.
+- **`du /ext3` undercounts** when the overlay has cruft outside `upper/ext3/`
+  (e.g. failed pip installs that wrote to `/usr/local/lib/...` end up in
+  `upper/usr/local/` and aren't visible to apptainer's view). Use
+  `debugfs -R "ls /upper" overlay.ext3` from a login node to inspect.
+- **Squash-merging stacked PRs** can hit "stale info" on `--force-with-lease`
+  when the token URL differs from `origin`. Either fetch first or use
+  `--force` with care.
+
+## Don't chain `sleep` to wait on background jobs
+
+A bare `sleep N` to poll on a submitted job's state is hard on the SLURM
+controller and brittle. Patterns that work:
+
+- **One-shot wait**: a single `sacct -j $JOBID --format=State -n -P` after a
+  generous initial sleep tuned to expected runtime.
+- **Conditional wait**: a `Monitor`-style `until` loop in a single background
+  shell, with a sane upper bound.
+- **Wall-clock interval**: schedule a wake-up rather than long-running `sleep`.
+
+Hammering `squeue` in a tight loop is bad cluster citizenship — the controller
+is shared across all users. Sleep at least 60 s between checks.
diff --git a/docs/mining.md b/docs/mining.md
new file mode 100644
index 0000000000..e677f81ced
--- /dev/null
+++ b/docs/mining.md
@@ -0,0 +1,179 @@
+# Failure mining workflow
+
+How to roll a trained policy out, capture compact replays, and produce a
+browser-viewable HTML index of episodes. Pairs with `pufferl.mine_failures`
+and `pufferlib/mining_viz.py`.
+
+## TL;DR
+
+```bash
+# Roll the policy out for 100 episodes, save compact replays for "failures",
+# render HTML for each + a sortable index.
+puffer mine_failures puffer_drive \
+    --load-model-path /path/to/model_011000.pt \
+    --mine.output-dir ./failure_mining/baseline_011000 \
+    --mine.num-episodes 100 \
+    --vec.backend Serial             # see "Multiprocessing hang" below
+
+# Outputs:
+#   ./failure_mining/baseline_011000/
+#     replays/episode_NNNNNN.replay.zlib   ← one per failed episode
+#     renders/episode_NNNNNN.html          ← per-replay viewer
+#     renders/index.html                   ← sortable summary
+#     episodes.csv                         ← all episodes, all metrics
+```
+
+Open the index in a browser:
+
+```bash
+open ./failure_mining/baseline_011000/renders/index.html
+```
+
+## What gets captured
+
+A compact replay bundle is a pickled+zlib'd `schema_version=2` dict containing
+per-step agent state, traffic state, and observation arrays for a single
+episode. Bundles are produced **C-side** when `capture_compact_replay=True`
+is passed to `Drive(...)`. `mine_failures` sets this automatically.
+
+Each saved bundle is paired with a metadata row in `episodes.csv` including
+`episode_return`, `collision_rate`, `offroad_rate`, `num_goals_reached`,
+`avg_distance_per_infraction`, etc. The HTML viewer (`pufferlib/mining_viz.py`)
+reads the bundle and replays it in-browser on a top-down canvas, with optional
+overlays for the agent's observed FOV, partner circle, goal route, and waypoint
+markers.
+
+## `mine.score_threshold` — gotcha
+
+The `mine_failures` selection rule is "save replay if and only if
+`episode_return < score_threshold`". The docstring claims `-inf` means "capture
+every episode" — that's wrong: `episode_return < -inf` is never true, so the
+default captures **nothing**. To actually save episodes:
+
+```bash
+# Capture every episode (works with any non-degenerate return):
+--mine.score-threshold 1e9
+
+# Capture only "true" failures (negative returns):
+--mine.score-threshold 0
+```
+
+`episodes.csv` always contains all N episodes' metadata regardless of threshold
+— only the bundle save + HTML render is gated.
+
+## Multiprocessing hang — use `--vec.backend Serial`
+
+`pufferl.mine_failures` goes through `pufferlib.vector.make(...)` with the
+drive.ini default `backend=Multiprocessing`. Even with `num_envs=1,
+num_workers=1`, that backend **forks** workers post-torch-import. Forking after
+torch has been imported in the parent is a classic deadlock for CUDA — the
+child can hang on CUDA initialization, and the parent sits forever on the IPC
+pipe.
+
+Symptoms: CPU 100% in the parent, RSS frozen, no `[mine_failures] target
+episodes=...` print, never produces output. If you let it sit for ~10 minutes
+nothing changes.
+
+Fix: force the in-process backend.
+
+```bash
+--vec.backend Serial
+```
+
+This keeps the env in the same process as the policy. No fork, no hang. The
+single-env nature of mining means the throughput cost is negligible.
+
+## Tuning the rollout config
+
+The mining env config comes from drive.ini's `[mine]` section plus per-CLI
+overrides. Useful knobs:
+
+```bash
+# Larger output (slower):
+--mine.num-episodes 500
+
+# Replay mode (drive recorded nuPlan / Waymo scenarios):
+--env.simulation-mode replay \
+--env.control-mode control_sdc_only \
+--env.map-dir /path/to/recorded_bins \
+--env.init-steps 10 \
+--env.scenario-length 200
+
+# Looser goal radius (useful if the trained policy struggles with the
+# stricter default; default 2m, max 12m under reward randomization):
+--env.goal-radius 6
+
+# Closer-spaced goals (mining a policy that wasn't trained on these):
+--env.min-waypoint-spacing 10 \
+--env.max-waypoint-spacing 15
+```
+
+## Resume + obs-shape gotcha
+
+`mine_failures` does **not** read the sibling `config.yaml` next to
+`load_model_path` — only `pufferl.train` does. If the checkpoint was trained
+with non-default `policy.*` or `rnn.*` dimensions (e.g. `input_size=128`,
+`backbone_num_layers=4`), you'll get a shape mismatch on `load_state_dict`
+unless you pass them on the CLI:
+
+```bash
+--policy.input-size 128 \
+--policy.actor-hidden-size 512 \
+--policy.actor-num-layers 0 \
+--policy.backbone-hidden-size 512 \
+--policy.backbone-num-layers 4 \
+--policy.critic-hidden-size 512 \
+--policy.critic-num-layers 0 \
+--policy.encoder-gigaflow True \
+--policy.split-network False \
+--rnn.hidden-size 512 \
+--rnn.input-size 512
+```
+
+You can read the right values out of the checkpoint's sibling `config.yaml`
+(under `policy:` and `rnn:`) and pass them through.
+
+## On the cluster
+
+Mining is GPU-bound on the policy forward pass but memory-light compared to
+training (single env, no rollout buffer, no PPO update). 48 GB RAM and a
+60-minute time limit are plenty for 100 episodes:
+
+```bash
+sbatch --account=<acct> --partition=<gpu-partition> --gres=gpu:1 \
+    --cpus-per-task=8 --mem=48gb --time=60 \
+    --chdir=$PWD -o $LOGDIR/mine_%j.log \
+    --wrap "
+        singularity exec --nv \
+          --overlay /scratch/\$USER/images/PufferDrive/overlay-15GB-500K.ext3:ro \
+          /share/apps/images/cuda12.8.1-cudnn9.8.0-ubuntu24.04.2.sif \
+          bash -c '
+            source /scratch/\$USER/venvs/pufferdrive/bin/activate
+            export PYTHONNOUSERSITE=1
+            cd /scratch/\$USER/code/PufferDrive
+            python -m pufferlib.pufferl mine_failures puffer_drive \
+                --load-model-path \$CKPT \
+                --mine.output-dir \$OUT \
+                --mine.num-episodes 100 \
+                --mine.score-threshold 1e9 \
+                --vec.backend Serial
+          '
+    "
+```
+
+Outputs land on `/scratch`; pull them down with `rsync` for in-browser viewing.
+
+## Viewer features (`mining_viz.py`)
+
+The per-episode HTML viewer supports:
+
+- Frame scrubber + play/pause + speed control.
+- Toggle observation overlay (FOV rectangle, partner circle, observed-entity
+  highlights, goal route, waypoint markers).
+- Toggle road segment / road edge / lane line rendering.
+- Map background (CARLA / nuPlan / Waymo road graph from the bundle's
+  embedded `simulation_mode`).
+
+The index (`renders/index.html`) is a sortable table linking to each per-episode
+HTML, with the metadata columns from `episodes.csv` (failure metrics, scenario
+ID, map name).

From 297e85ebf4527974d0ac8fb906f1418247779d82 Mon Sep 17 00:00:00 2001
From: Eugene Vinitsky <eugenevinitsky@Eugenes-MacBook-Air.local>
Date: Wed, 20 May 2026 07:43:05 -0500
Subject: [PATCH 02/20] submit_cluster: wrap submitit launcher in singularity
 when --container

Two coupled changes that make submit_cluster.py work end-to-end on
clusters where the system python differs across login and compute
nodes (Greene: login 3.12, compute 3.9) and the only consistent
python lives inside the singularity overlay.

1. Submission side: after constructing AutoExecutor, when --container
   is set, override executor._executor.python so submitit's outer
   launcher is invoked as
       singularity exec --nv --overlay :ro $IMAGE $VENV/bin/python
   That makes the compute-side srun command resolve the launcher
   python inside the container (where the venv's symlink to
   /ext3/miniforge3/bin/python3 is valid) instead of needing a
   cross-node-consistent system python with submitit installed.

2. launch_training side: when the function lands on the compute node
   it's now already inside singularity (from (1)), so skip the second
   singularity exec wrap and just run inner_cmd via bash -c. The
   /.singularity.d/Singularity marker file distinguishes the cases
   so direct-sbatch callers (not yet inside a container) still get
   the wrap.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 docs/cluster_training.md  | 229 ++++++++++++++++++++++++++------------
 docs/mining.md            |  44 ++++----
 scripts/submit_cluster.py |  70 +++++++++---
 3 files changed, 234 insertions(+), 109 deletions(-)

diff --git a/docs/cluster_training.md b/docs/cluster_training.md
index 7b8fa365c8..3b1bc092e3 100644
--- a/docs/cluster_training.md
+++ b/docs/cluster_training.md
@@ -7,20 +7,29 @@ Greene workflow but the patterns generalize. Pairs with `scripts/setup_container
 ## TL;DR
 
 ```bash
-# One-time per cluster: create the singularity overlay and install deps.
+# One-time per cluster:
+#   (a) create the singularity overlay and install deps into the venv
 ./scripts/setup_container.sh create-overlay
 sbatch --account=<acct> --gres=gpu:1 --cpus-per-task=8 --mem=32gb --time=60 \
     --wrap "./scripts/setup_container.sh install"
+#   (b) install submitit on the login-node system python (see "Why" below)
+python3 -m ensurepip --user
+python3 -m pip install --user submitit pyyaml cloudpickle
 
 # Per code change to C extensions: rebuild on a CPU partition (no GPU needed).
 sbatch --account=<acct> --partition=cpu_short --cpus-per-task=8 --mem=16gb --time=20 \
     --chdir=$PWD -o $LOGDIR/rebuild_%j.log \
     --wrap "./scripts/setup_container.sh rebuild"
 
-# Training: direct sbatch with inline singularity-exec + heartbeat.
-#   (`submit_cluster.py` has known limitations on this branch lineage —
-#   see "submit_cluster.py" below.)
-sbatch /path/to/my_train.sh   # template in this doc
+# Training: submit_cluster.py from the login node (NOT inside singularity)
+# with --container --heartbeat. Heartbeat is required for runs > ~2h.
+python3 scripts/submit_cluster.py \
+    --save_dir /scratch/$USER/runs \
+    --compute_config scripts/cluster_configs/nyu_greene.yaml \
+    --program_config scripts/cluster_configs/train_base.yaml \
+    --container --heartbeat \
+    --account <acct> --partition <gpu-partition> --time 2880 \
+    --args train.checkpoint_interval=250 env.simulation_mode=gigaflow
 ```
 
 ## Container model
@@ -48,21 +57,128 @@ singularity exec --nv \
 a torch-less base interpreter (it imports as a namespace-package stub with
 `torch.__file__ == None`).
 
-## Training sbatch template
+## Submitting training — `submit_cluster.py`
 
-The minimal template below uses a direct `sbatch` (no `submit_cluster.py`),
-includes the GPU heartbeat to prevent idle-reclamation, and wraps everything in
-a singularity-exec. Adapt the `--account`, `--partition`, paths, and CLI args:
+`scripts/submit_cluster.py` is the canonical submission path. It composes a
+`compute_config` YAML (SLURM settings) + a `program_config` YAML (pufferl
+training args) + `--args` CLI overrides, wraps the inner train command in
+`singularity exec` when `--container` is set, optionally injects the GPU
+heartbeat when `--heartbeat` is set, performs code isolation (symlinks the
+top-level entries + hard-copies `pufferlib/` into a per-run sandbox), and
+hands the package to `submitit` for `sbatch`-submission.
+
+### Why submitit needs the system python
+
+`submitit` serializes the launch function via `cloudpickle` and writes an
+sbatch script that, on the compute node, runs
+
+```
+srun <python-path> -m submitit.<launcher> <pkl>
+```
+
+`<python-path>` is `sys.executable` of the python that ran
+`submit_cluster.py`. That python must:
+
+1. Have `submitit` importable.
+2. Be invocable from the compute node *outside* singularity (because the
+   `srun` wrapper itself isn't inside the container — only the inner train
+   command is).
+
+The venv python on `/scratch/$USER/venvs/pufferdrive/bin/python` does **not**
+qualify: it's a symlink to `/ext3/miniforge3/bin/python3`, which only exists
+inside the singularity overlay. On the compute node `srun` tries to invoke
+that path outside the container and fails with
+`execve(): /scratch/.../python: No such file or directory`.
+
+The system `/usr/bin/python3` does qualify: it's on every node, no overlay
+symlinks, and the `~/.local` user site is on a shared filesystem so packages
+installed via `pip install --user` are visible from compute nodes.
+
+### One-time setup of submitit on system python
+
+```bash
+# Greene's /usr/bin/python3 is stripped of pip. Bootstrap with ensurepip:
+python3 -m ensurepip --user
+python3 -m pip install --user --upgrade pip
+python3 -m pip install --user submitit pyyaml cloudpickle
+```
+
+`submitit` is pure-python and the deps are too, so `--user` install works
+without needing a compiler. After this, `python3 -c 'import submitit'` works
+on the login node and all compute nodes.
+
+### Run submit_cluster.py from the *login node*, not from inside the container
+
+```bash
+python3 scripts/submit_cluster.py \
+    --save_dir /scratch/$USER/runs \
+    --prefix mytrain \
+    --compute_config scripts/cluster_configs/nyu_greene.yaml \
+    --program_config scripts/cluster_configs/train_base.yaml \
+    --account <acct> --partition <gpu-partition> --time 2880 \
+    --container \
+    --heartbeat \
+    --args \
+        train.total_timesteps=10000000000 \
+        train.checkpoint_interval=250
+```
+
+Key flags:
+
+| Flag | Effect |
+|---|---|
+| `--container` | wraps the inner train command in `singularity exec --nv --overlay $OVERLAY:ro $IMAGE_PATH ...` and prepends `source $VENV/bin/activate && export PYTHONNOUSERSITE=1` |
+| `--heartbeat` | wraps the inner train command in a brace group that backgrounds `python scripts/gpu_heartbeat.py` and kills it on train exit, preserving the train exit code |
+| `--args key=value key2=value2 ...` | passes nested config keys (underscores converted to dashes) as `--key value` on the torchrun line; e.g. `env.simulation_mode=replay` becomes `--env.simulation-mode replay` |
+| `--account` / `--partition` / `--time` | override `compute_config` SLURM settings |
+
+`AutoExecutor` (inside submit_cluster.py) probes for `sbatch` on `$PATH`. The
+login-node `$PATH` includes `/opt/slurm/bin`, so submitit picks
+`SlurmExecutor` automatically — no `cluster="slurm"` hint needed.
+
+### GPU heartbeat — required for long runs
+
+`--heartbeat` is not optional for jobs over ~2 hours. Without it, the
+cluster's idle-GPU reclaimer issues a `scancel` from `uid 0` (root) during
+the first eval / checkpoint dip in GPU utilization.
+
+`scripts/gpu_heartbeat.py` monitors `nvidia-smi` and runs short matmul bursts
+when utilization drops below 65%, so the cluster always sees the GPU as
+active. It cooperates with real training (steps aside when training is busy).
+
+### Environment knobs the container path sets
+
+When `--container` is on, the inner bash command has these env vars set
+before `cd $PROJECT_ROOT && <train>`:
+
+```bash
+source /scratch/$USER/venvs/pufferdrive/bin/activate
+export PYTHONNOUSERSITE=1
+export XDG_CACHE_HOME=/scratch/$USER/cache
+export WANDB_CACHE_DIR=/scratch/$USER/wandb_cache
+export WANDB_CONFIG_DIR=/scratch/$USER/wandb_config
+export WANDB_DATA_DIR=/scratch/$USER/wandb_data
+export WANDB_DIR=/scratch/$USER/wandb_data
+```
+
+You may want to set `TORCH_CUDA_ARCH_LIST="8.0;8.9;9.0"` in your shell
+profile if you build C extensions across the different GPU types on Greene
+(A100 sm_80, L40S/H100 sm_89/90, H200 sm_90).
+
+### Fallback: direct sbatch (if submitit setup is skipped)
+
+Sometimes you can't or don't want to install submitit on the system python
+(restricted environment, fast smoke test, etc.). A direct sbatch with the
+same singularity-exec + heartbeat pattern is fine. The translation from
+`submit_cluster.py --container --heartbeat` to a hand-written script is
+straightforward:
 
 ```bash
 #!/bin/bash
 #SBATCH --job-name=mytrain
 #SBATCH --account=<your-account>
 #SBATCH --partition=<gpu-partition>
-#SBATCH --gres=gpu:1
-#SBATCH --cpus-per-task=16
-#SBATCH --mem=96gb
-#SBATCH --time=2880          # 48h
+#SBATCH --gres=gpu:1 --cpus-per-task=16 --mem=96gb --time=2880
 #SBATCH -o /scratch/$USER/runs/logs/train_%j.log
 
 singularity exec --nv \
@@ -73,41 +189,23 @@ singularity exec --nv \
     export PYTHONNOUSERSITE=1
     export TORCH_CUDA_ARCH_LIST=\"8.0;8.9;9.0\"
     export XDG_CACHE_HOME=/scratch/$USER/cache
-    export WANDB_DIR=/scratch/$USER/wandb_data
     cd /scratch/$USER/code/PufferDrive
-
-    # GPU heartbeat: keeps utilization above 65% during eval/checkpoint dips
-    # so the cluster's idle-GPU reclaimer doesn't kill the job (root scancel
-    # at ~2h is the symptom).
     python scripts/gpu_heartbeat.py > /tmp/gpu_heartbeat.log 2>&1 &
     HB_PID=\$!
-
     torchrun --standalone --nproc_per_node 1 -m pufferlib.pufferl train puffer_drive \
         --train.total-timesteps 10000000000 \
         --train.checkpoint-interval 250 \
         --wandb --wandb-project pufferdrive \
         --train.data-dir /scratch/$USER/runs/mytrain
-
     TRAIN_EXIT=\$?
     kill \$HB_PID 2>/dev/null
     exit \$TRAIN_EXIT
 "
 ```
 
-`TORCH_CUDA_ARCH_LIST="8.0;8.9;9.0"` covers A100 (sm_80), L40S/H100 (sm_89/90),
-and H200 (sm_90). Without it the C extension is compiled only for the build
-host's GPU type and crashes on different hardware with `no kernel image is available`.
-
-## GPU heartbeat — required for long runs
-
-Without `scripts/gpu_heartbeat.py` backgrounded alongside training, jobs lasting
-~2 hours risk **CANCELLED by uid 0** from the cluster's idle-GPU reclaimer.
-Eval / checkpoint / map-load phases dip GPU utilization briefly, and the
-reclaimer interprets those dips as "idle".
-
-The heartbeat monitors `nvidia-smi` and runs short matmul bursts when
-utilization drops below 65%, so the cluster always sees the GPU as active.
-It cooperates with real training (steps aside when training is active).
+This skips submit_cluster.py's code isolation and YAML composition but gets
+the job running. Prefer `submit_cluster.py` once the one-time submitit
+install is done.
 
 ## CPU rebuild path
 
@@ -183,42 +281,33 @@ Levers, in order of impact:
 sibling `config.yaml` next to a `load_model_path` won't override them. They
 come from `drive.ini` or the CLI.
 
-## submit_cluster.py — known limitations
-
-`scripts/submit_cluster.py` wraps the training launch in submitit + a heartbeat
-wrapper. On the `emerge/temp_training`-derived branch lineage it doesn't work
-end-to-end:
-
-1. Login-node `/usr/bin/python3` lacks `pip` → can't `pip install submitit`
-   on the login node. The venv's `pip` shebang points at
-   `/ext3/miniforge3/bin/python3` (overlay-internal) so `pip install` outside
-   the container errors with "required file not found".
-2. Running `submit_cluster.py` *inside* the container makes submitit's `srun`
-   launcher inherit the venv python path (`/scratch/.../venvs/.../python`).
-   On the compute node `srun` tries to invoke that path *outside* singularity
-   and fails with `execve(): No such file or directory`. submit_cluster.py
-   wraps the *inner* train command in singularity-exec but the *outer* launcher
-   is not wrapped.
-
-Workaround if you really want submitit + sbatch: bind the slurm dirs into the
-container so the in-container python can see sbatch and call it directly:
-
-```bash
-singularity exec --nv \
-    --bind /opt/slurm:/opt/slurm \
-    --bind /run/munge:/run/munge \
-    --bind /etc/passwd:/etc/passwd \
-    --bind /etc/group:/etc/group \
-    --overlay overlay.ext3:ro \
-    $SIF bash -c 'PATH=/opt/slurm/bin:$PATH ...submit_cluster.py...'
-```
-
-This gets the submission through (real SLURM job ID), but the **submitted job
-itself** still hits (2) above unless you also bind those dirs into the launched
-container, which submit_cluster.py doesn't do.
-
-**Recommended**: use the direct-sbatch template from this doc. The heartbeat
-is a 4-line bash addition; you don't need submitit for that.
+## Submission pitfalls to avoid
+
+A few mistakes that look reasonable but break the submission flow:
+
+- **Don't run `submit_cluster.py` from inside the container.** It works at the
+  AutoExecutor level (sbatch is reachable; the submission goes through), but
+  the submitted job inherits the in-container venv python as `sys.executable`.
+  On the compute node `srun` tries to invoke that path *outside* singularity
+  and fails with `execve(): /scratch/.../python: No such file or directory`.
+  submit_cluster.py wraps the *inner* train command in singularity-exec but
+  the *outer* submitit launcher is not wrapped.
+
+  The fix is the layout described above: install submitit on the system
+  `/usr/bin/python3` via `pip install --user`, run `submit_cluster.py` from
+  the login node directly (no container, no venv activate).
+
+- **Don't `pip install submitit` into the venv expecting it to work from the
+  login node.** The venv's `pip` and `python` shebangs point at
+  `/ext3/miniforge3/bin/python3` (overlay-internal). Running them outside the
+  container errors with "required file not found". The venv is *runtime*
+  only — its packages are invisible to login-node tooling.
+
+- **Don't bind `/opt/slurm` + `/run/munge` + `/etc/passwd` into the container
+  as a workaround.** It does make `sbatch` callable from inside the container
+  (you'll see "slurm 25.05.4" if you run `sbatch --version`), but you're then
+  back to pitfall #1: the submitted job's outer python is still the venv
+  python. The bindings buy you the submission but not the execution.
 
 ## Common pitfalls
 
diff --git a/docs/mining.md b/docs/mining.md
index e677f81ced..996e53e18b 100644
--- a/docs/mining.md
+++ b/docs/mining.md
@@ -137,31 +137,33 @@ You can read the right values out of the checkpoint's sibling `config.yaml`
 
 Mining is GPU-bound on the policy forward pass but memory-light compared to
 training (single env, no rollout buffer, no PPO update). 48 GB RAM and a
-60-minute time limit are plenty for 100 episodes:
+60-minute time limit are plenty for 100 episodes. The same `submit_cluster.py`
+flow as training works — just override `--main` to invoke `mine_failures`:
 
 ```bash
-sbatch --account=<acct> --partition=<gpu-partition> --gres=gpu:1 \
-    --cpus-per-task=8 --mem=48gb --time=60 \
-    --chdir=$PWD -o $LOGDIR/mine_%j.log \
-    --wrap "
-        singularity exec --nv \
-          --overlay /scratch/\$USER/images/PufferDrive/overlay-15GB-500K.ext3:ro \
-          /share/apps/images/cuda12.8.1-cudnn9.8.0-ubuntu24.04.2.sif \
-          bash -c '
-            source /scratch/\$USER/venvs/pufferdrive/bin/activate
-            export PYTHONNOUSERSITE=1
-            cd /scratch/\$USER/code/PufferDrive
-            python -m pufferlib.pufferl mine_failures puffer_drive \
-                --load-model-path \$CKPT \
-                --mine.output-dir \$OUT \
-                --mine.num-episodes 100 \
-                --mine.score-threshold 1e9 \
-                --vec.backend Serial
-          '
-    "
+python3 scripts/submit_cluster.py \
+    --save_dir /scratch/$USER/runs \
+    --prefix mine \
+    --compute_config scripts/cluster_configs/nyu_greene.yaml \
+    --account <acct> --partition <gpu-partition> --time 60 \
+    --mem 48gb --cpus 8 \
+    --container \
+    --main "-m pufferlib.pufferl mine_failures puffer_drive" \
+    --args \
+        load_model_path=<path-to-ckpt> \
+        mine.output_dir=/scratch/$USER/failure_mining/out \
+        mine.num_episodes=100 \
+        mine.score_threshold=1e9 \
+        vec.backend=Serial
 ```
 
-Outputs land on `/scratch`; pull them down with `rsync` for in-browser viewing.
+See [`docs/cluster_training.md`](cluster_training.md) for the one-time
+submitit setup (`pip install --user submitit pyyaml cloudpickle` on the
+system python) and the rationale for why `submit_cluster.py` must be run
+from the login node rather than inside the container.
+
+Outputs land on `/scratch`; pull them down with `rsync` for in-browser
+viewing.
 
 ## Viewer features (`mining_viz.py`)
 
diff --git a/scripts/submit_cluster.py b/scripts/submit_cluster.py
index 59fe9ad3a5..b2a47fbc58 100644
--- a/scripts/submit_cluster.py
+++ b/scripts/submit_cluster.py
@@ -250,6 +250,32 @@ def submit(args, job_name: str, command: List[str], save_dir: str, dry: bool):
     # Set up executor
     executor = submitit.AutoExecutor(folder=os.path.join(save_dir, "submitit"))
 
+    # When --container is set, run submitit's outer launcher python *inside*
+    # singularity. The default launcher python is sys.executable, which is
+    # either the login-node system python (version-mismatched with the
+    # compute-node system python, so --user installs are invisible) or the
+    # venv python (a symlink into the overlay, which dangles on the compute
+    # node outside singularity). Wrapping the launcher in singularity exec
+    # uses the overlay's miniforge3 python — identical on every node — and
+    # gives submitit a working import of itself (the venv has submitit
+    # installed). launch_training detects the already-in-container state
+    # via /.singularity.d/Singularity and skips its own inner wrap to avoid
+    # nested singularity.
+    if args.container and hasattr(executor, "_executor"):
+        scratch_dir = os.environ.get("SCRATCH_DIR", f"/scratch/{os.environ.get('USER', '')}")
+        venv_path = os.environ.get("VENV_PATH", f"{scratch_dir}/venvs/pufferdrive")
+        cert_binds = []
+        for cert_path in ["/etc/ssl/certs", "/etc/pki"]:
+            if os.path.exists(cert_path):
+                cert_binds.append(f"--bind {cert_path}:{cert_path}:ro")
+        executor._executor.python = (
+            f"singularity exec --nv "
+            f"--overlay {args.container_overlay}:ro "
+            f"{' '.join(cert_binds)} "
+            f"{args.container_image} "
+            f"{venv_path}/bin/python"
+        )
+
     # Build GRES string for GPUs
     if from_config.get("gpu_type") is not None:
         gres = f"gpu:{from_config['gpu_type']}:{from_config['gpus']}"
@@ -404,25 +430,33 @@ def wrap_with_heartbeat(train_cmd_str):
             if args.heartbeat:
                 train_str = wrap_with_heartbeat(train_str)
             inner_cmd = f"{env_setup} && {cache_exports} && cd {project_root} && {train_str}"
-            full_cmd = [
-                "singularity",
-                "exec",
-                "--nv",
-                "--overlay",
-                container_config["overlay"] + ":ro",  # Read-only overlay for running
-            ]
-            # Bind mount SSL certificates for TLS verification (wandb, etc.)
-            for cert_path in ["/etc/ssl/certs", "/etc/pki"]:
-                if os.path.exists(cert_path):
-                    full_cmd.extend(["--bind", f"{cert_path}:{cert_path}:ro"])
-            full_cmd.extend(
-                [
-                    container_config["image"],
-                    "bash",
-                    "-c",
-                    inner_cmd,
+            # submit_cluster.py also wraps submitit's outer launcher python in
+            # singularity exec when --container is on (see the executor.python
+            # override at submission time). When we land here on the compute
+            # node, we're already inside that singularity context — skip the
+            # second wrap and just run inner_cmd via bash.
+            if os.path.exists("/.singularity.d/Singularity"):
+                full_cmd = ["bash", "-c", inner_cmd]
+            else:
+                full_cmd = [
+                    "singularity",
+                    "exec",
+                    "--nv",
+                    "--overlay",
+                    container_config["overlay"] + ":ro",  # Read-only overlay for running
                 ]
-            )
+                # Bind mount SSL certificates for TLS verification (wandb, etc.)
+                for cert_path in ["/etc/ssl/certs", "/etc/pki"]:
+                    if os.path.exists(cert_path):
+                        full_cmd.extend(["--bind", f"{cert_path}:{cert_path}:ro"])
+                full_cmd.extend(
+                    [
+                        container_config["image"],
+                        "bash",
+                        "-c",
+                        inner_cmd,
+                    ]
+                )
         elif args.heartbeat:
             # No container: still need to wrap in bash -c so the brace group parses.
             train_str = " ".join(full_cmd)

From bbbc1b15dc0bbfcd94ae6ecc9ba7c9a07ad62b9b Mon Sep 17 00:00:00 2001
From: Eugene Vinitsky <eugenevinitsky@Eugenes-MacBook-Air.local>
Date: Wed, 20 May 2026 07:47:33 -0500
Subject: [PATCH 03/20] docs: rewrite to describe current state, not discovery
 path
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Drop 'we figured out X' framing per code review feedback. The
submit_cluster.py path now works end-to-end (after the patch in
the previous commit that wraps submitit's outer launcher in
singularity exec), so the docs describe that as the canonical
flow rather than as a workaround. Direct-sbatch is no longer
documented as a fallback — submit_cluster.py is the single path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 README.md                |   4 +-
 docs/cluster_training.md | 175 ++++++++++-----------------------------
 docs/mining.md           | 110 +++++++++++-------------
 3 files changed, 96 insertions(+), 193 deletions(-)

diff --git a/README.md b/README.md
index 489aa057d5..97bd303429 100644
--- a/README.md
+++ b/README.md
@@ -71,7 +71,7 @@ python scripts/submit_cluster.py \
 
 `scripts/cluster_configs/nyu_greene.yaml` defines `account`, `gpus`, `cpus`, `mem`, `time` — edit `account` to your allocation before first submit. `--container` makes `submit_cluster.py` wrap the job command in `singularity exec --nv --overlay $OVERLAY_PATH:ro $IMAGE_PATH ...`.
 
-For an operational deep-dive — sbatch templates, the GPU heartbeat (required for runs > ~2h or the idle-GPU reclaimer will scancel them), CPU rebuild path, account/partition strategy, replay-mode memory sizing, and known `submit_cluster.py` failure modes — see [`docs/cluster_training.md`](docs/cluster_training.md).
+For the operational guide — the one-time login-side submitit setup, GPU heartbeat (required for runs > ~2h), CPU rebuild path, account/partition strategy, and replay-mode memory sizing — see [`docs/cluster_training.md`](docs/cluster_training.md).
 
 ## Data
 
@@ -148,7 +148,7 @@ renders/index.html           # sortable index of all episodes
 
 Open `renders/index.html` in a browser to triage. The index page filters by "failures only" / "replays only" and sorts by any metric column. Each row links to the per-episode viewer with the scene's full 2D animation.
 
-For the deeper guide — viewer features, the `score_threshold` semantic gotcha (default `-inf` saves nothing — not what the `mine_failures` docstring claims), the `Multiprocessing`-backend CUDA-after-fork hang and the `--vec.backend Serial` workaround, the shape-mismatch gotcha when loading checkpoints with non-default `policy.*` dims, and the on-cluster sbatch pattern — see [`docs/mining.md`](docs/mining.md).
+For the deeper guide — viewer features, `score_threshold` semantics, the required `--vec.backend Serial` flag, loading checkpoints with non-default `policy.*` dims, and the on-cluster `submit_cluster.py` pattern — see [`docs/mining.md`](docs/mining.md).
 
 ## Key Configuration (`pufferlib/config/ocean/drive.ini`)
 
diff --git a/docs/cluster_training.md b/docs/cluster_training.md
index 3b1bc092e3..96195cecf2 100644
--- a/docs/cluster_training.md
+++ b/docs/cluster_training.md
@@ -12,7 +12,8 @@ Greene workflow but the patterns generalize. Pairs with `scripts/setup_container
 ./scripts/setup_container.sh create-overlay
 sbatch --account=<acct> --gres=gpu:1 --cpus-per-task=8 --mem=32gb --time=60 \
     --wrap "./scripts/setup_container.sh install"
-#   (b) install submitit on the login-node system python (see "Why" below)
+#   (b) install submitit on the login-node system python (used to compose
+#       the submission; the in-container venv python runs the actual job)
 python3 -m ensurepip --user
 python3 -m pip install --user submitit pyyaml cloudpickle
 
@@ -21,8 +22,7 @@ sbatch --account=<acct> --partition=cpu_short --cpus-per-task=8 --mem=16gb --tim
     --chdir=$PWD -o $LOGDIR/rebuild_%j.log \
     --wrap "./scripts/setup_container.sh rebuild"
 
-# Training: submit_cluster.py from the login node (NOT inside singularity)
-# with --container --heartbeat. Heartbeat is required for runs > ~2h.
+# Training: submit_cluster.py from the login node with --container --heartbeat.
 python3 scripts/submit_cluster.py \
     --save_dir /scratch/$USER/runs \
     --compute_config scripts/cluster_configs/nyu_greene.yaml \
@@ -53,7 +53,7 @@ singularity exec --nv \
     '
 ```
 
-`source venv/activate` is **required** — sourcing `/ext3/env.sh` alone gives you
+`source venv/activate` is required — sourcing `/ext3/env.sh` alone gives you
 a torch-less base interpreter (it imports as a namespace-package stub with
 `torch.__file__ == None`).
 
@@ -67,47 +67,39 @@ heartbeat when `--heartbeat` is set, performs code isolation (symlinks the
 top-level entries + hard-copies `pufferlib/` into a per-run sandbox), and
 hands the package to `submitit` for `sbatch`-submission.
 
-### Why submitit needs the system python
+### Two pythons in play
 
-`submitit` serializes the launch function via `cloudpickle` and writes an
-sbatch script that, on the compute node, runs
+A `submit_cluster.py --container` submission uses two distinct python
+environments:
 
-```
-srun <python-path> -m submitit.<launcher> <pkl>
-```
-
-`<python-path>` is `sys.executable` of the python that ran
-`submit_cluster.py`. That python must:
-
-1. Have `submitit` importable.
-2. Be invocable from the compute node *outside* singularity (because the
-   `srun` wrapper itself isn't inside the container — only the inner train
-   command is).
-
-The venv python on `/scratch/$USER/venvs/pufferdrive/bin/python` does **not**
-qualify: it's a symlink to `/ext3/miniforge3/bin/python3`, which only exists
-inside the singularity overlay. On the compute node `srun` tries to invoke
-that path outside the container and fails with
-`execve(): /scratch/.../python: No such file or directory`.
+- **Login-side composer**: the python that runs `submit_cluster.py` itself.
+  Only needs `submitit`, `pyyaml`, `cloudpickle` importable. Used purely to
+  build the sbatch script and submit it to SLURM. On Greene this is
+  `/usr/bin/python3` (system python) with `pip install --user submitit pyyaml
+  cloudpickle` to provide those deps.
+- **Compute-side executor**: the python that runs the training job on the
+  compute node. This is the **venv python** inside the singularity overlay
+  — same on every node because the overlay is content-identical. submitit's
+  outer launcher is wrapped in `singularity exec` so it lands in this
+  environment; `launch_training` then runs `torchrun` inside the same
+  container.
 
-The system `/usr/bin/python3` does qualify: it's on every node, no overlay
-symlinks, and the `~/.local` user site is on a shared filesystem so packages
-installed via `pip install --user` are visible from compute nodes.
+`submit_cluster.py` handles the wrap automatically when `--container` is set
+— you don't need to think about it. The only setup step is installing the
+three login-side deps once.
 
-### One-time setup of submitit on system python
+### One-time login-side setup
 
 ```bash
-# Greene's /usr/bin/python3 is stripped of pip. Bootstrap with ensurepip:
+# Greene's /usr/bin/python3 ships without pip; bootstrap it:
 python3 -m ensurepip --user
 python3 -m pip install --user --upgrade pip
 python3 -m pip install --user submitit pyyaml cloudpickle
 ```
 
-`submitit` is pure-python and the deps are too, so `--user` install works
-without needing a compiler. After this, `python3 -c 'import submitit'` works
-on the login node and all compute nodes.
+After this, `python3 -c 'import submitit'` works on the login node.
 
-### Run submit_cluster.py from the *login node*, not from inside the container
+### Run from the login node
 
 ```bash
 python3 scripts/submit_cluster.py \
@@ -127,15 +119,11 @@ Key flags:
 
 | Flag | Effect |
 |---|---|
-| `--container` | wraps the inner train command in `singularity exec --nv --overlay $OVERLAY:ro $IMAGE_PATH ...` and prepends `source $VENV/bin/activate && export PYTHONNOUSERSITE=1` |
-| `--heartbeat` | wraps the inner train command in a brace group that backgrounds `python scripts/gpu_heartbeat.py` and kills it on train exit, preserving the train exit code |
-| `--args key=value key2=value2 ...` | passes nested config keys (underscores converted to dashes) as `--key value` on the torchrun line; e.g. `env.simulation_mode=replay` becomes `--env.simulation-mode replay` |
+| `--container` | wraps both submitit's outer launcher and the inner train command in `singularity exec --nv --overlay $OVERLAY:ro $IMAGE` |
+| `--heartbeat` | wraps the train command in a brace group that backgrounds `python scripts/gpu_heartbeat.py` and kills it on train exit, preserving the train exit code |
+| `--args key=value ...` | passes nested config keys (underscores converted to dashes) as `--key value` on the torchrun line; e.g. `env.simulation_mode=replay` becomes `--env.simulation-mode replay` |
 | `--account` / `--partition` / `--time` | override `compute_config` SLURM settings |
 
-`AutoExecutor` (inside submit_cluster.py) probes for `sbatch` on `$PATH`. The
-login-node `$PATH` includes `/opt/slurm/bin`, so submitit picks
-`SlurmExecutor` automatically — no `cluster="slurm"` hint needed.
-
 ### GPU heartbeat — required for long runs
 
 `--heartbeat` is not optional for jobs over ~2 hours. Without it, the
@@ -165,54 +153,12 @@ You may want to set `TORCH_CUDA_ARCH_LIST="8.0;8.9;9.0"` in your shell
 profile if you build C extensions across the different GPU types on Greene
 (A100 sm_80, L40S/H100 sm_89/90, H200 sm_90).
 
-### Fallback: direct sbatch (if submitit setup is skipped)
-
-Sometimes you can't or don't want to install submitit on the system python
-(restricted environment, fast smoke test, etc.). A direct sbatch with the
-same singularity-exec + heartbeat pattern is fine. The translation from
-`submit_cluster.py --container --heartbeat` to a hand-written script is
-straightforward:
-
-```bash
-#!/bin/bash
-#SBATCH --job-name=mytrain
-#SBATCH --account=<your-account>
-#SBATCH --partition=<gpu-partition>
-#SBATCH --gres=gpu:1 --cpus-per-task=16 --mem=96gb --time=2880
-#SBATCH -o /scratch/$USER/runs/logs/train_%j.log
-
-singularity exec --nv \
-  --overlay /scratch/$USER/images/PufferDrive/overlay-15GB-500K.ext3:ro \
-  /share/apps/images/cuda12.8.1-cudnn9.8.0-ubuntu24.04.2.sif \
-  bash -c "
-    source /scratch/$USER/venvs/pufferdrive/bin/activate
-    export PYTHONNOUSERSITE=1
-    export TORCH_CUDA_ARCH_LIST=\"8.0;8.9;9.0\"
-    export XDG_CACHE_HOME=/scratch/$USER/cache
-    cd /scratch/$USER/code/PufferDrive
-    python scripts/gpu_heartbeat.py > /tmp/gpu_heartbeat.log 2>&1 &
-    HB_PID=\$!
-    torchrun --standalone --nproc_per_node 1 -m pufferlib.pufferl train puffer_drive \
-        --train.total-timesteps 10000000000 \
-        --train.checkpoint-interval 250 \
-        --wandb --wandb-project pufferdrive \
-        --train.data-dir /scratch/$USER/runs/mytrain
-    TRAIN_EXIT=\$?
-    kill \$HB_PID 2>/dev/null
-    exit \$TRAIN_EXIT
-"
-```
-
-This skips submit_cluster.py's code isolation and YAML composition but gets
-the job running. Prefer `submit_cluster.py` once the one-time submitit
-install is done.
-
 ## CPU rebuild path
 
-GPU partitions are routinely saturated by training jobs of this same project.
-`setup_container.sh rebuild` doesn't actually need a GPU — it just runs
-`python setup.py build_ext --inplace --force` plus a smoke import. Submit to a
-CPU partition for fast turnaround:
+GPU partitions are routinely saturated by training jobs. `setup_container.sh
+rebuild` doesn't actually need a GPU — it just runs `python setup.py
+build_ext --inplace --force` plus a smoke import. Submit to a CPU partition
+for fast turnaround:
 
 ```bash
 sbatch --account=<general-account> --partition=cpu_short \
@@ -228,20 +174,22 @@ sbatch --account=<general-account> --partition=cpu_short \
 
 NYU Greene exposes `_general` and `_tandon_priority` account tiers, each with
 their own QOS pool per partition. When `squeue` shows your job pending on
-`QOSGrpGRES`, the issue is partition-level pool saturation — **switching
-accounts within the same tier doesn't help**, but switching partitions does.
+`QOSGrpGRES`, the issue is partition-level pool saturation — switching
+accounts within the same tier doesn't help, but switching partitions does.
 
 `QOSMaxGRESPerUser` is different: you're over your own concurrent-GPU cap.
 Cancel a pending job or wait.
 
-Practical recipe for long training:
+Practical recipe:
 
-- For short jobs (rebuilds, eval, mining): try `cpu_short` if CPU-only; else
-  `h200_public + *_general`. Often the fastest GPU slot.
-- For long training: race 2–3 GPU partitions in parallel and cancel the
-  losers as soon as one starts. `tandon_priority` accounts often unblock when
-  `_general` pools are pinned. `l40s_public` typically has multi-hour queues
-  and is the last resort.
+- For short jobs (rebuilds, eval, mining): try `cpu_short` first when no GPU
+  is needed, else `h200_public + <general-account>`. Often the fastest GPU
+  slot.
+- For long training: `_tandon_priority` accounts have their own QOS pools
+  separate from `_general`, so they unblock when `_general` pools are
+  pinned. Race 2–3 partitions in parallel and cancel the losers as soon as
+  one starts. `l40s_public` typically has multi-hour queues and is the last
+  resort.
 
 Quick test-only across combos:
 
@@ -277,37 +225,9 @@ Levers, in order of impact:
   `[eval.validation_gigaflow]` specifically renders 8 × 1080p MP4s in parallel.
 - `--mem=128gb` or `--mem=192gb` if you need the eval signal in wandb.
 
-`vec.*` keys are **not** in pufferl's `KEYS_OF_INTEREST` auto-merge, so a
-sibling `config.yaml` next to a `load_model_path` won't override them. They
-come from `drive.ini` or the CLI.
-
-## Submission pitfalls to avoid
-
-A few mistakes that look reasonable but break the submission flow:
-
-- **Don't run `submit_cluster.py` from inside the container.** It works at the
-  AutoExecutor level (sbatch is reachable; the submission goes through), but
-  the submitted job inherits the in-container venv python as `sys.executable`.
-  On the compute node `srun` tries to invoke that path *outside* singularity
-  and fails with `execve(): /scratch/.../python: No such file or directory`.
-  submit_cluster.py wraps the *inner* train command in singularity-exec but
-  the *outer* submitit launcher is not wrapped.
-
-  The fix is the layout described above: install submitit on the system
-  `/usr/bin/python3` via `pip install --user`, run `submit_cluster.py` from
-  the login node directly (no container, no venv activate).
-
-- **Don't `pip install submitit` into the venv expecting it to work from the
-  login node.** The venv's `pip` and `python` shebangs point at
-  `/ext3/miniforge3/bin/python3` (overlay-internal). Running them outside the
-  container errors with "required file not found". The venv is *runtime*
-  only — its packages are invisible to login-node tooling.
-
-- **Don't bind `/opt/slurm` + `/run/munge` + `/etc/passwd` into the container
-  as a workaround.** It does make `sbatch` callable from inside the container
-  (you'll see "slurm 25.05.4" if you run `sbatch --version`), but you're then
-  back to pitfall #1: the submitted job's outer python is still the venv
-  python. The bindings buy you the submission but not the execution.
+`vec.*` keys are not in pufferl's `KEYS_OF_INTEREST` auto-merge, so a sibling
+`config.yaml` next to a `load_model_path` won't override them. They come from
+`drive.ini` or the CLI.
 
 ## Common pitfalls
 
@@ -324,9 +244,6 @@ A few mistakes that look reasonable but break the submission flow:
   (e.g. failed pip installs that wrote to `/usr/local/lib/...` end up in
   `upper/usr/local/` and aren't visible to apptainer's view). Use
   `debugfs -R "ls /upper" overlay.ext3` from a login node to inspect.
-- **Squash-merging stacked PRs** can hit "stale info" on `--force-with-lease`
-  when the token URL differs from `origin`. Either fetch first or use
-  `--force` with care.
 
 ## Don't chain `sleep` to wait on background jobs
 
diff --git a/docs/mining.md b/docs/mining.md
index 996e53e18b..1c11b9e60a 100644
--- a/docs/mining.md
+++ b/docs/mining.md
@@ -7,20 +7,25 @@ and `pufferlib/mining_viz.py`.
 ## TL;DR
 
 ```bash
-# Roll the policy out for 100 episodes, save compact replays for "failures",
-# render HTML for each + a sortable index.
+# Roll the policy out for 100 episodes, save compact replays for episodes
+# whose episode_return falls below the threshold, render HTML for each +
+# a sortable index.
 puffer mine_failures puffer_drive \
     --load-model-path /path/to/model_011000.pt \
     --mine.output-dir ./failure_mining/baseline_011000 \
     --mine.num-episodes 100 \
-    --vec.backend Serial             # see "Multiprocessing hang" below
-
-# Outputs:
-#   ./failure_mining/baseline_011000/
-#     replays/episode_NNNNNN.replay.zlib   ← one per failed episode
-#     renders/episode_NNNNNN.html          ← per-replay viewer
-#     renders/index.html                   ← sortable summary
-#     episodes.csv                         ← all episodes, all metrics
+    --mine.score-threshold 1e9 \
+    --vec.backend Serial
+```
+
+Outputs:
+
+```
+./failure_mining/baseline_011000/
+    replays/episode_NNNNNN.replay.zlib   one per saved episode
+    renders/episode_NNNNNN.html          per-replay viewer
+    renders/index.html                   sortable summary
+    episodes.csv                         all episodes, all metrics
 ```
 
 Open the index in a browser:
@@ -33,8 +38,8 @@ open ./failure_mining/baseline_011000/renders/index.html
 
 A compact replay bundle is a pickled+zlib'd `schema_version=2` dict containing
 per-step agent state, traffic state, and observation arrays for a single
-episode. Bundles are produced **C-side** when `capture_compact_replay=True`
-is passed to `Drive(...)`. `mine_failures` sets this automatically.
+episode. Bundles are produced C-side when `capture_compact_replay=True` is
+passed to `Drive(...)`. `mine_failures` sets this automatically.
 
 Each saved bundle is paired with a metadata row in `episodes.csv` including
 `episode_return`, `collision_rate`, `offroad_rate`, `num_goals_reached`,
@@ -43,50 +48,34 @@ reads the bundle and replays it in-browser on a top-down canvas, with optional
 overlays for the agent's observed FOV, partner circle, goal route, and waypoint
 markers.
 
-## `mine.score_threshold` — gotcha
-
-The `mine_failures` selection rule is "save replay if and only if
-`episode_return < score_threshold`". The docstring claims `-inf` means "capture
-every episode" — that's wrong: `episode_return < -inf` is never true, so the
-default captures **nothing**. To actually save episodes:
-
-```bash
-# Capture every episode (works with any non-degenerate return):
---mine.score-threshold 1e9
-
-# Capture only "true" failures (negative returns):
---mine.score-threshold 0
-```
+## `mine.score_threshold` selection
 
-`episodes.csv` always contains all N episodes' metadata regardless of threshold
-— only the bundle save + HTML render is gated.
+The save rule is "write replay if and only if `episode_return < score_threshold`".
 
-## Multiprocessing hang — use `--vec.backend Serial`
+- `--mine.score-threshold 1e9` captures every episode (any real return is
+  less than 1e9).
+- `--mine.score-threshold 0` captures only negative-return ("true failure")
+  episodes.
+- Default `-inf` captures **nothing** — useful only if you want `episodes.csv`
+  metrics without the bundle overhead.
 
-`pufferl.mine_failures` goes through `pufferlib.vector.make(...)` with the
-drive.ini default `backend=Multiprocessing`. Even with `num_envs=1,
-num_workers=1`, that backend **forks** workers post-torch-import. Forking after
-torch has been imported in the parent is a classic deadlock for CUDA — the
-child can hang on CUDA initialization, and the parent sits forever on the IPC
-pipe.
+`episodes.csv` always contains all N episodes' metadata regardless of
+threshold; only the bundle save + HTML render is gated.
 
-Symptoms: CPU 100% in the parent, RSS frozen, no `[mine_failures] target
-episodes=...` print, never produces output. If you let it sit for ~10 minutes
-nothing changes.
+## `--vec.backend Serial`
 
-Fix: force the in-process backend.
-
-```bash
---vec.backend Serial
-```
+Mining must use `--vec.backend Serial`. The drive.ini default
+`Multiprocessing` backend forks workers post-torch-import, which deadlocks on
+CUDA in the child process. Symptom is a parent process at 100% CPU with no
+visible progress and no `[mine_failures] target episodes=...` print.
 
-This keeps the env in the same process as the policy. No fork, no hang. The
-single-env nature of mining means the throughput cost is negligible.
+`Serial` keeps the env in the same process as the policy. Mining is a single
+env / single rollout workflow, so the throughput cost is negligible.
 
 ## Tuning the rollout config
 
 The mining env config comes from drive.ini's `[mine]` section plus per-CLI
-overrides. Useful knobs:
+overrides:
 
 ```bash
 # Larger output (slower):
@@ -99,22 +88,20 @@ overrides. Useful knobs:
 --env.init-steps 10 \
 --env.scenario-length 200
 
-# Looser goal radius (useful if the trained policy struggles with the
-# stricter default; default 2m, max 12m under reward randomization):
+# Looser goal radius (default 2 m, up to 12 m under reward randomization):
 --env.goal-radius 6
 
-# Closer-spaced goals (mining a policy that wasn't trained on these):
+# Closer-spaced goals:
 --env.min-waypoint-spacing 10 \
 --env.max-waypoint-spacing 15
 ```
 
-## Resume + obs-shape gotcha
+## Loading checkpoints with non-default architecture
 
-`mine_failures` does **not** read the sibling `config.yaml` next to
-`load_model_path` — only `pufferl.train` does. If the checkpoint was trained
+`mine_failures` does not read the sibling `config.yaml` next to
+`load_model_path` (only `pufferl.train` does). If the checkpoint was trained
 with non-default `policy.*` or `rnn.*` dimensions (e.g. `input_size=128`,
-`backbone_num_layers=4`), you'll get a shape mismatch on `load_state_dict`
-unless you pass them on the CLI:
+`backbone_num_layers=4`), pass them on the CLI to match the saved state dict:
 
 ```bash
 --policy.input-size 128 \
@@ -131,14 +118,15 @@ unless you pass them on the CLI:
 ```
 
 You can read the right values out of the checkpoint's sibling `config.yaml`
-(under `policy:` and `rnn:`) and pass them through.
+(under `policy:` and `rnn:`) and pass them through. The error if you forget
+is a wall of `size mismatch for ...` lines from `policy.load_state_dict`.
 
 ## On the cluster
 
 Mining is GPU-bound on the policy forward pass but memory-light compared to
 training (single env, no rollout buffer, no PPO update). 48 GB RAM and a
 60-minute time limit are plenty for 100 episodes. The same `submit_cluster.py`
-flow as training works — just override `--main` to invoke `mine_failures`:
+flow as training works — override `--main` to invoke `mine_failures`:
 
 ```bash
 python3 scripts/submit_cluster.py \
@@ -157,13 +145,11 @@ python3 scripts/submit_cluster.py \
         vec.backend=Serial
 ```
 
-See [`docs/cluster_training.md`](cluster_training.md) for the one-time
-submitit setup (`pip install --user submitit pyyaml cloudpickle` on the
-system python) and the rationale for why `submit_cluster.py` must be run
-from the login node rather than inside the container.
+See [`docs/cluster_training.md`](cluster_training.md) for one-time setup of
+the login-side submitit (`python3 -m pip install --user submitit pyyaml
+cloudpickle`).
 
-Outputs land on `/scratch`; pull them down with `rsync` for in-browser
-viewing.
+Outputs land on `/scratch`; pull them down with `rsync` for in-browser viewing.
 
 ## Viewer features (`mining_viz.py`)
 

From 50f97d91ec6d8519396f93c9ec27af054d4189f8 Mon Sep 17 00:00:00 2001
From: Eugene Vinitsky <eugenevinitsky@172-16-7-177.dynapool.nyu.edu>
Date: Wed, 20 May 2026 09:34:07 -0500
Subject: [PATCH 04/20] docs: expand TORCH_CUDA_ARCH_LIST explanation

Replace the cryptic one-line 'you may want to set' with a self-contained
explanation: what the env var does (per-arch fat binary), why it matters
on a heterogeneous cluster ('no kernel image' on the wrong GPU), what
the recommended value covers (A100/L40S/H100/H200), and when it actually
matters in practice (interactive builds outside setup_container.sh
rebuild, which already exports it).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 docs/cluster_training.md | 141 ++++++++++++++-------------------------
 1 file changed, 50 insertions(+), 91 deletions(-)

diff --git a/docs/cluster_training.md b/docs/cluster_training.md
index 96195cecf2..83adf594bf 100644
--- a/docs/cluster_training.md
+++ b/docs/cluster_training.md
@@ -1,10 +1,8 @@
 # Cluster training — operational guide
 
-How to run PufferDrive training on a SLURM cluster. Written against the NYU
-Greene workflow but the patterns generalize. Pairs with `scripts/setup_container.sh`,
-`scripts/gpu_heartbeat.py`, and `scripts/submit_cluster.py`.
+How to run PufferDrive training on a SLURM cluster. This is written with the NYU cluster in mind but it should mostly hold for any SLURM cluster. 
 
-## TL;DR
+## A quick overview of the setup and launch process
 
 ```bash
 # One-time per cluster:
@@ -17,30 +15,31 @@ sbatch --account=<acct> --gres=gpu:1 --cpus-per-task=8 --mem=32gb --time=60 \
 python3 -m ensurepip --user
 python3 -m pip install --user submitit pyyaml cloudpickle
 
-# Per code change to C extensions: rebuild on a CPU partition (no GPU needed).
+# If code changes, or we haven't built before, rebuild the C code in the container
 sbatch --account=<acct> --partition=cpu_short --cpus-per-task=8 --mem=16gb --time=20 \
     --chdir=$PWD -o $LOGDIR/rebuild_%j.log \
     --wrap "./scripts/setup_container.sh rebuild"
 
 # Training: submit_cluster.py from the login node with --container --heartbeat.
+# By default launches RL training but can be modified through the --main argument
+# to launch other modes
 python3 scripts/submit_cluster.py \
     --save_dir /scratch/$USER/runs \
     --compute_config scripts/cluster_configs/nyu_greene.yaml \
     --program_config scripts/cluster_configs/train_base.yaml \
     --container --heartbeat \
     --account <acct> --partition <gpu-partition> --time 2880 \
-    --args train.checkpoint_interval=250 env.simulation_mode=gigaflow
+    --args train.checkpoint_interval=250 env.simulation_mode=gigaflow # use this to override config args
 ```
 
 ## Container model
 
 PufferDrive on Greene runs inside a singularity container. The container provides
 a modern glibc + CUDA toolkit; the project's Python environment lives in a venv
-on `/scratch` (not in the overlay) so installs aren't bottlenecked by fuse2fs.
+on `/scratch` so installs aren't bottlenecked by the slow process of building a venv inside a container.
 
 The container is invoked with a **read-only** overlay mount for the miniforge3
-base interpreter, plus the on-disk venv for project packages:
-
+base interpreter, plus the on-disk venv for project packages. As an example of running such a command:
 ```bash
 singularity exec --nv \
     --overlay /scratch/$USER/images/PufferDrive/overlay-15GB-500K.ext3:ro \
@@ -53,21 +52,20 @@ singularity exec --nv \
     '
 ```
 
-`source venv/activate` is required — sourcing `/ext3/env.sh` alone gives you
-a torch-less base interpreter (it imports as a namespace-package stub with
-`torch.__file__ == None`).
-
 ## Submitting training — `submit_cluster.py`
 
-`scripts/submit_cluster.py` is the canonical submission path. It composes a
-`compute_config` YAML (SLURM settings) + a `program_config` YAML (pufferl
-training args) + `--args` CLI overrides, wraps the inner train command in
-`singularity exec` when `--container` is set, optionally injects the GPU
-heartbeat when `--heartbeat` is set, performs code isolation (symlinks the
+`scripts/submit_cluster.py` is the canonical submission path. It composes: 
+- a `compute_config` YAML (SLURM settings)
+- a `program_config` YAML (pufferl training args)
+- `--args` CLI overrides
+- wraps the inner train command in `singularity exec` when `--container` is set
+- optionally injects the GPU heartbeat when `--heartbeat` is set. WARNING: this is specifically for the torch cluster to prevent our jobs being killed. No one else should use this.
+
+It performs code isolation (symlinks the
 top-level entries + hard-copies `pufferlib/` into a per-run sandbox), and
 hands the package to `submitit` for `sbatch`-submission.
 
-### Two pythons in play
+### WARNING: two python installation are being used here
 
 A `submit_cluster.py --container` submission uses two distinct python
 environments:
@@ -75,19 +73,14 @@ environments:
 - **Login-side composer**: the python that runs `submit_cluster.py` itself.
   Only needs `submitit`, `pyyaml`, `cloudpickle` importable. Used purely to
   build the sbatch script and submit it to SLURM. On Greene this is
-  `/usr/bin/python3` (system python) with `pip install --user submitit pyyaml
+  `/usr/bin/python3` (system python) and you can run `pip install --user submitit pyyaml
   cloudpickle` to provide those deps.
 - **Compute-side executor**: the python that runs the training job on the
-  compute node. This is the **venv python** inside the singularity overlay
-  — same on every node because the overlay is content-identical. submitit's
+  compute node. This is the **venv python** inside the singularity overlay. submitit's
   outer launcher is wrapped in `singularity exec` so it lands in this
   environment; `launch_training` then runs `torchrun` inside the same
   container.
 
-`submit_cluster.py` handles the wrap automatically when `--container` is set
-— you don't need to think about it. The only setup step is installing the
-three login-side deps once.
-
 ### One-time login-side setup
 
 ```bash
@@ -120,7 +113,7 @@ Key flags:
 | Flag | Effect |
 |---|---|
 | `--container` | wraps both submitit's outer launcher and the inner train command in `singularity exec --nv --overlay $OVERLAY:ro $IMAGE` |
-| `--heartbeat` | wraps the train command in a brace group that backgrounds `python scripts/gpu_heartbeat.py` and kills it on train exit, preserving the train exit code |
+| `--heartbeat` | wraps the train command in a brace group that backgrounds `python scripts/gpu_heartbeat.py` preventing the cluster from killing your job due to low GPU usage |
 | `--args key=value ...` | passes nested config keys (underscores converted to dashes) as `--key value` on the torchrun line; e.g. `env.simulation_mode=replay` becomes `--env.simulation-mode replay` |
 | `--account` / `--partition` / `--time` | override `compute_config` SLURM settings |
 
@@ -132,7 +125,7 @@ the first eval / checkpoint dip in GPU utilization.
 
 `scripts/gpu_heartbeat.py` monitors `nvidia-smi` and runs short matmul bursts
 when utilization drops below 65%, so the cluster always sees the GPU as
-active. It cooperates with real training (steps aside when training is busy).
+active. It cooperates with training and steps aside when training is busy.
 
 ### Environment knobs the container path sets
 
@@ -149,14 +142,39 @@ export WANDB_DATA_DIR=/scratch/$USER/wandb_data
 export WANDB_DIR=/scratch/$USER/wandb_data
 ```
 
-You may want to set `TORCH_CUDA_ARCH_LIST="8.0;8.9;9.0"` in your shell
-profile if you build C extensions across the different GPU types on Greene
-(A100 sm_80, L40S/H100 sm_89/90, H200 sm_90).
+### `TORCH_CUDA_ARCH_LIST` — why you may need to set it
+
+PufferDrive's C extension contains CUDA kernels. When `setup.py build_ext`
+compiles them, `nvcc` emits machine code for each architecture listed in
+the `TORCH_CUDA_ARCH_LIST` env var (and only those); the result is a "fat
+binary" containing one variant per arch. If the env var is unset, the build
+defaults to whatever GPU was visible to the compiler at build time — often
+just one architecture.
+
+The catch on a heterogeneous cluster like Greene is that you don't get to
+choose which GPU you land on. `_general` accounts queue across L40S
+(sm_89), H100 (sm_90), and H200 (sm_90); `_tandon_*` partitions add A100
+(sm_80). If the `_C.so` was built against only sm_80 and your job lands on
+an H100, every CUDA call into the extension dies with
+`no kernel image is available for execution on the device`.
+
+Setting `TORCH_CUDA_ARCH_LIST="8.0;8.9;9.0"` covers A100 / L40S+H100 / H200
+in one fat binary — the build is a bit slower (three variants instead of
+one) and the `.so` is a bit larger, but the resulting binary runs on every
+GPU Greene routes you to.
+
+`setup_container.sh rebuild` exports this automatically for the build step,
+so a fresh rebuild on the cluster is already multi-arch. The env var only
+matters when you build the C extension **outside** the rebuild wrapper —
+e.g. an interactive `python setup.py build_ext --inplace --force` inside a
+hand-launched singularity exec. Adding the export to your shell profile
+(or sourcing it before any manual build) saves you from hitting the "no
+kernel image" error after a quick fix-and-rebuild loop.
 
 ## CPU rebuild path
 
 GPU partitions are routinely saturated by training jobs. `setup_container.sh
-rebuild` doesn't actually need a GPU — it just runs `python setup.py
+rebuild` doesn't actually need a GPU as it just runs `python setup.py
 build_ext --inplace --force` plus a smoke import. Submit to a CPU partition
 for fast turnaround:
 
@@ -170,65 +188,6 @@ sbatch --account=<general-account> --partition=cpu_short \
 
 `--chdir=$PWD` is required because the script uses `./scripts/`. Takes ~40s.
 
-## Account / partition strategy
-
-NYU Greene exposes `_general` and `_tandon_priority` account tiers, each with
-their own QOS pool per partition. When `squeue` shows your job pending on
-`QOSGrpGRES`, the issue is partition-level pool saturation — switching
-accounts within the same tier doesn't help, but switching partitions does.
-
-`QOSMaxGRESPerUser` is different: you're over your own concurrent-GPU cap.
-Cancel a pending job or wait.
-
-Practical recipe:
-
-- For short jobs (rebuilds, eval, mining): try `cpu_short` first when no GPU
-  is needed, else `h200_public + <general-account>`. Often the fastest GPU
-  slot.
-- For long training: `_tandon_priority` accounts have their own QOS pools
-  separate from `_general`, so they unblock when `_general` pools are
-  pinned. Race 2–3 partitions in parallel and cancel the losers as soon as
-  one starts. `l40s_public` typically has multi-hour queues and is the last
-  resort.
-
-Quick test-only across combos:
-
-```bash
-for combo in \
-    "<acct-priority> a100_tandon" \
-    "<acct-priority> h100_tandon" \
-    "<acct-general>  h200_public"; do
-  read ACCT PART <<< "$combo"
-  RES=$(sbatch --test-only --account=$ACCT --partition=$PART \
-        --gres=gpu:1 --cpus-per-task=16 --mem=96gb --time=2880 \
-        --wrap "echo test" 2>&1 | head -1)
-  echo "$ACCT $PART -> $RES"
-done
-```
-
-`--test-only` prints an estimated start time without actually submitting.
-
-## Memory sizing — replay mode is heavier than gigaflow
-
-Gigaflow training with `num_agents=1024` fits comfortably in 96 GB on Greene.
-Replay-mode training on nuPlan does not — each sub-env loads its own bin file
-(parsed lane graph + per-agent trajectories), so `--mem=96gb` OOMs.
-
-Levers, in order of impact:
-
-- `--vec.num-envs N` (drive.ini default `20`). Each vec worker is a fork; each
-  worker holds copy-on-write-divergent state proportional to `num_agents/num_envs`
-  + the loaded map data. Halving from 20→10 saves ~25 GB.
-- Disable subsets of `[eval.*]` evaluators via CLI overrides. The 14 enabled
-  evaluators in `drive.ini` all spin up their own `pufferlib.vector.make` envs
-  at the first eval cycle and can collectively cost 30–50 GB at peak.
-  `[eval.validation_gigaflow]` specifically renders 8 × 1080p MP4s in parallel.
-- `--mem=128gb` or `--mem=192gb` if you need the eval signal in wandb.
-
-`vec.*` keys are not in pufferl's `KEYS_OF_INTEREST` auto-merge, so a sibling
-`config.yaml` next to a `load_model_path` won't override them. They come from
-`drive.ini` or the CLI.
-
 ## Common pitfalls
 
 - **`ncclCommShrink` undefined symbol** at `from torch._C import *`. Greene's

From dae17688e39bf71d1c241345a3da3b4782399b6b Mon Sep 17 00:00:00 2001
From: Eugene Vinitsky <eugenevinitsky@172-16-7-177.dynapool.nyu.edu>
Date: Wed, 20 May 2026 09:44:17 -0500
Subject: [PATCH 05/20] docs: explain why CPU rebuild works for CUDA code

The previous CPU rebuild section just said 'doesn't need a GPU' without
explaining the apparent contradiction (we're compiling CUDA, no?). Spell
out that nvcc is a cross-compiler: it emits PTX/SASS for the target
arches in TORCH_CUDA_ARCH_LIST without needing matching hardware, and
the CUDA toolkit lives in the singularity image so any node that can
mount the image can run the build.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 docs/cluster_training.md | 77 ++++++++++++++++++----------------------
 1 file changed, 34 insertions(+), 43 deletions(-)

diff --git a/docs/cluster_training.md b/docs/cluster_training.md
index 83adf594bf..45aef6367d 100644
--- a/docs/cluster_training.md
+++ b/docs/cluster_training.md
@@ -142,41 +142,18 @@ export WANDB_DATA_DIR=/scratch/$USER/wandb_data
 export WANDB_DIR=/scratch/$USER/wandb_data
 ```
 
-### `TORCH_CUDA_ARCH_LIST` — why you may need to set it
-
-PufferDrive's C extension contains CUDA kernels. When `setup.py build_ext`
-compiles them, `nvcc` emits machine code for each architecture listed in
-the `TORCH_CUDA_ARCH_LIST` env var (and only those); the result is a "fat
-binary" containing one variant per arch. If the env var is unset, the build
-defaults to whatever GPU was visible to the compiler at build time — often
-just one architecture.
-
-The catch on a heterogeneous cluster like Greene is that you don't get to
-choose which GPU you land on. `_general` accounts queue across L40S
-(sm_89), H100 (sm_90), and H200 (sm_90); `_tandon_*` partitions add A100
-(sm_80). If the `_C.so` was built against only sm_80 and your job lands on
-an H100, every CUDA call into the extension dies with
-`no kernel image is available for execution on the device`.
-
-Setting `TORCH_CUDA_ARCH_LIST="8.0;8.9;9.0"` covers A100 / L40S+H100 / H200
-in one fat binary — the build is a bit slower (three variants instead of
-one) and the `.so` is a bit larger, but the resulting binary runs on every
-GPU Greene routes you to.
-
-`setup_container.sh rebuild` exports this automatically for the build step,
-so a fresh rebuild on the cluster is already multi-arch. The env var only
-matters when you build the C extension **outside** the rebuild wrapper —
-e.g. an interactive `python setup.py build_ext --inplace --force` inside a
-hand-launched singularity exec. Adding the export to your shell profile
-(or sourcing it before any manual build) saves you from hitting the "no
-kernel image" error after a quick fix-and-rebuild loop.
-
 ## CPU rebuild path
 
 GPU partitions are routinely saturated by training jobs. `setup_container.sh
-rebuild` doesn't actually need a GPU as it just runs `python setup.py
-build_ext --inplace --force` plus a smoke import. Submit to a CPU partition
-for fast turnaround:
+rebuild` doesn't need a GPU even though it compiles CUDA code: `nvcc` is a
+cross-compiler. It generates PTX/SASS for each architecture in
+`TORCH_CUDA_ARCH_LIST` without needing matching hardware on the build host,
+the same way a C compiler can target ARM from an x86 host. The CUDA toolkit
+itself (`nvcc`, headers, libs) lives in the cuda12.8.1 `.sif` image, so any
+node that can mount the image can run the build — CPU partitions included.
+The rebuild script exports `TORCH_CUDA_ARCH_LIST="8.0 8.9 9.0"` upfront, so
+the resulting `.so` is a fat binary that runs on every GPU type at job time.
+Submit to a CPU partition for fast turnaround:
 
 ```bash
 sbatch --account=<general-account> --partition=cpu_short \
@@ -188,7 +165,7 @@ sbatch --account=<general-account> --partition=cpu_short \
 
 `--chdir=$PWD` is required because the script uses `./scripts/`. Takes ~40s.
 
-## Common pitfalls
+### Common pitfalls
 
 - **`ncclCommShrink` undefined symbol** at `from torch._C import *`. Greene's
   cuda12.8.1 sif ships `libnccl 2.25.1` in `/usr/lib`, but torch ≥ 2.10 calls
@@ -204,16 +181,30 @@ sbatch --account=<general-account> --partition=cpu_short \
   `upper/usr/local/` and aren't visible to apptainer's view). Use
   `debugfs -R "ls /upper" overlay.ext3` from a login node to inspect.
 
-## Don't chain `sleep` to wait on background jobs
+### `TORCH_CUDA_ARCH_LIST`: a warning that you can skip
 
-A bare `sleep N` to poll on a submitted job's state is hard on the SLURM
-controller and brittle. Patterns that work:
+PufferDrive's C extension contains CUDA kernels. When `setup.py build_ext`
+compiles them, `nvcc` emits machine code for each architecture listed in
+the `TORCH_CUDA_ARCH_LIST` env var (and only those); the result is a large binary containing one variant per arch. If the env var is unset, the build
+defaults to whatever GPU was visible to the compiler at build time which is often
+just one architecture.
 
-- **One-shot wait**: a single `sacct -j $JOBID --format=State -n -P` after a
-  generous initial sleep tuned to expected runtime.
-- **Conditional wait**: a `Monitor`-style `until` loop in a single background
-  shell, with a sane upper bound.
-- **Wall-clock interval**: schedule a wake-up rather than long-running `sleep`.
+On Greene, you frequently don't get to
+choose which GPU you land on. `_general` accounts queue across L40S
+(sm_89), H100 (sm_90), and H200 (sm_90); `_tandon_*` partitions add A100
+(sm_80). If the `_C.so` was built against only sm_80 and your job lands on
+an H100, every CUDA call into the extension dies with
+`no kernel image is available for execution on the device`.
 
-Hammering `squeue` in a tight loop is bad cluster citizenship — the controller
-is shared across all users. Sleep at least 60 s between checks.
+Setting `TORCH_CUDA_ARCH_LIST="8.0;8.9;9.0"` covers A100 / L40S+H100 / H200
+in one fat binary — the build is a bit slower (three variants instead of
+one) and the `.so` is a bit larger, but the resulting binary runs on every
+GPU Greene routes you to.
+
+`setup_container.sh rebuild` exports this automatically for the build step,
+so a fresh rebuild on the cluster is already multi-arch. The env var only
+matters when you build the C extension **outside** the rebuild wrapper —
+e.g. an interactive `python setup.py build_ext --inplace --force` inside a
+hand-launched singularity exec. Adding the export to your shell profile
+(or sourcing it before any manual build) saves you from hitting the "no
+kernel image" error after a quick fix-and-rebuild loop.
\ No newline at end of file

From 33d560ea016babc835eb2f3eb488e8a4be750f8d Mon Sep 17 00:00:00 2001
From: Eugene Vinitsky <eugenevinitsky@172-16-7-177.dynapool.nyu.edu>
Date: Wed, 20 May 2026 09:46:01 -0500
Subject: [PATCH 06/20] =?UTF-8?q?docs:=20trim=20CPU=20rebuild=20section=20?=
 =?UTF-8?q?=E2=80=94=20drop=20the=20cross-compiler=20explanation?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Just say 'doesn't need a GPU' without the nvcc / fat binary / cross-compiler
detour. Readers who want the why can find TORCH_CUDA_ARCH_LIST above.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 docs/cluster_training.md | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/docs/cluster_training.md b/docs/cluster_training.md
index 45aef6367d..f5b7ee2112 100644
--- a/docs/cluster_training.md
+++ b/docs/cluster_training.md
@@ -145,15 +145,7 @@ export WANDB_DIR=/scratch/$USER/wandb_data
 ## CPU rebuild path
 
 GPU partitions are routinely saturated by training jobs. `setup_container.sh
-rebuild` doesn't need a GPU even though it compiles CUDA code: `nvcc` is a
-cross-compiler. It generates PTX/SASS for each architecture in
-`TORCH_CUDA_ARCH_LIST` without needing matching hardware on the build host,
-the same way a C compiler can target ARM from an x86 host. The CUDA toolkit
-itself (`nvcc`, headers, libs) lives in the cuda12.8.1 `.sif` image, so any
-node that can mount the image can run the build — CPU partitions included.
-The rebuild script exports `TORCH_CUDA_ARCH_LIST="8.0 8.9 9.0"` upfront, so
-the resulting `.so` is a fat binary that runs on every GPU type at job time.
-Submit to a CPU partition for fast turnaround:
+rebuild` doesn't need a GPU — submit to a CPU partition for fast turnaround:
 
 ```bash
 sbatch --account=<general-account> --partition=cpu_short \

From f8642620e4b1bab6346f84b5752386a5d75de575 Mon Sep 17 00:00:00 2001
From: Eugene Vinitsky <eugenevinitsky@172-16-7-177.dynapool.nyu.edu>
Date: Wed, 20 May 2026 09:50:56 -0500
Subject: [PATCH 07/20] docs: drop mining doc from this PR (moved to a separate
 PR)

This PR is just cluster training. The mining workflow doc and its
README pointer move to a separate PR so the two can be reviewed and
landed independently.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 README.md      |   2 -
 docs/mining.md | 167 -------------------------------------------------
 2 files changed, 169 deletions(-)
 delete mode 100644 docs/mining.md

diff --git a/README.md b/README.md
index 97bd303429..f19eb805c3 100644
--- a/README.md
+++ b/README.md
@@ -148,8 +148,6 @@ renders/index.html           # sortable index of all episodes
 
 Open `renders/index.html` in a browser to triage. The index page filters by "failures only" / "replays only" and sorts by any metric column. Each row links to the per-episode viewer with the scene's full 2D animation.
 
-For the deeper guide — viewer features, `score_threshold` semantics, the required `--vec.backend Serial` flag, loading checkpoints with non-default `policy.*` dims, and the on-cluster `submit_cluster.py` pattern — see [`docs/mining.md`](docs/mining.md).
-
 ## Key Configuration (`pufferlib/config/ocean/drive.ini`)
 
 ### `[env]` — Simulation
diff --git a/docs/mining.md b/docs/mining.md
deleted file mode 100644
index 1c11b9e60a..0000000000
--- a/docs/mining.md
+++ /dev/null
@@ -1,167 +0,0 @@
-# Failure mining workflow
-
-How to roll a trained policy out, capture compact replays, and produce a
-browser-viewable HTML index of episodes. Pairs with `pufferl.mine_failures`
-and `pufferlib/mining_viz.py`.
-
-## TL;DR
-
-```bash
-# Roll the policy out for 100 episodes, save compact replays for episodes
-# whose episode_return falls below the threshold, render HTML for each +
-# a sortable index.
-puffer mine_failures puffer_drive \
-    --load-model-path /path/to/model_011000.pt \
-    --mine.output-dir ./failure_mining/baseline_011000 \
-    --mine.num-episodes 100 \
-    --mine.score-threshold 1e9 \
-    --vec.backend Serial
-```
-
-Outputs:
-
-```
-./failure_mining/baseline_011000/
-    replays/episode_NNNNNN.replay.zlib   one per saved episode
-    renders/episode_NNNNNN.html          per-replay viewer
-    renders/index.html                   sortable summary
-    episodes.csv                         all episodes, all metrics
-```
-
-Open the index in a browser:
-
-```bash
-open ./failure_mining/baseline_011000/renders/index.html
-```
-
-## What gets captured
-
-A compact replay bundle is a pickled+zlib'd `schema_version=2` dict containing
-per-step agent state, traffic state, and observation arrays for a single
-episode. Bundles are produced C-side when `capture_compact_replay=True` is
-passed to `Drive(...)`. `mine_failures` sets this automatically.
-
-Each saved bundle is paired with a metadata row in `episodes.csv` including
-`episode_return`, `collision_rate`, `offroad_rate`, `num_goals_reached`,
-`avg_distance_per_infraction`, etc. The HTML viewer (`pufferlib/mining_viz.py`)
-reads the bundle and replays it in-browser on a top-down canvas, with optional
-overlays for the agent's observed FOV, partner circle, goal route, and waypoint
-markers.
-
-## `mine.score_threshold` selection
-
-The save rule is "write replay if and only if `episode_return < score_threshold`".
-
-- `--mine.score-threshold 1e9` captures every episode (any real return is
-  less than 1e9).
-- `--mine.score-threshold 0` captures only negative-return ("true failure")
-  episodes.
-- Default `-inf` captures **nothing** — useful only if you want `episodes.csv`
-  metrics without the bundle overhead.
-
-`episodes.csv` always contains all N episodes' metadata regardless of
-threshold; only the bundle save + HTML render is gated.
-
-## `--vec.backend Serial`
-
-Mining must use `--vec.backend Serial`. The drive.ini default
-`Multiprocessing` backend forks workers post-torch-import, which deadlocks on
-CUDA in the child process. Symptom is a parent process at 100% CPU with no
-visible progress and no `[mine_failures] target episodes=...` print.
-
-`Serial` keeps the env in the same process as the policy. Mining is a single
-env / single rollout workflow, so the throughput cost is negligible.
-
-## Tuning the rollout config
-
-The mining env config comes from drive.ini's `[mine]` section plus per-CLI
-overrides:
-
-```bash
-# Larger output (slower):
---mine.num-episodes 500
-
-# Replay mode (drive recorded nuPlan / Waymo scenarios):
---env.simulation-mode replay \
---env.control-mode control_sdc_only \
---env.map-dir /path/to/recorded_bins \
---env.init-steps 10 \
---env.scenario-length 200
-
-# Looser goal radius (default 2 m, up to 12 m under reward randomization):
---env.goal-radius 6
-
-# Closer-spaced goals:
---env.min-waypoint-spacing 10 \
---env.max-waypoint-spacing 15
-```
-
-## Loading checkpoints with non-default architecture
-
-`mine_failures` does not read the sibling `config.yaml` next to
-`load_model_path` (only `pufferl.train` does). If the checkpoint was trained
-with non-default `policy.*` or `rnn.*` dimensions (e.g. `input_size=128`,
-`backbone_num_layers=4`), pass them on the CLI to match the saved state dict:
-
-```bash
---policy.input-size 128 \
---policy.actor-hidden-size 512 \
---policy.actor-num-layers 0 \
---policy.backbone-hidden-size 512 \
---policy.backbone-num-layers 4 \
---policy.critic-hidden-size 512 \
---policy.critic-num-layers 0 \
---policy.encoder-gigaflow True \
---policy.split-network False \
---rnn.hidden-size 512 \
---rnn.input-size 512
-```
-
-You can read the right values out of the checkpoint's sibling `config.yaml`
-(under `policy:` and `rnn:`) and pass them through. The error if you forget
-is a wall of `size mismatch for ...` lines from `policy.load_state_dict`.
-
-## On the cluster
-
-Mining is GPU-bound on the policy forward pass but memory-light compared to
-training (single env, no rollout buffer, no PPO update). 48 GB RAM and a
-60-minute time limit are plenty for 100 episodes. The same `submit_cluster.py`
-flow as training works — override `--main` to invoke `mine_failures`:
-
-```bash
-python3 scripts/submit_cluster.py \
-    --save_dir /scratch/$USER/runs \
-    --prefix mine \
-    --compute_config scripts/cluster_configs/nyu_greene.yaml \
-    --account <acct> --partition <gpu-partition> --time 60 \
-    --mem 48gb --cpus 8 \
-    --container \
-    --main "-m pufferlib.pufferl mine_failures puffer_drive" \
-    --args \
-        load_model_path=<path-to-ckpt> \
-        mine.output_dir=/scratch/$USER/failure_mining/out \
-        mine.num_episodes=100 \
-        mine.score_threshold=1e9 \
-        vec.backend=Serial
-```
-
-See [`docs/cluster_training.md`](cluster_training.md) for one-time setup of
-the login-side submitit (`python3 -m pip install --user submitit pyyaml
-cloudpickle`).
-
-Outputs land on `/scratch`; pull them down with `rsync` for in-browser viewing.
-
-## Viewer features (`mining_viz.py`)
-
-The per-episode HTML viewer supports:
-
-- Frame scrubber + play/pause + speed control.
-- Toggle observation overlay (FOV rectangle, partner circle, observed-entity
-  highlights, goal route, waypoint markers).
-- Toggle road segment / road edge / lane line rendering.
-- Map background (CARLA / nuPlan / Waymo road graph from the bundle's
-  embedded `simulation_mode`).
-
-The index (`renders/index.html`) is a sortable table linking to each per-episode
-HTML, with the metadata columns from `episodes.csv` (failure metrics, scenario
-ID, map name).

From b793e602433c65af46b4ad63983649529f329707 Mon Sep 17 00:00:00 2001
From: Eugene Vinitsky <eugenevinitsky@172-16-7-177.dynapool.nyu.edu>
Date: Wed, 20 May 2026 09:51:13 -0500
Subject: [PATCH 08/20] docs: cluster_training tweaks (TL;DR rewrite,
 formatting)

---
 docs/cluster_training.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/cluster_training.md b/docs/cluster_training.md
index f5b7ee2112..af4eeae82f 100644
--- a/docs/cluster_training.md
+++ b/docs/cluster_training.md
@@ -173,7 +173,7 @@ sbatch --account=<general-account> --partition=cpu_short \
   `upper/usr/local/` and aren't visible to apptainer's view). Use
   `debugfs -R "ls /upper" overlay.ext3` from a login node to inspect.
 
-### `TORCH_CUDA_ARCH_LIST`: a warning that you can skip
+### `TORCH_CUDA_ARCH_LIST`: a quick warning that won't generally be an issue
 
 PufferDrive's C extension contains CUDA kernels. When `setup.py build_ext`
 compiles them, `nvcc` emits machine code for each architecture listed in

From 1560772b8a83361f0c4b05a5308ca679ccc566b5 Mon Sep 17 00:00:00 2001
From: Eugene Vinitsky <eugenevinitsky@Eugenes-MacBook-Air.local>
Date: Wed, 20 May 2026 10:27:52 -0500
Subject: [PATCH 09/20] docs+gitignore: pre-commit fixes + drop sphinx noise
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

cluster_training.md: trim trailing whitespace on two lines and add a
final newline (pre-commit hooks were rejecting the diff).

.gitignore: drop the sphinx-themed entries. This repo doesn't use
sphinx (no docs/conf.py, no Makefile, no mkdocs.yml — docs are just
plain Markdown), so 'docs/_build/' and the misleading 'Generated docs
(sphinx build output only)' comment were vestigial cookiecutter
boilerplate. The earlier '# docs/' line is gone with them.

README.md: short pointer paragraph added to the existing HPC section
linking to docs/cluster_training.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 .gitignore               | 6 ------
 README.md                | 2 +-
 docs/cluster_training.md | 6 +++---
 3 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/.gitignore b/.gitignore
index 390f3dd7a2..62f3bc9449 100644
--- a/.gitignore
+++ b/.gitignore
@@ -81,9 +81,6 @@ instance/
 # Scrapy stuff:
 .scrapy
 
-# Sphinx documentation
-docs/_build/
-
 # PyBuilder
 target/
 
@@ -211,9 +208,6 @@ pufferlib/resources/drive/output*.gif
 # External local clones
 external/
 
-# Generated docs (sphinx build output only; docs/*.md is tracked)
-# docs/
-
 # Claude config
 .claude/
 CLAUDE.local.md
diff --git a/README.md b/README.md
index f19eb805c3..9011a93e03 100644
--- a/README.md
+++ b/README.md
@@ -71,7 +71,7 @@ python scripts/submit_cluster.py \
 
 `scripts/cluster_configs/nyu_greene.yaml` defines `account`, `gpus`, `cpus`, `mem`, `time` — edit `account` to your allocation before first submit. `--container` makes `submit_cluster.py` wrap the job command in `singularity exec --nv --overlay $OVERLAY_PATH:ro $IMAGE_PATH ...`.
 
-For the operational guide — the one-time login-side submitit setup, GPU heartbeat (required for runs > ~2h), CPU rebuild path, account/partition strategy, and replay-mode memory sizing — see [`docs/cluster_training.md`](docs/cluster_training.md).
+**For a full guide on how to use this see [`docs/cluster_training.md`](docs/cluster_training.md).**
 
 ## Data
 
diff --git a/docs/cluster_training.md b/docs/cluster_training.md
index af4eeae82f..d002bdee7c 100644
--- a/docs/cluster_training.md
+++ b/docs/cluster_training.md
@@ -1,6 +1,6 @@
 # Cluster training — operational guide
 
-How to run PufferDrive training on a SLURM cluster. This is written with the NYU cluster in mind but it should mostly hold for any SLURM cluster. 
+How to run PufferDrive training on a SLURM cluster. This is written with the NYU cluster in mind but it should mostly hold for any SLURM cluster.
 
 ## A quick overview of the setup and launch process
 
@@ -54,7 +54,7 @@ singularity exec --nv \
 
 ## Submitting training — `submit_cluster.py`
 
-`scripts/submit_cluster.py` is the canonical submission path. It composes: 
+`scripts/submit_cluster.py` is the canonical submission path. It composes:
 - a `compute_config` YAML (SLURM settings)
 - a `program_config` YAML (pufferl training args)
 - `--args` CLI overrides
@@ -199,4 +199,4 @@ matters when you build the C extension **outside** the rebuild wrapper —
 e.g. an interactive `python setup.py build_ext --inplace --force` inside a
 hand-launched singularity exec. Adding the export to your shell profile
 (or sourcing it before any manual build) saves you from hitting the "no
-kernel image" error after a quick fix-and-rebuild loop.
\ No newline at end of file
+kernel image" error after a quick fix-and-rebuild loop.

From a016d9adb381e8e596db1e290ef8f8ab1dab2d84 Mon Sep 17 00:00:00 2001
From: Eugene Vinitsky <eugenevinitsky@Eugenes-MacBook-Air.local>
Date: Wed, 20 May 2026 10:47:32 -0500
Subject: [PATCH 10/20] submit_cluster: compress the launcher-wrap comment
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

11 lines → 6. Kept the what (wrap launcher in singularity exec) +
why (sys.executable is either version-mismatched host python or a
dangling venv symlink) + pointer to the matched check in
launch_training. Dropped the enumerated detail; future readers who
need it can trace the code.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 scripts/submit_cluster.py | 20 ++++++--------------
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/scripts/submit_cluster.py b/scripts/submit_cluster.py
index b2a47fbc58..a6e4ebcfed 100644
--- a/scripts/submit_cluster.py
+++ b/scripts/submit_cluster.py
@@ -250,17 +250,12 @@ def submit(args, job_name: str, command: List[str], save_dir: str, dry: bool):
     # Set up executor
     executor = submitit.AutoExecutor(folder=os.path.join(save_dir, "submitit"))
 
-    # When --container is set, run submitit's outer launcher python *inside*
-    # singularity. The default launcher python is sys.executable, which is
-    # either the login-node system python (version-mismatched with the
-    # compute-node system python, so --user installs are invisible) or the
-    # venv python (a symlink into the overlay, which dangles on the compute
-    # node outside singularity). Wrapping the launcher in singularity exec
-    # uses the overlay's miniforge3 python — identical on every node — and
-    # gives submitit a working import of itself (the venv has submitit
-    # installed). launch_training detects the already-in-container state
-    # via /.singularity.d/Singularity and skips its own inner wrap to avoid
-    # nested singularity.
+    # Wrap submitit's outer launcher python in singularity exec so it uses
+    # the overlay's miniforge3 python (cross-node consistent, has submitit
+    # in the venv) instead of sys.executable, which is either a version-
+    # mismatched host python or a venv symlink that dangles outside the
+    # container. launch_training detects the in-container state via
+    # /.singularity.d/Singularity and skips its own wrap so we don't nest.
     if args.container and hasattr(executor, "_executor"):
         scratch_dir = os.environ.get("SCRATCH_DIR", f"/scratch/{os.environ.get('USER', '')}")
         venv_path = os.environ.get("VENV_PATH", f"{scratch_dir}/venvs/pufferdrive")
@@ -315,7 +310,6 @@ def launch_training(args, from_config, cmd, save_dir, project_root, container_co
         import submitit
 
         # Code isolation: symlink top-level entries, hard copy pufferlib/ source
-        # (symlink resources/ to avoid copying 3.7GB of maps/models).
         isolated_root = os.path.join(save_dir, "code")
         if os.path.exists(isolated_root):
             version = 1
@@ -334,8 +328,6 @@ def launch_training(args, from_config, cmd, save_dir, project_root, container_co
                     os.remove(dst)
             os.symlink(src, dst)
         # Hard copy pufferlib/ so branch switches don't break running jobs.
-        # Previously used `cp -rs` (symlinks) which meant switching branches
-        # after submission would silently change the code running jobs use.
         # We symlink resources/ (3.7GB of maps/models) to avoid slow copies,
         # but hard copy everything else (source code, .so files).
         pufferlib_dst = os.path.join(isolated_root, "pufferlib")

From 2855f0a9003091e1be86850e5db2673a7d9fe215 Mon Sep 17 00:00:00 2001
From: Eugene Vinitsky <eugenevinitsky@Eugenes-MacBook-Air.local>
Date: Wed, 20 May 2026 11:00:38 -0500
Subject: [PATCH 11/20] submit_cluster: clarify what the wrap actually solves
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previous comment conflated two scenarios — the version-mismatched host
python (real, what we hit) and the venv-symlink-dangling case (a
hypothetical alternative setup we don't use). Just describe the real
problem: login-node /usr/bin/python3 is 3.12, compute-node is 3.9, so
~/.local installs for 3.12 are invisible on the compute side.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 scripts/submit_cluster.py | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/scripts/submit_cluster.py b/scripts/submit_cluster.py
index a6e4ebcfed..fd0267f144 100644
--- a/scripts/submit_cluster.py
+++ b/scripts/submit_cluster.py
@@ -250,12 +250,14 @@ def submit(args, job_name: str, command: List[str], save_dir: str, dry: bool):
     # Set up executor
     executor = submitit.AutoExecutor(folder=os.path.join(save_dir, "submitit"))
 
-    # Wrap submitit's outer launcher python in singularity exec so it uses
-    # the overlay's miniforge3 python (cross-node consistent, has submitit
-    # in the venv) instead of sys.executable, which is either a version-
-    # mismatched host python or a venv symlink that dangles outside the
-    # container. launch_training detects the in-container state via
-    # /.singularity.d/Singularity and skips its own wrap so we don't nest.
+    # Override the python submitit invokes on the compute node. Default is
+    # sys.executable (the login-node /usr/bin/python3, version 3.12 on Greene),
+    # but on the compute node /usr/bin/python3 is version 3.9 and can't see
+    # the user-installed submitit at ~/.local/lib/python3.12/. Wrap it in
+    # singularity exec so the compute-side launcher uses the overlay's
+    # miniforge3 python — same on every node, with submitit available in the
+    # venv. launch_training detects /.singularity.d/Singularity and skips
+    # its own singularity wrap so we don't nest.
     if args.container and hasattr(executor, "_executor"):
         scratch_dir = os.environ.get("SCRATCH_DIR", f"/scratch/{os.environ.get('USER', '')}")
         venv_path = os.environ.get("VENV_PATH", f"{scratch_dir}/venvs/pufferdrive")

From 80a726b1d221cdb42e47eb7f757bde00d1361d32 Mon Sep 17 00:00:00 2001
From: Eugene Vinitsky <eugenevinitsky@Eugenes-MacBook-Air.local>
Date: Wed, 20 May 2026 11:39:23 -0500
Subject: [PATCH 12/20] cluster: drop the login-side submitit bootstrap; revert
 launcher wrap
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Per reviewer feedback (yw4142):

1. setup_container.sh create_overlay: remove the stray
   mv "${TEMPLATE_NAME%.gz}" overlay.ext3 — it renamed the just-
   extracted file to a name that doesn't match what OVERLAY_PATH
   defaults to, so every later step looked for the overlay at the
   wrong path. Drop the line; overlay stays at its original name.

2. submit_cluster.py: revert the executor._executor.python override
   and the /.singularity.d/Singularity sentinel check in
   launch_training. They were working around 'submitit not on the
   compute-side python', but if you source the project venv on the
   login node, sys.executable becomes the venv python (same path
   the compute node will run) and submitit's serialization round-
   trips without needing any wrap. Back to the original pattern:
   launcher python is the venv python, launch_training wraps the
   inner train command in singularity exec for CUDA libs.

3. cluster_training.md: replace the pip install --user submitit
   pyyaml cloudpickle bootstrap with 'source the venv'. setup_container.sh
   install already lands submitit in the venv via the project's
   pyproject.toml, and sourcing the venv on login makes it
   importable. The 'two python installations' section comes out
   entirely — there's just one python (the venv's).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 docs/cluster_training.md   | 54 +++++++++---------------------
 scripts/setup_container.sh |  1 -
 scripts/submit_cluster.py  | 67 ++++++++++----------------------------
 3 files changed, 34 insertions(+), 88 deletions(-)

diff --git a/docs/cluster_training.md b/docs/cluster_training.md
index d002bdee7c..e4963ea2b6 100644
--- a/docs/cluster_training.md
+++ b/docs/cluster_training.md
@@ -5,25 +5,23 @@ How to run PufferDrive training on a SLURM cluster. This is written with the NYU
 ## A quick overview of the setup and launch process
 
 ```bash
-# One-time per cluster:
-#   (a) create the singularity overlay and install deps into the venv
+# One-time per cluster: create the singularity overlay and install deps
+# into the venv (this also installs submitit and the other submission
+# deps as part of the project's pyproject.toml).
 ./scripts/setup_container.sh create-overlay
 sbatch --account=<acct> --gres=gpu:1 --cpus-per-task=8 --mem=32gb --time=60 \
     --wrap "./scripts/setup_container.sh install"
-#   (b) install submitit on the login-node system python (used to compose
-#       the submission; the in-container venv python runs the actual job)
-python3 -m ensurepip --user
-python3 -m pip install --user submitit pyyaml cloudpickle
 
 # If code changes, or we haven't built before, rebuild the C code in the container
 sbatch --account=<acct> --partition=cpu_short --cpus-per-task=8 --mem=16gb --time=20 \
     --chdir=$PWD -o $LOGDIR/rebuild_%j.log \
     --wrap "./scripts/setup_container.sh rebuild"
 
-# Training: submit_cluster.py from the login node with --container --heartbeat.
-# By default launches RL training but can be modified through the --main argument
-# to launch other modes
-python3 scripts/submit_cluster.py \
+# Training: source the venv on the login node, then submit_cluster.py
+# with --container --heartbeat. --main defaults to RL training; override
+# it to launch other modes (e.g. mining, eval).
+source /scratch/$USER/venvs/pufferdrive/bin/activate
+python scripts/submit_cluster.py \
     --save_dir /scratch/$USER/runs \
     --compute_config scripts/cluster_configs/nyu_greene.yaml \
     --program_config scripts/cluster_configs/train_base.yaml \
@@ -65,37 +63,17 @@ It performs code isolation (symlinks the
 top-level entries + hard-copies `pufferlib/` into a per-run sandbox), and
 hands the package to `submitit` for `sbatch`-submission.
 
-### WARNING: two python installation are being used here
-
-A `submit_cluster.py --container` submission uses two distinct python
-environments:
-
-- **Login-side composer**: the python that runs `submit_cluster.py` itself.
-  Only needs `submitit`, `pyyaml`, `cloudpickle` importable. Used purely to
-  build the sbatch script and submit it to SLURM. On Greene this is
-  `/usr/bin/python3` (system python) and you can run `pip install --user submitit pyyaml
-  cloudpickle` to provide those deps.
-- **Compute-side executor**: the python that runs the training job on the
-  compute node. This is the **venv python** inside the singularity overlay. submitit's
-  outer launcher is wrapped in `singularity exec` so it lands in this
-  environment; `launch_training` then runs `torchrun` inside the same
-  container.
+### Source the venv before invoking `submit_cluster.py`
 
-### One-time login-side setup
+`setup_container.sh install` puts submitit + its deps into the project
+venv at `/scratch/$USER/venvs/pufferdrive/`. Sourcing the venv on the
+login node makes that submitit importable and lines up `sys.executable`
+with the same venv python that the compute node will run, so submitit's
+serialization round-trips cleanly.
 
 ```bash
-# Greene's /usr/bin/python3 ships without pip; bootstrap it:
-python3 -m ensurepip --user
-python3 -m pip install --user --upgrade pip
-python3 -m pip install --user submitit pyyaml cloudpickle
-```
-
-After this, `python3 -c 'import submitit'` works on the login node.
-
-### Run from the login node
-
-```bash
-python3 scripts/submit_cluster.py \
+source /scratch/$USER/venvs/pufferdrive/bin/activate
+python scripts/submit_cluster.py \
     --save_dir /scratch/$USER/runs \
     --prefix mytrain \
     --compute_config scripts/cluster_configs/nyu_greene.yaml \
diff --git a/scripts/setup_container.sh b/scripts/setup_container.sh
index bf1b8f2c33..338197c9da 100755
--- a/scripts/setup_container.sh
+++ b/scripts/setup_container.sh
@@ -46,7 +46,6 @@ create_overlay() {
     TEMPLATE_NAME=$(basename "$OVERLAY_TEMPLATE")
     cd "$CONTAINER_DIR"
     gunzip "$TEMPLATE_NAME"
-    mv "${TEMPLATE_NAME%.gz}" overlay.ext3
 
     echo "Overlay created at $OVERLAY_PATH"
     echo ""
diff --git a/scripts/submit_cluster.py b/scripts/submit_cluster.py
index fd0267f144..9a8182bf8c 100644
--- a/scripts/submit_cluster.py
+++ b/scripts/submit_cluster.py
@@ -250,29 +250,6 @@ def submit(args, job_name: str, command: List[str], save_dir: str, dry: bool):
     # Set up executor
     executor = submitit.AutoExecutor(folder=os.path.join(save_dir, "submitit"))
 
-    # Override the python submitit invokes on the compute node. Default is
-    # sys.executable (the login-node /usr/bin/python3, version 3.12 on Greene),
-    # but on the compute node /usr/bin/python3 is version 3.9 and can't see
-    # the user-installed submitit at ~/.local/lib/python3.12/. Wrap it in
-    # singularity exec so the compute-side launcher uses the overlay's
-    # miniforge3 python — same on every node, with submitit available in the
-    # venv. launch_training detects /.singularity.d/Singularity and skips
-    # its own singularity wrap so we don't nest.
-    if args.container and hasattr(executor, "_executor"):
-        scratch_dir = os.environ.get("SCRATCH_DIR", f"/scratch/{os.environ.get('USER', '')}")
-        venv_path = os.environ.get("VENV_PATH", f"{scratch_dir}/venvs/pufferdrive")
-        cert_binds = []
-        for cert_path in ["/etc/ssl/certs", "/etc/pki"]:
-            if os.path.exists(cert_path):
-                cert_binds.append(f"--bind {cert_path}:{cert_path}:ro")
-        executor._executor.python = (
-            f"singularity exec --nv "
-            f"--overlay {args.container_overlay}:ro "
-            f"{' '.join(cert_binds)} "
-            f"{args.container_image} "
-            f"{venv_path}/bin/python"
-        )
-
     # Build GRES string for GPUs
     if from_config.get("gpu_type") is not None:
         gres = f"gpu:{from_config['gpu_type']}:{from_config['gpus']}"
@@ -424,33 +401,25 @@ def wrap_with_heartbeat(train_cmd_str):
             if args.heartbeat:
                 train_str = wrap_with_heartbeat(train_str)
             inner_cmd = f"{env_setup} && {cache_exports} && cd {project_root} && {train_str}"
-            # submit_cluster.py also wraps submitit's outer launcher python in
-            # singularity exec when --container is on (see the executor.python
-            # override at submission time). When we land here on the compute
-            # node, we're already inside that singularity context — skip the
-            # second wrap and just run inner_cmd via bash.
-            if os.path.exists("/.singularity.d/Singularity"):
-                full_cmd = ["bash", "-c", inner_cmd]
-            else:
-                full_cmd = [
-                    "singularity",
-                    "exec",
-                    "--nv",
-                    "--overlay",
-                    container_config["overlay"] + ":ro",  # Read-only overlay for running
+            full_cmd = [
+                "singularity",
+                "exec",
+                "--nv",
+                "--overlay",
+                container_config["overlay"] + ":ro",  # Read-only overlay for running
+            ]
+            # Bind mount SSL certificates for TLS verification (wandb, etc.)
+            for cert_path in ["/etc/ssl/certs", "/etc/pki"]:
+                if os.path.exists(cert_path):
+                    full_cmd.extend(["--bind", f"{cert_path}:{cert_path}:ro"])
+            full_cmd.extend(
+                [
+                    container_config["image"],
+                    "bash",
+                    "-c",
+                    inner_cmd,
                 ]
-                # Bind mount SSL certificates for TLS verification (wandb, etc.)
-                for cert_path in ["/etc/ssl/certs", "/etc/pki"]:
-                    if os.path.exists(cert_path):
-                        full_cmd.extend(["--bind", f"{cert_path}:{cert_path}:ro"])
-                full_cmd.extend(
-                    [
-                        container_config["image"],
-                        "bash",
-                        "-c",
-                        inner_cmd,
-                    ]
-                )
+            )
         elif args.heartbeat:
             # No container: still need to wrap in bash -c so the brace group parses.
             train_str = " ".join(full_cmd)

From 8c847ae671847221b7298bb6c377790940aaddc7 Mon Sep 17 00:00:00 2001
From: Eugene Vinitsky <eugenevinitsky@Eugenes-MacBook-Air.local>
Date: Wed, 20 May 2026 11:53:11 -0500
Subject: [PATCH 13/20] setup_container: install miniforge3 on /scratch instead
 of in the overlay
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Move the base python from inside the overlay (/ext3/miniforge3/) onto
/scratch (/scratch/$USER/miniforge3/). The venv's bin/python symlink
now points at /scratch — a path that resolves on every node, inside
or outside singularity, so 'source venv/activate && python ...' works
on the login node without needing to enter the container.

Changes:
- New MINIFORGE3_DIR variable (default /scratch/$USER/miniforge3) and
  ensure_miniforge3() helper that runs the conda-forge installer
  (self-contained shell script; no root, no singularity).
- CONTAINER_PYTHON default now points at the /scratch miniforge3.
- The 'install' dispatch runs ensure_miniforge3 outside the container
  first, then enters singularity (read-only overlay) for the uv + pip
  + build_ext steps that need nvcc.
- run_in_container_writable + --fakeroot are gone; nothing in the python
  flow writes to the overlay anymore. The overlay stays mounted :ro for
  rare system-tool installs but isn't on any write path.

For existing users: re-run setup_container.sh install — it'll detect
miniforge3 missing on /scratch, install it, recreate the venv against
the new base, and reinstall the project. The overlay's old miniforge3
becomes stale but harmless; you can rm it from the overlay later.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 scripts/setup_container.sh | 72 ++++++++++++++++++++++++--------------
 1 file changed, 46 insertions(+), 26 deletions(-)

diff --git a/scripts/setup_container.sh b/scripts/setup_container.sh
index 338197c9da..d25c6eebd8 100755
--- a/scripts/setup_container.sh
+++ b/scripts/setup_container.sh
@@ -4,12 +4,16 @@
 # with older glibc versions.
 #
 # Architecture:
-#   - The overlay is used ONLY for the miniforge3 base Python interpreter.
-#   - All Python packages (torch, pufferlib, etc.) live in a venv on /scratch
-#     (regular ext4) instead of the overlay (fuse2fs single-threaded ~10 MB/s).
-#     This makes installs/rebuilds ~50x faster than the all-in-overlay approach.
-#   - At runtime the venv's bin/python symlinks back to /ext3/miniforge3, which
-#     is why we still mount the overlay (read-only) when activating the venv.
+#   - miniforge3 lives on /scratch (NOT in the overlay) so its python is a
+#     real file accessible from any node, in or out of singularity. The venv
+#     symlinks `bin/python` into the /scratch miniforge3, which makes
+#     `source venv/activate` work on the login node directly without
+#     needing to enter the container.
+#   - All Python packages (torch, pufferlib, etc.) live in the venv on /scratch
+#     too — fuse2fs is not on the write path for any install step.
+#   - The singularity image still supplies CUDA + cuDNN at job runtime. The
+#     overlay is preserved for the rare case where you need to install
+#     system-level tools, but it's not used for the standard python flow.
 #
 # Usage:
 #   1. Create an overlay (one time): ./setup_container.sh create-overlay
@@ -28,8 +32,11 @@ CONTAINER_DIR="${CONTAINER_DIR:-$(dirname "$OVERLAY_PATH")}"
 PROJECT_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
 # Venv lives on /scratch (regular ext4) — bypasses fuse2fs entirely for installs.
 VENV_PATH="${VENV_PATH:-/scratch/$USER/venvs/pufferdrive}"
-# Python from the overlay's miniforge3 (mounted read-only at runtime).
-CONTAINER_PYTHON="${CONTAINER_PYTHON:-/ext3/miniforge3/bin/python3}"
+# miniforge3 lives on /scratch too so the venv's python symlink resolves
+# from any node without needing the singularity overlay to be mounted.
+MINIFORGE3_DIR="${MINIFORGE3_DIR:-/scratch/$USER/miniforge3}"
+MINIFORGE3_INSTALLER_URL="${MINIFORGE3_INSTALLER_URL:-https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh}"
+CONTAINER_PYTHON="${CONTAINER_PYTHON:-$MINIFORGE3_DIR/bin/python3}"
 
 create_overlay() {
     echo "=== Creating overlay filesystem ==="
@@ -75,6 +82,25 @@ fi
 EOF
 }
 
+# Install miniforge3 to /scratch if it isn't there yet. The conda-forge
+# installer is a self-contained shell script — no root, no singularity
+# required. Doing this on /scratch (rather than inside the overlay)
+# means $MINIFORGE3_DIR/bin/python3 is a real file accessible from any
+# node, so the venv's bin/python symlink resolves outside singularity too.
+ensure_miniforge3() {
+    if [ -x "$MINIFORGE3_DIR/bin/python3" ]; then
+        return 0
+    fi
+    echo "=== Installing miniforge3 to $MINIFORGE3_DIR ==="
+    mkdir -p "$(dirname "$MINIFORGE3_DIR")"
+    local installer
+    installer="$(mktemp -t miniforge3-installer.XXXXXX.sh)"
+    curl -fsSL "$MINIFORGE3_INSTALLER_URL" -o "$installer"
+    bash "$installer" -b -p "$MINIFORGE3_DIR"
+    rm -f "$installer"
+    echo "miniforge3 installed at $MINIFORGE3_DIR"
+}
+
 # Find or bootstrap a uv binary. Prefer one already on PATH or in
 # $HOME/.local/bin (auto-bound by apptainer). Fall back to the official
 # installer, which drops a static binary into ~/.local/bin in seconds.
@@ -167,27 +193,16 @@ rebuild_extension() {
 
 run_in_container() {
     local cmd="$1"
-    # Overlay mounted read-only — venv's bin/python symlinks back into
-    # /ext3/miniforge3 for the interpreter, but every package read/write
-    # happens on /scratch ext4 (the venv on $VENV_PATH).
+    # Overlay mounted read-only — every read/write the install or rebuild
+    # cares about happens on /scratch ext4 (miniforge3 + venv). The overlay
+    # is kept on the mount line for backward compatibility, but nothing
+    # in the python flow writes to it.
     singularity exec --nv \
         --overlay "$OVERLAY_PATH:ro" \
         "$IMAGE_PATH" \
         bash -c "cd $PROJECT_ROOT && $cmd"
 }
 
-run_in_container_writable() {
-    local cmd="$1"
-    # --fakeroot still required because uv bootstrap writes to /ext3/miniforge3
-    # (the system pip puts uv there before we activate the venv). Once uv
-    # is bootstrapped, all subsequent installs go to the venv on /scratch
-    # (regular ext4, no fuse2fs in the write path).
-    singularity exec --nv --fakeroot \
-        --overlay "$OVERLAY_PATH" \
-        "$IMAGE_PATH" \
-        bash -c "cd $PROJECT_ROOT && $cmd"
-}
-
 case "${1:-}" in
     create-overlay)
         create_overlay
@@ -196,7 +211,11 @@ case "${1:-}" in
         if [ -f /.singularity.d/Singularity ]; then
             install_deps
         else
-            run_in_container_writable "$0 install"
+            # miniforge3 installs on /scratch via plain shell — no singularity
+            # needed for that step. The rest (uv + pip + build_ext) runs in
+            # the container so nvcc and the right glibc are on PATH.
+            ensure_miniforge3
+            run_in_container "$0 install"
         fi
         ;;
     rebuild)
@@ -217,12 +236,13 @@ case "${1:-}" in
         echo "  rebuild         Rebuild C extension only (submit as GPU job)"
         echo ""
         echo "Environment variables:"
+        echo "  MINIFORGE3_DIR  Where the base python lives (default: /scratch/\$USER/miniforge3)"
         echo "  VENV_PATH       Where the venv lives (default: /scratch/\$USER/venvs/pufferdrive)"
-        echo "  OVERLAY_PATH    Singularity overlay (only needs miniforge3 base python)"
+        echo "  OVERLAY_PATH    Singularity overlay (kept for system-tool installs; not used by the python flow)"
         echo ""
         echo "Example workflow:"
         echo "  1. $0 create-overlay"
         echo "  2. sbatch --gres=gpu:1 --time=60 --wrap \"$0 install\""
-        echo "  3. python scripts/submit_cluster.py --container ..."
+        echo "  3. source \$VENV_PATH/bin/activate && python scripts/submit_cluster.py --container ..."
         ;;
 esac

From 450c3ca5aabcdac8620821f23d8cfc2a1b30c87b Mon Sep 17 00:00:00 2001
From: Eugene Vinitsky <eugenevinitsky@Eugenes-MacBook-Air.local>
Date: Wed, 20 May 2026 11:54:17 -0500
Subject: [PATCH 14/20] setup_container: rebuild venv if its python symlink is
 stale

Existing installs have the venv's bin/python pointing at /ext3/miniforge3
from before we moved miniforge3 to /scratch. Re-running install would
otherwise reuse the broken-symlink venv and skip recreating it. Detect
the broken symlink (bash -e $VENV/bin/python) and rm -rf before
recreating against $CONTAINER_PYTHON.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 scripts/setup_container.sh | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/scripts/setup_container.sh b/scripts/setup_container.sh
index d25c6eebd8..61868cd022 100755
--- a/scripts/setup_container.sh
+++ b/scripts/setup_container.sh
@@ -131,6 +131,14 @@ ensure_uv() {
 # of the box and works against any cpython.
 ensure_venv() {
     ensure_uv
+    # If the venv exists but its python symlink no longer resolves (e.g. it
+    # points into /ext3/miniforge3 from before we moved miniforge3 onto
+    # /scratch), rebuild it. Cheap heuristic — the venv's python is a tiny
+    # symlink that uv will recreate against $CONTAINER_PYTHON.
+    if [ -f "$VENV_PATH/bin/activate" ] && [ ! -e "$VENV_PATH/bin/python" ]; then
+        echo "=== Rebuilding stale venv at $VENV_PATH (python link is broken) ==="
+        rm -rf "$VENV_PATH"
+    fi
     if [ ! -f "$VENV_PATH/bin/activate" ]; then
         echo "=== Creating venv at $VENV_PATH ==="
         mkdir -p "$(dirname "$VENV_PATH")"

From 1c0fc2aa5fd777cf6e9c7b20313d0f86ba182c7b Mon Sep 17 00:00:00 2001
From: Eugene Vinitsky <eugenevinitsky@172-16-7-177.dynapool.nyu.edu>
Date: Wed, 20 May 2026 12:03:15 -0500
Subject: [PATCH 15/20] setup_container: detect stale venv by symlink TARGET,
 not existence
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The previous check (-e $VENV/bin/python) was wrong: install runs inside
singularity, where the overlay is mounted, so the OLD venv's symlink
into /ext3/miniforge3 IS valid — it just points at the wrong place
relative to where MINIFORGE3_DIR now is. Use readlink -f to walk the
chain and verify the resolved path is under $MINIFORGE3_DIR; rebuild
if not.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 scripts/setup_container.sh | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/scripts/setup_container.sh b/scripts/setup_container.sh
index 61868cd022..8178f5f942 100755
--- a/scripts/setup_container.sh
+++ b/scripts/setup_container.sh
@@ -131,13 +131,22 @@ ensure_uv() {
 # of the box and works against any cpython.
 ensure_venv() {
     ensure_uv
-    # If the venv exists but its python symlink no longer resolves (e.g. it
-    # points into /ext3/miniforge3 from before we moved miniforge3 onto
-    # /scratch), rebuild it. Cheap heuristic — the venv's python is a tiny
-    # symlink that uv will recreate against $CONTAINER_PYTHON.
-    if [ -f "$VENV_PATH/bin/activate" ] && [ ! -e "$VENV_PATH/bin/python" ]; then
-        echo "=== Rebuilding stale venv at $VENV_PATH (python link is broken) ==="
-        rm -rf "$VENV_PATH"
+    # If the venv exists but its python doesn't resolve into the current
+    # $MINIFORGE3_DIR (e.g. it points at /ext3/miniforge3 from before we
+    # moved miniforge3 onto /scratch), rebuild. readlink -f resolves the
+    # whole symlink chain, so this catches the case where the link is
+    # valid inside the container (overlay mounted) but stale relative to
+    # where the new venv should point.
+    if [ -f "$VENV_PATH/bin/activate" ]; then
+        local resolved
+        resolved="$(readlink -f "$VENV_PATH/bin/python" 2>/dev/null || true)"
+        case "$resolved" in
+            "$MINIFORGE3_DIR"/*) ;;
+            *)
+                echo "=== Rebuilding stale venv at $VENV_PATH (python points to '$resolved', not under $MINIFORGE3_DIR) ==="
+                rm -rf "$VENV_PATH"
+                ;;
+        esac
     fi
     if [ ! -f "$VENV_PATH/bin/activate" ]; then
         echo "=== Creating venv at $VENV_PATH ==="

From aea954091596b63e17884d8ad4867852ffa32ed9 Mon Sep 17 00:00:00 2001
From: Eugene Vinitsky <eugenevinitsky@172-16-7-177.dynapool.nyu.edu>
Date: Wed, 20 May 2026 12:11:19 -0500
Subject: [PATCH 16/20] setup_container: pin miniforge3 to a Python 3.12
 release
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

miniforge3 25.x ships Python 3.13, but torch's cu121 wheels are cp39..cp312
only — no cp313 — so the install fails with 'no solution found when
resolving dependencies'. Pin the installer URL to 24.11.3-2 (Python
3.12) and add a version check in ensure_miniforge3 so an existing
miniforge3 with the wrong python gets reinstalled.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 scripts/setup_container.sh | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/scripts/setup_container.sh b/scripts/setup_container.sh
index 8178f5f942..7623039f2d 100755
--- a/scripts/setup_container.sh
+++ b/scripts/setup_container.sh
@@ -35,7 +35,11 @@ VENV_PATH="${VENV_PATH:-/scratch/$USER/venvs/pufferdrive}"
 # miniforge3 lives on /scratch too so the venv's python symlink resolves
 # from any node without needing the singularity overlay to be mounted.
 MINIFORGE3_DIR="${MINIFORGE3_DIR:-/scratch/$USER/miniforge3}"
-MINIFORGE3_INSTALLER_URL="${MINIFORGE3_INSTALLER_URL:-https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh}"
+# Pin to a miniforge3 release that ships Python 3.12. 25.x switched to 3.13,
+# but torch's cu121 wheels are cp39..cp312 only (no cp313), so 3.13 breaks
+# the install. Bump this once torch publishes cp313 wheels for our index.
+MINIFORGE3_INSTALLER_URL="${MINIFORGE3_INSTALLER_URL:-https://github.com/conda-forge/miniforge/releases/download/24.11.3-2/Miniforge3-24.11.3-2-Linux-x86_64.sh}"
+MINIFORGE3_PYTHON_VERSION="${MINIFORGE3_PYTHON_VERSION:-3.12}"
 CONTAINER_PYTHON="${CONTAINER_PYTHON:-$MINIFORGE3_DIR/bin/python3}"
 
 create_overlay() {
@@ -89,7 +93,16 @@ EOF
 # node, so the venv's bin/python symlink resolves outside singularity too.
 ensure_miniforge3() {
     if [ -x "$MINIFORGE3_DIR/bin/python3" ]; then
-        return 0
+        # Verify the existing miniforge3 has the python version we expect —
+        # otherwise an earlier install that grabbed "latest" (Python 3.13)
+        # would stay around, and uv venv would happily reuse it.
+        local existing
+        existing="$("$MINIFORGE3_DIR/bin/python3" -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")' 2>/dev/null || true)"
+        if [ "$existing" = "$MINIFORGE3_PYTHON_VERSION" ]; then
+            return 0
+        fi
+        echo "=== miniforge3 at $MINIFORGE3_DIR has python $existing (want $MINIFORGE3_PYTHON_VERSION); reinstalling ==="
+        rm -rf "$MINIFORGE3_DIR"
     fi
     echo "=== Installing miniforge3 to $MINIFORGE3_DIR ==="
     mkdir -p "$(dirname "$MINIFORGE3_DIR")"

From 17506911b0f02272212b7db3a69374610bae30c0 Mon Sep 17 00:00:00 2001
From: Eugene Vinitsky <eugenevinitsky@Eugenes-MacBook-Air.local>
Date: Wed, 20 May 2026 17:41:46 -0500
Subject: [PATCH 17/20] =?UTF-8?q?gitignore:=20restore=20docs/=5Fbuild/=20?=
 =?UTF-8?q?=E2=80=94=20harmless=20defensive=20ignore=20for=20sphinx?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Putting back the docs/_build/ line I dropped earlier alongside the
blanket 'docs/' ignore. The blanket ignore was actively wrong (hid
our checked-in markdown), but docs/_build/ is just the standard
sphinx build output dir — costs nothing to keep, and protects future
contributors from accidentally committing 'make html' output if
sphinx is ever added.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 .gitignore | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/.gitignore b/.gitignore
index 62f3bc9449..d0b9435dfb 100644
--- a/.gitignore
+++ b/.gitignore
@@ -81,6 +81,9 @@ instance/
 # Scrapy stuff:
 .scrapy
 
+# Sphinx documentation (if sphinx is later added; docs/*.md is tracked)
+docs/_build/
+
 # PyBuilder
 target/
 

From e36b5d32d6728acc995207caddf88cf0fc348392 Mon Sep 17 00:00:00 2001
From: Eugene Vinitsky <eugenevinitsky@Eugenes-MacBook-Air.local>
Date: Wed, 20 May 2026 17:43:37 -0500
Subject: [PATCH 18/20] gitignore: drop failure_mining/ from this PR (moved to
 a separate PR)

The cluster-docs PR shouldn't carry the mining-outputs ignore. Splitting
to its own PR so the two are reviewable independently.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 .gitignore | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/.gitignore b/.gitignore
index d0b9435dfb..3b9fa56c27 100644
--- a/.gitignore
+++ b/.gitignore
@@ -214,6 +214,3 @@ external/
 # Claude config
 .claude/
 CLAUDE.local.md
-
-# Mining output artifacts (large local-only renders/replays)
-failure_mining/

From 93be35aa3682a3b6d261fa44df46d853446c0ba5 Mon Sep 17 00:00:00 2001
From: Eugene Vinitsky <eugenevinitsky@172-16-7-177.dynapool.nyu.edu>
Date: Wed, 20 May 2026 09:52:22 -0500
Subject: [PATCH 19/20] docs: failure mining operational guide

New docs/mining.md covering the mine_failures workflow:
  - score_threshold semantics (default -inf saves nothing)
  - the required --vec.backend Serial flag (pufferl's default
    Multiprocessing backend forks workers post-torch-import and
    deadlocks on CUDA)
  - loading checkpoints with non-default policy.* / rnn.* dims
    (mine_failures doesn't auto-merge the sibling config.yaml that
    train() does)
  - on-cluster submit_cluster.py pattern with --main override
  - viewer features

README.md gains a short pointer paragraph at the end of the existing
Failure mining section.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 README.md      |   2 +
 docs/mining.md | 167 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 169 insertions(+)
 create mode 100644 docs/mining.md

diff --git a/README.md b/README.md
index 9011a93e03..5e4914d60c 100644
--- a/README.md
+++ b/README.md
@@ -148,6 +148,8 @@ renders/index.html           # sortable index of all episodes
 
 Open `renders/index.html` in a browser to triage. The index page filters by "failures only" / "replays only" and sorts by any metric column. Each row links to the per-episode viewer with the scene's full 2D animation.
 
+For the deeper guide — viewer features, `score_threshold` semantics, the required `--vec.backend Serial` flag, loading checkpoints with non-default `policy.*` dims, and the on-cluster `submit_cluster.py` pattern — see [`docs/mining.md`](docs/mining.md).
+
 ## Key Configuration (`pufferlib/config/ocean/drive.ini`)
 
 ### `[env]` — Simulation
diff --git a/docs/mining.md b/docs/mining.md
new file mode 100644
index 0000000000..1c11b9e60a
--- /dev/null
+++ b/docs/mining.md
@@ -0,0 +1,167 @@
+# Failure mining workflow
+
+How to roll a trained policy out, capture compact replays, and produce a
+browser-viewable HTML index of episodes. Pairs with `pufferl.mine_failures`
+and `pufferlib/mining_viz.py`.
+
+## TL;DR
+
+```bash
+# Roll the policy out for 100 episodes, save compact replays for episodes
+# whose episode_return falls below the threshold, render HTML for each +
+# a sortable index.
+puffer mine_failures puffer_drive \
+    --load-model-path /path/to/model_011000.pt \
+    --mine.output-dir ./failure_mining/baseline_011000 \
+    --mine.num-episodes 100 \
+    --mine.score-threshold 1e9 \
+    --vec.backend Serial
+```
+
+Outputs:
+
+```
+./failure_mining/baseline_011000/
+    replays/episode_NNNNNN.replay.zlib   one per saved episode
+    renders/episode_NNNNNN.html          per-replay viewer
+    renders/index.html                   sortable summary
+    episodes.csv                         all episodes, all metrics
+```
+
+Open the index in a browser:
+
+```bash
+open ./failure_mining/baseline_011000/renders/index.html
+```
+
+## What gets captured
+
+A compact replay bundle is a pickled+zlib'd `schema_version=2` dict containing
+per-step agent state, traffic state, and observation arrays for a single
+episode. Bundles are produced C-side when `capture_compact_replay=True` is
+passed to `Drive(...)`. `mine_failures` sets this automatically.
+
+Each saved bundle is paired with a metadata row in `episodes.csv` including
+`episode_return`, `collision_rate`, `offroad_rate`, `num_goals_reached`,
+`avg_distance_per_infraction`, etc. The HTML viewer (`pufferlib/mining_viz.py`)
+reads the bundle and replays it in-browser on a top-down canvas, with optional
+overlays for the agent's observed FOV, partner circle, goal route, and waypoint
+markers.
+
+## `mine.score_threshold` selection
+
+The save rule is "write replay if and only if `episode_return < score_threshold`".
+
+- `--mine.score-threshold 1e9` captures every episode (any real return is
+  less than 1e9).
+- `--mine.score-threshold 0` captures only negative-return ("true failure")
+  episodes.
+- Default `-inf` captures **nothing** — useful only if you want `episodes.csv`
+  metrics without the bundle overhead.
+
+`episodes.csv` always contains all N episodes' metadata regardless of
+threshold; only the bundle save + HTML render is gated.
+
+## `--vec.backend Serial`
+
+Mining must use `--vec.backend Serial`. The drive.ini default
+`Multiprocessing` backend forks workers post-torch-import, which deadlocks on
+CUDA in the child process. Symptom is a parent process at 100% CPU with no
+visible progress and no `[mine_failures] target episodes=...` print.
+
+`Serial` keeps the env in the same process as the policy. Mining is a single
+env / single rollout workflow, so the throughput cost is negligible.
+
+## Tuning the rollout config
+
+The mining env config comes from drive.ini's `[mine]` section plus per-CLI
+overrides:
+
+```bash
+# Larger output (slower):
+--mine.num-episodes 500
+
+# Replay mode (drive recorded nuPlan / Waymo scenarios):
+--env.simulation-mode replay \
+--env.control-mode control_sdc_only \
+--env.map-dir /path/to/recorded_bins \
+--env.init-steps 10 \
+--env.scenario-length 200
+
+# Looser goal radius (default 2 m, up to 12 m under reward randomization):
+--env.goal-radius 6
+
+# Closer-spaced goals:
+--env.min-waypoint-spacing 10 \
+--env.max-waypoint-spacing 15
+```
+
+## Loading checkpoints with non-default architecture
+
+`mine_failures` does not read the sibling `config.yaml` next to
+`load_model_path` (only `pufferl.train` does). If the checkpoint was trained
+with non-default `policy.*` or `rnn.*` dimensions (e.g. `input_size=128`,
+`backbone_num_layers=4`), pass them on the CLI to match the saved state dict:
+
+```bash
+--policy.input-size 128 \
+--policy.actor-hidden-size 512 \
+--policy.actor-num-layers 0 \
+--policy.backbone-hidden-size 512 \
+--policy.backbone-num-layers 4 \
+--policy.critic-hidden-size 512 \
+--policy.critic-num-layers 0 \
+--policy.encoder-gigaflow True \
+--policy.split-network False \
+--rnn.hidden-size 512 \
+--rnn.input-size 512
+```
+
+You can read the right values out of the checkpoint's sibling `config.yaml`
+(under `policy:` and `rnn:`) and pass them through. The error if you forget
+is a wall of `size mismatch for ...` lines from `policy.load_state_dict`.
+
+## On the cluster
+
+Mining is GPU-bound on the policy forward pass but memory-light compared to
+training (single env, no rollout buffer, no PPO update). 48 GB RAM and a
+60-minute time limit are plenty for 100 episodes. The same `submit_cluster.py`
+flow as training works — override `--main` to invoke `mine_failures`:
+
+```bash
+python3 scripts/submit_cluster.py \
+    --save_dir /scratch/$USER/runs \
+    --prefix mine \
+    --compute_config scripts/cluster_configs/nyu_greene.yaml \
+    --account <acct> --partition <gpu-partition> --time 60 \
+    --mem 48gb --cpus 8 \
+    --container \
+    --main "-m pufferlib.pufferl mine_failures puffer_drive" \
+    --args \
+        load_model_path=<path-to-ckpt> \
+        mine.output_dir=/scratch/$USER/failure_mining/out \
+        mine.num_episodes=100 \
+        mine.score_threshold=1e9 \
+        vec.backend=Serial
+```
+
+See [`docs/cluster_training.md`](cluster_training.md) for one-time setup of
+the login-side submitit (`python3 -m pip install --user submitit pyyaml
+cloudpickle`).
+
+Outputs land on `/scratch`; pull them down with `rsync` for in-browser viewing.
+
+## Viewer features (`mining_viz.py`)
+
+The per-episode HTML viewer supports:
+
+- Frame scrubber + play/pause + speed control.
+- Toggle observation overlay (FOV rectangle, partner circle, observed-entity
+  highlights, goal route, waypoint markers).
+- Toggle road segment / road edge / lane line rendering.
+- Map background (CARLA / nuPlan / Waymo road graph from the bundle's
+  embedded `simulation_mode`).
+
+The index (`renders/index.html`) is a sortable table linking to each per-episode
+HTML, with the metadata columns from `episodes.csv` (failure metrics, scenario
+ID, map name).

From 0ca867b58bba7bab8d42964b5eb47d87adf09e37 Mon Sep 17 00:00:00 2001
From: Eugene Vinitsky <eugenevinitsky@Eugenes-MacBook-Air.local>
Date: Wed, 20 May 2026 11:39:35 -0500
Subject: [PATCH 20/20] docs/mining: 'source the venv' instead of 'pip install
 --user submitit'
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Mirror the cluster_training.md change — setup_container.sh install
already lands submitit in the venv, and sourcing the venv on login
makes it importable. No --user bootstrap needed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 docs/mining.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/mining.md b/docs/mining.md
index 1c11b9e60a..41b1b22caa 100644
--- a/docs/mining.md
+++ b/docs/mining.md
@@ -145,9 +145,9 @@ python3 scripts/submit_cluster.py \
         vec.backend=Serial
 ```
 
-See [`docs/cluster_training.md`](cluster_training.md) for one-time setup of
-the login-side submitit (`python3 -m pip install --user submitit pyyaml
-cloudpickle`).
+Source the venv before invoking `submit_cluster.py` (`source
+/scratch/$USER/venvs/pufferdrive/bin/activate`) — see
+[`docs/cluster_training.md`](cluster_training.md) for the rationale.
 
 Outputs land on `/scratch`; pull them down with `rsync` for in-browser viewing.