From b992f62329d22e869614e2437b70f832d4daf8a9 Mon Sep 17 00:00:00 2001 From: Eugene Vinitsky Date: Wed, 20 May 2026 07:06:49 -0500 Subject: [PATCH 01/20] docs: cluster training + mining operational guides MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add two new docs and link them from README.md: - docs/cluster_training.md — sbatch template (incl. GPU heartbeat, required for jobs > ~2h or the cluster's idle-reclaimer will scancel them), CPU rebuild path (cpu_short partition; no GPU needed for build_ext), account/partition strategy (QOSGrpGRES vs QOSMaxGRESPerUser, _tandon_priority vs _general tiers, when to race partitions), replay-mode memory sizing (vec.num_envs lever, [eval.*] suite cost at first eval), and submit_cluster.py failure modes (login-node python lacks pip; submitit's srun launcher inherits the in-container venv python path). - docs/mining.md — mine_failures workflow, the score_threshold default-captures-nothing gotcha (docstring is misleading; -inf means no replay is saved), the pufferlib.vector Multiprocessing CUDA-after-fork hang and --vec.backend Serial workaround, the shape-mismatch gotcha when load_model_path checkpoints have non-default policy.* / rnn.* dimensions (mine_failures doesn't do the sibling config.yaml auto-merge that train() does), and the on-cluster sbatch pattern. .gitignore: un-ignore docs/ (was a blanket rule that prevented checking in markdown docs alongside the existing eval_unification.md spec); add failure_mining/ which is large local-only mining output. Co-Authored-By: Claude Opus 4.7 --- .gitignore | 7 +- README.md | 4 + docs/cluster_training.md | 254 +++++++++++++++++++++++++++++++++++++++ docs/mining.md | 179 +++++++++++++++++++++++++++ 4 files changed, 442 insertions(+), 2 deletions(-) create mode 100644 docs/cluster_training.md create mode 100644 docs/mining.md diff --git a/.gitignore b/.gitignore index fd4c1de0a2..390f3dd7a2 100644 --- a/.gitignore +++ b/.gitignore @@ -211,9 +211,12 @@ pufferlib/resources/drive/output*.gif # External local clones external/ -# Generated docs -docs/ +# Generated docs (sphinx build output only; docs/*.md is tracked) +# docs/ # Claude config .claude/ CLAUDE.local.md + +# Mining output artifacts (large local-only renders/replays) +failure_mining/ diff --git a/README.md b/README.md index aca5182bee..489aa057d5 100644 --- a/README.md +++ b/README.md @@ -71,6 +71,8 @@ python scripts/submit_cluster.py \ `scripts/cluster_configs/nyu_greene.yaml` defines `account`, `gpus`, `cpus`, `mem`, `time` — edit `account` to your allocation before first submit. `--container` makes `submit_cluster.py` wrap the job command in `singularity exec --nv --overlay $OVERLAY_PATH:ro $IMAGE_PATH ...`. +For an operational deep-dive — sbatch templates, the GPU heartbeat (required for runs > ~2h or the idle-GPU reclaimer will scancel them), CPU rebuild path, account/partition strategy, replay-mode memory sizing, and known `submit_cluster.py` failure modes — see [`docs/cluster_training.md`](docs/cluster_training.md). + ## Data Place binaries under `pufferlib/resources/drive/binaries/`. @@ -146,6 +148,8 @@ renders/index.html # sortable index of all episodes Open `renders/index.html` in a browser to triage. The index page filters by "failures only" / "replays only" and sorts by any metric column. Each row links to the per-episode viewer with the scene's full 2D animation. +For the deeper guide — viewer features, the `score_threshold` semantic gotcha (default `-inf` saves nothing — not what the `mine_failures` docstring claims), the `Multiprocessing`-backend CUDA-after-fork hang and the `--vec.backend Serial` workaround, the shape-mismatch gotcha when loading checkpoints with non-default `policy.*` dims, and the on-cluster sbatch pattern — see [`docs/mining.md`](docs/mining.md). + ## Key Configuration (`pufferlib/config/ocean/drive.ini`) ### `[env]` — Simulation diff --git a/docs/cluster_training.md b/docs/cluster_training.md new file mode 100644 index 0000000000..7b8fa365c8 --- /dev/null +++ b/docs/cluster_training.md @@ -0,0 +1,254 @@ +# Cluster training — operational guide + +How to run PufferDrive training on a SLURM cluster. Written against the NYU +Greene workflow but the patterns generalize. Pairs with `scripts/setup_container.sh`, +`scripts/gpu_heartbeat.py`, and `scripts/submit_cluster.py`. + +## TL;DR + +```bash +# One-time per cluster: create the singularity overlay and install deps. +./scripts/setup_container.sh create-overlay +sbatch --account= --gres=gpu:1 --cpus-per-task=8 --mem=32gb --time=60 \ + --wrap "./scripts/setup_container.sh install" + +# Per code change to C extensions: rebuild on a CPU partition (no GPU needed). +sbatch --account= --partition=cpu_short --cpus-per-task=8 --mem=16gb --time=20 \ + --chdir=$PWD -o $LOGDIR/rebuild_%j.log \ + --wrap "./scripts/setup_container.sh rebuild" + +# Training: direct sbatch with inline singularity-exec + heartbeat. +# (`submit_cluster.py` has known limitations on this branch lineage — +# see "submit_cluster.py" below.) +sbatch /path/to/my_train.sh # template in this doc +``` + +## Container model + +PufferDrive on Greene runs inside a singularity container. The container provides +a modern glibc + CUDA toolkit; the project's Python environment lives in a venv +on `/scratch` (not in the overlay) so installs aren't bottlenecked by fuse2fs. + +The container is invoked with a **read-only** overlay mount for the miniforge3 +base interpreter, plus the on-disk venv for project packages: + +```bash +singularity exec --nv \ + --overlay /scratch/$USER/images/PufferDrive/overlay-15GB-500K.ext3:ro \ + /share/apps/images/cuda12.8.1-cudnn9.8.0-ubuntu24.04.2.sif \ + bash -c ' + source /scratch/$USER/venvs/pufferdrive/bin/activate + export PYTHONNOUSERSITE=1 + cd /scratch/$USER/code/PufferDrive + + ' +``` + +`source venv/activate` is **required** — sourcing `/ext3/env.sh` alone gives you +a torch-less base interpreter (it imports as a namespace-package stub with +`torch.__file__ == None`). + +## Training sbatch template + +The minimal template below uses a direct `sbatch` (no `submit_cluster.py`), +includes the GPU heartbeat to prevent idle-reclamation, and wraps everything in +a singularity-exec. Adapt the `--account`, `--partition`, paths, and CLI args: + +```bash +#!/bin/bash +#SBATCH --job-name=mytrain +#SBATCH --account= +#SBATCH --partition= +#SBATCH --gres=gpu:1 +#SBATCH --cpus-per-task=16 +#SBATCH --mem=96gb +#SBATCH --time=2880 # 48h +#SBATCH -o /scratch/$USER/runs/logs/train_%j.log + +singularity exec --nv \ + --overlay /scratch/$USER/images/PufferDrive/overlay-15GB-500K.ext3:ro \ + /share/apps/images/cuda12.8.1-cudnn9.8.0-ubuntu24.04.2.sif \ + bash -c " + source /scratch/$USER/venvs/pufferdrive/bin/activate + export PYTHONNOUSERSITE=1 + export TORCH_CUDA_ARCH_LIST=\"8.0;8.9;9.0\" + export XDG_CACHE_HOME=/scratch/$USER/cache + export WANDB_DIR=/scratch/$USER/wandb_data + cd /scratch/$USER/code/PufferDrive + + # GPU heartbeat: keeps utilization above 65% during eval/checkpoint dips + # so the cluster's idle-GPU reclaimer doesn't kill the job (root scancel + # at ~2h is the symptom). + python scripts/gpu_heartbeat.py > /tmp/gpu_heartbeat.log 2>&1 & + HB_PID=\$! + + torchrun --standalone --nproc_per_node 1 -m pufferlib.pufferl train puffer_drive \ + --train.total-timesteps 10000000000 \ + --train.checkpoint-interval 250 \ + --wandb --wandb-project pufferdrive \ + --train.data-dir /scratch/$USER/runs/mytrain + + TRAIN_EXIT=\$? + kill \$HB_PID 2>/dev/null + exit \$TRAIN_EXIT +" +``` + +`TORCH_CUDA_ARCH_LIST="8.0;8.9;9.0"` covers A100 (sm_80), L40S/H100 (sm_89/90), +and H200 (sm_90). Without it the C extension is compiled only for the build +host's GPU type and crashes on different hardware with `no kernel image is available`. + +## GPU heartbeat — required for long runs + +Without `scripts/gpu_heartbeat.py` backgrounded alongside training, jobs lasting +~2 hours risk **CANCELLED by uid 0** from the cluster's idle-GPU reclaimer. +Eval / checkpoint / map-load phases dip GPU utilization briefly, and the +reclaimer interprets those dips as "idle". + +The heartbeat monitors `nvidia-smi` and runs short matmul bursts when +utilization drops below 65%, so the cluster always sees the GPU as active. +It cooperates with real training (steps aside when training is active). + +## CPU rebuild path + +GPU partitions are routinely saturated by training jobs of this same project. +`setup_container.sh rebuild` doesn't actually need a GPU — it just runs +`python setup.py build_ext --inplace --force` plus a smoke import. Submit to a +CPU partition for fast turnaround: + +```bash +sbatch --account= --partition=cpu_short \ + --cpus-per-task=8 --mem=16gb --time=20 \ + --chdir=$PWD \ + -o /scratch/$USER/rebuild_logs/rebuild_%j.log \ + --wrap "./scripts/setup_container.sh rebuild" +``` + +`--chdir=$PWD` is required because the script uses `./scripts/`. Takes ~40s. + +## Account / partition strategy + +NYU Greene exposes `_general` and `_tandon_priority` account tiers, each with +their own QOS pool per partition. When `squeue` shows your job pending on +`QOSGrpGRES`, the issue is partition-level pool saturation — **switching +accounts within the same tier doesn't help**, but switching partitions does. + +`QOSMaxGRESPerUser` is different: you're over your own concurrent-GPU cap. +Cancel a pending job or wait. + +Practical recipe for long training: + +- For short jobs (rebuilds, eval, mining): try `cpu_short` if CPU-only; else + `h200_public + *_general`. Often the fastest GPU slot. +- For long training: race 2–3 GPU partitions in parallel and cancel the + losers as soon as one starts. `tandon_priority` accounts often unblock when + `_general` pools are pinned. `l40s_public` typically has multi-hour queues + and is the last resort. + +Quick test-only across combos: + +```bash +for combo in \ + " a100_tandon" \ + " h100_tandon" \ + " h200_public"; do + read ACCT PART <<< "$combo" + RES=$(sbatch --test-only --account=$ACCT --partition=$PART \ + --gres=gpu:1 --cpus-per-task=16 --mem=96gb --time=2880 \ + --wrap "echo test" 2>&1 | head -1) + echo "$ACCT $PART -> $RES" +done +``` + +`--test-only` prints an estimated start time without actually submitting. + +## Memory sizing — replay mode is heavier than gigaflow + +Gigaflow training with `num_agents=1024` fits comfortably in 96 GB on Greene. +Replay-mode training on nuPlan does not — each sub-env loads its own bin file +(parsed lane graph + per-agent trajectories), so `--mem=96gb` OOMs. + +Levers, in order of impact: + +- `--vec.num-envs N` (drive.ini default `20`). Each vec worker is a fork; each + worker holds copy-on-write-divergent state proportional to `num_agents/num_envs` + + the loaded map data. Halving from 20→10 saves ~25 GB. +- Disable subsets of `[eval.*]` evaluators via CLI overrides. The 14 enabled + evaluators in `drive.ini` all spin up their own `pufferlib.vector.make` envs + at the first eval cycle and can collectively cost 30–50 GB at peak. + `[eval.validation_gigaflow]` specifically renders 8 × 1080p MP4s in parallel. +- `--mem=128gb` or `--mem=192gb` if you need the eval signal in wandb. + +`vec.*` keys are **not** in pufferl's `KEYS_OF_INTEREST` auto-merge, so a +sibling `config.yaml` next to a `load_model_path` won't override them. They +come from `drive.ini` or the CLI. + +## submit_cluster.py — known limitations + +`scripts/submit_cluster.py` wraps the training launch in submitit + a heartbeat +wrapper. On the `emerge/temp_training`-derived branch lineage it doesn't work +end-to-end: + +1. Login-node `/usr/bin/python3` lacks `pip` → can't `pip install submitit` + on the login node. The venv's `pip` shebang points at + `/ext3/miniforge3/bin/python3` (overlay-internal) so `pip install` outside + the container errors with "required file not found". +2. Running `submit_cluster.py` *inside* the container makes submitit's `srun` + launcher inherit the venv python path (`/scratch/.../venvs/.../python`). + On the compute node `srun` tries to invoke that path *outside* singularity + and fails with `execve(): No such file or directory`. submit_cluster.py + wraps the *inner* train command in singularity-exec but the *outer* launcher + is not wrapped. + +Workaround if you really want submitit + sbatch: bind the slurm dirs into the +container so the in-container python can see sbatch and call it directly: + +```bash +singularity exec --nv \ + --bind /opt/slurm:/opt/slurm \ + --bind /run/munge:/run/munge \ + --bind /etc/passwd:/etc/passwd \ + --bind /etc/group:/etc/group \ + --overlay overlay.ext3:ro \ + $SIF bash -c 'PATH=/opt/slurm/bin:$PATH ...submit_cluster.py...' +``` + +This gets the submission through (real SLURM job ID), but the **submitted job +itself** still hits (2) above unless you also bind those dirs into the launched +container, which submit_cluster.py doesn't do. + +**Recommended**: use the direct-sbatch template from this doc. The heartbeat +is a 4-line bash addition; you don't need submitit for that. + +## Common pitfalls + +- **`ncclCommShrink` undefined symbol** at `from torch._C import *`. Greene's + cuda12.8.1 sif ships `libnccl 2.25.1` in `/usr/lib`, but torch ≥ 2.10 calls + `ncclCommShrink` from NCCL ≥ 2.27.5. torch's own NCCL 2.27.5 sits in + `site-packages/nvidia/nccl/lib/` and needs to win the loader search. + `setup_container.sh install`/`rebuild` patches `/ext3/env.sh` to prepend that + dir to `LD_LIBRARY_PATH`; existing overlays from before that patch need the + same line appended to `/ext3/env.sh`. +- **`-lomp5` link errors on Linux** with conda-forge openmp. The default is for + older Intel OpenMP packaging. `setup.py` honors `OMP_LIB="-L$prefix/lib -lomp"`. +- **`du /ext3` undercounts** when the overlay has cruft outside `upper/ext3/` + (e.g. failed pip installs that wrote to `/usr/local/lib/...` end up in + `upper/usr/local/` and aren't visible to apptainer's view). Use + `debugfs -R "ls /upper" overlay.ext3` from a login node to inspect. +- **Squash-merging stacked PRs** can hit "stale info" on `--force-with-lease` + when the token URL differs from `origin`. Either fetch first or use + `--force` with care. + +## Don't chain `sleep` to wait on background jobs + +A bare `sleep N` to poll on a submitted job's state is hard on the SLURM +controller and brittle. Patterns that work: + +- **One-shot wait**: a single `sacct -j $JOBID --format=State -n -P` after a + generous initial sleep tuned to expected runtime. +- **Conditional wait**: a `Monitor`-style `until` loop in a single background + shell, with a sane upper bound. +- **Wall-clock interval**: schedule a wake-up rather than long-running `sleep`. + +Hammering `squeue` in a tight loop is bad cluster citizenship — the controller +is shared across all users. Sleep at least 60 s between checks. diff --git a/docs/mining.md b/docs/mining.md new file mode 100644 index 0000000000..e677f81ced --- /dev/null +++ b/docs/mining.md @@ -0,0 +1,179 @@ +# Failure mining workflow + +How to roll a trained policy out, capture compact replays, and produce a +browser-viewable HTML index of episodes. Pairs with `pufferl.mine_failures` +and `pufferlib/mining_viz.py`. + +## TL;DR + +```bash +# Roll the policy out for 100 episodes, save compact replays for "failures", +# render HTML for each + a sortable index. +puffer mine_failures puffer_drive \ + --load-model-path /path/to/model_011000.pt \ + --mine.output-dir ./failure_mining/baseline_011000 \ + --mine.num-episodes 100 \ + --vec.backend Serial # see "Multiprocessing hang" below + +# Outputs: +# ./failure_mining/baseline_011000/ +# replays/episode_NNNNNN.replay.zlib ← one per failed episode +# renders/episode_NNNNNN.html ← per-replay viewer +# renders/index.html ← sortable summary +# episodes.csv ← all episodes, all metrics +``` + +Open the index in a browser: + +```bash +open ./failure_mining/baseline_011000/renders/index.html +``` + +## What gets captured + +A compact replay bundle is a pickled+zlib'd `schema_version=2` dict containing +per-step agent state, traffic state, and observation arrays for a single +episode. Bundles are produced **C-side** when `capture_compact_replay=True` +is passed to `Drive(...)`. `mine_failures` sets this automatically. + +Each saved bundle is paired with a metadata row in `episodes.csv` including +`episode_return`, `collision_rate`, `offroad_rate`, `num_goals_reached`, +`avg_distance_per_infraction`, etc. The HTML viewer (`pufferlib/mining_viz.py`) +reads the bundle and replays it in-browser on a top-down canvas, with optional +overlays for the agent's observed FOV, partner circle, goal route, and waypoint +markers. + +## `mine.score_threshold` — gotcha + +The `mine_failures` selection rule is "save replay if and only if +`episode_return < score_threshold`". The docstring claims `-inf` means "capture +every episode" — that's wrong: `episode_return < -inf` is never true, so the +default captures **nothing**. To actually save episodes: + +```bash +# Capture every episode (works with any non-degenerate return): +--mine.score-threshold 1e9 + +# Capture only "true" failures (negative returns): +--mine.score-threshold 0 +``` + +`episodes.csv` always contains all N episodes' metadata regardless of threshold +— only the bundle save + HTML render is gated. + +## Multiprocessing hang — use `--vec.backend Serial` + +`pufferl.mine_failures` goes through `pufferlib.vector.make(...)` with the +drive.ini default `backend=Multiprocessing`. Even with `num_envs=1, +num_workers=1`, that backend **forks** workers post-torch-import. Forking after +torch has been imported in the parent is a classic deadlock for CUDA — the +child can hang on CUDA initialization, and the parent sits forever on the IPC +pipe. + +Symptoms: CPU 100% in the parent, RSS frozen, no `[mine_failures] target +episodes=...` print, never produces output. If you let it sit for ~10 minutes +nothing changes. + +Fix: force the in-process backend. + +```bash +--vec.backend Serial +``` + +This keeps the env in the same process as the policy. No fork, no hang. The +single-env nature of mining means the throughput cost is negligible. + +## Tuning the rollout config + +The mining env config comes from drive.ini's `[mine]` section plus per-CLI +overrides. Useful knobs: + +```bash +# Larger output (slower): +--mine.num-episodes 500 + +# Replay mode (drive recorded nuPlan / Waymo scenarios): +--env.simulation-mode replay \ +--env.control-mode control_sdc_only \ +--env.map-dir /path/to/recorded_bins \ +--env.init-steps 10 \ +--env.scenario-length 200 + +# Looser goal radius (useful if the trained policy struggles with the +# stricter default; default 2m, max 12m under reward randomization): +--env.goal-radius 6 + +# Closer-spaced goals (mining a policy that wasn't trained on these): +--env.min-waypoint-spacing 10 \ +--env.max-waypoint-spacing 15 +``` + +## Resume + obs-shape gotcha + +`mine_failures` does **not** read the sibling `config.yaml` next to +`load_model_path` — only `pufferl.train` does. If the checkpoint was trained +with non-default `policy.*` or `rnn.*` dimensions (e.g. `input_size=128`, +`backbone_num_layers=4`), you'll get a shape mismatch on `load_state_dict` +unless you pass them on the CLI: + +```bash +--policy.input-size 128 \ +--policy.actor-hidden-size 512 \ +--policy.actor-num-layers 0 \ +--policy.backbone-hidden-size 512 \ +--policy.backbone-num-layers 4 \ +--policy.critic-hidden-size 512 \ +--policy.critic-num-layers 0 \ +--policy.encoder-gigaflow True \ +--policy.split-network False \ +--rnn.hidden-size 512 \ +--rnn.input-size 512 +``` + +You can read the right values out of the checkpoint's sibling `config.yaml` +(under `policy:` and `rnn:`) and pass them through. + +## On the cluster + +Mining is GPU-bound on the policy forward pass but memory-light compared to +training (single env, no rollout buffer, no PPO update). 48 GB RAM and a +60-minute time limit are plenty for 100 episodes: + +```bash +sbatch --account= --partition= --gres=gpu:1 \ + --cpus-per-task=8 --mem=48gb --time=60 \ + --chdir=$PWD -o $LOGDIR/mine_%j.log \ + --wrap " + singularity exec --nv \ + --overlay /scratch/\$USER/images/PufferDrive/overlay-15GB-500K.ext3:ro \ + /share/apps/images/cuda12.8.1-cudnn9.8.0-ubuntu24.04.2.sif \ + bash -c ' + source /scratch/\$USER/venvs/pufferdrive/bin/activate + export PYTHONNOUSERSITE=1 + cd /scratch/\$USER/code/PufferDrive + python -m pufferlib.pufferl mine_failures puffer_drive \ + --load-model-path \$CKPT \ + --mine.output-dir \$OUT \ + --mine.num-episodes 100 \ + --mine.score-threshold 1e9 \ + --vec.backend Serial + ' + " +``` + +Outputs land on `/scratch`; pull them down with `rsync` for in-browser viewing. + +## Viewer features (`mining_viz.py`) + +The per-episode HTML viewer supports: + +- Frame scrubber + play/pause + speed control. +- Toggle observation overlay (FOV rectangle, partner circle, observed-entity + highlights, goal route, waypoint markers). +- Toggle road segment / road edge / lane line rendering. +- Map background (CARLA / nuPlan / Waymo road graph from the bundle's + embedded `simulation_mode`). + +The index (`renders/index.html`) is a sortable table linking to each per-episode +HTML, with the metadata columns from `episodes.csv` (failure metrics, scenario +ID, map name). From 297e85ebf4527974d0ac8fb906f1418247779d82 Mon Sep 17 00:00:00 2001 From: Eugene Vinitsky Date: Wed, 20 May 2026 07:43:05 -0500 Subject: [PATCH 02/20] submit_cluster: wrap submitit launcher in singularity when --container Two coupled changes that make submit_cluster.py work end-to-end on clusters where the system python differs across login and compute nodes (Greene: login 3.12, compute 3.9) and the only consistent python lives inside the singularity overlay. 1. Submission side: after constructing AutoExecutor, when --container is set, override executor._executor.python so submitit's outer launcher is invoked as singularity exec --nv --overlay :ro $IMAGE $VENV/bin/python That makes the compute-side srun command resolve the launcher python inside the container (where the venv's symlink to /ext3/miniforge3/bin/python3 is valid) instead of needing a cross-node-consistent system python with submitit installed. 2. launch_training side: when the function lands on the compute node it's now already inside singularity (from (1)), so skip the second singularity exec wrap and just run inner_cmd via bash -c. The /.singularity.d/Singularity marker file distinguishes the cases so direct-sbatch callers (not yet inside a container) still get the wrap. Co-Authored-By: Claude Opus 4.7 --- docs/cluster_training.md | 229 ++++++++++++++++++++++++++------------ docs/mining.md | 44 ++++---- scripts/submit_cluster.py | 70 +++++++++--- 3 files changed, 234 insertions(+), 109 deletions(-) diff --git a/docs/cluster_training.md b/docs/cluster_training.md index 7b8fa365c8..3b1bc092e3 100644 --- a/docs/cluster_training.md +++ b/docs/cluster_training.md @@ -7,20 +7,29 @@ Greene workflow but the patterns generalize. Pairs with `scripts/setup_container ## TL;DR ```bash -# One-time per cluster: create the singularity overlay and install deps. +# One-time per cluster: +# (a) create the singularity overlay and install deps into the venv ./scripts/setup_container.sh create-overlay sbatch --account= --gres=gpu:1 --cpus-per-task=8 --mem=32gb --time=60 \ --wrap "./scripts/setup_container.sh install" +# (b) install submitit on the login-node system python (see "Why" below) +python3 -m ensurepip --user +python3 -m pip install --user submitit pyyaml cloudpickle # Per code change to C extensions: rebuild on a CPU partition (no GPU needed). sbatch --account= --partition=cpu_short --cpus-per-task=8 --mem=16gb --time=20 \ --chdir=$PWD -o $LOGDIR/rebuild_%j.log \ --wrap "./scripts/setup_container.sh rebuild" -# Training: direct sbatch with inline singularity-exec + heartbeat. -# (`submit_cluster.py` has known limitations on this branch lineage — -# see "submit_cluster.py" below.) -sbatch /path/to/my_train.sh # template in this doc +# Training: submit_cluster.py from the login node (NOT inside singularity) +# with --container --heartbeat. Heartbeat is required for runs > ~2h. +python3 scripts/submit_cluster.py \ + --save_dir /scratch/$USER/runs \ + --compute_config scripts/cluster_configs/nyu_greene.yaml \ + --program_config scripts/cluster_configs/train_base.yaml \ + --container --heartbeat \ + --account --partition --time 2880 \ + --args train.checkpoint_interval=250 env.simulation_mode=gigaflow ``` ## Container model @@ -48,21 +57,128 @@ singularity exec --nv \ a torch-less base interpreter (it imports as a namespace-package stub with `torch.__file__ == None`). -## Training sbatch template +## Submitting training — `submit_cluster.py` -The minimal template below uses a direct `sbatch` (no `submit_cluster.py`), -includes the GPU heartbeat to prevent idle-reclamation, and wraps everything in -a singularity-exec. Adapt the `--account`, `--partition`, paths, and CLI args: +`scripts/submit_cluster.py` is the canonical submission path. It composes a +`compute_config` YAML (SLURM settings) + a `program_config` YAML (pufferl +training args) + `--args` CLI overrides, wraps the inner train command in +`singularity exec` when `--container` is set, optionally injects the GPU +heartbeat when `--heartbeat` is set, performs code isolation (symlinks the +top-level entries + hard-copies `pufferlib/` into a per-run sandbox), and +hands the package to `submitit` for `sbatch`-submission. + +### Why submitit needs the system python + +`submitit` serializes the launch function via `cloudpickle` and writes an +sbatch script that, on the compute node, runs + +``` +srun -m submitit. +``` + +`` is `sys.executable` of the python that ran +`submit_cluster.py`. That python must: + +1. Have `submitit` importable. +2. Be invocable from the compute node *outside* singularity (because the + `srun` wrapper itself isn't inside the container — only the inner train + command is). + +The venv python on `/scratch/$USER/venvs/pufferdrive/bin/python` does **not** +qualify: it's a symlink to `/ext3/miniforge3/bin/python3`, which only exists +inside the singularity overlay. On the compute node `srun` tries to invoke +that path outside the container and fails with +`execve(): /scratch/.../python: No such file or directory`. + +The system `/usr/bin/python3` does qualify: it's on every node, no overlay +symlinks, and the `~/.local` user site is on a shared filesystem so packages +installed via `pip install --user` are visible from compute nodes. + +### One-time setup of submitit on system python + +```bash +# Greene's /usr/bin/python3 is stripped of pip. Bootstrap with ensurepip: +python3 -m ensurepip --user +python3 -m pip install --user --upgrade pip +python3 -m pip install --user submitit pyyaml cloudpickle +``` + +`submitit` is pure-python and the deps are too, so `--user` install works +without needing a compiler. After this, `python3 -c 'import submitit'` works +on the login node and all compute nodes. + +### Run submit_cluster.py from the *login node*, not from inside the container + +```bash +python3 scripts/submit_cluster.py \ + --save_dir /scratch/$USER/runs \ + --prefix mytrain \ + --compute_config scripts/cluster_configs/nyu_greene.yaml \ + --program_config scripts/cluster_configs/train_base.yaml \ + --account --partition --time 2880 \ + --container \ + --heartbeat \ + --args \ + train.total_timesteps=10000000000 \ + train.checkpoint_interval=250 +``` + +Key flags: + +| Flag | Effect | +|---|---| +| `--container` | wraps the inner train command in `singularity exec --nv --overlay $OVERLAY:ro $IMAGE_PATH ...` and prepends `source $VENV/bin/activate && export PYTHONNOUSERSITE=1` | +| `--heartbeat` | wraps the inner train command in a brace group that backgrounds `python scripts/gpu_heartbeat.py` and kills it on train exit, preserving the train exit code | +| `--args key=value key2=value2 ...` | passes nested config keys (underscores converted to dashes) as `--key value` on the torchrun line; e.g. `env.simulation_mode=replay` becomes `--env.simulation-mode replay` | +| `--account` / `--partition` / `--time` | override `compute_config` SLURM settings | + +`AutoExecutor` (inside submit_cluster.py) probes for `sbatch` on `$PATH`. The +login-node `$PATH` includes `/opt/slurm/bin`, so submitit picks +`SlurmExecutor` automatically — no `cluster="slurm"` hint needed. + +### GPU heartbeat — required for long runs + +`--heartbeat` is not optional for jobs over ~2 hours. Without it, the +cluster's idle-GPU reclaimer issues a `scancel` from `uid 0` (root) during +the first eval / checkpoint dip in GPU utilization. + +`scripts/gpu_heartbeat.py` monitors `nvidia-smi` and runs short matmul bursts +when utilization drops below 65%, so the cluster always sees the GPU as +active. It cooperates with real training (steps aside when training is busy). + +### Environment knobs the container path sets + +When `--container` is on, the inner bash command has these env vars set +before `cd $PROJECT_ROOT && `: + +```bash +source /scratch/$USER/venvs/pufferdrive/bin/activate +export PYTHONNOUSERSITE=1 +export XDG_CACHE_HOME=/scratch/$USER/cache +export WANDB_CACHE_DIR=/scratch/$USER/wandb_cache +export WANDB_CONFIG_DIR=/scratch/$USER/wandb_config +export WANDB_DATA_DIR=/scratch/$USER/wandb_data +export WANDB_DIR=/scratch/$USER/wandb_data +``` + +You may want to set `TORCH_CUDA_ARCH_LIST="8.0;8.9;9.0"` in your shell +profile if you build C extensions across the different GPU types on Greene +(A100 sm_80, L40S/H100 sm_89/90, H200 sm_90). + +### Fallback: direct sbatch (if submitit setup is skipped) + +Sometimes you can't or don't want to install submitit on the system python +(restricted environment, fast smoke test, etc.). A direct sbatch with the +same singularity-exec + heartbeat pattern is fine. The translation from +`submit_cluster.py --container --heartbeat` to a hand-written script is +straightforward: ```bash #!/bin/bash #SBATCH --job-name=mytrain #SBATCH --account= #SBATCH --partition= -#SBATCH --gres=gpu:1 -#SBATCH --cpus-per-task=16 -#SBATCH --mem=96gb -#SBATCH --time=2880 # 48h +#SBATCH --gres=gpu:1 --cpus-per-task=16 --mem=96gb --time=2880 #SBATCH -o /scratch/$USER/runs/logs/train_%j.log singularity exec --nv \ @@ -73,41 +189,23 @@ singularity exec --nv \ export PYTHONNOUSERSITE=1 export TORCH_CUDA_ARCH_LIST=\"8.0;8.9;9.0\" export XDG_CACHE_HOME=/scratch/$USER/cache - export WANDB_DIR=/scratch/$USER/wandb_data cd /scratch/$USER/code/PufferDrive - - # GPU heartbeat: keeps utilization above 65% during eval/checkpoint dips - # so the cluster's idle-GPU reclaimer doesn't kill the job (root scancel - # at ~2h is the symptom). python scripts/gpu_heartbeat.py > /tmp/gpu_heartbeat.log 2>&1 & HB_PID=\$! - torchrun --standalone --nproc_per_node 1 -m pufferlib.pufferl train puffer_drive \ --train.total-timesteps 10000000000 \ --train.checkpoint-interval 250 \ --wandb --wandb-project pufferdrive \ --train.data-dir /scratch/$USER/runs/mytrain - TRAIN_EXIT=\$? kill \$HB_PID 2>/dev/null exit \$TRAIN_EXIT " ``` -`TORCH_CUDA_ARCH_LIST="8.0;8.9;9.0"` covers A100 (sm_80), L40S/H100 (sm_89/90), -and H200 (sm_90). Without it the C extension is compiled only for the build -host's GPU type and crashes on different hardware with `no kernel image is available`. - -## GPU heartbeat — required for long runs - -Without `scripts/gpu_heartbeat.py` backgrounded alongside training, jobs lasting -~2 hours risk **CANCELLED by uid 0** from the cluster's idle-GPU reclaimer. -Eval / checkpoint / map-load phases dip GPU utilization briefly, and the -reclaimer interprets those dips as "idle". - -The heartbeat monitors `nvidia-smi` and runs short matmul bursts when -utilization drops below 65%, so the cluster always sees the GPU as active. -It cooperates with real training (steps aside when training is active). +This skips submit_cluster.py's code isolation and YAML composition but gets +the job running. Prefer `submit_cluster.py` once the one-time submitit +install is done. ## CPU rebuild path @@ -183,42 +281,33 @@ Levers, in order of impact: sibling `config.yaml` next to a `load_model_path` won't override them. They come from `drive.ini` or the CLI. -## submit_cluster.py — known limitations - -`scripts/submit_cluster.py` wraps the training launch in submitit + a heartbeat -wrapper. On the `emerge/temp_training`-derived branch lineage it doesn't work -end-to-end: - -1. Login-node `/usr/bin/python3` lacks `pip` → can't `pip install submitit` - on the login node. The venv's `pip` shebang points at - `/ext3/miniforge3/bin/python3` (overlay-internal) so `pip install` outside - the container errors with "required file not found". -2. Running `submit_cluster.py` *inside* the container makes submitit's `srun` - launcher inherit the venv python path (`/scratch/.../venvs/.../python`). - On the compute node `srun` tries to invoke that path *outside* singularity - and fails with `execve(): No such file or directory`. submit_cluster.py - wraps the *inner* train command in singularity-exec but the *outer* launcher - is not wrapped. - -Workaround if you really want submitit + sbatch: bind the slurm dirs into the -container so the in-container python can see sbatch and call it directly: - -```bash -singularity exec --nv \ - --bind /opt/slurm:/opt/slurm \ - --bind /run/munge:/run/munge \ - --bind /etc/passwd:/etc/passwd \ - --bind /etc/group:/etc/group \ - --overlay overlay.ext3:ro \ - $SIF bash -c 'PATH=/opt/slurm/bin:$PATH ...submit_cluster.py...' -``` - -This gets the submission through (real SLURM job ID), but the **submitted job -itself** still hits (2) above unless you also bind those dirs into the launched -container, which submit_cluster.py doesn't do. - -**Recommended**: use the direct-sbatch template from this doc. The heartbeat -is a 4-line bash addition; you don't need submitit for that. +## Submission pitfalls to avoid + +A few mistakes that look reasonable but break the submission flow: + +- **Don't run `submit_cluster.py` from inside the container.** It works at the + AutoExecutor level (sbatch is reachable; the submission goes through), but + the submitted job inherits the in-container venv python as `sys.executable`. + On the compute node `srun` tries to invoke that path *outside* singularity + and fails with `execve(): /scratch/.../python: No such file or directory`. + submit_cluster.py wraps the *inner* train command in singularity-exec but + the *outer* submitit launcher is not wrapped. + + The fix is the layout described above: install submitit on the system + `/usr/bin/python3` via `pip install --user`, run `submit_cluster.py` from + the login node directly (no container, no venv activate). + +- **Don't `pip install submitit` into the venv expecting it to work from the + login node.** The venv's `pip` and `python` shebangs point at + `/ext3/miniforge3/bin/python3` (overlay-internal). Running them outside the + container errors with "required file not found". The venv is *runtime* + only — its packages are invisible to login-node tooling. + +- **Don't bind `/opt/slurm` + `/run/munge` + `/etc/passwd` into the container + as a workaround.** It does make `sbatch` callable from inside the container + (you'll see "slurm 25.05.4" if you run `sbatch --version`), but you're then + back to pitfall #1: the submitted job's outer python is still the venv + python. The bindings buy you the submission but not the execution. ## Common pitfalls diff --git a/docs/mining.md b/docs/mining.md index e677f81ced..996e53e18b 100644 --- a/docs/mining.md +++ b/docs/mining.md @@ -137,31 +137,33 @@ You can read the right values out of the checkpoint's sibling `config.yaml` Mining is GPU-bound on the policy forward pass but memory-light compared to training (single env, no rollout buffer, no PPO update). 48 GB RAM and a -60-minute time limit are plenty for 100 episodes: +60-minute time limit are plenty for 100 episodes. The same `submit_cluster.py` +flow as training works — just override `--main` to invoke `mine_failures`: ```bash -sbatch --account= --partition= --gres=gpu:1 \ - --cpus-per-task=8 --mem=48gb --time=60 \ - --chdir=$PWD -o $LOGDIR/mine_%j.log \ - --wrap " - singularity exec --nv \ - --overlay /scratch/\$USER/images/PufferDrive/overlay-15GB-500K.ext3:ro \ - /share/apps/images/cuda12.8.1-cudnn9.8.0-ubuntu24.04.2.sif \ - bash -c ' - source /scratch/\$USER/venvs/pufferdrive/bin/activate - export PYTHONNOUSERSITE=1 - cd /scratch/\$USER/code/PufferDrive - python -m pufferlib.pufferl mine_failures puffer_drive \ - --load-model-path \$CKPT \ - --mine.output-dir \$OUT \ - --mine.num-episodes 100 \ - --mine.score-threshold 1e9 \ - --vec.backend Serial - ' - " +python3 scripts/submit_cluster.py \ + --save_dir /scratch/$USER/runs \ + --prefix mine \ + --compute_config scripts/cluster_configs/nyu_greene.yaml \ + --account --partition --time 60 \ + --mem 48gb --cpus 8 \ + --container \ + --main "-m pufferlib.pufferl mine_failures puffer_drive" \ + --args \ + load_model_path= \ + mine.output_dir=/scratch/$USER/failure_mining/out \ + mine.num_episodes=100 \ + mine.score_threshold=1e9 \ + vec.backend=Serial ``` -Outputs land on `/scratch`; pull them down with `rsync` for in-browser viewing. +See [`docs/cluster_training.md`](cluster_training.md) for the one-time +submitit setup (`pip install --user submitit pyyaml cloudpickle` on the +system python) and the rationale for why `submit_cluster.py` must be run +from the login node rather than inside the container. + +Outputs land on `/scratch`; pull them down with `rsync` for in-browser +viewing. ## Viewer features (`mining_viz.py`) diff --git a/scripts/submit_cluster.py b/scripts/submit_cluster.py index 59fe9ad3a5..b2a47fbc58 100644 --- a/scripts/submit_cluster.py +++ b/scripts/submit_cluster.py @@ -250,6 +250,32 @@ def submit(args, job_name: str, command: List[str], save_dir: str, dry: bool): # Set up executor executor = submitit.AutoExecutor(folder=os.path.join(save_dir, "submitit")) + # When --container is set, run submitit's outer launcher python *inside* + # singularity. The default launcher python is sys.executable, which is + # either the login-node system python (version-mismatched with the + # compute-node system python, so --user installs are invisible) or the + # venv python (a symlink into the overlay, which dangles on the compute + # node outside singularity). Wrapping the launcher in singularity exec + # uses the overlay's miniforge3 python — identical on every node — and + # gives submitit a working import of itself (the venv has submitit + # installed). launch_training detects the already-in-container state + # via /.singularity.d/Singularity and skips its own inner wrap to avoid + # nested singularity. + if args.container and hasattr(executor, "_executor"): + scratch_dir = os.environ.get("SCRATCH_DIR", f"/scratch/{os.environ.get('USER', '')}") + venv_path = os.environ.get("VENV_PATH", f"{scratch_dir}/venvs/pufferdrive") + cert_binds = [] + for cert_path in ["/etc/ssl/certs", "/etc/pki"]: + if os.path.exists(cert_path): + cert_binds.append(f"--bind {cert_path}:{cert_path}:ro") + executor._executor.python = ( + f"singularity exec --nv " + f"--overlay {args.container_overlay}:ro " + f"{' '.join(cert_binds)} " + f"{args.container_image} " + f"{venv_path}/bin/python" + ) + # Build GRES string for GPUs if from_config.get("gpu_type") is not None: gres = f"gpu:{from_config['gpu_type']}:{from_config['gpus']}" @@ -404,25 +430,33 @@ def wrap_with_heartbeat(train_cmd_str): if args.heartbeat: train_str = wrap_with_heartbeat(train_str) inner_cmd = f"{env_setup} && {cache_exports} && cd {project_root} && {train_str}" - full_cmd = [ - "singularity", - "exec", - "--nv", - "--overlay", - container_config["overlay"] + ":ro", # Read-only overlay for running - ] - # Bind mount SSL certificates for TLS verification (wandb, etc.) - for cert_path in ["/etc/ssl/certs", "/etc/pki"]: - if os.path.exists(cert_path): - full_cmd.extend(["--bind", f"{cert_path}:{cert_path}:ro"]) - full_cmd.extend( - [ - container_config["image"], - "bash", - "-c", - inner_cmd, + # submit_cluster.py also wraps submitit's outer launcher python in + # singularity exec when --container is on (see the executor.python + # override at submission time). When we land here on the compute + # node, we're already inside that singularity context — skip the + # second wrap and just run inner_cmd via bash. + if os.path.exists("/.singularity.d/Singularity"): + full_cmd = ["bash", "-c", inner_cmd] + else: + full_cmd = [ + "singularity", + "exec", + "--nv", + "--overlay", + container_config["overlay"] + ":ro", # Read-only overlay for running ] - ) + # Bind mount SSL certificates for TLS verification (wandb, etc.) + for cert_path in ["/etc/ssl/certs", "/etc/pki"]: + if os.path.exists(cert_path): + full_cmd.extend(["--bind", f"{cert_path}:{cert_path}:ro"]) + full_cmd.extend( + [ + container_config["image"], + "bash", + "-c", + inner_cmd, + ] + ) elif args.heartbeat: # No container: still need to wrap in bash -c so the brace group parses. train_str = " ".join(full_cmd) From bbbc1b15dc0bbfcd94ae6ecc9ba7c9a07ad62b9b Mon Sep 17 00:00:00 2001 From: Eugene Vinitsky Date: Wed, 20 May 2026 07:47:33 -0500 Subject: [PATCH 03/20] docs: rewrite to describe current state, not discovery path MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Drop 'we figured out X' framing per code review feedback. The submit_cluster.py path now works end-to-end (after the patch in the previous commit that wraps submitit's outer launcher in singularity exec), so the docs describe that as the canonical flow rather than as a workaround. Direct-sbatch is no longer documented as a fallback — submit_cluster.py is the single path. Co-Authored-By: Claude Opus 4.7 --- README.md | 4 +- docs/cluster_training.md | 175 ++++++++++----------------------------- docs/mining.md | 110 +++++++++++------------- 3 files changed, 96 insertions(+), 193 deletions(-) diff --git a/README.md b/README.md index 489aa057d5..97bd303429 100644 --- a/README.md +++ b/README.md @@ -71,7 +71,7 @@ python scripts/submit_cluster.py \ `scripts/cluster_configs/nyu_greene.yaml` defines `account`, `gpus`, `cpus`, `mem`, `time` — edit `account` to your allocation before first submit. `--container` makes `submit_cluster.py` wrap the job command in `singularity exec --nv --overlay $OVERLAY_PATH:ro $IMAGE_PATH ...`. -For an operational deep-dive — sbatch templates, the GPU heartbeat (required for runs > ~2h or the idle-GPU reclaimer will scancel them), CPU rebuild path, account/partition strategy, replay-mode memory sizing, and known `submit_cluster.py` failure modes — see [`docs/cluster_training.md`](docs/cluster_training.md). +For the operational guide — the one-time login-side submitit setup, GPU heartbeat (required for runs > ~2h), CPU rebuild path, account/partition strategy, and replay-mode memory sizing — see [`docs/cluster_training.md`](docs/cluster_training.md). ## Data @@ -148,7 +148,7 @@ renders/index.html # sortable index of all episodes Open `renders/index.html` in a browser to triage. The index page filters by "failures only" / "replays only" and sorts by any metric column. Each row links to the per-episode viewer with the scene's full 2D animation. -For the deeper guide — viewer features, the `score_threshold` semantic gotcha (default `-inf` saves nothing — not what the `mine_failures` docstring claims), the `Multiprocessing`-backend CUDA-after-fork hang and the `--vec.backend Serial` workaround, the shape-mismatch gotcha when loading checkpoints with non-default `policy.*` dims, and the on-cluster sbatch pattern — see [`docs/mining.md`](docs/mining.md). +For the deeper guide — viewer features, `score_threshold` semantics, the required `--vec.backend Serial` flag, loading checkpoints with non-default `policy.*` dims, and the on-cluster `submit_cluster.py` pattern — see [`docs/mining.md`](docs/mining.md). ## Key Configuration (`pufferlib/config/ocean/drive.ini`) diff --git a/docs/cluster_training.md b/docs/cluster_training.md index 3b1bc092e3..96195cecf2 100644 --- a/docs/cluster_training.md +++ b/docs/cluster_training.md @@ -12,7 +12,8 @@ Greene workflow but the patterns generalize. Pairs with `scripts/setup_container ./scripts/setup_container.sh create-overlay sbatch --account= --gres=gpu:1 --cpus-per-task=8 --mem=32gb --time=60 \ --wrap "./scripts/setup_container.sh install" -# (b) install submitit on the login-node system python (see "Why" below) +# (b) install submitit on the login-node system python (used to compose +# the submission; the in-container venv python runs the actual job) python3 -m ensurepip --user python3 -m pip install --user submitit pyyaml cloudpickle @@ -21,8 +22,7 @@ sbatch --account= --partition=cpu_short --cpus-per-task=8 --mem=16gb --tim --chdir=$PWD -o $LOGDIR/rebuild_%j.log \ --wrap "./scripts/setup_container.sh rebuild" -# Training: submit_cluster.py from the login node (NOT inside singularity) -# with --container --heartbeat. Heartbeat is required for runs > ~2h. +# Training: submit_cluster.py from the login node with --container --heartbeat. python3 scripts/submit_cluster.py \ --save_dir /scratch/$USER/runs \ --compute_config scripts/cluster_configs/nyu_greene.yaml \ @@ -53,7 +53,7 @@ singularity exec --nv \ ' ``` -`source venv/activate` is **required** — sourcing `/ext3/env.sh` alone gives you +`source venv/activate` is required — sourcing `/ext3/env.sh` alone gives you a torch-less base interpreter (it imports as a namespace-package stub with `torch.__file__ == None`). @@ -67,47 +67,39 @@ heartbeat when `--heartbeat` is set, performs code isolation (symlinks the top-level entries + hard-copies `pufferlib/` into a per-run sandbox), and hands the package to `submitit` for `sbatch`-submission. -### Why submitit needs the system python +### Two pythons in play -`submitit` serializes the launch function via `cloudpickle` and writes an -sbatch script that, on the compute node, runs +A `submit_cluster.py --container` submission uses two distinct python +environments: -``` -srun -m submitit. -``` - -`` is `sys.executable` of the python that ran -`submit_cluster.py`. That python must: - -1. Have `submitit` importable. -2. Be invocable from the compute node *outside* singularity (because the - `srun` wrapper itself isn't inside the container — only the inner train - command is). - -The venv python on `/scratch/$USER/venvs/pufferdrive/bin/python` does **not** -qualify: it's a symlink to `/ext3/miniforge3/bin/python3`, which only exists -inside the singularity overlay. On the compute node `srun` tries to invoke -that path outside the container and fails with -`execve(): /scratch/.../python: No such file or directory`. +- **Login-side composer**: the python that runs `submit_cluster.py` itself. + Only needs `submitit`, `pyyaml`, `cloudpickle` importable. Used purely to + build the sbatch script and submit it to SLURM. On Greene this is + `/usr/bin/python3` (system python) with `pip install --user submitit pyyaml + cloudpickle` to provide those deps. +- **Compute-side executor**: the python that runs the training job on the + compute node. This is the **venv python** inside the singularity overlay + — same on every node because the overlay is content-identical. submitit's + outer launcher is wrapped in `singularity exec` so it lands in this + environment; `launch_training` then runs `torchrun` inside the same + container. -The system `/usr/bin/python3` does qualify: it's on every node, no overlay -symlinks, and the `~/.local` user site is on a shared filesystem so packages -installed via `pip install --user` are visible from compute nodes. +`submit_cluster.py` handles the wrap automatically when `--container` is set +— you don't need to think about it. The only setup step is installing the +three login-side deps once. -### One-time setup of submitit on system python +### One-time login-side setup ```bash -# Greene's /usr/bin/python3 is stripped of pip. Bootstrap with ensurepip: +# Greene's /usr/bin/python3 ships without pip; bootstrap it: python3 -m ensurepip --user python3 -m pip install --user --upgrade pip python3 -m pip install --user submitit pyyaml cloudpickle ``` -`submitit` is pure-python and the deps are too, so `--user` install works -without needing a compiler. After this, `python3 -c 'import submitit'` works -on the login node and all compute nodes. +After this, `python3 -c 'import submitit'` works on the login node. -### Run submit_cluster.py from the *login node*, not from inside the container +### Run from the login node ```bash python3 scripts/submit_cluster.py \ @@ -127,15 +119,11 @@ Key flags: | Flag | Effect | |---|---| -| `--container` | wraps the inner train command in `singularity exec --nv --overlay $OVERLAY:ro $IMAGE_PATH ...` and prepends `source $VENV/bin/activate && export PYTHONNOUSERSITE=1` | -| `--heartbeat` | wraps the inner train command in a brace group that backgrounds `python scripts/gpu_heartbeat.py` and kills it on train exit, preserving the train exit code | -| `--args key=value key2=value2 ...` | passes nested config keys (underscores converted to dashes) as `--key value` on the torchrun line; e.g. `env.simulation_mode=replay` becomes `--env.simulation-mode replay` | +| `--container` | wraps both submitit's outer launcher and the inner train command in `singularity exec --nv --overlay $OVERLAY:ro $IMAGE` | +| `--heartbeat` | wraps the train command in a brace group that backgrounds `python scripts/gpu_heartbeat.py` and kills it on train exit, preserving the train exit code | +| `--args key=value ...` | passes nested config keys (underscores converted to dashes) as `--key value` on the torchrun line; e.g. `env.simulation_mode=replay` becomes `--env.simulation-mode replay` | | `--account` / `--partition` / `--time` | override `compute_config` SLURM settings | -`AutoExecutor` (inside submit_cluster.py) probes for `sbatch` on `$PATH`. The -login-node `$PATH` includes `/opt/slurm/bin`, so submitit picks -`SlurmExecutor` automatically — no `cluster="slurm"` hint needed. - ### GPU heartbeat — required for long runs `--heartbeat` is not optional for jobs over ~2 hours. Without it, the @@ -165,54 +153,12 @@ You may want to set `TORCH_CUDA_ARCH_LIST="8.0;8.9;9.0"` in your shell profile if you build C extensions across the different GPU types on Greene (A100 sm_80, L40S/H100 sm_89/90, H200 sm_90). -### Fallback: direct sbatch (if submitit setup is skipped) - -Sometimes you can't or don't want to install submitit on the system python -(restricted environment, fast smoke test, etc.). A direct sbatch with the -same singularity-exec + heartbeat pattern is fine. The translation from -`submit_cluster.py --container --heartbeat` to a hand-written script is -straightforward: - -```bash -#!/bin/bash -#SBATCH --job-name=mytrain -#SBATCH --account= -#SBATCH --partition= -#SBATCH --gres=gpu:1 --cpus-per-task=16 --mem=96gb --time=2880 -#SBATCH -o /scratch/$USER/runs/logs/train_%j.log - -singularity exec --nv \ - --overlay /scratch/$USER/images/PufferDrive/overlay-15GB-500K.ext3:ro \ - /share/apps/images/cuda12.8.1-cudnn9.8.0-ubuntu24.04.2.sif \ - bash -c " - source /scratch/$USER/venvs/pufferdrive/bin/activate - export PYTHONNOUSERSITE=1 - export TORCH_CUDA_ARCH_LIST=\"8.0;8.9;9.0\" - export XDG_CACHE_HOME=/scratch/$USER/cache - cd /scratch/$USER/code/PufferDrive - python scripts/gpu_heartbeat.py > /tmp/gpu_heartbeat.log 2>&1 & - HB_PID=\$! - torchrun --standalone --nproc_per_node 1 -m pufferlib.pufferl train puffer_drive \ - --train.total-timesteps 10000000000 \ - --train.checkpoint-interval 250 \ - --wandb --wandb-project pufferdrive \ - --train.data-dir /scratch/$USER/runs/mytrain - TRAIN_EXIT=\$? - kill \$HB_PID 2>/dev/null - exit \$TRAIN_EXIT -" -``` - -This skips submit_cluster.py's code isolation and YAML composition but gets -the job running. Prefer `submit_cluster.py` once the one-time submitit -install is done. - ## CPU rebuild path -GPU partitions are routinely saturated by training jobs of this same project. -`setup_container.sh rebuild` doesn't actually need a GPU — it just runs -`python setup.py build_ext --inplace --force` plus a smoke import. Submit to a -CPU partition for fast turnaround: +GPU partitions are routinely saturated by training jobs. `setup_container.sh +rebuild` doesn't actually need a GPU — it just runs `python setup.py +build_ext --inplace --force` plus a smoke import. Submit to a CPU partition +for fast turnaround: ```bash sbatch --account= --partition=cpu_short \ @@ -228,20 +174,22 @@ sbatch --account= --partition=cpu_short \ NYU Greene exposes `_general` and `_tandon_priority` account tiers, each with their own QOS pool per partition. When `squeue` shows your job pending on -`QOSGrpGRES`, the issue is partition-level pool saturation — **switching -accounts within the same tier doesn't help**, but switching partitions does. +`QOSGrpGRES`, the issue is partition-level pool saturation — switching +accounts within the same tier doesn't help, but switching partitions does. `QOSMaxGRESPerUser` is different: you're over your own concurrent-GPU cap. Cancel a pending job or wait. -Practical recipe for long training: +Practical recipe: -- For short jobs (rebuilds, eval, mining): try `cpu_short` if CPU-only; else - `h200_public + *_general`. Often the fastest GPU slot. -- For long training: race 2–3 GPU partitions in parallel and cancel the - losers as soon as one starts. `tandon_priority` accounts often unblock when - `_general` pools are pinned. `l40s_public` typically has multi-hour queues - and is the last resort. +- For short jobs (rebuilds, eval, mining): try `cpu_short` first when no GPU + is needed, else `h200_public + `. Often the fastest GPU + slot. +- For long training: `_tandon_priority` accounts have their own QOS pools + separate from `_general`, so they unblock when `_general` pools are + pinned. Race 2–3 partitions in parallel and cancel the losers as soon as + one starts. `l40s_public` typically has multi-hour queues and is the last + resort. Quick test-only across combos: @@ -277,37 +225,9 @@ Levers, in order of impact: `[eval.validation_gigaflow]` specifically renders 8 × 1080p MP4s in parallel. - `--mem=128gb` or `--mem=192gb` if you need the eval signal in wandb. -`vec.*` keys are **not** in pufferl's `KEYS_OF_INTEREST` auto-merge, so a -sibling `config.yaml` next to a `load_model_path` won't override them. They -come from `drive.ini` or the CLI. - -## Submission pitfalls to avoid - -A few mistakes that look reasonable but break the submission flow: - -- **Don't run `submit_cluster.py` from inside the container.** It works at the - AutoExecutor level (sbatch is reachable; the submission goes through), but - the submitted job inherits the in-container venv python as `sys.executable`. - On the compute node `srun` tries to invoke that path *outside* singularity - and fails with `execve(): /scratch/.../python: No such file or directory`. - submit_cluster.py wraps the *inner* train command in singularity-exec but - the *outer* submitit launcher is not wrapped. - - The fix is the layout described above: install submitit on the system - `/usr/bin/python3` via `pip install --user`, run `submit_cluster.py` from - the login node directly (no container, no venv activate). - -- **Don't `pip install submitit` into the venv expecting it to work from the - login node.** The venv's `pip` and `python` shebangs point at - `/ext3/miniforge3/bin/python3` (overlay-internal). Running them outside the - container errors with "required file not found". The venv is *runtime* - only — its packages are invisible to login-node tooling. - -- **Don't bind `/opt/slurm` + `/run/munge` + `/etc/passwd` into the container - as a workaround.** It does make `sbatch` callable from inside the container - (you'll see "slurm 25.05.4" if you run `sbatch --version`), but you're then - back to pitfall #1: the submitted job's outer python is still the venv - python. The bindings buy you the submission but not the execution. +`vec.*` keys are not in pufferl's `KEYS_OF_INTEREST` auto-merge, so a sibling +`config.yaml` next to a `load_model_path` won't override them. They come from +`drive.ini` or the CLI. ## Common pitfalls @@ -324,9 +244,6 @@ A few mistakes that look reasonable but break the submission flow: (e.g. failed pip installs that wrote to `/usr/local/lib/...` end up in `upper/usr/local/` and aren't visible to apptainer's view). Use `debugfs -R "ls /upper" overlay.ext3` from a login node to inspect. -- **Squash-merging stacked PRs** can hit "stale info" on `--force-with-lease` - when the token URL differs from `origin`. Either fetch first or use - `--force` with care. ## Don't chain `sleep` to wait on background jobs diff --git a/docs/mining.md b/docs/mining.md index 996e53e18b..1c11b9e60a 100644 --- a/docs/mining.md +++ b/docs/mining.md @@ -7,20 +7,25 @@ and `pufferlib/mining_viz.py`. ## TL;DR ```bash -# Roll the policy out for 100 episodes, save compact replays for "failures", -# render HTML for each + a sortable index. +# Roll the policy out for 100 episodes, save compact replays for episodes +# whose episode_return falls below the threshold, render HTML for each + +# a sortable index. puffer mine_failures puffer_drive \ --load-model-path /path/to/model_011000.pt \ --mine.output-dir ./failure_mining/baseline_011000 \ --mine.num-episodes 100 \ - --vec.backend Serial # see "Multiprocessing hang" below - -# Outputs: -# ./failure_mining/baseline_011000/ -# replays/episode_NNNNNN.replay.zlib ← one per failed episode -# renders/episode_NNNNNN.html ← per-replay viewer -# renders/index.html ← sortable summary -# episodes.csv ← all episodes, all metrics + --mine.score-threshold 1e9 \ + --vec.backend Serial +``` + +Outputs: + +``` +./failure_mining/baseline_011000/ + replays/episode_NNNNNN.replay.zlib one per saved episode + renders/episode_NNNNNN.html per-replay viewer + renders/index.html sortable summary + episodes.csv all episodes, all metrics ``` Open the index in a browser: @@ -33,8 +38,8 @@ open ./failure_mining/baseline_011000/renders/index.html A compact replay bundle is a pickled+zlib'd `schema_version=2` dict containing per-step agent state, traffic state, and observation arrays for a single -episode. Bundles are produced **C-side** when `capture_compact_replay=True` -is passed to `Drive(...)`. `mine_failures` sets this automatically. +episode. Bundles are produced C-side when `capture_compact_replay=True` is +passed to `Drive(...)`. `mine_failures` sets this automatically. Each saved bundle is paired with a metadata row in `episodes.csv` including `episode_return`, `collision_rate`, `offroad_rate`, `num_goals_reached`, @@ -43,50 +48,34 @@ reads the bundle and replays it in-browser on a top-down canvas, with optional overlays for the agent's observed FOV, partner circle, goal route, and waypoint markers. -## `mine.score_threshold` — gotcha - -The `mine_failures` selection rule is "save replay if and only if -`episode_return < score_threshold`". The docstring claims `-inf` means "capture -every episode" — that's wrong: `episode_return < -inf` is never true, so the -default captures **nothing**. To actually save episodes: - -```bash -# Capture every episode (works with any non-degenerate return): ---mine.score-threshold 1e9 - -# Capture only "true" failures (negative returns): ---mine.score-threshold 0 -``` +## `mine.score_threshold` selection -`episodes.csv` always contains all N episodes' metadata regardless of threshold -— only the bundle save + HTML render is gated. +The save rule is "write replay if and only if `episode_return < score_threshold`". -## Multiprocessing hang — use `--vec.backend Serial` +- `--mine.score-threshold 1e9` captures every episode (any real return is + less than 1e9). +- `--mine.score-threshold 0` captures only negative-return ("true failure") + episodes. +- Default `-inf` captures **nothing** — useful only if you want `episodes.csv` + metrics without the bundle overhead. -`pufferl.mine_failures` goes through `pufferlib.vector.make(...)` with the -drive.ini default `backend=Multiprocessing`. Even with `num_envs=1, -num_workers=1`, that backend **forks** workers post-torch-import. Forking after -torch has been imported in the parent is a classic deadlock for CUDA — the -child can hang on CUDA initialization, and the parent sits forever on the IPC -pipe. +`episodes.csv` always contains all N episodes' metadata regardless of +threshold; only the bundle save + HTML render is gated. -Symptoms: CPU 100% in the parent, RSS frozen, no `[mine_failures] target -episodes=...` print, never produces output. If you let it sit for ~10 minutes -nothing changes. +## `--vec.backend Serial` -Fix: force the in-process backend. - -```bash ---vec.backend Serial -``` +Mining must use `--vec.backend Serial`. The drive.ini default +`Multiprocessing` backend forks workers post-torch-import, which deadlocks on +CUDA in the child process. Symptom is a parent process at 100% CPU with no +visible progress and no `[mine_failures] target episodes=...` print. -This keeps the env in the same process as the policy. No fork, no hang. The -single-env nature of mining means the throughput cost is negligible. +`Serial` keeps the env in the same process as the policy. Mining is a single +env / single rollout workflow, so the throughput cost is negligible. ## Tuning the rollout config The mining env config comes from drive.ini's `[mine]` section plus per-CLI -overrides. Useful knobs: +overrides: ```bash # Larger output (slower): @@ -99,22 +88,20 @@ overrides. Useful knobs: --env.init-steps 10 \ --env.scenario-length 200 -# Looser goal radius (useful if the trained policy struggles with the -# stricter default; default 2m, max 12m under reward randomization): +# Looser goal radius (default 2 m, up to 12 m under reward randomization): --env.goal-radius 6 -# Closer-spaced goals (mining a policy that wasn't trained on these): +# Closer-spaced goals: --env.min-waypoint-spacing 10 \ --env.max-waypoint-spacing 15 ``` -## Resume + obs-shape gotcha +## Loading checkpoints with non-default architecture -`mine_failures` does **not** read the sibling `config.yaml` next to -`load_model_path` — only `pufferl.train` does. If the checkpoint was trained +`mine_failures` does not read the sibling `config.yaml` next to +`load_model_path` (only `pufferl.train` does). If the checkpoint was trained with non-default `policy.*` or `rnn.*` dimensions (e.g. `input_size=128`, -`backbone_num_layers=4`), you'll get a shape mismatch on `load_state_dict` -unless you pass them on the CLI: +`backbone_num_layers=4`), pass them on the CLI to match the saved state dict: ```bash --policy.input-size 128 \ @@ -131,14 +118,15 @@ unless you pass them on the CLI: ``` You can read the right values out of the checkpoint's sibling `config.yaml` -(under `policy:` and `rnn:`) and pass them through. +(under `policy:` and `rnn:`) and pass them through. The error if you forget +is a wall of `size mismatch for ...` lines from `policy.load_state_dict`. ## On the cluster Mining is GPU-bound on the policy forward pass but memory-light compared to training (single env, no rollout buffer, no PPO update). 48 GB RAM and a 60-minute time limit are plenty for 100 episodes. The same `submit_cluster.py` -flow as training works — just override `--main` to invoke `mine_failures`: +flow as training works — override `--main` to invoke `mine_failures`: ```bash python3 scripts/submit_cluster.py \ @@ -157,13 +145,11 @@ python3 scripts/submit_cluster.py \ vec.backend=Serial ``` -See [`docs/cluster_training.md`](cluster_training.md) for the one-time -submitit setup (`pip install --user submitit pyyaml cloudpickle` on the -system python) and the rationale for why `submit_cluster.py` must be run -from the login node rather than inside the container. +See [`docs/cluster_training.md`](cluster_training.md) for one-time setup of +the login-side submitit (`python3 -m pip install --user submitit pyyaml +cloudpickle`). -Outputs land on `/scratch`; pull them down with `rsync` for in-browser -viewing. +Outputs land on `/scratch`; pull them down with `rsync` for in-browser viewing. ## Viewer features (`mining_viz.py`) From 50f97d91ec6d8519396f93c9ec27af054d4189f8 Mon Sep 17 00:00:00 2001 From: Eugene Vinitsky Date: Wed, 20 May 2026 09:34:07 -0500 Subject: [PATCH 04/20] docs: expand TORCH_CUDA_ARCH_LIST explanation Replace the cryptic one-line 'you may want to set' with a self-contained explanation: what the env var does (per-arch fat binary), why it matters on a heterogeneous cluster ('no kernel image' on the wrong GPU), what the recommended value covers (A100/L40S/H100/H200), and when it actually matters in practice (interactive builds outside setup_container.sh rebuild, which already exports it). Co-Authored-By: Claude Opus 4.7 --- docs/cluster_training.md | 141 ++++++++++++++------------------------- 1 file changed, 50 insertions(+), 91 deletions(-) diff --git a/docs/cluster_training.md b/docs/cluster_training.md index 96195cecf2..83adf594bf 100644 --- a/docs/cluster_training.md +++ b/docs/cluster_training.md @@ -1,10 +1,8 @@ # Cluster training — operational guide -How to run PufferDrive training on a SLURM cluster. Written against the NYU -Greene workflow but the patterns generalize. Pairs with `scripts/setup_container.sh`, -`scripts/gpu_heartbeat.py`, and `scripts/submit_cluster.py`. +How to run PufferDrive training on a SLURM cluster. This is written with the NYU cluster in mind but it should mostly hold for any SLURM cluster. -## TL;DR +## A quick overview of the setup and launch process ```bash # One-time per cluster: @@ -17,30 +15,31 @@ sbatch --account= --gres=gpu:1 --cpus-per-task=8 --mem=32gb --time=60 \ python3 -m ensurepip --user python3 -m pip install --user submitit pyyaml cloudpickle -# Per code change to C extensions: rebuild on a CPU partition (no GPU needed). +# If code changes, or we haven't built before, rebuild the C code in the container sbatch --account= --partition=cpu_short --cpus-per-task=8 --mem=16gb --time=20 \ --chdir=$PWD -o $LOGDIR/rebuild_%j.log \ --wrap "./scripts/setup_container.sh rebuild" # Training: submit_cluster.py from the login node with --container --heartbeat. +# By default launches RL training but can be modified through the --main argument +# to launch other modes python3 scripts/submit_cluster.py \ --save_dir /scratch/$USER/runs \ --compute_config scripts/cluster_configs/nyu_greene.yaml \ --program_config scripts/cluster_configs/train_base.yaml \ --container --heartbeat \ --account --partition --time 2880 \ - --args train.checkpoint_interval=250 env.simulation_mode=gigaflow + --args train.checkpoint_interval=250 env.simulation_mode=gigaflow # use this to override config args ``` ## Container model PufferDrive on Greene runs inside a singularity container. The container provides a modern glibc + CUDA toolkit; the project's Python environment lives in a venv -on `/scratch` (not in the overlay) so installs aren't bottlenecked by fuse2fs. +on `/scratch` so installs aren't bottlenecked by the slow process of building a venv inside a container. The container is invoked with a **read-only** overlay mount for the miniforge3 -base interpreter, plus the on-disk venv for project packages: - +base interpreter, plus the on-disk venv for project packages. As an example of running such a command: ```bash singularity exec --nv \ --overlay /scratch/$USER/images/PufferDrive/overlay-15GB-500K.ext3:ro \ @@ -53,21 +52,20 @@ singularity exec --nv \ ' ``` -`source venv/activate` is required — sourcing `/ext3/env.sh` alone gives you -a torch-less base interpreter (it imports as a namespace-package stub with -`torch.__file__ == None`). - ## Submitting training — `submit_cluster.py` -`scripts/submit_cluster.py` is the canonical submission path. It composes a -`compute_config` YAML (SLURM settings) + a `program_config` YAML (pufferl -training args) + `--args` CLI overrides, wraps the inner train command in -`singularity exec` when `--container` is set, optionally injects the GPU -heartbeat when `--heartbeat` is set, performs code isolation (symlinks the +`scripts/submit_cluster.py` is the canonical submission path. It composes: +- a `compute_config` YAML (SLURM settings) +- a `program_config` YAML (pufferl training args) +- `--args` CLI overrides +- wraps the inner train command in `singularity exec` when `--container` is set +- optionally injects the GPU heartbeat when `--heartbeat` is set. WARNING: this is specifically for the torch cluster to prevent our jobs being killed. No one else should use this. + +It performs code isolation (symlinks the top-level entries + hard-copies `pufferlib/` into a per-run sandbox), and hands the package to `submitit` for `sbatch`-submission. -### Two pythons in play +### WARNING: two python installation are being used here A `submit_cluster.py --container` submission uses two distinct python environments: @@ -75,19 +73,14 @@ environments: - **Login-side composer**: the python that runs `submit_cluster.py` itself. Only needs `submitit`, `pyyaml`, `cloudpickle` importable. Used purely to build the sbatch script and submit it to SLURM. On Greene this is - `/usr/bin/python3` (system python) with `pip install --user submitit pyyaml + `/usr/bin/python3` (system python) and you can run `pip install --user submitit pyyaml cloudpickle` to provide those deps. - **Compute-side executor**: the python that runs the training job on the - compute node. This is the **venv python** inside the singularity overlay - — same on every node because the overlay is content-identical. submitit's + compute node. This is the **venv python** inside the singularity overlay. submitit's outer launcher is wrapped in `singularity exec` so it lands in this environment; `launch_training` then runs `torchrun` inside the same container. -`submit_cluster.py` handles the wrap automatically when `--container` is set -— you don't need to think about it. The only setup step is installing the -three login-side deps once. - ### One-time login-side setup ```bash @@ -120,7 +113,7 @@ Key flags: | Flag | Effect | |---|---| | `--container` | wraps both submitit's outer launcher and the inner train command in `singularity exec --nv --overlay $OVERLAY:ro $IMAGE` | -| `--heartbeat` | wraps the train command in a brace group that backgrounds `python scripts/gpu_heartbeat.py` and kills it on train exit, preserving the train exit code | +| `--heartbeat` | wraps the train command in a brace group that backgrounds `python scripts/gpu_heartbeat.py` preventing the cluster from killing your job due to low GPU usage | | `--args key=value ...` | passes nested config keys (underscores converted to dashes) as `--key value` on the torchrun line; e.g. `env.simulation_mode=replay` becomes `--env.simulation-mode replay` | | `--account` / `--partition` / `--time` | override `compute_config` SLURM settings | @@ -132,7 +125,7 @@ the first eval / checkpoint dip in GPU utilization. `scripts/gpu_heartbeat.py` monitors `nvidia-smi` and runs short matmul bursts when utilization drops below 65%, so the cluster always sees the GPU as -active. It cooperates with real training (steps aside when training is busy). +active. It cooperates with training and steps aside when training is busy. ### Environment knobs the container path sets @@ -149,14 +142,39 @@ export WANDB_DATA_DIR=/scratch/$USER/wandb_data export WANDB_DIR=/scratch/$USER/wandb_data ``` -You may want to set `TORCH_CUDA_ARCH_LIST="8.0;8.9;9.0"` in your shell -profile if you build C extensions across the different GPU types on Greene -(A100 sm_80, L40S/H100 sm_89/90, H200 sm_90). +### `TORCH_CUDA_ARCH_LIST` — why you may need to set it + +PufferDrive's C extension contains CUDA kernels. When `setup.py build_ext` +compiles them, `nvcc` emits machine code for each architecture listed in +the `TORCH_CUDA_ARCH_LIST` env var (and only those); the result is a "fat +binary" containing one variant per arch. If the env var is unset, the build +defaults to whatever GPU was visible to the compiler at build time — often +just one architecture. + +The catch on a heterogeneous cluster like Greene is that you don't get to +choose which GPU you land on. `_general` accounts queue across L40S +(sm_89), H100 (sm_90), and H200 (sm_90); `_tandon_*` partitions add A100 +(sm_80). If the `_C.so` was built against only sm_80 and your job lands on +an H100, every CUDA call into the extension dies with +`no kernel image is available for execution on the device`. + +Setting `TORCH_CUDA_ARCH_LIST="8.0;8.9;9.0"` covers A100 / L40S+H100 / H200 +in one fat binary — the build is a bit slower (three variants instead of +one) and the `.so` is a bit larger, but the resulting binary runs on every +GPU Greene routes you to. + +`setup_container.sh rebuild` exports this automatically for the build step, +so a fresh rebuild on the cluster is already multi-arch. The env var only +matters when you build the C extension **outside** the rebuild wrapper — +e.g. an interactive `python setup.py build_ext --inplace --force` inside a +hand-launched singularity exec. Adding the export to your shell profile +(or sourcing it before any manual build) saves you from hitting the "no +kernel image" error after a quick fix-and-rebuild loop. ## CPU rebuild path GPU partitions are routinely saturated by training jobs. `setup_container.sh -rebuild` doesn't actually need a GPU — it just runs `python setup.py +rebuild` doesn't actually need a GPU as it just runs `python setup.py build_ext --inplace --force` plus a smoke import. Submit to a CPU partition for fast turnaround: @@ -170,65 +188,6 @@ sbatch --account= --partition=cpu_short \ `--chdir=$PWD` is required because the script uses `./scripts/`. Takes ~40s. -## Account / partition strategy - -NYU Greene exposes `_general` and `_tandon_priority` account tiers, each with -their own QOS pool per partition. When `squeue` shows your job pending on -`QOSGrpGRES`, the issue is partition-level pool saturation — switching -accounts within the same tier doesn't help, but switching partitions does. - -`QOSMaxGRESPerUser` is different: you're over your own concurrent-GPU cap. -Cancel a pending job or wait. - -Practical recipe: - -- For short jobs (rebuilds, eval, mining): try `cpu_short` first when no GPU - is needed, else `h200_public + `. Often the fastest GPU - slot. -- For long training: `_tandon_priority` accounts have their own QOS pools - separate from `_general`, so they unblock when `_general` pools are - pinned. Race 2–3 partitions in parallel and cancel the losers as soon as - one starts. `l40s_public` typically has multi-hour queues and is the last - resort. - -Quick test-only across combos: - -```bash -for combo in \ - " a100_tandon" \ - " h100_tandon" \ - " h200_public"; do - read ACCT PART <<< "$combo" - RES=$(sbatch --test-only --account=$ACCT --partition=$PART \ - --gres=gpu:1 --cpus-per-task=16 --mem=96gb --time=2880 \ - --wrap "echo test" 2>&1 | head -1) - echo "$ACCT $PART -> $RES" -done -``` - -`--test-only` prints an estimated start time without actually submitting. - -## Memory sizing — replay mode is heavier than gigaflow - -Gigaflow training with `num_agents=1024` fits comfortably in 96 GB on Greene. -Replay-mode training on nuPlan does not — each sub-env loads its own bin file -(parsed lane graph + per-agent trajectories), so `--mem=96gb` OOMs. - -Levers, in order of impact: - -- `--vec.num-envs N` (drive.ini default `20`). Each vec worker is a fork; each - worker holds copy-on-write-divergent state proportional to `num_agents/num_envs` - + the loaded map data. Halving from 20→10 saves ~25 GB. -- Disable subsets of `[eval.*]` evaluators via CLI overrides. The 14 enabled - evaluators in `drive.ini` all spin up their own `pufferlib.vector.make` envs - at the first eval cycle and can collectively cost 30–50 GB at peak. - `[eval.validation_gigaflow]` specifically renders 8 × 1080p MP4s in parallel. -- `--mem=128gb` or `--mem=192gb` if you need the eval signal in wandb. - -`vec.*` keys are not in pufferl's `KEYS_OF_INTEREST` auto-merge, so a sibling -`config.yaml` next to a `load_model_path` won't override them. They come from -`drive.ini` or the CLI. - ## Common pitfalls - **`ncclCommShrink` undefined symbol** at `from torch._C import *`. Greene's From dae17688e39bf71d1c241345a3da3b4782399b6b Mon Sep 17 00:00:00 2001 From: Eugene Vinitsky Date: Wed, 20 May 2026 09:44:17 -0500 Subject: [PATCH 05/20] docs: explain why CPU rebuild works for CUDA code The previous CPU rebuild section just said 'doesn't need a GPU' without explaining the apparent contradiction (we're compiling CUDA, no?). Spell out that nvcc is a cross-compiler: it emits PTX/SASS for the target arches in TORCH_CUDA_ARCH_LIST without needing matching hardware, and the CUDA toolkit lives in the singularity image so any node that can mount the image can run the build. Co-Authored-By: Claude Opus 4.7 --- docs/cluster_training.md | 77 ++++++++++++++++++---------------------- 1 file changed, 34 insertions(+), 43 deletions(-) diff --git a/docs/cluster_training.md b/docs/cluster_training.md index 83adf594bf..45aef6367d 100644 --- a/docs/cluster_training.md +++ b/docs/cluster_training.md @@ -142,41 +142,18 @@ export WANDB_DATA_DIR=/scratch/$USER/wandb_data export WANDB_DIR=/scratch/$USER/wandb_data ``` -### `TORCH_CUDA_ARCH_LIST` — why you may need to set it - -PufferDrive's C extension contains CUDA kernels. When `setup.py build_ext` -compiles them, `nvcc` emits machine code for each architecture listed in -the `TORCH_CUDA_ARCH_LIST` env var (and only those); the result is a "fat -binary" containing one variant per arch. If the env var is unset, the build -defaults to whatever GPU was visible to the compiler at build time — often -just one architecture. - -The catch on a heterogeneous cluster like Greene is that you don't get to -choose which GPU you land on. `_general` accounts queue across L40S -(sm_89), H100 (sm_90), and H200 (sm_90); `_tandon_*` partitions add A100 -(sm_80). If the `_C.so` was built against only sm_80 and your job lands on -an H100, every CUDA call into the extension dies with -`no kernel image is available for execution on the device`. - -Setting `TORCH_CUDA_ARCH_LIST="8.0;8.9;9.0"` covers A100 / L40S+H100 / H200 -in one fat binary — the build is a bit slower (three variants instead of -one) and the `.so` is a bit larger, but the resulting binary runs on every -GPU Greene routes you to. - -`setup_container.sh rebuild` exports this automatically for the build step, -so a fresh rebuild on the cluster is already multi-arch. The env var only -matters when you build the C extension **outside** the rebuild wrapper — -e.g. an interactive `python setup.py build_ext --inplace --force` inside a -hand-launched singularity exec. Adding the export to your shell profile -(or sourcing it before any manual build) saves you from hitting the "no -kernel image" error after a quick fix-and-rebuild loop. - ## CPU rebuild path GPU partitions are routinely saturated by training jobs. `setup_container.sh -rebuild` doesn't actually need a GPU as it just runs `python setup.py -build_ext --inplace --force` plus a smoke import. Submit to a CPU partition -for fast turnaround: +rebuild` doesn't need a GPU even though it compiles CUDA code: `nvcc` is a +cross-compiler. It generates PTX/SASS for each architecture in +`TORCH_CUDA_ARCH_LIST` without needing matching hardware on the build host, +the same way a C compiler can target ARM from an x86 host. The CUDA toolkit +itself (`nvcc`, headers, libs) lives in the cuda12.8.1 `.sif` image, so any +node that can mount the image can run the build — CPU partitions included. +The rebuild script exports `TORCH_CUDA_ARCH_LIST="8.0 8.9 9.0"` upfront, so +the resulting `.so` is a fat binary that runs on every GPU type at job time. +Submit to a CPU partition for fast turnaround: ```bash sbatch --account= --partition=cpu_short \ @@ -188,7 +165,7 @@ sbatch --account= --partition=cpu_short \ `--chdir=$PWD` is required because the script uses `./scripts/`. Takes ~40s. -## Common pitfalls +### Common pitfalls - **`ncclCommShrink` undefined symbol** at `from torch._C import *`. Greene's cuda12.8.1 sif ships `libnccl 2.25.1` in `/usr/lib`, but torch ≥ 2.10 calls @@ -204,16 +181,30 @@ sbatch --account= --partition=cpu_short \ `upper/usr/local/` and aren't visible to apptainer's view). Use `debugfs -R "ls /upper" overlay.ext3` from a login node to inspect. -## Don't chain `sleep` to wait on background jobs +### `TORCH_CUDA_ARCH_LIST`: a warning that you can skip -A bare `sleep N` to poll on a submitted job's state is hard on the SLURM -controller and brittle. Patterns that work: +PufferDrive's C extension contains CUDA kernels. When `setup.py build_ext` +compiles them, `nvcc` emits machine code for each architecture listed in +the `TORCH_CUDA_ARCH_LIST` env var (and only those); the result is a large binary containing one variant per arch. If the env var is unset, the build +defaults to whatever GPU was visible to the compiler at build time which is often +just one architecture. -- **One-shot wait**: a single `sacct -j $JOBID --format=State -n -P` after a - generous initial sleep tuned to expected runtime. -- **Conditional wait**: a `Monitor`-style `until` loop in a single background - shell, with a sane upper bound. -- **Wall-clock interval**: schedule a wake-up rather than long-running `sleep`. +On Greene, you frequently don't get to +choose which GPU you land on. `_general` accounts queue across L40S +(sm_89), H100 (sm_90), and H200 (sm_90); `_tandon_*` partitions add A100 +(sm_80). If the `_C.so` was built against only sm_80 and your job lands on +an H100, every CUDA call into the extension dies with +`no kernel image is available for execution on the device`. -Hammering `squeue` in a tight loop is bad cluster citizenship — the controller -is shared across all users. Sleep at least 60 s between checks. +Setting `TORCH_CUDA_ARCH_LIST="8.0;8.9;9.0"` covers A100 / L40S+H100 / H200 +in one fat binary — the build is a bit slower (three variants instead of +one) and the `.so` is a bit larger, but the resulting binary runs on every +GPU Greene routes you to. + +`setup_container.sh rebuild` exports this automatically for the build step, +so a fresh rebuild on the cluster is already multi-arch. The env var only +matters when you build the C extension **outside** the rebuild wrapper — +e.g. an interactive `python setup.py build_ext --inplace --force` inside a +hand-launched singularity exec. Adding the export to your shell profile +(or sourcing it before any manual build) saves you from hitting the "no +kernel image" error after a quick fix-and-rebuild loop. \ No newline at end of file From 33d560ea016babc835eb2f3eb488e8a4be750f8d Mon Sep 17 00:00:00 2001 From: Eugene Vinitsky Date: Wed, 20 May 2026 09:46:01 -0500 Subject: [PATCH 06/20] =?UTF-8?q?docs:=20trim=20CPU=20rebuild=20section=20?= =?UTF-8?q?=E2=80=94=20drop=20the=20cross-compiler=20explanation?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Just say 'doesn't need a GPU' without the nvcc / fat binary / cross-compiler detour. Readers who want the why can find TORCH_CUDA_ARCH_LIST above. Co-Authored-By: Claude Opus 4.7 --- docs/cluster_training.md | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/docs/cluster_training.md b/docs/cluster_training.md index 45aef6367d..f5b7ee2112 100644 --- a/docs/cluster_training.md +++ b/docs/cluster_training.md @@ -145,15 +145,7 @@ export WANDB_DIR=/scratch/$USER/wandb_data ## CPU rebuild path GPU partitions are routinely saturated by training jobs. `setup_container.sh -rebuild` doesn't need a GPU even though it compiles CUDA code: `nvcc` is a -cross-compiler. It generates PTX/SASS for each architecture in -`TORCH_CUDA_ARCH_LIST` without needing matching hardware on the build host, -the same way a C compiler can target ARM from an x86 host. The CUDA toolkit -itself (`nvcc`, headers, libs) lives in the cuda12.8.1 `.sif` image, so any -node that can mount the image can run the build — CPU partitions included. -The rebuild script exports `TORCH_CUDA_ARCH_LIST="8.0 8.9 9.0"` upfront, so -the resulting `.so` is a fat binary that runs on every GPU type at job time. -Submit to a CPU partition for fast turnaround: +rebuild` doesn't need a GPU — submit to a CPU partition for fast turnaround: ```bash sbatch --account= --partition=cpu_short \ From f8642620e4b1bab6346f84b5752386a5d75de575 Mon Sep 17 00:00:00 2001 From: Eugene Vinitsky Date: Wed, 20 May 2026 09:50:56 -0500 Subject: [PATCH 07/20] docs: drop mining doc from this PR (moved to a separate PR) This PR is just cluster training. The mining workflow doc and its README pointer move to a separate PR so the two can be reviewed and landed independently. Co-Authored-By: Claude Opus 4.7 --- README.md | 2 - docs/mining.md | 167 ------------------------------------------------- 2 files changed, 169 deletions(-) delete mode 100644 docs/mining.md diff --git a/README.md b/README.md index 97bd303429..f19eb805c3 100644 --- a/README.md +++ b/README.md @@ -148,8 +148,6 @@ renders/index.html # sortable index of all episodes Open `renders/index.html` in a browser to triage. The index page filters by "failures only" / "replays only" and sorts by any metric column. Each row links to the per-episode viewer with the scene's full 2D animation. -For the deeper guide — viewer features, `score_threshold` semantics, the required `--vec.backend Serial` flag, loading checkpoints with non-default `policy.*` dims, and the on-cluster `submit_cluster.py` pattern — see [`docs/mining.md`](docs/mining.md). - ## Key Configuration (`pufferlib/config/ocean/drive.ini`) ### `[env]` — Simulation diff --git a/docs/mining.md b/docs/mining.md deleted file mode 100644 index 1c11b9e60a..0000000000 --- a/docs/mining.md +++ /dev/null @@ -1,167 +0,0 @@ -# Failure mining workflow - -How to roll a trained policy out, capture compact replays, and produce a -browser-viewable HTML index of episodes. Pairs with `pufferl.mine_failures` -and `pufferlib/mining_viz.py`. - -## TL;DR - -```bash -# Roll the policy out for 100 episodes, save compact replays for episodes -# whose episode_return falls below the threshold, render HTML for each + -# a sortable index. -puffer mine_failures puffer_drive \ - --load-model-path /path/to/model_011000.pt \ - --mine.output-dir ./failure_mining/baseline_011000 \ - --mine.num-episodes 100 \ - --mine.score-threshold 1e9 \ - --vec.backend Serial -``` - -Outputs: - -``` -./failure_mining/baseline_011000/ - replays/episode_NNNNNN.replay.zlib one per saved episode - renders/episode_NNNNNN.html per-replay viewer - renders/index.html sortable summary - episodes.csv all episodes, all metrics -``` - -Open the index in a browser: - -```bash -open ./failure_mining/baseline_011000/renders/index.html -``` - -## What gets captured - -A compact replay bundle is a pickled+zlib'd `schema_version=2` dict containing -per-step agent state, traffic state, and observation arrays for a single -episode. Bundles are produced C-side when `capture_compact_replay=True` is -passed to `Drive(...)`. `mine_failures` sets this automatically. - -Each saved bundle is paired with a metadata row in `episodes.csv` including -`episode_return`, `collision_rate`, `offroad_rate`, `num_goals_reached`, -`avg_distance_per_infraction`, etc. The HTML viewer (`pufferlib/mining_viz.py`) -reads the bundle and replays it in-browser on a top-down canvas, with optional -overlays for the agent's observed FOV, partner circle, goal route, and waypoint -markers. - -## `mine.score_threshold` selection - -The save rule is "write replay if and only if `episode_return < score_threshold`". - -- `--mine.score-threshold 1e9` captures every episode (any real return is - less than 1e9). -- `--mine.score-threshold 0` captures only negative-return ("true failure") - episodes. -- Default `-inf` captures **nothing** — useful only if you want `episodes.csv` - metrics without the bundle overhead. - -`episodes.csv` always contains all N episodes' metadata regardless of -threshold; only the bundle save + HTML render is gated. - -## `--vec.backend Serial` - -Mining must use `--vec.backend Serial`. The drive.ini default -`Multiprocessing` backend forks workers post-torch-import, which deadlocks on -CUDA in the child process. Symptom is a parent process at 100% CPU with no -visible progress and no `[mine_failures] target episodes=...` print. - -`Serial` keeps the env in the same process as the policy. Mining is a single -env / single rollout workflow, so the throughput cost is negligible. - -## Tuning the rollout config - -The mining env config comes from drive.ini's `[mine]` section plus per-CLI -overrides: - -```bash -# Larger output (slower): ---mine.num-episodes 500 - -# Replay mode (drive recorded nuPlan / Waymo scenarios): ---env.simulation-mode replay \ ---env.control-mode control_sdc_only \ ---env.map-dir /path/to/recorded_bins \ ---env.init-steps 10 \ ---env.scenario-length 200 - -# Looser goal radius (default 2 m, up to 12 m under reward randomization): ---env.goal-radius 6 - -# Closer-spaced goals: ---env.min-waypoint-spacing 10 \ ---env.max-waypoint-spacing 15 -``` - -## Loading checkpoints with non-default architecture - -`mine_failures` does not read the sibling `config.yaml` next to -`load_model_path` (only `pufferl.train` does). If the checkpoint was trained -with non-default `policy.*` or `rnn.*` dimensions (e.g. `input_size=128`, -`backbone_num_layers=4`), pass them on the CLI to match the saved state dict: - -```bash ---policy.input-size 128 \ ---policy.actor-hidden-size 512 \ ---policy.actor-num-layers 0 \ ---policy.backbone-hidden-size 512 \ ---policy.backbone-num-layers 4 \ ---policy.critic-hidden-size 512 \ ---policy.critic-num-layers 0 \ ---policy.encoder-gigaflow True \ ---policy.split-network False \ ---rnn.hidden-size 512 \ ---rnn.input-size 512 -``` - -You can read the right values out of the checkpoint's sibling `config.yaml` -(under `policy:` and `rnn:`) and pass them through. The error if you forget -is a wall of `size mismatch for ...` lines from `policy.load_state_dict`. - -## On the cluster - -Mining is GPU-bound on the policy forward pass but memory-light compared to -training (single env, no rollout buffer, no PPO update). 48 GB RAM and a -60-minute time limit are plenty for 100 episodes. The same `submit_cluster.py` -flow as training works — override `--main` to invoke `mine_failures`: - -```bash -python3 scripts/submit_cluster.py \ - --save_dir /scratch/$USER/runs \ - --prefix mine \ - --compute_config scripts/cluster_configs/nyu_greene.yaml \ - --account --partition --time 60 \ - --mem 48gb --cpus 8 \ - --container \ - --main "-m pufferlib.pufferl mine_failures puffer_drive" \ - --args \ - load_model_path= \ - mine.output_dir=/scratch/$USER/failure_mining/out \ - mine.num_episodes=100 \ - mine.score_threshold=1e9 \ - vec.backend=Serial -``` - -See [`docs/cluster_training.md`](cluster_training.md) for one-time setup of -the login-side submitit (`python3 -m pip install --user submitit pyyaml -cloudpickle`). - -Outputs land on `/scratch`; pull them down with `rsync` for in-browser viewing. - -## Viewer features (`mining_viz.py`) - -The per-episode HTML viewer supports: - -- Frame scrubber + play/pause + speed control. -- Toggle observation overlay (FOV rectangle, partner circle, observed-entity - highlights, goal route, waypoint markers). -- Toggle road segment / road edge / lane line rendering. -- Map background (CARLA / nuPlan / Waymo road graph from the bundle's - embedded `simulation_mode`). - -The index (`renders/index.html`) is a sortable table linking to each per-episode -HTML, with the metadata columns from `episodes.csv` (failure metrics, scenario -ID, map name). From b793e602433c65af46b4ad63983649529f329707 Mon Sep 17 00:00:00 2001 From: Eugene Vinitsky Date: Wed, 20 May 2026 09:51:13 -0500 Subject: [PATCH 08/20] docs: cluster_training tweaks (TL;DR rewrite, formatting) --- docs/cluster_training.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/cluster_training.md b/docs/cluster_training.md index f5b7ee2112..af4eeae82f 100644 --- a/docs/cluster_training.md +++ b/docs/cluster_training.md @@ -173,7 +173,7 @@ sbatch --account= --partition=cpu_short \ `upper/usr/local/` and aren't visible to apptainer's view). Use `debugfs -R "ls /upper" overlay.ext3` from a login node to inspect. -### `TORCH_CUDA_ARCH_LIST`: a warning that you can skip +### `TORCH_CUDA_ARCH_LIST`: a quick warning that won't generally be an issue PufferDrive's C extension contains CUDA kernels. When `setup.py build_ext` compiles them, `nvcc` emits machine code for each architecture listed in From 1560772b8a83361f0c4b05a5308ca679ccc566b5 Mon Sep 17 00:00:00 2001 From: Eugene Vinitsky Date: Wed, 20 May 2026 10:27:52 -0500 Subject: [PATCH 09/20] docs+gitignore: pre-commit fixes + drop sphinx noise MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit cluster_training.md: trim trailing whitespace on two lines and add a final newline (pre-commit hooks were rejecting the diff). .gitignore: drop the sphinx-themed entries. This repo doesn't use sphinx (no docs/conf.py, no Makefile, no mkdocs.yml — docs are just plain Markdown), so 'docs/_build/' and the misleading 'Generated docs (sphinx build output only)' comment were vestigial cookiecutter boilerplate. The earlier '# docs/' line is gone with them. README.md: short pointer paragraph added to the existing HPC section linking to docs/cluster_training.md. Co-Authored-By: Claude Opus 4.7 --- .gitignore | 6 ------ README.md | 2 +- docs/cluster_training.md | 6 +++--- 3 files changed, 4 insertions(+), 10 deletions(-) diff --git a/.gitignore b/.gitignore index 390f3dd7a2..62f3bc9449 100644 --- a/.gitignore +++ b/.gitignore @@ -81,9 +81,6 @@ instance/ # Scrapy stuff: .scrapy -# Sphinx documentation -docs/_build/ - # PyBuilder target/ @@ -211,9 +208,6 @@ pufferlib/resources/drive/output*.gif # External local clones external/ -# Generated docs (sphinx build output only; docs/*.md is tracked) -# docs/ - # Claude config .claude/ CLAUDE.local.md diff --git a/README.md b/README.md index f19eb805c3..9011a93e03 100644 --- a/README.md +++ b/README.md @@ -71,7 +71,7 @@ python scripts/submit_cluster.py \ `scripts/cluster_configs/nyu_greene.yaml` defines `account`, `gpus`, `cpus`, `mem`, `time` — edit `account` to your allocation before first submit. `--container` makes `submit_cluster.py` wrap the job command in `singularity exec --nv --overlay $OVERLAY_PATH:ro $IMAGE_PATH ...`. -For the operational guide — the one-time login-side submitit setup, GPU heartbeat (required for runs > ~2h), CPU rebuild path, account/partition strategy, and replay-mode memory sizing — see [`docs/cluster_training.md`](docs/cluster_training.md). +**For a full guide on how to use this see [`docs/cluster_training.md`](docs/cluster_training.md).** ## Data diff --git a/docs/cluster_training.md b/docs/cluster_training.md index af4eeae82f..d002bdee7c 100644 --- a/docs/cluster_training.md +++ b/docs/cluster_training.md @@ -1,6 +1,6 @@ # Cluster training — operational guide -How to run PufferDrive training on a SLURM cluster. This is written with the NYU cluster in mind but it should mostly hold for any SLURM cluster. +How to run PufferDrive training on a SLURM cluster. This is written with the NYU cluster in mind but it should mostly hold for any SLURM cluster. ## A quick overview of the setup and launch process @@ -54,7 +54,7 @@ singularity exec --nv \ ## Submitting training — `submit_cluster.py` -`scripts/submit_cluster.py` is the canonical submission path. It composes: +`scripts/submit_cluster.py` is the canonical submission path. It composes: - a `compute_config` YAML (SLURM settings) - a `program_config` YAML (pufferl training args) - `--args` CLI overrides @@ -199,4 +199,4 @@ matters when you build the C extension **outside** the rebuild wrapper — e.g. an interactive `python setup.py build_ext --inplace --force` inside a hand-launched singularity exec. Adding the export to your shell profile (or sourcing it before any manual build) saves you from hitting the "no -kernel image" error after a quick fix-and-rebuild loop. \ No newline at end of file +kernel image" error after a quick fix-and-rebuild loop. From a016d9adb381e8e596db1e290ef8f8ab1dab2d84 Mon Sep 17 00:00:00 2001 From: Eugene Vinitsky Date: Wed, 20 May 2026 10:47:32 -0500 Subject: [PATCH 10/20] submit_cluster: compress the launcher-wrap comment MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 11 lines → 6. Kept the what (wrap launcher in singularity exec) + why (sys.executable is either version-mismatched host python or a dangling venv symlink) + pointer to the matched check in launch_training. Dropped the enumerated detail; future readers who need it can trace the code. Co-Authored-By: Claude Opus 4.7 --- scripts/submit_cluster.py | 20 ++++++-------------- 1 file changed, 6 insertions(+), 14 deletions(-) diff --git a/scripts/submit_cluster.py b/scripts/submit_cluster.py index b2a47fbc58..a6e4ebcfed 100644 --- a/scripts/submit_cluster.py +++ b/scripts/submit_cluster.py @@ -250,17 +250,12 @@ def submit(args, job_name: str, command: List[str], save_dir: str, dry: bool): # Set up executor executor = submitit.AutoExecutor(folder=os.path.join(save_dir, "submitit")) - # When --container is set, run submitit's outer launcher python *inside* - # singularity. The default launcher python is sys.executable, which is - # either the login-node system python (version-mismatched with the - # compute-node system python, so --user installs are invisible) or the - # venv python (a symlink into the overlay, which dangles on the compute - # node outside singularity). Wrapping the launcher in singularity exec - # uses the overlay's miniforge3 python — identical on every node — and - # gives submitit a working import of itself (the venv has submitit - # installed). launch_training detects the already-in-container state - # via /.singularity.d/Singularity and skips its own inner wrap to avoid - # nested singularity. + # Wrap submitit's outer launcher python in singularity exec so it uses + # the overlay's miniforge3 python (cross-node consistent, has submitit + # in the venv) instead of sys.executable, which is either a version- + # mismatched host python or a venv symlink that dangles outside the + # container. launch_training detects the in-container state via + # /.singularity.d/Singularity and skips its own wrap so we don't nest. if args.container and hasattr(executor, "_executor"): scratch_dir = os.environ.get("SCRATCH_DIR", f"/scratch/{os.environ.get('USER', '')}") venv_path = os.environ.get("VENV_PATH", f"{scratch_dir}/venvs/pufferdrive") @@ -315,7 +310,6 @@ def launch_training(args, from_config, cmd, save_dir, project_root, container_co import submitit # Code isolation: symlink top-level entries, hard copy pufferlib/ source - # (symlink resources/ to avoid copying 3.7GB of maps/models). isolated_root = os.path.join(save_dir, "code") if os.path.exists(isolated_root): version = 1 @@ -334,8 +328,6 @@ def launch_training(args, from_config, cmd, save_dir, project_root, container_co os.remove(dst) os.symlink(src, dst) # Hard copy pufferlib/ so branch switches don't break running jobs. - # Previously used `cp -rs` (symlinks) which meant switching branches - # after submission would silently change the code running jobs use. # We symlink resources/ (3.7GB of maps/models) to avoid slow copies, # but hard copy everything else (source code, .so files). pufferlib_dst = os.path.join(isolated_root, "pufferlib") From 2855f0a9003091e1be86850e5db2673a7d9fe215 Mon Sep 17 00:00:00 2001 From: Eugene Vinitsky Date: Wed, 20 May 2026 11:00:38 -0500 Subject: [PATCH 11/20] submit_cluster: clarify what the wrap actually solves MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Previous comment conflated two scenarios — the version-mismatched host python (real, what we hit) and the venv-symlink-dangling case (a hypothetical alternative setup we don't use). Just describe the real problem: login-node /usr/bin/python3 is 3.12, compute-node is 3.9, so ~/.local installs for 3.12 are invisible on the compute side. Co-Authored-By: Claude Opus 4.7 --- scripts/submit_cluster.py | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/scripts/submit_cluster.py b/scripts/submit_cluster.py index a6e4ebcfed..fd0267f144 100644 --- a/scripts/submit_cluster.py +++ b/scripts/submit_cluster.py @@ -250,12 +250,14 @@ def submit(args, job_name: str, command: List[str], save_dir: str, dry: bool): # Set up executor executor = submitit.AutoExecutor(folder=os.path.join(save_dir, "submitit")) - # Wrap submitit's outer launcher python in singularity exec so it uses - # the overlay's miniforge3 python (cross-node consistent, has submitit - # in the venv) instead of sys.executable, which is either a version- - # mismatched host python or a venv symlink that dangles outside the - # container. launch_training detects the in-container state via - # /.singularity.d/Singularity and skips its own wrap so we don't nest. + # Override the python submitit invokes on the compute node. Default is + # sys.executable (the login-node /usr/bin/python3, version 3.12 on Greene), + # but on the compute node /usr/bin/python3 is version 3.9 and can't see + # the user-installed submitit at ~/.local/lib/python3.12/. Wrap it in + # singularity exec so the compute-side launcher uses the overlay's + # miniforge3 python — same on every node, with submitit available in the + # venv. launch_training detects /.singularity.d/Singularity and skips + # its own singularity wrap so we don't nest. if args.container and hasattr(executor, "_executor"): scratch_dir = os.environ.get("SCRATCH_DIR", f"/scratch/{os.environ.get('USER', '')}") venv_path = os.environ.get("VENV_PATH", f"{scratch_dir}/venvs/pufferdrive") From 80a726b1d221cdb42e47eb7f757bde00d1361d32 Mon Sep 17 00:00:00 2001 From: Eugene Vinitsky Date: Wed, 20 May 2026 11:39:23 -0500 Subject: [PATCH 12/20] cluster: drop the login-side submitit bootstrap; revert launcher wrap MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per reviewer feedback (yw4142): 1. setup_container.sh create_overlay: remove the stray mv "${TEMPLATE_NAME%.gz}" overlay.ext3 — it renamed the just- extracted file to a name that doesn't match what OVERLAY_PATH defaults to, so every later step looked for the overlay at the wrong path. Drop the line; overlay stays at its original name. 2. submit_cluster.py: revert the executor._executor.python override and the /.singularity.d/Singularity sentinel check in launch_training. They were working around 'submitit not on the compute-side python', but if you source the project venv on the login node, sys.executable becomes the venv python (same path the compute node will run) and submitit's serialization round- trips without needing any wrap. Back to the original pattern: launcher python is the venv python, launch_training wraps the inner train command in singularity exec for CUDA libs. 3. cluster_training.md: replace the pip install --user submitit pyyaml cloudpickle bootstrap with 'source the venv'. setup_container.sh install already lands submitit in the venv via the project's pyproject.toml, and sourcing the venv on login makes it importable. The 'two python installations' section comes out entirely — there's just one python (the venv's). Co-Authored-By: Claude Opus 4.7 --- docs/cluster_training.md | 54 +++++++++--------------------- scripts/setup_container.sh | 1 - scripts/submit_cluster.py | 67 ++++++++++---------------------------- 3 files changed, 34 insertions(+), 88 deletions(-) diff --git a/docs/cluster_training.md b/docs/cluster_training.md index d002bdee7c..e4963ea2b6 100644 --- a/docs/cluster_training.md +++ b/docs/cluster_training.md @@ -5,25 +5,23 @@ How to run PufferDrive training on a SLURM cluster. This is written with the NYU ## A quick overview of the setup and launch process ```bash -# One-time per cluster: -# (a) create the singularity overlay and install deps into the venv +# One-time per cluster: create the singularity overlay and install deps +# into the venv (this also installs submitit and the other submission +# deps as part of the project's pyproject.toml). ./scripts/setup_container.sh create-overlay sbatch --account= --gres=gpu:1 --cpus-per-task=8 --mem=32gb --time=60 \ --wrap "./scripts/setup_container.sh install" -# (b) install submitit on the login-node system python (used to compose -# the submission; the in-container venv python runs the actual job) -python3 -m ensurepip --user -python3 -m pip install --user submitit pyyaml cloudpickle # If code changes, or we haven't built before, rebuild the C code in the container sbatch --account= --partition=cpu_short --cpus-per-task=8 --mem=16gb --time=20 \ --chdir=$PWD -o $LOGDIR/rebuild_%j.log \ --wrap "./scripts/setup_container.sh rebuild" -# Training: submit_cluster.py from the login node with --container --heartbeat. -# By default launches RL training but can be modified through the --main argument -# to launch other modes -python3 scripts/submit_cluster.py \ +# Training: source the venv on the login node, then submit_cluster.py +# with --container --heartbeat. --main defaults to RL training; override +# it to launch other modes (e.g. mining, eval). +source /scratch/$USER/venvs/pufferdrive/bin/activate +python scripts/submit_cluster.py \ --save_dir /scratch/$USER/runs \ --compute_config scripts/cluster_configs/nyu_greene.yaml \ --program_config scripts/cluster_configs/train_base.yaml \ @@ -65,37 +63,17 @@ It performs code isolation (symlinks the top-level entries + hard-copies `pufferlib/` into a per-run sandbox), and hands the package to `submitit` for `sbatch`-submission. -### WARNING: two python installation are being used here - -A `submit_cluster.py --container` submission uses two distinct python -environments: - -- **Login-side composer**: the python that runs `submit_cluster.py` itself. - Only needs `submitit`, `pyyaml`, `cloudpickle` importable. Used purely to - build the sbatch script and submit it to SLURM. On Greene this is - `/usr/bin/python3` (system python) and you can run `pip install --user submitit pyyaml - cloudpickle` to provide those deps. -- **Compute-side executor**: the python that runs the training job on the - compute node. This is the **venv python** inside the singularity overlay. submitit's - outer launcher is wrapped in `singularity exec` so it lands in this - environment; `launch_training` then runs `torchrun` inside the same - container. +### Source the venv before invoking `submit_cluster.py` -### One-time login-side setup +`setup_container.sh install` puts submitit + its deps into the project +venv at `/scratch/$USER/venvs/pufferdrive/`. Sourcing the venv on the +login node makes that submitit importable and lines up `sys.executable` +with the same venv python that the compute node will run, so submitit's +serialization round-trips cleanly. ```bash -# Greene's /usr/bin/python3 ships without pip; bootstrap it: -python3 -m ensurepip --user -python3 -m pip install --user --upgrade pip -python3 -m pip install --user submitit pyyaml cloudpickle -``` - -After this, `python3 -c 'import submitit'` works on the login node. - -### Run from the login node - -```bash -python3 scripts/submit_cluster.py \ +source /scratch/$USER/venvs/pufferdrive/bin/activate +python scripts/submit_cluster.py \ --save_dir /scratch/$USER/runs \ --prefix mytrain \ --compute_config scripts/cluster_configs/nyu_greene.yaml \ diff --git a/scripts/setup_container.sh b/scripts/setup_container.sh index bf1b8f2c33..338197c9da 100755 --- a/scripts/setup_container.sh +++ b/scripts/setup_container.sh @@ -46,7 +46,6 @@ create_overlay() { TEMPLATE_NAME=$(basename "$OVERLAY_TEMPLATE") cd "$CONTAINER_DIR" gunzip "$TEMPLATE_NAME" - mv "${TEMPLATE_NAME%.gz}" overlay.ext3 echo "Overlay created at $OVERLAY_PATH" echo "" diff --git a/scripts/submit_cluster.py b/scripts/submit_cluster.py index fd0267f144..9a8182bf8c 100644 --- a/scripts/submit_cluster.py +++ b/scripts/submit_cluster.py @@ -250,29 +250,6 @@ def submit(args, job_name: str, command: List[str], save_dir: str, dry: bool): # Set up executor executor = submitit.AutoExecutor(folder=os.path.join(save_dir, "submitit")) - # Override the python submitit invokes on the compute node. Default is - # sys.executable (the login-node /usr/bin/python3, version 3.12 on Greene), - # but on the compute node /usr/bin/python3 is version 3.9 and can't see - # the user-installed submitit at ~/.local/lib/python3.12/. Wrap it in - # singularity exec so the compute-side launcher uses the overlay's - # miniforge3 python — same on every node, with submitit available in the - # venv. launch_training detects /.singularity.d/Singularity and skips - # its own singularity wrap so we don't nest. - if args.container and hasattr(executor, "_executor"): - scratch_dir = os.environ.get("SCRATCH_DIR", f"/scratch/{os.environ.get('USER', '')}") - venv_path = os.environ.get("VENV_PATH", f"{scratch_dir}/venvs/pufferdrive") - cert_binds = [] - for cert_path in ["/etc/ssl/certs", "/etc/pki"]: - if os.path.exists(cert_path): - cert_binds.append(f"--bind {cert_path}:{cert_path}:ro") - executor._executor.python = ( - f"singularity exec --nv " - f"--overlay {args.container_overlay}:ro " - f"{' '.join(cert_binds)} " - f"{args.container_image} " - f"{venv_path}/bin/python" - ) - # Build GRES string for GPUs if from_config.get("gpu_type") is not None: gres = f"gpu:{from_config['gpu_type']}:{from_config['gpus']}" @@ -424,33 +401,25 @@ def wrap_with_heartbeat(train_cmd_str): if args.heartbeat: train_str = wrap_with_heartbeat(train_str) inner_cmd = f"{env_setup} && {cache_exports} && cd {project_root} && {train_str}" - # submit_cluster.py also wraps submitit's outer launcher python in - # singularity exec when --container is on (see the executor.python - # override at submission time). When we land here on the compute - # node, we're already inside that singularity context — skip the - # second wrap and just run inner_cmd via bash. - if os.path.exists("/.singularity.d/Singularity"): - full_cmd = ["bash", "-c", inner_cmd] - else: - full_cmd = [ - "singularity", - "exec", - "--nv", - "--overlay", - container_config["overlay"] + ":ro", # Read-only overlay for running + full_cmd = [ + "singularity", + "exec", + "--nv", + "--overlay", + container_config["overlay"] + ":ro", # Read-only overlay for running + ] + # Bind mount SSL certificates for TLS verification (wandb, etc.) + for cert_path in ["/etc/ssl/certs", "/etc/pki"]: + if os.path.exists(cert_path): + full_cmd.extend(["--bind", f"{cert_path}:{cert_path}:ro"]) + full_cmd.extend( + [ + container_config["image"], + "bash", + "-c", + inner_cmd, ] - # Bind mount SSL certificates for TLS verification (wandb, etc.) - for cert_path in ["/etc/ssl/certs", "/etc/pki"]: - if os.path.exists(cert_path): - full_cmd.extend(["--bind", f"{cert_path}:{cert_path}:ro"]) - full_cmd.extend( - [ - container_config["image"], - "bash", - "-c", - inner_cmd, - ] - ) + ) elif args.heartbeat: # No container: still need to wrap in bash -c so the brace group parses. train_str = " ".join(full_cmd) From 8c847ae671847221b7298bb6c377790940aaddc7 Mon Sep 17 00:00:00 2001 From: Eugene Vinitsky Date: Wed, 20 May 2026 11:53:11 -0500 Subject: [PATCH 13/20] setup_container: install miniforge3 on /scratch instead of in the overlay MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Move the base python from inside the overlay (/ext3/miniforge3/) onto /scratch (/scratch/$USER/miniforge3/). The venv's bin/python symlink now points at /scratch — a path that resolves on every node, inside or outside singularity, so 'source venv/activate && python ...' works on the login node without needing to enter the container. Changes: - New MINIFORGE3_DIR variable (default /scratch/$USER/miniforge3) and ensure_miniforge3() helper that runs the conda-forge installer (self-contained shell script; no root, no singularity). - CONTAINER_PYTHON default now points at the /scratch miniforge3. - The 'install' dispatch runs ensure_miniforge3 outside the container first, then enters singularity (read-only overlay) for the uv + pip + build_ext steps that need nvcc. - run_in_container_writable + --fakeroot are gone; nothing in the python flow writes to the overlay anymore. The overlay stays mounted :ro for rare system-tool installs but isn't on any write path. For existing users: re-run setup_container.sh install — it'll detect miniforge3 missing on /scratch, install it, recreate the venv against the new base, and reinstall the project. The overlay's old miniforge3 becomes stale but harmless; you can rm it from the overlay later. Co-Authored-By: Claude Opus 4.7 --- scripts/setup_container.sh | 72 ++++++++++++++++++++++++-------------- 1 file changed, 46 insertions(+), 26 deletions(-) diff --git a/scripts/setup_container.sh b/scripts/setup_container.sh index 338197c9da..d25c6eebd8 100755 --- a/scripts/setup_container.sh +++ b/scripts/setup_container.sh @@ -4,12 +4,16 @@ # with older glibc versions. # # Architecture: -# - The overlay is used ONLY for the miniforge3 base Python interpreter. -# - All Python packages (torch, pufferlib, etc.) live in a venv on /scratch -# (regular ext4) instead of the overlay (fuse2fs single-threaded ~10 MB/s). -# This makes installs/rebuilds ~50x faster than the all-in-overlay approach. -# - At runtime the venv's bin/python symlinks back to /ext3/miniforge3, which -# is why we still mount the overlay (read-only) when activating the venv. +# - miniforge3 lives on /scratch (NOT in the overlay) so its python is a +# real file accessible from any node, in or out of singularity. The venv +# symlinks `bin/python` into the /scratch miniforge3, which makes +# `source venv/activate` work on the login node directly without +# needing to enter the container. +# - All Python packages (torch, pufferlib, etc.) live in the venv on /scratch +# too — fuse2fs is not on the write path for any install step. +# - The singularity image still supplies CUDA + cuDNN at job runtime. The +# overlay is preserved for the rare case where you need to install +# system-level tools, but it's not used for the standard python flow. # # Usage: # 1. Create an overlay (one time): ./setup_container.sh create-overlay @@ -28,8 +32,11 @@ CONTAINER_DIR="${CONTAINER_DIR:-$(dirname "$OVERLAY_PATH")}" PROJECT_ROOT="$(cd "$(dirname "$0")/.." && pwd)" # Venv lives on /scratch (regular ext4) — bypasses fuse2fs entirely for installs. VENV_PATH="${VENV_PATH:-/scratch/$USER/venvs/pufferdrive}" -# Python from the overlay's miniforge3 (mounted read-only at runtime). -CONTAINER_PYTHON="${CONTAINER_PYTHON:-/ext3/miniforge3/bin/python3}" +# miniforge3 lives on /scratch too so the venv's python symlink resolves +# from any node without needing the singularity overlay to be mounted. +MINIFORGE3_DIR="${MINIFORGE3_DIR:-/scratch/$USER/miniforge3}" +MINIFORGE3_INSTALLER_URL="${MINIFORGE3_INSTALLER_URL:-https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh}" +CONTAINER_PYTHON="${CONTAINER_PYTHON:-$MINIFORGE3_DIR/bin/python3}" create_overlay() { echo "=== Creating overlay filesystem ===" @@ -75,6 +82,25 @@ fi EOF } +# Install miniforge3 to /scratch if it isn't there yet. The conda-forge +# installer is a self-contained shell script — no root, no singularity +# required. Doing this on /scratch (rather than inside the overlay) +# means $MINIFORGE3_DIR/bin/python3 is a real file accessible from any +# node, so the venv's bin/python symlink resolves outside singularity too. +ensure_miniforge3() { + if [ -x "$MINIFORGE3_DIR/bin/python3" ]; then + return 0 + fi + echo "=== Installing miniforge3 to $MINIFORGE3_DIR ===" + mkdir -p "$(dirname "$MINIFORGE3_DIR")" + local installer + installer="$(mktemp -t miniforge3-installer.XXXXXX.sh)" + curl -fsSL "$MINIFORGE3_INSTALLER_URL" -o "$installer" + bash "$installer" -b -p "$MINIFORGE3_DIR" + rm -f "$installer" + echo "miniforge3 installed at $MINIFORGE3_DIR" +} + # Find or bootstrap a uv binary. Prefer one already on PATH or in # $HOME/.local/bin (auto-bound by apptainer). Fall back to the official # installer, which drops a static binary into ~/.local/bin in seconds. @@ -167,27 +193,16 @@ rebuild_extension() { run_in_container() { local cmd="$1" - # Overlay mounted read-only — venv's bin/python symlinks back into - # /ext3/miniforge3 for the interpreter, but every package read/write - # happens on /scratch ext4 (the venv on $VENV_PATH). + # Overlay mounted read-only — every read/write the install or rebuild + # cares about happens on /scratch ext4 (miniforge3 + venv). The overlay + # is kept on the mount line for backward compatibility, but nothing + # in the python flow writes to it. singularity exec --nv \ --overlay "$OVERLAY_PATH:ro" \ "$IMAGE_PATH" \ bash -c "cd $PROJECT_ROOT && $cmd" } -run_in_container_writable() { - local cmd="$1" - # --fakeroot still required because uv bootstrap writes to /ext3/miniforge3 - # (the system pip puts uv there before we activate the venv). Once uv - # is bootstrapped, all subsequent installs go to the venv on /scratch - # (regular ext4, no fuse2fs in the write path). - singularity exec --nv --fakeroot \ - --overlay "$OVERLAY_PATH" \ - "$IMAGE_PATH" \ - bash -c "cd $PROJECT_ROOT && $cmd" -} - case "${1:-}" in create-overlay) create_overlay @@ -196,7 +211,11 @@ case "${1:-}" in if [ -f /.singularity.d/Singularity ]; then install_deps else - run_in_container_writable "$0 install" + # miniforge3 installs on /scratch via plain shell — no singularity + # needed for that step. The rest (uv + pip + build_ext) runs in + # the container so nvcc and the right glibc are on PATH. + ensure_miniforge3 + run_in_container "$0 install" fi ;; rebuild) @@ -217,12 +236,13 @@ case "${1:-}" in echo " rebuild Rebuild C extension only (submit as GPU job)" echo "" echo "Environment variables:" + echo " MINIFORGE3_DIR Where the base python lives (default: /scratch/\$USER/miniforge3)" echo " VENV_PATH Where the venv lives (default: /scratch/\$USER/venvs/pufferdrive)" - echo " OVERLAY_PATH Singularity overlay (only needs miniforge3 base python)" + echo " OVERLAY_PATH Singularity overlay (kept for system-tool installs; not used by the python flow)" echo "" echo "Example workflow:" echo " 1. $0 create-overlay" echo " 2. sbatch --gres=gpu:1 --time=60 --wrap \"$0 install\"" - echo " 3. python scripts/submit_cluster.py --container ..." + echo " 3. source \$VENV_PATH/bin/activate && python scripts/submit_cluster.py --container ..." ;; esac From 450c3ca5aabcdac8620821f23d8cfc2a1b30c87b Mon Sep 17 00:00:00 2001 From: Eugene Vinitsky Date: Wed, 20 May 2026 11:54:17 -0500 Subject: [PATCH 14/20] setup_container: rebuild venv if its python symlink is stale Existing installs have the venv's bin/python pointing at /ext3/miniforge3 from before we moved miniforge3 to /scratch. Re-running install would otherwise reuse the broken-symlink venv and skip recreating it. Detect the broken symlink (bash -e $VENV/bin/python) and rm -rf before recreating against $CONTAINER_PYTHON. Co-Authored-By: Claude Opus 4.7 --- scripts/setup_container.sh | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/scripts/setup_container.sh b/scripts/setup_container.sh index d25c6eebd8..61868cd022 100755 --- a/scripts/setup_container.sh +++ b/scripts/setup_container.sh @@ -131,6 +131,14 @@ ensure_uv() { # of the box and works against any cpython. ensure_venv() { ensure_uv + # If the venv exists but its python symlink no longer resolves (e.g. it + # points into /ext3/miniforge3 from before we moved miniforge3 onto + # /scratch), rebuild it. Cheap heuristic — the venv's python is a tiny + # symlink that uv will recreate against $CONTAINER_PYTHON. + if [ -f "$VENV_PATH/bin/activate" ] && [ ! -e "$VENV_PATH/bin/python" ]; then + echo "=== Rebuilding stale venv at $VENV_PATH (python link is broken) ===" + rm -rf "$VENV_PATH" + fi if [ ! -f "$VENV_PATH/bin/activate" ]; then echo "=== Creating venv at $VENV_PATH ===" mkdir -p "$(dirname "$VENV_PATH")" From 1c0fc2aa5fd777cf6e9c7b20313d0f86ba182c7b Mon Sep 17 00:00:00 2001 From: Eugene Vinitsky Date: Wed, 20 May 2026 12:03:15 -0500 Subject: [PATCH 15/20] setup_container: detect stale venv by symlink TARGET, not existence MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The previous check (-e $VENV/bin/python) was wrong: install runs inside singularity, where the overlay is mounted, so the OLD venv's symlink into /ext3/miniforge3 IS valid — it just points at the wrong place relative to where MINIFORGE3_DIR now is. Use readlink -f to walk the chain and verify the resolved path is under $MINIFORGE3_DIR; rebuild if not. Co-Authored-By: Claude Opus 4.7 --- scripts/setup_container.sh | 23 ++++++++++++++++------- 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/scripts/setup_container.sh b/scripts/setup_container.sh index 61868cd022..8178f5f942 100755 --- a/scripts/setup_container.sh +++ b/scripts/setup_container.sh @@ -131,13 +131,22 @@ ensure_uv() { # of the box and works against any cpython. ensure_venv() { ensure_uv - # If the venv exists but its python symlink no longer resolves (e.g. it - # points into /ext3/miniforge3 from before we moved miniforge3 onto - # /scratch), rebuild it. Cheap heuristic — the venv's python is a tiny - # symlink that uv will recreate against $CONTAINER_PYTHON. - if [ -f "$VENV_PATH/bin/activate" ] && [ ! -e "$VENV_PATH/bin/python" ]; then - echo "=== Rebuilding stale venv at $VENV_PATH (python link is broken) ===" - rm -rf "$VENV_PATH" + # If the venv exists but its python doesn't resolve into the current + # $MINIFORGE3_DIR (e.g. it points at /ext3/miniforge3 from before we + # moved miniforge3 onto /scratch), rebuild. readlink -f resolves the + # whole symlink chain, so this catches the case where the link is + # valid inside the container (overlay mounted) but stale relative to + # where the new venv should point. + if [ -f "$VENV_PATH/bin/activate" ]; then + local resolved + resolved="$(readlink -f "$VENV_PATH/bin/python" 2>/dev/null || true)" + case "$resolved" in + "$MINIFORGE3_DIR"/*) ;; + *) + echo "=== Rebuilding stale venv at $VENV_PATH (python points to '$resolved', not under $MINIFORGE3_DIR) ===" + rm -rf "$VENV_PATH" + ;; + esac fi if [ ! -f "$VENV_PATH/bin/activate" ]; then echo "=== Creating venv at $VENV_PATH ===" From aea954091596b63e17884d8ad4867852ffa32ed9 Mon Sep 17 00:00:00 2001 From: Eugene Vinitsky Date: Wed, 20 May 2026 12:11:19 -0500 Subject: [PATCH 16/20] setup_container: pin miniforge3 to a Python 3.12 release MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit miniforge3 25.x ships Python 3.13, but torch's cu121 wheels are cp39..cp312 only — no cp313 — so the install fails with 'no solution found when resolving dependencies'. Pin the installer URL to 24.11.3-2 (Python 3.12) and add a version check in ensure_miniforge3 so an existing miniforge3 with the wrong python gets reinstalled. Co-Authored-By: Claude Opus 4.7 --- scripts/setup_container.sh | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/scripts/setup_container.sh b/scripts/setup_container.sh index 8178f5f942..7623039f2d 100755 --- a/scripts/setup_container.sh +++ b/scripts/setup_container.sh @@ -35,7 +35,11 @@ VENV_PATH="${VENV_PATH:-/scratch/$USER/venvs/pufferdrive}" # miniforge3 lives on /scratch too so the venv's python symlink resolves # from any node without needing the singularity overlay to be mounted. MINIFORGE3_DIR="${MINIFORGE3_DIR:-/scratch/$USER/miniforge3}" -MINIFORGE3_INSTALLER_URL="${MINIFORGE3_INSTALLER_URL:-https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh}" +# Pin to a miniforge3 release that ships Python 3.12. 25.x switched to 3.13, +# but torch's cu121 wheels are cp39..cp312 only (no cp313), so 3.13 breaks +# the install. Bump this once torch publishes cp313 wheels for our index. +MINIFORGE3_INSTALLER_URL="${MINIFORGE3_INSTALLER_URL:-https://github.com/conda-forge/miniforge/releases/download/24.11.3-2/Miniforge3-24.11.3-2-Linux-x86_64.sh}" +MINIFORGE3_PYTHON_VERSION="${MINIFORGE3_PYTHON_VERSION:-3.12}" CONTAINER_PYTHON="${CONTAINER_PYTHON:-$MINIFORGE3_DIR/bin/python3}" create_overlay() { @@ -89,7 +93,16 @@ EOF # node, so the venv's bin/python symlink resolves outside singularity too. ensure_miniforge3() { if [ -x "$MINIFORGE3_DIR/bin/python3" ]; then - return 0 + # Verify the existing miniforge3 has the python version we expect — + # otherwise an earlier install that grabbed "latest" (Python 3.13) + # would stay around, and uv venv would happily reuse it. + local existing + existing="$("$MINIFORGE3_DIR/bin/python3" -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")' 2>/dev/null || true)" + if [ "$existing" = "$MINIFORGE3_PYTHON_VERSION" ]; then + return 0 + fi + echo "=== miniforge3 at $MINIFORGE3_DIR has python $existing (want $MINIFORGE3_PYTHON_VERSION); reinstalling ===" + rm -rf "$MINIFORGE3_DIR" fi echo "=== Installing miniforge3 to $MINIFORGE3_DIR ===" mkdir -p "$(dirname "$MINIFORGE3_DIR")" From 17506911b0f02272212b7db3a69374610bae30c0 Mon Sep 17 00:00:00 2001 From: Eugene Vinitsky Date: Wed, 20 May 2026 17:41:46 -0500 Subject: [PATCH 17/20] =?UTF-8?q?gitignore:=20restore=20docs/=5Fbuild/=20?= =?UTF-8?q?=E2=80=94=20harmless=20defensive=20ignore=20for=20sphinx?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Putting back the docs/_build/ line I dropped earlier alongside the blanket 'docs/' ignore. The blanket ignore was actively wrong (hid our checked-in markdown), but docs/_build/ is just the standard sphinx build output dir — costs nothing to keep, and protects future contributors from accidentally committing 'make html' output if sphinx is ever added. Co-Authored-By: Claude Opus 4.7 --- .gitignore | 3 +++ 1 file changed, 3 insertions(+) diff --git a/.gitignore b/.gitignore index 62f3bc9449..d0b9435dfb 100644 --- a/.gitignore +++ b/.gitignore @@ -81,6 +81,9 @@ instance/ # Scrapy stuff: .scrapy +# Sphinx documentation (if sphinx is later added; docs/*.md is tracked) +docs/_build/ + # PyBuilder target/ From e36b5d32d6728acc995207caddf88cf0fc348392 Mon Sep 17 00:00:00 2001 From: Eugene Vinitsky Date: Wed, 20 May 2026 17:43:37 -0500 Subject: [PATCH 18/20] gitignore: drop failure_mining/ from this PR (moved to a separate PR) The cluster-docs PR shouldn't carry the mining-outputs ignore. Splitting to its own PR so the two are reviewable independently. Co-Authored-By: Claude Opus 4.7 --- .gitignore | 3 --- 1 file changed, 3 deletions(-) diff --git a/.gitignore b/.gitignore index d0b9435dfb..3b9fa56c27 100644 --- a/.gitignore +++ b/.gitignore @@ -214,6 +214,3 @@ external/ # Claude config .claude/ CLAUDE.local.md - -# Mining output artifacts (large local-only renders/replays) -failure_mining/ From 93be35aa3682a3b6d261fa44df46d853446c0ba5 Mon Sep 17 00:00:00 2001 From: Eugene Vinitsky Date: Wed, 20 May 2026 09:52:22 -0500 Subject: [PATCH 19/20] docs: failure mining operational guide New docs/mining.md covering the mine_failures workflow: - score_threshold semantics (default -inf saves nothing) - the required --vec.backend Serial flag (pufferl's default Multiprocessing backend forks workers post-torch-import and deadlocks on CUDA) - loading checkpoints with non-default policy.* / rnn.* dims (mine_failures doesn't auto-merge the sibling config.yaml that train() does) - on-cluster submit_cluster.py pattern with --main override - viewer features README.md gains a short pointer paragraph at the end of the existing Failure mining section. Co-Authored-By: Claude Opus 4.7 --- README.md | 2 + docs/mining.md | 167 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 169 insertions(+) create mode 100644 docs/mining.md diff --git a/README.md b/README.md index 9011a93e03..5e4914d60c 100644 --- a/README.md +++ b/README.md @@ -148,6 +148,8 @@ renders/index.html # sortable index of all episodes Open `renders/index.html` in a browser to triage. The index page filters by "failures only" / "replays only" and sorts by any metric column. Each row links to the per-episode viewer with the scene's full 2D animation. +For the deeper guide — viewer features, `score_threshold` semantics, the required `--vec.backend Serial` flag, loading checkpoints with non-default `policy.*` dims, and the on-cluster `submit_cluster.py` pattern — see [`docs/mining.md`](docs/mining.md). + ## Key Configuration (`pufferlib/config/ocean/drive.ini`) ### `[env]` — Simulation diff --git a/docs/mining.md b/docs/mining.md new file mode 100644 index 0000000000..1c11b9e60a --- /dev/null +++ b/docs/mining.md @@ -0,0 +1,167 @@ +# Failure mining workflow + +How to roll a trained policy out, capture compact replays, and produce a +browser-viewable HTML index of episodes. Pairs with `pufferl.mine_failures` +and `pufferlib/mining_viz.py`. + +## TL;DR + +```bash +# Roll the policy out for 100 episodes, save compact replays for episodes +# whose episode_return falls below the threshold, render HTML for each + +# a sortable index. +puffer mine_failures puffer_drive \ + --load-model-path /path/to/model_011000.pt \ + --mine.output-dir ./failure_mining/baseline_011000 \ + --mine.num-episodes 100 \ + --mine.score-threshold 1e9 \ + --vec.backend Serial +``` + +Outputs: + +``` +./failure_mining/baseline_011000/ + replays/episode_NNNNNN.replay.zlib one per saved episode + renders/episode_NNNNNN.html per-replay viewer + renders/index.html sortable summary + episodes.csv all episodes, all metrics +``` + +Open the index in a browser: + +```bash +open ./failure_mining/baseline_011000/renders/index.html +``` + +## What gets captured + +A compact replay bundle is a pickled+zlib'd `schema_version=2` dict containing +per-step agent state, traffic state, and observation arrays for a single +episode. Bundles are produced C-side when `capture_compact_replay=True` is +passed to `Drive(...)`. `mine_failures` sets this automatically. + +Each saved bundle is paired with a metadata row in `episodes.csv` including +`episode_return`, `collision_rate`, `offroad_rate`, `num_goals_reached`, +`avg_distance_per_infraction`, etc. The HTML viewer (`pufferlib/mining_viz.py`) +reads the bundle and replays it in-browser on a top-down canvas, with optional +overlays for the agent's observed FOV, partner circle, goal route, and waypoint +markers. + +## `mine.score_threshold` selection + +The save rule is "write replay if and only if `episode_return < score_threshold`". + +- `--mine.score-threshold 1e9` captures every episode (any real return is + less than 1e9). +- `--mine.score-threshold 0` captures only negative-return ("true failure") + episodes. +- Default `-inf` captures **nothing** — useful only if you want `episodes.csv` + metrics without the bundle overhead. + +`episodes.csv` always contains all N episodes' metadata regardless of +threshold; only the bundle save + HTML render is gated. + +## `--vec.backend Serial` + +Mining must use `--vec.backend Serial`. The drive.ini default +`Multiprocessing` backend forks workers post-torch-import, which deadlocks on +CUDA in the child process. Symptom is a parent process at 100% CPU with no +visible progress and no `[mine_failures] target episodes=...` print. + +`Serial` keeps the env in the same process as the policy. Mining is a single +env / single rollout workflow, so the throughput cost is negligible. + +## Tuning the rollout config + +The mining env config comes from drive.ini's `[mine]` section plus per-CLI +overrides: + +```bash +# Larger output (slower): +--mine.num-episodes 500 + +# Replay mode (drive recorded nuPlan / Waymo scenarios): +--env.simulation-mode replay \ +--env.control-mode control_sdc_only \ +--env.map-dir /path/to/recorded_bins \ +--env.init-steps 10 \ +--env.scenario-length 200 + +# Looser goal radius (default 2 m, up to 12 m under reward randomization): +--env.goal-radius 6 + +# Closer-spaced goals: +--env.min-waypoint-spacing 10 \ +--env.max-waypoint-spacing 15 +``` + +## Loading checkpoints with non-default architecture + +`mine_failures` does not read the sibling `config.yaml` next to +`load_model_path` (only `pufferl.train` does). If the checkpoint was trained +with non-default `policy.*` or `rnn.*` dimensions (e.g. `input_size=128`, +`backbone_num_layers=4`), pass them on the CLI to match the saved state dict: + +```bash +--policy.input-size 128 \ +--policy.actor-hidden-size 512 \ +--policy.actor-num-layers 0 \ +--policy.backbone-hidden-size 512 \ +--policy.backbone-num-layers 4 \ +--policy.critic-hidden-size 512 \ +--policy.critic-num-layers 0 \ +--policy.encoder-gigaflow True \ +--policy.split-network False \ +--rnn.hidden-size 512 \ +--rnn.input-size 512 +``` + +You can read the right values out of the checkpoint's sibling `config.yaml` +(under `policy:` and `rnn:`) and pass them through. The error if you forget +is a wall of `size mismatch for ...` lines from `policy.load_state_dict`. + +## On the cluster + +Mining is GPU-bound on the policy forward pass but memory-light compared to +training (single env, no rollout buffer, no PPO update). 48 GB RAM and a +60-minute time limit are plenty for 100 episodes. The same `submit_cluster.py` +flow as training works — override `--main` to invoke `mine_failures`: + +```bash +python3 scripts/submit_cluster.py \ + --save_dir /scratch/$USER/runs \ + --prefix mine \ + --compute_config scripts/cluster_configs/nyu_greene.yaml \ + --account --partition --time 60 \ + --mem 48gb --cpus 8 \ + --container \ + --main "-m pufferlib.pufferl mine_failures puffer_drive" \ + --args \ + load_model_path= \ + mine.output_dir=/scratch/$USER/failure_mining/out \ + mine.num_episodes=100 \ + mine.score_threshold=1e9 \ + vec.backend=Serial +``` + +See [`docs/cluster_training.md`](cluster_training.md) for one-time setup of +the login-side submitit (`python3 -m pip install --user submitit pyyaml +cloudpickle`). + +Outputs land on `/scratch`; pull them down with `rsync` for in-browser viewing. + +## Viewer features (`mining_viz.py`) + +The per-episode HTML viewer supports: + +- Frame scrubber + play/pause + speed control. +- Toggle observation overlay (FOV rectangle, partner circle, observed-entity + highlights, goal route, waypoint markers). +- Toggle road segment / road edge / lane line rendering. +- Map background (CARLA / nuPlan / Waymo road graph from the bundle's + embedded `simulation_mode`). + +The index (`renders/index.html`) is a sortable table linking to each per-episode +HTML, with the metadata columns from `episodes.csv` (failure metrics, scenario +ID, map name). From 0ca867b58bba7bab8d42964b5eb47d87adf09e37 Mon Sep 17 00:00:00 2001 From: Eugene Vinitsky Date: Wed, 20 May 2026 11:39:35 -0500 Subject: [PATCH 20/20] docs/mining: 'source the venv' instead of 'pip install --user submitit' MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Mirror the cluster_training.md change — setup_container.sh install already lands submitit in the venv, and sourcing the venv on login makes it importable. No --user bootstrap needed. Co-Authored-By: Claude Opus 4.7 --- docs/mining.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/mining.md b/docs/mining.md index 1c11b9e60a..41b1b22caa 100644 --- a/docs/mining.md +++ b/docs/mining.md @@ -145,9 +145,9 @@ python3 scripts/submit_cluster.py \ vec.backend=Serial ``` -See [`docs/cluster_training.md`](cluster_training.md) for one-time setup of -the login-side submitit (`python3 -m pip install --user submitit pyyaml -cloudpickle`). +Source the venv before invoking `submit_cluster.py` (`source +/scratch/$USER/venvs/pufferdrive/bin/activate`) — see +[`docs/cluster_training.md`](cluster_training.md) for the rationale. Outputs land on `/scratch`; pull them down with `rsync` for in-browser viewing.