Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
b992f62
docs: cluster training + mining operational guides
May 20, 2026
297e85e
submit_cluster: wrap submitit launcher in singularity when --container
May 20, 2026
bbbc1b1
docs: rewrite to describe current state, not discovery path
May 20, 2026
50f97d9
docs: expand TORCH_CUDA_ARCH_LIST explanation
May 20, 2026
dae1768
docs: explain why CPU rebuild works for CUDA code
May 20, 2026
33d560e
docs: trim CPU rebuild section — drop the cross-compiler explanation
May 20, 2026
f864262
docs: drop mining doc from this PR (moved to a separate PR)
May 20, 2026
b793e60
docs: cluster_training tweaks (TL;DR rewrite, formatting)
May 20, 2026
1560772
docs+gitignore: pre-commit fixes + drop sphinx noise
May 20, 2026
a016d9a
submit_cluster: compress the launcher-wrap comment
May 20, 2026
2855f0a
submit_cluster: clarify what the wrap actually solves
May 20, 2026
80a726b
cluster: drop the login-side submitit bootstrap; revert launcher wrap
May 20, 2026
8c847ae
setup_container: install miniforge3 on /scratch instead of in the ove…
May 20, 2026
450c3ca
setup_container: rebuild venv if its python symlink is stale
May 20, 2026
1c0fc2a
setup_container: detect stale venv by symlink TARGET, not existence
May 20, 2026
aea9540
setup_container: pin miniforge3 to a Python 3.12 release
May 20, 2026
1750691
gitignore: restore docs/_build/ — harmless defensive ignore for sphinx
May 20, 2026
e36b5d3
gitignore: drop failure_mining/ from this PR (moved to a separate PR)
May 20, 2026
93be35a
docs: failure mining operational guide
May 20, 2026
0ca867b
docs/mining: 'source the venv' instead of 'pip install --user submitit'
May 20, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 1 addition & 4 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ instance/
# Scrapy stuff:
.scrapy

# Sphinx documentation
# Sphinx documentation (if sphinx is later added; docs/*.md is tracked)
docs/_build/

# PyBuilder
Expand Down Expand Up @@ -211,9 +211,6 @@ pufferlib/resources/drive/output*.gif
# External local clones
external/

# Generated docs
docs/

# Claude config
.claude/
CLAUDE.local.md
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@ python scripts/submit_cluster.py \

`scripts/cluster_configs/nyu_greene.yaml` defines `account`, `gpus`, `cpus`, `mem`, `time` — edit `account` to your allocation before first submit. `--container` makes `submit_cluster.py` wrap the job command in `singularity exec --nv --overlay $OVERLAY_PATH:ro $IMAGE_PATH ...`.

**For a full guide on how to use this see [`docs/cluster_training.md`](docs/cluster_training.md).**

## Data

Place binaries under `pufferlib/resources/drive/binaries/`.
Expand Down Expand Up @@ -146,6 +148,8 @@ renders/index.html # sortable index of all episodes

Open `renders/index.html` in a browser to triage. The index page filters by "failures only" / "replays only" and sorts by any metric column. Each row links to the per-episode viewer with the scene's full 2D animation.

For the deeper guide — viewer features, `score_threshold` semantics, the required `--vec.backend Serial` flag, loading checkpoints with non-default `policy.*` dims, and the on-cluster `submit_cluster.py` pattern — see [`docs/mining.md`](docs/mining.md).

## Key Configuration (`pufferlib/config/ocean/drive.ini`)

### `[env]` — Simulation
Expand Down
180 changes: 180 additions & 0 deletions docs/cluster_training.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# Cluster training — operational guide

How to run PufferDrive training on a SLURM cluster. This is written with the NYU cluster in mind but it should mostly hold for any SLURM cluster.

## A quick overview of the setup and launch process

```bash
# One-time per cluster: create the singularity overlay and install deps
# into the venv (this also installs submitit and the other submission
# deps as part of the project's pyproject.toml).
./scripts/setup_container.sh create-overlay
sbatch --account=<acct> --gres=gpu:1 --cpus-per-task=8 --mem=32gb --time=60 \
--wrap "./scripts/setup_container.sh install"

# If code changes, or we haven't built before, rebuild the C code in the container
sbatch --account=<acct> --partition=cpu_short --cpus-per-task=8 --mem=16gb --time=20 \
--chdir=$PWD -o $LOGDIR/rebuild_%j.log \
--wrap "./scripts/setup_container.sh rebuild"

# Training: source the venv on the login node, then submit_cluster.py
# with --container --heartbeat. --main defaults to RL training; override
# it to launch other modes (e.g. mining, eval).
source /scratch/$USER/venvs/pufferdrive/bin/activate
python scripts/submit_cluster.py \
--save_dir /scratch/$USER/runs \
--compute_config scripts/cluster_configs/nyu_greene.yaml \
--program_config scripts/cluster_configs/train_base.yaml \
--container --heartbeat \
--account <acct> --partition <gpu-partition> --time 2880 \
--args train.checkpoint_interval=250 env.simulation_mode=gigaflow # use this to override config args
```

## Container model

PufferDrive on Greene runs inside a singularity container. The container provides
a modern glibc + CUDA toolkit; the project's Python environment lives in a venv
on `/scratch` so installs aren't bottlenecked by the slow process of building a venv inside a container.

The container is invoked with a **read-only** overlay mount for the miniforge3
base interpreter, plus the on-disk venv for project packages. As an example of running such a command:
```bash
singularity exec --nv \
--overlay /scratch/$USER/images/PufferDrive/overlay-15GB-500K.ext3:ro \
/share/apps/images/cuda12.8.1-cudnn9.8.0-ubuntu24.04.2.sif \
bash -c '
source /scratch/$USER/venvs/pufferdrive/bin/activate
export PYTHONNOUSERSITE=1
cd /scratch/$USER/code/PufferDrive
<your command>
'
```

## Submitting training — `submit_cluster.py`

`scripts/submit_cluster.py` is the canonical submission path. It composes:
- a `compute_config` YAML (SLURM settings)
- a `program_config` YAML (pufferl training args)
- `--args` CLI overrides
- wraps the inner train command in `singularity exec` when `--container` is set
- optionally injects the GPU heartbeat when `--heartbeat` is set. WARNING: this is specifically for the torch cluster to prevent our jobs being killed. No one else should use this.

It performs code isolation (symlinks the
top-level entries + hard-copies `pufferlib/` into a per-run sandbox), and
hands the package to `submitit` for `sbatch`-submission.

### Source the venv before invoking `submit_cluster.py`

`setup_container.sh install` puts submitit + its deps into the project
venv at `/scratch/$USER/venvs/pufferdrive/`. Sourcing the venv on the
login node makes that submitit importable and lines up `sys.executable`
with the same venv python that the compute node will run, so submitit's
serialization round-trips cleanly.

```bash
source /scratch/$USER/venvs/pufferdrive/bin/activate
python scripts/submit_cluster.py \
--save_dir /scratch/$USER/runs \
--prefix mytrain \
--compute_config scripts/cluster_configs/nyu_greene.yaml \
--program_config scripts/cluster_configs/train_base.yaml \
--account <acct> --partition <gpu-partition> --time 2880 \
--container \
--heartbeat \
--args \
train.total_timesteps=10000000000 \
train.checkpoint_interval=250
```

Key flags:

| Flag | Effect |
|---|---|
| `--container` | wraps both submitit's outer launcher and the inner train command in `singularity exec --nv --overlay $OVERLAY:ro $IMAGE` |
| `--heartbeat` | wraps the train command in a brace group that backgrounds `python scripts/gpu_heartbeat.py` preventing the cluster from killing your job due to low GPU usage |
| `--args key=value ...` | passes nested config keys (underscores converted to dashes) as `--key value` on the torchrun line; e.g. `env.simulation_mode=replay` becomes `--env.simulation-mode replay` |
| `--account` / `--partition` / `--time` | override `compute_config` SLURM settings |

### GPU heartbeat — required for long runs

`--heartbeat` is not optional for jobs over ~2 hours. Without it, the
cluster's idle-GPU reclaimer issues a `scancel` from `uid 0` (root) during
the first eval / checkpoint dip in GPU utilization.

`scripts/gpu_heartbeat.py` monitors `nvidia-smi` and runs short matmul bursts
when utilization drops below 65%, so the cluster always sees the GPU as
active. It cooperates with training and steps aside when training is busy.

### Environment knobs the container path sets

When `--container` is on, the inner bash command has these env vars set
before `cd $PROJECT_ROOT && <train>`:

```bash
source /scratch/$USER/venvs/pufferdrive/bin/activate
export PYTHONNOUSERSITE=1
export XDG_CACHE_HOME=/scratch/$USER/cache
export WANDB_CACHE_DIR=/scratch/$USER/wandb_cache
export WANDB_CONFIG_DIR=/scratch/$USER/wandb_config
export WANDB_DATA_DIR=/scratch/$USER/wandb_data
export WANDB_DIR=/scratch/$USER/wandb_data
```

## CPU rebuild path

GPU partitions are routinely saturated by training jobs. `setup_container.sh
rebuild` doesn't need a GPU — submit to a CPU partition for fast turnaround:

```bash
sbatch --account=<general-account> --partition=cpu_short \
--cpus-per-task=8 --mem=16gb --time=20 \
--chdir=$PWD \
-o /scratch/$USER/rebuild_logs/rebuild_%j.log \
--wrap "./scripts/setup_container.sh rebuild"
```

`--chdir=$PWD` is required because the script uses `./scripts/`. Takes ~40s.

### Common pitfalls

- **`ncclCommShrink` undefined symbol** at `from torch._C import *`. Greene's
cuda12.8.1 sif ships `libnccl 2.25.1` in `/usr/lib`, but torch ≥ 2.10 calls
`ncclCommShrink` from NCCL ≥ 2.27.5. torch's own NCCL 2.27.5 sits in
`site-packages/nvidia/nccl/lib/` and needs to win the loader search.
`setup_container.sh install`/`rebuild` patches `/ext3/env.sh` to prepend that
dir to `LD_LIBRARY_PATH`; existing overlays from before that patch need the
same line appended to `/ext3/env.sh`.
- **`-lomp5` link errors on Linux** with conda-forge openmp. The default is for
older Intel OpenMP packaging. `setup.py` honors `OMP_LIB="-L$prefix/lib -lomp"`.
- **`du /ext3` undercounts** when the overlay has cruft outside `upper/ext3/`
(e.g. failed pip installs that wrote to `/usr/local/lib/...` end up in
`upper/usr/local/` and aren't visible to apptainer's view). Use
`debugfs -R "ls /upper" overlay.ext3` from a login node to inspect.

### `TORCH_CUDA_ARCH_LIST`: a quick warning that won't generally be an issue

PufferDrive's C extension contains CUDA kernels. When `setup.py build_ext`
compiles them, `nvcc` emits machine code for each architecture listed in
the `TORCH_CUDA_ARCH_LIST` env var (and only those); the result is a large binary containing one variant per arch. If the env var is unset, the build
defaults to whatever GPU was visible to the compiler at build time which is often
just one architecture.

On Greene, you frequently don't get to
choose which GPU you land on. `_general` accounts queue across L40S
(sm_89), H100 (sm_90), and H200 (sm_90); `_tandon_*` partitions add A100
(sm_80). If the `_C.so` was built against only sm_80 and your job lands on
an H100, every CUDA call into the extension dies with
`no kernel image is available for execution on the device`.

Setting `TORCH_CUDA_ARCH_LIST="8.0;8.9;9.0"` covers A100 / L40S+H100 / H200
in one fat binary — the build is a bit slower (three variants instead of
one) and the `.so` is a bit larger, but the resulting binary runs on every
GPU Greene routes you to.

`setup_container.sh rebuild` exports this automatically for the build step,
so a fresh rebuild on the cluster is already multi-arch. The env var only
matters when you build the C extension **outside** the rebuild wrapper —
e.g. an interactive `python setup.py build_ext --inplace --force` inside a
hand-launched singularity exec. Adding the export to your shell profile
(or sourcing it before any manual build) saves you from hitting the "no
kernel image" error after a quick fix-and-rebuild loop.
167 changes: 167 additions & 0 deletions docs/mining.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
# Failure mining workflow

How to roll a trained policy out, capture compact replays, and produce a
browser-viewable HTML index of episodes. Pairs with `pufferl.mine_failures`
and `pufferlib/mining_viz.py`.
Comment on lines +4 to +5

## TL;DR

```bash
# Roll the policy out for 100 episodes, save compact replays for episodes
# whose episode_return falls below the threshold, render HTML for each +
# a sortable index.
puffer mine_failures puffer_drive \
--load-model-path /path/to/model_011000.pt \
--mine.output-dir ./failure_mining/baseline_011000 \
--mine.num-episodes 100 \
--mine.score-threshold 1e9 \
--vec.backend Serial
```

Outputs:

```
./failure_mining/baseline_011000/
replays/episode_NNNNNN.replay.zlib one per saved episode
renders/episode_NNNNNN.html per-replay viewer
renders/index.html sortable summary
episodes.csv all episodes, all metrics
```

Open the index in a browser:

```bash
open ./failure_mining/baseline_011000/renders/index.html
```
Comment on lines +31 to +35

## What gets captured

A compact replay bundle is a pickled+zlib'd `schema_version=2` dict containing
per-step agent state, traffic state, and observation arrays for a single
episode. Bundles are produced C-side when `capture_compact_replay=True` is
Comment on lines +40 to +41
passed to `Drive(...)`. `mine_failures` sets this automatically.

Each saved bundle is paired with a metadata row in `episodes.csv` including
`episode_return`, `collision_rate`, `offroad_rate`, `num_goals_reached`,
`avg_distance_per_infraction`, etc. The HTML viewer (`pufferlib/mining_viz.py`)
reads the bundle and replays it in-browser on a top-down canvas, with optional
overlays for the agent's observed FOV, partner circle, goal route, and waypoint
markers.
Comment on lines +47 to +49

## `mine.score_threshold` selection

The save rule is "write replay if and only if `episode_return < score_threshold`".

- `--mine.score-threshold 1e9` captures every episode (any real return is
less than 1e9).
- `--mine.score-threshold 0` captures only negative-return ("true failure")
episodes.
- Default `-inf` captures **nothing** — useful only if you want `episodes.csv`
metrics without the bundle overhead.

`episodes.csv` always contains all N episodes' metadata regardless of
threshold; only the bundle save + HTML render is gated.

## `--vec.backend Serial`

Mining must use `--vec.backend Serial`. The drive.ini default
`Multiprocessing` backend forks workers post-torch-import, which deadlocks on
CUDA in the child process. Symptom is a parent process at 100% CPU with no
visible progress and no `[mine_failures] target episodes=...` print.

`Serial` keeps the env in the same process as the policy. Mining is a single
env / single rollout workflow, so the throughput cost is negligible.

## Tuning the rollout config

The mining env config comes from drive.ini's `[mine]` section plus per-CLI
overrides:

```bash
# Larger output (slower):
--mine.num-episodes 500

# Replay mode (drive recorded nuPlan / Waymo scenarios):
--env.simulation-mode replay \
--env.control-mode control_sdc_only \
--env.map-dir /path/to/recorded_bins \
--env.init-steps 10 \
--env.scenario-length 200

# Looser goal radius (default 2 m, up to 12 m under reward randomization):
--env.goal-radius 6

# Closer-spaced goals:
--env.min-waypoint-spacing 10 \
--env.max-waypoint-spacing 15
```

## Loading checkpoints with non-default architecture

`mine_failures` does not read the sibling `config.yaml` next to
`load_model_path` (only `pufferl.train` does). If the checkpoint was trained
with non-default `policy.*` or `rnn.*` dimensions (e.g. `input_size=128`,
`backbone_num_layers=4`), pass them on the CLI to match the saved state dict:

```bash
--policy.input-size 128 \
--policy.actor-hidden-size 512 \
--policy.actor-num-layers 0 \
--policy.backbone-hidden-size 512 \
--policy.backbone-num-layers 4 \
--policy.critic-hidden-size 512 \
--policy.critic-num-layers 0 \
--policy.encoder-gigaflow True \
--policy.split-network False \
--rnn.hidden-size 512 \
--rnn.input-size 512
```

You can read the right values out of the checkpoint's sibling `config.yaml`
(under `policy:` and `rnn:`) and pass them through. The error if you forget
is a wall of `size mismatch for ...` lines from `policy.load_state_dict`.

## On the cluster

Mining is GPU-bound on the policy forward pass but memory-light compared to
training (single env, no rollout buffer, no PPO update). 48 GB RAM and a
60-minute time limit are plenty for 100 episodes. The same `submit_cluster.py`
flow as training works — override `--main` to invoke `mine_failures`:

```bash
python3 scripts/submit_cluster.py \
--save_dir /scratch/$USER/runs \
--prefix mine \
--compute_config scripts/cluster_configs/nyu_greene.yaml \
--account <acct> --partition <gpu-partition> --time 60 \
--mem 48gb --cpus 8 \
--container \
--main "-m pufferlib.pufferl mine_failures puffer_drive" \
--args \
load_model_path=<path-to-ckpt> \
mine.output_dir=/scratch/$USER/failure_mining/out \
mine.num_episodes=100 \
mine.score_threshold=1e9 \
vec.backend=Serial
```

Source the venv before invoking `submit_cluster.py` (`source
/scratch/$USER/venvs/pufferdrive/bin/activate`) — see
[`docs/cluster_training.md`](cluster_training.md) for the rationale.

Outputs land on `/scratch`; pull them down with `rsync` for in-browser viewing.

## Viewer features (`mining_viz.py`)

The per-episode HTML viewer supports:

- Frame scrubber + play/pause + speed control.
- Toggle observation overlay (FOV rectangle, partner circle, observed-entity
highlights, goal route, waypoint markers).
- Toggle road segment / road edge / lane line rendering.
- Map background (CARLA / nuPlan / Waymo road graph from the bundle's
embedded `simulation_mode`).
Comment on lines +158 to +163

The index (`renders/index.html`) is a sortable table linking to each per-episode
HTML, with the metadata columns from `episodes.csv` (failure metrics, scenario
ID, map name).
Loading
Loading