(WIP: do not review or read) Docs: failure mining operational guide by eugenevinitsky · Pull Request #434 · Emerge-Lab/PufferDrive

eugenevinitsky · 2026-05-20T14:52:38Z

New `docs/mining.md` covering the failure mining workflow plus a short README pointer.

Chained on top of #433 (cluster training docs) — that PR un-ignores `docs/` in `.gitignore` and adds the `submit_cluster.py` singularity-wrap patch that the on-cluster mining example here depends on. After #433 merges, this PR rebases onto `emerge/temp_training` automatically.

What's in

`docs/mining.md` — `mine_failures` workflow, `score_threshold` semantics, the required `--vec.backend Serial` flag (the default Multiprocessing backend forks workers post-torch-import and deadlocks on CUDA), loading checkpoints with non-default `policy.` / `rnn.` dims (mine_failures doesn't auto-merge the sibling `config.yaml` that `train()` does), the on-cluster `submit_cluster.py` pattern with `--main` override, and viewer features.
`README.md` — short pointer paragraph at the end of the existing "Failure mining" section.

🤖 Generated with Claude Code

Add two new docs and link them from README.md: - docs/cluster_training.md — sbatch template (incl. GPU heartbeat, required for jobs > ~2h or the cluster's idle-reclaimer will scancel them), CPU rebuild path (cpu_short partition; no GPU needed for build_ext), account/partition strategy (QOSGrpGRES vs QOSMaxGRESPerUser, _tandon_priority vs _general tiers, when to race partitions), replay-mode memory sizing (vec.num_envs lever, [eval.*] suite cost at first eval), and submit_cluster.py failure modes (login-node python lacks pip; submitit's srun launcher inherits the in-container venv python path). - docs/mining.md — mine_failures workflow, the score_threshold default-captures-nothing gotcha (docstring is misleading; -inf means no replay is saved), the pufferlib.vector Multiprocessing CUDA-after-fork hang and --vec.backend Serial workaround, the shape-mismatch gotcha when load_model_path checkpoints have non-default policy.* / rnn.* dimensions (mine_failures doesn't do the sibling config.yaml auto-merge that train() does), and the on-cluster sbatch pattern. .gitignore: un-ignore docs/ (was a blanket rule that prevented checking in markdown docs alongside the existing eval_unification.md spec); add failure_mining/ which is large local-only mining output. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two coupled changes that make submit_cluster.py work end-to-end on clusters where the system python differs across login and compute nodes (Greene: login 3.12, compute 3.9) and the only consistent python lives inside the singularity overlay. 1. Submission side: after constructing AutoExecutor, when --container is set, override executor._executor.python so submitit's outer launcher is invoked as singularity exec --nv --overlay :ro $IMAGE $VENV/bin/python That makes the compute-side srun command resolve the launcher python inside the container (where the venv's symlink to /ext3/miniforge3/bin/python3 is valid) instead of needing a cross-node-consistent system python with submitit installed. 2. launch_training side: when the function lands on the compute node it's now already inside singularity (from (1)), so skip the second singularity exec wrap and just run inner_cmd via bash -c. The /.singularity.d/Singularity marker file distinguishes the cases so direct-sbatch callers (not yet inside a container) still get the wrap. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Drop 'we figured out X' framing per code review feedback. The submit_cluster.py path now works end-to-end (after the patch in the previous commit that wraps submitit's outer launcher in singularity exec), so the docs describe that as the canonical flow rather than as a workaround. Direct-sbatch is no longer documented as a fallback — submit_cluster.py is the single path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Replace the cryptic one-line 'you may want to set' with a self-contained explanation: what the env var does (per-arch fat binary), why it matters on a heterogeneous cluster ('no kernel image' on the wrong GPU), what the recommended value covers (A100/L40S/H100/H200), and when it actually matters in practice (interactive builds outside setup_container.sh rebuild, which already exports it). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The previous CPU rebuild section just said 'doesn't need a GPU' without explaining the apparent contradiction (we're compiling CUDA, no?). Spell out that nvcc is a cross-compiler: it emits PTX/SASS for the target arches in TORCH_CUDA_ARCH_LIST without needing matching hardware, and the CUDA toolkit lives in the singularity image so any node that can mount the image can run the build. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Just say 'doesn't need a GPU' without the nvcc / fat binary / cross-compiler detour. Readers who want the why can find TORCH_CUDA_ARCH_LIST above. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

This PR is just cluster training. The mining workflow doc and its README pointer move to a separate PR so the two can be reviewed and landed independently. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copilot

Pull request overview

Adds an operational guide for the mine_failures workflow (capturing compact replays + generating an HTML triage UI) and links it from the existing README “Failure mining” section.

Changes:

Add docs/mining.md with a failure-mining walkthrough, CLI examples, cluster submission pattern, and viewer notes.
Add a README pointer linking to the deeper mining guide.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
README.md	Adds a pointer to the new failure-mining operational guide.
docs/mining.md	New end-to-end guide for `mine_failures`, including threshold semantics, cluster usage, and viewer documentation.

Comments suppressed due to low confidence (1)

docs/mining.md:103

pufferl.train is referenced as if it were a module; the training entrypoint here is puffer train ... / python -m pufferlib.pufferl train ... (and the function is pufferlib.pufferl.train). Consider updating the wording to the correct command/module path.

`mine_failures` does not read the sibling `config.yaml` next to
`load_model_path` (only `pufferl.train` does). If the checkpoint was trained
with non-default `policy.*` or `rnn.*` dimensions (e.g. `input_size=128`,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+browser-viewable HTML index of episodes. Pairs with `pufferl.mine_failures`
+and `pufferlib/mining_viz.py`.


+Open the index in a browser:
+
+```bash
+open ./failure_mining/baseline_011000/renders/index.html
+```


+per-step agent state, traffic state, and observation arrays for a single
+episode. Bundles are produced C-side when `capture_compact_replay=True` is


+reads the bundle and replays it in-browser on a top-down canvas, with optional
+overlays for the agent's observed FOV, partner circle, goal route, and waypoint
+markers.


+- Frame scrubber + play/pause + speed control.
+- Toggle observation overlay (FOV rectangle, partner circle, observed-entity
+  highlights, goal route, waypoint markers).
+- Toggle road segment / road edge / lane line rendering.
+- Map background (CARLA / nuPlan / Waymo road graph from the bundle's
+  embedded `simulation_mode`).


cluster_training.md: trim trailing whitespace on two lines and add a final newline (pre-commit hooks were rejecting the diff). .gitignore: drop the sphinx-themed entries. This repo doesn't use sphinx (no docs/conf.py, no Makefile, no mkdocs.yml — docs are just plain Markdown), so 'docs/_build/' and the misleading 'Generated docs (sphinx build output only)' comment were vestigial cookiecutter boilerplate. The earlier '# docs/' line is gone with them. README.md: short pointer paragraph added to the existing HPC section linking to docs/cluster_training.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

11 lines → 6. Kept the what (wrap launcher in singularity exec) + why (sys.executable is either version-mismatched host python or a dangling venv symlink) + pointer to the matched check in launch_training. Dropped the enumerated detail; future readers who need it can trace the code. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Previous comment conflated two scenarios — the version-mismatched host python (real, what we hit) and the venv-symlink-dangling case (a hypothetical alternative setup we don't use). Just describe the real problem: login-node /usr/bin/python3 is 3.12, compute-node is 3.9, so ~/.local installs for 3.12 are invisible on the compute side. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Per reviewer feedback (yw4142): 1. setup_container.sh create_overlay: remove the stray mv "${TEMPLATE_NAME%.gz}" overlay.ext3 — it renamed the just- extracted file to a name that doesn't match what OVERLAY_PATH defaults to, so every later step looked for the overlay at the wrong path. Drop the line; overlay stays at its original name. 2. submit_cluster.py: revert the executor._executor.python override and the /.singularity.d/Singularity sentinel check in launch_training. They were working around 'submitit not on the compute-side python', but if you source the project venv on the login node, sys.executable becomes the venv python (same path the compute node will run) and submitit's serialization round- trips without needing any wrap. Back to the original pattern: launcher python is the venv python, launch_training wraps the inner train command in singularity exec for CUDA libs. 3. cluster_training.md: replace the pip install --user submitit pyyaml cloudpickle bootstrap with 'source the venv'. setup_container.sh install already lands submitit in the venv via the project's pyproject.toml, and sourcing the venv on login makes it importable. The 'two python installations' section comes out entirely — there's just one python (the venv's). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

…rlay Move the base python from inside the overlay (/ext3/miniforge3/) onto /scratch (/scratch/$USER/miniforge3/). The venv's bin/python symlink now points at /scratch — a path that resolves on every node, inside or outside singularity, so 'source venv/activate && python ...' works on the login node without needing to enter the container. Changes: - New MINIFORGE3_DIR variable (default /scratch/$USER/miniforge3) and ensure_miniforge3() helper that runs the conda-forge installer (self-contained shell script; no root, no singularity). - CONTAINER_PYTHON default now points at the /scratch miniforge3. - The 'install' dispatch runs ensure_miniforge3 outside the container first, then enters singularity (read-only overlay) for the uv + pip + build_ext steps that need nvcc. - run_in_container_writable + --fakeroot are gone; nothing in the python flow writes to the overlay anymore. The overlay stays mounted :ro for rare system-tool installs but isn't on any write path. For existing users: re-run setup_container.sh install — it'll detect miniforge3 missing on /scratch, install it, recreate the venv against the new base, and reinstall the project. The overlay's old miniforge3 becomes stale but harmless; you can rm it from the overlay later. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Existing installs have the venv's bin/python pointing at /ext3/miniforge3 from before we moved miniforge3 to /scratch. Re-running install would otherwise reuse the broken-symlink venv and skip recreating it. Detect the broken symlink (bash -e $VENV/bin/python) and rm -rf before recreating against $CONTAINER_PYTHON. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

The previous check (-e $VENV/bin/python) was wrong: install runs inside singularity, where the overlay is mounted, so the OLD venv's symlink into /ext3/miniforge3 IS valid — it just points at the wrong place relative to where MINIFORGE3_DIR now is. Use readlink -f to walk the chain and verify the resolved path is under $MINIFORGE3_DIR; rebuild if not. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

miniforge3 25.x ships Python 3.13, but torch's cu121 wheels are cp39..cp312 only — no cp313 — so the install fails with 'no solution found when resolving dependencies'. Pin the installer URL to 24.11.3-2 (Python 3.12) and add a version check in ensure_miniforge3 so an existing miniforge3 with the wrong python gets reinstalled. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Putting back the docs/_build/ line I dropped earlier alongside the blanket 'docs/' ignore. The blanket ignore was actively wrong (hid our checked-in markdown), but docs/_build/ is just the standard sphinx build output dir — costs nothing to keep, and protects future contributors from accidentally committing 'make html' output if sphinx is ever added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The cluster-docs PR shouldn't carry the mining-outputs ignore. Splitting to its own PR so the two are reviewable independently. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

New docs/mining.md covering the mine_failures workflow: - score_threshold semantics (default -inf saves nothing) - the required --vec.backend Serial flag (pufferl's default Multiprocessing backend forks workers post-torch-import and deadlocks on CUDA) - loading checkpoints with non-default policy.* / rnn.* dims (mine_failures doesn't auto-merge the sibling config.yaml that train() does) - on-cluster submit_cluster.py pattern with --main override - viewer features README.md gains a short pointer paragraph at the end of the existing Failure mining section. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Mirror the cluster_training.md change — setup_container.sh install already lands submitit in the venv, and sourcing the venv on login makes it importable. No --user bootstrap needed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Eugene Vinitsky and others added 8 commits May 20, 2026 07:06

docs: trim CPU rebuild section — drop the cross-compiler explanation

33d560e

Just say 'doesn't need a GPU' without the nvcc / fat binary / cross-compiler detour. Readers who want the why can find TORCH_CUDA_ARCH_LIST above. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

docs: drop mining doc from this PR (moved to a separate PR)

f864262

This PR is just cluster training. The mining workflow doc and its README pointer move to a separate PR so the two can be reviewed and landed independently. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

docs: cluster_training tweaks (TL;DR rewrite, formatting)

b793e60

Copilot AI review requested due to automatic review settings May 20, 2026 14:52

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

Copilot started reviewing on behalf of eugenevinitsky May 20, 2026 14:53 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

eugenevinitsky changed the title ~~Docs: failure mining operational guide~~ (WIP: do not review or read) Docs: failure mining operational guide May 20, 2026

eugenevinitsky force-pushed the ev/mining_docs branch from 4e8139a to 9deb766 Compare May 20, 2026 15:33

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

eugenevinitsky force-pushed the ev/mining_docs branch from 9deb766 to 8441077 Compare May 20, 2026 15:47

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

eugenevinitsky force-pushed the ev/mining_docs branch from e3abf94 to 2e8cc74 Compare May 20, 2026 16:01

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

eugenevinitsky force-pushed the ev/mining_docs branch from 2e8cc74 to 86d30e2 Compare May 20, 2026 16:39

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

eugenevinitsky force-pushed the ev/mining_docs branch from 96c75e3 to 8d043a5 Compare May 20, 2026 16:53

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

eugenevinitsky force-pushed the ev/mining_docs branch from fc91378 to 1fd6b0b Compare May 20, 2026 16:56

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

eugenevinitsky force-pushed the ev/mining_docs branch from 638e41a to 21f3fe8 Compare May 20, 2026 17:03

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

eugenevinitsky force-pushed the ev/mining_docs branch from 21f3fe8 to 8da7751 Compare May 20, 2026 17:11

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

eugenevinitsky force-pushed the ev/mining_docs branch from 8da7751 to 9edd792 Compare May 20, 2026 22:41

Eugene Vinitsky and others added 3 commits May 20, 2026 17:43

gitignore: drop failure_mining/ from this PR (moved to a separate PR)

e36b5d3

The cluster-docs PR shouldn't carry the mining-outputs ignore. Splitting to its own PR so the two are reviewable independently. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

eugenevinitsky force-pushed the ev/mining_docs branch from 9edd792 to 0ca867b Compare May 20, 2026 22:44

Base automatically changed from ev/improve_docs to emerge/temp_training May 23, 2026 18:46

		browser-viewable HTML index of episodes. Pairs with `pufferl.mine_failures`
		and `pufferlib/mining_viz.py`.

		per-step agent state, traffic state, and observation arrays for a single
		episode. Bundles are produced C-side when `capture_compact_replay=True` is

Conversation

eugenevinitsky commented May 20, 2026

What's in

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants