Skip to content

(WIP: do not review or read) Docs: failure mining operational guide#434

Open
eugenevinitsky wants to merge 20 commits into
emerge/temp_trainingfrom
ev/mining_docs
Open

(WIP: do not review or read) Docs: failure mining operational guide#434
eugenevinitsky wants to merge 20 commits into
emerge/temp_trainingfrom
ev/mining_docs

Conversation

@eugenevinitsky
Copy link
Copy Markdown

New `docs/mining.md` covering the failure mining workflow plus a short README pointer.

Chained on top of #433 (cluster training docs) — that PR un-ignores `docs/` in `.gitignore` and adds the `submit_cluster.py` singularity-wrap patch that the on-cluster mining example here depends on. After #433 merges, this PR rebases onto `emerge/temp_training` automatically.

What's in

  • `docs/mining.md` — `mine_failures` workflow, `score_threshold` semantics, the required `--vec.backend Serial` flag (the default Multiprocessing backend forks workers post-torch-import and deadlocks on CUDA), loading checkpoints with non-default `policy.` / `rnn.` dims (mine_failures doesn't auto-merge the sibling `config.yaml` that `train()` does), the on-cluster `submit_cluster.py` pattern with `--main` override, and viewer features.

  • `README.md` — short pointer paragraph at the end of the existing "Failure mining" section.

🤖 Generated with Claude Code

Eugene Vinitsky and others added 8 commits May 20, 2026 07:06
Add two new docs and link them from README.md:

- docs/cluster_training.md — sbatch template (incl. GPU heartbeat,
  required for jobs > ~2h or the cluster's idle-reclaimer will
  scancel them), CPU rebuild path (cpu_short partition; no GPU
  needed for build_ext), account/partition strategy (QOSGrpGRES vs
  QOSMaxGRESPerUser, _tandon_priority vs _general tiers, when to
  race partitions), replay-mode memory sizing (vec.num_envs lever,
  [eval.*] suite cost at first eval), and submit_cluster.py
  failure modes (login-node python lacks pip; submitit's srun
  launcher inherits the in-container venv python path).

- docs/mining.md — mine_failures workflow, the score_threshold
  default-captures-nothing gotcha (docstring is misleading; -inf
  means no replay is saved), the pufferlib.vector Multiprocessing
  CUDA-after-fork hang and --vec.backend Serial workaround, the
  shape-mismatch gotcha when load_model_path checkpoints have
  non-default policy.* / rnn.* dimensions (mine_failures doesn't
  do the sibling config.yaml auto-merge that train() does), and
  the on-cluster sbatch pattern.

.gitignore: un-ignore docs/ (was a blanket rule that prevented
checking in markdown docs alongside the existing eval_unification.md
spec); add failure_mining/ which is large local-only mining output.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two coupled changes that make submit_cluster.py work end-to-end on
clusters where the system python differs across login and compute
nodes (Greene: login 3.12, compute 3.9) and the only consistent
python lives inside the singularity overlay.

1. Submission side: after constructing AutoExecutor, when --container
   is set, override executor._executor.python so submitit's outer
   launcher is invoked as
       singularity exec --nv --overlay :ro $IMAGE $VENV/bin/python
   That makes the compute-side srun command resolve the launcher
   python inside the container (where the venv's symlink to
   /ext3/miniforge3/bin/python3 is valid) instead of needing a
   cross-node-consistent system python with submitit installed.

2. launch_training side: when the function lands on the compute node
   it's now already inside singularity (from (1)), so skip the second
   singularity exec wrap and just run inner_cmd via bash -c. The
   /.singularity.d/Singularity marker file distinguishes the cases
   so direct-sbatch callers (not yet inside a container) still get
   the wrap.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop 'we figured out X' framing per code review feedback. The
submit_cluster.py path now works end-to-end (after the patch in
the previous commit that wraps submitit's outer launcher in
singularity exec), so the docs describe that as the canonical
flow rather than as a workaround. Direct-sbatch is no longer
documented as a fallback — submit_cluster.py is the single path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the cryptic one-line 'you may want to set' with a self-contained
explanation: what the env var does (per-arch fat binary), why it matters
on a heterogeneous cluster ('no kernel image' on the wrong GPU), what
the recommended value covers (A100/L40S/H100/H200), and when it actually
matters in practice (interactive builds outside setup_container.sh
rebuild, which already exports it).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous CPU rebuild section just said 'doesn't need a GPU' without
explaining the apparent contradiction (we're compiling CUDA, no?). Spell
out that nvcc is a cross-compiler: it emits PTX/SASS for the target
arches in TORCH_CUDA_ARCH_LIST without needing matching hardware, and
the CUDA toolkit lives in the singularity image so any node that can
mount the image can run the build.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Just say 'doesn't need a GPU' without the nvcc / fat binary / cross-compiler
detour. Readers who want the why can find TORCH_CUDA_ARCH_LIST above.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This PR is just cluster training. The mining workflow doc and its
README pointer move to a separate PR so the two can be reviewed and
landed independently.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 20, 2026 14:52
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an operational guide for the mine_failures workflow (capturing compact replays + generating an HTML triage UI) and links it from the existing README “Failure mining” section.

Changes:

  • Add docs/mining.md with a failure-mining walkthrough, CLI examples, cluster submission pattern, and viewer notes.
  • Add a README pointer linking to the deeper mining guide.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
README.md Adds a pointer to the new failure-mining operational guide.
docs/mining.md New end-to-end guide for mine_failures, including threshold semantics, cluster usage, and viewer documentation.
Comments suppressed due to low confidence (1)

docs/mining.md:103

  • pufferl.train is referenced as if it were a module; the training entrypoint here is puffer train ... / python -m pufferlib.pufferl train ... (and the function is pufferlib.pufferl.train). Consider updating the wording to the correct command/module path.
`mine_failures` does not read the sibling `config.yaml` next to
`load_model_path` (only `pufferl.train` does). If the checkpoint was trained
with non-default `policy.*` or `rnn.*` dimensions (e.g. `input_size=128`,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/mining.md
Comment on lines +4 to +5
browser-viewable HTML index of episodes. Pairs with `pufferl.mine_failures`
and `pufferlib/mining_viz.py`.
Comment thread docs/mining.md
Comment on lines +31 to +35
Open the index in a browser:

```bash
open ./failure_mining/baseline_011000/renders/index.html
```
Comment thread docs/mining.md
Comment on lines +40 to +41
per-step agent state, traffic state, and observation arrays for a single
episode. Bundles are produced C-side when `capture_compact_replay=True` is
Comment thread docs/mining.md
Comment on lines +47 to +49
reads the bundle and replays it in-browser on a top-down canvas, with optional
overlays for the agent's observed FOV, partner circle, goal route, and waypoint
markers.
Comment thread docs/mining.md
Comment on lines +158 to +163
- Frame scrubber + play/pause + speed control.
- Toggle observation overlay (FOV rectangle, partner circle, observed-entity
highlights, goal route, waypoint markers).
- Toggle road segment / road edge / lane line rendering.
- Map background (CARLA / nuPlan / Waymo road graph from the bundle's
embedded `simulation_mode`).
@eugenevinitsky eugenevinitsky changed the title Docs: failure mining operational guide (WIP: do not review or read) Docs: failure mining operational guide May 20, 2026
cluster_training.md: trim trailing whitespace on two lines and add a
final newline (pre-commit hooks were rejecting the diff).

.gitignore: drop the sphinx-themed entries. This repo doesn't use
sphinx (no docs/conf.py, no Makefile, no mkdocs.yml — docs are just
plain Markdown), so 'docs/_build/' and the misleading 'Generated docs
(sphinx build output only)' comment were vestigial cookiecutter
boilerplate. The earlier '# docs/' line is gone with them.

README.md: short pointer paragraph added to the existing HPC section
linking to docs/cluster_training.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

11 lines → 6. Kept the what (wrap launcher in singularity exec) +
why (sys.executable is either version-mismatched host python or a
dangling venv symlink) + pointer to the matched check in
launch_training. Dropped the enumerated detail; future readers who
need it can trace the code.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Previous comment conflated two scenarios — the version-mismatched host
python (real, what we hit) and the venv-symlink-dangling case (a
hypothetical alternative setup we don't use). Just describe the real
problem: login-node /usr/bin/python3 is 3.12, compute-node is 3.9, so
~/.local installs for 3.12 are invisible on the compute side.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Per reviewer feedback (yw4142):

1. setup_container.sh create_overlay: remove the stray
   mv "${TEMPLATE_NAME%.gz}" overlay.ext3 — it renamed the just-
   extracted file to a name that doesn't match what OVERLAY_PATH
   defaults to, so every later step looked for the overlay at the
   wrong path. Drop the line; overlay stays at its original name.

2. submit_cluster.py: revert the executor._executor.python override
   and the /.singularity.d/Singularity sentinel check in
   launch_training. They were working around 'submitit not on the
   compute-side python', but if you source the project venv on the
   login node, sys.executable becomes the venv python (same path
   the compute node will run) and submitit's serialization round-
   trips without needing any wrap. Back to the original pattern:
   launcher python is the venv python, launch_training wraps the
   inner train command in singularity exec for CUDA libs.

3. cluster_training.md: replace the pip install --user submitit
   pyyaml cloudpickle bootstrap with 'source the venv'. setup_container.sh
   install already lands submitit in the venv via the project's
   pyproject.toml, and sourcing the venv on login makes it
   importable. The 'two python installations' section comes out
   entirely — there's just one python (the venv's).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

…rlay

Move the base python from inside the overlay (/ext3/miniforge3/) onto
/scratch (/scratch/$USER/miniforge3/). The venv's bin/python symlink
now points at /scratch — a path that resolves on every node, inside
or outside singularity, so 'source venv/activate && python ...' works
on the login node without needing to enter the container.

Changes:
- New MINIFORGE3_DIR variable (default /scratch/$USER/miniforge3) and
  ensure_miniforge3() helper that runs the conda-forge installer
  (self-contained shell script; no root, no singularity).
- CONTAINER_PYTHON default now points at the /scratch miniforge3.
- The 'install' dispatch runs ensure_miniforge3 outside the container
  first, then enters singularity (read-only overlay) for the uv + pip
  + build_ext steps that need nvcc.
- run_in_container_writable + --fakeroot are gone; nothing in the python
  flow writes to the overlay anymore. The overlay stays mounted :ro for
  rare system-tool installs but isn't on any write path.

For existing users: re-run setup_container.sh install — it'll detect
miniforge3 missing on /scratch, install it, recreate the venv against
the new base, and reinstall the project. The overlay's old miniforge3
becomes stale but harmless; you can rm it from the overlay later.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Existing installs have the venv's bin/python pointing at /ext3/miniforge3
from before we moved miniforge3 to /scratch. Re-running install would
otherwise reuse the broken-symlink venv and skip recreating it. Detect
the broken symlink (bash -e $VENV/bin/python) and rm -rf before
recreating against $CONTAINER_PYTHON.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

The previous check (-e $VENV/bin/python) was wrong: install runs inside
singularity, where the overlay is mounted, so the OLD venv's symlink
into /ext3/miniforge3 IS valid — it just points at the wrong place
relative to where MINIFORGE3_DIR now is. Use readlink -f to walk the
chain and verify the resolved path is under $MINIFORGE3_DIR; rebuild
if not.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

miniforge3 25.x ships Python 3.13, but torch's cu121 wheels are cp39..cp312
only — no cp313 — so the install fails with 'no solution found when
resolving dependencies'. Pin the installer URL to 24.11.3-2 (Python
3.12) and add a version check in ensure_miniforge3 so an existing
miniforge3 with the wrong python gets reinstalled.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Putting back the docs/_build/ line I dropped earlier alongside the
blanket 'docs/' ignore. The blanket ignore was actively wrong (hid
our checked-in markdown), but docs/_build/ is just the standard
sphinx build output dir — costs nothing to keep, and protects future
contributors from accidentally committing 'make html' output if
sphinx is ever added.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Eugene Vinitsky and others added 3 commits May 20, 2026 17:43
The cluster-docs PR shouldn't carry the mining-outputs ignore. Splitting
to its own PR so the two are reviewable independently.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New docs/mining.md covering the mine_failures workflow:
  - score_threshold semantics (default -inf saves nothing)
  - the required --vec.backend Serial flag (pufferl's default
    Multiprocessing backend forks workers post-torch-import and
    deadlocks on CUDA)
  - loading checkpoints with non-default policy.* / rnn.* dims
    (mine_failures doesn't auto-merge the sibling config.yaml that
    train() does)
  - on-cluster submit_cluster.py pattern with --main override
  - viewer features

README.md gains a short pointer paragraph at the end of the existing
Failure mining section.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirror the cluster_training.md change — setup_container.sh install
already lands submitit in the venv, and sourcing the venv on login
makes it importable. No --user bootstrap needed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Base automatically changed from ev/improve_docs to emerge/temp_training May 23, 2026 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants