action dataloader: episode-shuffle stream (fix DROID grad-norm instability)#37
Open
fwd4 wants to merge 8 commits into
Open
action dataloader: episode-shuffle stream (fix DROID grad-norm instability)#37fwd4 wants to merge 8 commits into
fwd4 wants to merge 8 commits into
Conversation
The DROID action dataset is map-style and (unlike the iterable vision SFTDataset) does not self-shuffle, and RankPartitionedDataLoader wrapped it in a DataLoader with no shuffle -> SequentialSampler. Every rank then iterated the same consecutive, overlapping windows, so the all-reduced global batch was ~1 episode -> high gradient variance and an unstable, slow-settling grad-norm. Fix: ActionIterableShuffleDataset (iterable_shuffle=True) streams rank x worker-sharded, episode-order-shuffled, sequential-within-episode -- decorrelated batches with sequential reads (I/O locality + copy-on-write preserved; a plain RandomSampler instead does random-access I/O -> ~11min/iter + OOM). Mirrors i4's ActionUnifiedIterableDataset worker assignment. Adds DROIDLeRobotDataset.get_shuffle_blocks() for the per-episode/ segment index blocks the iterable streams. No DataLoader change needed -- IterableDataset is handled natively (sampler=None). Validated (256-rank-equivalent, 8192 global): grad-norm settles 27.8->2.9->1.7, tracking the internal reference (43->4.7->1.9) vs the no-shuffle run stuck at ~21; per-component action loss converges to ~0.0055 (matches internal ~0.005 vs the broken run's noisy 0.03-0.07). Signed-off-by: Hao Liang <haolia@nvidia.com>
f786168 to
8eec346
Compare
Collaborator
|
LGTM |
lfengad
previously approved these changes
Jun 12, 2026
The DROID policy recipe was silently training multi-task: DROIDLeRobotDataset defaulted to mode="joint" (random forward_dynamics/inverse_dynamics/policy per sample), so a *_policy* recipe trained a mix of tasks. inverse_dynamics zeros the vision loss and forward_dynamics zeros the action loss, diluting each per-task loss by ~1/3 vs the policy-only internal run. Set the dataset default to mode="policy" (matching i4's DROIDLeRobotDataset) and thread `mode` through get_action_droid_sft_dataset. Also uncap max_num_tokens_after_packing (NANO default 45056 -> -1) to match the internal droid_lerobot_8b run so the full vision sequence is processed per step. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Hao Liang <haolia@nvidia.com>
| "action_modality_embed", | ||
| ], | ||
| lr=2.0e-04, # matches internal droid_lerobot_8b_policy submit (--lr 2e-4) | ||
| lr=1.0e-04, # sqrt-scaled for 2048 global batch (internal 2e-4 was for 8192 = 4x) |
Collaborator
There was a problem hiding this comment.
Is this change intended? Our internal ablation showed that fixing lr to 2.0e-4 is a key to high policy success rate.
…fs from comments - optimizer.lr 1e-4 -> 2e-4 (for the 8192 global batch) - document max_samples_per_batch=128 as 8192 global at 64 ranks (16 nodes) - remove i4 / internal-run references from recipe + dataset comments, keeping the technical rationale; corrected the keep-ranges note (it is published at HF KarlP/droid, not an internal artifact) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Hao Liang <haolia@nvidia.com>
Restructure docs/action_policy_droid_posttrain.md to mirror the server doc (clean title, HF-linked intro, TOC block, Prerequisites referencing Setup / Environment Variables / FAQ). Make the keep-ranges window filter a default step in Full Reproduction (download from KarlP/droid + enable via overrides); drop the smoke-reproduction section. Add EXTRA_OVERRIDES to the SFT launcher: a space-separated Hydra-override string passed via the environment. Unlike the TAIL_OVERRIDES array, an exported string survives `bash <launcher>` (a child process), so overrides documented with the `bash` launch form actually take effect. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Hao Liang <haolia@nvidia.com>
…EXTRA_OVERRIDES Replace the generic EXTRA_OVERRIDES helper passthrough (reverts that addition to _sft_launcher_common.sh) with a dedicated KEEP_RANGES_PATH env var on the DROID wrapper, mirroring how launch_sft_videophy2_nano.sh plumbs VLM_SAFETENSORS_PATH. The wrapper composes the use_filter_dict / filter_dict_path overrides into TAIL_OVERRIDES in-process (append-guarded so a sourced TAIL_OVERRIDES survives), so it takes effect over `bash <wrapper>` without any new generic mechanism. Also fix the smoke example to the working `source` form and the posttrain doc-path typo. Update the post-train doc to enable the filter via KEEP_RANGES_PATH. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Hao Liang <haolia@nvidia.com>
…wnload + trim doc
Switch the keep-ranges filter plumbing to the generic EXTRA_TAIL_OVERRIDES
string (matches launch_sft_videophy2_nano.sh) instead of a dedicated
KEEP_RANGES_PATH var: the wrapper word-splits ${EXTRA_TAIL_OVERRIDES:-} into
TAIL_OVERRIDES, so an exported string takes effect over `bash <wrapper>`.
Fix the filter download command: KarlP/droid is a MODEL repo, so drop the
bogus `--repo-type dataset` (it 404s). Also drop the Non-Goals section.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Hao Liang <haolia@nvidia.com>
…ANK/MASTER_ADDR) Forward NNODES / NODE_RANK / MASTER_ADDR to torchrun when set, so the portable launcher scales multi-node under any scheduler (a SLURM/Lepton wrapper just exports them). MASTER_ADDR has no torchrun env fallback, so it must be passed explicitly. With all three unset the invocation is byte-identical to single-node. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Hao Liang <haolia@nvidia.com>
…ism.*) The HSDP degree override lands at model.config.parallelism.* (per the structured-TOML schema, sft_config.py), not model.parallelism.* — the latter fails with 'Key parallelism is not in struct'. Verified on a 2-node run: the corrected path applies (shard=4 x replicate=2, mesh [2,4]). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Hao Liang <haolia@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The DROID action SFT dataloader trained with an unstable, slow-settling grad-norm (and a noisy action-loss plateau) vs the internal reference. Root cause: the DROID action dataset is map-style and — unlike the iterable vision
SFTDataset, which self-shuffles — does not shuffle, andRankPartitionedDataLoaderwraps it in aDataLoaderwith noshuffle, i.e. aSequentialSampler. Every rank then iterates the same consecutive, overlapping windows, so the all-reduced global batch is effectively ~1 episode → high gradient variance.(Forward + gradients were verified numerically equivalent to the internal model on identical input, so this was a data-path issue, not the model/loss/optimizer.)
Fix
ActionIterableShuffleDataset(iterable_shuffle=True): anIterableDatasetview of the map-style dataset that streams rank × worker-sharded, episode-order-shuffled, sequential-within-episode — decorrelated batches with sequential reads (preserves I/O locality + copy-on-write; a plainshuffle=True/RandomSamplerinstead does random-access I/O → ~11 min/iter and OOM from broken COW). Mirrors the internal iterable dataset's per-worker episode assignment.DROIDLeRobotDataset.get_shuffle_blocks()(per-episode/segment flat-index blocks the iterable streams).DataLoader/sampler change needed —IterableDatasetis handled natively (sampler=None).Validation (8192 global batch)
Per-component action loss converges to ~0.0055 (matches internal ~0.005; the no-shuffle run plateaued noisily at 0.03–0.07). Builds on #24 (recipe + FusedAdam optimizer).
🤖 Generated with Claude Code
Added commits (recipe correctness)
mode="policy"default —DROIDLeRobotDatasetdefaulted tomode="joint"(random forward_dynamics/inverse_dynamics/policy per sample), so the policy recipe was silently training multi-task.inverse_dynamicszeros the vision loss andforward_dynamicszeros the action loss, diluting each per-task loss by ~1/3 vs the policy-only internal run. Now defaults topolicy(matching i4'sDROIDLeRobotDataset);modeis also threaded throughget_action_droid_sft_dataset.max_num_tokens_after_packing=-1— uncaps the packed-sequence length (NANO default 45056) to match the internaldroid_lerobot_8brun, so the full vision sequence is processed per step. Does not change the per-token loss; widens the effective vision context per step.