Skip to content

Remove use-wandb in doc#610

Open
xyao-nv wants to merge 3 commits intomainfrom
xyao/doc/remove_wandb
Open

Remove use-wandb in doc#610
xyao-nv wants to merge 3 commits intomainfrom
xyao/doc/remove_wandb

Conversation

@xyao-nv
Copy link
Copy Markdown
Collaborator

@xyao-nv xyao-nv commented Apr 15, 2026

Summary

Address https://nvbugspro.nvidia.com/bug/6062848

Detailed description

In a multi-GPU setup, the standard output (stdout) buffer gets flooded with logs from secondary GPUs. As a result, the wandb prompt requesting user input gets buried in the output. Because the prompt goes unanswered, the data loading process stalls, eventually leading to a timeout

@xyao-nv xyao-nv marked this pull request as ready for review April 15, 2026 17:57
@xyao-nv xyao-nv force-pushed the xyao/doc/remove_wandb branch from de9a147 to a492d65 Compare April 15, 2026 17:58
@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Apr 15, 2026

Greptile Summary

Removes --use-wandb from the launch_finetune.py command snippets in three example workflow docs to fix a multi-GPU hang: in distributed training, secondary GPU stdout floods the terminal and buries the interactive wandb prompt, causing an unanswered input timeout that stalls data loading.

Note: osmo/finetune.yaml (not changed in this PR) still passes --use-wandb on line 69 and would encounter the same issue in that workflow path.

Confidence Score: 5/5

Safe to merge — documentation-only change with no code logic impact

All three files receive identical, targeted doc fixes (flag removal). No code, config, or test logic is altered. The remaining observation about osmo/finetune.yaml is informational and out of scope for this PR.

No files require special attention; all changes are straightforward documentation edits

Important Files Changed

Filename Overview
docs/pages/example_workflows/locomanipulation/step_4_policy_training.rst Removed --use-wandb from the multi-GPU torchrun training command; no other changes
docs/pages/example_workflows/sequential_static_manipulation/step_4_policy_training.rst Removed --use-wandb from both the 8-GPU and 1-GPU training command examples
docs/pages/example_workflows/static_manipulation/step_4_policy_training.rst Removed --use-wandb from both the 8-GPU and 1-GPU training command examples

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Start multi-GPU training\ntorchrun --nproc_per_node=8] --> B{--use-wandb\npassed?}
    B -- Yes\n(before this PR) --> C[wandb prompts user\nfor login/project]
    C --> D[Secondary GPU logs\nflood stdout]
    D --> E[Prompt buried /\nnever answered]
    E --> F[Data loading\ntimeout / hang]
    B -- No\n(after this PR) --> G[Training starts\nimmediately]
    G --> H[Completes successfully]
Loading

Reviews (2): Last reviewed commit: "nit" | Re-trigger Greptile

@xyao-nv xyao-nv enabled auto-merge (squash) April 15, 2026 18:12
@xyao-nv xyao-nv disabled auto-merge April 15, 2026 19:03
@xyao-nv xyao-nv enabled auto-merge (squash) April 15, 2026 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants