wandb: raise init_timeout, add retry, fix shared-mode init for cross-region clusters#10
Merged
DavidBellamy merged 44 commits intomainfrom Apr 21, 2026
Merged
Conversation
…pdating (radixark#890) Co-authored-by: Yueming Yuan <yym022502@gmail.com>
Co-authored-by: Yueming Yuan <yueming@Mac.attlocal.net>
…adixark#654) Co-authored-by: GuanxingLu <gxlu02@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… CLI args in swe-agent-v2 (radixark#954) Co-authored-by: Shi Dong <shi.dong@radixark.ai>
…#952) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Yueming Yuan <yym022502@gmail.com>
…adixark#926) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rmers >=5.0 (radixark#927) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…#948) Co-authored-by: guapisolo <guapisolo@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…log dtype (radixark#975) Co-authored-by: yueming-yuan <yym022502@gmail.com>
…adixark#974) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…on (radixark#949) Co-authored-by: Yanbin Jiang <jybsuper@gmail.com>
Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com>
Co-authored-by: Zhichen Zeng <zczeng@uw.edu>
…t between sgl & miles router (radixark#1015)
…r cross-region clusters In online + shared mode, both `init_wandb_primary` and `init_wandb_secondary` make HTTPS round-trips to wandb cloud (login + run create/attach). On high-latency cross-region clusters (e.g. Abu Dhabi MBZUAI ↔ wandb-cloud US-West) with concurrent actor bursts, a single round-trip can exceed the wandb SDK's 90s default `init_timeout` — tearing down the whole run with a silent handshake abort. Observed on RL360 job 1564420, which forced `WANDB_MODE=offline` as a global default ever since (see https://github.com/LLM360/RL360/issues/87). The issue's original diagnosis assumed a local primary↔secondary socket handshake race. That's not how shared mode works — per wandb's own feature PR (wandb/wandb#6882), each writer spawns an independent wandb-core that talks to the cloud directly; aggregation is server-side by run_id. No local socket exists. The failure mode is pure network/latency, not a local readiness race. Changes ------- - Bump `init_timeout` to 300s for primary and secondary Settings. Configurable via `WANDB_INIT_TIMEOUT_SECS` env var for tuning. - Wrap both init paths in a bounded exponential-backoff retry (`_wandb_init_with_retry`) that re-attempts on wandb.errors.CommError and wandb.errors.UsageError. 3 attempts with 5→10→20s backoff by default, tunable via `WANDB_INIT_RETRY_ATTEMPTS` / `WANDB_INIT_RETRY_BACKOFF_SECS`. - Add `x_label` tagging per wandb distributed-training docs: primary gets `rank_<rank>_primary`, secondaries get `rank_<rank>_secondary`. Enables per-rank console-log filtering in the wandb UI. - Drop `reinit=True` from secondary init_kwargs. Shared mode natively supports concurrent writers on a single run; `reinit=True` triggered stale-state warnings on secondary actors without functional benefit. Followups this change enables ----------------------------- - `WANDB_MODE=offline` can be removed from scale.yaml's extra_env default once a pilot run confirms online mode boots cleanly. - The tmux-based `~/bin/wandb-sync-rl360.sh` workaround on David's M2 account becomes obsolete (no more offline-only default). - Near-realtime wandb dashboards replace the ~2-minute-lag offline sync; per-rank system metrics via x_label filtering.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix
WANDB_MODE=onlinebooting past the 90s init timeout on cross-region clusters. Unblocks LLM360/RL360#243.Root cause correction
Issue LLM360/RL360#243's original hypothesis ("primary↔secondary local socket handshake race") is incorrect. Per the wandb upstream feature PR (wandb/wandb#6882):
Shared mode spawns one wandb-core per writer; each writer talks to wandb cloud independently; server-side aggregation by
run_id. No local socket handshake.The observed >90s timeout is just wandb's default
init_timeout = 90.0s. The cross-continent HTTPS round-trip (MBZUAI Abu Dhabi ↔ wandb cloud US-West) with concurrent actor bursts can exceed 90s when the wandb cloud is under any load.Changes
init_timeout=300.0on both primary and secondary Settings (configurable viaWANDB_INIT_TIMEOUT_SECS)_wandb_init_with_retrywraps both init paths, 3 attempts, 5→10→20s exponential backoff onCommError/UsageError(configurable)x_labelper-rank tagging per wandb distributed-training docsreinit=Truefrom secondary init_kwargs (not needed for shared mode, triggered stale-state warnings)Unlocks
WANDB_MODE=offlinecan be removed fromscale.yaml.extra_envdefault~/bin/wandb-sync-rl360.shtmux patchwork on David's M2 account becomes obsoletex_labelfiltering in wandb UITest plan
agentic-rl-nightly-YYYYMMDDimage (same path as the TITO fix landed today)WANDB_MODE=onlineWANDB_INIT_TIMEOUT_SECS=90restores prior default in-placeUpstream
Will also open the equivalent PR against
radixark/miles:mainfor upstream hygiene. Landing here first to unblock the M2 pilot via the LLM360-deploy octopus.