Skip to content

wandb: raise init_timeout, add retry, fix shared-mode init for cross-region clusters#10

Merged
DavidBellamy merged 44 commits intomainfrom
fix/wandb-shared-mode-online-timeout
Apr 21, 2026
Merged

wandb: raise init_timeout, add retry, fix shared-mode init for cross-region clusters#10
DavidBellamy merged 44 commits intomainfrom
fix/wandb-shared-mode-online-timeout

Conversation

@DavidBellamy
Copy link
Copy Markdown
Collaborator

Summary

Fix WANDB_MODE=online booting past the 90s init timeout on cross-region clusters. Unblocks LLM360/RL360#243.

Root cause correction

Issue LLM360/RL360#243's original hypothesis ("primary↔secondary local socket handshake race") is incorrect. Per the wandb upstream feature PR (wandb/wandb#6882):

Make it possible to log to the same run using multiple writers with independent wandb-core's (including ones running, for example, on different machines)

Shared mode spawns one wandb-core per writer; each writer talks to wandb cloud independently; server-side aggregation by run_id. No local socket handshake.

The observed >90s timeout is just wandb's default init_timeout = 90.0s. The cross-continent HTTPS round-trip (MBZUAI Abu Dhabi ↔ wandb cloud US-West) with concurrent actor bursts can exceed 90s when the wandb cloud is under any load.

Changes

  • init_timeout=300.0 on both primary and secondary Settings (configurable via WANDB_INIT_TIMEOUT_SECS)
  • _wandb_init_with_retry wraps both init paths, 3 attempts, 5→10→20s exponential backoff on CommError/UsageError (configurable)
  • x_label per-rank tagging per wandb distributed-training docs
  • Drop reinit=True from secondary init_kwargs (not needed for shared mode, triggered stale-state warnings)

Unlocks

  • WANDB_MODE=offline can be removed from scale.yaml.extra_env default
  • ~/bin/wandb-sync-rl360.sh tmux patchwork on David's M2 account becomes obsolete
  • Near-realtime dashboards replace 2-min-lag offline sync
  • Per-rank system metrics via x_label filtering in wandb UI

Test plan

  1. Merge → octopus → rebuild agentic-rl-nightly-YYYYMMDD image (same path as the TITO fix landed today)
  2. Pilot run with a test composition that forces WANDB_MODE=online
  3. Verify boot completes <5 min, primary+secondaries attach cleanly, dashboards live
  4. Stress: 3 concurrent pilot-scale jobs (~100 concurrent wandb clients across them) vs wandb's documented 300-concurrent-client limit
  5. Rollback: if anything misbehaves, env-var override WANDB_INIT_TIMEOUT_SECS=90 restores prior default in-place

Upstream

Will also open the equivalent PR against radixark/miles:main for upstream hygiene. Landing here first to unblock the M2 pilot via the LLM360-deploy octopus.

JD-ETH and others added 30 commits April 5, 2026 13:51
…pdating (radixark#890)

Co-authored-by: Yueming Yuan <yym022502@gmail.com>
Co-authored-by: Yueming Yuan <yueming@Mac.attlocal.net>
…adixark#654)

Co-authored-by: GuanxingLu <gxlu02@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… CLI args in swe-agent-v2 (radixark#954)

Co-authored-by: Shi Dong <shi.dong@radixark.ai>
…#952)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Yueming Yuan <yym022502@gmail.com>
…adixark#926)

Co-authored-by: guapisolo <guapisolo@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rmers >=5.0 (radixark#927)

Co-authored-by: guapisolo <guapisolo@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…#948)

Co-authored-by: guapisolo <guapisolo@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…log dtype (radixark#975)

Co-authored-by: yueming-yuan <yym022502@gmail.com>
maocheng23 and others added 14 commits April 15, 2026 11:51
…adixark#974)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…on (radixark#949)

Co-authored-by: Yanbin Jiang <jybsuper@gmail.com>
)

Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com>
Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com>
Co-authored-by: Zhichen Zeng <zczeng@uw.edu>
…r cross-region clusters

In online + shared mode, both `init_wandb_primary` and `init_wandb_secondary`
make HTTPS round-trips to wandb cloud (login + run create/attach). On
high-latency cross-region clusters (e.g. Abu Dhabi MBZUAI ↔ wandb-cloud
US-West) with concurrent actor bursts, a single round-trip can exceed the
wandb SDK's 90s default `init_timeout` — tearing down the whole run
with a silent handshake abort. Observed on RL360 job 1564420, which
forced `WANDB_MODE=offline` as a global default ever since (see
https://github.com/LLM360/RL360/issues/87).

The issue's original diagnosis assumed a local primary↔secondary socket
handshake race. That's not how shared mode works — per wandb's own
feature PR (wandb/wandb#6882), each writer spawns
an independent wandb-core that talks to the cloud directly; aggregation
is server-side by run_id. No local socket exists. The failure mode is
pure network/latency, not a local readiness race.

Changes
-------

- Bump `init_timeout` to 300s for primary and secondary Settings.
  Configurable via `WANDB_INIT_TIMEOUT_SECS` env var for tuning.
- Wrap both init paths in a bounded exponential-backoff retry
  (`_wandb_init_with_retry`) that re-attempts on wandb.errors.CommError
  and wandb.errors.UsageError. 3 attempts with 5→10→20s backoff by
  default, tunable via `WANDB_INIT_RETRY_ATTEMPTS` /
  `WANDB_INIT_RETRY_BACKOFF_SECS`.
- Add `x_label` tagging per wandb distributed-training docs: primary
  gets `rank_<rank>_primary`, secondaries get `rank_<rank>_secondary`.
  Enables per-rank console-log filtering in the wandb UI.
- Drop `reinit=True` from secondary init_kwargs. Shared mode natively
  supports concurrent writers on a single run; `reinit=True` triggered
  stale-state warnings on secondary actors without functional benefit.

Followups this change enables
-----------------------------

- `WANDB_MODE=offline` can be removed from scale.yaml's extra_env
  default once a pilot run confirms online mode boots cleanly.
- The tmux-based `~/bin/wandb-sync-rl360.sh` workaround on David's M2
  account becomes obsolete (no more offline-only default).
- Near-realtime wandb dashboards replace the ~2-minute-lag offline
  sync; per-rank system metrics via x_label filtering.
@DavidBellamy DavidBellamy merged commit 3684c6d into main Apr 21, 2026
1 check failed
@DavidBellamy DavidBellamy deleted the fix/wandb-shared-mode-online-timeout branch April 21, 2026 19:07
@DavidBellamy DavidBellamy restored the fix/wandb-shared-mode-online-timeout branch April 21, 2026 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.