Skip to content

feat(lora): fix LoRA weight-sync deadlock and add stable LoRA RL recipe#20

Open
WWWjiahui wants to merge 3 commits into
Infini-AI-Lab:mainfrom
WWWjiahui:chore/bump-sglang-0.5.12
Open

feat(lora): fix LoRA weight-sync deadlock and add stable LoRA RL recipe#20
WWWjiahui wants to merge 3 commits into
Infini-AI-Lab:mainfrom
WWWjiahui:chore/bump-sglang-0.5.12

Conversation

@WWWjiahui

@WWWjiahui WWWjiahui commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Makes LoRA RL train end-to-end. Two independent fixes: a weight-sync deadlock in the RaaS engine, and a training recipe that keeps LoRA from collapsing. Validated on Qwen3-1.7B / math with a clean rising eval curve.

1. Weight-sync

On every training step the freshly trained LoRA adapter is pushed to the SGLang inference server. The swap follows a fixed sequence so that no request ever runs against half-updated weights or stale cache:

  1. Pause — the server stops accepting new requests (/pause_generation).
  2. Drain / abort — it waits out a short grace period for in-flight requests to finish, aborting any still running, so nothing is mid-generation on the old adapter.
  3. Load — the new adapter is loaded onto every server.
  4. Flush — the KV cache is flushed to drop entries computed with the old adapter's weights.
  5. Resume — generation continues; new requests now use the new adapter.

The bug. The old code reloaded the adapter under a fixed name (lora_1), which required unloading the previous one before step 3. SGLang's /unload_lora_adapter calls wait_for_unload, which blocks until the adapter's usage counter reaches zero, But the requests aborted during the drain never released their hold on that counter, so it stayed non-zero and the unload hung forever, freezing the entire pipeline. This only struck LoRA runs under load (full fine-tuning takes a different, unload-free path), which made it look intermittent.

The fix. Never unload. Each sync loads the new adapter under a fresh versioned name (lora_v1, lora_v2, …) and lets SGLang's memory-pool LRU reclaim the stale ones. The deadlock-prone unload is removed entirely, while the pause → drain/abort → load → flush → resume sequence above is unchanged, so the swap stays correct. A lingering old adapter only costs GPU memory, which is bounded by max_loras_per_batch / max_loaded_loras.

Validation

Qwen3-1.7B, math DP-scaled SGLang: clean monotonic rise over the first ~100 steps with importance weight pinned at 1.0 throughout —

metric base step 100
eval-avg/math500/avg@4 70.0 77.55
eval-avg/overall_avg 30.2 36.58

@WWWjiahui WWWjiahui requested a review from haizhongzheng as a code owner June 18, 2026 08:34
Make LoRA RL (FSDP2 trainer + SGLang inference) train end-to-end.

Weight-sync deadlock (RaaS engine):
- Each sync used to unload+reload the adapter under a fixed name (lora_1).
  SGLang's /unload_lora_adapter blocks in wait_for_unload until every
  in-flight request releases the adapter's usage counter; paused/aborted
  requests at a sync never release it, so the unload hangs forever and the
  pipeline deadlocks.
- Fix: load each sync under a fresh versioned name (lora_v{seq}) and never
  unload; SGLang's mem-pool LRU evicts stale adapters. Track the active
  name (_current_lora_name), thread it through generation requests, and
  mirror it to the eval engines that share the server.

Stable LoRA RL recipe (examples/math/qwen3-1.7b-m2po-2gpus-lora):
- LoRA's alpha/rank scaling makes each weight update much larger than
  full-FT at the same lr, so the off-policy / multi-minibatch settings
  full-FT tolerates collapse the policy under LoRA. Use near-on-policy:
  ppo_n_minibatches=1, max_staleness=1, recompute_logprob=true (lr 5e-6).
- Validated: clean rising eval, math500 avg@4 70 -> 77.5, overall 30 -> 36.6.
@WWWjiahui WWWjiahui force-pushed the chore/bump-sglang-0.5.12 branch from 5f9c0a1 to cbd7f97 Compare June 19, 2026 06:38
Add a LoRA subsection to the math recipes doc: the 2-GPU LoRA variant,
the near-on-policy settings LoRA needs (ppo_n_minibatches=1,
max_staleness=1, recompute_logprob=true), and the pause/drain/abort/load
sequence used to swap the adapter on each weight sync.
@WWWjiahui WWWjiahui changed the base branch from dev to main June 19, 2026 07:12
haizhongzheng added a commit that referenced this pull request Jun 19, 2026
… deadlock at its source

PR Infini-AI-Lab#20 avoided the LoRA weight-sync deadlock by loading each step's adapter under
a fresh versioned name and never unloading, letting SGLang's LRU reclaim old ones.
That only DEFERS the deadlock: SGLang's LoRARegistry LRU eviction (fires once
num_registered > max_loaded_loras) calls wait_for_unload, which blocks on a
per-adapter usage counter that aborted requests leak.

Root cause (SGLang tokenizer_manager): the LoRARegistry counter is acquire()d per
request but release()d only on normal completion and one scheduler-abort branch --
never on client-disconnect / waiting-queue aborts, which the RaaS per-step drain
routinely creates. The leaked counter never reaches zero, so wait_for_unload (both
/unload_lora_adapter and the max_loaded_loras eviction) hangs forever while holding
lora_update_lock, freezing all further LoRA ops.

Fix: LoRACounterLeakPatch wraps TokenizerManager.generate_request with a finally
that releases the counter on every teardown path, idempotently -- release iff the
rid is still in rid_to_state (atomic check-then-pop; double-release is fatal since
ConcurrentCounter has no floor). Delivered via the existing SGLang monkey-patch
framework (no rebuild); gated by ASTRAFLOW_DISABLE_LORA_LEAK_FIX.

The versioned-name scheme is kept (harmless once the leak is fixed); the now-false
"deadlock" comments in remote_inf_engine.py and raas.yaml are corrected.

Verified A/B at max_loaded_loras=4 on AMD/ROCm: control (fix off) deadlocks at
/load_lora_adapter(lora_v5) evicting lora_v1; treatment (fix on) completes all
weight-syncs and finishes. Unit test astraflow/raas/patch/tests/test_lora_counter_leak.py
covers release idempotency and the double-release guard (4/4 pass).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant