Fix Stage 0 + Ulysses crash: make bwc_tensor_model_parallel_rank() resilient to MP API absence by nathon-lee · Pull Request #7888 · deepspeedai/DeepSpeed

nathon-lee · 2026-03-06T06:59:13Z

Title

Fix Stage 0 + Ulysses crash: make bwc_tensor_model_parallel_rank() resilient to MP API absence

Summary

This PR fixes a hard crash when using Ulysses sequence parallelism with ZeRO Stage 0 (BF16_Optimizer).
In this configuration, DeepSpeed calls deepspeed.utils.bwc.bwc_tensor_model_parallel_rank(mpu=...), and the passed mpu object can be deepspeed.runtime.sequence_parallel.parallel_state_sp, which does not implement the deprecated get_model_parallel_rank() API. The current fallback path unconditionally calls mpu.get_model_parallel_rank(), raising AttributeError.

The fix adds a defensive capability check before calling the deprecated API. If the provided mpu does not expose any known tensor/model-parallel rank API, we treat it as “no tensor model parallelism” and return rank 0.

Motivation / Context

Affected scenario: Ulysses sequence parallel + ZeRO Stage 0
Failure mode: AttributeError: ... parallel_state_sp has no attribute get_model_parallel_rank
Root cause: bwc_tensor_model_parallel_rank() falls back to a deprecated API without an hasattr() check.

This change keeps the original priority order intact:

get_tensor_model_parallel_rank()
get_slice_parallel_rank()
get_model_parallel_rank() (deprecated)
fallback to 0 if none exist

Changes

deepspeed/utils/bwc.py
- Update bwc_tensor_model_parallel_rank() to check hasattr(mpu, "get_model_parallel_rank") before calling it.
- If mpu provides none of the expected tensor/model-parallel rank APIs, return 0 (no TP).

Why this is safe

For Megatron / DeepSpeed Topology / any existing MPU that already implements get_tensor_model_parallel_rank() or get_slice_parallel_rank() or get_model_parallel_rank(), behavior is unchanged.
The new code path only affects the previously-crashing case where the mpu object does not provide any of these methods.

Reproduction

Using the Ulysses ALST tutorial flow, switching ZeRO stage from 3 to 0 triggers the crash during optimizer step when grad norm is computed.

Testing

Existing unit tests should continue to pass.
Minimal repro: calling bwc_tensor_model_parallel_rank(mpu=deepspeed.runtime.sequence_parallel.parallel_state_sp) should no longer raise.

References

DeepSpeed Issue: [BUG] Ulysses crashes in Stage 0 #7833 “Ulysses crashes in Stage 0”

This reverts commit ff88670. Co-authored-by: nathon-lee <248585198+nathon-lee@users.noreply.github.com>

Revert "fix: update 1 file reformatted." (ff88670)

This reverts commit b90aee5.

Revert accidental Muon optimizer code re-introduction from copilot PRs

Add check for model parallel rank in mpu.

tohtana · 2026-03-06T20:41:00Z

Hi @nathon-lee,
Thank you for reporting!

I found that we already have a fallback from get_model_parallel_world_size to get_sequence_parallel_world_size.
This was introduced in #7649. Can you make sure that the latest version still raises the error?

Copilot AI and others added 8 commits February 27, 2026 06:30

Initial plan

001f77c

Revert "fix: update 1 file reformatted."

b90aee5

This reverts commit ff88670. Co-authored-by: nathon-lee <248585198+nathon-lee@users.noreply.github.com>

Merge pull request #5 from nathon-lee/copilot/git-revert-ff886701

b6da9af

Revert "fix: update 1 file reformatted." (ff88670)

Merge branch 'deepspeedai:master' into master

bb7f64f

Initial plan

cbc816c

Reapply "fix: update 1 file reformatted."

5fcc9a7

This reverts commit b90aee5.

Merge pull request #6 from nathon-lee/copilot/remove-commits-from-master

f7c5d75

Revert accidental Muon optimizer code re-introduction from copilot PRs

Enhance tensor model parallel rank retrieval

5b1f8c8

Add check for model parallel rank in mpu.

nathon-lee requested review from tjruwase and tohtana as code owners March 6, 2026 06:59

nathon-lee changed the title ~~Fix iss 7833~~ Fix issue 7833 Mar 6, 2026

nathon-lee changed the title ~~Fix issue 7833~~ Fix issue #7833 Mar 6, 2026

nathon-lee changed the title ~~Fix issue #7833~~ Fix Stage 0 + Ulysses crash: make bwc_tensor_model_parallel_rank() resilient to MP API absence Mar 6, 2026

nathon-lee mentioned this pull request Mar 6, 2026

fix: Validate fp16.loss_scale is finite and non-negative #7889

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Stage 0 + Ulysses crash: make bwc_tensor_model_parallel_rank() resilient to MP API absence#7888

Fix Stage 0 + Ulysses crash: make bwc_tensor_model_parallel_rank() resilient to MP API absence#7888
nathon-lee wants to merge 8 commits intodeepspeedai:masterfrom
nathon-lee:fix_iss_7833

nathon-lee commented Mar 6, 2026

Uh oh!

tohtana commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nathon-lee commented Mar 6, 2026

Title

Summary

Motivation / Context

Changes

Why this is safe

Reproduction

Testing

References

Uh oh!

tohtana commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants