[WIP] Merging AutoSP into DeepSpeed by neeldani · Pull Request #7860 · deepspeedai/DeepSpeed

neeldani · 2026-02-19T06:17:57Z

AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism

Overview

AutoSP is a compiler optimization pass that shards inputs along the sequence dimension and enables Ulysses styled sequence parallelism while preventing graph breaks during torch.compile(). All the passes operate at the Torch IR on the forward graph.

API Design

User-Facing Entry Point: `prepare_autosp_inputs()`

Users must explicitly call this function to prepare inputs for AutoSP compilation:

def prepare_autosp_inputs(
    input_id: torch.Tensor,
    label_id: torch.Tensor,
    position_id: torch.Tensor = None,
    attention_mask: torch.Tensor = None,
    seq_dim: int = 1
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]

Purpose: Symbolize sequence dimension and annotate tensors for identification.

Operations:

Mark sequence dimension as dynamic using torch._dynamo.decorators.mark_dynamic()
Attach metadata tags for tensor identification for auto-sharding:
- input_id.tag = constants.INPUT_ID_KEY
- label_id.tag = constants.LABEL_ID_KEY
- position_id.tag = constants.POSITION_ID_KEY (if provided)

Rationale: PyTorch's FX graph tracer requires explicit annotation of data-dependent dimensions. Marking the sequence dimension as dynamic prevents symbolic shape propagation from losing dimension information through reshape/view operations.

Compilation Passes

Pass 1: `pass_shard_seq_dim()`

Objective: Propagate sharded sequence dimension to all consumers.

Algorithm:

Extract symbolic sequence dimension from input_id shape metadata
Locate the symbolic dimension node in the FX graph
Create a floor-divide node: seq_dim / world_size
Perform worklist-based graph traversal to find all direct and indirect consumers of input node, label node and position id node.
Replace symbolic dimension references with sharded dimension in consumer nodes

Rationale: Reshapes and views that consume the sequence dimension as an argument do not get updated during propagation of symbolic shapes. This pass explicitly rewires the computation graph to use sharded dimensions, enabling proper shape inference downstream.

Pass 2: `pass_shard_input_ids()` / `pass_shard_label_ids()` / `pass_shard_position_ids()`

Objective: Insert slicing operations after input tensors.

Implementation: Call shard_tensor_node() utility which inserts slice operations. Each rank retains only the portion of the tensor corresponding to its sequence partition and drops the remaining buffer.

Note on attention_mask: Not sharded because it applies to the full sequence length, not the partitioned dimension.

Pass 3: `pass_insert_attention_all_to_all()`

Objective: Insert all-to-all collectives around attention (Ulysses styled) to avoid graph breaks during compilation.

Algorithm:

Identify all SDPA (Scaled Dot-Product Attention) nodes in the graph
For each SDPA node with inputs Q, K, V, after each of Q, K, V: insert A2A scatter heads (dim=1), gather sequence (dim=2)
Insert A2A after thre attention output O: scatter sequence (dim=2), gather heads (dim=1)

Graph Rewrite Example:

Q [B, N, S/P, H] --A2A(scatter_heads,gather_seq)--> [B, N/P, S, H]
K [B, N, S/P, H] --A2A(scatter_heads,gather_seq)--> [B, N/P, S, H]
V [B, N, S/P, H] --A2A(scatter_heads,gather_seq)--> [B, N/P, S, H]
                     |
                    SDPA
                     |
O [B, N/P, S, H] --A2A(scatter_seq,gather_heads)--> [B, N, S/P, H]

Current support: Currently only supports torch.nn.functional.scaled_dot_product_attention(). Composite attention patterns require additional pattern matching logic.

Pass 4: `pass_propagate_shapes()`

Objective: Compute static shapes for all nodes using fake tensor execution.

Implementation:

Create ShapeEnv for symbolic dimension tracking
Construct FakeTensorMode with the shape environment
Execute FakeTensorProp.propagate() to compute shape metadata

Pass 5: `pass_canonicalize()`

Objective: Finalize graph representation.

Operations:

eliminate_dead_code(): Remove unused operations
lint(): Validate graph structure
recompile(): Regenerate compiled representation

Execution Order

prepare_autosp_inputs()
    ↓
pass_shard_seq_dim
    ↓
pass_shard_input_ids
    ↓
pass_shard_label_ids
    ↓
pass_shard_position_ids
    ↓
pass_insert_attention_all_to_all
    ↓
pass_propagate_shapes
    ↓
pass_canonicalize

Reducing gradients across ranks

AutoSP requires an all-reduce to reduce the gradients across ranks. This is automatically called by DeepSpeed's engine here

Known Limitations

Attention Pattern Matching: Only torch.nn.functional.scaled_dot_product_attention() is supported. Fused attention implementations require pattern-specific handling.
No Graph Break Requirement: AutoSP will fail if there are graph breaks because use-def chains are lost and it becomes tricky to propagate auto-sharding information across graph modules.

Example

DeepSpeedExample PR: deepspeedai/DeepSpeedExamples#999

tohtana · 2026-02-20T00:18:22Z

Hi @neeldani,
Thank you for opening this PR! This is truly exciting.

Since this is a large PR, let’s proceed step by step. Here are my suggestions:

Code Location: This PR contains a significant amount of client code in bench_dc_ulysses. Could we move that to DeepSpeedExamples instead? Feel free to open a separate PR there for it.
Documentation: The README in bench_dc_ulysses appears to be outdated. Could you update it with instructions so we can reproduce the results?
API Design: Could you share the current API design? As you mentioned, we should discuss this further. You can either add the details to this PR or start a new Discussion in this repo.

neeldani · 2026-02-23T07:46:57Z

@tohtana thank you for the feedback:

Moved the scripts to DeepSpeedExample and put up a new PR: Add AutoSP example DeepSpeedExamples#999
Updated the README.md with the setup instructions
Updated the description of this PR with the API design

Please let me know if there are any hiccups to run AutoSP or have any questions related to the design - happy to discuss them on this PR.

deepspeed/compile/custom_ops/all_to_all.py

deepspeed/runtime/constants.py

deepspeed/runtime/engine.py

tohtana · 2026-02-25T22:22:31Z

deepspeed/runtime/engine.py

                        "DeepCompile with ZeRO stage 3 is not currently supported on PyTorch >= 2.9. "
                        "Please use ZeRO stage 1 or 2 with DeepCompile, or disable DeepCompile for ZeRO stage 3.")
                backend = init_z3(self, backend, compile_config, compile_kwargs, schedule)
+            elif self.zero_optimization_stage() == ZeroStageEnum.disabled:


Currently, do we enable AutoSP by these?

set zero stage to 0

Enable deepcompile
If so, I think we should make it more explicit.
For AutoTP (see example), we do

"tensor_parallel": { "autotp_size": args.tp_size, ... }

For AutoEP proposal,

"expert_parallel": { "autoep_size": args.ep_size, ... }

AutoEP is currently just a proposal, but how about making the config

"sequence_parallel": { "autosp_size": args.sp_size, ... }

You may want to require DeepCompile to be enabled too. As we don't have eager AutoSP now, it might be good to enable DeepCompile automatically when sequence_parallel is enabled.

Here is the flow for enabling autosp (and its interoperability with zero-1 DP):

User specifies in a config, the pertinent compiler passes and their parameters. This looks like the following

"compile": { "deepcompile": true, "passes": ["autosp"], "sp_size": 2, "dp_size": 1

Specifying what the SP and DP size should be.

Next, if the DP size is larger than one, the user can opt to turn on zero-1. The entire configuration would look like this:

"zero_optimization":{ "stage": 1 }, "compile": { "deepcompile": true, "passes": ["autosp"], "sp_size": 2, "dp_size": 1

Using the legacy config-style to specify the zero_optimization stage. This would then accordingly compose SP with zero-1 DP. However, note that this is not Zero-1 DP from deepcompile, rather the Zero-1 DP implemented originally in DeepSpeed.

Here, I have currently opted to make both the sp_size and dp_size explicitly controllable by the user. Another option is to automatically infer the DP-size from the sp-size by computing dp-size = num-devices/sp-size.

For a sample config, here is the deepspeed config in DeepSPeedExamples (link here).

Thank you, @spikerheado1234!
Can you clarify a few points?

Can we automatically determine dp_size based on world size and sp size?

I feel a bit weird to have "sp_size" as a part of compile config. Shouldn't we make a new item to set arguments to compiler passes? (I also want to get thoughts from @sfc-gh-truwase)

In current SP (Ulysses), we have the size for zero sharding is dp_size * sp_size. Is this same for AutoSP?

Can you add tests to run the matrix of sp_size * dp_size * zero_stage? It will clarify that what we support and guarantee it works.

I also find it odd to include both sp_size and dp_size in ds_config. To my knowledge, dp_size is implicitly derived from world_size, and sp_size is subsequently inferred from dp_size.

I am aware that dp_size and sp_size is a common confusion source in SP, so this probably requires offline discussion for AutoSP.

tohtana · 2026-02-25T22:29:26Z

Thank you @neeldani for the update! As we don't have a lot of changes in existing code, I don't think we have much risk.
The key discussion is the interface to enable AutoSP. See this comment. I also want to get thoughs from @sfc-gh-truwase @minjiazhang

We should also have clear assertions to terminate early when we hit these limitations: Attention Pattern Matching and No Graph Break Requirement.

Patch Zero-1 interoperability when using AutoSP.

spikerheado1234 · 2026-02-27T20:47:41Z

Thank you @neeldani for the update! As we don't have a lot of changes in existing code, I don't think we have much risk. The key discussion is the interface to enable AutoSP. See this comment. I also want to get thoughs from @sfc-gh-truwase @minjiazhang

Hi @tohtana, just merged in the code that correctly enables zero-1 and AutoSP interoperability.

deepspeed/compile/util.py

deepspeed/compile/passes/sp_compile.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4fdc54b84a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

deepspeed/runtime/engine.py

chatgpt-codex-connector · 2026-03-02T07:40:02Z

deepspeed/compile/custom_ops/all_to_all.py

+def _all_to_all_backward(ctx, grad):
+    return (
+        all_to_all(grad, ctx.scatter_idx, ctx.gather_idx, ctx.name),
+        None, None, None, None


Return one gradient per autosp::all_to_all input

The custom op autosp::all_to_all takes 4 inputs (input, scatter_idx, gather_idx, name), but _all_to_all_backward returns 5 gradient slots because of an extra trailing None. During backward through this op, autograd expects the gradient tuple arity to match inputs, so training paths that hit this op will fail at runtime.

Useful? React with 👍 / 👎.

@neeldani @spikerheado1234 FYI

deepspeed/compile/passes/sp_compile.py

neeldani · 2026-03-02T07:52:20Z

We should also have clear assertions to terminate early when we hit these limitations: Attention Pattern Matching and No Graph Break Requirement.

Thanks @tohtana, I added checks for the invariants.

Attention pattern matching is handled here

For no graph break requirement, I am forcing a full graph capture via torch.compile's fullgraph=True flag here. The program would crash. There's another option where we run dynamo._explain and have a more graceful exit but the tradeoff is that the graph needs to be traced twice (first time to check for graph breaks, second time for actual compilation)

tohtana · 2026-03-04T19:21:53Z

For no graph break requirement, I am forcing a full graph capture via torch.compile's fullgraph=True flag here.

I think using fullgraph is the right approach. We just need to clarify it in the document. This PR currently doesn't have a document, but we would need to add a section in training.rst and config-json.md.
It would also be good if we could handle the error due to the fullgraph violation.

neeldani added 3 commits February 14, 2026 20:10

add autosp backend

b57ccb8

add benchmarking script

4df32d1

Merge remote-tracking branch 'upstream/master' into autosp

fad4846

neeldani changed the title ~~[WIP] Merging AutoSP into Deepspeed~~ [WIP] Merging AutoSP into DeepSpeed Feb 19, 2026

tohtana mentioned this pull request Feb 19, 2026

(Draft) [Roadmap] DeepSpeed Roadmap Q2 2026 #7861

Open

27 tasks

sfc-gh-truwase and others added 3 commits February 21, 2026 21:53

Merge branch 'master' into autosp

10d9cc0

move bench scripts to DeepSpeedExamples

ecbc6ea

move constants and apis to deepspeed library

a38674e

PKUWZP self-requested a review February 23, 2026 06:09

neeldani mentioned this pull request Feb 23, 2026

Add AutoSP example deepspeedai/DeepSpeedExamples#999

Draft

tohtana reviewed Feb 25, 2026

View reviewed changes

Ubuntu and others added 3 commits February 27, 2026 05:20

add zero-1 interoperability to autosp

fea194c

fix early termination of gradients issue when using autosp

bd916b7

Merge pull request #1 from neeldani/staging

d6a0aaa

Patch Zero-1 interoperability when using AutoSP.

neeldani added 2 commits February 28, 2026 18:36

rename autosp specific constants

82beda1

fix merge conflicts

634b706

tjruwase reviewed Mar 1, 2026

View reviewed changes

deepspeed/compile/util.py Outdated Show resolved Hide resolved

add missing __init__.py file

ee8bdb0

tjruwase reviewed Mar 2, 2026

View reviewed changes

deepspeed/compile/passes/sp_compile.py Show resolved Hide resolved

fix __init__.py

a028973

tjruwase reviewed Mar 2, 2026

View reviewed changes

deepspeed/compile/passes/sp_compile.py Outdated Show resolved Hide resolved

neeldani added 3 commits March 1, 2026 23:40

address review comments

8c77acd

refactor engine.py to validate and select compiler backend

f5e4d4b

fallback to torch backend

4fdc54b

neeldani marked this pull request as ready for review March 2, 2026 07:35

neeldani requested a review from loadams as a code owner March 2, 2026 07:35

chatgpt-codex-connector bot reviewed Mar 2, 2026

View reviewed changes

spikerheado1234 and others added 2 commits March 4, 2026 17:16

refactor to avoid None edgecases

a72cd7d

Merge branch 'master' into autosp

8724800

Conversation

neeldani commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism

Overview

API Design

User-Facing Entry Point: prepare_autosp_inputs()

Compilation Passes

Pass 1: pass_shard_seq_dim()

Pass 2: pass_shard_input_ids() / pass_shard_label_ids() / pass_shard_position_ids()

Pass 3: pass_insert_attention_all_to_all()

Pass 4: pass_propagate_shapes()

Pass 5: pass_canonicalize()

Execution Order

Reducing gradients across ranks

Known Limitations

Example

Uh oh!

tohtana commented Feb 20, 2026

Uh oh!

neeldani commented Feb 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tohtana Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

spikerheado1234 Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

tohtana Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

tjruwase Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

tohtana commented Feb 25, 2026

Uh oh!

spikerheado1234 commented Feb 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

sfc-gh-truwase Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

spikerheado1234 Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

neeldani commented Mar 2, 2026

Uh oh!

tohtana commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

neeldani commented Feb 19, 2026 •

edited

Loading

User-Facing Entry Point: `prepare_autosp_inputs()`

Pass 1: `pass_shard_seq_dim()`

Pass 2: `pass_shard_input_ids()` / `pass_shard_label_ids()` / `pass_shard_position_ids()`

Pass 3: `pass_insert_attention_all_to_all()`

Pass 4: `pass_propagate_shapes()`

Pass 5: `pass_canonicalize()`