[WIP] Merging AutoSP into DeepSpeed#7860
[WIP] Merging AutoSP into DeepSpeed#7860neeldani wants to merge 18 commits intodeepspeedai:masterfrom
Conversation
|
Hi @neeldani, Since this is a large PR, let’s proceed step by step. Here are my suggestions:
|
|
@tohtana thank you for the feedback:
Please let me know if there are any hiccups to run AutoSP or have any questions related to the design - happy to discuss them on this PR. |
deepspeed/runtime/engine.py
Outdated
| "DeepCompile with ZeRO stage 3 is not currently supported on PyTorch >= 2.9. " | ||
| "Please use ZeRO stage 1 or 2 with DeepCompile, or disable DeepCompile for ZeRO stage 3.") | ||
| backend = init_z3(self, backend, compile_config, compile_kwargs, schedule) | ||
| elif self.zero_optimization_stage() == ZeroStageEnum.disabled: |
There was a problem hiding this comment.
Currently, do we enable AutoSP by these?
- set zero stage to 0
- Enable deepcompile
If so, I think we should make it more explicit.
For AutoTP (see example), we do
"tensor_parallel": {
"autotp_size": args.tp_size,
...
}
For AutoEP proposal,
"expert_parallel": {
"autoep_size": args.ep_size,
...
}
AutoEP is currently just a proposal, but how about making the config
"sequence_parallel": {
"autosp_size": args.sp_size,
...
}
You may want to require DeepCompile to be enabled too. As we don't have eager AutoSP now, it might be good to enable DeepCompile automatically when sequence_parallel is enabled.
There was a problem hiding this comment.
Here is the flow for enabling autosp (and its interoperability with zero-1 DP):
- User specifies in a config, the pertinent compiler passes and their parameters. This looks like the following
"compile": {
"deepcompile": true,
"passes": ["autosp"],
"sp_size": 2,
"dp_size": 1
Specifying what the SP and DP size should be.
Next, if the DP size is larger than one, the user can opt to turn on zero-1. The entire configuration would look like this:
"zero_optimization":{
"stage": 1
},
"compile": {
"deepcompile": true,
"passes": ["autosp"],
"sp_size": 2,
"dp_size": 1
Using the legacy config-style to specify the zero_optimization stage. This would then accordingly compose SP with zero-1 DP. However, note that this is not Zero-1 DP from deepcompile, rather the Zero-1 DP implemented originally in DeepSpeed.
Here, I have currently opted to make both the sp_size and dp_size explicitly controllable by the user. Another option is to automatically infer the DP-size from the sp-size by computing dp-size = num-devices/sp-size.
For a sample config, here is the deepspeed config in DeepSPeedExamples (link here).
There was a problem hiding this comment.
Thank you, @spikerheado1234!
Can you clarify a few points?
- Can we automatically determine dp_size based on world size and sp size?
- I feel a bit weird to have "sp_size" as a part of compile config. Shouldn't we make a new item to set arguments to compiler passes? (I also want to get thoughts from @sfc-gh-truwase)
- In current SP (Ulysses), we have the size for zero sharding is dp_size * sp_size. Is this same for AutoSP?
- Can you add tests to run the matrix of sp_size * dp_size * zero_stage? It will clarify that what we support and guarantee it works.
There was a problem hiding this comment.
I also find it odd to include both sp_size and dp_size in ds_config. To my knowledge, dp_size is implicitly derived from world_size, and sp_size is subsequently inferred from dp_size.
I am aware that dp_size and sp_size is a common confusion source in SP, so this probably requires offline discussion for AutoSP.
|
Thank you @neeldani for the update! As we don't have a lot of changes in existing code, I don't think we have much risk. We should also have clear assertions to terminate early when we hit these limitations: Attention Pattern Matching and No Graph Break Requirement. |
Patch Zero-1 interoperability when using AutoSP.
Hi @tohtana, just merged in the code that correctly enables zero-1 and AutoSP interoperability. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4fdc54b84a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| def _all_to_all_backward(ctx, grad): | ||
| return ( | ||
| all_to_all(grad, ctx.scatter_idx, ctx.gather_idx, ctx.name), | ||
| None, None, None, None |
There was a problem hiding this comment.
Return one gradient per autosp::all_to_all input
The custom op autosp::all_to_all takes 4 inputs (input, scatter_idx, gather_idx, name), but _all_to_all_backward returns 5 gradient slots because of an extra trailing None. During backward through this op, autograd expects the gradient tuple arity to match inputs, so training paths that hit this op will fail at runtime.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Thanks @tohtana, I added checks for the invariants. Attention pattern matching is handled here For no graph break requirement, I am forcing a full graph capture via torch.compile's |
I think using |
AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism
Overview
AutoSP is a compiler optimization pass that shards inputs along the sequence dimension and enables Ulysses styled sequence parallelism while preventing graph breaks during
torch.compile(). All the passes operate at the Torch IR on the forward graph.API Design
User-Facing Entry Point:
prepare_autosp_inputs()Users must explicitly call this function to prepare inputs for AutoSP compilation:
Purpose: Symbolize sequence dimension and annotate tensors for identification.
Operations:
torch._dynamo.decorators.mark_dynamic()input_id.tag = constants.INPUT_ID_KEYlabel_id.tag = constants.LABEL_ID_KEYposition_id.tag = constants.POSITION_ID_KEY(if provided)Rationale: PyTorch's FX graph tracer requires explicit annotation of data-dependent dimensions. Marking the sequence dimension as dynamic prevents symbolic shape propagation from losing dimension information through reshape/view operations.
Compilation Passes
Pass 1:
pass_shard_seq_dim()Objective: Propagate sharded sequence dimension to all consumers.
Algorithm:
input_idshape metadataseq_dim / world_sizeRationale: Reshapes and views that consume the sequence dimension as an argument do not get updated during propagation of symbolic shapes. This pass explicitly rewires the computation graph to use sharded dimensions, enabling proper shape inference downstream.
Pass 2:
pass_shard_input_ids()/pass_shard_label_ids()/pass_shard_position_ids()Objective: Insert slicing operations after input tensors.
Implementation: Call
shard_tensor_node()utility which inserts slice operations. Each rank retains only the portion of the tensor corresponding to its sequence partition and drops the remaining buffer.Note on
attention_mask: Not sharded because it applies to the full sequence length, not the partitioned dimension.Pass 3:
pass_insert_attention_all_to_all()Objective: Insert all-to-all collectives around attention (Ulysses styled) to avoid graph breaks during compilation.
Algorithm:
Graph Rewrite Example:
Current support: Currently only supports
torch.nn.functional.scaled_dot_product_attention(). Composite attention patterns require additional pattern matching logic.Pass 4:
pass_propagate_shapes()Objective: Compute static shapes for all nodes using fake tensor execution.
Implementation:
ShapeEnvfor symbolic dimension trackingFakeTensorModewith the shape environmentFakeTensorProp.propagate()to compute shape metadataPass 5:
pass_canonicalize()Objective: Finalize graph representation.
Operations:
eliminate_dead_code(): Remove unused operationslint(): Validate graph structurerecompile(): Regenerate compiled representationExecution Order
Reducing gradients across ranks
AutoSP requires an all-reduce to reduce the gradients across ranks. This is automatically called by DeepSpeed's engine here
Known Limitations
torch.nn.functional.scaled_dot_product_attention()is supported. Fused attention implementations require pattern-specific handling.Example
DeepSpeedExample PR: deepspeedai/DeepSpeedExamples#999