-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Open
Labels
AutoDeploy<NV> AutoDeploy Backend<NV> AutoDeploy Backendfeature requestNew feature or request. This includes new model, dtype, functionality supportNew feature or request. This includes new model, dtype, functionality support
Description
🚀 The feature, motivation and pitch
Currently, our sharding config is based on LLMArgs such as
sharding_source: ['manual', 'factory', 'heuristic']
support_partial_config: true
sharding_dims: ['tp', 'ep', 'bmm']
shard_all_unprocessed: false
dist_mapping: {'tp': 2, 'ep' :2]
There are still open questions regarding sharding, especially around MoE, and what is the optimal strategy for:
- shared experts
- latent projections (for MoLE)
- MLA: latent projections
The PT backend does not expose these configurations. The only source of truth for sharding is theMappingobject. Figure out what PT backed does with these nodes and if this is truly optimal.
On the other hand, based on pareto plots, w know that depending on the troughput-latency tradeoff, different parallel configurations are optimal, transitioning from DEP to TEP, to TP. Determine inflation points and configure the runtime to dynamically switch configurations depending on runtime parameters.
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
AutoDeploy<NV> AutoDeploy Backend<NV> AutoDeploy Backendfeature requestNew feature or request. This includes new model, dtype, functionality supportNew feature or request. This includes new model, dtype, functionality support
Type
Projects
Status
Backlog