Skip to content

[Feature][AutoDeploy]: Explore optimal sharding configurations #11656

@greg-kwasniewski1

Description

@greg-kwasniewski1

🚀 The feature, motivation and pitch

Currently, our sharding config is based on LLMArgs such as

    sharding_source: ['manual', 'factory', 'heuristic']
    support_partial_config: true
    sharding_dims: ['tp', 'ep', 'bmm']
    shard_all_unprocessed: false
    dist_mapping: {'tp': 2, 'ep' :2]

There are still open questions regarding sharding, especially around MoE, and what is the optimal strategy for:

  • shared experts
  • latent projections (for MoLE)
  • MLA: latent projections
    The PT backend does not expose these configurations. The only source of truth for sharding is the Mapping object. Figure out what PT backed does with these nodes and if this is truly optimal.

On the other hand, based on pareto plots, w know that depending on the troughput-latency tradeoff, different parallel configurations are optimal, transitioning from DEP to TEP, to TP. Determine inflation points and configure the runtime to dynamically switch configurations depending on runtime parameters.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Labels

AutoDeploy<NV> AutoDeploy Backendfeature requestNew feature or request. This includes new model, dtype, functionality support

Type

No type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions