Skip to content

Add elastic EP for dispatch/combine flows#175

Open
maning00 wants to merge 4 commits intomainfrom
feat_elastic_ep
Open

Add elastic EP for dispatch/combine flows#175
maning00 wants to merge 4 commits intomainfrom
feat_elastic_ep

Conversation

@maning00
Copy link
Contributor

Motivation

This PR adds elastic EP behavior to dispatch_combine so jobs can continue making progress when some rank/node become unresponsive, instead of hanging on cross-rank synchronization.
The goal is to improve robustness for internode MoE dispatch/combine under partial-failure scenarios while keeping the existing behavior unchanged when elastic mode is not enabled.

Technical Details

  • Added optional elastic EP state (active_ranks) and timeout (timeout_us) plumbing from Python API -> pybind -> C++ handle -> device kernels.
  • Introduced device-side elastic utilities in device_primitives (rank active checks, rank deactivation, wait-with-timeout helpers) and applied them across internode/intranode/low-latency dispatch+combine synchronization points.
  • Updated EpDispatchCombineHandle to store elastic state and wall-clock rate, and converted timeout from microseconds to device wall-clock ticks.
  • Renamed wall-clock helper API from MHz naming to KHz naming for correctness/clarity:
    • get_cur_device_wall_clock_freq_mhz -> get_cur_device_wall_clock_freq_khz
    • corresponding C++ helper rename in HIP utils and Python kernel profiler usage.
  • Extended Python EpDispatchCombineOp interfaces (dispatch* / combine*) with optional active_ranks and timeout_us parameters.
  • Added/expanded tests and example coverage for elastic behavior:
    • New elastic EP test path in tests/python/ops/test_dispatch_combine.py (test_dispatch_combine_elastic_ep)
    • Updated internode example test to support dropout simulation (drop_rank) and timeout-based inactive rank handling.

Test Plan

  • Unit/integration:
    • tests/python/ops/test_dispatch_combine.py::test_dispatch_combine
    • tests/python/ops/test_dispatch_combine.py::test_dispatch_combine_elastic_ep
  • Example/internode validation:
    • examples/ops/dispatch_combine/test_dispatch_combine_internode.py with:
      • normal mode (drop_rank=-1)
      • dropout simulation (drop_rank=<rank_id>, timeout_us=<value>)

Test Result

  • Unit tests passed

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds elastic Expert Parallelism (EP) behavior to the dispatch/combine operations, enabling jobs to continue making progress when some ranks become unresponsive. The implementation includes timeout-based detection of unresponsive ranks, active rank tracking, and graceful degradation when ranks fail. Additionally, the PR corrects the wall-clock frequency API naming from MHz to KHz for accuracy.

Changes:

  • Added elastic EP state management with active_ranks tensor and timeout_us parameter throughout the dispatch/combine API stack
  • Implemented device-side timeout and rank activity checking utilities with atomic operations for thread safety
  • Renamed wall-clock frequency API from MHz to KHz (correcting the naming to match actual HIP API behavior)
  • Added comprehensive elastic EP test coverage and example integration

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
python/mori/ops/dispatch_combine.py Added active_ranks and timeout_us parameters to dispatch*/combine* methods
python/mori/kernel_profiler/init.py Updated to use renamed KHz wall-clock API
src/pybind/mori.cpp Added MaybeUpdateElasticState helper and parameter validation; updated all dispatch/combine/recv functions
include/mori/ops/dispatch_combine/dispatch_combine.hpp Added wallClockRateKHz, activeRanks, timeoutTicks fields and SetElasticState method
include/mori/utils/hip_helper.hpp Renamed wall-clock API functions from MHz to KHz
include/mori/core/transport/p2p/device_primitives.hpp Added elastic EP device utilities: IsRankActive, MarkRankInactive, WaitUntil*OrTimeout
src/ops/dispatch_combine/dispatch_combine.cpp Initialize wallClockRateKHz in constructor
src/ops/dispatch_combine/intranode.hpp Added elastic checks and timeout handling in dispatch/combine/barrier operations
src/ops/dispatch_combine/internode_v1.cpp Added elastic checks throughout internode dispatch/combine/sync operations
src/ops/dispatch_combine/low_latency_async.cpp Added elastic checks in async low-latency dispatch/combine paths
tests/python/ops/test_dispatch_combine.py Added test_dispatch_combine_elastic_ep with dropout simulation
examples/ops/dispatch_combine/test_dispatch_combine_internode.py Extended with drop_rank and timeout_us support for elastic testing
Comments suppressed due to low confidence (2)

python/mori/ops/dispatch_combine.py:219

  • The docstrings for dispatch and combine methods don't document the new active_ranks and timeout_us parameters. Please add documentation for these parameters explaining:
  • active_ranks: Optional int32 CUDA tensor of shape (world_size,) indicating which ranks are active (1) or inactive (0). Used for elastic EP behavior.
  • timeout_us: Optional timeout in microseconds for detecting unresponsive ranks. When a rank doesn't respond within this time, it is marked inactive in the active_ranks tensor. Use -1 or None to disable timeout.
        """Dispatch tokens to experts based on top-k indices.

        Args:
            input: Input token tensor.
            weights: Token weights for each expert.
            scales: Quantization scales (optional).
            indices: Top-k expert indices.
            block_num: Override config.block_num if > 0.
            warp_per_block: Override config.warp_num_per_block if > 0.
        """

python/mori/ops/dispatch_combine.py:297

  • The docstring for the combine method doesn't document the new active_ranks and timeout_us parameters. Please add documentation for these parameters explaining:
  • active_ranks: Optional int32 CUDA tensor of shape (world_size,) indicating which ranks are active (1) or inactive (0). Used for elastic EP behavior.
  • timeout_us: Optional timeout in microseconds for detecting unresponsive ranks. When a rank doesn't respond within this time, it is marked inactive in the active_ranks tensor. Use -1 or None to disable timeout.
        """Combine tokens from experts back to original positions.

        Args:
            input: Expert output tensor.
            weights: Token weights for weighted combination.
            indices: Top-k expert indices.
            block_num: Override config.block_num if > 0.
            warp_per_block: Override config.warp_num_per_block if > 0.
            use_external_inp_buf: Override config.use_external_inp_buf if >= 0.
                0 = use zero-copy (registered combine input buffer),
                1 = use external input buffer (non-zero-copy).
            call_reset: Whether to call reset after combine.
        """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 334 to 339
if op.config.kernel_type is mori.ops.EpDispatchCombineKernelType.AsyncLL:
ret = op.dispatch_send(token, weights, scales, indices)
op.dispatch_recv()
else:
ret = op.dispatch(token, weights, scales, indices)
ret = op.dispatch(token, weights, scales, indices, **kwargs)
return ret
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For AsyncLL kernel type, the elastic parameters are not passed to dispatch_send/dispatch_recv. While the AsyncLL kernels do support elastic EP (they have IsRankActive checks), the dispatch_send/combine_send/dispatch_recv/combine_recv methods already accept active_ranks and timeout_us parameters. Consider passing these parameters through the kwargs for AsyncLL as well to enable elastic EP support for this kernel type in the example.

Copilot uses AI. Check for mistakes.
Comment on lines 358 to +362
if op.config.kernel_type is mori.ops.EpDispatchCombineKernelType.AsyncLL:
ret = op.combine_send(token, weights, indices)
op.combine_recv()
else:
ret = op.combine(token, weights, indices)
ret = op.combine(token, weights, indices, **kwargs)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to run_dispatch, for AsyncLL kernel type the elastic parameters are not passed to combine_send/combine_recv. Consider passing these parameters through the kwargs for AsyncLL as well to enable elastic EP support for this kernel type in the example.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants