Skip to content

[Feature] Support persistent communication domains across Worker.run calls #824

@ccyywwen

Description

@ccyywwen

Summary

Worker appears to support multiple Worker.run() calls over its lifetime, but the new orch-driven dynamic communication domain lifetime is effectively tied to one Worker.run() invocation. This may be too short for common training and inference workloads where tensor-parallel and data-parallel communication domains are usually long-lived, and their lifetime may span multiple training or inference tasks.

As a result, the same fixed communication domain may be dynamically allocated and released repeatedly within one Worker lifetime. Since HCCL dynamic allocation can involve device allocation, IPC export/import, and subset synchronization, this repeated cost may be unacceptable in training, inference, or benchmark loops.

Motivation / Use Case

A typical application may keep one Worker alive and call Worker.run() many times:

worker.init()
for step in range(num_steps):
    worker.run(train_step_orch, args=step_args[step], config=cfg)
worker.close()

Inside train_step_orch, the application may need a fixed TP or DP domain:

def train_step_orch(orch, args, cfg):
    with orch.allocate_domain(
        name="tp",
        workers=[0, 1, 2, 3],
        window_size=window_size,
        buffers=buffer_specs,
    ) as tp:
        orch.submit_next_level(...)
        orch.submit_next_level(...)

If the TP or DP membership, window size, and buffer layout are unchanged across steps, this domain is logically a long-lived resource. However, with the current run-scoped dynamic allocation model, each Worker.run() would allocate and release the same fixed domain again.

This matters for workloads such as:

  • TP/DP/EP groups reused across many training or inference tasks.
  • Benchmark or warmup/timed loops where repeated allocation overhead would pollute the measured steady-state cost.

Proposed API / Behavior

Consider adding a persistent or cacheable communication domain lifetime that is longer than one Worker.run() call. For example, the runtime could support a domain that is created once, reused by multiple orch invocations, and released explicitly or at Worker.close().

The exact API is open, but the intended behavior is:

  • Temporary domains remain available for per-DAG or per-phase scratch space.
  • Fixed domains can be reused across multiple Worker.run() calls when their workers, window size, and buffer layout are unchanged.
  • The runtime avoids paying HCCL dynamic allocation cost once per run for the same fixed domain.
  • Cleanup remains explicit and safe, either by user request or during Worker.close().

Alternatives Considered

One workaround is to merge more work into a single large Worker.run() so the dynamic allocation cost is amortized across more submitted tasks. This is not a complete solution:

  • It makes host-side orchestration less flexible for training steps, inference requests, benchmark iterations, and staged workflows.
  • It can increase peak memory usage because current domain release is deferred until the whole Worker.run() drains.
  • It does not match fixed TP/DP domain lifetimes, which may naturally span multiple training or inference tasks.

Additional Context

This concern comes from reviewing PR #817, which moves communication domain allocation to the orch-only dynamic path. The dynamic path is a useful API for domains whose membership or buffers are genuinely per-DAG, but fixed TP/DP domains may need a longer-lived resource model to avoid repeated HCCL dynamic allocation overhead.

Reference commit: 681f3315.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions