Summary
Worker appears to support multiple Worker.run() calls over its lifetime, but the new orch-driven dynamic communication domain lifetime is effectively tied to one Worker.run() invocation. This may be too short for common training and inference workloads where tensor-parallel and data-parallel communication domains are usually long-lived, and their lifetime may span multiple training or inference tasks.
As a result, the same fixed communication domain may be dynamically allocated and released repeatedly within one Worker lifetime. Since HCCL dynamic allocation can involve device allocation, IPC export/import, and subset synchronization, this repeated cost may be unacceptable in training, inference, or benchmark loops.
Motivation / Use Case
A typical application may keep one Worker alive and call Worker.run() many times:
worker.init()
for step in range(num_steps):
worker.run(train_step_orch, args=step_args[step], config=cfg)
worker.close()
Inside train_step_orch, the application may need a fixed TP or DP domain:
def train_step_orch(orch, args, cfg):
with orch.allocate_domain(
name="tp",
workers=[0, 1, 2, 3],
window_size=window_size,
buffers=buffer_specs,
) as tp:
orch.submit_next_level(...)
orch.submit_next_level(...)
If the TP or DP membership, window size, and buffer layout are unchanged across steps, this domain is logically a long-lived resource. However, with the current run-scoped dynamic allocation model, each Worker.run() would allocate and release the same fixed domain again.
This matters for workloads such as:
- TP/DP/EP groups reused across many training or inference tasks.
- Benchmark or warmup/timed loops where repeated allocation overhead would pollute the measured steady-state cost.
Proposed API / Behavior
Consider adding a persistent or cacheable communication domain lifetime that is longer than one Worker.run() call. For example, the runtime could support a domain that is created once, reused by multiple orch invocations, and released explicitly or at Worker.close().
The exact API is open, but the intended behavior is:
- Temporary domains remain available for per-DAG or per-phase scratch space.
- Fixed domains can be reused across multiple
Worker.run() calls when their workers, window size, and buffer layout are unchanged.
- The runtime avoids paying HCCL dynamic allocation cost once per run for the same fixed domain.
- Cleanup remains explicit and safe, either by user request or during
Worker.close().
Alternatives Considered
One workaround is to merge more work into a single large Worker.run() so the dynamic allocation cost is amortized across more submitted tasks. This is not a complete solution:
- It makes host-side orchestration less flexible for training steps, inference requests, benchmark iterations, and staged workflows.
- It can increase peak memory usage because current domain release is deferred until the whole
Worker.run() drains.
- It does not match fixed TP/DP domain lifetimes, which may naturally span multiple training or inference tasks.
Additional Context
This concern comes from reviewing PR #817, which moves communication domain allocation to the orch-only dynamic path. The dynamic path is a useful API for domains whose membership or buffers are genuinely per-DAG, but fixed TP/DP domains may need a longer-lived resource model to avoid repeated HCCL dynamic allocation overhead.
Reference commit: 681f3315.
Summary
Workerappears to support multipleWorker.run()calls over its lifetime, but the new orch-driven dynamic communication domain lifetime is effectively tied to oneWorker.run()invocation. This may be too short for common training and inference workloads where tensor-parallel and data-parallel communication domains are usually long-lived, and their lifetime may span multiple training or inference tasks.As a result, the same fixed communication domain may be dynamically allocated and released repeatedly within one
Workerlifetime. Since HCCL dynamic allocation can involve device allocation, IPC export/import, and subset synchronization, this repeated cost may be unacceptable in training, inference, or benchmark loops.Motivation / Use Case
A typical application may keep one
Workeralive and callWorker.run()many times:Inside
train_step_orch, the application may need a fixed TP or DP domain:If the TP or DP membership, window size, and buffer layout are unchanged across steps, this domain is logically a long-lived resource. However, with the current run-scoped dynamic allocation model, each
Worker.run()would allocate and release the same fixed domain again.This matters for workloads such as:
Proposed API / Behavior
Consider adding a persistent or cacheable communication domain lifetime that is longer than one
Worker.run()call. For example, the runtime could support a domain that is created once, reused by multiple orch invocations, and released explicitly or atWorker.close().The exact API is open, but the intended behavior is:
Worker.run()calls when their workers, window size, and buffer layout are unchanged.Worker.close().Alternatives Considered
One workaround is to merge more work into a single large
Worker.run()so the dynamic allocation cost is amortized across more submitted tasks. This is not a complete solution:Worker.run()drains.Additional Context
This concern comes from reviewing PR #817, which moves communication domain allocation to the orch-only dynamic path. The dynamic path is a useful API for domains whose membership or buffers are genuinely per-DAG, but fixed TP/DP domains may need a longer-lived resource model to avoid repeated HCCL dynamic allocation overhead.
Reference commit:
681f3315.