[Feature] Support persistent communication domains across Worker.run calls

### Summary

`Worker` appears to support multiple `Worker.run()` calls over its lifetime, but the new orch-driven dynamic communication domain lifetime is effectively tied to one `Worker.run()` invocation. This may be too short for common training and inference workloads where tensor-parallel and data-parallel communication domains are usually long-lived, and their lifetime may span multiple training or inference tasks.

As a result, the same fixed communication domain may be dynamically allocated and released repeatedly within one `Worker` lifetime. Since HCCL dynamic allocation can involve device allocation, IPC export/import, and subset synchronization, this repeated cost may be unacceptable in training, inference, or benchmark loops.

### Motivation / Use Case


A typical application may keep one `Worker` alive and call `Worker.run()` many times:

```python
worker.init()
for step in range(num_steps):
    worker.run(train_step_orch, args=step_args[step], config=cfg)
worker.close()
```

Inside `train_step_orch`, the application may need a fixed TP or DP domain:

```python
def train_step_orch(orch, args, cfg):
    with orch.allocate_domain(
        name="tp",
        workers=[0, 1, 2, 3],
        window_size=window_size,
        buffers=buffer_specs,
    ) as tp:
        orch.submit_next_level(...)
        orch.submit_next_level(...)
```

If the TP or DP membership, window size, and buffer layout are unchanged across steps, this domain is logically a long-lived resource. However, with the current run-scoped dynamic allocation model, each `Worker.run()` would allocate and release the same fixed domain again.

This matters for workloads such as:

- TP/DP/EP groups reused across many training or inference tasks.
- Benchmark or warmup/timed loops where repeated allocation overhead would pollute the measured steady-state cost.


### Proposed API / Behavior

Consider adding a persistent or cacheable communication domain lifetime that is longer than one `Worker.run()` call. For example, the runtime could support a domain that is created once, reused by multiple orch invocations, and released explicitly or at `Worker.close()`.

The exact API is open, but the intended behavior is:

- Temporary domains remain available for per-DAG or per-phase scratch space.
- Fixed domains can be reused across multiple `Worker.run()` calls when their workers, window size, and buffer layout are unchanged.
- The runtime avoids paying HCCL dynamic allocation cost once per run for the same fixed domain.
- Cleanup remains explicit and safe, either by user request or during `Worker.close()`.

### Alternatives Considered

One workaround is to merge more work into a single large `Worker.run()` so the dynamic allocation cost is amortized across more submitted tasks. This is not a complete solution:

- It makes host-side orchestration less flexible for training steps, inference requests, benchmark iterations, and staged workflows.
- It can increase peak memory usage because current domain release is deferred until the whole `Worker.run()` drains.
- It does not match fixed TP/DP domain lifetimes, which may naturally span multiple training or inference tasks.


### Additional Context

This concern comes from reviewing PR #817, which moves communication domain allocation to the orch-only dynamic path. The dynamic path is a useful API for domains whose membership or buffers are genuinely per-DAG, but fixed TP/DP domains may need a longer-lived resource model to avoid repeated HCCL dynamic allocation overhead.

Reference commit: `681f3315`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support persistent communication domains across Worker.run calls #824

Summary

Motivation / Use Case

Proposed API / Behavior

Alternatives Considered

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature] Support persistent communication domains across Worker.run calls #824

Description

Summary

Motivation / Use Case

Proposed API / Behavior

Alternatives Considered

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions