Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions docs/hierarchical_level_runtime.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ For details of each component's internals, see:
- [scheduler.md](scheduler.md) — dispatch loop, queues, completion handling
- [worker-manager.md](worker-manager.md) — WorkerThread pool, fork + mailbox
- [task-flow.md](task-flow.md) — Callable / TaskArgs / CallConfig data flow, execution leaves
- [remote-l3-worker-design.md](remote-l3-worker-design.md) — design proposal
for scheduling remote L3 workers as NEXT_LEVEL children

For the L2 chip-level details (host `.so`, AICPU, AICore), see
[chip-level-arch.md](chip-level-arch.md).
Expand Down Expand Up @@ -42,14 +44,16 @@ L0 CORE / AIV, AIC ── individual compute core (hardware-managed)
atomics and barriers.
- **L3–L6** (host/cluster): each level runs the same scheduling engine
composed of Orchestrator + Scheduler + Worker pool. Communication via IPC
(fork + shm at L3 today; RDMA / sockets at L4+).
(fork + shm for today's L3 and for local recursive L4+ composition).
Cross-host L4/L5/L6 composition, where a parent schedules a remote L3
endpoint over RoCE/HCCS/UB/sockets, is a proposed extension.

| Level | Workers it contains | Status |
| ----- | ------------------- | ------ |
| L3 (Host) | `ChipWorker` ×N + `SubWorker` ×M | Implemented |
| L4 (Pod) | `Worker(level=3)` ×N + `SubWorker` ×M | Implemented |
| L5 (SuperNode) | `Worker(level=4)` ×N | Same code as L4 (untested) |
| L6 (Cluster) | `Worker(level=5)` ×N | Same code as L4 (untested) |
| L4 (Pod) | `Worker(level=3)` ×N + `SubWorker` ×M | Local-only implemented; remote proposed |
| L5 (SuperNode) | `Worker(level=4)` ×N | Local L4 code path, untested; remote proposed |
| L6 (Cluster) | `Worker(level=5)` ×N | Local L4 code path, untested; remote proposed |

`Worker` is a single C++ class that handles every level from L3 upward — the
`level` parameter is a diagnostic label; behavior does not branch on it. The
Expand Down
472 changes: 472 additions & 0 deletions docs/remote-l3-worker-design.md

Large diffs are not rendered by default.

307 changes: 307 additions & 0 deletions docs/remote-l3-worker-design/buffers-and-transports.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,307 @@
# Remote L3 Buffers and Transports

This document defines remote buffer lifetime and backend transport contracts
for remote L3 NEXT_LEVEL endpoints.

## Buffer Handles

Remote buffers need an owner, generation, and deferred free point. The parent
tracks user-visible handle state; the remote session runner owns physical
memory and imported mappings.

Parent-side handle:

```text
RemoteBufferHandle:
endpoint_id
buffer_id
generation
address_space
nbytes
remote_addr
rkey_or_token
ub_ldst_va
ref_state
live_slot_refs
```

`buffer_id` may be reused only with a new `generation`. Stale completions,
imports, or frees whose generation does not match are ignored or reported as
session errors.

## Public Memory API

Remote memory APIs return handles, not bare integer pointers:

```python
buf = w4.remote_malloc(worker=l3_worker_id, nbytes=4096)
w4.remote_copy_to(buf, host_ptr, 4096)

args = TaskArgs()
args.add_tensor(
RemoteTensorRef(buf, offset=0, shape=(1024,), dtype=DataType.FLOAT32),
TensorArgType.INPUT,
)
```

The binding stores a hidden remote sidecar beside the tensor metadata.
`RemoteTensorRef` is transport metadata, not an extension of the local mailbox
ABI. Local fork/shm endpoints reject it unless the buffer has first been
explicitly imported, staged, or materialized into a local-addressable
`ContinuousTensor`. Remote endpoints reject bare host pointers unless explicit
staging produced a handle.

## TaskArgs Sidecar Contract

`ContinuousTensor` remains the tensor metadata ABI. Remote transport metadata
is stored in a per-`TaskArgs` sidecar keyed by tensor index.

Python-facing rules:

- `TaskArgs.add_tensor(RemoteTensorRef(...), tag)` appends one normal tensor
metadata entry and one remote sidecar entry at the same tensor index.
- `RemoteTensorRef` is not converted to a fake integer pointer.
- `RemoteBufferHandle` is opaque to user code. Users may inspect endpoint,
size, and release state, but not transport keys such as `rkey`.
- A `TaskArgs` containing any remote sidecar is legal only when the final
selected endpoint set contains remote endpoints that can consume every
referenced sidecar.
- Local `submit_next_level()` rejects remote sidecars before slot commit unless
an explicit import/staging API has converted them into local-addressable
`ContinuousTensor` values and removed the remote sidecar.
- Remote submit rejects `OUTPUT` tensors with `ContinuousTensor.data == 0`
unless an explicit remote allocation API has already produced a
`RemoteTensorRef` sidecar for that tensor.

Parent C++ slot rules:

- `TaskSlotState` owns a copy of the sidecar for the slot lifetime.
- Sidecar length must equal `tensor_count`; entries can be null only for
metadata-only tensors. `HOST_INLINE` payloads still have sidecar descriptors
so the frame codec has one validation path for every remote data payload.
- Submit validation captures a live ref on every referenced
`RemoteBufferHandle` before the slot becomes visible to the Scheduler.
- Validation fails before slot commit when the intersection of callable-eligible
endpoints and data-eligible endpoints is empty, or when a remote tensor names
an ineligible endpoint, stale generation, out-of-range offset, or released
handle.
- Group submit stores one sidecar per group member, aligned with
`task_args_list[i]`.

Endpoint rules:

- `LocalMailboxEndpoint` rejects non-empty sidecars. It cannot encode remote
descriptors into the local 4096-byte mailbox, and its child processes expect
`ContinuousTensor.data` to be a local host/shm pointer or a local child-memory
pointer.
- `RemoteL3Endpoint` requires a sidecar for every tensor payload that crosses
the remote protocol, including `HOST_INLINE` payloads.
- Remote TASK frames write `ContinuousTensorWire.data == 0`; parent virtual
addresses never cross the remote protocol.
- A remote tensor with `child_memory=True` and no sidecar is invalid. Local
child-memory pointers are meaningful only inside fork/shm topology.
- The remote session runner translates each `RemoteTensorDesc` into a local
`ContinuousTensor` and fills `data` from its validated local mapping
immediately before invoking `inner_worker.run()`.

## Remote OUTPUT Allocation Policy

The first implementation does not mirror local HeapRing auto-allocation for
remote outputs. In the local fork/shm path, an `OUTPUT` tensor with
`ContinuousTensor.data == 0` is assigned a parent HeapRing pointer during
submit, and forked children can dereference that shared virtual address. A
remote L3 worker cannot use a parent-host HeapRing pointer.

Remote callers must allocate or import output storage explicitly before submit:

```python
out = w4.remote_malloc(worker=l3_worker_id, nbytes=4096)

args = TaskArgs()
args.add_tensor(
RemoteTensorRef(out, offset=0, shape=(1024,), dtype=DataType.FLOAT32),
TensorArgType.OUTPUT,
)
orch.submit_next_level(cid, args, cfg, worker=l3_worker_id)
```

This keeps submit-time validation simple: the slot already carries complete
data eligibility, handle generation, bounds, and lifetime refs before it
becomes visible to the Scheduler.

Future work may add remote output auto-allocation, but only after the runtime
has a well-defined pre-dispatch endpoint selection policy. Auto-allocation must
decide which endpoint owns the output before slot commit, allocate or import the
remote buffer, attach the generated sidecar to the correct tensor index, and
handle group submits where each member may need storage on a different
endpoint. Until those rules exist, remote null `OUTPUT` tensors fail fast.

## Required Controls

| Command | Purpose |
| ------- | ------- |
| `ALLOC_REMOTE_BUFFER` | Allocate remote L3 host or chip memory. |
| `FREE_REMOTE_BUFFER` | Mark a handle released; physical free is deferred. |
| `COPY_TO_REMOTE` | Stage host data into a remote buffer. |
| `COPY_FROM_REMOTE` | Pull remote output data back to host. |
| `EXPORT_BUFFER` | Return RDMA key or UB mapping metadata. |
| `IMPORT_BUFFER` | Import a peer buffer into a remote worker. |
| `RELEASE_IMPORT` | Drop an imported peer mapping. |

Control commands are typed remote protocol frames. They are not the local
mailbox `CTRL_*` integers.

## Release Policy

- Slot refs are acquired during `submit_next_level()` while walking `TaskArgs`.
- Captured buffers stay live until every capturing slot has reached a terminal
state and every producer/consumer reference that can expose that buffer has
reached `CONSUMED` or failed cleanup.
- Explicit buffers used only as INPUT still need slot refs. They have no
producer slot to protect them.
- Runtime-managed OUTPUT buffers follow the producer slot's terminal cleanup.
- `FREE_REMOTE_BUFFER` and `RELEASE_IMPORT` mark handles released. Physical
free or import teardown runs only when the released handle has no live slot
refs.
- If a run fails, the same post-drain cleanup path runs before the next run.
- Session shutdown rejects new work and frees all session-owned buffers after
completing or failing in-flight tasks.

The registry therefore needs both a user-visible release state and a live
slot-ref count. Failed slots still release captured refs through the same
terminal cleanup path as successful slots.

## Dependency Keys

`TensorKey` must grow beyond the current `{ptr, int8 worker}` shape for remote
buffers while preserving the current exact-start lookup semantics. Today, local
dependency tracking keys only on a tensor's start pointer and worker id; shape
and byte length do not participate in lookup. Remote keys follow the same rule:

```text
address_kind
owner_endpoint_id
buffer_id
offset_begin
generation
```

Local fork/shm keys remain a compatibility subset:

```text
host pointer: (LOCAL_HOST, -1, ptr)
local child memory: (LOCAL_CHILD, worker_id, ptr)
```

Known limitation: two remote tensors that reference overlapping byte ranges
with different `offset_begin` values do not automatically depend on each other.
For example, a producer writing `[0, 4096)` and a consumer reading
`[1024, 2048)` map to different dependency keys. This matches the current local
`ptr`-based TensorMap behavior, where a subview at `base + offset` is a
different key from `base`.

The first implementation chooses this route to keep remote scheduling behavior
compatible with local fork/shm semantics and to avoid changing TensorMap into a
range index as part of the transport bring-up. `offset_end`/`nbytes` remains in
`RemoteTensorDesc` for bounds checks and for a future range-overlap TensorMap
upgrade, but it is not part of the first dependency key.

## HCOMM Adapter Contract

Remote L3 uses HCOMM for steady-state communication in the first
implementation. The bootstrap socket is only a setup path for session
validation and HCOMM bring-up. Once HCOMM RPC is ready, task metadata,
CONTROL, CONTROL_REPLY, COMPLETION, and SHUTDOWN frames use the HCOMM RPC
adapter; tensor data and remote buffer copies use the HCOMM data adapter.

The endpoint owns the adapter objects. `Orchestrator`, Scheduler, and
`WorkerThread` see only `WorkerEndpoint::run()`, `WorkerEndpoint::control()`,
and logical capability bits from `WorkerEndpoint::caps()`.

The adapter family provides this logical contract:

```text
BootstrapSocketAdapter:
open(control_uri)
exchange hello/capability/HCOMM bootstrap frames
close after HCOMM_RPC_READY

HcommRpcAdapter:
submit TASK or CONTROL frame on the ordered command lane
wait for matching COMPLETION or CONTROL_REPLY
send SHUTDOWN when no command is in flight

HcommAdapter:
register/export/import memory
submit read/write/copy plans
wait/fence completion
release registrations and imports
```

HCOMM RPC enqueues request frames on the endpoint's ordered command lane.
Data-plane transfers may use RoCE, HCCS, or UB HCOMM profiles, but TASK
doorbells, CONTROL frames, replies, completions, and shutdown state are ordered
by the command lane. Reply frames carry the request sequence they answer. A
task completes only after an explicit remote `COMPLETION` frame; data-copy
completion alone is never a task completion signal. `control()` returns only
after the matching `CONTROL_REPLY` arrives or a timeout/disconnect is converted
into a failed reply. Liveness is handled by an independent health lane or
transport keepalive; it is not queued behind the ordered command lane.

## A2 RoCE HCOMM Profile

- Use HCOMM with `COMM_PROTOCOL_ROCE`.
- Carry command frames and completion records through HCOMM RPC rings and
notify/fence operations.
- Use a separate health HCOMM lane or transport keepalive for liveness.
- Use registered staging buffers for large callable blobs and bulk data.
- Export buffers as HCOMM memory descriptors plus RoCE-specific channel
metadata.
- Complete tasks only after an HCOMM RPC `COMPLETION` frame from the session
runner.
- Bound every wait with a timeout and convert disconnects into endpoint
failure completions.

## A3 HCCS HCOMM Profile

- Keep the same HCOMM adapter contract as A2.
- Implement memory export/import through the HCCS-capable HCOMM profile.
- Preserve the same command-lane ordering rules: task/control frames are
observed in sequence order, command frame visible before doorbell, and remote
writes complete before completion frame.
- Provide health independently from command-lane progress so long-running TASK
execution does not cause false endpoint failure.
- Reuse the frame codec, HCOMM RPC, and buffer registry tests from the A2 path.

## A5 UB HCOMM Profile

- Export both RDMA metadata and, when available, an LD/ST mapping token.
- Use LD/ST for doorbells and small completion records only when the mapping
is coherent for the participating hosts.
- Preserve the same per-endpoint command-lane order for TASK, CONTROL,
CONTROL_REPLY, COMPLETION, and SHUTDOWN frames.
- Keep UB health doorbells or transport health independent from the command
lane used for state-changing frames.
- Use RDMA for bulk transfers until platform benchmarks justify LD/ST bulk
copies.
- Add explicit fences around:
- task payload writes before doorbell;
- remote output writes before completion;
- parent completion read before dependent task dispatch.
- Keep RDMA fallback for all UB LD/ST paths.

## Simulation Backend

The simulation backend uses TCP or Unix sockets plus local files/shm for
integration tests. It must exercise:

- framed protocol encode/decode;
- sequence numbers;
- remote callable bootstrap;
- endpoint eligibility validation;
- success and error completions;
- failed dependency poisoning;
- buffer registry ref capture and deferred free;
- timeout handling.

It must not depend on A2/A3/A5 hardware.
Loading
Loading