From 6c052cbda97eaf168b8df2754b49d5ac25abc889 Mon Sep 17 00:00:00 2001 From: puddingfjz <2811443837@qq.com> Date: Tue, 26 May 2026 14:08:10 +0800 Subject: [PATCH 1/2] Add: remote L3 worker design - Document remote L3 NEXT_LEVEL worker architecture, protocol, and buffer lifetime contracts - Describe A2 RoCE, A3 HCCS, and A5 UB transport expectations - Link the proposal from hierarchical runtime, task flow, and worker manager docs --- docs/hierarchical_level_runtime.md | 12 +- docs/remote-l3-worker-design.md | 434 ++++++++++++++++++ .../buffers-and-transports.md | 291 ++++++++++++ .../implementation-plan.md | 293 ++++++++++++ docs/remote-l3-worker-design/protocol.md | 399 ++++++++++++++++ docs/task-flow.md | 28 +- docs/worker-manager.md | 44 +- 7 files changed, 1471 insertions(+), 30 deletions(-) create mode 100644 docs/remote-l3-worker-design.md create mode 100644 docs/remote-l3-worker-design/buffers-and-transports.md create mode 100644 docs/remote-l3-worker-design/implementation-plan.md create mode 100644 docs/remote-l3-worker-design/protocol.md diff --git a/docs/hierarchical_level_runtime.md b/docs/hierarchical_level_runtime.md index 2e78f7aad..d29482914 100644 --- a/docs/hierarchical_level_runtime.md +++ b/docs/hierarchical_level_runtime.md @@ -13,6 +13,8 @@ For details of each component's internals, see: - [scheduler.md](scheduler.md) — dispatch loop, queues, completion handling - [worker-manager.md](worker-manager.md) — WorkerThread pool, fork + mailbox - [task-flow.md](task-flow.md) — Callable / TaskArgs / CallConfig data flow, execution leaves +- [remote-l3-worker-design.md](remote-l3-worker-design.md) — design proposal + for scheduling remote L3 workers as NEXT_LEVEL children For the L2 chip-level details (host `.so`, AICPU, AICore), see [chip-level-arch.md](chip-level-arch.md). @@ -42,14 +44,16 @@ L0 CORE / AIV, AIC ── individual compute core (hardware-managed) atomics and barriers. - **L3–L6** (host/cluster): each level runs the same scheduling engine composed of Orchestrator + Scheduler + Worker pool. Communication via IPC - (fork + shm at L3 today; RDMA / sockets at L4+). + (fork + shm for today's L3 and for local recursive L4+ composition). + Cross-host L4/L5/L6 composition, where a parent schedules a remote L3 + endpoint over RoCE/HCCS/UB/sockets, is a proposed extension. | Level | Workers it contains | Status | | ----- | ------------------- | ------ | | L3 (Host) | `ChipWorker` ×N + `SubWorker` ×M | Implemented | -| L4 (Pod) | `Worker(level=3)` ×N + `SubWorker` ×M | Implemented | -| L5 (SuperNode) | `Worker(level=4)` ×N | Same code as L4 (untested) | -| L6 (Cluster) | `Worker(level=5)` ×N | Same code as L4 (untested) | +| L4 (Pod) | `Worker(level=3)` ×N + `SubWorker` ×M | Local-only implemented; remote proposed | +| L5 (SuperNode) | `Worker(level=4)` ×N | Local L4 code path, untested; remote proposed | +| L6 (Cluster) | `Worker(level=5)` ×N | Local L4 code path, untested; remote proposed | `Worker` is a single C++ class that handles every level from L3 upward — the `level` parameter is a diagnostic label; behavior does not branch on it. The diff --git a/docs/remote-l3-worker-design.md b/docs/remote-l3-worker-design.md new file mode 100644 index 000000000..20e02ae57 --- /dev/null +++ b/docs/remote-l3-worker-design.md @@ -0,0 +1,434 @@ +# Remote L3 Worker Design + +This document describes how to extend the L3+ hierarchical runtime so a +parent `Worker` can schedule a remote `Worker(level=3)` as a NEXT_LEVEL +worker. The first target is an L4 parent dispatching to remote L3 workers. +The same contracts can later serve L5/L6. + +Detailed protocol, buffer, transport, and rollout notes live in: + +- [protocol.md](remote-l3-worker-design/protocol.md) +- [buffers-and-transports.md](remote-l3-worker-design/buffers-and-transports.md) +- [implementation-plan.md](remote-l3-worker-design/implementation-plan.md) + +The current implementation uses pre-forked local child processes and a +4096-byte shared-memory mailbox. That model depends on copy-on-write callable +registries, identical virtual addresses for `MAP_SHARED` regions, and +parent-visible child PIDs. None of those assumptions holds across hosts. + +## Scope + +Goals: + +- Preserve the Orchestrator/Scheduler DAG model. +- Replace the local mailbox endpoint under `WorkerThread` with a pluggable + NEXT_LEVEL endpoint. +- Support remote data-plane backends for A2 RoCE, A3 HCCS, and A5 UB. +- Carry task dispatch, control commands, completion, error messages, and + buffer lifetime over the remote endpoint. + +Non-goals: + +- Rewriting kernel allreduce or PTO-ISA collective kernels. +- Shipping Python closures without an explicit serialization contract. +- Designing general cross-host Python dependency or code distribution for + arbitrary closures. +- Replacing the local fork/shm path for chip and sub workers. +- Changing the L2 `ChipWorker::run` ABI. + +Remote callable registration follows the same public cid lifecycle defined for +local dynamic Python registration by PR #839: registration becomes visible to +the selected Python-capable endpoint only after the registration control reply +succeeds, unregister permits cid reuse only after stale state is cleared or the +endpoint is marked failed, and stale cid residue must not be observable by +later TASK frames. The remote design reuses those lifecycle semantics, but it +does not reuse PR #839's local mailbox commands, POSIX shm names, +process-local pointers, or exact serialized payload wire shape. + +The required baseline remote callable descriptor is an import path such as +`pkg.module:orch_fn`. A serialized Python callable payload produced by the +PR #839 serializer is a negotiated remote capability, not an unconditional +baseline. When enabled, it travels as a versioned remote CONTROL payload and +must negotiate serializer version, payload limits, Python ABI/runtime +compatibility, and dependency/runtime-environment compatibility. + +## Current Seams + +Relevant code paths: + +- `python/simpler/worker.py` + - `_start_hierarchical()` forks local child workers. + - `_child_worker_loop()` runs a nested `Worker` child via shm mailbox. + - `_run_chip_main_loop()` handles task and control mailbox states. +- `src/common/hierarchical/worker_manager.{h,cpp}` + - `WorkerThread` owns one local mailbox and blocks until `TASK_DONE`. + - Control commands share the same mailbox and serialize on `mailbox_mu_`. + - Errors are reported through `MAILBOX_OFF_ERROR` and + `MAILBOX_OFF_ERROR_MSG`. +- `src/common/hierarchical/orchestrator.{h,cpp}` + - `submit_next_level()` stores `TaskArgs`, `CallConfig`, callable id, and + optional worker affinity in a parent-side slot. + - Dependency inference happens before dispatch from tags in `TaskArgs`. +- `src/common/task_interface/task_args.h` + - Process dispatch writes `[T][S][ContinuousTensor x T][uint64 x S]`. + - Tags are stripped after submit. +- `docs/comm-domain.md` + - Dynamic communication domains already model deferred release after + `drain()`. + +The Scheduler should not inspect transport details, but it does need enough +metadata to avoid dispatching a task to an endpoint that cannot run it. Remote +tensor identity must be resolved before a slot becomes ready, because +TensorMap dependency inference and buffer-reference capture happen at submit +time. + +## Target Architecture + +Introduce a transport-neutral endpoint under `WorkerThread` with `run`, +`control`, and `shutdown` operations. `LocalMailboxEndpoint` wraps the current +shm mailbox code without changing wire behavior. `RemoteL3Endpoint` implements +the same interface over the framed remote protocol. + +On dispatch, `WorkerThread` builds a task packet from `TaskSlotState`, calls +the endpoint, reports endpoint errors, and notifies the Scheduler with an +explicit success/failure outcome. + +Ready queues, group dispatch, affinities, fanin/fanout, and ring release remain +in the existing runtime. The first-error-wins policy remains only as the error +reporting policy for choosing which root error `drain()` raises. The important +change is that completion is no longer implicitly success; every endpoint, +including `LocalMailboxEndpoint`, must report an explicit success/failure +outcome. + +## Fork-Safe Remote Process Model + +The remote runtime must preserve the repository's fork ordering invariant: +all chip/sub child processes are forked before any C++ Scheduler, +`WorkerThread`, transport, or health threads are started. + +Use a two-process remote model: + +1. `simpler-remote-worker` is a small control daemon. It accepts session + requests and validates bootstrap manifests on the daemon control channel. + It never constructs an inner `Worker` and never forks chip/sub children + after starting transport worker threads. +2. For each accepted session, the daemon starts a fresh + `simpler-remote-l3-session` runner process, preferably by `exec`. +3. The daemon passes the validated manifest to the runner through a simple + pre-fork handoff such as an inherited fd, a manifest file path in env, or a + single-threaded pipe. This handoff is not the remote transport protocol. +4. The session runner reads the manifest before starting transport threads and + constructs `Worker(level=3)`. +5. The runner then performs an explicit prestart step equivalent to + `inner_worker.init()` plus `_start_hierarchical()` for the inner Worker: + allocate local mailboxes, fork local chip/sub children, register local + endpoints with the inner C++ Worker, and start the inner Scheduler and + `WorkerThread`s. +6. Only after this local L3 child tree is established does the session runner + bring up sockets, RDMA queue pairs, health threads, or UB doorbells for task + traffic. +7. The runner then performs the remote protocol `HELLO`/ready handshake over + the ordered command lane. `HELLO` confirms session identity, endpoint + identity, protocol version, transport kind, and negotiated features; it does + not carry the bootstrap manifest. +8. Session shutdown rejects new frames, completes or fails in-flight tasks, + drains cleanup, closes the inner Worker, and exits the runner process. + +This keeps the local L3 fork/shm implementation intact while preventing a +multi-threaded network daemon from becoming the process that performs the +forks. + +`HELLO ready_state=READY` is a scheduling barrier, not just a liveness signal. +The parent must not put a remote endpoint into the schedulable set until the +runner has completed prestart, installed the bootstrap registries, initialized +the buffer/import registry, started the command and health lanes, and confirmed +the negotiated feature set. This mirrors the distributed-system convention that +a worker or actor becomes visible only after runtime and dependency +initialization has completed. + +## Endpoint Identity and Callable Routing + +Remote scheduling needs explicit callable namespaces and an explicit mapping +from callable ids to eligible NEXT_LEVEL endpoints. The current scheduler can +otherwise choose any idle worker, which is only correct when every NEXT_LEVEL +child has the same callable registry. + +Required contracts: + +- Every local or remote NEXT_LEVEL child has a stable `endpoint_id` equal to + its logical worker id in `WorkerManager`. +- `register()` continues to register local callables for local fork/shm + endpoints. +- `register_remote(remote_callable, workers=...)` allocates an outer cid in the + parent `Worker(level=4)` id space, but binds that cid to one or more remote + endpoint ids. +- Bootstrap manifests are generated by the parent. Users provide remote + callable descriptors; users do not hand-author raw `cid -> callable` maps. +- Remote callable descriptors have two Python forms: + - `PYTHON_IMPORT`: a bounded `module:qualname` import path. This is required + for the remote L3 baseline. + - `PYTHON_SERIALIZED`: a versioned payload produced by the PR #839 callable + serializer. This is allowed only when parent and session negotiate the + serializer version, payload limits, Python ABI/runtime compatibility, and + dependency/runtime-environment compatibility. +- Remote L3 uses two independent cid namespaces: + - **Outer remote cid namespace**: parent-assigned cids carried in L4 TASK + frames. These cids select the remote L3 orchestration callable. + - **Inner L3 cid namespace**: cids registered on the session runner's + `inner_worker = Worker(level=3)`. Remote L3 orch functions use these cids + when they call `orch.submit_next_level(...)` or `orch.submit_sub(...)`. +- The two namespaces may contain the same integer values, but they are not the + same registry. A cid from the parent TASK frame must not be assumed to name a + chip/sub callable inside the inner L3 Worker. +- Dynamic Python callable registration follows the public visibility and cid + lifecycle semantics from local dynamic registration: registration is + synchronous per selected endpoint, future TASK frames may use the cid only + after the control reply succeeds, unregister/cid reuse clears stale callable + state, and TASK/control ordering prevents a TASK from observing a partially + registered cid. +- Import-path descriptors are the required remote baseline. Serialized Python + callable payloads preserve the same cid lifecycle but remain an optional + negotiated feature because they require Ray-like environment and serializer + compatibility checks that local fork/COW registration does not need. +- Multi-endpoint `register_remote(..., workers=[...])` is all-or-nothing by + default. The parent sends a prepare phase to every selected endpoint, commits + the cid only after every prepare succeeds, and exposes the cid to future TASK + frames only after every commit reply succeeds. If any endpoint fails prepare + or commit, the parent aborts the transaction, keeps the cid invisible, and + either rolls back successful endpoints or marks endpoints with unknown state + failed. +- Multi-endpoint unregister uses a tombstone state. A cid is unavailable for + reuse until every selected endpoint confirms unregister cleanup, or until any + non-responsive endpoint is removed from eligibility and marked failed. This + prevents stale callable residue from being observed by later TASK frames. +- `TaskSlotState` stores the final eligible endpoint set for the slot. This is + the intersection of endpoints that can run the `callable_id` and endpoints + that can access every tensor/buffer referenced by the slot. +- If the user passes `worker=N`, submit-time validation checks that endpoint + `N` is eligible for that cid and for the slot's tensor sidecars. +- If `worker=-1`, the Scheduler chooses only from idle endpoints in the + slot's eligible set. +- Group submit validates each affinity independently. Unconstrained group + members are assigned distinct idle eligible endpoints. +- Mixed local + remote NEXT_LEVEL pools are allowed only when the callable id + is registered on every endpoint that can receive the slot and the slot's + tensors are materialized in a representation those endpoints can consume. + A callable registered on both local and remote endpoints does not make a + remote-buffer task eligible for the local endpoint. + +Example API shape: + +```python +from simpler.worker import RemoteCallable, RemoteWorkerSpec, Worker + +w4 = Worker(level=4) + +l3 = RemoteWorkerSpec( + endpoint="node17:19073", + platform="a2a3", + runtime="tensormap_and_ringbuffer", + device_ids=list(range(16)), + num_sub_workers=2, + transport="roce", +) + +l3_worker_id = w4.add_remote_worker(l3) +l3_cid = w4.register_remote( + RemoteCallable("my_pkg.remote_orch:l3_orch"), + workers=[l3_worker_id], +) +w4.init() +``` + +`add_worker(local_worker)` remains unchanged and continues to use fork/shm. + +Dynamic remote registration uses the same cid lifecycle whether the descriptor +is installed at bootstrap or through a later control frame. If a callable refers +to inner L3 cids, those cids are values from the inner namespace installed by +the session manifest or by prior remote control registration, not the outer cid +used to dispatch the remote L3 orch callable. + +## Remote Worker Session + +The parent generates a bootstrap manifest and sends it to the +`simpler-remote-worker` daemon as part of session creation. The daemon validates +it and hands it to the session runner before the runner starts any transport +threads: + +```text +session_id +parent_worker_level +remote_worker_level = 3 +endpoint_id +platform, runtime, build flag +device_ids, num_sub_workers, heap_ring_size +callable registry: + outer registry: + parent-assigned outer cid -> remote L3 orch callable descriptor + descriptor = PYTHON_IMPORT or negotiated PYTHON_SERIALIZED + inner L3 registry: + inner cid -> ChipCallable blob metadata, when needed + inner cid -> Python sub/orch callable descriptor, when needed +transport kind: roce | hccs | ub | sim +feature flags +``` + +The session runner installs the outer registry into its remote TASK dispatcher. +It installs the inner registry into `inner_worker = Worker(level=3)` during +prestart and before `HELLO READY`. Remote controls may add or remove entries in +either namespace after bootstrap: + +- registering an outer Python callable changes what future L4 TASK frames can + dispatch on this remote endpoint; +- registering an inner Python callable changes what already-registered remote + L3 orch functions can submit to `inner_worker`; +- registering an inner `ChipCallable` follows the existing dynamic callable IPC + cascade shape, but the remote control payload is a versioned remote frame + instead of a local POSIX-shm mailbox name. + +For a TASK frame, the session runner: + +1. Validates the session and sequence number. +2. Decodes `RemoteTaskArgs`. +3. Translates remote tensor descriptors into local `ContinuousTensor` values. +4. Looks up the L3 orchestration function in the outer registry by + parent-assigned outer cid. +5. Calls `inner_worker.run(orch_fn, args, config)`. +6. Sends completion with success or bounded traceback text. + +For a CONTROL frame, it forwards the operation to the inner worker or its +buffer registry, then replies with a typed result. + +Session execution rules: + +- The baseline remote endpoint runs at most one TASK at a time. This matches + the current one-`WorkerThread`-per-child local scheduling model and keeps + ordering, buffer lifetime, and cid visibility simple. +- State-changing CONTROL frames such as register, unregister, buffer free, + import release, comm init, and domain allocation serialize with TASK + execution on the ordered command lane. They are not applied concurrently with + a running TASK on the same endpoint. +- Bulk data movement may use a separate data plane, but the state change that + makes staged bytes, callable payloads, or imported handles visible is ordered + by the command lane. +- Health/liveness does not depend on the command lane making progress. Each + session has an independent health lane or equivalent transport-level health + signal so a long-running `inner_worker.run()` does not look like endpoint + failure merely because queued command-lane frames cannot be serviced. + +## Remote TaskArgs Representation + +Keep `ContinuousTensor` as the L2 ABI. Do not overload raw pointer values to +carry transport state. + +Public Python uses a sidecar representation: + +- `RemoteBufferHandle` identifies an allocated or imported remote buffer. +- `RemoteTensorRef(handle, offset, shape, dtype)` is accepted by + `TaskArgs.add_tensor()` wherever a remote submit is legal. +- The Python/C++ binding stores a normal tensor metadata entry plus a hidden + sidecar entry at the same tensor index. +- Local endpoints reject remote tensor refs. `RemoteTensorRef` is transport + metadata, not a local mailbox ABI. A local fork/shm endpoint becomes eligible + only after the data has been explicitly imported, staged, or materialized into + a local-addressable `ContinuousTensor`. +- Remote endpoints require a sidecar/descriptor for every tensor that carries + data over the remote protocol, including `HOST_INLINE` tensors. A null + sidecar is allowed only for metadata-only tensors with no data payload. + Remote endpoints reject bare host pointers unless an explicit staging API + produced a remote handle. +- Remote submits reject `OUTPUT` tensors whose `ContinuousTensor.data == 0` + unless the caller has already supplied a `RemoteTensorRef` sidecar. The first + implementation does not auto-allocate remote outputs during submit. + +Parent-side slots therefore store existing `TaskArgs`, an optional +`RemoteTaskArgsView`, eligible endpoint ids, and captured remote-buffer refs. + +`Orchestrator::infer_deps()` builds `TensorKey` from remote handle metadata +when a sidecar exists. The first implementation intentionally preserves the +current exact-start TensorMap semantics: local fork/shm keys use +`(ptr, worker)`, while remote keys use +`(address_kind, owner_endpoint_id, buffer_id, generation, offset)`. The tensor +byte length is bounds-checked by the descriptor but does not participate in +dependency lookup. This means two remote tensors that reference overlapping +byte ranges with different offsets are not automatically ordered, matching the +current local pointer-key behavior. + +## Failure Semantics + +Remote completion is explicit and sequence-based. Local mailbox completion must +be adapted to the same outcome contract. Failure must not make downstream tasks +run as if the producer succeeded. + +Required parent-side behavior: + +- `RemoteL3Endpoint::run()` blocks for the matching completion sequence. +- `LocalMailboxEndpoint::run()` maps a non-zero mailbox error to + `task_failure` instead of reporting a successful completion. +- Non-zero task or endpoint errors become candidates for the worker's first + reported error. +- The worker still notifies the Scheduler so `drain()` cannot hang. +- The notification carries an outcome: success, task failure, or endpoint + failure. +- Failed slots transition to a failed/poisoned state rather than successful + `COMPLETED`. +- Downstream consumers of a failed producer are marked failed/skipped and are + not dispatched. +- `drain()` waits for bookkeeping and cleanup, then raises the first root + error with remote host, endpoint id, cid, and sequence in the message. + +Local mailbox dispatch keeps first-error-wins only for final error reporting. +It must not mark a failed child dispatch as successful `COMPLETED`. The remote +buffer path and the local adapter both use the same poisoned dependency +propagation before dependent tasks are exposed to failed producer outputs. + +Every blocking wait must have a configurable timeout. Remote transport must +not copy the current local control path's infinite spin-wait failure mode. + +## Buffer Lifecycle + +Remote buffers need an owner, generation, and deferred physical free. The +parent owns the visible handle state; the session runner owns remote physical +memory and imported mappings. + +See +[buffers-and-transports.md](remote-l3-worker-design/buffers-and-transports.md) +for the handle schema, control commands, release policy, and A2/A3/A5 backend +requirements. + +## Protocol + +Do not reuse the raw 4096-byte mailbox format across hosts. It has no version +field, no sequence number, and assumes shared virtual memory. + +Remote endpoints use a versioned frame protocol with `HELLO`, `TASK`, +`CONTROL`, `CONTROL_REPLY`, `COMPLETION`, `HEALTH`, and `SHUTDOWN` frames. The +local path keeps the existing mailbox layout behind `LocalMailboxEndpoint`. +Remote frames use canonical little-endian field encoding for `CallConfig`, +`ContinuousTensor`, tensor descriptors, strings, counts, and enums; they do not +memcpy local C++ POD structs onto the wire. Each endpoint has one ordered +command lane for runtime state-changing frames, so TASK cannot overtake +registry-changing CONTROL. Liveness uses a separate health lane or equivalent +transport-level signal and is not queued behind user TASK execution. + +See [protocol.md](remote-l3-worker-design/protocol.md) for frame layout, +remote tensor descriptors, ordering, and bounds-checking requirements. + +## Rollout + +The recommended first cut is conservative: + +1. Land the endpoint abstraction and local adapter. +2. Add remote tensor sidecars and endpoint eligibility metadata. +3. Add the versioned frame codec and the independent health-lane contract. +4. Add remote callable registration with all-or-nothing multi-endpoint + visibility and tombstone-based cid reuse. +5. Add the fork-safe simulation session runner with explicit prestart before + `HELLO READY`. +6. Prove local behavior is unchanged and remote sim behavior handles success, + failure, cid mapping, timeouts, health, and buffer cleanup. +7. Add hardware transports behind the same protocol. + +See +[implementation-plan.md](remote-l3-worker-design/implementation-plan.md) +for the detailed PR sequence and validation matrix. diff --git a/docs/remote-l3-worker-design/buffers-and-transports.md b/docs/remote-l3-worker-design/buffers-and-transports.md new file mode 100644 index 000000000..1a44dbdd2 --- /dev/null +++ b/docs/remote-l3-worker-design/buffers-and-transports.md @@ -0,0 +1,291 @@ +# Remote L3 Buffers and Transports + +This document defines remote buffer lifetime and backend transport contracts +for remote L3 NEXT_LEVEL endpoints. + +## Buffer Handles + +Remote buffers need an owner, generation, and deferred free point. The parent +tracks user-visible handle state; the remote session runner owns physical +memory and imported mappings. + +Parent-side handle: + +```text +RemoteBufferHandle: + endpoint_id + buffer_id + generation + address_space + nbytes + remote_addr + rkey_or_token + ub_ldst_va + ref_state + live_slot_refs +``` + +`buffer_id` may be reused only with a new `generation`. Stale completions, +imports, or frees whose generation does not match are ignored or reported as +session errors. + +## Public Memory API + +Remote memory APIs return handles, not bare integer pointers: + +```python +buf = w4.remote_malloc(worker=l3_worker_id, nbytes=4096) +w4.remote_copy_to(buf, host_ptr, 4096) + +args = TaskArgs() +args.add_tensor( + RemoteTensorRef(buf, offset=0, shape=(1024,), dtype=DataType.FLOAT32), + TensorArgType.INPUT, +) +``` + +The binding stores a hidden remote sidecar beside the tensor metadata. +`RemoteTensorRef` is transport metadata, not an extension of the local mailbox +ABI. Local fork/shm endpoints reject it unless the buffer has first been +explicitly imported, staged, or materialized into a local-addressable +`ContinuousTensor`. Remote endpoints reject bare host pointers unless explicit +staging produced a handle. + +## TaskArgs Sidecar Contract + +`ContinuousTensor` remains the tensor metadata ABI. Remote transport metadata +is stored in a per-`TaskArgs` sidecar keyed by tensor index. + +Python-facing rules: + +- `TaskArgs.add_tensor(RemoteTensorRef(...), tag)` appends one normal tensor + metadata entry and one remote sidecar entry at the same tensor index. +- `RemoteTensorRef` is not converted to a fake integer pointer. +- `RemoteBufferHandle` is opaque to user code. Users may inspect endpoint, + size, and release state, but not transport keys such as `rkey`. +- A `TaskArgs` containing any remote sidecar is legal only when the final + selected endpoint set contains remote endpoints that can consume every + referenced sidecar. +- Local `submit_next_level()` rejects remote sidecars before slot commit unless + an explicit import/staging API has converted them into local-addressable + `ContinuousTensor` values and removed the remote sidecar. +- Remote submit rejects `OUTPUT` tensors with `ContinuousTensor.data == 0` + unless an explicit remote allocation API has already produced a + `RemoteTensorRef` sidecar for that tensor. + +Parent C++ slot rules: + +- `TaskSlotState` owns a copy of the sidecar for the slot lifetime. +- Sidecar length must equal `tensor_count`; entries can be null only for + metadata-only tensors. `HOST_INLINE` payloads still have sidecar descriptors + so the frame codec has one validation path for every remote data payload. +- Submit validation captures a live ref on every referenced + `RemoteBufferHandle` before the slot becomes visible to the Scheduler. +- Validation fails before slot commit when the intersection of callable-eligible + endpoints and data-eligible endpoints is empty, or when a remote tensor names + an ineligible endpoint, stale generation, out-of-range offset, or released + handle. +- Group submit stores one sidecar per group member, aligned with + `task_args_list[i]`. + +Endpoint rules: + +- `LocalMailboxEndpoint` rejects non-empty sidecars. It cannot encode remote + descriptors into the local 4096-byte mailbox, and its child processes expect + `ContinuousTensor.data` to be a local host/shm pointer or a local child-memory + pointer. +- `RemoteL3Endpoint` requires a sidecar for every tensor payload that crosses + the remote protocol, including `HOST_INLINE` payloads. +- Remote TASK frames write `ContinuousTensorWire.data == 0`; parent virtual + addresses never cross the remote protocol. +- A remote tensor with `child_memory=True` and no sidecar is invalid. Local + child-memory pointers are meaningful only inside fork/shm topology. +- The remote session runner translates each `RemoteTensorDesc` into a local + `ContinuousTensor` and fills `data` from its validated local mapping + immediately before invoking `inner_worker.run()`. + +## Remote OUTPUT Allocation Policy + +The first implementation does not mirror local HeapRing auto-allocation for +remote outputs. In the local fork/shm path, an `OUTPUT` tensor with +`ContinuousTensor.data == 0` is assigned a parent HeapRing pointer during +submit, and forked children can dereference that shared virtual address. A +remote L3 worker cannot use a parent-host HeapRing pointer. + +Remote callers must allocate or import output storage explicitly before submit: + +```python +out = w4.remote_malloc(worker=l3_worker_id, nbytes=4096) + +args = TaskArgs() +args.add_tensor( + RemoteTensorRef(out, offset=0, shape=(1024,), dtype=DataType.FLOAT32), + TensorArgType.OUTPUT, +) +orch.submit_next_level(cid, args, cfg, worker=l3_worker_id) +``` + +This keeps submit-time validation simple: the slot already carries complete +data eligibility, handle generation, bounds, and lifetime refs before it +becomes visible to the Scheduler. + +Future work may add remote output auto-allocation, but only after the runtime +has a well-defined pre-dispatch endpoint selection policy. Auto-allocation must +decide which endpoint owns the output before slot commit, allocate or import the +remote buffer, attach the generated sidecar to the correct tensor index, and +handle group submits where each member may need storage on a different +endpoint. Until those rules exist, remote null `OUTPUT` tensors fail fast. + +## Required Controls + +| Command | Purpose | +| ------- | ------- | +| `ALLOC_REMOTE_BUFFER` | Allocate remote L3 host or chip memory. | +| `FREE_REMOTE_BUFFER` | Mark a handle released; physical free is deferred. | +| `COPY_TO_REMOTE` | Stage host data into a remote buffer. | +| `COPY_FROM_REMOTE` | Pull remote output data back to host. | +| `EXPORT_BUFFER` | Return RDMA key or UB mapping metadata. | +| `IMPORT_BUFFER` | Import a peer buffer into a remote worker. | +| `RELEASE_IMPORT` | Drop an imported peer mapping. | + +Control commands are typed remote protocol frames. They are not the local +mailbox `CTRL_*` integers. + +## Release Policy + +- Slot refs are acquired during `submit_next_level()` while walking `TaskArgs`. +- Captured buffers stay live until every capturing slot has reached a terminal + state and every producer/consumer reference that can expose that buffer has + reached `CONSUMED` or failed cleanup. +- Explicit buffers used only as INPUT still need slot refs. They have no + producer slot to protect them. +- Runtime-managed OUTPUT buffers follow the producer slot's terminal cleanup. +- `FREE_REMOTE_BUFFER` and `RELEASE_IMPORT` mark handles released. Physical + free or import teardown runs only when the released handle has no live slot + refs. +- If a run fails, the same post-drain cleanup path runs before the next run. +- Session shutdown rejects new work and frees all session-owned buffers after + completing or failing in-flight tasks. + +The registry therefore needs both a user-visible release state and a live +slot-ref count. Failed slots still release captured refs through the same +terminal cleanup path as successful slots. + +## Dependency Keys + +`TensorKey` must grow beyond the current `{ptr, int8 worker}` shape for remote +buffers while preserving the current exact-start lookup semantics. Today, local +dependency tracking keys only on a tensor's start pointer and worker id; shape +and byte length do not participate in lookup. Remote keys follow the same rule: + +```text +address_kind +owner_endpoint_id +buffer_id +offset_begin +generation +``` + +Local fork/shm keys remain a compatibility subset: + +```text +host pointer: (LOCAL_HOST, -1, ptr) +local child memory: (LOCAL_CHILD, worker_id, ptr) +``` + +Known limitation: two remote tensors that reference overlapping byte ranges +with different `offset_begin` values do not automatically depend on each other. +For example, a producer writing `[0, 4096)` and a consumer reading +`[1024, 2048)` map to different dependency keys. This matches the current local +`ptr`-based TensorMap behavior, where a subview at `base + offset` is a +different key from `base`. + +The first implementation chooses this route to keep remote scheduling behavior +compatible with local fork/shm semantics and to avoid changing TensorMap into a +range index as part of the transport bring-up. `offset_end`/`nbytes` remains in +`RemoteTensorDesc` for bounds checks and for a future range-overlap TensorMap +upgrade, but it is not part of the first dependency key. + +## RemoteTransport Interface + +All backends implement the same transport contract: + +```cpp +class RemoteTransport { +public: + virtual void connect(const RemoteEndpointSpec &) = 0; + virtual void post_task(uint64_t seq, Span frame) = 0; + virtual Frame wait_completion(uint64_t seq, Timeout timeout) = 0; + virtual ControlReply control(const ControlRequest &) = 0; + virtual RemoteMemory export_memory(BufferId) = 0; + virtual ImportedMemory import_memory(const RemoteMemory &) = 0; + virtual void close() = 0; +}; +``` + +`post_task()` and `control()` enqueue request frames on the endpoint's ordered +command lane. Data-plane transfers may use RDMA/HCCS/UB paths, but TASK +doorbells, CONTROL frames, replies, completions, and shutdown state are ordered +by the command lane. Reply frames carry the request sequence they answer. +`wait_completion()` waits for an explicit remote `COMPLETION` frame; RDMA write +completion alone is never a task completion signal. `control()` sends a +`CONTROL` frame and returns only after the matching `CONTROL_REPLY` arrives or a +timeout/disconnect is converted into a failed reply. Liveness is handled by an +independent health lane or transport keepalive; it is not queued behind the +ordered command lane. + +## A2 RoCE + +- Use RDMA SEND/RECV for small frames and completion records. +- Keep TASK doorbells, control frames, control replies, and completions on one + per-endpoint ordered SEND/RECV command lane. +- Use a separate health SEND/RECV lane or transport keepalive for liveness. +- Use registered pinned staging buffers for large callable blobs and bulk data. +- Export buffers as `(addr, length, rkey, qp/session metadata)`. +- Complete tasks only after a SEND completion frame from the session runner. +- Bound every wait with a timeout and convert disconnects into endpoint + failure completions. + +## A3 HCCS + +- Keep the same `RemoteTransport` contract as RoCE. +- Implement memory export/import through the HCCS-capable platform HAL. +- Preserve the same command-lane ordering rules: task/control frames are + observed in sequence order, command frame visible before doorbell, and remote + writes complete before completion frame. +- Provide health independently from command-lane progress so long-running TASK + execution does not cause false endpoint failure. +- Reuse the frame codec and buffer registry tests from the RoCE path. + +## A5 UB + +- Export both RDMA metadata and, when available, an LD/ST mapping token. +- Use LD/ST for doorbells and small completion records only when the mapping + is coherent for the participating hosts. +- Preserve the same per-endpoint command-lane order for TASK, CONTROL, + CONTROL_REPLY, COMPLETION, and SHUTDOWN frames. +- Keep UB health doorbells or transport health independent from the command + lane used for state-changing frames. +- Use RDMA for bulk transfers until platform benchmarks justify LD/ST bulk + copies. +- Add explicit fences around: + - task payload writes before doorbell; + - remote output writes before completion; + - parent completion read before dependent task dispatch. +- Keep RDMA fallback for all UB LD/ST paths. + +## Simulation Backend + +The simulation backend uses TCP or Unix sockets plus local files/shm for +integration tests. It must exercise: + +- framed protocol encode/decode; +- sequence numbers; +- remote callable bootstrap; +- endpoint eligibility validation; +- success and error completions; +- failed dependency poisoning; +- buffer registry ref capture and deferred free; +- timeout handling. + +It must not depend on A2/A3/A5 hardware. diff --git a/docs/remote-l3-worker-design/implementation-plan.md b/docs/remote-l3-worker-design/implementation-plan.md new file mode 100644 index 000000000..625396d01 --- /dev/null +++ b/docs/remote-l3-worker-design/implementation-plan.md @@ -0,0 +1,293 @@ +# Remote L3 Implementation Plan + +Deliver remote L3 support in small PRs. Each step should keep existing local +fork/shm behavior working. + +## PR Sequence + +1. Endpoint interface and local adapter. + - Add `WorkerEndpoint`. + - Define `WorkerEndpoint::run()` to return an explicit outcome: success, + task failure, or endpoint failure. + - Move current mailbox code into `LocalMailboxEndpoint`. + - Teach `LocalMailboxEndpoint` to map `MAILBOX_OFF_ERROR == 0` to success + and non-zero child mailbox errors to task failure. + - Treat local dispatch exceptions, child crash detection, and timeout paths + as endpoint failure once those paths are available. + - Keep `WorkerManager::add_next_level(void *mailbox)` working by wrapping + the mailbox in a local endpoint. + - Thread the endpoint outcome through the WorkerThread completion callback + without yet changing all downstream DAG poisoning behavior. + - Add local adapter regression tests for existing L3/L4 examples. + +2. Endpoint eligibility metadata. + - Assign each NEXT_LEVEL child a stable `endpoint_id`. + - Store `callable_id -> eligible endpoint ids` in parent runtime metadata. + - Extend submit slots with final eligible endpoint sets computed as + callable eligibility intersected with tensor/buffer data eligibility. + - Teach Scheduler/WorkerManager to pick only eligible idle workers. + - Validate `worker=` affinity against the slot's final eligible set. + - Keep current docs in sync: describe local fork/shm as + `LocalMailboxEndpoint` and remote L3 as a framed endpoint, not as another + mailbox child loop. + +3. Remote task sidecars and dependency keys. + - Add `RemoteBufferHandle`, `RemoteTensorRef`, and `RemoteTensorDesc`. + - Store a `RemoteTaskArgsView` sidecar beside existing `TaskArgs`. + - Extend `TensorKey` for remote endpoint, buffer id, generation, logical + start offset, and address kind. + - Teach `Orchestrator::infer_deps()` to use remote logical keys while + preserving existing local keys. + - Reject remote sidecars on local fork/shm endpoints unless an explicit + import/staging API has converted them into local-addressable tensors. + - Reject unstaged raw host pointers before a remote slot is committed. + - Reject remote `OUTPUT` tensors with `data == 0` unless an explicit remote + allocation API has already produced a `RemoteTensorRef` sidecar. + +4. Failed task poisoning. + - Add per-member group state/outcome tracking so group failure can skip + unstarted members while waiting for already-dispatched members to finish. + - Add failed/poisoned slot handling. + - Prevent downstream consumers of failed producers from dispatching. + - Preserve `drain()` cleanup and first-error-wins reporting. + +5. Versioned remote frame codec. + - Add `remote_wire.h/.cpp`. + - Implement canonical little-endian encode/decode for `CallConfigWire`, + `ContinuousTensorWire`, frame headers, descriptors, counts, strings, and + enum values. + - Implement the `HOST_INLINE` inline byte arena with descriptor + `inline_payload_offset` / `inline_payload_len` validation. + - Keep local mailbox `write_blob` / `read_blob` local-only; remote codec must + not memcpy C++ POD structs as its wire format. + - Implement encode/decode bounds checks for all frame types. + - Define and test `CONTROL_REPLY` encode/decode for command success, + command failure, result payloads, and sequence matching. + - Define and test the per-endpoint ordered command lane for TASK, CONTROL, + CONTROL_REPLY, COMPLETION, and SHUTDOWN frames. + - Define and test an independent `HEALTH` lane or transport keepalive so + liveness is not queued behind long-running TASK execution. + - Include tests for corrupt lengths, tensor counts, sequence mismatch, and + bounded error payloads. + - Include tests that reject unknown enum values, non-zero reserved fields, + and truncated multi-byte fields. + - Include tests that reject non-zero `ContinuousTensorWire.data` in remote + TASK frames. + +6. Remote callable registry. + - Implement `RemoteCallable("module:qualname")`. + - Preserve the public cid lifecycle from local dynamic Python registration: + visibility only after register reply, unregister/cid reuse, stale-state + cleanup, and TASK/control ordering. + - Treat import-path callables as the baseline remote mode. + - Support PR #839 serialized Python callable payloads only as a negotiated + feature with serializer version, payload limit, Python ABI/runtime, and + dependency/runtime-environment compatibility checks. + - Implement `register_remote(..., workers=...)`. + - Implement multi-endpoint all-or-nothing registration with prepare, commit, + and abort controls. Keep the parent cid invisible until every selected + endpoint commits, and mark uncertain endpoints failed rather than leaving a + partially visible cid. + - Implement unregister tombstones. Do not reuse a cid until every selected + endpoint confirms cleanup or is removed from eligibility as failed. + - Split cid handling into an outer remote-orch namespace and an inner L3 + Worker namespace; never assume a parent TASK cid is an inner chip/sub cid. + - Make parent/session-assigned cid values the only cid source in each + manifest namespace. + - Define post-bootstrap namespace-aware prepare, commit, abort, and + unregister frames for Python callables and `ChipCallable` entries. + +7. Fork-safe simulation session runner. + - Add `simpler-remote-worker` control entry point. + - Add per-session `simpler-remote-l3-session` runner. + - Pass the validated bootstrap manifest from daemon to runner through an + inherited fd, manifest path in env, or single-threaded pipe before any + runner transport threads start. + - Add an explicit runner prestart step equivalent to `inner_worker.init()` + plus `_start_hierarchical()`: fork L3 chip/sub children, register local + endpoints, and start the inner Scheduler before any remote transport or + health threads are started. + - Start the sim transport only after the local L3 child tree is established, + then run the post-prestart `HELLO`/ready handshake. + - Treat `HELLO ready_state=READY` as a scheduling barrier; the parent must + not schedule an endpoint that is alive but not prestarted. + - Run TASK frames over the sim transport and return completions. + - Add localhost two-process integration tests. + +8. Remote control-plane parity. + - Map existing NEXT_LEVEL controls onto typed remote frames: + prepare, register, unregister, comm init, domain alloc, and domain + release. + - Keep local mailbox sub-command ids local-only. + - Add tests for post-bootstrap ChipCallable registration and dynamic domain + allocation through the remote session runner. + +9. Remote buffer registry. + - Add `ALLOC_REMOTE_BUFFER`, `FREE_REMOTE_BUFFER`, `COPY_TO_REMOTE`, and + `COPY_FROM_REMOTE`. + - Track per-slot capture refs for explicit buffers and imported peer + buffers. + - Tie physical free and release-import to post-drain cleanup after all + captured refs drop. + +10. RoCE transport. + - Implement connection setup, SEND/RECV frames, registered staging buffers, + and timeout/error paths. + - Add a hardware-gated smoke test with one remote L3 worker. + +11. HCCS transport. + - Implement the same transport contract through platform HCCS APIs. + - Reuse the RoCE frame and buffer registry tests. + +12. A5 UB transport. + - Add UB export/import metadata. + - Implement LD/ST doorbell and completion path with fences. + - Keep RDMA fallback for bulk transfers. + +13. Remote `allocate_domain()`. + - Extend `CommDomainHandle` to carry remote endpoint ids. + - Allocate/import windows collectively across remote workers. + - Preserve deferred release after `drain()`. + +## Required Tests Before Hardware Backends + +| Test | Expected result | +| ---- | --------------- | +| Local adapter regression | Existing L3/L4 fork/shm behavior unchanged. | +| Endpoint eligibility | Scheduler never picks an ineligible endpoint. | +| Frame fuzz/bounds | Corrupt lengths and counts are rejected. | +| Remote sim hello | Parent bootstraps remote L3 and shuts down cleanly. | +| Manifest handoff | Runner reads manifest before transport starts. | +| Prestart barrier | HELLO READY only after inner L3 scheduler is started. | +| Remote sim task | L4 parent dispatches one L3 orch task successfully. | +| Remote sim error | Remote orch raises; parent raises with host/seq/cid. | +| Failed dependency | Consumer of failed remote producer is not dispatched. | +| Remote cid mapping | Daemon resolves non-zero parent-assigned remote cid. | +| Remote dep key | Shared remote buffer serializes through TensorMap. | +| Raw pointer rejection | Unstaged host pointer fails before slot commit. | +| Wire data zero | Non-zero remote TASK tensor data is rejected. | +| HOST_INLINE desc | Inline payloads require a descriptor and bounds checks. | +| Remote buffer copy | Host stages input, remote writes output, host pulls. | +| Input-only free deferral | Released input buffer survives queued consumers. | +| Timeout | Killed session runner produces bounded failure. | +| Dynamic Python register | Import-path callable register/unregister works. | +| Serialized Python gate | PR #839 payload works only after feature negotiation. | +| Callable visibility | TASK with cid works only after register reply. | +| Register partial failure | Multi-endpoint register is invisible after partial fail. | +| Unregister tombstone | Reused cid waits for cleanup or failed endpoint removal. | +| Command-lane order | TASK cannot overtake register/unregister control. | +| Health during long task | Health remains live while command lane runs a task. | +| Callable kind gate | Unsupported callable kind rejects without cid install. | +| Dynamic inner register | Inner Python and ChipCallable cids can run. | +| Remote cid namespace | Outer TASK cid cannot collide with inner cid. | +| Remote control parity | Register/unregister/domain reaches namespace. | + +## Hardware-Gated Tests + +- A2 RoCE single remote L3 task. +- A2 RoCE remote buffer copy round trip. +- A3 HCCS single remote L3 task. +- A5 UB LD/ST doorbell plus RDMA fallback. +- Remote domain allocation and deferred release across two remote L3 workers. + +## Open Decisions + +- Exact platform HAL names for HCCS and UB export/import. +- Authentication and isolation for remote daemon sessions. +- Exact compatibility metadata required for PR #839 serialized Python callable + payloads beyond serializer version and Python ABI/runtime. +- How endpoint health feeds scheduler-level eligibility after the transport + reports a failed health lane. +- How much of `CommContext` should remain shared with PTO-ISA once remote UB + address metadata is added. + +The first cut should land endpoint abstraction, endpoint eligibility, +remote callable registration, failure poisoning, and the simulation runner +before any hardware transport code. + +## Failure Poisoning Contract + +Worker failures must finish DAG bookkeeping without pretending the producer +succeeded. The contract applies to remote completions, endpoint failures, and +local mailbox child errors reported by `LocalMailboxEndpoint`. + +State model: + +```text +FREE -> PENDING -> READY -> RUNNING -> COMPLETED -> CONSUMED + \-> FAILED -> CONSUMED +``` + +Rules: + +- Worker completion carries `success`, `task_failure`, or `endpoint_failure`. +- `success` transitions `RUNNING -> COMPLETED`. +- `task_failure` or `endpoint_failure` transitions `RUNNING -> FAILED`. +- `LocalMailboxEndpoint` converts a non-zero child mailbox error into + `task_failure`; first-error-wins controls only which root error message is + retained for `drain()`. +- A `FAILED` producer releases fanout bookkeeping, but consumers are marked + `FAILED` instead of `READY`. +- Failed consumers are never dispatched. +- `FAILED -> CONSUMED` runs the same cleanup hooks as `COMPLETED -> CONSUMED`: + TensorMap erase, ring release, remote buffer ref release, and deferred free + scheduling. +- `drain()` waits for all successful, failed, and skipped slots to become + `CONSUMED`, then rethrows the first root failure. + +Group task rules: + +`TaskSlotState` stores per-member execution state for group slots: + +```text +GroupMemberState: + NOT_DISPATCHED + RUNNING + SUCCESS + FAILED + SKIPPED + +GroupMemberOutcome: + success + task_failure + endpoint_failure + skipped +``` + +Additional group bookkeeping: + +- `member_states[group_size]` and `member_outcomes[group_size]`; +- `group_terminal_count`: members in `SUCCESS`, `FAILED`, or `SKIPPED`; +- `group_dispatched_count`: members that reached `RUNNING`; +- `group_failed`: set when any member reports `task_failure` or + `endpoint_failure`; +- `group_first_failure_index`: first failed member used for root-error context. + +Rules: + +- A group slot is successful only if every member reaches `SUCCESS`. +- On dispatch of member `i`, transition + `NOT_DISPATCHED -> RUNNING` before handing work to the endpoint. +- On successful completion of member `i`, transition `RUNNING -> SUCCESS` and + increment `group_terminal_count`. +- On failed completion of member `i`, transition `RUNNING -> FAILED`, set + `group_failed`, record `group_first_failure_index` if unset, and increment + `group_terminal_count`. +- When `group_failed` becomes true, every member still in `NOT_DISPATCHED` + transitions to `SKIPPED` and increments `group_terminal_count`. Skipped + members are never dispatched. +- Members already in `RUNNING` are allowed to complete so their endpoint state + and buffer refs can be cleaned up. +- The group slot reaches terminal outcome only when + `group_terminal_count == group_size`. +- If the terminal group has any `FAILED` member, the slot outcome is `FAILED`; + otherwise it is `COMPLETED`. +- Slot cleanup runs once for the whole group after the group slot reaches its + terminal outcome. + +Error reporting: + +- The first root failure message includes remote host, endpoint id, + callable id, and sequence. +- Poisoned downstream slots should reference the root producer slot instead of + overwriting the first-error-wins message. diff --git a/docs/remote-l3-worker-design/protocol.md b/docs/remote-l3-worker-design/protocol.md new file mode 100644 index 000000000..42179abde --- /dev/null +++ b/docs/remote-l3-worker-design/protocol.md @@ -0,0 +1,399 @@ +# Remote L3 Protocol + +This document defines the remote wire protocol used by +`RemoteL3Endpoint`. The local fork/shm path keeps the existing mailbox layout +behind `LocalMailboxEndpoint`. + +## Frames + +Remote transport must not reuse the raw 4096-byte mailbox format. Define a +versioned frame header: + +```text +FrameHeader: + magic = "SLR3" + version = 1 + frame_type = + HELLO | TASK | CONTROL | CONTROL_REPLY | COMPLETION | HEALTH | SHUTDOWN + session_id + endpoint_id + sequence + payload_bytes + flags +``` + +Rules: + +- `magic` and `version` are validated before reading payload fields. +- `session_id` protects against stale frames from prior sessions. +- `endpoint_id` must match the logical NEXT_LEVEL worker selected by the + parent scheduler. +- For parent-to-runner command requests, `sequence` is monotonic per endpoint + on the ordered command lane. This includes `TASK`, state-changing `CONTROL`, + and `SHUTDOWN`. +- `HEALTH` frames travel on an independent health lane, or use an equivalent + transport-level health signal. They carry their own health sequence and must + not be queued behind long-running TASK execution. +- Reply frames such as `COMPLETION` and `CONTROL_REPLY` carry the sequence of + the request they answer. They do not allocate a new command sequence. +- `payload_bytes` is bounds-checked before allocation or decode. +- Unknown frame types, versions, flags, or oversized payloads fail the + session instead of falling back to best-effort parsing. + +## Wire Encoding + +Remote frames use canonical field encoding. They do not memcpy C++ POD structs +such as `CallConfig` or `ContinuousTensor` onto the wire. The local fork/shm +mailbox may continue to use the raw POD layout because both endpoints are +fork-related processes running the same binary; remote endpoints must treat the +wire schema below as the compatibility contract. + +Encoding rules: + +- Multi-byte integers are little-endian. +- Boolean values are encoded as `uint8`, where `0` is false and `1` is true. +- Enum values are encoded as fixed-width integers. Unknown enum values reject + the frame. +- Strings are `uint32 byte_len` followed by UTF-8 bytes, with per-field maximum + lengths. +- Repeated fields are `uint32 count` followed by that many elements, with + configured maximum counts. +- Reserved fields must be written as zero and rejected when non-zero unless a + later protocol version assigns them. + +`CallConfigWire v1` encodes the current `CallConfig` fields explicitly: + +```text +block_dim: int32 +aicpu_thread_num: int32 +enable_l2_swimlane: int32 +enable_dump_tensor: int32 +enable_pmu: int32 +enable_dep_gen: int32 +output_prefix: string(max=1024) +``` + +`ContinuousTensorWire v1` encodes tensor metadata explicitly. In remote TASK +frames, `data` is not a transferable pointer; it is reserved and must be zero. + +```text +data: uint64 # reserved in remote TASK frames; must be 0 +shapes: uint32[CONTINUOUS_TENSOR_MAX_DIMS] +ndims: uint32 +dtype: uint32 +child_memory: uint8 +reserved: uint8[7] +``` + +The session runner decodes these wire records into local `CallConfig` and +`ContinuousTensor` values before calling `inner_worker.run()`. For tensors with +a remote descriptor, the runner fills the local `ContinuousTensor.data` from +its buffer/import registry after validating the descriptor. Local ABI structs +therefore remain the in-process execution ABI, not the remote transport ABI. + +## HELLO Payload + +`HELLO` is a post-prestart command-lane handshake. The session runner sends or +accepts it only after it has read the bootstrap manifest from the daemon +handoff, constructed `Worker(level=3)`, performed an explicit prestart that +forks local chip/sub children, started the inner scheduler, installed bootstrap +registries, initialized the buffer/import registry, and brought up the command +and health lanes. `HELLO` does not carry the bootstrap manifest. + +```text +session_id +endpoint_id +protocol_version +transport kind +feature flags +ready_state +``` + +The parent treats the endpoint as schedulable only after the `HELLO` exchange +confirms `ready_state=READY`, matching `session_id`, matching `endpoint_id`, +and compatible protocol and feature sets. + +`READY` is a scheduling barrier. A session that can answer liveness probes but +has not completed prestart reports a non-ready state and must not receive TASK +frames. + +## TASK Payload + +```text +callable_id: int32 +config: CallConfigWire v1 +args: RemoteTaskArgsWire v1 +``` + +`RemoteTaskArgsWire v1` contains: + +```text +tensor_count: uint32 +scalar_count: uint32 +tensor_metadata: ContinuousTensorWire[tensor_count] +remote_desc: OptionalRemoteTensorDescWire[tensor_count] +scalars: uint64[scalar_count] +inline_payload_bytes_len: uint32 +inline_payload_bytes: uint8[inline_payload_bytes_len] +``` + +For each tensor index, exactly one of these is true: + +- `remote_desc[i]` is present and names a remote buffer, imported peer buffer, + UB mapping, or allowed small `HOST_INLINE` payload. +- The tensor is metadata-only and has no data pointer and no remote descriptor. + +`tensor_metadata[i].data` must be zero in both cases. Bare host pointers are +rejected for remote endpoints unless an explicit staging API has produced a +remote handle and sidecar descriptor. + +`HOST_INLINE` is for small payloads that should travel inside the TASK frame. +It still requires a `RemoteTensorDescWire` with `address_space=HOST_INLINE`; +the descriptor points into `inline_payload_bytes`. Regular and large tensor +data must use `RemoteBufferHandle` / `RemoteTensorDesc` and the remote +data-plane transport, not inline bytes. + +## RemoteTensorDesc + +```text +RemoteTensorDescWire: + address_space: + HOST_INLINE + REMOTE_DEVICE + REMOTE_WINDOW + UB_LDST + owner_endpoint_id + buffer_id + offset + nbytes + remote_addr + rkey_or_token + generation + inline_payload_offset + inline_payload_len + flags +``` + +Rules: + +- `ContinuousTensor` remains the L2 ABI. The session runner translates + descriptors into local `ContinuousTensor` values immediately before + `inner_worker.run()`. +- When a descriptor is present, the incoming `ContinuousTensorWire.data` is + reserved and must be zero. The session runner derives the executable local + address only from the validated descriptor and its live buffer/import + registry. +- Metadata-only tensors also require `ContinuousTensorWire.data == 0`. +- Parent-side dependency keys use a stable logical start-address key derived at + submit time: + `(address_kind, owner_endpoint_id, buffer_id, generation, offset)`. + `nbytes` bounds the descriptor and is kept for validation and future overlap + detection, but it is not part of the first implementation's TensorMap lookup. +- `offset + nbytes` must fit inside the referenced handle. +- `generation` must match the parent registry entry and the remote daemon's + live handle entry. +- For `HOST_INLINE`, `inline_payload_offset + inline_payload_len` must fit + inside `inline_payload_bytes_len`, and `inline_payload_len` must equal + `nbytes`. Remote handle fields (`owner_endpoint_id`, `buffer_id`, + `remote_addr`, `rkey_or_token`, `generation`) are reserved and must be zero. +- For non-`HOST_INLINE` descriptors, `inline_payload_len` must be zero. +- A tensor owned by endpoint 3 cannot be submitted to endpoint 5 unless an + explicit `IMPORT_BUFFER` or peer-access handle is present. +- `child_memory=1` keeps its local meaning: the data is managed by a + next-level worker. For remote workers, ownership is resolved through the + remote sidecar, not through a local pointer alone. + +## COMPLETION Payload + +```text +sequence +error_code +error_message_len +error_message bytes +``` + +Rules: + +- `sequence` must match the request being waited on. +- `error_code=0` means task success. +- Non-zero error means task failure. The parent marks the slot failed/poisoned + and does not dispatch downstream consumers. +- `error_message` is bounded UTF-8. It should include remote host, + `endpoint_id`, `callable_id`, and `sequence`. +- If health expires or the process exits, the parent fabricates an endpoint + failure completion for every in-flight sequence. + +## CONTROL Payload + +```text +control_name +control_version +command-specific bytes +``` + +Remote control frames use typed names and versioned payloads. Local mailbox +sub-command ids remain local-only and must not leak into the remote protocol. +Every `CONTROL` request produces exactly one `CONTROL_REPLY` frame with the +same `sequence`. + +Required remote controls: + +- `UNREGISTER_CALLABLE` +- `PREPARE_REGISTER_CALLABLE` +- `COMMIT_REGISTER_CALLABLE` +- `ABORT_REGISTER_CALLABLE` +- `PREPARE_CALLABLE` +- `ALLOC_REMOTE_BUFFER` +- `FREE_REMOTE_BUFFER` +- `COPY_TO_REMOTE` +- `COPY_FROM_REMOTE` +- `EXPORT_BUFFER` +- `IMPORT_BUFFER` +- `RELEASE_IMPORT` +- `COMM_INIT` +- `ALLOC_DOMAIN` +- `RELEASE_DOMAIN` + +The register-family controls are namespace-aware. +`PREPARE_REGISTER_CALLABLE` carries: + +```text +target_namespace: + OUTER_REMOTE_ORCH + INNER_L3_WORKER +callable_kind: + PYTHON_IMPORT + CHIP_CALLABLE + PYTHON_SERIALIZED # optional negotiated extension +cid: int32 +payload_version: uint32 +payload bytes +``` + +Rules: + +- `OUTER_REMOTE_ORCH` registers callables that can be selected by future + parent TASK frames. Only Python callable kinds are valid in this namespace. +- `INNER_L3_WORKER` registers callables on the session runner's + `inner_worker = Worker(level=3)`. Python callables become valid targets for + inner `submit_sub` / recursive orchestration; `CHIP_CALLABLE` follows the + existing prepare/register path for chip workers. +- `PYTHON_IMPORT` payloads carry a bounded UTF-8 `module:qualname` string. +- `PYTHON_SERIALIZED` payloads are produced by the PR #839 callable serializer. + They are valid only when parent and session negotiate serializer version, + payload limits, Python ABI/runtime compatibility, and dependency/runtime + compatibility. Support is optional; `PYTHON_IMPORT` remains the required + baseline. +- A session that does not advertise a requested callable kind rejects the + control request before installing the cid. +- `CHIP_CALLABLE` payloads carry the ChipCallable blob metadata or a staged blob + reference, depending on transport capability. The remote frame never embeds a + local POSIX shm name from the parent process. +- `COMMIT_REGISTER_CALLABLE` and `ABORT_REGISTER_CALLABLE` carry + `target_namespace`, `callable_kind`, and `cid`. +- `UNREGISTER_CALLABLE` carries the same `target_namespace`, `callable_kind`, + and `cid` so cid cleanup and reuse are scoped to the intended registry. + +Multi-endpoint parent registration uses two-phase visibility: + +1. The parent sends `PREPARE_REGISTER_CALLABLE` to every selected endpoint. + Each endpoint validates the descriptor/payload, checks feature gates, stages + any callable bytes, and records the cid as prepared but not visible to TASK. +2. If every prepare succeeds, the parent sends `COMMIT_REGISTER_CALLABLE` to + every selected endpoint. A successful commit makes the cid visible to later + TASK frames on that endpoint. +3. If any prepare or commit fails, the parent sends `ABORT_REGISTER_CALLABLE` + to endpoints with prepared or uncertain state, keeps the cid invisible, and + marks endpoints failed when their final registry state cannot be proven. + +Unregister creates a parent-side tombstone. The cid cannot be reused until +every selected endpoint has confirmed cleanup, or until any non-responsive +endpoint has been removed from eligibility and marked failed. + +## CONTROL_REPLY Payload + +```text +sequence +control_name +control_version +error_code +error_message_len +error_message bytes +result_bytes_len +result_bytes +``` + +Rules: + +- `sequence` must match one in-flight `CONTROL` request on the same endpoint. +- `control_name` and `control_version` must match the request being answered. +- `error_code=0` means the control succeeded and `result_bytes` contains the + command-specific result, if any. +- Non-zero `error_code` means the remote session did not apply the requested + state change, except for commands whose versioned contract explicitly allows + best-effort partial cleanup. +- `error_message` is bounded UTF-8. It should include remote host, + `endpoint_id`, `control_name`, and `sequence`. +- `result_bytes` uses the same canonical encoding rules as other remote + payloads. For example, `ALLOC_REMOTE_BUFFER` returns a buffer id, + generation, address space, size, and transport-specific export metadata. +- If health expires or the process exits, the parent fabricates a failed + `CONTROL_REPLY` for every in-flight control sequence. + +Control waits obey the same timeout policy as task waits. + +## Ordering + +Each endpoint has one ordered command lane. Runtime state-changing requests use +this lane even when their bulk bytes are staged through a separate data plane. +The remote session observes parent-to-runner requests in increasing `sequence` +order, and state changes caused by a request happen-before later requests on +the same endpoint. Reply frames carry the answered request sequence and become +visible only after the corresponding request side effects. + +The baseline endpoint admits at most one TASK in flight. State-changing CONTROL +frames serialize with that TASK and are not applied concurrently with +`inner_worker.run()`. This keeps callable registry visibility, buffer release, +import release, and domain lifetime consistent with the local one-mailbox +model. + +`HEALTH` frames are not state-changing command-lane requests. The parent uses a +separate health lane, RDMA completion health signal, or transport-native +keepalive so a long-running TASK does not block liveness observation. Health +success does not imply task completion; task completion is only a matching +`COMPLETION` frame. + +All transports must provide the following visibility rules: + +- Task payload writes are visible before the task doorbell. +- Remote output writes are complete before the completion frame is visible. +- Parent completion reads happen before dependent task dispatch. +- Control request payload writes are visible before the control doorbell. +- Control side effects are visible before the matching `CONTROL_REPLY`. +- Register-family controls and `UNREGISTER_CALLABLE` serialize with TASK + dispatch on the same endpoint. A successful commit reply happens-before later + TASK frames that use the cid. A successful unregister reply happens-before + later cid reuse. +- Health messages may be observed while a TASK is running, but they must not + expose or mutate task, callable, buffer, or domain state. + +RoCE and HCCS can satisfy this with SEND/RECV ordering plus explicit flush or +completion requirements. UB LD/ST paths require explicit memory fences around +doorbell and completion state transitions. + +## Bounds and Fuzz Tests + +The frame codec must reject: + +- bad magic or unsupported version; +- payload length larger than configured maximum; +- truncated frame headers or payloads; +- tensor/scalar counts that overflow decoded payload size; +- descriptor offsets outside the referenced handle; +- `HOST_INLINE` payload offsets or lengths outside the inline byte arena; +- non-`HOST_INLINE` descriptors with non-zero inline payload lengths; +- non-zero `ContinuousTensorWire.data` in remote TASK frames; +- stale generations; +- unknown control names or control versions; +- completion sequence mismatch; +- control reply sequence or control-name mismatch. diff --git a/docs/task-flow.md b/docs/task-flow.md index 37c4856b5..b18557fc8 100644 --- a/docs/task-flow.md +++ b/docs/task-flow.md @@ -49,16 +49,25 @@ Opaque 64-bit handle. What it actually is depends on the destination worker: | `w4.submit_next_level(cid, …)` dispatched to an L3 `Worker` child | `int32_t` cid returned by `Worker.register(orch_fn)` (Python orch fn) | `_child_worker_loop` looks up the orch fn in the child's COW registry and calls `inner_worker.run(orch_fn, …)` | | `w3.submit_sub(cid, …)` dispatched to a SUB child | `int32_t` cid indexing the Python callable registry | `_sub_worker_loop` calls `fn(args)` with the decoded `TaskArgs` | -All three paths share one mailbox wire format — the cid is written into -`MAILBOX_OFF_CALLABLE` and the child does the dispatch in its own -address space. +The implemented local paths share one mailbox wire format — the cid is written +into `MAILBOX_OFF_CALLABLE` and the child does the dispatch in its own address +space. The proposed remote L3 path keeps the cid concept, but sends it in a +versioned TASK frame and resolves it against the remote endpoint's registry +after the endpoint has reported `HELLO READY`. ### Lifetime — pre-fork registration -Every concrete `Callable` object (ChipCallable, Python orch fn, sub callable) -**must be registered before any child process is forked**. After fork, the -child inherits these through COW and the uint64 handle dereferences validly -in the child. +For the local fork/shm path, every concrete `Callable` object (ChipCallable, +Python orch fn, sub callable) **must be registered before any child process is +forked**, except for explicit dynamic registration controls. After fork, the +child inherits these through COW and the uint64 handle dereferences validly in +the child. + +For remote L3, the parent cannot rely on COW. Remote callable registration uses +explicit descriptors: required `PYTHON_IMPORT` paths, optional negotiated +PR #839 serialized Python callable payloads, and `CHIP_CALLABLE` payloads for +inner L3 chip work. A remote cid becomes visible only after the selected +endpoint replies success. --- @@ -241,7 +250,10 @@ dispatches to an L3 child, the child process runs `_child_worker_loop`, which looks up the registered orch fn for that cid and calls `inner_worker.run(orch_fn, args, config)` — i.e. the L3 `Worker.run` Python method, not a C++ leaf. The kernel-running leaves stay at L2 -(`ChipWorker`); higher levels just compose more scheduling engines. +(`ChipWorker`); higher levels just compose more scheduling engines. A remote +L3 session runner follows the same execution shape after it has prestarted its +inner L3 Worker, but task/control/completion bytes travel through the remote +framed protocol instead of the local mailbox. --- diff --git a/docs/worker-manager.md b/docs/worker-manager.md index 0be0480ea..dd9606fe8 100644 --- a/docs/worker-manager.md +++ b/docs/worker-manager.md @@ -1,15 +1,21 @@ # Worker Manager — Pool, Threading, and Dispatch `WorkerManager` and `WorkerThread` together implement the **execution layer** -of a `Worker` engine. `WorkerManager` owns two pools of `WorkerThread`s (one -for next-level workers, one for sub workers); each `WorkerThread` drives a -shared-memory mailbox that a forked Python child consumes — the child runs -the real worker (a `ChipWorker` for NEXT_LEVEL, a Python callable for SUB) -in its own address space. +of a `Worker` engine. In today's local implementation, `WorkerManager` owns two +pools of `WorkerThread`s (one for next-level workers, one for sub workers); +each `WorkerThread` drives a shared-memory mailbox that a forked Python child +consumes — the child runs the real worker (a `ChipWorker` for NEXT_LEVEL, a +Python callable for SUB) in its own address space. + +The proposed remote L3 design keeps this local fork/shm path behind a +`LocalMailboxEndpoint` and adds a framed `RemoteL3Endpoint` for cross-host +NEXT_LEVEL children. A remote endpoint is not another child loop that polls the +4096-byte mailbox; it uses the contracts in +[remote-l3-worker-design.md](remote-l3-worker-design.md). For the high-level role of this layer among the three engine components, see [hierarchical_level_runtime.md](hierarchical_level_runtime.md). For what -runs on the other side of the mailbox, see [task-flow.md](task-flow.md). +runs on the other side of the local mailbox, see [task-flow.md](task-flow.md). For where dispatched tasks come from, see [scheduler.md](scheduler.md). --- @@ -193,21 +199,23 @@ the C++ dispatcher thread — it does not own the child process. --- -## 4. Adding a new worker kind +## 4. Local vs. Remote Endpoints -To add a new worker type (e.g., a RemoteWorker over RPC): +The mailbox protocol is the local endpoint contract. Adding another local +forked worker kind still follows the existing pattern: -1. Define the kernel-running entry point (a C++ class with a `run` method - or a Python callable — there is no abstract interface to inherit from). -2. Write a child-process loop (mirroring `_chip_process_loop` or - `_sub_worker_loop`) that polls the mailbox, decodes the args blob, and - invokes that entry point. -3. Register the per-child mailbox via `manager.add_next_level(mailbox)` - or `manager.add_sub(mailbox)`. +1. Define the worker entry point. +2. Write a child-process loop that polls the mailbox, decodes the args blob, + and invokes that entry point. +3. Register the mailbox via `manager.add_next_level(mailbox)` or + `manager.add_sub(mailbox)`. -The parent side (WorkerManager / WorkerThread) doesn't change — it -only knows the mailbox protocol, not who runs the kernel on the other -end. +Remote L3 is different. It cannot reuse the mailbox wire format because the +remote side does not share virtual addresses, fork-time COW registries, POSIX +shm names, or parent-visible child PIDs. The remote design introduces a +transport-neutral endpoint under `WorkerThread`: `LocalMailboxEndpoint` wraps +this local mailbox path, while `RemoteL3Endpoint` sends framed TASK, CONTROL, +COMPLETION, HEALTH, and SHUTDOWN messages over the negotiated transport. ### 4.1 Nested fork ordering (L4+ Worker children) From 1c904d543f96161c7884832e741a484b4e808e60 Mon Sep 17 00:00:00 2001 From: puddingfjz <2811443837@qq.com> Date: Wed, 27 May 2026 11:05:40 +0800 Subject: [PATCH 2/2] Update: align remote L3 design scope - Adopt HCOMM-backed control and data adapter wording for Remote L3 - Keep Remote CommDomain as later work and reserve domain controls - Use unified register(callable, workers=...) for RemoteCallable descriptors - Reference the Python callable serialization contract from PR 839 --- docs/remote-l3-worker-design.md | 88 ++++++++++++----- .../buffers-and-transports.md | 94 +++++++++++-------- .../implementation-plan.md | 62 +++++++----- docs/remote-l3-worker-design/protocol.md | 25 ++++- 4 files changed, 182 insertions(+), 87 deletions(-) diff --git a/docs/remote-l3-worker-design.md b/docs/remote-l3-worker-design.md index 20e02ae57..1491ddb7d 100644 --- a/docs/remote-l3-worker-design.md +++ b/docs/remote-l3-worker-design.md @@ -11,6 +11,11 @@ Detailed protocol, buffer, transport, and rollout notes live in: - [buffers-and-transports.md](remote-l3-worker-design/buffers-and-transports.md) - [implementation-plan.md](remote-l3-worker-design/implementation-plan.md) +Related callable registration and serialization contracts: + +- [callable-ipc-dynamic-register.md](callable-ipc-dynamic-register.md) +- [python-callable-serialization.md](python-callable-serialization.md) + The current implementation uses pre-forked local child processes and a 4096-byte shared-memory mailbox. That model depends on copy-on-write callable registries, identical virtual addresses for `MAP_SHARED` regions, and @@ -23,7 +28,8 @@ Goals: - Preserve the Orchestrator/Scheduler DAG model. - Replace the local mailbox endpoint under `WorkerThread` with a pluggable NEXT_LEVEL endpoint. -- Support remote data-plane backends for A2 RoCE, A3 HCCS, and A5 UB. +- Support HCOMM-backed remote communication profiles for A2 RoCE, A3 HCCS, + and A5 UB. - Carry task dispatch, control commands, completion, error messages, and buffer lifetime over the remote endpoint. @@ -35,6 +41,11 @@ Non-goals: arbitrary closures. - Replacing the local fork/shm path for chip and sub workers. - Changing the L2 `ChipWorker::run` ABI. +- Exposing `RemoteL3Endpoint`, HCOMM, RDMA, or socket details to + `Orchestrator`; endpoint selection and materialization stay behind + `WorkerEndpoint`. +- Redesigning remote `CommDomain` or the device `CommContext` ABI in the first + Remote L3 task-dispatch cut. Remote callable registration follows the same public cid lifecycle defined for local dynamic Python registration by PR #839: registration becomes visible to @@ -47,10 +58,12 @@ process-local pointers, or exact serialized payload wire shape. The required baseline remote callable descriptor is an import path such as `pkg.module:orch_fn`. A serialized Python callable payload produced by the -PR #839 serializer is a negotiated remote capability, not an unconditional -baseline. When enabled, it travels as a versioned remote CONTROL payload and -must negotiate serializer version, payload limits, Python ABI/runtime -compatibility, and dependency/runtime-environment compatibility. +PR #839 serializer described in +[python-callable-serialization.md](python-callable-serialization.md) is a +negotiated remote capability, not an unconditional baseline. When enabled, it +travels as a versioned remote CONTROL payload and must negotiate serializer +version, payload limits, Python ABI/runtime compatibility, and +dependency/runtime-environment compatibility. ## Current Seams @@ -82,12 +95,26 @@ tensor identity must be resolved before a slot becomes ready, because TensorMap dependency inference and buffer-reference capture happen at submit time. +`Orchestrator` consumes tags, computes dependency keys, and stores endpoint +eligibility metadata. It must not call `RemoteL3Endpoint`, HCOMM, RDMA, or +socket APIs directly. `WorkerThread` and `WorkerEndpoint` own the child +boundary and materialization mechanics. + ## Target Architecture -Introduce a transport-neutral endpoint under `WorkerThread` with `run`, -`control`, and `shutdown` operations. `LocalMailboxEndpoint` wraps the current -shm mailbox code without changing wire behavior. `RemoteL3Endpoint` implements -the same interface over the framed remote protocol. +Introduce a communication-neutral `WorkerEndpoint` under `WorkerThread` with +`caps`, `run`, `control`, and `shutdown` operations. `LocalMailboxEndpoint` +wraps the current shm mailbox code without changing wire behavior. +`RemoteL3Endpoint` implements the same interface with a bootstrap socket for +session setup, then HCOMM-backed RPC and data adapters for steady-state +traffic. + +`caps()` is part of the endpoint contract because submit-time eligibility must +know whether an endpoint can run a cid and consume the tensors in a slot. It is +read-only capability metadata, not a transport escape hatch: the Scheduler sees +logical features such as callable kinds, memory directions, address spaces, and +health state, while `RemoteL3Endpoint` remains the only layer that knows the +selected HCOMM protocol or adapter handles. On dispatch, `WorkerThread` builds a task packet from `TaskSlotState`, calls the endpoint, reports endpoint errors, and notifies the Scheduler with an @@ -129,7 +156,7 @@ Use a two-process remote model: traffic. 7. The runner then performs the remote protocol `HELLO`/ready handshake over the ordered command lane. `HELLO` confirms session identity, endpoint - identity, protocol version, transport kind, and negotiated features; it does + identity, protocol version, comm profile, and negotiated features; it does not carry the bootstrap manifest. 8. Session shutdown rejects new frames, completes or fails in-flight tasks, drains cleanup, closes the inner Worker, and exits the runner process. @@ -157,19 +184,24 @@ Required contracts: - Every local or remote NEXT_LEVEL child has a stable `endpoint_id` equal to its logical worker id in `WorkerManager`. -- `register()` continues to register local callables for local fork/shm - endpoints. -- `register_remote(remote_callable, workers=...)` allocates an outer cid in the - parent `Worker(level=4)` id space, but binds that cid to one or more remote - endpoint ids. +- `register(callable, workers=...)` is the single public registration API. + Local Python/callable objects keep the existing local registration path. + `RemoteCallable` descriptors use the remote control path and bind the + allocated parent cid to one or more remote endpoint ids. +- The first implementation requires `RemoteCallable` registration to pass an + explicit non-empty `workers=[...]` list. Future releases may replace raw + worker ids with named remote pools or placement policies, but implicit + broadcast to all remote endpoints is not part of the contract. - Bootstrap manifests are generated by the parent. Users provide remote callable descriptors; users do not hand-author raw `cid -> callable` maps. - Remote callable descriptors have two Python forms: - `PYTHON_IMPORT`: a bounded `module:qualname` import path. This is required for the remote L3 baseline. - `PYTHON_SERIALIZED`: a versioned payload produced by the PR #839 callable - serializer. This is allowed only when parent and session negotiate the - serializer version, payload limits, Python ABI/runtime compatibility, and + serializer described in + [python-callable-serialization.md](python-callable-serialization.md). This + is allowed only when parent and session negotiate the serializer version, + payload limits, Python ABI/runtime compatibility, and dependency/runtime-environment compatibility. - Remote L3 uses two independent cid namespaces: - **Outer remote cid namespace**: parent-assigned cids carried in L4 TASK @@ -190,7 +222,7 @@ Required contracts: callable payloads preserve the same cid lifecycle but remain an optional negotiated feature because they require Ray-like environment and serializer compatibility checks that local fork/COW registration does not need. -- Multi-endpoint `register_remote(..., workers=[...])` is all-or-nothing by +- Multi-endpoint `register(..., workers=[...])` is all-or-nothing by default. The parent sends a prepare phase to every selected endpoint, commits the cid only after every prepare succeeds, and exposes the cid to future TASK frames only after every commit reply succeeds. If any endpoint fails prepare @@ -233,7 +265,7 @@ l3 = RemoteWorkerSpec( ) l3_worker_id = w4.add_remote_worker(l3) -l3_cid = w4.register_remote( +l3_cid = w4.register( RemoteCallable("my_pkg.remote_orch:l3_orch"), workers=[l3_worker_id], ) @@ -269,7 +301,7 @@ callable registry: inner L3 registry: inner cid -> ChipCallable blob metadata, when needed inner cid -> Python sub/orch callable descriptor, when needed -transport kind: roce | hccs | ub | sim +comm policy: roce | hccs | ub | sim feature flags ``` @@ -305,9 +337,10 @@ Session execution rules: the current one-`WorkerThread`-per-child local scheduling model and keeps ordering, buffer lifetime, and cid visibility simple. - State-changing CONTROL frames such as register, unregister, buffer free, - import release, comm init, and domain allocation serialize with TASK - execution on the ordered command lane. They are not applied concurrently with - a running TASK on the same endpoint. + copy, export/import, and import release serialize with TASK execution on the + ordered command lane. They are not applied concurrently with a running TASK + on the same endpoint. Future Remote CommDomain controls follow the same + ordering rule when they enter scope. - Bulk data movement may use a separate data plane, but the state change that makes staged bytes, callable payloads, or imported handles visible is ordered by the command lane. @@ -403,7 +436,12 @@ field, no sequence number, and assumes shared virtual memory. Remote endpoints use a versioned frame protocol with `HELLO`, `TASK`, `CONTROL`, `CONTROL_REPLY`, `COMPLETION`, `HEALTH`, and `SHUTDOWN` frames. The -local path keeps the existing mailbox layout behind `LocalMailboxEndpoint`. +bootstrap socket carries only session setup and HCOMM bring-up frames. After +HCOMM RPC is ready, steady-state control, task metadata, completions, and +shutdown use the HCOMM-backed RPC adapter; tensor data and remote buffer +operations use the HCOMM data adapter. The local path keeps the existing +mailbox layout behind `LocalMailboxEndpoint`. + Remote frames use canonical little-endian field encoding for `CallConfig`, `ContinuousTensor`, tensor descriptors, strings, counts, and enums; they do not memcpy local C++ POD structs onto the wire. Each endpoint has one ordered @@ -427,7 +465,7 @@ The recommended first cut is conservative: `HELLO READY`. 6. Prove local behavior is unchanged and remote sim behavior handles success, failure, cid mapping, timeouts, health, and buffer cleanup. -7. Add hardware transports behind the same protocol. +7. Add A2 RoCE, A3 HCCS, and A5 UB profiles behind the HCOMM adapter layer. See [implementation-plan.md](remote-l3-worker-design/implementation-plan.md) diff --git a/docs/remote-l3-worker-design/buffers-and-transports.md b/docs/remote-l3-worker-design/buffers-and-transports.md index 1a44dbdd2..484cc1755 100644 --- a/docs/remote-l3-worker-design/buffers-and-transports.md +++ b/docs/remote-l3-worker-design/buffers-and-transports.md @@ -206,58 +206,74 @@ range index as part of the transport bring-up. `offset_end`/`nbytes` remains in `RemoteTensorDesc` for bounds checks and for a future range-overlap TensorMap upgrade, but it is not part of the first dependency key. -## RemoteTransport Interface - -All backends implement the same transport contract: - -```cpp -class RemoteTransport { -public: - virtual void connect(const RemoteEndpointSpec &) = 0; - virtual void post_task(uint64_t seq, Span frame) = 0; - virtual Frame wait_completion(uint64_t seq, Timeout timeout) = 0; - virtual ControlReply control(const ControlRequest &) = 0; - virtual RemoteMemory export_memory(BufferId) = 0; - virtual ImportedMemory import_memory(const RemoteMemory &) = 0; - virtual void close() = 0; -}; +## HCOMM Adapter Contract + +Remote L3 uses HCOMM for steady-state communication in the first +implementation. The bootstrap socket is only a setup path for session +validation and HCOMM bring-up. Once HCOMM RPC is ready, task metadata, +CONTROL, CONTROL_REPLY, COMPLETION, and SHUTDOWN frames use the HCOMM RPC +adapter; tensor data and remote buffer copies use the HCOMM data adapter. + +The endpoint owns the adapter objects. `Orchestrator`, Scheduler, and +`WorkerThread` see only `WorkerEndpoint::run()`, `WorkerEndpoint::control()`, +and logical capability bits from `WorkerEndpoint::caps()`. + +The adapter family provides this logical contract: + +```text +BootstrapSocketAdapter: + open(control_uri) + exchange hello/capability/HCOMM bootstrap frames + close after HCOMM_RPC_READY + +HcommRpcAdapter: + submit TASK or CONTROL frame on the ordered command lane + wait for matching COMPLETION or CONTROL_REPLY + send SHUTDOWN when no command is in flight + +HcommAdapter: + register/export/import memory + submit read/write/copy plans + wait/fence completion + release registrations and imports ``` -`post_task()` and `control()` enqueue request frames on the endpoint's ordered -command lane. Data-plane transfers may use RDMA/HCCS/UB paths, but TASK +HCOMM RPC enqueues request frames on the endpoint's ordered command lane. +Data-plane transfers may use RoCE, HCCS, or UB HCOMM profiles, but TASK doorbells, CONTROL frames, replies, completions, and shutdown state are ordered -by the command lane. Reply frames carry the request sequence they answer. -`wait_completion()` waits for an explicit remote `COMPLETION` frame; RDMA write -completion alone is never a task completion signal. `control()` sends a -`CONTROL` frame and returns only after the matching `CONTROL_REPLY` arrives or a -timeout/disconnect is converted into a failed reply. Liveness is handled by an -independent health lane or transport keepalive; it is not queued behind the -ordered command lane. - -## A2 RoCE - -- Use RDMA SEND/RECV for small frames and completion records. -- Keep TASK doorbells, control frames, control replies, and completions on one - per-endpoint ordered SEND/RECV command lane. -- Use a separate health SEND/RECV lane or transport keepalive for liveness. -- Use registered pinned staging buffers for large callable blobs and bulk data. -- Export buffers as `(addr, length, rkey, qp/session metadata)`. -- Complete tasks only after a SEND completion frame from the session runner. +by the command lane. Reply frames carry the request sequence they answer. A +task completes only after an explicit remote `COMPLETION` frame; data-copy +completion alone is never a task completion signal. `control()` returns only +after the matching `CONTROL_REPLY` arrives or a timeout/disconnect is converted +into a failed reply. Liveness is handled by an independent health lane or +transport keepalive; it is not queued behind the ordered command lane. + +## A2 RoCE HCOMM Profile + +- Use HCOMM with `COMM_PROTOCOL_ROCE`. +- Carry command frames and completion records through HCOMM RPC rings and + notify/fence operations. +- Use a separate health HCOMM lane or transport keepalive for liveness. +- Use registered staging buffers for large callable blobs and bulk data. +- Export buffers as HCOMM memory descriptors plus RoCE-specific channel + metadata. +- Complete tasks only after an HCOMM RPC `COMPLETION` frame from the session + runner. - Bound every wait with a timeout and convert disconnects into endpoint failure completions. -## A3 HCCS +## A3 HCCS HCOMM Profile -- Keep the same `RemoteTransport` contract as RoCE. -- Implement memory export/import through the HCCS-capable platform HAL. +- Keep the same HCOMM adapter contract as A2. +- Implement memory export/import through the HCCS-capable HCOMM profile. - Preserve the same command-lane ordering rules: task/control frames are observed in sequence order, command frame visible before doorbell, and remote writes complete before completion frame. - Provide health independently from command-lane progress so long-running TASK execution does not cause false endpoint failure. -- Reuse the frame codec and buffer registry tests from the RoCE path. +- Reuse the frame codec, HCOMM RPC, and buffer registry tests from the A2 path. -## A5 UB +## A5 UB HCOMM Profile - Export both RDMA metadata and, when available, an LD/ST mapping token. - Use LD/ST for doorbells and small completion records only when the mapping diff --git a/docs/remote-l3-worker-design/implementation-plan.md b/docs/remote-l3-worker-design/implementation-plan.md index 625396d01..295d8ea6d 100644 --- a/docs/remote-l3-worker-design/implementation-plan.md +++ b/docs/remote-l3-worker-design/implementation-plan.md @@ -7,6 +7,9 @@ fork/shm behavior working. 1. Endpoint interface and local adapter. - Add `WorkerEndpoint`. + - Include read-only `WorkerEndpoint::caps()` capability metadata for + endpoint eligibility. `caps()` returns logical features only and must not + expose HCOMM/RDMA/socket handles to Scheduler or Orchestrator code. - Define `WorkerEndpoint::run()` to return an explicit outcome: success, task failure, or endpoint failure. - Move current mailbox code into `LocalMailboxEndpoint`. @@ -82,8 +85,15 @@ fork/shm behavior working. - Treat import-path callables as the baseline remote mode. - Support PR #839 serialized Python callable payloads only as a negotiated feature with serializer version, payload limit, Python ABI/runtime, and - dependency/runtime-environment compatibility checks. - - Implement `register_remote(..., workers=...)`. + dependency/runtime-environment compatibility checks. Follow the payload + contract in `docs/python-callable-serialization.md`. + - Extend the existing public `register(callable, workers=...)` flow so + `RemoteCallable` descriptors can target remote endpoints. Do not add a + separate public remote-only registration API. + - Require the first implementation to reject `RemoteCallable` registration + without an explicit non-empty `workers=[...]` list. Leave named remote + pools and placement policies as future API work, and do not define + implicit broadcast to all remote endpoints as a contract. - Implement multi-endpoint all-or-nothing registration with prepare, commit, and abort controls. Keep the parent cid invisible until every selected endpoint commits, and mark uncertain endpoints failed rather than leaving a @@ -116,11 +126,15 @@ fork/shm behavior working. 8. Remote control-plane parity. - Map existing NEXT_LEVEL controls onto typed remote frames: - prepare, register, unregister, comm init, domain alloc, and domain - release. + prepare, register, unregister, remote buffer allocation, remote buffer + free, copy to remote, copy from remote, export buffer, import buffer, and + release import. + - Reserve remote `COMM_INIT`, `ALLOC_DOMAIN`, and `RELEASE_DOMAIN` opcodes + for the later Remote CommDomain phase; the first task-dispatch cut rejects + them as unsupported controls. - Keep local mailbox sub-command ids local-only. - - Add tests for post-bootstrap ChipCallable registration and dynamic domain - allocation through the remote session runner. + - Add tests for post-bootstrap ChipCallable registration and remote buffer + controls through the session runner. 9. Remote buffer registry. - Add `ALLOC_REMOTE_BUFFER`, `FREE_REMOTE_BUFFER`, `COPY_TO_REMOTE`, and @@ -130,21 +144,25 @@ fork/shm behavior working. - Tie physical free and release-import to post-drain cleanup after all captured refs drop. -10. RoCE transport. - - Implement connection setup, SEND/RECV frames, registered staging buffers, - and timeout/error paths. +10. A2 RoCE HCOMM profile. + - Implement HCOMM endpoint/channel setup with `COMM_PROTOCOL_ROCE`, HCOMM + RPC command rings, registered staging buffers, and timeout/error paths. + - Keep bootstrap sockets limited to session setup and HCOMM bring-up. + After HCOMM RPC is ready, task metadata, controls, completions, and + shutdown use the HCOMM RPC adapter. - Add a hardware-gated smoke test with one remote L3 worker. -11. HCCS transport. - - Implement the same transport contract through platform HCCS APIs. - - Reuse the RoCE frame and buffer registry tests. +11. A3 HCCS HCOMM profile. + - Implement the same HCOMM adapter contract with the HCCS protocol profile. + - Reuse the A2 frame, HCOMM RPC, and buffer registry tests. -12. A5 UB transport. - - Add UB export/import metadata. - - Implement LD/ST doorbell and completion path with fences. - - Keep RDMA fallback for bulk transfers. +12. A5 UB HCOMM profile. + - Add UB export/import metadata through the HCOMM adapter. + - Implement LD/ST doorbell and completion paths only when the selected + profile proves the mapping and fence rules are valid. + - Keep RDMA/HCOMM fallback for bulk transfers. -13. Remote `allocate_domain()`. +13. Remote `allocate_domain()` future work. - Extend `CommDomainHandle` to carry remote endpoint ids. - Allocate/import windows collectively across remote workers. - Preserve deferred release after `drain()`. @@ -188,18 +206,20 @@ fork/shm behavior working. - A2 RoCE remote buffer copy round trip. - A3 HCCS single remote L3 task. - A5 UB LD/ST doorbell plus RDMA fallback. -- Remote domain allocation and deferred release across two remote L3 workers. +- Remote domain allocation and deferred release across two remote L3 workers + after Remote CommDomain enters scope. ## Open Decisions - Exact platform HAL names for HCCS and UB export/import. - Authentication and isolation for remote daemon sessions. -- Exact compatibility metadata required for PR #839 serialized Python callable - payloads beyond serializer version and Python ABI/runtime. +- Exact remote compatibility metadata required for PR #839 serialized Python + callable payloads beyond the local payload contract, serializer version, and + Python ABI/runtime. - How endpoint health feeds scheduler-level eligibility after the transport reports a failed health lane. - How much of `CommContext` should remain shared with PTO-ISA once remote UB - address metadata is added. + address metadata is added, when Remote CommDomain enters scope. The first cut should land endpoint abstraction, endpoint eligibility, remote callable registration, failure poisoning, and the simulation runner diff --git a/docs/remote-l3-worker-design/protocol.md b/docs/remote-l3-worker-design/protocol.md index 42179abde..ed739d740 100644 --- a/docs/remote-l3-worker-design/protocol.md +++ b/docs/remote-l3-worker-design/protocol.md @@ -4,6 +4,12 @@ This document defines the remote wire protocol used by `RemoteL3Endpoint`. The local fork/shm path keeps the existing mailbox layout behind `LocalMailboxEndpoint`. +The bootstrap socket carries only setup and HCOMM bring-up frames. After HCOMM +RPC is ready, steady-state TASK, CONTROL, CONTROL_REPLY, COMPLETION, and +SHUTDOWN frames travel through the HCOMM RPC adapter. Tensor data moves through +the HCOMM data adapter and is referenced from TASK frames by descriptors, not +by host pointers. + ## Frames Remote transport must not reuse the raw 4096-byte mailbox format. Define a @@ -104,7 +110,7 @@ and health lanes. `HELLO` does not carry the bootstrap manifest. session_id endpoint_id protocol_version -transport kind +comm profile feature flags ready_state ``` @@ -125,6 +131,12 @@ config: CallConfigWire v1 args: RemoteTaskArgsWire v1 ``` +TASK frames carry tensor metadata and remote descriptors after parent-side +dependency inference has already consumed `TaskArgs` tags. Tags are +Orchestrator input, not the remote L3 wire contract. The endpoint uses the +sidecar descriptors captured at submit time to materialize local +`ContinuousTensor` values on the session runner. + `RemoteTaskArgsWire v1` contains: ```text @@ -250,10 +262,17 @@ Required remote controls: - `EXPORT_BUFFER` - `IMPORT_BUFFER` - `RELEASE_IMPORT` + +Reserved future controls for Remote CommDomain: + - `COMM_INIT` - `ALLOC_DOMAIN` - `RELEASE_DOMAIN` +The first Remote L3 task-dispatch cut rejects the reserved domain controls +with an unsupported-control reply. They become required only when Remote +CommDomain enters scope. + The register-family controls are namespace-aware. `PREPARE_REGISTER_CALLABLE` carries: @@ -279,7 +298,9 @@ Rules: inner `submit_sub` / recursive orchestration; `CHIP_CALLABLE` follows the existing prepare/register path for chip workers. - `PYTHON_IMPORT` payloads carry a bounded UTF-8 `module:qualname` string. -- `PYTHON_SERIALIZED` payloads are produced by the PR #839 callable serializer. +- `PYTHON_SERIALIZED` payloads are produced by the PR #839 callable serializer + described in + [../python-callable-serialization.md](../python-callable-serialization.md). They are valid only when parent and session negotiate serializer version, payload limits, Python ABI/runtime compatibility, and dependency/runtime compatibility. Support is optional; `PYTHON_IMPORT` remains the required