diff --git a/docs/hierarchical_level_runtime.md b/docs/hierarchical_level_runtime.md index 2e78f7aad..d29482914 100644 --- a/docs/hierarchical_level_runtime.md +++ b/docs/hierarchical_level_runtime.md @@ -13,6 +13,8 @@ For details of each component's internals, see: - [scheduler.md](scheduler.md) — dispatch loop, queues, completion handling - [worker-manager.md](worker-manager.md) — WorkerThread pool, fork + mailbox - [task-flow.md](task-flow.md) — Callable / TaskArgs / CallConfig data flow, execution leaves +- [remote-l3-worker-design.md](remote-l3-worker-design.md) — design proposal + for scheduling remote L3 workers as NEXT_LEVEL children For the L2 chip-level details (host `.so`, AICPU, AICore), see [chip-level-arch.md](chip-level-arch.md). @@ -42,14 +44,16 @@ L0 CORE / AIV, AIC ── individual compute core (hardware-managed) atomics and barriers. - **L3–L6** (host/cluster): each level runs the same scheduling engine composed of Orchestrator + Scheduler + Worker pool. Communication via IPC - (fork + shm at L3 today; RDMA / sockets at L4+). + (fork + shm for today's L3 and for local recursive L4+ composition). + Cross-host L4/L5/L6 composition, where a parent schedules a remote L3 + endpoint over RoCE/HCCS/UB/sockets, is a proposed extension. | Level | Workers it contains | Status | | ----- | ------------------- | ------ | | L3 (Host) | `ChipWorker` ×N + `SubWorker` ×M | Implemented | -| L4 (Pod) | `Worker(level=3)` ×N + `SubWorker` ×M | Implemented | -| L5 (SuperNode) | `Worker(level=4)` ×N | Same code as L4 (untested) | -| L6 (Cluster) | `Worker(level=5)` ×N | Same code as L4 (untested) | +| L4 (Pod) | `Worker(level=3)` ×N + `SubWorker` ×M | Local-only implemented; remote proposed | +| L5 (SuperNode) | `Worker(level=4)` ×N | Local L4 code path, untested; remote proposed | +| L6 (Cluster) | `Worker(level=5)` ×N | Local L4 code path, untested; remote proposed | `Worker` is a single C++ class that handles every level from L3 upward — the `level` parameter is a diagnostic label; behavior does not branch on it. The diff --git a/docs/remote-l3-worker-design.md b/docs/remote-l3-worker-design.md new file mode 100644 index 000000000..1491ddb7d --- /dev/null +++ b/docs/remote-l3-worker-design.md @@ -0,0 +1,472 @@ +# Remote L3 Worker Design + +This document describes how to extend the L3+ hierarchical runtime so a +parent `Worker` can schedule a remote `Worker(level=3)` as a NEXT_LEVEL +worker. The first target is an L4 parent dispatching to remote L3 workers. +The same contracts can later serve L5/L6. + +Detailed protocol, buffer, transport, and rollout notes live in: + +- [protocol.md](remote-l3-worker-design/protocol.md) +- [buffers-and-transports.md](remote-l3-worker-design/buffers-and-transports.md) +- [implementation-plan.md](remote-l3-worker-design/implementation-plan.md) + +Related callable registration and serialization contracts: + +- [callable-ipc-dynamic-register.md](callable-ipc-dynamic-register.md) +- [python-callable-serialization.md](python-callable-serialization.md) + +The current implementation uses pre-forked local child processes and a +4096-byte shared-memory mailbox. That model depends on copy-on-write callable +registries, identical virtual addresses for `MAP_SHARED` regions, and +parent-visible child PIDs. None of those assumptions holds across hosts. + +## Scope + +Goals: + +- Preserve the Orchestrator/Scheduler DAG model. +- Replace the local mailbox endpoint under `WorkerThread` with a pluggable + NEXT_LEVEL endpoint. +- Support HCOMM-backed remote communication profiles for A2 RoCE, A3 HCCS, + and A5 UB. +- Carry task dispatch, control commands, completion, error messages, and + buffer lifetime over the remote endpoint. + +Non-goals: + +- Rewriting kernel allreduce or PTO-ISA collective kernels. +- Shipping Python closures without an explicit serialization contract. +- Designing general cross-host Python dependency or code distribution for + arbitrary closures. +- Replacing the local fork/shm path for chip and sub workers. +- Changing the L2 `ChipWorker::run` ABI. +- Exposing `RemoteL3Endpoint`, HCOMM, RDMA, or socket details to + `Orchestrator`; endpoint selection and materialization stay behind + `WorkerEndpoint`. +- Redesigning remote `CommDomain` or the device `CommContext` ABI in the first + Remote L3 task-dispatch cut. + +Remote callable registration follows the same public cid lifecycle defined for +local dynamic Python registration by PR #839: registration becomes visible to +the selected Python-capable endpoint only after the registration control reply +succeeds, unregister permits cid reuse only after stale state is cleared or the +endpoint is marked failed, and stale cid residue must not be observable by +later TASK frames. The remote design reuses those lifecycle semantics, but it +does not reuse PR #839's local mailbox commands, POSIX shm names, +process-local pointers, or exact serialized payload wire shape. + +The required baseline remote callable descriptor is an import path such as +`pkg.module:orch_fn`. A serialized Python callable payload produced by the +PR #839 serializer described in +[python-callable-serialization.md](python-callable-serialization.md) is a +negotiated remote capability, not an unconditional baseline. When enabled, it +travels as a versioned remote CONTROL payload and must negotiate serializer +version, payload limits, Python ABI/runtime compatibility, and +dependency/runtime-environment compatibility. + +## Current Seams + +Relevant code paths: + +- `python/simpler/worker.py` + - `_start_hierarchical()` forks local child workers. + - `_child_worker_loop()` runs a nested `Worker` child via shm mailbox. + - `_run_chip_main_loop()` handles task and control mailbox states. +- `src/common/hierarchical/worker_manager.{h,cpp}` + - `WorkerThread` owns one local mailbox and blocks until `TASK_DONE`. + - Control commands share the same mailbox and serialize on `mailbox_mu_`. + - Errors are reported through `MAILBOX_OFF_ERROR` and + `MAILBOX_OFF_ERROR_MSG`. +- `src/common/hierarchical/orchestrator.{h,cpp}` + - `submit_next_level()` stores `TaskArgs`, `CallConfig`, callable id, and + optional worker affinity in a parent-side slot. + - Dependency inference happens before dispatch from tags in `TaskArgs`. +- `src/common/task_interface/task_args.h` + - Process dispatch writes `[T][S][ContinuousTensor x T][uint64 x S]`. + - Tags are stripped after submit. +- `docs/comm-domain.md` + - Dynamic communication domains already model deferred release after + `drain()`. + +The Scheduler should not inspect transport details, but it does need enough +metadata to avoid dispatching a task to an endpoint that cannot run it. Remote +tensor identity must be resolved before a slot becomes ready, because +TensorMap dependency inference and buffer-reference capture happen at submit +time. + +`Orchestrator` consumes tags, computes dependency keys, and stores endpoint +eligibility metadata. It must not call `RemoteL3Endpoint`, HCOMM, RDMA, or +socket APIs directly. `WorkerThread` and `WorkerEndpoint` own the child +boundary and materialization mechanics. + +## Target Architecture + +Introduce a communication-neutral `WorkerEndpoint` under `WorkerThread` with +`caps`, `run`, `control`, and `shutdown` operations. `LocalMailboxEndpoint` +wraps the current shm mailbox code without changing wire behavior. +`RemoteL3Endpoint` implements the same interface with a bootstrap socket for +session setup, then HCOMM-backed RPC and data adapters for steady-state +traffic. + +`caps()` is part of the endpoint contract because submit-time eligibility must +know whether an endpoint can run a cid and consume the tensors in a slot. It is +read-only capability metadata, not a transport escape hatch: the Scheduler sees +logical features such as callable kinds, memory directions, address spaces, and +health state, while `RemoteL3Endpoint` remains the only layer that knows the +selected HCOMM protocol or adapter handles. + +On dispatch, `WorkerThread` builds a task packet from `TaskSlotState`, calls +the endpoint, reports endpoint errors, and notifies the Scheduler with an +explicit success/failure outcome. + +Ready queues, group dispatch, affinities, fanin/fanout, and ring release remain +in the existing runtime. The first-error-wins policy remains only as the error +reporting policy for choosing which root error `drain()` raises. The important +change is that completion is no longer implicitly success; every endpoint, +including `LocalMailboxEndpoint`, must report an explicit success/failure +outcome. + +## Fork-Safe Remote Process Model + +The remote runtime must preserve the repository's fork ordering invariant: +all chip/sub child processes are forked before any C++ Scheduler, +`WorkerThread`, transport, or health threads are started. + +Use a two-process remote model: + +1. `simpler-remote-worker` is a small control daemon. It accepts session + requests and validates bootstrap manifests on the daemon control channel. + It never constructs an inner `Worker` and never forks chip/sub children + after starting transport worker threads. +2. For each accepted session, the daemon starts a fresh + `simpler-remote-l3-session` runner process, preferably by `exec`. +3. The daemon passes the validated manifest to the runner through a simple + pre-fork handoff such as an inherited fd, a manifest file path in env, or a + single-threaded pipe. This handoff is not the remote transport protocol. +4. The session runner reads the manifest before starting transport threads and + constructs `Worker(level=3)`. +5. The runner then performs an explicit prestart step equivalent to + `inner_worker.init()` plus `_start_hierarchical()` for the inner Worker: + allocate local mailboxes, fork local chip/sub children, register local + endpoints with the inner C++ Worker, and start the inner Scheduler and + `WorkerThread`s. +6. Only after this local L3 child tree is established does the session runner + bring up sockets, RDMA queue pairs, health threads, or UB doorbells for task + traffic. +7. The runner then performs the remote protocol `HELLO`/ready handshake over + the ordered command lane. `HELLO` confirms session identity, endpoint + identity, protocol version, comm profile, and negotiated features; it does + not carry the bootstrap manifest. +8. Session shutdown rejects new frames, completes or fails in-flight tasks, + drains cleanup, closes the inner Worker, and exits the runner process. + +This keeps the local L3 fork/shm implementation intact while preventing a +multi-threaded network daemon from becoming the process that performs the +forks. + +`HELLO ready_state=READY` is a scheduling barrier, not just a liveness signal. +The parent must not put a remote endpoint into the schedulable set until the +runner has completed prestart, installed the bootstrap registries, initialized +the buffer/import registry, started the command and health lanes, and confirmed +the negotiated feature set. This mirrors the distributed-system convention that +a worker or actor becomes visible only after runtime and dependency +initialization has completed. + +## Endpoint Identity and Callable Routing + +Remote scheduling needs explicit callable namespaces and an explicit mapping +from callable ids to eligible NEXT_LEVEL endpoints. The current scheduler can +otherwise choose any idle worker, which is only correct when every NEXT_LEVEL +child has the same callable registry. + +Required contracts: + +- Every local or remote NEXT_LEVEL child has a stable `endpoint_id` equal to + its logical worker id in `WorkerManager`. +- `register(callable, workers=...)` is the single public registration API. + Local Python/callable objects keep the existing local registration path. + `RemoteCallable` descriptors use the remote control path and bind the + allocated parent cid to one or more remote endpoint ids. +- The first implementation requires `RemoteCallable` registration to pass an + explicit non-empty `workers=[...]` list. Future releases may replace raw + worker ids with named remote pools or placement policies, but implicit + broadcast to all remote endpoints is not part of the contract. +- Bootstrap manifests are generated by the parent. Users provide remote + callable descriptors; users do not hand-author raw `cid -> callable` maps. +- Remote callable descriptors have two Python forms: + - `PYTHON_IMPORT`: a bounded `module:qualname` import path. This is required + for the remote L3 baseline. + - `PYTHON_SERIALIZED`: a versioned payload produced by the PR #839 callable + serializer described in + [python-callable-serialization.md](python-callable-serialization.md). This + is allowed only when parent and session negotiate the serializer version, + payload limits, Python ABI/runtime compatibility, and + dependency/runtime-environment compatibility. +- Remote L3 uses two independent cid namespaces: + - **Outer remote cid namespace**: parent-assigned cids carried in L4 TASK + frames. These cids select the remote L3 orchestration callable. + - **Inner L3 cid namespace**: cids registered on the session runner's + `inner_worker = Worker(level=3)`. Remote L3 orch functions use these cids + when they call `orch.submit_next_level(...)` or `orch.submit_sub(...)`. +- The two namespaces may contain the same integer values, but they are not the + same registry. A cid from the parent TASK frame must not be assumed to name a + chip/sub callable inside the inner L3 Worker. +- Dynamic Python callable registration follows the public visibility and cid + lifecycle semantics from local dynamic registration: registration is + synchronous per selected endpoint, future TASK frames may use the cid only + after the control reply succeeds, unregister/cid reuse clears stale callable + state, and TASK/control ordering prevents a TASK from observing a partially + registered cid. +- Import-path descriptors are the required remote baseline. Serialized Python + callable payloads preserve the same cid lifecycle but remain an optional + negotiated feature because they require Ray-like environment and serializer + compatibility checks that local fork/COW registration does not need. +- Multi-endpoint `register(..., workers=[...])` is all-or-nothing by + default. The parent sends a prepare phase to every selected endpoint, commits + the cid only after every prepare succeeds, and exposes the cid to future TASK + frames only after every commit reply succeeds. If any endpoint fails prepare + or commit, the parent aborts the transaction, keeps the cid invisible, and + either rolls back successful endpoints or marks endpoints with unknown state + failed. +- Multi-endpoint unregister uses a tombstone state. A cid is unavailable for + reuse until every selected endpoint confirms unregister cleanup, or until any + non-responsive endpoint is removed from eligibility and marked failed. This + prevents stale callable residue from being observed by later TASK frames. +- `TaskSlotState` stores the final eligible endpoint set for the slot. This is + the intersection of endpoints that can run the `callable_id` and endpoints + that can access every tensor/buffer referenced by the slot. +- If the user passes `worker=N`, submit-time validation checks that endpoint + `N` is eligible for that cid and for the slot's tensor sidecars. +- If `worker=-1`, the Scheduler chooses only from idle endpoints in the + slot's eligible set. +- Group submit validates each affinity independently. Unconstrained group + members are assigned distinct idle eligible endpoints. +- Mixed local + remote NEXT_LEVEL pools are allowed only when the callable id + is registered on every endpoint that can receive the slot and the slot's + tensors are materialized in a representation those endpoints can consume. + A callable registered on both local and remote endpoints does not make a + remote-buffer task eligible for the local endpoint. + +Example API shape: + +```python +from simpler.worker import RemoteCallable, RemoteWorkerSpec, Worker + +w4 = Worker(level=4) + +l3 = RemoteWorkerSpec( + endpoint="node17:19073", + platform="a2a3", + runtime="tensormap_and_ringbuffer", + device_ids=list(range(16)), + num_sub_workers=2, + transport="roce", +) + +l3_worker_id = w4.add_remote_worker(l3) +l3_cid = w4.register( + RemoteCallable("my_pkg.remote_orch:l3_orch"), + workers=[l3_worker_id], +) +w4.init() +``` + +`add_worker(local_worker)` remains unchanged and continues to use fork/shm. + +Dynamic remote registration uses the same cid lifecycle whether the descriptor +is installed at bootstrap or through a later control frame. If a callable refers +to inner L3 cids, those cids are values from the inner namespace installed by +the session manifest or by prior remote control registration, not the outer cid +used to dispatch the remote L3 orch callable. + +## Remote Worker Session + +The parent generates a bootstrap manifest and sends it to the +`simpler-remote-worker` daemon as part of session creation. The daemon validates +it and hands it to the session runner before the runner starts any transport +threads: + +```text +session_id +parent_worker_level +remote_worker_level = 3 +endpoint_id +platform, runtime, build flag +device_ids, num_sub_workers, heap_ring_size +callable registry: + outer registry: + parent-assigned outer cid -> remote L3 orch callable descriptor + descriptor = PYTHON_IMPORT or negotiated PYTHON_SERIALIZED + inner L3 registry: + inner cid -> ChipCallable blob metadata, when needed + inner cid -> Python sub/orch callable descriptor, when needed +comm policy: roce | hccs | ub | sim +feature flags +``` + +The session runner installs the outer registry into its remote TASK dispatcher. +It installs the inner registry into `inner_worker = Worker(level=3)` during +prestart and before `HELLO READY`. Remote controls may add or remove entries in +either namespace after bootstrap: + +- registering an outer Python callable changes what future L4 TASK frames can + dispatch on this remote endpoint; +- registering an inner Python callable changes what already-registered remote + L3 orch functions can submit to `inner_worker`; +- registering an inner `ChipCallable` follows the existing dynamic callable IPC + cascade shape, but the remote control payload is a versioned remote frame + instead of a local POSIX-shm mailbox name. + +For a TASK frame, the session runner: + +1. Validates the session and sequence number. +2. Decodes `RemoteTaskArgs`. +3. Translates remote tensor descriptors into local `ContinuousTensor` values. +4. Looks up the L3 orchestration function in the outer registry by + parent-assigned outer cid. +5. Calls `inner_worker.run(orch_fn, args, config)`. +6. Sends completion with success or bounded traceback text. + +For a CONTROL frame, it forwards the operation to the inner worker or its +buffer registry, then replies with a typed result. + +Session execution rules: + +- The baseline remote endpoint runs at most one TASK at a time. This matches + the current one-`WorkerThread`-per-child local scheduling model and keeps + ordering, buffer lifetime, and cid visibility simple. +- State-changing CONTROL frames such as register, unregister, buffer free, + copy, export/import, and import release serialize with TASK execution on the + ordered command lane. They are not applied concurrently with a running TASK + on the same endpoint. Future Remote CommDomain controls follow the same + ordering rule when they enter scope. +- Bulk data movement may use a separate data plane, but the state change that + makes staged bytes, callable payloads, or imported handles visible is ordered + by the command lane. +- Health/liveness does not depend on the command lane making progress. Each + session has an independent health lane or equivalent transport-level health + signal so a long-running `inner_worker.run()` does not look like endpoint + failure merely because queued command-lane frames cannot be serviced. + +## Remote TaskArgs Representation + +Keep `ContinuousTensor` as the L2 ABI. Do not overload raw pointer values to +carry transport state. + +Public Python uses a sidecar representation: + +- `RemoteBufferHandle` identifies an allocated or imported remote buffer. +- `RemoteTensorRef(handle, offset, shape, dtype)` is accepted by + `TaskArgs.add_tensor()` wherever a remote submit is legal. +- The Python/C++ binding stores a normal tensor metadata entry plus a hidden + sidecar entry at the same tensor index. +- Local endpoints reject remote tensor refs. `RemoteTensorRef` is transport + metadata, not a local mailbox ABI. A local fork/shm endpoint becomes eligible + only after the data has been explicitly imported, staged, or materialized into + a local-addressable `ContinuousTensor`. +- Remote endpoints require a sidecar/descriptor for every tensor that carries + data over the remote protocol, including `HOST_INLINE` tensors. A null + sidecar is allowed only for metadata-only tensors with no data payload. + Remote endpoints reject bare host pointers unless an explicit staging API + produced a remote handle. +- Remote submits reject `OUTPUT` tensors whose `ContinuousTensor.data == 0` + unless the caller has already supplied a `RemoteTensorRef` sidecar. The first + implementation does not auto-allocate remote outputs during submit. + +Parent-side slots therefore store existing `TaskArgs`, an optional +`RemoteTaskArgsView`, eligible endpoint ids, and captured remote-buffer refs. + +`Orchestrator::infer_deps()` builds `TensorKey` from remote handle metadata +when a sidecar exists. The first implementation intentionally preserves the +current exact-start TensorMap semantics: local fork/shm keys use +`(ptr, worker)`, while remote keys use +`(address_kind, owner_endpoint_id, buffer_id, generation, offset)`. The tensor +byte length is bounds-checked by the descriptor but does not participate in +dependency lookup. This means two remote tensors that reference overlapping +byte ranges with different offsets are not automatically ordered, matching the +current local pointer-key behavior. + +## Failure Semantics + +Remote completion is explicit and sequence-based. Local mailbox completion must +be adapted to the same outcome contract. Failure must not make downstream tasks +run as if the producer succeeded. + +Required parent-side behavior: + +- `RemoteL3Endpoint::run()` blocks for the matching completion sequence. +- `LocalMailboxEndpoint::run()` maps a non-zero mailbox error to + `task_failure` instead of reporting a successful completion. +- Non-zero task or endpoint errors become candidates for the worker's first + reported error. +- The worker still notifies the Scheduler so `drain()` cannot hang. +- The notification carries an outcome: success, task failure, or endpoint + failure. +- Failed slots transition to a failed/poisoned state rather than successful + `COMPLETED`. +- Downstream consumers of a failed producer are marked failed/skipped and are + not dispatched. +- `drain()` waits for bookkeeping and cleanup, then raises the first root + error with remote host, endpoint id, cid, and sequence in the message. + +Local mailbox dispatch keeps first-error-wins only for final error reporting. +It must not mark a failed child dispatch as successful `COMPLETED`. The remote +buffer path and the local adapter both use the same poisoned dependency +propagation before dependent tasks are exposed to failed producer outputs. + +Every blocking wait must have a configurable timeout. Remote transport must +not copy the current local control path's infinite spin-wait failure mode. + +## Buffer Lifecycle + +Remote buffers need an owner, generation, and deferred physical free. The +parent owns the visible handle state; the session runner owns remote physical +memory and imported mappings. + +See +[buffers-and-transports.md](remote-l3-worker-design/buffers-and-transports.md) +for the handle schema, control commands, release policy, and A2/A3/A5 backend +requirements. + +## Protocol + +Do not reuse the raw 4096-byte mailbox format across hosts. It has no version +field, no sequence number, and assumes shared virtual memory. + +Remote endpoints use a versioned frame protocol with `HELLO`, `TASK`, +`CONTROL`, `CONTROL_REPLY`, `COMPLETION`, `HEALTH`, and `SHUTDOWN` frames. The +bootstrap socket carries only session setup and HCOMM bring-up frames. After +HCOMM RPC is ready, steady-state control, task metadata, completions, and +shutdown use the HCOMM-backed RPC adapter; tensor data and remote buffer +operations use the HCOMM data adapter. The local path keeps the existing +mailbox layout behind `LocalMailboxEndpoint`. + +Remote frames use canonical little-endian field encoding for `CallConfig`, +`ContinuousTensor`, tensor descriptors, strings, counts, and enums; they do not +memcpy local C++ POD structs onto the wire. Each endpoint has one ordered +command lane for runtime state-changing frames, so TASK cannot overtake +registry-changing CONTROL. Liveness uses a separate health lane or equivalent +transport-level signal and is not queued behind user TASK execution. + +See [protocol.md](remote-l3-worker-design/protocol.md) for frame layout, +remote tensor descriptors, ordering, and bounds-checking requirements. + +## Rollout + +The recommended first cut is conservative: + +1. Land the endpoint abstraction and local adapter. +2. Add remote tensor sidecars and endpoint eligibility metadata. +3. Add the versioned frame codec and the independent health-lane contract. +4. Add remote callable registration with all-or-nothing multi-endpoint + visibility and tombstone-based cid reuse. +5. Add the fork-safe simulation session runner with explicit prestart before + `HELLO READY`. +6. Prove local behavior is unchanged and remote sim behavior handles success, + failure, cid mapping, timeouts, health, and buffer cleanup. +7. Add A2 RoCE, A3 HCCS, and A5 UB profiles behind the HCOMM adapter layer. + +See +[implementation-plan.md](remote-l3-worker-design/implementation-plan.md) +for the detailed PR sequence and validation matrix. diff --git a/docs/remote-l3-worker-design/buffers-and-transports.md b/docs/remote-l3-worker-design/buffers-and-transports.md new file mode 100644 index 000000000..484cc1755 --- /dev/null +++ b/docs/remote-l3-worker-design/buffers-and-transports.md @@ -0,0 +1,307 @@ +# Remote L3 Buffers and Transports + +This document defines remote buffer lifetime and backend transport contracts +for remote L3 NEXT_LEVEL endpoints. + +## Buffer Handles + +Remote buffers need an owner, generation, and deferred free point. The parent +tracks user-visible handle state; the remote session runner owns physical +memory and imported mappings. + +Parent-side handle: + +```text +RemoteBufferHandle: + endpoint_id + buffer_id + generation + address_space + nbytes + remote_addr + rkey_or_token + ub_ldst_va + ref_state + live_slot_refs +``` + +`buffer_id` may be reused only with a new `generation`. Stale completions, +imports, or frees whose generation does not match are ignored or reported as +session errors. + +## Public Memory API + +Remote memory APIs return handles, not bare integer pointers: + +```python +buf = w4.remote_malloc(worker=l3_worker_id, nbytes=4096) +w4.remote_copy_to(buf, host_ptr, 4096) + +args = TaskArgs() +args.add_tensor( + RemoteTensorRef(buf, offset=0, shape=(1024,), dtype=DataType.FLOAT32), + TensorArgType.INPUT, +) +``` + +The binding stores a hidden remote sidecar beside the tensor metadata. +`RemoteTensorRef` is transport metadata, not an extension of the local mailbox +ABI. Local fork/shm endpoints reject it unless the buffer has first been +explicitly imported, staged, or materialized into a local-addressable +`ContinuousTensor`. Remote endpoints reject bare host pointers unless explicit +staging produced a handle. + +## TaskArgs Sidecar Contract + +`ContinuousTensor` remains the tensor metadata ABI. Remote transport metadata +is stored in a per-`TaskArgs` sidecar keyed by tensor index. + +Python-facing rules: + +- `TaskArgs.add_tensor(RemoteTensorRef(...), tag)` appends one normal tensor + metadata entry and one remote sidecar entry at the same tensor index. +- `RemoteTensorRef` is not converted to a fake integer pointer. +- `RemoteBufferHandle` is opaque to user code. Users may inspect endpoint, + size, and release state, but not transport keys such as `rkey`. +- A `TaskArgs` containing any remote sidecar is legal only when the final + selected endpoint set contains remote endpoints that can consume every + referenced sidecar. +- Local `submit_next_level()` rejects remote sidecars before slot commit unless + an explicit import/staging API has converted them into local-addressable + `ContinuousTensor` values and removed the remote sidecar. +- Remote submit rejects `OUTPUT` tensors with `ContinuousTensor.data == 0` + unless an explicit remote allocation API has already produced a + `RemoteTensorRef` sidecar for that tensor. + +Parent C++ slot rules: + +- `TaskSlotState` owns a copy of the sidecar for the slot lifetime. +- Sidecar length must equal `tensor_count`; entries can be null only for + metadata-only tensors. `HOST_INLINE` payloads still have sidecar descriptors + so the frame codec has one validation path for every remote data payload. +- Submit validation captures a live ref on every referenced + `RemoteBufferHandle` before the slot becomes visible to the Scheduler. +- Validation fails before slot commit when the intersection of callable-eligible + endpoints and data-eligible endpoints is empty, or when a remote tensor names + an ineligible endpoint, stale generation, out-of-range offset, or released + handle. +- Group submit stores one sidecar per group member, aligned with + `task_args_list[i]`. + +Endpoint rules: + +- `LocalMailboxEndpoint` rejects non-empty sidecars. It cannot encode remote + descriptors into the local 4096-byte mailbox, and its child processes expect + `ContinuousTensor.data` to be a local host/shm pointer or a local child-memory + pointer. +- `RemoteL3Endpoint` requires a sidecar for every tensor payload that crosses + the remote protocol, including `HOST_INLINE` payloads. +- Remote TASK frames write `ContinuousTensorWire.data == 0`; parent virtual + addresses never cross the remote protocol. +- A remote tensor with `child_memory=True` and no sidecar is invalid. Local + child-memory pointers are meaningful only inside fork/shm topology. +- The remote session runner translates each `RemoteTensorDesc` into a local + `ContinuousTensor` and fills `data` from its validated local mapping + immediately before invoking `inner_worker.run()`. + +## Remote OUTPUT Allocation Policy + +The first implementation does not mirror local HeapRing auto-allocation for +remote outputs. In the local fork/shm path, an `OUTPUT` tensor with +`ContinuousTensor.data == 0` is assigned a parent HeapRing pointer during +submit, and forked children can dereference that shared virtual address. A +remote L3 worker cannot use a parent-host HeapRing pointer. + +Remote callers must allocate or import output storage explicitly before submit: + +```python +out = w4.remote_malloc(worker=l3_worker_id, nbytes=4096) + +args = TaskArgs() +args.add_tensor( + RemoteTensorRef(out, offset=0, shape=(1024,), dtype=DataType.FLOAT32), + TensorArgType.OUTPUT, +) +orch.submit_next_level(cid, args, cfg, worker=l3_worker_id) +``` + +This keeps submit-time validation simple: the slot already carries complete +data eligibility, handle generation, bounds, and lifetime refs before it +becomes visible to the Scheduler. + +Future work may add remote output auto-allocation, but only after the runtime +has a well-defined pre-dispatch endpoint selection policy. Auto-allocation must +decide which endpoint owns the output before slot commit, allocate or import the +remote buffer, attach the generated sidecar to the correct tensor index, and +handle group submits where each member may need storage on a different +endpoint. Until those rules exist, remote null `OUTPUT` tensors fail fast. + +## Required Controls + +| Command | Purpose | +| ------- | ------- | +| `ALLOC_REMOTE_BUFFER` | Allocate remote L3 host or chip memory. | +| `FREE_REMOTE_BUFFER` | Mark a handle released; physical free is deferred. | +| `COPY_TO_REMOTE` | Stage host data into a remote buffer. | +| `COPY_FROM_REMOTE` | Pull remote output data back to host. | +| `EXPORT_BUFFER` | Return RDMA key or UB mapping metadata. | +| `IMPORT_BUFFER` | Import a peer buffer into a remote worker. | +| `RELEASE_IMPORT` | Drop an imported peer mapping. | + +Control commands are typed remote protocol frames. They are not the local +mailbox `CTRL_*` integers. + +## Release Policy + +- Slot refs are acquired during `submit_next_level()` while walking `TaskArgs`. +- Captured buffers stay live until every capturing slot has reached a terminal + state and every producer/consumer reference that can expose that buffer has + reached `CONSUMED` or failed cleanup. +- Explicit buffers used only as INPUT still need slot refs. They have no + producer slot to protect them. +- Runtime-managed OUTPUT buffers follow the producer slot's terminal cleanup. +- `FREE_REMOTE_BUFFER` and `RELEASE_IMPORT` mark handles released. Physical + free or import teardown runs only when the released handle has no live slot + refs. +- If a run fails, the same post-drain cleanup path runs before the next run. +- Session shutdown rejects new work and frees all session-owned buffers after + completing or failing in-flight tasks. + +The registry therefore needs both a user-visible release state and a live +slot-ref count. Failed slots still release captured refs through the same +terminal cleanup path as successful slots. + +## Dependency Keys + +`TensorKey` must grow beyond the current `{ptr, int8 worker}` shape for remote +buffers while preserving the current exact-start lookup semantics. Today, local +dependency tracking keys only on a tensor's start pointer and worker id; shape +and byte length do not participate in lookup. Remote keys follow the same rule: + +```text +address_kind +owner_endpoint_id +buffer_id +offset_begin +generation +``` + +Local fork/shm keys remain a compatibility subset: + +```text +host pointer: (LOCAL_HOST, -1, ptr) +local child memory: (LOCAL_CHILD, worker_id, ptr) +``` + +Known limitation: two remote tensors that reference overlapping byte ranges +with different `offset_begin` values do not automatically depend on each other. +For example, a producer writing `[0, 4096)` and a consumer reading +`[1024, 2048)` map to different dependency keys. This matches the current local +`ptr`-based TensorMap behavior, where a subview at `base + offset` is a +different key from `base`. + +The first implementation chooses this route to keep remote scheduling behavior +compatible with local fork/shm semantics and to avoid changing TensorMap into a +range index as part of the transport bring-up. `offset_end`/`nbytes` remains in +`RemoteTensorDesc` for bounds checks and for a future range-overlap TensorMap +upgrade, but it is not part of the first dependency key. + +## HCOMM Adapter Contract + +Remote L3 uses HCOMM for steady-state communication in the first +implementation. The bootstrap socket is only a setup path for session +validation and HCOMM bring-up. Once HCOMM RPC is ready, task metadata, +CONTROL, CONTROL_REPLY, COMPLETION, and SHUTDOWN frames use the HCOMM RPC +adapter; tensor data and remote buffer copies use the HCOMM data adapter. + +The endpoint owns the adapter objects. `Orchestrator`, Scheduler, and +`WorkerThread` see only `WorkerEndpoint::run()`, `WorkerEndpoint::control()`, +and logical capability bits from `WorkerEndpoint::caps()`. + +The adapter family provides this logical contract: + +```text +BootstrapSocketAdapter: + open(control_uri) + exchange hello/capability/HCOMM bootstrap frames + close after HCOMM_RPC_READY + +HcommRpcAdapter: + submit TASK or CONTROL frame on the ordered command lane + wait for matching COMPLETION or CONTROL_REPLY + send SHUTDOWN when no command is in flight + +HcommAdapter: + register/export/import memory + submit read/write/copy plans + wait/fence completion + release registrations and imports +``` + +HCOMM RPC enqueues request frames on the endpoint's ordered command lane. +Data-plane transfers may use RoCE, HCCS, or UB HCOMM profiles, but TASK +doorbells, CONTROL frames, replies, completions, and shutdown state are ordered +by the command lane. Reply frames carry the request sequence they answer. A +task completes only after an explicit remote `COMPLETION` frame; data-copy +completion alone is never a task completion signal. `control()` returns only +after the matching `CONTROL_REPLY` arrives or a timeout/disconnect is converted +into a failed reply. Liveness is handled by an independent health lane or +transport keepalive; it is not queued behind the ordered command lane. + +## A2 RoCE HCOMM Profile + +- Use HCOMM with `COMM_PROTOCOL_ROCE`. +- Carry command frames and completion records through HCOMM RPC rings and + notify/fence operations. +- Use a separate health HCOMM lane or transport keepalive for liveness. +- Use registered staging buffers for large callable blobs and bulk data. +- Export buffers as HCOMM memory descriptors plus RoCE-specific channel + metadata. +- Complete tasks only after an HCOMM RPC `COMPLETION` frame from the session + runner. +- Bound every wait with a timeout and convert disconnects into endpoint + failure completions. + +## A3 HCCS HCOMM Profile + +- Keep the same HCOMM adapter contract as A2. +- Implement memory export/import through the HCCS-capable HCOMM profile. +- Preserve the same command-lane ordering rules: task/control frames are + observed in sequence order, command frame visible before doorbell, and remote + writes complete before completion frame. +- Provide health independently from command-lane progress so long-running TASK + execution does not cause false endpoint failure. +- Reuse the frame codec, HCOMM RPC, and buffer registry tests from the A2 path. + +## A5 UB HCOMM Profile + +- Export both RDMA metadata and, when available, an LD/ST mapping token. +- Use LD/ST for doorbells and small completion records only when the mapping + is coherent for the participating hosts. +- Preserve the same per-endpoint command-lane order for TASK, CONTROL, + CONTROL_REPLY, COMPLETION, and SHUTDOWN frames. +- Keep UB health doorbells or transport health independent from the command + lane used for state-changing frames. +- Use RDMA for bulk transfers until platform benchmarks justify LD/ST bulk + copies. +- Add explicit fences around: + - task payload writes before doorbell; + - remote output writes before completion; + - parent completion read before dependent task dispatch. +- Keep RDMA fallback for all UB LD/ST paths. + +## Simulation Backend + +The simulation backend uses TCP or Unix sockets plus local files/shm for +integration tests. It must exercise: + +- framed protocol encode/decode; +- sequence numbers; +- remote callable bootstrap; +- endpoint eligibility validation; +- success and error completions; +- failed dependency poisoning; +- buffer registry ref capture and deferred free; +- timeout handling. + +It must not depend on A2/A3/A5 hardware. diff --git a/docs/remote-l3-worker-design/implementation-plan.md b/docs/remote-l3-worker-design/implementation-plan.md new file mode 100644 index 000000000..295d8ea6d --- /dev/null +++ b/docs/remote-l3-worker-design/implementation-plan.md @@ -0,0 +1,313 @@ +# Remote L3 Implementation Plan + +Deliver remote L3 support in small PRs. Each step should keep existing local +fork/shm behavior working. + +## PR Sequence + +1. Endpoint interface and local adapter. + - Add `WorkerEndpoint`. + - Include read-only `WorkerEndpoint::caps()` capability metadata for + endpoint eligibility. `caps()` returns logical features only and must not + expose HCOMM/RDMA/socket handles to Scheduler or Orchestrator code. + - Define `WorkerEndpoint::run()` to return an explicit outcome: success, + task failure, or endpoint failure. + - Move current mailbox code into `LocalMailboxEndpoint`. + - Teach `LocalMailboxEndpoint` to map `MAILBOX_OFF_ERROR == 0` to success + and non-zero child mailbox errors to task failure. + - Treat local dispatch exceptions, child crash detection, and timeout paths + as endpoint failure once those paths are available. + - Keep `WorkerManager::add_next_level(void *mailbox)` working by wrapping + the mailbox in a local endpoint. + - Thread the endpoint outcome through the WorkerThread completion callback + without yet changing all downstream DAG poisoning behavior. + - Add local adapter regression tests for existing L3/L4 examples. + +2. Endpoint eligibility metadata. + - Assign each NEXT_LEVEL child a stable `endpoint_id`. + - Store `callable_id -> eligible endpoint ids` in parent runtime metadata. + - Extend submit slots with final eligible endpoint sets computed as + callable eligibility intersected with tensor/buffer data eligibility. + - Teach Scheduler/WorkerManager to pick only eligible idle workers. + - Validate `worker=` affinity against the slot's final eligible set. + - Keep current docs in sync: describe local fork/shm as + `LocalMailboxEndpoint` and remote L3 as a framed endpoint, not as another + mailbox child loop. + +3. Remote task sidecars and dependency keys. + - Add `RemoteBufferHandle`, `RemoteTensorRef`, and `RemoteTensorDesc`. + - Store a `RemoteTaskArgsView` sidecar beside existing `TaskArgs`. + - Extend `TensorKey` for remote endpoint, buffer id, generation, logical + start offset, and address kind. + - Teach `Orchestrator::infer_deps()` to use remote logical keys while + preserving existing local keys. + - Reject remote sidecars on local fork/shm endpoints unless an explicit + import/staging API has converted them into local-addressable tensors. + - Reject unstaged raw host pointers before a remote slot is committed. + - Reject remote `OUTPUT` tensors with `data == 0` unless an explicit remote + allocation API has already produced a `RemoteTensorRef` sidecar. + +4. Failed task poisoning. + - Add per-member group state/outcome tracking so group failure can skip + unstarted members while waiting for already-dispatched members to finish. + - Add failed/poisoned slot handling. + - Prevent downstream consumers of failed producers from dispatching. + - Preserve `drain()` cleanup and first-error-wins reporting. + +5. Versioned remote frame codec. + - Add `remote_wire.h/.cpp`. + - Implement canonical little-endian encode/decode for `CallConfigWire`, + `ContinuousTensorWire`, frame headers, descriptors, counts, strings, and + enum values. + - Implement the `HOST_INLINE` inline byte arena with descriptor + `inline_payload_offset` / `inline_payload_len` validation. + - Keep local mailbox `write_blob` / `read_blob` local-only; remote codec must + not memcpy C++ POD structs as its wire format. + - Implement encode/decode bounds checks for all frame types. + - Define and test `CONTROL_REPLY` encode/decode for command success, + command failure, result payloads, and sequence matching. + - Define and test the per-endpoint ordered command lane for TASK, CONTROL, + CONTROL_REPLY, COMPLETION, and SHUTDOWN frames. + - Define and test an independent `HEALTH` lane or transport keepalive so + liveness is not queued behind long-running TASK execution. + - Include tests for corrupt lengths, tensor counts, sequence mismatch, and + bounded error payloads. + - Include tests that reject unknown enum values, non-zero reserved fields, + and truncated multi-byte fields. + - Include tests that reject non-zero `ContinuousTensorWire.data` in remote + TASK frames. + +6. Remote callable registry. + - Implement `RemoteCallable("module:qualname")`. + - Preserve the public cid lifecycle from local dynamic Python registration: + visibility only after register reply, unregister/cid reuse, stale-state + cleanup, and TASK/control ordering. + - Treat import-path callables as the baseline remote mode. + - Support PR #839 serialized Python callable payloads only as a negotiated + feature with serializer version, payload limit, Python ABI/runtime, and + dependency/runtime-environment compatibility checks. Follow the payload + contract in `docs/python-callable-serialization.md`. + - Extend the existing public `register(callable, workers=...)` flow so + `RemoteCallable` descriptors can target remote endpoints. Do not add a + separate public remote-only registration API. + - Require the first implementation to reject `RemoteCallable` registration + without an explicit non-empty `workers=[...]` list. Leave named remote + pools and placement policies as future API work, and do not define + implicit broadcast to all remote endpoints as a contract. + - Implement multi-endpoint all-or-nothing registration with prepare, commit, + and abort controls. Keep the parent cid invisible until every selected + endpoint commits, and mark uncertain endpoints failed rather than leaving a + partially visible cid. + - Implement unregister tombstones. Do not reuse a cid until every selected + endpoint confirms cleanup or is removed from eligibility as failed. + - Split cid handling into an outer remote-orch namespace and an inner L3 + Worker namespace; never assume a parent TASK cid is an inner chip/sub cid. + - Make parent/session-assigned cid values the only cid source in each + manifest namespace. + - Define post-bootstrap namespace-aware prepare, commit, abort, and + unregister frames for Python callables and `ChipCallable` entries. + +7. Fork-safe simulation session runner. + - Add `simpler-remote-worker` control entry point. + - Add per-session `simpler-remote-l3-session` runner. + - Pass the validated bootstrap manifest from daemon to runner through an + inherited fd, manifest path in env, or single-threaded pipe before any + runner transport threads start. + - Add an explicit runner prestart step equivalent to `inner_worker.init()` + plus `_start_hierarchical()`: fork L3 chip/sub children, register local + endpoints, and start the inner Scheduler before any remote transport or + health threads are started. + - Start the sim transport only after the local L3 child tree is established, + then run the post-prestart `HELLO`/ready handshake. + - Treat `HELLO ready_state=READY` as a scheduling barrier; the parent must + not schedule an endpoint that is alive but not prestarted. + - Run TASK frames over the sim transport and return completions. + - Add localhost two-process integration tests. + +8. Remote control-plane parity. + - Map existing NEXT_LEVEL controls onto typed remote frames: + prepare, register, unregister, remote buffer allocation, remote buffer + free, copy to remote, copy from remote, export buffer, import buffer, and + release import. + - Reserve remote `COMM_INIT`, `ALLOC_DOMAIN`, and `RELEASE_DOMAIN` opcodes + for the later Remote CommDomain phase; the first task-dispatch cut rejects + them as unsupported controls. + - Keep local mailbox sub-command ids local-only. + - Add tests for post-bootstrap ChipCallable registration and remote buffer + controls through the session runner. + +9. Remote buffer registry. + - Add `ALLOC_REMOTE_BUFFER`, `FREE_REMOTE_BUFFER`, `COPY_TO_REMOTE`, and + `COPY_FROM_REMOTE`. + - Track per-slot capture refs for explicit buffers and imported peer + buffers. + - Tie physical free and release-import to post-drain cleanup after all + captured refs drop. + +10. A2 RoCE HCOMM profile. + - Implement HCOMM endpoint/channel setup with `COMM_PROTOCOL_ROCE`, HCOMM + RPC command rings, registered staging buffers, and timeout/error paths. + - Keep bootstrap sockets limited to session setup and HCOMM bring-up. + After HCOMM RPC is ready, task metadata, controls, completions, and + shutdown use the HCOMM RPC adapter. + - Add a hardware-gated smoke test with one remote L3 worker. + +11. A3 HCCS HCOMM profile. + - Implement the same HCOMM adapter contract with the HCCS protocol profile. + - Reuse the A2 frame, HCOMM RPC, and buffer registry tests. + +12. A5 UB HCOMM profile. + - Add UB export/import metadata through the HCOMM adapter. + - Implement LD/ST doorbell and completion paths only when the selected + profile proves the mapping and fence rules are valid. + - Keep RDMA/HCOMM fallback for bulk transfers. + +13. Remote `allocate_domain()` future work. + - Extend `CommDomainHandle` to carry remote endpoint ids. + - Allocate/import windows collectively across remote workers. + - Preserve deferred release after `drain()`. + +## Required Tests Before Hardware Backends + +| Test | Expected result | +| ---- | --------------- | +| Local adapter regression | Existing L3/L4 fork/shm behavior unchanged. | +| Endpoint eligibility | Scheduler never picks an ineligible endpoint. | +| Frame fuzz/bounds | Corrupt lengths and counts are rejected. | +| Remote sim hello | Parent bootstraps remote L3 and shuts down cleanly. | +| Manifest handoff | Runner reads manifest before transport starts. | +| Prestart barrier | HELLO READY only after inner L3 scheduler is started. | +| Remote sim task | L4 parent dispatches one L3 orch task successfully. | +| Remote sim error | Remote orch raises; parent raises with host/seq/cid. | +| Failed dependency | Consumer of failed remote producer is not dispatched. | +| Remote cid mapping | Daemon resolves non-zero parent-assigned remote cid. | +| Remote dep key | Shared remote buffer serializes through TensorMap. | +| Raw pointer rejection | Unstaged host pointer fails before slot commit. | +| Wire data zero | Non-zero remote TASK tensor data is rejected. | +| HOST_INLINE desc | Inline payloads require a descriptor and bounds checks. | +| Remote buffer copy | Host stages input, remote writes output, host pulls. | +| Input-only free deferral | Released input buffer survives queued consumers. | +| Timeout | Killed session runner produces bounded failure. | +| Dynamic Python register | Import-path callable register/unregister works. | +| Serialized Python gate | PR #839 payload works only after feature negotiation. | +| Callable visibility | TASK with cid works only after register reply. | +| Register partial failure | Multi-endpoint register is invisible after partial fail. | +| Unregister tombstone | Reused cid waits for cleanup or failed endpoint removal. | +| Command-lane order | TASK cannot overtake register/unregister control. | +| Health during long task | Health remains live while command lane runs a task. | +| Callable kind gate | Unsupported callable kind rejects without cid install. | +| Dynamic inner register | Inner Python and ChipCallable cids can run. | +| Remote cid namespace | Outer TASK cid cannot collide with inner cid. | +| Remote control parity | Register/unregister/domain reaches namespace. | + +## Hardware-Gated Tests + +- A2 RoCE single remote L3 task. +- A2 RoCE remote buffer copy round trip. +- A3 HCCS single remote L3 task. +- A5 UB LD/ST doorbell plus RDMA fallback. +- Remote domain allocation and deferred release across two remote L3 workers + after Remote CommDomain enters scope. + +## Open Decisions + +- Exact platform HAL names for HCCS and UB export/import. +- Authentication and isolation for remote daemon sessions. +- Exact remote compatibility metadata required for PR #839 serialized Python + callable payloads beyond the local payload contract, serializer version, and + Python ABI/runtime. +- How endpoint health feeds scheduler-level eligibility after the transport + reports a failed health lane. +- How much of `CommContext` should remain shared with PTO-ISA once remote UB + address metadata is added, when Remote CommDomain enters scope. + +The first cut should land endpoint abstraction, endpoint eligibility, +remote callable registration, failure poisoning, and the simulation runner +before any hardware transport code. + +## Failure Poisoning Contract + +Worker failures must finish DAG bookkeeping without pretending the producer +succeeded. The contract applies to remote completions, endpoint failures, and +local mailbox child errors reported by `LocalMailboxEndpoint`. + +State model: + +```text +FREE -> PENDING -> READY -> RUNNING -> COMPLETED -> CONSUMED + \-> FAILED -> CONSUMED +``` + +Rules: + +- Worker completion carries `success`, `task_failure`, or `endpoint_failure`. +- `success` transitions `RUNNING -> COMPLETED`. +- `task_failure` or `endpoint_failure` transitions `RUNNING -> FAILED`. +- `LocalMailboxEndpoint` converts a non-zero child mailbox error into + `task_failure`; first-error-wins controls only which root error message is + retained for `drain()`. +- A `FAILED` producer releases fanout bookkeeping, but consumers are marked + `FAILED` instead of `READY`. +- Failed consumers are never dispatched. +- `FAILED -> CONSUMED` runs the same cleanup hooks as `COMPLETED -> CONSUMED`: + TensorMap erase, ring release, remote buffer ref release, and deferred free + scheduling. +- `drain()` waits for all successful, failed, and skipped slots to become + `CONSUMED`, then rethrows the first root failure. + +Group task rules: + +`TaskSlotState` stores per-member execution state for group slots: + +```text +GroupMemberState: + NOT_DISPATCHED + RUNNING + SUCCESS + FAILED + SKIPPED + +GroupMemberOutcome: + success + task_failure + endpoint_failure + skipped +``` + +Additional group bookkeeping: + +- `member_states[group_size]` and `member_outcomes[group_size]`; +- `group_terminal_count`: members in `SUCCESS`, `FAILED`, or `SKIPPED`; +- `group_dispatched_count`: members that reached `RUNNING`; +- `group_failed`: set when any member reports `task_failure` or + `endpoint_failure`; +- `group_first_failure_index`: first failed member used for root-error context. + +Rules: + +- A group slot is successful only if every member reaches `SUCCESS`. +- On dispatch of member `i`, transition + `NOT_DISPATCHED -> RUNNING` before handing work to the endpoint. +- On successful completion of member `i`, transition `RUNNING -> SUCCESS` and + increment `group_terminal_count`. +- On failed completion of member `i`, transition `RUNNING -> FAILED`, set + `group_failed`, record `group_first_failure_index` if unset, and increment + `group_terminal_count`. +- When `group_failed` becomes true, every member still in `NOT_DISPATCHED` + transitions to `SKIPPED` and increments `group_terminal_count`. Skipped + members are never dispatched. +- Members already in `RUNNING` are allowed to complete so their endpoint state + and buffer refs can be cleaned up. +- The group slot reaches terminal outcome only when + `group_terminal_count == group_size`. +- If the terminal group has any `FAILED` member, the slot outcome is `FAILED`; + otherwise it is `COMPLETED`. +- Slot cleanup runs once for the whole group after the group slot reaches its + terminal outcome. + +Error reporting: + +- The first root failure message includes remote host, endpoint id, + callable id, and sequence. +- Poisoned downstream slots should reference the root producer slot instead of + overwriting the first-error-wins message. diff --git a/docs/remote-l3-worker-design/protocol.md b/docs/remote-l3-worker-design/protocol.md new file mode 100644 index 000000000..ed739d740 --- /dev/null +++ b/docs/remote-l3-worker-design/protocol.md @@ -0,0 +1,420 @@ +# Remote L3 Protocol + +This document defines the remote wire protocol used by +`RemoteL3Endpoint`. The local fork/shm path keeps the existing mailbox layout +behind `LocalMailboxEndpoint`. + +The bootstrap socket carries only setup and HCOMM bring-up frames. After HCOMM +RPC is ready, steady-state TASK, CONTROL, CONTROL_REPLY, COMPLETION, and +SHUTDOWN frames travel through the HCOMM RPC adapter. Tensor data moves through +the HCOMM data adapter and is referenced from TASK frames by descriptors, not +by host pointers. + +## Frames + +Remote transport must not reuse the raw 4096-byte mailbox format. Define a +versioned frame header: + +```text +FrameHeader: + magic = "SLR3" + version = 1 + frame_type = + HELLO | TASK | CONTROL | CONTROL_REPLY | COMPLETION | HEALTH | SHUTDOWN + session_id + endpoint_id + sequence + payload_bytes + flags +``` + +Rules: + +- `magic` and `version` are validated before reading payload fields. +- `session_id` protects against stale frames from prior sessions. +- `endpoint_id` must match the logical NEXT_LEVEL worker selected by the + parent scheduler. +- For parent-to-runner command requests, `sequence` is monotonic per endpoint + on the ordered command lane. This includes `TASK`, state-changing `CONTROL`, + and `SHUTDOWN`. +- `HEALTH` frames travel on an independent health lane, or use an equivalent + transport-level health signal. They carry their own health sequence and must + not be queued behind long-running TASK execution. +- Reply frames such as `COMPLETION` and `CONTROL_REPLY` carry the sequence of + the request they answer. They do not allocate a new command sequence. +- `payload_bytes` is bounds-checked before allocation or decode. +- Unknown frame types, versions, flags, or oversized payloads fail the + session instead of falling back to best-effort parsing. + +## Wire Encoding + +Remote frames use canonical field encoding. They do not memcpy C++ POD structs +such as `CallConfig` or `ContinuousTensor` onto the wire. The local fork/shm +mailbox may continue to use the raw POD layout because both endpoints are +fork-related processes running the same binary; remote endpoints must treat the +wire schema below as the compatibility contract. + +Encoding rules: + +- Multi-byte integers are little-endian. +- Boolean values are encoded as `uint8`, where `0` is false and `1` is true. +- Enum values are encoded as fixed-width integers. Unknown enum values reject + the frame. +- Strings are `uint32 byte_len` followed by UTF-8 bytes, with per-field maximum + lengths. +- Repeated fields are `uint32 count` followed by that many elements, with + configured maximum counts. +- Reserved fields must be written as zero and rejected when non-zero unless a + later protocol version assigns them. + +`CallConfigWire v1` encodes the current `CallConfig` fields explicitly: + +```text +block_dim: int32 +aicpu_thread_num: int32 +enable_l2_swimlane: int32 +enable_dump_tensor: int32 +enable_pmu: int32 +enable_dep_gen: int32 +output_prefix: string(max=1024) +``` + +`ContinuousTensorWire v1` encodes tensor metadata explicitly. In remote TASK +frames, `data` is not a transferable pointer; it is reserved and must be zero. + +```text +data: uint64 # reserved in remote TASK frames; must be 0 +shapes: uint32[CONTINUOUS_TENSOR_MAX_DIMS] +ndims: uint32 +dtype: uint32 +child_memory: uint8 +reserved: uint8[7] +``` + +The session runner decodes these wire records into local `CallConfig` and +`ContinuousTensor` values before calling `inner_worker.run()`. For tensors with +a remote descriptor, the runner fills the local `ContinuousTensor.data` from +its buffer/import registry after validating the descriptor. Local ABI structs +therefore remain the in-process execution ABI, not the remote transport ABI. + +## HELLO Payload + +`HELLO` is a post-prestart command-lane handshake. The session runner sends or +accepts it only after it has read the bootstrap manifest from the daemon +handoff, constructed `Worker(level=3)`, performed an explicit prestart that +forks local chip/sub children, started the inner scheduler, installed bootstrap +registries, initialized the buffer/import registry, and brought up the command +and health lanes. `HELLO` does not carry the bootstrap manifest. + +```text +session_id +endpoint_id +protocol_version +comm profile +feature flags +ready_state +``` + +The parent treats the endpoint as schedulable only after the `HELLO` exchange +confirms `ready_state=READY`, matching `session_id`, matching `endpoint_id`, +and compatible protocol and feature sets. + +`READY` is a scheduling barrier. A session that can answer liveness probes but +has not completed prestart reports a non-ready state and must not receive TASK +frames. + +## TASK Payload + +```text +callable_id: int32 +config: CallConfigWire v1 +args: RemoteTaskArgsWire v1 +``` + +TASK frames carry tensor metadata and remote descriptors after parent-side +dependency inference has already consumed `TaskArgs` tags. Tags are +Orchestrator input, not the remote L3 wire contract. The endpoint uses the +sidecar descriptors captured at submit time to materialize local +`ContinuousTensor` values on the session runner. + +`RemoteTaskArgsWire v1` contains: + +```text +tensor_count: uint32 +scalar_count: uint32 +tensor_metadata: ContinuousTensorWire[tensor_count] +remote_desc: OptionalRemoteTensorDescWire[tensor_count] +scalars: uint64[scalar_count] +inline_payload_bytes_len: uint32 +inline_payload_bytes: uint8[inline_payload_bytes_len] +``` + +For each tensor index, exactly one of these is true: + +- `remote_desc[i]` is present and names a remote buffer, imported peer buffer, + UB mapping, or allowed small `HOST_INLINE` payload. +- The tensor is metadata-only and has no data pointer and no remote descriptor. + +`tensor_metadata[i].data` must be zero in both cases. Bare host pointers are +rejected for remote endpoints unless an explicit staging API has produced a +remote handle and sidecar descriptor. + +`HOST_INLINE` is for small payloads that should travel inside the TASK frame. +It still requires a `RemoteTensorDescWire` with `address_space=HOST_INLINE`; +the descriptor points into `inline_payload_bytes`. Regular and large tensor +data must use `RemoteBufferHandle` / `RemoteTensorDesc` and the remote +data-plane transport, not inline bytes. + +## RemoteTensorDesc + +```text +RemoteTensorDescWire: + address_space: + HOST_INLINE + REMOTE_DEVICE + REMOTE_WINDOW + UB_LDST + owner_endpoint_id + buffer_id + offset + nbytes + remote_addr + rkey_or_token + generation + inline_payload_offset + inline_payload_len + flags +``` + +Rules: + +- `ContinuousTensor` remains the L2 ABI. The session runner translates + descriptors into local `ContinuousTensor` values immediately before + `inner_worker.run()`. +- When a descriptor is present, the incoming `ContinuousTensorWire.data` is + reserved and must be zero. The session runner derives the executable local + address only from the validated descriptor and its live buffer/import + registry. +- Metadata-only tensors also require `ContinuousTensorWire.data == 0`. +- Parent-side dependency keys use a stable logical start-address key derived at + submit time: + `(address_kind, owner_endpoint_id, buffer_id, generation, offset)`. + `nbytes` bounds the descriptor and is kept for validation and future overlap + detection, but it is not part of the first implementation's TensorMap lookup. +- `offset + nbytes` must fit inside the referenced handle. +- `generation` must match the parent registry entry and the remote daemon's + live handle entry. +- For `HOST_INLINE`, `inline_payload_offset + inline_payload_len` must fit + inside `inline_payload_bytes_len`, and `inline_payload_len` must equal + `nbytes`. Remote handle fields (`owner_endpoint_id`, `buffer_id`, + `remote_addr`, `rkey_or_token`, `generation`) are reserved and must be zero. +- For non-`HOST_INLINE` descriptors, `inline_payload_len` must be zero. +- A tensor owned by endpoint 3 cannot be submitted to endpoint 5 unless an + explicit `IMPORT_BUFFER` or peer-access handle is present. +- `child_memory=1` keeps its local meaning: the data is managed by a + next-level worker. For remote workers, ownership is resolved through the + remote sidecar, not through a local pointer alone. + +## COMPLETION Payload + +```text +sequence +error_code +error_message_len +error_message bytes +``` + +Rules: + +- `sequence` must match the request being waited on. +- `error_code=0` means task success. +- Non-zero error means task failure. The parent marks the slot failed/poisoned + and does not dispatch downstream consumers. +- `error_message` is bounded UTF-8. It should include remote host, + `endpoint_id`, `callable_id`, and `sequence`. +- If health expires or the process exits, the parent fabricates an endpoint + failure completion for every in-flight sequence. + +## CONTROL Payload + +```text +control_name +control_version +command-specific bytes +``` + +Remote control frames use typed names and versioned payloads. Local mailbox +sub-command ids remain local-only and must not leak into the remote protocol. +Every `CONTROL` request produces exactly one `CONTROL_REPLY` frame with the +same `sequence`. + +Required remote controls: + +- `UNREGISTER_CALLABLE` +- `PREPARE_REGISTER_CALLABLE` +- `COMMIT_REGISTER_CALLABLE` +- `ABORT_REGISTER_CALLABLE` +- `PREPARE_CALLABLE` +- `ALLOC_REMOTE_BUFFER` +- `FREE_REMOTE_BUFFER` +- `COPY_TO_REMOTE` +- `COPY_FROM_REMOTE` +- `EXPORT_BUFFER` +- `IMPORT_BUFFER` +- `RELEASE_IMPORT` + +Reserved future controls for Remote CommDomain: + +- `COMM_INIT` +- `ALLOC_DOMAIN` +- `RELEASE_DOMAIN` + +The first Remote L3 task-dispatch cut rejects the reserved domain controls +with an unsupported-control reply. They become required only when Remote +CommDomain enters scope. + +The register-family controls are namespace-aware. +`PREPARE_REGISTER_CALLABLE` carries: + +```text +target_namespace: + OUTER_REMOTE_ORCH + INNER_L3_WORKER +callable_kind: + PYTHON_IMPORT + CHIP_CALLABLE + PYTHON_SERIALIZED # optional negotiated extension +cid: int32 +payload_version: uint32 +payload bytes +``` + +Rules: + +- `OUTER_REMOTE_ORCH` registers callables that can be selected by future + parent TASK frames. Only Python callable kinds are valid in this namespace. +- `INNER_L3_WORKER` registers callables on the session runner's + `inner_worker = Worker(level=3)`. Python callables become valid targets for + inner `submit_sub` / recursive orchestration; `CHIP_CALLABLE` follows the + existing prepare/register path for chip workers. +- `PYTHON_IMPORT` payloads carry a bounded UTF-8 `module:qualname` string. +- `PYTHON_SERIALIZED` payloads are produced by the PR #839 callable serializer + described in + [../python-callable-serialization.md](../python-callable-serialization.md). + They are valid only when parent and session negotiate serializer version, + payload limits, Python ABI/runtime compatibility, and dependency/runtime + compatibility. Support is optional; `PYTHON_IMPORT` remains the required + baseline. +- A session that does not advertise a requested callable kind rejects the + control request before installing the cid. +- `CHIP_CALLABLE` payloads carry the ChipCallable blob metadata or a staged blob + reference, depending on transport capability. The remote frame never embeds a + local POSIX shm name from the parent process. +- `COMMIT_REGISTER_CALLABLE` and `ABORT_REGISTER_CALLABLE` carry + `target_namespace`, `callable_kind`, and `cid`. +- `UNREGISTER_CALLABLE` carries the same `target_namespace`, `callable_kind`, + and `cid` so cid cleanup and reuse are scoped to the intended registry. + +Multi-endpoint parent registration uses two-phase visibility: + +1. The parent sends `PREPARE_REGISTER_CALLABLE` to every selected endpoint. + Each endpoint validates the descriptor/payload, checks feature gates, stages + any callable bytes, and records the cid as prepared but not visible to TASK. +2. If every prepare succeeds, the parent sends `COMMIT_REGISTER_CALLABLE` to + every selected endpoint. A successful commit makes the cid visible to later + TASK frames on that endpoint. +3. If any prepare or commit fails, the parent sends `ABORT_REGISTER_CALLABLE` + to endpoints with prepared or uncertain state, keeps the cid invisible, and + marks endpoints failed when their final registry state cannot be proven. + +Unregister creates a parent-side tombstone. The cid cannot be reused until +every selected endpoint has confirmed cleanup, or until any non-responsive +endpoint has been removed from eligibility and marked failed. + +## CONTROL_REPLY Payload + +```text +sequence +control_name +control_version +error_code +error_message_len +error_message bytes +result_bytes_len +result_bytes +``` + +Rules: + +- `sequence` must match one in-flight `CONTROL` request on the same endpoint. +- `control_name` and `control_version` must match the request being answered. +- `error_code=0` means the control succeeded and `result_bytes` contains the + command-specific result, if any. +- Non-zero `error_code` means the remote session did not apply the requested + state change, except for commands whose versioned contract explicitly allows + best-effort partial cleanup. +- `error_message` is bounded UTF-8. It should include remote host, + `endpoint_id`, `control_name`, and `sequence`. +- `result_bytes` uses the same canonical encoding rules as other remote + payloads. For example, `ALLOC_REMOTE_BUFFER` returns a buffer id, + generation, address space, size, and transport-specific export metadata. +- If health expires or the process exits, the parent fabricates a failed + `CONTROL_REPLY` for every in-flight control sequence. + +Control waits obey the same timeout policy as task waits. + +## Ordering + +Each endpoint has one ordered command lane. Runtime state-changing requests use +this lane even when their bulk bytes are staged through a separate data plane. +The remote session observes parent-to-runner requests in increasing `sequence` +order, and state changes caused by a request happen-before later requests on +the same endpoint. Reply frames carry the answered request sequence and become +visible only after the corresponding request side effects. + +The baseline endpoint admits at most one TASK in flight. State-changing CONTROL +frames serialize with that TASK and are not applied concurrently with +`inner_worker.run()`. This keeps callable registry visibility, buffer release, +import release, and domain lifetime consistent with the local one-mailbox +model. + +`HEALTH` frames are not state-changing command-lane requests. The parent uses a +separate health lane, RDMA completion health signal, or transport-native +keepalive so a long-running TASK does not block liveness observation. Health +success does not imply task completion; task completion is only a matching +`COMPLETION` frame. + +All transports must provide the following visibility rules: + +- Task payload writes are visible before the task doorbell. +- Remote output writes are complete before the completion frame is visible. +- Parent completion reads happen before dependent task dispatch. +- Control request payload writes are visible before the control doorbell. +- Control side effects are visible before the matching `CONTROL_REPLY`. +- Register-family controls and `UNREGISTER_CALLABLE` serialize with TASK + dispatch on the same endpoint. A successful commit reply happens-before later + TASK frames that use the cid. A successful unregister reply happens-before + later cid reuse. +- Health messages may be observed while a TASK is running, but they must not + expose or mutate task, callable, buffer, or domain state. + +RoCE and HCCS can satisfy this with SEND/RECV ordering plus explicit flush or +completion requirements. UB LD/ST paths require explicit memory fences around +doorbell and completion state transitions. + +## Bounds and Fuzz Tests + +The frame codec must reject: + +- bad magic or unsupported version; +- payload length larger than configured maximum; +- truncated frame headers or payloads; +- tensor/scalar counts that overflow decoded payload size; +- descriptor offsets outside the referenced handle; +- `HOST_INLINE` payload offsets or lengths outside the inline byte arena; +- non-`HOST_INLINE` descriptors with non-zero inline payload lengths; +- non-zero `ContinuousTensorWire.data` in remote TASK frames; +- stale generations; +- unknown control names or control versions; +- completion sequence mismatch; +- control reply sequence or control-name mismatch. diff --git a/docs/task-flow.md b/docs/task-flow.md index 37c4856b5..b18557fc8 100644 --- a/docs/task-flow.md +++ b/docs/task-flow.md @@ -49,16 +49,25 @@ Opaque 64-bit handle. What it actually is depends on the destination worker: | `w4.submit_next_level(cid, …)` dispatched to an L3 `Worker` child | `int32_t` cid returned by `Worker.register(orch_fn)` (Python orch fn) | `_child_worker_loop` looks up the orch fn in the child's COW registry and calls `inner_worker.run(orch_fn, …)` | | `w3.submit_sub(cid, …)` dispatched to a SUB child | `int32_t` cid indexing the Python callable registry | `_sub_worker_loop` calls `fn(args)` with the decoded `TaskArgs` | -All three paths share one mailbox wire format — the cid is written into -`MAILBOX_OFF_CALLABLE` and the child does the dispatch in its own -address space. +The implemented local paths share one mailbox wire format — the cid is written +into `MAILBOX_OFF_CALLABLE` and the child does the dispatch in its own address +space. The proposed remote L3 path keeps the cid concept, but sends it in a +versioned TASK frame and resolves it against the remote endpoint's registry +after the endpoint has reported `HELLO READY`. ### Lifetime — pre-fork registration -Every concrete `Callable` object (ChipCallable, Python orch fn, sub callable) -**must be registered before any child process is forked**. After fork, the -child inherits these through COW and the uint64 handle dereferences validly -in the child. +For the local fork/shm path, every concrete `Callable` object (ChipCallable, +Python orch fn, sub callable) **must be registered before any child process is +forked**, except for explicit dynamic registration controls. After fork, the +child inherits these through COW and the uint64 handle dereferences validly in +the child. + +For remote L3, the parent cannot rely on COW. Remote callable registration uses +explicit descriptors: required `PYTHON_IMPORT` paths, optional negotiated +PR #839 serialized Python callable payloads, and `CHIP_CALLABLE` payloads for +inner L3 chip work. A remote cid becomes visible only after the selected +endpoint replies success. --- @@ -241,7 +250,10 @@ dispatches to an L3 child, the child process runs `_child_worker_loop`, which looks up the registered orch fn for that cid and calls `inner_worker.run(orch_fn, args, config)` — i.e. the L3 `Worker.run` Python method, not a C++ leaf. The kernel-running leaves stay at L2 -(`ChipWorker`); higher levels just compose more scheduling engines. +(`ChipWorker`); higher levels just compose more scheduling engines. A remote +L3 session runner follows the same execution shape after it has prestarted its +inner L3 Worker, but task/control/completion bytes travel through the remote +framed protocol instead of the local mailbox. --- diff --git a/docs/worker-manager.md b/docs/worker-manager.md index 0be0480ea..dd9606fe8 100644 --- a/docs/worker-manager.md +++ b/docs/worker-manager.md @@ -1,15 +1,21 @@ # Worker Manager — Pool, Threading, and Dispatch `WorkerManager` and `WorkerThread` together implement the **execution layer** -of a `Worker` engine. `WorkerManager` owns two pools of `WorkerThread`s (one -for next-level workers, one for sub workers); each `WorkerThread` drives a -shared-memory mailbox that a forked Python child consumes — the child runs -the real worker (a `ChipWorker` for NEXT_LEVEL, a Python callable for SUB) -in its own address space. +of a `Worker` engine. In today's local implementation, `WorkerManager` owns two +pools of `WorkerThread`s (one for next-level workers, one for sub workers); +each `WorkerThread` drives a shared-memory mailbox that a forked Python child +consumes — the child runs the real worker (a `ChipWorker` for NEXT_LEVEL, a +Python callable for SUB) in its own address space. + +The proposed remote L3 design keeps this local fork/shm path behind a +`LocalMailboxEndpoint` and adds a framed `RemoteL3Endpoint` for cross-host +NEXT_LEVEL children. A remote endpoint is not another child loop that polls the +4096-byte mailbox; it uses the contracts in +[remote-l3-worker-design.md](remote-l3-worker-design.md). For the high-level role of this layer among the three engine components, see [hierarchical_level_runtime.md](hierarchical_level_runtime.md). For what -runs on the other side of the mailbox, see [task-flow.md](task-flow.md). +runs on the other side of the local mailbox, see [task-flow.md](task-flow.md). For where dispatched tasks come from, see [scheduler.md](scheduler.md). --- @@ -193,21 +199,23 @@ the C++ dispatcher thread — it does not own the child process. --- -## 4. Adding a new worker kind +## 4. Local vs. Remote Endpoints -To add a new worker type (e.g., a RemoteWorker over RPC): +The mailbox protocol is the local endpoint contract. Adding another local +forked worker kind still follows the existing pattern: -1. Define the kernel-running entry point (a C++ class with a `run` method - or a Python callable — there is no abstract interface to inherit from). -2. Write a child-process loop (mirroring `_chip_process_loop` or - `_sub_worker_loop`) that polls the mailbox, decodes the args blob, and - invokes that entry point. -3. Register the per-child mailbox via `manager.add_next_level(mailbox)` - or `manager.add_sub(mailbox)`. +1. Define the worker entry point. +2. Write a child-process loop that polls the mailbox, decodes the args blob, + and invokes that entry point. +3. Register the mailbox via `manager.add_next_level(mailbox)` or + `manager.add_sub(mailbox)`. -The parent side (WorkerManager / WorkerThread) doesn't change — it -only knows the mailbox protocol, not who runs the kernel on the other -end. +Remote L3 is different. It cannot reuse the mailbox wire format because the +remote side does not share virtual addresses, fork-time COW registries, POSIX +shm names, or parent-visible child PIDs. The remote design introduces a +transport-neutral endpoint under `WorkerThread`: `LocalMailboxEndpoint` wraps +this local mailbox path, while `RemoteL3Endpoint` sends framed TASK, CONTROL, +COMPLETION, HEALTH, and SHUTDOWN messages over the negotiated transport. ### 4.1 Nested fork ordering (L4+ Worker children)