Skip to content

Add remote L3 worker#866

Open
puddingfjz wants to merge 2 commits into
hw-native-sys:mainfrom
puddingfjz:remote-l3-worker-design
Open

Add remote L3 worker#866
puddingfjz wants to merge 2 commits into
hw-native-sys:mainfrom
puddingfjz:remote-l3-worker-design

Conversation

@puddingfjz
Copy link
Copy Markdown
Contributor

Summary

  • Add a Remote L3 worker design for scheduling remote Worker(level=3) instances as L4 NEXT_LEVEL children while preserving the existing Orchestrator/Scheduler DAG model
  • Introduce a WorkerEndpoint boundary so local fork/shm children and remote L3 sessions share the same run/control/shutdown contract
  • Define fork-safe remote daemon/session startup, including runner prestart before HELLO READY and before any transport or health threads handle task traffic
  • Specify remote callable routing with separate outer remote-orch and inner L3 cid namespaces, transactional register/commit/abort, unregister tombstones, and unified
    register(callable, workers=...) API semantics
  • Define remote TaskArgs wire representation with canonical frame encoding, remote tensor sidecars, tag consumption before dispatch, explicit remote buffer handles, and first-cut
    rejection of null remote OUTPUT tensors
  • Define failure semantics for explicit success/failure completion, endpoint failure, downstream dependency poisoning, timeout handling, and cleanup
  • Specify HCOMM-backed steady-state control/data communication after bootstrap, including HCOMM RPC command lanes, HCOMM data adapter profiles for A2 RoCE/A3 HCCS/A5 UB, and
    independent health signaling
  • Keep Remote CommDomain and possible CommContext ABI changes as later work while reserving domain-related controls for future extension
  • Provide a staged implementation plan and test matrix covering local adapter compatibility, frame validation, remote simulation, callable registration, remote buffers, health, and
    hardware-gated HCOMM profiles

Testing

  • Docs only; no tests run
  • git diff --check passed

- Document remote L3 NEXT_LEVEL worker architecture, protocol, and buffer lifetime contracts

- Describe A2 RoCE, A3 HCCS, and A5 UB transport expectations

- Link the proposal from hierarchical runtime, task flow, and worker manager docs
- Adopt HCOMM-backed control and data adapter wording for Remote L3
- Keep Remote CommDomain as later work and reserve domain controls
- Use unified register(callable, workers=...) for RemoteCallable descriptors
- Reference the Python callable serialization contract from PR 839
@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant