Milestones

7. Typed RPC Service Layer
## Goal Provide typed request-response service definitions layered on top of Zenoh's query/reply primitive. Services are Rust traits with typed request and response types, deadlines, and structured errors. If Zenoh replaces NATS + GraphQL as the edge-to-central transport, this provides the typed service contracts that GraphQL currently offers. ## Zenoh Primitives Used - session.get() / session.declare_queryable(): The underlying transport. A query sends a request payload, a queryable responds. Multiple queryables can respond to wildcard queries (scatter-gather). - Liveliness tokens (session.liveliness().declare_token()): Already provide presence detection — a service advertises itself via a token, clients discover available services via liveliness subscription. No custom discovery protocol needed. - Attachments: Key-value metadata on messages. Used for deadline propagation (remaining timeout budget) and method routing (which operation on the service is being called). - TypedQueryable<Req, Resp> from M6: The typed wrapper this milestone builds on. M7 adds service-level concerns (method dispatch, deadlines, error types, discovery) on top of M6's type-safe query/reply. The gap: query/reply is a raw byte-in/byte-out primitive. Every service interaction requires manual serialization, manual error handling, manual timeout management, and manual method routing (encoding the operation name in the payload or key expression). This doesn't scale across 11+ vendor drivers and multiple service boundaries. ## What Changes 1. **Service trait pattern**: Define services as Rust traits with typed methods. No proc macros initially — users implement the trait manually, helper functions handle the wiring. Proc macro DSL is a follow-up once the API stabilizes. ```rust #[async_trait] trait DeviceConfigService { async fn get_config(&self, req: GetConfigRequest) -> Result<DeviceConfig, ServiceError>; async fn push_firmware(&self, req: PushFirmwareRequest) -> Result<FirmwareResult, ServiceError>; } ``` 2. **ServiceServer<S>**: Wraps a trait implementation + TypedQueryable. Dispatches incoming queries to the correct method based on a method identifier in the query attachment. Handles deserialization, method routing, error serialization. 3. **ServiceClient<S>**: Wraps a TypedQuerier. Provides typed method calls that serialize the request, set the deadline attachment, send the query, and deserialize the response (or error). 4. **Deadline propagation**: Client sets a deadline (Duration) on the call. Transmitted as an attachment. Server checks remaining budget before processing. If budget expired, server returns a DeadlineExceeded error without doing work. 5. **Structured errors**: ServiceError enum with variants (NotFound, InvalidRequest, Internal, DeadlineExceeded, custom). Serialized in the reply payload with a status code attachment for fast-path error detection without deserializing the body. 6. **Service discovery via liveliness**: ServiceServer declares a liveliness token on a well-known key (e.g. @services/{service_name}/{instance_id}). ServiceClient discovers available instances via liveliness subscription. No custom discovery protocol. ## Why This Ordering Depends on M6 (TypedQueryable/TypedQuerier are the building blocks). Does not depend on milestones 1-5, but benefits from M5 (Namespace Isolation) for per-tenant service routing — tenant-scoped key expressions mean each tenant's services are isolated by default. ## Branch zenoh/typed-rpc
No due date
•25/25 issues closed
100% complete0 open 25 closed
6. Schema Enforcement
## Goal Add compile-time typed payload wrappers and runtime version tagging to prevent silent data corruption when edge and central disagree on payload format. zenoh-ext already has ZSerializer/ZDeserializer with Serialize/Deserialize traits — this milestone builds typed publisher/subscriber/queryable wrappers on top of that foundation, and uses Zenoh's existing Encoding field for lightweight version negotiation. ## Zenoh Primitives Used - ZSerializer/ZDeserializer (zenoh-ext/src/serialization.rs): Already provides binary serialization following the Zenoh serialization RFC. Serialize/Deserialize traits implemented for primitives, collections, tuples. - Encoding (zenoh-protocol/src/core/encoding.rs): Already carried on every Sample. Supports custom encoding IDs and schema strings. Currently used for content-type hints (JSON, Protobuf, etc.) but not enforced. - Publisher/Subscriber/Queryable APIs: The typed wrappers compose around these existing types — no new wire messages, no protocol changes. The gap: nothing connects these pieces. A publisher can set Encoding::APPLICATION_JSON and send garbage bytes. A subscriber has no compile-time guarantee about what type it will receive. Version mismatches between edge agent v2 and central expecting v3 are undetectable until a downstream system breaks. ## What Changes 1. **TypedPublisher<T>/TypedSubscriber<T>**: Generic wrappers that serialize/deserialize payloads using ZSerializer. TypedPublisher<T> only accepts T values. TypedSubscriber<T> only yields T values (or errors). Compile-time safety — wrong type is a compiler error, not a runtime crash. 2. **SchemaVersion attachment**: Lightweight version tag sent as a Zenoh attachment on each publication. Receivers check the version before deserializing. Fast rejection path: if version doesn't match, skip deserialization entirely and surface a typed error. 3. **TypedQueryable<Req, Resp>/TypedQuerier<Req, Resp>**: Same pattern for query/reply — typed request and response. This becomes the foundation for M7 (Typed RPC). 4. **Integration tests**: Typed pub/sub round-trip, version mismatch detection, backward compatibility (typed subscriber receiving from untyped publisher should degrade gracefully). ## Why This Ordering No dependencies on milestones 1-5. Can run in parallel. Ordered 6th because it's an application-layer concern — the core data flow (ack_put → event keys → replication → cursors → namespace isolation) is more critical for the onexos-edge migration. However, M6 directly enables M7 (typed queryable/querier wrappers are the building blocks for typed RPC services). ## Branch zenoh/schema-enforcement
No due date
•20/20 issues closed
100% complete0 open 20 closed
5. Namespace Isolation Hardening
## Goal Strengthen Zenoh's existing Namespace mechanism from a lightweight prefix-rewriting layer into a robust isolation boundary suitable for multi-tenant deployments. Zenoh already has Namespace and ENamespace structs (namespace.rs) that prepend prefixes on egress and strip/block on ingress. This milestone hardens that mechanism. ## Zenoh Primitives Used - Namespace (namespace.rs): Already prepends namespace prefix to all egress messages. - ENamespace (namespace.rs): Already strips namespace prefix on ingress, blocks messages that don't match, tracks blocked subscribers/queryables/tokens in HashSets. - ACL system: Subject model (cert CN, username, interface, ZID) with per-key-expression Allow/Deny trees. - Interceptor chain: Message pipeline where ACL is already enforced. The existing Namespace provides real isolation — ENamespace.handle_namespace_ingress() returns false for non-matching messages, which drops them. The gap is operational: no resource limits, no monitoring, and ACL rules must be manually configured to prevent cross-namespace access. ## What Changes 1. **Default-deny across namespaces**: When namespaces are configured, automatically generate ACL deny rules preventing cross-namespace message flow. Currently, a misconfigured ACL can leak data across namespaces — this makes isolation the default. 2. **Namespace-scoped admin status**: Expose per-namespace metrics via the admin space — connection count, message rate, storage usage. Uses existing admin queryable infrastructure. 3. **Connection limits per namespace**: Configurable max_connections per namespace. Enforced at face creation time. Prevents one tenant from exhausting router resources. 4. **Documentation**: Configuration guide showing how to set up multi-tenant isolation using namespaces + ACLs + TLS cert CN matching. ## Why This Ordering No dependencies on other milestones. Ordered last because it's the least critical for the core data flow (ack_put -> event keys -> replication -> cursors). Useful for production hardening. ## Branch zenoh/namespace-isolation
No due date
•19/19 issues closed
100% complete0 open 19 closed
4. Client Cursor Library
## Goal Provide a zenoh-ext library for client-side cursor management — the mechanism by which consumers track their position in an event key stream and resume from where they left off after disconnection. ## Why Client-Side, Not Server-Side Zenoh routers are designed to be stateless relays that run on constrained devices (256KB RAM in some deployments). Per-subscriber cursor state on the router contradicts Zenoh's core principle of supporting extremely constrained devices. Zenoh's pub/sub path is one-way by design — no subscriber-to-router feedback channel exists in the wire protocol. Client-side cursors use HLC timestamps as the universal ordering. The client persists its last-processed HLC timestamp locally (file, SQLite, or a Zenoh storage key via ack_put). On reconnect, it queries get(key_expr, "_time_range=T..") to resume. No router state, no protocol changes, no new wire messages. ## Zenoh Primitives Used - HLC timestamps: already provide a total order over all mutations system-wide - session.get() with time_range parameter: already supported for storage queries - Attachments: HLC timestamp already present on every Sample - ack_put (from milestone 1): for persisting the cursor bookmark durably ## What Changes 1. **CursorBookmark struct**: Wraps (key_expr, last_processed_hlc) with serde for persistence. 2. **EventSubscriber**: Wraps AdvancedSubscriber with cursor tracking. On each received sample, updates the in-memory cursor position. Periodically (or on explicit flush) persists the cursor via ack_put to a well-known key (e.g. @cursors/{consumer_name}/{key_expr_hash}). 3. **Resume-from-cursor**: On startup, loads persisted cursor, issues get(key_expr, "_time_range=cursor..") for catch-up, then switches to live subscription. Delivers catch-up events before live events to maintain ordering. 4. **Cursor status API**: cursor_position() -> HlcTimestamp, time_since_last_flush() -> Duration. ## Why This Ordering Depends on milestone 1 (ack_put) for durable cursor persistence. The cursor bookmark is persisted via ack_put to ensure the storage confirms the write before the client considers the cursor advanced. ## Branch zenoh/client-cursor
No due date
•39/39 issues closed
100% complete0 open 39 closed
3. Replication Volume Awareness
## Goal When using the unique-key event pattern at scale (thousands of events during an offline period), the existing replication system must handle potentially large divergence sets efficiently on reconnection. This milestone ensures that the alignment protocol performs well under high-volume catch-up scenarios and that the digest/bloom filter infrastructure is tuned for event-heavy workloads. ## Zenoh Primitives Used The existing replication module (LogLatest, digest exchange, bloom filters, alignment protocol) works correctly for unique-key events — each event is a distinct key, History::Latest applies, one event per key. No architectural changes needed. But operational tuning and possibly batching optimizations are needed for reconnection scenarios where 10,000+ event keys need to sync. ## Replication Architecture (confirmed by audit) - LogLatest stores one Event per key (enforced by assert_only_one_event_per_key_expr) - Fingerprint = xxhash3(key_expr + hlc_timestamp), XOR-chained up through SubInterval -> Interval -> Era - Alignment protocol: Digest diff -> Intervals -> SubIntervals -> EventsMetadata -> Retrieval - Hot/Warm/Cold eras provide hierarchical divergence detection - Bloom filters provide fast key membership checks All of this works with unique-key events. Each event key is a distinct entry in LogLatest. The concern is performance, not correctness. ## What Changes 1. **Benchmark**: Measure alignment performance with 1K, 10K, 100K divergent event keys. Identify bottlenecks in the Retrieval phase (likely N sequential queries). 2. **Batch retrieval** (if needed): The current alignment requests events one at a time via AlignmentQuery::Events. For large catch-up, batch retrieval would reduce round-trips. 3. **Bloom filter sizing**: Current bloom filter is ~5MB per storage (noted as TODO in log.rs:346). With event keys, the false positive rate may need tuning for larger key sets. 4. **Hot era tuning**: Default hot era is 6 intervals (60 seconds). For reconnection after hours, most events are in cold era — single XOR fingerprint. If cold era fingerprints differ, alignment must request ALL cold-era events. Consider configurable cold-era sub-grouping for large deployments. ## Why This Ordering Depends on milestone 2 (event key GC) conceptually — you need the event key pattern established before tuning replication for it. Practically, can be built in parallel. ## Branch zenoh/replication-volume ## Branches - **`zenoh/replication-volume`** — full working branch (includes benchmarks + features) - **`zenoh/repl-vol`** — upstream-ready branch (features only, cherry-picked from main, no benchmarks) Use `zenoh/repl-vol` for PRs to upstream zenoh. Keep `zenoh/replication-volume` for internal regression testing.
No due date
•29/29 issues closed
100% complete0 open 29 closed
2. Event Key Garbage Collection
## Goal Support the unique-key modeling pattern where events are encoded as unique keys (e.g. devices/42/events/{hlc_timestamp}) rather than multiple values on the same key. This pattern preserves complete event history using History::Latest — no History::All needed, existing replication works unmodified. But unique event keys accumulate and need lifecycle management. ## Zenoh Primitives Used Zenoh already has GarbageCollectionConfig in the storage service with configurable period and lifespan. Tombstone cleanup already runs on a timer. This milestone extends that mechanism to handle prefix-scoped expiration for event keys specifically, since event keys have different retention requirements than entity state keys. ## The Data Modeling Pattern This Enables Instead of: put("device/42/status", "offline") — overwrites previous, intermediate state lost Applications do: put("device/42/status", "offline") — current state (LWW) put("device/42/events/{hlc_ts}", "status=offline") — event record (unique key) Both are History::Latest. Both replicate via existing replication unmodified. Central gets current state AND complete event history. Unique event keys need GC after their retention window; entity state keys do not. ## What Changes 1. **Per-prefix GC configuration**: Allow storage config to specify different retention rules for key expression subtrees (e.g. **/events/** keys expire after 48 hours, entity keys never expire). 2. **GC awareness of event key patterns**: The existing GarbageCollectionEvent timer checks key timestamps against lifespan. Extend to support per-prefix lifespan overrides. ## Why This Ordering Depends on nothing. Could be built in parallel with milestone 1. Ordered second because milestone 1 is smaller and validates the workflow first. ## Branch zenoh/event-key-gc
No due date
•7/7 issues closed
100% complete0 open 7 closed
1. Acknowledged Put
## Goal Enable publishers to receive confirmation that a storage backend has durably written their data. This is the foundation for reliable store-and-forward patterns where a local outbox (e.g. SurrealDB) must not delete an entry until the messaging layer confirms persistence. ## Zenoh Primitives Used Zenoh's query/reply path (session.get() -> queryable -> reply) already provides synchronous request-response semantics. The storage manager plugin already registers a queryable on every configured storage's key expression. StorageInsertionResult (Inserted/Replaced/Outdated/Deleted) is already returned by every Storage::put() call — but discarded. This milestone wires those existing primitives together. ## What Changes Two small additions, both using existing Zenoh interfaces: 1. **zenoh-ext helper** (~50 lines): ack_put(session, key_expr, payload) -> ZResult<()> — wraps session.get() with _ack_put=true parameter and payload, parses the reply. 2. **Storage service handler** (~40 lines): In StorageService::reply_query(), detect _ack_put/_ack_delete query parameters, call Storage::put()/delete(), serialize StorageInsertionResult into the reply payload. No protocol changes. No new wire messages. No router modifications. ## Why This Ordering This milestone has zero dependencies and unblocks milestone 4 (Client Cursor Library), which uses ack_put for the cursor bookmark persistence pattern. It's also the smallest change — good for validating the fork workflow. ## Branch zenoh/ack-put
No due date
•17/17 issues closed
100% complete0 open 17 closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Milestones

7. Typed RPC Service Layer

6. Schema Enforcement

5. Namespace Isolation Hardening

4. Client Cursor Library

3. Replication Volume Awareness

2. Event Key Garbage Collection

1. Acknowledged Put

Milestones

List view

7. Typed RPC Service Layer

6. Schema Enforcement

5. Namespace Isolation Hardening

4. Client Cursor Library

3. Replication Volume Awareness

2. Event Key Garbage Collection

1. Acknowledged Put