Skip to content

feat(timestamp-stack): add routing timestamp instrumentation#30

Open
YuanYuYuan wants to merge 6 commits into
ZettaScaleLabs:feat/routing-timestampsfrom
YuanYuYuan:feat/routing-timestamps
Open

feat(timestamp-stack): add routing timestamp instrumentation#30
YuanYuYuan wants to merge 6 commits into
ZettaScaleLabs:feat/routing-timestampsfrom
YuanYuYuan:feat/routing-timestamps

Conversation

@YuanYuYuan
Copy link
Copy Markdown

@YuanYuYuan YuanYuYuan commented May 29, 2026

Summary

Adds opt-in timestamp instrumentation for measuring end-to-end message latency in Zenoh. Messages can carry a TsStack wire extension that accumulates Interception records at up to three points along a message's path: Send, Route, and Receive.

The feature is entirely #[cfg(feature = "unstable")]-gated and has zero overhead on uninstrumented messages.

Key Changes

Wire protocol (zenoh-protocol, zenoh-codec)

  • New TsStackType extension (ID 0x7) added to Push, Request, and Response messages
  • WireTimestampStack carries conf_flags (which points are active) + an ordered Vec<Interception>
  • Codec enforces a max stack depth of 64 to prevent malformed-wire memory exhaustion
  • Round-trip codec tests cover empty stacks, known records, random instances, and the depth-limit boundary

Public API (zenoh crate, feature = "unstable")

  • TimestampInstrumentationBuilder / TimestampInstrumentation — configure which points to record
  • InterceptionPoint (Send, Route, Receive) — #[non_exhaustive] for forward compatibility
  • TimestampStack / TimestampStackRecord — read timestamps off a received Sample, ReplyError, or Query
  • TsStackContext / GetTimestampCallback — custom timestamp generation via OpenBuilder::with_timestamp_callback
  • .timestamp_instrumentation(Option<TimestampInstrumentation>) builder method on put / get / reply / publisher

Instrumentation points

  • Send: session.rs (resolve_put, resolve_get) and builders/reply.rs
  • Route: routing/dispatcher/pubsub.rs (per-subscriber, inside fan-out loop) and queries.rs
  • Receive: WeakSession::send_push_consume and adminspace.rs

Robustness fixes

  • Codec decode cap at 64 interceptions (prevents OOM on crafted wire input)
  • WireTimestampStack renamed from TimestampStack to eliminate name collision with the public API type
  • get_ts_stack_timestamp removes unwrap()/expect() — returns empty bytes on clock failure instead of panicking
  • push_ts_interception uses debug_assert + silent skip instead of expect for unknown point IDs
  • ext_ts_stack: None on timeout replies is now documented (was a TODO comment)

Plumbing

  • WeakDynamicRuntime / WeakRuntime weak-reference plumbing for runtime access in query state
  • ReplyErr public API accessor
  • lazy_hlc in RuntimeState — lazily-initialized HLC fallback when no user HLC is configured, replacing the previous SystemTime::now() approach

Tests

  • 20 integration tests in zenoh/tests/timestamp_stack.rs: no-instrumentation baseline, single-point (Send / Receive), all-points ordering, custom callback byte verification, callback context correctness, per-message stack independence, query/reply flows, is_custom flag, route-only instrumentation, ReplyError propagation, multiple-subscriber fan-out, and publisher API path
  • 4 codec tests: empty stack, known records, random round-trips, and the 64-depth limit

Breaking Changes

None. All new types and methods are behind feature = "unstable".

Known Limitations

  • Route timestamp is stamped once per routing hop; multi-hop topologies will produce multiple ROUTE records (by design)
  • QueryCleanup timeout responses carry ext_ts_stack: None — not instrumented (documented rationale in code)

🏷️ Label-Based Checklist

Based on the labels applied to this PR, please complete these additional requirements:

Labels: new feature

🆕 New Feature Requirements

Since this PR adds a new feature:

  • Feature scope documented - Clear description of what the feature does and why it's needed
  • Minimum necessary code - Implementation is as simple as possible, doesn't overcomplicate the system
  • New APIs well-designed - Public APIs are intuitive, consistent with existing APIs
  • Comprehensive tests - All functionality is tested (happy path + edge cases + error cases)
  • Examples provided - Usage examples in code comments or separate example files
  • Documentation added - New docs explaining the feature, its use cases, and API
  • Feature flag considered - Entirely behind feature = "unstable", zero overhead when inactive
  • Performance impact assessed - No overhead on uninstrumented messages (push_ts_interception is a no-op when ext_ts_stack is None)
  • Integration tested - 24/24 integration + codec tests pass (Rust 1.93.0)

Consider: Can this feature be split into smaller, incremental PRs?

Instructions:

  1. Check off items as you complete them (change - [ ] to - [x])
  2. The PR checklist CI will verify these are completed

This checklist updates automatically when labels change, but preserves your checked boxes.

- Cap TsStack wire decode at 64 entries to prevent OOM from crafted packets
- Rename wire type TimestampStack -> WireTimestampStack to eliminate name
  collision with the public API type zenoh::timestamp_stack::TimestampStack,
  including all remaining import sites in the zenoh crate
- Remove unwrap/expect in get_ts_stack_timestamp; return empty vec on
  serialization failure
- Stamp ROUTE timestamp per-subscriber in fan-out so each gets its own
  dispatch time instead of sharing a pre-loop timestamp
- Replace avoidable .clone() with mem::take() at receive-side extract sites
- Replace expect() in push_ts_interception with debug_assert + graceful return
- Add #[non_exhaustive] to InterceptionPoint
- Add pub(crate) visibility comment on Runtime::state
- Document timeout reply ext_ts_stack: None intentionally unset
@YuanYuYuan YuanYuYuan force-pushed the feat/routing-timestamps branch from a764445 to fc1e09c Compare May 29, 2026 04:31
… reply_err, multi-subscriber, publisher API, and callback context
- Add `TimestampInstrumentation::new(send, route, receive)` direct constructor
  alongside the builder, for C/Python bindings that don't need the builder chain
- Add per-publisher default instrumentation: `PublisherBuilder::timestamp_instrumentation()`
  stores a default on the `Publisher` struct; per-put override takes precedence
- Add `TimestampStackRecord::as_timestamp()` to parse standard HLC bytes into
  `zenoh::time::Timestamp`, avoiding raw byte handling for non-custom records
- Rename `GetTimestampCallback` → `SessionTimestampCallback` to clarify that the
  callback is session-scoped (registered at open time, not per-message)

All 20 timestamp_stack integration tests pass (job 63 on beta-cuda).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant