Skip to content

Improve telemetry percentile accuracy with reservoir sampling #10

Description

@lambda-alpha-labs

The telemetry module uses Exponential Moving Average (EMA) for P50/P95/P99 percentile estimation. While lightweight, EMA is order-dependent: the same data processed in different order produces different percentile estimates.

Proposed fix

Implement reservoir sampling: maintain a sorted buffer of the last 100-200 latency measurements per node. This ensures order-invariant percentile calculation while still being memory-efficient.

Current code

src/telemetry.rs in update_percentiles:

meta.p50_ms = meta.p50_ms * 0.9 + latency_ms * 0.1;

Acceptance

  • Percentile values are invariant to trace file ordering
  • Memory overhead stays under ~1KB per active node
  • No regression in existing telemetry tests

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions