Cat 7.b: query latency distribution (YCSB percentiles) by jphein · Pull Request #33 · M0nkeyFl0wer/multipass-structural-memory-eval

jphein · 2026-05-25T01:01:39Z

Summary

Closes #25.

Adds Cat 7.b — query latency distribution using YCSB-standard percentile reporting (p50/p95/p99/p99.9/max).

New sme/categories/latency.py with LatencyCollector and LatencyReport dataclasses
LatencyCollector.timed_call(fn, *args) wraps any callable with wall-clock measurement
format_latency_report() renders human-readable output
Uses numpy percentiles (already a core dependency) — no new dependencies required

Design decisions:

numpy percentiles instead of HdrHistogram for the initial implementation — HdrHistogram is more appropriate for sub-ms benchmarks with coordinated omission concerns, but LLM-query latencies (hundreds of ms each) don't need that precision floor. Can upgrade later if warranted.
Raw latency array included in report for downstream reaggregation
[latency] optional extra added to pyproject.toml for future HdrHistogram upgrade path

Test plan

test_latency.py — collector recording, percentile computation, timed_call wrapper, empty report edge case, format output
Existing tests pass unchanged

🫏 Generated with Claude Code

Adds sme/categories/latency.py with LatencyCollector that wraps adapter calls and captures wall-clock timing, computing YCSB-standard percentiles (p50, p95, p99, p99.9, max) via numpy. The [latency] optional extra reserves an HdrHistogram upgrade path when higher-fidelity tail measurement is needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a new Category 7.b latency distribution utility and accompanying tests, plus an optional dependency stanza for a latency-related extra.

Changes:

Introduces LatencyCollector/LatencyReport for recording and reporting query latency percentiles.
Adds pytest coverage for collector behavior, dict conversion, and report formatting.
Adds a latency optional dependency extra in pyproject.toml.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
tests/test_latency.py	Adds unit tests for latency collection, percentile reporting, dict serialization, and formatting output.
sme/categories/latency.py	Implements latency collection/reporting and a human-readable summary formatter.
pyproject.toml	Adds a `latency` optional-deps extra and includes it in the `all` extra.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+Uses numpy for percentile calculations; optionally upgrades to
+HdrHistogram when the [latency] extra is installed.


+import logging
+from dataclasses import dataclass, field
+
+import numpy as np
+
+log = logging.getLogger(__name__)
+
+


+import logging
+from dataclasses import dataclass, field
+
+import numpy as np
+
+log = logging.getLogger(__name__)
+
+


+def test_timed_call():
+    c = LatencyCollector()
+    def slow_fn(x):
+        time.sleep(0.01)
+        return x * 2
+    result, latency = c.timed_call(slow_fn, 5)
+    assert result == 10
+    assert latency >= 10.0


+latency = [
+    "hdrhistogram>=0.10",
+]


M0nkeyFl0wer · 2026-05-25T23:34:00Z

YCSB percentiles (p50/p95/p99/p99.9) are exactly the right shape for query-latency — aggregate means hide the tail, and the tail is where memory systems differentiate. The LatencyCollector API with keep_raw for post-hoc analysis is clean.

Three small things before merge:

HdrHistogram fallback. The docstring promises "optionally upgrade to HdrHistogram when the extra is installed" but the code never imports or uses it, and the optional hdrhistogram dep in pyproject.toml is unreferenced. Either wire it in (the value over np.percentile is preservation across merged runs and lower memory at large N) or drop the doc claim. Probably worth wiring — HdrHistogram's whole reason to exist is that np.percentile on a raw-keep array doesn't scale past ~1M samples.
Dead logging import — drop it.
time.sleep(0.01) test is timing-flaky — replace with a deterministic mock-clock or fixed timestamps.

One framing question for the spec angle: Cat 7.b (latency) and Cat 7 (cost-per-correct, #34) are both landing in the same category. Is the intended Cat 7 readout a single combined section or two side-by-side? The answer changes what shape the headline metric becomes — "cost-per-correct at p95 latency" vs. two independent columns.

…ic latency test Back LatencyCollector with HdrHistogram when the [latency] extra is installed (guarded optional import); fall back to np.percentile on the base install. HdrHistogram gives fixed per-sample memory at large N and lossless cross-run merging, fulfilling the module docstring's prior unmet promise. Record microseconds for sub-ms precision. Remove the unused logging import. Replace the flaky time.sleep(0.01) timed_call test with a monkeypatched perf_counter clock. Percentile assertions now tolerate HdrHistogram quantization on the HDR path while staying exact on the numpy fallback; add HDR-only merge/backend tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

jphein · 2026-05-27T14:15:18Z

Pushed eb79499:

HdrHistogram wired. Added a latency extra (hdrhistogram>=0.10); the collector uses an HdrHistogram backend when the extra is installed and falls back to numpy percentiles otherwise (guarded import). That delivers the lossless-merge across runs / bounded memory at large N you flagged, while the base install stays dependency-light.
Removed the dead logging import.
Replaced the flaky time.sleep(0.01) test with deterministic fixed timestamps.

CI green on 3.10/3.11/3.12 (the numpy fallback path — the extra isn't installed in CI).

On the Cat 7 framing question (also on #34): going with two side-by-side readouts — latency (this PR) and cost (#34) stay independent so each lands on its own.

🫏

M0nkeyFl0wer · 2026-05-27T14:56:17Z

Verified — HdrHistogram backend behind the [latency] extra with numpy fallback is the right shape: base install stays dependency-light, large-N runs get lossless merging and bounded memory. The deterministic monkeypatch(perf_counter) test is much nicer than the sleep version it replaced. Ship.

One downstream methodology question, not a blocker: the HDR path and the numpy path return percentile values that differ by HDR quantization bounds. For published cross-system readings, is one canonical, or does the spec require [latency] installed when reporting numbers? Worth a one-line note in sme_spec_v8.md so two researchers reporting "p99 = 480ms" can't be talking about subtly different things, but doesn't need to gate this merge.

Copilot AI review requested due to automatic review settings May 25, 2026 01:01

Copilot AI reviewed May 25, 2026

View reviewed changes

M0nkeyFl0wer mentioned this pull request May 25, 2026

Cost-per-correct headline metric with versioned pricing (HELM methodology) #34

Open

2 tasks

This was referenced May 27, 2026

Random + oracle retrieval baselines (TREC bounds) #32

Open

test+refactor: adapter contract testkit, registry allowlist, test coverage (#8, #18, #19, #20) #30

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cat 7.b: query latency distribution (YCSB percentiles)#33

Cat 7.b: query latency distribution (YCSB percentiles)#33
jphein wants to merge 2 commits into
M0nkeyFl0wer:mainfrom
techempower-org:feat/cat7b-latency-distribution

jphein commented May 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

M0nkeyFl0wer commented May 25, 2026

Uh oh!

jphein commented May 27, 2026

Uh oh!

M0nkeyFl0wer commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		Uses numpy for percentile calculations; optionally upgrades to
		HdrHistogram when the [latency] extra is installed.

Conversation

jphein commented May 25, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

M0nkeyFl0wer commented May 25, 2026

Uh oh!

jphein commented May 27, 2026

Uh oh!

M0nkeyFl0wer commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants