feat(bench): load-100k.ts harness with p50/p90/p99 output#363
Conversation
Adds a reproducible, dependency-free load harness so we can answer "what's p99 at 100k memories under concurrency 100?" with a number instead of a shrug. The harness seeds N synthetic memories against a local agentmemory daemon (defaults to http://localhost:3111, optional autostart via AGENTMEMORY_BENCH_AUTOSTART=1), then drives a matrix of (N, concurrency, endpoint) cells with hand-rolled Promise.allSettled batches. Per-request latency is collected via performance.now() and summarized as nearest-rank p50 / p90 / p99 plus min / max / errors and wall-clock throughput. Results are written to benchmark/results/load-100k-<short-git-sha>.json with a schema_version field so future format changes don't silently break consumers. Defaults match issue #346: N in {1000, 10000, 100000} x C in {1, 10, 100} x three endpoints (POST /agentmemory/remember, POST /agentmemory/smart-search, GET /agentmemory/memories?latest=true). Each cell issues BENCH_OPS=200 requests by default — enough samples for stable p99 without dragging a 100k-seed run past tens of minutes. Content is generated by a small noun/verb/concept vocabulary fed by a mulberry32(BENCH_SEED) PRNG so re-running the harness against the same daemon build yields the same seed corpus. Reproducibility, not realism, is the point — latency variance comes from the daemon, not JSON payload jitter. Files: - benchmark/load-100k.ts: main harness - benchmark/lib/percentiles.ts: zero-dep pXX helper, nearest-rank - benchmark/README.md: how to run, what gets measured, where results land, and why p99 is the number you want for capacity planning - benchmark/results/load-100k-96c0ed0.json: example result from a small-N (N=1000, C=10) verification run against a fresh daemon - package.json: wires `npm run bench:load` - CHANGELOG.md: Unreleased entry + a Performance section placeholder describing where per-release numbers should land going forward Verified locally at BENCH_N=1000 BENCH_C=10 BENCH_OPS=200 — three cells, zero errors, JSON written. Full 100k matrix is intentionally deferred to CI/release time. Closes #346.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughWalkthroughThis PR introduces a complete load-testing harness that seeds synthetic memories and measures REST API latency percentiles (p50/p90/p99) and throughput across configurable memory counts (1k, 10k, 100k) and concurrency levels (1, 10, 100), writing schema-versioned JSON results and integrating via npm script and release documentation. ChangesLoad-Testing Harness Implementation and Integration
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
benchmark/load-100k.ts (1)
140-151: ⚡ Quick win
maybeStartDaemonswallows spawn failures.There's no
'error'or'exit'listener on the child, so a failedspawn(e.g. EACCES on the CLI binary, or the daemon crashing on startup) shows up only as awaitForLiveztimeout 30 seconds later. Attaching an error/early-exit listener lets the harness fail fast with a useful message instead of looking like a network problem.♻️ Proposed handler — fail fast on spawn / early exit
const child = spawn(process.execPath, [cliPath, "start"], { stdio: ["ignore", "pipe", "pipe"], detached: false, }); + child.on("error", (err) => { + console.error(`[load-100k] daemon spawn error:`, err); + }); + child.once("exit", (code, signal) => { + if (code !== 0 && code !== null) { + console.error( + `[load-100k] daemon exited early code=${code} signal=${signal ?? "-"}`, + ); + } + }); child.stdout?.on("data", () => { /* drain */ });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmark/load-100k.ts` around lines 140 - 151, maybeStartDaemon currently spawns a child process with spawn(...) and drains stdio but doesn’t attach 'error' or early 'exit' handlers, so spawn failures only surface as a waitForLivez timeout; update maybeStartDaemon to attach child.on('error', ...) to immediately reject/throw with the spawn error and child.on('exit', (code, signal) => ...) to fail fast if the daemon exits before becoming healthy (include code/signal in the message), and ensure these handlers clean up (remove listeners or resolve/reject only once) to avoid memory leaks and duplicate resolution.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmark/load-100k.ts`:
- Around line 99-100: The current use of parseInt(...) || fallback for
opsPerCell and seed treats 0 as falsy and replaces legitimate zero values with
defaults; update the parsing so it distinguishes parse failures from zero—parse
with parseInt(process.env["BENCH_OPS"] ?? "200", 10) and
parseInt(process.env["BENCH_SEED"] ?? "12648430", 10) (or parse then test
Number.isNaN) and only fall back to 200 and 12648430 when the parsed result is
NaN; change the assignments for opsPerCell and seed accordingly to preserve a
configured 0 while still defaulting on invalid input.
- Around line 244-285: seedMemories currently shares one rng across concurrent
workers which makes content for a given i non-deterministic; change content
generation to be a pure function of (seed, i) by deriving a per-record RNG from
the seed and index instead of using the shared rng. Concretely, update
buildContent (or an overload) to accept a seed and the record index (or accept a
per-record rng created by calling mulberry32(seed + i)) and use that to generate
content inside seedMemories.worker; replace uses of the shared rng in
seedMemories and in measureRemember/driveLoad (where probeRng is shared) so each
call computes its own mulberry32(cfg.seed + i) (or equivalent deterministic
derivation) before generating the body, ensuring reproducible content for every
i.
---
Nitpick comments:
In `@benchmark/load-100k.ts`:
- Around line 140-151: maybeStartDaemon currently spawns a child process with
spawn(...) and drains stdio but doesn’t attach 'error' or early 'exit' handlers,
so spawn failures only surface as a waitForLivez timeout; update
maybeStartDaemon to attach child.on('error', ...) to immediately reject/throw
with the spawn error and child.on('exit', (code, signal) => ...) to fail fast if
the daemon exits before becoming healthy (include code/signal in the message),
and ensure these handlers clean up (remove listeners or resolve/reject only
once) to avoid memory leaks and duplicate resolution.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c8b52df9-73fe-432f-b0f8-189246073ef1
📒 Files selected for processing (6)
CHANGELOG.mdbenchmark/README.mdbenchmark/lib/percentiles.tsbenchmark/load-100k.tsbenchmark/results/load-100k-96c0ed0.jsonpackage.json
| opsPerCell: parseInt(process.env["BENCH_OPS"] || "200", 10) || 200, | ||
| seed: parseInt(process.env["BENCH_SEED"] || "12648430", 10) || 12648430, |
There was a problem hiding this comment.
BENCH_SEED=0 (and BENCH_OPS=0) are silently replaced with defaults.
The parseInt(...) || fallback idiom on lines 99–100 conflates "unparseable" with "legitimate zero". BENCH_SEED=0 is a perfectly valid seed for mulberry32, but this code rewrites it to 12648430, which surprises anyone trying to reproduce a previously-published run that used seed 0. Same issue, less impactful, for BENCH_OPS=0.
🛡️ Proposed fix — distinguish parse failure from zero
- opsPerCell: parseInt(process.env["BENCH_OPS"] || "200", 10) || 200,
- seed: parseInt(process.env["BENCH_SEED"] || "12648430", 10) || 12648430,
+ opsPerCell: (() => {
+ const v = parseInt(process.env["BENCH_OPS"] ?? "", 10);
+ return Number.isFinite(v) && v > 0 ? v : 200;
+ })(),
+ seed: (() => {
+ const v = parseInt(process.env["BENCH_SEED"] ?? "", 10);
+ return Number.isFinite(v) ? v : 12648430;
+ })(),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| opsPerCell: parseInt(process.env["BENCH_OPS"] || "200", 10) || 200, | |
| seed: parseInt(process.env["BENCH_SEED"] || "12648430", 10) || 12648430, | |
| opsPerCell: (() => { | |
| const v = parseInt(process.env["BENCH_OPS"] ?? "", 10); | |
| return Number.isFinite(v) && v > 0 ? v : 200; | |
| })(), | |
| seed: (() => { | |
| const v = parseInt(process.env["BENCH_SEED"] ?? "", 10); | |
| return Number.isFinite(v) ? v : 12648430; | |
| })(), |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benchmark/load-100k.ts` around lines 99 - 100, The current use of
parseInt(...) || fallback for opsPerCell and seed treats 0 as falsy and replaces
legitimate zero values with defaults; update the parsing so it distinguishes
parse failures from zero—parse with parseInt(process.env["BENCH_OPS"] ?? "200",
10) and parseInt(process.env["BENCH_SEED"] ?? "12648430", 10) (or parse then
test Number.isNaN) and only fall back to 200 and 12648430 when the parsed result
is NaN; change the assignments for opsPerCell and seed accordingly to preserve a
configured 0 while still defaulting on invalid input.
| async function seedMemories( | ||
| baseUrl: string, | ||
| count: number, | ||
| rng: () => number, | ||
| seedConcurrency = 32, | ||
| ): Promise<{ seeded: number; errors: number; wallMs: number }> { | ||
| let issued = 0; | ||
| let seeded = 0; | ||
| let errors = 0; | ||
| const t0 = performance.now(); | ||
| async function worker(): Promise<void> { | ||
| while (true) { | ||
| const i = issued++; | ||
| if (i >= count) return; | ||
| const body = JSON.stringify({ | ||
| content: buildContent(rng, i), | ||
| type: "observation", | ||
| }); | ||
| try { | ||
| const res = await fetch(`${baseUrl}/agentmemory/remember`, { | ||
| method: "POST", | ||
| headers: { "content-type": "application/json" }, | ||
| body, | ||
| signal: AbortSignal.timeout(30_000), | ||
| }); | ||
| if (res.ok) { | ||
| seeded++; | ||
| } else { | ||
| errors++; | ||
| } | ||
| // drain body to free the socket | ||
| await res.text().catch(() => ""); | ||
| } catch { | ||
| errors++; | ||
| } | ||
| } | ||
| } | ||
| await Promise.allSettled( | ||
| Array.from({ length: seedConcurrency }, () => worker()), | ||
| ); | ||
| return { seeded, errors, wallMs: performance.now() - t0 }; | ||
| } |
There was a problem hiding this comment.
Shared mutable RNG across concurrent workers breaks the "deterministic content" guarantee.
seedMemories shares one rng closure across seedConcurrency workers. After the first synchronous burst (each worker grabs an i and consumes ~5 rng values before its first await fetch), every subsequent rng consumption is interleaved with HTTP completions, so the rng→i pairing depends on network timing. Two runs against the same daemon build with the same seed will produce different content for the same i once i >= seedConcurrency. The PR description explicitly promises reproducibility via mulberry32(BENCH_SEED), so this is a real regression of that contract — and it also leaks into measureRemember below (same pattern with probeRng).
Derive the per-record rng from i so content is a pure function of (seed, i) regardless of completion order:
🛡️ Proposed fix — make content a pure function of (seed, i)
async function seedMemories(
baseUrl: string,
count: number,
- rng: () => number,
+ baseSeed: number,
seedConcurrency = 32,
): Promise<{ seeded: number; errors: number; wallMs: number }> {
let issued = 0;
let seeded = 0;
let errors = 0;
const t0 = performance.now();
async function worker(): Promise<void> {
while (true) {
const i = issued++;
if (i >= count) return;
+ const wrng = mulberry32((baseSeed + i) >>> 0);
const body = JSON.stringify({
- content: buildContent(rng, i),
+ content: buildContent(wrng, i),
type: "observation",
});…and update the call site (around line 406) to pass cfg.seed + seededSoFar instead of a constructed rng. Apply the same pattern in measureRemember (lines 294–307) where probeRng is shared across driveLoad workers.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async function seedMemories( | |
| baseUrl: string, | |
| count: number, | |
| rng: () => number, | |
| seedConcurrency = 32, | |
| ): Promise<{ seeded: number; errors: number; wallMs: number }> { | |
| let issued = 0; | |
| let seeded = 0; | |
| let errors = 0; | |
| const t0 = performance.now(); | |
| async function worker(): Promise<void> { | |
| while (true) { | |
| const i = issued++; | |
| if (i >= count) return; | |
| const body = JSON.stringify({ | |
| content: buildContent(rng, i), | |
| type: "observation", | |
| }); | |
| try { | |
| const res = await fetch(`${baseUrl}/agentmemory/remember`, { | |
| method: "POST", | |
| headers: { "content-type": "application/json" }, | |
| body, | |
| signal: AbortSignal.timeout(30_000), | |
| }); | |
| if (res.ok) { | |
| seeded++; | |
| } else { | |
| errors++; | |
| } | |
| // drain body to free the socket | |
| await res.text().catch(() => ""); | |
| } catch { | |
| errors++; | |
| } | |
| } | |
| } | |
| await Promise.allSettled( | |
| Array.from({ length: seedConcurrency }, () => worker()), | |
| ); | |
| return { seeded, errors, wallMs: performance.now() - t0 }; | |
| } | |
| async function seedMemories( | |
| baseUrl: string, | |
| count: number, | |
| baseSeed: number, | |
| seedConcurrency = 32, | |
| ): Promise<{ seeded: number; errors: number; wallMs: number }> { | |
| let issued = 0; | |
| let seeded = 0; | |
| let errors = 0; | |
| const t0 = performance.now(); | |
| async function worker(): Promise<void> { | |
| while (true) { | |
| const i = issued++; | |
| if (i >= count) return; | |
| const wrng = mulberry32((baseSeed + i) >>> 0); | |
| const body = JSON.stringify({ | |
| content: buildContent(wrng, i), | |
| type: "observation", | |
| }); | |
| try { | |
| const res = await fetch(`${baseUrl}/agentmemory/remember`, { | |
| method: "POST", | |
| headers: { "content-type": "application/json" }, | |
| body, | |
| signal: AbortSignal.timeout(30_000), | |
| }); | |
| if (res.ok) { | |
| seeded++; | |
| } else { | |
| errors++; | |
| } | |
| // drain body to free the socket | |
| await res.text().catch(() => ""); | |
| } catch { | |
| errors++; | |
| } | |
| } | |
| } | |
| await Promise.allSettled( | |
| Array.from({ length: seedConcurrency }, () => worker()), | |
| ); | |
| return { seeded, errors, wallMs: performance.now() - t0 }; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benchmark/load-100k.ts` around lines 244 - 285, seedMemories currently shares
one rng across concurrent workers which makes content for a given i
non-deterministic; change content generation to be a pure function of (seed, i)
by deriving a per-record RNG from the seed and index instead of using the shared
rng. Concretely, update buildContent (or an overload) to accept a seed and the
record index (or accept a per-record rng created by calling mulberry32(seed +
i)) and use that to generate content inside seedMemories.worker; replace uses of
the shared rng in seedMemories and in measureRemember/driveLoad (where probeRng
is shared) so each call computes its own mulberry32(cfg.seed + i) (or equivalent
deterministic derivation) before generating the body, ensuring reproducible
content for every i.
Closes #346.
What this is
A reproducible load harness for the three hot REST endpoints, so we have a number to point at when somebody asks "what's p99 at 100k memories under concurrency 100?"
Hand-rolled, dependency-free Node 20 / TypeScript. Hits a local agentmemory daemon over real HTTP, collects per-request latency with
performance.now(), summarizes as nearest-rank p50 / p90 / p99 plus min / max / errors / throughput, writes a schema-versioned JSON report per run.Matrix dimensions
{1000, 10000, 100000}— set viaBENCH_N=(comma list).{1, 10, 100}— set viaBENCH_C=.BENCH_OPS=. Enough samples for stable p99 without dragging a 100k-seed run past tens of minutes.Nis processed ascending so each cell builds on the previous seed work — no re-seeding between cells.Endpoints under test
POST /agentmemory/rememberPOST /agentmemory/smart-searchGET /agentmemory/memories?latest=trueContent generation
Synthetic memory bodies come from a small noun / verb / concept vocabulary fed by a
mulberry32(BENCH_SEED)PRNG. Default seed0xC0FFEE. Same seed + same daemon build = byte-identical seed corpus, so re-running against the same git sha gives the same content mixture going in and latency variance has to come from the daemon, not JSON payload jitter.Output
Each run writes
benchmark/results/load-100k-<short-git-sha>.json(mkdir -p'd if missing). Top-levelschema_version: 1so future format changes don't silently break consumers. The harness also prints a compact table to stdout for quick eyeball comparison.Verification run (small-N)
BENCH_N=1000 BENCH_C=10 BENCH_OPS=200 npm run bench:loadagainst a fresh daemon — three cells, zero errors, JSON written. The full 100k matrix is intentionally deferred to CI / release time, per the issue's "single-process for now" scope.POST /agentmemory/rememberPOST /agentmemory/smart-searchGET /agentmemory/memories?latest=truep99 is the headline number for capacity planning. p50 + throughput are context. The raw JSON is committed at
benchmark/results/load-100k-96c0ed0.jsonas the example result.What's wired
benchmark/load-100k.ts— main harness. File header documents env knobs.benchmark/lib/percentiles.ts— zero-dep nearest-rankpXX(sorted, p)helper.benchmark/README.md— how to run, what gets measured, where results land, and why p99 is the number you want.package.json—npm run bench:load(usestsxalready in devDeps).CHANGELOG.md—[Unreleased]entry + a### Performancesection placeholder describing where per-release p50/p90/p99 bullets should land going forward.Out of scope
Test plan
npm run bench:loadwithBENCH_N=1000 BENCH_C=10 BENCH_OPS=200against a freshly-started daemon — three cells, zero errors, JSON written.npm testexcluding the pre-existingtest/mcp-standalone.test.tsfailures (11 failures present onmainHEAD, unrelated to this PR): 875 / 875 pass.BENCH_N=1000,10000,100000 BENCH_C=1,10,100at release time, attach the resulting JSON to the release notes.AGENTMEMORY_BENCH_AUTOSTART=1path on a fresh checkout (npm run build && AGENTMEMORY_BENCH_AUTOSTART=1 BENCH_N=1000 npm run bench:load).Summary by CodeRabbit
New Features
npm run bench:load.Documentation