Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,14 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),

## [Unreleased]

### Added

- **`benchmark/load-100k.ts` load harness** ([#346](https://github.com/rohitg00/agentmemory/issues/346)). Hand-rolled, dependency-free harness that seeds N synthetic memories against a local daemon at `http://localhost:3111` and records p50 / p90 / p99 latency + throughput for `POST /agentmemory/remember`, `POST /agentmemory/smart-search`, and `GET /agentmemory/memories?latest=true` across the matrix N ∈ {1k, 10k, 100k} × concurrency C ∈ {1, 10, 100}. Content drawn from a seedable `mulberry32` PRNG so re-running against the same build produces the same seed corpus. Results land in `benchmark/results/load-100k-<short-git-sha>.json` (schema-versioned). Wired as `npm run bench:load`. See `benchmark/README.md` for the matrix and env knobs.

### Performance

- This is the placeholder for per-release p50 / p90 / p99 numbers from `benchmark/load-100k.ts`. Each release should land a `benchmark/results/load-100k-<sha>.json` and reference the headline p99 here. Format suggestion: one bullet per (N, C) cell that materially regressed or improved versus the previous release. p99 is the capacity-planning number; p50 + throughput are context. See [`benchmark/README.md`](benchmark/README.md) for how to reproduce.

## [0.9.12] — 2026-05-13

Four landed PRs since v0.9.11 — one type-correctness fix, one search-quality fix (BM25 unicode + vector-index live-write), one viewer hardening (CSP-clean fonts + load-error surface), and one integrations security hardening (bearer token over plaintext HTTP).
Expand Down
100 changes: 100 additions & 0 deletions benchmark/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# benchmark/

Two kinds of numbers live in this directory:

1. **Quality / retrieval** — `longmemeval-bench.ts`, `quality-eval.ts`,
`real-embeddings-eval.ts`, `scale-eval.ts`. Recall, precision, token
savings. Documented in `LONGMEMEVAL.md`, `QUALITY.md`,
`REAL-EMBEDDINGS.md`, `SCALE.md`.

2. **Load shape** — `load-100k.ts`. p50 / p90 / p99 latency and
throughput against a running daemon. This is the file you want when
somebody asks "what's p99 at 100k memories under concurrency 100?".

## load-100k.ts

Hand-rolled, dependency-free load harness. Issues real HTTP against a
local agentmemory daemon at `http://localhost:3111`, records per-request
latency with `performance.now()`, and writes a JSON report per run.

### What it measures

For each cell in the matrix `(N, concurrency, endpoint)` it records:

- `p50_ms`, `p90_ms`, `p99_ms` — nearest-rank percentiles.
- `min_ms`, `max_ms`, `ops`, `errors`.
- `throughput_per_sec` — wall-clock ops / sec for that cell.

Default matrix:

- `N` ∈ {1000, 10000, 100000} — number of memories seeded before the
cell runs.
- `C` ∈ {1, 10, 100} — concurrent in-flight requests during the cell.
- Endpoints under test:
- `POST /agentmemory/remember`
- `POST /agentmemory/smart-search`
- `GET /agentmemory/memories?latest=true`

Each cell issues `BENCH_OPS=200` requests by default — enough samples
for stable p99 without dragging a 100k-seed run past tens of minutes.

### Why p99 is the number that matters

p50 tells you the median request feels fast. p90 tells you the bulk of
requests feel fast. **p99 tells you the request your tail user hits when
they really need it feels fast.** Capacity planning lives here — if you
want to size a fleet, scale your daemon, or set an SLO, p99 is the
number to plan against. p50 will lie to you.

### Running it

```bash
# 1. Start the daemon however you normally do (npx, Docker, etc.)
npx @agentmemory/agentmemory

# 2. From the repo root, in another shell:
npm run bench:load
```

To override the matrix:

```bash
BENCH_N=1000 BENCH_C=1,10 BENCH_OPS=100 npm run bench:load
```

To have the harness spawn a daemon for the run (after `npm run build`):

```bash
AGENTMEMORY_BENCH_AUTOSTART=1 npm run bench:load
```

Other env knobs (see the file header for the canonical list):

- `AGENTMEMORY_URL` — base URL of the daemon (default
`http://localhost:3111`).
- `BENCH_SEED` — seed for the `mulberry32` content RNG. Same seed +
same daemon build = byte-identical seed corpus.
- `BENCH_OUT_DIR` — where the JSON report lands (default
`benchmark/results/`).

### Where results land

`benchmark/results/load-100k-<short-git-sha>.json`. The harness
`mkdir -p`s the directory. The file has a `schema_version: 1` field so
future format changes don't silently break consumers.

### Content generation is seedable

Synthetic memory content is built from a small noun / verb / concept
vocabulary fed by a `mulberry32(BENCH_SEED)` PRNG. Same seed + same
build = same corpus. The point isn't "realistic" content (there isn't
one realistic content); the point is **reproducibility** — re-running
the harness against the same git sha should give the same content
mixture going in, so latency variance comes from the daemon and not
from JSON payload jitter.

### Publishing numbers per release

The release flow appends a `## Performance` section to `CHANGELOG.md`
referencing the JSON in `benchmark/results/` for that release's git
sha. p99 is the headline number; the JSON is the receipt.
22 changes: 22 additions & 0 deletions benchmark/lib/percentiles.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
/**
* Nearest-rank percentile over a pre-sorted ascending array of numbers.
*
* No dependencies, no allocation. The caller is responsible for sorting
* the input ascending (`arr.sort((a, b) => a - b)`) — sorting in here
* would hide an O(n log n) cost in what looks like a cheap lookup.
*
* @param sorted Ascending-sorted samples. Empty array returns `NaN`.
* @param p Percentile in [0, 100]. Values outside the range are clamped.
* @returns The sample at the nearest rank, or `NaN` for empty input.
*/
export function pXX(sorted: number[], p: number): number {
const n = sorted.length;
if (n === 0) return NaN;
const clamped = Math.max(0, Math.min(100, p));
if (clamped === 0) return sorted[0]!;
if (clamped === 100) return sorted[n - 1]!;
// Nearest-rank: rank = ceil(p/100 * n), index = rank - 1.
const rank = Math.ceil((clamped / 100) * n);
const idx = Math.min(n - 1, Math.max(0, rank - 1));
return sorted[idx]!;
}
Loading
Loading