bench: LOCOMO-shaped memory-retrieval harness for memory_search#308
bench: LOCOMO-shaped memory-retrieval harness for memory_search#308FMXExpress wants to merge 1 commit into
Conversation
Self-contained adapter for measuring PasClaw's memory_search
retrieval quality on LOCOMO-shaped multi-session personas.
Designed to run WITHOUT a real LLM provider -- exercises the
retrieval layer only, which is the upper bound on agent
performance (no LLM can answer if the right facts don't surface).
Pipeline:
1. Per persona, set up a fresh $PASCLAW_HOME tempdir.
2. Write each session as one .md under workspace/memory/ with
date-stamped naming so PasClaw's daily-note injection
conventions apply.
3. memory_recall (a small Pascal CLI helper in this directory)
calls PasClaw.Memory.Index.Search per question; the harness
subprocesses it once per question.
4. Two-layer scoring per top-K result set:
- snippet-level: did any answer alias appear in the FTS5-
bounded snippet text? This is what the model sees in
one memory_search call.
- doc-level: did any alias appear in the full document body?
This is what an agent finds with a follow-up fs_read on
the cited path.
Both numbers tell different parts of the story. Snippet-
level reports the agent's first round of retrieval;
doc-level reports what an iterating agent can recover.
5. Tally R@1 / R@5 / R@10 per category and overall.
Files:
* memory_recall.pas -- CLI shim over PasClaw.Memory.Index.
Reads --home, --query, --k. Falls back from
NewVectorMemoryIndex to NewMemoryIndex when the vector
runtime isn't on disk, so the harness works on any FPC
build without `pasclaw memory provision`.
* run.py -- main harness. Loader, scorer, summariser.
* fixture/alice_synthetic.json -- hand-rolled LOCOMO-shaped
persona with 1 persona, 3 sessions, 8 questions across
single-hop / multi-hop / temporal / adversarial categories.
Exists to smoke-test the harness end-to-end without
requiring the real LOCOMO dataset download.
* README.md -- design + run procedure + how to wire up real
LOCOMO (snap-stanford/locomo) via a small shape adapter
against the same loader contract.
* .gitignore -- compiled binary + per-run results dumps.
Smoke result on the bundled persona (memory_search hybrid
FTS5+vec via RRF, current PasClaw default):
category n snippet R@10 doc R@10
--------------- --- ------------ --------
adversarial 1 0.000 1.000
multi-hop 2 1.000 1.000
single-hop 4 0.750 1.000
temporal 1 1.000 1.000
--------------- --- ------------ --------
overall 8 0.750 1.000
Read: retrieval lands the right DOCUMENT almost always (doc R@10
~ 1.0 on the synthetic), but the FTS5-bounded snippet window
often clips the answer line (snippet R@10 ~ 0.75). Actionable:
either widen snippet windows in PasClaw.Memory.Index, or train
the agent to follow up with fs_read on retrieved citations.
These numbers are from a SYNTHETIC persona designed to be
recoverable; real LOCOMO will be lower because of:
- more distractor content per persona (35+ sessions vs 3)
- more cross-session synthesis required for multi-hop
- real adversarial cases (negation, supersession) that
PasClaw's no-consolidation memory model doesn't address
The harness is the regression target for future memory work
(Gauss decay, auto-consolidation, supersession) -- numbers can
be tracked over time as the layer evolves.
No Pascal source changes; the helper compiles against PasClaw's
existing memory units.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cfdc26fac9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| except Exception as e: | ||
| print(f"[bench] FAIL on {q['id']}: {e}", file=sys.stderr) | ||
| r = {"hits": []} |
There was a problem hiding this comment.
Fail the bench when recall crashes
Because this catches every exception from recall() and substitutes {"hits": []}, infrastructure failures such as a helper crash, timeout, invalid JSON, or sqlite unavailability are recorded as retrieval misses and the command still writes a summary/returns 0. In CI or regression runs that makes a broken harness look like poor recall, so the bench should propagate the error or exit nonzero after recording failures.
Useful? React with 👍 / 👎.
| for k in (1, 5, 10): | ||
| snippet_hit = False | ||
| doc_hit = False | ||
| for entry in hits[:k]: |
There was a problem hiding this comment.
Do not report R@10 from truncated hit lists
When the runner is invoked with --k below one of the fixed cutoffs, hits[:k] has already been truncated by recall(), but the loop still reports R@5 and R@10 as if those ranks had been retrieved. For example --k 1 makes every R@5/R@10 value actually score only the top 1 hit, producing misleading benchmark results; either require --k >= 10 or omit cutoffs larger than the retrieved K.
Useful? React with 👍 / 👎.
Summary
Self-contained adapter that runs LOCOMO-shaped multi-session personas through PasClaw's
memory_searchand reports retrieval scores. Does not need a real LLM provider — it exercises the retrieval layer only, which is the upper bound on agent performance (no LLM can answer if the right facts don't surface).This is the "I act as the provider/judge inside myself" setup you asked for: the harness writes conversations into PasClaw's memory store, queries
memory_search, dumps snippets — I read the dumps and judge whether the gold answer is present. No external API calls, no per-turn LLM costs.Two-layer scoring
For each top-K result set:
memory_searchcall.fs_readon the cited path.Both numbers tell different parts of the story. Snippet-level is the agent's first round; doc-level is what an iterating agent can recover.
Files
memory_recall.pasfalls back fromNewVectorMemoryIndex→NewMemoryIndexwhen the vector runtime isn't provisioned, so the harness works on any FPC build without first runningpasclaw memory provision.Smoke result on the bundled persona
Read: PasClaw's
memory_searchlands the right document almost always (doc R@10 ~ 1.0 on this synthetic), but the FTS5 snippet window often clips the answer line (snippet R@10 ~ 0.75). Actionable: either widen snippet windows inPasClaw.Memory.Index, or train the agent to follow up withfs_readon retrieved citations more aggressively.These numbers are from a synthetic persona designed to be recoverable. Real LOCOMO will be lower because of:
Running against real LOCOMO
Not bundled because LOCOMO has its own download terms.
Why no Pascal source changes
The helper compiles against PasClaw's existing
PasClaw.Memory.Index/PasClaw.Memory.Vectorunits. The whole adapter is bench-shaped scaffolding around the existing memory API — no internal PasClaw behavior change.Test plan
The harness is also the regression target for future memory work (Gauss decay, auto-consolidation, supersession) — these numbers can be tracked over time as the layer evolves.
Generated by Claude Code