bench: LOCOMO-shaped memory-retrieval harness for memory_search by FMXExpress · Pull Request #308 · FMXExpress/PasClaw

FMXExpress · 2026-06-18T13:45:23Z

Summary

Self-contained adapter that runs LOCOMO-shaped multi-session personas through PasClaw's memory_search and reports retrieval scores. Does not need a real LLM provider — it exercises the retrieval layer only, which is the upper bound on agent performance (no LLM can answer if the right facts don't surface).

This is the "I act as the provider/judge inside myself" setup you asked for: the harness writes conversations into PasClaw's memory store, queries memory_search, dumps snippets — I read the dumps and judge whether the gold answer is present. No external API calls, no per-turn LLM costs.

Two-layer scoring

For each top-K result set:

snippet-level — did the gold answer (or any alias) appear in the FTS5-bounded snippet text? This is what the model sees in one memory_search call.
doc-level — did it appear anywhere in the full document body of the top-K hits' source files? This is what an agent reaches with a follow-up fs_read on the cited path.

Both numbers tell different parts of the story. Snippet-level is the agent's first round; doc-level is what an iterating agent can recover.

Files

bench/locomo/
├── README.md            design + run procedure + real-LOCOMO adapter notes
├── fixture/
│   └── alice_synthetic.json    1 persona, 3 sessions, 8 questions
├── memory_recall.pas    Pascal CLI shim over PasClaw.Memory.Index.Search
├── run.py               main harness (loader / scorer / summariser)
└── .gitignore           compiled binary + per-run results

memory_recall.pas falls back from NewVectorMemoryIndex → NewMemoryIndex when the vector runtime isn't provisioned, so the harness works on any FPC build without first running pasclaw memory provision.

Smoke result on the bundled persona

category         n    snippet R@10    doc R@10
adversarial      1    0.000           1.000
multi-hop        2    1.000           1.000
single-hop       4    0.750           1.000
temporal         1    1.000           1.000
---------------  ---  ------------    --------
overall          8    0.750           1.000

Read: PasClaw's memory_search lands the right document almost always (doc R@10 ~ 1.0 on this synthetic), but the FTS5 snippet window often clips the answer line (snippet R@10 ~ 0.75). Actionable: either widen snippet windows in PasClaw.Memory.Index, or train the agent to follow up with fs_read on retrieved citations more aggressively.

These numbers are from a synthetic persona designed to be recoverable. Real LOCOMO will be lower because of:

More distractor content per persona (35+ sessions vs 3)
More cross-session synthesis required for multi-hop
Real adversarial cases (negation, supersession) that PasClaw's no-consolidation memory model doesn't address

Running against real LOCOMO

git clone https://github.com/snap-stanford/locomo /tmp/locomo
# write a ~30-line shape adapter mapping LOCOMO's session_summary +
# dialogue field names to this harness's loader contract -- see
# load_persona() in run.py.
python3 bench/locomo/run.py --persona /tmp/locomo/data/locomo10.json

Not bundled because LOCOMO has its own download terms.

Why no Pascal source changes

The helper compiles against PasClaw's existing PasClaw.Memory.Index / PasClaw.Memory.Vector units. The whole adapter is bench-shaped scaffolding around the existing memory API — no internal PasClaw behavior change.

Test plan

Harness runs end-to-end on the bundled fixture (results above).
Two-layer scoring distinguishes "retrieval missed entirely" from "retrieval found doc but snippet too narrow."
Real LOCOMO run with a shape adapter — left as a follow-up since the dataset has its own license terms.

The harness is also the regression target for future memory work (Gauss decay, auto-consolidation, supersession) — these numbers can be tracked over time as the layer evolves.

Generated by Claude Code

Self-contained adapter for measuring PasClaw's memory_search retrieval quality on LOCOMO-shaped multi-session personas. Designed to run WITHOUT a real LLM provider -- exercises the retrieval layer only, which is the upper bound on agent performance (no LLM can answer if the right facts don't surface). Pipeline: 1. Per persona, set up a fresh $PASCLAW_HOME tempdir. 2. Write each session as one .md under workspace/memory/ with date-stamped naming so PasClaw's daily-note injection conventions apply. 3. memory_recall (a small Pascal CLI helper in this directory) calls PasClaw.Memory.Index.Search per question; the harness subprocesses it once per question. 4. Two-layer scoring per top-K result set: - snippet-level: did any answer alias appear in the FTS5- bounded snippet text? This is what the model sees in one memory_search call. - doc-level: did any alias appear in the full document body? This is what an agent finds with a follow-up fs_read on the cited path. Both numbers tell different parts of the story. Snippet- level reports the agent's first round of retrieval; doc-level reports what an iterating agent can recover. 5. Tally R@1 / R@5 / R@10 per category and overall. Files: * memory_recall.pas -- CLI shim over PasClaw.Memory.Index. Reads --home, --query, --k. Falls back from NewVectorMemoryIndex to NewMemoryIndex when the vector runtime isn't on disk, so the harness works on any FPC build without `pasclaw memory provision`. * run.py -- main harness. Loader, scorer, summariser. * fixture/alice_synthetic.json -- hand-rolled LOCOMO-shaped persona with 1 persona, 3 sessions, 8 questions across single-hop / multi-hop / temporal / adversarial categories. Exists to smoke-test the harness end-to-end without requiring the real LOCOMO dataset download. * README.md -- design + run procedure + how to wire up real LOCOMO (snap-stanford/locomo) via a small shape adapter against the same loader contract. * .gitignore -- compiled binary + per-run results dumps. Smoke result on the bundled persona (memory_search hybrid FTS5+vec via RRF, current PasClaw default): category n snippet R@10 doc R@10 --------------- --- ------------ -------- adversarial 1 0.000 1.000 multi-hop 2 1.000 1.000 single-hop 4 0.750 1.000 temporal 1 1.000 1.000 --------------- --- ------------ -------- overall 8 0.750 1.000 Read: retrieval lands the right DOCUMENT almost always (doc R@10 ~ 1.0 on the synthetic), but the FTS5-bounded snippet window often clips the answer line (snippet R@10 ~ 0.75). Actionable: either widen snippet windows in PasClaw.Memory.Index, or train the agent to follow up with fs_read on retrieved citations. These numbers are from a SYNTHETIC persona designed to be recoverable; real LOCOMO will be lower because of: - more distractor content per persona (35+ sessions vs 3) - more cross-session synthesis required for multi-hop - real adversarial cases (negation, supersession) that PasClaw's no-consolidation memory model doesn't address The harness is the regression target for future memory work (Gauss decay, auto-consolidation, supersession) -- numbers can be tracked over time as the layer evolves. No Pascal source changes; the helper compiles against PasClaw's existing memory units.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cfdc26fac9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-18T13:48:55Z

+            except Exception as e:
+                print(f"[bench] FAIL on {q['id']}: {e}", file=sys.stderr)
+                r = {"hits": []}


Fail the bench when recall crashes

Because this catches every exception from recall() and substitutes {"hits": []}, infrastructure failures such as a helper crash, timeout, invalid JSON, or sqlite unavailability are recorded as retrieval misses and the command still writes a summary/returns 0. In CI or regression runs that makes a broken harness look like poor recall, so the bench should propagate the error or exit nonzero after recording failures.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-18T13:48:55Z

+    for k in (1, 5, 10):
+        snippet_hit = False
+        doc_hit = False
+        for entry in hits[:k]:


Do not report R@10 from truncated hit lists

When the runner is invoked with --k below one of the fixed cutoffs, hits[:k] has already been truncated by recall(), but the loop still reports R@5 and R@10 as if those ranks had been retrieved. For example --k 1 makes every R@5/R@10 value actually score only the top 1 hit, producing misleading benchmark results; either require --k >= 10 or omit cutoffs larger than the retrieved K.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed Jun 18, 2026

View reviewed changes

FMXExpress mentioned this pull request Jun 18, 2026

memory/kb: widen FTS5 snippet window 24 → 60 + nudge agent to follow up with fs_read #309

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench: LOCOMO-shaped memory-retrieval harness for memory_search#308

bench: LOCOMO-shaped memory-retrieval harness for memory_search#308
FMXExpress wants to merge 1 commit into
mainfrom
claude/locomo-memory-recall-harness

FMXExpress commented Jun 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FMXExpress commented Jun 18, 2026

Summary

Two-layer scoring

Files

Smoke result on the bundled persona

Running against real LOCOMO

Why no Pascal source changes

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants