Skip to content

bench: LOCOMO-shaped memory-retrieval harness for memory_search#308

Open
FMXExpress wants to merge 1 commit into
mainfrom
claude/locomo-memory-recall-harness
Open

bench: LOCOMO-shaped memory-retrieval harness for memory_search#308
FMXExpress wants to merge 1 commit into
mainfrom
claude/locomo-memory-recall-harness

Conversation

@FMXExpress

Copy link
Copy Markdown
Owner

Summary

Self-contained adapter that runs LOCOMO-shaped multi-session personas through PasClaw's memory_search and reports retrieval scores. Does not need a real LLM provider — it exercises the retrieval layer only, which is the upper bound on agent performance (no LLM can answer if the right facts don't surface).

This is the "I act as the provider/judge inside myself" setup you asked for: the harness writes conversations into PasClaw's memory store, queries memory_search, dumps snippets — I read the dumps and judge whether the gold answer is present. No external API calls, no per-turn LLM costs.

Two-layer scoring

For each top-K result set:

  • snippet-level — did the gold answer (or any alias) appear in the FTS5-bounded snippet text? This is what the model sees in one memory_search call.
  • doc-level — did it appear anywhere in the full document body of the top-K hits' source files? This is what an agent reaches with a follow-up fs_read on the cited path.

Both numbers tell different parts of the story. Snippet-level is the agent's first round; doc-level is what an iterating agent can recover.

Files

bench/locomo/
├── README.md            design + run procedure + real-LOCOMO adapter notes
├── fixture/
│   └── alice_synthetic.json    1 persona, 3 sessions, 8 questions
├── memory_recall.pas    Pascal CLI shim over PasClaw.Memory.Index.Search
├── run.py               main harness (loader / scorer / summariser)
└── .gitignore           compiled binary + per-run results

memory_recall.pas falls back from NewVectorMemoryIndexNewMemoryIndex when the vector runtime isn't provisioned, so the harness works on any FPC build without first running pasclaw memory provision.

Smoke result on the bundled persona

category         n    snippet R@10    doc R@10
adversarial      1    0.000           1.000
multi-hop        2    1.000           1.000
single-hop       4    0.750           1.000
temporal         1    1.000           1.000
---------------  ---  ------------    --------
overall          8    0.750           1.000

Read: PasClaw's memory_search lands the right document almost always (doc R@10 ~ 1.0 on this synthetic), but the FTS5 snippet window often clips the answer line (snippet R@10 ~ 0.75). Actionable: either widen snippet windows in PasClaw.Memory.Index, or train the agent to follow up with fs_read on retrieved citations more aggressively.

These numbers are from a synthetic persona designed to be recoverable. Real LOCOMO will be lower because of:

  • More distractor content per persona (35+ sessions vs 3)
  • More cross-session synthesis required for multi-hop
  • Real adversarial cases (negation, supersession) that PasClaw's no-consolidation memory model doesn't address

Running against real LOCOMO

git clone https://github.com/snap-stanford/locomo /tmp/locomo
# write a ~30-line shape adapter mapping LOCOMO's session_summary +
# dialogue field names to this harness's loader contract -- see
# load_persona() in run.py.
python3 bench/locomo/run.py --persona /tmp/locomo/data/locomo10.json

Not bundled because LOCOMO has its own download terms.

Why no Pascal source changes

The helper compiles against PasClaw's existing PasClaw.Memory.Index / PasClaw.Memory.Vector units. The whole adapter is bench-shaped scaffolding around the existing memory API — no internal PasClaw behavior change.

Test plan

  • Harness runs end-to-end on the bundled fixture (results above).
  • Two-layer scoring distinguishes "retrieval missed entirely" from "retrieval found doc but snippet too narrow."
  • Real LOCOMO run with a shape adapter — left as a follow-up since the dataset has its own license terms.

The harness is also the regression target for future memory work (Gauss decay, auto-consolidation, supersession) — these numbers can be tracked over time as the layer evolves.


Generated by Claude Code

Self-contained adapter for measuring PasClaw's memory_search
retrieval quality on LOCOMO-shaped multi-session personas.
Designed to run WITHOUT a real LLM provider -- exercises the
retrieval layer only, which is the upper bound on agent
performance (no LLM can answer if the right facts don't surface).

Pipeline:

  1. Per persona, set up a fresh $PASCLAW_HOME tempdir.
  2. Write each session as one .md under workspace/memory/ with
     date-stamped naming so PasClaw's daily-note injection
     conventions apply.
  3. memory_recall (a small Pascal CLI helper in this directory)
     calls PasClaw.Memory.Index.Search per question; the harness
     subprocesses it once per question.
  4. Two-layer scoring per top-K result set:

     - snippet-level: did any answer alias appear in the FTS5-
       bounded snippet text?  This is what the model sees in
       one memory_search call.
     - doc-level: did any alias appear in the full document body?
       This is what an agent finds with a follow-up fs_read on
       the cited path.

     Both numbers tell different parts of the story.  Snippet-
     level reports the agent's first round of retrieval;
     doc-level reports what an iterating agent can recover.

  5. Tally R@1 / R@5 / R@10 per category and overall.

Files:

  * memory_recall.pas -- CLI shim over PasClaw.Memory.Index.
    Reads --home, --query, --k.  Falls back from
    NewVectorMemoryIndex to NewMemoryIndex when the vector
    runtime isn't on disk, so the harness works on any FPC
    build without `pasclaw memory provision`.

  * run.py -- main harness.  Loader, scorer, summariser.

  * fixture/alice_synthetic.json -- hand-rolled LOCOMO-shaped
    persona with 1 persona, 3 sessions, 8 questions across
    single-hop / multi-hop / temporal / adversarial categories.
    Exists to smoke-test the harness end-to-end without
    requiring the real LOCOMO dataset download.

  * README.md -- design + run procedure + how to wire up real
    LOCOMO (snap-stanford/locomo) via a small shape adapter
    against the same loader contract.

  * .gitignore -- compiled binary + per-run results dumps.

Smoke result on the bundled persona (memory_search hybrid
FTS5+vec via RRF, current PasClaw default):

  category         n    snippet R@10    doc R@10
  ---------------  ---  ------------    --------
  adversarial      1    0.000           1.000
  multi-hop        2    1.000           1.000
  single-hop       4    0.750           1.000
  temporal         1    1.000           1.000
  ---------------  ---  ------------    --------
  overall          8    0.750           1.000

Read: retrieval lands the right DOCUMENT almost always (doc R@10
~ 1.0 on the synthetic), but the FTS5-bounded snippet window
often clips the answer line (snippet R@10 ~ 0.75).  Actionable:
either widen snippet windows in PasClaw.Memory.Index, or train
the agent to follow up with fs_read on retrieved citations.

These numbers are from a SYNTHETIC persona designed to be
recoverable; real LOCOMO will be lower because of:

  - more distractor content per persona (35+ sessions vs 3)
  - more cross-session synthesis required for multi-hop
  - real adversarial cases (negation, supersession) that
    PasClaw's no-consolidation memory model doesn't address

The harness is the regression target for future memory work
(Gauss decay, auto-consolidation, supersession) -- numbers can
be tracked over time as the layer evolves.

No Pascal source changes; the helper compiles against PasClaw's
existing memory units.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cfdc26fac9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread bench/locomo/run.py
Comment on lines +223 to +225
except Exception as e:
print(f"[bench] FAIL on {q['id']}: {e}", file=sys.stderr)
r = {"hits": []}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail the bench when recall crashes

Because this catches every exception from recall() and substitutes {"hits": []}, infrastructure failures such as a helper crash, timeout, invalid JSON, or sqlite unavailability are recorded as retrieval misses and the command still writes a summary/returns 0. In CI or regression runs that makes a broken harness look like poor recall, so the bench should propagate the error or exit nonzero after recording failures.

Useful? React with 👍 / 👎.

Comment thread bench/locomo/run.py
Comment on lines +186 to +189
for k in (1, 5, 10):
snippet_hit = False
doc_hit = False
for entry in hits[:k]:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not report R@10 from truncated hit lists

When the runner is invoked with --k below one of the fixed cutoffs, hits[:k] has already been truncated by recall(), but the loop still reports R@5 and R@10 as if those ranks had been retrieved. For example --k 1 makes every R@5/R@10 value actually score only the top 1 hit, producing misleading benchmark results; either require --k >= 10 or omit cutoffs larger than the retrieved K.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants