Problem
The v2 LongMemEval runner (run_longmemeval_v2.py) stores individual turn-pairs as separate memories (~250 embeddings/question). This creates so much retrieval noise that relevant memories don't rank in the top-15 results.
Evidence
| Runner |
Storage strategy |
Accuracy (50q) |
Refusal rate |
v1 (run_longmemeval.py, March) |
Full sessions (1 block/session, ~5 embeddings/question) |
22/50 (44%) |
24% |
v2 (run_longmemeval_v2.py, April 8) |
Turn-pairs (~250 blocks/question) |
~21/50 (42%) |
24% |
While final accuracy is similar, v2 has much higher retrieval noise. The bottleneck is the reader LLM (gemma3:1b), not ShrimPK retrieval — echo hit rate is ~80% on LME questions, but the LLM fails to extract the answer from retrieved context.
Root Cause
250 embeddings per question all compete in cosine similarity space. The relevant memory (containing the specific fact asked about) gets a lower rank than irrelevant turn-pairs that happen to be semantically closer to the query. With max_results=15, the relevant memory is often excluded.
Fix Directions
| Approach |
Complexity |
Notes |
| v3 hybrid chunking — store full session + extracted key facts |
M |
Best of both: dense context + precision retrieval |
| Reduce max_results — top-5 instead of top-15 |
XS |
Higher precision, lower recall |
| BM25 re-ranking — keyword filter on top of vector results |
M |
Addresses lexical gap |
| Better reader LLM — replace gemma3:1b with larger model |
XS config |
Addresses actual bottleneck |
For KS74: Use v1 runner for 500q full benchmark. Build v3 hybrid chunking. Validate that reader LLM (gemma3:1b) is the bottleneck by testing with gemma3:4b or qwen2.5:3b.
Discovered during KS73 overnight benchmark session (2026-04-08).
Problem
The v2 LongMemEval runner (
run_longmemeval_v2.py) stores individual turn-pairs as separate memories (~250 embeddings/question). This creates so much retrieval noise that relevant memories don't rank in the top-15 results.Evidence
run_longmemeval.py, March)run_longmemeval_v2.py, April 8)While final accuracy is similar, v2 has much higher retrieval noise. The bottleneck is the reader LLM (gemma3:1b), not ShrimPK retrieval — echo hit rate is ~80% on LME questions, but the LLM fails to extract the answer from retrieved context.
Root Cause
250 embeddings per question all compete in cosine similarity space. The relevant memory (containing the specific fact asked about) gets a lower rank than irrelevant turn-pairs that happen to be semantically closer to the query. With
max_results=15, the relevant memory is often excluded.Fix Directions
For KS74: Use v1 runner for 500q full benchmark. Build v3 hybrid chunking. Validate that reader LLM (gemma3:1b) is the bottleneck by testing with gemma3:4b or qwen2.5:3b.
Discovered during KS73 overnight benchmark session (2026-04-08).