Skip to content

LongMemEval v2 runner: turn-pair chunking buries relevant memories under noise #12

@Liorrr

Description

@Liorrr

Problem

The v2 LongMemEval runner (run_longmemeval_v2.py) stores individual turn-pairs as separate memories (~250 embeddings/question). This creates so much retrieval noise that relevant memories don't rank in the top-15 results.

Evidence

Runner Storage strategy Accuracy (50q) Refusal rate
v1 (run_longmemeval.py, March) Full sessions (1 block/session, ~5 embeddings/question) 22/50 (44%) 24%
v2 (run_longmemeval_v2.py, April 8) Turn-pairs (~250 blocks/question) ~21/50 (42%) 24%

While final accuracy is similar, v2 has much higher retrieval noise. The bottleneck is the reader LLM (gemma3:1b), not ShrimPK retrieval — echo hit rate is ~80% on LME questions, but the LLM fails to extract the answer from retrieved context.

Root Cause

250 embeddings per question all compete in cosine similarity space. The relevant memory (containing the specific fact asked about) gets a lower rank than irrelevant turn-pairs that happen to be semantically closer to the query. With max_results=15, the relevant memory is often excluded.

Fix Directions

Approach Complexity Notes
v3 hybrid chunking — store full session + extracted key facts M Best of both: dense context + precision retrieval
Reduce max_results — top-5 instead of top-15 XS Higher precision, lower recall
BM25 re-ranking — keyword filter on top of vector results M Addresses lexical gap
Better reader LLM — replace gemma3:1b with larger model XS config Addresses actual bottleneck

For KS74: Use v1 runner for 500q full benchmark. Build v3 hybrid chunking. Validate that reader LLM (gemma3:1b) is the bottleneck by testing with gemma3:4b or qwen2.5:3b.

Discovered during KS73 overnight benchmark session (2026-04-08).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions