LongMemEval v2 runner: turn-pair chunking buries relevant memories under noise

## Problem

The v2 LongMemEval runner (`run_longmemeval_v2.py`) stores individual turn-pairs as separate memories (~250 embeddings/question). This creates so much retrieval noise that relevant memories don't rank in the top-15 results.

## Evidence

| Runner | Storage strategy | Accuracy (50q) | Refusal rate |
|--------|-----------------|----------------|--------------|
| v1 (`run_longmemeval.py`, March) | Full sessions (1 block/session, ~5 embeddings/question) | **22/50 (44%)** | 24% |
| v2 (`run_longmemeval_v2.py`, April 8) | Turn-pairs (~250 blocks/question) | ~21/50 (42%) | 24% |

While final accuracy is similar, v2 has much higher retrieval noise. The bottleneck is the **reader LLM** (gemma3:1b), not ShrimPK retrieval — echo hit rate is ~80% on LME questions, but the LLM fails to extract the answer from retrieved context.

## Root Cause

250 embeddings per question all compete in cosine similarity space. The relevant memory (containing the specific fact asked about) gets a lower rank than irrelevant turn-pairs that happen to be semantically closer to the query. With `max_results=15`, the relevant memory is often excluded.

## Fix Directions

| Approach | Complexity | Notes |
|----------|-----------|-------|
| **v3 hybrid chunking** — store full session + extracted key facts | M | Best of both: dense context + precision retrieval |
| **Reduce max_results** — top-5 instead of top-15 | XS | Higher precision, lower recall |
| **BM25 re-ranking** — keyword filter on top of vector results | M | Addresses lexical gap |
| **Better reader LLM** — replace gemma3:1b with larger model | XS config | Addresses actual bottleneck |

**For KS74:** Use v1 runner for 500q full benchmark. Build v3 hybrid chunking. Validate that reader LLM (gemma3:1b) is the bottleneck by testing with gemma3:4b or qwen2.5:3b.

Discovered during KS73 overnight benchmark session (2026-04-08).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LongMemEval v2 runner: turn-pair chunking buries relevant memories under noise #12

Problem

Evidence

Root Cause

Fix Directions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Runner	Storage strategy	Accuracy (50q)	Refusal rate
v1 (`run_longmemeval.py`, March)	Full sessions (1 block/session, ~5 embeddings/question)	22/50 (44%)	24%
v2 (`run_longmemeval_v2.py`, April 8)	Turn-pairs (~250 blocks/question)	~21/50 (42%)	24%

Approach	Complexity	Notes
v3 hybrid chunking — store full session + extracted key facts	M	Best of both: dense context + precision retrieval
Reduce max_results — top-5 instead of top-15	XS	Higher precision, lower recall
BM25 re-ranking — keyword filter on top of vector results	M	Addresses lexical gap
Better reader LLM — replace gemma3:1b with larger model	XS config	Addresses actual bottleneck

LongMemEval v2 runner: turn-pair chunking buries relevant memories under noise #12

Description

Problem

Evidence

Root Cause

Fix Directions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions