Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions .fallowrc.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@
],
"publicPackages": ["@atomicmemory/atomicmemory-sdk"],
"ignorePatterns": [
"**/one-offs/**"
"**/one-offs/**",
"benchmarks/**"
],
"rules": {
"unused-class-members": "off",
Expand All @@ -30,6 +31,7 @@
"tests/**",
"scripts/**",
"examples/**",
"benchmarks/**",
"src/embedding/wasm-semantic-processor.ts"
]
},
Expand All @@ -45,7 +47,8 @@
"**/*.spec.tsx",
"tests/**",
"scripts/**",
"examples/**"
"examples/**",
"benchmarks/**"
]
},
"regression": {
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -38,3 +38,6 @@ pnpm-debug.log*

# Internal tech-debt notes — never commit.
tech-debt.md

# Superpowers skill plugin output — agent-generated specs/plans, internal-only.
docs/superpowers/
95 changes: 95 additions & 0 deletions benchmarks/alignbench/PR-DESCRIPTION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# AlignBench v0 — controlled recall benchmark + falsified pronoun-rewrite fix

Adds `benchmarks/alignbench/` to the SDK: a 60-query / 55-fact controlled
benchmark for embedding-based recall, with a runner that ablates four
candidate fixes against the current Xenova/all-MiniLM-L6-v2 default.

## Why

Three observed failure modes share one signature:

1. **Partner demo** (atomicmem.filecoin.cloud): "what is my name?" returns no
recall; "what is the user's name?" returns the same fact at cosine 0.51.
2. **LMME-S full n=500** (sprint 5): 31% of failures were "I don't have info"
refusals when the answer text was in the haystack.
3. **BEAM Knowledge-Update**: retrieval pulls the keyword-matching chunk
instead of the freshest one.

Each was filed as a benchmark-specific quirk. AlignBench tests whether
they're one phenomenon — and which fix actually closes the gap.

## Pre-registered hypothesis (and outcome)

Before running, I committed in writing:

> If query-side pronoun rewriting (my → the user's) doesn't lift r@5 by ≥0.25
> over baseline, the pronoun hypothesis is wrong and we look at extraction
> quality instead.

Result: query-rewrite r@5 lift = **0.000** (0.933 vs 0.933 baseline).
**Hypothesis falsified.** The diagnostic story I posted earlier — "fix it in
the SDK recall path with a pronoun rewrite" — does not survive contact with a
controlled benchmark.

This is exactly what pre-registration is for.

## What actually wins

| Variant | r@1 | r@5 | distractor_top1 | fp@control |
|---|---:|---:|---:|---:|
| baseline (current SDK) | 0.733 | 0.933 | 0.067 | 0.000 |
| **baseline, clean pool (no extraction meta-facts)** | **0.767** | **0.950** | 0.000 | 0.000 |
| query-rewrite | 0.733 | 0.933 | 0.083 (worse) | 0.000 |
| dual-storage | 0.783 | 0.933 | 0.067 | 0.000 |
| hybrid BM25 + semantic | 0.617 | 0.917 | 0.067 | **1.000** ← broken |
| combined (rewrite + BM25) | 0.650 | 0.933 | 0.083 | 1.000 |

The dominant fixable lift is **upstream of retrieval** — stopping the extractor
from emitting meta-facts like `The user asked for the user's name.` and
`As of <date>, X is a term mentioned in the conversation.`. Those poison the
embedding neighborhood for every adjacent query.

## What this PR contains

- `benchmarks/alignbench/items.json` — 55 facts, 60 scored queries, 10
controls, across 4 variation axes (pronoun, temporal, specificity,
negation) plus an extraction-style distractor pool observed in the partner
demo.
- `benchmarks/alignbench/run.mjs` — standalone Node runner using
`@huggingface/transformers` (same model as SDK). No Postgres, no network,
no SDK dependencies. Each variant produces a directly-comparable run JSON.
- `benchmarks/alignbench/runs/*.json` — all 5 variant runs committed for
diff-ability.
- `benchmarks/alignbench/RESULTS.md` — full per-axis breakdown, ablation
table, per-item failure analysis on the temporal axis, recommendations.
- `benchmarks/alignbench/README.md` — what it is, how to read it, what's out
of scope.

## What this PR does NOT contain (deliberately)

No SDK code change. Two reasons:

1. The pre-registered hypothesis was falsified, so the proposed fix (query
rewrite) doesn't earn a code change.
2. The actual leverage is in core's extraction prompt and the temporal-state
layer, neither of which is owned by this PR. Follow-up issues filed for
both.

## Recommendations (filed as follow-up issues)

| # | Where | What | Priority |
|---|---|---|---|
| 1 | core | Filter meta-facts at extraction time (drop `The user (asked\|is\|requested\|said).*` etc.) | high — biggest single lift |
| 2 | SDK | Expose `EXTRACTION_PROMPT` as a configurable surface (Ethan flagged Slack-side) | high — enables (1) for design partners |
| 3 | core/SDK | Wire core's temporal-state layer (`temporal-classifier`, `temporal-rerank`) into SDK retrieval path for time-anchored queries | medium — only fix that addresses the temporal-axis structural gap |
| 4 | SDK | Opt-in `RECALL_DUAL_STORAGE=true` for first-person-heavy workloads | low — +0.05 r@1 but 2× store size |
| 5 | — | Skip BM25 hybrid unless we ship a control-set-aware weight schedule | not recommended in this form |

## Honest limits

- n=60 is small. Treat ±0.05 r@1 differences as within-noise.
- Distractor pool is hand-curated from observed SDK output. A pool sampled
from the live partner Postgres would be the gold version.
- Single embedding model tested in default. The mpnet ablation is one data
point, not a sweep.
- AlignBench is a diagnostic instrument, not a leaderboard.
78 changes: 78 additions & 0 deletions benchmarks/alignbench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# AlignBench

A small, focused benchmark that exercises one failure mode in agentic-memory
recall: the alignment gap between **stored fact phrasing** and **query
phrasing**.

## Why

Several observed failures share the same signature:

1. SDK partner demo: "what is my name?" returns no recall, but
"what is the user's name?" returns the same fact at cosine 0.51.
2. LongMemEval-S full n=500: 31% of failures are "I don't have info" refusals
when the answer text is in the haystack.
3. BEAM Knowledge-Update regressions: model picks an older value because
retrieval brings in keyword-matching chunks rather than the freshest one.

These manifestations share one root: **embedding-and-threshold retrieval
silently returns empty when query phrasing diverges from stored phrasing**,
rather than degrading gracefully.

AlignBench isolates this in a controlled set (~100 items) so we can:
- Quantify the gap on the default SDK embedding stack
- Ablate three independent fixes (query rewrite / dual-storage / hybrid BM25)
- Pick the dominated point and regression-test against committed LoCoMo10 and
BEAM-1M numbers before shipping.

## Items

`items.json` — one array of test cases. Each case:

```json
{
"id": "pronoun-001",
"axis": "pronoun", // pronoun | temporal | specificity | negation | control
"fact": "The user's name is Alex.",
"query": "what is my name?",
"gold_in_topk": true, // expected presence in top-K
"gold_answer": "Alex" // for downstream LLM correctness
}
```

Facts are **shared across queries within an axis** — each query searches the
full fact pool, not just its own gold fact. That mimics real recall behavior.

## Variation axes

| Axis | What it varies | Why it matters |
|---|---|---|
| pronoun | `my X` vs `the user's X` vs `X of <name>` | Tests bi-encoder pronoun alignment (dominant SDK failure) |
| temporal | `live in Y` vs `lived in Y` vs `as of 2026, live in Y` | Tests knowledge-update / temporal-anchor handling |
| specificity | `my dog Apollo` vs `my dog` vs `my pet` | Tests generic-vs-specific retrieval |
| negation | `I don't drink coffee` vs `I drink tea, not coffee` | Tests embedding sensitivity to polarity |
| control | unrelated facts/queries | False-positive floor (top-K shouldn't surface these) |

## Metrics

Per run:
- **recall@1** — gold fact ranked first
- **recall@5** — gold fact in top-5
- **per-axis recall@5** — diagnostic
- **false-positive@5** — unrelated controls leaking into top-K
- **mean rank** of gold (lower is better)
- **median similarity** of gold vs distractors

## Runs

- `runs/baseline.json` — current SDK recall pipeline
- `runs/query-rewrite.json` — query-side pronoun rewrite
- `runs/dual-storage.json` — both phrasings stored
- `runs/hybrid-bm25.json` — BM25 + semantic union
- `runs/combined.json` — winning variants stacked

## Falsification

Pre-registered: if query-rewrite alone doesn't lift recall@5 by ≥0.25 over
baseline, the pronoun hypothesis is wrong and we look at extraction quality
next. Stated here so it's not adjusted after seeing data.
Loading
Loading