atomicstrata · ethanj · May 17, 2026 · May 17, 2026
diff --git a/.fallowrc.json b/.fallowrc.json
@@ -13,7 +13,8 @@
   ],
   "publicPackages": ["@atomicmemory/atomicmemory-sdk"],
   "ignorePatterns": [
-    "**/one-offs/**"
+    "**/one-offs/**",
+    "benchmarks/**"
   ],
   "rules": {
     "unused-class-members": "off",
@@ -30,6 +31,7 @@
       "tests/**",
       "scripts/**",
       "examples/**",
+      "benchmarks/**",
       "src/embedding/wasm-semantic-processor.ts"
     ]
   },
@@ -45,7 +47,8 @@
       "**/*.spec.tsx",
       "tests/**",
       "scripts/**",
-      "examples/**"
+      "examples/**",
+      "benchmarks/**"
     ]
   },
   "regression": {

diff --git a/.gitignore b/.gitignore
@@ -38,3 +38,6 @@ pnpm-debug.log*
 
 # Internal tech-debt notes — never commit.
 tech-debt.md
+
+# Superpowers skill plugin output — agent-generated specs/plans, internal-only.
+docs/superpowers/
diff --git a/benchmarks/alignbench/PR-DESCRIPTION.md b/benchmarks/alignbench/PR-DESCRIPTION.md
@@ -0,0 +1,95 @@
+# AlignBench v0 — controlled recall benchmark + falsified pronoun-rewrite fix
+
+Adds `benchmarks/alignbench/` to the SDK: a 60-query / 55-fact controlled
+benchmark for embedding-based recall, with a runner that ablates four
+candidate fixes against the current Xenova/all-MiniLM-L6-v2 default.
+
+## Why
+
+Three observed failure modes share one signature:
+
+1. **Partner demo** (atomicmem.filecoin.cloud): "what is my name?" returns no
+   recall; "what is the user's name?" returns the same fact at cosine 0.51.
+2. **LMME-S full n=500** (sprint 5): 31% of failures were "I don't have info"
+   refusals when the answer text was in the haystack.
+3. **BEAM Knowledge-Update**: retrieval pulls the keyword-matching chunk
+   instead of the freshest one.
+
+Each was filed as a benchmark-specific quirk. AlignBench tests whether
+they're one phenomenon — and which fix actually closes the gap.
+
+## Pre-registered hypothesis (and outcome)
+
+Before running, I committed in writing:
+
+> If query-side pronoun rewriting (my → the user's) doesn't lift r@5 by ≥0.25
+> over baseline, the pronoun hypothesis is wrong and we look at extraction
+> quality instead.
+
+Result: query-rewrite r@5 lift = **0.000** (0.933 vs 0.933 baseline).
+**Hypothesis falsified.** The diagnostic story I posted earlier — "fix it in
+the SDK recall path with a pronoun rewrite" — does not survive contact with a
+controlled benchmark.
+
+This is exactly what pre-registration is for.
+
+## What actually wins
+
+| Variant | r@1 | r@5 | distractor_top1 | fp@control |
+|---|---:|---:|---:|---:|
+| baseline (current SDK) | 0.733 | 0.933 | 0.067 | 0.000 |
+| **baseline, clean pool (no extraction meta-facts)** | **0.767** | **0.950** | 0.000 | 0.000 |
+| query-rewrite | 0.733 | 0.933 | 0.083 (worse) | 0.000 |
+| dual-storage | 0.783 | 0.933 | 0.067 | 0.000 |
+| hybrid BM25 + semantic | 0.617 | 0.917 | 0.067 | **1.000** ← broken |
+| combined (rewrite + BM25) | 0.650 | 0.933 | 0.083 | 1.000 |
+
+The dominant fixable lift is **upstream of retrieval** — stopping the extractor
+from emitting meta-facts like `The user asked for the user's name.` and
+`As of <date>, X is a term mentioned in the conversation.`. Those poison the
+embedding neighborhood for every adjacent query.
+
+## What this PR contains
+
+- `benchmarks/alignbench/items.json` — 55 facts, 60 scored queries, 10
+  controls, across 4 variation axes (pronoun, temporal, specificity,
+  negation) plus an extraction-style distractor pool observed in the partner
+  demo.
+- `benchmarks/alignbench/run.mjs` — standalone Node runner using
+  `@huggingface/transformers` (same model as SDK). No Postgres, no network,
+  no SDK dependencies. Each variant produces a directly-comparable run JSON.
+- `benchmarks/alignbench/runs/*.json` — all 5 variant runs committed for
+  diff-ability.
+- `benchmarks/alignbench/RESULTS.md` — full per-axis breakdown, ablation
+  table, per-item failure analysis on the temporal axis, recommendations.
+- `benchmarks/alignbench/README.md` — what it is, how to read it, what's out
+  of scope.
+
+## What this PR does NOT contain (deliberately)
+
+No SDK code change. Two reasons:
+
+1. The pre-registered hypothesis was falsified, so the proposed fix (query
+   rewrite) doesn't earn a code change.
+2. The actual leverage is in core's extraction prompt and the temporal-state
+   layer, neither of which is owned by this PR. Follow-up issues filed for
+   both.
+
+## Recommendations (filed as follow-up issues)
+
+| # | Where | What | Priority |
+|---|---|---|---|
+| 1 | core | Filter meta-facts at extraction time (drop `The user (asked\|is\|requested\|said).*` etc.) | high — biggest single lift |
+| 2 | SDK | Expose `EXTRACTION_PROMPT` as a configurable surface (Ethan flagged Slack-side) | high — enables (1) for design partners |
+| 3 | core/SDK | Wire core's temporal-state layer (`temporal-classifier`, `temporal-rerank`) into SDK retrieval path for time-anchored queries | medium — only fix that addresses the temporal-axis structural gap |
+| 4 | SDK | Opt-in `RECALL_DUAL_STORAGE=true` for first-person-heavy workloads | low — +0.05 r@1 but 2× store size |
+| 5 | — | Skip BM25 hybrid unless we ship a control-set-aware weight schedule | not recommended in this form |
+
+## Honest limits
+
+- n=60 is small. Treat ±0.05 r@1 differences as within-noise.
+- Distractor pool is hand-curated from observed SDK output. A pool sampled
+  from the live partner Postgres would be the gold version.
+- Single embedding model tested in default. The mpnet ablation is one data
+  point, not a sweep.
+- AlignBench is a diagnostic instrument, not a leaderboard.
diff --git a/benchmarks/alignbench/README.md b/benchmarks/alignbench/README.md
@@ -0,0 +1,78 @@
+# AlignBench
+
+A small, focused benchmark that exercises one failure mode in agentic-memory
+recall: the alignment gap between **stored fact phrasing** and **query
+phrasing**.
+
+## Why
+
+Several observed failures share the same signature:
+
+1. SDK partner demo: "what is my name?" returns no recall, but
+   "what is the user's name?" returns the same fact at cosine 0.51.
+2. LongMemEval-S full n=500: 31% of failures are "I don't have info" refusals
+   when the answer text is in the haystack.
+3. BEAM Knowledge-Update regressions: model picks an older value because
+   retrieval brings in keyword-matching chunks rather than the freshest one.
+
+These manifestations share one root: **embedding-and-threshold retrieval
+silently returns empty when query phrasing diverges from stored phrasing**,
+rather than degrading gracefully.
+
+AlignBench isolates this in a controlled set (~100 items) so we can:
+- Quantify the gap on the default SDK embedding stack
+- Ablate three independent fixes (query rewrite / dual-storage / hybrid BM25)
+- Pick the dominated point and regression-test against committed LoCoMo10 and
+  BEAM-1M numbers before shipping.
+
+## Items
+
+`items.json` — one array of test cases. Each case:
+
+```json
+{
+  "id": "pronoun-001",
+  "axis": "pronoun",                 // pronoun | temporal | specificity | negation | control
+  "fact": "The user's name is Alex.",
+  "query": "what is my name?",
+  "gold_in_topk": true,              // expected presence in top-K
+  "gold_answer": "Alex"              // for downstream LLM correctness
+}
+```
+
+Facts are **shared across queries within an axis** — each query searches the
+full fact pool, not just its own gold fact. That mimics real recall behavior.
+
+## Variation axes
+
+| Axis | What it varies | Why it matters |
+|---|---|---|
+| pronoun | `my X` vs `the user's X` vs `X of <name>` | Tests bi-encoder pronoun alignment (dominant SDK failure) |
+| temporal | `live in Y` vs `lived in Y` vs `as of 2026, live in Y` | Tests knowledge-update / temporal-anchor handling |
+| specificity | `my dog Apollo` vs `my dog` vs `my pet` | Tests generic-vs-specific retrieval |
+| negation | `I don't drink coffee` vs `I drink tea, not coffee` | Tests embedding sensitivity to polarity |
+| control | unrelated facts/queries | False-positive floor (top-K shouldn't surface these) |
+
+## Metrics
+
+Per run:
+- **recall@1** — gold fact ranked first
+- **recall@5** — gold fact in top-5
+- **per-axis recall@5** — diagnostic
+- **false-positive@5** — unrelated controls leaking into top-K
+- **mean rank** of gold (lower is better)
+- **median similarity** of gold vs distractors
+
+## Runs
+
+- `runs/baseline.json` — current SDK recall pipeline
+- `runs/query-rewrite.json` — query-side pronoun rewrite
+- `runs/dual-storage.json` — both phrasings stored
+- `runs/hybrid-bm25.json` — BM25 + semantic union
+- `runs/combined.json` — winning variants stacked
+
+## Falsification
+
+Pre-registered: if query-rewrite alone doesn't lift recall@5 by ≥0.25 over
+baseline, the pronoun hypothesis is wrong and we look at extraction quality
+next. Stated here so it's not adjusted after seeing data.