Per-layer conformance for agent memory: open spec, scenario suite, and public evidence.
iHow Memory is for testing whether agent memory survives handoff — across tools, sessions, and people. We focus on the retrieval layer because conformance needs deterministic, reproducible metrics.
Three things, one repo:
- Open spec (
spec/) — Protocol terms, interfaces, event-context-writeback-audit schema - Scenario suite (
scenarios/) — Five reliability scenarios covering cross-tool handoff, feedback capture, constraint preservation, human team handoff, and model migration - Conformance evidence (
conformance/) — Public benchmark runs, including retrieval-stage LongMemEval_S 470/470
Most memory benchmarks today report end-to-end LLM-judged answer accuracy: retrieve evidence, generate an answer, judge it. That conflates two distinct failure modes:
- Memory retrieved the wrong evidence (retrieval failure)
- Memory retrieved the right evidence but the generator wrote a wrong answer (generation failure)
These have different fix paths. We report retrieval and generation separately.
Retrieval is the layer this repo's evidence speaks to today. Generation-stage scenarios are part of v0.2 roadmap.
| Scenario | Metric | Result |
|---|---|---|
| LongMemEval_S retrieval (470 effective samples) | recall_all@10 |
1.0 |
| LongMemEval_S retrieval | recall_any@10 |
1.0 |
| LongMemEval_S retrieval | ndcg_any@10 |
0.946 |
| Reference implementation self-conformance | 5 spec scenarios v0.1 | 5/5 PASS |
Reproducible evidence manifest
Note on vendor comparisons. Other memory projects publish end-to-end LLM-judged answer accuracy on the same LongMemEval_S split — Mem0 token-efficient algorithm 93.4% (source), Mastra observational memory 94.87% (source), OMEGA 95.4% (source). Those measure a different layer than the retrieval recall reported here and are not directly comparable to our numbers. We argue memory systems should report per-layer; this evidence file is the retrieval-layer disclosure.
Spec is here, reference implementation is in ihow-memory-core
This repo holds: spec, scenarios, conformance evidence, whitepaper. Reference implementation (Apache-2.0): see ihow-memory-core.
If you maintain a memory system (Mem0, Letta, Zep, MemGPT, Cognee, Graphiti, or your own), we'd love you to run a runner against our conformance suite. We score PARTIAL and NOT_APPLICABLE friendly — not pass-or-fail.
See conformance/runners/README.md.
- v0.1 — Live. Five scenarios + protocol draft + retrieval-stage LongMemEval_S evidence.
- v0.2 — Scoped. Generation-stage scenarios, calibration memory, multi-runner conformance.
- External runners — Currently
PARTIALfor OpenViking / GBrain / M-Flow. PRs welcome.
- 🌐 Site: ihowmemory.com
- 📕 Whitepaper: EN · 中文
- 💬 Discussions: GitHub Discussions
- 📧 Contact: repo issues / discussions
Built by a small team focused on multi-agent reliability. Open by default. Conformance over claims.