feat: add MemoryManager backend to LongMemEval runner by rohram04 · Pull Request #14 · mem0ai/memory-benchmarks

rohram04 · 2026-06-20T06:18:12Z

Summary

Adds a new memorymanager backend choice (--backend memorymanager) to the LongMemEval runner, enabling apples-to-apples comparison against Mem0's OSS and cloud backends with the same answerer prompt, judge, and metrics pipeline.
Introduces benchmarks/common/mm_bridge.py — a bridge module that wires up a fresh, isolated MemoryManager.Agent per question with an in-memory LongTermStore, then exposes three phases: ingest (turn-by-turn memory lifecycle), PREP (surface relevant blocks before the answerer), and PERSIST (store the exchange after).
Updates benchmarks/longmemeval/run.py to route MM questions through ingest_question_mm → process_question_answerer(agent=...), collapsing the cutoff loop to a single window since MM manages its own context budget.

How it works

The MemoryManager system lives in a sibling repo (~/MemoryManager, or MEMORYMANAGER_PATH). Each benchmark question gets its own Agent instance with an isolated SQLite in-memory store — no state leaks between questions.

python run.py --backend memorymanager --mode answerer \
              --mm-max-tokens 8000 \
              --mm-model openai/gpt-4o \
              --mm-util-model openai/gpt-4o-mini

New CLI flags (ignored for Mem0 backends):

--mm-max-tokens — context-window token budget (default 8000)
--mm-model — main answerer model (default openai/gpt-4o)
--mm-util-model — compress/merge/novelty util model (default openai/gpt-4o-mini)

Test plan

Smoke-test: run a single question with --max-questions 1 --backend memorymanager
Verify output JSON structure matches Mem0 output (same keys, same judge scores)
Run full LongMemEval-S subset and compare accuracy vs Mem0 OSS baseline

Made with Cursor

New --backend memorymanager path that benchmarks the custom MemoryManager system instead of mem0, holding chunks/answerer/judge identical (memory system is the only variable). Integrates directly at the runner's call sites — no Mem0Client replacement, no search() seam. - benchmarks/common/mm_bridge.py: per-question Agent (isolated in-memory LongTermStore), OpenRouter gpt-4o/4o-mini backend + shared embedder singletons; mm_ingest / mm_surface_and_format (PREP) / mm_persist (PERSIST). - run.py: --backend memorymanager + --mm-* flags; ingest_question_mm; process_question_answerer surfaces via the agent's PREP and persists via PERSIST, answer still generated by the SAME harness answerer; single managed-window cutoff; mem0 paths untouched. Validated end-to-end through the bridge with real gpt-4o (construction, ingest, PREP, PERSIST). Embedder parity (text-embedding-3-small) is a follow-up needing a small MemoryManager-side change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…-small Embedder parity with mem0 OSS (text-embedding-3-small), routed through OpenRouter so the whole stack uses one OpenRouter key (no OpenAI key): route the bridge's embedder through MM's make_embedder (was hardcoding SentenceTransformer), default to "openrouter:openai/text-embedding-3-small" (1536-d, matches LongTermStore's default embedding_dim), and add --mm-embedding-model (accepts openrouter:/openai: specs or a local sentence-transformers name). Verified: OpenRouter's embeddings endpoint returns a 1536-d vector for openai/text-embedding-3-small. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ycles. Use PREP+PERSIST ingest and query in llm mode and dual receive() in algorithmic mode so LongMemEval exercises the same memory paths as Agent, with the harness answerer substituting for REPLY. Co-authored-by: Cursor <cursoragent@cursor.com>

How-to for the MM-vs-mem0 LongMemEval head-to-head: the mm_bridge integration, all-OpenRouter routing (3 points + env), reusing MM's venv, mem0 OpenRouter config, run commands, and fairness caveats. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Thread each haystack pair's real-world session date through ingest: run.py builds a per-pair date list (parse_longmemeval_date -> UTC datetime) parallel to the pairs and passes it to mm_ingest, which stamps each pair's blocks via the scoped ContextManager.using_source_date(). _format_blocks now emits the block's source_date as an ISO created_at, so the answerer prompt date-orders/-groups MM memories the same way it does mem0's per-memory timestamps — closing the temporal-reasoning gap. dates is optional (un-dated ingest still works). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…-turn. Thread clock_seconds_per_turn through mm_bridge into MemoryConfig so batch LongMemEval runs can differentiate recency during fast ingest. Co-authored-by: Cursor <cursoragent@cursor.com>

Rohith Ramanathan and others added 6 commits June 20, 2026 00:41

Expose MemoryManager simulated decay clock via --mm-clock-seconds-per…

765222f

…-turn. Thread clock_seconds_per_turn through mm_bridge into MemoryConfig so batch LongMemEval runs can differentiate recency during fast ingest. Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add MemoryManager backend to LongMemEval runner#14

feat: add MemoryManager backend to LongMemEval runner#14
rohram04 wants to merge 6 commits into
mem0ai:mainfrom
rohram04:feat/longmemeval-memorymanager

rohram04 commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rohram04 commented Jun 20, 2026

Summary

How it works

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant