Skip to content

feat: add MemoryManager backend to LongMemEval runner#14

Open
rohram04 wants to merge 6 commits into
mem0ai:mainfrom
rohram04:feat/longmemeval-memorymanager
Open

feat: add MemoryManager backend to LongMemEval runner#14
rohram04 wants to merge 6 commits into
mem0ai:mainfrom
rohram04:feat/longmemeval-memorymanager

Conversation

@rohram04

Copy link
Copy Markdown

Summary

  • Adds a new memorymanager backend choice (--backend memorymanager) to the LongMemEval runner, enabling apples-to-apples comparison against Mem0's OSS and cloud backends with the same answerer prompt, judge, and metrics pipeline.
  • Introduces benchmarks/common/mm_bridge.py — a bridge module that wires up a fresh, isolated MemoryManager.Agent per question with an in-memory LongTermStore, then exposes three phases: ingest (turn-by-turn memory lifecycle), PREP (surface relevant blocks before the answerer), and PERSIST (store the exchange after).
  • Updates benchmarks/longmemeval/run.py to route MM questions through ingest_question_mmprocess_question_answerer(agent=...), collapsing the cutoff loop to a single window since MM manages its own context budget.

How it works

The MemoryManager system lives in a sibling repo (~/MemoryManager, or MEMORYMANAGER_PATH). Each benchmark question gets its own Agent instance with an isolated SQLite in-memory store — no state leaks between questions.

python run.py --backend memorymanager --mode answerer \
              --mm-max-tokens 8000 \
              --mm-model openai/gpt-4o \
              --mm-util-model openai/gpt-4o-mini

New CLI flags (ignored for Mem0 backends):

  • --mm-max-tokens — context-window token budget (default 8000)
  • --mm-model — main answerer model (default openai/gpt-4o)
  • --mm-util-model — compress/merge/novelty util model (default openai/gpt-4o-mini)

Test plan

  • Smoke-test: run a single question with --max-questions 1 --backend memorymanager
  • Verify output JSON structure matches Mem0 output (same keys, same judge scores)
  • Run full LongMemEval-S subset and compare accuracy vs Mem0 OSS baseline

Made with Cursor

Rohith Ramanathan and others added 6 commits June 20, 2026 00:41
New --backend memorymanager path that benchmarks the custom MemoryManager
system instead of mem0, holding chunks/answerer/judge identical (memory
system is the only variable). Integrates directly at the runner's call
sites — no Mem0Client replacement, no search() seam.

- benchmarks/common/mm_bridge.py: per-question Agent (isolated in-memory
  LongTermStore), OpenRouter gpt-4o/4o-mini backend + shared embedder
  singletons; mm_ingest / mm_surface_and_format (PREP) / mm_persist (PERSIST).
- run.py: --backend memorymanager + --mm-* flags; ingest_question_mm;
  process_question_answerer surfaces via the agent's PREP and persists via
  PERSIST, answer still generated by the SAME harness answerer; single
  managed-window cutoff; mem0 paths untouched.

Validated end-to-end through the bridge with real gpt-4o (construction,
ingest, PREP, PERSIST). Embedder parity (text-embedding-3-small) is a
follow-up needing a small MemoryManager-side change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-small

Embedder parity with mem0 OSS (text-embedding-3-small), routed through OpenRouter
so the whole stack uses one OpenRouter key (no OpenAI key): route the bridge's
embedder through MM's make_embedder (was hardcoding SentenceTransformer), default
to "openrouter:openai/text-embedding-3-small" (1536-d, matches LongTermStore's
default embedding_dim), and add --mm-embedding-model (accepts openrouter:/openai:
specs or a local sentence-transformers name). Verified: OpenRouter's embeddings
endpoint returns a 1536-d vector for openai/text-embedding-3-small.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ycles.

Use PREP+PERSIST ingest and query in llm mode and dual receive() in algorithmic mode so LongMemEval exercises the same memory paths as Agent, with the harness answerer substituting for REPLY.

Co-authored-by: Cursor <cursoragent@cursor.com>
How-to for the MM-vs-mem0 LongMemEval head-to-head: the mm_bridge integration,
all-OpenRouter routing (3 points + env), reusing MM's venv, mem0 OpenRouter
config, run commands, and fairness caveats.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Thread each haystack pair's real-world session date through ingest: run.py
builds a per-pair date list (parse_longmemeval_date -> UTC datetime) parallel to
the pairs and passes it to mm_ingest, which stamps each pair's blocks via the
scoped ContextManager.using_source_date(). _format_blocks now emits the block's
source_date as an ISO created_at, so the answerer prompt date-orders/-groups MM
memories the same way it does mem0's per-memory timestamps — closing the
temporal-reasoning gap. dates is optional (un-dated ingest still works).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-turn.

Thread clock_seconds_per_turn through mm_bridge into MemoryConfig so batch LongMemEval runs can differentiate recency during fast ingest.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant