Skip to content

feat(bench): implement LongMemEval dataset loader and evaluator #2832

@bug-ops

Description

@bug-ops

Description

Implement the LongMemEval dataset loader, scenario iterator, and evaluator (exact match + F1).

Part of epic #2827. See spec: .local/specs/zeph-bench/spec.md section 6.

Scope

  • LongMemEvalLoader that reads the official dataset JSON format and produces Vec<Scenario>
  • Correct handling of multi-session scenarios: when session_id changes within a scenario, trigger a memory reset mid-scenario (new session boundary)
  • LongMemEvalEvaluator implementing exact-match and token-F1 metrics against ground-truth answers
  • Dataset download: fetch from official source URL, cache to ~/.local/share/zeph/bench/longmemeval/
  • Cache validation: SHA256 checksum of downloaded archive
  • Unit tests with a small synthetic scenario fixture (3 turns, known ground truth)

Acceptance Criteria

  • Loader correctly parses official LongMemEval JSON schema
  • Multi-session boundary triggers memory reset
  • F1 evaluator matches reference implementation on 3-example fixture
  • Download and cache work end-to-end (zeph bench download --dataset longmemeval)
  • Unit tests pass without network access (fixture-based)

Metadata

Metadata

Assignees

Labels

P2High value, medium complexityenhancementNew feature or requestmemoryzeph-memory crate (SQLite)

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions