Description
Implement the LongMemEval dataset loader, scenario iterator, and evaluator (exact match + F1).
Part of epic #2827. See spec: .local/specs/zeph-bench/spec.md section 6.
Scope
LongMemEvalLoader that reads the official dataset JSON format and produces Vec<Scenario>
- Correct handling of multi-session scenarios: when
session_id changes within a scenario, trigger a memory reset mid-scenario (new session boundary)
LongMemEvalEvaluator implementing exact-match and token-F1 metrics against ground-truth answers
- Dataset download: fetch from official source URL, cache to
~/.local/share/zeph/bench/longmemeval/
- Cache validation: SHA256 checksum of downloaded archive
- Unit tests with a small synthetic scenario fixture (3 turns, known ground truth)
Acceptance Criteria
Description
Implement the LongMemEval dataset loader, scenario iterator, and evaluator (exact match + F1).
Part of epic #2827. See spec:
.local/specs/zeph-bench/spec.mdsection 6.Scope
LongMemEvalLoaderthat reads the official dataset JSON format and producesVec<Scenario>session_idchanges within a scenario, trigger a memory reset mid-scenario (new session boundary)LongMemEvalEvaluatorimplementing exact-match and token-F1 metrics against ground-truth answers~/.local/share/zeph/bench/longmemeval/Acceptance Criteria
zeph bench download --dataset longmemeval)