feat(bench): implement LongMemEval dataset loader and evaluator

## Description
Implement the LongMemEval dataset loader, scenario iterator, and evaluator (exact match + F1).

Part of epic #2827. See spec: `.local/specs/zeph-bench/spec.md` section 6.

## Scope
- `LongMemEvalLoader` that reads the official dataset JSON format and produces `Vec<Scenario>`
- Correct handling of multi-session scenarios: when `session_id` changes within a scenario, trigger a memory reset mid-scenario (new session boundary)
- `LongMemEvalEvaluator` implementing exact-match and token-F1 metrics against ground-truth answers
- Dataset download: fetch from official source URL, cache to `~/.local/share/zeph/bench/longmemeval/`
- Cache validation: SHA256 checksum of downloaded archive
- Unit tests with a small synthetic scenario fixture (3 turns, known ground truth)

## Acceptance Criteria
- [ ] Loader correctly parses official LongMemEval JSON schema
- [ ] Multi-session boundary triggers memory reset
- [ ] F1 evaluator matches reference implementation on 3-example fixture
- [ ] Download and cache work end-to-end (`zeph bench download --dataset longmemeval`)
- [ ] Unit tests pass without network access (fixture-based)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): implement LongMemEval dataset loader and evaluator #2832

Description

Scope

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat(bench): implement LongMemEval dataset loader and evaluator #2832

Description

Description

Scope

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions