Overview
Add zeph-bench: a bench feature-gated crate that runs Zeph against standard AI-agent benchmarks (LongMemEval, LOCOMO, FRAMES, tau-bench, GAIA) in a fully automated, reproducible manner.
Spec: .local/specs/zeph-bench/spec.md
Motivation
Zeph's persistent semantic memory and tool-use capabilities have no external, reproducible measurement. This epic adds the harness to demonstrate and regression-track those differentiators against standard leaderboards.
Architecture
- New crate
zeph-bench at Layer 4 (same tier as zeph-channels, zeph-tui)
BenchmarkChannel implements zeph-core::Channel — zero changes to agent core
- Dedicated Qdrant collection prefix and SQLite DB per run — never touches production state
bench feature flag — excluded from full bundle
- CLI:
zeph bench list | download | run | show
Child Issues
Acceptance Criteria
zeph bench run --dataset longmemeval completes end-to-end and produces valid results.json
- Two identical runs produce identical scores (determinism)
- Memory-enabled score >= memory-disabled score on LongMemEval
- No writes to production Qdrant/SQLite during bench run
bench feature excluded from full build
Overview
Add
zeph-bench: abenchfeature-gated crate that runs Zeph against standard AI-agent benchmarks (LongMemEval, LOCOMO, FRAMES, tau-bench, GAIA) in a fully automated, reproducible manner.Spec:
.local/specs/zeph-bench/spec.mdMotivation
Zeph's persistent semantic memory and tool-use capabilities have no external, reproducible measurement. This epic adds the harness to demonstrate and regression-track those differentiators against standard leaderboards.
Architecture
zeph-benchat Layer 4 (same tier aszeph-channels,zeph-tui)BenchmarkChannelimplementszeph-core::Channel— zero changes to agent corebenchfeature flag — excluded fromfullbundlezeph bench list | download | run | showChild Issues
Acceptance Criteria
zeph bench run --dataset longmemevalcompletes end-to-end and produces validresults.jsonbenchfeature excluded fromfullbuild