Skip to content

feat: zeph-bench benchmark harness #2827

@bug-ops

Description

@bug-ops

Overview

Add zeph-bench: a bench feature-gated crate that runs Zeph against standard AI-agent benchmarks (LongMemEval, LOCOMO, FRAMES, tau-bench, GAIA) in a fully automated, reproducible manner.

Spec: .local/specs/zeph-bench/spec.md

Motivation

Zeph's persistent semantic memory and tool-use capabilities have no external, reproducible measurement. This epic adds the harness to demonstrate and regression-track those differentiators against standard leaderboards.

Architecture

  • New crate zeph-bench at Layer 4 (same tier as zeph-channels, zeph-tui)
  • BenchmarkChannel implements zeph-core::Channel — zero changes to agent core
  • Dedicated Qdrant collection prefix and SQLite DB per run — never touches production state
  • bench feature flag — excluded from full bundle
  • CLI: zeph bench list | download | run | show

Child Issues

  • Crate scaffold and BenchmarkChannel
  • CLI subcommand (zeph bench)
  • Memory isolation (Qdrant + SQLite reset per scenario)
  • Deterministic mode (temperature=0 override)
  • LongMemEval dataset loader and evaluator
  • JSON + Markdown result writer
  • Baseline comparison (--baseline flag)
  • Resume interrupted run (--resume flag)
  • LOCOMO dataset loader
  • FRAMES dataset loader
  • tau-bench dataset loader
  • GAIA dataset loader

Acceptance Criteria

  • zeph bench run --dataset longmemeval completes end-to-end and produces valid results.json
  • Two identical runs produce identical scores (determinism)
  • Memory-enabled score >= memory-disabled score on LongMemEval
  • No writes to production Qdrant/SQLite during bench run
  • bench feature excluded from full build

Metadata

Metadata

Assignees

Labels

P2High value, medium complexityenhancementNew feature or requestepicMilestone-level tracking issue

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions