An autonomous optimization lab for VoxTerm — offline voice transcription with speaker diarization for macOS/Apple Silicon.
Systematically optimizes VoxTerm across four axes using an automated experiment loop:
| Axis | Metric | Goal |
|---|---|---|
| Transcription accuracy | WER (Word Error Rate) | < 10% |
| Diarization quality | DER (Diarization Error Rate) | < 15% |
| Latency | RTF (Real-Time Factor) | < 0.5 |
| Speaker recognition | Speaker ID accuracy | > 80% |
git clone <this-repo> && cd voxterm-lab
bash setup.sh # Clone VoxTerm, install deps
make eval NAME=baseline # Run baseline evaluation
make optimize NAME=cycle-1 # Start autonomous optimization (5 iterations)
make leaderboard # Check current best scores- Hypothesis-driven:
research/hypotheses.jsontracks 10+ optimization ideas ranked by expected impact - Automated eval:
eval/run_eval.pymeasures all four axes (plug point for external eval) - Experiment tracking: Each change gets its own directory with scores, diffs, and analysis
- Leaderboard:
leaderboard.jsontracks best scores across all experiments - Agent-friendly: Designed for Claude Code to run autonomously via
META-AGENT.md
VoxTerm (target) VoxTerm Lab (this repo)
├── transcriber/ ├── eval/run_eval.py (plug point)
├── diarization/ ├── scripts/optimize-loop.sh
├── audio/ ├── research/hypotheses.json
├── speakers/ ├── experiments/*/scores.json
└── config.py └── leaderboard.json
This repo is designed for autonomous operation. See CLAUDE.md for agent instructions and META-AGENT.md for the optimization loop protocol.
| Target | Description |
|---|---|
make eval NAME=x |
Run evaluation, save scores |
make ab-eval NAME=x |
A/B comparison (baseline vs changes) |
make optimize NAME=x |
Autonomous optimization loop |
make leaderboard |
Show best scores |
make list-experiments |
List all experiments with scores |