Add A/B harness plumbing for knowledge transfer experiment (#25) by ovidb · Pull Request #60 · Rememora/rememora

ovidb · 2026-04-12T00:21:04Z

Summary

Adds resetDbBetweenTasks condition flag to wipe DB between tasks (control condition)
Adds --repeats N CLI flag for statistical robustness (multiple runs per condition)
Adds LLM-as-judge scoring via Haiku with 3-dimension rubric (completion 0.4, knowledge utilization 0.3, quality 0.3)
Creates ab-control and ab-treatment condition configs using winning Experiment: Instruction mode comparison (P0) #24 config (full-hybrid + tmux)
Adds quality columns to comparison report and HTML report
Fixes stale full-hybrid instruction test

Test plan

All 66 existing + new tests pass (pnpm --prefix bench test:eval)
Dry run with ab-control.json — verify [reset] Wiping DB log messages
Dry run with ab-treatment.json — verify KB grows across tasks
--repeats 2 produces 2 JSONL files with distinct timestamps
With ANTHROPIC_API_KEY, JSONL shows non-zero task_quality and judge_reasoning
Comparison report shows Quality column

Three features to support the Knowledge Transfer A/B experiment: 1. resetDbBetweenTasks condition flag — wipes the DB before each task for the control condition, isolating per-task knowledge 2. --repeats N CLI flag — runs sequences multiple times for statistical robustness, each repeat gets a fresh DB 3. LLM-as-judge scoring — calls Haiku with a 3-dimension rubric (completion, knowledge utilization, quality) to score task output Also adds ab-control/ab-treatment condition configs, quality columns in comparison and HTML reports, and fixes a stale full-hybrid test.

Control (DB wiped per task): 2 runs, 10/10 tasks, 17 autonomous saves Treatment (DB persists): 2 runs, 10/10 tasks, 7 autonomous saves Key finding: zero searches in both conditions — agent saves memories but never searches, even with MANDATORY instructions. This is the critical gap to address before the full experiment.

ovidb added 2 commits April 12, 2026 01:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add A/B harness plumbing for knowledge transfer experiment (#25)#60

Add A/B harness plumbing for knowledge transfer experiment (#25)#60
ovidb wants to merge 2 commits into
mainfrom
feat/25-ab-harness

ovidb commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ovidb commented Apr 12, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant