Skip to content

Add A/B harness plumbing for knowledge transfer experiment (#25)#60

Open
ovidb wants to merge 2 commits into
mainfrom
feat/25-ab-harness
Open

Add A/B harness plumbing for knowledge transfer experiment (#25)#60
ovidb wants to merge 2 commits into
mainfrom
feat/25-ab-harness

Conversation

@ovidb
Copy link
Copy Markdown
Contributor

@ovidb ovidb commented Apr 12, 2026

Summary

  • Adds resetDbBetweenTasks condition flag to wipe DB between tasks (control condition)
  • Adds --repeats N CLI flag for statistical robustness (multiple runs per condition)
  • Adds LLM-as-judge scoring via Haiku with 3-dimension rubric (completion 0.4, knowledge utilization 0.3, quality 0.3)
  • Creates ab-control and ab-treatment condition configs using winning Experiment: Instruction mode comparison (P0) #24 config (full-hybrid + tmux)
  • Adds quality columns to comparison report and HTML report
  • Fixes stale full-hybrid instruction test

Test plan

  • All 66 existing + new tests pass (pnpm --prefix bench test:eval)
  • Dry run with ab-control.json — verify [reset] Wiping DB log messages
  • Dry run with ab-treatment.json — verify KB grows across tasks
  • --repeats 2 produces 2 JSONL files with distinct timestamps
  • With ANTHROPIC_API_KEY, JSONL shows non-zero task_quality and judge_reasoning
  • Comparison report shows Quality column

ovidb added 2 commits April 12, 2026 01:20
Three features to support the Knowledge Transfer A/B experiment:

1. resetDbBetweenTasks condition flag — wipes the DB before each task
   for the control condition, isolating per-task knowledge
2. --repeats N CLI flag — runs sequences multiple times for statistical
   robustness, each repeat gets a fresh DB
3. LLM-as-judge scoring — calls Haiku with a 3-dimension rubric
   (completion, knowledge utilization, quality) to score task output

Also adds ab-control/ab-treatment condition configs, quality columns
in comparison and HTML reports, and fixes a stale full-hybrid test.
Control (DB wiped per task): 2 runs, 10/10 tasks, 17 autonomous saves
Treatment (DB persists): 2 runs, 10/10 tasks, 7 autonomous saves

Key finding: zero searches in both conditions — agent saves memories
but never searches, even with MANDATORY instructions. This is the
critical gap to address before the full experiment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant