feat(agent-comparison): promote autoresearch live eval hardening to main by notque · Pull Request #208 · notque/claude-code-toolkit

notque · 2026-03-30T03:52:02Z

Summary

promote the live registered-skill autoresearch hardening from fix/autoresearch-recommendations
include the blind body-eval guardrails, holdout/report correctness fixes, and routing/docs alignment
include the measured socratic-debugging instruction-body improvement and the short benchmark fixtures

Included Work

isolate real registered-skill live evals in worktrees
harden blind body scoring to require trigger evidence and reject fallback contamination
support body-only optimization with documented short runnable benchmarks
fix holdout score attribution, final best-by-test selection, and final report consistency
route autoresearch requests through agent-comparison instead of the older skill-eval path

Validation

local validation passed on the promoted head before merge into fix/autoresearch-recommendations:
- pytest -q scripts/tests/test_agent_comparison_optimize_loop.py scripts/tests/test_skill_eval_claude_code.py scripts/tests/test_passk_eval.py scripts/tests/test_eval_compare_optimization.py
short live proof artifacts were generated during the feature PR for socratic-debugging body optimization

Notes

this PR promotes merged work from PR feat(agent-comparison): harden autoresearch live evals #207 into main
unlike the stacked PR, this PR targets main, so the repository's Tests workflow should apply here

Improve live registered-skill autoresearch eval and short proof flow

…ardening feat(agent-comparison): harden autoresearch live evals

notque added 9 commits March 29, 2026 19:16

fix(autoresearch): isolate live registered-skill eval

127ef5a

Merge pull request #206 from notque/feat/live-registered-skill-eval-v2

8f6a968

Improve live registered-skill autoresearch eval and short proof flow

feat(agent-comparison): harden autoresearch live evals

2ec703c

fix(socratic-debugging): tighten first-turn question discipline

98ecf1a

Merge pull request #207 from notque/feature/skill-body-autoresearch-h…

23cad06

…ardening feat(agent-comparison): harden autoresearch live evals

Merge main into fix/autoresearch-recommendations

6ef426c

fix(ci): address lint findings in autoresearch tests

c10111d

fix(lint): auto-format 3 files to pass ruff format CI check

281b32a

merge: resolve conflicts with main after PR #209 merge

3da524a

notque merged commit 29ab879 into main Mar 30, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent-comparison): promote autoresearch live eval hardening to main#208

feat(agent-comparison): promote autoresearch live eval hardening to main#208
notque merged 9 commits intomainfrom
fix/autoresearch-recommendations

notque commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

notque commented Mar 30, 2026

Summary

Included Work

Validation

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant