Skip to content

feat(agent-comparison): promote autoresearch live eval hardening to main#208

Merged
notque merged 9 commits intomainfrom
fix/autoresearch-recommendations
Mar 30, 2026
Merged

feat(agent-comparison): promote autoresearch live eval hardening to main#208
notque merged 9 commits intomainfrom
fix/autoresearch-recommendations

Conversation

@notque
Copy link
Copy Markdown
Owner

@notque notque commented Mar 30, 2026

Summary

  • promote the live registered-skill autoresearch hardening from fix/autoresearch-recommendations
  • include the blind body-eval guardrails, holdout/report correctness fixes, and routing/docs alignment
  • include the measured socratic-debugging instruction-body improvement and the short benchmark fixtures

Included Work

  • isolate real registered-skill live evals in worktrees
  • harden blind body scoring to require trigger evidence and reject fallback contamination
  • support body-only optimization with documented short runnable benchmarks
  • fix holdout score attribution, final best-by-test selection, and final report consistency
  • route autoresearch requests through agent-comparison instead of the older skill-eval path

Validation

  • local validation passed on the promoted head before merge into fix/autoresearch-recommendations:
    • pytest -q scripts/tests/test_agent_comparison_optimize_loop.py scripts/tests/test_skill_eval_claude_code.py scripts/tests/test_passk_eval.py scripts/tests/test_eval_compare_optimization.py
  • short live proof artifacts were generated during the feature PR for socratic-debugging body optimization

Notes

@notque notque merged commit 29ab879 into main Mar 30, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant