Add research-bot benchmark cases#77
Conversation
jarrettj
left a comment
There was a problem hiding this comment.
Code Review Summary — PR #77: Research-Bot Benchmark Cases
Verdict: Request Changes (2 critical gaps, multiple testing oversights)
🔴 Critical
Schema mismatch with evaluation framework
- The research-bot cases use schema:
{id, skill, category, input, expected:{must_include/must_avoid}, judge} - The EvalExample class (evolution/core/dataset_builder.py) uses schema:
{task_input, expected_behavior, difficulty, category, source} - Impact: No code currently exists to load or use research-bot cases. They cannot be consumed by the GEPA optimizer or evaluation framework without a conversion layer.
- Fix: Either (a) add a loader that converts research-bot schema to EvalExample, or (b) document that these cases require future framework extension
No tests for case loading, validation, or usage
- The 30 cases are added as raw data with no test coverage for:
- Loading cases.jsonl into a usable format
- Validating case structure and content
- Integrating cases into the evaluation pipeline
- The PR description mentions framework steps (P1–P4) but doesn't add tests to verify any of them work
- Impact: Case quality is unknown and cannot be verified programmatically
- Fix: Add test file
tests/core/test_research_bot_cases.pywith:- Test for loading all 30 cases from JSONL
- Validation that all required fields are present and valid
- Schema conformance (category in {planner, reflexion, synthesizer}, no duplicate IDs)
- Optional: convert research-bot schema to EvalExample and verify compatibility
⚠️ Warnings
No documentation on case usage
- README.md does not mention where benchmark cases live or how to use them
- PLAN.md mentions "benchmark gating" but not these specific case files
- Users/developers cannot discover or use these cases without tribal knowledge
- Suggestion: Add a
cases/README.mdor section to main README explaining:- Where research-bot cases are located
- Case schema (what each field means)
- How to load and evaluate against them
- How to add new cases
Missing cross-file validation
- The self-review claims to check "evolution sources and dependency availability" but doesn't provide evidence
- The PR should verify that:
- All cases reference valid "skill" values (currently all are "research-bot"; should be documented)
- All "category" values match the expected taxonomy
- No external dependencies are needed to load/validate cases
- Suggestion: Add these checks to the test suite
No tests for case semantic quality
- The cases themselves are manually authored; no tests verify:
- Input prompts are well-formed and solvable
- Judgment criteria are objective and verifiable
must_includeandmust_avoiddon't contradict each other- Examples given are realistic for a research-bot skill
- Suggestion: Consider adding a test for obvious contradictions (e.g., must_include "X" and must_avoid "X")
💡 Suggestions
Consider adding case difficulty and source metadata
- Current cases lack
difficultyandsourcefields that EvalExample uses - These would help with:
- Stratified evaluation (test on easy vs. hard cases)
- Traceability (hand-curated vs. synthetic)
- Not critical if cases are distinct from EvalExample, but worth planning
Plan for case evolution and updates
- The cases are static snapshots added once; no clear process for:
- Adding new cases as research-bot evolves
- Retiring cases that become obsolete
- Versioning the case set
- Suggestion: Document a case lifecycle in README (how to propose, review, merge new cases)
✅ Looks Good
- Data integrity: All 30 lines are valid JSON with no parse errors ✓
- Distribution: Perfect 10/10/10 split across planner/reflexion/synthesizer ✓
- No duplicates: All case IDs are unique ✓
- No diff issues: Git diff --check passes ✓
- Field completeness: All cases have all required fields ✓
- Clear semantics: Each case has a realistic prompt, clear expected behavior, and sensible judge criteria ✓
- Korean context: Excellent inclusion of market/cultural specifics in several cases ✓
Recommendation
Do not merge yet. The data itself is solid, but the cases are orphaned from the evaluation framework. Before merging:
- Add integration code: Implement
load_research_bot_cases()that returns EvalExample objects or documents why a new evaluation pipeline is needed - Add tests: Create
test_research_bot_cases.pywith validation and loading tests - Add documentation: Update README or add cases/README.md explaining schema and usage
Once these are in place, this contribution will be valuable for benchmarking research-bot improvements.
Reviewed by Code Review Skill
Test Coverage Gap AnalysisWhat's MissingThe PR adds 30 benchmark cases for research-bot but zero tests to verify they work: 1. No loader/integration tests
Required: Add with at least: def test_loads_all_30_cases():
# Load all cases from JSONL
# Assert count == 30
pass
def test_cases_have_required_fields():
# For each case: assert 'id', 'category', 'input', 'expected', 'judge' exist
pass
def test_category_distribution():
# Assert 10 planner, 10 reflexion, 10 synthesizer
pass2. No validation tests
3. No integration tests
Why It MattersWithout tests, we can't:
Quick FixAdd test file with basic validation (will take ~20 min). I can help review once you push it. |
Summary
Verification
Self-Review