Skip to content

Add research-bot benchmark cases#77

Open
upsearchmain wants to merge 1 commit into
NousResearch:mainfrom
upsearchmain:codex/T-E5-1-03-research-bot-cases
Open

Add research-bot benchmark cases#77
upsearchmain wants to merge 1 commit into
NousResearch:mainfrom
upsearchmain:codex/T-E5-1-03-research-bot-cases

Conversation

@upsearchmain
Copy link
Copy Markdown

Summary

  • Add 30 JSONL benchmark cases for research-bot self-evolution.
  • Balance planner/reflexion/synthesizer at 10 cases each.
  • Keep schema simple: id, skill, category, input, expected, judge.

Verification

  • test -f ~/hermes-evolution/cases/research-bot/cases.jsonl && echo present -> present
  • wc -l ~/hermes-evolution/cases/research-bot/cases.jsonl -> 30 /home/claude/hermes-evolution/cases/research-bot/cases.jsonl
  • /usr/bin/python3 -c "import json, pathlib; [json.loads(l) for l in pathlib.Path('/home/claude/hermes-evolution/cases/research-bot/cases.jsonl').read_text().splitlines()]" && echo OK -> OK
  • category distribution -> {'planner': 10, 'reflexion': 10, 'synthesizer': 10}
  • git diff --check origin/main...codex/T-E5-1-03-research-bot-cases -> no output

Self-Review

  • P1: Verified task references and dependency availability before writing cases.
  • P2: Added only a simple static JSONL dataset, no harness or API calls.
  • P3: Changed only cases/research-bot/cases.jsonl in this repo.
  • P4: File existence, line count, JSON parse, category distribution, and diff-check all pass.

Copy link
Copy Markdown

@jarrettj jarrettj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary — PR #77: Research-Bot Benchmark Cases

Verdict: Request Changes (2 critical gaps, multiple testing oversights)

🔴 Critical

Schema mismatch with evaluation framework

  • The research-bot cases use schema: {id, skill, category, input, expected:{must_include/must_avoid}, judge}
  • The EvalExample class (evolution/core/dataset_builder.py) uses schema: {task_input, expected_behavior, difficulty, category, source}
  • Impact: No code currently exists to load or use research-bot cases. They cannot be consumed by the GEPA optimizer or evaluation framework without a conversion layer.
  • Fix: Either (a) add a loader that converts research-bot schema to EvalExample, or (b) document that these cases require future framework extension

No tests for case loading, validation, or usage

  • The 30 cases are added as raw data with no test coverage for:
    • Loading cases.jsonl into a usable format
    • Validating case structure and content
    • Integrating cases into the evaluation pipeline
  • The PR description mentions framework steps (P1–P4) but doesn't add tests to verify any of them work
  • Impact: Case quality is unknown and cannot be verified programmatically
  • Fix: Add test file tests/core/test_research_bot_cases.py with:
    • Test for loading all 30 cases from JSONL
    • Validation that all required fields are present and valid
    • Schema conformance (category in {planner, reflexion, synthesizer}, no duplicate IDs)
    • Optional: convert research-bot schema to EvalExample and verify compatibility

⚠️ Warnings

No documentation on case usage

  • README.md does not mention where benchmark cases live or how to use them
  • PLAN.md mentions "benchmark gating" but not these specific case files
  • Users/developers cannot discover or use these cases without tribal knowledge
  • Suggestion: Add a cases/README.md or section to main README explaining:
    • Where research-bot cases are located
    • Case schema (what each field means)
    • How to load and evaluate against them
    • How to add new cases

Missing cross-file validation

  • The self-review claims to check "evolution sources and dependency availability" but doesn't provide evidence
  • The PR should verify that:
    • All cases reference valid "skill" values (currently all are "research-bot"; should be documented)
    • All "category" values match the expected taxonomy
    • No external dependencies are needed to load/validate cases
  • Suggestion: Add these checks to the test suite

No tests for case semantic quality

  • The cases themselves are manually authored; no tests verify:
    • Input prompts are well-formed and solvable
    • Judgment criteria are objective and verifiable
    • must_include and must_avoid don't contradict each other
    • Examples given are realistic for a research-bot skill
  • Suggestion: Consider adding a test for obvious contradictions (e.g., must_include "X" and must_avoid "X")

💡 Suggestions

Consider adding case difficulty and source metadata

  • Current cases lack difficulty and source fields that EvalExample uses
  • These would help with:
    • Stratified evaluation (test on easy vs. hard cases)
    • Traceability (hand-curated vs. synthetic)
  • Not critical if cases are distinct from EvalExample, but worth planning

Plan for case evolution and updates

  • The cases are static snapshots added once; no clear process for:
    • Adding new cases as research-bot evolves
    • Retiring cases that become obsolete
    • Versioning the case set
  • Suggestion: Document a case lifecycle in README (how to propose, review, merge new cases)

✅ Looks Good

  • Data integrity: All 30 lines are valid JSON with no parse errors ✓
  • Distribution: Perfect 10/10/10 split across planner/reflexion/synthesizer ✓
  • No duplicates: All case IDs are unique ✓
  • No diff issues: Git diff --check passes ✓
  • Field completeness: All cases have all required fields ✓
  • Clear semantics: Each case has a realistic prompt, clear expected behavior, and sensible judge criteria ✓
  • Korean context: Excellent inclusion of market/cultural specifics in several cases ✓

Recommendation

Do not merge yet. The data itself is solid, but the cases are orphaned from the evaluation framework. Before merging:

  1. Add integration code: Implement load_research_bot_cases() that returns EvalExample objects or documents why a new evaluation pipeline is needed
  2. Add tests: Create test_research_bot_cases.py with validation and loading tests
  3. Add documentation: Update README or add cases/README.md explaining schema and usage

Once these are in place, this contribution will be valuable for benchmarking research-bot improvements.


Reviewed by Code Review Skill

@jarrettj
Copy link
Copy Markdown

Test Coverage Gap Analysis

What's Missing

The PR adds 30 benchmark cases for research-bot but zero tests to verify they work:

1. No loader/integration tests

  • How are these cases loaded for use?
  • Is there code to convert from research-bot schema to EvalExample?
  • What happens when someone tries to use these for evaluation?

Required: Add with at least:

def test_loads_all_30_cases():
    # Load all cases from JSONL
    # Assert count == 30
    pass

def test_cases_have_required_fields():
    # For each case: assert 'id', 'category', 'input', 'expected', 'judge' exist
    pass

def test_category_distribution():
    # Assert 10 planner, 10 reflexion, 10 synthesizer
    pass

2. No validation tests

  • No test for duplicate IDs
  • No test for invalid categories
  • No test for schema conformance
  • No test for empty/null fields
  • No test for contradictory must_include/must_avoid

3. No integration tests

  • Can the evaluation framework actually use these cases?
  • Can they be converted to EvalExample objects?
  • Do they produce meaningful scores?

Why It Matters

Without tests, we can't:

  • Verify the cases aren't broken (already added, but good practice)
  • Prevent regressions if cases are modified
  • Discover schema mismatches early
  • Document the expected format for future contributors
  • Automate validation when new cases are added

Quick Fix

Add test file with basic validation (will take ~20 min). I can help review once you push it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants