Add research-bot benchmark cases by upsearchmain · Pull Request #77 · NousResearch/hermes-agent-self-evolution

upsearchmain · 2026-05-14T16:46:46Z

Summary

Add 30 JSONL benchmark cases for research-bot self-evolution.
Balance planner/reflexion/synthesizer at 10 cases each.
Keep schema simple: id, skill, category, input, expected, judge.

Verification

test -f ~/hermes-evolution/cases/research-bot/cases.jsonl && echo present -> present
wc -l ~/hermes-evolution/cases/research-bot/cases.jsonl -> 30 /home/claude/hermes-evolution/cases/research-bot/cases.jsonl
/usr/bin/python3 -c "import json, pathlib; [json.loads(l) for l in pathlib.Path('/home/claude/hermes-evolution/cases/research-bot/cases.jsonl').read_text().splitlines()]" && echo OK -> OK
category distribution -> {'planner': 10, 'reflexion': 10, 'synthesizer': 10}
git diff --check origin/main...codex/T-E5-1-03-research-bot-cases -> no output

Self-Review

P1: Verified task references and dependency availability before writing cases.
P2: Added only a simple static JSONL dataset, no harness or API calls.
P3: Changed only cases/research-bot/cases.jsonl in this repo.
P4: File existence, line count, JSON parse, category distribution, and diff-check all pass.

jarrettj

Code Review Summary — PR #77: Research-Bot Benchmark Cases

Verdict: Request Changes (2 critical gaps, multiple testing oversights)

🔴 Critical

Schema mismatch with evaluation framework

The research-bot cases use schema: {id, skill, category, input, expected:{must_include/must_avoid}, judge}
The EvalExample class (evolution/core/dataset_builder.py) uses schema: {task_input, expected_behavior, difficulty, category, source}
Impact: No code currently exists to load or use research-bot cases. They cannot be consumed by the GEPA optimizer or evaluation framework without a conversion layer.
Fix: Either (a) add a loader that converts research-bot schema to EvalExample, or (b) document that these cases require future framework extension

No tests for case loading, validation, or usage

The 30 cases are added as raw data with no test coverage for:
- Loading cases.jsonl into a usable format
- Validating case structure and content
- Integrating cases into the evaluation pipeline
The PR description mentions framework steps (P1–P4) but doesn't add tests to verify any of them work
Impact: Case quality is unknown and cannot be verified programmatically
Fix: Add test file tests/core/test_research_bot_cases.py with:
- Test for loading all 30 cases from JSONL
- Validation that all required fields are present and valid
- Schema conformance (category in {planner, reflexion, synthesizer}, no duplicate IDs)
- Optional: convert research-bot schema to EvalExample and verify compatibility

⚠️ Warnings

No documentation on case usage

README.md does not mention where benchmark cases live or how to use them
PLAN.md mentions "benchmark gating" but not these specific case files
Users/developers cannot discover or use these cases without tribal knowledge
Suggestion: Add a cases/README.md or section to main README explaining:
- Where research-bot cases are located
- Case schema (what each field means)
- How to load and evaluate against them
- How to add new cases

Missing cross-file validation

The self-review claims to check "evolution sources and dependency availability" but doesn't provide evidence
The PR should verify that:
- All cases reference valid "skill" values (currently all are "research-bot"; should be documented)
- All "category" values match the expected taxonomy
- No external dependencies are needed to load/validate cases
Suggestion: Add these checks to the test suite

No tests for case semantic quality

The cases themselves are manually authored; no tests verify:
- Input prompts are well-formed and solvable
- Judgment criteria are objective and verifiable
- must_include and must_avoid don't contradict each other
- Examples given are realistic for a research-bot skill
Suggestion: Consider adding a test for obvious contradictions (e.g., must_include "X" and must_avoid "X")

💡 Suggestions

Consider adding case difficulty and source metadata

Current cases lack difficulty and source fields that EvalExample uses
These would help with:
- Stratified evaluation (test on easy vs. hard cases)
- Traceability (hand-curated vs. synthetic)
Not critical if cases are distinct from EvalExample, but worth planning

Plan for case evolution and updates

The cases are static snapshots added once; no clear process for:
- Adding new cases as research-bot evolves
- Retiring cases that become obsolete
- Versioning the case set
Suggestion: Document a case lifecycle in README (how to propose, review, merge new cases)

✅ Looks Good

Data integrity: All 30 lines are valid JSON with no parse errors ✓
Distribution: Perfect 10/10/10 split across planner/reflexion/synthesizer ✓
No duplicates: All case IDs are unique ✓
No diff issues: Git diff --check passes ✓
Field completeness: All cases have all required fields ✓
Clear semantics: Each case has a realistic prompt, clear expected behavior, and sensible judge criteria ✓
Korean context: Excellent inclusion of market/cultural specifics in several cases ✓

Recommendation

Do not merge yet. The data itself is solid, but the cases are orphaned from the evaluation framework. Before merging:

Add integration code: Implement load_research_bot_cases() that returns EvalExample objects or documents why a new evaluation pipeline is needed
Add tests: Create test_research_bot_cases.py with validation and loading tests
Add documentation: Update README or add cases/README.md explaining schema and usage

Once these are in place, this contribution will be valuable for benchmarking research-bot improvements.

Reviewed by Code Review Skill

jarrettj · 2026-05-16T10:11:53Z

Test Coverage Gap Analysis

What's Missing

The PR adds 30 benchmark cases for research-bot but zero tests to verify they work:

1. No loader/integration tests

How are these cases loaded for use?
Is there code to convert from research-bot schema to EvalExample?
What happens when someone tries to use these for evaluation?

Required: Add with at least:

def test_loads_all_30_cases():
    # Load all cases from JSONL
    # Assert count == 30
    pass

def test_cases_have_required_fields():
    # For each case: assert 'id', 'category', 'input', 'expected', 'judge' exist
    pass

def test_category_distribution():
    # Assert 10 planner, 10 reflexion, 10 synthesizer
    pass

2. No validation tests

No test for duplicate IDs
No test for invalid categories
No test for schema conformance
No test for empty/null fields
No test for contradictory must_include/must_avoid

3. No integration tests

Can the evaluation framework actually use these cases?
Can they be converted to EvalExample objects?
Do they produce meaningful scores?

Why It Matters

Without tests, we can't:

Verify the cases aren't broken (already added, but good practice)
Prevent regressions if cases are modified
Discover schema mismatches early
Document the expected format for future contributors
Automate validation when new cases are added

Quick Fix

Add test file with basic validation (will take ~20 min). I can help review once you push it.

Add research-bot benchmark cases

cf75b71

jarrettj suggested changes May 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add research-bot benchmark cases#77

Add research-bot benchmark cases#77
upsearchmain wants to merge 1 commit into
NousResearch:mainfrom
upsearchmain:codex/T-E5-1-03-research-bot-cases

upsearchmain commented May 14, 2026

Uh oh!

jarrettj left a comment

Uh oh!

jarrettj commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

upsearchmain commented May 14, 2026

Summary

Verification

Self-Review

Uh oh!

jarrettj left a comment

Choose a reason for hiding this comment

Code Review Summary — PR #77: Research-Bot Benchmark Cases

🔴 Critical

⚠️ Warnings

💡 Suggestions

✅ Looks Good

Recommendation

Uh oh!

jarrettj commented May 16, 2026

Test Coverage Gap Analysis

What's Missing

1. No loader/integration tests

2. No validation tests

3. No integration tests

Why It Matters

Quick Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants