Improve HealthBench parity with openai/simple-evals by bofenghuang · Pull Request #132 · MedARC-AI/medmarks

bofenghuang · 2026-05-14T09:43:18Z

Summary

Tightens HealthBench's parity with openai/simple-evals and adds the Professional variant.

Changes

Fix judge retry: the previous try/except AttributeError around _parse_json never fired (the parser returns {} on bad JSON), so malformed grading responses were silently scored criteria_met=False and the existing rerun_judge path was dead code. Retry is now driven by the real failure signals (judge error, empty raw, or missing/non-boolean criteria_met), bounded by a new max_judge_retries arg (default 3).
Add optional length-adjusted score metric (length_adjusted_score) matching simple-evals' formula. Computed in order raw → length adjust → clip to [0, 1]. Length-adjustment is opt-in but auto-applied when both knobs are None, using OpenAI's published per-variant defaults (regular 0.0299, hard 0.0392, consensus 0.0020, professional 0.0147; center=2000 for all).
Add use_length_adjusted_as_reward flag (default False) that swaps weights so length_adjusted_score becomes the headline reward; useful for RL training. Both numbers are always reported either way.
Add HealthBench Professional variant (difficulty="professional", 525 rows). Different upstream schema (conversation / rubric_items / id, no per-rubric tags) is projected onto the canonical shape at load time via _normalize_to_canonical_schema. Rubric-level axis/consensus slicing is unavailable for this variant; example-level tags (use_case, type, difficulty, specialty) are surfaced in info for analytics.
Merge per-variant dataset id and split into a single HEALTHBENCH_DATASET_MAPPING dict.
README: env-args table updated for new knobs; new "Known disparencies with openai/simple-evals" section documents three real numerical differences that remain (per-rollout vs aggregate clip, bounded judge-retry fallback, multi-judge averaging when K > 1).

Test plan

uv run ruff check environments/healthbench/ && uv run ruff format --check environments/healthbench/ — clean.
Manual smoke: each difficulty (all, hard, consensus, professional) loads, normalization for professional produces correct canonical fields (prompt, prompt_id, rubrics, example_tags) plus expected info (theme=professional, axes=[None, ...]).
Validation: length_adjustment_* both-or-neither, non-negative, and use_length_adjusted_as_reward=True requiring length-adjustment all raise ValueError as expected.

…ch README

bofenghuang added 5 commits May 14, 2026 10:42

Fix HealthBench judge retry on malformed JSON or failed call

c4fc45e

Add HealthBench length-adjusted score metric with OpenAI defaults

2c796e6

Add HealthBench Professional variant

709d8be

Document remaining disparencies with openai/simple-evals in HealthBen…

7a856f1

…ch README

Fix HealthBench length-adjustment doc URL

9e64a98

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve HealthBench parity with openai/simple-evals#132

Improve HealthBench parity with openai/simple-evals#132
bofenghuang wants to merge 5 commits into
MedARC-AI:mainfrom
bofenghuang:fix/healthbench-official-parity

bofenghuang commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bofenghuang commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant