Skip to content

Improve HealthBench parity with openai/simple-evals#132

Open
bofenghuang wants to merge 5 commits into
MedARC-AI:mainfrom
bofenghuang:fix/healthbench-official-parity
Open

Improve HealthBench parity with openai/simple-evals#132
bofenghuang wants to merge 5 commits into
MedARC-AI:mainfrom
bofenghuang:fix/healthbench-official-parity

Conversation

@bofenghuang
Copy link
Copy Markdown
Contributor

Summary

Tightens HealthBench's parity with openai/simple-evals and adds the Professional variant.

Changes

  • Fix judge retry: the previous try/except AttributeError around _parse_json never fired (the parser returns {} on bad JSON), so malformed grading responses were silently scored criteria_met=False and the existing rerun_judge path was dead code. Retry is now driven by the real failure signals (judge error, empty raw, or missing/non-boolean criteria_met), bounded by a new max_judge_retries arg (default 3).
  • Add optional length-adjusted score metric (length_adjusted_score) matching simple-evals' formula. Computed in order raw → length adjust → clip to [0, 1]. Length-adjustment is opt-in but auto-applied when both knobs are None, using OpenAI's published per-variant defaults (regular 0.0299, hard 0.0392, consensus 0.0020, professional 0.0147; center=2000 for all).
  • Add use_length_adjusted_as_reward flag (default False) that swaps weights so length_adjusted_score becomes the headline reward; useful for RL training. Both numbers are always reported either way.
  • Add HealthBench Professional variant (difficulty="professional", 525 rows). Different upstream schema (conversation / rubric_items / id, no per-rubric tags) is projected onto the canonical shape at load time via _normalize_to_canonical_schema. Rubric-level axis/consensus slicing is unavailable for this variant; example-level tags (use_case, type, difficulty, specialty) are surfaced in info for analytics.
  • Merge per-variant dataset id and split into a single HEALTHBENCH_DATASET_MAPPING dict.
  • README: env-args table updated for new knobs; new "Known disparencies with openai/simple-evals" section documents three real numerical differences that remain (per-rollout vs aggregate clip, bounded judge-retry fallback, multi-judge averaging when K > 1).

Test plan

  • uv run ruff check environments/healthbench/ && uv run ruff format --check environments/healthbench/ — clean.
  • Manual smoke: each difficulty (all, hard, consensus, professional) loads, normalization for professional produces correct canonical fields (prompt, prompt_id, rubrics, example_tags) plus expected info (theme=professional, axes=[None, ...]).
  • Validation: length_adjustment_* both-or-neither, non-negative, and use_length_adjusted_as_reward=True requiring length-adjustment all raise ValueError as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant