Skip to content

feat(bench): implement LOCOMO, FRAMES, and GAIA dataset loaders#2842

Merged
bug-ops merged 1 commit intomainfrom
locomo-frames-gaia-dataset-loaders
Apr 8, 2026
Merged

feat(bench): implement LOCOMO, FRAMES, and GAIA dataset loaders#2842
bug-ops merged 1 commit intomainfrom
locomo-frames-gaia-dataset-loaders

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented Apr 8, 2026

Summary

  • Add shared Scenario / Evaluator traits and metric functions (token_f1, exact_match, gaia_normalized_exact_match) in scenario.rs
  • LocomoLoader parses lmlab/locomo JSON array (one Scenario per QA pair); LocomoEvaluator uses token F1 with threshold 0.5
  • FramesLoader parses google/frames-benchmark JSONL, stores reasoning_types in metadata; FramesEvaluator uses exact match
  • GaiaLoader parses gaia-benchmark/GAIA JSONL with optional --level 1|2|3 filter; GaiaEvaluator uses GAIA-normalized exact match (strip articles, punctuation, collapse whitespace)

Closes #2836, #2837, #2839. Part of epic #2827.

Test plan

  • 52 unit tests across all new modules (synthetic fixtures for each loader + evaluator)
  • cargo nextest run -p zeph-bench --lib — 52 passed, 0 skipped
  • Full workspace: 7788 tests passed
  • cargo +nightly fmt --check — clean
  • cargo clippy -p zeph-bench --all-targets -- -D warnings — clean

@github-actions github-actions bot added enhancement New feature or request rust Rust code changes dependencies Dependency updates size/XL Extra large PR (500+ lines) and removed enhancement New feature or request labels Apr 8, 2026
@bug-ops bug-ops enabled auto-merge (squash) April 8, 2026 13:16
Closes #2836, #2837, #2839

Add shared Scenario/Evaluator traits and metric functions:
- token_f1: whitespace-token overlap F1 score
- exact_match: case-insensitive, punctuation-stripped equality
- gaia_normalized_exact_match: strips articles, punctuation, collapses whitespace

Loaders and evaluators:
- LocomoLoader: parses lmlab/locomo JSON array; one Scenario per QA pair;
  LocomoEvaluator uses token F1 with threshold 0.5
- FramesLoader: parses google/frames-benchmark JSONL; stores reasoning_types
  in metadata; FramesEvaluator uses exact match
- GaiaLoader: parses gaia-benchmark/GAIA JSONL with optional --level filter;
  GaiaEvaluator uses GAIA-normalized exact match

52 unit tests across all new modules; all 7788 workspace tests pass.
@bug-ops bug-ops force-pushed the locomo-frames-gaia-dataset-loaders branch from 39eb57c to 2fe5fce Compare April 8, 2026 13:20
@github-actions github-actions bot added enhancement New feature or request and removed dependencies Dependency updates labels Apr 8, 2026
@bug-ops bug-ops merged commit d58a1b2 into main Apr 8, 2026
29 checks passed
@bug-ops bug-ops deleted the locomo-frames-gaia-dataset-loaders branch April 8, 2026 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request rust Rust code changes size/XL Extra large PR (500+ lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(bench): implement GAIA dataset loader feat(bench): implement FRAMES dataset loader feat(bench): implement LOCOMO dataset loader

1 participant