Use cases, pain points, and background
NeMo Gym benchmarks currently do not have a standardized way to indicate whether a benchmark has been validated for production evaluation usage. As benchmark coverage grows, users need a simple mechanism to distinguish production-ready benchmarks from experimental or in-progress integrations. Without this metadata, benchmark readiness is unclear across configs, documentation, and evaluation workflows.
Description:
Add a standardized validated tag/status for benchmarks.
The tag should indicate that a benchmark:
has passed internal validation checks
has acceptable runtime behavior
has established expected evaluation parity
is approved for production evaluation usage
The tag should be surfaced in benchmark metadata/configuration and exposed through benchmark discovery/documentation flows where applicable (Ref: #1434)
Design:
Expected work includes:
- defining benchmark validated metadata/schema
- adding support in benchmark configs/registry metadata
- surfacing validation status in docs/discovery tooling
- documenting usage expectations for the tag
Potential areas of change:
Out of scope:
- implementing automated validation workflows
- retroactively validating all benchmarks
- changing benchmark execution or scoring logic
Acceptance Criteria:
Use cases, pain points, and background
NeMo Gym benchmarks currently do not have a standardized way to indicate whether a benchmark has been validated for production evaluation usage. As benchmark coverage grows, users need a simple mechanism to distinguish production-ready benchmarks from experimental or in-progress integrations. Without this metadata, benchmark readiness is unclear across configs, documentation, and evaluation workflows.
Description:
Add a standardized validated tag/status for benchmarks.
The tag should indicate that a benchmark:
has passed internal validation checks
has acceptable runtime behavior
has established expected evaluation parity
is approved for production evaluation usage
The tag should be surfaced in benchmark metadata/configuration and exposed through benchmark discovery/documentation flows where applicable (Ref: #1434)
Design:
Expected work includes:
Potential areas of change:
Out of scope:
Acceptance Criteria:
validatedstatus is surfaced in benchmark discovery/docs where applicable