feat(evaluator): filesystem extension, Evidence handles, ATIF traces, and taskset by arpitsardhana · Pull Request #432 · NVIDIA-NeMo/nemo-platform

arpitsardhana · 2026-06-24T06:24:09Z

Summary

Implements the gaps in agent-eval design of the evaluator SDK, so metrics can score from what the agent actually produced (filesystem, traces, logs) rather than a stamped reward. Built on top of latest main.

Filesystem handle (values/evidence.py): read_bytes, list(pattern), content-hash diff → FilesystemDiff/FilesystemEntry, and run_verifier (runs a command list — no shell — in a throwaway mkdtemp overlay so stored evidence is never mutated; kills the process on timeout) → CommandResult.
ATIF traces + read handles (values/evidence.py): portable AtifTrace/AtifEvent/AtifTokenUsage, with normalizers for NAT trajectories and OTel/OpenInference spans (normalize_trace dispatch). Exposed via CandidateEvidence.trace() → TraceHandle (events/tool_calls/token_usage) and .logs() → LogHandle.
Taskset (agent_eval/tasks.py): AgentEvalTaskset — id/name wrapper around a unique-id task list.
Example reference metrics (examples/run_agent_eval/example_metrics.py, example-only): tests_pass, no_test_cheating, inefficient_retry_loop scoring off the handles.
Vendored SDK mirror synced via make vendor.

Test plan

ruff check + ruff format --check clean on changed files
ty check clean
pytest packages/nemo_evaluator_sdk/tests/agent_eval/ → 62 passed (filesystem ops + overlay isolation, NAT/OTel/OpenInference trace normalization, log handle, taskset dup-id, all three example metrics)
make vendor regenerated mirror, no drift
CI green

Notes

All SDK additions live in values/evidence.py (no new value module) per review preference.
The documented .list() evidence method shadows the builtin inside its class; aliased list[str] annotations rather than renaming the public API.

Implement the design's metrics-over-evidence P0 surface on the agent-eval SDK: - Filesystem handle: read_bytes, list(pattern), diff (content-hash), and run_verifier (runs a command in a throwaway overlay so stored evidence is never mutated) plus FilesystemDiff/FilesystemEntry/CommandResult value types. - ATIF trace model + read handles: AtifTrace/AtifEvent with normalizers for NAT trajectories and OTel/OpenInference spans, exposed via CandidateEvidence.trace() (TraceHandle) and .logs() (LogHandle) for portable trace/log metrics. - AgentEvalTaskset: id/name wrapper around a unique-id task list. - Example reference metrics (tests_pass, no_test_cheating, inefficient_retry_loop) showing scoring from evidence handles instead of a stamped reward. Vendored mirror synced via make vendor. Signed-off-by: Arpit Singh (SW-CLOUD) <arpsingh@nvidia.com>

github-actions · 2026-06-24T06:38:39Z

Suite	Lines Covered	Line Rate	Branch Rate
Unit Tests	21185/27777	76.3%	61.3%
Integration Tests	12221/26546	46.0%	19.4%

github-actions Bot added the feat label Jun 24, 2026

arpitsardhana changed the title ~~feat(evaluator): P0 evidence handles, ATIF traces, and taskset~~ feat(evaluator): filesystem extension, Evidence handles, ATIF traces, and taskset Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evaluator): filesystem extension, Evidence handles, ATIF traces, and taskset#432

feat(evaluator): filesystem extension, Evidence handles, ATIF traces, and taskset#432
arpitsardhana wants to merge 1 commit into
mainfrom
aalgo-258-p0-evidence-metrics/arpsingh

arpitsardhana commented Jun 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arpitsardhana commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Notes

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arpitsardhana commented Jun 24, 2026 •

edited

Loading