Skip to content

feat(evaluator): filesystem extension, Evidence handles, ATIF traces, and taskset#432

Draft
arpitsardhana wants to merge 1 commit into
mainfrom
aalgo-258-p0-evidence-metrics/arpsingh
Draft

feat(evaluator): filesystem extension, Evidence handles, ATIF traces, and taskset#432
arpitsardhana wants to merge 1 commit into
mainfrom
aalgo-258-p0-evidence-metrics/arpsingh

Conversation

@arpitsardhana

@arpitsardhana arpitsardhana commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements the gaps in agent-eval design of the evaluator SDK, so metrics can score from what the agent actually produced (filesystem, traces, logs) rather than a stamped reward. Built on top of latest main.

  • Filesystem handle (values/evidence.py): read_bytes, list(pattern), content-hash diffFilesystemDiff/FilesystemEntry, and run_verifier (runs a command list — no shell — in a throwaway mkdtemp overlay so stored evidence is never mutated; kills the process on timeout) → CommandResult.
  • ATIF traces + read handles (values/evidence.py): portable AtifTrace/AtifEvent/AtifTokenUsage, with normalizers for NAT trajectories and OTel/OpenInference spans (normalize_trace dispatch). Exposed via CandidateEvidence.trace()TraceHandle (events/tool_calls/token_usage) and .logs()LogHandle.
  • Taskset (agent_eval/tasks.py): AgentEvalTaskset — id/name wrapper around a unique-id task list.
  • Example reference metrics (examples/run_agent_eval/example_metrics.py, example-only): tests_pass, no_test_cheating, inefficient_retry_loop scoring off the handles.
  • Vendored SDK mirror synced via make vendor.

Test plan

  • ruff check + ruff format --check clean on changed files
  • ty check clean
  • pytest packages/nemo_evaluator_sdk/tests/agent_eval/ → 62 passed (filesystem ops + overlay isolation, NAT/OTel/OpenInference trace normalization, log handle, taskset dup-id, all three example metrics)
  • make vendor regenerated mirror, no drift
  • CI green

Notes

  • All SDK additions live in values/evidence.py (no new value module) per review preference.
  • The documented .list() evidence method shadows the builtin inside its class; aliased list[str] annotations rather than renaming the public API.

Implement the design's metrics-over-evidence P0 surface on the agent-eval SDK:

- Filesystem handle: read_bytes, list(pattern), diff (content-hash), and
  run_verifier (runs a command in a throwaway overlay so stored evidence is
  never mutated) plus FilesystemDiff/FilesystemEntry/CommandResult value types.
- ATIF trace model + read handles: AtifTrace/AtifEvent with normalizers for NAT
  trajectories and OTel/OpenInference spans, exposed via CandidateEvidence.trace()
  (TraceHandle) and .logs() (LogHandle) for portable trace/log metrics.
- AgentEvalTaskset: id/name wrapper around a unique-id task list.
- Example reference metrics (tests_pass, no_test_cheating, inefficient_retry_loop)
  showing scoring from evidence handles instead of a stamped reward.

Vendored mirror synced via make vendor.

Signed-off-by: Arpit Singh (SW-CLOUD) <arpsingh@nvidia.com>
@github-actions github-actions Bot added the feat label Jun 24, 2026
@arpitsardhana arpitsardhana changed the title feat(evaluator): P0 evidence handles, ATIF traces, and taskset feat(evaluator): filesystem extension, Evidence handles, ATIF traces, and taskset Jun 24, 2026
@github-actions

Copy link
Copy Markdown
Contributor
Suite Lines Covered Line Rate Branch Rate
Unit Tests 21185/27777 76.3% 61.3%
Integration Tests 12221/26546 46.0% 19.4%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant