feat(evaluator): filesystem extension, Evidence handles, ATIF traces, and taskset#432
Draft
arpitsardhana wants to merge 1 commit into
Draft
feat(evaluator): filesystem extension, Evidence handles, ATIF traces, and taskset#432arpitsardhana wants to merge 1 commit into
arpitsardhana wants to merge 1 commit into
Conversation
Implement the design's metrics-over-evidence P0 surface on the agent-eval SDK: - Filesystem handle: read_bytes, list(pattern), diff (content-hash), and run_verifier (runs a command in a throwaway overlay so stored evidence is never mutated) plus FilesystemDiff/FilesystemEntry/CommandResult value types. - ATIF trace model + read handles: AtifTrace/AtifEvent with normalizers for NAT trajectories and OTel/OpenInference spans, exposed via CandidateEvidence.trace() (TraceHandle) and .logs() (LogHandle) for portable trace/log metrics. - AgentEvalTaskset: id/name wrapper around a unique-id task list. - Example reference metrics (tests_pass, no_test_cheating, inefficient_retry_loop) showing scoring from evidence handles instead of a stamped reward. Vendored mirror synced via make vendor. Signed-off-by: Arpit Singh (SW-CLOUD) <arpsingh@nvidia.com>
Contributor
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the gaps in agent-eval design of the evaluator SDK, so metrics can score from what the agent actually produced (filesystem, traces, logs) rather than a stamped reward. Built on top of latest
main.values/evidence.py):read_bytes,list(pattern), content-hashdiff→FilesystemDiff/FilesystemEntry, andrun_verifier(runs a command list — no shell — in a throwawaymkdtempoverlay so stored evidence is never mutated; kills the process on timeout) →CommandResult.values/evidence.py): portableAtifTrace/AtifEvent/AtifTokenUsage, with normalizers for NAT trajectories and OTel/OpenInference spans (normalize_tracedispatch). Exposed viaCandidateEvidence.trace()→TraceHandle(events/tool_calls/token_usage) and.logs()→LogHandle.agent_eval/tasks.py):AgentEvalTaskset— id/name wrapper around a unique-id task list.examples/run_agent_eval/example_metrics.py, example-only):tests_pass,no_test_cheating,inefficient_retry_loopscoring off the handles.make vendor.Test plan
ruff check+ruff format --checkclean on changed filesty checkcleanpytest packages/nemo_evaluator_sdk/tests/agent_eval/→ 62 passed (filesystem ops + overlay isolation, NAT/OTel/OpenInference trace normalization, log handle, taskset dup-id, all three example metrics)make vendorregenerated mirror, no driftNotes
values/evidence.py(no new value module) per review preference..list()evidence method shadows the builtin inside its class; aliasedlist[str]annotations rather than renaming the public API.