Skip to content

update eval.json#2176

Merged
sosahi merged 2 commits into
mainfrom
sohail/update-eval
May 29, 2026
Merged

update eval.json#2176
sosahi merged 2 commits into
mainfrom
sohail/update-eval

Conversation

@sosahi
Copy link
Copy Markdown
Collaborator

@sosahi sosahi commented May 29, 2026

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@sosahi sosahi requested review from a team as code owners May 29, 2026 23:17
@sosahi sosahi requested a review from jperez999 May 29, 2026 23:18
@sosahi
Copy link
Copy Markdown
Collaborator Author

sosahi commented May 29, 2026

/nvskills-ci

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 29, 2026

Greptile Summary

This PR refreshes the nemo-retriever skill evaluation package: evals.json is replaced with four new scenarios (climate-change search, risk-management summarization, data-privacy clause extraction across mixed formats, and a negative chocolate-cake task), BENCHMARK.md is updated with live claude-code/codex benchmark results, skill-card.md is expanded to cover multimodal document types (images, Office, audio, video), and the Sigstore signature bundle is regenerated to match the new file tree.

  • evals.json: All four tasks are rewritten with broader document formats and richer citation expectations; the output.json persistence step that appeared in the previous version of evals 001–003 is not present in the new scenarios.
  • BENCHMARK.md / skill-card.md: Overall verdict upgraded to PASS; Effectiveness dimension scores (45%/47%) fall below the stated 50% pass threshold and show negative uplift for both agents — inconsistent with the PASS verdict.
  • skill.oms.sig: Signature bundle covers the updated file set including evals/evals.json and references/troubleshooting.md (replacing references/pitfalls.md).

Confidence Score: 4/5

Documentation and eval-definition update; no executable code changed. Safe to merge after resolving the benchmark verdict inconsistency.

The Effectiveness dimension scores (45% for claude-code, 47% for codex) are both below the document's stated 50% pass threshold and show negative uplift, yet the top-level verdict reads PASS. All other changes — new eval scenarios, multimodal format expansion, updated skill-card, refreshed signature — look correct and internally consistent.

skills/nemo-retriever/BENCHMARK.md — the Effectiveness row and the overall verdict need to be reconciled.

Important Files Changed

Filename Overview
skills/nemo-retriever/evals/evals.json Replaces all four eval scenarios with new questions covering climate change search, risk-management summarization, data-privacy clause extraction (now multi-format), and a negative chocolate-cake task; output-file verification steps removed (flagged separately in prior review).
skills/nemo-retriever/BENCHMARK.md Upgraded from FAIL to PASS with live agent results; Effectiveness dimension (45%/47%) falls below the document's own 50% pass threshold for both agents and shows negative uplift, creating an inconsistency with the overall PASS verdict.
skills/nemo-retriever/skill-card.md Expands supported formats to include images, Office docs, audio, and video; adds evaluation agents, signals, and results table; renames pitfalls.md reference to troubleshooting.md; updates skill version SHA.
skills/nemo-retriever/skill.oms.sig Refreshed DSSE/Sigstore bundle covering the updated skill file tree; covers new evals.json and troubleshooting.md resources; drops the old pitfalls.md and scripts hashes from the prior signature.
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
skills/nemo-retriever/BENCHMARK.md:46-52
**Effectiveness below pass threshold contradicts PASS verdict**

The report's stated pass threshold is 50%, yet the Effectiveness dimension scores 45% for `claude-code` and 47% for `codex` — both below that threshold and both showing negative uplift (-3% and -2%), meaning the skill actively reduces performance relative to no-skill baseline. Every other dimension passes. If the 50% threshold applies uniformly, Effectiveness is a failing dimension, which conflicts with the top-level "Overall verdict: PASS." Either the threshold does not apply to Effectiveness (in which case that exception should be documented), or the overall verdict should be FAIL or CONDITIONAL PASS.

Reviews (2): Last reviewed commit: "Attach NVSkills validation signatures" | Re-trigger Greptile

Comment on lines 1 to +56
[
{
"id": "nemo-retriever-001",
"question": "Can you use the nemo-retriever tool to search my PDF folder for the term \"machine learning\" and give me the relevant passages?",
"expected_skill": "nemo-retriever",
"expected_script": "None",
"ground_truth": "The agent built a LanceDB index of the PDFs, performed a vector search for \"machine learning\", and returned the matching document excerpts in ./output.json.",
"expected_behavior": [
"The agent read the nemo-retriever SKILL.md before executing commands",
"The agent executed a Bash command to ingest the PDFs with `retriever ingest`",
"The agent executed a Bash command to query the index with `retriever query` for \"machine learning\"",
"The agent wrote ./output.json containing the query results",
"The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace"
]
},
{
"id": "nemo-retriever-002",
"question": "I have a folder of research papers in PDF. How can I find which ones discuss reinforcement learning?",
"expected_skill": "nemo-retriever",
"expected_script": "None",
"ground_truth": "The agent queried the existing PDF index for reinforcement learning topics and supplied the relevant passages in ./output.json.",
"expected_behavior": [
"The agent read the nemo-retriever SKILL.md before executing commands",
"The agent executed a Bash command to query the index with `retriever query` for reinforcement learning",
"The agent wrote ./output.json with the relevant excerpts",
"The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace"
]
},
{
"id": "nemo-retriever-003",
"question": "Our legal team needs to quickly locate clauses about data privacy in the contract PDFs we stored. Can you set up a search over those documents?",
"expected_skill": "nemo-retriever",
"expected_script": "None",
"ground_truth": "The agent created an index of the contract PDFs, searched for data‑privacy clauses, and returned the matching sections in ./output.json.",
"expected_behavior": [
"The agent read the nemo-retriever SKILL.md before executing commands",
"The agent executed a Bash command to ingest the contract PDFs with `retriever ingest`",
"The agent executed a Bash command to query the index with `retriever query` for data privacy clauses",
"The agent wrote ./output.json containing the found clauses",
"The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace"
]
},
{
"id": "nemo-retriever-004",
"question": "What's the best way to bake a chocolate cake?",
"expected_skill": null,
"expected_script": "None",
"ground_truth": "The agent answered the cooking question directly without invoking the nemo-retriever tool.",
"expected_behavior": [
"The agent responded with a textual answer without invoking any Bash commands",
"The agent did not read the nemo-retriever SKILL.md or use the retriever CLI",
"The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace"
]
}
]
{
"id": "nemo-retriever-001",
"question": "Use the nemo-retriever skill to find every mention of \"climate change\" in the PDF reports inside my folder \"research_reports\".",
"expected_skill": "nemo-retriever",
"expected_script": "None",
"ground_truth": "The agent indexed the folder and returned all passages containing \"climate change\" from the PDFs, each with the file name and page number as citations.",
"expected_behavior": [
"The agent read the nemo-retriever SKILL.md before executing commands",
"The agent executed a `retriever ingest` command to index the \"research_reports\" folder",
"The agent executed a `retriever query` command with the search term \"climate change\"",
"The agent returned the matching excerpts with file and page citations",
"The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace"
]
},
{
"id": "nemo-retriever-002",
"question": "Can you search through all the documents I uploaded and give me a summary of the sections that discuss risk management?",
"expected_skill": "nemo-retriever",
"expected_script": "None",
"ground_truth": "The agent searched across the uploaded PDFs, DOCX, and text files, produced a concise summary of each risk‑management section, and included citations to the source documents.",
"expected_behavior": [
"The agent read the nemo-retriever SKILL.md before executing commands",
"The agent executed a `retriever ingest` command to index the uploaded document collection",
"The agent executed a `retriever query` command targeting \"risk management\"",
"The agent returned a summarized answer with citations to each source",
"The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace"
]
},
{
"id": "nemo-retriever-003",
"question": "Our legal team needs to extract every clause about data privacy from the collection of contracts we have (PDFs, Word docs, and scanned images). Please provide the clauses with citations.",
"expected_skill": "nemo-retriever",
"expected_script": "None",
"ground_truth": "The agent indexed the mixed‑format contracts folder and extracted every verbatim data‑privacy clause, listing each clause together with the document name and page/slide number where it appears.",
"expected_behavior": [
"The agent read the nemo-retriever SKILL.md before executing commands",
"The agent executed a `retriever ingest` command to index PDFs, DOCX, and image files in the contracts folder",
"The agent executed a `retriever query` command to locate clauses containing \"data privacy\"",
"The agent returned each clause verbatim with document and location citations",
"The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace"
]
},
{
"id": "nemo-retriever-004",
"question": "How do I bake a chocolate cake from scratch?",
"expected_skill": null,
"expected_script": "None",
"ground_truth": "The agent provided a step‑by‑step chocolate cake recipe without using the nemo-retriever skill or any tool calls.",
"expected_behavior": [
"The agent responded with a chocolate cake recipe without invoking any tools",
"The agent did not execute any Bash commands or read the nemo-retriever SKILL.md",
"The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace"
]
}
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Output file verification removed from all retrieval evals

The previous versions of evals 001, 002, and 003 all included an explicit expected_behavior step asserting the agent wrote ./output.json. That step has been dropped from every retrieval eval in this update. If any part of the eval harness still checks for an output file on disk to verify actual retrieved content, all three evals will now silently pass even when the agent produces no usable result — the behavioral contract that retrieval happened and was persisted is no longer encoded.

Prompt To Fix With AI
This is a comment left during a code review.
Path: skills/nemo-retriever/evals/evals.json
Line: 1-56

Comment:
**Output file verification removed from all retrieval evals**

The previous versions of evals 001, 002, and 003 all included an explicit `expected_behavior` step asserting the agent wrote `./output.json`. That step has been dropped from every retrieval eval in this update. If any part of the eval harness still checks for an output file on disk to verify actual retrieved content, all three evals will now silently pass even when the agent produces no usable result — the behavioral contract that retrieval happened and was persisted is no longer encoded.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +22 to +38
"expected_behavior": [
"The agent read the nemo-retriever SKILL.md before executing commands",
"The agent executed a `retriever ingest` command to index the uploaded document collection",
"The agent executed a `retriever query` command targeting \"risk management\"",
"The agent returned a summarized answer with citations to each source",
"The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace"
]
},
{
"id": "nemo-retriever-003",
"question": "Our legal team needs to extract every clause about data privacy from the collection of contracts we have (PDFs, Word docs, and scanned images). Please provide the clauses with citations.",
"expected_skill": "nemo-retriever",
"expected_script": "None",
"ground_truth": "The agent indexed the mixed‑format contracts folder and extracted every verbatim data‑privacy clause, listing each clause together with the document name and page/slide number where it appears.",
"expected_behavior": [
"The agent read the nemo-retriever SKILL.md before executing commands",
"The agent executed a `retriever ingest` command to index PDFs, DOCX, and image files in the contracts folder",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Summarization claim inconsistent with retriever query output contract

The ground_truth for eval 002 says the agent "produced a concise summary of each risk‑management section," and the final expected_behavior step says "returned a summarized answer with citations." However, retriever query emits a raw JSON array of vector-search hits (one object per chunk with text, source, page_number, etc.) — it does not produce a prose summary. An automated evaluator grading behavior against this step would need to distinguish between "agent returned raw hits" and "agent synthesized a summary," but the step as written conflates the two. If the summarization is expected to come from the LLM interpreting the hits, that should be a separate explicit step or the ground_truth should reflect raw excerpts, as evals 001 and 003 do.

Prompt To Fix With AI
This is a comment left during a code review.
Path: skills/nemo-retriever/evals/evals.json
Line: 22-38

Comment:
**Summarization claim inconsistent with `retriever query` output contract**

The `ground_truth` for eval 002 says the agent "produced a concise summary of each risk‑management section," and the final `expected_behavior` step says "returned a summarized answer with citations." However, `retriever query` emits a raw JSON array of vector-search hits (one object per chunk with `text`, `source`, `page_number`, etc.) — it does not produce a prose summary. An automated evaluator grading behavior against this step would need to distinguish between "agent returned raw hits" and "agent synthesized a summary," but the step as written conflates the two. If the summarization is expected to come from the LLM interpreting the hits, that should be a separate explicit step or the `ground_truth` should reflect raw excerpts, as evals 001 and 003 do.

How can I resolve this? If you propose a fix, please make it concise.

Signed-off-by: nvskills-svc-account <svc-nvskills-signing@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 29, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@sosahi sosahi merged commit 2ebcf10 into main May 29, 2026
8 of 9 checks passed
@sosahi sosahi deleted the sohail/update-eval branch May 29, 2026 23:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants