-
Notifications
You must be signed in to change notification settings - Fork 322
update eval.json #2176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update eval.json #2176
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,55 +1,56 @@ | ||
| [ | ||
| { | ||
| "id": "nemo-retriever-001", | ||
| "question": "Can you use the nemo-retriever tool to search my PDF folder for the term \"machine learning\" and give me the relevant passages?", | ||
| "expected_skill": "nemo-retriever", | ||
| "expected_script": "None", | ||
| "ground_truth": "The agent built a LanceDB index of the PDFs, performed a vector search for \"machine learning\", and returned the matching document excerpts in ./output.json.", | ||
| "expected_behavior": [ | ||
| "The agent read the nemo-retriever SKILL.md before executing commands", | ||
| "The agent executed a Bash command to ingest the PDFs with `retriever ingest`", | ||
| "The agent executed a Bash command to query the index with `retriever query` for \"machine learning\"", | ||
| "The agent wrote ./output.json containing the query results", | ||
| "The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace" | ||
| ] | ||
| }, | ||
| { | ||
| "id": "nemo-retriever-002", | ||
| "question": "I have a folder of research papers in PDF. How can I find which ones discuss reinforcement learning?", | ||
| "expected_skill": "nemo-retriever", | ||
| "expected_script": "None", | ||
| "ground_truth": "The agent queried the existing PDF index for reinforcement learning topics and supplied the relevant passages in ./output.json.", | ||
| "expected_behavior": [ | ||
| "The agent read the nemo-retriever SKILL.md before executing commands", | ||
| "The agent executed a Bash command to query the index with `retriever query` for reinforcement learning", | ||
| "The agent wrote ./output.json with the relevant excerpts", | ||
| "The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace" | ||
| ] | ||
| }, | ||
| { | ||
| "id": "nemo-retriever-003", | ||
| "question": "Our legal team needs to quickly locate clauses about data privacy in the contract PDFs we stored. Can you set up a search over those documents?", | ||
| "expected_skill": "nemo-retriever", | ||
| "expected_script": "None", | ||
| "ground_truth": "The agent created an index of the contract PDFs, searched for data‑privacy clauses, and returned the matching sections in ./output.json.", | ||
| "expected_behavior": [ | ||
| "The agent read the nemo-retriever SKILL.md before executing commands", | ||
| "The agent executed a Bash command to ingest the contract PDFs with `retriever ingest`", | ||
| "The agent executed a Bash command to query the index with `retriever query` for data privacy clauses", | ||
| "The agent wrote ./output.json containing the found clauses", | ||
| "The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace" | ||
| ] | ||
| }, | ||
| { | ||
| "id": "nemo-retriever-004", | ||
| "question": "What's the best way to bake a chocolate cake?", | ||
| "expected_skill": null, | ||
| "expected_script": "None", | ||
| "ground_truth": "The agent answered the cooking question directly without invoking the nemo-retriever tool.", | ||
| "expected_behavior": [ | ||
| "The agent responded with a textual answer without invoking any Bash commands", | ||
| "The agent did not read the nemo-retriever SKILL.md or use the retriever CLI", | ||
| "The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace" | ||
| ] | ||
| } | ||
| ] | ||
| { | ||
| "id": "nemo-retriever-001", | ||
| "question": "Use the nemo-retriever skill to find every mention of \"climate change\" in the PDF reports inside my folder \"research_reports\".", | ||
| "expected_skill": "nemo-retriever", | ||
| "expected_script": "None", | ||
| "ground_truth": "The agent indexed the folder and returned all passages containing \"climate change\" from the PDFs, each with the file name and page number as citations.", | ||
| "expected_behavior": [ | ||
| "The agent read the nemo-retriever SKILL.md before executing commands", | ||
| "The agent executed a `retriever ingest` command to index the \"research_reports\" folder", | ||
| "The agent executed a `retriever query` command with the search term \"climate change\"", | ||
| "The agent returned the matching excerpts with file and page citations", | ||
| "The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace" | ||
| ] | ||
| }, | ||
| { | ||
| "id": "nemo-retriever-002", | ||
| "question": "Can you search through all the documents I uploaded and give me a summary of the sections that discuss risk management?", | ||
| "expected_skill": "nemo-retriever", | ||
| "expected_script": "None", | ||
| "ground_truth": "The agent searched across the uploaded PDFs, DOCX, and text files, produced a concise summary of each risk‑management section, and included citations to the source documents.", | ||
| "expected_behavior": [ | ||
| "The agent read the nemo-retriever SKILL.md before executing commands", | ||
| "The agent executed a `retriever ingest` command to index the uploaded document collection", | ||
| "The agent executed a `retriever query` command targeting \"risk management\"", | ||
| "The agent returned a summarized answer with citations to each source", | ||
| "The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace" | ||
| ] | ||
| }, | ||
| { | ||
| "id": "nemo-retriever-003", | ||
| "question": "Our legal team needs to extract every clause about data privacy from the collection of contracts we have (PDFs, Word docs, and scanned images). Please provide the clauses with citations.", | ||
| "expected_skill": "nemo-retriever", | ||
| "expected_script": "None", | ||
| "ground_truth": "The agent indexed the mixed‑format contracts folder and extracted every verbatim data‑privacy clause, listing each clause together with the document name and page/slide number where it appears.", | ||
| "expected_behavior": [ | ||
| "The agent read the nemo-retriever SKILL.md before executing commands", | ||
| "The agent executed a `retriever ingest` command to index PDFs, DOCX, and image files in the contracts folder", | ||
| "The agent executed a `retriever query` command to locate clauses containing \"data privacy\"", | ||
| "The agent returned each clause verbatim with document and location citations", | ||
| "The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace" | ||
| ] | ||
| }, | ||
| { | ||
| "id": "nemo-retriever-004", | ||
| "question": "How do I bake a chocolate cake from scratch?", | ||
| "expected_skill": null, | ||
| "expected_script": "None", | ||
| "ground_truth": "The agent provided a step‑by‑step chocolate cake recipe without using the nemo-retriever skill or any tool calls.", | ||
| "expected_behavior": [ | ||
| "The agent responded with a chocolate cake recipe without invoking any tools", | ||
| "The agent did not execute any Bash commands or read the nemo-retriever SKILL.md", | ||
| "The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace" | ||
| ] | ||
| } | ||
| ] | ||
|
Comment on lines
1
to
+56
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The previous versions of evals 001, 002, and 003 all included an explicit Prompt To Fix With AIThis is a comment left during a code review.
Path: skills/nemo-retriever/evals/evals.json
Line: 1-56
Comment:
**Output file verification removed from all retrieval evals**
The previous versions of evals 001, 002, and 003 all included an explicit `expected_behavior` step asserting the agent wrote `./output.json`. That step has been dropped from every retrieval eval in this update. If any part of the eval harness still checks for an output file on disk to verify actual retrieved content, all three evals will now silently pass even when the agent produces no usable result — the behavioral contract that retrieval happened and was persisted is no longer encoded.
How can I resolve this? If you propose a fix, please make it concise. |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retriever queryoutput contractThe
ground_truthfor eval 002 says the agent "produced a concise summary of each risk‑management section," and the finalexpected_behaviorstep says "returned a summarized answer with citations." However,retriever queryemits a raw JSON array of vector-search hits (one object per chunk withtext,source,page_number, etc.) — it does not produce a prose summary. An automated evaluator grading behavior against this step would need to distinguish between "agent returned raw hits" and "agent synthesized a summary," but the step as written conflates the two. If the summarization is expected to come from the LLM interpreting the hits, that should be a separate explicit step or theground_truthshould reflect raw excerpts, as evals 001 and 003 do.Prompt To Fix With AI