add codex skill-eval by edknv · Pull Request #2100 · NVIDIA/NeMo-Retriever

edknv · 2026-05-22T17:23:28Z

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

greptile-apps · 2026-05-22T17:29:48Z

Greptile Summary

This PR extends the skill-eval benchmark harness to support OpenAI Codex as a second agent alongside Claude, and adds a rescore sub-command plus a TraceSummarizer that narrates agent tool-use traces after each session.

Codex agent path: runner.py gains _build_codex_command, Codex-specific JSONL parsing, cumulative-delta token accounting, and session log discovery; the results key is widened from (condition, domain) to (agent, condition, domain) across cli.py, runner.py, and report.py.
rescore command: Adds per-file error handling in _load_trial and correct _needs_rescore logic that no longer treats a score of 0 as missing.
Two correctness fixes: retriever_graph_utils.py returns a column-labelled empty DataFrame to avoid a downstream KeyError: 'query', and judge.py strips <think>…</think> blocks before JSON parsing to support reasoning models.

Confidence Score: 5/5

The two correctness fixes are targeted one-liners; the Codex agent path is fully additive and isolated behind the --agent flag with no regressions to existing Claude paths.

Every new Codex branch is gated on agent == codex, leaving existing Claude behaviour unchanged. The empty-DataFrame fix and think-tag strip address confirmed crash paths. The rescore command file-level error handling and corrected _needs_rescore logic close the gaps from the previous review round.

runner.py — _codex_session_log_for_workdir full-scan and absence of budget enforcement for Codex trials.

Important Files Changed

Filename	Overview
nemo_retriever/src/nemo_retriever/llm/clients/judge.py	Adds a think-tag strip before the existing JSON fence cleanup to handle reasoning model output — minimal, correct change.
nemo_retriever/src/nemo_retriever/retriever_graph_utils.py	Fixes empty-results crash by returning a column-labelled empty DataFrame instead of a column-less one; docstring updated to explain the bug being fixed.
nemo_retriever/src/nemo_retriever/skill_eval/configs/skill_eval.yaml	Adds agent, agent_models, and summarizer sections; gpt-5.5 default for Codex and back-compat agent_model comment are clearly documented.
nemo_retriever/src/nemo_retriever/skill_eval/trace_summarizer.py	New module backing the tool-use summarizer via claude --print; SPDX header present; from_kwargs still lacks type annotations (unresolved from previous review).
nemo_retriever/src/nemo_retriever/skill_eval/cli.py	Adds rescore command, Codex agent path, trace summarizer integration, and archive step; _needs_rescore and _load_trial now handle edge cases correctly.
nemo_retriever/src/nemo_retriever/skill_eval/runner.py	Large addition of Codex agent support: command builder, JSONL parsing, delta token accounting, and session log discovery; _codex_session_log_for_workdir does an O(N) full-scan of all Codex session files to locate a session by workdir.
nemo_retriever/src/nemo_retriever/skill_eval/report.py	Extends report grouping from (condition, domain) to (agent, condition, domain); adds _fmt_cost helper that correctly handles None for Codex where cost is unavailable.

Sequence Diagram

sequenceDiagram
    participant CLI as skill-eval CLI
    participant RC as run_condition()
    participant Agent as Agent CLI (claude/codex)
    participant Judge as LLMJudge
    participant Summarizer as TraceSummarizer
    participant FS as Artifact FS

    CLI->>RC: run_condition(agent, condition, domain, entries)
    RC->>Agent: setup turn (--session-id / exec)
    Agent-->>RC: TrialResult (session_id, tokens)
    RC->>Agent: query turn x N (--resume / exec resume)
    Agent-->>RC: TrialResult x N
    RC-->>CLI: (workdir, results)
    CLI->>Summarizer: summarize(trace)
    Summarizer->>Agent: claude --print (narrative)
    Agent-->>Summarizer: markdown text
    Summarizer-->>CLI: tool_use_summary
    CLI->>Judge: _apply_judge(entry, result) x N
    Judge-->>CLI: judge_score
    CLI->>FS: save_trial() + archive_session_log()
    CLI->>FS: write_summary (JSON + MD)

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
nemo_retriever/src/nemo_retriever/skill_eval/runner.py:1340-1354
**`_codex_session_log_for_workdir` does a full scan of all Codex session files**

This function reads `~/.codex/sessions/**/rollout-*.jsonl` and opens each one to parse the `session_meta` event looking for a matching `cwd`. If a developer has accumulated hundreds of Codex sessions, every setup-turn log lookup reads that many files from disk before finding (or failing to find) the right one. Consider caching the scan result within the run, or using a more targeted lookup (e.g., writing the workdir path into a sidecar file when the session starts) so rescanning isn't necessary on subsequent query turns.

### Issue 2 of 2
nemo_retriever/src/nemo_retriever/skill_eval/runner.py:1048-1081
**`budget_usd` is silently dropped for Codex, leaving no per-trial cost guard**

`_build_codex_command` does not accept or forward `budget_usd`, so the configured `per_trial_budget_usd` has no effect when `agent == "codex"`. The wall-clock `timeout_s` still terminates the process, but a Codex trial that stays under the timeout can freely exhaust its context window and accumulate unbounded token costs. If Codex exposes a flag like `--max-tokens` or similar, wiring it here (or at least logging a warning that budget enforcement is unavailable) would prevent surprise cost overruns in automated benchmark runs.

_{Reviews (2): Last reviewed commit: "Unhandled exceptions from corrupt trial ..." | Re-trigger Greptile}

…e run

add codex skill-eval

9820baa

edknv requested review from a team as code owners May 22, 2026 17:23

edknv requested a review from jdye64 May 22, 2026 17:23

greptile-apps Bot reviewed May 22, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/skill_eval/cli.py

Comment thread nemo_retriever/src/nemo_retriever/skill_eval/cli.py

Comment thread nemo_retriever/src/nemo_retriever/skill_eval/trace_summarizer.py

Comment thread nemo_retriever/src/nemo_retriever/skill_eval/cli.py

edknv added 3 commits May 22, 2026 10:37

lint

522acee

zero is a valid judge score

5b3dee0

Unhandled exceptions from corrupt trial files crash the entire rescor…

00343cb

…e run

jperez999 approved these changes May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add codex skill-eval#2100

add codex skill-eval#2100
edknv wants to merge 4 commits into
NVIDIA:mainfrom
edknv:edwardk/skill-codex

edknv commented May 22, 2026

Uh oh!

greptile-apps Bot commented May 22, 2026 •

edited

Loading

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

edknv commented May 22, 2026

Description

Checklist

Uh oh!

greptile-apps Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented May 22, 2026 •

edited

Loading