Skip to content

add codex skill-eval#2100

Open
edknv wants to merge 4 commits into
NVIDIA:mainfrom
edknv:edwardk/skill-codex
Open

add codex skill-eval#2100
edknv wants to merge 4 commits into
NVIDIA:mainfrom
edknv:edwardk/skill-codex

Conversation

@edknv
Copy link
Copy Markdown
Collaborator

@edknv edknv commented May 22, 2026

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@edknv edknv requested review from a team as code owners May 22, 2026 17:23
@edknv edknv requested a review from jdye64 May 22, 2026 17:23
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 22, 2026

Greptile Summary

This PR extends the skill-eval benchmark harness to support OpenAI Codex as a second agent alongside Claude, and adds a rescore sub-command plus a TraceSummarizer that narrates agent tool-use traces after each session.

  • Codex agent path: runner.py gains _build_codex_command, Codex-specific JSONL parsing, cumulative-delta token accounting, and session log discovery; the results key is widened from (condition, domain) to (agent, condition, domain) across cli.py, runner.py, and report.py.
  • rescore command: Adds per-file error handling in _load_trial and correct _needs_rescore logic that no longer treats a score of 0 as missing.
  • Two correctness fixes: retriever_graph_utils.py returns a column-labelled empty DataFrame to avoid a downstream KeyError: 'query', and judge.py strips <think>…</think> blocks before JSON parsing to support reasoning models.

Confidence Score: 5/5

The two correctness fixes are targeted one-liners; the Codex agent path is fully additive and isolated behind the --agent flag with no regressions to existing Claude paths.

Every new Codex branch is gated on agent == codex, leaving existing Claude behaviour unchanged. The empty-DataFrame fix and think-tag strip address confirmed crash paths. The rescore command file-level error handling and corrected _needs_rescore logic close the gaps from the previous review round.

runner.py — _codex_session_log_for_workdir full-scan and absence of budget enforcement for Codex trials.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/llm/clients/judge.py Adds a think-tag strip before the existing JSON fence cleanup to handle reasoning model output — minimal, correct change.
nemo_retriever/src/nemo_retriever/retriever_graph_utils.py Fixes empty-results crash by returning a column-labelled empty DataFrame instead of a column-less one; docstring updated to explain the bug being fixed.
nemo_retriever/src/nemo_retriever/skill_eval/configs/skill_eval.yaml Adds agent, agent_models, and summarizer sections; gpt-5.5 default for Codex and back-compat agent_model comment are clearly documented.
nemo_retriever/src/nemo_retriever/skill_eval/trace_summarizer.py New module backing the tool-use summarizer via claude --print; SPDX header present; from_kwargs still lacks type annotations (unresolved from previous review).
nemo_retriever/src/nemo_retriever/skill_eval/cli.py Adds rescore command, Codex agent path, trace summarizer integration, and archive step; _needs_rescore and _load_trial now handle edge cases correctly.
nemo_retriever/src/nemo_retriever/skill_eval/runner.py Large addition of Codex agent support: command builder, JSONL parsing, delta token accounting, and session log discovery; _codex_session_log_for_workdir does an O(N) full-scan of all Codex session files to locate a session by workdir.
nemo_retriever/src/nemo_retriever/skill_eval/report.py Extends report grouping from (condition, domain) to (agent, condition, domain); adds _fmt_cost helper that correctly handles None for Codex where cost is unavailable.

Sequence Diagram

sequenceDiagram
    participant CLI as skill-eval CLI
    participant RC as run_condition()
    participant Agent as Agent CLI (claude/codex)
    participant Judge as LLMJudge
    participant Summarizer as TraceSummarizer
    participant FS as Artifact FS

    CLI->>RC: run_condition(agent, condition, domain, entries)
    RC->>Agent: setup turn (--session-id / exec)
    Agent-->>RC: TrialResult (session_id, tokens)
    RC->>Agent: query turn x N (--resume / exec resume)
    Agent-->>RC: TrialResult x N
    RC-->>CLI: (workdir, results)
    CLI->>Summarizer: summarize(trace)
    Summarizer->>Agent: claude --print (narrative)
    Agent-->>Summarizer: markdown text
    Summarizer-->>CLI: tool_use_summary
    CLI->>Judge: _apply_judge(entry, result) x N
    Judge-->>CLI: judge_score
    CLI->>FS: save_trial() + archive_session_log()
    CLI->>FS: write_summary (JSON + MD)
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
nemo_retriever/src/nemo_retriever/skill_eval/runner.py:1340-1354
**`_codex_session_log_for_workdir` does a full scan of all Codex session files**

This function reads `~/.codex/sessions/**/rollout-*.jsonl` and opens each one to parse the `session_meta` event looking for a matching `cwd`. If a developer has accumulated hundreds of Codex sessions, every setup-turn log lookup reads that many files from disk before finding (or failing to find) the right one. Consider caching the scan result within the run, or using a more targeted lookup (e.g., writing the workdir path into a sidecar file when the session starts) so rescanning isn't necessary on subsequent query turns.

### Issue 2 of 2
nemo_retriever/src/nemo_retriever/skill_eval/runner.py:1048-1081
**`budget_usd` is silently dropped for Codex, leaving no per-trial cost guard**

`_build_codex_command` does not accept or forward `budget_usd`, so the configured `per_trial_budget_usd` has no effect when `agent == "codex"`. The wall-clock `timeout_s` still terminates the process, but a Codex trial that stays under the timeout can freely exhaust its context window and accumulate unbounded token costs. If Codex exposes a flag like `--max-tokens` or similar, wiring it here (or at least logging a warning that budget enforcement is unavailable) would prevent surprise cost overruns in automated benchmark runs.

Reviews (2): Last reviewed commit: "Unhandled exceptions from corrupt trial ..." | Re-trigger Greptile

Comment thread nemo_retriever/src/nemo_retriever/skill_eval/cli.py
Comment thread nemo_retriever/src/nemo_retriever/skill_eval/cli.py
Comment thread nemo_retriever/src/nemo_retriever/skill_eval/trace_summarizer.py
Comment thread nemo_retriever/src/nemo_retriever/skill_eval/cli.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants