add codex skill-eval#2100
Conversation
Greptile SummaryThis PR extends the skill-eval benchmark harness to support OpenAI Codex as a second agent alongside Claude, and adds a
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/llm/clients/judge.py | Adds a think-tag strip before the existing JSON fence cleanup to handle reasoning model output — minimal, correct change. |
| nemo_retriever/src/nemo_retriever/retriever_graph_utils.py | Fixes empty-results crash by returning a column-labelled empty DataFrame instead of a column-less one; docstring updated to explain the bug being fixed. |
| nemo_retriever/src/nemo_retriever/skill_eval/configs/skill_eval.yaml | Adds agent, agent_models, and summarizer sections; gpt-5.5 default for Codex and back-compat agent_model comment are clearly documented. |
| nemo_retriever/src/nemo_retriever/skill_eval/trace_summarizer.py | New module backing the tool-use summarizer via claude --print; SPDX header present; from_kwargs still lacks type annotations (unresolved from previous review). |
| nemo_retriever/src/nemo_retriever/skill_eval/cli.py | Adds rescore command, Codex agent path, trace summarizer integration, and archive step; _needs_rescore and _load_trial now handle edge cases correctly. |
| nemo_retriever/src/nemo_retriever/skill_eval/runner.py | Large addition of Codex agent support: command builder, JSONL parsing, delta token accounting, and session log discovery; _codex_session_log_for_workdir does an O(N) full-scan of all Codex session files to locate a session by workdir. |
| nemo_retriever/src/nemo_retriever/skill_eval/report.py | Extends report grouping from (condition, domain) to (agent, condition, domain); adds _fmt_cost helper that correctly handles None for Codex where cost is unavailable. |
Sequence Diagram
sequenceDiagram
participant CLI as skill-eval CLI
participant RC as run_condition()
participant Agent as Agent CLI (claude/codex)
participant Judge as LLMJudge
participant Summarizer as TraceSummarizer
participant FS as Artifact FS
CLI->>RC: run_condition(agent, condition, domain, entries)
RC->>Agent: setup turn (--session-id / exec)
Agent-->>RC: TrialResult (session_id, tokens)
RC->>Agent: query turn x N (--resume / exec resume)
Agent-->>RC: TrialResult x N
RC-->>CLI: (workdir, results)
CLI->>Summarizer: summarize(trace)
Summarizer->>Agent: claude --print (narrative)
Agent-->>Summarizer: markdown text
Summarizer-->>CLI: tool_use_summary
CLI->>Judge: _apply_judge(entry, result) x N
Judge-->>CLI: judge_score
CLI->>FS: save_trial() + archive_session_log()
CLI->>FS: write_summary (JSON + MD)
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
nemo_retriever/src/nemo_retriever/skill_eval/runner.py:1340-1354
**`_codex_session_log_for_workdir` does a full scan of all Codex session files**
This function reads `~/.codex/sessions/**/rollout-*.jsonl` and opens each one to parse the `session_meta` event looking for a matching `cwd`. If a developer has accumulated hundreds of Codex sessions, every setup-turn log lookup reads that many files from disk before finding (or failing to find) the right one. Consider caching the scan result within the run, or using a more targeted lookup (e.g., writing the workdir path into a sidecar file when the session starts) so rescanning isn't necessary on subsequent query turns.
### Issue 2 of 2
nemo_retriever/src/nemo_retriever/skill_eval/runner.py:1048-1081
**`budget_usd` is silently dropped for Codex, leaving no per-trial cost guard**
`_build_codex_command` does not accept or forward `budget_usd`, so the configured `per_trial_budget_usd` has no effect when `agent == "codex"`. The wall-clock `timeout_s` still terminates the process, but a Codex trial that stays under the timeout can freely exhaust its context window and accumulate unbounded token costs. If Codex exposes a flag like `--max-tokens` or similar, wiring it here (or at least logging a warning that budget enforcement is unavailable) would prevent surprise cost overruns in automated benchmark runs.
Reviews (2): Last reviewed commit: "Unhandled exceptions from corrupt trial ..." | Re-trigger Greptile
Description
Checklist