feat(review): codex review correctness (PR 2 of 3)#1370
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 1aaf8b5. Configure here.
There was a problem hiding this comment.
Pull request overview
This PR refines the agent-layer implementation of entire review, focusing on Codex correctness (skill invocation + live token reporting) and extracting shared skill discovery logic for Claude Code and Codex. It also introduces two additive configuration fields (Model, ReasoningEffort) on both persisted settings (ReviewConfig) and the runtime spawn config (RunConfig).
Changes:
- Adds a shared
agent/skilldiscoveryscanner (frontmatter parsing, plugin-cache version selection, invocation-form helpers) and refactors Claude Code discovery to use it; implements Codex discovery over~/.codex. - Updates Codex reviewer execution to support
-m <model>/-c model_reasoning_effort=<level>and adds rollout-file tailing to emit live token updates. - Updates Claude Code stream parser to emit input-only live token updates during streaming and final totals on
result.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| cmd/entire/cli/settings/settings.go | Adds persisted ReviewConfig.Model / ReasoningEffort and extends IsZero. |
| cmd/entire/cli/review/types/reviewer.go | Adds runtime RunConfig.Model / ReasoningEffort fields for per-spawn overrides. |
| cmd/entire/cli/agent/skilldiscovery/scan.go | New shared scanners/helpers for skill discovery (slash vs dollar invocation forms, plugin-cache scanning, frontmatter parsing). |
| cmd/entire/cli/agent/skilldiscovery/registry.go | Removes Codex curated built-ins; adjusts install-hint messaging constraints (no backticks). |
| cmd/entire/cli/agent/skilldiscovery/registry_test.go | Updates Codex curated-builtins expectations to empty. |
| cmd/entire/cli/agent/codex/reviewer.go | Adds model/reasoning flags, threads rollout token tailer, and tweaks token emission behavior. |
| cmd/entire/cli/agent/codex/reviewer_test.go | Updates tests for verbatim skill pass-through and new argv flags; adds token emission contract test. |
| cmd/entire/cli/agent/codex/review_tokens.go | New rollout JSONL tailer emitting cumulative Tokens from token_count events. |
| cmd/entire/cli/agent/codex/review_tokens_test.go | Unit tests for rollout token parsing and tailing behavior. |
| cmd/entire/cli/agent/codex/discovery.go | Implements Codex on-disk skill discovery using $name/$plugin:name invocation form. |
| cmd/entire/cli/agent/codex/discovery_test.go | Tests Codex discovery across user skills, plugins cache, and superpowers. |
| cmd/entire/cli/agent/codex/AGENT.md | Documents Codex review integration specifics (exec, skills, live tokens, no hooks). |
| cmd/entire/cli/agent/claudecode/testdata/stream_with_deltas.jsonl | Adds fixture capturing real Claude stream-json usage snapshot behavior. |
| cmd/entire/cli/agent/claudecode/reviewer.go | Emits input-only Tokens on assistant envelopes; final totals on result. |
| cmd/entire/cli/agent/claudecode/reviewer_test.go | Updates expectations and adds test locking input-only live-token semantics. |
| cmd/entire/cli/agent/claudecode/discovery.go | Refactors Claude discovery to delegate scanning/version-dedupe/frontmatter parsing to skilldiscovery. |
Comments suppressed due to low confidence (1)
cmd/entire/cli/agent/codex/reviewer_test.go:110
- This test uses the legacy slash-form "/review" skill, but codex skill discovery and prompt invocation in this PR are described as $name /
$plugin:name. Using a $ -prefixed skill name here would better reflect the real codex execution path and avoid locking the tests to an invocation form codex exec may not support.
t.Parallel()
cfg := reviewtypes.RunConfig{
Skills: []string{"/review"},
AlwaysPrompt: "Focus on auth regressions.",
ScopeBaseRef: "main",
CheckpointContext: "Commits in scope (newest first):\n abc123 summary\n",
}
- reviewer_test: TestParseCodexOutput_EmitsTokensAtEveryTurnCompleted feeds a thread.started envelope, which launches the rollout token tailer. Pin ENTIRE_TEST_CODEX_SESSION_DIR to an empty temp dir (drop t.Parallel) so the tailer can't pick up a real ~/.codex rollout-*-tid-1.jsonl and inject extra Tokens that break the exactly-2 assertion. - skilldiscovery: codex install-hint suppression fingerprint was slash form (/codex:adversarial-review) but DiscoverReviewSkills emits dollar form, so ActiveInstallHintsFor never intersected and the hint showed forever after install. Use $codex:adversarial-review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 210b33059b2b
Agent-layer correctness for `entire review`, independent of the command
surface (no cmd/setup/fix churn):
- Codex invokes its real review skill verbatim (no /review paraphrase);
codex's skill system loads the matching SKILL.md. Adds per-spawn
review.<agent>.model / reasoning_effort, threaded through RunConfig into
codex CLI flags -m / -c model_reasoning_effort=.
- Codex skill discovery in $name form over ~/.codex/{skills,plugins,superpowers};
lifts the generic SKILL.md scanners + version dedupe + frontmatter parse into
the shared skilldiscovery package (SlashForm/DollarForm); claude delegates to
it. Codex curated built-ins dropped (discovered on disk, not bundled).
- Honest live tokens: claude emits input-only Tokens during streaming (output
snapshot is misleading) with final totals on result; codex tails its rollout
transcript (located by thread_id) and emits cumulative Tokens per token_count,
since `codex exec --json` only carries usage at turn.completed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 4cbe537bd34d
- reviewer_test: TestParseCodexOutput_EmitsTokensAtEveryTurnCompleted feeds a thread.started envelope, which launches the rollout token tailer. Pin ENTIRE_TEST_CODEX_SESSION_DIR to an empty temp dir (drop t.Parallel) so the tailer can't pick up a real ~/.codex rollout-*-tid-1.jsonl and inject extra Tokens that break the exactly-2 assertion. - skilldiscovery: codex install-hint suppression fingerprint was slash form (/codex:adversarial-review) but DiscoverReviewSkills emits dollar form, so ActiveInstallHintsFor never intersected and the hint showed forever after install. Use $codex:adversarial-review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 210b33059b2b
- Event ordering (real bug): the rollout token tailer runs concurrently and sends Tokens on the same channel as the parser. It was only stopped in a defer that runs AFTER the inline Finished emissions, so a late Tokens could follow Finished — violating ReviewerTemplate's "Finished is the last event" contract. Introduce an idempotent stopTailer (sync.Once + WaitGroup.Wait) and call it before every terminal Finished; the defer remains as a backstop. Adds a regression test (Finished is last even with a live rollout tailer), verified under -race. - Test fidelity: TestCodexReviewer_PassesSkillThroughVerbatim configured the codex skill as "/review" (claude's slash form). Codex skills are dollar-prefixed; use "$code-reviewer" so the test models real codex usage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 480a94f05ac8
…1373) Claude review subprocesses run via `claude -p` don't reliably report a model through lifecycle hooks (the ModelUpdate path is flaky in headless mode), so per-session checkpoint metadata landed with "model": "" for Claude reviewer sessions, while codex/pi reviewers record theirs. Root cause: ClaudeCodeAgent did not implement agent.ModelExtractor, so condense's sessionStateBackfillModel fallback (AsModelExtractor -> ExtractModel) was a no-op for Claude; the model relied solely on the flaky hook path. Implement ExtractModel on ClaudeCodeAgent: parse the transcript and return the last non-empty model from an assistant line (mirrors pi/copilotcli/factoryai). Condense already calls this when state.ModelName == "", so review (and any hook-missed) Claude sessions now record a non-empty model deterministically. Verified against real recorded data: a real agent_review row in entire/checkpoints/v1 with model="" (skills=[/review]) — running the new ExtractModel on that session's stored transcript recovers "claude-opus-4-7", which is exactly the value the condense backfill writes. Surfaces in entire.io's per-reviewer attribution (entirehq/entire.io#2271). Note: the transcript carries the base model id ("claude-opus-4-7"), not the hook's context-window variant ("claude-opus-4-7[1m]"), so backfilled models omit the [1m] suffix — a correct, non-empty id, slightly less precise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 41d9aa9acef6

https://entire.io/gh/entireio/cli/trails/522
Stacked on #1241 (PR 1: role + setup foundation). Diffed against
review-review.Peeled out of the original PR 2 (#1352) so the agent-layer codex correctness can be reviewed independently of the command-layer UX cutover. No
cmd.go/setup.go/fix.gochurn — this PR touches onlyagent/*plus two additive fields (ReviewConfig.Model/ReasoningEffort,RunConfig.Model/ReasoningEffort).What this adds
/reviewparaphrase); codex's skill system loads the matching SKILL.md. Adds per-spawnreview.<agent>.model/reasoning_effort→ codex-m/-c model_reasoning_effort=.$nameform over~/.codex/{skills,plugins,superpowers}; lifts the generic SKILL.md scanners + version dedupe + frontmatter parse into the sharedskilldiscoverypackage (SlashForm/DollarForm); claude delegates to it. Codex curated built-ins dropped (discovered on disk, not bundled).Tokensduring streaming with final totals onresult; codex tails its rollout transcript (located bythread_id) and emits cumulativeTokenspertoken_count(sincecodex exec --jsononly carries usage atturn.completed).Verified: build + vet clean,
mise run lint0 issues, agent/review/settings unit tests green.🤖 Generated with Claude Code
Note
Medium Risk
Changes how review prompts and skills reach Codex and how token counts are derived from local rollout files; mistakes could affect review quality or TUI metrics without touching auth or data stores.
Overview
This PR tightens agent-layer behavior for
entire reviewon Claude Code and Codex: shared skill discovery, honest live token display, and Codex-specific review invocation.Codex stops paraphrasing
/reviewinto generic text and passes configured skills verbatim so Codex loads realSKILL.mdworkflows. It adds on-disk discovery under~/.codexin$name/$plugin:nameform, drops hardcoded/reviewcurated built-ins, and maps optionalmodel/reasoning_effortsettings tocodex exec-mand-c model_reasoning_effort=. Live usage during a run comes from tailing the rollout JSONL (bythread_id) fortoken_countevents, not onlyturn.completedon stdout.Claude Code discovery is refactored to the shared
skilldiscoverypackage (SlashForm, plugin cache scan, dedupe, frontmatter parsing). Its stream-json parser now emits input-onlyTokenson assistant envelopes (misleading tinyoutput_tokensat turn start) and full totals onresult.Shared: new
skilldiscovery/scan.go, picker install-hint text without backticks, andReviewConfig/RunConfigfields for per-spawn model and reasoning effort (wired for Codex; other agents ignore).Reviewed by Cursor Bugbot for commit 1aaf8b5. Configure here.
Note for reviewers (stacked-PR artifact):
RunConfig.Model/ReasoningEffort(andReviewConfig.Model/ReasoningEffort) are introduced here and mapped by the codex reviewer to-m/-c model_reasoning_effort=, but the call site that copies them from settings into the per-agentRunConfig(applyReviewConfiginreview/cmd.go) lives in PR 3 (#1352). They are therefore inert on this PR alone and become live once PR 3 lands — this is deliberate, keeping PR 2 free of command-layer churn.