feat(review): codex review correctness (PR 2 of 3) by peyton-alt · Pull Request #1370 · entireio/cli

peyton-alt · 2026-06-04T17:32:02Z

https://entire.io/gh/entireio/cli/trails/522

Stacked on #1241 (PR 1: role + setup foundation). Diffed against review-review.

Peeled out of the original PR 2 (#1352) so the agent-layer codex correctness can be reviewed independently of the command-layer UX cutover. No cmd.go/setup.go/fix.go churn — this PR touches only agent/* plus two additive fields (ReviewConfig.Model/ReasoningEffort, RunConfig.Model/ReasoningEffort).

What this adds

Codex invokes its real review skill verbatim (no /review paraphrase); codex's skill system loads the matching SKILL.md. Adds per-spawn review.<agent>.model / reasoning_effort → codex -m / -c model_reasoning_effort=.
Codex skill discovery in $name form over ~/.codex/{skills,plugins,superpowers}; lifts the generic SKILL.md scanners + version dedupe + frontmatter parse into the shared skilldiscovery package (SlashForm/DollarForm); claude delegates to it. Codex curated built-ins dropped (discovered on disk, not bundled).
Honest live tokens: claude emits input-only Tokens during streaming with final totals on result; codex tails its rollout transcript (located by thread_id) and emits cumulative Tokens per token_count (since codex exec --json only carries usage at turn.completed).

Verified: build + vet clean, mise run lint 0 issues, agent/review/settings unit tests green.

🤖 Generated with Claude Code

Note

Medium Risk
Changes how review prompts and skills reach Codex and how token counts are derived from local rollout files; mistakes could affect review quality or TUI metrics without touching auth or data stores.

Overview
This PR tightens agent-layer behavior for entire review on Claude Code and Codex: shared skill discovery, honest live token display, and Codex-specific review invocation.

Codex stops paraphrasing /review into generic text and passes configured skills verbatim so Codex loads real SKILL.md workflows. It adds on-disk discovery under ~/.codex in $name / $plugin:name form, drops hardcoded /review curated built-ins, and maps optional model / reasoning_effort settings to codex exec -m and -c model_reasoning_effort=. Live usage during a run comes from tailing the rollout JSONL (by thread_id) for token_count events, not only turn.completed on stdout.

Claude Code discovery is refactored to the shared skilldiscovery package (SlashForm, plugin cache scan, dedupe, frontmatter parsing). Its stream-json parser now emits input-only Tokens on assistant envelopes (misleading tiny output_tokens at turn start) and full totals on result.

Shared: new skilldiscovery/scan.go, picker install-hint text without backticks, and ReviewConfig / RunConfig fields for per-spawn model and reasoning effort (wired for Codex; other agents ignore).

^{Reviewed by Cursor Bugbot for commit 1aaf8b5. Configure here.}

Note for reviewers (stacked-PR artifact): RunConfig.Model / ReasoningEffort (and ReviewConfig.Model / ReasoningEffort) are introduced here and mapped by the codex reviewer to -m / -c model_reasoning_effort=, but the call site that copies them from settings into the per-agent RunConfig (applyReviewConfig in review/cmd.go) lives in PR 3 (#1352). They are therefore inert on this PR alone and become live once PR 3 lands — this is deliberate, keeping PR 2 free of command-layer churn.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit 1aaf8b5. Configure here.}

Copilot

Pull request overview

This PR refines the agent-layer implementation of entire review, focusing on Codex correctness (skill invocation + live token reporting) and extracting shared skill discovery logic for Claude Code and Codex. It also introduces two additive configuration fields (Model, ReasoningEffort) on both persisted settings (ReviewConfig) and the runtime spawn config (RunConfig).

Changes:

Adds a shared agent/skilldiscovery scanner (frontmatter parsing, plugin-cache version selection, invocation-form helpers) and refactors Claude Code discovery to use it; implements Codex discovery over ~/.codex.
Updates Codex reviewer execution to support -m <model> / -c model_reasoning_effort=<level> and adds rollout-file tailing to emit live token updates.
Updates Claude Code stream parser to emit input-only live token updates during streaming and final totals on result.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
cmd/entire/cli/settings/settings.go	Adds persisted `ReviewConfig.Model` / `ReasoningEffort` and extends `IsZero`.
cmd/entire/cli/review/types/reviewer.go	Adds runtime `RunConfig.Model` / `ReasoningEffort` fields for per-spawn overrides.
cmd/entire/cli/agent/skilldiscovery/scan.go	New shared scanners/helpers for skill discovery (slash vs dollar invocation forms, plugin-cache scanning, frontmatter parsing).
cmd/entire/cli/agent/skilldiscovery/registry.go	Removes Codex curated built-ins; adjusts install-hint messaging constraints (no backticks).
cmd/entire/cli/agent/skilldiscovery/registry_test.go	Updates Codex curated-builtins expectations to empty.
cmd/entire/cli/agent/codex/reviewer.go	Adds model/reasoning flags, threads rollout token tailer, and tweaks token emission behavior.
cmd/entire/cli/agent/codex/reviewer_test.go	Updates tests for verbatim skill pass-through and new argv flags; adds token emission contract test.
cmd/entire/cli/agent/codex/review_tokens.go	New rollout JSONL tailer emitting cumulative `Tokens` from `token_count` events.
cmd/entire/cli/agent/codex/review_tokens_test.go	Unit tests for rollout token parsing and tailing behavior.
cmd/entire/cli/agent/codex/discovery.go	Implements Codex on-disk skill discovery using `$name`/`$plugin:name` invocation form.
cmd/entire/cli/agent/codex/discovery_test.go	Tests Codex discovery across user skills, plugins cache, and superpowers.
cmd/entire/cli/agent/codex/AGENT.md	Documents Codex review integration specifics (exec, skills, live tokens, no hooks).
cmd/entire/cli/agent/claudecode/testdata/stream_with_deltas.jsonl	Adds fixture capturing real Claude stream-json usage snapshot behavior.
cmd/entire/cli/agent/claudecode/reviewer.go	Emits input-only Tokens on assistant envelopes; final totals on result.
cmd/entire/cli/agent/claudecode/reviewer_test.go	Updates expectations and adds test locking input-only live-token semantics.
cmd/entire/cli/agent/claudecode/discovery.go	Refactors Claude discovery to delegate scanning/version-dedupe/frontmatter parsing to `skilldiscovery`.

Comments suppressed due to low confidence (1)

cmd/entire/cli/agent/codex/reviewer_test.go:110

This test uses the legacy slash-form "/review" skill, but codex skill discovery and prompt invocation in this PR are described as $name / $plugin:name. Using a $-prefixed skill name here would better reflect the real codex execution path and avoid locking the tests to an invocation form codex exec may not support.

	t.Parallel()
	cfg := reviewtypes.RunConfig{
		Skills:            []string{"/review"},
		AlwaysPrompt:      "Focus on auth regressions.",
		ScopeBaseRef:      "main",
		CheckpointContext: "Commits in scope (newest first):\n  abc123 summary\n",
	}

- reviewer_test: TestParseCodexOutput_EmitsTokensAtEveryTurnCompleted feeds a thread.started envelope, which launches the rollout token tailer. Pin ENTIRE_TEST_CODEX_SESSION_DIR to an empty temp dir (drop t.Parallel) so the tailer can't pick up a real ~/.codex rollout-*-tid-1.jsonl and inject extra Tokens that break the exactly-2 assertion. - skilldiscovery: codex install-hint suppression fingerprint was slash form (/codex:adversarial-review) but DiscoverReviewSkills emits dollar form, so ActiveInstallHintsFor never intersected and the hint showed forever after install. Use $codex:adversarial-review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 210b33059b2b

Agent-layer correctness for `entire review`, independent of the command surface (no cmd/setup/fix churn): - Codex invokes its real review skill verbatim (no /review paraphrase); codex's skill system loads the matching SKILL.md. Adds per-spawn review.<agent>.model / reasoning_effort, threaded through RunConfig into codex CLI flags -m / -c model_reasoning_effort=. - Codex skill discovery in $name form over ~/.codex/{skills,plugins,superpowers}; lifts the generic SKILL.md scanners + version dedupe + frontmatter parse into the shared skilldiscovery package (SlashForm/DollarForm); claude delegates to it. Codex curated built-ins dropped (discovered on disk, not bundled). - Honest live tokens: claude emits input-only Tokens during streaming (output snapshot is misleading) with final totals on result; codex tails its rollout transcript (located by thread_id) and emits cumulative Tokens per token_count, since `codex exec --json` only carries usage at turn.completed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 4cbe537bd34d

- reviewer_test: TestParseCodexOutput_EmitsTokensAtEveryTurnCompleted feeds a thread.started envelope, which launches the rollout token tailer. Pin ENTIRE_TEST_CODEX_SESSION_DIR to an empty temp dir (drop t.Parallel) so the tailer can't pick up a real ~/.codex rollout-*-tid-1.jsonl and inject extra Tokens that break the exactly-2 assertion. - skilldiscovery: codex install-hint suppression fingerprint was slash form (/codex:adversarial-review) but DiscoverReviewSkills emits dollar form, so ActiveInstallHintsFor never intersected and the hint showed forever after install. Use $codex:adversarial-review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 210b33059b2b

- Event ordering (real bug): the rollout token tailer runs concurrently and sends Tokens on the same channel as the parser. It was only stopped in a defer that runs AFTER the inline Finished emissions, so a late Tokens could follow Finished — violating ReviewerTemplate's "Finished is the last event" contract. Introduce an idempotent stopTailer (sync.Once + WaitGroup.Wait) and call it before every terminal Finished; the defer remains as a backstop. Adds a regression test (Finished is last even with a live rollout tailer), verified under -race. - Test fidelity: TestCodexReviewer_PassesSkillThroughVerbatim configured the codex skill as "/review" (claude's slash form). Codex skills are dollar-prefixed; use "$code-reviewer" so the test models real codex usage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 480a94f05ac8

…1373) Claude review subprocesses run via `claude -p` don't reliably report a model through lifecycle hooks (the ModelUpdate path is flaky in headless mode), so per-session checkpoint metadata landed with "model": "" for Claude reviewer sessions, while codex/pi reviewers record theirs. Root cause: ClaudeCodeAgent did not implement agent.ModelExtractor, so condense's sessionStateBackfillModel fallback (AsModelExtractor -> ExtractModel) was a no-op for Claude; the model relied solely on the flaky hook path. Implement ExtractModel on ClaudeCodeAgent: parse the transcript and return the last non-empty model from an assistant line (mirrors pi/copilotcli/factoryai). Condense already calls this when state.ModelName == "", so review (and any hook-missed) Claude sessions now record a non-empty model deterministically. Verified against real recorded data: a real agent_review row in entire/checkpoints/v1 with model="" (skills=[/review]) — running the new ExtractModel on that session's stored transcript recovers "claude-opus-4-7", which is exactly the value the condense backfill writes. Surfaces in entire.io's per-reviewer attribution (entirehq/entire.io#2271). Note: the transcript carries the base model id ("claude-opus-4-7"), not the hook's context-window variant ("claude-opus-4-7[1m]"), so backfilled models omit the [1m] suffix — a correct, non-empty id, slightly less precise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Entire-Checkpoint: 41d9aa9acef6

Copilot AI review requested due to automatic review settings June 4, 2026 17:32

peyton-alt requested a review from a team as a code owner June 4, 2026 17:32

Copilot started reviewing on behalf of peyton-alt June 4, 2026 17:32 View session

peyton-alt mentioned this pull request Jun 4, 2026

feat(review): role-driven UX cutover + TUI (PR 3 of 3) #1352

Open

cursor Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread cmd/entire/cli/agent/codex/reviewer_test.go

Comment thread cmd/entire/cli/agent/skilldiscovery/registry.go Outdated

Copilot AI reviewed Jun 4, 2026

View reviewed changes

Comment thread cmd/entire/cli/agent/skilldiscovery/registry.go

Comment thread cmd/entire/cli/agent/codex/reviewer.go

Comment thread cmd/entire/cli/agent/codex/reviewer_test.go

Comment thread cmd/entire/cli/review/types/reviewer.go

Comment thread cmd/entire/cli/agent/codex/reviewer.go Outdated

peyton-alt and others added 2 commits June 4, 2026 14:01

peyton-alt force-pushed the review-codex branch from 689bd75 to 242e9bb Compare June 4, 2026 18:04

peyton-alt and others added 2 commits June 4, 2026 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(review): codex review correctness (PR 2 of 3)#1370

feat(review): codex review correctness (PR 2 of 3)#1370
peyton-alt wants to merge 4 commits into
review-reviewfrom
review-codex

peyton-alt commented Jun 4, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

peyton-alt commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this adds

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

peyton-alt commented Jun 4, 2026 •

edited

Loading