Skip to content

feat(review): codex review correctness (PR 2 of 3)#1370

Open
peyton-alt wants to merge 4 commits into
review-reviewfrom
review-codex
Open

feat(review): codex review correctness (PR 2 of 3)#1370
peyton-alt wants to merge 4 commits into
review-reviewfrom
review-codex

Conversation

@peyton-alt
Copy link
Copy Markdown
Contributor

@peyton-alt peyton-alt commented Jun 4, 2026

https://entire.io/gh/entireio/cli/trails/522

Stacked on #1241 (PR 1: role + setup foundation). Diffed against review-review.

Peeled out of the original PR 2 (#1352) so the agent-layer codex correctness can be reviewed independently of the command-layer UX cutover. No cmd.go/setup.go/fix.go churn — this PR touches only agent/* plus two additive fields (ReviewConfig.Model/ReasoningEffort, RunConfig.Model/ReasoningEffort).

What this adds

  • Codex invokes its real review skill verbatim (no /review paraphrase); codex's skill system loads the matching SKILL.md. Adds per-spawn review.<agent>.model / reasoning_effort → codex -m / -c model_reasoning_effort=.
  • Codex skill discovery in $name form over ~/.codex/{skills,plugins,superpowers}; lifts the generic SKILL.md scanners + version dedupe + frontmatter parse into the shared skilldiscovery package (SlashForm/DollarForm); claude delegates to it. Codex curated built-ins dropped (discovered on disk, not bundled).
  • Honest live tokens: claude emits input-only Tokens during streaming with final totals on result; codex tails its rollout transcript (located by thread_id) and emits cumulative Tokens per token_count (since codex exec --json only carries usage at turn.completed).

Verified: build + vet clean, mise run lint 0 issues, agent/review/settings unit tests green.

🤖 Generated with Claude Code


Note

Medium Risk
Changes how review prompts and skills reach Codex and how token counts are derived from local rollout files; mistakes could affect review quality or TUI metrics without touching auth or data stores.

Overview
This PR tightens agent-layer behavior for entire review on Claude Code and Codex: shared skill discovery, honest live token display, and Codex-specific review invocation.

Codex stops paraphrasing /review into generic text and passes configured skills verbatim so Codex loads real SKILL.md workflows. It adds on-disk discovery under ~/.codex in $name / $plugin:name form, drops hardcoded /review curated built-ins, and maps optional model / reasoning_effort settings to codex exec -m and -c model_reasoning_effort=. Live usage during a run comes from tailing the rollout JSONL (by thread_id) for token_count events, not only turn.completed on stdout.

Claude Code discovery is refactored to the shared skilldiscovery package (SlashForm, plugin cache scan, dedupe, frontmatter parsing). Its stream-json parser now emits input-only Tokens on assistant envelopes (misleading tiny output_tokens at turn start) and full totals on result.

Shared: new skilldiscovery/scan.go, picker install-hint text without backticks, and ReviewConfig / RunConfig fields for per-spawn model and reasoning effort (wired for Codex; other agents ignore).

Reviewed by Cursor Bugbot for commit 1aaf8b5. Configure here.


Note for reviewers (stacked-PR artifact): RunConfig.Model / ReasoningEffort (and ReviewConfig.Model / ReasoningEffort) are introduced here and mapped by the codex reviewer to -m / -c model_reasoning_effort=, but the call site that copies them from settings into the per-agent RunConfig (applyReviewConfig in review/cmd.go) lives in PR 3 (#1352). They are therefore inert on this PR alone and become live once PR 3 lands — this is deliberate, keeping PR 2 free of command-layer churn.

Copilot AI review requested due to automatic review settings June 4, 2026 17:32
@peyton-alt peyton-alt requested a review from a team as a code owner June 4, 2026 17:32
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 1aaf8b5. Configure here.

Comment thread cmd/entire/cli/agent/codex/reviewer_test.go
Comment thread cmd/entire/cli/agent/skilldiscovery/registry.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refines the agent-layer implementation of entire review, focusing on Codex correctness (skill invocation + live token reporting) and extracting shared skill discovery logic for Claude Code and Codex. It also introduces two additive configuration fields (Model, ReasoningEffort) on both persisted settings (ReviewConfig) and the runtime spawn config (RunConfig).

Changes:

  • Adds a shared agent/skilldiscovery scanner (frontmatter parsing, plugin-cache version selection, invocation-form helpers) and refactors Claude Code discovery to use it; implements Codex discovery over ~/.codex.
  • Updates Codex reviewer execution to support -m <model> / -c model_reasoning_effort=<level> and adds rollout-file tailing to emit live token updates.
  • Updates Claude Code stream parser to emit input-only live token updates during streaming and final totals on result.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
cmd/entire/cli/settings/settings.go Adds persisted ReviewConfig.Model / ReasoningEffort and extends IsZero.
cmd/entire/cli/review/types/reviewer.go Adds runtime RunConfig.Model / ReasoningEffort fields for per-spawn overrides.
cmd/entire/cli/agent/skilldiscovery/scan.go New shared scanners/helpers for skill discovery (slash vs dollar invocation forms, plugin-cache scanning, frontmatter parsing).
cmd/entire/cli/agent/skilldiscovery/registry.go Removes Codex curated built-ins; adjusts install-hint messaging constraints (no backticks).
cmd/entire/cli/agent/skilldiscovery/registry_test.go Updates Codex curated-builtins expectations to empty.
cmd/entire/cli/agent/codex/reviewer.go Adds model/reasoning flags, threads rollout token tailer, and tweaks token emission behavior.
cmd/entire/cli/agent/codex/reviewer_test.go Updates tests for verbatim skill pass-through and new argv flags; adds token emission contract test.
cmd/entire/cli/agent/codex/review_tokens.go New rollout JSONL tailer emitting cumulative Tokens from token_count events.
cmd/entire/cli/agent/codex/review_tokens_test.go Unit tests for rollout token parsing and tailing behavior.
cmd/entire/cli/agent/codex/discovery.go Implements Codex on-disk skill discovery using $name/$plugin:name invocation form.
cmd/entire/cli/agent/codex/discovery_test.go Tests Codex discovery across user skills, plugins cache, and superpowers.
cmd/entire/cli/agent/codex/AGENT.md Documents Codex review integration specifics (exec, skills, live tokens, no hooks).
cmd/entire/cli/agent/claudecode/testdata/stream_with_deltas.jsonl Adds fixture capturing real Claude stream-json usage snapshot behavior.
cmd/entire/cli/agent/claudecode/reviewer.go Emits input-only Tokens on assistant envelopes; final totals on result.
cmd/entire/cli/agent/claudecode/reviewer_test.go Updates expectations and adds test locking input-only live-token semantics.
cmd/entire/cli/agent/claudecode/discovery.go Refactors Claude discovery to delegate scanning/version-dedupe/frontmatter parsing to skilldiscovery.
Comments suppressed due to low confidence (1)

cmd/entire/cli/agent/codex/reviewer_test.go:110

  • This test uses the legacy slash-form "/review" skill, but codex skill discovery and prompt invocation in this PR are described as $name / $plugin:name. Using a $-prefixed skill name here would better reflect the real codex execution path and avoid locking the tests to an invocation form codex exec may not support.
	t.Parallel()
	cfg := reviewtypes.RunConfig{
		Skills:            []string{"/review"},
		AlwaysPrompt:      "Focus on auth regressions.",
		ScopeBaseRef:      "main",
		CheckpointContext: "Commits in scope (newest first):\n  abc123 summary\n",
	}

Comment thread cmd/entire/cli/agent/skilldiscovery/registry.go
Comment thread cmd/entire/cli/agent/codex/reviewer.go
Comment thread cmd/entire/cli/agent/codex/reviewer_test.go
Comment thread cmd/entire/cli/review/types/reviewer.go
Comment thread cmd/entire/cli/agent/codex/reviewer.go Outdated
peyton-alt added a commit that referenced this pull request Jun 4, 2026
- reviewer_test: TestParseCodexOutput_EmitsTokensAtEveryTurnCompleted feeds a
  thread.started envelope, which launches the rollout token tailer. Pin
  ENTIRE_TEST_CODEX_SESSION_DIR to an empty temp dir (drop t.Parallel) so the
  tailer can't pick up a real ~/.codex rollout-*-tid-1.jsonl and inject extra
  Tokens that break the exactly-2 assertion.
- skilldiscovery: codex install-hint suppression fingerprint was slash form
  (/codex:adversarial-review) but DiscoverReviewSkills emits dollar form, so
  ActiveInstallHintsFor never intersected and the hint showed forever after
  install. Use $codex:adversarial-review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 210b33059b2b
peyton-alt and others added 2 commits June 4, 2026 14:01
Agent-layer correctness for `entire review`, independent of the command
surface (no cmd/setup/fix churn):

- Codex invokes its real review skill verbatim (no /review paraphrase);
  codex's skill system loads the matching SKILL.md. Adds per-spawn
  review.<agent>.model / reasoning_effort, threaded through RunConfig into
  codex CLI flags -m / -c model_reasoning_effort=.
- Codex skill discovery in $name form over ~/.codex/{skills,plugins,superpowers};
  lifts the generic SKILL.md scanners + version dedupe + frontmatter parse into
  the shared skilldiscovery package (SlashForm/DollarForm); claude delegates to
  it. Codex curated built-ins dropped (discovered on disk, not bundled).
- Honest live tokens: claude emits input-only Tokens during streaming (output
  snapshot is misleading) with final totals on result; codex tails its rollout
  transcript (located by thread_id) and emits cumulative Tokens per token_count,
  since `codex exec --json` only carries usage at turn.completed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 4cbe537bd34d
- reviewer_test: TestParseCodexOutput_EmitsTokensAtEveryTurnCompleted feeds a
  thread.started envelope, which launches the rollout token tailer. Pin
  ENTIRE_TEST_CODEX_SESSION_DIR to an empty temp dir (drop t.Parallel) so the
  tailer can't pick up a real ~/.codex rollout-*-tid-1.jsonl and inject extra
  Tokens that break the exactly-2 assertion.
- skilldiscovery: codex install-hint suppression fingerprint was slash form
  (/codex:adversarial-review) but DiscoverReviewSkills emits dollar form, so
  ActiveInstallHintsFor never intersected and the hint showed forever after
  install. Use $codex:adversarial-review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 210b33059b2b
peyton-alt and others added 2 commits June 4, 2026 14:29
- Event ordering (real bug): the rollout token tailer runs concurrently and
  sends Tokens on the same channel as the parser. It was only stopped in a
  defer that runs AFTER the inline Finished emissions, so a late Tokens could
  follow Finished — violating ReviewerTemplate's "Finished is the last event"
  contract. Introduce an idempotent stopTailer (sync.Once + WaitGroup.Wait)
  and call it before every terminal Finished; the defer remains as a backstop.
  Adds a regression test (Finished is last even with a live rollout tailer),
  verified under -race.
- Test fidelity: TestCodexReviewer_PassesSkillThroughVerbatim configured the
  codex skill as "/review" (claude's slash form). Codex skills are
  dollar-prefixed; use "$code-reviewer" so the test models real codex usage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 480a94f05ac8
…1373)

Claude review subprocesses run via `claude -p` don't reliably report a model
through lifecycle hooks (the ModelUpdate path is flaky in headless mode), so
per-session checkpoint metadata landed with "model": "" for Claude reviewer
sessions, while codex/pi reviewers record theirs. Root cause: ClaudeCodeAgent
did not implement agent.ModelExtractor, so condense's sessionStateBackfillModel
fallback (AsModelExtractor -> ExtractModel) was a no-op for Claude; the model
relied solely on the flaky hook path.

Implement ExtractModel on ClaudeCodeAgent: parse the transcript and return the
last non-empty model from an assistant line (mirrors pi/copilotcli/factoryai).
Condense already calls this when state.ModelName == "", so review (and any
hook-missed) Claude sessions now record a non-empty model deterministically.

Verified against real recorded data: a real agent_review row in
entire/checkpoints/v1 with model="" (skills=[/review]) — running the new
ExtractModel on that session's stored transcript recovers "claude-opus-4-7",
which is exactly the value the condense backfill writes. Surfaces in
entire.io's per-reviewer attribution (entirehq/entire.io#2271).

Note: the transcript carries the base model id ("claude-opus-4-7"), not the
hook's context-window variant ("claude-opus-4-7[1m]"), so backfilled models
omit the [1m] suffix — a correct, non-empty id, slightly less precise.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 41d9aa9acef6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants