feat(compare): add acpx compare to run one prompt across multiple agents#320
feat(compare): add acpx compare to run one prompt across multiple agents#320mvanhorn wants to merge 1 commit into
Conversation
acpx compare <agent>... '<prompt>' runs the same prompt against multiple ACP-compatible agents and shows wall-clock time, token usage, stop reason, and final message preview side-by-side. Use it to pick the right agent for a task. Each agent runs in parallel via Promise.allSettled. Per-agent transcripts are persisted to ~/.acpx/compare/<run-id>/<agent>.ndjson so they survive the table render and can be inspected later. Token data comes from usage_update events already in the protocol stream; this PR aggregates and presents, no new state introduced. Flags: - --cwd <dir>: target workspace - --deny-all / --approve-all / --approve-reads: permission mode (default: deny-all) - --timeout <seconds>: per-agent timeout (default 300) - --json: emit CompareRow[] as JSON - --diff: in approve-all mode, run each agent in an isolated worktree - -f, --prompt-file <path>: read prompt from file
|
Codex review: needs real behavior proof before merge. Workflow note: Future ClawSweeper reviews update this same comment in place. How this review workflow works
Summary Reproducibility: yes. for the review findings: source inspection shows compare passes config defaults/no permission policy into PR rating Rank-up moves:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. Real behavior proof Risk before merge
Maintainer options:
Next step before merge Security Review findings
Review detailsBest possible solution: Land compare only after it reuses the shared exec/global policy plumbing, aligns JSON output with Do we have a high-confidence way to reproduce the issue? Yes for the review findings: source inspection shows compare passes config defaults/no permission policy into Is this the best way to solve the issue? No; a compare command may be a good fit, but this implementation forks core permission and output handling instead of reusing the established exec/global option path. The safer solution is to share that plumbing and treat any compare-specific deviations as explicit product decisions. Label changes:
Label justifications:
Full review comments:
Overall correctness: patch is incorrect Security concerns:
What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against eb132177bd90. |
|
ClawSweeper PR egg 🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat. Where did the egg go?
|
Summary
acpx compare <agent>... '<prompt>'. Runs the same prompt against multiple ACP-compatible agents in parallel (viaPromise.allSettled) and emits a per-agent table: wall-clock time, token usage (fromusage_updateevents already in the protocol stream), stop reason, first 200 chars of final message, and transcript path.session/update.usage_updateevents the protocol already produces. This PR surfaces existing data; no new state is introduced.~/.acpx/compare/<run-id>/<agent>.ndjsonso they survive the table render and stay reviewable later.--cwd <dir>,--deny-all/--approve-all/--approve-reads(defaultdeny-all),--timeout <seconds>(default 300, per-agent),--jsonfor CompareRow[] output,--diffto run each agent in an isolated git worktree when--approve-all,-f, --prompt-file <path>.Why this matters
acpx already supports calling any individual ACP-compatible agent:
acpx codex 'fix the test',acpx claude 'fix the test',acpx pi 'fix the test'. What's missing is the natural next step — running the same prompt across multiple agents in one command and seeing the results side-by-side."Which agent should I use for this task?" is unsolved in ACP-land. The current workflow is to run the prompt under each agent and compare by hand. One command closes that:
Each agent's full NDJSON transcript is persisted to
~/.acpx/compare/<run-id>/<agent>.ndjsonso the table render is a summary, not the only output.Demo
Simulated demo:
The demo shows
acpx compare codex claude pi 'fix the failing test'against three agents: codex finishes in 8.4s with a concise fix, claude takes 14.1s with deeper analysis, pi times out at the 300s cap. The viewer sees all three outcomes in one table — exactly the picking-an-agent decision the feature exists to support.Testing
corepack pnpm typecheckcorepack pnpm lint(oxlint + oxfmt + flow-schema-terms + persisted-key-casing, all clean)corepack pnpm test— 675 tests pass; newtest/compare-command.test.tsuses stub agents to cover:--jsonreturns valid CompareRow[]status: errorwith error preview; other agents still ok--timeout <s>cancels agents past the per-agent budget (status: cancelled)~/.acpx/compare/<run-id>/<agent>.ndjsonon disk