feat(compare): add acpx compare to run one prompt across multiple agents by mvanhorn · Pull Request #320 · openclaw/acpx

mvanhorn · 2026-05-16T17:47:58Z

Summary

Adds acpx compare <agent>... '<prompt>'. Runs the same prompt against multiple ACP-compatible agents in parallel (via Promise.allSettled) and emits a per-agent table: wall-clock time, token usage (from usage_update events already in the protocol stream), stop reason, first 200 chars of final message, and transcript path.
Token data is aggregated from session/update.usage_update events the protocol already produces. This PR surfaces existing data; no new state is introduced.
Per-agent transcripts persist to ~/.acpx/compare/<run-id>/<agent>.ndjson so they survive the table render and stay reviewable later.
Flags: --cwd <dir>, --deny-all / --approve-all / --approve-reads (default deny-all), --timeout <seconds> (default 300, per-agent), --json for CompareRow[] output, --diff to run each agent in an isolated git worktree when --approve-all, -f, --prompt-file <path>.

Why this matters

acpx already supports calling any individual ACP-compatible agent: acpx codex 'fix the test', acpx claude 'fix the test', acpx pi 'fix the test'. What's missing is the natural next step — running the same prompt across multiple agents in one command and seeing the results side-by-side.

"Which agent should I use for this task?" is unsolved in ACP-land. The current workflow is to run the prompt under each agent and compare by hand. One command closes that:

acpx compare codex claude pi 'fix the failing test in checkout.spec.ts'
acpx compare codex claude --json | jq '.[] | select(.status == "ok") | .agent'
acpx --approve-all compare codex claude 'refactor auth.ts' --diff   # isolated worktrees

Each agent's full NDJSON transcript is persisted to ~/.acpx/compare/<run-id>/<agent>.ndjson so the table render is a summary, not the only output.

Demo

Simulated demo:

The demo shows acpx compare codex claude pi 'fix the failing test' against three agents: codex finishes in 8.4s with a concise fix, claude takes 14.1s with deeper analysis, pi times out at the 300s cap. The viewer sees all three outcomes in one table — exactly the picking-an-agent decision the feature exists to support.

Testing

corepack pnpm typecheck
corepack pnpm lint (oxlint + oxfmt + flow-schema-terms + persisted-key-casing, all clean)
corepack pnpm test — 675 tests pass; new test/compare-command.test.ts uses stub agents to cover:
- multi-agent run produces one table row per agent
- --json returns valid CompareRow[]
- an erroring agent shows status: error with error preview; other agents still ok
- --timeout <s> cancels agents past the per-agent budget (status: cancelled)
- token totals populate from stubbed usage_update events
- transcripts persist to ~/.acpx/compare/<run-id>/<agent>.ndjson on disk

acpx compare <agent>... '<prompt>' runs the same prompt against multiple ACP-compatible agents and shows wall-clock time, token usage, stop reason, and final message preview side-by-side. Use it to pick the right agent for a task. Each agent runs in parallel via Promise.allSettled. Per-agent transcripts are persisted to ~/.acpx/compare/<run-id>/<agent>.ndjson so they survive the table render and can be inspected later. Token data comes from usage_update events already in the protocol stream; this PR aggregates and presents, no new state introduced. Flags: - --cwd <dir>: target workspace - --deny-all / --approve-all / --approve-reads: permission mode (default: deny-all) - --timeout <seconds>: per-agent timeout (default 300) - --json: emit CompareRow[] as JSON - --diff: in approve-all mode, run each agent in an isolated worktree - -f, --prompt-file <path>: read prompt from file

clawsweeper · 2026-05-21T08:12:20Z

Codex review: needs real behavior proof before merge.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR adds a top-level acpx compare command with parallel one-shot agent runs, transcript persistence, optional diff worktrees, docs, changelog, and stub-agent tests.

Reproducibility: yes. for the review findings: source inspection shows compare passes config defaults/no permission policy into runOnce and only checks flags.json, while current exec/global docs establish the expected shared behavior. I did not run the PR because this review is read-only and the issue is clear from the diff.

PR rating
Overall: 🧂 unranked krab
Proof: 🦪 silver shellfish
Patch quality: 🦪 silver shellfish
Summary: The idea is useful, but the PR is not quality-ready until it preserves shared security/output contracts and includes real behavior proof.

Rank-up moves:

Add redacted terminal output or a short recording showing acpx compare against at least two real configured ACP agents, including transcript paths and any private data redacted.
Refactor compare to reuse the shared global flag, permission-policy, output-format, and exec option plumbing, then add regression coverage for --policy and --format json.
Rebase the branch onto current main and update the definitive CLI reference plus skills/acpx/SKILL.md for the new command.

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Needs real behavior proof before merge: The PR provides a simulated GIF and stub-agent tests, but no terminal output, recording, or redacted logs from real configured ACP agents after the change. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

Compare currently ignores explicit global permission policy/auth/output choices, so users could get behavior that conflicts with existing CLI safety and automation contracts.
The PR body provides a simulated demo and stub-agent tests, but not after-fix proof from real configured ACP agents.
GitHub currently reports the branch as conflicting with main, so it needs a rebase before normal merge review.
This adds a new top-level command, output schema, and persisted transcript location, so maintainers should explicitly accept those product-surface conventions before landing.

Maintainer options:

Reuse shared exec policy plumbing (recommended)
Resolve compare execution through the same global flag, permission-policy, auth, terminal, retry, timeout, and session-option path as exec, and make --format json drive JSON output.
Accept a compare-specific contract explicitly
Maintainers could intentionally make compare use command-local flags, but then the PR needs explicit docs/tests showing which global flags are unsupported and why that is safe.
Pause the top-level command
If the new compare command/output schema is too much core surface right now, pause or close this PR until the product shape is agreed separately.

Next step before merge
Needs maintainer product/security review and contributor-supplied real behavior proof; ClawSweeper should not queue an automated repair while the proof gate and top-level command contract remain open.

Security
Needs attention: The diff introduces a concrete permission-boundary concern because compare does not pass explicit global permission policy into agent execution.

Review findings

[P1] Honor global permission policy in compare — src/cli/compare-command.ts:425-436
[P2] Honor the shared output format contract — src/cli/compare-command.ts:590-595
[P3] Keep the definitive CLI and skill docs in sync — README.md:39

Review details

Best possible solution:

Land compare only after it reuses the shared exec/global policy plumbing, aligns JSON output with --format, updates definitive agent-facing docs, rebases cleanly, and includes redacted real-agent proof.

Do we have a high-confidence way to reproduce the issue?

Yes for the review findings: source inspection shows compare passes config defaults/no permission policy into runOnce and only checks flags.json, while current exec/global docs establish the expected shared behavior. I did not run the PR because this review is read-only and the issue is clear from the diff.

Is this the best way to solve the issue?

No; a compare command may be a good fit, but this implementation forks core permission and output handling instead of reusing the established exec/global option path. The safer solution is to share that plumbing and treat any compare-specific deviations as explicit product decisions.

Label changes:

add P2: This is a normal-priority feature PR with concrete merge blockers but no urgent regression in current released behavior.
add merge-risk: 🚨 compatibility: The new command does not honor the existing global --format json/config output contract, which can confuse or break automation using shared CLI conventions.
add merge-risk: 🚨 security-boundary: The new command can ignore explicit permission-policy and non-interactive permission flags while running agents, weakening caller-selected tool safety controls.
add rating: 🧂 unranked krab: Current PR rating is 🧂 unranked krab because proof is 🦪 silver shellfish, patch quality is 🦪 silver shellfish, and The idea is useful, but the PR is not quality-ready until it preserves shared security/output contracts and includes real behavior proof.
add status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR provides a simulated GIF and stub-agent tests, but no terminal output, recording, or redacted logs from real configured ACP agents after the change. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Label justifications:

P2: This is a normal-priority feature PR with concrete merge blockers but no urgent regression in current released behavior.
merge-risk: 🚨 security-boundary: The new command can ignore explicit permission-policy and non-interactive permission flags while running agents, weakening caller-selected tool safety controls.
merge-risk: 🚨 compatibility: The new command does not honor the existing global --format json/config output contract, which can confuse or break automation using shared CLI conventions.
rating: 🧂 unranked krab: Current PR rating is 🧂 unranked krab because proof is 🦪 silver shellfish, patch quality is 🦪 silver shellfish, and The idea is useful, but the PR is not quality-ready until it preserves shared security/output contracts and includes real behavior proof.
status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR provides a simulated GIF and stub-agent tests, but no terminal output, recording, or redacted logs from real configured ACP agents after the change. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Full review comments:

[P1] Honor global permission policy in compare — src/cli/compare-command.ts:425-436
compare runs each agent with config defaults for non-interactive/auth policy and never passes a loaded permissionPolicy, unlike exec which resolves those global flags before runOnce. With acpx --approve-all --policy '{"autoDeny":["execute"]}' compare ..., execute requests would not see the caller's deny rule, so the new command bypasses an existing safety contract. Reuse the shared global-flag and permission-policy path for compare.
Confidence: 0.9
[P2] Honor the shared output format contract — src/cli/compare-command.ts:590-595
--format and config format are the existing machine-output contract, but this branch only emits CompareRow[] when the command-local --json flag is set. acpx --format json compare ... or config format: "json" would still print a text table, which breaks scripts using the documented global output mode. Make --format json drive compare JSON output and keep --json only as an alias if desired.
Confidence: 0.87
[P3] Keep the definitive CLI and skill docs in sync — README.md:39
The README advertises a new top-level command, but the definitive CLI reference grammar and the bundled acpx skill still do not document compare. Agent users are a primary audience for this repo, so landing the command without those surfaces leaves the public command contract split and incomplete.
Confidence: 0.78

Overall correctness: patch is incorrect
Overall confidence: 0.88

Security concerns:

[medium] Explicit permission policy is ignored — src/cli/compare-command.ts:425
runAgentForCompare calls runOnce without a loaded permissionPolicy while still allowing permissive modes such as --approve-all, so caller-supplied deny/escalate rules can be skipped for the new command.
Confidence: 0.9

What I checked:

PR diff adds new command surface: The branch registers a new top-level compare command and implements its own compare-specific flag parsing/output in src/cli/compare-command.ts. (src/cli/compare-command.ts:529, 14a7ba86bdbe)
Current exec path threads global policy into runOnce: On current main, handleExec resolves global flags, loads permissionPolicy, and passes non-interactive permissions, auth policy, terminal, retries, timeout, and session options into runOnce. (src/cli/command-handlers.ts:404, eb132177bd90)
Compare omits global policy plumbing: The PR's runAgentForCompare passes config defaults for non-interactive/auth policy and does not pass any loaded permissionPolicy, terminal setting, retries, or session options to runOnce. (src/cli/compare-command.ts:425, 14a7ba86bdbe)
Global output contract exists on main: The CLI reference documents --format as the global output mode and --format json as the automation-oriented JSON output contract. (docs/CLI.md:112, eb132177bd90)
Compare ignores the shared output selector: The PR emits CompareRow[] only when the command-local --json flag is present, so --format json and config format: json would still render the text table. (src/cli/compare-command.ts:590, 14a7ba86bdbe)
PR proof and merge state: The PR body labels the demo as simulated, tests use stub agents, and gh pr view reports mergeable: CONFLICTING. (14a7ba86bdbe)

Likely related people:

Alex Knight: Recent current-main work touched CLI routing, session list/resume behavior, and prompt content handling, including custom-agent routing after global flags. (role: recent area contributor; confidence: high; commits: 0907268a37c7, adee5ad7d665, f09933a837de; files: src/cli-core.ts, src/cli/command-handlers.ts, src/cli/session/runtime.ts)
Peter Steinberger: Blame for the current top-level verbs, shared command registration, global flags, and exec plumbing points to the release/baseline commit that added these files in the checked-out history. (role: introduced current CLI command surface; confidence: medium; commits: 994fc6c9cf13; files: src/cli-core.ts, src/cli/command-registration.ts, src/cli/command-handlers.ts)
Bob: Recent Slophammer work refactored flag parsing and added extensive CLI flag tests around the same global option surface this PR needs to reuse. (role: recent flags and quality-gate contributor; confidence: medium; commits: c26e99c8bcd0; files: src/cli/flags.ts, test/cli-flags.test.ts, slophammer.yml)

Codex review notes: model gpt-5.5, reasoning high; reviewed against eb132177bd90.

clawsweeper · 2026-05-21T08:12:22Z

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?

The egg game starts only after the PR passes the real-behavior proof check.
Before that, no creature or rarity is rolled. The treat waits for real proof.
This is still just collectible flavor: proof affects review readiness, not creature quality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(compare): add acpx compare to run one prompt across multiple agents#320

feat(compare): add acpx compare to run one prompt across multiple agents#320
mvanhorn wants to merge 1 commit into
openclaw:mainfrom
mvanhorn:feat/acpx-compare

mvanhorn commented May 16, 2026

Uh oh!

clawsweeper Bot commented May 21, 2026

Uh oh!

clawsweeper Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mvanhorn commented May 16, 2026

Summary

Why this matters

Demo

Testing

Uh oh!

clawsweeper Bot commented May 21, 2026

Uh oh!

clawsweeper Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant