Skip to content

feat(compare): add acpx compare to run one prompt across multiple agents#320

Open
mvanhorn wants to merge 1 commit into
openclaw:mainfrom
mvanhorn:feat/acpx-compare
Open

feat(compare): add acpx compare to run one prompt across multiple agents#320
mvanhorn wants to merge 1 commit into
openclaw:mainfrom
mvanhorn:feat/acpx-compare

Conversation

@mvanhorn
Copy link
Copy Markdown
Contributor

Summary

  • Adds acpx compare <agent>... '<prompt>'. Runs the same prompt against multiple ACP-compatible agents in parallel (via Promise.allSettled) and emits a per-agent table: wall-clock time, token usage (from usage_update events already in the protocol stream), stop reason, first 200 chars of final message, and transcript path.
  • Token data is aggregated from session/update.usage_update events the protocol already produces. This PR surfaces existing data; no new state is introduced.
  • Per-agent transcripts persist to ~/.acpx/compare/<run-id>/<agent>.ndjson so they survive the table render and stay reviewable later.
  • Flags: --cwd <dir>, --deny-all / --approve-all / --approve-reads (default deny-all), --timeout <seconds> (default 300, per-agent), --json for CompareRow[] output, --diff to run each agent in an isolated git worktree when --approve-all, -f, --prompt-file <path>.

Why this matters

acpx already supports calling any individual ACP-compatible agent: acpx codex 'fix the test', acpx claude 'fix the test', acpx pi 'fix the test'. What's missing is the natural next step — running the same prompt across multiple agents in one command and seeing the results side-by-side.

"Which agent should I use for this task?" is unsolved in ACP-land. The current workflow is to run the prompt under each agent and compare by hand. One command closes that:

acpx compare codex claude pi 'fix the failing test in checkout.spec.ts'
acpx compare codex claude --json | jq '.[] | select(.status == "ok") | .agent'
acpx --approve-all compare codex claude 'refactor auth.ts' --diff   # isolated worktrees

Each agent's full NDJSON transcript is persisted to ~/.acpx/compare/<run-id>/<agent>.ndjson so the table render is a summary, not the only output.

Demo

Simulated demo:

acpx compare demo

The demo shows acpx compare codex claude pi 'fix the failing test' against three agents: codex finishes in 8.4s with a concise fix, claude takes 14.1s with deeper analysis, pi times out at the 300s cap. The viewer sees all three outcomes in one table — exactly the picking-an-agent decision the feature exists to support.

Testing

  • corepack pnpm typecheck
  • corepack pnpm lint (oxlint + oxfmt + flow-schema-terms + persisted-key-casing, all clean)
  • corepack pnpm test — 675 tests pass; new test/compare-command.test.ts uses stub agents to cover:
    • multi-agent run produces one table row per agent
    • --json returns valid CompareRow[]
    • an erroring agent shows status: error with error preview; other agents still ok
    • --timeout <s> cancels agents past the per-agent budget (status: cancelled)
    • token totals populate from stubbed usage_update events
    • transcripts persist to ~/.acpx/compare/<run-id>/<agent>.ndjson on disk

acpx compare <agent>... '<prompt>' runs the same prompt against multiple
ACP-compatible agents and shows wall-clock time, token usage, stop reason,
and final message preview side-by-side. Use it to pick the right agent
for a task.

Each agent runs in parallel via Promise.allSettled. Per-agent transcripts
are persisted to ~/.acpx/compare/<run-id>/<agent>.ndjson so they survive
the table render and can be inspected later.

Token data comes from usage_update events already in the protocol stream;
this PR aggregates and presents, no new state introduced.

Flags:
- --cwd <dir>: target workspace
- --deny-all / --approve-all / --approve-reads: permission mode (default: deny-all)
- --timeout <seconds>: per-agent timeout (default 300)
- --json: emit CompareRow[] as JSON
- --diff: in approve-all mode, run each agent in an isolated worktree
- -f, --prompt-file <path>: read prompt from file
@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal priority bug or improvement with limited blast radius. merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. merge-risk: 🚨 security-boundary 🚨 Merging this PR could weaken sandboxing, authorization, credentials, or sensitive data. labels May 21, 2026
@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 21, 2026

Codex review: needs real behavior proof before merge.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR adds a top-level acpx compare command with parallel one-shot agent runs, transcript persistence, optional diff worktrees, docs, changelog, and stub-agent tests.

Reproducibility: yes. for the review findings: source inspection shows compare passes config defaults/no permission policy into runOnce and only checks flags.json, while current exec/global docs establish the expected shared behavior. I did not run the PR because this review is read-only and the issue is clear from the diff.

PR rating
Overall: 🧂 unranked krab
Proof: 🦪 silver shellfish
Patch quality: 🦪 silver shellfish
Summary: The idea is useful, but the PR is not quality-ready until it preserves shared security/output contracts and includes real behavior proof.

Rank-up moves:

  • Add redacted terminal output or a short recording showing acpx compare against at least two real configured ACP agents, including transcript paths and any private data redacted.
  • Refactor compare to reuse the shared global flag, permission-policy, output-format, and exec option plumbing, then add regression coverage for --policy and --format json.
  • Rebase the branch onto current main and update the definitive CLI reference plus skills/acpx/SKILL.md for the new command.
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Needs real behavior proof before merge: The PR provides a simulated GIF and stub-agent tests, but no terminal output, recording, or redacted logs from real configured ACP agents after the change. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

  • Compare currently ignores explicit global permission policy/auth/output choices, so users could get behavior that conflicts with existing CLI safety and automation contracts.
  • The PR body provides a simulated demo and stub-agent tests, but not after-fix proof from real configured ACP agents.
  • GitHub currently reports the branch as conflicting with main, so it needs a rebase before normal merge review.
  • This adds a new top-level command, output schema, and persisted transcript location, so maintainers should explicitly accept those product-surface conventions before landing.

Maintainer options:

  1. Reuse shared exec policy plumbing (recommended)
    Resolve compare execution through the same global flag, permission-policy, auth, terminal, retry, timeout, and session-option path as exec, and make --format json drive JSON output.
  2. Accept a compare-specific contract explicitly
    Maintainers could intentionally make compare use command-local flags, but then the PR needs explicit docs/tests showing which global flags are unsupported and why that is safe.
  3. Pause the top-level command
    If the new compare command/output schema is too much core surface right now, pause or close this PR until the product shape is agreed separately.

Next step before merge
Needs maintainer product/security review and contributor-supplied real behavior proof; ClawSweeper should not queue an automated repair while the proof gate and top-level command contract remain open.

Security
Needs attention: The diff introduces a concrete permission-boundary concern because compare does not pass explicit global permission policy into agent execution.

Review findings

  • [P1] Honor global permission policy in compare — src/cli/compare-command.ts:425-436
  • [P2] Honor the shared output format contract — src/cli/compare-command.ts:590-595
  • [P3] Keep the definitive CLI and skill docs in sync — README.md:39
Review details

Best possible solution:

Land compare only after it reuses the shared exec/global policy plumbing, aligns JSON output with --format, updates definitive agent-facing docs, rebases cleanly, and includes redacted real-agent proof.

Do we have a high-confidence way to reproduce the issue?

Yes for the review findings: source inspection shows compare passes config defaults/no permission policy into runOnce and only checks flags.json, while current exec/global docs establish the expected shared behavior. I did not run the PR because this review is read-only and the issue is clear from the diff.

Is this the best way to solve the issue?

No; a compare command may be a good fit, but this implementation forks core permission and output handling instead of reusing the established exec/global option path. The safer solution is to share that plumbing and treat any compare-specific deviations as explicit product decisions.

Label changes:

  • add P2: This is a normal-priority feature PR with concrete merge blockers but no urgent regression in current released behavior.
  • add merge-risk: 🚨 compatibility: The new command does not honor the existing global --format json/config output contract, which can confuse or break automation using shared CLI conventions.
  • add merge-risk: 🚨 security-boundary: The new command can ignore explicit permission-policy and non-interactive permission flags while running agents, weakening caller-selected tool safety controls.
  • add rating: 🧂 unranked krab: Current PR rating is 🧂 unranked krab because proof is 🦪 silver shellfish, patch quality is 🦪 silver shellfish, and The idea is useful, but the PR is not quality-ready until it preserves shared security/output contracts and includes real behavior proof.
  • add status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR provides a simulated GIF and stub-agent tests, but no terminal output, recording, or redacted logs from real configured ACP agents after the change. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Label justifications:

  • P2: This is a normal-priority feature PR with concrete merge blockers but no urgent regression in current released behavior.
  • merge-risk: 🚨 security-boundary: The new command can ignore explicit permission-policy and non-interactive permission flags while running agents, weakening caller-selected tool safety controls.
  • merge-risk: 🚨 compatibility: The new command does not honor the existing global --format json/config output contract, which can confuse or break automation using shared CLI conventions.
  • rating: 🧂 unranked krab: Current PR rating is 🧂 unranked krab because proof is 🦪 silver shellfish, patch quality is 🦪 silver shellfish, and The idea is useful, but the PR is not quality-ready until it preserves shared security/output contracts and includes real behavior proof.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR provides a simulated GIF and stub-agent tests, but no terminal output, recording, or redacted logs from real configured ACP agents after the change. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Full review comments:

  • [P1] Honor global permission policy in compare — src/cli/compare-command.ts:425-436
    compare runs each agent with config defaults for non-interactive/auth policy and never passes a loaded permissionPolicy, unlike exec which resolves those global flags before runOnce. With acpx --approve-all --policy '{"autoDeny":["execute"]}' compare ..., execute requests would not see the caller's deny rule, so the new command bypasses an existing safety contract. Reuse the shared global-flag and permission-policy path for compare.
    Confidence: 0.9
  • [P2] Honor the shared output format contract — src/cli/compare-command.ts:590-595
    --format and config format are the existing machine-output contract, but this branch only emits CompareRow[] when the command-local --json flag is set. acpx --format json compare ... or config format: "json" would still print a text table, which breaks scripts using the documented global output mode. Make --format json drive compare JSON output and keep --json only as an alias if desired.
    Confidence: 0.87
  • [P3] Keep the definitive CLI and skill docs in sync — README.md:39
    The README advertises a new top-level command, but the definitive CLI reference grammar and the bundled acpx skill still do not document compare. Agent users are a primary audience for this repo, so landing the command without those surfaces leaves the public command contract split and incomplete.
    Confidence: 0.78

Overall correctness: patch is incorrect
Overall confidence: 0.88

Security concerns:

  • [medium] Explicit permission policy is ignored — src/cli/compare-command.ts:425
    runAgentForCompare calls runOnce without a loaded permissionPolicy while still allowing permissive modes such as --approve-all, so caller-supplied deny/escalate rules can be skipped for the new command.
    Confidence: 0.9

What I checked:

  • PR diff adds new command surface: The branch registers a new top-level compare command and implements its own compare-specific flag parsing/output in src/cli/compare-command.ts. (src/cli/compare-command.ts:529, 14a7ba86bdbe)
  • Current exec path threads global policy into runOnce: On current main, handleExec resolves global flags, loads permissionPolicy, and passes non-interactive permissions, auth policy, terminal, retries, timeout, and session options into runOnce. (src/cli/command-handlers.ts:404, eb132177bd90)
  • Compare omits global policy plumbing: The PR's runAgentForCompare passes config defaults for non-interactive/auth policy and does not pass any loaded permissionPolicy, terminal setting, retries, or session options to runOnce. (src/cli/compare-command.ts:425, 14a7ba86bdbe)
  • Global output contract exists on main: The CLI reference documents --format as the global output mode and --format json as the automation-oriented JSON output contract. (docs/CLI.md:112, eb132177bd90)
  • Compare ignores the shared output selector: The PR emits CompareRow[] only when the command-local --json flag is present, so --format json and config format: json would still render the text table. (src/cli/compare-command.ts:590, 14a7ba86bdbe)
  • PR proof and merge state: The PR body labels the demo as simulated, tests use stub agents, and gh pr view reports mergeable: CONFLICTING. (14a7ba86bdbe)

Likely related people:

  • Alex Knight: Recent current-main work touched CLI routing, session list/resume behavior, and prompt content handling, including custom-agent routing after global flags. (role: recent area contributor; confidence: high; commits: 0907268a37c7, adee5ad7d665, f09933a837de; files: src/cli-core.ts, src/cli/command-handlers.ts, src/cli/session/runtime.ts)
  • Peter Steinberger: Blame for the current top-level verbs, shared command registration, global flags, and exec plumbing points to the release/baseline commit that added these files in the checked-out history. (role: introduced current CLI command surface; confidence: medium; commits: 994fc6c9cf13; files: src/cli-core.ts, src/cli/command-registration.ts, src/cli/command-handlers.ts)
  • Bob: Recent Slophammer work refactored flag parsing and added extensive CLI flag tests around the same global option surface this PR needs to reuse. (role: recent flags and quality-gate contributor; confidence: medium; commits: c26e99c8bcd0; files: src/cli/flags.ts, test/cli-flags.test.ts, slophammer.yml)

Codex review notes: model gpt-5.5, reasoning high; reviewed against eb132177bd90.

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 21, 2026

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 compatibility 🚨 Merging this PR could break existing users, config, migrations, defaults, or upgrades. merge-risk: 🚨 security-boundary 🚨 Merging this PR could weaken sandboxing, authorization, credentials, or sensitive data. P2 Normal priority bug or improvement with limited blast radius. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant