bench/swe: empirical bench harness for tuning PasClaw agent settings#313
bench/swe: empirical bench harness for tuning PasClaw agent settings#313FMXExpress wants to merge 3 commits into
Conversation
Adds bench/swe/, a self-contained adapter for benchmarking PasClaw's
agent loop against SWE-shaped tasks. Same author/judge pattern as
bench/locomo/: the eval lives inside this repo, no external services
required for the harness to run end-to-end.
What the harness measures
=========================
The SUBJECT under test is PasClaw's agent loop (system prompt, tool
surface, plan-mode gates, profile defaults, condenser, etc.) -- NOT
the underlying model. The provider is held fixed across the sweep so
any pass-rate delta is attributable to PasClaw settings, not provider
variance.
Three drive modes (provider_stub.py):
--mock <transcript.jsonl> replay an offline transcript
--proxy <upstream_base_url> forward to a real upstream provider
--blocking <queue_dir> file FIFO for live human / subagent driving
PasClaw is wired via a one-off config.json that points its OpenAI
provider at the localhost stub -- zero Pascal code changes, just the
existing OpenAI-compat path with a different api_base.
15 fixtures
===========
01-04 simple bug fixes (snippet width, shell quoting, count files, yaml)
07 cross-file grep (capability test for fs_grep)
08 CLI Centipede game (long creative task)
09 bash notes CLI (multi-iteration multi-subcommand)
10 add Cloudflare AI Gateway provider to a real PasClaw checkout
11 skill discovery (capability test for skills_list / skills_view)
12 vault lookup (placeholder -- needs reachable vault endpoint)
13 web context fetch (capability test for web_fetch)
14 prior-session recall (capability test for memory_search)
15 auto skill creation via the distiller pipeline
Plus a 21-variant ablation matrix (ablation.json) and per-tool cost
breakdown (tool_cost.py) so anyone can probe how each individual
setting affects the first-turn prompt size without touching the model.
Cross-model shootouts
=====================
Drove the same fixture set with Opus 4.8, Sonnet 4.6, and Haiku 4.5
via subagent providers. Cumulative finding (verifiable from
bench/swe/README.md):
model class reliability max-build penalty recommendation
Haiku 4.5 poor 3x turns vs lean lean-edit
Sonnet 4.6 rock-solid none lean-stock-shape
Opus 4.8 rock-solid none lean-stock-shape
Smaller models do WORSE with bigger profiles -- they reach for tools
they can't author correctly (fs_edit_hashline lured Haiku). Sonnet and
Opus pick the same tools in the same order regardless of profile.
Auto skill creation
===================
Fixture 15 captures real end-to-end distiller artifacts:
- draft staged at $PASCLAW_HOME/workspace/skills/.pending/<id>/
when auto_approve=false (default)
- direct install at $PASCLAW_HOME/workspace/skills/<name>/
when auto_approve=true
Distiller is NOT a max-build-only feature: it's inherited from
lean-stock-shaped settings and present in every lean-* profile, so
the cheap profiles get the auto-skill-creation pipeline without
paying for skills_manage.
OpenAI provider HTTP read timeout
=================================
Also bumps providers/openai PostJSON read timeout 120s -> 600s.
Discovered while running the long-creative fixture: slow-thinking
Claude subagents authoring multi-KB game files were taking 130+
seconds to publish their first reply to the localhost stub, and
PasClaw would time out the read and abort the run before any tool
fired. The 600s ceiling covers slow-think without affecting the
happy path (the read returns as soon as the body lands).
Follow-up PR
============
The findings drive a separate PR proposing TConfig.Create defaults
adopt lean-edit's settings -- the bench-grounded recommendation for
the cheapest profile that doesn't lose pass-rate on every model
class tested. That PR also adds an opt-in hashline question to
`pasclaw onboard` based on the Haiku finding (smaller models
mis-handle fs_edit_hashline when it's advertised).
This PR is bench infrastructure only -- no profile or onboarding
changes here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 77f20504d0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| workspace.mkdir(parents=True, exist_ok=True) | ||
| stub_log = run_dir / "stub.log" | ||
|
|
||
| stage_workspace(fixture, workspace) |
There was a problem hiding this comment.
Invoke fixture setup hooks in run.py
When using run.py/score.py on setup-backed fixtures (10 real repo, 11 skill, 13 web server, 14 memory), only pre-fix is staged here. Unlike start_cell.sh, no setup.sh hook is executed, so those workspaces are missing the repo snapshot, installed skill/data, spec URL server, or memory file before PasClaw and the oracle run, making full proxy sweeps fail or measure the wrong task setup.
Useful? React with 👍 / 👎.
| # The agent sees a clean tree (no .git, no build/ untracked). | ||
| set -euo pipefail | ||
|
|
||
| REPO=/home/user/PasClaw |
There was a problem hiding this comment.
Derive the fixture-10 repo path instead of hard-coding it
In any checkout that is not exactly /home/user/PasClaw (including this repo at /workspace/PasClaw), fixture 10's setup.sh fails at cd "$REPO" before it can stage the PasClaw snapshot. Since start_cell.sh runs this hook for the real-codebase fixture, live runs of fixture 10 are not portable; derive the repo root from FIXTURE_DIR/the harness path or pass it in from start_cell.sh.
Useful? React with 👍 / 👎.
| ap.add_argument("--fixtures", nargs="*", | ||
| default=["fixture/*"]) |
There was a problem hiding this comment.
Exclude fixtures without mocks from mock sweeps
With the documented score.py --mock invocation this default expands to all committed fixtures, but the added fixture tree only includes mock transcripts for 01-04; the remaining fixtures are returned from run_cell as passed: None and are still counted by aggregate, so the offline smoke sweep/frontier is dominated by ERR cells instead of a valid mock benchmark. Filter to mock-backed fixtures or narrow the default when --mock is selected.
Useful? React with 👍 / 👎.
The 600s timeout fix was originally bundled here as a "bench infrastructure prereq" -- but it's a Pascal code change that belongs in the lean-edit defaults PR (reviewer note: the bench PR should be bench code only). The lean-edit-as-stock-defaults branch (PR #314) now carries the timeout bump alongside its other Pascal-side changes. Running the bench against a checkout without that fix means slow subagent drivers may hit the original 120s read timeout, but that's a known issue documented in bench/swe/README.md -- the bench harness can still produce useful data on mock and proxy modes, just not on some live-driven cells with subagents authoring large tool-call bodies.
Slow-thinking subagents driving the bench can take 2-3 minutes to emit a reasoning preamble + a multi-KB tool-call body before they publish the response to the localhost stub. 120s was tight and broke bench cells (driver took ~130s; PasClaw timed out the read before the response landed, even though the body was seconds away). 600s is a ceiling, not a wait -- the read returns as soon as the body arrives, so the happy path is unaffected. Carrying this on the bench branch too so a reviewer can run bench/swe end-to-end without needing the lean-edit-defaults PR merged first. Same change as PR #314's timeout bump -- if both PRs land in either order, the second is a no-op for this line.
Summary
Adds
bench/swe/, a self-contained adapter for benchmarking PasClaw's agent loop against SWE-shaped tasks. Same author/judge pattern asbench/locomo/: the eval lives inside this repo, no external services required.Pure bench infrastructure — no profile or onboarding changes. Those land in a separate follow-up PR.
What's in here
bench/swe/harness/—provider_stub.py(localhost OpenAI-compat HTTP server with mock/proxy/blocking modes),run.py,score.py,start_cell.sh/finalize_cell.sh/driver_helper.pyfor live driving, plus probe scripts (probe_first_turn.py,tool_cost.py,turn_growth.py,tool_utilization.py).bench/swe/fixture/01..15— 15 fixtures covering simple bug fixes, multi-iteration tasks, real-codebase add-a-provider, and capability tests forfs_grep,skills_list/view,web_fetch,memory_search, and the distiller pipeline.bench/swe/ablation.json+bench/swe/results/ablation.md— 21-variant ablation matrix.bench/swe/README.md— full methodology + cross-model shootout results.The one Pascal change in this PR
src/pkg/providers/PasClaw.Providers.OpenAI.pas: bumpPostJSONread timeout 120s → 600s. Discovered while running the long-creative fixture — slow-thinking subagents authoring multi-KB tool calls were taking 130+ seconds, and PasClaw was timing out reads before the response landed. 600s is a ceiling; the read returns as soon as the body arrives, so the happy path is unaffected.Cumulative findings
After running the same fixture set with Opus 4.8, Sonnet 4.6, and Haiku 4.5 as subagent providers:
lean-edit-shapelean-stock-shapelean-stock-shapeSmaller models do WORSE with bigger profiles — they reach for tools they can't author correctly (
fs_edit_hashlinelured Haiku). Bigger models pick the same tools regardless of profile.Distiller pipeline (auto skill creation) verified end-to-end on fixture 15 — captures real staged/installed
SKILL.mdartifacts under$PASCLAW_HOME/workspace/skills/.pending/(default) orworkspace/skills/<name>/(withauto_approve=true).Follow-up PR
A separate PR proposes changing PasClaw's
TConfig.Createdefaults to matchlean-edit's shape (dropweb_fetch_enabled,vault_tools_enabled,memory_fetch_enabledfrom out-of-box; turn on the 6 free behavioral toggles) and adds a hashline opt-in question topasclaw onboardbased on the Haiku finding.Test plan
provider_stub.pysmoke (mock + proxy + blocking modes)run.pyend-to-end on fixture 01 with mock transcriptscore.py --mockfull sweepprobe_first_turn.pyagainst each built-in profile (baseline,stock,low-token,security,max-build,all-on)python3 bench/swe/harness/score.py --mockfrom a fresh checkoutGenerated by Claude Code