bench/swe: empirical bench harness for tuning PasClaw agent settings by FMXExpress · Pull Request #313 · FMXExpress/PasClaw

FMXExpress · 2026-06-20T03:16:22Z

Summary

Adds bench/swe/, a self-contained adapter for benchmarking PasClaw's agent loop against SWE-shaped tasks. Same author/judge pattern as bench/locomo/: the eval lives inside this repo, no external services required.

Pure bench infrastructure — no profile or onboarding changes. Those land in a separate follow-up PR.

What's in here

bench/swe/harness/ — provider_stub.py (localhost OpenAI-compat HTTP server with mock/proxy/blocking modes), run.py, score.py, start_cell.sh/finalize_cell.sh/driver_helper.py for live driving, plus probe scripts (probe_first_turn.py, tool_cost.py, turn_growth.py, tool_utilization.py).
bench/swe/fixture/01..15 — 15 fixtures covering simple bug fixes, multi-iteration tasks, real-codebase add-a-provider, and capability tests for fs_grep, skills_list/view, web_fetch, memory_search, and the distiller pipeline.
bench/swe/ablation.json + bench/swe/results/ablation.md — 21-variant ablation matrix.
bench/swe/README.md — full methodology + cross-model shootout results.

The one Pascal change in this PR

src/pkg/providers/PasClaw.Providers.OpenAI.pas: bump PostJSON read timeout 120s → 600s. Discovered while running the long-creative fixture — slow-thinking subagents authoring multi-KB tool calls were taking 130+ seconds, and PasClaw was timing out reads before the response landed. 600s is a ceiling; the read returns as soon as the body arrives, so the happy path is unaffected.

Cumulative findings

After running the same fixture set with Opus 4.8, Sonnet 4.6, and Haiku 4.5 as subagent providers:

model	reliability	max-build vs lean penalty	recommendation
Haiku 4.5	bypassed schema 14/15	3x turns on max-build	`lean-edit`-shape
Sonnet 4.6	15/15 REAL	none	`lean-stock`-shape
Opus 4.8	45+/45+ REAL	none	`lean-stock`-shape

Smaller models do WORSE with bigger profiles — they reach for tools they can't author correctly (fs_edit_hashline lured Haiku). Bigger models pick the same tools regardless of profile.

Distiller pipeline (auto skill creation) verified end-to-end on fixture 15 — captures real staged/installed SKILL.md artifacts under $PASCLAW_HOME/workspace/skills/.pending/ (default) or workspace/skills/<name>/ (with auto_approve=true).

Follow-up PR

A separate PR proposes changing PasClaw's TConfig.Create defaults to match lean-edit's shape (drop web_fetch_enabled, vault_tools_enabled, memory_fetch_enabled from out-of-box; turn on the 6 free behavioral toggles) and adds a hashline opt-in question to pasclaw onboard based on the Haiku finding.

Test plan

provider_stub.py smoke (mock + proxy + blocking modes)
run.py end-to-end on fixture 01 with mock transcript
score.py --mock full sweep
probe_first_turn.py against each built-in profile (baseline, stock, low-token, security, max-build, all-on)
Live-driven cells on fixtures 01-15 across Opus / Sonnet / Haiku subagent drivers
Independent reviewer runs python3 bench/swe/harness/score.py --mock from a fresh checkout

Generated by Claude Code

Adds bench/swe/, a self-contained adapter for benchmarking PasClaw's agent loop against SWE-shaped tasks. Same author/judge pattern as bench/locomo/: the eval lives inside this repo, no external services required for the harness to run end-to-end. What the harness measures ========================= The SUBJECT under test is PasClaw's agent loop (system prompt, tool surface, plan-mode gates, profile defaults, condenser, etc.) -- NOT the underlying model. The provider is held fixed across the sweep so any pass-rate delta is attributable to PasClaw settings, not provider variance. Three drive modes (provider_stub.py): --mock <transcript.jsonl> replay an offline transcript --proxy <upstream_base_url> forward to a real upstream provider --blocking <queue_dir> file FIFO for live human / subagent driving PasClaw is wired via a one-off config.json that points its OpenAI provider at the localhost stub -- zero Pascal code changes, just the existing OpenAI-compat path with a different api_base. 15 fixtures =========== 01-04 simple bug fixes (snippet width, shell quoting, count files, yaml) 07 cross-file grep (capability test for fs_grep) 08 CLI Centipede game (long creative task) 09 bash notes CLI (multi-iteration multi-subcommand) 10 add Cloudflare AI Gateway provider to a real PasClaw checkout 11 skill discovery (capability test for skills_list / skills_view) 12 vault lookup (placeholder -- needs reachable vault endpoint) 13 web context fetch (capability test for web_fetch) 14 prior-session recall (capability test for memory_search) 15 auto skill creation via the distiller pipeline Plus a 21-variant ablation matrix (ablation.json) and per-tool cost breakdown (tool_cost.py) so anyone can probe how each individual setting affects the first-turn prompt size without touching the model. Cross-model shootouts ===================== Drove the same fixture set with Opus 4.8, Sonnet 4.6, and Haiku 4.5 via subagent providers. Cumulative finding (verifiable from bench/swe/README.md): model class reliability max-build penalty recommendation Haiku 4.5 poor 3x turns vs lean lean-edit Sonnet 4.6 rock-solid none lean-stock-shape Opus 4.8 rock-solid none lean-stock-shape Smaller models do WORSE with bigger profiles -- they reach for tools they can't author correctly (fs_edit_hashline lured Haiku). Sonnet and Opus pick the same tools in the same order regardless of profile. Auto skill creation =================== Fixture 15 captures real end-to-end distiller artifacts: - draft staged at $PASCLAW_HOME/workspace/skills/.pending/<id>/ when auto_approve=false (default) - direct install at $PASCLAW_HOME/workspace/skills/<name>/ when auto_approve=true Distiller is NOT a max-build-only feature: it's inherited from lean-stock-shaped settings and present in every lean-* profile, so the cheap profiles get the auto-skill-creation pipeline without paying for skills_manage. OpenAI provider HTTP read timeout ================================= Also bumps providers/openai PostJSON read timeout 120s -> 600s. Discovered while running the long-creative fixture: slow-thinking Claude subagents authoring multi-KB game files were taking 130+ seconds to publish their first reply to the localhost stub, and PasClaw would time out the read and abort the run before any tool fired. The 600s ceiling covers slow-think without affecting the happy path (the read returns as soon as the body lands). Follow-up PR ============ The findings drive a separate PR proposing TConfig.Create defaults adopt lean-edit's settings -- the bench-grounded recommendation for the cheapest profile that doesn't lose pass-rate on every model class tested. That PR also adds an opt-in hashline question to `pasclaw onboard` based on the Haiku finding (smaller models mis-handle fs_edit_hashline when it's advertised). This PR is bench infrastructure only -- no profile or onboarding changes here.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 77f20504d0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-20T03:19:35Z

+    workspace.mkdir(parents=True, exist_ok=True)
+    stub_log = run_dir / "stub.log"
+
+    stage_workspace(fixture, workspace)


Invoke fixture setup hooks in run.py

When using run.py/score.py on setup-backed fixtures (10 real repo, 11 skill, 13 web server, 14 memory), only pre-fix is staged here. Unlike start_cell.sh, no setup.sh hook is executed, so those workspaces are missing the repo snapshot, installed skill/data, spec URL server, or memory file before PasClaw and the oracle run, making full proxy sweeps fail or measure the wrong task setup.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-20T03:19:35Z

+# The agent sees a clean tree (no .git, no build/ untracked).
+set -euo pipefail
+
+REPO=/home/user/PasClaw


Derive the fixture-10 repo path instead of hard-coding it

In any checkout that is not exactly /home/user/PasClaw (including this repo at /workspace/PasClaw), fixture 10's setup.sh fails at cd "$REPO" before it can stage the PasClaw snapshot. Since start_cell.sh runs this hook for the real-codebase fixture, live runs of fixture 10 are not portable; derive the repo root from FIXTURE_DIR/the harness path or pass it in from start_cell.sh.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-20T03:19:35Z

+    ap.add_argument("--fixtures", nargs="*",
+                    default=["fixture/*"])


Exclude fixtures without mocks from mock sweeps

With the documented score.py --mock invocation this default expands to all committed fixtures, but the added fixture tree only includes mock transcripts for 01-04; the remaining fixtures are returned from run_cell as passed: None and are still counted by aggregate, so the offline smoke sweep/frontier is dominated by ERR cells instead of a valid mock benchmark. Filter to mock-backed fixtures or narrow the default when --mock is selected.

Useful? React with 👍 / 👎.

The 600s timeout fix was originally bundled here as a "bench infrastructure prereq" -- but it's a Pascal code change that belongs in the lean-edit defaults PR (reviewer note: the bench PR should be bench code only). The lean-edit-as-stock-defaults branch (PR #314) now carries the timeout bump alongside its other Pascal-side changes. Running the bench against a checkout without that fix means slow subagent drivers may hit the original 120s read timeout, but that's a known issue documented in bench/swe/README.md -- the bench harness can still produce useful data on mock and proxy modes, just not on some live-driven cells with subagents authoring large tool-call bodies.

Slow-thinking subagents driving the bench can take 2-3 minutes to emit a reasoning preamble + a multi-KB tool-call body before they publish the response to the localhost stub. 120s was tight and broke bench cells (driver took ~130s; PasClaw timed out the read before the response landed, even though the body was seconds away). 600s is a ceiling, not a wait -- the read returns as soon as the body arrives, so the happy path is unaffected. Carrying this on the bench branch too so a reviewer can run bench/swe end-to-end without needing the lean-edit-defaults PR merged first. Same change as PR #314's timeout bump -- if both PRs land in either order, the second is a no-op for this line.

chatgpt-codex-connector Bot reviewed Jun 20, 2026

View reviewed changes

FMXExpress mentioned this pull request Jun 20, 2026

config: adopt lean-edit shape for stock defaults + hashline opt-in onboarding #314

Merged

6 tasks

claude added 2 commits June 20, 2026 03:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench/swe: empirical bench harness for tuning PasClaw agent settings#313

bench/swe: empirical bench harness for tuning PasClaw agent settings#313
FMXExpress wants to merge 3 commits into
mainfrom
claude/swe-bench-harness-v2

FMXExpress commented Jun 20, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 20, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 20, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		ap.add_argument("--fixtures", nargs="*",
		default=["fixture/*"])

Conversation

FMXExpress commented Jun 20, 2026

Summary

What's in here

The one Pascal change in this PR

Cumulative findings

Follow-up PR

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants