Skip to content

bench/swe: empirical bench harness for tuning PasClaw agent settings#313

Open
FMXExpress wants to merge 3 commits into
mainfrom
claude/swe-bench-harness-v2
Open

bench/swe: empirical bench harness for tuning PasClaw agent settings#313
FMXExpress wants to merge 3 commits into
mainfrom
claude/swe-bench-harness-v2

Conversation

@FMXExpress

Copy link
Copy Markdown
Owner

Summary

Adds bench/swe/, a self-contained adapter for benchmarking PasClaw's agent loop against SWE-shaped tasks. Same author/judge pattern as bench/locomo/: the eval lives inside this repo, no external services required.

Pure bench infrastructure — no profile or onboarding changes. Those land in a separate follow-up PR.

What's in here

  • bench/swe/harness/provider_stub.py (localhost OpenAI-compat HTTP server with mock/proxy/blocking modes), run.py, score.py, start_cell.sh/finalize_cell.sh/driver_helper.py for live driving, plus probe scripts (probe_first_turn.py, tool_cost.py, turn_growth.py, tool_utilization.py).
  • bench/swe/fixture/01..15 — 15 fixtures covering simple bug fixes, multi-iteration tasks, real-codebase add-a-provider, and capability tests for fs_grep, skills_list/view, web_fetch, memory_search, and the distiller pipeline.
  • bench/swe/ablation.json + bench/swe/results/ablation.md — 21-variant ablation matrix.
  • bench/swe/README.md — full methodology + cross-model shootout results.

The one Pascal change in this PR

src/pkg/providers/PasClaw.Providers.OpenAI.pas: bump PostJSON read timeout 120s → 600s. Discovered while running the long-creative fixture — slow-thinking subagents authoring multi-KB tool calls were taking 130+ seconds, and PasClaw was timing out reads before the response landed. 600s is a ceiling; the read returns as soon as the body arrives, so the happy path is unaffected.

Cumulative findings

After running the same fixture set with Opus 4.8, Sonnet 4.6, and Haiku 4.5 as subagent providers:

model reliability max-build vs lean penalty recommendation
Haiku 4.5 bypassed schema 14/15 3x turns on max-build lean-edit-shape
Sonnet 4.6 15/15 REAL none lean-stock-shape
Opus 4.8 45+/45+ REAL none lean-stock-shape

Smaller models do WORSE with bigger profiles — they reach for tools they can't author correctly (fs_edit_hashline lured Haiku). Bigger models pick the same tools regardless of profile.

Distiller pipeline (auto skill creation) verified end-to-end on fixture 15 — captures real staged/installed SKILL.md artifacts under $PASCLAW_HOME/workspace/skills/.pending/ (default) or workspace/skills/<name>/ (with auto_approve=true).

Follow-up PR

A separate PR proposes changing PasClaw's TConfig.Create defaults to match lean-edit's shape (drop web_fetch_enabled, vault_tools_enabled, memory_fetch_enabled from out-of-box; turn on the 6 free behavioral toggles) and adds a hashline opt-in question to pasclaw onboard based on the Haiku finding.

Test plan

  • provider_stub.py smoke (mock + proxy + blocking modes)
  • run.py end-to-end on fixture 01 with mock transcript
  • score.py --mock full sweep
  • probe_first_turn.py against each built-in profile (baseline, stock, low-token, security, max-build, all-on)
  • Live-driven cells on fixtures 01-15 across Opus / Sonnet / Haiku subagent drivers
  • Independent reviewer runs python3 bench/swe/harness/score.py --mock from a fresh checkout

Generated by Claude Code

Adds bench/swe/, a self-contained adapter for benchmarking PasClaw's
agent loop against SWE-shaped tasks. Same author/judge pattern as
bench/locomo/: the eval lives inside this repo, no external services
required for the harness to run end-to-end.

What the harness measures
=========================

The SUBJECT under test is PasClaw's agent loop (system prompt, tool
surface, plan-mode gates, profile defaults, condenser, etc.) -- NOT
the underlying model. The provider is held fixed across the sweep so
any pass-rate delta is attributable to PasClaw settings, not provider
variance.

Three drive modes (provider_stub.py):

  --mock <transcript.jsonl>   replay an offline transcript
  --proxy <upstream_base_url> forward to a real upstream provider
  --blocking <queue_dir>      file FIFO for live human / subagent driving

PasClaw is wired via a one-off config.json that points its OpenAI
provider at the localhost stub -- zero Pascal code changes, just the
existing OpenAI-compat path with a different api_base.

15 fixtures
===========

  01-04  simple bug fixes (snippet width, shell quoting, count files, yaml)
  07     cross-file grep (capability test for fs_grep)
  08     CLI Centipede game (long creative task)
  09     bash notes CLI (multi-iteration multi-subcommand)
  10     add Cloudflare AI Gateway provider to a real PasClaw checkout
  11     skill discovery (capability test for skills_list / skills_view)
  12     vault lookup (placeholder -- needs reachable vault endpoint)
  13     web context fetch (capability test for web_fetch)
  14     prior-session recall (capability test for memory_search)
  15     auto skill creation via the distiller pipeline

Plus a 21-variant ablation matrix (ablation.json) and per-tool cost
breakdown (tool_cost.py) so anyone can probe how each individual
setting affects the first-turn prompt size without touching the model.

Cross-model shootouts
=====================

Drove the same fixture set with Opus 4.8, Sonnet 4.6, and Haiku 4.5
via subagent providers. Cumulative finding (verifiable from
bench/swe/README.md):

  model class    reliability   max-build penalty    recommendation
  Haiku 4.5      poor          3x turns vs lean     lean-edit
  Sonnet 4.6     rock-solid    none                 lean-stock-shape
  Opus 4.8       rock-solid    none                 lean-stock-shape

Smaller models do WORSE with bigger profiles -- they reach for tools
they can't author correctly (fs_edit_hashline lured Haiku). Sonnet and
Opus pick the same tools in the same order regardless of profile.

Auto skill creation
===================

Fixture 15 captures real end-to-end distiller artifacts:
  - draft staged at $PASCLAW_HOME/workspace/skills/.pending/<id>/
    when auto_approve=false (default)
  - direct install at $PASCLAW_HOME/workspace/skills/<name>/
    when auto_approve=true

Distiller is NOT a max-build-only feature: it's inherited from
lean-stock-shaped settings and present in every lean-* profile, so
the cheap profiles get the auto-skill-creation pipeline without
paying for skills_manage.

OpenAI provider HTTP read timeout
=================================

Also bumps providers/openai PostJSON read timeout 120s -> 600s.
Discovered while running the long-creative fixture: slow-thinking
Claude subagents authoring multi-KB game files were taking 130+
seconds to publish their first reply to the localhost stub, and
PasClaw would time out the read and abort the run before any tool
fired. The 600s ceiling covers slow-think without affecting the
happy path (the read returns as soon as the body lands).

Follow-up PR
============

The findings drive a separate PR proposing TConfig.Create defaults
adopt lean-edit's settings -- the bench-grounded recommendation for
the cheapest profile that doesn't lose pass-rate on every model
class tested. That PR also adds an opt-in hashline question to
`pasclaw onboard` based on the Haiku finding (smaller models
mis-handle fs_edit_hashline when it's advertised).

This PR is bench infrastructure only -- no profile or onboarding
changes here.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 77f20504d0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread bench/swe/harness/run.py
workspace.mkdir(parents=True, exist_ok=True)
stub_log = run_dir / "stub.log"

stage_workspace(fixture, workspace)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Invoke fixture setup hooks in run.py

When using run.py/score.py on setup-backed fixtures (10 real repo, 11 skill, 13 web server, 14 memory), only pre-fix is staged here. Unlike start_cell.sh, no setup.sh hook is executed, so those workspaces are missing the repo snapshot, installed skill/data, spec URL server, or memory file before PasClaw and the oracle run, making full proxy sweeps fail or measure the wrong task setup.

Useful? React with 👍 / 👎.

# The agent sees a clean tree (no .git, no build/ untracked).
set -euo pipefail

REPO=/home/user/PasClaw

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Derive the fixture-10 repo path instead of hard-coding it

In any checkout that is not exactly /home/user/PasClaw (including this repo at /workspace/PasClaw), fixture 10's setup.sh fails at cd "$REPO" before it can stage the PasClaw snapshot. Since start_cell.sh runs this hook for the real-codebase fixture, live runs of fixture 10 are not portable; derive the repo root from FIXTURE_DIR/the harness path or pass it in from start_cell.sh.

Useful? React with 👍 / 👎.

Comment on lines +188 to +189
ap.add_argument("--fixtures", nargs="*",
default=["fixture/*"])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude fixtures without mocks from mock sweeps

With the documented score.py --mock invocation this default expands to all committed fixtures, but the added fixture tree only includes mock transcripts for 01-04; the remaining fixtures are returned from run_cell as passed: None and are still counted by aggregate, so the offline smoke sweep/frontier is dominated by ERR cells instead of a valid mock benchmark. Filter to mock-backed fixtures or narrow the default when --mock is selected.

Useful? React with 👍 / 👎.

claude added 2 commits June 20, 2026 03:52
The 600s timeout fix was originally bundled here as a "bench
infrastructure prereq" -- but it's a Pascal code change that belongs
in the lean-edit defaults PR (reviewer note: the bench PR should be
bench code only).

The lean-edit-as-stock-defaults branch (PR #314) now carries the
timeout bump alongside its other Pascal-side changes. Running the
bench against a checkout without that fix means slow subagent
drivers may hit the original 120s read timeout, but that's a known
issue documented in bench/swe/README.md -- the bench harness can
still produce useful data on mock and proxy modes, just not on
some live-driven cells with subagents authoring large tool-call
bodies.
Slow-thinking subagents driving the bench can take 2-3 minutes to
emit a reasoning preamble + a multi-KB tool-call body before they
publish the response to the localhost stub. 120s was tight and broke
bench cells (driver took ~130s; PasClaw timed out the read before
the response landed, even though the body was seconds away). 600s
is a ceiling, not a wait -- the read returns as soon as the body
arrives, so the happy path is unaffected.

Carrying this on the bench branch too so a reviewer can run
bench/swe end-to-end without needing the lean-edit-defaults PR
merged first. Same change as PR #314's timeout bump -- if both PRs
land in either order, the second is a no-op for this line.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants