config: default ToolOutputCap to 24576 (C1) — 6× smaller request bodies#414
Conversation
Codifies the relay-worker methodology that found the
MALFORMED_FUNCTION_CALL death spiral, the workspace/CWD split, and the
goal-drift failure: the bench spawns a real gateway against a throwaway
PASCLAW_HOME, connects as the relay worker, plays a scripted "model", and
asserts on what the LOOP did -- the system prompts it built, what landed
on disk, how big the request bodies grew. Fully reproducible, zero API
tokens, every envelope inspectable.
Pure FPC, no external runtime: TProcess spawns the built binary, the
worker thread holds the SSE poll with the same TIdHTTP + custom-TStream
frame parser shape as PasClaw.Cmd.Relay, and PasClaw.JSON +
PostJSON/GetJSONURL cover the wire. The shipping binary is what gets
exercised.
Three scenarios (16 assertions):
* build-site -- ledger folds into iteration 2 (goal + written
file + checklist), iteration 1 pristine
(prefix-cache), deliverable exists on disk.
* malformed-recovery -- a Gemini-shaped malformed empty turn is retried
with the corrective nudge naming a REGISTERED
tool, and the turn still delivers.
* resume-after-cap -- the 25-iteration notice carries the resume
ledger (do-NOT-redo, file names, read counts);
a "continue" turn anchors its ledger goal to the
original task, not to the word "continue".
Metrics print either way -- provider calls to deliverable and max
request-body bytes across 25 iterations (24 KB baseline: the number the
future context-diet work gets measured against). Harness changes land
with before/after numbers, not vibes.
`make bench-agentloop` (compiles + runs; needs the built binary;
deliberately not part of `make test` since it binds a port).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6
An uncapped tool result rides verbatim in history for the rest of the
turn: one fat read_file / grep_files inflated EVERY subsequent request --
the top context-growth driver behind the 125 KB request bodies seen on
real runs (slower turns, higher cost, and outsized histories are what
push Gemini-class models into the malformed-call behaviour in the first
place). ToolOutputCap existed but defaulted to 0.
* TConfig default 0 -> DefaultToolOutputCap (24576 ~ 6K tokens):
generous enough that normal reads pass verbatim, tight enough that a
runaway result can't dominate the turn. tool_output_get always
recovers the full text, and every surface already registers it when
the cap is on (the gates key off Cfg.ToolOutputCap > 0).
* ToJSON now emits tool_output_cap when it differs from the default --
including an explicit 0 opt-out, which must round-trip or the next
LoadConfig would silently re-enable the cap.
* stock profile mirrors the new default; baseline keeps its explicit 0
(the "everything off" A/B floor); low-token 8192 / max-build 16384
stay as their own explicit choices.
* Onboarding: Enter keeps the default, y customises, n disables.
Harness-verified (bench/agentloop, new fat-read scenario: a 100 KB read
at iteration 1, then further iterations):
before (main): fatread.max_request_body_bytes = 129256
after (this): fatread.max_request_body_bytes = 21502 (6x)
plus asserts the capped result carries the tool_output_get handle.
Also: fixes the stale test-session-stats assertion that has been failing
on main ("default is False" -- StatsCollectionEnabled defaulted to True
in #314's six-free-toggles change; the round-trip test now exercises the
explicit-off direction). Includes #413's bench commit: that PR merged
into its stacked base branch (claude/progress-ledger) instead of main --
GitHub only retargets when the base branch is deleted -- so the bench
never actually landed on main; carried here verbatim.
Known follow-up: the TUI's inner loop config doesn't wire ToolOutputCap
(pre-existing; its loop caps at 6 iterations and has its own condensers).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 84dc17b9f4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| RenderMarkdown := True; { on by default for terminal surfaces; cmd/serve flips off } | ||
| SubagentsEnabled := True; { on by default -- a built-in general-purpose subagent makes `spawn` available with no config. } | ||
| ToolOutputCap := 0; { off by default; operators opt in. See TConfig.ToolOutputCap. } | ||
| ToolOutputCap := DefaultToolOutputCap; { on by default -- an uncapped tool result rides in history for the whole turn and was the top context-growth driver (bench/agentloop: 100 KB read -> >100 KB request bodies). tool_output_get recovers the full text; operators opt out with tool_output_cap: 0. } |
There was a problem hiding this comment.
Thread the default cap into TUI loops
With this default flipped on, fresh TUI sessions still do not apply the cap: both TTUI.StartTurn and the line-based HandleUserInput build TToolLoopConfig from Default(TToolLoopConfig) and never assign ToolOutputCap, so RunToolLoop sees 0 and sends oversized tool output verbatim. I checked those TUI loop-construction paths, while Cmd.TUI now registers tool_output_get from the config, so the TUI advertises the recovery tool but the new default protection is not actually enforced on that chat surface.
Useful? React with 👍 / 👎.
| for i := 1 to GetEnvironmentVariableCount do | ||
| GW.Environment.Add(GetEnvironmentString(i)); | ||
| GW.Environment.Add('PASCLAW_HOME=' + GHomeDir); |
There was a problem hiding this comment.
Override PASCLAW_HOME instead of appending duplicates
When the caller already has PASCLAW_HOME exported, this copies that entry into the child environment and then appends a second PASCLAW_HOME; duplicate environment names are not a reliable override, and typical getenv resolution will see the earlier value. In that environment the spawned gateway can ignore the temp relay config and run the benchmark against the user's live PasClaw home/provider/workspace, causing false failures or unintended writes; remove/replace existing PASCLAW_HOME/PASCLAW_CONFIG entries before Execute.
Useful? React with 👍 / 👎.
Two #414 review findings: * The TUI's two loop-config builders never set ToolOutputCap, so with the default flipped on, TUI turns still sent oversized tool output verbatim -- while Cmd.TUI dutifully registered tool_output_get. Both builders now take the cap from RouteConfig, the session- lifetime config snapshot the auto-router already uses. * The bench copied the caller's environment and then APPENDED its PASCLAW_HOME -- duplicate names are not a reliable override, so a caller with PASCLAW_HOME exported could have the spawned gateway run against their LIVE home instead of the throwaway bench home. The copy now skips PASCLAW_HOME / PASCLAW_CONFIG / NO_COLOR before the bench's values are added. Verified: `PASCLAW_HOME=/tmp/decoy make bench-agentloop` passes and the decoy is never created. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6
First of the five queued harness-effectiveness PRs (C1 → B1 → B2 → C2+C3 → D1), each verified with before/after numbers from
bench/agentloop.Why
An uncapped tool result rides verbatim in history for the rest of the turn: one fat
read_file/grep_filesinflates every subsequent request — the top context-growth driver behind the 125 KB request bodies seen on real runs. Oversized histories are also what push Gemini-class models into the malformed-call behaviour in the first place (#406 made recovery work; this attacks the trigger).ToolOutputCapexisted but defaulted to 0.What
0→DefaultToolOutputCap(24576 ≈ 6K tokens). Normal reads pass verbatim; a runaway result becomes head + tail + atool_output_gethandle (full text always recoverable). Every surface already registers the recovery tool when the cap is on.ToJSONemitstool_output_capwhen ≠ default — including an explicit0opt-out, which must round-trip or the next LoadConfig would silently re-enable the cap.stockmirrors the new default;baselinekeeps its explicit 0 (the "everything off" A/B floor);low-token8192 /max-build16384 stay as explicit choices.ycustomises the byte count,ndisables.Harness-verified (the point of this series)
New
fat-readbench scenario: a 100 KBread_fileat iteration 1, then further iterations — measuring the max request body the loop builds:fatread.max_request_body_bytes6× smaller, and the scenario asserts the capped result carries the
tool_output_getretrieval handle. All prior scenarios still green (20 assertions total).Also
test-session-statsassertion that has been failing on main ("default is False" —StatsCollectionEnabledflipped to True in config: adopt lean-edit shape for stock defaults + hashline opt-in onboarding #314; the round-trip test now exercises the explicit-off direction).ToolOutputCap.🤖 Generated with Claude Code
https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6
Generated by Claude Code