Skip to content

config: default ToolOutputCap to 24576 (C1) — 6× smaller request bodies#414

Merged
FMXExpress merged 3 commits into
mainfrom
claude/tool-output-cap-default
Jul 2, 2026
Merged

config: default ToolOutputCap to 24576 (C1) — 6× smaller request bodies#414
FMXExpress merged 3 commits into
mainfrom
claude/tool-output-cap-default

Conversation

@FMXExpress

Copy link
Copy Markdown
Owner

First of the five queued harness-effectiveness PRs (C1 → B1 → B2 → C2+C3 → D1), each verified with before/after numbers from bench/agentloop.

⚠️ Also carries #413's bench commit. #413 merged into its stacked base branch (claude/progress-ledger) instead of main — GitHub only retargets a PR when its base branch is deleted — so the bench never actually landed on main. It's cherry-picked here verbatim as the first commit.

Why

An uncapped tool result rides verbatim in history for the rest of the turn: one fat read_file/grep_files inflates every subsequent request — the top context-growth driver behind the 125 KB request bodies seen on real runs. Oversized histories are also what push Gemini-class models into the malformed-call behaviour in the first place (#406 made recovery work; this attacks the trigger). ToolOutputCap existed but defaulted to 0.

What

  • Default 0DefaultToolOutputCap (24576 ≈ 6K tokens). Normal reads pass verbatim; a runaway result becomes head + tail + a tool_output_get handle (full text always recoverable). Every surface already registers the recovery tool when the cap is on.
  • ToJSON emits tool_output_cap when ≠ default — including an explicit 0 opt-out, which must round-trip or the next LoadConfig would silently re-enable the cap.
  • Profiles: stock mirrors the new default; baseline keeps its explicit 0 (the "everything off" A/B floor); low-token 8192 / max-build 16384 stay as explicit choices.
  • Onboarding: Enter keeps the default, y customises the byte count, n disables.

Harness-verified (the point of this series)

New fat-read bench scenario: a 100 KB read_file at iteration 1, then further iterations — measuring the max request body the loop builds:

fatread.max_request_body_bytes
before (main binary) 129,256
after (this PR) 21,502

6× smaller, and the scenario asserts the capped result carries the tool_output_get retrieval handle. All prior scenarios still green (20 assertions total).

Also

🤖 Generated with Claude Code

https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6


Generated by Claude Code

claude added 2 commits July 2, 2026 06:39
Codifies the relay-worker methodology that found the
MALFORMED_FUNCTION_CALL death spiral, the workspace/CWD split, and the
goal-drift failure: the bench spawns a real gateway against a throwaway
PASCLAW_HOME, connects as the relay worker, plays a scripted "model", and
asserts on what the LOOP did -- the system prompts it built, what landed
on disk, how big the request bodies grew. Fully reproducible, zero API
tokens, every envelope inspectable.

Pure FPC, no external runtime: TProcess spawns the built binary, the
worker thread holds the SSE poll with the same TIdHTTP + custom-TStream
frame parser shape as PasClaw.Cmd.Relay, and PasClaw.JSON +
PostJSON/GetJSONURL cover the wire. The shipping binary is what gets
exercised.

Three scenarios (16 assertions):
  * build-site         -- ledger folds into iteration 2 (goal + written
                          file + checklist), iteration 1 pristine
                          (prefix-cache), deliverable exists on disk.
  * malformed-recovery -- a Gemini-shaped malformed empty turn is retried
                          with the corrective nudge naming a REGISTERED
                          tool, and the turn still delivers.
  * resume-after-cap   -- the 25-iteration notice carries the resume
                          ledger (do-NOT-redo, file names, read counts);
                          a "continue" turn anchors its ledger goal to the
                          original task, not to the word "continue".

Metrics print either way -- provider calls to deliverable and max
request-body bytes across 25 iterations (24 KB baseline: the number the
future context-diet work gets measured against). Harness changes land
with before/after numbers, not vibes.

`make bench-agentloop` (compiles + runs; needs the built binary;
deliberately not part of `make test` since it binds a port).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6
An uncapped tool result rides verbatim in history for the rest of the
turn: one fat read_file / grep_files inflated EVERY subsequent request --
the top context-growth driver behind the 125 KB request bodies seen on
real runs (slower turns, higher cost, and outsized histories are what
push Gemini-class models into the malformed-call behaviour in the first
place). ToolOutputCap existed but defaulted to 0.

  * TConfig default 0 -> DefaultToolOutputCap (24576 ~ 6K tokens):
    generous enough that normal reads pass verbatim, tight enough that a
    runaway result can't dominate the turn. tool_output_get always
    recovers the full text, and every surface already registers it when
    the cap is on (the gates key off Cfg.ToolOutputCap > 0).
  * ToJSON now emits tool_output_cap when it differs from the default --
    including an explicit 0 opt-out, which must round-trip or the next
    LoadConfig would silently re-enable the cap.
  * stock profile mirrors the new default; baseline keeps its explicit 0
    (the "everything off" A/B floor); low-token 8192 / max-build 16384
    stay as their own explicit choices.
  * Onboarding: Enter keeps the default, y customises, n disables.

Harness-verified (bench/agentloop, new fat-read scenario: a 100 KB read
at iteration 1, then further iterations):
    before (main):  fatread.max_request_body_bytes = 129256
    after  (this):  fatread.max_request_body_bytes =  21502   (6x)
plus asserts the capped result carries the tool_output_get handle.

Also: fixes the stale test-session-stats assertion that has been failing
on main ("default is False" -- StatsCollectionEnabled defaulted to True
in #314's six-free-toggles change; the round-trip test now exercises the
explicit-off direction). Includes #413's bench commit: that PR merged
into its stacked base branch (claude/progress-ledger) instead of main --
GitHub only retargets when the base branch is deleted -- so the bench
never actually landed on main; carried here verbatim.

Known follow-up: the TUI's inner loop config doesn't wire ToolOutputCap
(pre-existing; its loop caps at 6 iterations and has its own condensers).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 84dc17b9f4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

RenderMarkdown := True; { on by default for terminal surfaces; cmd/serve flips off }
SubagentsEnabled := True; { on by default -- a built-in general-purpose subagent makes `spawn` available with no config. }
ToolOutputCap := 0; { off by default; operators opt in. See TConfig.ToolOutputCap. }
ToolOutputCap := DefaultToolOutputCap; { on by default -- an uncapped tool result rides in history for the whole turn and was the top context-growth driver (bench/agentloop: 100 KB read -> >100 KB request bodies). tool_output_get recovers the full text; operators opt out with tool_output_cap: 0. }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Thread the default cap into TUI loops

With this default flipped on, fresh TUI sessions still do not apply the cap: both TTUI.StartTurn and the line-based HandleUserInput build TToolLoopConfig from Default(TToolLoopConfig) and never assign ToolOutputCap, so RunToolLoop sees 0 and sends oversized tool output verbatim. I checked those TUI loop-construction paths, while Cmd.TUI now registers tool_output_get from the config, so the TUI advertises the recovery tool but the new default protection is not actually enforced on that chat surface.

Useful? React with 👍 / 👎.

Comment on lines +678 to +680
for i := 1 to GetEnvironmentVariableCount do
GW.Environment.Add(GetEnvironmentString(i));
GW.Environment.Add('PASCLAW_HOME=' + GHomeDir);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Override PASCLAW_HOME instead of appending duplicates

When the caller already has PASCLAW_HOME exported, this copies that entry into the child environment and then appends a second PASCLAW_HOME; duplicate environment names are not a reliable override, and typical getenv resolution will see the earlier value. In that environment the spawned gateway can ignore the temp relay config and run the benchmark against the user's live PasClaw home/provider/workspace, causing false failures or unintended writes; remove/replace existing PASCLAW_HOME/PASCLAW_CONFIG entries before Execute.

Useful? React with 👍 / 👎.

Two #414 review findings:

  * The TUI's two loop-config builders never set ToolOutputCap, so with
    the default flipped on, TUI turns still sent oversized tool output
    verbatim -- while Cmd.TUI dutifully registered tool_output_get.
    Both builders now take the cap from RouteConfig, the session-
    lifetime config snapshot the auto-router already uses.
  * The bench copied the caller's environment and then APPENDED its
    PASCLAW_HOME -- duplicate names are not a reliable override, so a
    caller with PASCLAW_HOME exported could have the spawned gateway run
    against their LIVE home instead of the throwaway bench home. The
    copy now skips PASCLAW_HOME / PASCLAW_CONFIG / NO_COLOR before the
    bench's values are added. Verified: `PASCLAW_HOME=/tmp/decoy make
    bench-agentloop` passes and the decoy is never created.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants