agent effectiveness: project facts, edit mini-diff, read ranges + dedup, find_files (+ real-task bench) by FMXExpress · Pull Request #415 · FMXExpress/PasClaw

FMXExpress · 2026-07-02T07:08:56Z

The remaining four items of the effectiveness queue (#414 was C1), one commit per concern, each harness-verified with before/after runs of the same bench binary against the main-built gateway (BENCH_BIN) vs this branch.

The headline: the original failing prompt, end to end

New real-task bench scenario runs the exact prompt that produced the original 100-tool-call failure ("build a better version of pasclaw.dev… use the workspace dir") under a deterministic policy model — the same code on both binaries, adapting to what the harness offers the way a real model reads its tool schemas (uses find_files / range reads / the mini-diff when present; falls back to the list_dir ladder and full-file verify re-reads from the real transcript when not):

metric	before (main)	after (this + #414)
provider calls to deliverable	13	11	−15%
total request bytes over the turn	1,651,216	296,859	5.6× less
max single request body	186,681	33,011	5.7× less

Identical deliverable both ways — fewer calls, a fifth of the context traffic. Guards assert the 68 KB reference dump was actually written and its body reached the model, so the scenario can never silently measure a lightweight task (my first draft did: the dump generator used $((i+1)), which the command-substitution deny-guard rightly refused).

The four changes

B1 — deterministic ## Project facts (PasClaw.Agent.ProjectFacts). Stack, build command, THE verify command (Test: make test <- run this to verify after code edits), up to 8 Makefile targets, npm scripts, and the git branch from .git/HEAD — pure file reads, no LLM, no exec, both compilers. Injected after the Workspace section from the directory relative tool paths resolve to; skipped entirely when the project has an AGENTS.md (operator-authored doc wins). Bench: facts + make test + branch reach iteration 1's system prompt — zero exploration calls.

B2 — edit_file mini-diff. Success result now appends the changed region of the new content with ±3 context lines and line numbers (bounded to 12), so the model confirms placement at a glance instead of the defensive full-file re-read. Bench asserts the snippet rides in the tool result.

C2+C3 — read_file line ranges + per-turn read dedup. start_line/end_line (clamped, (lines A-B of N) prefix; a range implies plain output since hashline headers hash the whole file), and RunToolLoop swaps a byte-identical repeat read of the same path for a one-line stub — state lives in the loop's own locals, so sessions/subagents can't cross-talk, and a changed file hashes differently and passes through. Before/after: history growth over two re-reads of an 8 KB file = 12,530 → 886 bytes (14×).

D1 — find_files. The Glob half of the Glob/Grep pair: recursive name-glob from the working dir, grep_files' walk discipline (skips dotdirs + VCS/build/deps dirs), sorted relative paths, max_results cap with an explicit truncation note, empty result points at grep_files for contents. Kills the list_dir ladder.

Tests

New project_facts_tests (+ make test); fs_tool_naming_tests grows range/mini-diff/find_files sections; progress_ledger_tests gains the dedup case. Bench: 40+ assertions across 6 scenarios, all green — including under a hostile PASCLAW_HOME=/tmp/decoy environment (per the #414 review fix). 13-suite regression sweep green.

🤖 Generated with Claude Code

https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6

Generated by Claude Code

A coding agent dropped into a repo burns its first 5-10 tool calls rediscovering facts the filesystem states outright (the observed real transcript opened with a wall of fs_list / probe calls). New PasClaw.Agent.ProjectFacts detects -- by PURE FILE READS, no LLM, no exec, both compilers -- the stack (Makefile / package.json / Cargo.toml / go.mod / .dpr / Python), the build command, THE verify command ("Test: make test <- run this to verify after code edits", giving Rule 3 a concrete target), up to 8 Makefile target names, npm scripts, and the git branch from .git/HEAD (working-tree state is the model's to query -- the block says to run `git status`). Injected right after the Workspace section, from the same directory relative tool paths resolve to (PromptWorkDir = CurrentWorkspace). Skipped entirely when the project carries an AGENTS.md -- the operator-authored doc wins; this is the zero-config floor. Empty / unrecognisable directories yield no section, so gateway homes with a bare workspace stay quiet. Tests: project_facts_tests (Makefile targets incl. VAR:=/pattern-rule/ .PHONY exclusions, test-target promotion, package.json scripts with Makefile keeping ownership of Build/Test, Cargo fallback, detached HEAD, empty dir). Bench: build-site now seeds a Makefile + .git/HEAD in the workspace and asserts '## Project', 'Test: make test' and the branch reach ITERATION 1's system prompt -- zero exploration calls spent. All prior scenarios green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6

edit_file's success result was a bare count ("edited X (replaced 1 occurrence(s))"), so a careful model follows up with a full-file read_file "to be safe" -- re-injecting the entire body into history for the rest of the turn. The result now appends the changed region of the NEW content with +-3 context lines and line numbers: edited src/x.pas (replaced 1 occurrence(s)) now reads (lines 2-8): 2: l2 ... 5: L5-CHANGED ... Bounded to 12 lines. The model confirms placement at a glance instead of paying a re-read; line numbers feed grep_files/read_file follow-ups. Tests: fs_tool_naming asserts the snippet header, the changed line with its number, leading/trailing context, and the +-3 bound. Bench: build-site's turn 2 is now an edit_file and asserts the following request's tool result carries "now reads (lines" + the changed content. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6

C2 -- read_file gains start_line/end_line (1-based inclusive, clamped; result prefixed "(lines A-B of N)"). grep_files already returns line numbers; the follow-up read can now be surgical instead of swallowing the whole file into history. A range implies plain output (hashline headers hash the WHOLE file, so a sliced body must not carry one). C3 -- RunToolLoop dedups repeat reads: when THIS loop already returned a byte-identical body for the same path, the repeat becomes a one-line stub ("unchanged since the earlier read this turn (hash #x) -- the previous read_file result above is still current"). State lives in the loop's own locals, so concurrent sessions / subagent child loops can't cross-talk, and a file that changed between reads hashes differently and passes through untouched. Runs before promptware + the byte cap (the stub skips both). Re-reading unchanged files "to be safe" was a constant in observed transcripts. Harness-verified (bench/agentloop, new repeat-read scenario -- an 8 KB file read three times in one turn; the fixture sits below the C1 cap on purpose, so the dedup is what keeps bodies flat): before (main): repeatread.growth_bytes_over_2_rereads = 12530 after (this): repeatread.growth_bytes_over_2_rereads = 886 (14x) Units: read_file slices + clamps (fs_tool_naming); exactly one full body + one stub in the loop's final history (progress_ledger_tests). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6

"Where is webui.html?" had no direct answer: grep_files searches CONTENTS (and requires a pattern), list_dir is one level -- so real transcripts show fs_list ladders descending the tree directory by directory. find_files is the Glob half of the Glob/Grep pair every mature harness ships (the model's priors already expect it): recursive name-glob from a root (default: the working directory), same walk discipline as grep_files (skip dotdirs + VCS/build/deps dirs), sorted relative paths, max_results cap (default 100, hard 500) with an explicit truncation note, and an empty result that points at grep_files for contents search. tcReadOnly -- fans out in parallel batches. Units: finds top-level + recursive matches as relative paths, skips node_modules, self-explaining empty result. Bench: build-site gains a find_files turn -- one call locates the deliverable by glob (no list_dir ladder). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6

Runs the exact prompt that produced the original 100-tool-call failure under a deterministic POLICY model: the same code adapts to what the harness offers (find_files / read ranges / the edit mini-diff when present; the list_dir ladder and full-file verify re-reads of the real transcript when not). The reference "site dump" is generated locally (yes|head -- a shell loop's $((...)) trips the command-substitution deny guard) so the run is byte-deterministic; guards assert the dump was written and its body reached the model, so the scenario can never silently measure the wrong task. Same bench, both binaries: before (main) after (C1+B1+B2+C2C3+D1) provider calls 13 11 total request bytes 1,651,216 296,859 (5.6x less) max request body 186,681 33,011 (5.7x less) Identical deliverable either way (index.html with the edited title) -- the after-binary just gets there in fewer calls and a fraction of the context traffic. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 989eb8da5b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-07-02T07:13:55Z

+      if StartLn > Total then StartLn := Total;
+      if EndLn < StartLn then EndLn := StartLn;
+      Body := '';
+      for i := StartLn to EndLn do


Handle empty files before clamping range reads

When read_file is called with start_line or end_line on an empty file, Lines.Count is 0, so this clamp sets StartLn to 0; the loop immediately below then runs from 0 to 0 and indexes Lines[i - 1] (Lines[-1]). In that scenario the new range-read path raises an internal string-list bounds error instead of returning an empty slice, so reading an empty generated/config file with a line range fails.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-07-02T07:13:55Z

+  if FindProjectAgentsMd('') = '' then
+    Result := AppendSection(Result, BuildProjectFactsSection(PromptWorkDir));


Check AGENTS.md in the workspace tree

In sessions where the configured workspace differs from the process CWD, this checks the wrong directory tree: FindProjectAgentsMd('') searches the launch cwd, while BuildProjectFactsSection(PromptWorkDir) uses the workspace. That means an AGENTS.md in the actual workspace will not suppress the auto-detected facts as intended (and an unrelated AGENTS.md in the launch cwd can suppress useful facts for the workspace), which breaks the documented “operator-authored doc wins” behavior for configured workspaces.

Useful? React with 👍 / 👎.

…fixes) Two #415 review findings: * read_file with start_line/end_line on an EMPTY file: Lines.Count = 0 drove the clamp to StartLn = 0 and the slice loop into Lines[-1] (range-check error). An empty file now reports "(empty file: 0 lines)" instead of crashing. Unit test added. * The AGENTS.md suppression check walked the LAUNCH cwd (FindProjectAgentsMd('')) while the facts are read from the WORKSPACE -- so a workspace AGENTS.md failed to suppress the auto-facts, and an unrelated launch-dir AGENTS.md could suppress useful workspace facts. Both now walk PromptWorkDir. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6

Drives a typo'd tool name (read_files) and a wrong-directory read through the real dispatch path and asserts the errors carry the next move ("did you mean: read_file", "Did you mean: pages/sub-dir-page.html"). On the pre-B3 binary both assertions fail (bare "unknown tool: X" / "no such file: Y"), so the audit is now harness-verified, not just unit-tested. Fresh before/after against merged main (#415), same bench binary: first_request_bytes: 18726 -> 17780 (C4, every request) realtask total bytes: 296859 -> 286453 (trim x 11 requests) error-guidance: 2 FAIL -> PASS (B3) realtask deliverable: byte-identical index.html either way Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6

claude added 5 commits July 2, 2026 07:07

chatgpt-codex-connector Bot reviewed Jul 2, 2026

View reviewed changes

FMXExpress merged commit 5a24cc2 into main Jul 2, 2026

FMXExpress mentioned this pull request Jul 2, 2026

tools: schema trim + error-message audit (C4 + B3) #416

Merged

FMXExpress mentioned this pull request Jul 2, 2026

agent: remove the deterministic project-facts prompt section #417

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent effectiveness: project facts, edit mini-diff, read ranges + dedup, find_files (+ real-task bench)#415

agent effectiveness: project facts, edit mini-diff, read ranges + dedup, find_files (+ real-task bench)#415
FMXExpress merged 6 commits into
mainfrom
claude/project-facts

FMXExpress commented Jul 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jul 2, 2026

Uh oh!

chatgpt-codex-connector Bot Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if FindProjectAgentsMd('') = '' then
		Result := AppendSection(Result, BuildProjectFactsSection(PromptWorkDir));

Conversation

FMXExpress commented Jul 2, 2026

The headline: the original failing prompt, end to end

The four changes

Tests

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants