agent effectiveness: project facts, edit mini-diff, read ranges + dedup, find_files (+ real-task bench)#415
Conversation
A coding agent dropped into a repo burns its first 5-10 tool calls
rediscovering facts the filesystem states outright (the observed real
transcript opened with a wall of fs_list / probe calls). New
PasClaw.Agent.ProjectFacts detects -- by PURE FILE READS, no LLM, no
exec, both compilers -- the stack (Makefile / package.json / Cargo.toml
/ go.mod / .dpr / Python), the build command, THE verify command
("Test: make test <- run this to verify after code edits", giving Rule
3 a concrete target), up to 8 Makefile target names, npm scripts, and
the git branch from .git/HEAD (working-tree state is the model's to
query -- the block says to run `git status`).
Injected right after the Workspace section, from the same directory
relative tool paths resolve to (PromptWorkDir = CurrentWorkspace).
Skipped entirely when the project carries an AGENTS.md -- the
operator-authored doc wins; this is the zero-config floor. Empty /
unrecognisable directories yield no section, so gateway homes with a
bare workspace stay quiet.
Tests: project_facts_tests (Makefile targets incl. VAR:=/pattern-rule/
.PHONY exclusions, test-target promotion, package.json scripts with
Makefile keeping ownership of Build/Test, Cargo fallback, detached
HEAD, empty dir). Bench: build-site now seeds a Makefile + .git/HEAD in
the workspace and asserts '## Project', 'Test: make test' and the
branch reach ITERATION 1's system prompt -- zero exploration calls
spent. All prior scenarios green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6
edit_file's success result was a bare count ("edited X (replaced 1
occurrence(s))"), so a careful model follows up with a full-file
read_file "to be safe" -- re-injecting the entire body into history for
the rest of the turn. The result now appends the changed region of the
NEW content with +-3 context lines and line numbers:
edited src/x.pas (replaced 1 occurrence(s))
now reads (lines 2-8):
2: l2
...
5: L5-CHANGED
...
Bounded to 12 lines. The model confirms placement at a glance instead of
paying a re-read; line numbers feed grep_files/read_file follow-ups.
Tests: fs_tool_naming asserts the snippet header, the changed line with
its number, leading/trailing context, and the +-3 bound. Bench:
build-site's turn 2 is now an edit_file and asserts the following
request's tool result carries "now reads (lines" + the changed content.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6
C2 -- read_file gains start_line/end_line (1-based inclusive, clamped;
result prefixed "(lines A-B of N)"). grep_files already returns line
numbers; the follow-up read can now be surgical instead of swallowing
the whole file into history. A range implies plain output (hashline
headers hash the WHOLE file, so a sliced body must not carry one).
C3 -- RunToolLoop dedups repeat reads: when THIS loop already returned a
byte-identical body for the same path, the repeat becomes a one-line
stub ("unchanged since the earlier read this turn (hash #x) -- the
previous read_file result above is still current"). State lives in the
loop's own locals, so concurrent sessions / subagent child loops can't
cross-talk, and a file that changed between reads hashes differently and
passes through untouched. Runs before promptware + the byte cap (the
stub skips both). Re-reading unchanged files "to be safe" was a constant
in observed transcripts.
Harness-verified (bench/agentloop, new repeat-read scenario -- an 8 KB
file read three times in one turn; the fixture sits below the C1 cap on
purpose, so the dedup is what keeps bodies flat):
before (main): repeatread.growth_bytes_over_2_rereads = 12530
after (this): repeatread.growth_bytes_over_2_rereads = 886 (14x)
Units: read_file slices + clamps (fs_tool_naming); exactly one full body
+ one stub in the loop's final history (progress_ledger_tests).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6
"Where is webui.html?" had no direct answer: grep_files searches CONTENTS (and requires a pattern), list_dir is one level -- so real transcripts show fs_list ladders descending the tree directory by directory. find_files is the Glob half of the Glob/Grep pair every mature harness ships (the model's priors already expect it): recursive name-glob from a root (default: the working directory), same walk discipline as grep_files (skip dotdirs + VCS/build/deps dirs), sorted relative paths, max_results cap (default 100, hard 500) with an explicit truncation note, and an empty result that points at grep_files for contents search. tcReadOnly -- fans out in parallel batches. Units: finds top-level + recursive matches as relative paths, skips node_modules, self-explaining empty result. Bench: build-site gains a find_files turn -- one call locates the deliverable by glob (no list_dir ladder). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6
Runs the exact prompt that produced the original 100-tool-call failure
under a deterministic POLICY model: the same code adapts to what the
harness offers (find_files / read ranges / the edit mini-diff when
present; the list_dir ladder and full-file verify re-reads of the real
transcript when not). The reference "site dump" is generated locally
(yes|head -- a shell loop's $((...)) trips the command-substitution
deny guard) so the run is byte-deterministic; guards assert the dump was
written and its body reached the model, so the scenario can never
silently measure the wrong task.
Same bench, both binaries:
before (main) after (C1+B1+B2+C2C3+D1)
provider calls 13 11
total request bytes 1,651,216 296,859 (5.6x less)
max request body 186,681 33,011 (5.7x less)
Identical deliverable either way (index.html with the edited title) --
the after-binary just gets there in fewer calls and a fraction of the
context traffic.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 989eb8da5b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if StartLn > Total then StartLn := Total; | ||
| if EndLn < StartLn then EndLn := StartLn; | ||
| Body := ''; | ||
| for i := StartLn to EndLn do |
There was a problem hiding this comment.
Handle empty files before clamping range reads
When read_file is called with start_line or end_line on an empty file, Lines.Count is 0, so this clamp sets StartLn to 0; the loop immediately below then runs from 0 to 0 and indexes Lines[i - 1] (Lines[-1]). In that scenario the new range-read path raises an internal string-list bounds error instead of returning an empty slice, so reading an empty generated/config file with a line range fails.
Useful? React with 👍 / 👎.
| if FindProjectAgentsMd('') = '' then | ||
| Result := AppendSection(Result, BuildProjectFactsSection(PromptWorkDir)); |
There was a problem hiding this comment.
Check AGENTS.md in the workspace tree
In sessions where the configured workspace differs from the process CWD, this checks the wrong directory tree: FindProjectAgentsMd('') searches the launch cwd, while BuildProjectFactsSection(PromptWorkDir) uses the workspace. That means an AGENTS.md in the actual workspace will not suppress the auto-detected facts as intended (and an unrelated AGENTS.md in the launch cwd can suppress useful facts for the workspace), which breaks the documented “operator-authored doc wins” behavior for configured workspaces.
Useful? React with 👍 / 👎.
…fixes) Two #415 review findings: * read_file with start_line/end_line on an EMPTY file: Lines.Count = 0 drove the clamp to StartLn = 0 and the slice loop into Lines[-1] (range-check error). An empty file now reports "(empty file: 0 lines)" instead of crashing. Unit test added. * The AGENTS.md suppression check walked the LAUNCH cwd (FindProjectAgentsMd('')) while the facts are read from the WORKSPACE -- so a workspace AGENTS.md failed to suppress the auto-facts, and an unrelated launch-dir AGENTS.md could suppress useful workspace facts. Both now walk PromptWorkDir. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6
Drives a typo'd tool name (read_files) and a wrong-directory read
through the real dispatch path and asserts the errors carry the next
move ("did you mean: read_file", "Did you mean: pages/sub-dir-page.html").
On the pre-B3 binary both assertions fail (bare "unknown tool: X" /
"no such file: Y"), so the audit is now harness-verified, not just
unit-tested.
Fresh before/after against merged main (#415), same bench binary:
first_request_bytes: 18726 -> 17780 (C4, every request)
realtask total bytes: 296859 -> 286453 (trim x 11 requests)
error-guidance: 2 FAIL -> PASS (B3)
realtask deliverable: byte-identical index.html either way
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6
The remaining four items of the effectiveness queue (#414 was C1), one commit per concern, each harness-verified with before/after runs of the same bench binary against the main-built gateway (
BENCH_BIN) vs this branch.The headline: the original failing prompt, end to end
New
real-taskbench scenario runs the exact prompt that produced the original 100-tool-call failure ("build a better version of pasclaw.dev… use the workspace dir") under a deterministic policy model — the same code on both binaries, adapting to what the harness offers the way a real model reads its tool schemas (usesfind_files/ range reads / the mini-diff when present; falls back to thelist_dirladder and full-file verify re-reads from the real transcript when not):Identical deliverable both ways — fewer calls, a fifth of the context traffic. Guards assert the 68 KB reference dump was actually written and its body reached the model, so the scenario can never silently measure a lightweight task (my first draft did: the dump generator used
$((i+1)), which the command-substitution deny-guard rightly refused).The four changes
B1 — deterministic
## Projectfacts (PasClaw.Agent.ProjectFacts). Stack, build command, THE verify command (Test: make test <- run this to verify after code edits), up to 8 Makefile targets, npm scripts, and the git branch from.git/HEAD— pure file reads, no LLM, no exec, both compilers. Injected after the Workspace section from the directory relative tool paths resolve to; skipped entirely when the project has an AGENTS.md (operator-authored doc wins). Bench: facts +make test+ branch reach iteration 1's system prompt — zero exploration calls.B2 —
edit_filemini-diff. Success result now appends the changed region of the new content with ±3 context lines and line numbers (bounded to 12), so the model confirms placement at a glance instead of the defensive full-file re-read. Bench asserts the snippet rides in the tool result.C2+C3 —
read_fileline ranges + per-turn read dedup.start_line/end_line(clamped,(lines A-B of N)prefix; a range implies plain output since hashline headers hash the whole file), andRunToolLoopswaps a byte-identical repeat read of the same path for a one-line stub — state lives in the loop's own locals, so sessions/subagents can't cross-talk, and a changed file hashes differently and passes through. Before/after: history growth over two re-reads of an 8 KB file = 12,530 → 886 bytes (14×).D1 —
find_files. The Glob half of the Glob/Grep pair: recursive name-glob from the working dir, grep_files' walk discipline (skips dotdirs + VCS/build/deps dirs), sorted relative paths,max_resultscap with an explicit truncation note, empty result points atgrep_filesfor contents. Kills thelist_dirladder.Tests
New
project_facts_tests(+make test);fs_tool_naming_testsgrows range/mini-diff/find_files sections;progress_ledger_testsgains the dedup case. Bench: 40+ assertions across 6 scenarios, all green — including under a hostilePASCLAW_HOME=/tmp/decoyenvironment (per the #414 review fix). 13-suite regression sweep green.🤖 Generated with Claude Code
https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6
Generated by Claude Code