Skip to content

agent effectiveness: project facts, edit mini-diff, read ranges + dedup, find_files (+ real-task bench)#415

Merged
FMXExpress merged 6 commits into
mainfrom
claude/project-facts
Jul 2, 2026
Merged

agent effectiveness: project facts, edit mini-diff, read ranges + dedup, find_files (+ real-task bench)#415
FMXExpress merged 6 commits into
mainfrom
claude/project-facts

Conversation

@FMXExpress

Copy link
Copy Markdown
Owner

The remaining four items of the effectiveness queue (#414 was C1), one commit per concern, each harness-verified with before/after runs of the same bench binary against the main-built gateway (BENCH_BIN) vs this branch.

The headline: the original failing prompt, end to end

New real-task bench scenario runs the exact prompt that produced the original 100-tool-call failure ("build a better version of pasclaw.dev… use the workspace dir") under a deterministic policy model — the same code on both binaries, adapting to what the harness offers the way a real model reads its tool schemas (uses find_files / range reads / the mini-diff when present; falls back to the list_dir ladder and full-file verify re-reads from the real transcript when not):

metric before (main) after (this + #414)
provider calls to deliverable 13 11 −15%
total request bytes over the turn 1,651,216 296,859 5.6× less
max single request body 186,681 33,011 5.7× less

Identical deliverable both ways — fewer calls, a fifth of the context traffic. Guards assert the 68 KB reference dump was actually written and its body reached the model, so the scenario can never silently measure a lightweight task (my first draft did: the dump generator used $((i+1)), which the command-substitution deny-guard rightly refused).

The four changes

B1 — deterministic ## Project facts (PasClaw.Agent.ProjectFacts). Stack, build command, THE verify command (Test: make test <- run this to verify after code edits), up to 8 Makefile targets, npm scripts, and the git branch from .git/HEADpure file reads, no LLM, no exec, both compilers. Injected after the Workspace section from the directory relative tool paths resolve to; skipped entirely when the project has an AGENTS.md (operator-authored doc wins). Bench: facts + make test + branch reach iteration 1's system prompt — zero exploration calls.

B2 — edit_file mini-diff. Success result now appends the changed region of the new content with ±3 context lines and line numbers (bounded to 12), so the model confirms placement at a glance instead of the defensive full-file re-read. Bench asserts the snippet rides in the tool result.

C2+C3 — read_file line ranges + per-turn read dedup. start_line/end_line (clamped, (lines A-B of N) prefix; a range implies plain output since hashline headers hash the whole file), and RunToolLoop swaps a byte-identical repeat read of the same path for a one-line stub — state lives in the loop's own locals, so sessions/subagents can't cross-talk, and a changed file hashes differently and passes through. Before/after: history growth over two re-reads of an 8 KB file = 12,530 → 886 bytes (14×).

D1 — find_files. The Glob half of the Glob/Grep pair: recursive name-glob from the working dir, grep_files' walk discipline (skips dotdirs + VCS/build/deps dirs), sorted relative paths, max_results cap with an explicit truncation note, empty result points at grep_files for contents. Kills the list_dir ladder.

Tests

New project_facts_tests (+ make test); fs_tool_naming_tests grows range/mini-diff/find_files sections; progress_ledger_tests gains the dedup case. Bench: 40+ assertions across 6 scenarios, all green — including under a hostile PASCLAW_HOME=/tmp/decoy environment (per the #414 review fix). 13-suite regression sweep green.

🤖 Generated with Claude Code

https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6


Generated by Claude Code

claude added 5 commits July 2, 2026 07:07
A coding agent dropped into a repo burns its first 5-10 tool calls
rediscovering facts the filesystem states outright (the observed real
transcript opened with a wall of fs_list / probe calls). New
PasClaw.Agent.ProjectFacts detects -- by PURE FILE READS, no LLM, no
exec, both compilers -- the stack (Makefile / package.json / Cargo.toml
/ go.mod / .dpr / Python), the build command, THE verify command
("Test: make test  <- run this to verify after code edits", giving Rule
3 a concrete target), up to 8 Makefile target names, npm scripts, and
the git branch from .git/HEAD (working-tree state is the model's to
query -- the block says to run `git status`).

Injected right after the Workspace section, from the same directory
relative tool paths resolve to (PromptWorkDir = CurrentWorkspace).
Skipped entirely when the project carries an AGENTS.md -- the
operator-authored doc wins; this is the zero-config floor. Empty /
unrecognisable directories yield no section, so gateway homes with a
bare workspace stay quiet.

Tests: project_facts_tests (Makefile targets incl. VAR:=/pattern-rule/
.PHONY exclusions, test-target promotion, package.json scripts with
Makefile keeping ownership of Build/Test, Cargo fallback, detached
HEAD, empty dir). Bench: build-site now seeds a Makefile + .git/HEAD in
the workspace and asserts '## Project', 'Test:  make test' and the
branch reach ITERATION 1's system prompt -- zero exploration calls
spent. All prior scenarios green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6
edit_file's success result was a bare count ("edited X (replaced 1
occurrence(s))"), so a careful model follows up with a full-file
read_file "to be safe" -- re-injecting the entire body into history for
the rest of the turn. The result now appends the changed region of the
NEW content with +-3 context lines and line numbers:

    edited src/x.pas (replaced 1 occurrence(s))
    now reads (lines 2-8):
         2: l2
    ...
         5: L5-CHANGED
    ...

Bounded to 12 lines. The model confirms placement at a glance instead of
paying a re-read; line numbers feed grep_files/read_file follow-ups.

Tests: fs_tool_naming asserts the snippet header, the changed line with
its number, leading/trailing context, and the +-3 bound. Bench:
build-site's turn 2 is now an edit_file and asserts the following
request's tool result carries "now reads (lines" + the changed content.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6
C2 -- read_file gains start_line/end_line (1-based inclusive, clamped;
result prefixed "(lines A-B of N)"). grep_files already returns line
numbers; the follow-up read can now be surgical instead of swallowing
the whole file into history. A range implies plain output (hashline
headers hash the WHOLE file, so a sliced body must not carry one).

C3 -- RunToolLoop dedups repeat reads: when THIS loop already returned a
byte-identical body for the same path, the repeat becomes a one-line
stub ("unchanged since the earlier read this turn (hash #x) -- the
previous read_file result above is still current"). State lives in the
loop's own locals, so concurrent sessions / subagent child loops can't
cross-talk, and a file that changed between reads hashes differently and
passes through untouched. Runs before promptware + the byte cap (the
stub skips both). Re-reading unchanged files "to be safe" was a constant
in observed transcripts.

Harness-verified (bench/agentloop, new repeat-read scenario -- an 8 KB
file read three times in one turn; the fixture sits below the C1 cap on
purpose, so the dedup is what keeps bodies flat):
    before (main): repeatread.growth_bytes_over_2_rereads = 12530
    after  (this): repeatread.growth_bytes_over_2_rereads =   886   (14x)

Units: read_file slices + clamps (fs_tool_naming); exactly one full body
+ one stub in the loop's final history (progress_ledger_tests).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6
"Where is webui.html?" had no direct answer: grep_files searches
CONTENTS (and requires a pattern), list_dir is one level -- so real
transcripts show fs_list ladders descending the tree directory by
directory. find_files is the Glob half of the Glob/Grep pair every
mature harness ships (the model's priors already expect it): recursive
name-glob from a root (default: the working directory), same walk
discipline as grep_files (skip dotdirs + VCS/build/deps dirs),
sorted relative paths, max_results cap (default 100, hard 500) with an
explicit truncation note, and an empty result that points at grep_files
for contents search. tcReadOnly -- fans out in parallel batches.

Units: finds top-level + recursive matches as relative paths, skips
node_modules, self-explaining empty result. Bench: build-site gains a
find_files turn -- one call locates the deliverable by glob (no
list_dir ladder).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6
Runs the exact prompt that produced the original 100-tool-call failure
under a deterministic POLICY model: the same code adapts to what the
harness offers (find_files / read ranges / the edit mini-diff when
present; the list_dir ladder and full-file verify re-reads of the real
transcript when not). The reference "site dump" is generated locally
(yes|head -- a shell loop's $((...)) trips the command-substitution
deny guard) so the run is byte-deterministic; guards assert the dump was
written and its body reached the model, so the scenario can never
silently measure the wrong task.

Same bench, both binaries:
                          before (main)   after (C1+B1+B2+C2C3+D1)
  provider calls               13                11
  total request bytes   1,651,216           296,859     (5.6x less)
  max request body        186,681            33,011     (5.7x less)

Identical deliverable either way (index.html with the edited title) --
the after-binary just gets there in fewer calls and a fraction of the
context traffic.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 989eb8da5b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +231 to +234
if StartLn > Total then StartLn := Total;
if EndLn < StartLn then EndLn := StartLn;
Body := '';
for i := StartLn to EndLn do

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle empty files before clamping range reads

When read_file is called with start_line or end_line on an empty file, Lines.Count is 0, so this clamp sets StartLn to 0; the loop immediately below then runs from 0 to 0 and indexes Lines[i - 1] (Lines[-1]). In that scenario the new range-read path raises an internal string-list bounds error instead of returning an empty slice, so reading an empty generated/config file with a line range fails.

Useful? React with 👍 / 👎.

Comment thread src/pkg/agent/PasClaw.Agent.Prompt.pas Outdated
Comment on lines +865 to +866
if FindProjectAgentsMd('') = '' then
Result := AppendSection(Result, BuildProjectFactsSection(PromptWorkDir));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check AGENTS.md in the workspace tree

In sessions where the configured workspace differs from the process CWD, this checks the wrong directory tree: FindProjectAgentsMd('') searches the launch cwd, while BuildProjectFactsSection(PromptWorkDir) uses the workspace. That means an AGENTS.md in the actual workspace will not suppress the auto-detected facts as intended (and an unrelated AGENTS.md in the launch cwd can suppress useful facts for the workspace), which breaks the documented “operator-authored doc wins” behavior for configured workspaces.

Useful? React with 👍 / 👎.

…fixes)

Two #415 review findings:

  * read_file with start_line/end_line on an EMPTY file: Lines.Count = 0
    drove the clamp to StartLn = 0 and the slice loop into Lines[-1]
    (range-check error). An empty file now reports "(empty file: 0
    lines)" instead of crashing. Unit test added.
  * The AGENTS.md suppression check walked the LAUNCH cwd
    (FindProjectAgentsMd('')) while the facts are read from the
    WORKSPACE -- so a workspace AGENTS.md failed to suppress the
    auto-facts, and an unrelated launch-dir AGENTS.md could suppress
    useful workspace facts. Both now walk PromptWorkDir.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6
@FMXExpress FMXExpress merged commit 5a24cc2 into main Jul 2, 2026
FMXExpress pushed a commit that referenced this pull request Jul 2, 2026
Drives a typo'd tool name (read_files) and a wrong-directory read
through the real dispatch path and asserts the errors carry the next
move ("did you mean: read_file", "Did you mean: pages/sub-dir-page.html").
On the pre-B3 binary both assertions fail (bare "unknown tool: X" /
"no such file: Y"), so the audit is now harness-verified, not just
unit-tested.

Fresh before/after against merged main (#415), same bench binary:
    first_request_bytes:      18726 -> 17780   (C4, every request)
    realtask total bytes:    296859 -> 286453  (trim x 11 requests)
    error-guidance:           2 FAIL -> PASS   (B3)
    realtask deliverable:  byte-identical index.html either way

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LGQKz579j1ZnDRwmr6h1V6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants