feat(tools): generic input-shape repair for tool calls (validate-then-repair) by trungutt · Pull Request #2635 · docker/docker-agent

trungutt · 2026-05-05T08:06:12Z

Today every tool in docker-agent has a single, strict, one-shot JSON parse for its arguments (tools.NewHandler, pkg/tools/tools.go:17). If a model sends arguments that don't match the Go struct exactly, the call fails and the model retries — usually with the same mistake, because the error message it sees is unreadable JSON noise.

This PR teaches NewHandler a small, named set of repairs for the four shape mistakes open-weights models repeat across every tool that has an array field. It does not change any tool's contract or schema. Valid inputs are not touched.

Why now

We already do this — but only for edit_file. PR #2452 and PR #2144 added validate-then-repair logic for the edits field and for malformed JSON syntax. Those repairs are bespoke. The patterns they catch are general — the same four shape errors show up across every []string field we expose:

read_multiple_files (paths)
create_directory / remove_directory (paths)
search_files_content (excludePatterns)
fetch (urls)
create_todos (descriptions)
task dependencies

Each is silently brittle today: any model that sends paths: "foo" instead of paths: ["foo"] gets a schema error and a wasted turn. The framing in this thread argues — credibly, in our experience — that this is harness design, not model capability. Anthropic and OpenAI eat the cost of strict contracts invisibly because they've memorised every JSON contract during pretraining; open models pay it loudly and get dismissed for it.

This PR generalises the pattern we already use for edit_file so every tool benefits.

How it works

The handler does the obvious thing first: try a strict json.Unmarshal into the typed struct. If that succeeds — the 95% case for valid input — the typed function is called immediately, with zero overhead and no behaviour change.

If the strict parse fails, the repair layer walks the destination struct's fields by reflection and looks for a small, fixed catalogue of mismatches between each typed field and the corresponding raw JSON value:

Stringified array — paths: \"[\\\"a\\\",\\\"b\\\"]\" becomes paths: [\"a\",\"b\"]. We try json.Unmarshal on the string and accept it only if it parses as an array.
Bare scalar where an array is expected — paths: \"foo\" becomes paths: [\"foo\"]. Only fires for primitive element kinds; we don't guess struct construction.
Single-key object placeholder — paths: {\"path\": \"foo\"} becomes paths: [\"foo\"]. Restricted to objects with exactly one entry whose value matches the slice's element kind.
Null for primitive scalar — n: null is dropped from the payload so the type's zero value wins. (Mostly a no-op in Go's stdlib but matters when a field has a custom UnmarshalJSON.)

If at least one repair fires we re-marshal the payload, retry the strict parse, and on success emit a tool_input_repaired log entry tagged with the tool name and the list of repairs applied. If the retry still fails we surface the original error so the model sees the schema's complaint rather than a synthesised one from the repair layer.

Why validate-then-repair (not preprocess-then-validate)

The naive approach would be to walk every input and normalise things that look broken before parsing. That's silently corrupting: imagine a write_file call where the model wants to write [1,2,3] to a file. Preprocessing strings-that-parse-as-JSON would turn content into a real array and we'd write garbage to disk.

Validate-then-repair avoids this because the schema is the prior. The repair only runs at field paths where the strict parse already failed. If content is typed as string, parsing succeeds with \"[1,2,3]\" as a string and we never touch it. If paths is typed as []string, the parse fails with a type mismatch at paths, and only there do we try unwrapping. The validator localises the bug for us; the repair layer only spends budget at confirmed mismatches.

Why ordering matters

The four repairs are run in a fixed order. The load-bearing one is that the stringified-array unwrap must run before the bare-string-wrap, otherwise a literal stringified array would be wrapped as a single element and we'd silently corrupt the input ('[\"a\",\"b\"]' would become ['[\"a\",\"b\"]']). There's a dedicated test (TestRepair_OrderingPreventsDoubleWrap) that pins this invariant.

What this is not

Not content fuzzy matching. We are not editing user-intended text. PR feat: add fuzzy matching fallbacks for edit_file tool #2296 attempted that for edit_file's oldText and was closed as too dangerous; that objection does not apply here because input-shape repair never modifies content fields.
Not a generic JSON repair layer. Four named, narrow rules, each with a specific failure mode it catches. New mistakes need new repair rules — intentionally.
Not a schema change. The contract advertised to the model is unchanged. We just become more forgiving when the model violates it in one of the four documented ways.
Not recursive. Only top-level fields are inspected. We don't walk into nested structs or slice elements yet — there is no evidence we need to.

Open-weights models (DeepSeek, Qwen, GLM) repeat the same small set of shape mistakes when calling tools whose schema includes an array field: sending a JSON-stringified array, a bare scalar, or a single-key object placeholder where the schema expects an array. Today every tool surface that doesn't have a custom UnmarshalJSON fails strictly on these and the model wastes a turn retrying with the same mistake. This generalises the validate-then-repair design we already use for edit_file (PRs docker#2452 and docker#2144) to every tool that goes through tools.NewHandler. Strict json.Unmarshal still runs first; only when it fails does the repair layer walk the destination type's fields and attempt four narrow fixes at the exact paths the schema disagreed at. Successful repairs emit a tool_input_repaired log entry tagged with the tool name and the repairs applied so per-(model, tool) repair rates can be tracked.

Replace direct NumField/Field iteration with reflect.VisibleFields, which walks promoted fields from embedded structs the same way encoding/json does when marshaling. The previous loop silently skipped them. Affects ReferencesArgs/RenameArgs/CallHierarchyArgs/TypeHierarchyArgs in pkg/tools/builtin/lsp.go which all embed PositionArgs. Promoted fields are primitive scalars today so only repairDropNull would have applied, but the loop should still visit them so future []string fields lifted through embedding inherit the repair layer automatically. Also resolves the modernize lint complaint on the same loop.

trungutt added 2 commits May 5, 2026 10:05

lint: use range over int + reflect.TypeFor (modernize)

377c6ab

trungutt marked this pull request as ready for review May 5, 2026 08:16

trungutt requested a review from a team as a code owner May 5, 2026 08:16

This comment was marked as resolved.

Sign in to view

trungutt added 2 commits May 5, 2026 11:00

lint: blank line between embedded and regular fields

485775f

rumpl approved these changes May 6, 2026

View reviewed changes

trungutt merged commit 325d936 into docker:main May 6, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tools): generic input-shape repair for tool calls (validate-then-repair)#2635

feat(tools): generic input-shape repair for tool calls (validate-then-repair)#2635
trungutt merged 4 commits intodocker:mainfrom
trungutt:trungutt/tool-input-shape-repair

trungutt commented May 5, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

trungutt commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why now

How it works

Why validate-then-repair (not preprocess-then-validate)

Why ordering matters

What this is not

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

trungutt commented May 5, 2026 •

edited

Loading