fix(fsm): make complete() work on non-Anthropic models#303
Open
LissaGreense wants to merge 18 commits into
Open
fix(fsm): make complete() work on non-Anthropic models#303LissaGreense wants to merge 18 commits into
LissaGreense wants to merge 18 commits into
Conversation
The @ai-sdk/openai provider accepts a `baseURL` setting but Friday's `createOpenAIWithOptions` was not forwarding it. Without that, the openai provider was pinned to api.openai.com and could not target local runners (Ollama, LM Studio, llama.cpp) or alternative free hosted providers (OpenRouter) that expose an OpenAI-compatible API. Read from the `OPENAI_BASE_URL` env var by default, with an explicit `baseURL` option override. Three-line addition; unlocks every OpenAI-compatible endpoint without further code changes.
When an FSM action declares `outputTo:` without an explicit
`outputType:`, the engine injects a synthetic `complete()` tool the
model must call to emit structured output. The previous default
schema was
z.record(z.string(), z.unknown()).refine(non-empty)
which compiles to JSON Schema `{type: "object", additionalProperties: true}`
— no required keys. The runtime `.refine(non-empty)` doesn't translate
to JSON Schema, so the schema sent to the provider is effectively "any
object." Claude infers a sensible shape from the tool's description
prose; smaller models (Groq llama-*, gpt-oss-*, etc.) read the empty
schema literally and return `complete({})`, failing FSM validation.
Tested across ~12 free model/provider combos: gpt-oss-120b/20b,
llama-3.3-70b, llama-4-scout, qwen3-*, deepseek-chat-v3, minimax-m2.5,
glm-4.5-air, plus their OpenRouter variants. Every one failed before
this fix.
Two-layer fix:
1) Tighten the default schema in `packages/fsm-engine/fsm-engine.ts`
to a concrete `{result: string, min 1}` shape using explicit
`aiJsonSchema(...)` so the schema sent over the wire actually
has `required: ["result"]` and `additionalProperties: false`.
`tool()` wraps the definition so the AI SDK runs its full
serialization path (matches the `failStep` tool pattern).
2) In `packages/fsm-engine/llm-provider-adapter.ts`, pass
`providerOptions: { groq: { strictJsonSchema: true } }` (or
`openai: { strictJsonSchema: true, structuredOutputs: true }`)
to `streamText()`. Without this the AI SDK does not set
`strict: true` on the function definition server-side, so the
provider treats the JSON Schema as advisory and accepts
schema-violating tool calls. Verified by instrumenting the
outgoing HTTP request: before this change the function def has
`strict: null`; after, generation is constrained to the schema.
Model gating: strict mode triggers `response_format: json_schema`
on the wire, which is supported by Llama 4 family on Groq and
gpt-4o/4.1/5 family on OpenAI, but not by gpt-oss-* or llama-3.x.
Enabling unconditionally would break those models with HTTP 400, so
we feature-detect by model name regex for now. A proper capability
registry is the right long-term answer.
Verified end-to-end on the DnD demo workspace with
`meta-llama/llama-4-scout-17b-16e-instruct` via Groq free tier:
- `view-roster` (read-only, 1 tool): consistent success across
5+ consecutive runs, 1-3s each.
- `generate-npc` with payload `{prompt: "a roguish bard..."}`:
read memory, generated full 5e stat block, saved to memory,
returned summary. 3.1s end-to-end. The saved NPC was
retrievable on the next `view-roster` call.
Known limitation: gpt-oss-* and llama-3.x cannot be made to work
through this path because they do not support `response_format:
json_schema` server-side. Either `experimental_repairToolCall`
retries or a non-strict fallback would be needed for those models —
out of scope here.
Anthropic continues to work as before (enforces tool schemas
natively, ignores the strictJsonSchema flag).
…te() injection - Remove dead `outputSchema` variable; inline `compiledSchema ?? aiJsonSchema(...)` as `inputSchema` so typed-output documents use their declared schema again (the previous diff silently dropped the compiledSchema branch). - Re-sort `ai` imports for Biome's organize-imports rule. - Type `strictModeProviderOptions` with a literal shape so it satisfies `SharedV3ProviderOptions` (Record<string, JSONObject>) instead of the unknown-valued record TS rejected.
…iders" This reverts commit f596a14.
… regex
The model-name regex was a whitelist of two families (Groq Llama 4 +
OpenAI gpt-4o), supposedly to protect unsupported models from HTTP 400
on `response_format: json_schema`. But those models were already broken
under this code path — they silently emit `complete({})` and the FSM
fails with "did not call complete." A 400 is the upgrade: it surfaces
the unsupported model clearly instead of leaving the user to debug an
empty FSM output.
Gate by provider only. Every Groq + OpenAI call gets strict mode;
Anthropic ignores the option. OpenRouter / LiteLLM passthroughs benefit
for any backend that supports the format. New models work by default.
Collaborator
|
@LissaGreense need to regression test full eval suite on anthropic (should be 100% pass as that's main case today) and the new providers. |
1. `providerName` was derived from `defaultModel.provider`, so per-action
`provider:` overrides in workspace.yml never engaged strict mode —
the very case `provider: groq` in an FSM action exists for. Use
`params.provider ?? defaultModel.provider` so overrides flow through.
2. Typed `outputType:` schemas round-tripped through Zod, which serialized
`catchall: unknown` instead of `additionalProperties: false`. OpenAI /
Groq silently downgrade strict mode when additionalProperties isn't
forced false, letting `complete({})` slip through against a schema
that ostensibly required fields. Skip the Zod round-trip on the wire:
walk the workspace's JSON Schema and inject `additionalProperties:
false` at every object level, then hand it to `aiJsonSchema()`
verbatim.
Caught both running the first-principles eval suite against gpt-4o-mini
via OpenRouter: pre-fix the model emitted `complete({})` against
`ReviewResult{marker,count,firstId}` and the FSM stored an empty doc;
post-fix the wire payload carries `additionalProperties:false` at every
nested object and the model fills all required fields.
Makes tools/qa eval suites runnable against any registry provider, not
just the workspace.yml-hardcoded `anthropic` + `claude-sonnet-4-6`, and
unblocks them on macOS dev environments where pre-existing harness
regressions silently masked every daemon-based scenario.
- tools/qa/run-evals.sh — parallel runner over every promptfoo config
in tools/qa. `--profile {anthropic,groq,openai}` selects the
`FRIDAY_QA_PROVIDER` / `FRIDAY_QA_MODEL` pair; suites tagged `core`
(FSM/daemon), `prompt` (Anthropic-only inline-fetch), `no-llm`
(model-agnostic). Routes everything through `npx promptfoo eval`.
- Workspace fixtures use `__FRIDAY_QA_PROVIDER__` / `__FRIDAY_QA_MODEL__`
placeholders; `qaProviderReplacements()` in harness substitutes from
env at materialize time, default anthropic / claude-sonnet-4-6.
Harness fixes (all pre-existing regressions, surface only now that the
provider matrix is exercised):
- realpath FRIDAY_HOME on the way in. macOS Deno.makeTempDir returns
/var/folders/... but realpath gives /private/var/folders/...
`isUnderHome` (#265) does a literal prefix check and masked every
registered workspace as cross-home.
- Materialize fixture YAMLs inside FRIDAY_HOME via `qaWorkspaceTmpRoot()`
rather than the system TMPDIR — same isUnderHome reason.
- Bypass `deno task atlas` for the daemon spawn; the task hard-codes
`FRIDAY_HOME=$HOME/.atlas` as an inline shell assignment, silently
overriding the isolated path in env.
- Force-blank FRIDAY_TLS_CERT/KEY/CA in the spawn env. With #243 the
daemon binds HTTPS when these are set + readable, and the harness's
HTTP `/health` check times out at 90s with no clear cause.
Collaborator
|
@LissaGreense this work is looking good. We're gonna target release of this for |
Removed step-by-step narrator comment over the docTypeName lookup in FSMEngine.run() and trimmed the duplicate "Build lines, sort, then split" narration in run-evals.sh's results loop.
Commit 4808573 widened `providerName` from `this.defaultModel.provider` to `params.provider ?? this.defaultModel.provider` to fix strict-mode gating for per-action `provider:` overrides in workspace.yml. That fix worked for the groq/openai strict checks, but it also flowed into `isDefaultOptsProvider(providerName)` — and there `this.defaultModel.provider` returns the AI SDK's surface-qualified id like `"anthropic.messages"` (not the bare workspace key `"anthropic"`). So before 4808573 the Anthropic branch hit the fallback empty `{}`; after 4808573 it started returning `getDefaultProviderOpts("anthropic")` = `{ anthropic: { cacheControl: { ttl: "1h" } } }` as `defaultUserProviderOptions`, which gets attached to every user message — a wire-level cache_control change no one signed up for. Revert `providerName` to its pre-4808573 value and introduce a local `effectiveProvider = params.provider ?? providerName` scoped to the strict-mode block, where the per-action override actually needs to participate. `isAnthropic`, `isDefaultOptsProvider`, `modelIdForLog`, and the error logger all keep their original behavior. ## Progress - Task: revert cache_control side-effect from 4808573 surfaced in review - Decisions: scoped the override to the strict-mode block only; kept the rest of the function reading the pre-existing `providerName` so no other wire-level behavior shifts. - Key Learnings: AI SDK provider ids are surface-qualified (`anthropic.messages`, `anthropic.tools`), not the bare workspace key. `startsWith("anthropic")` masks the difference; equality and table lookups (like `isDefaultOptsProvider`) don't. When widening a variable that feeds multiple branches, check every consumer — `startsWith` passes can hide a real flip in a sibling exact-match check. - Files: packages/fsm-engine/llm-provider-adapter.ts
JSON Schema allows omitting `type:` when `properties:` is set; design review flagged that the previous `out.type === "object"` gate skipped those nodes, leaving strict mode silently downgraded. Treat presence of a `properties` object as implying object-node. Also document the still- unhandled cases (`oneOf` / `anyOf` / `allOf` / `patternProperties` / schema-form `additionalProperties`) so the next author gets a hint instead of debugging strict-mode silently downgrading. ## Progress - Task: design-review follow-up on fix/fsm-strict-mode-free-models - Decisions: only widen the gate; defer composition-keyword recursion until a workspace actually needs it (YAGNI). Note left for the next author. - Key Learnings: OpenAI/Groq strict mode silently downgrades to permissive when `additionalProperties` is absent on any object node — there's no error, just bad results. Any code that injects strict-mode schemas needs to also handle the type-less object form because JSON Schema doesn't require `type` when `properties` is set. - Files: packages/fsm-engine/fsm-engine.ts
…ree-models # Conflicts: # tools/qa/live-daemon/harness.ts # tools/qa/live-daemon/scenarios/first-principles.ts # tools/qa/live-daemon/scenarios/tool-suite-management.ts
- 9 cases: nested-objects-in-arrays, deep nesting, additionalProperties
preserved, untyped-object trigger (the new branch), input immutability,
non-record passthrough. Catches recursion-breaking refactors without
paying for a live OpenAI/Groq run.
- run-evals.sh: let FRIDAY_QA_MODEL override the profile default
(`${FRIDAY_QA_MODEL:-…}` fallback) so callers can run gpt-4o-mini via
OpenRouter as `openai/gpt-4o-mini` without forking the script.
Re-type withStrictObjects to take/return the existing JSONSchema interface instead of Record<string, unknown>, dropping recursion-site casts in the function body and the one production caller. Rewrite the unit test's nested-property assertions with toMatchObject so they no longer need brittle Record-of-Record casts that tripped noUncheckedIndexedAccess. Also bump biome.json $schema to 2.4.14 to match the CLI version CI runs.
…aemon Isolates the FSM's auto-injected complete() tool from the daemon-based first-principles suite. Each case is one HTTP call per provider; full 5×3 matrix runs in ~17s with ~2K tokens vs the daemon suite's 50K+ tokens and several minutes. Five cases cover the schema shapes used by workspace.yml fixtures (ReviewResult / ArrayReviewResult / EmailBatch / AgentHydrationResult) plus a single-field smoke. Each ships in two variants: - case-N-*.yaml — post-PR-#303 wire shape: `additionalProperties: false` recursively, `strict: true` on OpenAI-style function defs, forced `tool_choice` so the model can't bail to natural text. - case-N-pre-fix.yaml — pre-PR-#303 wire shape: permissive schema, no strict flag, no forced tool choice. Validated on 2026-05-15: pre-fix 12/15 (80%), post-fix 15/15 (100%) across Anthropic claude-sonnet-4-6, Groq llama-4-scout, OpenRouter openai/gpt-oss-120b. The fix's three measurable failure modes: - Groq scout emits `count: "12"` (string) for number-typed fields without wire-level constrained decoding (cases 2, 5) - Anthropic falls back to a natural-text response instead of calling the tool without forced tool_choice (case 3) Run via README — promptfoo-native providers + `OPENROUTER_API_KEY` aliased from the OpenRouter key stored in `OPENAI_API_KEY`.
CI's trailing-whitespace check (Lint JS workflow) was rejecting case-1-pre-fix.yaml and case-3-pre-fix.yaml. Pure whitespace cleanup, no semantic change.
Re-measuring the validated delta surfaced three methodology gaps: - Both pre-fix and post-fix YAMLs now use the same tool description string the production FSM ships (fsm-engine.ts:1461-1462). Earlier revisions gave the post-fix variant a tailored "Required fields: X, Y, Z" prose that leaked schema info into the prompt and inflated post-fix pass rates. Case-1 and case-3 pre-fix were also still on the stale tailored prose; both fixed. - Sample size goes from n=1 to n=5 per cell via --repeat 5 in the run snippet. At n=1 a missed tool call is indistinguishable from a real regression for non-deterministic models. - Case 5 no longer hand-feeds `sawBodySentinel=true` in the prompt; the model derives the boolean from stated facts, which is what actually exercises strict-mode type coercion. Adds openrouter:nvidia/nemotron-3-super-120b-a12b:free as a fourth provider to widen the OpenAI-compatible-strict surface beyond gpt-oss-120b. Re-measured matrix (n=5): - Anthropic 18/25 → 25/25 (72% → 100%), driven by case-3 tool_choice forcing (1/5 → 5/5) - OpenRouter gpt-oss-120b 25/25 → 25/25; previous 24/25 didn't replicate, was n=5 noise - OpenRouter nemotron-super 25/25 → 25/25 - Groq 15/25 → 14/25 — the eval doesn't engage Groq's real strict mode (promptfoo's tool-level `strict: true` is not the same wire shape as the FSM's providerOptions.groq.strictJsonSchema → Groq's response_format: json_schema). README documents the caveat; the daemon-based first-principles suite is the authoritative Groq test.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The auto-injected
complete()tool in FSMoutputTo:actions only worked on Anthropic. Other providers either skipped the call or invokedcomplete({}), breaking every non-Claude FSM agent.Root cause
Two layers, both in the path that ships
complete()'s schema to the provider:z.record(z.unknown()).refine(non-empty, ...). That compiles to{type: "object", additionalProperties: true}with norequiredkeys — the runtime.refine()doesn't translate to JSON Schema. Anthropic infers good keys from the description prose; everything else reads "object with any keys" as "empty is fine."strict: trueon the function definition. OpenAI-compatible providers (Groq, OpenAI, OpenRouter, LiteLLM) treat tool schemas as advisory without it.Changes
packages/fsm-engine/fsm-engine.ts— replace the openz.recordfallback withz.object({ result: z.string().min(1) }). Usetool({...})+ explicitaiJsonSchema(...)so the AI SDK serializes the schema verbatim (no Zod→JSON quirks). Preserves the existingcompiledSchemabranch for actions with a declaredoutputType:.packages/fsm-engine/llm-provider-adapter.ts— passproviderOptions: { groq: { strictJsonSchema: true } }(oropenai: { strictJsonSchema: true, structuredOutputs: true }) intostreamText()for every non-Anthropic call. Strict mode constrains generation to the schema server-side; providers reject{}instead of accepting it.Why no model whitelist
An earlier revision gated strict mode by a model-name regex (
llama-4-scout,gpt-4o, etc.) to avoid HTTP 400 from models that don't supportresponse_format: json_schema. That was wrong:complete({})and the FSM throws "did not call complete."So strict mode is enabled for every Groq + OpenAI call. Models that can't honor it fail visibly, with an actionable error. Models that can — including everything modern + everything routed through OpenRouter/LiteLLM — get schema enforcement automatically.
What still needs work (separate PRs)
packages/core/src/agent-conversion/from-llm.ts:154) has the samecomplete()injection and the same brittleness. Not touched here.complete()tool itself — when an FSM action declares nooutputType:, we still inject a{result: string}schema and force the model through it. Better default would be to skip injection entirely and use the streamed text response. Behavior change for ~28% of existing untypedoutputTo:actions, needs its own review.OPENAI_BASE_URLpassthrough for OpenAI-compatible endpoints (Ollama, LM Studio) — was originally bundled into this PR, split out to keep scope tight.Tracked in
docs/plans/free-models-bugs.md.Test plan
@atlas/fsm-engineand@atlas/llmtest suites pass locally (withfriday.ymlmoved aside)groq:meta-llama/llama-4-scout-17b-16e-instruct: tool-calling FSM actions succeed across multiple runs, including memory writes that round-tripproviderOptions.groq/openaiis ignored by the Anthropic providerresponse_format: json_schemanow return a clear HTTP 400 instead of silently emitting empty outputFollow-up: two more strict-mode gaps + eval matrix
Caught running the full
tools/qasuite against three providers.2a. Per-action override bypassed strict mode.
providerNamecame fromdefaultModel.provider, so workspaces overriding provider per-action never engaged strict mode. Fix: scope the override to the strict-mode block only (keepsisDefaultOptsProvidersemantics intact — broadening it would silently flip AnthropiccacheControlon every user message).2b. Typed
outputType:lostadditionalProperties: falseon the wire. The compiled-schema path round-tripped through Zod, which serializedcatchall: unknowninstead. OpenAI/Groq strict mode silently downgrades withoutadditionalProperties: false. NewwithStrictObjects()walks the JSON Schema recursively, injects strictness at every object level, hands it toaiJsonSchema()verbatim.Eval matrix (FSM-only sub-tests, 20 scenarios)
tools/qa/run-evals.sh --profile {anthropic,groq,openai}. FSM-only scope excludes 4 workspace-chat scenarios (called out as "not touched here") + 1 Python user-agent plumbing case — same failures across all providers.Wire schema verified strict on every call via instrumented log during development. Remaining failures are scenarios whose
outputType:has 3+ required fields — model-quality limit, not strict-mode regression. Single-field schemas pass on all providers.Test coverage
Added: 9 unit tests for
withStrictObjectscovering nested-objects-in-arrays, deep nesting, additionalProperties preserved, untyped-object trigger, immutability, non-record passthrough.Skipped (intentionally): wire-payload mocks (would couple to AI SDK version-pinned internals; live matrix is the integration test); targeted strict-mode-override regression (workspace.yml fixtures already parameterize per-action provider/model, every Groq/OpenRouter run exercises it); harness-fix unit tests (inverse-assertion — the fact that daemon suites run at all is the regression signal).
Targeted
complete()tool-call evalAdded
tools/qa/complete-eval/— promptfoo-native, no daemon. Each case is one HTTP call per provider per schema shape. The full 5-case × 3-provider matrix runs in ~17s with ~2K tokens (vs. the daemon-based suite at 50K+ tokens and several minutes), so it can be re-run cheaply on every change.Five cases cover the schema shapes the FSM ships: 1-field smoke, 3 flat primitives (
ReviewResult), primitive +string[](ArrayReviewResult), array of nested objects (EmailBatch— the regression surface foradditionalPropertiespropagation), 4 fields incl. boolean (AgentHydrationResult). Each case ships in two variants: post-fix (current wire shape) and pre-fix (permissive schema, nostrict, no forcedtool_choice).Pre-fix 12/15 (80%) → Post-fix 15/15 (100%). The fix prevents three measurable failure modes:
additionalProperties: false+strict: true, scout emitscount: "12"(string) instead of12(number) and the server-side validator 400s the call. With the fix, wire-level constrained decoding forces the right type.tool_choice: { type: tool, name: complete }, sonnet returns natural text. With the fix, the tool is forced.complete({})calls (not surfaced in these particular prompts, but the foundational reason the FSM rejects an empty/incomplete call vs accepting it silently).This eval is the integration test for the wire-shape contract the fix produces — fast enough to iterate on, narrow enough to interpret.