fix(fsm): make complete() work on non-Anthropic models by LissaGreense · Pull Request #303 · friday-platform/friday-studio

LissaGreense · 2026-05-13T14:16:17Z

Summary

The auto-injected complete() tool in FSM outputTo: actions only worked on Anthropic. Other providers either skipped the call or invoked complete({}), breaking every non-Claude FSM agent.

Root cause

Two layers, both in the path that ships complete()'s schema to the provider:

Schema too loose. The default fell back to z.record(z.unknown()).refine(non-empty, ...). That compiles to {type: "object", additionalProperties: true} with no required keys — the runtime .refine() doesn't translate to JSON Schema. Anthropic infers good keys from the description prose; everything else reads "object with any keys" as "empty is fine."
Strict mode never set. Even with a tighter schema, the AI SDK doesn't send strict: true on the function definition. OpenAI-compatible providers (Groq, OpenAI, OpenRouter, LiteLLM) treat tool schemas as advisory without it.

Changes

packages/fsm-engine/fsm-engine.ts — replace the open z.record fallback with z.object({ result: z.string().min(1) }). Use tool({...}) + explicit aiJsonSchema(...) so the AI SDK serializes the schema verbatim (no Zod→JSON quirks). Preserves the existing compiledSchema branch for actions with a declared outputType:.

packages/fsm-engine/llm-provider-adapter.ts — pass providerOptions: { groq: { strictJsonSchema: true } } (or openai: { strictJsonSchema: true, structuredOutputs: true }) into streamText() for every non-Anthropic call. Strict mode constrains generation to the schema server-side; providers reject {} instead of accepting it.

Why no model whitelist

An earlier revision gated strict mode by a model-name regex (llama-4-scout, gpt-4o, etc.) to avoid HTTP 400 from models that don't support response_format: json_schema. That was wrong:

Those same models are already broken under this code path — they silently call complete({}) and the FSM throws "did not call complete."
A 400 with a clear provider error message ("model X doesn't support json_schema") is a strictly better failure mode than silent empty output.
New models (Groq adds them weekly, OpenAI adds them periodically) silently regress with a whitelist; they work by default without one.

So strict mode is enabled for every Groq + OpenAI call. Models that can't honor it fail visibly, with an actionable error. Models that can — including everything modern + everything routed through OpenRouter/LiteLLM — get schema enforcement automatically.

What still needs work (separate PRs)

Workspace-chat path (packages/core/src/agent-conversion/from-llm.ts:154) has the same complete() injection and the same brittleness. Not touched here.
The synthetic complete() tool itself — when an FSM action declares no outputType:, we still inject a {result: string} schema and force the model through it. Better default would be to skip injection entirely and use the streamed text response. Behavior change for ~28% of existing untyped outputTo: actions, needs its own review.
OPENAI_BASE_URL passthrough for OpenAI-compatible endpoints (Ollama, LM Studio) — was originally bundled into this PR, split out to keep scope tight.

Tracked in docs/plans/free-models-bugs.md.

Test plan

@atlas/fsm-engine and @atlas/llm test suites pass locally (with friday.yml moved aside)
End-to-end on groq:meta-llama/llama-4-scout-17b-16e-instruct: tool-calling FSM actions succeed across multiple runs, including memory writes that round-trip
Anthropic path unchanged — providerOptions.groq/openai is ignored by the Anthropic provider
Models that don't support response_format: json_schema now return a clear HTTP 400 instead of silently emitting empty output

Follow-up: two more strict-mode gaps + eval matrix

Caught running the full tools/qa suite against three providers.

2a. Per-action override bypassed strict mode. providerName came from defaultModel.provider, so workspaces overriding provider per-action never engaged strict mode. Fix: scope the override to the strict-mode block only (keeps isDefaultOptsProvider semantics intact — broadening it would silently flip Anthropic cacheControl on every user message).

2b. Typed outputType: lost additionalProperties: false on the wire. The compiled-schema path round-tripped through Zod, which serialized catchall: unknown instead. OpenAI/Groq strict mode silently downgrades without additionalProperties: false. New withStrictObjects() walks the JSON Schema recursively, injects strictness at every object level, hands it to aiJsonSchema() verbatim.

Eval matrix (FSM-only sub-tests, 20 scenarios)

tools/qa/run-evals.sh --profile {anthropic,groq,openai}. FSM-only scope excludes 4 workspace-chat scenarios (called out as "not touched here") + 1 Python user-agent plumbing case — same failures across all providers.

Provider / Model	Pass	Notes
anthropic / claude-sonnet-4-6	20/20 — 100%	baseline
openai / gpt-4o-mini (via OpenRouter)	15/20 — 75%	small-model multi-field limit
groq / llama-4-scout-17b-16e-instruct	12/20 — 60%	scout struggles more with 3-field required outputs

Wire schema verified strict on every call via instrumented log during development. Remaining failures are scenarios whose outputType: has 3+ required fields — model-quality limit, not strict-mode regression. Single-field schemas pass on all providers.

Test coverage

Added: 9 unit tests for withStrictObjects covering nested-objects-in-arrays, deep nesting, additionalProperties preserved, untyped-object trigger, immutability, non-record passthrough.

Skipped (intentionally): wire-payload mocks (would couple to AI SDK version-pinned internals; live matrix is the integration test); targeted strict-mode-override regression (workspace.yml fixtures already parameterize per-action provider/model, every Groq/OpenRouter run exercises it); harness-fix unit tests (inverse-assertion — the fact that daemon suites run at all is the regression signal).

Targeted `complete()` tool-call eval

Added tools/qa/complete-eval/ — promptfoo-native, no daemon. Each case is one HTTP call per provider per schema shape. The full 5-case × 3-provider matrix runs in ~17s with ~2K tokens (vs. the daemon-based suite at 50K+ tokens and several minutes), so it can be re-run cheaply on every change.

Five cases cover the schema shapes the FSM ships: 1-field smoke, 3 flat primitives (ReviewResult), primitive + string[] (ArrayReviewResult), array of nested objects (EmailBatch — the regression surface for additionalProperties propagation), 4 fields incl. boolean (AgentHydrationResult). Each case ships in two variants: post-fix (current wire shape) and pre-fix (permissive schema, no strict, no forced tool_choice).

	Anthropic claude-sonnet-4-6	Groq llama-4-scout	OpenRouter openai/gpt-oss-120b
Pre-fix	4/5	3/5	5/5
Post-fix	5/5	5/5	5/5

Pre-fix 12/15 (80%) → Post-fix 15/15 (100%). The fix prevents three measurable failure modes:

Groq scout type coercion (cases 2, 5): without additionalProperties: false + strict: true, scout emits count: "12" (string) instead of 12 (number) and the server-side validator 400s the call. With the fix, wire-level constrained decoding forces the right type.
Anthropic skipping the tool call (case 3): without forced tool_choice: { type: tool, name: complete }, sonnet returns natural text. With the fix, the tool is forced.
Strict-mode-rejected complete({}) calls (not surfaced in these particular prompts, but the foundational reason the FSM rejects an empty/incomplete call vs accepting it silently).

This eval is the integration test for the wire-shape contract the fix produces — fast enough to iterate on, narrow enough to interpret.

The @ai-sdk/openai provider accepts a `baseURL` setting but Friday's `createOpenAIWithOptions` was not forwarding it. Without that, the openai provider was pinned to api.openai.com and could not target local runners (Ollama, LM Studio, llama.cpp) or alternative free hosted providers (OpenRouter) that expose an OpenAI-compatible API. Read from the `OPENAI_BASE_URL` env var by default, with an explicit `baseURL` option override. Three-line addition; unlocks every OpenAI-compatible endpoint without further code changes.

When an FSM action declares `outputTo:` without an explicit `outputType:`, the engine injects a synthetic `complete()` tool the model must call to emit structured output. The previous default schema was z.record(z.string(), z.unknown()).refine(non-empty) which compiles to JSON Schema `{type: "object", additionalProperties: true}` — no required keys. The runtime `.refine(non-empty)` doesn't translate to JSON Schema, so the schema sent to the provider is effectively "any object." Claude infers a sensible shape from the tool's description prose; smaller models (Groq llama-*, gpt-oss-*, etc.) read the empty schema literally and return `complete({})`, failing FSM validation. Tested across ~12 free model/provider combos: gpt-oss-120b/20b, llama-3.3-70b, llama-4-scout, qwen3-*, deepseek-chat-v3, minimax-m2.5, glm-4.5-air, plus their OpenRouter variants. Every one failed before this fix. Two-layer fix: 1) Tighten the default schema in `packages/fsm-engine/fsm-engine.ts` to a concrete `{result: string, min 1}` shape using explicit `aiJsonSchema(...)` so the schema sent over the wire actually has `required: ["result"]` and `additionalProperties: false`. `tool()` wraps the definition so the AI SDK runs its full serialization path (matches the `failStep` tool pattern). 2) In `packages/fsm-engine/llm-provider-adapter.ts`, pass `providerOptions: { groq: { strictJsonSchema: true } }` (or `openai: { strictJsonSchema: true, structuredOutputs: true }`) to `streamText()`. Without this the AI SDK does not set `strict: true` on the function definition server-side, so the provider treats the JSON Schema as advisory and accepts schema-violating tool calls. Verified by instrumenting the outgoing HTTP request: before this change the function def has `strict: null`; after, generation is constrained to the schema. Model gating: strict mode triggers `response_format: json_schema` on the wire, which is supported by Llama 4 family on Groq and gpt-4o/4.1/5 family on OpenAI, but not by gpt-oss-* or llama-3.x. Enabling unconditionally would break those models with HTTP 400, so we feature-detect by model name regex for now. A proper capability registry is the right long-term answer. Verified end-to-end on the DnD demo workspace with `meta-llama/llama-4-scout-17b-16e-instruct` via Groq free tier: - `view-roster` (read-only, 1 tool): consistent success across 5+ consecutive runs, 1-3s each. - `generate-npc` with payload `{prompt: "a roguish bard..."}`: read memory, generated full 5e stat block, saved to memory, returned summary. 3.1s end-to-end. The saved NPC was retrievable on the next `view-roster` call. Known limitation: gpt-oss-* and llama-3.x cannot be made to work through this path because they do not support `response_format: json_schema` server-side. Either `experimental_repairToolCall` retries or a non-strict fallback would be needed for those models — out of scope here. Anthropic continues to work as before (enforces tool schemas natively, ignores the strictJsonSchema flag).

…te() injection - Remove dead `outputSchema` variable; inline `compiledSchema ?? aiJsonSchema(...)` as `inputSchema` so typed-output documents use their declared schema again (the previous diff silently dropped the compiledSchema branch). - Re-sort `ai` imports for Biome's organize-imports rule. - Type `strictModeProviderOptions` with a literal shape so it satisfies `SharedV3ProviderOptions` (Record<string, JSONObject>) instead of the unknown-valued record TS rejected.

…iders" This reverts commit f596a14.

… regex The model-name regex was a whitelist of two families (Groq Llama 4 + OpenAI gpt-4o), supposedly to protect unsupported models from HTTP 400 on `response_format: json_schema`. But those models were already broken under this code path — they silently emit `complete({})` and the FSM fails with "did not call complete." A 400 is the upgrade: it surfaces the unsupported model clearly instead of leaving the user to debug an empty FSM output. Gate by provider only. Every Groq + OpenAI call gets strict mode; Anthropic ignores the option. OpenRouter / LiteLLM passthroughs benefit for any backend that supports the format. New models work by default.

basedfriday · 2026-05-13T17:48:30Z

@LissaGreense need to regression test full eval suite on anthropic (should be 100% pass as that's main case today) and the new providers.

1. `providerName` was derived from `defaultModel.provider`, so per-action `provider:` overrides in workspace.yml never engaged strict mode — the very case `provider: groq` in an FSM action exists for. Use `params.provider ?? defaultModel.provider` so overrides flow through. 2. Typed `outputType:` schemas round-tripped through Zod, which serialized `catchall: unknown` instead of `additionalProperties: false`. OpenAI / Groq silently downgrade strict mode when additionalProperties isn't forced false, letting `complete({})` slip through against a schema that ostensibly required fields. Skip the Zod round-trip on the wire: walk the workspace's JSON Schema and inject `additionalProperties: false` at every object level, then hand it to `aiJsonSchema()` verbatim. Caught both running the first-principles eval suite against gpt-4o-mini via OpenRouter: pre-fix the model emitted `complete({})` against `ReviewResult{marker,count,firstId}` and the FSM stored an empty doc; post-fix the wire payload carries `additionalProperties:false` at every nested object and the model fills all required fields.

Makes tools/qa eval suites runnable against any registry provider, not just the workspace.yml-hardcoded `anthropic` + `claude-sonnet-4-6`, and unblocks them on macOS dev environments where pre-existing harness regressions silently masked every daemon-based scenario. - tools/qa/run-evals.sh — parallel runner over every promptfoo config in tools/qa. `--profile {anthropic,groq,openai}` selects the `FRIDAY_QA_PROVIDER` / `FRIDAY_QA_MODEL` pair; suites tagged `core` (FSM/daemon), `prompt` (Anthropic-only inline-fetch), `no-llm` (model-agnostic). Routes everything through `npx promptfoo eval`. - Workspace fixtures use `__FRIDAY_QA_PROVIDER__` / `__FRIDAY_QA_MODEL__` placeholders; `qaProviderReplacements()` in harness substitutes from env at materialize time, default anthropic / claude-sonnet-4-6. Harness fixes (all pre-existing regressions, surface only now that the provider matrix is exercised): - realpath FRIDAY_HOME on the way in. macOS Deno.makeTempDir returns /var/folders/... but realpath gives /private/var/folders/... `isUnderHome` (#265) does a literal prefix check and masked every registered workspace as cross-home. - Materialize fixture YAMLs inside FRIDAY_HOME via `qaWorkspaceTmpRoot()` rather than the system TMPDIR — same isUnderHome reason. - Bypass `deno task atlas` for the daemon spawn; the task hard-codes `FRIDAY_HOME=$HOME/.atlas` as an inline shell assignment, silently overriding the isolated path in env. - Force-blank FRIDAY_TLS_CERT/KEY/CA in the spawn env. With #243 the daemon binds HTTPS when these are set + readable, and the harness's HTTP `/health` check times out at 90s with no clear cause.

basedfriday · 2026-05-14T04:52:53Z

@LissaGreense this work is looking good. We're gonna target release of this for 0.1.9 as we're focusing next release on some critical fixes.

Removed step-by-step narrator comment over the docTypeName lookup in FSMEngine.run() and trimmed the duplicate "Build lines, sort, then split" narration in run-evals.sh's results loop.

Commit 4808573 widened `providerName` from `this.defaultModel.provider` to `params.provider ?? this.defaultModel.provider` to fix strict-mode gating for per-action `provider:` overrides in workspace.yml. That fix worked for the groq/openai strict checks, but it also flowed into `isDefaultOptsProvider(providerName)` — and there `this.defaultModel.provider` returns the AI SDK's surface-qualified id like `"anthropic.messages"` (not the bare workspace key `"anthropic"`). So before 4808573 the Anthropic branch hit the fallback empty `{}`; after 4808573 it started returning `getDefaultProviderOpts("anthropic")` = `{ anthropic: { cacheControl: { ttl: "1h" } } }` as `defaultUserProviderOptions`, which gets attached to every user message — a wire-level cache_control change no one signed up for. Revert `providerName` to its pre-4808573 value and introduce a local `effectiveProvider = params.provider ?? providerName` scoped to the strict-mode block, where the per-action override actually needs to participate. `isAnthropic`, `isDefaultOptsProvider`, `modelIdForLog`, and the error logger all keep their original behavior. ## Progress - Task: revert cache_control side-effect from 4808573 surfaced in review - Decisions: scoped the override to the strict-mode block only; kept the rest of the function reading the pre-existing `providerName` so no other wire-level behavior shifts. - Key Learnings: AI SDK provider ids are surface-qualified (`anthropic.messages`, `anthropic.tools`), not the bare workspace key. `startsWith("anthropic")` masks the difference; equality and table lookups (like `isDefaultOptsProvider`) don't. When widening a variable that feeds multiple branches, check every consumer — `startsWith` passes can hide a real flip in a sibling exact-match check. - Files: packages/fsm-engine/llm-provider-adapter.ts

JSON Schema allows omitting `type:` when `properties:` is set; design review flagged that the previous `out.type === "object"` gate skipped those nodes, leaving strict mode silently downgraded. Treat presence of a `properties` object as implying object-node. Also document the still- unhandled cases (`oneOf` / `anyOf` / `allOf` / `patternProperties` / schema-form `additionalProperties`) so the next author gets a hint instead of debugging strict-mode silently downgrading. ## Progress - Task: design-review follow-up on fix/fsm-strict-mode-free-models - Decisions: only widen the gate; defer composition-keyword recursion until a workspace actually needs it (YAGNI). Note left for the next author. - Key Learnings: OpenAI/Groq strict mode silently downgrades to permissive when `additionalProperties` is absent on any object node — there's no error, just bad results. Any code that injects strict-mode schemas needs to also handle the type-less object form because JSON Schema doesn't require `type` when `properties` is set. - Files: packages/fsm-engine/fsm-engine.ts

…ree-models # Conflicts: # tools/qa/live-daemon/harness.ts # tools/qa/live-daemon/scenarios/first-principles.ts # tools/qa/live-daemon/scenarios/tool-suite-management.ts

- 9 cases: nested-objects-in-arrays, deep nesting, additionalProperties preserved, untyped-object trigger (the new branch), input immutability, non-record passthrough. Catches recursion-breaking refactors without paying for a live OpenAI/Groq run. - run-evals.sh: let FRIDAY_QA_MODEL override the profile default (`${FRIDAY_QA_MODEL:-…}` fallback) so callers can run gpt-4o-mini via OpenRouter as `openai/gpt-4o-mini` without forking the script.

Re-type withStrictObjects to take/return the existing JSONSchema interface instead of Record<string, unknown>, dropping recursion-site casts in the function body and the one production caller. Rewrite the unit test's nested-property assertions with toMatchObject so they no longer need brittle Record-of-Record casts that tripped noUncheckedIndexedAccess. Also bump biome.json $schema to 2.4.14 to match the CLI version CI runs.

…aemon Isolates the FSM's auto-injected complete() tool from the daemon-based first-principles suite. Each case is one HTTP call per provider; full 5×3 matrix runs in ~17s with ~2K tokens vs the daemon suite's 50K+ tokens and several minutes. Five cases cover the schema shapes used by workspace.yml fixtures (ReviewResult / ArrayReviewResult / EmailBatch / AgentHydrationResult) plus a single-field smoke. Each ships in two variants: - case-N-*.yaml — post-PR-#303 wire shape: `additionalProperties: false` recursively, `strict: true` on OpenAI-style function defs, forced `tool_choice` so the model can't bail to natural text. - case-N-pre-fix.yaml — pre-PR-#303 wire shape: permissive schema, no strict flag, no forced tool choice. Validated on 2026-05-15: pre-fix 12/15 (80%), post-fix 15/15 (100%) across Anthropic claude-sonnet-4-6, Groq llama-4-scout, OpenRouter openai/gpt-oss-120b. The fix's three measurable failure modes: - Groq scout emits `count: "12"` (string) for number-typed fields without wire-level constrained decoding (cases 2, 5) - Anthropic falls back to a natural-text response instead of calling the tool without forced tool_choice (case 3) Run via README — promptfoo-native providers + `OPENROUTER_API_KEY` aliased from the OpenRouter key stored in `OPENAI_API_KEY`.

CI's trailing-whitespace check (Lint JS workflow) was rejecting case-1-pre-fix.yaml and case-3-pre-fix.yaml. Pure whitespace cleanup, no semantic change.

Re-measuring the validated delta surfaced three methodology gaps: - Both pre-fix and post-fix YAMLs now use the same tool description string the production FSM ships (fsm-engine.ts:1461-1462). Earlier revisions gave the post-fix variant a tailored "Required fields: X, Y, Z" prose that leaked schema info into the prompt and inflated post-fix pass rates. Case-1 and case-3 pre-fix were also still on the stale tailored prose; both fixed. - Sample size goes from n=1 to n=5 per cell via --repeat 5 in the run snippet. At n=1 a missed tool call is indistinguishable from a real regression for non-deterministic models. - Case 5 no longer hand-feeds `sawBodySentinel=true` in the prompt; the model derives the boolean from stated facts, which is what actually exercises strict-mode type coercion. Adds openrouter:nvidia/nemotron-3-super-120b-a12b:free as a fourth provider to widen the OpenAI-compatible-strict surface beyond gpt-oss-120b. Re-measured matrix (n=5): - Anthropic 18/25 → 25/25 (72% → 100%), driven by case-3 tool_choice forcing (1/5 → 5/5) - OpenRouter gpt-oss-120b 25/25 → 25/25; previous 24/25 didn't replicate, was n=5 noise - OpenRouter nemotron-super 25/25 → 25/25 - Groq 15/25 → 14/25 — the eval doesn't engage Groq's real strict mode (promptfoo's tool-level `strict: true` is not the same wire shape as the FSM's providerOptions.groq.strictJsonSchema → Groq's response_format: json_schema). README documents the caveat; the daemon-based first-principles suite is the authoritative Groq test.

LissaGreense added 2 commits May 13, 2026 16:15

LissaGreense requested review from Vpr99 and ljagiello as code owners May 13, 2026 14:16

LissaGreense marked this pull request as draft May 13, 2026 16:04

Revert "feat(llm): respect OPENAI_BASE_URL for OpenAI-compatible prov…

b20984c

…iders" This reverts commit f596a14.

LissaGreense changed the title ~~Make FSM agents work on non-Anthropic models (Groq Llama 4, GPT-4o, etc.)~~ fix(fsm): make complete() work on non-Anthropic models May 13, 2026

LissaGreense added 2 commits May 14, 2026 01:09

LissaGreense added 10 commits May 14, 2026 17:58

chore(polish): drop narrator comments from fsm-engine and run-evals

5241b35

Removed step-by-step narrator comment over the docTypeName lookup in FSMEngine.run() and trimmed the duplicate "Build lines, sort, then split" narration in run-evals.sh's results loop.

chore: biome formatter pass

6275e82

chore: gitignore tools/qa/results

b96d8e8

Merge remote-tracking branch 'origin/main' into fix/fsm-strict-mode-f…

3cb5204

…ree-models # Conflicts: # tools/qa/live-daemon/harness.ts # tools/qa/live-daemon/scenarios/first-principles.ts # tools/qa/live-daemon/scenarios/tool-suite-management.ts

chore(qa): strip trailing whitespace from pre-fix case yamls

aa2cf82

CI's trailing-whitespace check (Lint JS workflow) was rejecting case-1-pre-fix.yaml and case-3-pre-fix.yaml. Pure whitespace cleanup, no semantic change.

LissaGreense marked this pull request as ready for review May 15, 2026 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(fsm): make complete() work on non-Anthropic models#303

fix(fsm): make complete() work on non-Anthropic models#303
LissaGreense wants to merge 18 commits into
mainfrom
fix/fsm-strict-mode-free-models

LissaGreense commented May 13, 2026 •

edited

Loading

Uh oh!

basedfriday commented May 13, 2026

Uh oh!

basedfriday commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LissaGreense commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Changes

Why no model whitelist

What still needs work (separate PRs)

Test plan

Follow-up: two more strict-mode gaps + eval matrix

Eval matrix (FSM-only sub-tests, 20 scenarios)

Test coverage

Targeted complete() tool-call eval

Uh oh!

basedfriday commented May 13, 2026

Uh oh!

basedfriday commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LissaGreense commented May 13, 2026 •

edited

Loading

Targeted `complete()` tool-call eval