Skip to content

fix(fsm): make complete() work on non-Anthropic models#303

Open
LissaGreense wants to merge 18 commits into
mainfrom
fix/fsm-strict-mode-free-models
Open

fix(fsm): make complete() work on non-Anthropic models#303
LissaGreense wants to merge 18 commits into
mainfrom
fix/fsm-strict-mode-free-models

Conversation

@LissaGreense
Copy link
Copy Markdown
Contributor

@LissaGreense LissaGreense commented May 13, 2026

Summary

The auto-injected complete() tool in FSM outputTo: actions only worked on Anthropic. Other providers either skipped the call or invoked complete({}), breaking every non-Claude FSM agent.

Root cause

Two layers, both in the path that ships complete()'s schema to the provider:

  1. Schema too loose. The default fell back to z.record(z.unknown()).refine(non-empty, ...). That compiles to {type: "object", additionalProperties: true} with no required keys — the runtime .refine() doesn't translate to JSON Schema. Anthropic infers good keys from the description prose; everything else reads "object with any keys" as "empty is fine."
  2. Strict mode never set. Even with a tighter schema, the AI SDK doesn't send strict: true on the function definition. OpenAI-compatible providers (Groq, OpenAI, OpenRouter, LiteLLM) treat tool schemas as advisory without it.

Changes

packages/fsm-engine/fsm-engine.ts — replace the open z.record fallback with z.object({ result: z.string().min(1) }). Use tool({...}) + explicit aiJsonSchema(...) so the AI SDK serializes the schema verbatim (no Zod→JSON quirks). Preserves the existing compiledSchema branch for actions with a declared outputType:.

packages/fsm-engine/llm-provider-adapter.ts — pass providerOptions: { groq: { strictJsonSchema: true } } (or openai: { strictJsonSchema: true, structuredOutputs: true }) into streamText() for every non-Anthropic call. Strict mode constrains generation to the schema server-side; providers reject {} instead of accepting it.

Why no model whitelist

An earlier revision gated strict mode by a model-name regex (llama-4-scout, gpt-4o, etc.) to avoid HTTP 400 from models that don't support response_format: json_schema. That was wrong:

  • Those same models are already broken under this code path — they silently call complete({}) and the FSM throws "did not call complete."
  • A 400 with a clear provider error message ("model X doesn't support json_schema") is a strictly better failure mode than silent empty output.
  • New models (Groq adds them weekly, OpenAI adds them periodically) silently regress with a whitelist; they work by default without one.

So strict mode is enabled for every Groq + OpenAI call. Models that can't honor it fail visibly, with an actionable error. Models that can — including everything modern + everything routed through OpenRouter/LiteLLM — get schema enforcement automatically.

What still needs work (separate PRs)

  • Workspace-chat path (packages/core/src/agent-conversion/from-llm.ts:154) has the same complete() injection and the same brittleness. Not touched here.
  • The synthetic complete() tool itself — when an FSM action declares no outputType:, we still inject a {result: string} schema and force the model through it. Better default would be to skip injection entirely and use the streamed text response. Behavior change for ~28% of existing untyped outputTo: actions, needs its own review.
  • OPENAI_BASE_URL passthrough for OpenAI-compatible endpoints (Ollama, LM Studio) — was originally bundled into this PR, split out to keep scope tight.

Tracked in docs/plans/free-models-bugs.md.

Test plan

  • @atlas/fsm-engine and @atlas/llm test suites pass locally (with friday.yml moved aside)
  • End-to-end on groq:meta-llama/llama-4-scout-17b-16e-instruct: tool-calling FSM actions succeed across multiple runs, including memory writes that round-trip
  • Anthropic path unchanged — providerOptions.groq/openai is ignored by the Anthropic provider
  • Models that don't support response_format: json_schema now return a clear HTTP 400 instead of silently emitting empty output

Follow-up: two more strict-mode gaps + eval matrix

Caught running the full tools/qa suite against three providers.

2a. Per-action override bypassed strict mode. providerName came from defaultModel.provider, so workspaces overriding provider per-action never engaged strict mode. Fix: scope the override to the strict-mode block only (keeps isDefaultOptsProvider semantics intact — broadening it would silently flip Anthropic cacheControl on every user message).

2b. Typed outputType: lost additionalProperties: false on the wire. The compiled-schema path round-tripped through Zod, which serialized catchall: unknown instead. OpenAI/Groq strict mode silently downgrades without additionalProperties: false. New withStrictObjects() walks the JSON Schema recursively, injects strictness at every object level, hands it to aiJsonSchema() verbatim.

Eval matrix (FSM-only sub-tests, 20 scenarios)

tools/qa/run-evals.sh --profile {anthropic,groq,openai}. FSM-only scope excludes 4 workspace-chat scenarios (called out as "not touched here") + 1 Python user-agent plumbing case — same failures across all providers.

Provider / Model Pass Notes
anthropic / claude-sonnet-4-6 20/20 — 100% baseline
openai / gpt-4o-mini (via OpenRouter) 15/20 — 75% small-model multi-field limit
groq / llama-4-scout-17b-16e-instruct 12/20 — 60% scout struggles more with 3-field required outputs

Wire schema verified strict on every call via instrumented log during development. Remaining failures are scenarios whose outputType: has 3+ required fields — model-quality limit, not strict-mode regression. Single-field schemas pass on all providers.

Test coverage

Added: 9 unit tests for withStrictObjects covering nested-objects-in-arrays, deep nesting, additionalProperties preserved, untyped-object trigger, immutability, non-record passthrough.

Skipped (intentionally): wire-payload mocks (would couple to AI SDK version-pinned internals; live matrix is the integration test); targeted strict-mode-override regression (workspace.yml fixtures already parameterize per-action provider/model, every Groq/OpenRouter run exercises it); harness-fix unit tests (inverse-assertion — the fact that daemon suites run at all is the regression signal).

Targeted complete() tool-call eval

Added tools/qa/complete-eval/ — promptfoo-native, no daemon. Each case is one HTTP call per provider per schema shape. The full 5-case × 3-provider matrix runs in ~17s with ~2K tokens (vs. the daemon-based suite at 50K+ tokens and several minutes), so it can be re-run cheaply on every change.

Five cases cover the schema shapes the FSM ships: 1-field smoke, 3 flat primitives (ReviewResult), primitive + string[] (ArrayReviewResult), array of nested objects (EmailBatch — the regression surface for additionalProperties propagation), 4 fields incl. boolean (AgentHydrationResult). Each case ships in two variants: post-fix (current wire shape) and pre-fix (permissive schema, no strict, no forced tool_choice).

Anthropic claude-sonnet-4-6 Groq llama-4-scout OpenRouter openai/gpt-oss-120b
Pre-fix 4/5 3/5 5/5
Post-fix 5/5 5/5 5/5

Pre-fix 12/15 (80%) → Post-fix 15/15 (100%). The fix prevents three measurable failure modes:

  • Groq scout type coercion (cases 2, 5): without additionalProperties: false + strict: true, scout emits count: "12" (string) instead of 12 (number) and the server-side validator 400s the call. With the fix, wire-level constrained decoding forces the right type.
  • Anthropic skipping the tool call (case 3): without forced tool_choice: { type: tool, name: complete }, sonnet returns natural text. With the fix, the tool is forced.
  • Strict-mode-rejected complete({}) calls (not surfaced in these particular prompts, but the foundational reason the FSM rejects an empty/incomplete call vs accepting it silently).

This eval is the integration test for the wire-shape contract the fix produces — fast enough to iterate on, narrow enough to interpret.

The @ai-sdk/openai provider accepts a `baseURL` setting but Friday's
`createOpenAIWithOptions` was not forwarding it. Without that, the
openai provider was pinned to api.openai.com and could not target
local runners (Ollama, LM Studio, llama.cpp) or alternative free
hosted providers (OpenRouter) that expose an OpenAI-compatible API.

Read from the `OPENAI_BASE_URL` env var by default, with an explicit
`baseURL` option override. Three-line addition; unlocks every
OpenAI-compatible endpoint without further code changes.
When an FSM action declares `outputTo:` without an explicit
`outputType:`, the engine injects a synthetic `complete()` tool the
model must call to emit structured output. The previous default
schema was

  z.record(z.string(), z.unknown()).refine(non-empty)

which compiles to JSON Schema `{type: "object", additionalProperties: true}`
— no required keys. The runtime `.refine(non-empty)` doesn't translate
to JSON Schema, so the schema sent to the provider is effectively "any
object." Claude infers a sensible shape from the tool's description
prose; smaller models (Groq llama-*, gpt-oss-*, etc.) read the empty
schema literally and return `complete({})`, failing FSM validation.

Tested across ~12 free model/provider combos: gpt-oss-120b/20b,
llama-3.3-70b, llama-4-scout, qwen3-*, deepseek-chat-v3, minimax-m2.5,
glm-4.5-air, plus their OpenRouter variants. Every one failed before
this fix.

Two-layer fix:

1) Tighten the default schema in `packages/fsm-engine/fsm-engine.ts`
   to a concrete `{result: string, min 1}` shape using explicit
   `aiJsonSchema(...)` so the schema sent over the wire actually
   has `required: ["result"]` and `additionalProperties: false`.
   `tool()` wraps the definition so the AI SDK runs its full
   serialization path (matches the `failStep` tool pattern).

2) In `packages/fsm-engine/llm-provider-adapter.ts`, pass
   `providerOptions: { groq: { strictJsonSchema: true } }` (or
   `openai: { strictJsonSchema: true, structuredOutputs: true }`)
   to `streamText()`. Without this the AI SDK does not set
   `strict: true` on the function definition server-side, so the
   provider treats the JSON Schema as advisory and accepts
   schema-violating tool calls. Verified by instrumenting the
   outgoing HTTP request: before this change the function def has
   `strict: null`; after, generation is constrained to the schema.

Model gating: strict mode triggers `response_format: json_schema`
on the wire, which is supported by Llama 4 family on Groq and
gpt-4o/4.1/5 family on OpenAI, but not by gpt-oss-* or llama-3.x.
Enabling unconditionally would break those models with HTTP 400, so
we feature-detect by model name regex for now. A proper capability
registry is the right long-term answer.

Verified end-to-end on the DnD demo workspace with
`meta-llama/llama-4-scout-17b-16e-instruct` via Groq free tier:

  - `view-roster` (read-only, 1 tool): consistent success across
    5+ consecutive runs, 1-3s each.
  - `generate-npc` with payload `{prompt: "a roguish bard..."}`:
    read memory, generated full 5e stat block, saved to memory,
    returned summary. 3.1s end-to-end. The saved NPC was
    retrievable on the next `view-roster` call.

Known limitation: gpt-oss-* and llama-3.x cannot be made to work
through this path because they do not support `response_format:
json_schema` server-side. Either `experimental_repairToolCall`
retries or a non-strict fallback would be needed for those models —
out of scope here.

Anthropic continues to work as before (enforces tool schemas
natively, ignores the strictJsonSchema flag).
…te() injection

- Remove dead `outputSchema` variable; inline `compiledSchema ?? aiJsonSchema(...)`
  as `inputSchema` so typed-output documents use their declared schema again
  (the previous diff silently dropped the compiledSchema branch).
- Re-sort `ai` imports for Biome's organize-imports rule.
- Type `strictModeProviderOptions` with a literal shape so it satisfies
  `SharedV3ProviderOptions` (Record<string, JSONObject>) instead of the
  unknown-valued record TS rejected.
@LissaGreense LissaGreense marked this pull request as draft May 13, 2026 16:04
@LissaGreense LissaGreense changed the title Make FSM agents work on non-Anthropic models (Groq Llama 4, GPT-4o, etc.) fix(fsm): make complete() work on non-Anthropic models May 13, 2026
… regex

The model-name regex was a whitelist of two families (Groq Llama 4 +
OpenAI gpt-4o), supposedly to protect unsupported models from HTTP 400
on `response_format: json_schema`. But those models were already broken
under this code path — they silently emit `complete({})` and the FSM
fails with "did not call complete." A 400 is the upgrade: it surfaces
the unsupported model clearly instead of leaving the user to debug an
empty FSM output.

Gate by provider only. Every Groq + OpenAI call gets strict mode;
Anthropic ignores the option. OpenRouter / LiteLLM passthroughs benefit
for any backend that supports the format. New models work by default.
@basedfriday
Copy link
Copy Markdown
Collaborator

@LissaGreense need to regression test full eval suite on anthropic (should be 100% pass as that's main case today) and the new providers.

1. `providerName` was derived from `defaultModel.provider`, so per-action
   `provider:` overrides in workspace.yml never engaged strict mode —
   the very case `provider: groq` in an FSM action exists for. Use
   `params.provider ?? defaultModel.provider` so overrides flow through.

2. Typed `outputType:` schemas round-tripped through Zod, which serialized
   `catchall: unknown` instead of `additionalProperties: false`. OpenAI /
   Groq silently downgrade strict mode when additionalProperties isn't
   forced false, letting `complete({})` slip through against a schema
   that ostensibly required fields. Skip the Zod round-trip on the wire:
   walk the workspace's JSON Schema and inject `additionalProperties:
   false` at every object level, then hand it to `aiJsonSchema()`
   verbatim.

Caught both running the first-principles eval suite against gpt-4o-mini
via OpenRouter: pre-fix the model emitted `complete({})` against
`ReviewResult{marker,count,firstId}` and the FSM stored an empty doc;
post-fix the wire payload carries `additionalProperties:false` at every
nested object and the model fills all required fields.
Makes tools/qa eval suites runnable against any registry provider, not
just the workspace.yml-hardcoded `anthropic` + `claude-sonnet-4-6`, and
unblocks them on macOS dev environments where pre-existing harness
regressions silently masked every daemon-based scenario.

  - tools/qa/run-evals.sh — parallel runner over every promptfoo config
    in tools/qa. `--profile {anthropic,groq,openai}` selects the
    `FRIDAY_QA_PROVIDER` / `FRIDAY_QA_MODEL` pair; suites tagged `core`
    (FSM/daemon), `prompt` (Anthropic-only inline-fetch), `no-llm`
    (model-agnostic). Routes everything through `npx promptfoo eval`.
  - Workspace fixtures use `__FRIDAY_QA_PROVIDER__` / `__FRIDAY_QA_MODEL__`
    placeholders; `qaProviderReplacements()` in harness substitutes from
    env at materialize time, default anthropic / claude-sonnet-4-6.

Harness fixes (all pre-existing regressions, surface only now that the
provider matrix is exercised):

  - realpath FRIDAY_HOME on the way in. macOS Deno.makeTempDir returns
    /var/folders/... but realpath gives /private/var/folders/...
    `isUnderHome` (#265) does a literal prefix check and masked every
    registered workspace as cross-home.
  - Materialize fixture YAMLs inside FRIDAY_HOME via `qaWorkspaceTmpRoot()`
    rather than the system TMPDIR — same isUnderHome reason.
  - Bypass `deno task atlas` for the daemon spawn; the task hard-codes
    `FRIDAY_HOME=$HOME/.atlas` as an inline shell assignment, silently
    overriding the isolated path in env.
  - Force-blank FRIDAY_TLS_CERT/KEY/CA in the spawn env. With #243 the
    daemon binds HTTPS when these are set + readable, and the harness's
    HTTP `/health` check times out at 90s with no clear cause.
@basedfriday
Copy link
Copy Markdown
Collaborator

@LissaGreense this work is looking good. We're gonna target release of this for 0.1.9 as we're focusing next release on some critical fixes.

Removed step-by-step narrator comment over the docTypeName lookup in
FSMEngine.run() and trimmed the duplicate "Build lines, sort, then split"
narration in run-evals.sh's results loop.
Commit 4808573 widened `providerName` from `this.defaultModel.provider`
to `params.provider ?? this.defaultModel.provider` to fix strict-mode
gating for per-action `provider:` overrides in workspace.yml. That fix
worked for the groq/openai strict checks, but it also flowed into
`isDefaultOptsProvider(providerName)` — and there `this.defaultModel.provider`
returns the AI SDK's surface-qualified id like `"anthropic.messages"`
(not the bare workspace key `"anthropic"`). So before 4808573 the
Anthropic branch hit the fallback empty `{}`; after 4808573 it started
returning `getDefaultProviderOpts("anthropic")` = `{ anthropic: {
cacheControl: { ttl: "1h" } } }` as `defaultUserProviderOptions`, which
gets attached to every user message — a wire-level cache_control change
no one signed up for.

Revert `providerName` to its pre-4808573 value and introduce a local
`effectiveProvider = params.provider ?? providerName` scoped to the
strict-mode block, where the per-action override actually needs to
participate. `isAnthropic`, `isDefaultOptsProvider`, `modelIdForLog`,
and the error logger all keep their original behavior.

## Progress
- Task: revert cache_control side-effect from 4808573 surfaced in review
- Decisions: scoped the override to the strict-mode block only; kept the
  rest of the function reading the pre-existing `providerName` so no
  other wire-level behavior shifts.
- Key Learnings: AI SDK provider ids are surface-qualified
  (`anthropic.messages`, `anthropic.tools`), not the bare workspace key.
  `startsWith("anthropic")` masks the difference; equality and table
  lookups (like `isDefaultOptsProvider`) don't. When widening a variable
  that feeds multiple branches, check every consumer — `startsWith`
  passes can hide a real flip in a sibling exact-match check.
- Files: packages/fsm-engine/llm-provider-adapter.ts
JSON Schema allows omitting `type:` when `properties:` is set; design
review flagged that the previous `out.type === "object"` gate skipped
those nodes, leaving strict mode silently downgraded. Treat presence of
a `properties` object as implying object-node. Also document the still-
unhandled cases (`oneOf` / `anyOf` / `allOf` / `patternProperties` /
schema-form `additionalProperties`) so the next author gets a hint
instead of debugging strict-mode silently downgrading.

## Progress
- Task: design-review follow-up on fix/fsm-strict-mode-free-models
- Decisions: only widen the gate; defer composition-keyword recursion until a workspace actually needs it (YAGNI). Note left for the next author.
- Key Learnings: OpenAI/Groq strict mode silently downgrades to permissive when `additionalProperties` is absent on any object node — there's no error, just bad results. Any code that injects strict-mode schemas needs to also handle the type-less object form because JSON Schema doesn't require `type` when `properties` is set.
- Files: packages/fsm-engine/fsm-engine.ts
…ree-models

# Conflicts:
#	tools/qa/live-daemon/harness.ts
#	tools/qa/live-daemon/scenarios/first-principles.ts
#	tools/qa/live-daemon/scenarios/tool-suite-management.ts
- 9 cases: nested-objects-in-arrays, deep nesting, additionalProperties
  preserved, untyped-object trigger (the new branch), input immutability,
  non-record passthrough. Catches recursion-breaking refactors without
  paying for a live OpenAI/Groq run.

- run-evals.sh: let FRIDAY_QA_MODEL override the profile default
  (`${FRIDAY_QA_MODEL:-…}` fallback) so callers can run gpt-4o-mini via
  OpenRouter as `openai/gpt-4o-mini` without forking the script.
Re-type withStrictObjects to take/return the existing JSONSchema
interface instead of Record<string, unknown>, dropping recursion-site
casts in the function body and the one production caller. Rewrite the
unit test's nested-property assertions with toMatchObject so they no
longer need brittle Record-of-Record casts that tripped
noUncheckedIndexedAccess.

Also bump biome.json $schema to 2.4.14 to match the CLI version CI
runs.
…aemon

Isolates the FSM's auto-injected complete() tool from the daemon-based
first-principles suite. Each case is one HTTP call per provider; full
5×3 matrix runs in ~17s with ~2K tokens vs the daemon suite's 50K+
tokens and several minutes.

Five cases cover the schema shapes used by workspace.yml fixtures
(ReviewResult / ArrayReviewResult / EmailBatch / AgentHydrationResult)
plus a single-field smoke. Each ships in two variants:

- case-N-*.yaml — post-PR-#303 wire shape: `additionalProperties: false`
  recursively, `strict: true` on OpenAI-style function defs, forced
  `tool_choice` so the model can't bail to natural text.
- case-N-pre-fix.yaml — pre-PR-#303 wire shape: permissive schema, no
  strict flag, no forced tool choice.

Validated on 2026-05-15: pre-fix 12/15 (80%), post-fix 15/15 (100%)
across Anthropic claude-sonnet-4-6, Groq llama-4-scout, OpenRouter
openai/gpt-oss-120b. The fix's three measurable failure modes:
- Groq scout emits `count: "12"` (string) for number-typed fields
  without wire-level constrained decoding (cases 2, 5)
- Anthropic falls back to a natural-text response instead of calling
  the tool without forced tool_choice (case 3)

Run via README — promptfoo-native providers + `OPENROUTER_API_KEY`
aliased from the OpenRouter key stored in `OPENAI_API_KEY`.
CI's trailing-whitespace check (Lint JS workflow) was rejecting case-1-pre-fix.yaml
and case-3-pre-fix.yaml. Pure whitespace cleanup, no semantic change.
@LissaGreense LissaGreense marked this pull request as ready for review May 15, 2026 18:20
Re-measuring the validated delta surfaced three methodology gaps:

- Both pre-fix and post-fix YAMLs now use the same tool description
  string the production FSM ships (fsm-engine.ts:1461-1462). Earlier
  revisions gave the post-fix variant a tailored "Required fields: X,
  Y, Z" prose that leaked schema info into the prompt and inflated
  post-fix pass rates. Case-1 and case-3 pre-fix were also still on
  the stale tailored prose; both fixed.
- Sample size goes from n=1 to n=5 per cell via --repeat 5 in the run
  snippet. At n=1 a missed tool call is indistinguishable from a real
  regression for non-deterministic models.
- Case 5 no longer hand-feeds `sawBodySentinel=true` in the prompt;
  the model derives the boolean from stated facts, which is what
  actually exercises strict-mode type coercion.

Adds openrouter:nvidia/nemotron-3-super-120b-a12b:free as a fourth
provider to widen the OpenAI-compatible-strict surface beyond
gpt-oss-120b.

Re-measured matrix (n=5):
- Anthropic 18/25 → 25/25 (72% → 100%), driven by case-3 tool_choice
  forcing (1/5 → 5/5)
- OpenRouter gpt-oss-120b 25/25 → 25/25; previous 24/25 didn't
  replicate, was n=5 noise
- OpenRouter nemotron-super 25/25 → 25/25
- Groq 15/25 → 14/25 — the eval doesn't engage Groq's real strict
  mode (promptfoo's tool-level `strict: true` is not the same wire
  shape as the FSM's providerOptions.groq.strictJsonSchema → Groq's
  response_format: json_schema). README documents the caveat; the
  daemon-based first-principles suite is the authoritative Groq test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants