Skip to content

fix(opencode): consume Model.prefill + runtime-probe llama.cpp templates#27916

Open
feanor5555 wants to merge 2 commits into
anomalyco:devfrom
feanor5555:pr3-consume-prefill-and-probe
Open

fix(opencode): consume Model.prefill + runtime-probe llama.cpp templates#27916
feanor5555 wants to merge 2 commits into
anomalyco:devfrom
feanor5555:pr3-consume-prefill-and-probe

Conversation

@feanor5555
Copy link
Copy Markdown

@feanor5555 feanor5555 commented May 16, 2026

Issue for this PR

Closes #27920

Stacked on #27915 for the Model.prefill capability. Sister-PR #27914 handles the orthogonal empty-trailing case via the empty-content filter.

Type of change

  • Bug fix
  • New feature
  • Refactor / code improvement
  • Documentation

What does this PR do?

Closes the remaining ~25% of trailing-assistant 400s on llama.cpp / vLLM / TGI that #27914 cannot reach. The MAX_STEPS prefill in session/prompt.ts is non-empty by design (it delivers a user-visible "wrap up" instruction), so it survives the empty-content filter and trips the same template-incompat 400.

Three coordinated pieces:

1. ProviderTransform.canAcceptTrailingAssistant(model) — new helper, three-layer precedence:

  1. Explicit model.capabilities.prefill (from models.dev or user config) wins.
  2. Auto-inference: @ai-sdk/openai-compatible + reasoning:truefalse. Covers every known 2025-2026 thinking family even before models.dev ships explicit values.
  3. Default true (backwards compatible).

2. MAX_STEPS routing in session/prompt.tsrole:"assistant" for prefill-capable providers, role:"user" for the rest. Thinking stays enabled in the request body — only the role of the synthetic wrap-up message changes, so the model still thinks and writes its summary normally.

3. CapabilityProbe — runtime detection for self-hosted openai-compatible servers. llama.cpp's <root>/props endpoint exposes the active chat template; templates that branch on enable_thinking are exactly the ones that reject prefill at runtime. The probe runs once per base URL (cached), fail-silent (vLLM/TGI/mistral.rs have no /props and fall through to the auto-inference path), short-timeout (1.5s).

Affected behaviour:

  • Anthropic, Bedrock, OpenAI, Google: unchanged (prefill stays available).
  • Thinking-on local models (Qwen3, DeepSeek-R1, GLM-thinking, Kimi-K2-Thinking, MiniMax-M2): MAX_STEPS arrives as a user message.

Common misunderstanding: prefill: false does not disable thinking — only the role of the synthetic MAX_STEPS message changes from assistant to user. The model thinks and writes its wrap-up normally.

User can override per-model via opencode.json:

{
  "provider": {
    "my-llamacpp": {
      "models": {
        "qwen3.5-coder": { "reasoning": true, "prefill": false }
      }
    }
  }
}

Related upstream: ggml-org/llama.cpp#20861, ggml-org/llama.cpp#21889, mastra-ai/mastra#15234.

How did you verify your code works?

  • bun test test/provider/transform.test.ts test/provider/capability-probe.test.ts: 243 pass, 0 fail.

  • bun run typecheck clean.

  • Real-world benchmark against a Spring Boot project on llama.cpp + Qwen3.5-9B with --reasoning on, agent forced into MAX_STEPS via steps: 3, 3 runs per variant:

    Config Prefill-400 / run
    Without this PR 2.0
    With this PR + reasoning: true in user config 0.0
    With this PR + auto-probe (no user config) 0.0

Tests:

  • transform.test.ts: 8-case canAcceptTrailingAssistant matrix (explicit-overrides-everything, auto-inference for openai-compatible + reasoning class, unchanged defaults for Anthropic/OpenAI/Google/Bedrock representatives).
  • capability-probe.test.ts: 11 cases (enable_thinking detection, /v1-suffix normalisation, 404 fallback, network-error fallback, empty baseURL, per-URL cache, supports_preserve_reasoning secondary signal).

Screenshots / recordings

N/A — backend change.

Checklist

  • I have tested my changes locally
  • I have not included unrelated changes in this PR

Anthropic-style providers accept (and rely on) an assistant message as
the last turn in a conversation ("response continuation" / "prefill"
for tool-use continuation). Most other thinking-on-by-default templates
reject it outright — llama.cpp returns HTTP 400 "Assistant response
prefill is incompatible with enable_thinking" on Qwen3-family templates,
and vLLM/TGI have equivalent behaviour for DeepSeek-R1, GLM-4.6 thinking,
Kimi-K2-Thinking, etc.

A first-class `prefill: boolean` on Model lets every host (opencode,
mastra, others) consult one canonical source of truth instead of
guessing from npm package + reasoning flag.

- packages/core/src/models.ts: add optional prefill field on Model
  with a per-family list of templates known to reject prefill
  (Qwen3 hybrid/3.5/3.6/Thinking-2507/VL, QwQ, DeepSeek-R1/R1-0528/V4,
  GLM-4.6/4.7-thinking, Kimi-K2-Thinking, MiniMax-M2).

- packages/opencode/src/config/provider.ts: mirror the field on the
  user-facing config schema with an annotation describing when to set
  it (and what the auto-default is for openai-compatible+reasoning).

Default (undefined) is treated as `true` to keep all existing models
unaffected. Consumer-side logic lives in a follow-up PR.

Sister-PR to a sst/models.dev data PR that will populate prefill: false
on the affected per-model entries.
Closes the remaining ~25% of trailing-assistant 400s on llama.cpp /
vLLM / TGI that an empty-content filter alone cannot reach. The
MAX_STEPS prefill in session/prompt.ts is non-empty by design (it
delivers a user-visible "wrap up" instruction), so it survives the
empty filter and trips the same template-incompat 400.

Three coordinated changes:

1. ProviderTransform.canAcceptTrailingAssistant(model) — new helper.
   Three-layer precedence:
     (a) explicit model.capabilities.prefill wins (from models.dev or
         user config),
     (b) auto-inference: @ai-sdk/openai-compatible + reasoning:true
         → false (covers every known 2025-2026 thinking family even
         before models.dev ships explicit values),
     (c) default true (backwards compatible — Anthropic, Bedrock,
         OpenAI, Google etc. unchanged).

2. session/prompt.ts MAX_STEPS routing now consults the helper:
   role:"assistant" for prefill-capable providers, role:"user" for the
   rest. Thinking stays enabled in the request body — only the role of
   the synthetic wrap-up message changes from `assistant` to `user`,
   so the model still thinks and writes its summary normally.

3. CapabilityProbe — runtime detection for self-hosted openai-compatible
   servers. llama.cpp's `<root>/props` endpoint exposes the active
   chat template; templates that branch on `enable_thinking` are exactly
   the ones that reject prefill. The probe runs once per base URL
   (cached), fail-silent (vLLM/TGI/mistral.rs have no /props and fall
   through to the auto-inference path), short-timeout (1.5s).

User can always override per-model via opencode.json:

    {
      "provider": {
        "my-llamacpp": {
          "models": {
            "qwen3.5-coder": { "reasoning": true, "prefill": false }
          }
        }
      }
    }

Affected behaviour:
  - Anthropic, Bedrock, OpenAI, Google — unchanged (prefill stays
    available).
  - Thinking-on local models (Qwen3, DeepSeek-R1, GLM-thinking,
    Kimi-K2-Thinking, MiniMax-M2): MAX_STEPS arrives as a user message.
    Same instruction, same wrap-up behaviour, no template rejection.

Tests:
  - transform.test.ts: 8-case canAcceptTrailingAssistant matrix
    (explicit-overrides-everything, auto-inference for openai-compatible
    + reasoning class, unchanged defaults for Anthropic/OpenAI/Google/
    Bedrock representatives).
  - capability-probe.test.ts: 11 cases for the runtime probe
    (enable_thinking detection, /v1-suffix normalisation, 404 fallback,
    network-error fallback, empty baseURL, per-URL cache).

Real-world benchmark against an echomodus-sized Spring Boot project
on llama.cpp + Qwen3.5-9B with --reasoning on:
  - Without this PR: 2.0 prefill-400s per run (3/3 runs).
  - With this PR + reasoning:true in user config: 0 errors (3/3).
  - With this PR + auto-probe (no user config): 0 errors (3/3).

Common misunderstanding: prefill:false does NOT disable thinking.
Thinking stays on for the whole request — only the role of the synthetic
MAX_STEPS message changes from `assistant` to `user`. The model then
thinks (with thinking enabled) and writes its wrap-up normally.

Builds on the Model.prefill capability introduced in the previous
commit. Sister-PR-1 (filter empty assistant content for
@ai-sdk/openai-compatible) handles the orthogonal empty-trailing case;
this PR handles the non-empty trailing case.
@github-actions github-actions Bot added needs:compliance This means the issue will auto-close after 2 hours. needs:title labels May 16, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Hey! Your PR title provider: consume Model.prefill + runtime-probe llama.cpp templates doesn't follow conventional commit format.

Please update it to start with one of:

  • feat: or feat(scope): new feature
  • fix: or fix(scope): bug fix
  • docs: or docs(scope): documentation changes
  • chore: or chore(scope): maintenance tasks
  • refactor: or refactor(scope): code refactoring
  • test: or test(scope): adding or updating tests

Where scope is the package name (e.g., app, desktop, opencode).

See CONTRIBUTING.md for details.

@github-actions
Copy link
Copy Markdown
Contributor

The following comment was made by an LLM, it may be inaccurate:

Based on the search results, I found related PRs that are part of the same feature work:

Related PRs (Not Duplicates)

  1. PR feat(core): add Model.prefill capability for trailing-assistant support #27915 - core: add Model.prefill capability for trailing-assistant support

  2. PR fix(opencode): filter empty assistant content for @ai-sdk/openai-compatible #27914 - transform: filter empty assistant content for @ai-sdk/openai-compatible

Note: PR #27916 (the current PR) is explicitly stacked on PR #27915 and represents the next logical piece in the same feature chain. These are coordinated changes, not duplicates.

@feanor5555 feanor5555 changed the title provider: consume Model.prefill + runtime-probe llama.cpp templates fix(opencode): consume Model.prefill + runtime-probe llama.cpp templates May 16, 2026
@github-actions github-actions Bot added needs:issue and removed needs:compliance This means the issue will auto-close after 2 hours. needs:title labels May 16, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for updating your PR! It now meets our contributing guidelines. 👍

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for your contribution!

This PR doesn't have a linked issue. All PRs must reference an existing issue.

Please:

  1. Open an issue describing the bug/feature (if one doesn't exist)
  2. Add Fixes #<number> or Closes #<number> to this PR description

See CONTRIBUTING.md for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Trailing-assistant 400 on llama.cpp/vLLM with thinking-on templates (Qwen3, DeepSeek-R1, GLM-thinking, Kimi-K2-Thinking, MiniMax-M2)

1 participant