fix(opencode): consume Model.prefill + runtime-probe llama.cpp templates by feanor5555 · Pull Request #27916 · anomalyco/opencode

feanor5555 · 2026-05-16T17:49:55Z

Issue for this PR

Stacked on #27915 for the Model.prefill capability. Sister-PR #27914 handles the orthogonal empty-trailing case via the empty-content filter.

Type of change

Bug fix
New feature
Refactor / code improvement
Documentation

What does this PR do?

Closes the remaining ~25% of trailing-assistant 400s on llama.cpp / vLLM / TGI that #27914 cannot reach. The MAX_STEPS prefill in session/prompt.ts is non-empty by design (it delivers a user-visible "wrap up" instruction), so it survives the empty-content filter and trips the same template-incompat 400.

Three coordinated pieces:

1. ProviderTransform.canAcceptTrailingAssistant(model) — new helper, three-layer precedence:

Explicit model.capabilities.prefill (from models.dev or user config) wins.
Auto-inference: @ai-sdk/openai-compatible + reasoning:true → false. Covers every known 2025-2026 thinking family even before models.dev ships explicit values.
Default true (backwards compatible).

2. MAX_STEPS routing in session/prompt.ts — role:"assistant" for prefill-capable providers, role:"user" for the rest. Thinking stays enabled in the request body — only the role of the synthetic wrap-up message changes, so the model still thinks and writes its summary normally.

3. CapabilityProbe — runtime detection for self-hosted openai-compatible servers. llama.cpp's <root>/props endpoint exposes the active chat template; templates that branch on enable_thinking are exactly the ones that reject prefill at runtime. The probe runs once per base URL (cached), fail-silent (vLLM/TGI/mistral.rs have no /props and fall through to the auto-inference path), short-timeout (1.5s).

Affected behaviour:

Anthropic, Bedrock, OpenAI, Google: unchanged (prefill stays available).
Thinking-on local models (Qwen3, DeepSeek-R1, GLM-thinking, Kimi-K2-Thinking, MiniMax-M2): MAX_STEPS arrives as a user message.

Common misunderstanding: prefill: false does not disable thinking — only the role of the synthetic MAX_STEPS message changes from assistant to user. The model thinks and writes its wrap-up normally.

User can override per-model via opencode.json:

{
  "provider": {
    "my-llamacpp": {
      "models": {
        "qwen3.5-coder": { "reasoning": true, "prefill": false }
      }
    }
  }
}

Related upstream: ggml-org/llama.cpp#20861, ggml-org/llama.cpp#21889, mastra-ai/mastra#15234.

How did you verify your code works?

bun test test/provider/transform.test.ts test/provider/capability-probe.test.ts: 243 pass, 0 fail.
bun run typecheck clean.
Real-world benchmark against a Spring Boot project on llama.cpp + Qwen3.5-9B with --reasoning on, agent forced into MAX_STEPS via steps: 3, 3 runs per variant:

Config Prefill-400 / run

Without this PR 2.0

With this PR + reasoning: true in user config 0.0

With this PR + auto-probe (no user config) 0.0

Tests:

transform.test.ts: 8-case canAcceptTrailingAssistant matrix (explicit-overrides-everything, auto-inference for openai-compatible + reasoning class, unchanged defaults for Anthropic/OpenAI/Google/Bedrock representatives).
capability-probe.test.ts: 11 cases (enable_thinking detection, /v1-suffix normalisation, 404 fallback, network-error fallback, empty baseURL, per-URL cache, supports_preserve_reasoning secondary signal).

Screenshots / recordings

N/A — backend change.

Checklist

I have tested my changes locally
I have not included unrelated changes in this PR

Anthropic-style providers accept (and rely on) an assistant message as the last turn in a conversation ("response continuation" / "prefill" for tool-use continuation). Most other thinking-on-by-default templates reject it outright — llama.cpp returns HTTP 400 "Assistant response prefill is incompatible with enable_thinking" on Qwen3-family templates, and vLLM/TGI have equivalent behaviour for DeepSeek-R1, GLM-4.6 thinking, Kimi-K2-Thinking, etc. A first-class `prefill: boolean` on Model lets every host (opencode, mastra, others) consult one canonical source of truth instead of guessing from npm package + reasoning flag. - packages/core/src/models.ts: add optional prefill field on Model with a per-family list of templates known to reject prefill (Qwen3 hybrid/3.5/3.6/Thinking-2507/VL, QwQ, DeepSeek-R1/R1-0528/V4, GLM-4.6/4.7-thinking, Kimi-K2-Thinking, MiniMax-M2). - packages/opencode/src/config/provider.ts: mirror the field on the user-facing config schema with an annotation describing when to set it (and what the auto-default is for openai-compatible+reasoning). Default (undefined) is treated as `true` to keep all existing models unaffected. Consumer-side logic lives in a follow-up PR. Sister-PR to a sst/models.dev data PR that will populate prefill: false on the affected per-model entries.

Closes the remaining ~25% of trailing-assistant 400s on llama.cpp / vLLM / TGI that an empty-content filter alone cannot reach. The MAX_STEPS prefill in session/prompt.ts is non-empty by design (it delivers a user-visible "wrap up" instruction), so it survives the empty filter and trips the same template-incompat 400. Three coordinated changes: 1. ProviderTransform.canAcceptTrailingAssistant(model) — new helper. Three-layer precedence: (a) explicit model.capabilities.prefill wins (from models.dev or user config), (b) auto-inference: @ai-sdk/openai-compatible + reasoning:true → false (covers every known 2025-2026 thinking family even before models.dev ships explicit values), (c) default true (backwards compatible — Anthropic, Bedrock, OpenAI, Google etc. unchanged). 2. session/prompt.ts MAX_STEPS routing now consults the helper: role:"assistant" for prefill-capable providers, role:"user" for the rest. Thinking stays enabled in the request body — only the role of the synthetic wrap-up message changes from `assistant` to `user`, so the model still thinks and writes its summary normally. 3. CapabilityProbe — runtime detection for self-hosted openai-compatible servers. llama.cpp's `<root>/props` endpoint exposes the active chat template; templates that branch on `enable_thinking` are exactly the ones that reject prefill. The probe runs once per base URL (cached), fail-silent (vLLM/TGI/mistral.rs have no /props and fall through to the auto-inference path), short-timeout (1.5s). User can always override per-model via opencode.json: { "provider": { "my-llamacpp": { "models": { "qwen3.5-coder": { "reasoning": true, "prefill": false } } } } } Affected behaviour: - Anthropic, Bedrock, OpenAI, Google — unchanged (prefill stays available). - Thinking-on local models (Qwen3, DeepSeek-R1, GLM-thinking, Kimi-K2-Thinking, MiniMax-M2): MAX_STEPS arrives as a user message. Same instruction, same wrap-up behaviour, no template rejection. Tests: - transform.test.ts: 8-case canAcceptTrailingAssistant matrix (explicit-overrides-everything, auto-inference for openai-compatible + reasoning class, unchanged defaults for Anthropic/OpenAI/Google/ Bedrock representatives). - capability-probe.test.ts: 11 cases for the runtime probe (enable_thinking detection, /v1-suffix normalisation, 404 fallback, network-error fallback, empty baseURL, per-URL cache). Real-world benchmark against an echomodus-sized Spring Boot project on llama.cpp + Qwen3.5-9B with --reasoning on: - Without this PR: 2.0 prefill-400s per run (3/3 runs). - With this PR + reasoning:true in user config: 0 errors (3/3). - With this PR + auto-probe (no user config): 0 errors (3/3). Common misunderstanding: prefill:false does NOT disable thinking. Thinking stays on for the whole request — only the role of the synthetic MAX_STEPS message changes from `assistant` to `user`. The model then thinks (with thinking enabled) and writes its wrap-up normally. Builds on the Model.prefill capability introduced in the previous commit. Sister-PR-1 (filter empty assistant content for @ai-sdk/openai-compatible) handles the orthogonal empty-trailing case; this PR handles the non-empty trailing case.

github-actions · 2026-05-16T17:50:03Z

Hey! Your PR title provider: consume Model.prefill + runtime-probe llama.cpp templates doesn't follow conventional commit format.

Please update it to start with one of:

feat: or feat(scope): new feature
fix: or fix(scope): bug fix
docs: or docs(scope): documentation changes
chore: or chore(scope): maintenance tasks
refactor: or refactor(scope): code refactoring
test: or test(scope): adding or updating tests

Where scope is the package name (e.g., app, desktop, opencode).

See CONTRIBUTING.md for details.

github-actions · 2026-05-16T17:50:51Z

The following comment was made by an LLM, it may be inaccurate:

Based on the search results, I found related PRs that are part of the same feature work:

Related PRs (Not Duplicates)

PR feat(core): add Model.prefill capability for trailing-assistant support #27915 - core: add Model.prefill capability for trailing-assistant support
- feat(core): add Model.prefill capability for trailing-assistant support #27915
- This is the dependency PR that PR fix(opencode): consume Model.prefill + runtime-probe llama.cpp templates #27916 is stacked on. It adds the foundational Model.prefill capability that this PR consumes.
PR fix(opencode): filter empty assistant content for @ai-sdk/openai-compatible #27914 - transform: filter empty assistant content for @ai-sdk/openai-compatible
- fix(opencode): filter empty assistant content for @ai-sdk/openai-compatible #27914
- Related work addressing the same issue space (trailing-assistant 400s), but uses a different approach (filtering empty content).

Note: PR #27916 (the current PR) is explicitly stacked on PR #27915 and represents the next logical piece in the same feature chain. These are coordinated changes, not duplicates.

github-actions · 2026-05-16T17:57:48Z

Thanks for updating your PR! It now meets our contributing guidelines. 👍

github-actions · 2026-05-16T17:57:48Z

Thanks for your contribution!

This PR doesn't have a linked issue. All PRs must reference an existing issue.

Please:

Open an issue describing the bug/feature (if one doesn't exist)
Add Fixes #<number> or Closes #<number> to this PR description

See CONTRIBUTING.md for details.

feanor5555 added 2 commits May 16, 2026 17:39

github-actions Bot added needs:compliance This means the issue will auto-close after 2 hours. needs:title labels May 16, 2026

github-actions Bot mentioned this pull request May 16, 2026

feat(core): add Model.prefill capability for trailing-assistant support #27915

Open

6 tasks

feanor5555 changed the title ~~provider: consume Model.prefill + runtime-probe llama.cpp templates~~ fix(opencode): consume Model.prefill + runtime-probe llama.cpp templates May 16, 2026

github-actions Bot added needs:issue and removed needs:compliance This means the issue will auto-close after 2 hours. needs:title labels May 16, 2026

github-actions Bot removed the needs:issue label May 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(opencode): consume Model.prefill + runtime-probe llama.cpp templates#27916

fix(opencode): consume Model.prefill + runtime-probe llama.cpp templates#27916
feanor5555 wants to merge 2 commits into
anomalyco:devfrom
feanor5555:pr3-consume-prefill-and-probe

feanor5555 commented May 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Config	Prefill-400 / run
Without this PR	2.0
With this PR + `reasoning: true` in user config	0.0
With this PR + auto-probe (no user config)	0.0

Conversation

feanor5555 commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue for this PR

Type of change

What does this PR do?

How did you verify your code works?

Screenshots / recordings

Checklist

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Related PRs (Not Duplicates)

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feanor5555 commented May 16, 2026 •

edited

Loading