Skip to content

feat(vllm): add grammar and structured output support#8806

Open
eureka928 wants to merge 7 commits intomudler:masterfrom
eureka928:feat/vllm-structured-output
Open

feat(vllm): add grammar and structured output support#8806
eureka928 wants to merge 7 commits intomudler:masterfrom
eureka928:feat/vllm-structured-output

Conversation

@eureka928
Copy link
Contributor

@eureka928 eureka928 commented Mar 6, 2026

Description

This PR fixes #6857

Adds grammar and structured output support to the vLLM backend, enabling users to enforce structured outputs via JSON schema, JSON object, and BNF/GBNF grammar constraints.

Problem

The vLLM backend ignored all structured output parameters:

  • The Grammar field from the proto was never read
  • A phantom GuidedDecoding mapping referenced a non-existent proto field
  • response_format with json_schema or json_object had no effect on vLLM
  • Missing import time caused a runtime crash on video input

Solution

Proto (backend.proto):

  • Added JSONSchema (field 52) and ResponseFormat (field 53) to PredictOptions, allowing backends to receive the raw JSON schema and format type natively

Go endpoints (chat.go, completion.go):

  • Extract the raw JSON schema string from response_format: {type: "json_schema", ...} and store it on config.JSONSchema
  • Set config.ResponseFormat to the format type (json_object / json_schema)
  • GBNF grammar is still generated in parallel for llama.cpp compatibility
  • Added json_schema grammar support to the completion endpoint (was missing)

Go backend (options.go, model_config.go):

  • Pass JSONSchema and ResponseFormat through gRPCPredictOpts to backends

vLLM backend (backend.py):

  • Support both StructuredOutputsParams (vLLM latest) and GuidedDecodingParams (vLLM <=0.8.x) with graceful import fallback
  • Handle three structured output modes with priority: JSONSchema > json_object > Grammar
  • Fix missing import time and import json
  • Remove phantom GuidedDecoding mapping

Docs (constrained_grammars.md):

  • Updated compatibility notice to include vLLM
  • Added vLLM section with examples for JSON schema, JSON object, and grammar

How It Works

User request (response_format or grammar parameter)
  → chat.go/completion.go: extracts raw JSON schema → config.JSONSchema
  → also generates GBNF grammar → config.Grammar (for llama.cpp compat)
  → options.go: passes both via gRPC PredictOptions
  → vLLM backend: uses StructuredOutputsParams/GuidedDecodingParams
  → llama.cpp backend: uses Grammar (GBNF) as before — no regression

Verification

  • xgrammar (used by vLLM) explicitly follows the GBNF spec from llama.cpp, so grammar passthrough works
  • GuidedDecodingParams.json and StructuredOutputsParams.json both accept JSON strings
  • grammar_is_likely_lark() in vLLM correctly identifies GBNF as non-Lark (via ::= detection)
  • No conflict with existing config.ResponseFormat usage (image endpoint b64_json goes through a different code path)

Notes for Reviewers

  • The proto generated Go files are gitignored and built at compile time — only backend.proto is committed
  • Compatible with both vLLM <=0.8.x (GuidedDecodingParams / guided_decoding) and latest (StructuredOutputsParams / structured_outputs)
  • No regression for llama.cpp or other backends

Signed commits

  • Yes, I signed my commits.

@mudler

@netlify
Copy link

netlify bot commented Mar 6, 2026

Deploy Preview for localai ready!

Name Link
🔨 Latest commit bb08454
🔍 Latest deploy log https://app.netlify.com/projects/localai/deploys/69aa440d185937000840e1f9
😎 Deploy Preview https://deploy-preview-8806--localai.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Add two new fields to PredictOptions in the proto:
- JSONSchema (field 52): raw JSON schema string for backends that
  support native structured output (e.g. vLLM guided decoding)
- ResponseFormat (field 53): response format type string

These fields allow backends like vLLM to receive structured output
constraints natively instead of only through GBNF grammar conversion.

Ref: mudler#6857
Signed-off-by: eureka928 <meobius123@gmail.com>
Add JSONSchema field to ModelConfig to carry the raw JSON schema string
alongside the GBNF Grammar. Pass both JSONSchema and ResponseFormat
through gRPCPredictOpts to backends via the new proto fields.

This allows backends like vLLM to receive the original JSON schema
for native structured output support.

Ref: mudler#6857
Signed-off-by: eureka928 <meobius123@gmail.com>
In chat and completion endpoints, when response_format is json_schema,
extract the raw JSON schema and store it on config.JSONSchema alongside
the GBNF grammar. Also set config.ResponseFormat to the format type.

This allows backends that support native structured output (like vLLM)
to use the JSON schema directly instead of the GBNF grammar.

Ref: mudler#6857
Signed-off-by: eureka928 <meobius123@gmail.com>
Update the vLLM backend to support structured output:
- Import GuidedDecodingParams from vllm.sampling_params
- Handle JSONSchema: parse and pass as GuidedDecodingParams(json_schema=...)
- Handle json_object response format: GuidedDecodingParams(json_object=True)
- Fall back to Grammar (GBNF) via GuidedDecodingParams(grammar=...)
- Remove phantom GuidedDecoding mapping (field doesn't exist in proto)
- Fix missing 'import time' and 'import json' for load_video and schema parsing

Priority: JSONSchema > json_object > Grammar (GBNF fallback)

Ref: mudler#6857
Signed-off-by: eureka928 <meobius123@gmail.com>
- Make GuidedDecodingParams import conditional (try/except) for
  backwards compatibility with older vLLM versions
- Remove GBNF grammar fallback — vLLM expects EBNF, not GBNF, so
  passing LocalAI's GBNF grammar would produce confusing errors
- Pass JSONSchema as string directly instead of parsing to dict
  (safer across vLLM versions)
- Add GBNF grammar generation for json_schema in completion endpoint
  so non-vLLM backends (llama.cpp) also get grammar enforcement

Ref: mudler#6857
Signed-off-by: eureka928 <meobius123@gmail.com>
- Handle both StructuredOutputsParams (vLLM latest) and
  GuidedDecodingParams (vLLM <=0.8.x) with graceful fallback
- Use the correct SamplingParams field name for each version
  (structured_outputs vs guided_decoding)
- Use 'json' parameter (not 'json_schema') matching both APIs
- Re-add grammar (GBNF/BNF) passthrough — both vLLM APIs accept
  a 'grammar' parameter handled by xgrammar which supports GBNF
- Priority: JSONSchema > json_object > Grammar

Ref: mudler#6857
Signed-off-by: eureka928 <meobius123@gmail.com>
Update the compatibility notice to include vLLM alongside llama.cpp.
Add a vLLM-specific section with examples for all three supported
methods: json_schema, json_object, and grammar (via xgrammar).

Ref: mudler#6857
Signed-off-by: eureka928 <meobius123@gmail.com>
@eureka928 eureka928 force-pushed the feat/vllm-structured-output branch from dabd63c to bb08454 Compare March 6, 2026 03:03
@eureka928
Copy link
Contributor Author

GM @mudler
Would you review my PR?
Thank you for your time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Add grammar support (BNF/xgrammar) to vLLM backend

1 participant