Description
Current grammar enforcement runs through the HTTP layer via response_format: json_object. Moving structured output sampling to the C-side in llama.cpp would reduce per-token overhead and improve generation speed.
Current Performance
~51 tok/s on RTX 5060 Ti 16GB with grammar constraints via HTTP response_format.
Approach
- Investigate llama.cpp's native grammar/sampler chain API
- Determine if
response_format: json_object already uses the C-side sampler or adds HTTP overhead
- If overhead exists, implement direct GBNF grammar passing to avoid the HTTP round-trip
- Benchmark before/after
Context
V3.1 roadmap item. This requires understanding llama.cpp internals. The current approach works but may leave performance on the table.
Description
Current grammar enforcement runs through the HTTP layer via
response_format: json_object. Moving structured output sampling to the C-side in llama.cpp would reduce per-token overhead and improve generation speed.Current Performance
~51 tok/s on RTX 5060 Ti 16GB with grammar constraints via HTTP
response_format.Approach
response_format: json_objectalready uses the C-side sampler or adds HTTP overheadContext
V3.1 roadmap item. This requires understanding llama.cpp internals. The current approach works but may leave performance on the table.