feat: Grammar speed optimization via C-side sampler chain

## Description

Current grammar enforcement runs through the HTTP layer via `response_format: json_object`. Moving structured output sampling to the C-side in llama.cpp would reduce per-token overhead and improve generation speed.

## Current Performance

~51 tok/s on RTX 5060 Ti 16GB with grammar constraints via HTTP `response_format`.

## Approach

- Investigate llama.cpp's native grammar/sampler chain API
- Determine if `response_format: json_object` already uses the C-side sampler or adds HTTP overhead
- If overhead exists, implement direct GBNF grammar passing to avoid the HTTP round-trip
- Benchmark before/after

## Context

V3.1 roadmap item. This requires understanding llama.cpp internals. The current approach works but may leave performance on the table.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Grammar speed optimization via C-side sampler chain #33

Description

Current Performance

Approach

Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: Grammar speed optimization via C-side sampler chain #33

Description

Description

Current Performance

Approach

Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions