Skip to content

feat: Grammar speed optimization via C-side sampler chain #33

@itigges22

Description

@itigges22

Description

Current grammar enforcement runs through the HTTP layer via response_format: json_object. Moving structured output sampling to the C-side in llama.cpp would reduce per-token overhead and improve generation speed.

Current Performance

~51 tok/s on RTX 5060 Ti 16GB with grammar constraints via HTTP response_format.

Approach

  • Investigate llama.cpp's native grammar/sampler chain API
  • Determine if response_format: json_object already uses the C-side sampler or adds HTTP overhead
  • If overhead exists, implement direct GBNF grammar passing to avoid the HTTP round-trip
  • Benchmark before/after

Context

V3.1 roadmap item. This requires understanding llama.cpp internals. The current approach works but may leave performance on the table.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions