You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(libturboquant): port Phi-3 fused QKV/FFN + LongRoPE to split sources
Ports the Phi-3/Phi-3.5 architecture support from quant.h (PR #65)
to the split source files used by libturboquant and quant-server.
Changes:
- tq_model.c: fused attn_qkv detection, LongRoPE factor loading,
fused gate||up FFN detection
- tq_transformer.c: fused QKV matmul + split, NeoX-style LongRoPE
rotation, fused gate||up FFN path, expanded state allocation
- tq_generate.c: Phi-3 BOS token handling
- tq_tokenizer.c: <s> BOS lookup
- tq_server.c: Phi-3 chat template support
- tq_engine.h: new fields for fused weights and LongRoPE config
- cli.py: Phi-3.5 default model + alias updates
quant-server now detects Phi-3.5 correctly:
loaded 32 layers (32 self_attn) + LongRoPE
Note: server crashes during inference (segfault in forward pass).
The fused QKV → split memcpy or LongRoPE computation likely has
a buffer size issue in the server path. Tracked in #67.
35/35 unit tests still pass.
Fixes#67 (partial — loader works, inference needs debugging)
Refs #69, #70
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0 commit comments