Skip to content

Add Qwen3.5 model support (0.8B, 4B, 9B, 27B)#684

Draft
mnoukhov wants to merge 3 commits into
mainfrom
qwen35
Draft

Add Qwen3.5 model support (0.8B, 4B, 9B, 27B)#684
mnoukhov wants to merge 3 commits into
mainfrom
qwen35

Conversation

@mnoukhov
Copy link
Copy Markdown

@mnoukhov mnoukhov commented May 21, 2026

Summary

  • Add support for Qwen3.5 dense hybrid models (0.8B, 4B, 9B, 27B) to OLMo-core
  • Add TransformerConfig.qwen3_5_like() factory and size-specific builders with a 3:1 Gated DeltaNet + full-attention block pattern
  • Add partial RoPE support via partial_rotary_factor on RoPEConfig (25% of head dim, matching Qwen3.5)
  • Add HuggingFace weight conversion for qwen3_5_text hybrid models, including multimodal checkpoint key normalization
  • Fix HF conversion for recent Transformers checkpoints: GDN out_proj/norm key names, interleaved fused q_proj+gate layout, and tied word embeddings
  • Add comprehensive tests including HuggingFace logits comparison on GPU

Architecture Details

Qwen3.5 dense models use a hybrid architecture with key differences from Qwen3:

  • Block pattern: 3 GDN (linear attention) layers + 1 full attention layer, repeating
  • GDN layers: 16 key heads × 128 head dim, grouped depthwise conv (kernel 4), allow_neg_eigval=False
  • Full-attention layers: explicit head_dim=256, GQA, per-head QK norm, elementwise output gating, partial RoPE (θ=10M)
  • Norm: Qwen-style RMSNorm (hidden_states * weight with HF zero-init → OLMo ones-init transform)
Model d_model layers attn heads kv heads head_dim GDN v-heads intermediate
0.8B 1024 24 8 2 256 16 3584
4B 2560 32 16 4 256 32 9216
9B 4096 32 16 4 256 32 12288
27B 5120 64 24 4 256 48 17408

HF conversion fixes

  • GDN keys: Map linear_attn.out_proj / linear_attn.norm (Transformers 5.9+) with legacy o_proj / o_norm fallbacks
  • Fused Q projection: HF interleaves per-head [q, gate] weights; OLMo stores [all q, all gate] — conversion now unshuffles correctly
  • Tied embeddings: Copy embed_tokens to lm_head.w_out when tie_word_embeddings=True

Test plan

  • pytest -v src/test/nn/transformer/model_test.py -k qwen3_5 — builder configs, param counts, GPU forward (9 tests)
  • pytest -v -m gpu src/test/nn/hf/qwen3_5_test.py — HF logits parity vs Qwen/Qwen3.5-0.8B (requires HF_TOKEN, GPU, fla)
  • test_qwen3_5_matches_huggingface — logits match within rtol=1e-3, atol=5e-3 (mean diff ~3e-4, max ~3e-3; relaxed tolerance accounts for HF torch fallback vs OLMo FLA kernels for GDN)
  • src/test/nn/rope_test.py — partial RoPE coverage

Note on GPU-only tests: Qwen3.5 is a GDN + attention hybrid, unlike Qwen3 (attention-only). The end-to-end forward and HF parity tests require a GPU because GDN layers depend on flash-linear-attention (fla), which has no CPU implementation. Config/param-count tests remain CPU-safe.

Made with Cursor

mnoukhov and others added 3 commits May 21, 2026 22:17
Correct interleaved q_proj/gate layout, updated GDN key names for recent
Transformers checkpoints, tied-embedding handling, and run FLA tests on CUDA.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant