Add Qwen3.5 model support (0.8B, 4B, 9B, 27B) by mnoukhov · Pull Request #684 · allenai/OLMo-core

mnoukhov · 2026-05-21T22:48:03Z

Summary

Add support for Qwen3.5 dense hybrid models (0.8B, 4B, 9B, 27B) to OLMo-core
Add TransformerConfig.qwen3_5_like() factory and size-specific builders with a 3:1 Gated DeltaNet + full-attention block pattern
Add partial RoPE support via partial_rotary_factor on RoPEConfig (25% of head dim, matching Qwen3.5)
Add HuggingFace weight conversion for qwen3_5_text hybrid models, including multimodal checkpoint key normalization
Fix HF conversion for recent Transformers checkpoints: GDN out_proj/norm key names, interleaved fused q_proj+gate layout, and tied word embeddings
Add comprehensive tests including HuggingFace logits comparison on GPU

Architecture Details

Qwen3.5 dense models use a hybrid architecture with key differences from Qwen3:

Block pattern: 3 GDN (linear attention) layers + 1 full attention layer, repeating
GDN layers: 16 key heads × 128 head dim, grouped depthwise conv (kernel 4), allow_neg_eigval=False
Full-attention layers: explicit head_dim=256, GQA, per-head QK norm, elementwise output gating, partial RoPE (θ=10M)
Norm: Qwen-style RMSNorm (hidden_states * weight with HF zero-init → OLMo ones-init transform)

Model	d_model	layers	attn heads	kv heads	head_dim	GDN v-heads	intermediate
0.8B	1024	24	8	2	256	16	3584
4B	2560	32	16	4	256	32	9216
9B	4096	32	16	4	256	32	12288
27B	5120	64	24	4	256	48	17408

HF conversion fixes

GDN keys: Map linear_attn.out_proj / linear_attn.norm (Transformers 5.9+) with legacy o_proj / o_norm fallbacks
Fused Q projection: HF interleaves per-head [q, gate] weights; OLMo stores [all q, all gate] — conversion now unshuffles correctly
Tied embeddings: Copy embed_tokens to lm_head.w_out when tie_word_embeddings=True

Test plan

pytest -v src/test/nn/transformer/model_test.py -k qwen3_5 — builder configs, param counts, GPU forward (9 tests)
pytest -v -m gpu src/test/nn/hf/qwen3_5_test.py — HF logits parity vs Qwen/Qwen3.5-0.8B (requires HF_TOKEN, GPU, fla)
test_qwen3_5_matches_huggingface — logits match within rtol=1e-3, atol=5e-3 (mean diff ~3e-4, max ~3e-3; relaxed tolerance accounts for HF torch fallback vs OLMo FLA kernels for GDN)
src/test/nn/rope_test.py — partial RoPE coverage

Note on GPU-only tests: Qwen3.5 is a GDN + attention hybrid, unlike Qwen3 (attention-only). The end-to-end forward and HF parity tests require a GPU because GDN layers depend on flash-linear-attention (fla), which has no CPU implementation. Config/param-count tests remain CPU-safe.

Made with Cursor

Correct interleaved q_proj/gate layout, updated GDN key names for recent Transformers checkpoints, tied-embedding handling, and run FLA tests on CUDA. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

mnoukhov and others added 3 commits May 21, 2026 22:17

qwen35 initial try

4bfc551

Fix Qwen3.5 HF weight conversion and GPU parity tests.

52ae4e6

Correct interleaved q_proj/gate layout, updated GDN key names for recent Transformers checkpoints, tied-embedding handling, and run FLA tests on CUDA. Co-authored-by: Cursor <cursoragent@cursor.com>

Fix Qwen3.5 CI failures for style, RoPE, and GPU-only forward test.

1907fab

Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen3.5 model support (0.8B, 4B, 9B, 27B)#684

Add Qwen3.5 model support (0.8B, 4B, 9B, 27B)#684
mnoukhov wants to merge 3 commits into
mainfrom
qwen35

mnoukhov commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mnoukhov commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture Details

HF conversion fixes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mnoukhov commented May 21, 2026 •

edited

Loading