wip by yachty66 · Pull Request #1 · gpu-mode/paris-hackathon-2026-inference

yachty66 · 2026-04-09T16:23:23Z

No description provided.

…d, sigmoid gate, per-head gate split

Introspects the GatedDeltaNet class directly so we can read the exact torch fallback HF is calling (the "fast path is not available" branch). No model load required. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HF's torch_recurrent_gated_delta_rule applies `query = query * (1/sqrt(k_head_dim))` before the recurrence (use_qk_l2norm_in_kernel path). Without this, our core_out norm was sqrt(128) ≈ 11.31x too large vs HF, matching the observed ratio 111.545 / 9.862 = 11.311 exactly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Qwen chat mode ends turns with <|im_end|> (248046), not <|endoftext|> (248044). Hardcoding only 248044 caused the model to roll past its real stop and emit <|im_start|>assistant\n, which decoded to bare "assistant\n" repeated until max_tokens. Switch generate() to a frozenset of valid stop ids. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

linear_attn.py: optional fla.ops.gated_delta_rule.chunk_gated_delta_rule fast path for the prefill recurrence (T>1). Per-token decode (T==1) and the no-FLA fallback still go through the original sequential loop. Same math either way; the kernel is what HF calls when the fast path is available, and we already verified our inputs to it match HF bit-exactly. loader.py: factor weight remapping into a helper so we can build N replicas from a single CPU state-dict instead of paying the HF cold-load cost N times. New load_replicas() takes a list of devices. main.py: load one replica per GPU at startup, dispatch requests round- robin across replicas with a per-GPU asyncio.Lock and a thread pool. Up to NUM_GPUS requests now run truly in parallel. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chunk_gated_delta_rule computes in fp32 internally and returns fp32 tensors. Without casting, the decode-step sequential loop (which runs in bf16) crashed with "expected scalar type BFloat16 but found Float" on the first einsum against the carried-over state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The per-GPU work in load_replicas (Qwen35MoE construction, load_state_dict, .to(cuda:i)) is independent across replicas. H2D copies to different GPUs use independent PCIe lanes and torch releases the GIL during the copy, so parallelizing with a ThreadPoolExecutor gives near-linear speedup over the sequential loop without any synchronization concerns — the shared new_sd dict is read-only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

yachty66 and others added 24 commits April 9, 2026 14:08

Add initial inference server

513659f

Add custom inference engine (GatedDeltaNet + GQA + MoE)

9691cd9

Fix q_norm/k_norm: use RMSNorm instead of LayerNorm (no bias)

db39fd7

Fix linear attn einsum: keep val-sub-head dim in outer product

54503ef

Fix dtype mismatch: cast softmax output back to bfloat16

0e9c54c

Fix generate: return [1, num_generated] via stack not cat

417dd36

Fix causal conv1d (manual left-pad) and add Q normalization

c057806

Fix conv_buf: save from padded input not raw qkv_t

babeb0b

Add debug comparison script (HF vs our model)

1e8eddc

Add layer-by-layer debug comparison script

e84abfd

Add source inspection script

7cb8c42

Add source2 inspection script

0687f71

Fix all bugs found from source: RMSNorm formula, conv silu, Q/K expan…

244e34b

…d, sigmoid gate, per-head gate split

Add layer 0 sublayer debug script

2f08440

Add GatedDeltaNet internals debug script

f3d656d

Fix syntax error in debug_deltanet.py

61754ea

Fix delta rule: decay state first, then compute delta from decayed state

da80400

Fix debug script delta rule order to match fix

5bcc748

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wip#1

wip#1
yachty66 wants to merge 24 commits intogpu-mode:mainfrom
FelixMul:maxh

yachty66 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yachty66 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant