Open
Conversation
…d, sigmoid gate, per-head gate split
Introspects the GatedDeltaNet class directly so we can read the exact torch fallback HF is calling (the "fast path is not available" branch). No model load required. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HF's torch_recurrent_gated_delta_rule applies `query = query * (1/sqrt(k_head_dim))` before the recurrence (use_qk_l2norm_in_kernel path). Without this, our core_out norm was sqrt(128) ≈ 11.31x too large vs HF, matching the observed ratio 111.545 / 9.862 = 11.311 exactly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Qwen chat mode ends turns with <|im_end|> (248046), not <|endoftext|> (248044). Hardcoding only 248044 caused the model to roll past its real stop and emit <|im_start|>assistant\n, which decoded to bare "assistant\n" repeated until max_tokens. Switch generate() to a frozenset of valid stop ids. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
linear_attn.py: optional fla.ops.gated_delta_rule.chunk_gated_delta_rule fast path for the prefill recurrence (T>1). Per-token decode (T==1) and the no-FLA fallback still go through the original sequential loop. Same math either way; the kernel is what HF calls when the fast path is available, and we already verified our inputs to it match HF bit-exactly. loader.py: factor weight remapping into a helper so we can build N replicas from a single CPU state-dict instead of paying the HF cold-load cost N times. New load_replicas() takes a list of devices. main.py: load one replica per GPU at startup, dispatch requests round- robin across replicas with a per-GPU asyncio.Lock and a thread pool. Up to NUM_GPUS requests now run truly in parallel. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
chunk_gated_delta_rule computes in fp32 internally and returns fp32 tensors. Without casting, the decode-step sequential loop (which runs in bf16) crashed with "expected scalar type BFloat16 but found Float" on the first einsum against the carried-over state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The per-GPU work in load_replicas (Qwen35MoE construction, load_state_dict, .to(cuda:i)) is independent across replicas. H2D copies to different GPUs use independent PCIe lanes and torch releases the GIL during the copy, so parallelizing with a ThreadPoolExecutor gives near-linear speedup over the sequential loop without any synchronization concerns — the shared new_sd dict is read-only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.