Skip to content

wip#1

Open
yachty66 wants to merge 24 commits intogpu-mode:mainfrom
FelixMul:maxh
Open

wip#1
yachty66 wants to merge 24 commits intogpu-mode:mainfrom
FelixMul:maxh

Conversation

@yachty66
Copy link
Copy Markdown

@yachty66 yachty66 commented Apr 9, 2026

No description provided.

yachty66 and others added 24 commits April 9, 2026 14:08
Introspects the GatedDeltaNet class directly so we can read the exact
torch fallback HF is calling (the "fast path is not available" branch).
No model load required.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HF's torch_recurrent_gated_delta_rule applies `query = query * (1/sqrt(k_head_dim))`
before the recurrence (use_qk_l2norm_in_kernel path). Without this, our
core_out norm was sqrt(128) ≈ 11.31x too large vs HF, matching the observed
ratio 111.545 / 9.862 = 11.311 exactly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Qwen chat mode ends turns with <|im_end|> (248046), not <|endoftext|>
(248044). Hardcoding only 248044 caused the model to roll past its
real stop and emit <|im_start|>assistant\n, which decoded to bare
"assistant\n" repeated until max_tokens. Switch generate() to a
frozenset of valid stop ids.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
linear_attn.py: optional fla.ops.gated_delta_rule.chunk_gated_delta_rule
fast path for the prefill recurrence (T>1). Per-token decode (T==1) and
the no-FLA fallback still go through the original sequential loop. Same
math either way; the kernel is what HF calls when the fast path is
available, and we already verified our inputs to it match HF bit-exactly.

loader.py: factor weight remapping into a helper so we can build N
replicas from a single CPU state-dict instead of paying the HF cold-load
cost N times. New load_replicas() takes a list of devices.

main.py: load one replica per GPU at startup, dispatch requests round-
robin across replicas with a per-GPU asyncio.Lock and a thread pool. Up
to NUM_GPUS requests now run truly in parallel.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
chunk_gated_delta_rule computes in fp32 internally and returns fp32
tensors. Without casting, the decode-step sequential loop (which runs
in bf16) crashed with "expected scalar type BFloat16 but found Float"
on the first einsum against the carried-over state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The per-GPU work in load_replicas (Qwen35MoE construction, load_state_dict,
.to(cuda:i)) is independent across replicas. H2D copies to different GPUs
use independent PCIe lanes and torch releases the GIL during the copy, so
parallelizing with a ThreadPoolExecutor gives near-linear speedup over
the sequential loop without any synchronization concerns — the shared
new_sd dict is read-only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant