Skip to content

[cuda backend] int4/8 matvec: vectorized activation load #20144

Open
Gasoonjia wants to merge 2 commits into
g4-opt-sliding-splitkfrom
g4-opt-int4-vecload
Open

[cuda backend] int4/8 matvec: vectorized activation load #20144
Gasoonjia wants to merge 2 commits into
g4-opt-sliding-splitkfrom
g4-opt-int4-vecload

Conversation

@Gasoonjia

@Gasoonjia Gasoonjia commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

The decode-only int4_plain_mm matvec was bound by activation load-instruction throughput, not DRAM bandwidth (already ~64% peak) or latency. Each inner iteration issued ~15 loads per 16-byte weight chunk: 8 scalar int32 activation loads + the same per-block scale d reloaded 4x. Same as int8_plain_mm

Align Q8Block to 16 bytes (sizeof 36->48) so each block's qs_even/qs_odd 16B halves are 16B-aligned, then load a whole activation block with two vectorized uint4 loads + one d load (~4x fewer activation loads). dp4a math and accumulation order are bit-identical; the int8 activation values and scale are unchanged.

gemma4_31b decode (long-ctx harness, stacked on optimize_1):
decode 43.98 -> 46.557 tok/s (+6.4%), +12.7% compare with llama.cpp (41.5 token/s)

profile result: int4 matvec avg 38.4 -> 34.75 us (-9.5%); quant kernel unchanged.

…ock)

The decode-only int4_plain_mm matvec was bound by activation load-instruction
throughput, not DRAM bandwidth (already ~64% peak) or latency. Each inner
iteration issued ~15 loads per 16-byte weight chunk: 8 scalar int32 activation
loads + the same per-block scale d reloaded 4x.

Align Q8Block to 16 bytes (sizeof 36->48) so each block's qs_even/qs_odd 16B
halves are 16B-aligned, then load a whole activation block with two vectorized
uint4 loads + one d load (~4x fewer activation loads). dp4a math and
accumulation order are bit-identical; the int8 activation values and scale are
unchanged.

gemma4_31b decode (long-ctx harness, stacked on optimize_1):
  decode  43.98 -> 46.79 tok/s (+6.4%)
  prefill 1193  -> 1186     (noise; int4_plain_mm is decode-only)
nsys: int4 matvec avg 38.4 -> 34.75 us (-9.5%); quant kernel unchanged.
Unit tests test_aoti_torch_cuda_int4_plain_mm: 6/6 pass (M=1/8, gs=16/32/128).
@pytorch-bot

pytorch-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20144

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 5 Pending, 2 Unrelated Failures

As of commit 457a316 with merge base a79f3e4 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 9, 2026
@Gasoonjia Gasoonjia changed the title [cuda backend] int4 W4A8 matvec: vectorized activation load [cuda backend] int4/8 matvec: vectorized activation load Jun 9, 2026
@Gasoonjia Gasoonjia marked this pull request as ready for review June 9, 2026 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant