Unified stochastic kernel by habibutsu · Pull Request #259 · trymirai/uzu

habibutsu · 2026-03-16T13:19:17Z

It is a draft of kernel.
For checking performance the following command can be used:

cargo test -p uzu --test kernel perf_batch -- --nocapture

on my laptop I have following results:

test sampling::sampling_perf_test::perf_batch1_128k_vocab ... [uzu::backends::metal::backend::Metal] batch=1 vocab=128000
  sequential: mean=0.312ms  min=0.298ms
  unified:    mean=0.137ms  min=0.134ms
  speedup:      2.23x
ok
test sampling::sampling_perf_test::perf_batch64_128k_vocab ... [uzu::backends::metal::backend::Metal] batch=64 vocab=128000
  sequential: mean=8.402ms  min=8.271ms
  unified:    mean=1.921ms  min=1.736ms
  speedup:      4.77x
ok

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5cb70631a4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-16T13:27:41Z

crates/uzu/src/backends/metal/kernel/sampling/unified_stochastic.metal

+    }
+
+    // Precompute top_p threshold in unnorm space to avoid per-round division.
+    const float top_p_mass = top_p * sum_exp;


Compute top-p mass from top-k-filtered logits

This computes top_p_mass from sum_exp before any top-k filtering, but the existing stochastic path applies TopK before TopP (crates/uzu/src/backends/common/kernel/sampling.rs calls self.topk.encode before self.topp.encode). When both parameters are enabled (including common defaults like top_k=20 with top_p=0.95), unified sampling checks top-p against full-vocab mass instead of the top-k subset, which makes top-p substantially weaker and changes sampling behavior relative to the current production path.

Useful? React with 👍 / 👎.

uuuvn · 2026-03-18T11:35:56Z

crates/uzu/src/session/parameter/sampling_policy.rs

+    /// Unified single-pass path: all filtering and Gumbel-max in one kernel dispatch,
+    /// operating on logits loaded into private registers.
+    UnifiedStochastic {
+        temperature: Option<f32>,
+        top_k: Option<u32>,
+        top_p: Option<f32>,
+        min_p: Option<f32>,
+    },


This should not be a separate sampling policy. Unified stochastic is an implementation detail, not a different sampling policy.

uuuvn · 2026-03-18T11:36:52Z

crates/uzu/src/backends/common/kernel/sampling.rs

+            if let Some(bitmask_buffer) = bitmask_buffer {
+                self.bitmask.encode(
+                    None::<&B::Buffer>,
+                    (bitmask_buffer, bitmask_offset),
+                    logits_buffer.deref_mut(),
+                    batch_size as u32,
+                    vocab_size as u32,
+                    command_buffer,
+                );
+            }


This should also be fused

uuuvn · 2026-03-18T11:37:44Z

crates/uzu/src/backends/common/kernel/sampling.rs

    bitmask: <B::Kernels as Kernels>::BitmaskKernel,
    temperature: <B::Kernels as Kernels>::TemperatureKernel,
    topk: <B::Kernels as Kernels>::TopKKernel,
    topp: <B::Kernels as Kernels>::TopPKernel,
    minp: <B::Kernels as Kernels>::MinPKernel,
    gumbel: <B::Kernels as Kernels>::GumbelKernel,
+    unified: <B::Kernels as Kernels>::UnifiedStochasticKernel,
    argmax_implementation: ArgmaxImplementation<B>,


Are there any cases where separate kernels are faster? If no, unified stochastic should replace all of the old kernels, not just be a new option

uuuvn · 2026-03-18T11:39:28Z

crates/cli/src/main.rs

+        /// Sampling method (default: stochastic with model's generation config)
+        #[arg(long)]
+        sampler: Option<SamplerArg>,


Implementation details (unified/separate sampling kernels) shouldn't be exposed in the cli

uuuvn · 2026-03-18T11:42:06Z

crates/uzu/src/backends/cpu/kernel/sampling/unified_stochastic.rs

+// with logit-space pivots. Mirrors the Metal kernel logic.
+#[kernel(UnifiedStochastic)]
+#[variants(T, f32, f16, bf16)]
+pub fn unified_stochastic<T: ArrayElement + Float>(


Cpu backend is meant as a reference, it shouldn't be doing anything fancy like unified stochastic. I think there should be a single SamplingKernel trait that is implemented in the most straightforward textbook way possible on cpu and with either unified metal kernel directly or with multiple private metal kernels on metal.

uuuvn · 2026-03-18T11:48:20Z

crates/uzu/src/backends/metal/kernel/sampling/unified_stochastic.metal

+// ── Unified stochastic sampling: temperature + top_k/p/min_p + sampling in one dispatch ──
+//
+// NOTE: No Gumbel noise, no argmax.
+//   Gumbel-max (add Gumbel noise to logits → argmax) is mathematically equivalent to
+//   inverse-transform sampling from the softmax distribution (draw u ~ U(0,1), find
+//   token at CDF position u). This kernel uses the latter: one uniform draw per round,
+//   located via a cooperative prefix-sum walk — no per-token noise, no full-vocab argmax.


We use gumbel with shared seed between speculator and llm sampling for increased acceptance rate

uuuvn

Accidentally selected the previous review as approve where it should've been request changes, don't see a way to undo it

uuuvn · 2026-04-03T09:10:39Z

crates/uzu/tests/context_mode_test.rs

@@ -0,0 +1,151 @@
+mod common;


Why are you adding this file? crates/uzu/tests/integration/session/chat_session/context_mode_test.rs

uuuvn · 2026-04-03T09:10:56Z

crates/uzu/tests/sampling_test.rs

@@ -0,0 +1,478 @@
+mod common;


Why are you adding this file? crates/uzu/tests/unit/encodable_block/sampling_test.rs

uuuvn · 2026-04-03T09:11:55Z

crates/uzu/tests/kernel/sampling/mod.rs

@@ -0,0 +1 @@
+mod sampling_perf_test;


Why are you adding a whole new directory for a single file?

uuuvn · 2026-04-03T09:12:43Z

crates/uzu/src/backends/cpu/kernel/sampling/stochastic.rs

+    min_p: f32,
+    #[specialize] has_bitmask: bool,
+) {
+    let _ = min_p;


uuuvn · 2026-04-03T09:13:03Z

crates/uzu/src/backends/cpu/kernel/sampling/stochastic.rs

+    } else {
+        top_p
+    };
+    let bitmask_stride = (vocab_size + 31) / 32;


uuuvn · 2026-04-03T09:20:52Z

crates/uzu/src/backends/common/kernel/sampling.rs

+const MAX_TOP_K: u32 = 64;
+
 pub struct SamplingKernel<B: Backend> {
    bitmask: <B::Kernels as Kernels>::BitmaskKernel,


Let's add bitmask to argmax instead of having a separate kernel only for argmax

uuuvn · 2026-04-03T09:21:54Z

crates/uzu/src/backends/common/kernel/sampling.rs

+    #[error("Stochastic: top_k={0} exceeds N_CANDIDATES={MAX_TOP_K}")]
+    TopKTooLarge(u32),


We should probably still support top_k > 64 via some fallback.

uuuvn · 2026-04-03T09:22:53Z

crates/uzu/src/backends/common/kernel/sampling.rs

-            processing_order,
+            ..


You're silently ignoring processing order

uuuvn · 2026-04-03T09:28:38Z

Please carefully read your diff before re-requesting review, there is a lot of things that are either obviously wrong or very dirty

habibutsu requested review from LuckyIYI, alexxale, eugenebokhan and uuuvn as code owners March 16, 2026 13:19

chatgpt-codex-connector bot reviewed Mar 16, 2026

View reviewed changes

uuuvn approved these changes Mar 18, 2026

View reviewed changes

uuuvn requested changes Mar 18, 2026

View reviewed changes

uuuvn marked this pull request as draft March 25, 2026 22:06

habibutsu force-pushed the unified_kernel branch from 5cb7063 to 346ce14 Compare April 2, 2026 06:45

Unified stochastic kernel

1a0a7b4

habibutsu force-pushed the unified_kernel branch from 346ce14 to 1a0a7b4 Compare April 2, 2026 07:08

uuuvn requested changes Apr 3, 2026

View reviewed changes

		#[error("Stochastic: top_k={0} exceeds N_CANDIDATES={MAX_TOP_K}")]
		TopKTooLarge(u32),

Conversation

habibutsu commented Mar 16, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

uuuvn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

uuuvn commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants