Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5cb70631a4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } | ||
|
|
||
| // Precompute top_p threshold in unnorm space to avoid per-round division. | ||
| const float top_p_mass = top_p * sum_exp; |
There was a problem hiding this comment.
Compute top-p mass from top-k-filtered logits
This computes top_p_mass from sum_exp before any top-k filtering, but the existing stochastic path applies TopK before TopP (crates/uzu/src/backends/common/kernel/sampling.rs calls self.topk.encode before self.topp.encode). When both parameters are enabled (including common defaults like top_k=20 with top_p=0.95), unified sampling checks top-p against full-vocab mass instead of the top-k subset, which makes top-p substantially weaker and changes sampling behavior relative to the current production path.
Useful? React with 👍 / 👎.
| /// Unified single-pass path: all filtering and Gumbel-max in one kernel dispatch, | ||
| /// operating on logits loaded into private registers. | ||
| UnifiedStochastic { | ||
| temperature: Option<f32>, | ||
| top_k: Option<u32>, | ||
| top_p: Option<f32>, | ||
| min_p: Option<f32>, | ||
| }, |
There was a problem hiding this comment.
This should not be a separate sampling policy. Unified stochastic is an implementation detail, not a different sampling policy.
| if let Some(bitmask_buffer) = bitmask_buffer { | ||
| self.bitmask.encode( | ||
| None::<&B::Buffer>, | ||
| (bitmask_buffer, bitmask_offset), | ||
| logits_buffer.deref_mut(), | ||
| batch_size as u32, | ||
| vocab_size as u32, | ||
| command_buffer, | ||
| ); | ||
| } |
| bitmask: <B::Kernels as Kernels>::BitmaskKernel, | ||
| temperature: <B::Kernels as Kernels>::TemperatureKernel, | ||
| topk: <B::Kernels as Kernels>::TopKKernel, | ||
| topp: <B::Kernels as Kernels>::TopPKernel, | ||
| minp: <B::Kernels as Kernels>::MinPKernel, | ||
| gumbel: <B::Kernels as Kernels>::GumbelKernel, | ||
| unified: <B::Kernels as Kernels>::UnifiedStochasticKernel, | ||
| argmax_implementation: ArgmaxImplementation<B>, |
There was a problem hiding this comment.
Are there any cases where separate kernels are faster? If no, unified stochastic should replace all of the old kernels, not just be a new option
crates/cli/src/main.rs
Outdated
| /// Sampling method (default: stochastic with model's generation config) | ||
| #[arg(long)] | ||
| sampler: Option<SamplerArg>, |
There was a problem hiding this comment.
Implementation details (unified/separate sampling kernels) shouldn't be exposed in the cli
| // with logit-space pivots. Mirrors the Metal kernel logic. | ||
| #[kernel(UnifiedStochastic)] | ||
| #[variants(T, f32, f16, bf16)] | ||
| pub fn unified_stochastic<T: ArrayElement + Float>( |
There was a problem hiding this comment.
Cpu backend is meant as a reference, it shouldn't be doing anything fancy like unified stochastic. I think there should be a single SamplingKernel trait that is implemented in the most straightforward textbook way possible on cpu and with either unified metal kernel directly or with multiple private metal kernels on metal.
| // ── Unified stochastic sampling: temperature + top_k/p/min_p + sampling in one dispatch ── | ||
| // | ||
| // NOTE: No Gumbel noise, no argmax. | ||
| // Gumbel-max (add Gumbel noise to logits → argmax) is mathematically equivalent to | ||
| // inverse-transform sampling from the softmax distribution (draw u ~ U(0,1), find | ||
| // token at CDF position u). This kernel uses the latter: one uniform draw per round, | ||
| // located via a cooperative prefix-sum walk — no per-token noise, no full-vocab argmax. |
There was a problem hiding this comment.
We use gumbel with shared seed between speculator and llm sampling for increased acceptance rate
uuuvn
left a comment
There was a problem hiding this comment.
Accidentally selected the previous review as approve where it should've been request changes, don't see a way to undo it
5cb7063 to
346ce14
Compare
346ce14 to
1a0a7b4
Compare
| @@ -0,0 +1,151 @@ | |||
| mod common; | |||
There was a problem hiding this comment.
Why are you adding this file? crates/uzu/tests/integration/session/chat_session/context_mode_test.rs
| @@ -0,0 +1,478 @@ | |||
| mod common; | |||
There was a problem hiding this comment.
Why are you adding this file? crates/uzu/tests/unit/encodable_block/sampling_test.rs
| @@ -0,0 +1 @@ | |||
| mod sampling_perf_test; | |||
There was a problem hiding this comment.
Why are you adding a whole new directory for a single file?
| min_p: f32, | ||
| #[specialize] has_bitmask: bool, | ||
| ) { | ||
| let _ = min_p; |
| } else { | ||
| top_p | ||
| }; | ||
| let bitmask_stride = (vocab_size + 31) / 32; |
| const MAX_TOP_K: u32 = 64; | ||
|
|
||
| pub struct SamplingKernel<B: Backend> { | ||
| bitmask: <B::Kernels as Kernels>::BitmaskKernel, |
There was a problem hiding this comment.
Let's add bitmask to argmax instead of having a separate kernel only for argmax
| #[error("Stochastic: top_k={0} exceeds N_CANDIDATES={MAX_TOP_K}")] | ||
| TopKTooLarge(u32), |
There was a problem hiding this comment.
We should probably still support top_k > 64 via some fallback.
| processing_order, | ||
| .. |
There was a problem hiding this comment.
You're silently ignoring processing order
|
Please carefully read your diff before re-requesting review, there is a lot of things that are either obviously wrong or very dirty |
It is a draft of kernel.
For checking performance the following command can be used:
cargo test -p uzu --test kernel perf_batch -- --nocaptureon my laptop I have following results: