Neural-speculator with PARD approach by habibutsu · Pull Request #241 · trymirai/uzu

habibutsu · 2026-03-06T15:15:19Z

ngram-speculator
neural-speculator with PARD approach
acceptance rate metric
--no-thinking CLI flag

Results (was running on Apple M4 24 GB):

cargo run --release -p cli -- run \
  ./models/0.1.8/Qwen3-4B/ \
  --no-thinking \
  --seed 16 \
  --message "Tell me briefly about Navier–Stokes equations"

17.712s, 12.506t/s

cargo run --release -p cli -- run \
  ./models/0.1.8/Qwen3-4B/ \
  --no-thinking \
  --seed 16 \
  --speculator-type ngram \
  --message "Tell me briefly about Navier–Stokes equations"

25.713s, 13.530t/s, acc 59/1060 (6%)

cargo run --release -p cli -- run \
  ./models/0.1.8/Qwen3-4B/ \
  --no-thinking \
  --seed 16 \
  --speculator-type pard \
  --speculator-path ../lalamo/models/PARD-Qwen3-0.6B/ \
  --speculator-tokens 4 \
  --message "Tell me briefly about Navier–Stokes equations"

14.708s, 18.649t/s, acc 120/420 (29%)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 08756075d2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-06T15:23:30Z

crates/uzu/src/speculators/neural_speculator.rs

+unsafe impl<B: Backend> Send for NeuralSpeculator<B> {}
+unsafe impl<B: Backend> Sync for NeuralSpeculator<B> {}


Remove unsound Sync/Send impl from NeuralSpeculator

NeuralSpeculator keeps mutable state in RefCell (draft and speculative_cache), but this commit marks it Send and Sync with unsafe impl based on a SessionWrapper mutex assumption that only holds in the CLI path. The type is publicly exported, so safe Rust can now share it across threads (e.g., via Arc) and concurrently call prepare/speculate, which creates a data race on RefCell's non-atomic borrow state and is undefined behavior. Please make the interior mutability thread-safe (e.g., Mutex/RwLock) or remove these impls and enforce single-threaded access at the type boundary.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-06T15:23:30Z

crates/uzu/src/speculators/ngram_speculator.rs

+            buf.extend_from_slice(&(tok as u32).to_le_bytes());
+        }
+
+        xxh3_64(&buf) as usize % self.hashtable_size


Validate ngram hashtable_size before modulo operations

A malformed model.bin with hashtable_size = 0 is currently accepted by load, and any subsequent speculate on a non-empty prefix will panic on % self.hashtable_size here. Since speculator files come from disk/CLI input, this turns an invalid model into a process crash in run/serve instead of a clean InvalidData error. Reject zero-sized hash tables during load to avoid this runtime panic.

Useful? React with 👍 / 👎.

uuuvn · 2026-03-07T09:21:43Z

I can reproduce acceptance rates but not speedup on my m1 max:

[uuuvn@macbook ~/src/uzu]% cargo run --release -p cli -- run \
  ./models/0.1.8/Qwen3-4B/ \
  --no-thinking \
  --seed 16 \
  --message "Tell me briefly about Navier–Stokes equations"
    Finished `release` profile [optimized] target(s) in 0.70s
     Running `target/release/cli run ./models/0.1.8/Qwen3-4B/ --no-thinking --seed 16 --message 'Tell me briefly about Navier–Stokes equations'`
Loaded: Qwen3-4B                                                                                                                         The Navier–Stokes equations are a set of partial differential equations that describe the motion of viscous fluid substances. They are fundamental in fluid dynamics and are used to model a wide range of phenomena, from weather patterns to blood flow in the human body.

### Key Points:
- **Purpose**: They govern the behavior of fluids (liquids and gases) under various conditions, including viscosity, pressure, and external forces.
- **Equations**: The equations consist of the **continuity equation** (mass conservation) and the **momentum equation** (force balance).
- **Variables**: They involve velocity, pressure, density, and temperature fields.
- **Complexity**: The equations are nonlinear and can be extremely challenging to solve analytically, leading to the famous **Navier–Stokes existence and smoothness problem**, one of the Millennium Prize Problems.
- **Applications**: Used in engineering, meteorology, oceanography, and many other fields where fluid motion is critical.

In summary, the Navier–Stokes equations are the cornerstone of fluid dynamics, providing a mathematical framework to understand and predict fluid behavior.

9.367s, 31.411t/s
[uuuvn@macbook ~/src/uzu]% cargo run --release -p cli -- run \
  ./models/0.1.8/Qwen3-4B/ \
  --no-thinking \
  --seed 16 \
  --speculator-type ngram \
  --message "Tell me briefly about Navier–Stokes equations"
    Finished `release` profile [optimized] target(s) in 0.65s
     Running `target/release/cli run ./models/0.1.8/Qwen3-4B/ --no-thinking --seed 16 --speculator-type ngram --message 'Tell me briefly about Navier–Stokes equations'`
Loaded: Qwen3-4B                                                                                                                         The **Navier–Stokes equations** are a set of partial differential equations that describe the motion of **viscous fluid substances** such as air or water. They are fundamental in fluid dynamics and are used to model a wide range of phenomena, from weather patterns to blood flow in the body.

### Key Points:
- **Purpose**: They describe the conservation of mass, momentum, and energy in a fluid.
- **Variables**: They involve velocity, pressure, density, and temperature of the fluid.
- **Form**: They are derived from Newton's second law of motion applied to fluid elements.
- **Challenges**: They are nonlinear and can exhibit complex behavior, including turbulence, making them difficult to solve analytically in general.
- **Applications**: Used in aerodynamics, meteorology, oceanography, and engineering.

### Mathematical Form (simplified):
$$
\rho \left( \frac{\partial \mathbf{u}}{\partial t} + \mathbf{u} \cdot \nabla \mathbf{u} \right) = -\nabla p + \mu \nabla^2 \mathbf{u} + \mathbf{f}
$$
Where:
- $\rho$ is density,
- $\mathbf{u}$ is velocity vector,
- $p$ is pressure,
- $\mu$ is dynamic viscosity,
- $\mathbf{f}$ is body force per unit volume.

The Navier–Stokes equations are one of the seven Millennium Prize Problems, with a prize of $1 million for a proof of their existence and smoothness in three dimensions.

31.933s, 10.901t/s, acc 59/1060 (6%)
[uuuvn@macbook ~/src/uzu]% cargo run --release -p cli -- run \
  ./models/0.1.8/Qwen3-4B/ \
  --no-thinking \
  --seed 16 \
  --speculator-type pard \
  --speculator-path ../lalamo/models/PARD-Qwen3-0.6B/ \
  --speculator-tokens 4 \
  --message "Tell me briefly about Navier–Stokes equations"
    Finished `release` profile [optimized] target(s) in 0.65s
     Running `target/release/cli run ./models/0.1.8/Qwen3-4B/ --no-thinking --seed 16 --speculator-type pard --speculator-path ../lalamo/models/PARD-Qwen3-0.6B/ --speculator-tokens 4 --message 'Tell me briefly about Navier–Stokes equations'`
Loaded: Qwen3-4B                                                                                                                         The **Navier–Stokes equations** are a set of partial differential equations that describe the motion of fluid substances, such as liquids and gases. They are fundamental in fluid dynamics and are used to model a wide range of phenomena, from weather patterns to ocean currents, and from blood flow in the body to airflow over an airplane wing.

### Key Points:
- **Purpose**: They describe the conservation of mass, momentum, and energy in a fluid.
- **Variables**: They involve velocity, pressure, density, and temperature of the fluid.
- **Form**: The equations are nonlinear and can be very complex, making them difficult to solve analytically in most cases.
- **Applications**: Used in engineering, physics, meteorology, and many other fields.
- **Challenge**: Solving them exactly is generally impossible, so numerical methods (like CFD) are often used.

In summary, the Navier–Stokes equations are the cornerstone of fluid dynamics, governing how fluids move and interact with forces and boundaries.

17.412s, 14.169t/s, acc 119/425 (28%)
[uuuvn@macbook ~/src/uzu]%

Which is weird because m4 has only about 37% more flops/bandwidth which should be the main predictor of how good acceptance ratio will be. I'll ask people with other chips to test this too to figure out what's happening.

re ngram speculator: we have our own ngram speculator implementation internally which is a bit cleaner, I'll push it into open source uzu today
re pard: the code currently contains ugly hacks like interior mutability on trie nodes with flat_idx: std::cell::Cell<usize> which is mutated on every trie linearize, and we should probably figure out a better way to integrate it with LanguageModelGenerator, but otherwise this is a really good change

Can you split the cli (thinking switch and acc rate display) changes into a separate pr and leave this one with PARD?

habibutsu · 2026-03-07T12:59:33Z

31.411 t/s without speculation vs. 10.901 t/s with the n-gram speculator — it looks strange because that speculator does not add any overhead and can even guess some tokens in 6% of cases.

norpadon · 2026-03-07T16:51:51Z

31.411 t/s without speculation vs. 10.901 t/s with the n-gram speculator — it looks strange because that speculator does not add any overhead and can even guess some tokens in 6% of cases.

How did you train the ngram model? 6% is a very low acceptance rate, you should be able to get much higher numbers with some hyperparameter tuning

uuuvn · 2026-03-07T16:54:34Z

31.411 t/s without speculation vs. 10.901 t/s with the n-gram speculator — it looks strange because that speculator does not add any overhead and can even guess some tokens in 6% of cases.

How did you train the ngram model? 6% is a very low acceptance rate, you should be able to get much higher numbers with some hyperparameter tuning

It's almost definitely the model from our cdn, I get ~the same number there

norpadon · 2026-03-07T16:58:19Z

Oops yes the models from the CDN are not very good. @uuuvn I remember you had a good one for Qwen3, right? Can you upload it for testing?

habibutsu · 2026-03-07T16:59:14Z

31.411 t/s without speculation vs. 10.901 t/s with the n-gram speculator — it looks strange because that speculator does not add any overhead and can even guess some tokens in 6% of cases.

How did you train the ngram model? 6% is a very low acceptance rate, you should be able to get much higher numbers with some hyperparameter tuning

It's almost definitely the model from our cdn, I get ~the same number there

Yes, I used your model. In my implementation, when the "ngram" speculator type is specified and no explicit path is given, the uzu looks for a speculators directory inside the main model.

uuuvn · 2026-03-07T16:59:16Z

31.411 t/s without speculation vs. 10.901 t/s with the n-gram speculator — it looks strange because that speculator does not add any overhead and can even guess some tokens in 6% of cases.

The ngram speculator "inference" is indeed essentially free, but the larger batch size isn't in practice (even though theoretical calculations suggest that we should be able to pack quite a bit of tokens before we become compute bound).

norpadon · 2026-03-07T17:05:25Z

Also there is a significant overhead from disabling the cursed async encoding loop, which results in degradation on higher-end devices and smaller models due to CPU-GPU sync overhead

We will push significant improvements to the encoding pipeline with the move to Metal4 soon, so speculative decoding will get more efficient for smaller models

uuuvn · 2026-03-07T17:17:40Z

Oops yes the models from the CDN are not very good. @uuuvn I remember you had a good one for Qwen3, right? Can you upload it for testing?

@habibutsu this model is about 2x better than the cdn one

2gram.bin.zip

I had to zip it because github won't allow .bin for whatever reason

habibutsu · 2026-03-07T18:02:01Z

Oops yes the models from the CDN are not very good. @uuuvn I remember you had a good one for Qwen3, right? Can you upload it for testing?

@habibutsu this model is about 2x better than the cdn one

2gram.bin.zip

I had to zip it because github won't allow .bin for whatever reason

Yes, this speculator is much better. I've got the following results:
time: 15.681s, speed: 13.996t/s, tokens: 206, speculation-rate: 28/176 (16%)

norpadon · 2026-03-07T19:21:59Z

I am convinced that ngram models should outperform standalone LLMs at those scales, especially for larger draft trees

Our current ngram models are relatively primitive: we currently only use bigrams and don't have any backoff/smoothing

I think adding something like Kneser-Ney smoothing and using 3- or 4-grams should bring the acceptance rate to 20-30%

habibutsu · 2026-03-08T10:23:57Z

@uuuvn I left here only PARD speculator and some minor improvements in your ngram, instead flat_idx I added indices to FlatTrie.

habibutsu requested review from LuckyIYI, alexxale, eugenebokhan and uuuvn as code owners March 6, 2026 15:15

chatgpt-codex-connector bot reviewed Mar 6, 2026

View reviewed changes

habibutsu force-pushed the speculators branch 3 times, most recently from 69841ab to a9d3506 Compare March 6, 2026 21:17

habibutsu mentioned this pull request Mar 6, 2026

Add amd/PARD-Qwen3-0.6B trymirai/lalamo#162

Merged

habibutsu force-pushed the speculators branch from a9d3506 to 9d11f75 Compare March 8, 2026 00:31

habibutsu changed the title ~~Speculators implementation~~ Neural-speculator with PARD approach Mar 8, 2026

habibutsu force-pushed the speculators branch from 9d11f75 to bb4e4b1 Compare March 8, 2026 00:51

Neural-speculator with PARD approach

b3f6382

habibutsu force-pushed the speculators branch from bb4e4b1 to b3f6382 Compare March 8, 2026 01:04

uuuvn marked this pull request as draft March 25, 2026 22:06

		unsafe impl<B: Backend> Send for NeuralSpeculator<B> {}
		unsafe impl<B: Backend> Sync for NeuralSpeculator<B> {}

Conversation

habibutsu commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

uuuvn commented Mar 7, 2026

Uh oh!

habibutsu commented Mar 7, 2026

Uh oh!

norpadon commented Mar 7, 2026

Uh oh!

uuuvn commented Mar 7, 2026

Uh oh!

norpadon commented Mar 7, 2026

Uh oh!

habibutsu commented Mar 7, 2026

Uh oh!

uuuvn commented Mar 7, 2026

Uh oh!

norpadon commented Mar 7, 2026

Uh oh!

uuuvn commented Mar 7, 2026

Uh oh!

habibutsu commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

norpadon commented Mar 7, 2026

Uh oh!

habibutsu commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

habibutsu commented Mar 6, 2026 •

edited

Loading

habibutsu commented Mar 7, 2026 •

edited

Loading