Skip to content

Neural-speculator with PARD approach#241

Draft
habibutsu wants to merge 1 commit intotrymirai:mainfrom
habibutsu:speculators
Draft

Neural-speculator with PARD approach#241
habibutsu wants to merge 1 commit intotrymirai:mainfrom
habibutsu:speculators

Conversation

@habibutsu
Copy link
Copy Markdown
Contributor

@habibutsu habibutsu commented Mar 6, 2026

  • ngram-speculator
  • neural-speculator with PARD approach
  • acceptance rate metric
  • --no-thinking CLI flag

Results (was running on Apple M4 24 GB):

cargo run --release -p cli -- run \
  ./models/0.1.8/Qwen3-4B/ \
  --no-thinking \
  --seed 16 \
  --message "Tell me briefly about Navier–Stokes equations"

17.712s, 12.506t/s

cargo run --release -p cli -- run \
  ./models/0.1.8/Qwen3-4B/ \
  --no-thinking \
  --seed 16 \
  --speculator-type ngram \
  --message "Tell me briefly about Navier–Stokes equations"

25.713s, 13.530t/s, acc 59/1060 (6%)

cargo run --release -p cli -- run \
  ./models/0.1.8/Qwen3-4B/ \
  --no-thinking \
  --seed 16 \
  --speculator-type pard \
  --speculator-path ../lalamo/models/PARD-Qwen3-0.6B/ \
  --speculator-tokens 4 \
  --message "Tell me briefly about Navier–Stokes equations"

14.708s, 18.649t/s, acc 120/420 (29%)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 08756075d2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +47 to +48
unsafe impl<B: Backend> Send for NeuralSpeculator<B> {}
unsafe impl<B: Backend> Sync for NeuralSpeculator<B> {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove unsound Sync/Send impl from NeuralSpeculator

NeuralSpeculator keeps mutable state in RefCell (draft and speculative_cache), but this commit marks it Send and Sync with unsafe impl based on a SessionWrapper mutex assumption that only holds in the CLI path. The type is publicly exported, so safe Rust can now share it across threads (e.g., via Arc) and concurrently call prepare/speculate, which creates a data race on RefCell's non-atomic borrow state and is undefined behavior. Please make the interior mutability thread-safe (e.g., Mutex/RwLock) or remove these impls and enforce single-threaded access at the type boundary.

Useful? React with 👍 / 👎.

buf.extend_from_slice(&(tok as u32).to_le_bytes());
}

xxh3_64(&buf) as usize % self.hashtable_size
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate ngram hashtable_size before modulo operations

A malformed model.bin with hashtable_size = 0 is currently accepted by load, and any subsequent speculate on a non-empty prefix will panic on % self.hashtable_size here. Since speculator files come from disk/CLI input, this turns an invalid model into a process crash in run/serve instead of a clean InvalidData error. Reject zero-sized hash tables during load to avoid this runtime panic.

Useful? React with 👍 / 👎.

@habibutsu habibutsu force-pushed the speculators branch 3 times, most recently from 69841ab to a9d3506 Compare March 6, 2026 21:17
@uuuvn
Copy link
Copy Markdown
Contributor

uuuvn commented Mar 7, 2026

I can reproduce acceptance rates but not speedup on my m1 max:

[uuuvn@macbook ~/src/uzu]% cargo run --release -p cli -- run \
  ./models/0.1.8/Qwen3-4B/ \
  --no-thinking \
  --seed 16 \
  --message "Tell me briefly about Navier–Stokes equations"
    Finished `release` profile [optimized] target(s) in 0.70s
     Running `target/release/cli run ./models/0.1.8/Qwen3-4B/ --no-thinking --seed 16 --message 'Tell me briefly about Navier–Stokes equations'`
Loaded: Qwen3-4B                                                                                                                         The Navier–Stokes equations are a set of partial differential equations that describe the motion of viscous fluid substances. They are fundamental in fluid dynamics and are used to model a wide range of phenomena, from weather patterns to blood flow in the human body.

### Key Points:
- **Purpose**: They govern the behavior of fluids (liquids and gases) under various conditions, including viscosity, pressure, and external forces.
- **Equations**: The equations consist of the **continuity equation** (mass conservation) and the **momentum equation** (force balance).
- **Variables**: They involve velocity, pressure, density, and temperature fields.
- **Complexity**: The equations are nonlinear and can be extremely challenging to solve analytically, leading to the famous **Navier–Stokes existence and smoothness problem**, one of the Millennium Prize Problems.
- **Applications**: Used in engineering, meteorology, oceanography, and many other fields where fluid motion is critical.

In summary, the Navier–Stokes equations are the cornerstone of fluid dynamics, providing a mathematical framework to understand and predict fluid behavior.

9.367s, 31.411t/s
[uuuvn@macbook ~/src/uzu]% cargo run --release -p cli -- run \
  ./models/0.1.8/Qwen3-4B/ \
  --no-thinking \
  --seed 16 \
  --speculator-type ngram \
  --message "Tell me briefly about Navier–Stokes equations"
    Finished `release` profile [optimized] target(s) in 0.65s
     Running `target/release/cli run ./models/0.1.8/Qwen3-4B/ --no-thinking --seed 16 --speculator-type ngram --message 'Tell me briefly about Navier–Stokes equations'`
Loaded: Qwen3-4B                                                                                                                         The **Navier–Stokes equations** are a set of partial differential equations that describe the motion of **viscous fluid substances** such as air or water. They are fundamental in fluid dynamics and are used to model a wide range of phenomena, from weather patterns to blood flow in the body.

### Key Points:
- **Purpose**: They describe the conservation of mass, momentum, and energy in a fluid.
- **Variables**: They involve velocity, pressure, density, and temperature of the fluid.
- **Form**: They are derived from Newton's second law of motion applied to fluid elements.
- **Challenges**: They are nonlinear and can exhibit complex behavior, including turbulence, making them difficult to solve analytically in general.
- **Applications**: Used in aerodynamics, meteorology, oceanography, and engineering.

### Mathematical Form (simplified):
$$
\rho \left( \frac{\partial \mathbf{u}}{\partial t} + \mathbf{u} \cdot \nabla \mathbf{u} \right) = -\nabla p + \mu \nabla^2 \mathbf{u} + \mathbf{f}
$$
Where:
- $\rho$ is density,
- $\mathbf{u}$ is velocity vector,
- $p$ is pressure,
- $\mu$ is dynamic viscosity,
- $\mathbf{f}$ is body force per unit volume.

The Navier–Stokes equations are one of the seven Millennium Prize Problems, with a prize of $1 million for a proof of their existence and smoothness in three dimensions.

31.933s, 10.901t/s, acc 59/1060 (6%)
[uuuvn@macbook ~/src/uzu]% cargo run --release -p cli -- run \
  ./models/0.1.8/Qwen3-4B/ \
  --no-thinking \
  --seed 16 \
  --speculator-type pard \
  --speculator-path ../lalamo/models/PARD-Qwen3-0.6B/ \
  --speculator-tokens 4 \
  --message "Tell me briefly about Navier–Stokes equations"
    Finished `release` profile [optimized] target(s) in 0.65s
     Running `target/release/cli run ./models/0.1.8/Qwen3-4B/ --no-thinking --seed 16 --speculator-type pard --speculator-path ../lalamo/models/PARD-Qwen3-0.6B/ --speculator-tokens 4 --message 'Tell me briefly about Navier–Stokes equations'`
Loaded: Qwen3-4B                                                                                                                         The **Navier–Stokes equations** are a set of partial differential equations that describe the motion of fluid substances, such as liquids and gases. They are fundamental in fluid dynamics and are used to model a wide range of phenomena, from weather patterns to ocean currents, and from blood flow in the body to airflow over an airplane wing.

### Key Points:
- **Purpose**: They describe the conservation of mass, momentum, and energy in a fluid.
- **Variables**: They involve velocity, pressure, density, and temperature of the fluid.
- **Form**: The equations are nonlinear and can be very complex, making them difficult to solve analytically in most cases.
- **Applications**: Used in engineering, physics, meteorology, and many other fields.
- **Challenge**: Solving them exactly is generally impossible, so numerical methods (like CFD) are often used.

In summary, the Navier–Stokes equations are the cornerstone of fluid dynamics, governing how fluids move and interact with forces and boundaries.

17.412s, 14.169t/s, acc 119/425 (28%)
[uuuvn@macbook ~/src/uzu]%

Which is weird because m4 has only about 37% more flops/bandwidth which should be the main predictor of how good acceptance ratio will be. I'll ask people with other chips to test this too to figure out what's happening.

re ngram speculator: we have our own ngram speculator implementation internally which is a bit cleaner, I'll push it into open source uzu today
re pard: the code currently contains ugly hacks like interior mutability on trie nodes with flat_idx: std::cell::Cell<usize> which is mutated on every trie linearize, and we should probably figure out a better way to integrate it with LanguageModelGenerator, but otherwise this is a really good change

Can you split the cli (thinking switch and acc rate display) changes into a separate pr and leave this one with PARD?

@habibutsu
Copy link
Copy Markdown
Contributor Author

31.411 t/s without speculation vs. 10.901 t/s with the n-gram speculator — it looks strange because that speculator does not add any overhead and can even guess some tokens in 6% of cases.

@norpadon
Copy link
Copy Markdown

norpadon commented Mar 7, 2026

31.411 t/s without speculation vs. 10.901 t/s with the n-gram speculator — it looks strange because that speculator does not add any overhead and can even guess some tokens in 6% of cases.

How did you train the ngram model? 6% is a very low acceptance rate, you should be able to get much higher numbers with some hyperparameter tuning

@uuuvn
Copy link
Copy Markdown
Contributor

uuuvn commented Mar 7, 2026

31.411 t/s without speculation vs. 10.901 t/s with the n-gram speculator — it looks strange because that speculator does not add any overhead and can even guess some tokens in 6% of cases.

How did you train the ngram model? 6% is a very low acceptance rate, you should be able to get much higher numbers with some hyperparameter tuning

It's almost definitely the model from our cdn, I get ~the same number there

@norpadon
Copy link
Copy Markdown

norpadon commented Mar 7, 2026

Oops yes the models from the CDN are not very good. @uuuvn I remember you had a good one for Qwen3, right? Can you upload it for testing?

@habibutsu
Copy link
Copy Markdown
Contributor Author

31.411 t/s without speculation vs. 10.901 t/s with the n-gram speculator — it looks strange because that speculator does not add any overhead and can even guess some tokens in 6% of cases.

How did you train the ngram model? 6% is a very low acceptance rate, you should be able to get much higher numbers with some hyperparameter tuning

It's almost definitely the model from our cdn, I get ~the same number there

Yes, I used your model. In my implementation, when the "ngram" speculator type is specified and no explicit path is given, the uzu looks for a speculators directory inside the main model.

@uuuvn
Copy link
Copy Markdown
Contributor

uuuvn commented Mar 7, 2026

31.411 t/s without speculation vs. 10.901 t/s with the n-gram speculator — it looks strange because that speculator does not add any overhead and can even guess some tokens in 6% of cases.

The ngram speculator "inference" is indeed essentially free, but the larger batch size isn't in practice (even though theoretical calculations suggest that we should be able to pack quite a bit of tokens before we become compute bound).

@norpadon
Copy link
Copy Markdown

norpadon commented Mar 7, 2026

Also there is a significant overhead from disabling the cursed async encoding loop, which results in degradation on higher-end devices and smaller models due to CPU-GPU sync overhead

We will push significant improvements to the encoding pipeline with the move to Metal4 soon, so speculative decoding will get more efficient for smaller models

@uuuvn
Copy link
Copy Markdown
Contributor

uuuvn commented Mar 7, 2026

Oops yes the models from the CDN are not very good. @uuuvn I remember you had a good one for Qwen3, right? Can you upload it for testing?

@habibutsu this model is about 2x better than the cdn one

2gram.bin.zip

I had to zip it because github won't allow .bin for whatever reason

@habibutsu
Copy link
Copy Markdown
Contributor Author

habibutsu commented Mar 7, 2026

Oops yes the models from the CDN are not very good. @uuuvn I remember you had a good one for Qwen3, right? Can you upload it for testing?

@habibutsu this model is about 2x better than the cdn one

2gram.bin.zip

I had to zip it because github won't allow .bin for whatever reason

Yes, this speculator is much better. I've got the following results:
time: 15.681s, speed: 13.996t/s, tokens: 206, speculation-rate: 28/176 (16%)

@norpadon
Copy link
Copy Markdown

norpadon commented Mar 7, 2026

I am convinced that ngram models should outperform standalone LLMs at those scales, especially for larger draft trees

Our current ngram models are relatively primitive: we currently only use bigrams and don't have any backoff/smoothing

I think adding something like Kneser-Ney smoothing and using 3- or 4-grams should bring the acceptance rate to 20-30%

@habibutsu habibutsu changed the title Speculators implementation Neural-speculator with PARD approach Mar 8, 2026
@habibutsu
Copy link
Copy Markdown
Contributor Author

@uuuvn I left here only PARD speculator and some minor improvements in your ngram, instead flat_idx I added indices to FlatTrie.

@uuuvn uuuvn marked this pull request as draft March 25, 2026 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants