Neural-speculator with PARD approach#241
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 08756075d2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| unsafe impl<B: Backend> Send for NeuralSpeculator<B> {} | ||
| unsafe impl<B: Backend> Sync for NeuralSpeculator<B> {} |
There was a problem hiding this comment.
Remove unsound Sync/Send impl from NeuralSpeculator
NeuralSpeculator keeps mutable state in RefCell (draft and speculative_cache), but this commit marks it Send and Sync with unsafe impl based on a SessionWrapper mutex assumption that only holds in the CLI path. The type is publicly exported, so safe Rust can now share it across threads (e.g., via Arc) and concurrently call prepare/speculate, which creates a data race on RefCell's non-atomic borrow state and is undefined behavior. Please make the interior mutability thread-safe (e.g., Mutex/RwLock) or remove these impls and enforce single-threaded access at the type boundary.
Useful? React with 👍 / 👎.
| buf.extend_from_slice(&(tok as u32).to_le_bytes()); | ||
| } | ||
|
|
||
| xxh3_64(&buf) as usize % self.hashtable_size |
There was a problem hiding this comment.
Validate ngram hashtable_size before modulo operations
A malformed model.bin with hashtable_size = 0 is currently accepted by load, and any subsequent speculate on a non-empty prefix will panic on % self.hashtable_size here. Since speculator files come from disk/CLI input, this turns an invalid model into a process crash in run/serve instead of a clean InvalidData error. Reject zero-sized hash tables during load to avoid this runtime panic.
Useful? React with 👍 / 👎.
69841ab to
a9d3506
Compare
|
I can reproduce acceptance rates but not speedup on my m1 max: Which is weird because m4 has only about 37% more flops/bandwidth which should be the main predictor of how good acceptance ratio will be. I'll ask people with other chips to test this too to figure out what's happening. re ngram speculator: we have our own ngram speculator implementation internally which is a bit cleaner, I'll push it into open source uzu today Can you split the cli (thinking switch and acc rate display) changes into a separate pr and leave this one with PARD? |
|
31.411 t/s without speculation vs. 10.901 t/s with the n-gram speculator — it looks strange because that speculator does not add any overhead and can even guess some tokens in 6% of cases. |
How did you train the ngram model? 6% is a very low acceptance rate, you should be able to get much higher numbers with some hyperparameter tuning |
It's almost definitely the model from our cdn, I get ~the same number there |
|
Oops yes the models from the CDN are not very good. @uuuvn I remember you had a good one for Qwen3, right? Can you upload it for testing? |
Yes, I used your model. In my implementation, when the "ngram" speculator type is specified and no explicit path is given, the uzu looks for a speculators directory inside the main model. |
The ngram speculator "inference" is indeed essentially free, but the larger batch size isn't in practice (even though theoretical calculations suggest that we should be able to pack quite a bit of tokens before we become compute bound). |
|
Also there is a significant overhead from disabling the cursed async encoding loop, which results in degradation on higher-end devices and smaller models due to CPU-GPU sync overhead We will push significant improvements to the encoding pipeline with the move to Metal4 soon, so speculative decoding will get more efficient for smaller models |
@habibutsu this model is about 2x better than the cdn one I had to zip it because github won't allow .bin for whatever reason |
Yes, this speculator is much better. I've got the following results: |
|
I am convinced that ngram models should outperform standalone LLMs at those scales, especially for larger draft trees Our current ngram models are relatively primitive: we currently only use bigrams and don't have any backoff/smoothing I think adding something like Kneser-Ney smoothing and using 3- or 4-grams should bring the acceptance rate to 20-30% |
|
@uuuvn I left here only PARD speculator and some minor improvements in your ngram, instead |
Results (was running on Apple M4 24 GB):
cargo run --release -p cli -- run \ ./models/0.1.8/Qwen3-4B/ \ --no-thinking \ --seed 16 \ --message "Tell me briefly about Navier–Stokes equations"17.712s, 12.506t/s
cargo run --release -p cli -- run \ ./models/0.1.8/Qwen3-4B/ \ --no-thinking \ --seed 16 \ --speculator-type ngram \ --message "Tell me briefly about Navier–Stokes equations"25.713s, 13.530t/s, acc 59/1060 (6%)
cargo run --release -p cli -- run \ ./models/0.1.8/Qwen3-4B/ \ --no-thinking \ --seed 16 \ --speculator-type pard \ --speculator-path ../lalamo/models/PARD-Qwen3-0.6B/ \ --speculator-tokens 4 \ --message "Tell me briefly about Navier–Stokes equations"14.708s, 18.649t/s, acc 120/420 (29%)