fix(cuda): disable NCCL autodetection by default + sync llama.cpp MTP wrappers/examples#1020
fix(cuda): disable NCCL autodetection by default + sync llama.cpp MTP wrappers/examples#1020MegalithOfficial wants to merge 10 commits into
Conversation
Co-authored-by: Lothar Hoffmann <l.hoffmann@cherrymint.de>
|
I added MTP support to the Rust wrapper and then wired up runnable examples on top of it. The wrapper now exposes the upstream MTP context type, recurrent-state config, and pre-norm embedding staging APIs needed for Qwen3.5-style NextN/MTP models. I also fixed mixed token+embedding batch handling on the Rust side, because upstream MTP needs both token ids and embedding rows in the same batch. On top of that, I added two examples:
I also updated the MTP generation path to reuse live KV/recurrent state instead of clearing and reprefilling every loop. For Qwen3.5 MTP this required setting Tested locally against |
|
From what I understand, Action runs seem to suggest that Llama.cpp behaves differently across platforms: on Linux and macOS, the generated ctx_type binding comes through as one signedness, whereas on Windows it comes through as the other. I updated the wrapper to handle both, ensuring that the MTP context type code now compiles consistently on all targets. |
This is correct. bindgen generates the platform type for enums, your fix is correct. |
|
I believe action error is now about i didnt do formatting. |
What this fixes
Some Linux CUDA builds were failing at link time on systems that had NCCL installed.
The errors looked like this:
The root cause was that
llama.cppcould detect NCCL automatically during the CMake build, butllama-cpp-sys-2was not linkinglibncclon the Rust side. That left us with compiled CUDA objects that referenced NCCL symbols but no final NCCL linkage.What changed
This update disables NCCL autodetection by default in CUDA builds by setting the relevant CMake flags in
llama-cpp-sys-2/build.rs.If someone does want NCCL enabled, they can still opt back in with:
Why this approach
NCCL is mainly useful for multi-GPU collective operations. For normal single-GPU setups, auto-enabling it is unnecessary and makes builds more fragile on machines where NCCL happens to be installed.
This keeps the default CUDA build path reliable while still leaving a deliberate opt-in path for NCCL.
Notes
If a previous build already cached NCCL detection, a clean rebuild may be needed.
Co-authored-by: Lothar Hoffmann