fix(cuda): disable NCCL autodetection by default + sync llama.cpp MTP wrappers/examples by MegalithOfficial · Pull Request #1020 · utilityai/llama-cpp-rs

MegalithOfficial · 2026-05-04T18:00:56Z

What this fixes

Some Linux CUDA builds were failing at link time on systems that had NCCL installed.

The errors looked like this:

rust-lld: error: undefined symbol: ncclCommInitAll
rust-lld: error: undefined symbol: ncclGetErrorString
rust-lld: error: undefined symbol: ncclGroupStart
rust-lld: error: undefined symbol: ncclAllReduce
rust-lld: error: undefined symbol: ncclGroupEnd
collect2: error: ld returned 1 exit status
error: could not compile due to 1 previous error

The root cause was that llama.cpp could detect NCCL automatically during the CMake build, but llama-cpp-sys-2 was not linking libnccl on the Rust side. That left us with compiled CUDA objects that referenced NCCL symbols but no final NCCL linkage.

What changed

This update disables NCCL autodetection by default in CUDA builds by setting the relevant CMake flags in llama-cpp-sys-2/build.rs.

If someone does want NCCL enabled, they can still opt back in with:

LLAMA_CUDA_DISABLE_NCCL=0

Why this approach

NCCL is mainly useful for multi-GPU collective operations. For normal single-GPU setups, auto-enabling it is unnecessary and makes builds more fragile on machines where NCCL happens to be installed.

This keeps the default CUDA build path reliable while still leaving a deliberate opt-in path for NCCL.

Notes

If a previous build already cached NCCL detection, a clean rebuild may be needed.

Co-authored-by: Lothar Hoffmann

Co-authored-by: Lothar Hoffmann <l.hoffmann@cherrymint.de>

MegalithOfficial · 2026-05-17T16:20:21Z

I added MTP support to the Rust wrapper and then wired up runnable examples on top of it. The wrapper now exposes the upstream MTP context type, recurrent-state config, and pre-norm embedding staging APIs needed for Qwen3.5-style NextN/MTP models. I also fixed mixed token+embedding batch handling on the Rust side, because upstream MTP needs both token ids and embedding rows in the same batch.

On top of that, I added two examples:

mtp-example: generates with a bundled-MTP GGUF such as unsloth/Qwen3.5-4B-MTP-GGUF
mtp-compare: runs plain decoding and MTP decoding back to back and prints output plus timing/acceptance stats

I also updated the MTP generation path to reuse live KV/recurrent state instead of clearing and reprefilling every loop. For Qwen3.5 MTP this required setting n_rs_seq = draft_n so rollback works correctly on the recurrent architecture.

Tested locally against Qwen3.5-4B-Q4_K_M.gguf. The path is working end to end; wall-clock speedup still depends on hardware, output length, and acceptance rate, so short CPU-only runs can still show MTP as slower even though the implementation is correct.

MegalithOfficial · 2026-05-25T21:58:21Z

From what I understand, Action runs seem to suggest that Llama.cpp behaves differently across platforms: on Linux and macOS, the generated ctx_type binding comes through as one signedness, whereas on Windows it comes through as the other. I updated the wrapper to handle both, ensuring that the MTP context type code now compiles consistently on all targets.

MarcusDunn · 2026-05-25T22:09:12Z

From what I understand, Action runs seem to suggest that Llama.cpp behaves differently across platforms: on Linux and macOS, the generated ctx_type binding comes through as one signedness, whereas on Windows it comes through as the other. I updated the wrapper to handle both, ensuring that the MTP context type code now compiles consistently on all targets.

This is correct. bindgen generates the platform type for enums, your fix is correct.

MegalithOfficial · 2026-05-26T07:35:29Z

I believe action error is now about i didnt do formatting.

MegalithOfficial and others added 8 commits May 4, 2026 20:57

fix(cuda): disable NCCL autodetection by default

bce6b64

Co-authored-by: Lothar Hoffmann <l.hoffmann@cherrymint.de>

chore(llama.cpp): sync vendored upstream to 39cf5d6

b847df7

fix(bindings): adapt wrappers to latest llama.cpp

0719405

feat(mtp): expose upstream context and pre-norm APIs

aacceb8

fix(mtp): support mixed token and pre-norm batches

e6c51ed

feat(example): add qwen mtp generation example

32aa7ca

perf(example): reuse kv state in mtp loop

19aae34

feat(example): compare plain and mtp decoding

809f647

MegalithOfficial changed the title ~~fix(cuda): disable NCCL autodetection by default~~ fix(cuda): disable NCCL autodetection by default + sync llama.cpp MTP wrappers/examples May 17, 2026

fix(bindings): handle platform-specific ctx_type signedness

faedb3d

style(fmt): normalize mtp examples and fit params

c9af356

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cuda): disable NCCL autodetection by default + sync llama.cpp MTP wrappers/examples#1020

fix(cuda): disable NCCL autodetection by default + sync llama.cpp MTP wrappers/examples#1020
MegalithOfficial wants to merge 10 commits into
utilityai:mainfrom
MegalithOfficial:main

MegalithOfficial commented May 4, 2026 •

edited

Loading

Uh oh!

MegalithOfficial commented May 17, 2026 •

edited

Loading

Uh oh!

MegalithOfficial commented May 25, 2026

Uh oh!

MarcusDunn commented May 25, 2026

Uh oh!

MegalithOfficial commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MegalithOfficial commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this fixes

What changed

Why this approach

Notes

Uh oh!

MegalithOfficial commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MegalithOfficial commented May 25, 2026

Uh oh!

MarcusDunn commented May 25, 2026

Uh oh!

MegalithOfficial commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MegalithOfficial commented May 4, 2026 •

edited

Loading

MegalithOfficial commented May 17, 2026 •

edited

Loading