Skip to content

fix(cuda): disable NCCL autodetection by default + sync llama.cpp MTP wrappers/examples#1020

Open
MegalithOfficial wants to merge 10 commits into
utilityai:mainfrom
MegalithOfficial:main
Open

fix(cuda): disable NCCL autodetection by default + sync llama.cpp MTP wrappers/examples#1020
MegalithOfficial wants to merge 10 commits into
utilityai:mainfrom
MegalithOfficial:main

Conversation

@MegalithOfficial
Copy link
Copy Markdown
Contributor

@MegalithOfficial MegalithOfficial commented May 4, 2026

What this fixes

Some Linux CUDA builds were failing at link time on systems that had NCCL installed.

The errors looked like this:

rust-lld: error: undefined symbol: ncclCommInitAll
rust-lld: error: undefined symbol: ncclGetErrorString
rust-lld: error: undefined symbol: ncclGroupStart
rust-lld: error: undefined symbol: ncclAllReduce
rust-lld: error: undefined symbol: ncclGroupEnd
collect2: error: ld returned 1 exit status
error: could not compile due to 1 previous error

The root cause was that llama.cpp could detect NCCL automatically during the CMake build, but llama-cpp-sys-2 was not linking libnccl on the Rust side. That left us with compiled CUDA objects that referenced NCCL symbols but no final NCCL linkage.

What changed

This update disables NCCL autodetection by default in CUDA builds by setting the relevant CMake flags in llama-cpp-sys-2/build.rs.

If someone does want NCCL enabled, they can still opt back in with:

LLAMA_CUDA_DISABLE_NCCL=0

Why this approach

NCCL is mainly useful for multi-GPU collective operations. For normal single-GPU setups, auto-enabling it is unnecessary and makes builds more fragile on machines where NCCL happens to be installed.

This keeps the default CUDA build path reliable while still leaving a deliberate opt-in path for NCCL.

Notes

If a previous build already cached NCCL detection, a clean rebuild may be needed.

Co-authored-by: Lothar Hoffmann

@MegalithOfficial
Copy link
Copy Markdown
Contributor Author

MegalithOfficial commented May 17, 2026

I added MTP support to the Rust wrapper and then wired up runnable examples on top of it. The wrapper now exposes the upstream MTP context type, recurrent-state config, and pre-norm embedding staging APIs needed for Qwen3.5-style NextN/MTP models. I also fixed mixed token+embedding batch handling on the Rust side, because upstream MTP needs both token ids and embedding rows in the same batch.

On top of that, I added two examples:

  • mtp-example: generates with a bundled-MTP GGUF such as unsloth/Qwen3.5-4B-MTP-GGUF
  • mtp-compare: runs plain decoding and MTP decoding back to back and prints output plus timing/acceptance stats

I also updated the MTP generation path to reuse live KV/recurrent state instead of clearing and reprefilling every loop. For Qwen3.5 MTP this required setting n_rs_seq = draft_n so rollback works correctly on the recurrent architecture.

Tested locally against Qwen3.5-4B-Q4_K_M.gguf. The path is working end to end; wall-clock speedup still depends on hardware, output length, and acceptance rate, so short CPU-only runs can still show MTP as slower even though the implementation is correct.

@MegalithOfficial MegalithOfficial changed the title fix(cuda): disable NCCL autodetection by default fix(cuda): disable NCCL autodetection by default + sync llama.cpp MTP wrappers/examples May 17, 2026
@MegalithOfficial
Copy link
Copy Markdown
Contributor Author

From what I understand, Action runs seem to suggest that Llama.cpp behaves differently across platforms: on Linux and macOS, the generated ctx_type binding comes through as one signedness, whereas on Windows it comes through as the other. I updated the wrapper to handle both, ensuring that the MTP context type code now compiles consistently on all targets.

@MarcusDunn
Copy link
Copy Markdown
Contributor

From what I understand, Action runs seem to suggest that Llama.cpp behaves differently across platforms: on Linux and macOS, the generated ctx_type binding comes through as one signedness, whereas on Windows it comes through as the other. I updated the wrapper to handle both, ensuring that the MTP context type code now compiles consistently on all targets.

This is correct. bindgen generates the platform type for enums, your fix is correct.

@MegalithOfficial
Copy link
Copy Markdown
Contributor Author

I believe action error is now about i didnt do formatting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants