feat!: update default CUDA image to server-cuda13 for Qwen3.5 and Blackwell support#262
Open
feat!: update default CUDA image to server-cuda13 for Qwen3.5 and Blackwell support#262
Conversation
…ckwell support BREAKING CHANGE: The default GPU inference image is now server-cuda13 (CUDA 13) instead of server-cuda (CUDA 12). This requires NVIDIA driver 590+ (CUDA 13.1). Users on older drivers must specify --image ghcr.io/ggml-org/llama.cpp:server-cuda to use the CUDA 12 image. This enables: - Qwen3.5 model architecture support (previously unknown architecture) - Native SM 120 (Blackwell) GPU support for RTX 50-series - Latest llama.cpp optimizations and model architecture support Updated across CLI, benchmark tool, sample manifests, and docs. Fixes #261 Signed-off-by: Christopher Maher <chris@mahercode.io>
54ed14b to
2fab52e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Updates the default GPU inference image from
server-cuda(CUDA 12) toserver-cuda13(CUDA 13) across the codebase. This is a one-line fix that enables support for Qwen3.5 models and native Blackwell GPU acceleration.What this fixes:
unknown model architecture: 'qwen35'on the old imageWhat changes:
server-cuda13instead ofserver-cudafor GPU deploymentsserver-cuda13--imagehelp text documents how to override with an older imageDriver requirement: NVIDIA driver 590+ (CUDA 13.1). Users on older drivers can still use
--image ghcr.io/ggml-org/llama.cpp:server-cuda.Validated on: Shadowstack (RTX 5060 Ti, driver 590.48.01) with Qwen3.5-9B at 61.5 tok/s generation.
Test plan
make testpasses--image ghcr.io/ggml-org/llama.cpp:server-cudaoverride works for older driversFixes #261