Skip to content

Conversation

@pwilkin
Copy link
Collaborator

@pwilkin pwilkin commented Dec 5, 2025

Until now, graph node numbers have largely been constant for GGML graphs, however, with some hybrid attention models, the need for chunking arising from the O(n^3) complexity of recurrent updates in gating functions (for which we use SOLVE_TRI) makes the graph size dependent on the number of chunks, which is dependent on the size of the ubatch. This patch passes the ubatch size / context size to the function so that for models that need it, we can dynamically calculate the needed number of max nodes.

Fixes #17578

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to eventually improve the Qwen3 Next graph and reduce the number of nodes - there are a lot of unnecessary ops in it.

And also, variable graph size is always going to have various drawbacks. The graph should not change for different batch sizes. It's important to figure out how to do that if we want to have long-term support for linear attention in llama.cpp.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@pwilkin
Copy link
Collaborator Author

pwilkin commented Dec 6, 2025

We need to eventually improve the Qwen3 Next graph and reduce the number of nodes - there are a lot of unnecessary ops in it.

Yeah, that's my plan after merging all the ops and fixing all the major bugs.

And also, variable graph size is always going to have various drawbacks. The graph should not change for different batch sizes. It's important to figure out how to do that if we want to have long-term support for linear attention in llama.cpp.

The problem is, the recurrent models by their design require chunking. I'm not sure how to do chunking without exploding the graph unless we allow an operation like "REPEAT_SUBGRAPH" (where all the operations are guaranteed to be performed with tensors of the same shape and characteristics for each repeat). At least that's the only idea I was able to come up with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Qwen3-Next --ubatch-size issue

2 participants