Skip to content

Bug-Fix sets an upper VRAM limit for cached ggml_cuda graphs to prevent VRAM memory leaks#21673

Open
kmorennv wants to merge 25 commits intoggml-org:masterfrom
kmorennv:kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3
Open

Bug-Fix sets an upper VRAM limit for cached ggml_cuda graphs to prevent VRAM memory leaks#21673
kmorennv wants to merge 25 commits intoggml-org:masterfrom
kmorennv:kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3

Conversation

@kmorennv
Copy link
Copy Markdown

@kmorennv kmorennv commented Apr 9, 2026

Overview

Bug fix for: #19639 The fix in cuda-backend prevents the unbound growth of cuda graphs for Gemma3 models.

Additional information

The fix is local in the cuda-backend. It adds new data-type that manages the cached cuda graphs. With this PR the bug is fixed, but there is still a code refactoring necessary.

  • For instance ggml_cuda_graph is open for misuse of CudaGraph_t.
  • There is also the problem with current design of the graph-cache, many ggml_cuda_graph insatances are non initialized (contains cuda-graph == nullptr) but still consume the host-memory.

Requirements

kmorennv added 25 commits March 30, 2026 13:13
…ement and encapsulate functionality, add comments
…proviz/dl/llama.cpp into kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3
…proviz/dl/llama.cpp into kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3
…proviz/dl/llama.cpp into kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3
…proviz/dl/llama.cpp into kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3
…proviz/dl/llama.cpp into kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3
…proviz/dl/llama.cpp into kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3
@kmorennv kmorennv requested review from a team and ggerganov as code owners April 9, 2026 11:37
@kmorennv
Copy link
Copy Markdown
Author

kmorennv commented Apr 9, 2026

  1. This was a general problem that has been fixed (and gemma3n was only affected in the multimodal case).
  2. The change in llama_context::set_causal_attn is not strictly speaking necessary -> can be isolated as a separate pull request.
  3. I want to further improve memory safety with additional refactoring (since we currently have pollution/host-side memory because we're not cleaning up empty graphs, unused instances of ggml_cuda_graph).

@kmorennv kmorennv changed the title fix cuda memory leak with mtmd gemma3 Bug-Fix sets an upper VRAM limit for cached ggml_cuda graphs to prevent VRAM memory leaks Apr 9, 2026
@am17an
Copy link
Copy Markdown
Contributor

am17an commented Apr 9, 2026

I'm not sure if this is the correct way to do this, seems a bit over-engineered with the LRU cache. I would much prefer a simple ring buffer, for example in #21611. A memory limit is fine I guess, but it would be preferable to have a dynamic upper bound to do this correctly. I don't have a good solution yet, still thinking about it.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

I agree with @am17an that a simple solution would be much preferred.

@kmorennv
Copy link
Copy Markdown
Author

kmorennv commented Apr 9, 2026

I'm not sure if this is the correct way to do this, seems a bit over-engineered with the LRU cache. I would much prefer a simple ring buffer, for example in #21611. A memory limit is fine I guess, but it would be preferable to have a dynamic upper bound to do this correctly. I don't have a good solution yet, still thinking about it.

  • ok, to fix only the upper-bound bug this might be too complex but there are other issues.
  • The solution proposed in CUDA: use a ring-buffer for cuda graphs #21611 fix only the unbound cache , but still the caching of many empty ggm_cuda_graph instances is not addressed -> ( but to be fair this another related issue ).

A memory limit is fine I guess, but it would be preferable to have a dynamic upper bound to do this correctly. I don't have a good solution yet, still thinking about it.

  • Dynamic upper bound for instance is more complex than the static upper-bound/heuristic. Dynamic means dependent, on what ? It is clear, it is possible to make it dependent on the micro-arch. but than is a more complex check ....

Long story short , I am agree, to fix the bug only a few code changes are sufficient in the current code base. To prevent other issues the separate PR is preferred ?

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants