Bug-Fix sets an upper VRAM limit for cached ggml_cuda graphs to prevent VRAM memory leaks by kmorennv · Pull Request #21673 · ggml-org/llama.cpp

kmorennv · 2026-04-09T11:37:20Z

Overview

Bug fix for: #19639 The fix in cuda-backend prevents the unbound growth of cuda graphs for Gemma3 models.

Additional information

The fix is local in the cuda-backend. It adds new data-type that manages the cached cuda graphs. With this PR the bug is fixed, but there is still a code refactoring necessary.

For instance ggml_cuda_graph is open for misuse of CudaGraph_t.
There is also the problem with current design of the graph-cache, many ggml_cuda_graph insatances are non initialized (contains cuda-graph == nullptr) but still consume the host-memory.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: No

…hs, add chrono to profile

…optimize memory usage

…strategy

…ement and encapsulate functionality, add comments

… graph structure

…proviz/dl/llama.cpp into kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3

… graph evaluation

…proviz/dl/llama.cpp into kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3

…valuate_and_capture

…a_graph_evaluate_and_capture

…proviz/dl/llama.cpp into kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3

kmorennv · 2026-04-09T11:48:00Z

This was a general problem that has been fixed (and gemma3n was only affected in the multimodal case).
The change in llama_context::set_causal_attn is not strictly speaking necessary -> can be isolated as a separate pull request.
I want to further improve memory safety with additional refactoring (since we currently have pollution/host-side memory because we're not cleaning up empty graphs, unused instances of ggml_cuda_graph).

am17an · 2026-04-09T12:18:35Z

I'm not sure if this is the correct way to do this, seems a bit over-engineered with the LRU cache. I would much prefer a simple ring buffer, for example in #21611. A memory limit is fine I guess, but it would be preferable to have a dynamic upper bound to do this correctly. I don't have a good solution yet, still thinking about it.

JohannesGaessler · 2026-04-09T13:39:37Z

I agree with @am17an that a simple solution would be much preferred.

kmorennv · 2026-04-09T16:47:39Z

I'm not sure if this is the correct way to do this, seems a bit over-engineered with the LRU cache. I would much prefer a simple ring buffer, for example in #21611. A memory limit is fine I guess, but it would be preferable to have a dynamic upper bound to do this correctly. I don't have a good solution yet, still thinking about it.

ok, to fix only the upper-bound bug this might be too complex but there are other issues.
The solution proposed in CUDA: use a ring-buffer for cuda graphs #21611 fix only the unbound cache , but still the caching of many empty ggm_cuda_graph instances is not addressed -> ( but to be fair this another related issue ).

A memory limit is fine I guess, but it would be preferable to have a dynamic upper bound to do this correctly. I don't have a good solution yet, still thinking about it.

Dynamic upper bound for instance is more complex than the static upper-bound/heuristic. Dynamic means dependent, on what ? It is clear, it is possible to make it dependent on the micro-arch. but than is a more complex check ....

Long story short , I am agree, to fix the bug only a few code changes are sufficient in the current code base. To prevent other issues the separate PR is preferred ?

kmorennv added 25 commits March 30, 2026 13:13

disable constant graph rebuild for Gemma models

7c932f7

add bug fix first commit

7261aa5

fix: prevent CUDA memory leak by implementing LRU cache for CUDA grap…

2275082

…hs, add chrono to profile

wp

ec5f53d

fix: optimize CUDA graph memory management and update VRAM estimates

526c99b

fix: implement LRU cache for CUDA graphs to prevent memory leaks and …

ebae1fc

…optimize memory usage

remove unused

e69d6df

wp

ed51f8a

fix: enhance CUDA graph memory management and implement LRU eviction …

72802b5

…strategy

fix: refactor CUDA graph cache implementation to improve memory manag…

f4a5e16

…ement and encapsulate functionality, add comments

fix: rename memory_usage to allocated_memory_size for clarity in CUDA…

b1722ef

… graph structure

Merge branch 'master' of ssh://gitlab-master.nvidia.com:12051/devtech…

837c064

…proviz/dl/llama.cpp into kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3

wp-merge

0f9656a

wp-merge

0b4679e

refactor: remove unused memory fusion check function to optimize CUDA…

9d3c840

… graph evaluation

Merge branch 'master' of ssh://gitlab-master.nvidia.com:12051/devtech…

ecb3314

…proviz/dl/llama.cpp into kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3

fix: ensure proper CUDA graph capture and launch in ggml_cuda_graph_e…

5c62402

…valuate_and_capture

fix: address CUDA memory leak by managing graph instances in ggml_cud…

e867ab9

…a_graph_evaluate_and_capture

Merge branch 'master' of ssh://gitlab-master.nvidia.com:12051/devtech…

a730485

…proviz/dl/llama.cpp into kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3

review comments

e5fc40c

Merge branch 'master' of ssh://gitlab-master.nvidia.com:12051/devtech…

29ef119

…proviz/dl/llama.cpp into kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3

Merge branch 'master' of ssh://gitlab-master.nvidia.com:12051/devtech…

9d5b4d2

…proviz/dl/llama.cpp into kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3

add review comments

636bc4a

add review comment

aabe62e

Merge branch 'master' of ssh://gitlab-master.nvidia.com:12051/devtech…

9c84f00

…proviz/dl/llama.cpp into kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3

kmorennv requested review from a team and ggerganov as code owners April 9, 2026 11:37

kmorennv changed the title ~~fix cuda memory leak with mtmd gemma3~~ Bug-Fix sets an upper VRAM limit for cached ggml_cuda graphs to prevent VRAM memory leaks Apr 9, 2026

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug-Fix sets an upper VRAM limit for cached ggml_cuda graphs to prevent VRAM memory leaks#21673

Bug-Fix sets an upper VRAM limit for cached ggml_cuda graphs to prevent VRAM memory leaks#21673
kmorennv wants to merge 25 commits intoggml-org:masterfrom
kmorennv:kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3

kmorennv commented Apr 9, 2026 •

edited

Loading

Uh oh!

kmorennv commented Apr 9, 2026

Uh oh!

am17an commented Apr 9, 2026 •

edited

Loading

Uh oh!

JohannesGaessler commented Apr 9, 2026

Uh oh!

kmorennv commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kmorennv commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

kmorennv commented Apr 9, 2026

Uh oh!

am17an commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Apr 9, 2026

Uh oh!

kmorennv commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kmorennv commented Apr 9, 2026 •

edited

Loading

am17an commented Apr 9, 2026 •

edited

Loading