Bug-Fix sets an upper VRAM limit for cached ggml_cuda graphs to prevent VRAM memory leaks#21673
Conversation
…hs, add chrono to profile
…optimize memory usage
…ement and encapsulate functionality, add comments
…proviz/dl/llama.cpp into kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3
… graph evaluation
…proviz/dl/llama.cpp into kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3
…valuate_and_capture
…a_graph_evaluate_and_capture
…proviz/dl/llama.cpp into kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3
…proviz/dl/llama.cpp into kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3
…proviz/dl/llama.cpp into kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3
…proviz/dl/llama.cpp into kmoren/Bug-Fix-19639-CUDA-memory-leak-with-MTMD-Gemma3
|
|
I'm not sure if this is the correct way to do this, seems a bit over-engineered with the LRU cache. I would much prefer a simple ring buffer, for example in #21611. A memory limit is fine I guess, but it would be preferable to have a dynamic upper bound to do this correctly. I don't have a good solution yet, still thinking about it. |
|
I agree with @am17an that a simple solution would be much preferred. |
Long story short , I am agree, to fix the bug only a few code changes are sufficient in the current code base. To prevent other issues the separate PR is preferred ? |
Overview
Bug fix for: #19639 The fix in cuda-backend prevents the unbound growth of cuda graphs for Gemma3 models.
Additional information
The fix is local in the cuda-backend. It adds new data-type that manages the cached cuda graphs. With this PR the bug is fixed, but there is still a code refactoring necessary.
Requirements