ggml-cuda: load large weights via unified memory on integrated GPUs#17
Merged
Merged
Conversation
On integrated GPUs the CPU and GPU share one pool of RAM, but the loader still allocated a separate device buffer and copied every weight tensor host-to-device. For a model whose weights approach total RAM this doubles residency and the load thrashes. On Strix Halo gfx1151 an ~86GB model failed to finish loading (>1500s). Route large weight allocations on integrated devices to managed (unified) memory and fill it without a staged copy: - ggml_cuda_device_malloc: on an integrated GPU, auto-select cudaMallocManaged + coarse-grain advise for allocations above DFLASH_HIP_UMA_MIN_FRAC of system RAM (default 0.45), where a separate device buffer plus the source would otherwise not fit. The cached ggml_cuda_info().devices[].integrated flag is hard-forced false, so this probes a fresh cudaGetDeviceProperties() (thread-safe init). Opt out with DFLASH_HIP_NO_AUTO_UMA=1; force on with GGML_CUDA_ENABLE_UNIFIED_MEMORY. - set_tensor: for managed buffers, fill with a parallel host memcpy, skipping the pageable host-to-device staged copy and the per-tensor synchronize. - add ggml_backend_cuda_buffer_is_managed so loaders can stream weights straight into a unified buffer. Auto-UMA is Linux + integrated only; discrete GPUs, non-integrated devices, and models that fit are unchanged (legacy hipMalloc + H2D), and the unified path degrades to legacy where /proc/meminfo is unavailable. Measured on Strix Halo gfx1151 (ROCm 7.2.2): an 86GB model goes from un-loadable (>1500s) to ~70-90s; a 16GB model is unchanged and its decode output is byte-identical to the legacy path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
davide221
added a commit
to Luce-Org/lucebox-hub
that referenced
this pull request
Jun 18, 2026
) Advance the server/deps/llama.cpp submodule to include Luce-Org/llama.cpp-dflash-ggml#17, which loads large model weights via managed (unified) memory on integrated GPUs. Models whose weights approach total RAM become loadable on Strix Halo gfx1151 (an ~86GB model went from un-loadable, >1500s, to ~70-90s); discrete GPUs and models that fit are unchanged. Submodule: 574be613 -> 9cd9e1ed (PR #17 only). Co-authored-by: mrciffa <davide@cifarelli.tech> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
On integrated GPUs (APUs where the CPU and GPU share one pool of RAM, e.g. AMD Strix Halo gfx1151) the loader allocated a separate device buffer and copied every weight tensor host-to-device. For a model whose weights approach total RAM this doubles residency, so the device buffer plus the mmap'd source no longer fit and the load thrashes. This routes large weight allocations on integrated devices to managed (unified) memory and fills it with a direct parallel host copy.
Changes (
ggml/src/ggml-cuda/ggml-cuda.cu, ~127 lines)ggml_cuda_device_malloc: on an integrated GPU, auto-selectcudaMallocManaged+ coarse-grain advise for allocations aboveDFLASH_HIP_UMA_MIN_FRACof system RAM (default0.45), i.e. where a separate device buffer plus the source would not fit. Probes a freshcudaGetDeviceProperties().integratedvia thread-safe init (the cachedggml_cuda_info()flag is hard-forcedfalse). Opt out:DFLASH_HIP_NO_AUTO_UMA=1. Force on:GGML_CUDA_ENABLE_UNIFIED_MEMORY.set_tensor: for managed buffers, fill with a parallel hostmemcpy, skipping the pageable host-to-device staged copy and the per-tensor synchronize.ggml_backend_cuda_buffer_is_managed: small accessor so loaders can stream weights straight into a unified buffer.Results (Strix Halo gfx1151, ROCm 7.2.2)
Safety / scope
hipMalloc+ H2D path untouched; the unified path degrades to legacy where/proc/meminfois unavailable.