ggml-cuda: load large weights via unified memory on integrated GPUs by davide221 · Pull Request #17 · Luce-Org/llama.cpp-dflash-ggml

davide221 · 2026-06-18T15:48:21Z

What

On integrated GPUs (APUs where the CPU and GPU share one pool of RAM, e.g. AMD Strix Halo gfx1151) the loader allocated a separate device buffer and copied every weight tensor host-to-device. For a model whose weights approach total RAM this doubles residency, so the device buffer plus the mmap'd source no longer fit and the load thrashes. This routes large weight allocations on integrated devices to managed (unified) memory and fills it with a direct parallel host copy.

Changes (`ggml/src/ggml-cuda/ggml-cuda.cu`, ~127 lines)

ggml_cuda_device_malloc: on an integrated GPU, auto-select cudaMallocManaged + coarse-grain advise for allocations above DFLASH_HIP_UMA_MIN_FRAC of system RAM (default 0.45), i.e. where a separate device buffer plus the source would not fit. Probes a fresh cudaGetDeviceProperties().integrated via thread-safe init (the cached ggml_cuda_info() flag is hard-forced false). Opt out: DFLASH_HIP_NO_AUTO_UMA=1. Force on: GGML_CUDA_ENABLE_UNIFIED_MEMORY.
set_tensor: for managed buffers, fill with a parallel host memcpy, skipping the pageable host-to-device staged copy and the per-tensor synchronize.
ggml_backend_cuda_buffer_is_managed: small accessor so loaders can stream weights straight into a unified buffer.

Results (Strix Halo gfx1151, ROCm 7.2.2)

Model	Before	After
~86 GB (near total RAM)	un-loadable (>1500 s, thrash)	~70-90 s
16 GB (fits)	unchanged	unchanged; decode output byte-identical

Safety / scope

Linux + integrated only. Discrete GPUs, non-integrated devices, and models that fit keep the legacy hipMalloc + H2D path untouched; the unified path degrades to legacy where /proc/meminfo is unavailable.
No decode regression: greedy output is byte-identical between the managed and legacy paths.
Page-cache management for near-RAM models (dropping copied source pages) is intentionally left to the model loader, where the source is known to be a read-only file mapping, rather than the generic backend.

On integrated GPUs the CPU and GPU share one pool of RAM, but the loader still allocated a separate device buffer and copied every weight tensor host-to-device. For a model whose weights approach total RAM this doubles residency and the load thrashes. On Strix Halo gfx1151 an ~86GB model failed to finish loading (>1500s). Route large weight allocations on integrated devices to managed (unified) memory and fill it without a staged copy: - ggml_cuda_device_malloc: on an integrated GPU, auto-select cudaMallocManaged + coarse-grain advise for allocations above DFLASH_HIP_UMA_MIN_FRAC of system RAM (default 0.45), where a separate device buffer plus the source would otherwise not fit. The cached ggml_cuda_info().devices[].integrated flag is hard-forced false, so this probes a fresh cudaGetDeviceProperties() (thread-safe init). Opt out with DFLASH_HIP_NO_AUTO_UMA=1; force on with GGML_CUDA_ENABLE_UNIFIED_MEMORY. - set_tensor: for managed buffers, fill with a parallel host memcpy, skipping the pageable host-to-device staged copy and the per-tensor synchronize. - add ggml_backend_cuda_buffer_is_managed so loaders can stream weights straight into a unified buffer. Auto-UMA is Linux + integrated only; discrete GPUs, non-integrated devices, and models that fit are unchanged (legacy hipMalloc + H2D), and the unified path degrades to legacy where /proc/meminfo is unavailable. Measured on Strix Halo gfx1151 (ROCm 7.2.2): an 86GB model goes from un-loadable (>1500s) to ~70-90s; a 16GB model is unchanged and its decode output is byte-identical to the legacy path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

) Advance the server/deps/llama.cpp submodule to include Luce-Org/llama.cpp-dflash-ggml#17, which loads large model weights via managed (unified) memory on integrated GPUs. Models whose weights approach total RAM become loadable on Strix Halo gfx1151 (an ~86GB model went from un-loadable, >1500s, to ~70-90s); discrete GPUs and models that fit are unchanged. Submodule: 574be613 -> 9cd9e1ed (PR #17 only). Co-authored-by: mrciffa <davide@cifarelli.tech> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions Bot added ggml CUDA labels Jun 18, 2026

davide221 merged commit 9cd9e1e into luce-dflash Jun 18, 2026
12 of 50 checks passed

davide221 mentioned this pull request Jun 18, 2026

deps: bump llama.cpp-dflash-ggml for unified-memory weight loading Luce-Org/lucebox-hub#421

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cuda: load large weights via unified memory on integrated GPUs#17

ggml-cuda: load large weights via unified memory on integrated GPUs#17
davide221 merged 1 commit into
luce-dflashfrom
feat/uma-integrated-weight-load

davide221 commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davide221 commented Jun 18, 2026

What

Changes (ggml/src/ggml-cuda/ggml-cuda.cu, ~127 lines)

Results (Strix Halo gfx1151, ROCm 7.2.2)

Safety / scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Changes (`ggml/src/ggml-cuda/ggml-cuda.cu`, ~127 lines)