Skip to content

ggml-cuda: load large weights via unified memory on integrated GPUs#17

Merged
davide221 merged 1 commit into
luce-dflashfrom
feat/uma-integrated-weight-load
Jun 18, 2026
Merged

ggml-cuda: load large weights via unified memory on integrated GPUs#17
davide221 merged 1 commit into
luce-dflashfrom
feat/uma-integrated-weight-load

Conversation

@davide221

Copy link
Copy Markdown

What

On integrated GPUs (APUs where the CPU and GPU share one pool of RAM, e.g. AMD Strix Halo gfx1151) the loader allocated a separate device buffer and copied every weight tensor host-to-device. For a model whose weights approach total RAM this doubles residency, so the device buffer plus the mmap'd source no longer fit and the load thrashes. This routes large weight allocations on integrated devices to managed (unified) memory and fills it with a direct parallel host copy.

Changes (ggml/src/ggml-cuda/ggml-cuda.cu, ~127 lines)

  • ggml_cuda_device_malloc: on an integrated GPU, auto-select cudaMallocManaged + coarse-grain advise for allocations above DFLASH_HIP_UMA_MIN_FRAC of system RAM (default 0.45), i.e. where a separate device buffer plus the source would not fit. Probes a fresh cudaGetDeviceProperties().integrated via thread-safe init (the cached ggml_cuda_info() flag is hard-forced false). Opt out: DFLASH_HIP_NO_AUTO_UMA=1. Force on: GGML_CUDA_ENABLE_UNIFIED_MEMORY.
  • set_tensor: for managed buffers, fill with a parallel host memcpy, skipping the pageable host-to-device staged copy and the per-tensor synchronize.
  • ggml_backend_cuda_buffer_is_managed: small accessor so loaders can stream weights straight into a unified buffer.

Results (Strix Halo gfx1151, ROCm 7.2.2)

Model Before After
~86 GB (near total RAM) un-loadable (>1500 s, thrash) ~70-90 s
16 GB (fits) unchanged unchanged; decode output byte-identical

Safety / scope

  • Linux + integrated only. Discrete GPUs, non-integrated devices, and models that fit keep the legacy hipMalloc + H2D path untouched; the unified path degrades to legacy where /proc/meminfo is unavailable.
  • No decode regression: greedy output is byte-identical between the managed and legacy paths.
  • Page-cache management for near-RAM models (dropping copied source pages) is intentionally left to the model loader, where the source is known to be a read-only file mapping, rather than the generic backend.

On integrated GPUs the CPU and GPU share one pool of RAM, but the loader still
allocated a separate device buffer and copied every weight tensor host-to-device.
For a model whose weights approach total RAM this doubles residency and the load
thrashes. On Strix Halo gfx1151 an ~86GB model failed to finish loading (>1500s).

Route large weight allocations on integrated devices to managed (unified) memory
and fill it without a staged copy:

- ggml_cuda_device_malloc: on an integrated GPU, auto-select cudaMallocManaged +
  coarse-grain advise for allocations above DFLASH_HIP_UMA_MIN_FRAC of system RAM
  (default 0.45), where a separate device buffer plus the source would otherwise
  not fit. The cached ggml_cuda_info().devices[].integrated flag is hard-forced
  false, so this probes a fresh cudaGetDeviceProperties() (thread-safe init). Opt
  out with DFLASH_HIP_NO_AUTO_UMA=1; force on with GGML_CUDA_ENABLE_UNIFIED_MEMORY.
- set_tensor: for managed buffers, fill with a parallel host memcpy, skipping the
  pageable host-to-device staged copy and the per-tensor synchronize.
- add ggml_backend_cuda_buffer_is_managed so loaders can stream weights straight
  into a unified buffer.

Auto-UMA is Linux + integrated only; discrete GPUs, non-integrated devices, and
models that fit are unchanged (legacy hipMalloc + H2D), and the unified path
degrades to legacy where /proc/meminfo is unavailable. Measured on Strix Halo
gfx1151 (ROCm 7.2.2): an 86GB model goes from un-loadable (>1500s) to ~70-90s; a
16GB model is unchanged and its decode output is byte-identical to the legacy path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@davide221 davide221 merged commit 9cd9e1e into luce-dflash Jun 18, 2026
12 of 50 checks passed
davide221 added a commit to Luce-Org/lucebox-hub that referenced this pull request Jun 18, 2026
)

Advance the server/deps/llama.cpp submodule to include
Luce-Org/llama.cpp-dflash-ggml#17, which loads large model weights via
managed (unified) memory on integrated GPUs. Models whose weights approach
total RAM become loadable on Strix Halo gfx1151 (an ~86GB model went from
un-loadable, >1500s, to ~70-90s); discrete GPUs and models that fit are
unchanged.

Submodule: 574be613 -> 9cd9e1ed (PR #17 only).

Co-authored-by: mrciffa <davide@cifarelli.tech>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant