Skip to content

Large NRT HOST memory allocations fail under host memory fragmentation; driver fallback pool cannot satisfy >2 MiB requests #15

@yangyang05sb

Description

@yangyang05sb

Environment

  • aws-neuronx-dkms: 2.24.13.0
  • Also reproduced with: 2.19.64.0
  • Instance: inf1.2xlarge
  • OS: Ubuntu 24.04
  • Kernel: 6.17

Symptom

After heavy host I/O / page-cache pressure, such as running docker build --no-cache multiple times, starting multiple Neuron containers concurrently can fail during model loading.

Example NRT logs:

TDRV:dmem_alloc_internal Failed to alloc HOST memory: 3292176
TDRV:tensor_allocate Failed to allocate 3292176 bytes on HOST for tensor ...
TDRV:dml_dump Wrote nrt memory alloc debug info to /tmp/nrt_mem_log_device_0_xxx.csv

Relevant kernel log:

page allocation failure: order:10, mode:0xcc0(GFP_KERNEL)
  __dma_direct_alloc_pages
  dma_alloc_attrs
  mc_alloc_align [neuron]
...
Node 0 Normal: ... 0*2048kB 0*4096kB
neuron:mc_alloc_internal: host mem occupied ...

In some cases, smaller allocations appear to be rescued by the internal pool:

page allocation failure: order:9, mode:0xcc0(GFP_KERNEL)
...
neuron:mc_alloc_internal: Completed host allocation of 2097152B from the internal pool

However, larger HOST tensor allocations, for example 2441472 bytes or 3292176 bytes, still fail.

Observed behavior

  • The failure correlates with host memory fragmentation / lack of high-order free pages.
  • /proc/buddyinfo shows very low or zero order-9 / order-10 blocks before failure.
  • Running the following before starting the workload makes the failure disappear in our environment:
sync
echo 3 | sudo tee /proc/sys/vm/drop_caches
echo 1 | sudo tee /proc/sys/vm/compact_memory
  • Increasing mempool_host_memory_size did not resolve this failure mode for allocations larger than 2 MiB.

Driver-side analysis / hypothesis

From neuron_mempool.c, the HOST allocation path first calls:

mc->va = dma_alloc_coherent(mpset->pdev, size, &addr,
        GFP_KERNEL | GFP_DMA32);

If that returns NULL, the driver falls back to its reserved host mempool:

for (i = 0; i < MP_HOST_RESERVE_MEMORY_POOL_COUNT; i++) {
    u32 page_size = MP_HOST_PAGE_SIZE_MIN << i;
    if (page_size < size)
        continue;

    mp = &mpset->mp_hrm[i];
    mc->va = gen_pool_dma_alloc(mp->gen_pool, size, &mc->pa);
    ...
}

The reserved host mempool appears to be initialized with four size classes:

256 KiB / 512 KiB / 1 MiB / 2 MiB

Because the fallback loop skips any pool where page_size < size, requests larger than 2 MiB do not appear to be serviceable by the fallback pool. This matches our observed failures for HOST tensor allocations such as:

2441472 bytes
3292176 bytes

Our understanding is therefore:

  1. Kernel coherent DMA allocation fails due to lack of high-order contiguous pages.
  2. The driver fallback pool can rescue some allocations up to 2 MiB.
  3. Larger HOST allocations cannot be satisfied by the current fallback pool layout, so NRT fails with Failed to alloc HOST memory.

Questions

Could AWS confirm whether this understanding is correct?

Specifically:

  1. Is the internal HOST fallback mempool expected to support allocations larger than 2 MiB?
  2. If not, is there a recommended setting or workaround for NRT models that require HOST tensors larger than 2 MiB?
  3. Would it be feasible for the driver to include a larger fallback size class, for example 4 MiB, or otherwise split large HOST tensor allocations?

Reproduction

  1. Fresh inf1.2xlarge host.
  2. Run heavy host I/O, for example two consecutive docker build --no-cache runs. Or even some heavy ec2 userdata like amount of installation can reproduce it.
  3. Start four containers concurrently, each pinned to a separate NeuronCore.
  4. Model loading fails with Failed to alloc HOST memory.

In our environment, the failure is highly reproducible after page-cache pressure and disappears after explicit cache drop + memory compaction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions