Environment
- aws-neuronx-dkms: 2.24.13.0
- Also reproduced with: 2.19.64.0
- Instance: inf1.2xlarge
- OS: Ubuntu 24.04
- Kernel: 6.17
Symptom
After heavy host I/O / page-cache pressure, such as running docker build --no-cache multiple times, starting multiple Neuron containers concurrently can fail during model loading.
Example NRT logs:
TDRV:dmem_alloc_internal Failed to alloc HOST memory: 3292176
TDRV:tensor_allocate Failed to allocate 3292176 bytes on HOST for tensor ...
TDRV:dml_dump Wrote nrt memory alloc debug info to /tmp/nrt_mem_log_device_0_xxx.csv
Relevant kernel log:
page allocation failure: order:10, mode:0xcc0(GFP_KERNEL)
__dma_direct_alloc_pages
dma_alloc_attrs
mc_alloc_align [neuron]
...
Node 0 Normal: ... 0*2048kB 0*4096kB
neuron:mc_alloc_internal: host mem occupied ...
In some cases, smaller allocations appear to be rescued by the internal pool:
page allocation failure: order:9, mode:0xcc0(GFP_KERNEL)
...
neuron:mc_alloc_internal: Completed host allocation of 2097152B from the internal pool
However, larger HOST tensor allocations, for example 2441472 bytes or 3292176 bytes, still fail.
Observed behavior
- The failure correlates with host memory fragmentation / lack of high-order free pages.
/proc/buddyinfo shows very low or zero order-9 / order-10 blocks before failure.
- Running the following before starting the workload makes the failure disappear in our environment:
sync
echo 3 | sudo tee /proc/sys/vm/drop_caches
echo 1 | sudo tee /proc/sys/vm/compact_memory
- Increasing
mempool_host_memory_size did not resolve this failure mode for allocations larger than 2 MiB.
Driver-side analysis / hypothesis
From neuron_mempool.c, the HOST allocation path first calls:
mc->va = dma_alloc_coherent(mpset->pdev, size, &addr,
GFP_KERNEL | GFP_DMA32);
If that returns NULL, the driver falls back to its reserved host mempool:
for (i = 0; i < MP_HOST_RESERVE_MEMORY_POOL_COUNT; i++) {
u32 page_size = MP_HOST_PAGE_SIZE_MIN << i;
if (page_size < size)
continue;
mp = &mpset->mp_hrm[i];
mc->va = gen_pool_dma_alloc(mp->gen_pool, size, &mc->pa);
...
}
The reserved host mempool appears to be initialized with four size classes:
256 KiB / 512 KiB / 1 MiB / 2 MiB
Because the fallback loop skips any pool where page_size < size, requests larger than 2 MiB do not appear to be serviceable by the fallback pool. This matches our observed failures for HOST tensor allocations such as:
2441472 bytes
3292176 bytes
Our understanding is therefore:
- Kernel coherent DMA allocation fails due to lack of high-order contiguous pages.
- The driver fallback pool can rescue some allocations up to 2 MiB.
- Larger HOST allocations cannot be satisfied by the current fallback pool layout, so NRT fails with
Failed to alloc HOST memory.
Questions
Could AWS confirm whether this understanding is correct?
Specifically:
- Is the internal HOST fallback mempool expected to support allocations larger than 2 MiB?
- If not, is there a recommended setting or workaround for NRT models that require HOST tensors larger than 2 MiB?
- Would it be feasible for the driver to include a larger fallback size class, for example 4 MiB, or otherwise split large HOST tensor allocations?
Reproduction
- Fresh inf1.2xlarge host.
- Run heavy host I/O, for example two consecutive
docker build --no-cache runs. Or even some heavy ec2 userdata like amount of installation can reproduce it.
- Start four containers concurrently, each pinned to a separate NeuronCore.
- Model loading fails with
Failed to alloc HOST memory.
In our environment, the failure is highly reproducible after page-cache pressure and disappears after explicit cache drop + memory compaction.
Environment
Symptom
After heavy host I/O / page-cache pressure, such as running
docker build --no-cachemultiple times, starting multiple Neuron containers concurrently can fail during model loading.Example NRT logs:
Relevant kernel log:
In some cases, smaller allocations appear to be rescued by the internal pool:
However, larger HOST tensor allocations, for example 2441472 bytes or 3292176 bytes, still fail.
Observed behavior
/proc/buddyinfoshows very low or zero order-9 / order-10 blocks before failure.mempool_host_memory_sizedid not resolve this failure mode for allocations larger than 2 MiB.Driver-side analysis / hypothesis
From
neuron_mempool.c, the HOST allocation path first calls:If that returns
NULL, the driver falls back to its reserved host mempool:The reserved host mempool appears to be initialized with four size classes:
Because the fallback loop skips any pool where
page_size < size, requests larger than 2 MiB do not appear to be serviceable by the fallback pool. This matches our observed failures for HOST tensor allocations such as:Our understanding is therefore:
Failed to alloc HOST memory.Questions
Could AWS confirm whether this understanding is correct?
Specifically:
Reproduction
docker build --no-cacheruns. Or even some heavy ec2 userdata like amount of installation can reproduce it.Failed to alloc HOST memory.In our environment, the failure is highly reproducible after page-cache pressure and disappears after explicit cache drop + memory compaction.