Large NRT HOST memory allocations fail under host memory fragmentation; driver fallback pool cannot satisfy >2 MiB requests

### Environment
- aws-neuronx-dkms: 2.24.13.0
- Also reproduced with: 2.19.64.0
- Instance: inf1.2xlarge
- OS: Ubuntu 24.04
- Kernel: 6.17

### Symptom
After heavy host I/O / page-cache pressure, such as running `docker build --no-cache` multiple times, starting multiple Neuron containers concurrently can fail during model loading.

Example NRT logs:

```text
TDRV:dmem_alloc_internal Failed to alloc HOST memory: 3292176
TDRV:tensor_allocate Failed to allocate 3292176 bytes on HOST for tensor ...
TDRV:dml_dump Wrote nrt memory alloc debug info to /tmp/nrt_mem_log_device_0_xxx.csv
```

Relevant kernel log:

```text
page allocation failure: order:10, mode:0xcc0(GFP_KERNEL)
  __dma_direct_alloc_pages
  dma_alloc_attrs
  mc_alloc_align [neuron]
...
Node 0 Normal: ... 0*2048kB 0*4096kB
neuron:mc_alloc_internal: host mem occupied ...
```

In some cases, smaller allocations appear to be rescued by the internal pool:

```text
page allocation failure: order:9, mode:0xcc0(GFP_KERNEL)
...
neuron:mc_alloc_internal: Completed host allocation of 2097152B from the internal pool
```

However, larger HOST tensor allocations, for example 2441472 bytes or 3292176 bytes, still fail.

### Observed behavior
- The failure correlates with host memory fragmentation / lack of high-order free pages.
- `/proc/buddyinfo` shows very low or zero order-9 / order-10 blocks before failure.
- Running the following before starting the workload makes the failure disappear in our environment:

```bash
sync
echo 3 | sudo tee /proc/sys/vm/drop_caches
echo 1 | sudo tee /proc/sys/vm/compact_memory
```

- Increasing `mempool_host_memory_size` did not resolve this failure mode for allocations larger than 2 MiB.

### Driver-side analysis / hypothesis
From `neuron_mempool.c`, the HOST allocation path first calls:

```c
mc->va = dma_alloc_coherent(mpset->pdev, size, &addr,
        GFP_KERNEL | GFP_DMA32);
```

If that returns `NULL`, the driver falls back to its reserved host mempool:

```c
for (i = 0; i < MP_HOST_RESERVE_MEMORY_POOL_COUNT; i++) {
    u32 page_size = MP_HOST_PAGE_SIZE_MIN << i;
    if (page_size < size)
        continue;

    mp = &mpset->mp_hrm[i];
    mc->va = gen_pool_dma_alloc(mp->gen_pool, size, &mc->pa);
    ...
}
```

The reserved host mempool appears to be initialized with four size classes:

```text
256 KiB / 512 KiB / 1 MiB / 2 MiB
```

Because the fallback loop skips any pool where `page_size < size`, requests larger than 2 MiB do not appear to be serviceable by the fallback pool. This matches our observed failures for HOST tensor allocations such as:

```text
2441472 bytes
3292176 bytes
```

Our understanding is therefore:

1. Kernel coherent DMA allocation fails due to lack of high-order contiguous pages.
2. The driver fallback pool can rescue some allocations up to 2 MiB.
3. Larger HOST allocations cannot be satisfied by the current fallback pool layout, so NRT fails with `Failed to alloc HOST memory`.

### Questions
Could AWS confirm whether this understanding is correct?

Specifically:

1. Is the internal HOST fallback mempool expected to support allocations larger than 2 MiB?
2. If not, is there a recommended setting or workaround for NRT models that require HOST tensors larger than 2 MiB?
3. Would it be feasible for the driver to include a larger fallback size class, for example 4 MiB, or otherwise split large HOST tensor allocations?

### Reproduction
1. Fresh inf1.2xlarge host.
2. Run heavy host I/O, for example two consecutive `docker build --no-cache` runs. Or even some heavy ec2 userdata like amount of installation can reproduce it.
3. Start four containers concurrently, each pinned to a separate NeuronCore.
4. Model loading fails with `Failed to alloc HOST memory`.

In our environment, the failure is highly reproducible after page-cache pressure and disappears after explicit cache drop + memory compaction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large NRT HOST memory allocations fail under host memory fragmentation; driver fallback pool cannot satisfy >2 MiB requests #15

Environment

Symptom

Observed behavior

Driver-side analysis / hypothesis

Questions

Reproduction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Large NRT HOST memory allocations fail under host memory fragmentation; driver fallback pool cannot satisfy >2 MiB requests #15

Description

Environment

Symptom

Observed behavior

Driver-side analysis / hypothesis

Questions

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions