Skip to content

Inf2 ERROR NRT:nrt_allocate_neuron_cores #1338

@pcayen

Description

@pcayen

Describe the bug

We are running inf2 nodes in our cluster.
Ocasionally, a service will be scheduled on the node instance but will fail with the errors added below.
We are unsure what causes this bug.
Any help / guidance on this matter would be greatly appreciated.

Model Name

Custom, built in-house.
tinyvec_neuron_dynamic_batch_torch_2_9_1_16000.pt

Describe the workload type

Inference workload.

resources:
  limits:
    aws.amazon.com/neuroncore: '1'
    cpu: '1'
    memory: 1500Mi
  requests:
    aws.amazon.com/neuroncore: '1'
    cpu: 500m
    memory: 750Mi

Instance Type

Karpenter snippet:

requirements:
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
      - inf2.xlarge

Release version

libneuronxla 2.2.16408.0+50c26cbd
torch 2.9.1+cpu
torch-neuronx 2.9.0.2.13.26312+8e870898
torch-xla 2.9.0
transformers 4.57.6

Reproduction Steps

uncertain

We use Keda to scale our services based on work-queue loads.
Once the queue is empty, we scale back down to zero.
Once the pod is removed from the inf2 node, we suspect that something goes wrong during teardown
and the core isn't released properly.
More work comes in, a new model is scheduled and then the new pod fails repeatedly with the errors below.

Our backend team has tried manually killing pods to attempt to replicate with no luck.

Regression Issue

  • Select this option if this issue appears to be a regression.

Possible Solution

N/A

Logs/Context/Additional Information

2026-May-19 15:44:37.069500 1:18 ERROR NRT:nrt_infodump NRT version: 2.31.24.0 (0b044f4ce917b633a70eb3d0bc460f34ac3da620)
2026-May-19 15:44:37.071215 1:18 ERROR NRT:nrt_infodump Embedded FW version: unknown (unknown)
2026-May-19 15:44:37.072632 1:18 ERROR NRT:nrt_infodump CCOM not loaded
2026-May-19 15:44:37.073773 1:18 ERROR NRT:nrt_infodump NCFW version: 2.31.1.0 (cf13a49f86829014f0575b6d50c112ddf68b53c0)
2026-May-19 15:44:37.076914 1:18 ERROR NRT:nrt_infodump Cluster ID: N/A
2026-May-19 15:44:37.078007 1:18 ERROR NRT:nrt_infodump Kernel: Linux 6.12.83-113.160.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Fri May 8 17:56:26 UTC 2026
2026-May-19 15:44:37.081644 1:18 ERROR NRT:nrt_infodump Driver version: 2.27.4.0
2026-May-19 15:44:37.083090 1:18 ERROR NRT:nrt_infodump Failure: NRT_FAILURE in nrt_init()
2026-May-19 15:44:37.084445 1:18 ERROR NRT:nrt_infodump Visible cores: 1
2026-May-19 15:44:37.085612 1:18 ERROR NRT:nrt_infodump Environment:
2026-May-19 15:44:37.086743 1:18 ERROR NRT:nrt_infodump NEURON_LOGICAL_NC_CONFIG=1
2026-May-19 15:44:37.089503 1:18 ERROR NRT:nrt_infodump NEURON_LIBRARY_PATH=/home/botni/.local/lib/python3.11/site-packages/libneuronxla/libneuronpjrt.so
2026-May-19 15:44:37.091666 1:18 ERROR NRT:nrt_infodump NEURON_RT_ROOT_COMM_ID=localhost:62182
2026-May-19 15:44:37.093119 1:18 ERROR NRT:nrt_infodump NEURON_RT_NUMERICAL_ERRORS_VERBOSITY=none
2026-May-19 15:44:37.472620 1:18 ERROR NRT:nrt_allocate_neuron_cores Logical Neuron Core(s) not available - Requested:lnc1-lnc1 Available:0 Logical Core size:1
2026-May-19 15:44:37.997567 1:18 ERROR NRT:nrt_allocate_neuron_cores Logical Neuron Core(s) not available - Requested:lnc1-lnc1 Available:0 Logical Core size:1

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions