Describe the bug
We are running inf2 nodes in our cluster.
Ocasionally, a service will be scheduled on the node instance but will fail with the errors added below.
We are unsure what causes this bug.
Any help / guidance on this matter would be greatly appreciated.
Model Name
Custom, built in-house.
tinyvec_neuron_dynamic_batch_torch_2_9_1_16000.pt
Describe the workload type
Inference workload.
resources:
limits:
aws.amazon.com/neuroncore: '1'
cpu: '1'
memory: 1500Mi
requests:
aws.amazon.com/neuroncore: '1'
cpu: 500m
memory: 750Mi
Instance Type
Karpenter snippet:
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values:
- inf2.xlarge
Release version
libneuronxla 2.2.16408.0+50c26cbd
torch 2.9.1+cpu
torch-neuronx 2.9.0.2.13.26312+8e870898
torch-xla 2.9.0
transformers 4.57.6
Reproduction Steps
uncertain
We use Keda to scale our services based on work-queue loads.
Once the queue is empty, we scale back down to zero.
Once the pod is removed from the inf2 node, we suspect that something goes wrong during teardown
and the core isn't released properly.
More work comes in, a new model is scheduled and then the new pod fails repeatedly with the errors below.
Our backend team has tried manually killing pods to attempt to replicate with no luck.
Regression Issue
Possible Solution
N/A
Logs/Context/Additional Information
2026-May-19 15:44:37.069500 1:18 ERROR NRT:nrt_infodump NRT version: 2.31.24.0 (0b044f4ce917b633a70eb3d0bc460f34ac3da620)
2026-May-19 15:44:37.071215 1:18 ERROR NRT:nrt_infodump Embedded FW version: unknown (unknown)
2026-May-19 15:44:37.072632 1:18 ERROR NRT:nrt_infodump CCOM not loaded
2026-May-19 15:44:37.073773 1:18 ERROR NRT:nrt_infodump NCFW version: 2.31.1.0 (cf13a49f86829014f0575b6d50c112ddf68b53c0)
2026-May-19 15:44:37.076914 1:18 ERROR NRT:nrt_infodump Cluster ID: N/A
2026-May-19 15:44:37.078007 1:18 ERROR NRT:nrt_infodump Kernel: Linux 6.12.83-113.160.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Fri May 8 17:56:26 UTC 2026
2026-May-19 15:44:37.081644 1:18 ERROR NRT:nrt_infodump Driver version: 2.27.4.0
2026-May-19 15:44:37.083090 1:18 ERROR NRT:nrt_infodump Failure: NRT_FAILURE in nrt_init()
2026-May-19 15:44:37.084445 1:18 ERROR NRT:nrt_infodump Visible cores: 1
2026-May-19 15:44:37.085612 1:18 ERROR NRT:nrt_infodump Environment:
2026-May-19 15:44:37.086743 1:18 ERROR NRT:nrt_infodump NEURON_LOGICAL_NC_CONFIG=1
2026-May-19 15:44:37.089503 1:18 ERROR NRT:nrt_infodump NEURON_LIBRARY_PATH=/home/botni/.local/lib/python3.11/site-packages/libneuronxla/libneuronpjrt.so
2026-May-19 15:44:37.091666 1:18 ERROR NRT:nrt_infodump NEURON_RT_ROOT_COMM_ID=localhost:62182
2026-May-19 15:44:37.093119 1:18 ERROR NRT:nrt_infodump NEURON_RT_NUMERICAL_ERRORS_VERBOSITY=none
2026-May-19 15:44:37.472620 1:18 ERROR NRT:nrt_allocate_neuron_cores Logical Neuron Core(s) not available - Requested:lnc1-lnc1 Available:0 Logical Core size:1
2026-May-19 15:44:37.997567 1:18 ERROR NRT:nrt_allocate_neuron_cores Logical Neuron Core(s) not available - Requested:lnc1-lnc1 Available:0 Logical Core size:1
Describe the bug
We are running inf2 nodes in our cluster.
Ocasionally, a service will be scheduled on the node instance but will fail with the errors added below.
We are unsure what causes this bug.
Any help / guidance on this matter would be greatly appreciated.
Model Name
Custom, built in-house.
tinyvec_neuron_dynamic_batch_torch_2_9_1_16000.pt
Describe the workload type
Inference workload.
Instance Type
Karpenter snippet:
Release version
libneuronxla 2.2.16408.0+50c26cbd
torch 2.9.1+cpu
torch-neuronx 2.9.0.2.13.26312+8e870898
torch-xla 2.9.0
transformers 4.57.6
Reproduction Steps
uncertain
We use
Kedato scale our services based on work-queue loads.Once the queue is empty, we scale back down to zero.
Once the pod is removed from the inf2 node, we suspect that something goes wrong during teardown
and the core isn't released properly.
More work comes in, a new model is scheduled and then the new pod fails repeatedly with the errors below.
Our backend team has tried manually killing pods to attempt to replicate with no luck.
Regression Issue
Possible Solution
N/A
Logs/Context/Additional Information