Skip to content

Neuron plugin gives error during model loading that works fine with neuron upstream vllm fork #6

@ajayvohra2005

Description

@ajayvohra2005

Error:

(EngineCore_DP0 pid=272) 2025-12-22 19:08:35.000109:  272  [INFO]: Compilation Successfully Completed for model.MODULE_4aae3bed9043a81c0125+97c2cc02.hlo_module.pb
(EngineCore_DP0 pid=272) INFO:Neuron:Done compilation for the priority HLO in 106.28764724731445 seconds
(EngineCore_DP0 pid=272) INFO:Neuron:Updating the hlo module with optimized layout
(EngineCore_DP0 pid=280) 2025-12-22 19:08:36.000096:  280  [INFO]: Using a cached neff at /cache/neuronxcc-2.22.12471.0+b4a00d10/MODULE_4aae3bed9043a81c0125+97c2cc02/model.neff
(EngineCore_DP0 pid=280) INFO:Neuron:Done compilation for the priority HLO in 106.46791672706604 seconds
(EngineCore_DP0 pid=280) INFO:Neuron:Updating the hlo module with optimized layout
(EngineCore_DP0 pid=276) 2025-12-22 19:08:36.000776:  276  [INFO]: Using a cached neff at /cache/neuronxcc-2.22.12471.0+b4a00d10/MODULE_4aae3bed9043a81c0125+97c2cc02/model.neff
(EngineCore_DP0 pid=276) INFO:Neuron:Done compilation for the priority HLO in 106.47657084465027 seconds
(EngineCore_DP0 pid=276) INFO:Neuron:Updating the hlo module with optimized layout
(EngineCore_DP0 pid=272) INFO:Neuron:Updating the hlo module with optimized layout
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708] EngineCore failed to start.
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708] Traceback (most recent call last):
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=280) Process EngineCore_DP0:
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 83, in __init__
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]     self._init_executor()
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 55, in _init_executor
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]     self.collective_rpc("load_model")
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]     return [run_method(self.driver_worker, method, args, kwargs)]
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm/utils/__init__.py", line 3122, in run_method
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]     return func(*args, **kwargs)
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm_neuron/worker/neuron_worker.py", line 86, in load_model
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]     self.model_runner.load_model()
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm_neuron/worker/neuronx_distributed_model_runner.py", line 221, in load_model
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]     self.model = get_neuron_model(
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]                  ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm_neuron/worker/neuronx_distributed_model_loader.py", line 714, in get_neuron_model
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]     model.load_weights(model_name_or_path=model_config.model,
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm_neuron/worker/neuronx_distributed_model_loader.py", line 394, in load_weights
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]     self._compile_and_load_model(model_name_or_path, neuronx_model_cls,
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm_neuron/worker/neuronx_distributed_model_loader.py", line 240, in _compile_and_load_model
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]     self.model.compile(compiled_path)
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/neuronx_distributed_inference/models/application_base.py", line 302, in compile
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]     traced_model = self.get_builder(debug).trace(
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/neuronx_distributed/trace/model_builder.py", line 680, in trace
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]     self._add_layout_optimization_to_remaining_hlo()
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/neuronx_distributed/trace/model_builder.py", line 1196, in _add_layout_optimization_to_remaining_hlo
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708]     os.remove(original_hlo_file_name)
(EngineCore_DP0 pid=280) ERROR 12-22 19:08:38 [core.py:708] FileNotFoundError: [Errno 2] No such file or directory: 'context_encoding_model_0.hlo'
(EngineCore_DP0 pid=280) Traceback (most recent call last):
(EngineCore_DP0 pid=280)   File "/opt/conda/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=280)     self.run()
(EngineCore_DP0 pid=280)   File "/opt/conda/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=280)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=280)   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 712, in run_engine_core
(EngineCore_DP0 pid=280)     raise e
(EngineCore_DP0 pid=280)   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=280)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=280)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=280)   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=280)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=280)   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 83, in __init__
(EngineCore_DP0 pid=280)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=280)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=280)   File "/opt/conda/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_DP0 pid=280)     self._init_executor()
(EngineCore_DP0 pid=280)   File "/opt/conda/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 55, in _init_executor
(EngineCore_DP0 pid=280)     self.collective_rpc("load_model")
(EngineCore_DP0 pid=280)   File "/opt/conda/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
(EngineCore_DP0 pid=280)     return [run_method(self.driver_worker, method, args, kwargs)]
(EngineCore_DP0 pid=280)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=280)   File "/opt/conda/lib/python3.12/site-packages/vllm/utils/__init__.py", line 3122, in run_method
(EngineCore_DP0 pid=280)     return func(*args, **kwargs)
(EngineCore_DP0 pid=280)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=280)   File "/opt/conda/lib/python3.12/site-packages/vllm_neuron/worker/neuron_worker.py", line 86, in load_model
(EngineCore_DP0 pid=280)     self.model_runner.load_model()
(EngineCore_DP0 pid=280)   File "/opt/conda/lib/python3.12/site-packages/vllm_neuron/worker/neuronx_distributed_model_runner.py", line 221, in load_model
(EngineCore_DP0 pid=280)     self.model = get_neuron_model(
(EngineCore_DP0 pid=280)                  ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=280)   File "/opt/conda/lib/python3.12/site-packages/vllm_neuron/worker/neuronx_distributed_model_loader.py", line 714, in get_neuron_model
(EngineCore_DP0 pid=280)     model.load_weights(model_name_or_path=model_config.model,
(EngineCore_DP0 pid=280)   File "/opt/conda/lib/python3.12/site-packages/vllm_neuron/worker/neuronx_distributed_model_loader.py", line 394, in load_weights
(EngineCore_DP0 pid=280)     self._compile_and_load_model(model_name_or_path, neuronx_model_cls,
(EngineCore_DP0 pid=280)   File "/opt/conda/lib/python3.12/site-packages/vllm_neuron/worker/neuronx_distributed_model_loader.py", line 240, in _compile_and_load_model
(EngineCore_DP0 pid=280)     self.model.compile(compiled_path)
(EngineCore_DP0 pid=280)   File "/opt/conda/lib/python3.12/site-packages/neuronx_distributed_inference/models/application_base.py", line 302, in compile
(EngineCore_DP0 pid=280)     traced_model = self.get_builder(debug).trace(
(EngineCore_DP0 pid=280)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=280)   File "/opt/conda/lib/python3.12/site-packages/neuronx_distributed/trace/model_builder.py", line 680, in trace
(EngineCore_DP0 pid=280)     self._add_layout_optimization_to_remaining_hlo()
(EngineCore_DP0 pid=280)   File "/opt/conda/lib/python3.12/site-packages/neuronx_distributed/trace/model_builder.py", line 1196, in _add_layout_optimization_to_remaining_hlo
(EngineCore_DP0 pid=280)     os.remove(original_hlo_file_name)
(EngineCore_DP0 pid=280) FileNotFoundError: [Errno 2] No such file or directory: 'context_encoding_model_0.hlo'
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708] EngineCore failed to start.
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708] Traceback (most recent call last):
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 83, in __init__
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=276) Process EngineCore_DP0:
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]     self._init_executor()
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 55, in _init_executor
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]     self.collective_rpc("load_model")
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]     return [run_method(self.driver_worker, method, args, kwargs)]
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm/utils/__init__.py", line 3122, in run_method
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]     return func(*args, **kwargs)
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm_neuron/worker/neuron_worker.py", line 86, in load_model
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]     self.model_runner.load_model()
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm_neuron/worker/neuronx_distributed_model_runner.py", line 221, in load_model
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]     self.model = get_neuron_model(
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]                  ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm_neuron/worker/neuronx_distributed_model_loader.py", line 714, in get_neuron_model
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]     model.load_weights(model_name_or_path=model_config.model,
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm_neuron/worker/neuronx_distributed_model_loader.py", line 394, in load_weights
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]     self._compile_and_load_model(model_name_or_path, neuronx_model_cls,
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/vllm_neuron/worker/neuronx_distributed_model_loader.py", line 240, in _compile_and_load_model
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]     self.model.compile(compiled_path)
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/neuronx_distributed_inference/models/application_base.py", line 302, in compile
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]     traced_model = self.get_builder(debug).trace(
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/neuronx_distributed/trace/model_builder.py", line 680, in trace
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]     self._add_layout_optimization_to_remaining_hlo()
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]   File "/opt/conda/lib/python3.12/site-packages/neuronx_distributed/trace/model_builder.py", line 1196, in _add_layout_optimization_to_remaining_hlo
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708]     os.remove(original_hlo_file_name)
(EngineCore_DP0 pid=276) ERROR 12-22 19:08:39 [core.py:708] FileNotFoundError: [Errno 2] No such file or directory: 'context_encoding_model_0.hlo'
(EngineCore_DP0 pid=276) Traceback (most recent call last):
(EngineCore_DP0 pid=276)   File "/opt/conda/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=276)     self.run()
(EngineCore_DP0 pid=276)   File "/opt/conda/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=276)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=276)   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 712, in run_engine_core
(EngineCore_DP0 pid=276)     raise e
(EngineCore_DP0 pid=276)   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=276)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=276)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=276)   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=276)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=276)   File "/opt/conda/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 83, in __init__
(EngineCore_DP0 pid=276)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=276)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=276)   File "/opt/conda/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_DP0 pid=276)     self._init_executor()
(EngineCore_DP0 pid=276)   File "/opt/conda/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 55, in _init_executor
(EngineCore_DP0 pid=276)     self.collective_rpc("load_model")
(EngineCore_DP0 pid=276)   File "/opt/conda/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
(EngineCore_DP0 pid=276)     return [run_method(self.driver_worker, method, args, kwargs)]
(EngineCore_DP0 pid=276)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=276)   File "/opt/conda/lib/python3.12/site-packages/vllm/utils/__init__.py", line 3122, in run_method
(EngineCore_DP0 pid=276)     return func(*args, **kwargs)
(EngineCore_DP0 pid=276)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=276)   File "/opt/conda/lib/python3.12/site-packages/vllm_neuron/worker/neuron_worker.py", line 86, in load_model
(EngineCore_DP0 pid=276)     self.model_runner.load_model()
(EngineCore_DP0 pid=276)   File "/opt/conda/lib/python3.12/site-packages/vllm_neuron/worker/neuronx_distributed_model_runner.py", line 221, in load_model
(EngineCore_DP0 pid=276)     self.model = get_neuron_model(
(EngineCore_DP0 pid=276)                  ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=276)   File "/opt/conda/lib/python3.12/site-packages/vllm_neuron/worker/neuronx_distributed_model_loader.py", line 714, in get_neuron_model
(EngineCore_DP0 pid=276)     model.load_weights(model_name_or_path=model_config.model,
(EngineCore_DP0 pid=276)   File "/opt/conda/lib/python3.12/site-packages/vllm_neuron/worker/neuronx_distributed_model_loader.py", line 394, in load_weights
(EngineCore_DP0 pid=276)     self._compile_and_load_model(model_name_or_path, neuronx_model_cls,
(EngineCore_DP0 pid=276)   File "/opt/conda/lib/python3.12/site-packages/vllm_neuron/worker/neuronx_distributed_model_loader.py", line 240, in _compile_and_load_model
(EngineCore_DP0 pid=276)     self.model.compile(compiled_path)
(EngineCore_DP0 pid=276)   File "/opt/conda/lib/python3.12/site-packages/neuronx_distributed_inference/models/application_base.py", line 302, in compile
(EngineCore_DP0 pid=276)     traced_model = self.get_builder(debug).trace(
(EngineCore_DP0 pid=276)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=276)   File "/opt/conda/lib/python3.12/site-packages/neuronx_distributed/trace/model_builder.py", line 680, in trace
(EngineCore_DP0 pid=276)     self._add_layout_optimization_to_remaining_hlo()
(EngineCore_DP0 pid=276)   File "/opt/conda/lib/python3.12/site-packages/neuronx_distributed/trace/model_builder.py", line 1196, in _add_layout_optimization_to_remaining_hlo
(EngineCore_DP0 pid=276)     os.remove(original_hlo_file_name)

Dockerfile:

ARG BASE_IMAGE=public.ecr.aws/neuron/pytorch-inference-neuronx:2.9.0-neuronx-py312-sdk2.27.0-ubuntu24.04
FROM $BASE_IMAGE

ENV DEBIAN_FRONTEND=noninteractive \
    PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PJRT_DEVICE=NEURON \
    LD_LIBRARY_PATH="/opt/conda/lib:/opt/aws/neuron/lib:${LD_LIBRARY_PATH}" \
    PATH="/opt/program:/opt/conda/bin:/opt/aws/neuron/bin:${PATH}"

RUN apt-get update && apt-get install -y --no-install-recommends \
            ca-certificates \
            build-essential \
            git \
            libssl-dev \
            libcurl4-openssl-dev \
            libgoogle-perftools-dev \
            libnuma-dev \
            pkg-config \
            unzip \
            wget \
            nginx \
      && rm -rf /var/lib/apt/lists/*

# Upgrade core tools in the existing environment
RUN pip3 install --no-cache-dir --upgrade pip setuptools wheel virtualenv build

# vLLM Installation
RUN git clone --branch="0.2.1-lts" https://github.com/vllm-project/vllm-neuron.git /tmp/vllm-neuron && \
  cd /tmp/vllm-neuron && git checkout 94447bd0fa0012d3500362309582fb9b99ba051e && \
  pip3 install --no-cache-dir --upgrade-strategy only-if-needed  --extra-index-url=https://pip.repos.neuron.amazonaws.com . && \
  rm -rf /tmp/vllm-neuron

Machine:

inf2.48xlarge

OS

ubuntu 24.04

Model:

deepseek-ai/DeepSeek-R1-Distill-Llama-8B

VLLM Open AI Server Config:

model-impl: auto
enable-log-requests: false
tensor-parallel-size: 8
max-num-seqs: 8
dtype: auto
max-model-len: 8192
gpu-memory-utilization: 0.95
max-num-batched-tokens: 8192
block-size: 16

Number of server instances

24/8 = 3

The cache directory is local mount on the container and is shared among all three vllm server instances.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions