-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Hello,
I’m working on running W4A4 with vLLM by applying MR-GPTQ from FP-Quant to Qwen3. However, when I load vLLM with the checkpoint I created, I get the error shown below. The same issue occurs even when I download and use the official MR-GPTQ checkpoint for Llama-8B. Could you please help me with this?
[Environment settings]
NVIDIA DGX-Spark
Operating System: Linux-6.11.0-1016-nvidia-aarch64-with-glibc2.39
Python Version: 3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0]
llm-compressor Version: None
compressed-tensors Version: 0.12.2
transformers Version: 4.57.1
torch Version: 2.9.0a0+50eac811a6.nv25.9
CUDA Devices: ['NVIDIA GB10']
AMD Devices: None
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:57:39_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0
vllm version: 0.11.1rc7.dev147+gda14ae0fa.d20251114.cu130
[Error output]
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
INFO 11-20 05:02:49 [model.py:631] Resolved architecture: Qwen3ForCausalLM
INFO 11-20 05:02:49 [model.py:1737] Using max model len 32768
INFO 11-20 05:02:50 [scheduler.py:260] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 11-20 05:02:50 [system_utils.py:103] We must use the spawn multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
/usr/local/lib/python3.12/dist-packages/torch/cuda/init.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
(EngineCore_DP0 pid=327518) INFO 11-20 05:02:54 [core.py:94] Initializing a V1 LLM engine (v0.11.1rc7.dev147+gda14ae0fa.d20251114) with config: model='/mnt/cephfs/name1/Qwen3-1.7B-W4A4-nvfp4-ID-realquant-s1.1', speculative_config=None, tokenizer='/mnt/cephfs/name1/Qwen3-1.7B-W4A4-nvfp4-ID-realquant-s1.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp_quant, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/mnt/cephfs/name1/Qwen3-1.7B-W4A4-nvfp4-ID-realquant-s1.1, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 512, 'local_cache_dir': None}
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=327518) INFO 11-20 05:02:54 [parallel_state.py:1325] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=327518) INFO 11-20 05:02:54 [gpu_model_runner.py:3047] Starting to load model /mnt/cephfs/name1/Qwen3-1.7B-W4A4-nvfp4-ID-realquant-s1.1...
(EngineCore_DP0 pid=327518) INFO 11-20 05:02:55 [cuda.py:418] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(EngineCore_DP0 pid=327518) INFO 11-20 05:02:55 [cuda.py:427] Using FLASH_ATTN backend.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:08<00:00, 8.13s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:08<00:00, 8.13s/it]
(EngineCore_DP0 pid=327518)
(EngineCore_DP0 pid=327518) INFO 11-20 05:03:03 [default_loader.py:314] Loading weights took 8.23 seconds
(EngineCore_DP0 pid=327518) INFO 11-20 05:03:03 [gpu_model_runner.py:3126] Model loading took 1.3486 GiB memory and 8.534439 seconds
(EngineCore_DP0 pid=327518) INFO 11-20 05:03:07 [backends.py:631] Using cache directory: /root/.cache/vllm/torch_compile_cache/6b279ff56b/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=327518) INFO 11-20 05:03:07 [backends.py:647] Dynamo bytecode transform time: 3.07 s
(EngineCore_DP0 pid=327518) INFO 11-20 05:03:07 [backends.py:251] Cache the graph for dynamic shape for later use
(EngineCore_DP0 pid=327518) INFO 11-20 05:03:11 [backends.py:282] Compiling a graph for dynamic shape takes 3.90 s
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] EngineCore failed to start.
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] Traceback (most recent call last):
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core.py", line 846, in run_engine_core
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core.py", line 619, in init
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] super().init(
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core.py", line 110, in init
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] num_gpu_blocks, num_cpu_blocks, kv_cache_config = self.initialize_kv_caches(
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core.py", line 228, in initialize_kv_caches
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/v1/executor/abstract.py", line 126, in determine_available_memory
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return func(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/utils/contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return func(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/v1/worker/gpu_worker.py", line 318, in determine_available_memory
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] self.model_runner.profile_run()
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/v1/worker/gpu_model_runner.py", line 3923, in profile_run
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] hidden_states, last_hidden_states = self.dummy_run(
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/utils/contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return func(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/v1/worker/gpu_model_runner.py", line 3643, in dummy_run
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] outputs = self.model(
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/compilation/cuda_graph.py", line 126, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in wrapped_call_impl
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self.call_impl(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in call_impl
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/model_executor/models/qwen3.py", line 319, in forward
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] hidden_states = self.model(
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/compilation/decorators.py", line 470, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] output = self.compiled_callable(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/dynamo/eval_frame.py", line 804, in compile_wrapper
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return fn(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/model_executor/models/qwen2.py", line 361, in forward
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] def forward(
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/dynamo/eval_frame.py", line 1005, in fn
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return fn(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/compilation/caching.py", line 53, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self.wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] raise e
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc]
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in wrapped_call_impl
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self.call_impl(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in call_impl
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "<eval_with_key>.58", line 682, in forward
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] submod_0 = self.submod_0(l_input_ids, s72, l_self_modules_embed_tokens_parameters_weight, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_forward_hadamard_matrix, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_act_global_scale, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_global_scale, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_qweight, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_scales, l_self_modules_layers_modules_0_modules_self_attn_modules_q_norm_parameters_weight, l_self_modules_layers_modules_0_modules_self_attn_modules_k_norm_parameters_weight, l_positions, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache); l_input_ids = l_self_modules_embed_tokens_parameters_weight = l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_forward_hadamard_matrix = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_act_global_scale_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_global_scale_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_qweight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_scales_ = l_self_modules_layers_modules_0_modules_self_attn_modules_q_norm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_k_norm_parameters_weight_ = None
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/compilation/cuda_graph.py", line 126, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/compilation/piecewise_backend.py", line 93, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self.compiled_graph_for_general_shape(*args)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/standalone_compile.py", line 62, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self._compiled_fn(*args)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1005, in _fn
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return fn(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1124, in forward
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return compiled_fn(full_args)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 353, in runtime_wrapper
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] all_outs = call_func_at_runtime_with_args(
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 129, in call_func_at_runtime_with_args
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] out = normalize_as_list(f(args))
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 526, in wrapper
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return compiled_fn(runtime_args)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 585, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self.current_callable(inputs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2889, in run
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] out = model(new_inputs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/tmp/torchinductor_root/uq/cuqgwgvzmzkwo4or6j6eh62z5oiz2io24dqqhorj3r5rm6gapuw4.py", line 1057, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] buf3 = torch.ops.vllm.fused_quantize_nv.default(buf2, arg4_1, arg5_1)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 840, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self._op(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/model_executor/layers/quantization/fp_quant.py", line 317, in fused_quantize_nv
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return fusedQuantizeNv(x_flat, hadamard_matrix, global_scale)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/_custom_ops.py", line 2840, in fusedQuantizeNv
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return torch.ops._qutlass_C.fusedQuantizeNv(a, b, xh_e2m1, xh_e4m3, global_scale)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/ops.py", line 1254, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self.op(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] RuntimeError: Error Internal
(EngineCore_DP0 pid=327518) Process EngineCore_DP0:
(EngineCore_DP0 pid=327518) Traceback (most recent call last):
(EngineCore_DP0 pid=327518) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in bootstrap
(EngineCore_DP0 pid=327518) self.run()
(EngineCore_DP0 pid=327518) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=327518) self.target(*self.args, **self.kwargs)
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core.py", line 859, in run_engine_core
(EngineCore_DP0 pid=327518) raise e
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core.py", line 846, in run_engine_core
(EngineCore_DP0 pid=327518) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core.py", line 619, in init
(EngineCore_DP0 pid=327518) super().init(
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core.py", line 110, in init
(EngineCore_DP0 pid=327518) num_gpu_blocks, num_cpu_blocks, kv_cache_config = self.initialize_kv_caches(
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core.py", line 228, in initialize_kv_caches
(EngineCore_DP0 pid=327518) available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/v1/executor/abstract.py", line 126, in determine_available_memory
(EngineCore_DP0 pid=327518) return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=327518) result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=327518) return func(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/utils/contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=327518) return func(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/v1/worker/gpu_worker.py", line 318, in determine_available_memory
(EngineCore_DP0 pid=327518) self.model_runner.profile_run()
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/v1/worker/gpu_model_runner.py", line 3923, in profile_run
(EngineCore_DP0 pid=327518) hidden_states, last_hidden_states = self.dummy_run(
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/utils/contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=327518) return func(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/v1/worker/gpu_model_runner.py", line 3643, in dummy_run
(EngineCore_DP0 pid=327518) outputs = self.model(
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/compilation/cuda_graph.py", line 126, in call
(EngineCore_DP0 pid=327518) return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in wrapped_call_impl
(EngineCore_DP0 pid=327518) return self.call_impl(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in call_impl
(EngineCore_DP0 pid=327518) return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/model_executor/models/qwen3.py", line 319, in forward
(EngineCore_DP0 pid=327518) hidden_states = self.model(
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/compilation/decorators.py", line 470, in call
(EngineCore_DP0 pid=327518) output = self.compiled_callable(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/dynamo/eval_frame.py", line 804, in compile_wrapper
(EngineCore_DP0 pid=327518) return fn(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/model_executor/models/qwen2.py", line 361, in forward
(EngineCore_DP0 pid=327518) def forward(
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/dynamo/eval_frame.py", line 1005, in fn
(EngineCore_DP0 pid=327518) return fn(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/compilation/caching.py", line 53, in call
(EngineCore_DP0 pid=327518) return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore_DP0 pid=327518) return self.wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in call
(EngineCore_DP0 pid=327518) raise e
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in call
(EngineCore_DP0 pid=327518) return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc]
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in wrapped_call_impl
(EngineCore_DP0 pid=327518) return self.call_impl(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in call_impl
(EngineCore_DP0 pid=327518) return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "<eval_with_key>.58", line 682, in forward
(EngineCore_DP0 pid=327518) submod_0 = self.submod_0(l_input_ids, s72, l_self_modules_embed_tokens_parameters_weight, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_forward_hadamard_matrix, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_act_global_scale, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_global_scale, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_qweight, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_scales, l_self_modules_layers_modules_0_modules_self_attn_modules_q_norm_parameters_weight, l_self_modules_layers_modules_0_modules_self_attn_modules_k_norm_parameters_weight, l_positions, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache); l_input_ids = l_self_modules_embed_tokens_parameters_weight = l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_forward_hadamard_matrix = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_act_global_scale = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_global_scale = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_qweight = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_scales = l_self_modules_layers_modules_0_modules_self_attn_modules_q_norm_parameters_weight = l_self_modules_layers_modules_0_modules_self_attn_modules_k_norm_parameters_weight = None
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/compilation/cuda_graph.py", line 126, in call
(EngineCore_DP0 pid=327518) return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/compilation/piecewise_backend.py", line 93, in call
(EngineCore_DP0 pid=327518) return self.compiled_graph_for_general_shape(*args)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/standalone_compile.py", line 62, in call
(EngineCore_DP0 pid=327518) return self._compiled_fn(*args)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1005, in _fn
(EngineCore_DP0 pid=327518) return fn(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1124, in forward
(EngineCore_DP0 pid=327518) return compiled_fn(full_args)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 353, in runtime_wrapper
(EngineCore_DP0 pid=327518) all_outs = call_func_at_runtime_with_args(
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 129, in call_func_at_runtime_with_args
(EngineCore_DP0 pid=327518) out = normalize_as_list(f(args))
(EngineCore_DP0 pid=327518) ^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 526, in wrapper
(EngineCore_DP0 pid=327518) return compiled_fn(runtime_args)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 585, in call
(EngineCore_DP0 pid=327518) return self.current_callable(inputs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2889, in run
(EngineCore_DP0 pid=327518) out = model(new_inputs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/tmp/torchinductor_root/uq/cuqgwgvzmzkwo4or6j6eh62z5oiz2io24dqqhorj3r5rm6gapuw4.py", line 1057, in call
(EngineCore_DP0 pid=327518) buf3 = torch.ops.vllm.fused_quantize_nv.default(buf2, arg4_1, arg5_1)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 840, in call
(EngineCore_DP0 pid=327518) return self._op(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/model_executor/layers/quantization/fp_quant.py", line 317, in fused_quantize_nv
(EngineCore_DP0 pid=327518) return fusedQuantizeNv(x_flat, hadamard_matrix, global_scale)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/_custom_ops.py", line 2840, in fusedQuantizeNv
(EngineCore_DP0 pid=327518) return torch.ops._qutlass_C.fusedQuantizeNv(a, b, xh_e2m1, xh_e4m3, global_scale)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1254, in call
(EngineCore_DP0 pid=327518) return self._op(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) RuntimeError: Error Internal
[rank0]:[W1120 05:03:12.413768333 ProcessGroupNCCL.cpp:1534] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
File "/mnt/cephfs/name1/evaluation.py", line 331, in
evaluate_accelerate(args.model, eval_dataset, args, n_answers_per_question=args.N)
File "/mnt/cephfs/name1/evaluation.py", line 110, in evaluate_accelerate
llm = prepare_vllm_model(
^^^^^^^^^^^^^^^^^^^
File "/mnt/cephfs/name1/evaluation.py", line 22, in prepare_vllm_model
llm = LLM(
^^^^
File "/mnt/cephfs/name2/codes/vllm/vllm/entrypoints/llm.py", line 344, in init
self.llm_engine = LLMEngine.from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/llm_engine.py", line 175, in from_engine_args
return cls(
^^^^
File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/llm_engine.py", line 109, in init
self.engine_core = EngineCoreClient.make_client(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core_client.py", line 93, in make_client
return SyncMPClient(vllm_config, executor_class, log_stats)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core_client.py", line 640, in init
super().init(
File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core_client.py", line 469, in init
with launch_core_engines(vllm_config, executor_class, log_stats) as (
File "/usr/lib/python3.12/contextlib.py", line 144, in exit
next(self.gen)
File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/utils.py", line 898, in launch_core_engines
wait_for_engine_startup(
File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/utils.py", line 955, in wait_for_engine_startup
raise RuntimeError(
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}