Skip to content

Realquant checkpoint vllm compatibility bug #15

@songsm921

Description

@songsm921

Hello,
I’m working on running W4A4 with vLLM by applying MR-GPTQ from FP-Quant to Qwen3. However, when I load vLLM with the checkpoint I created, I get the error shown below. The same issue occurs even when I download and use the official MR-GPTQ checkpoint for Llama-8B. Could you please help me with this?

[Environment settings]
NVIDIA DGX-Spark
Operating System: Linux-6.11.0-1016-nvidia-aarch64-with-glibc2.39
Python Version: 3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0]
llm-compressor Version: None
compressed-tensors Version: 0.12.2
transformers Version: 4.57.1
torch Version: 2.9.0a0+50eac811a6.nv25.9
CUDA Devices: ['NVIDIA GB10']
AMD Devices: None
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:57:39_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0
vllm version: 0.11.1rc7.dev147+gda14ae0fa.d20251114.cu130

[Error output]
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
INFO 11-20 05:02:49 [model.py:631] Resolved architecture: Qwen3ForCausalLM
INFO 11-20 05:02:49 [model.py:1737] Using max model len 32768
INFO 11-20 05:02:50 [scheduler.py:260] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 11-20 05:02:50 [system_utils.py:103] We must use the spawn multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
/usr/local/lib/python3.12/dist-packages/torch/cuda/init.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
(EngineCore_DP0 pid=327518) INFO 11-20 05:02:54 [core.py:94] Initializing a V1 LLM engine (v0.11.1rc7.dev147+gda14ae0fa.d20251114) with config: model='/mnt/cephfs/name1/Qwen3-1.7B-W4A4-nvfp4-ID-realquant-s1.1', speculative_config=None, tokenizer='/mnt/cephfs/name1/Qwen3-1.7B-W4A4-nvfp4-ID-realquant-s1.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp_quant, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/mnt/cephfs/name1/Qwen3-1.7B-W4A4-nvfp4-ID-realquant-s1.1, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 512, 'local_cache_dir': None}
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=327518) INFO 11-20 05:02:54 [parallel_state.py:1325] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=327518) INFO 11-20 05:02:54 [gpu_model_runner.py:3047] Starting to load model /mnt/cephfs/name1/Qwen3-1.7B-W4A4-nvfp4-ID-realquant-s1.1...
(EngineCore_DP0 pid=327518) INFO 11-20 05:02:55 [cuda.py:418] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(EngineCore_DP0 pid=327518) INFO 11-20 05:02:55 [cuda.py:427] Using FLASH_ATTN backend.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:08<00:00, 8.13s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:08<00:00, 8.13s/it]
(EngineCore_DP0 pid=327518)
(EngineCore_DP0 pid=327518) INFO 11-20 05:03:03 [default_loader.py:314] Loading weights took 8.23 seconds
(EngineCore_DP0 pid=327518) INFO 11-20 05:03:03 [gpu_model_runner.py:3126] Model loading took 1.3486 GiB memory and 8.534439 seconds
(EngineCore_DP0 pid=327518) INFO 11-20 05:03:07 [backends.py:631] Using cache directory: /root/.cache/vllm/torch_compile_cache/6b279ff56b/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=327518) INFO 11-20 05:03:07 [backends.py:647] Dynamo bytecode transform time: 3.07 s
(EngineCore_DP0 pid=327518) INFO 11-20 05:03:07 [backends.py:251] Cache the graph for dynamic shape for later use
(EngineCore_DP0 pid=327518) INFO 11-20 05:03:11 [backends.py:282] Compiling a graph for dynamic shape takes 3.90 s
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] EngineCore failed to start.
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] Traceback (most recent call last):
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core.py", line 846, in run_engine_core
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core.py", line 619, in init
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] super().init(
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core.py", line 110, in init
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] num_gpu_blocks, num_cpu_blocks, kv_cache_config = self.initialize_kv_caches(
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core.py", line 228, in initialize_kv_caches
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/v1/executor/abstract.py", line 126, in determine_available_memory
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return func(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/utils/contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return func(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/v1/worker/gpu_worker.py", line 318, in determine_available_memory
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] self.model_runner.profile_run()
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/v1/worker/gpu_model_runner.py", line 3923, in profile_run
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] hidden_states, last_hidden_states = self.dummy_run(
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/utils/contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return func(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/v1/worker/gpu_model_runner.py", line 3643, in dummy_run
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] outputs = self.model(
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/compilation/cuda_graph.py", line 126, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in wrapped_call_impl
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self.call_impl(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in call_impl
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/model_executor/models/qwen3.py", line 319, in forward
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] hidden_states = self.model(
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/compilation/decorators.py", line 470, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] output = self.compiled_callable(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/dynamo/eval_frame.py", line 804, in compile_wrapper
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return fn(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/model_executor/models/qwen2.py", line 361, in forward
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] def forward(
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/dynamo/eval_frame.py", line 1005, in fn
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return fn(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/compilation/caching.py", line 53, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self.wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] raise e
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc]
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in wrapped_call_impl
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self.call_impl(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in call_impl
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "<eval_with_key>.58", line 682, in forward
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] submod_0 = self.submod_0(l_input_ids
, s72, l_self_modules_embed_tokens_parameters_weight
, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight
, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_forward_hadamard_matrix
, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_act_global_scale
, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_global_scale
, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_qweight
, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_scales
, l_self_modules_layers_modules_0_modules_self_attn_modules_q_norm_parameters_weight
, l_self_modules_layers_modules_0_modules_self_attn_modules_k_norm_parameters_weight
, l_positions
, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache
); l_input_ids
= l_self_modules_embed_tokens_parameters_weight
= l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight
= l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_forward_hadamard_matrix
= l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_act_global_scale_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_global_scale_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_qweight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_scales_ = l_self_modules_layers_modules_0_modules_self_attn_modules_q_norm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_k_norm_parameters_weight_ = None
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/compilation/cuda_graph.py", line 126, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/compilation/piecewise_backend.py", line 93, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self.compiled_graph_for_general_shape(*args)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/standalone_compile.py", line 62, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self._compiled_fn(*args)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1005, in _fn
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return fn(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1124, in forward
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return compiled_fn(full_args)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 353, in runtime_wrapper
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] all_outs = call_func_at_runtime_with_args(
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 129, in call_func_at_runtime_with_args
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] out = normalize_as_list(f(args))
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 526, in wrapper
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return compiled_fn(runtime_args)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 585, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self.current_callable(inputs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2889, in run
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] out = model(new_inputs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/tmp/torchinductor_root/uq/cuqgwgvzmzkwo4or6j6eh62z5oiz2io24dqqhorj3r5rm6gapuw4.py", line 1057, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] buf3 = torch.ops.vllm.fused_quantize_nv.default(buf2, arg4_1, arg5_1)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 840, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self._op(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/model_executor/layers/quantization/fp_quant.py", line 317, in fused_quantize_nv
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return fusedQuantizeNv(x_flat, hadamard_matrix, global_scale)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/mnt/cephfs/name2/codes/vllm/vllm/_custom_ops.py", line 2840, in fusedQuantizeNv
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return torch.ops._qutlass_C.fusedQuantizeNv(a, b, xh_e2m1, xh_e4m3, global_scale)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/ops.py", line 1254, in call
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] return self.op(*args, **kwargs)
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) ERROR 11-20 05:03:11 [core.py:855] RuntimeError: Error Internal
(EngineCore_DP0 pid=327518) Process EngineCore_DP0:
(EngineCore_DP0 pid=327518) Traceback (most recent call last):
(EngineCore_DP0 pid=327518) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in bootstrap
(EngineCore_DP0 pid=327518) self.run()
(EngineCore_DP0 pid=327518) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=327518) self.target(*self.args, **self.kwargs)
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core.py", line 859, in run_engine_core
(EngineCore_DP0 pid=327518) raise e
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core.py", line 846, in run_engine_core
(EngineCore_DP0 pid=327518) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core.py", line 619, in init
(EngineCore_DP0 pid=327518) super().init(
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core.py", line 110, in init
(EngineCore_DP0 pid=327518) num_gpu_blocks, num_cpu_blocks, kv_cache_config = self.initialize_kv_caches(
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core.py", line 228, in initialize_kv_caches
(EngineCore_DP0 pid=327518) available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/v1/executor/abstract.py", line 126, in determine_available_memory
(EngineCore_DP0 pid=327518) return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=327518) result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=327518) return func(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/utils/contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=327518) return func(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/v1/worker/gpu_worker.py", line 318, in determine_available_memory
(EngineCore_DP0 pid=327518) self.model_runner.profile_run()
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/v1/worker/gpu_model_runner.py", line 3923, in profile_run
(EngineCore_DP0 pid=327518) hidden_states, last_hidden_states = self.dummy_run(
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/utils/contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=327518) return func(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/v1/worker/gpu_model_runner.py", line 3643, in dummy_run
(EngineCore_DP0 pid=327518) outputs = self.model(
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/compilation/cuda_graph.py", line 126, in call
(EngineCore_DP0 pid=327518) return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in wrapped_call_impl
(EngineCore_DP0 pid=327518) return self.call_impl(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in call_impl
(EngineCore_DP0 pid=327518) return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/model_executor/models/qwen3.py", line 319, in forward
(EngineCore_DP0 pid=327518) hidden_states = self.model(
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/compilation/decorators.py", line 470, in call
(EngineCore_DP0 pid=327518) output = self.compiled_callable(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/dynamo/eval_frame.py", line 804, in compile_wrapper
(EngineCore_DP0 pid=327518) return fn(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/model_executor/models/qwen2.py", line 361, in forward
(EngineCore_DP0 pid=327518) def forward(
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/dynamo/eval_frame.py", line 1005, in fn
(EngineCore_DP0 pid=327518) return fn(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/compilation/caching.py", line 53, in call
(EngineCore_DP0 pid=327518) return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore_DP0 pid=327518) return self.wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in call
(EngineCore_DP0 pid=327518) raise e
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in call
(EngineCore_DP0 pid=327518) return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc]
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in wrapped_call_impl
(EngineCore_DP0 pid=327518) return self.call_impl(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in call_impl
(EngineCore_DP0 pid=327518) return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "<eval_with_key>.58", line 682, in forward
(EngineCore_DP0 pid=327518) submod_0 = self.submod_0(l_input_ids
, s72, l_self_modules_embed_tokens_parameters_weight
, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight
, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_forward_hadamard_matrix
, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_act_global_scale
, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_global_scale
, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_qweight
, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_scales
, l_self_modules_layers_modules_0_modules_self_attn_modules_q_norm_parameters_weight
, l_self_modules_layers_modules_0_modules_self_attn_modules_k_norm_parameters_weight
, l_positions
, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache
); l_input_ids
= l_self_modules_embed_tokens_parameters_weight
= l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight
= l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_forward_hadamard_matrix
= l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_act_global_scale
= l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_global_scale
= l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_qweight
= l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_scales
= l_self_modules_layers_modules_0_modules_self_attn_modules_q_norm_parameters_weight
= l_self_modules_layers_modules_0_modules_self_attn_modules_k_norm_parameters_weight
= None
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/compilation/cuda_graph.py", line 126, in call
(EngineCore_DP0 pid=327518) return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/compilation/piecewise_backend.py", line 93, in call
(EngineCore_DP0 pid=327518) return self.compiled_graph_for_general_shape(*args)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/standalone_compile.py", line 62, in call
(EngineCore_DP0 pid=327518) return self._compiled_fn(*args)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1005, in _fn
(EngineCore_DP0 pid=327518) return fn(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1124, in forward
(EngineCore_DP0 pid=327518) return compiled_fn(full_args)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 353, in runtime_wrapper
(EngineCore_DP0 pid=327518) all_outs = call_func_at_runtime_with_args(
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 129, in call_func_at_runtime_with_args
(EngineCore_DP0 pid=327518) out = normalize_as_list(f(args))
(EngineCore_DP0 pid=327518) ^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 526, in wrapper
(EngineCore_DP0 pid=327518) return compiled_fn(runtime_args)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 585, in call
(EngineCore_DP0 pid=327518) return self.current_callable(inputs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2889, in run
(EngineCore_DP0 pid=327518) out = model(new_inputs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/tmp/torchinductor_root/uq/cuqgwgvzmzkwo4or6j6eh62z5oiz2io24dqqhorj3r5rm6gapuw4.py", line 1057, in call
(EngineCore_DP0 pid=327518) buf3 = torch.ops.vllm.fused_quantize_nv.default(buf2, arg4_1, arg5_1)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 840, in call
(EngineCore_DP0 pid=327518) return self._op(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/model_executor/layers/quantization/fp_quant.py", line 317, in fused_quantize_nv
(EngineCore_DP0 pid=327518) return fusedQuantizeNv(x_flat, hadamard_matrix, global_scale)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/mnt/cephfs/name2/codes/vllm/vllm/_custom_ops.py", line 2840, in fusedQuantizeNv
(EngineCore_DP0 pid=327518) return torch.ops._qutlass_C.fusedQuantizeNv(a, b, xh_e2m1, xh_e4m3, global_scale)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1254, in call
(EngineCore_DP0 pid=327518) return self._op(*args, **kwargs)
(EngineCore_DP0 pid=327518) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=327518) RuntimeError: Error Internal
[rank0]:[W1120 05:03:12.413768333 ProcessGroupNCCL.cpp:1534] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
File "/mnt/cephfs/name1/evaluation.py", line 331, in
evaluate_accelerate(args.model, eval_dataset, args, n_answers_per_question=args.N)
File "/mnt/cephfs/name1/evaluation.py", line 110, in evaluate_accelerate
llm = prepare_vllm_model(
^^^^^^^^^^^^^^^^^^^
File "/mnt/cephfs/name1/evaluation.py", line 22, in prepare_vllm_model
llm = LLM(
^^^^
File "/mnt/cephfs/name2/codes/vllm/vllm/entrypoints/llm.py", line 344, in init
self.llm_engine = LLMEngine.from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/llm_engine.py", line 175, in from_engine_args
return cls(
^^^^
File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/llm_engine.py", line 109, in init
self.engine_core = EngineCoreClient.make_client(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core_client.py", line 93, in make_client
return SyncMPClient(vllm_config, executor_class, log_stats)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core_client.py", line 640, in init
super().init(
File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/core_client.py", line 469, in init
with launch_core_engines(vllm_config, executor_class, log_stats) as (
File "/usr/lib/python3.12/contextlib.py", line 144, in exit
next(self.gen)
File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/utils.py", line 898, in launch_core_engines
wait_for_engine_startup(
File "/mnt/cephfs/name2/codes/vllm/vllm/v1/engine/utils.py", line 955, in wait_for_engine_startup
raise RuntimeError(
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions