Skip to content

[Bug]: GLM-4.6-AWQ model outputs garbled text on vllm/vllm-openai:v0.10.2-x86_64 #30165

@zzzyoyo

Description

@zzzyoyo

Your current environment

GPU: H800

Docker image: vllm/vllm-openai:v0.10.2-x86_64 with

vllm=0.10.2, transformers=4.56.1, torch=2.8.0+cu128

and

autoawq=0.2.9 (manually installed)

🐛 Describe the bug

Hello VLLM developers,

I am using your vllm/vllm-openai:v0.10.2-x86_64 Docker image, deployed on a Linux server with 6 H800 GPUs. The model I am trying to serve is: GLM-4.6-AWQ, the model's config.json file contains following content:

"quantization_config": {
    "quant_method": "awq_marlin",
    "bits": 4,
    "group_size": 128,
    "version": "gemm",
    "zero_point": true,
    "modules_to_not_convert": ["embed_tokens", "shared_experts", "shared_head", "lm_head"]
}


After entering the Docker container, I ran the following command::

vllm serve \
    /data \
    --served-model-name glm46 \
    --enable-auto-tool-choice \
    --tool-call-parser glm45 \
    --reasoning-parser glm45 \
    --swap-space 16 \
    --max-num-seqs 32 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 4 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

The server starts successfully:

Image

(Apologies for the photo format, as our computers are offline.)

However, the output text is garbled:

Image

I also tried loading the model in code using:

model = LLM('/data', tensor_parallel_size=4)

but the output is still garbled:

Image

Next, I tried loading the model without VLLM using Transformers and AutoAWQ:

model = AutoAWQForCausalLM.from_quantized(
    model_path,
    fuse_layers=True,                
    trust_remote_code=True,
    safetensors=True,               
    device_map="auto",
)

but it fails with: glm4_moe awq quantization isn't supported yet.

I also tried using AutoModelForCausalLM.from_pretrained, which outputs:

Image

For reference, my environment versions are:

transformers=4.56.1

vllm=0.10.2

torch=2.8.0+cu128

These are all from the Docker image, with only autoawq=0.2.9 installed manually.

Btw,I have also tried using --chat-template, adding --quantization parameters, loading with bfloat16, etc., but nothing works. I have confirmed that the model files are not corrupted.

Could you please advise on how to correctly serve this AWQ model using VLLM or Transformers?

Thank you very much!
s

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions