[Feature] support w4afp8 v1_loader and v0_loader(tp>1) #5757

lizexu123 · 2025-12-24T14:03:30Z

Motivation

支持w4afp8 使用load_choices="default_v1"加载，并且修复了load_choices="default"时，tp>1的精度问题。

启动服务脚本:

# online_inference.sh
for name in `env | grep -E 'PADDLE|ENDPOINT' | awk -F'=' '{print $1}'`; do
  unset ${name}
done

rm -rf log_eb
export FD_LOG_DIR=log_eb

model_path="ernie-4_5-21b-a3b-bf16-paddle" 
也可以用torch模型 ERNIE-4.5-21B-A3B-PT

export devices=0,1
export CUDA_VISIBLE_DEVICES=${devices}


export FD_SAMPLING_CLASS=rejection
export INFERENCE_MSG_QUEUE_ID=8908

python -m fastdeploy.entrypoints.openai.api_server --model ${model_path} \
    --max-model-len 32768 \
    --max-num-seqs 128 \
    --port 8912 \
    --quantization  w4afp8 \
    --tensor-parallel-size 2

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2025-12-24T14:03:38Z

Thanks for your contribution!

…into shandian-1

YuanRisheng · 2025-12-26T03:03:33Z

tests/ci_use/EB_Lite/test_EB_Lite_serving.py

@@ -1,22 +1,9 @@
-# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.


这里怎么删了

YuanRisheng · 2025-12-26T03:04:25Z

tests/ci_use/EB_Lite/test_EB_Lite_serving.py

-        "--guided-decoding-backend",
-        "auto",
    ]

-    # Start subprocess in new process group
-    # 清除log目录
    if os.path.exists("log"):
        shutil.rmtree("log")
    with open(log_path, "w") as logfile:
        process = subprocess.Popen(
            cmd,
            stdout=logfile,
            stderr=subprocess.STDOUT,
-            start_new_session=True,  # Enables killing full group via os.killpg
+            start_new_session=True,
        )

-    # Wait up to 300 seconds for API server to be ready
    for _ in range(300):
        if is_port_open("127.0.0.1", FD_API_PORT):
            print(f"API server is up on port {FD_API_PORT}")
            break
        time.sleep(1)
    else:
        print("[TIMEOUT] API server failed to start in 5 minutes. Cleaning up...")
        try:
            os.killpg(process.pid, signal.SIGTERM)
        except Exception as e:
            print(f"Failed to kill process group: {e}")
        raise RuntimeError(f"API server did not start on port {FD_API_PORT}")


这里的一些解释注释不要删掉

YuanRisheng · 2025-12-26T03:09:19Z

custom_ops/utils/auto_gen_w4afp8_gemm_kernel.py

+    [3072, 2560, 64, 0, 128],
+    [2560, 1536, 64, 0, 128],
+    [1536, 2560, 64, 0, 128],
+    [2560, 768, 64, 0, 128],


扩展这个列表的原因是什么

codecov-commenter · 2025-12-26T04:31:43Z

Codecov Report

❌ Patch coverage is 5.98291% with 110 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@8ee055a). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...l_executor/layers/moe/fused_moe_cutlass_backend.py	4.54%	104 Missing and 1 partial ⚠️
fastdeploy/model_executor/models/ernie4_5_moe.py	0.00%	1 Missing and 2 partials ⚠️
...loy/model_executor/layers/quantization/__init__.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #5757   +/-   ##
==========================================
  Coverage           ?   66.68%           
==========================================
  Files              ?      346           
  Lines              ?    44322           
  Branches           ?     6813           
==========================================
  Hits               ?    29554           
  Misses             ?    12584           
  Partials           ?     2184

Flag	Coverage Δ
GPU	`66.68% <5.98%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

yangjianfengo1 · 2025-12-26T09:40:48Z

LGTM

bukejiyu · 2025-12-29T03:49:47Z

fastdeploy/model_executor/layers/moe/fused_moe_cutlass_backend.py

+                quant_weight_list.append(quant_weight)
+                scale_list.append(weight_scale)
+
+            if hasattr(getattr(layer, weight_name), "tensor_track"):


这个if删掉 free_tensor 里面有

bukejiyu · 2025-12-29T03:56:55Z

fastdeploy/model_executor/layers/moe/fused_moe_cutlass_backend.py

+        if not up_gate_ready and not down_ready:
+            return
+
+        if not self.quant_config.is_quantized:


改成checkpoint_bf16

bukejiyu · 2025-12-29T03:57:20Z

fastdeploy/model_executor/layers/moe/fused_moe_cutlass_backend.py

-                shape=self.ffn1_weight_shape,
-                dtype=self.weight_dtype,
+
+        if not self.quant_config.is_quantized and layer.fd_config.load_config.load_choices == "default_v1":


改成is_checkpoint_bf16

…into shandian-1

EmmonsCurse · 2025-12-29T14:10:41Z

@lizexu123 当前单测覆盖率本身偏低，且 tests/e2e/test_ernie_4_5_w4afp8.py 单测执行时间接近 6 分钟。在已经明显拉长 run_tests_with_coverage 任务耗时、且对覆盖率提升有限的情况下，不建议将该用例纳入 run_tests_with_coverage 任务中。

lizexu123 · 2025-12-29T14:13:51Z

@lizexu123 当前单测覆盖率本身偏低，且 tests/e2e/test_ernie_4_5_w4afp8.py 单测执行时间接近 6 分钟。在已经明显拉长 run_tests_with_coverage 任务耗时、且对覆盖率提升有限的情况下，不建议将该用例纳入 run_tests_with_coverage 任务中。

ernie4_5_moe.py是因为直接看的内部(eb5)ernie4_5moe.py，所以自然有部分没覆盖掉，这里保持了开源和内部相同，单测还是有必要的

Copilot

Pull request overview

This PR adds support for W4AFP8 quantization with the v1_loader ("default_v1") and fixes accuracy issues when using tensor parallelism (tp>1) with the default loader ("default").

Key Changes:

Enabled W4AFP8 quantization for v1_loader by removing it from the unsupported quantization list
Fixed hadamard_block_size calculation for tp>1 scenarios by dividing by tp_size
Added online quantization support for v1_loader in W4AFP8 MoE backend
Added new weight key mappings for W4AFP8 with dynamic quantization mode
Expanded W4AFP8 GEMM kernel test cases to cover more dimension combinations

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`fastdeploy/model_executor/utils.py`	Removed "w4afp8" from unsupported quantizations list for v1_loader on CUDA
`fastdeploy/model_executor/layers/quantization/w4afp8.py`	Added `is_checkpoint_bf16` attribute to track checkpoint format
`fastdeploy/model_executor/layers/quantization/__init__.py`	Fixed hadamard_block_size calculation to account for tensor parallelism by dividing by tp_size
`fastdeploy/model_executor/models/ernie4_5_moe.py`	Added weight key mapping for W4AFP8 with dynamic quantization mode (without activation scales)
`fastdeploy/model_executor/layers/moe/fused_moe_cutlass_backend.py`	Implemented online quantization support for v1_loader including weight creation, Hadamard rotation, and quantization logic
`custom_ops/utils/auto_gen_w4afp8_gemm_kernel.py`	Fixed script path resolution and added new GEMM kernel configurations for additional dimension sizes
`tests/ci_use/EB_Lite_with_w4afp8/test_ernie_4_5_w4afp8.py`	Added comprehensive test suite for W4AFP8 with both default and default_v1 loaders

Copilot · 2025-12-30T04:56:22Z

tests/ci_use/EB_Lite_with_w4afp8/test_ernie_4_5_w4afp8.py

+        print(f"Failed to terminate API server [{config_id}]: {e}")
+        try:
+            os.killpg(process.pid, signal.SIGKILL)
+        except:


Except block directly handles BaseException.

Suggested change

except:

except Exception:

Copilot · 2025-12-30T04:56:23Z

tests/ci_use/EB_Lite_with_w4afp8/test_ernie_4_5_w4afp8.py

+        except:
+            pass


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except:

pass

except Exception as kill_error:

# Best-effort cleanup: log and ignore failure to force kill the process group.

print(f"Failed to force kill API server [{config_id}] (pid={process.pid}): {kill_error}")

support

888e633

lizexu123 temporarily deployed to Metax_ci December 24, 2025 14:03 — with GitHub Actions Inactive

lizexu123 added 2 commits December 25, 2025 04:29

fix

8214177

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

de7629f

…into shandian-1

lizexu123 temporarily deployed to Metax_ci December 25, 2025 04:38 — with GitHub Actions Inactive

support w4afp8 v1_loader and v0_loader

33b0ccc

lizexu123 had a problem deploying to Metax_ci December 25, 2025 13:41 — with GitHub Actions Error

lizexu123 changed the title ~~support~~ support Dec 25, 2025

lizexu123 changed the title ~~support~~ [Feature] support w4afp8 v1_loader and v0_loader(tp>1) Dec 25, 2025

fix

66b3fe6

lizexu123 temporarily deployed to Metax_ci December 25, 2025 13:58 — with GitHub Actions Inactive

YuanRisheng reviewed Dec 26, 2025

View reviewed changes

fix test

38fa1bf

lizexu123 had a problem deploying to Metax_ci December 26, 2025 06:39 — with GitHub Actions Error

fix test

8a9e6e8

lizexu123 had a problem deploying to Metax_ci December 26, 2025 06:48 — with GitHub Actions Error

fix test

9b06cd3

lizexu123 had a problem deploying to Metax_ci December 26, 2025 06:51 — with GitHub Actions Error

fix moe.py

32ce28f

lizexu123 had a problem deploying to Metax_ci December 26, 2025 06:54 — with GitHub Actions Failure

add test_ernie_4_5_w4afp8

9317b87

lizexu123 had a problem deploying to Metax_ci December 26, 2025 08:33 — with GitHub Actions Failure

yangjianfengo1 approved these changes Dec 26, 2025

View reviewed changes

bukejiyu reviewed Dec 29, 2025

View reviewed changes

add test

31f238b

lizexu123 had a problem deploying to Metax_ci December 29, 2025 04:42 — with GitHub Actions Error

delete tensor

8a31e79

lizexu123 temporarily deployed to Metax_ci December 29, 2025 04:45 — with GitHub Actions Inactive

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

219611d

…into shandian-1

lizexu123 had a problem deploying to Metax_ci December 29, 2025 05:32 — with GitHub Actions Failure

fix test

f82e08c

lizexu123 temporarily deployed to Metax_ci December 29, 2025 09:20 — with GitHub Actions Inactive

fix

adfdfcb

lizexu123 had a problem deploying to Metax_ci December 29, 2025 11:16 — with GitHub Actions Failure

bukejiyu previously approved these changes Dec 29, 2025

View reviewed changes

lizexu123 added 2 commits December 29, 2025 14:17

add

ca2bf1b

fix test

0aaad76

lizexu123 dismissed bukejiyu’s stale review via 0aaad76 December 29, 2025 14:24

lizexu123 temporarily deployed to Metax_ci December 29, 2025 14:24 — with GitHub Actions Inactive

Jiang-Jia-Jun requested a review from Copilot December 30, 2025 04:51

Copilot started reviewing on behalf of Jiang-Jia-Jun December 30, 2025 04:52 View session

Copilot AI reviewed Dec 30, 2025

View reviewed changes

bukejiyu approved these changes Dec 30, 2025

View reviewed changes

Jiang-Jia-Jun merged commit 44a13e4 into PaddlePaddle:develop Dec 30, 2025
23 of 26 checks passed

		@@ -1,22 +1,9 @@
		# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.

-        except:
-            pass
+        except Exception as kill_error:
+            # Best-effort cleanup: log and ignore failure to force kill the process group.
+            print(f"Failed to force kill API server [{config_id}] (pid={process.pid}): {kill_error}")

[Feature] support w4afp8 v1_loader and v0_loader(tp>1) #5757

[Feature] support w4afp8 v1_loader and v0_loader(tp>1) #5757

Uh oh!

Conversation

lizexu123 commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Dec 24, 2025

Uh oh!

YuanRisheng Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

YuanRisheng Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

YuanRisheng Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yangjianfengo1 commented Dec 26, 2025

Uh oh!

bukejiyu Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

bukejiyu Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

bukejiyu Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

EmmonsCurse commented Dec 29, 2025

Uh oh!

lizexu123 commented Dec 29, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

lizexu123 commented Dec 24, 2025 •

edited

Loading

codecov-commenter commented Dec 26, 2025 •

edited

Loading