Skip to content

Conversation

@lizexu123
Copy link
Collaborator

@lizexu123 lizexu123 commented Dec 24, 2025

Motivation

支持w4afp8 使用load_choices="default_v1"加载,并且修复了load_choices="default"时,tp>1的精度问题。

启动服务脚本:

# online_inference.sh
for name in `env | grep -E 'PADDLE|ENDPOINT' | awk -F'=' '{print $1}'`; do
  unset ${name}
done

rm -rf log_eb
export FD_LOG_DIR=log_eb

model_path="ernie-4_5-21b-a3b-bf16-paddle" 
也可以用torch模型 ERNIE-4.5-21B-A3B-PT

export devices=0,1
export CUDA_VISIBLE_DEVICES=${devices}


export FD_SAMPLING_CLASS=rejection
export INFERENCE_MSG_QUEUE_ID=8908

python -m fastdeploy.entrypoints.openai.api_server --model ${model_path} \
    --max-model-len 32768 \
    --max-num-seqs 128 \
    --port 8912 \
    --quantization  w4afp8 \
    --tensor-parallel-size 2 

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Dec 24, 2025

Thanks for your contribution!

@lizexu123 lizexu123 changed the title support support Dec 25, 2025
@lizexu123 lizexu123 changed the title support [Feature] support w4afp8 v1_loader and v0_loader(tp>1) Dec 25, 2025
@@ -1,22 +1,9 @@
# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里怎么删了

Comment on lines 84 to 112
"--guided-decoding-backend",
"auto",
]

# Start subprocess in new process group
# 清除log目录
if os.path.exists("log"):
shutil.rmtree("log")
with open(log_path, "w") as logfile:
process = subprocess.Popen(
cmd,
stdout=logfile,
stderr=subprocess.STDOUT,
start_new_session=True, # Enables killing full group via os.killpg
start_new_session=True,
)

# Wait up to 300 seconds for API server to be ready
for _ in range(300):
if is_port_open("127.0.0.1", FD_API_PORT):
print(f"API server is up on port {FD_API_PORT}")
break
time.sleep(1)
else:
print("[TIMEOUT] API server failed to start in 5 minutes. Cleaning up...")
try:
os.killpg(process.pid, signal.SIGTERM)
except Exception as e:
print(f"Failed to kill process group: {e}")
raise RuntimeError(f"API server did not start on port {FD_API_PORT}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的一些解释注释不要删掉

Comment on lines +93 to +96
[3072, 2560, 64, 0, 128],
[2560, 1536, 64, 0, 128],
[1536, 2560, 64, 0, 128],
[2560, 768, 64, 0, 128],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

扩展这个列表的原因是什么

@codecov-commenter
Copy link

codecov-commenter commented Dec 26, 2025

Codecov Report

❌ Patch coverage is 5.98291% with 110 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@8ee055a). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...l_executor/layers/moe/fused_moe_cutlass_backend.py 4.54% 104 Missing and 1 partial ⚠️
fastdeploy/model_executor/models/ernie4_5_moe.py 0.00% 1 Missing and 2 partials ⚠️
...loy/model_executor/layers/quantization/__init__.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #5757   +/-   ##
==========================================
  Coverage           ?   66.68%           
==========================================
  Files              ?      346           
  Lines              ?    44322           
  Branches           ?     6813           
==========================================
  Hits               ?    29554           
  Misses             ?    12584           
  Partials           ?     2184           
Flag Coverage Δ
GPU 66.68% <5.98%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yangjianfengo1
Copy link
Contributor

LGTM

quant_weight_list.append(quant_weight)
scale_list.append(weight_scale)

if hasattr(getattr(layer, weight_name), "tensor_track"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个if删掉 free_tensor 里面有

if not up_gate_ready and not down_ready:
return

if not self.quant_config.is_quantized:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改成checkpoint_bf16

shape=self.ffn1_weight_shape,
dtype=self.weight_dtype,

if not self.quant_config.is_quantized and layer.fd_config.load_config.load_choices == "default_v1":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改成is_checkpoint_bf16

bukejiyu
bukejiyu previously approved these changes Dec 29, 2025
@EmmonsCurse
Copy link
Collaborator

@lizexu123 当前单测覆盖率本身偏低,且 tests/e2e/test_ernie_4_5_w4afp8.py 单测执行时间接近 6 分钟。在已经明显拉长 run_tests_with_coverage 任务耗时、且对覆盖率提升有限的情况下,不建议将该用例纳入 run_tests_with_coverage 任务中。
image

@lizexu123
Copy link
Collaborator Author

@lizexu123 当前单测覆盖率本身偏低,且 tests/e2e/test_ernie_4_5_w4afp8.py 单测执行时间接近 6 分钟。在已经明显拉长 run_tests_with_coverage 任务耗时、且对覆盖率提升有限的情况下,不建议将该用例纳入 run_tests_with_coverage 任务中。 image

ernie4_5_moe.py是因为直接看的内部(eb5)ernie4_5moe.py,所以自然有部分没覆盖掉,这里保持了开源和内部相同,单测还是有必要的

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for W4AFP8 quantization with the v1_loader ("default_v1") and fixes accuracy issues when using tensor parallelism (tp>1) with the default loader ("default").

Key Changes:

  • Enabled W4AFP8 quantization for v1_loader by removing it from the unsupported quantization list
  • Fixed hadamard_block_size calculation for tp>1 scenarios by dividing by tp_size
  • Added online quantization support for v1_loader in W4AFP8 MoE backend
  • Added new weight key mappings for W4AFP8 with dynamic quantization mode
  • Expanded W4AFP8 GEMM kernel test cases to cover more dimension combinations

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
fastdeploy/model_executor/utils.py Removed "w4afp8" from unsupported quantizations list for v1_loader on CUDA
fastdeploy/model_executor/layers/quantization/w4afp8.py Added is_checkpoint_bf16 attribute to track checkpoint format
fastdeploy/model_executor/layers/quantization/__init__.py Fixed hadamard_block_size calculation to account for tensor parallelism by dividing by tp_size
fastdeploy/model_executor/models/ernie4_5_moe.py Added weight key mapping for W4AFP8 with dynamic quantization mode (without activation scales)
fastdeploy/model_executor/layers/moe/fused_moe_cutlass_backend.py Implemented online quantization support for v1_loader including weight creation, Hadamard rotation, and quantization logic
custom_ops/utils/auto_gen_w4afp8_gemm_kernel.py Fixed script path resolution and added new GEMM kernel configurations for additional dimension sizes
tests/ci_use/EB_Lite_with_w4afp8/test_ernie_4_5_w4afp8.py Added comprehensive test suite for W4AFP8 with both default and default_v1 loaders

print(f"Failed to terminate API server [{config_id}]: {e}")
try:
os.killpg(process.pid, signal.SIGKILL)
except:
Copy link

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except block directly handles BaseException.

Suggested change
except:
except Exception:

Copilot uses AI. Check for mistakes.
Comment on lines +214 to +215
except:
pass
Copy link

Copilot AI Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except:
pass
except Exception as kill_error:
# Best-effort cleanup: log and ignore failure to force kill the process group.
print(f"Failed to force kill API server [{config_id}] (pid={process.pid}): {kill_error}")

Copilot uses AI. Check for mistakes.
@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 44a13e4 into PaddlePaddle:develop Dec 30, 2025
23 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants