-
Notifications
You must be signed in to change notification settings - Fork 681
[Feature] support w4afp8 v1_loader and v0_loader(tp>1) #5757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your contribution! |
…into shandian-1
| @@ -1,22 +1,9 @@ | |||
| # Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里怎么删了
| "--guided-decoding-backend", | ||
| "auto", | ||
| ] | ||
|
|
||
| # Start subprocess in new process group | ||
| # 清除log目录 | ||
| if os.path.exists("log"): | ||
| shutil.rmtree("log") | ||
| with open(log_path, "w") as logfile: | ||
| process = subprocess.Popen( | ||
| cmd, | ||
| stdout=logfile, | ||
| stderr=subprocess.STDOUT, | ||
| start_new_session=True, # Enables killing full group via os.killpg | ||
| start_new_session=True, | ||
| ) | ||
|
|
||
| # Wait up to 300 seconds for API server to be ready | ||
| for _ in range(300): | ||
| if is_port_open("127.0.0.1", FD_API_PORT): | ||
| print(f"API server is up on port {FD_API_PORT}") | ||
| break | ||
| time.sleep(1) | ||
| else: | ||
| print("[TIMEOUT] API server failed to start in 5 minutes. Cleaning up...") | ||
| try: | ||
| os.killpg(process.pid, signal.SIGTERM) | ||
| except Exception as e: | ||
| print(f"Failed to kill process group: {e}") | ||
| raise RuntimeError(f"API server did not start on port {FD_API_PORT}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的一些解释注释不要删掉
| [3072, 2560, 64, 0, 128], | ||
| [2560, 1536, 64, 0, 128], | ||
| [1536, 2560, 64, 0, 128], | ||
| [2560, 768, 64, 0, 128], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
扩展这个列表的原因是什么
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #5757 +/- ##
==========================================
Coverage ? 66.68%
==========================================
Files ? 346
Lines ? 44322
Branches ? 6813
==========================================
Hits ? 29554
Misses ? 12584
Partials ? 2184
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
LGTM |
| quant_weight_list.append(quant_weight) | ||
| scale_list.append(weight_scale) | ||
|
|
||
| if hasattr(getattr(layer, weight_name), "tensor_track"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个if删掉 free_tensor 里面有
| if not up_gate_ready and not down_ready: | ||
| return | ||
|
|
||
| if not self.quant_config.is_quantized: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改成checkpoint_bf16
| shape=self.ffn1_weight_shape, | ||
| dtype=self.weight_dtype, | ||
|
|
||
| if not self.quant_config.is_quantized and layer.fd_config.load_config.load_choices == "default_v1": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改成is_checkpoint_bf16
…into shandian-1
|
@lizexu123 当前单测覆盖率本身偏低,且 |
ernie4_5_moe.py是因为直接看的内部(eb5)ernie4_5moe.py,所以自然有部分没覆盖掉,这里保持了开源和内部相同,单测还是有必要的 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for W4AFP8 quantization with the v1_loader ("default_v1") and fixes accuracy issues when using tensor parallelism (tp>1) with the default loader ("default").
Key Changes:
- Enabled W4AFP8 quantization for v1_loader by removing it from the unsupported quantization list
- Fixed hadamard_block_size calculation for tp>1 scenarios by dividing by tp_size
- Added online quantization support for v1_loader in W4AFP8 MoE backend
- Added new weight key mappings for W4AFP8 with dynamic quantization mode
- Expanded W4AFP8 GEMM kernel test cases to cover more dimension combinations
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
fastdeploy/model_executor/utils.py |
Removed "w4afp8" from unsupported quantizations list for v1_loader on CUDA |
fastdeploy/model_executor/layers/quantization/w4afp8.py |
Added is_checkpoint_bf16 attribute to track checkpoint format |
fastdeploy/model_executor/layers/quantization/__init__.py |
Fixed hadamard_block_size calculation to account for tensor parallelism by dividing by tp_size |
fastdeploy/model_executor/models/ernie4_5_moe.py |
Added weight key mapping for W4AFP8 with dynamic quantization mode (without activation scales) |
fastdeploy/model_executor/layers/moe/fused_moe_cutlass_backend.py |
Implemented online quantization support for v1_loader including weight creation, Hadamard rotation, and quantization logic |
custom_ops/utils/auto_gen_w4afp8_gemm_kernel.py |
Fixed script path resolution and added new GEMM kernel configurations for additional dimension sizes |
tests/ci_use/EB_Lite_with_w4afp8/test_ernie_4_5_w4afp8.py |
Added comprehensive test suite for W4AFP8 with both default and default_v1 loaders |
| print(f"Failed to terminate API server [{config_id}]: {e}") | ||
| try: | ||
| os.killpg(process.pid, signal.SIGKILL) | ||
| except: |
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except block directly handles BaseException.
| except: | |
| except Exception: |
| except: | ||
| pass |
Copilot
AI
Dec 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except: | |
| pass | |
| except Exception as kill_error: | |
| # Best-effort cleanup: log and ignore failure to force kill the process group. | |
| print(f"Failed to force kill API server [{config_id}] (pid={process.pid}): {kill_error}") |


Motivation
支持w4afp8 使用load_choices="default_v1"加载,并且修复了load_choices="default"时,tp>1的精度问题。
启动服务脚本:
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.