[Feature]【Hackathon 10th Spring No.45】SM-tier compile guards [cf] · Pull Request #7699 · PaddlePaddle/FastDeploy

ghost · 2026-05-02T17:14:36Z

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

I have submitted the CLA (only first PR)
My PR title follows the convention
My changes pass all tests

CLAassistant · 2026-05-02T17:14:43Z

All committers have signed the CLA.

paddle-bot · 2026-05-02T17:15:55Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-02T18:24:19Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-03 21:54:45

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 421e005
Merge base: d70f33d (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

当前无 Required 任务失败，所有 Required 任务通过（或未配置 Required 任务）。CI 仍有 1 个任务运行中。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
2(0)	2	1	0	1	0	0

⚠️ 注意：以下 7 个 Workflow 处于 action_required 状态（等待审批后才会执行）：Codestyle-Check、Approval、CI_HPU、Check PR Template、ILUVATAR-CI、PR Build and Test、CI_XPU。这些 Workflow 需人工审批触发。

注意：action_required workflows 不计入上表的任务统计。

2 任务状态汇总

2.1 Required 任务 : 0/0 通过

当前未配置 Required 任务（Branch Protection Rules 未设置或权限不足），无阻塞合并的必选任务。

2.2 可选任务 — 1/2 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
⏳	`Trigger Jenkins for PR`	-	Job	-
✅	其余 1 个可选任务通过	-	-	-

3 失败详情（仅 required）

无 required 失败任务。

…or cutlass and MoE tail ops

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-03 22:09:08

📋 Review 摘要

PR 概述：为 custom_ops GPU 算子添加 SM 架构级别编译守卫，清理废弃算子声明与源文件。
变更范围：custom_ops/gpu_ops/cpp_extensions.cc、custom_ops/setup_ops.py
影响面 Tag：[OP] [Quantization] [Speculative Decoding]

📝 PR 规范检查

PR 描述所有 section 均为 TODO 占位符，Checklist 中"My changes pass all tests"未勾选。

标题建议（可直接复制）：

[Feature] Add SM-tier compile guards for custom ops

PR 描述建议（可直接复制）：

## Motivation
为 custom_ops 中依赖特定 SM 架构的 GPU 算子添加编译守卫，确保高 SM 算子（cutlass_scaled_mm、FP8 量化系列、MoE permute/depermute 等）仅在支持对应架构的 GPU 上编译和注册，避免低 SM 设备构建时出现符号缺失或运行时错误。同时清理废弃算子声明与源文件，统一投机解码和 Attention 相关算子的参数签名。

## Modifications
- `custom_ops/gpu_ops/cpp_extensions.cc`：
  - 新增 `#ifdef ENABLE_SM75_EXT_OPS` 守卫，包裹 `cutlass_scaled_mm`、`cutlass_scaled_mm_azp` 及 FP8 量化算子（static/dynamic/per-token scaled fp8 quant）的 pybind 注册
  - 将 `prefill_permute_to_masked_gemm`、`depermute_prefill_combine`、`radix_topk_ragged_transform`、`per_token_group_fp8_quant` 移入 `#ifdef ENABLE_SM80_EXT_OPS` 守卫
  - 更新函数前向声明签名：`GetPositionIdsAndMaskEncoderBatch`（新增 `mask_encoder_batch`）、`UnifiedUpdateModelStatus`（新增 `adaptive_step_input_len`/`mask_rollback`/`is_naive_mode`/`prefill_one_step_stop`）、`DraftModelPreprocess`（参数重组）、`EagleGetSelfHiddenStates`（`seq_lens_encoder` → `step_idx`）、`UpdateAttnMaskOffsets`（新增 `attn_mask_offsets_decoder`/`mask_rollback`）
  - 移除废弃函数及 pybind 注册：`FusedCastSigmoidBias`、`BuildSamplingParamLogProb`、`NaiveUpdateModelStatus`、`EagleGatherHiddenStates`；`SpeculateGetAcceptTokensAndLogits` 重命名为 `SpeculateGetTargetLogits`
- `custom_ops/setup_ops.py`：
  - 移除废弃源文件：`gpu_ops/swap_cache_optimized.cu`、`gpu_ops/fused_cast_sigmoid_bias.cu`
  - 将 FP8 kernel 自动生成逻辑从多目标并行判断（`sm_versions` list）改为互斥 `if-elif-else` 分支（基于单一 `cc` 值）
  - 清理 Iluvatar 编译参数（移除 cxx 侧 `-Wno-non-pod-varargs`，移除 `iluvatar_ops/wi4a16_*.cu`、`gpu_ops/update_attn_mask_offsets.cu`）
  - 清理 Metax 编译参数（移除 `-Xcompiler -Wno-non-pod-varargs`）

## Usage or Command
N/A

## Accuracy Tests
N/A（编译基础设施变更，不涉及精度）

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
🔴 兼容性	`custom_ops/gpu_ops/cpp_extensions.cc:1635`	`ENABLE_SM75_EXT_OPS` 宏未见在 `setup_ops.py` 中定义，cutlass_scaled_mm / FP8 量化算子将静默消失
🟡 建议	`custom_ops/setup_ops.py:472`	FP8 kernel 自动生成改为互斥 if-elif-else（cc 单值），可能破坏多架构并行构建
🟡 建议	`custom_ops/gpu_ops/cpp_extensions.cc`	多处函数签名变更（UnifiedUpdateModelStatus / DraftModelPreprocess / EagleGetSelfHiddenStates / UpdateAttnMaskOffsets 等），未见 Python 调用侧（`fastdeploy/model_executor/layers/`）同步更新

总体评价

SM 守卫分层思路正确，但 ENABLE_SM75_EXT_OPS 未在 setup_ops.py 中定义，存在 cutlass_scaled_mm 与 FP8 量化算子被整体移除的兼容性风险；多架构构建逻辑的重构也值得补充验证。需修复后再合入。

PaddlePaddle-bot · 2026-05-03T14:15:38Z

-   * cutlass_scaled_mm
-   * cutlass_scaled_mm_azp
-   */
+#ifdef ENABLE_SM75_EXT_OPS


🔴 兼容性 ENABLE_SM75_EXT_OPS 宏在 setup_ops.py 的 diff 中未见定义。

cutlass_scaled_mm、cutlass_scaled_mm_azp、static_scaled_fp8_quant、dynamic_scaled_fp8_quant、dynamic_per_token_scaled_fp8_quant 共 5 个算子的 pybind 注册被置于此宏守卫内。若 setup_ops.py 未追加 -DENABLE_SM75_EXT_OPS，这些算子将在所有 SM 层级上静默消失，调用方运行时抛出 AttributeError。

请确认 setup_ops.py 中是否已存在（或本 PR 应补充）类似：

if cc >= 75: nvcc_compile_args += ["-DENABLE_SM75_EXT_OPS"]

注：architecture.md 当前仅记录 ENABLE_SM80_EXT_OPS（SM≥80），未见 SM75 对应宏。

PaddlePaddle-bot · 2026-05-03T14:15:38Z

@@ -472,59 +470,58 @@ def find_end_files(directory, end_str):
        # This script seems general enough for different SM versions, specific templates are chosen by CUTLASS.
        os.system("python utils/auto_gen_visitor_fp8_gemm_fused_kernels.py")



🟡 建议 FP8 kernel 自动生成逻辑从多目标并行判断（sm_versions list）改为互斥 if-elif-else（基于单一 cc 值），可能引入多架构构建回归。

原逻辑使用 has_sm90 = 90 in sm_versions，可正确处理 FD_BUILDING_ARCS=[80,90] 等多架构场景，两个 SM 版本的 kernel 均会生成。新逻辑仅检查单一 cc 值；若 cc 表示"当前最高 CC"而非逐一遍历，多目标构建时低 CC 目标的 FP8 kernel 可能被漏掉。

建议补充验证 FD_BUILDING_ARCS=[80,90] 场景，或在注释中明确 cc 已保证逐个遍历。

ghost temporarily deployed to Metax_ci May 2, 2026 17:14 — with GitHub Actions Inactive

ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive

paddle-bot Bot added the contributor External developers label May 2, 2026

ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

[CI]【Hackathon 10th Spring No.45-part2】Add SM75/SM80 compile guards f…

421e005

…or cutlass and MoE tail ops

PaddlePaddle-bot suggested changes May 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]【Hackathon 10th Spring No.45】SM-tier compile guards [cf]#7699

[Feature]【Hackathon 10th Spring No.45】SM-tier compile guards [cf]#7699
ghost wants to merge 1 commit intoPaddlePaddle:developfrom
CloudForge-Solutions:task/h10-45-sm-tier-compile-guards-v3

ghost commented May 2, 2026

Uh oh!

CLAassistant commented May 2, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented May 2, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 2, 2026 •

edited

Loading

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 3, 2026

Uh oh!

PaddlePaddle-bot May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -472,59 +470,58 @@ def find_end_files(directory, end_str):
		# This script seems general enough for different SM versions, specific templates are chosen by CUTLASS.
		os.system("python utils/auto_gen_visitor_fp8_gemm_fused_kernels.py")

Conversation

ghost commented May 2, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

CLAassistant commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paddle-bot Bot commented May 2, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required 任务 : 0/0 通过

2.2 可选任务 — 1/2 通过

3 失败详情（仅 required）

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 3, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CLAassistant commented May 2, 2026 •

edited

Loading

PaddlePaddle-bot commented May 2, 2026 •

edited

Loading