Skip to content

[Feature]【Hackathon 10th Spring No.45】SM-tier compile guards [cf]#7699

Open
ghost wants to merge 1 commit intoPaddlePaddle:developfrom
CloudForge-Solutions:task/h10-45-sm-tier-compile-guards-v3
Open

[Feature]【Hackathon 10th Spring No.45】SM-tier compile guards [cf]#7699
ghost wants to merge 1 commit intoPaddlePaddle:developfrom
CloudForge-Solutions:task/h10-45-sm-tier-compile-guards-v3

Conversation

@ghost
Copy link
Copy Markdown

@ghost ghost commented May 2, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

  • I have submitted the CLA (only first PR)
  • My PR title follows the convention
  • My changes pass all tests

@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:14 — with GitHub Actions Inactive
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 2, 2026

CLA assistant check
All committers have signed the CLA.

@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:14 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:14 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:14 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 2, 2026

Thanks for your contribution!

@paddle-bot paddle-bot Bot added the contributor External developers label May 2, 2026
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive
PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 2, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-03 21:54:45

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前无 Required 任务失败,所有 Required 任务通过(或未配置 Required 任务)。CI 仍有 1 个任务运行中。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
2(0) 2 1 0 1 0 0

⚠️ 注意:以下 7 个 Workflow 处于 action_required 状态(等待审批后才会执行):Codestyle-Check、Approval、CI_HPU、Check PR Template、ILUVATAR-CI、PR Build and Test、CI_XPU。这些 Workflow 需人工审批触发。

注意:action_required workflows 不计入上表的任务统计。

2 任务状态汇总

2.1 Required 任务 : 0/0 通过

当前未配置 Required 任务(Branch Protection Rules 未设置或权限不足),无阻塞合并的必选任务。

2.2 可选任务 — 1/2 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Trigger Jenkins for PR - Job -
其余 1 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 失败任务。

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-03 22:09:08

📋 Review 摘要

PR 概述:为 custom_ops GPU 算子添加 SM 架构级别编译守卫,清理废弃算子声明与源文件。
变更范围custom_ops/gpu_ops/cpp_extensions.cccustom_ops/setup_ops.py
影响面 Tag[OP] [Quantization] [Speculative Decoding]

📝 PR 规范检查

PR 描述所有 section 均为 TODO 占位符,Checklist 中"My changes pass all tests"未勾选。

标题建议(可直接复制):

  • [Feature] Add SM-tier compile guards for custom ops

PR 描述建议(可直接复制):

## Motivation
为 custom_ops 中依赖特定 SM 架构的 GPU 算子添加编译守卫,确保高 SM 算子(cutlass_scaled_mm、FP8 量化系列、MoE permute/depermute 等)仅在支持对应架构的 GPU 上编译和注册,避免低 SM 设备构建时出现符号缺失或运行时错误。同时清理废弃算子声明与源文件,统一投机解码和 Attention 相关算子的参数签名。

## Modifications
- `custom_ops/gpu_ops/cpp_extensions.cc`- 新增 `#ifdef ENABLE_SM75_EXT_OPS` 守卫,包裹 `cutlass_scaled_mm``cutlass_scaled_mm_azp` 及 FP8 量化算子(static/dynamic/per-token scaled fp8 quant)的 pybind 注册
  -`prefill_permute_to_masked_gemm``depermute_prefill_combine``radix_topk_ragged_transform``per_token_group_fp8_quant` 移入 `#ifdef ENABLE_SM80_EXT_OPS` 守卫
  - 更新函数前向声明签名:`GetPositionIdsAndMaskEncoderBatch`(新增 `mask_encoder_batch`)、`UnifiedUpdateModelStatus`(新增 `adaptive_step_input_len`/`mask_rollback`/`is_naive_mode`/`prefill_one_step_stop`)、`DraftModelPreprocess`(参数重组)、`EagleGetSelfHiddenStates``seq_lens_encoder``step_idx`)、`UpdateAttnMaskOffsets`(新增 `attn_mask_offsets_decoder`/`mask_rollback`- 移除废弃函数及 pybind 注册:`FusedCastSigmoidBias``BuildSamplingParamLogProb``NaiveUpdateModelStatus``EagleGatherHiddenStates``SpeculateGetAcceptTokensAndLogits` 重命名为 `SpeculateGetTargetLogits`
- `custom_ops/setup_ops.py`- 移除废弃源文件:`gpu_ops/swap_cache_optimized.cu``gpu_ops/fused_cast_sigmoid_bias.cu`
  - 将 FP8 kernel 自动生成逻辑从多目标并行判断(`sm_versions` list)改为互斥 `if-elif-else` 分支(基于单一 `cc` 值)
  - 清理 Iluvatar 编译参数(移除 cxx 侧 `-Wno-non-pod-varargs`,移除 `iluvatar_ops/wi4a16_*.cu``gpu_ops/update_attn_mask_offsets.cu`- 清理 Metax 编译参数(移除 `-Xcompiler -Wno-non-pod-varargs`## Usage or Command
N/A

## Accuracy Tests
N/A(编译基础设施变更,不涉及精度)

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
🔴 兼容性 custom_ops/gpu_ops/cpp_extensions.cc:1635 ENABLE_SM75_EXT_OPS 宏未见在 setup_ops.py 中定义,cutlass_scaled_mm / FP8 量化算子将静默消失
🟡 建议 custom_ops/setup_ops.py:472 FP8 kernel 自动生成改为互斥 if-elif-else(cc 单值),可能破坏多架构并行构建
🟡 建议 custom_ops/gpu_ops/cpp_extensions.cc 多处函数签名变更(UnifiedUpdateModelStatus / DraftModelPreprocess / EagleGetSelfHiddenStates / UpdateAttnMaskOffsets 等),未见 Python 调用侧(fastdeploy/model_executor/layers/)同步更新

总体评价

SM 守卫分层思路正确,但 ENABLE_SM75_EXT_OPS 未在 setup_ops.py 中定义,存在 cutlass_scaled_mm 与 FP8 量化算子被整体移除的兼容性风险;多架构构建逻辑的重构也值得补充验证。需修复后再合入。

* cutlass_scaled_mm
* cutlass_scaled_mm_azp
*/
#ifdef ENABLE_SM75_EXT_OPS
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 兼容性 ENABLE_SM75_EXT_OPS 宏在 setup_ops.py 的 diff 中未见定义。

cutlass_scaled_mmcutlass_scaled_mm_azpstatic_scaled_fp8_quantdynamic_scaled_fp8_quantdynamic_per_token_scaled_fp8_quant 共 5 个算子的 pybind 注册被置于此宏守卫内。若 setup_ops.py 未追加 -DENABLE_SM75_EXT_OPS,这些算子将在所有 SM 层级上静默消失,调用方运行时抛出 AttributeError

请确认 setup_ops.py 中是否已存在(或本 PR 应补充)类似:

if cc >= 75:
    nvcc_compile_args += ["-DENABLE_SM75_EXT_OPS"]

注:architecture.md 当前仅记录 ENABLE_SM80_EXT_OPS(SM≥80),未见 SM75 对应宏。

Comment thread custom_ops/setup_ops.py
@@ -472,59 +470,58 @@ def find_end_files(directory, end_str):
# This script seems general enough for different SM versions, specific templates are chosen by CUTLASS.
os.system("python utils/auto_gen_visitor_fp8_gemm_fused_kernels.py")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 FP8 kernel 自动生成逻辑从多目标并行判断(sm_versions list)改为互斥 if-elif-else(基于单一 cc 值),可能引入多架构构建回归。

原逻辑使用 has_sm90 = 90 in sm_versions,可正确处理 FD_BUILDING_ARCS=[80,90] 等多架构场景,两个 SM 版本的 kernel 均会生成。新逻辑仅检查单一 cc 值;若 cc 表示"当前最高 CC"而非逐一遍历,多目标构建时低 CC 目标的 FP8 kernel 可能被漏掉。

建议补充验证 FD_BUILDING_ARCS=[80,90] 场景,或在注释中明确 cc 已保证逐个遍历。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants