Skip to content

[Feature]【Hackathon 10th Spring No.46】Python Windows runtime compatibility [cf]#7702

Open
ghost wants to merge 1 commit intoPaddlePaddle:developfrom
CloudForge-Solutions:task/h10-46-python-compat-v3
Open

[Feature]【Hackathon 10th Spring No.46】Python Windows runtime compatibility [cf]#7702
ghost wants to merge 1 commit intoPaddlePaddle:developfrom
CloudForge-Solutions:task/h10-46-python-compat-v3

Conversation

@ghost
Copy link
Copy Markdown

@ghost ghost commented May 2, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

  • I have submitted the CLA (only first PR)
  • My PR title follows the convention
  • My changes pass all tests

@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:14 — with GitHub Actions Inactive
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 2, 2026

CLA assistant check
All committers have signed the CLA.

@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to Metax_ci May 2, 2026 17:15 — with GitHub Actions Inactive
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 2, 2026

Thanks for your contribution!

@paddle-bot paddle-bot Bot added the contributor External developers label May 2, 2026
PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 2, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-03 22:43:44

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

所有已执行的任务均已通过。但有 7 个 Workflow 处于 action_required 状态,等待人工审批后才会执行(含主要 CI 流水线),当前 CI 尚未完整运行。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
2(0) 2 2 0 0 0 0

⚠️ 注意:以下 7 个 Workflow 处于 action_required 状态(等待审批后才会执行):ApprovalCodestyle-CheckCheck PR TemplateCI_HPUCI_XPUPR Build and TestILUVATAR-CI。这些 Workflow 需人工审批触发。


2 任务状态汇总

2.1 Required任务 : 0/0 通过

当前未检测到已配置的 Required 任务(分支保护规则未配置,或 API 权限不足)。主要 CI 流水线(PR Build and Test 等)均处于 action_required 状态,尚未执行。

2.2 可选任务 — 2/2 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
其余 2 个可选任务通过 - - -

3 失败详情(仅 required)

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-03 22:05:16

📋 Review 摘要

PR 概述:为 Python 代码添加 Windows 平台运行时兼容性支持,替换 /dev/shm 路径、os.setsid/os.killpg 等 POSIX 专有 API
变更范围fastdeploy/engine/fastdeploy/cache_manager/fastdeploy/inter_communicator/fastdeploy/eplb/fastdeploy/worker/
影响面 Tag[Engine] [KVCache] [Feature]

📝 PR 规范检查

标题含非官方后缀 [cf],且 Motivation/Modifications/Usage/Accuracy Tests 各段均为空占位符,未填写实际内容。

标题建议(可直接复制):

  • [Feature] Python Windows runtime compatibility

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation
修复 Python 在 Windows 平台下的运行时兼容性问题。代码中大量硬编码了 `/dev/shm` 路径及 `os.setsid``os.killpg` 等 POSIX 专有 API,导致在 Windows 上无法运行。

## Modifications
- `cache_manager/cache_messager.py`:将 `/dev/shm` 路径替换为通过 `sys.platform` 判断的 `tempfile.gettempdir()`
- `cache_manager/prefix_cache_manager.py``subprocess.Popen``preexec_fn=os.setsid` 改为平台条件化 `**_popen_kwargs`
- `engine/common_engine.py``/dev/shm` 路径替换 + `os.killpg` 改为 Windows 下 `proc.terminate()`
- `engine/engine.py`:同上,并将 `multiprocessing.get_context("fork")` 改为 Windows 下使用 `"spawn"`
- `engine/expert_service.py`:进程终止逻辑兼容 Windows
- `eplb/async_expert_loader.py``/dev/shm` 路径替换
- `inter_communicator/fmq.py``Config.ipc_root` 默认值平台适配
- `inter_communicator/zmq_client.py``zmq_server.py`:socket 路径平台适配
- `worker/worker_process.py`:task queue 路径平台适配

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
🔴 Bug fastdeploy/engine/expert_service.py:113 init_cache_info 方法在 FDConfig 中不存在(仅有 init_pd_info),dp/splitwise 调度场景启动时将抛 AttributeError
🔴 Bug fastdeploy/engine/expert_service.py:126 同上,splitwise 路径下的第二处调用
🔴 Bug fastdeploy/engine/engine.py:872 dp worker 初始化由并行改为串行:先批量 start() 再统一等待的设计被破坏,多 dp 场景启动耗时线性增加
🟡 建议 fastdeploy/engine/engine.py:680 enable_flashinfer_allreduce_fusionworker_store_true_flag 和 worker 命令行参数中删除,但 model_executor 层仍读取该字段,导致用户开启该 flag 后实际效果静默失效

总体评价

Windows 平台路径和进程 API 的兼容性改动思路正确,但引入了两个明确的运行时 Bug:init_cache_info 方法不存在将在 dp/splitwise 场景导致启动失败;dp worker 并行初始化被意外改为串行。这两处必须修复后再合入。

self.engine.create_data_processor()
if self.cfg.scheduler_config.name == "dp":
self.cfg.init_pd_info()
self.cfg.init_cache_info()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug init_cache_info 方法在 FDConfig 中不存在(仅有 init_pd_info),此处调用将在 dp 调度场景下运行时抛出 AttributeError

建议修复:

self.cfg.init_pd_info()


if self.cfg.scheduler_config.name == "splitwise":
self.cfg.init_pd_info()
self.cfg.init_cache_info()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 同上,splitwise 调度场景下同样调用了不存在的 init_cache_info 方法,运行时抛 AttributeError

建议修复:

self.cfg.init_pd_info()


while self.launched_expert_service_signal.value[i] == 0:
time.sleep(0.1)
time.sleep(1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug dp worker 初始化逻辑从「并行启动→批量等待」被意外改为「串行启动→逐个等待」。

原代码设计:先 start() 所有 dp 子进程(允许并发加载权重),再独立循环等待所有就绪,总耗时约等于最慢进程的时间。

改动后:每启动一个进程都立即等待其就绪,多 dp 场景总启动耗时线性增加为所有进程初始化时间之和。

建议恢复原始两循环结构:

for i in range(1, self.cfg.parallel_config.data_parallel_size // self.cfg.nnode):
    # ... 创建并 start() 进程
    self.dp_processed[-1].start()

for i in range(1, self.cfg.parallel_config.data_parallel_size // self.cfg.nnode):
    while self.launched_expert_service_signal.value[i] == 0:
        time.sleep(0.1)

@@ -700,7 +678,6 @@ def _start_worker_service(self):
"enable_entropy": self.cfg.model_config.enable_entropy,
"ep_prefill_use_worst_num_tokens": self.cfg.parallel_config.ep_prefill_use_worst_num_tokens,
"enable_overlap_schedule": self.cfg.scheduler_config.enable_overlap_schedule,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 enable_flashinfer_allreduce_fusion 已从此处 worker_store_true_flagworker_process.py 的 argparse 参数中删除,但 model_executor/layers/normalization.pylinear.py 仍从 config 读取该字段控制融合行为。

这导致 flag 无法再从 engine 侧传递到 worker 子进程命令行,model_executor 层将始终以 config 的默认值(False)运行,若用户通过 EngineArgs 开启了该 flag,实际效果将静默失效

建议:如果是有意移除,需同步删除 config.pyargs_utils.pynormalization.pylinear.py 中的相关代码;如果不是有意移除,则需恢复这三处删除。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants