[Feature]【Hackathon 10th Spring No.46】Python Windows runtime compatibility [cf]#7702
[Feature]【Hackathon 10th Spring No.46】Python Windows runtime compatibility [cf]#7702ghost wants to merge 1 commit intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览所有已执行的任务均已通过。但有 7 个 Workflow 处于
2 任务状态汇总2.1 Required任务 : 0/0 通过
2.2 可选任务 — 2/2 通过
3 失败详情(仅 required)无 |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-03 22:05:16
📋 Review 摘要
PR 概述:为 Python 代码添加 Windows 平台运行时兼容性支持,替换 /dev/shm 路径、os.setsid/os.killpg 等 POSIX 专有 API
变更范围:fastdeploy/engine/、fastdeploy/cache_manager/、fastdeploy/inter_communicator/、fastdeploy/eplb/、fastdeploy/worker/
影响面 Tag:[Engine] [KVCache] [Feature]
📝 PR 规范检查
标题含非官方后缀 [cf],且 Motivation/Modifications/Usage/Accuracy Tests 各段均为空占位符,未填写实际内容。
标题建议(可直接复制):
[Feature] Python Windows runtime compatibility
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
修复 Python 在 Windows 平台下的运行时兼容性问题。代码中大量硬编码了 `/dev/shm` 路径及 `os.setsid`、`os.killpg` 等 POSIX 专有 API,导致在 Windows 上无法运行。
## Modifications
- `cache_manager/cache_messager.py`:将 `/dev/shm` 路径替换为通过 `sys.platform` 判断的 `tempfile.gettempdir()`
- `cache_manager/prefix_cache_manager.py`:`subprocess.Popen` 中 `preexec_fn=os.setsid` 改为平台条件化 `**_popen_kwargs`
- `engine/common_engine.py`:`/dev/shm` 路径替换 + `os.killpg` 改为 Windows 下 `proc.terminate()`
- `engine/engine.py`:同上,并将 `multiprocessing.get_context("fork")` 改为 Windows 下使用 `"spawn"`
- `engine/expert_service.py`:进程终止逻辑兼容 Windows
- `eplb/async_expert_loader.py`:`/dev/shm` 路径替换
- `inter_communicator/fmq.py`:`Config.ipc_root` 默认值平台适配
- `inter_communicator/zmq_client.py`、`zmq_server.py`:socket 路径平台适配
- `worker/worker_process.py`:task queue 路径平台适配
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/engine/expert_service.py:113 |
init_cache_info 方法在 FDConfig 中不存在(仅有 init_pd_info),dp/splitwise 调度场景启动时将抛 AttributeError |
| 🔴 Bug | fastdeploy/engine/expert_service.py:126 |
同上,splitwise 路径下的第二处调用 |
| 🔴 Bug | fastdeploy/engine/engine.py:872 |
dp worker 初始化由并行改为串行:先批量 start() 再统一等待的设计被破坏,多 dp 场景启动耗时线性增加 |
| 🟡 建议 | fastdeploy/engine/engine.py:680 |
enable_flashinfer_allreduce_fusion 从 worker_store_true_flag 和 worker 命令行参数中删除,但 model_executor 层仍读取该字段,导致用户开启该 flag 后实际效果静默失效 |
总体评价
Windows 平台路径和进程 API 的兼容性改动思路正确,但引入了两个明确的运行时 Bug:init_cache_info 方法不存在将在 dp/splitwise 场景导致启动失败;dp worker 并行初始化被意外改为串行。这两处必须修复后再合入。
| self.engine.create_data_processor() | ||
| if self.cfg.scheduler_config.name == "dp": | ||
| self.cfg.init_pd_info() | ||
| self.cfg.init_cache_info() |
There was a problem hiding this comment.
🔴 Bug init_cache_info 方法在 FDConfig 中不存在(仅有 init_pd_info),此处调用将在 dp 调度场景下运行时抛出 AttributeError。
建议修复:
self.cfg.init_pd_info()|
|
||
| if self.cfg.scheduler_config.name == "splitwise": | ||
| self.cfg.init_pd_info() | ||
| self.cfg.init_cache_info() |
There was a problem hiding this comment.
🔴 Bug 同上,splitwise 调度场景下同样调用了不存在的 init_cache_info 方法,运行时抛 AttributeError。
建议修复:
self.cfg.init_pd_info()|
|
||
| while self.launched_expert_service_signal.value[i] == 0: | ||
| time.sleep(0.1) | ||
| time.sleep(1) |
There was a problem hiding this comment.
🔴 Bug dp worker 初始化逻辑从「并行启动→批量等待」被意外改为「串行启动→逐个等待」。
原代码设计:先 start() 所有 dp 子进程(允许并发加载权重),再独立循环等待所有就绪,总耗时约等于最慢进程的时间。
改动后:每启动一个进程都立即等待其就绪,多 dp 场景总启动耗时线性增加为所有进程初始化时间之和。
建议恢复原始两循环结构:
for i in range(1, self.cfg.parallel_config.data_parallel_size // self.cfg.nnode):
# ... 创建并 start() 进程
self.dp_processed[-1].start()
for i in range(1, self.cfg.parallel_config.data_parallel_size // self.cfg.nnode):
while self.launched_expert_service_signal.value[i] == 0:
time.sleep(0.1)| @@ -700,7 +678,6 @@ def _start_worker_service(self): | |||
| "enable_entropy": self.cfg.model_config.enable_entropy, | |||
| "ep_prefill_use_worst_num_tokens": self.cfg.parallel_config.ep_prefill_use_worst_num_tokens, | |||
| "enable_overlap_schedule": self.cfg.scheduler_config.enable_overlap_schedule, | |||
There was a problem hiding this comment.
🟡 建议 enable_flashinfer_allreduce_fusion 已从此处 worker_store_true_flag 和 worker_process.py 的 argparse 参数中删除,但 model_executor/layers/normalization.py 和 linear.py 仍从 config 读取该字段控制融合行为。
这导致 flag 无法再从 engine 侧传递到 worker 子进程命令行,model_executor 层将始终以 config 的默认值(False)运行,若用户通过 EngineArgs 开启了该 flag,实际效果将静默失效。
建议:如果是有意移除,需同步删除 config.py、args_utils.py、normalization.py、linear.py 中的相关代码;如果不是有意移除,则需恢复这三处删除。
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist