From 7ba3abdfc35e54ec00eee133462604a6479389fe Mon Sep 17 00:00:00 2001 From: wongzhenhao Date: Mon, 2 Feb 2026 16:55:43 +0800 Subject: [PATCH 1/3] delete mathquetion_extarct --- .../guide/quickstart/mathquestion_extract.md | 187 ----------------- .../guide/quickstart/mathquestion_extract.md | 189 ------------------ 2 files changed, 376 deletions(-) delete mode 100644 docs/en/notes/guide/quickstart/mathquestion_extract.md delete mode 100644 docs/zh/notes/guide/quickstart/mathquestion_extract.md diff --git a/docs/en/notes/guide/quickstart/mathquestion_extract.md b/docs/en/notes/guide/quickstart/mathquestion_extract.md deleted file mode 100644 index bc80ca4b0..000000000 --- a/docs/en/notes/guide/quickstart/mathquestion_extract.md +++ /dev/null @@ -1,187 +0,0 @@ ---- -title: Case 6. Math Problem Extraction -createTime: 2025/07/16 20:10:28 -icon: teenyicons:receipt-outline -permalink: /en/guide/t8ykcw9l/ ---- - -# Quick Start: Math Problem Extraction - -This example demonstrates how to use the `MathBookQuestionExtract` operator in Dataflow to automatically extract math problems from a textbook PDF and generate output in JSON/Markdown format. - -## 1 Environment and Dependencies - -1. Install Dataflow and MinerU dependencies - ```shell - pip install "open-dataflow[mineru]" - ``` - Or install from source: - ```shell - pip install -e ".[mineru]" - ``` - -2. Download MinerU model weights - ```shell - mineru-models-download - ``` - -> This operator uses MinerU for PDF content segmentation and image extraction; please ensure that the installation and model weight download have succeeded. - -## 2 Configure LLM Serving - -This operator currently only supports API-based VLM Serving. Please configure the API URL and key before running. - -- Linux / macOS: - ```shell - export DF_API_KEY="sk-xxxxx" - ``` -- Windows PowerShell: - ```powershell - $env:DF_API_KEY = "sk-xxxxx" - ``` - -The API key will be read from the environment variable in the code, so there is no need to hard-code it in the script. - -## 3 Prepare the Test PDF - -The example repository includes a test PDF: -``` -./dataflow/example/KBCleaningPipeline/questionextract_test.pdf -``` -You can also replace it with any math textbook or exercise collection PDF. - -## 4 Initialize and Modify the Script - -First, create a new `run_dataflow` folder anywhere, enter that directory, and then execute Dataflow project initialization: - -```shell -mkdir run_dataflow -cd run_dataflow -dataflow init -``` - -After initialization is complete, the following file will appear in the project directory: - -```shell -run_dataflow/playground/mathbook_extract.py -``` - -The contents of that script are as follows: - -```python -from dataflow.operators.generate import MathBookQuestionExtract -from dataflow.serving.APIVLMServing_openai import APIVLMServing_openai - -class QuestionExtractPipeline: - def __init__(self, llm_serving: APIVLMServing_openai): - self.extractor = MathBookQuestionExtract(llm_serving) - self.test_pdf = "../example/KBCleaningPipeline/questionextract_test.pdf" - - def forward( - self, - pdf_path: str, - output_name: str, - output_dir: str, - api_url: str = "https://api.openai.com/v1/chat/completions", - key_name_of_api_key: str = "DF_API_KEY", - model_name: str = "o4-mini", - max_workers: int = 20 - ): - self.extractor.run( - pdf_file_path=pdf_path, - output_file_name=output_name, - output_folder=output_dir, - api_url=api_url, - key_name_of_api_key=key_name_of_api_key, - model_name=model_name, - max_workers=max_workers - ) - -if __name__ == "__main__": - # 1. Initialize LLM Serving - llm_serving = APIVLMServing_openai( - api_url="https://api.openai.com/v1/chat/completions", - model_name="o4-mini", # It is recommended to use a strong reasoning model - max_workers=20 # Number of concurrent requests - ) - - # 2. Construct and run the extraction pipeline - pipeline = QuestionExtractPipeline(llm_serving) - pipeline.forward( - pdf_path=pipeline.test_pdf, - output_name="test_question_extract", - output_dir="./output" - ) -``` - -### Key Parameter Explanation - -- `api_url`: OpenAI VLM endpoint URL -- `key_name_of_api_key`: Name of the environment variable -- `model_name`: Model name (e.g., `o4-mini`; strong reasoning models are recommended) -- `max_workers`: Number of concurrent requests - - -### Operator Logic - -The complete implementation of the operator is located at -`dataflow/operators/generate/KnowledgeCleaning/mathbook_question_extract.py` -Below, starting from the overall flow, we provide concise yet detailed explanations of each key stage to facilitate use and secondary development: - -1. PDF file splitting - - Use `pymupdf` (fitz) to open the target PDF, rendering each page into a high-quality JPEG image at the specified DPI. - - Save the images, named by page number, to the specified output directory, and log the conversion progress of each page to ensure traceability. - -2. Invoke MinerU for content recognition and image extraction - - Dynamically import the `mineru` module; if it is not installed, throw a friendly prompt guiding the user to run `pip install mineru[pipeline]` and download the models. - - Specify loading models from the local source via the environment variable `MINERU_MODEL_SOURCE=local`, supporting backend options `"vlm-sglang-engine"` or `"pipeline"`. - - Execute the command-line tool: -```shell - mineru -p -o -b --source local -``` - - After execution, the tool will generate `*_content_list.json` (a structured content inventory) and a folder of the original split images in the intermediate directory. - -3. Organize and rename image resources - - Read the `content_list.json` produced by MinerU, filtering out all items where `type=='image'`. - - Copy the corresponding images from MinerU’s temporary directory to the final result folder, renaming them sequentially as `0.jpg, 1.jpg...`. - - Also generate a new JSON inventory, recording each image’s page number in the source PDF and its new file path. - -4. Organize model invocation commands - - Retrieve the predefined text prompt (`mathbook_question_extract_prompt`) from `dataflow.prompts.kbcleaning.KnowledgeCleanerPrompt`, specifying the task requirements and format conventions. - - Package the rendered commands together with multiple input images (page snapshots, illustrations) to prepare for subsequent concurrent LLM service calls. - -5. Concurrently obtain model responses - - Use `APIVLMServing_openai` (or another `LLMServingABC` implementation) combined with `ThreadPoolExecutor` to concurrently submit the packaged list of images and labels to the model. - - Allow customization of the model name, API endpoint, concurrency level, and timeout to flexibly meet different performance and cost requirements. - -6. Parse and save the final output - - In the `analyze_and_save` method, use regular expressions to precisely capture the `index.jpg` tags in the model’s returned text. - - Copy the corresponding images referenced in the tags to the `images/` subfolder in the results directory. - - Output the results in two formats: - a. JSON file: sequentially store each question’s plain text (with tags removed) and the corresponding list of image paths - b. Markdown file: embed images in the original text using the `![](images/xx.jpg)` format for easy visualization - - All output files are saved in the user-specified result folder, facilitating subsequent verification and secondary use. - -## 5 Run the Script - -```shell -python generate_question_extract_api.py -``` - -After it finishes, the `./output` directory will contain: - -- `test_question_extract.json` - Each record includes: - - `text`: Extracted problem text - - `pics`: List of image paths involved in the problem -- `test_question_extract.md` - Displays the problems and their images in Markdown format - -## 6 Optional Extensions - -- Custom prompts: To adjust the system prompt, replace it inside the operator: - ```python - from dataflow.prompts.kbcleaning import KnowledgeCleanerPrompt - system_prompt = KnowledgeCleanerPrompt().mathbook_question_extract_prompt() - ``` -- Parameter customization: Supports switching the MinerU backend (`pipeline` | `vlm-sglang-engine`), adjusting DPI, concurrency, etc. See the `run` method signature in the operator. \ No newline at end of file diff --git a/docs/zh/notes/guide/quickstart/mathquestion_extract.md b/docs/zh/notes/guide/quickstart/mathquestion_extract.md deleted file mode 100644 index 5a81d8540..000000000 --- a/docs/zh/notes/guide/quickstart/mathquestion_extract.md +++ /dev/null @@ -1,189 +0,0 @@ ---- -title: 案例6. 数学问题提取 -createTime: 2025/07/16 20:10:28 -icon: teenyicons:receipt-outline -permalink: /zh/guide/zchbl7uk/ ---- - -# 快速开始:数学问题提取 - -本示例展示如何使用 Dataflow 中的 `MathBookQuestionExtract` 算子,自动从教材 PDF 中提取数学题目,并生成 JSON/Markdown 格式的输出。 - -## 1 环境及依赖 - -1. 安装 Dataflow 与 MinerU 依赖 - ```shell - pip install "open-dataflow[mineru]" - ``` - 或者从源码安装: - ```shell - pip install -e ".[mineru]" - ``` - -2. 下载 MinerU 模型权重 - ```shell - mineru-models-download - ``` - -> 本算子基于 MinerU 实现 PDF 内容切分与图像抽取,请确保安装并下载模型权重成功。 - -## 2 配置 LLM Serving - -当前算子仅支持基于 API 的 VLM Serving。请在运行前设置好 API 地址和 Key。 - -- Linux / macOS: - ```shell - export DF_API_KEY="sk-xxxxx" - ``` -- Windows PowerShell: - ```powershell - $env:DF_API_KEY = "sk-xxxxx" - ``` - -后续在代码中会通过环境变量读取该 API Key,无需在脚本中明文填写。 - -## 3 准备测试 PDF - -示例仓库自带一份测试 PDF: -``` -./dataflow/example/KBCleaningPipeline/questionextract_test.pdf -``` -你也可以替换为任意数学教材或习题集 PDF。 - -## 4 初始化并修改脚本 - -首先,在任意位置创建一个新的 `run_dataflow` 文件夹,并进入该目录,然后执行 Dataflow 项目初始化: - -```shell -mkdir run_dataflow -cd run_dataflow -dataflow init -``` - -初始化完成后,项目目录下会出现以下文件: - -```shell -run_dataflow/playground/mathbook_extract.py -``` - -该脚本的内容如下: - -```python -from dataflow.operators.generate import MathBookQuestionExtract -from dataflow.serving.APIVLMServing_openai import APIVLMServing_openai - -class QuestionExtractPipeline: - def __init__(self, llm_serving: APIVLMServing_openai): - self.extractor = MathBookQuestionExtract(llm_serving) - self.test_pdf = "../example/KBCleaningPipeline/questionextract_test.pdf" - - def forward( - self, - pdf_path: str, - output_name: str, - output_dir: str, - api_url: str = "https://api.openai.com/v1/chat/completions", - key_name_of_api_key: str = "DF_API_KEY", - model_name: str = "o4-mini", - max_workers: int = 20 - ): - self.extractor.run( - pdf_file_path=pdf_path, - output_file_name=output_name, - output_folder=output_dir, - api_url=api_url, - key_name_of_api_key=key_name_of_api_key, - model_name=model_name, - max_workers=max_workers - ) - -if __name__ == "__main__": - # 1. 初始化 LLM Serving - llm_serving = APIVLMServing_openai( - api_url="https://api.openai.com/v1/chat/completions", - model_name="o4-mini", # 推荐使用强推理模型 - max_workers=20 # 并发请求数 - ) - - # 2. 构造并运行提取管道 - pipeline = QuestionExtractPipeline(llm_serving) - pipeline.forward( - pdf_path=pipeline.test_pdf, - output_name="test_question_extract", - output_dir="./output" - ) -``` - -### 关键参数说明 - -- `api_url`:OpenAI VLM 接口地址 -- `key_name_of_api_key`:环境变量名称 -- `model_name`:模型名称(如 `o4-mini`,建议使用强推理模型) -- `max_workers`:并发请求数量 - - -### 算子逻辑 - -算子的完整实现位于 -`dataflow/operators/generate/KnowledgeCleaning/mathbook_question_extract.py` -下面从整体流程出发,对各关键环节做简要而不失细节的说明,便于使用和二次开发: - -1. PDF 文件切割 - - 利用 `pymupdf`(fitz)打开目标 PDF,将每一页按设定的 DPI 渲染成高质量的 JPEG 图片。 - - 图片按页编号保存到指定输出目录,并通过日志记录每一页的转换进度,确保可追溯性。 - -2. 调用 MinerU 进行内容识别与图片提取 - - 动态导入 `mineru` 模块,若未安装则抛出友好提示,指导用户完成 `pip install mineru[pipeline]` 和模型下载。 - - 通过环境变量 `MINERU_MODEL_SOURCE=local` 指定从本地加载模型,支持后端选项 `"vlm-sglang-engine"` 或 `"pipeline"`。 - - 执行命令行工具: -```shell - mineru -p -o -b --source local -``` - - 命令执行后会在中间目录下生成 `*_content_list.json`(结构化内容清单)和原始切割出的图片文件夹。 - -3. 整理与重命名图片资源 - - 读取 MinerU 产出的 `content_list.json`,筛选出所有 `type=='image'` 项。 - - 将对应的图片从 MinerU 的临时目录复制到最终结果文件夹,并按序重命名为 `0.jpg, 1.jpg...`。 - - 同时生成一份新的 JSON 清单,记录每张图片在源 PDF 中的页码及新文件路径。 - -4. 组织模型调用指令 - - 从 `dataflow.prompts.kbcleaning.KnowledgeCleanerPrompt` 中获取预定义的文本提示(`mathbook_question_extract_prompt`),明确任务要求和格式规范。 - - 将渲染好的指令与多张输入图片(页图、插图)打包,为后续并发调用 LLM 服务做准备。 - -5. 并发获取模型响应 - - 使用 `APIVLMServing_openai`(或其他 `LLMServingABC` 实现)并结合 `ThreadPoolExecutor`,将打包好的图片列表与标签并发提交给模型。 - - 可自定义模型名称、API 地址、并发数和超时时间,灵活满足不同性能与成本需求。 - -6. 解析并保存最终输出 - - 在 `analyze_and_save` 方法中,通过正则表达式精准抓取模型返回文本内的 `index.jpg` 标签。 - - 将标签中引用的对应图片复制到结果目录的 `images/` 子文件夹。 - - 以两种格式输出结果: - a. JSON 文件:按顺序保存各题的纯文本(已剔除标签)和对应图片路径列表 - b. Markdown 文件:原文中以 `![](images/xx.jpg)` 形式嵌入图片,易于可视化查看 - - 输出文件统一保存在用户指定的结果文件夹下,便于后续校验和二次使用。 - - - -## 5 运行脚本 - -```shell -python generate_question_extract_api.py -``` - -执行完成后,`./output` 目录下将产生: - -- `test_question_extract.json` - 每条记录包含: - - `text`:提取到的题目文本 - - `pics`:题目涉及的图片路径列表 -- `test_question_extract.md` - 以 Markdown 形式展示题目与配图 - -## 6 可选扩展 - -- 自定义提示词:若需调整系统提示,可在算子内部替换: - ```python - from dataflow.prompts.kbcleaning import KnowledgeCleanerPrompt - system_prompt = KnowledgeCleanerPrompt().mathbook_question_extract_prompt() - ``` -- 参数定制:支持切换 MinerU 后端(`pipeline` | `vlm-sglang-engine`)、调整 DPI、并发数等,详见算子 `run` 方法签名。 \ No newline at end of file From 9edd0255b08ed55ad24714440851231d9b6c3f27 Mon Sep 17 00:00:00 2001 From: wongzhenhao Date: Mon, 2 Feb 2026 16:57:08 +0800 Subject: [PATCH 2/3] change VQAextract pipeline to quickguide --- docs/.vuepress/notes/en/guide.ts | 3 +-- docs/.vuepress/notes/zh/guide.ts | 3 +-- .../PDFVQAExtractPipeline.md => quickstart/PDFVQAExtract.md} | 2 +- .../PDFVQAExtractPipeline.md => quickstart/PDFVQAExtract.md} | 2 +- 4 files changed, 4 insertions(+), 6 deletions(-) rename docs/en/notes/guide/{pipelines/PDFVQAExtractPipeline.md => quickstart/PDFVQAExtract.md} (99%) rename docs/zh/notes/guide/{pipelines/PDFVQAExtractPipeline.md => quickstart/PDFVQAExtract.md} (99%) diff --git a/docs/.vuepress/notes/en/guide.ts b/docs/.vuepress/notes/en/guide.ts index 6ea317d79..8dc26d906 100644 --- a/docs/.vuepress/notes/en/guide.ts +++ b/docs/.vuepress/notes/en/guide.ts @@ -42,7 +42,7 @@ export const Guide: ThemeNote = defineNoteConfig({ 'conversation_synthesis', 'reasoning_general', 'prompted_vqa', - 'mathquestion_extract', + 'PDFVQAExtract', 'knowledge_cleaning', 'speech_transcription', ], @@ -83,7 +83,6 @@ export const Guide: ThemeNote = defineNoteConfig({ "KnowledgeBaseCleaningPipeline", "FuncCallPipeline", "Pdf2ModelPipeline", - "PDFVQAExtractPipeline", ] }, { diff --git a/docs/.vuepress/notes/zh/guide.ts b/docs/.vuepress/notes/zh/guide.ts index f784356f8..942a615d0 100644 --- a/docs/.vuepress/notes/zh/guide.ts +++ b/docs/.vuepress/notes/zh/guide.ts @@ -50,7 +50,7 @@ export const Guide: ThemeNote = defineNoteConfig({ 'conversation_synthesis', "reasoning_general", "prompted_vqa", - "mathquestion_extract", + "PDFVQAExtract", 'knowledge_cleaning', 'speech_transcription', ], @@ -81,7 +81,6 @@ export const Guide: ThemeNote = defineNoteConfig({ "KnowledgeBaseCleaningPipeline", "FuncCallPipeline", "Pdf2ModelPipeline", - "PDFVQAExtractPipeline", ] }, { diff --git a/docs/en/notes/guide/pipelines/PDFVQAExtractPipeline.md b/docs/en/notes/guide/quickstart/PDFVQAExtract.md similarity index 99% rename from docs/en/notes/guide/pipelines/PDFVQAExtractPipeline.md rename to docs/en/notes/guide/quickstart/PDFVQAExtract.md index b6c4a1add..cedfea0f7 100644 --- a/docs/en/notes/guide/pipelines/PDFVQAExtractPipeline.md +++ b/docs/en/notes/guide/quickstart/PDFVQAExtract.md @@ -1,5 +1,5 @@ --- -title: PDF VQA Extraction Pipeline +title: Case 6. PDF VQA Extraction Pipeline createTime: 2025/11/17 14:01:55 permalink: /en/guide/vqa_extract_optimized/ icon: heroicons:document-text diff --git a/docs/zh/notes/guide/pipelines/PDFVQAExtractPipeline.md b/docs/zh/notes/guide/quickstart/PDFVQAExtract.md similarity index 99% rename from docs/zh/notes/guide/pipelines/PDFVQAExtractPipeline.md rename to docs/zh/notes/guide/quickstart/PDFVQAExtract.md index 774b4207b..d9d8ade80 100644 --- a/docs/zh/notes/guide/pipelines/PDFVQAExtractPipeline.md +++ b/docs/zh/notes/guide/quickstart/PDFVQAExtract.md @@ -1,5 +1,5 @@ --- -title: PDF中的VQA提取流水线 +title: 案例6. PDF中的VQA提取流水线 createTime: 2025/11/17 14:01:55 permalink: /zh/guide/vqa_extract_optimized/ icon: heroicons:document-text From 00f886df1b647d2c3418bb2c8b80741ba0ff784d Mon Sep 17 00:00:00 2001 From: wongzhenhao Date: Mon, 2 Feb 2026 16:57:25 +0800 Subject: [PATCH 3/3] renew case numbering --- docs/en/notes/guide/quickstart/speech_transcription.md | 2 +- docs/zh/notes/guide/quickstart/speech_transcription.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/en/notes/guide/quickstart/speech_transcription.md b/docs/en/notes/guide/quickstart/speech_transcription.md index 2a584e96a..8d009486f 100644 --- a/docs/en/notes/guide/quickstart/speech_transcription.md +++ b/docs/en/notes/guide/quickstart/speech_transcription.md @@ -1,5 +1,5 @@ --- -title: Case 9. Speech transcription +title: Case 8. Speech transcription createTime: 2025/08/22 16:38:49 permalink: /en/guide/5pdipkiv/ icon: fad:headphones diff --git a/docs/zh/notes/guide/quickstart/speech_transcription.md b/docs/zh/notes/guide/quickstart/speech_transcription.md index 84a093869..0fcc481b6 100644 --- a/docs/zh/notes/guide/quickstart/speech_transcription.md +++ b/docs/zh/notes/guide/quickstart/speech_transcription.md @@ -1,5 +1,5 @@ --- -title: 案例9. 语音转文字 +title: 案例8. 语音转文字 createTime: 2025/08/22 16:37:30 permalink: /zh/guide/du2akut8/ icon: fad:headphones