Skip to content

Commit 79671ce

Browse files
authored
Update pdf2vqa pipeline and chunked prompted generator doc (#152)
* Revise PDFVQAExtractPipeline documentation Updated the PDFVQAExtractPipeline documentation to reflect changes in the VQA extraction process, including modifications to the code examples and descriptions of the new components and their functionalities. * Revise PDFVQAExtractPipeline documentation Updated the PDFVQAExtractPipeline documentation to reflect changes in the pipeline structure, including the introduction of ChunkedPromptedGenerator and modifications to the input data format. * Add documentation for ChunkedPromptedGenerator operator * Add documentation for ChunkedPromptedGenerator operator * [pdf2vqa] Add doc for pdf2vqa format operators. * [pdf2vqa] update installation dependencies
1 parent 90f2cf2 commit 79671ce

10 files changed

Lines changed: 1071 additions & 82 deletions

File tree

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
---
2+
title: ChunkedPromptedGenerator
3+
createTime: 2026/01/20 15:00:00
4+
permalink: /en/api/operators/core_text/generate/chunkedpromptedgenerator/
5+
---
6+
7+
## 📘 Overview
8+
9+
`ChunkedPromptedGenerator` is a prompt generation operator that supports **automatic chunking for long texts**. When the input content exceeds a preset Token limit, the operator employs a recursive bisection method to split the text into smaller chunks. It then calls a Large Language Model (LLM) to generate results for each chunk and joins them using a specified separator.
10+
11+
It is particularly suitable for processing extra-long documents (such as books or long papers) and supports reading input content directly from file paths.
12+
13+
## `__init__` Function
14+
15+
```python
16+
def __init__(self,
17+
llm_serving: LLMServingABC,
18+
system_prompt: str = "You are a helpful agent.",
19+
json_schema: dict = None,
20+
max_chunk_len: int = 128000,
21+
enc = tiktoken.get_encoding("cl100k_base"),
22+
seperator: str = "\n"
23+
)
24+
25+
```
26+
27+
### Initialization Parameters
28+
29+
| Parameter | Type | Default | Description |
30+
| --- | --- | --- | --- |
31+
| **llm_serving** | LLMServingABC | Required | The LLM service instance used for inference. |
32+
| **system_prompt** | str | "You are a helpful agent." | System prompt to define the model's role and behavior. |
33+
| **json_schema** | dict | None | (Optional) A JSON Schema to constrain the LLM's output format. |
34+
| **max_chunk_len** | int | 128000 | The maximum number of tokens allowed per chunk. |
35+
| **enc** | Encoder/Tokenizer | tiktoken.get_encoding("cl100k_base") | The encoder used for token counting. Supports any object with an `encode` method (e.g., tiktoken or AutoTokenizer). |
36+
| **seperator** | str | "\n" | The character used to join results from multiple chunks. |
37+
38+
### Chunking Logic
39+
40+
The operator utilizes a **recursive bisection method**:
41+
42+
1. Calculate the total Token count of the current text.
43+
2. If Token count < `max_chunk_len`, it is processed as a single chunk.
44+
3. If Token count > `max_chunk_len`, the text is split into two halves based on the middle character position, and the process repeats recursively.
45+
46+
## `run` Function
47+
48+
```python
49+
def run(self, storage: DataFlowStorage, input_path_key: str, output_path_key: str)
50+
51+
```
52+
53+
Executes the operator logic: Reads file paths from the specified input column, loads the file content, generates output per chunk, writes the joined results to a new text file, and records the output file path in the DataFrame.
54+
55+
#### Parameters
56+
57+
| Name | Type | Default | Description |
58+
| --- | --- | --- | --- |
59+
| **storage** | DataFlowStorage | Required | DataFlow storage instance for reading and writing data. |
60+
| **input_path_key** | str | Required | The input column name containing the **local paths** of the text files. |
61+
| **output_path_key** | str | Required | The output column name where the resulting LLM output file paths will be stored. |
62+
63+
## 🧠 Example Usage
64+
65+
```python
66+
from dataflow.core import LLMServing
67+
from dataflow.utils.storage import DataFlowStorage
68+
69+
# Initialize the operator with a max length of 2000 tokens
70+
operator = ChunkedPromptedGenerator(
71+
llm_serving=my_llm_instance,
72+
max_chunk_len=2000,
73+
seperator="\n---\n"
74+
)
75+
76+
# Run the operator
77+
operator.run(
78+
storage=my_storage,
79+
input_path_key="file_path",
80+
output_path_key="result_path"
81+
)
82+
83+
```
84+
85+
#### 🧾 Output Logic
86+
87+
The operator automatically generates a result file with the suffix `_llm_output.txt` in the same directory as the input file.
88+
89+
| Field | Type | Description |
90+
| --- | --- | --- |
91+
| file_path | str | Path to the original input file (e.g., `data/doc.txt`). |
92+
| result_path | str | Path where the generated result file is saved (e.g., `data/doc_llm_output.txt`). |
93+
94+
**Example Input DataFrame Row:**
95+
96+
```json
97+
{
98+
"file_path": "/home/user/data/long_article.txt"
99+
}
100+
101+
```
102+
103+
**Chunking Workflow:**
104+
105+
1. Read the content of `long_article.txt`.
106+
2. Assume the text is split into `Chunk A` and `Chunk B`.
107+
3. Call the LLM to obtain `Result A` and `Result B`.
108+
4. Write `Result A\nResult B` into `/home/user/data/long_article_llm_output.txt`.
109+
110+
**Example Output DataFrame Row:**
111+
112+
```json
113+
{
114+
"file_path": "/home/user/data/long_article.txt",
115+
"result_path": "/home/user/data/long_article_llm_output.txt"
116+
}
117+
118+
```
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
---
2+
title: LLMOutputParser
3+
createTime: 2026/01/20 20:15:00
4+
permalink: /en/api/operators/core_text/parse/llmoutputparser/
5+
---
6+
7+
## 📘 Overview
8+
9+
`LLMOutputParser` is a structured data parsing operator designed specifically to parse response text generated by Large Language Models (LLMs) that contain specific XML tags.
10+
11+
The core functionalities of this operator include:
12+
1. **Tag Parsing**: Identifying and extracting content within tags such as `<chapter>`, `<qa_pair>`, `<question>`, `<answer>`, `<solution>`, and `<label>`.
13+
2. **ID Restoration**: Mapping numerical IDs returned by the LLM back to original text content or image tags (based on the converted layout files generated by `MinerU2LLMInputOperator`).
14+
3. **Resource Synchronization**: Automatically copying associated images from the intermediate directory to the final output directory and correcting the image reference paths.
15+
16+
## `__init__` Function
17+
18+
```python
19+
def __init__(self,
20+
mode: Literal['question', 'answer'],
21+
output_dir: str,
22+
intermediate_dir: str = "intermediate"
23+
)
24+
25+
```
26+
27+
### Initialization Parameters
28+
29+
| Parameter | Type | Default | Description |
30+
| --- | --- | --- | --- |
31+
| **mode** | str | Required | Parsing mode. Options are `'question'` or `'answer'`, which affects the output filename and the image subdirectory name. |
32+
| **output_dir** | str | Required | The final root directory for structured data and images. |
33+
| **intermediate_dir** | str | "intermediate" | The intermediate directory where original image resources processed by MinerU are located. |
34+
35+
## XML Tag Protocol
36+
37+
The operator expects the LLM to return data according to the following structure:
38+
39+
* `<chapter>`: A chapter block containing a title and multiple QA pairs.
40+
* `<title>`: The **ID** corresponding to the chapter title.
41+
* `<qa_pair>`: A block representing a single question-answer pair.
42+
* `<question>` / `<solution>`: A list of **IDs** (e.g., `1, 2, 5`) corresponding to the source content.
43+
* `<answer>`: The answer extracted from the solution. **This is actual text content, not an ID.**
44+
* `<label>`: Question type or label information. **This is a real sequence number/label, not an ID.**
45+
46+
## `run` Function
47+
48+
```python
49+
def run(self,
50+
storage: DataFlowStorage,
51+
input_response_path_key: str,
52+
input_converted_layout_path_key: str,
53+
input_name_key: str,
54+
output_qalist_path_key: str
55+
)
56+
57+
```
58+
59+
Executes the parsing logic: Reads the LLM response, restores content using the layout JSON file, saves the result in JSONL format.
60+
61+
#### Parameters
62+
63+
| Name | Type | Default | Description |
64+
| --- | --- | --- | --- |
65+
| **storage** | DataFlowStorage | Required | DataFlow storage instance. |
66+
| **input_response_path_key** | str | Required | Column name for the path to the original LLM response file. |
67+
| **input_converted_layout_path_key** | str | Required | Column name for the path to the converted layout file (`_converted.json`). |
68+
| **input_name_key** | str | Required | Column for the task name, which determines the naming of the output folder. |
69+
| **output_qalist_path_key** | str | Required | Column name to store the path of the generated JSONL file. |
70+
71+
## 🧠 Example Logic
72+
73+
### 1. ID Restoration Process
74+
75+
Suppose the LLM returns: `<question>1, 3</question>`
76+
The operator looks up entries with `id` 1 and 3 in the layout JSON:
77+
78+
* If `id: 1` is the text "What is AI?" and `id: 3` is the image `path/to/img.png`.
79+
* The restored content will be: `What is AI?\n![image](images/img.png)`.
80+
81+
### 2. Output File Structure
82+
83+
After execution, the directory structure under `output_dir` (referenced as `cache_path` in some contexts) will be as follows:
84+
85+
```text
86+
output_dir/
87+
└── {name}/
88+
├── extracted_questions.jsonl # Structured data
89+
└── question_images/ # Automatically synchronized images
90+
├── img1.png
91+
└── ...
92+
93+
```
94+
95+
### 3. JSONL Output Example
96+
97+
```json
98+
{
99+
"question": "Please analyze the image below:\n![image](question_images/fig1.png)",
100+
"answer": "This is the parsed answer text.",
101+
"solution": "Detailed step-by-step solution...",
102+
"label": "1",
103+
"chapter_title": "Chapter 1: Fundamentals"
104+
}
105+
106+
```
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
title: MinerU2LLMInputOperator
3+
createTime: 2026/01/20 20:10:00
4+
permalink: /en/api/operators/core_text/convert/mineru2llminputoperator/
5+
---
6+
7+
## 📘 Overview
8+
9+
`MinerU2LLMInputOperator` is a format conversion operator specifically designed for processing **MinerU** parsing results. It transforms the underlying `_content_list.json` files generated by MinerU into a flattened format that is more suitable for Large Language Model (LLM) understanding and processing.
10+
11+
### Key Features:
12+
* **List Flattening**: Breaks down complex `list` type items into individual `text` entries.
13+
* **Data Cleaning**: Removes metadata that is typically unnecessary for LLMs, such as `bbox` (bounding box coordinates) and `page_idx` (page numbers).
14+
* **Re-indexing**: Generates continuous and unique `id` values for all converted content items.
15+
16+
## `__init__` Function
17+
18+
```python
19+
def __init__(self)
20+
21+
```
22+
23+
This operator does not require any additional parameters during initialization.
24+
25+
## `run` Function
26+
27+
```python
28+
def run(self, storage: DataFlowStorage, input_markdown_path_key: str, output_converted_layout_key: str)
29+
30+
```
31+
32+
Executes the conversion logic: Locates the corresponding MinerU JSON file based on the Markdown file path, processes it, saves it as a new file, and records the new path.
33+
34+
#### Parameters
35+
36+
| Name | Type | Default | Description |
37+
| --- | --- | --- | --- |
38+
| **storage** | DataFlowStorage | Required | DataFlow storage instance. |
39+
| **input_markdown_path_key** | str | Required | Input column name containing the paths to MinerU `.md` files. The operator automatically searches for `_content_list.json` in the same directory. |
40+
| **output_converted_layout_key** | str | Required | Output column name to store the path of the processed `_converted.json` file. |
41+
42+
## 🧠 Conversion Logic Details
43+
44+
1. **Path Matching**: The operator retrieves the file path from `input_markdown_path_key` and replaces the `.md` extension with `_content_list.json` to read the original layout data.
45+
2. **Content Processing**:
46+
* If an entry type is `list` and the sub-type is `text`, the operator iterates through `list_items` and promotes each sub-item to an independent `text` entry.
47+
* Entries that are already `text` or other types are preserved.
48+
49+
50+
3. **Format Simplification**: The `bbox` and `page_idx` fields are removed from all entries to reduce token interference and noise.
51+
4. **File Output**: The resulting file is saved with a `_converted.json` suffix in the same directory as the original file.
52+
53+
## 🧠 Example Usage
54+
55+
#### 🧾 Format Conversion Comparison
56+
57+
**Input (Original MinerU `_content_list.json`):**
58+
59+
```json
60+
[
61+
{
62+
"type": "list",
63+
"sub_type": "text",
64+
"list_items": ["Item One Content", "Item Two Content"],
65+
"bbox": [10, 20, 100, 200],
66+
"page_idx": 0
67+
}
68+
]
69+
70+
```
71+
72+
**Output (Processed `_converted.json`):**
73+
74+
```json
75+
[
76+
{
77+
"type": "text",
78+
"text": "Item One Content",
79+
"id": 0
80+
},
81+
{
82+
"type": "text",
83+
"text": "Item Two Content",
84+
"id": 1
85+
}
86+
]
87+
88+
```

0 commit comments

Comments
 (0)