OpenDCAI
diff --git a/‎docs/en/notes/api/operators/core_text/generate/ChunkedPromptedGenerator.md‎
Lines changed: 118 additions & 0 deletions b/‎docs/en/notes/api/operators/core_text/generate/ChunkedPromptedGenerator.md‎
Lines changed: 118 additions & 0 deletions
diff --git a/‎docs/en/notes/api/operators/pdf2vqa/generate/LLMOutputParser.md‎
Lines changed: 106 additions & 0 deletions b/‎docs/en/notes/api/operators/pdf2vqa/generate/LLMOutputParser.md‎
Lines changed: 106 additions & 0 deletions
diff --git a/‎docs/en/notes/api/operators/pdf2vqa/generate/MineruToLLMInputOperator.md‎
Lines changed: 88 additions & 0 deletions b/‎docs/en/notes/api/operators/pdf2vqa/generate/MineruToLLMInputOperator.md‎
Lines changed: 88 additions & 0 deletions
@@ -0,0 +1,118 @@
+---
+title: ChunkedPromptedGenerator
+createTime: 2026/01/20 15:00:00
+permalink: /en/api/operators/core_text/generate/chunkedpromptedgenerator/
+---
+
+## 📘 Overview
+
+`ChunkedPromptedGenerator` is a prompt generation operator that supports **automatic chunking for long texts**. When the input content exceeds a preset Token limit, the operator employs a recursive bisection method to split the text into smaller chunks. It then calls a Large Language Model (LLM) to generate results for each chunk and joins them using a specified separator.
+
+It is particularly suitable for processing extra-long documents (such as books or long papers) and supports reading input content directly from file paths.
+
+## `__init__` Function
+
+```python
+def __init__(self, 
+             llm_serving: LLMServingABC, 
+             system_prompt: str = "You are a helpful agent.", 
+             json_schema: dict = None, 
+             max_chunk_len: int = 128000, 
+             enc = tiktoken.get_encoding("cl100k_base"), 
+             seperator: str = "\n"
+             )
+
+```
+
+### Initialization Parameters
+
+| Parameter | Type | Default | Description |
+| --- | --- | --- | --- |
+| **llm_serving** | LLMServingABC | Required | The LLM service instance used for inference. |
+| **system_prompt** | str | "You are a helpful agent." | System prompt to define the model's role and behavior. |
+| **json_schema** | dict | None | (Optional) A JSON Schema to constrain the LLM's output format. |
+| **max_chunk_len** | int | 128000 | The maximum number of tokens allowed per chunk. |
+| **enc** | Encoder/Tokenizer | tiktoken.get_encoding("cl100k_base") | The encoder used for token counting. Supports any object with an `encode` method (e.g., tiktoken or AutoTokenizer). |
+| **seperator** | str | "\n" | The character used to join results from multiple chunks. |
+
+### Chunking Logic
+
+The operator utilizes a **recursive bisection method**:
+
+1. Calculate the total Token count of the current text.
+2. If Token count < `max_chunk_len`, it is processed as a single chunk.
+3. If Token count > `max_chunk_len`, the text is split into two halves based on the middle character position, and the process repeats recursively.
+
+## `run` Function
+
+```python
+def run(self, storage: DataFlowStorage, input_path_key: str, output_path_key: str)
+
+```
+
+Executes the operator logic: Reads file paths from the specified input column, loads the file content, generates output per chunk, writes the joined results to a new text file, and records the output file path in the DataFrame.
+
+#### Parameters
+
+| Name | Type | Default | Description |
+| --- | --- | --- | --- |
+| **storage** | DataFlowStorage | Required | DataFlow storage instance for reading and writing data. |
+| **input_path_key** | str | Required | The input column name containing the **local paths** of the text files. |
+| **output_path_key** | str | Required | The output column name where the resulting LLM output file paths will be stored. |
+
+## 🧠 Example Usage
+
+```python
+from dataflow.core import LLMServing
+from dataflow.utils.storage import DataFlowStorage
+
+# Initialize the operator with a max length of 2000 tokens
+operator = ChunkedPromptedGenerator(
+    llm_serving=my_llm_instance,
+    max_chunk_len=2000,
+    seperator="\n---\n"
+)
+
+# Run the operator
+operator.run(
+    storage=my_storage,
+    input_path_key="file_path",
+    output_path_key="result_path"
+)
+
+```
+
+#### 🧾 Output Logic
+
+The operator automatically generates a result file with the suffix `_llm_output.txt` in the same directory as the input file.
+
+| Field | Type | Description |
+| --- | --- | --- |
+| file_path | str | Path to the original input file (e.g., `data/doc.txt`). |
+| result_path | str | Path where the generated result file is saved (e.g., `data/doc_llm_output.txt`). |
+
+**Example Input DataFrame Row:**
+
+```json
+{
+  "file_path": "/home/user/data/long_article.txt"
+}
+
+```
+
+**Chunking Workflow:**
+
+1. Read the content of `long_article.txt`.
+2. Assume the text is split into `Chunk A` and `Chunk B`.
+3. Call the LLM to obtain `Result A` and `Result B`.
+4. Write `Result A\nResult B` into `/home/user/data/long_article_llm_output.txt`.
+
+**Example Output DataFrame Row:**
+
+```json
+{
+  "file_path": "/home/user/data/long_article.txt",
+  "result_path": "/home/user/data/long_article_llm_output.txt"
+}
+
+```
@@ -0,0 +1,106 @@
+---
+title: LLMOutputParser
+createTime: 2026/01/20 20:15:00
+permalink: /en/api/operators/core_text/parse/llmoutputparser/
+---
+
+## 📘 Overview
+
+`LLMOutputParser` is a structured data parsing operator designed specifically to parse response text generated by Large Language Models (LLMs) that contain specific XML tags.
+
+The core functionalities of this operator include:
+1.  **Tag Parsing**: Identifying and extracting content within tags such as `<chapter>`, `<qa_pair>`, `<question>`, `<answer>`, `<solution>`, and `<label>`.
+2.  **ID Restoration**: Mapping numerical IDs returned by the LLM back to original text content or image tags (based on the converted layout files generated by `MinerU2LLMInputOperator`).
+3.  **Resource Synchronization**: Automatically copying associated images from the intermediate directory to the final output directory and correcting the image reference paths.
+
+## `__init__` Function
+
+```python
+def __init__(self, 
+             mode: Literal['question', 'answer'], 
+             output_dir: str, 
+             intermediate_dir: str = "intermediate"
+             )
+
+```
+
+### Initialization Parameters
+
+| Parameter | Type | Default | Description |
+| --- | --- | --- | --- |
+| **mode** | str | Required | Parsing mode. Options are `'question'` or `'answer'`, which affects the output filename and the image subdirectory name. |
+| **output_dir** | str | Required | The final root directory for structured data and images. |
+| **intermediate_dir** | str | "intermediate" | The intermediate directory where original image resources processed by MinerU are located. |
+
+## XML Tag Protocol
+
+The operator expects the LLM to return data according to the following structure:
+
+* `<chapter>`: A chapter block containing a title and multiple QA pairs.
+* `<title>`: The **ID** corresponding to the chapter title.
+* `<qa_pair>`: A block representing a single question-answer pair.
+* `<question>` / `<solution>`: A list of **IDs** (e.g., `1, 2, 5`) corresponding to the source content.
+* `<answer>`: The answer extracted from the solution. **This is actual text content, not an ID.**
+* `<label>`: Question type or label information. **This is a real sequence number/label, not an ID.**
+
+## `run` Function
+
+```python
+def run(self, 
+        storage: DataFlowStorage, 
+        input_response_path_key: str, 
+        input_converted_layout_path_key: str, 
+        input_name_key: str, 
+        output_qalist_path_key: str
+        )
+
+```
+
+Executes the parsing logic: Reads the LLM response, restores content using the layout JSON file, saves the result in JSONL format.
+
+#### Parameters
+
+| Name | Type | Default | Description |
+| --- | --- | --- | --- |
+| **storage** | DataFlowStorage | Required | DataFlow storage instance. |
+| **input_response_path_key** | str | Required | Column name for the path to the original LLM response file. |
+| **input_converted_layout_path_key** | str | Required | Column name for the path to the converted layout file (`_converted.json`). |
+| **input_name_key** | str | Required | Column for the task name, which determines the naming of the output folder. |
+| **output_qalist_path_key** | str | Required | Column name to store the path of the generated JSONL file. |
+
+## 🧠 Example Logic
+
+### 1. ID Restoration Process
+
+Suppose the LLM returns: `<question>1, 3</question>`
+The operator looks up entries with `id` 1 and 3 in the layout JSON:
+
+* If `id: 1` is the text "What is AI?" and `id: 3` is the image `path/to/img.png`.
+* The restored content will be: `What is AI?\n![image](images/img.png)`.
+
+### 2. Output File Structure
+
+After execution, the directory structure under `output_dir` (referenced as `cache_path` in some contexts) will be as follows:
+
+```text
+output_dir/
+└── {name}/
+    ├── extracted_questions.jsonl  # Structured data
+    └── question_images/           # Automatically synchronized images
+        ├── img1.png
+        └── ...
+
+```
+
+### 3. JSONL Output Example
+
+```json
+{
+  "question": "Please analyze the image below:\n![image](question_images/fig1.png)",
+  "answer": "This is the parsed answer text.",
+  "solution": "Detailed step-by-step solution...",
+  "label": "1",
+  "chapter_title": "Chapter 1: Fundamentals"
+}
+
+```
@@ -0,0 +1,88 @@
+---
+title: MinerU2LLMInputOperator
+createTime: 2026/01/20 20:10:00
+permalink: /en/api/operators/core_text/convert/mineru2llminputoperator/
+---
+
+## 📘 Overview
+
+`MinerU2LLMInputOperator` is a format conversion operator specifically designed for processing **MinerU** parsing results. It transforms the underlying `_content_list.json` files generated by MinerU into a flattened format that is more suitable for Large Language Model (LLM) understanding and processing.
+
+### Key Features:
+* **List Flattening**: Breaks down complex `list` type items into individual `text` entries.
+* **Data Cleaning**: Removes metadata that is typically unnecessary for LLMs, such as `bbox` (bounding box coordinates) and `page_idx` (page numbers).
+* **Re-indexing**: Generates continuous and unique `id` values for all converted content items.
+
+## `__init__` Function
+
+```python
+def __init__(self)
+
+```
+
+This operator does not require any additional parameters during initialization.
+
+## `run` Function
+
+```python
+def run(self, storage: DataFlowStorage, input_markdown_path_key: str, output_converted_layout_key: str)
+
+```
+
+Executes the conversion logic: Locates the corresponding MinerU JSON file based on the Markdown file path, processes it, saves it as a new file, and records the new path.
+
+#### Parameters
+
+| Name | Type | Default | Description |
+| --- | --- | --- | --- |
+| **storage** | DataFlowStorage | Required | DataFlow storage instance. |
+| **input_markdown_path_key** | str | Required | Input column name containing the paths to MinerU `.md` files. The operator automatically searches for `_content_list.json` in the same directory. |
+| **output_converted_layout_key** | str | Required | Output column name to store the path of the processed `_converted.json` file. |
+
+## 🧠 Conversion Logic Details
+
+1. **Path Matching**: The operator retrieves the file path from `input_markdown_path_key` and replaces the `.md` extension with `_content_list.json` to read the original layout data.
+2. **Content Processing**:
+* If an entry type is `list` and the sub-type is `text`, the operator iterates through `list_items` and promotes each sub-item to an independent `text` entry.
+* Entries that are already `text` or other types are preserved.
+
+
+3. **Format Simplification**: The `bbox` and `page_idx` fields are removed from all entries to reduce token interference and noise.
+4. **File Output**: The resulting file is saved with a `_converted.json` suffix in the same directory as the original file.
+
+## 🧠 Example Usage
+
+#### 🧾 Format Conversion Comparison
+
+**Input (Original MinerU `_content_list.json`):**
+
+```json
+[
+  {
+    "type": "list",
+    "sub_type": "text",
+    "list_items": ["Item One Content", "Item Two Content"],
+    "bbox": [10, 20, 100, 200],
+    "page_idx": 0
+  }
+]
+
+```
+
+**Output (Processed `_converted.json`):**
+
+```json
+[
+  {
+    "type": "text",
+    "text": "Item One Content",
+    "id": 0
+  },
+  {
+    "type": "text",
+    "text": "Item Two Content",
+    "id": 1
+  }
+]
+
+```