Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion docs/.vuepress/notes/en/guide.ts
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,12 @@ export const Guide: ThemeNote = defineNoteConfig({
prefix: 'agent',
items: [
"agent_for_data",
"DataFlow-AgentPipelineOrchestration"
"DataFlow-AgentPipelineOrchestration",
"operator_assemble_line",
"operator_qa",
"operator_write",
"pipeline_prompt",
"pipeline_rec&refine"
]
},
],
Expand Down
7 changes: 6 additions & 1 deletion docs/.vuepress/notes/zh/guide.ts
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,12 @@ export const Guide: ThemeNote = defineNoteConfig({
prefix: 'agent',
items: [
"agent_for_data",
"DataFlow-AgentPipelineOrchestration"
"DataFlow-AgentPipelineOrchestration",
"operator_assemble_line",
"operator_qa",
"operator_write",
"pipeline_prompt",
"pipeline_rec&refine"
]
},
// {
Expand Down
108 changes: 108 additions & 0 deletions docs/en/notes/guide/agent/operator_assemble_line.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
---
title: Visualized Operator Assemble Line
createTime: 2026/02/05 22:11:00
permalink: /en/guide/agent/operator_assemble_line/
---

## 1. Overview

**Visualized Operator Assemble Line** is a "low-code/no-code" development tool provided by the DataFlow-Agent platform. It allows users to bypass complex Python coding or AI planning processes by directly browsing available operators in the system via a Graphical User Interface (GUI), manually configuring parameters, and assembling them into ordered data processing pipelines.

The core value of this feature lies in:

* **What You See Is What You Get**: Real-time viewing of operator parameter definitions and pipeline structures.
* **Automatic Linking**: The system automatically attempts to match the output of a previous operator with the input of the next, simplifying data flow configuration.
* **Code Generation and Execution**: Assembled logic is automatically converted into standard Python code and executed in the background.

## 2. Features

This functional module primarily consists of frontend interaction logic (`op_assemble_line.py`) and a backend execution workflow (`wf_df_op_usage.py`).

### 2.1 Dynamic Operator Loading and Introspection

The system automatically scans the `OPERATOR_REGISTRY` upon startup, loading all registered operators and categorizing them based on their module paths.

* **Automatic Parameter Parsing**: Using Python's `inspect` module, the system automatically extracts method signatures from the `init` and `run` methods of operator classes to generate corresponding configuration boxes in the UI.
* **Prompt Template Support**: For operators that support Prompts, the UI automatically reads `ALLOWED_PROMPTS` and provides a dropdown selection box.

### 2.2 Intelligent Parameter Linking

During the UI orchestration process, the system features "automatic wiring" capabilities. It analyzes the input-output relationships between adjacent operators, automatically matches keys with similar names, and displays the data flow through visualized connections.

## 3. User Guide

This feature provides two modes of use: the **Graphical Interface (Gradio UI)** and **Command-line Scripts**.

### 3.1 UI Operation

Ideal for interactive exploration and rapid verification.

1. **Environment Configuration**: Enter API-related information and the input JSONL file path at the top of the page.
2. **Orchestrate Pipeline**:
1. **Select Operator**: Choose an operator category and a specific operator from the left dropdown menu.
2. **Configure Parameters**: Enter parameters into the JSON edit box.
3. **Add Operator**: Click the "Add Operator to Pipeline" button and drag items in the list below to adjust the execution order.
3. **Run and Results**: Click "Run Pipeline" to view the generated code and a preview of the processed results in the execution result section.

### 3.2 Script Invocation and Explicit Configuration

For automated tasks or batch processing, the `run_dfa_op_assemble.py` script can be used. This method bypasses the UI and defines the operator sequence directly through code.

> **Note: Explicit Configuration Requirement**: Unlike the "Automatic Linking" in the UI, the script mode requires you to **explicitly configure** all parameters. You must ensure that the `output_key` of the previous operator strictly matches the `input_key` of the next; the script will not automatically correct parameter names for you.

#### 1. Modify Configuration

Open `run_dfa_op_assemble.py` and modify the configuration area at the top of the file.

**Key Configuration Item**: **`PIPELINE_STEPS`**—a list defining the pipeline execution steps. Each element contains an `op_name` and `params`.

```python
# [Pipeline Definition]
PIPELINE_STEPS = [
{
"op_name": "ReasoningAnswerGenerator",
"params": {
# __init__ parameters (Note: unified into 'params' in wf_df_op_usage)
"prompt_template": "dataflow.prompts.reasoning.math.MathAnswerGeneratorPrompt",
# run parameters
"input_key": "raw_content",
"output_key": "generated_cot"
}
},
{
"op_name": "ReasoningPseudoAnswerGenerator",
"params": {
"max_times": 3,
"input_key": "generated_cot",
"output_key_answer": "pseudo_answers",
"output_key_answer_value": "pseudo_answer_value",
"output_key_solutions": "pseudo_solutions",
"output_key_correct_solution_example": "pseudo_correct_solution_example"
}
}
]

```

**Other Required Configurations**:

* `CACHE_DIR`: **Must use an absolute path** to avoid path errors when the generated Python script executes in a subprocess.
* `INPUT_FILE`: The absolute path to the initial data file.

#### 2. Run Script

```bash
python run_dfa_op_assemble.py

```

#### 3. Output Results

After execution, the console will print:

* **[Generation]**: The path of the generated Python script (e.g., `pipeline_script_pipeline_001.py`).
* **[Code Preview]**: A preview of the first 20 lines of the generated code.
* **[Execution]**:
* `Status: success` indicates successful execution.
* `STDOUT`: Prints the standard output logs from the pipeline runtime.

112 changes: 112 additions & 0 deletions docs/en/notes/guide/agent/operator_qa.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
---
title: Operator QA
createTime: 2026/02/05 22:11:00
permalink: /en/guide/agent/operator_qa/
---

## 1. Overview

**Operator QA** is a built-in vertical domain expert assistant within the DataFlow-Agent platform. Its core mission is to help users quickly navigate the extensive DataFlow operator library to find required tools, understand their usage, and inspect underlying source code.

Unlike generic chatbots, Operator QA integrates **RAG (Retrieval-Augmented Generation)** technology. It is equipped with a complete operator index (FAISS) and a metadata knowledge base of the DataFlow project. When a user asks a question, the Agent autonomously decides whether to retrieve information from the knowledge base, which operators to inspect, and provides accurate technical details—including code snippets and parameter descriptions—back to the user.

## 2. Core Features

This module is driven by a frontend UI (`operator_qa.py`), an entry script (`run_dfa_operator_qa.py`), and a backend agent (`operator_qa_agent.py`). It possesses the following core capabilities:

### 2.1 Intelligent Retrieval and Recommendation

The Agent does more than simple keyword matching; it identifies user needs based on semantic understanding.

* **Semantic Search**: If a user describes a need like "I want to filter out missing values," the Agent uses vector retrieval to find relevant operators such as `ContentNullFilter`.
* **On-Demand Invocation**: Based on the `BaseAgent` graph mode (`use_agent=True`), the Agent automatically determines whether to call the `search_operators` tool or respond directly based on the conversation context.

### 2.2 Multi-turn Conversation

Utilizing the `AdvancedMessageHistory` module, the system maintains a complete session context.

* **Contextual Memory**: A user can ask, "Which operators can load data?" followed by "How do I fill in **its** parameters?" The Agent can recognize that "its" refers to the operator recommended in the previous turn.
* **State Persistence**: In both script interaction and UI modes, by reusing the same `state` and `graph` instances, the `messages` list accumulates across multiple turns, ensuring the LLM maintains a full memory.

### 2.3 Visualization and Interaction

* **Gradio UI**: Provides code previews, operator highlighting, and quick-question buttons.
* **Interaction**: Supports multi-turn Q&A, clearing history, and viewing history.

## 3. Architectural Components

### 3.1 OperatorQAAgent

* Inherits from `BaseAgent` and is configured in ReAct/Graph mode.
* Possesses Post-Tools permissions to call RAG services for data retrieval.
* Responsible for parsing natural language, planning database queries, and generating final natural language responses.

### 3.2 OperatorRAGService

* A service layer decoupled from the Agent.
* Manages the FAISS vector index and `ops.json` metadata.
* Provides underlying capabilities such as `search` (vector search), `get_operator_info` (fetch details), and `get_operator_source` (fetch source code).

## 4. User Guide

### 4.1 UI Operation

1. **Configure Model**: In the "Configuration" panel on the right, verify the API URL and Key, and select a model (defaults to `gpt-4o`).
2. **Initiate Inquiry**:
1. **Dialogue Box**: Type your question.
2. **Quick Buttons**: Click "Quick Question" buttons, such as "Which operator filters missing values?" to start instantly.
3. **View Results**:
1. **Chat Area**: Displays the Agent's response and citations.
2. **Right Panel**:
* `Related Operators`: Lists operator names retrieved by the Agent.
* `Code Snippets`: Displays Python source code if specific implementations are involved.

### 4.2 Script Invocation and Explicit Configuration

Beyond the UI, the system provides the `run_dfa_operator_qa.py` script, which supports running the Q&A service through explicit code configuration—ideal for development and debugging.

**Configuration Method:** Directly modify the constant configuration area at the top of the script without passing command-line arguments:

```python
# ===== Example config (edit here) =====
INTERACTIVE = False # True for multi-turn mode, False for single query
QUERY = "Which operator should I use to filter missing values?" # Question for single query

LANGUAGE = "en"
SESSION_ID = "demo_operator_qa"
CACHE_DIR = "dataflow_cache"
TOP_K = 5 # Number of retrieval results

CHAT_API_URL = os.getenv("DF_API_URL", "http://123.129.219.111:3000/v1/")
API_KEY = os.getenv("DF_API_KEY", "")
MODEL = os.getenv("DF_MODEL", "gpt-4o")

OUTPUT_JSON = "" # e.g., "cache_local/operator_qa_result.json"; empty string means no file saving

```

**Execution Modes:**

1. **Single Query Mode** (`INTERACTIVE = False`): Executes a single `QUERY`; results can be printed or saved as a JSON file.
2. **Interactive Mode** (`INTERACTIVE = True`): Starts a terminal dialogue loop supporting `exit` to quit, `clear` to reset context, and `history` to view session history.

**Core Logic:** The script demonstrates how to explicitly construct `DFRequest` and `MainState`, and manually build the execution graph:

```python
# 1. Explicitly construct the request
req = DFRequest(
language=LANGUAGE,
chat_api_url=CHAT_API_URL,
api_key=API_KEY,
model=MODEL,
target="" # Populated before each query
)

# 2. Initialize state and graph
state = MainState(request=req, messages=[])
graph = create_operator_qa_graph().build()

# 3. Execute
result = await run_single_query(state, graph, QUERY)

```
120 changes: 120 additions & 0 deletions docs/en/notes/guide/agent/operator_write.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
---
title: Operator Write
createTime: 2026/02/05 22:11:00
permalink: /en/guide/agent/operator_write/
---

## 1. Overview

**Operator Write** is the core productivity module of the DataFlow-Agent. It is not merely a tool for generating Python code based on user requirements but rather builds a closed-loop system for **generation, execution, and debugging**.

This workflow enables:

1. **Semantic Matching**: Understanding user intent (e.g., "filter missing values") and finding the best-matching base class within the existing operator library.
2. **Code Generation**: Writing executable operator code based on the base class and user data samples.
3. **Automatic Injection**: Automatically injecting LLM service capabilities into the operator if needed.
4. **Subprocess Execution**: Instantiating and running the generated operator in a controlled environment.
5. **Self-Healing**: Launching a Debugger to analyze stack traces if execution fails, automatically modifying the code, and retrying until success or the maximum retry limit is reached.

## 2. Core Features

### 2.1 Intelligent Code Generation

* **Sample-Based Programming**: The Agent reads actual data samples (calling the pre-tool `local_tool_for_sample`) and the data Schema to ensure the generated code correctly handles real field names and data types.
* **Operator Reuse**: The system prioritizes retrieving existing operator libraries (calling the pre-tool `match_operator`) to generate code inherited from existing base classes rather than starting from scratch, ensuring code standardization and maintainability.

### 2.2 Automatic Debugging Loop

This is a system equipped with self-reflection capabilities.

* **Execution Monitoring**: At the `llm_instantiate` node, the system attempts to execute the generated code (`exec(code_str)`) and captures standard output and standard errors.
* **Error Diagnosis**: If an exception occurs, the `code_debugger` Agent analyzes the error stack (`error_trace`) and the current code to generate repair suggestions (`debug_reason`).
* **Auto-Rewrite**: The `rewriter` Agent regenerates the code based on the repair suggestions, automatically updates the file, and enters the next round of testing.

### 2.3 LLM Service Injection

For complex operators requiring Large Model calls (e.g., "generate summary based on content"), the `llm_append_serving` node automatically injects standard LLM call interfaces (`self.llm_serving`) into the operator code, empowering it with AI capabilities.

## 3. Workflow Architecture

This feature is orchestrated by `wf_pipeline_write.py`, forming a directed graph containing conditional loops.

1. **Match Node**: Retrieves reference operators.
2. **Write Node**: Writes the initial code.
3. **Append Serving Node**: Injects LLM capabilities.
4. **Instantiate Node**: Attempts to run the code.
5. **Debugger Node** (Conditional Trigger): Analyzes errors.
6. **Rewriter Node**: Fixes the code.

## 4. User Guide

This feature provides two modes of usage: **Graphical Interface (Gradio UI)** and **Command Line Script**.

### 4.1 UI Operation

The frontend page code is located in `operator_write.py`, offering a visualized interactive experience.

#### 1. Configure Inputs

Configure the following in the left panel of the page:

* **Target Description**: Describe in detail the function and purpose of the operator you want to create.
* Example: "Create an operator that performs sentiment analysis on text."
* **Operator Category**: The category the operator belongs to, used for matching similar operators as references. Defaults to `"Default"`. Options include `"filter"`, `"mapper"`, `"aggregator"`, etc..
* **Test Data File**: Specify the `.jsonl` file path used for testing the generated operator. Defaults to the project's built-in `tests/test.jsonl`.
* **Debug Settings**:
* `Enable Debug Mode`: If checked, the system automatically attempts to fix the code if an error occurs.
* `Max Debug Rounds`: Set the maximum number of automatic repair attempts (default is 3).
* **Output Path**: Specify the save path for the generated code (optional).

#### 2. View Results

After clicking the **"Generate Operator"** button, the right panel displays detailed results:

* **Generated Code**: Final usable Python code, supporting syntax highlighting.
* **Matched Operators**: Displays the list of reference operators found by the system in the library (e.g., `"LangkitSampleEvaluator"`, `"LexicalDiversitySampleEvaluator"`, `"PresidioSampleEvaluator"`, `"PerspectiveSampleEvaluator"`, etc.).
* **Execution Result**: Shows `success: true/false` and specific log information `stdout`/`stderr`.
* **Debug Info**: If debugging was triggered, this displays the runtime captured `stdout`/`stderr` and the selected input field key (`input_key`).
* **Agent Results**: Detailed execution results for each Agent node.
* **Execution Log**: Complete execution log information, facilitating the troubleshooting of the Agent's thought process.

### 4.2 Script Invocation and Explicit Configuration

For developers or automated tasks, `run_dfa_operator_write.py` can be executed directly.

#### 1. Modify Configuration

Open `run_dfa_operator_write.py` and modify the parameters in the configuration area at the top of the file:

```python
CHAT_API_URL = os.getenv("DF_API_URL", "http://123.129.219.111:3000/v1/")
MODEL = os.getenv("DF_MODEL", "gpt-4o")
LANGUAGE = "en"

TARGET = "Create an operator that filters out missing values and keeps rows with non-empty fields."
CATEGORY = "Default" # Fallback category (if classifier misses)
OUTPUT_PATH = "" # e.g., "cache_local/my_operator.py"; empty string means no file saving
JSON_FILE = "" # Empty string uses project built-in tests/test.jsonl

NEED_DEBUG = False
MAX_DEBUG_ROUNDS = 3

```

#### 2. Run Script

```bash
python run_dfa_operator_write.py

```

#### 3. Output Results

The script will print key information to the console:

* `Matched ops`: The matched reference operators.
* `Code preview`: A preview fragment of the generated code.
* `Execution Result`:
* `Success: True` indicates code generation and execution passed.
* `Success: False` will print `stderr preview` for troubleshooting.
* `Debug Runtime Preview`: Displays the automatically selected `input_key` and runtime logs.
Loading