Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
242 changes: 242 additions & 0 deletions data-pipeline/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
# Log-to-Training Data Pipeline for Agentic AI (DMP 2026)

## Core Idea

In the OAN (OpenAgriNet) system, user interactions involve multi-step agent behavior (LLM + tools such as weather APIs, mandi prices, etc.). Raw logs from such systems are noisy, inconsistent and not directly suitable for training.

This project builds a **data quality layer** that converts these raw logs into reliable, training-ready datasets.

The pipeline:
- removes incorrect, weak and low-information responses
- detects inefficient multi-step behavior such as unnecessary tool/API calls
- ensures only reliable, high-quality interactions are used for training

This directly improves the reliability and efficiency of responses generated by the OAN agentic AI system in real-world agricultural use cases.


- **Intelligent Quality Detection** - Filters out bad and weak responses
- **Explainable Filtering** - Shows WHY each entry was kept/removed
- **PII-Safe Processing** - Removes sensitive information
- **Training-Ready Exports** - JSONL format for LoRA fine-tuning
- **Inefficiency Detection** - Identifies and optimizes inefficient behavior

## Where This Fits in OAN Architecture

In the OAN pipeline:

User → Voice/Text → Agentic LLM → Tools (APIs) → Response → Logs

This project operates on the **logs generated by this system**.

Logs → [This Pipeline] → Clean Training Data → Model Training

By improving training data quality, this directly improves:
- response accuracy
- tool usage efficiency
- overall system reliability for farmers
- consistency of responses across different user queries

This pipeline acts as a data quality layer between raw system logs and model training.

## Example Pipeline Output

```
Total logs: 108

After filtering:
- Kept for training (high-quality): 100
- Removed:
- Weak: 1
- Bad: 7

Inefficient entries: 2

Removed entries summary:
- failure: error fetching (1)
- minimal response (1)
- empty response (1)
- too brief (1)
- failure: unable to (2)
- error only (2)

Exported to data/output.jsonl
```

## Key Innovations

### 1. **Multi-Tier Quality Assessment**
- **Bad entries**: Empty responses, minimal responses ("ok"), error responses
- **Weak entries**: Low-information responses (too brief or vague, e.g., "maybe", "might")
- **Good entries**: Helpful, detailed, professional responses
- **Explainable reasons**: Each filtered entry includes WHY it was removed

### 2. **Smart Error Detection**
- **Not naive**: Doesn't flag mentions of "error" in context
- **Specific patterns**: Only actual failures like "error fetching", "unable to"
- **Failure categories**: Differentiates between different types of errors

### 3. **Inefficiency Analysis**
- **Step-based detection**: >3 steps = inefficient
- **Optimization hints**: Provides suggestions for improvement
- **Training signals**: Includes inefficiency metadata in exports

### 4. **Training-Ready Format**
- **JSONL structure**: One JSON object per line for streaming
- **Chat template**: User/assistant role format
- **Rich metadata**: Quality scores, reasons, optimization hints

## Usage

```bash
python main.py
```

**Output:** `data/output.jsonl` - 100 high-quality training examples

## Folder Structure and Files

### Root Directory
```
data-pipeline/
├── main.py # Main pipeline orchestrator
├── README.md # This documentation
├── requirements.txt # Python dependencies
├── data/ # Input and output data
└── src/ # Pipeline modules
```

### Data Directory (`data/`)
```
data/
├── raw_logs.json # Input: Raw production logs (108 entries)
└── output.jsonl # Output: Filtered training data (100 entries)
```

**`raw_logs.json`** - Contains raw agentic logs with:
- `input`: User question/request
- `output`: Bot response
- `agent_turns`: Array of interaction steps

**`output.jsonl`** - Contains filtered training data with:
- `messages`: Chat template format (user/assistant roles)
- `metadata`: Quality scores, reasons, optimization hints

### Source Modules (`src/`)

#### `src/ingest.py`
**Purpose**: Load raw log data
**Function**: `load_logs(path)` - Reads JSON file and returns log array

#### `src/clean.py`
**Purpose**: Remove Personally Identifiable Information (PII)
**Functions**: `remove_pii(text)` - Replaces sensitive data with placeholders
**Features**:
- Email detection → `<EMAIL>`
- Phone detection → `<PHONE>`
- Government ID detection (Aadhaar, PAN) → `<GOVT_ID>`

#### `src/process.py`
**Purpose**: Process raw logs into structured format
**Functions**:
- `detect_inefficiency(entry)` - Identifies inefficient behavior (>3 steps)
- `process_logs(logs)` - Main processing function
**Features**:
- Calculates steps from `agent_turns`
- Applies inefficiency detection
- Cleans input/output text

#### `src/score.py`
**Purpose**: Intelligent quality assessment and scoring
**Functions**:
- `is_bad(entry)` - Hard filters for obviously bad responses
- `is_weak(entry)` - Detects low-quality but not terrible responses
- `score_entry(entry)` - Main scoring function
**Quality Rules**:
- **Bad**: Empty, minimal ("ok"), error responses
- **Weak**: Too brief (<20 chars), uncertain language
- **Good**: Helpful, detailed responses

#### `src/select.py`
**Purpose**: Diversity filtering and deduplication
**Functions**:
- `select_diverse(entries)` - Removes duplicate inputs
- `balance_by_steps(entries)` - Balances by complexity (unused in current version)
**Features**:
- Input normalization (lowercase, remove punctuation)
- Duplicate detection and removal

#### `src/improve.py`
**Purpose**: Add optimization hints for inefficient entries
**Functions**:
- `add_optimization_hint(entry)` - Suggests improvements
**Features**:
- "Reduce unnecessary steps or tool calls" for inefficient entries
- "Efficient" for good entries

#### `src/export.py`
**Purpose**: Export data in training-ready format
**Functions**:
- `format_chat_template(entry)` - Converts to chat template format
- `export_jsonl(data, path)` - Exports to JSONL file
**Features**:
- Chat template structure (user/assistant roles)
- Rich metadata inclusion (scores, reasons, hints)

### Main Pipeline (`main.py`)

**Purpose**: Orchestrates the entire pipeline
**Process**:
1. Load raw logs (`load_logs`)
2. Process and clean (`process_logs`)
3. Add optimization hints (`add_optimization_hint`)
4. Score entries (`score_entry`)
5. Separate by quality type
6. Apply diversity filtering (`select_diverse`)
7. Export results (`export_jsonl`)
8. Generate detailed logging report


## Quality Filtering Logic

The pipeline uses a **three-tier quality system**:

### Bad Entries (Always Removed)
- **Empty response**: `""` → Reason: "empty response"
- **Minimal response**: "ok", "yes", "done" → Reason: "minimal response"
- **Error only**: "err", "error", "failed" → Reason: "error only"
- **Failure patterns**: "error fetching", "unable to", "failed to" → Reason: "failure: [pattern]"

### Weak Entries (Filtered Out)
- **Too brief**: <20 characters → Reason: "too brief"
- **Uncertain language**: "maybe", "might", "possibly" → Reason: "uncertain language"

### Good Entries (Kept for Training)
- **Helpful responses**: Detailed, professional answers
- **Proper length**: >20 characters with substance
- **No uncertainty**: Confident, clear responses

## Output Format

Each training example in `output.jsonl`:
```json
{
"messages": [
{"role": "user", "content": "User question here"},
{"role": "assistant", "content": "Helpful response here"}
],
"metadata": {
"score": 2,
"steps": 3,
"inefficient": false,
"optimization_suggestion": "Efficient",
"quality_type": "good",
"quality_reason": "good response"
}
}
```

## Summary

This project turns messy system logs into clean, reliable training data by filtering out weak or incorrect responses and keeping only interactions that actually help the model learn better.


Loading