OpenAgriNet · sharanyaa23 · May 2, 2026
diff --git a/data-pipeline/README.md b/data-pipeline/README.md
@@ -0,0 +1,242 @@
+# Log-to-Training Data Pipeline for Agentic AI (DMP 2026)
+
+## Core Idea
+
+In the OAN (OpenAgriNet) system, user interactions involve multi-step agent behavior (LLM + tools such as weather APIs, mandi prices, etc.). Raw logs from such systems are noisy, inconsistent and not directly suitable for training.
+
+This project builds a **data quality layer** that converts these raw logs into reliable, training-ready datasets.
+
+The pipeline:
+- removes incorrect, weak and low-information responses
+- detects inefficient multi-step behavior such as unnecessary tool/API calls
+- ensures only reliable, high-quality interactions are used for training
+
+This directly improves the reliability and efficiency of responses generated by the OAN agentic AI system in real-world agricultural use cases.
+
+
+- **Intelligent Quality Detection** - Filters out bad and weak responses
+- **Explainable Filtering** - Shows WHY each entry was kept/removed
+- **PII-Safe Processing** - Removes sensitive information
+- **Training-Ready Exports** - JSONL format for LoRA fine-tuning
+- **Inefficiency Detection** - Identifies and optimizes inefficient behavior
+
+## Where This Fits in OAN Architecture
+
+In the OAN pipeline:
+
+User → Voice/Text → Agentic LLM → Tools (APIs) → Response → Logs
+
+This project operates on the **logs generated by this system**.
+
+Logs → [This Pipeline] → Clean Training Data → Model Training
+
+By improving training data quality, this directly improves:
+- response accuracy
+- tool usage efficiency
+- overall system reliability for farmers
+- consistency of responses across different user queries
+
+This pipeline acts as a data quality layer between raw system logs and model training.
+
+## Example Pipeline Output
+
+```
+Total logs: 108
+
+After filtering:
+- Kept for training (high-quality): 100
+- Removed:
+    - Weak: 1
+    - Bad: 7
+
+Inefficient entries: 2
+
+Removed entries summary:
+  - failure: error fetching (1)
+  - minimal response (1)
+  - empty response (1)
+  - too brief (1)
+  - failure: unable to (2)
+  - error only (2)
+
+Exported to data/output.jsonl
+```
+
+## Key Innovations
+
+### 1. **Multi-Tier Quality Assessment**
+- **Bad entries**: Empty responses, minimal responses ("ok"), error responses
+- **Weak entries**: Low-information responses (too brief or vague, e.g., "maybe", "might")
+- **Good entries**: Helpful, detailed, professional responses
+- **Explainable reasons**: Each filtered entry includes WHY it was removed
+
+### 2. **Smart Error Detection**
+- **Not naive**: Doesn't flag mentions of "error" in context
+- **Specific patterns**: Only actual failures like "error fetching", "unable to"
+- **Failure categories**: Differentiates between different types of errors
+
+### 3. **Inefficiency Analysis**
+- **Step-based detection**: >3 steps = inefficient
+- **Optimization hints**: Provides suggestions for improvement
+- **Training signals**: Includes inefficiency metadata in exports
+
+### 4. **Training-Ready Format**
+- **JSONL structure**: One JSON object per line for streaming
+- **Chat template**: User/assistant role format
+- **Rich metadata**: Quality scores, reasons, optimization hints
+
+## Usage
+
+```bash
+python main.py
+```
+
+**Output:** `data/output.jsonl` - 100 high-quality training examples
+
+## Folder Structure and Files
+
+### Root Directory
+```
+data-pipeline/
+├── main.py              # Main pipeline orchestrator
+├── README.md            # This documentation
+├── requirements.txt     # Python dependencies
+├── data/                # Input and output data
+└── src/                 # Pipeline modules
+```
+
+### Data Directory (`data/`)
+```
+data/
+├── raw_logs.json        # Input: Raw production logs (108 entries)
+└── output.jsonl         # Output: Filtered training data (100 entries)
+```
+
+**`raw_logs.json`** - Contains raw agentic logs with:
+- `input`: User question/request
+- `output`: Bot response 
+- `agent_turns`: Array of interaction steps
+
+**`output.jsonl`** - Contains filtered training data with:
+- `messages`: Chat template format (user/assistant roles)
+- `metadata`: Quality scores, reasons, optimization hints
+
+### Source Modules (`src/`)
+
+#### `src/ingest.py`
+**Purpose**: Load raw log data
+**Function**: `load_logs(path)` - Reads JSON file and returns log array
+
+#### `src/clean.py`
+**Purpose**: Remove Personally Identifiable Information (PII)
+**Functions**: `remove_pii(text)` - Replaces sensitive data with placeholders
+**Features**:
+- Email detection → `<EMAIL>`
+- Phone detection → `<PHONE>`
+- Government ID detection (Aadhaar, PAN) → `<GOVT_ID>`
+
+#### `src/process.py`
+**Purpose**: Process raw logs into structured format
+**Functions**:
+- `detect_inefficiency(entry)` - Identifies inefficient behavior (>3 steps)
+- `process_logs(logs)` - Main processing function
+**Features**:
+- Calculates steps from `agent_turns`
+- Applies inefficiency detection
+- Cleans input/output text
+
+#### `src/score.py`
+**Purpose**: Intelligent quality assessment and scoring
+**Functions**:
+- `is_bad(entry)` - Hard filters for obviously bad responses
+- `is_weak(entry)` - Detects low-quality but not terrible responses  
+- `score_entry(entry)` - Main scoring function
+**Quality Rules**:
+- **Bad**: Empty, minimal ("ok"), error responses
+- **Weak**: Too brief (<20 chars), uncertain language
+- **Good**: Helpful, detailed responses
+
+#### `src/select.py`
+**Purpose**: Diversity filtering and deduplication
+**Functions**:
+- `select_diverse(entries)` - Removes duplicate inputs
+- `balance_by_steps(entries)` - Balances by complexity (unused in current version)
+**Features**:
+- Input normalization (lowercase, remove punctuation)
+- Duplicate detection and removal
+
+#### `src/improve.py`
+**Purpose**: Add optimization hints for inefficient entries
+**Functions**:
+- `add_optimization_hint(entry)` - Suggests improvements
+**Features**:
+- "Reduce unnecessary steps or tool calls" for inefficient entries
+- "Efficient" for good entries
+
+#### `src/export.py`
+**Purpose**: Export data in training-ready format
+**Functions**:
+- `format_chat_template(entry)` - Converts to chat template format
+- `export_jsonl(data, path)` - Exports to JSONL file
+**Features**:
+- Chat template structure (user/assistant roles)
+- Rich metadata inclusion (scores, reasons, hints)
+
+### Main Pipeline (`main.py`)
+
+**Purpose**: Orchestrates the entire pipeline
+**Process**:
+1. Load raw logs (`load_logs`)
+2. Process and clean (`process_logs`)
+3. Add optimization hints (`add_optimization_hint`)
+4. Score entries (`score_entry`)
+5. Separate by quality type
+6. Apply diversity filtering (`select_diverse`)
+7. Export results (`export_jsonl`)
+8. Generate detailed logging report
+
+
+## Quality Filtering Logic
+
+The pipeline uses a **three-tier quality system**:
+
+### Bad Entries (Always Removed)
+- **Empty response**: `""` → Reason: "empty response"
+- **Minimal response**: "ok", "yes", "done" → Reason: "minimal response"
+- **Error only**: "err", "error", "failed" → Reason: "error only"
+- **Failure patterns**: "error fetching", "unable to", "failed to" → Reason: "failure: [pattern]"
+
+### Weak Entries (Filtered Out)
+- **Too brief**: <20 characters → Reason: "too brief"
+- **Uncertain language**: "maybe", "might", "possibly" → Reason: "uncertain language"
+
+### Good Entries (Kept for Training)
+- **Helpful responses**: Detailed, professional answers
+- **Proper length**: >20 characters with substance
+- **No uncertainty**: Confident, clear responses
+
+## Output Format
+
+Each training example in `output.jsonl`:
+```json
+{
+  "messages": [
+    {"role": "user", "content": "User question here"},
+    {"role": "assistant", "content": "Helpful response here"}
+  ],
+  "metadata": {
+    "score": 2,
+    "steps": 3,
+    "inefficient": false,
+    "optimization_suggestion": "Efficient",
+    "quality_type": "good",
+    "quality_reason": "good response"
+  }
+}
+```
+
+## Summary
+
+This project turns messy system logs into clean, reliable training data by filtering out weak or incorrect responses and keeping only interactions that actually help the model learn better.
+
+