Skip to content

End-to-End Protype for Logs to Training Pipeline#14

Open
a-vidushi wants to merge 1 commit into
OpenAgriNet:mainfrom
a-vidushi:main
Open

End-to-End Protype for Logs to Training Pipeline#14
a-vidushi wants to merge 1 commit into
OpenAgriNet:mainfrom
a-vidushi:main

Conversation

@a-vidushi
Copy link
Copy Markdown

Summary

Adds an end-to-end pipeline that ingests raw Langfuse traces, strips PII, and exports validated SFT and DPO JSONL datasets ready for TRL LoRA and DPO training runs. Covers schema modeling, Indian-ID redaction, language detection, trajectory validation, complexity tagging, semantic deduplication, synthetic data augmentation, persona scoring, and four-file JSONL export with audit logs.


What This PR Does

Schema and Types (types_def.py)

Defines Pydantic models for the full data lifecycle: Part, AgentTurn, LogSession, TrainingMetadata, and DPORecord. DPORecord exports chosen and rejected as message lists (List[Dict]), matching TRL DPOTrainer expectations.

PII Redaction (anonymizer.py)

Extends Microsoft Presidio with Indian-specific recognizers like Aadhaar (12-digit, grouped or plain), PAN card, Voter ID, and Indian phone numbers. For non-English sessions where Presidio's NLP pipeline does not apply, a regex-only fallback runs the same patterns directly. Both paths now correctly filter overlapping matches, keeping the longest span and replacing all detected entities with consistent indexed placeholders (<IN_AADHAAR_1>, <EMAIL_ADDRESS_1>, etc.) scoped per session. All redaction findings are written to audit_log.jsonl for human review.

Language Detection (lang_utils.py)

Uses langdetect to auto-tag sessions with a BCP-47 primary language code across nine supported languages. Result is cached onto LogSession.language and used downstream to route PII redaction and penalize language mismatches in persona scoring.

Trajectory Validation and Complexity Tagging (analyzer.py)

validate_trajectory rejects sessions with no agent turns, unnamed tool calls, tool names outside the allowlist, and tool-returns with no matching prior tool-call or with content=None. tag_complexity assigns one of three tiers, simple (0 tools), moderate (1–3 tools, no recovery), complex (4+ tools or any recovery step), and computes an ambiguity score from a multilingual token list covering English, Hindi, and Marathi.

Persona Scoring (behavior_scorer.py)

Additive-penalty scorer starting at 1.0. Deducts for over-refusal phrases, hallucinated scheme names, unhedged percentage claims, and language mismatch, where mismatch is only penalized when the user wrote in native script but the bot responded in ASCII, correctly ignoring transliterated sessions. Scores below 0.4 are excluded from DPO; scores below 0.8 are excluded from DPO chosen but retained for SFT.

Semantic Deduplication and Diversity Mapping (session_dedup.py)

Embeds redacted user questions with sentence-transformers/all-MiniLM-L6-v2 and runs greedy radius-search clustering in pure numpy (FAISS not implemented for the prototype, to be added for efficiency later). One representative seed is kept per cluster. Produces a diversity map reporting total seeds, per-language and per-complexity-tier breakdowns, zero-coverage cells, and underrepresented cells flagged for future data collection. Collapsed duplicates are saved to dedup_dropped.json for auditing.

Synthetic Data Augmentation (data_augmenter.py)

Two generators run on every seed session after deduplication. generate_hard_case prepends an ambiguity prefix and appends a vague suffix in the session's detected language (English and Hindi supported; other languages fall back to English). generate_failure_correction injects a synthetic tool error into the first tool-return and rewrites the bot response as a graceful recovery message. Both generators are deterministic from the session's question hash and set source="synthetic" on the output.

Export Pipeline (build_dataset.py)

Orchestrates the full pipeline: load → deduplicate → split → augment → export. Train/eval split is deterministic via MD5 hash of user_question modulo 100, ensuring near-identical prompts always land in the same split. create_sft_export writes one JSONL row per session in OpenAI chat format with full tool-call and tool-return turns. create_dpo_export writes prompt (full conversation prefix), chosen (final assistant message), and rejected (synthetic negative) as message lists, gated on persona score. Both exporters share a single append-mode audit file. Four output files are produced: sft_train.jsonl, sft_eval.jsonl, dpo_train.jsonl, dpo_eval.jsonl.


Final Touches for deployment

  • Add persona file when received. behavior_scorer.py encodes persona checks directly in Python. The scorer should instead read a persona.md at runtime so the rubric can be swapped per deployment without touching pipeline code. This also unblocks using the persona prompt as input to an LLM-as-judge scorer for richer quality signals.

  • Repair fixable traces through hosted open model when received. Sessions with valid questions but flawed trajectories or weak responses are currently dropped or left as-is. These should be repaired by running them through a hosted open model (Qwen2.5-32B or similar on the H100 cluster) to produce corrected responses, turning the noisiest part of the dataset into useful training signal rather than discarding it.

  • Add mock tool environment. Hard case generation currently inherits parent tool responses unchanged rather than running generated questions through a live mock environment. A mock tool executor that validates arguments against known schemas and returns realistic or controlled-failure responses is required before synthetic expansion can produce grounded, training-safe trajectories.

  • Strengthen DPO rejection system. Current rejected sides are string-prefix degradations, not structurally corrupted trajectories (missing tool calls, reordered steps, ignored results). Real preference pairs from thumbs-down signals, user edits, or genuine failure/success trajectory contrasts are not yet collected.

  • Multi-turn depth to be decided. multi_turn_depth is hardcoded to 1. True multi-turn session segmentation is to be implemented.

  • Add TRL dry-run validation. SFT and DPO outputs have not been validated against a live LoRA or DPO training run yet.

@a-vidushi a-vidushi changed the title End-to-End Protype for OAN End-to-End Protype for Logs to Training Pipeline May 16, 2026
@a-vidushi a-vidushi marked this pull request as ready for review May 16, 2026 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant