End-to-End Protype for Logs to Training Pipeline#14
Open
a-vidushi wants to merge 1 commit into
Open
Conversation
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an end-to-end pipeline that ingests raw Langfuse traces, strips PII, and exports validated SFT and DPO JSONL datasets ready for TRL LoRA and DPO training runs. Covers schema modeling, Indian-ID redaction, language detection, trajectory validation, complexity tagging, semantic deduplication, synthetic data augmentation, persona scoring, and four-file JSONL export with audit logs.
What This PR Does
Schema and Types (
types_def.py)Defines Pydantic models for the full data lifecycle:
Part,AgentTurn,LogSession,TrainingMetadata, andDPORecord.DPORecordexportschosenandrejectedas message lists (List[Dict]), matching TRLDPOTrainerexpectations.PII Redaction (
anonymizer.py)Extends Microsoft Presidio with Indian-specific recognizers like Aadhaar (12-digit, grouped or plain), PAN card, Voter ID, and Indian phone numbers. For non-English sessions where Presidio's NLP pipeline does not apply, a regex-only fallback runs the same patterns directly. Both paths now correctly filter overlapping matches, keeping the longest span and replacing all detected entities with consistent indexed placeholders (
<IN_AADHAAR_1>,<EMAIL_ADDRESS_1>, etc.) scoped per session. All redaction findings are written toaudit_log.jsonlfor human review.Language Detection (
lang_utils.py)Uses
langdetectto auto-tag sessions with a BCP-47 primary language code across nine supported languages. Result is cached ontoLogSession.languageand used downstream to route PII redaction and penalize language mismatches in persona scoring.Trajectory Validation and Complexity Tagging (
analyzer.py)validate_trajectoryrejects sessions with no agent turns, unnamed tool calls, tool names outside the allowlist, and tool-returns with no matching prior tool-call or withcontent=None.tag_complexityassigns one of three tiers,simple(0 tools),moderate(1–3 tools, no recovery),complex(4+ tools or any recovery step), and computes an ambiguity score from a multilingual token list covering English, Hindi, and Marathi.Persona Scoring (
behavior_scorer.py)Additive-penalty scorer starting at 1.0. Deducts for over-refusal phrases, hallucinated scheme names, unhedged percentage claims, and language mismatch, where mismatch is only penalized when the user wrote in native script but the bot responded in ASCII, correctly ignoring transliterated sessions. Scores below 0.4 are excluded from DPO; scores below 0.8 are excluded from DPO chosen but retained for SFT.
Semantic Deduplication and Diversity Mapping (
session_dedup.py)Embeds redacted user questions with
sentence-transformers/all-MiniLM-L6-v2and runs greedy radius-search clustering in pure numpy (FAISS not implemented for the prototype, to be added for efficiency later). One representative seed is kept per cluster. Produces a diversity map reporting total seeds, per-language and per-complexity-tier breakdowns, zero-coverage cells, and underrepresented cells flagged for future data collection. Collapsed duplicates are saved todedup_dropped.jsonfor auditing.Synthetic Data Augmentation (
data_augmenter.py)Two generators run on every seed session after deduplication.
generate_hard_caseprepends an ambiguity prefix and appends a vague suffix in the session's detected language (English and Hindi supported; other languages fall back to English).generate_failure_correctioninjects a synthetic tool error into the first tool-return and rewrites the bot response as a graceful recovery message. Both generators are deterministic from the session's question hash and setsource="synthetic"on the output.Export Pipeline (
build_dataset.py)Orchestrates the full pipeline: load → deduplicate → split → augment → export. Train/eval split is deterministic via MD5 hash of
user_questionmodulo 100, ensuring near-identical prompts always land in the same split.create_sft_exportwrites one JSONL row per session in OpenAI chat format with full tool-call and tool-return turns.create_dpo_exportwrites prompt (full conversation prefix), chosen (final assistant message), and rejected (synthetic negative) as message lists, gated on persona score. Both exporters share a single append-mode audit file. Four output files are produced:sft_train.jsonl,sft_eval.jsonl,dpo_train.jsonl,dpo_eval.jsonl.Final Touches for deployment
Add persona file when received.
behavior_scorer.pyencodes persona checks directly in Python. The scorer should instead read a persona.md at runtime so the rubric can be swapped per deployment without touching pipeline code. This also unblocks using the persona prompt as input to an LLM-as-judge scorer for richer quality signals.Repair fixable traces through hosted open model when received. Sessions with valid questions but flawed trajectories or weak responses are currently dropped or left as-is. These should be repaired by running them through a hosted open model (Qwen2.5-32B or similar on the H100 cluster) to produce corrected responses, turning the noisiest part of the dataset into useful training signal rather than discarding it.
Add mock tool environment. Hard case generation currently inherits parent tool responses unchanged rather than running generated questions through a live mock environment. A mock tool executor that validates arguments against known schemas and returns realistic or controlled-failure responses is required before synthetic expansion can produce grounded, training-safe trajectories.
Strengthen DPO rejection system. Current rejected sides are string-prefix degradations, not structurally corrupted trajectories (missing tool calls, reordered steps, ignored results). Real preference pairs from thumbs-down signals, user edits, or genuine failure/success trajectory contrasts are not yet collected.
Multi-turn depth to be decided.
multi_turn_depthis hardcoded to 1. True multi-turn session segmentation is to be implemented.Add TRL dry-run validation. SFT and DPO outputs have not been validated against a live LoRA or DPO training run yet.