Skip to content

feat: logs-to-training pipeline — PII redaction, SFT/DPO JSONL export, trajectory complexity tagging#6

Open
unmeshgb wants to merge 1 commit into
OpenAgriNet:mainfrom
unmeshgb:feat/logs-to-training-pipeline
Open

feat: logs-to-training pipeline — PII redaction, SFT/DPO JSONL export, trajectory complexity tagging#6
unmeshgb wants to merge 1 commit into
OpenAgriNet:mainfrom
unmeshgb:feat/logs-to-training-pipeline

Conversation

@unmeshgb
Copy link
Copy Markdown

@unmeshgb unmeshgb commented May 5, 2026

Resolves the DMP 2026 issue: Logs-to-training pipeline for agentic setups (#1)

Summary

End-to-end pipeline from raw Pydantic/Langfuse JSON logs to LoRA-ready SFT and
DPO JSONL datasets. Covers PII redaction, full agentic trajectory handling,
complexity tagging, split integrity, per-session audit logging, and automatic
student model filtering — all in a single repeatable CLI run.

Acceptance criteria coverage

Criterion How it's met
No artifact ships without PII pipeline + audit sample Every content field — user question, bot response, tool args, tool return content — passes redact_pii() before write; redactions logged to audit_log.jsonl with trace_id, original span, placeholder, and entity type
SFT JSONL validates against LoRA dry run Full trajectories in OpenAI/TRL tool-call format: user → tool-call → tool-return → assistant; compatible with TRL SFTTrainer and Gemma, Llama, Qwen chat templates without format shims
DPO JSONL has prompt / chosen / rejected Persona-violation style synthetic rejections tagged with rejection_type; "synthetic": true flag makes real pair replacement a one-line swap when feedback data arrives
Agent trajectories excluded on tool call mismatch Two-layer check: tool-call must have a tool name; tool-return must reference a tool_call_id already seen in that session — catches out-of-order and orphaned results
Field definitions, split strategy, complexity mapping SCHEMA.md documents all fields, hash-based split rationale, complexity tier → training schedule mapping, residual PII risk, and student model acceptance criteria
Student model criteria defined student_eligible auto-set to False when tool_count > 4 or total_tokens > 8192 — no manual post-processing needed

Implementation highlights

Schema — Langfuse-native
AgentTurn carries usage (input/output tokens), model_name, finish_reason,
run_id, and provider_name — matching real Langfuse trace fields directly so
no pre-processing transform is needed before ingest.

PII Redaction — multilingual aware
Presidio for English with a regex fallback for non-English (Hindi/regional) queries,
matching the confirmed reality of mixed Indian language logs. Custom Aadhaar recognizer
added (12-digit pattern, IN_AADHAAR). PERSON entity detection intentionally
disabled to prevent false positives on crop names and place names in agricultural
context. Placeholders are session-scoped and indexed — the same entity gets the same
token across SFT and DPO exports for that session. language field on LogSession
routes to the correct redaction path automatically with no manual flag per run.

Trajectory Validation
Two-layer check: (1) tool-call must have a tool name, (2) tool-return must reference
a tool_call_id already seen in that session — catches out-of-order and orphaned
results that would silently corrupt training data.

Complexity Tagging
Each exported row carries: tool_count, unique_tools, has_recovery,
total_tokens, complexity_tier (simple / moderate / complex), is_agentic,
and student_eligible. Token usage is read directly from AgentTurn.usage when
present. Recovery detection covers error, timeout, failed, and no results
in tool-return content. Trainers can filter by any tag for staged curriculum or
flat mixture runs without touching the pipeline.

DPO Rejection Typing
Each DPO pair carries a rejection_type field on its metadata. Current value is
persona_violation — making it straightforward to add more rejection categories
(tool inefficiency, wrong tool order, hallucinated args) as the pipeline matures.

Audit Log
audit_log.jsonl is cleared and rebuilt fresh on every pipeline run — no stale
entries. Each record carries trace_id (from session_id if present, otherwise
question hash), original PII span, placeholder, and entity type. Designed for
human sampling to catch false negatives before any artifact ships. Residual risk
for unstructured regional language free text is documented in SCHEMA.md.

Split Integrity
MD5 hash of user_question mod 100 for train/eval assignment — deterministic
across runs, prevents near-duplicate prompt leakage between SFT and DPO sets.

Files changed

File Purpose
schemas.py LogSession, AgentTurn (Langfuse-native fields), TrainingMetadata, DPORecord
pii_redactor.py Presidio + custom Aadhaar recognizer + multilingual regex fallback
segmenter.py validate_trajectory() with tool-call ID ordering; tag_complexity() with token counting and student_eligible
exporter.py CLI with --input, --output, --split-ratio; hash-based splits; SFT + DPO export; audit_log.jsonl
sample_logs.json 3 sessions covering all complexity tiers (simple/moderate/complex) and all PII types (phone, email, Aadhaar)
SCHEMA.md Field definitions, split strategy, complexity → schedule mapping, residual PII risk, student model criteria
requirements.txt presidio-analyzer, presidio-anonymizer, pydantic>=2, pytest
setup.sh One-command install + python -m spacy download en_core_web_lg
test_pipeline.py 5 pytest tests — see below

Tests (5/5 passing)

Test What it asserts
test_redact_pii Phone number and email removed; correct indexed placeholders present in output
test_tag_complexity Session with 3 tool calls + recovery flag returns complexity_tier = complex
test_validate_trajectory Tool-call with no tool_name returns False
test_sft_export_format Output JSONL contains messages key with at least one message
test_dpo_export_format Output JSONL contains all three required fields: prompt, chosen, rejected

How to run

bash setup.sh
python exporter.py --input sample_logs.json --output exported_dataset/ --split-ratio 0.8
pytest test_pipeline.py -v

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant