feat: logs-to-training pipeline — PII redaction, SFT/DPO JSONL export, trajectory complexity tagging#6
Open
unmeshgb wants to merge 1 commit into
Conversation
…O export, trajectory tagging
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves the DMP 2026 issue: Logs-to-training pipeline for agentic setups (#1)
Summary
End-to-end pipeline from raw Pydantic/Langfuse JSON logs to LoRA-ready SFT and
DPO JSONL datasets. Covers PII redaction, full agentic trajectory handling,
complexity tagging, split integrity, per-session audit logging, and automatic
student model filtering — all in a single repeatable CLI run.
Acceptance criteria coverage
redact_pii()before write; redactions logged toaudit_log.jsonlwithtrace_id, original span, placeholder, and entity typeuser → tool-call → tool-return → assistant; compatible with TRLSFTTrainerand Gemma, Llama, Qwen chat templates without format shimsrejection_type;"synthetic": trueflag makes real pair replacement a one-line swap when feedback data arrivestool_call_idalready seen in that session — catches out-of-order and orphaned resultsSCHEMA.mddocuments all fields, hash-based split rationale, complexity tier → training schedule mapping, residual PII risk, and student model acceptance criteriastudent_eligibleauto-set toFalsewhentool_count > 4ortotal_tokens > 8192— no manual post-processing neededImplementation highlights
Schema — Langfuse-native
AgentTurncarriesusage(input/output tokens),model_name,finish_reason,run_id, andprovider_name— matching real Langfuse trace fields directly sono pre-processing transform is needed before ingest.
PII Redaction — multilingual aware
Presidio for English with a regex fallback for non-English (Hindi/regional) queries,
matching the confirmed reality of mixed Indian language logs. Custom Aadhaar recognizer
added (12-digit pattern,
IN_AADHAAR).PERSONentity detection intentionallydisabled to prevent false positives on crop names and place names in agricultural
context. Placeholders are session-scoped and indexed — the same entity gets the same
token across SFT and DPO exports for that session.
languagefield onLogSessionroutes to the correct redaction path automatically with no manual flag per run.
Trajectory Validation
Two-layer check: (1) tool-call must have a tool name, (2) tool-return must reference
a
tool_call_idalready seen in that session — catches out-of-order and orphanedresults that would silently corrupt training data.
Complexity Tagging
Each exported row carries:
tool_count,unique_tools,has_recovery,total_tokens,complexity_tier(simple / moderate / complex),is_agentic,and
student_eligible. Token usage is read directly fromAgentTurn.usagewhenpresent. Recovery detection covers
error,timeout,failed, andno resultsin tool-return content. Trainers can filter by any tag for staged curriculum or
flat mixture runs without touching the pipeline.
DPO Rejection Typing
Each DPO pair carries a
rejection_typefield on its metadata. Current value ispersona_violation— making it straightforward to add more rejection categories(tool inefficiency, wrong tool order, hallucinated args) as the pipeline matures.
Audit Log
audit_log.jsonlis cleared and rebuilt fresh on every pipeline run — no staleentries. Each record carries
trace_id(fromsession_idif present, otherwisequestion hash), original PII span, placeholder, and entity type. Designed for
human sampling to catch false negatives before any artifact ships. Residual risk
for unstructured regional language free text is documented in
SCHEMA.md.Split Integrity
MD5 hash of
user_questionmod 100 for train/eval assignment — deterministicacross runs, prevents near-duplicate prompt leakage between SFT and DPO sets.
Files changed
schemas.pyLogSession,AgentTurn(Langfuse-native fields),TrainingMetadata,DPORecordpii_redactor.pysegmenter.pyvalidate_trajectory()with tool-call ID ordering;tag_complexity()with token counting andstudent_eligibleexporter.py--input,--output,--split-ratio; hash-based splits; SFT + DPO export;audit_log.jsonlsample_logs.jsonSCHEMA.mdrequirements.txtpresidio-analyzer,presidio-anonymizer,pydantic>=2,pytestsetup.shpython -m spacy download en_core_web_lgtest_pipeline.pyTests (5/5 passing)
test_redact_piitest_tag_complexitycomplexity_tier = complextest_validate_trajectorytool_namereturnsFalsetest_sft_export_formatmessageskey with at least one messagetest_dpo_export_formatprompt,chosen,rejectedHow to run