feat: logs-to-training pipeline — PII redaction, SFT/DPO JSONL export, trajectory complexity tagging by unmeshgb · Pull Request #6 · OpenAgriNet/training_setup_logs

unmeshgb · 2026-05-05T16:13:13Z

Resolves the DMP 2026 issue: Logs-to-training pipeline for agentic setups (#1)

Summary

End-to-end pipeline from raw Pydantic/Langfuse JSON logs to LoRA-ready SFT and
DPO JSONL datasets. Covers PII redaction, full agentic trajectory handling,
complexity tagging, split integrity, per-session audit logging, and automatic
student model filtering — all in a single repeatable CLI run.

Acceptance criteria coverage

Criterion	How it's met
No artifact ships without PII pipeline + audit sample	Every content field — user question, bot response, tool args, tool return content — passes `redact_pii()` before write; redactions logged to `audit_log.jsonl` with `trace_id`, original span, placeholder, and entity type
SFT JSONL validates against LoRA dry run	Full trajectories in OpenAI/TRL tool-call format: `user → tool-call → tool-return → assistant`; compatible with TRL `SFTTrainer` and Gemma, Llama, Qwen chat templates without format shims
DPO JSONL has prompt / chosen / rejected	Persona-violation style synthetic rejections tagged with `rejection_type`; `"synthetic": true` flag makes real pair replacement a one-line swap when feedback data arrives
Agent trajectories excluded on tool call mismatch	Two-layer check: tool-call must have a tool name; tool-return must reference a `tool_call_id` already seen in that session — catches out-of-order and orphaned results
Field definitions, split strategy, complexity mapping	`SCHEMA.md` documents all fields, hash-based split rationale, complexity tier → training schedule mapping, residual PII risk, and student model acceptance criteria
Student model criteria defined	`student_eligible` auto-set to `False` when `tool_count > 4` or `total_tokens > 8192` — no manual post-processing needed

Implementation highlights

Schema — Langfuse-native
AgentTurn carries usage (input/output tokens), model_name, finish_reason,
run_id, and provider_name — matching real Langfuse trace fields directly so
no pre-processing transform is needed before ingest.

PII Redaction — multilingual aware
Presidio for English with a regex fallback for non-English (Hindi/regional) queries,
matching the confirmed reality of mixed Indian language logs. Custom Aadhaar recognizer
added (12-digit pattern, IN_AADHAAR). PERSON entity detection intentionally
disabled to prevent false positives on crop names and place names in agricultural
context. Placeholders are session-scoped and indexed — the same entity gets the same
token across SFT and DPO exports for that session. language field on LogSession
routes to the correct redaction path automatically with no manual flag per run.

Trajectory Validation
Two-layer check: (1) tool-call must have a tool name, (2) tool-return must reference
a tool_call_id already seen in that session — catches out-of-order and orphaned
results that would silently corrupt training data.

Complexity Tagging
Each exported row carries: tool_count, unique_tools, has_recovery,
total_tokens, complexity_tier (simple / moderate / complex), is_agentic,
and student_eligible. Token usage is read directly from AgentTurn.usage when
present. Recovery detection covers error, timeout, failed, and no results
in tool-return content. Trainers can filter by any tag for staged curriculum or
flat mixture runs without touching the pipeline.

DPO Rejection Typing
Each DPO pair carries a rejection_type field on its metadata. Current value is
persona_violation — making it straightforward to add more rejection categories
(tool inefficiency, wrong tool order, hallucinated args) as the pipeline matures.

Audit Log
audit_log.jsonl is cleared and rebuilt fresh on every pipeline run — no stale
entries. Each record carries trace_id (from session_id if present, otherwise
question hash), original PII span, placeholder, and entity type. Designed for
human sampling to catch false negatives before any artifact ships. Residual risk
for unstructured regional language free text is documented in SCHEMA.md.

Split Integrity
MD5 hash of user_question mod 100 for train/eval assignment — deterministic
across runs, prevents near-duplicate prompt leakage between SFT and DPO sets.

Files changed

File	Purpose
`schemas.py`	`LogSession`, `AgentTurn` (Langfuse-native fields), `TrainingMetadata`, `DPORecord`
`pii_redactor.py`	Presidio + custom Aadhaar recognizer + multilingual regex fallback
`segmenter.py`	`validate_trajectory()` with tool-call ID ordering; `tag_complexity()` with token counting and `student_eligible`
`exporter.py`	CLI with `--input`, `--output`, `--split-ratio`; hash-based splits; SFT + DPO export; `audit_log.jsonl`
`sample_logs.json`	3 sessions covering all complexity tiers (simple/moderate/complex) and all PII types (phone, email, Aadhaar)
`SCHEMA.md`	Field definitions, split strategy, complexity → schedule mapping, residual PII risk, student model criteria
`requirements.txt`	`presidio-analyzer`, `presidio-anonymizer`, `pydantic>=2`, `pytest`
`setup.sh`	One-command install + `python -m spacy download en_core_web_lg`
`test_pipeline.py`	5 pytest tests — see below

Tests (5/5 passing)

Test	What it asserts
`test_redact_pii`	Phone number and email removed; correct indexed placeholders present in output
`test_tag_complexity`	Session with 3 tool calls + recovery flag returns `complexity_tier = complex`
`test_validate_trajectory`	Tool-call with no `tool_name` returns `False`
`test_sft_export_format`	Output JSONL contains `messages` key with at least one message
`test_dpo_export_format`	Output JSONL contains all three required fields: `prompt`, `chosen`, `rejected`

How to run

bash setup.sh
python exporter.py --input sample_logs.json --output exported_dataset/ --split-ratio 0.8
pytest test_pipeline.py -v

…O export, trajectory tagging

feat: end-to-end logs-to-training pipeline with PII redaction, SFT/DP…

c17d9a3

…O export, trajectory tagging

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: logs-to-training pipeline — PII redaction, SFT/DPO JSONL export, trajectory complexity tagging#6

feat: logs-to-training pipeline — PII redaction, SFT/DPO JSONL export, trajectory complexity tagging#6
unmeshgb wants to merge 1 commit into
OpenAgriNet:mainfrom
unmeshgb:feat/logs-to-training-pipeline

unmeshgb commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

unmeshgb commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Acceptance criteria coverage

Implementation highlights

Files changed

Tests (5/5 passing)

How to run

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

unmeshgb commented May 5, 2026 •

edited

Loading