Skip to content

Add prototype log-to-training pipeline#2

Open
vandit98 wants to merge 3 commits into
OpenAgriNet:mainfrom
vandit98:prototype/log-to-training-pipeline
Open

Add prototype log-to-training pipeline#2
vandit98 wants to merge 3 commits into
OpenAgriNet:mainfrom
vandit98:prototype/log-to-training-pipeline

Conversation

@vandit98
Copy link
Copy Markdown

Summary

This PR adds an early Python-first prototype for the DMP 2026 log-to-training-data pipeline described in #1.

The implementation focuses on the first reviewable end-to-end slice:

  • JSON/JSONL ingestion and normalization into a canonical event schema
  • deterministic rule-based PII/secrets redaction with stable placeholders
  • session segmentation into Q&A units and agent trajectories
  • trajectory complexity tagging for training schedule buckets
  • basic validation for tool-call/tool-result consistency
  • LoRA-ready SFT JSONL export
  • human-review DPO candidate JSONL export
  • sample agent logs, schema docs, and implementation plan

Verification

  • python -m pytest
  • sample pipeline run produces 2 SFT rows, 1 DPO candidate row, PII counts, tag summary, and 0 validation issues

Notes

This is intentionally dependency-light and stdlib-only for the first pass, so mentors can review the data model and pipeline shape before we add heavier trainer integrations, NER models, Hugging Face datasets, PEFT/TRL dry runs, or organization-specific PII dictionaries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant