Skip to content

feat: ingest → PII redaction → filter/tag → SFT export prototype#5

Open
Joshuathomas18 wants to merge 3 commits into
OpenAgriNet:mainfrom
Joshuathomas18:main
Open

feat: ingest → PII redaction → filter/tag → SFT export prototype#5
Joshuathomas18 wants to merge 3 commits into
OpenAgriNet:mainfrom
Joshuathomas18:main

Conversation

@Joshuathomas18
Copy link
Copy Markdown

What this PR does

ingest → PII redaction → filter/tag → SFT export on a sampled log subset.

Stages implemented

  • Ingest: Pydantic v2 models matching Langfuse/Pydantic schema, JSON/JSONL parser, never crashes on bad records
  • Redact: Presidio wrapper with custom Indian phone number recognizer, entity type normalization, session-scoped consistent placeholders, recursive dict leaf walker (preserves args structure), structured field-rule fallback for agristack tool returns, audit log with residual risk note
  • Filter/tag: 5 hard drop conditions, complexity tagging (step_count, unique_tools, has_recovery, redundant_calls, complexity_tier), language detection
  • Export: LoRA-ready SFT JSONL in TRL chat template format, cluster-level stratified 80/10/10 split

Acceptance criteria coverage

Criterion File
No artifact without PII pipeline + audit sample redact/engine.py, redact/audit.py
SFT JSONL in LoRA-ready format export/sft.py
Agent trajectories flagged on tool call mismatch tag/filters.py
Field definitions, split strategy, complexity mapping docs/schema.md

Notable implementation decisions

  • Indian phone numbers (10-digit, +91 prefix) added as custom Presidio recognizer baseline en recognizers classify them as UK_NHS
  • Entity types normalized before placeholder assignment so UK_NHS, IN_PAN etc. map to canonical PHONE_NUMBER / GOVT_ID
  • Agristack tool returns use structured field rules (Name:, Mobile:, Date of Birth:) as fallback for PII Presidio misses in Hindi/regional text
  • Residual risk documented in audit summary unstructured regional language free text is a known gap

Tests

22/22 passing. To run:

python -m spacy download en_core_web_sm

python -m pytest tests/ -v

@Joshuathomas18
Copy link
Copy Markdown
Author

Hi @Gautam-Rajeev — wanted to share an update on my prototype PR for #1 .

Since the initial submission I've added the two remaining midpoint milestone pieces:

  • data/gold/gold_session.json : hand-curated representative agristack + weather forecast trajectory (Hindi query, clean 2-tool path, no failures)
  • data/gold/dpo_candidates.jsonl: 5 governed preference pairs covering redundant tool calls, tool failure without recovery, suboptimal paths, persona violations (English response to Hindi query), and wrong tool order

One implementation detail worth flagging: Presidio's default English recognizers classify Indian 10-digit mobile numbers as UK_NHS. I added a custom recognizer for the +91/10-digit pattern and a normalization layer so all variants map to [PHONE_NUMBER_N]. The field-rule fallback for agristack Name: / Mobile: fields handles cases Presidio misses in regional language content. Residual risk for unstructured Hindi free text is documented in the audit summary.

The PR now covers all midpoint milestone goals from the ticket. Happy to iterate based on your feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant