Skip to content

feat: add residual PII audit workflow#11

Open
Keerti707 wants to merge 1 commit into
OpenAgriNet:mainfrom
Keerti707:feat/pii-audit-workflow
Open

feat: add residual PII audit workflow#11
Keerti707 wants to merge 1 commit into
OpenAgriNet:mainfrom
Keerti707:feat/pii-audit-workflow

Conversation

@Keerti707
Copy link
Copy Markdown

------Summary-------------------

Adds a lightweight residual PII audit workflow for checking whether redacted JSONL training data still contains sensitive patterns before being used for downstream SFT or DPO training.

This contribution focuses on the privacy-auditing portion of the project requirements and provides a small reusable utility for validating residual-risk after redaction.

------What’s included------------------

  • JSONL residual PII audit utility

  • Detection support for:

    • email addresses
    • phone numbers
    • API-key-like strings
    • token-like strings
  • Risk-level classification

  • Example redacted dataset

  • Unit tests with pytest

  • README usage documentation

  • pytest configuration through pyproject.toml

------Why this contribution-----------------

The repository requirements explicitly mention:

  • residual PII auditing
  • documented audit workflows
  • measurable privacy validation

This PR contributes a focused validation utility around those requirements instead of duplicating broader pipeline/export implementations already being explored in parallel PRs.

-------Verification-------------------

pytest -q
python training_setup_logs/pii_audit.py

Both commands execute successfully locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant