Skip to content

To add privacy-preserving PII redaction pipeline prototype #13

@NomzzNJS

Description

@NomzzNJS

Hi @Gautam-Rajeev , I went through the README.md, and realized that preserving the privacy in the data is necessary we pipeline them into training the model.
So, I added a minimal PII redaction pipeline( #12 prototype as of now, I would love the chance to make it production grade given the chance) that handles common identifiers: emails, phone numbers, credit cards, UUIDs, and URLs with tokens. The pipeline normalizes heterogeneous log entries into a canonical event schema, audits for PII, applies consistent placeholder redaction, and exports SFT and DPO-ready datasets.

Also, I'm an dual degree IIT Madras student and I would really want to contribute to this, looking forward to it

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions