Hi @Gautam-Rajeev , I went through the README.md, and realized that preserving the privacy in the data is necessary we pipeline them into training the model.
So, I added a minimal PII redaction pipeline( #12 prototype as of now, I would love the chance to make it production grade given the chance) that handles common identifiers: emails, phone numbers, credit cards, UUIDs, and URLs with tokens. The pipeline normalizes heterogeneous log entries into a canonical event schema, audits for PII, applies consistent placeholder redaction, and exports SFT and DPO-ready datasets.
Also, I'm an dual degree IIT Madras student and I would really want to contribute to this, looking forward to it
Hi @Gautam-Rajeev , I went through the README.md, and realized that preserving the privacy in the data is necessary we pipeline them into training the model.
So, I added a minimal PII redaction pipeline( #12 prototype as of now, I would love the chance to make it production grade given the chance) that handles common identifiers: emails, phone numbers, credit cards, UUIDs, and URLs with tokens. The pipeline normalizes heterogeneous log entries into a canonical event schema, audits for PII, applies consistent placeholder redaction, and exports SFT and DPO-ready datasets.
Also, I'm an dual degree IIT Madras student and I would really want to contribute to this, looking forward to it