Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,3 +90,36 @@ The pipeline should treat **export formats** as first-class requirements so the
6. Define **student model** constraints (context length, tool set) and a **filter + eval** plan for teacher-to-student parity before production swap.


---

## PII residual-risk audit

This repository now includes a lightweight audit utility for checking whether redacted JSONL training data still contains common residual PII patterns.

### Currently detected patterns

- email addresses
- 10-digit phone numbers
- API-key-like strings
- token-like strings

### Run the sample audit

```bash
python training_setup_logs/pii_audit.py
```

### Run tests

```bash
pytest -q
```

### Audit output

The generated audit report includes:

- number of rows scanned
- suspected rows containing residual PII
- line-level issue summaries
- overall residual-risk classification
4 changes: 4 additions & 0 deletions examples/redacted_sample.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{"text": "User email has been replaced with [EMAIL]"}
{"text": "Call me maybe at [PHONE]"}
{"text": "No sensitive information here"}
{"text": "Oops leaked email john@example.com"}
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[tool.pytest.ini_options]
pythonpath = ["."]
testpaths = ["tests"]
Binary file not shown.
17 changes: 17 additions & 0 deletions tests/test_pii_audit.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
from training_setup_logs.pii_audit import detect_pii


def test_email_detection():
text = "contact me at test@example.com"

findings = detect_pii(text)

assert len(findings) > 0


def test_clean_text():
text = "all pii has been removed"

findings = detect_pii(text)

assert findings == []
Empty file added training_setup_logs/__init__.py
Empty file.
Binary file not shown.
Binary file not shown.
73 changes: 73 additions & 0 deletions training_setup_logs/pii_audit.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
import json
import re
from pathlib import Path


PII_PATTERNS = {
"email": r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+",
"phone": r"\b\d{10}\b",
"api_key": r"sk-[A-Za-z0-9]{20,}",
"token": r"token\s*[:=]\s*[A-Za-z0-9-_]+",
}


def detect_pii(text):
findings = []

for pii_type, pattern in PII_PATTERNS.items():
matches = re.findall(pattern, text)

if matches:
findings.append(
{
"type": pii_type,
"matches_found": len(matches),
}
)

return findings


def audit_jsonl(file_path):
report = {
"rows_scanned": 0,
"suspected_pii": 0,
"findings": [],
}

path = Path(file_path)

with path.open("r", encoding="utf-8") as file:
for line_number, line in enumerate(file, start=1):
report["rows_scanned"] += 1

data = json.loads(line)

text = json.dumps(data)

findings = detect_pii(text)

if findings:
report["suspected_pii"] += 1

report["findings"].append(
{
"line": line_number,
"issues": findings,
}
)

if report["suspected_pii"] == 0:
report["risk_level"] = "low"
elif report["suspected_pii"] < 3:
report["risk_level"] = "medium"
else:
report["risk_level"] = "high"

return report


if __name__ == "__main__":
report = audit_jsonl("examples/redacted_sample.jsonl")

print(json.dumps(report, indent=2))