Skip to content

feat(governance): pre-LLM privacy gate for the R2 mechanical regime#7

Open
sammy995 wants to merge 8 commits into
SantanderAI:mainfrom
sammy995:feat/privacy-gate
Open

feat(governance): pre-LLM privacy gate for the R2 mechanical regime#7
sammy995 wants to merge 8 commits into
SantanderAI:mainfrom
sammy995:feat/privacy-gate

Conversation

@sammy995

Copy link
Copy Markdown

Description

Adds a pre-LLM privacy gate to the R2 mechanical regime. It reversibly tokenizes direct identifiers (EMAIL, PHONE, SSN, PAN, IBAN, IP) so the model never sees raw personal data, and mechanically DEFERs (PRIV_0) a case when residual identifiers exceed a configurable budget or detection fails (fail-closed). This adds a data-minimization dimension (GDPR Art. 5(1)(c) / OWASP LLM06).

The new primitive mirrors the existing hard_gates / i6q shape (config dataclass + result dataclass + pure functions) and slots into the documented pipeline as a new stage:

hard_gates → privacy → E3_commit → CEFL → I6Q → ambiguity_gate → E3_reveal

R2 decisions are driven by risk_score / regulatory_flags, not identities, so tokenization does not affect outcomes. The reversible token map stays in-memory and is never written to DecisionResult / to_dict() or logs — only integer counts (privacy_entities_found, privacy_residual_pii) go into metadata.

This is PR#1 of 2. A follow-up adds a Privacy Leakage Rate metric (alongside CDL/DIU), synthetic narrative data that exercises the gate end-to-end, an optional NER recognizer behind a [privacy-ner] extra, and R1/R3 coverage via an LLMInterface wrapper.

What's included

  • governance/primitives/privacy_gate.pyPrivacyConfig, PrivacyResult, RegexRecognizer, pluggable PiiRecognizer, scan_and_tokenize, detokenize, privacy_gate.
  • R2 wiring (pre-LLM stage; off via PrivacyConfig(enabled=False)).
  • tests/test_privacy_gate.py — 19 offline tests (recognizer, reversible tokenization, residual fail-safe, fail-closed, R2 integration, serialization-safety).
  • examples/privacy_demo.py (offline) and a CHANGELOG entry.

Related issue

Closes #6

Type of change

  • New feature (non-breaking change that adds functionality)

Vendor-neutral core

  • This PR keeps the core vendor-neutral (no cloud SDK; stdlib only)

Checklist

  • I have signed the CLA (the CLA Assistant bot will prompt external contributors)
  • My commit messages follow Conventional Commits
  • ruff check . and black --check . pass
  • mypy src passes
  • pytest passes (tests run offline with the mock provider)
  • I have added/updated tests where relevant
  • I have updated documentation / CHANGELOG where relevant
  • No secrets, API keys, internal URLs, or proprietary content are included

Open items for maintainers

  • Gate id PRIV_0 vs the K0_x taxonomy — happy to rename.
  • PrivacyConfig.enabled defaults to True (adds two integer metadata keys; no behavior change on the current synthetic dataset, which has no free-text PII). Can default to opt-in if preferred.
  • New-file copyright header mirrors the repo's existing style.

@sammy995 sammy995 requested a review from a team as a code owner June 23, 2026 12:50
@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@sammy995

Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

github-actions Bot added a commit that referenced this pull request Jun 23, 2026
@sammy995 sammy995 requested a review from a team as a code owner June 25, 2026 03:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a pre-LLM privacy gate (PII data-minimization) to the R2 mechanical regime

1 participant