PII Detection & Redaction

Automatic detection and redaction of personally identifiable information (PII) before sending data to LLMs.

Overview

InferShield protects sensitive data by detecting and redacting PII before it reaches AI models. This is critical for:

HIPAA compliance - Healthcare organizations
GDPR compliance - European users
PCI-DSS compliance - Payment card data
SOC 2 compliance - Enterprise security audits

Supported PII Types

Critical (Auto-blocked in Free Tier)

SSN - US Social Security Numbers
Credit Cards - Validated with Luhn algorithm
Passport Numbers - Government-issued passports
Medical Records - MRN, patient IDs
Bank Accounts - Account numbers
API Keys - Generic + AWS-specific
AWS Keys - Access keys (AKIA...)

High Sensitivity

Email Addresses - Full email detection
Driver's Licenses - State-issued IDs

Medium Sensitivity

Phone Numbers - US format
IP Addresses - IPv4 (validated)
Dates of Birth - MM/DD/YYYY format

Redaction Strategies

1. MASK (Default)

Replace entire value with label: ``` Input: "My SSN is 123-45-6789" Output: "My SSN is [SSN_REDACTED]" ```

2. PARTIAL

Show last 4 digits: ``` Input: "Card: 4532-1488-0343-6467" Output: "Card: XXXX-XXXX-XXXX-6467" ```

3. HASH

One-way hash: ``` Input: "SSN: 123-45-6789" Output: "SSN: [a3f5c2d1]" ```

4. TOKEN (Enterprise)

Reversible tokenization: ``` Input: "SSN: 123-45-6789" Output: "SSN: [TOKEN_e3b0c442]" (can be detokenized with key) ```

5. REMOVE

Complete removal: ``` Input: "My SSN is 123-45-6789 and email is test@example.com" Output: "My SSN is and email is " ```

Usage

As Express Middleware

```javascript const { piiRedactionMiddleware, RedactionStrategy } = require('./services/pii-redactor');

app.use(piiRedactionMiddleware({ enabled: true, strategy: RedactionStrategy.PARTIAL, patterns: ['ssn', 'credit_card', 'email', 'phone'] })); ```

Direct API

```javascript const { detectPII, redactPII } = require('./services/pii-redactor');

// Detect PII const text = 'My SSN is 123-45-6789'; const detected = detectPII(text);

console.log(detected); // [{ // type: 'ssn', // name: 'SSN', // value: '123-45-6789', // position: 10, // length: 11, // severity: 'critical', // category: 'government_id' // }]

// Redact PII const result = redactPII(text, { strategy: 'partial' });

console.log(result); // { // redacted: 'My SSN is XXX-XX-6789', // original: 'My SSN is 123-45-6789', // detections: [...], // changed: true // } ```

Selective Pattern Detection

```javascript // Only check for critical PII const result = redactPII(text, { patterns: ['ssn', 'credit_card', 'passport'], strategy: 'mask' });

// Check everything except emails const allPatterns = Object.keys(PII_PATTERNS); const patternsExceptEmail = allPatterns.filter(p => p !== 'email');

const result = redactPII(text, { patterns: patternsExceptEmail }); ```

Configuration

Environment Variables

```bash

Enable/disable PII redaction

PII_REDACTION_ENABLED=true

Default redaction strategy

PII_REDACTION_STRATEGY=partial

Token encryption key (for TOKEN strategy)

PII_TOKEN_KEY=your-256-bit-encryption-key

Patterns to check (comma-separated)

PII_PATTERNS=ssn,credit_card,email,phone ```

Per-Request Configuration

Pass options in request headers:

```bash curl -X POST https://api.infershield.io/v1/chat/completions \ -H "X-PII-Strategy: partial" \ -H "X-PII-Patterns: ssn,credit_card" \ -d '{"prompt": "My SSN is 123-45-6789"}' ```

API Response Headers

Redacted requests include these headers:

``` X-PII-Redacted: true X-PII-Detections: 2 X-PII-Types: ssn,email ```

Audit Logging

All PII detections are logged:

```javascript { "timestamp": "2024-02-21T23:00:00Z", "action": "pii.detected", "userId": "user-123", "requestId": "req-abc", "detections": [ { "type": "ssn", "severity": "critical", "redacted": true, "strategy": "partial" } ], "totalDetected": 1 } ```

Performance

Detection speed: <5ms per 1KB of text
Memory: ~2MB per process
Latency: <1ms added to request
Throughput: 10,000 req/s (single process)

Optimizations:

Compiled regex patterns (cached)
Streaming detection for large payloads
Zero-copy redaction where possible

Examples

Protect Healthcare Data

```javascript app.use('/api/medical', piiRedactionMiddleware({ patterns: ['ssn', 'medical_record', 'date_of_birth', 'phone', 'email'], strategy: RedactionStrategy.MASK })); ```

Protect Financial Data

```javascript app.use('/api/payments', piiRedactionMiddleware({ patterns: ['credit_card', 'bank_account', 'ssn'], strategy: RedactionStrategy.HASH })); ```

Partial Redaction for User Context

```javascript // Keep last 4 digits for user reference app.use(piiRedactionMiddleware({ strategy: RedactionStrategy.PARTIAL, patterns: ['credit_card', 'ssn', 'phone'] })); ```

Testing

Run PII detection tests:

```bash npm test services/pii-redactor.test.js ```

Coverage:

14 PII pattern types
5 redaction strategies
Edge cases (empty, long text, special chars)
Validation (Luhn algorithm for credit cards)

Compliance

HIPAA (Healthcare)

Protected Health Information (PHI):

✅ Names (partial via email)
✅ Dates of birth
✅ Phone numbers
✅ Email addresses
✅ Medical record numbers
✅ SSN

GDPR (European)

Personal Data:

✅ Email addresses
✅ Phone numbers
✅ IP addresses
✅ Government IDs

PCI-DSS (Payment Cards)

Cardholder Data:

✅ Credit card numbers (Luhn validated)
✅ CVV detection (pattern: `\b\d{3,4}\b`)

SOC 2 (Enterprise)

Sensitive Data Protection:

✅ PII detection
✅ Audit logging
✅ Encryption (TOKEN strategy)
✅ Access controls

Limitations

Not Detected (Yet)

Names (too many false positives)
Addresses (complex patterns)
Non-US formats (international phone, etc.)
Biometric data
Photos/images (only text)

False Positives

Phone numbers: May match invoice numbers
Bank accounts: May match order IDs
API keys: May match hashes

Mitigation:

Use `validateMatches: true` option
Customize patterns for your domain
Whitelist known false positives

Roadmap

v0.5.1 (Next Release)

International phone formats
Address detection (US + EU)
Name detection (with ML)
Custom regex patterns (user-defined)

v0.6 (Future)

OCR for image-based PII
Audio transcription redaction
Real-time streaming redaction
Multi-language support

Best Practices

Use PARTIAL for user-facing - Users can verify last 4 digits
Use HASH for audit logs - One-way, searchable
Use TOKEN for reversibility - Enterprise customers only
Always log detections - Compliance requirement
Test with real data - Validate patterns work for your use case
Customize patterns - Add industry-specific patterns
Monitor false positives - Adjust patterns as needed

Support

Docs: https://docs.infershield.io/pii-redaction
Issues: https://github.com/InferShield/infershield/issues
Discussions: https://github.com/InferShield/infershield/discussions

License

MIT - See LICENSE file for details

FilesExpand file tree

PII_REDACTION.md

Latest commit

History