Skip to content

Latest commit

 

History

History
338 lines (259 loc) · 7.33 KB

File metadata and controls

338 lines (259 loc) · 7.33 KB

PII Detection & Redaction

Automatic detection and redaction of personally identifiable information (PII) before sending data to LLMs.

Overview

InferShield protects sensitive data by detecting and redacting PII before it reaches AI models. This is critical for:

  • HIPAA compliance - Healthcare organizations
  • GDPR compliance - European users
  • PCI-DSS compliance - Payment card data
  • SOC 2 compliance - Enterprise security audits

Supported PII Types

Critical (Auto-blocked in Free Tier)

  • SSN - US Social Security Numbers
  • Credit Cards - Validated with Luhn algorithm
  • Passport Numbers - Government-issued passports
  • Medical Records - MRN, patient IDs
  • Bank Accounts - Account numbers
  • API Keys - Generic + AWS-specific
  • AWS Keys - Access keys (AKIA...)

High Sensitivity

  • Email Addresses - Full email detection
  • Driver's Licenses - State-issued IDs

Medium Sensitivity

  • Phone Numbers - US format
  • IP Addresses - IPv4 (validated)
  • Dates of Birth - MM/DD/YYYY format

Redaction Strategies

1. MASK (Default)

Replace entire value with label: ``` Input: "My SSN is 123-45-6789" Output: "My SSN is [SSN_REDACTED]" ```

2. PARTIAL

Show last 4 digits: ``` Input: "Card: 4532-1488-0343-6467" Output: "Card: XXXX-XXXX-XXXX-6467" ```

3. HASH

One-way hash: ``` Input: "SSN: 123-45-6789" Output: "SSN: [a3f5c2d1]" ```

4. TOKEN (Enterprise)

Reversible tokenization: ``` Input: "SSN: 123-45-6789" Output: "SSN: [TOKEN_e3b0c442]" (can be detokenized with key) ```

5. REMOVE

Complete removal: ``` Input: "My SSN is 123-45-6789 and email is test@example.com" Output: "My SSN is and email is " ```

Usage

As Express Middleware

```javascript const { piiRedactionMiddleware, RedactionStrategy } = require('./services/pii-redactor');

app.use(piiRedactionMiddleware({ enabled: true, strategy: RedactionStrategy.PARTIAL, patterns: ['ssn', 'credit_card', 'email', 'phone'] })); ```

Direct API

```javascript const { detectPII, redactPII } = require('./services/pii-redactor');

// Detect PII const text = 'My SSN is 123-45-6789'; const detected = detectPII(text);

console.log(detected); // [{ // type: 'ssn', // name: 'SSN', // value: '123-45-6789', // position: 10, // length: 11, // severity: 'critical', // category: 'government_id' // }]

// Redact PII const result = redactPII(text, { strategy: 'partial' });

console.log(result); // { // redacted: 'My SSN is XXX-XX-6789', // original: 'My SSN is 123-45-6789', // detections: [...], // changed: true // } ```

Selective Pattern Detection

```javascript // Only check for critical PII const result = redactPII(text, { patterns: ['ssn', 'credit_card', 'passport'], strategy: 'mask' });

// Check everything except emails const allPatterns = Object.keys(PII_PATTERNS); const patternsExceptEmail = allPatterns.filter(p => p !== 'email');

const result = redactPII(text, { patterns: patternsExceptEmail }); ```

Configuration

Environment Variables

```bash

Enable/disable PII redaction

PII_REDACTION_ENABLED=true

Default redaction strategy

PII_REDACTION_STRATEGY=partial

Token encryption key (for TOKEN strategy)

PII_TOKEN_KEY=your-256-bit-encryption-key

Patterns to check (comma-separated)

PII_PATTERNS=ssn,credit_card,email,phone ```

Per-Request Configuration

Pass options in request headers:

```bash curl -X POST https://api.infershield.io/v1/chat/completions \ -H "X-PII-Strategy: partial" \ -H "X-PII-Patterns: ssn,credit_card" \ -d '{"prompt": "My SSN is 123-45-6789"}' ```

API Response Headers

Redacted requests include these headers:

``` X-PII-Redacted: true X-PII-Detections: 2 X-PII-Types: ssn,email ```

Audit Logging

All PII detections are logged:

```javascript { "timestamp": "2024-02-21T23:00:00Z", "action": "pii.detected", "userId": "user-123", "requestId": "req-abc", "detections": [ { "type": "ssn", "severity": "critical", "redacted": true, "strategy": "partial" } ], "totalDetected": 1 } ```

Performance

  • Detection speed: <5ms per 1KB of text
  • Memory: ~2MB per process
  • Latency: <1ms added to request
  • Throughput: 10,000 req/s (single process)

Optimizations:

  • Compiled regex patterns (cached)
  • Streaming detection for large payloads
  • Zero-copy redaction where possible

Examples

Protect Healthcare Data

```javascript app.use('/api/medical', piiRedactionMiddleware({ patterns: ['ssn', 'medical_record', 'date_of_birth', 'phone', 'email'], strategy: RedactionStrategy.MASK })); ```

Protect Financial Data

```javascript app.use('/api/payments', piiRedactionMiddleware({ patterns: ['credit_card', 'bank_account', 'ssn'], strategy: RedactionStrategy.HASH })); ```

Partial Redaction for User Context

```javascript // Keep last 4 digits for user reference app.use(piiRedactionMiddleware({ strategy: RedactionStrategy.PARTIAL, patterns: ['credit_card', 'ssn', 'phone'] })); ```

Testing

Run PII detection tests:

```bash npm test services/pii-redactor.test.js ```

Coverage:

  • 14 PII pattern types
  • 5 redaction strategies
  • Edge cases (empty, long text, special chars)
  • Validation (Luhn algorithm for credit cards)

Compliance

HIPAA (Healthcare)

Protected Health Information (PHI):

  • ✅ Names (partial via email)
  • ✅ Dates of birth
  • ✅ Phone numbers
  • ✅ Email addresses
  • ✅ Medical record numbers
  • ✅ SSN

GDPR (European)

Personal Data:

  • ✅ Email addresses
  • ✅ Phone numbers
  • ✅ IP addresses
  • ✅ Government IDs

PCI-DSS (Payment Cards)

Cardholder Data:

  • ✅ Credit card numbers (Luhn validated)
  • ✅ CVV detection (pattern: `\b\d{3,4}\b`)

SOC 2 (Enterprise)

Sensitive Data Protection:

  • ✅ PII detection
  • ✅ Audit logging
  • ✅ Encryption (TOKEN strategy)
  • ✅ Access controls

Limitations

Not Detected (Yet)

  • Names (too many false positives)
  • Addresses (complex patterns)
  • Non-US formats (international phone, etc.)
  • Biometric data
  • Photos/images (only text)

False Positives

  • Phone numbers: May match invoice numbers
  • Bank accounts: May match order IDs
  • API keys: May match hashes

Mitigation:

  • Use `validateMatches: true` option
  • Customize patterns for your domain
  • Whitelist known false positives

Roadmap

v0.5.1 (Next Release)

  • International phone formats
  • Address detection (US + EU)
  • Name detection (with ML)
  • Custom regex patterns (user-defined)

v0.6 (Future)

  • OCR for image-based PII
  • Audio transcription redaction
  • Real-time streaming redaction
  • Multi-language support

Best Practices

  1. Use PARTIAL for user-facing - Users can verify last 4 digits
  2. Use HASH for audit logs - One-way, searchable
  3. Use TOKEN for reversibility - Enterprise customers only
  4. Always log detections - Compliance requirement
  5. Test with real data - Validate patterns work for your use case
  6. Customize patterns - Add industry-specific patterns
  7. Monitor false positives - Adjust patterns as needed

Support

License

MIT - See LICENSE file for details