- 🔒 Overview
- 🎯 Key Features
- ⚙️ Configuration
- 🧠 Dual-Agent Validation (Hybrid Method)
- 🛡️ Advanced Features
- 🎨 Web UI Features
- 📊 Bulk Scanning
- 🔍 Search & Filter Examples
- 🔧 Technical Implementation
- 📈 Performance Considerations
- 🚨 Known Limitations
- 🔐 GDPR Compliance Use Cases
- 📚 Related Documentation
- 🎓 Best Practices
- 🔄 Version History
This project includes a comprehensive PII (Personally Identifiable Information) detection system designed for GDPR compliance. The system automatically scans uploaded documents for sensitive data and provides filtering capabilities to identify non-compliant documents.
The system supports 5 detection methods with different trade-offs:
- Regex - Fast pattern matching for common PII types
- Ollama - LLM-based detection with context understanding
- Hybrid - Combines regex + Ollama with dual-agent validation
- Compromise - NLP-based detection using compromise.js
- Advanced - All methods combined with deduplication
| PII Type | Regex | Ollama | Compromise | Description |
|---|---|---|---|---|
| Name | ❌ | ✅ | ✅ | Full names with context validation |
| ✅ | ✅ | ❌ | Email addresses | |
| Phone | ✅ | ✅ | ❌ | International phone numbers |
| Address | ❌ | ✅ | ✅ | Street addresses with landmarks |
| Credit Card | ✅ | ✅ | ❌ | Visa, MC, Amex, Discover (Luhn validated) |
| SSN | ✅ | ✅ | ❌ | Social Security Numbers |
| Passport | ❌ | ✅ | ❌ | Passport numbers |
| Driver's License | ❌ | ✅ | ❌ | License numbers |
| Bank Account | ❌ | ✅ | ❌ | Account numbers |
| IP Address | ✅ | ✅ | ❌ | IPv4 addresses |
| Date of Birth | ✅ | ✅ | ✅ | Birth dates |
# Enable/disable PII detection
PII_DETECTION_ENABLED=true
# Detection method: ollama, regex, hybrid, compromise, advanced
PII_DETECTION_METHOD=hybrid
# Model for PII detection (requires Ollama)
PII_DETECTION_MODEL=gemma3:4b- Development/Testing:
PII_DETECTION_METHOD=regex(fast) - Production:
PII_DETECTION_METHOD=hybrid(accurate + validated) - Maximum Coverage:
PII_DETECTION_METHOD=advanced(all methods)
The hybrid method uses a two-stage validation process:
- LLM scans document for PII
- Returns findings with context and confidence
- Handles complex formats (Hebrew text, varied phone formats)
- Second LLM validates each finding
- Confirms PII is genuine (not part of business info)
- Reduces false positives from company names, product IDs, etc.
// Example: Company phone number vs. personal phone
{
finding: "000-0000000",
context: "Contact us at 000-0000000",
isValid: false, // Validation agent rejects (business contact)
confidence: 0.3
}
{
finding: "000-0000000",
context: "My personal mobile is 000-0000000",
isValid: true, // Validation agent confirms (personal info)
confidence: 0.9
}Prevents infinite loops with repeated findings:
- Tracks occurrence count per finding
- Stops stream if same finding appears >3 times
- Critical for handling LLM response anomalies
International phone validation with graceful handling:
- Uses
phonelibrary for format validation - Lowers confidence to 0.6 if validation fails (instead of rejecting)
Multi-step validation process:
- Pattern matching (Visa, Mastercard, Amex, Discover)
- Luhn algorithm checksum validation
- Removes spaces and hyphens before validation
| Risk Level | Criteria | Example PII Types |
|---|---|---|
| Low | 1-2 low-risk items | Email only, Single phone number |
| Medium | 3-4 items or 1 medium-risk | Multiple emails + phones |
| High | 5+ items or 1 high-risk | Names + addresses, Credit cards |
| Critical | Multiple high-risk items | SSN + credit card + address |
Filter documents by detection status:
- Never Scanned - Documents not yet scanned for PII
- None - No PII detected (clean documents)
- Low - Minimal PII (1-2 low-risk items)
- Medium - Moderate PII (3-4 items)
- High - Significant PII (5+ items or credit cards)
- Critical - Severe PII exposure (SSN, multiple high-risk)
Multi-select filter by specific PII types:
- 💳 Credit Card
- 📞 Phone
- 📍 Address
- 👤 Name
- 🏦 Bank Account
- 🆔 SSN
- 🛂 Passport
- 🚗 Driver's License
- 📅 Date of Birth
- 🌐 IP Address
Real-time PII risk statistics:
PII Risk Levels:
├─ Never Scanned: 51 documents
├─ None: 1 document (GDPR compliant)
├─ Low: 2 documents
├─ Medium: 0 documents
├─ High: 5 documents ⚠️
└─ Critical: 1 document 🚨
Scan all documents at once via API:
# Trigger bulk PII scan
curl -X POST http://localhost:3001/api/documents/scan-all-pii
# Response
{
"scanned": 60,
"withPII": 9,
"noPII": 51,
"errors": 0
}Note: Bulk scanning can take time depending on:
- Number of documents
- Document size
- Detection method (hybrid is slower but more accurate)
- Ollama server performance
// Web UI: Select any PII severity level except "None"
// API: Filter by pii_detected = true
{
must: [
{ key: 'pii_detected', match: { value: true } }
]
}// Web UI: Select "Credit Card" from PII Types filter
// API: Filter by pii_types array
{
must: [
{ key: 'pii_types', match: { any: ['credit_card'] } }
]
}// Web UI: Select "High" from PII Severity filter
// API: Filter by pii_risk_level
{
must: [
{ key: 'pii_risk_level', match: { value: 'high' } }
]
}// Web UI: Select "Never Scanned" from PII Severity filter
// Backend: Uses application-layer filtering (must_not doesn't work for missing fields)
// Returns documents where pii_detected field is undefinedEach scanned document gets these payload fields:
{
pii_detected: true, // boolean
pii_types: ['email', 'phone', 'name'], // array of strings
pii_risk_level: 'high', // low/medium/high/critical
pii_scan_date: '2025-12-30T10:30:00Z', // ISO timestamp
pii_detection_method: 'hybrid', // detection method used
pii_details: [ // array of findings
{
type: 'email',
value: 'john@example.com',
context: '...contact me at john@example.com for...',
confidence: 0.95,
line: 42
}
]
}For Ollama-based detection:
- Streaming prevents timeout on large documents
- Accumulates partial JSON objects
- Detects duplicate findings in real-time
- Gracefully handles malformed JSON
Special handling for missing PII fields:
- Qdrant's
must_notdoesn't detect missing fields - Uses application-layer filtering
- Fetches all documents, filters where
pii_detected === undefined - Returns paginated results from filtered set
| Method | Speed | Accuracy | Network Calls |
|---|---|---|---|
| Regex | ~50ms | 70% | 0 |
| Ollama | ~2-5s | 85% | 1 |
| Hybrid | ~4-10s | 95% | 2 (detection + validation) |
| Compromise | ~100ms | 75% | 0 |
| Advanced | ~5-12s | 98% | 2 |
- Use regex for initial screening
- Use hybrid for high-value documents
- Batch process during off-hours
- Cache results to avoid re-scanning
- Index
pii_detectedandpii_risk_levelfields
- False Positives: Company names that look like personal names
- Context Required: Email in signature vs. email in content
- Format Variations: Non-standard date formats may be missed
- Language Coverage: Best results with English text
- Never Scanned Filter: Slower than native Qdrant filters (application-layer)
// Scan document before adding to public database
const piiResult = await detectPII(content);
if (piiResult.hasPII && piiResult.riskLevel === 'high') {
return { error: 'Document contains sensitive PII, cannot be published' };
}// Find all documents with PII for review
GET /api/search/semantic?query=&filters={"must":[{"key":"pii_detected","match":{"value":true}}]}// Scan all documents, get list of non-compliant ones
POST /api/documents/scan-all-pii
// Filter by high-risk for manual review
GET /api/search/semantic?filters={"must":[{"key":"pii_risk_level","match":{"value":"high"}}]}- File Upload Implementation - Upload process with PII scanning
- Web UI Architecture - Frontend PII filter components
- Quick Reference - API endpoints and commands
- Enable PII Detection in Production: Always scan user-uploaded content
- Use Hybrid Method: Best accuracy-to-speed ratio
- Review High-Risk Documents: Manual verification recommended
- Update Regularly: Re-scan documents when detection improves
- Monitor Statistics: Track PII exposure across your dataset
- Filter URLs: Add PII filters to shareable URLs for audits
- Educate Users: Show PII warnings during upload process
- ✅ Dual-agent validation (hybrid method)
- ✅ Duplicate detection for non-English content
- ✅ Phone number validation with graceful fallback
- ✅ Credit card Luhn validation
- ✅ PII severity filters (6 levels)
- ✅ PII type filters (11 types)
- ✅ Never scanned filter
- ✅ URL persistence for filters
- ✅ Background upload processing
- ✅ Bulk scanning API