This package provides tools for detecting sensitive data leakage in Enterprise Search RAG applications and detecting when someone is attempting to extract sensitive data using similar techniques.
The package consists of two main tools:
-
Sensitive Data Leakage Detection Tool: Tests Enterprise Search RAG applications for sensitive data leakage by using domain-specific words that might elicit sensitive information.
-
OpenTelemetry-based Detection Solution: Monitors requests to Enterprise Search RAG applications to detect potential sensitive data leakage attacks.
- Python 3.8 or higher
- pip package manager
git clone https://github.com/yourusername/sensitive-data-detection.git
cd sensitive-data-detection
pip install -e .The package requires the following main dependencies:
- openai (for word generation)
- transformers (for local sensitivity analysis)
- torch (for running transformer models)
- opentelemetry (for detection solution)
- scikit-learn (for anomaly detection)
- requests (for API communication)
All dependencies will be automatically installed when installing the package.
This tool tests Enterprise Search RAG applications for sensitive data leakage by:
- Generating domain-specific words that might elicit sensitive information
- Querying the Enterprise Search application with these words
- Analyzing the responses with a local LLM to determine if they contain sensitive information
- Storing the words that successfully retrieved sensitive information
sensitive-data-detector --domain "Corporate information including HR, financial, and strategic data" --provider glean --api-key YOUR_API_KEYfrom sensitive_data_detection.detector import SensitiveDataDetector
# Initialize the detector
detector = SensitiveDataDetector({
"word_generator": {
"llm_provider": "openai",
"api_key": "YOUR_OPENAI_API_KEY"
},
"enterprise_search": {
"provider": "glean",
"api_key": "YOUR_GLEAN_API_KEY",
"api_url": "https://your-instance.glean.com/api/v1/search"
},
"sensitivity_analyzer": {
"model_name": "distilbert-base-uncased",
"threshold": 70
}
})
# Run detection
results = detector.detect_leakage(
domain="Corporate information including HR, financial, and strategic data",
options={
"max_words_per_run": 100,
"batch_size": 10
}
)
# Print results
print(f"Sensitive terms found: {len(results['sensitive_terms'])}")
for term in results['sensitive_terms'][:5]:
print(f"- {term['word']} (Score: {term['sensitivity_score']})")The tool supports the following configuration options:
-
word_generator: Configuration for the word generation component
- llm_provider: LLM provider to use (openai, huggingface)
- api_key: API key for the LLM provider
- cache_dir: Directory to cache generated words
-
enterprise_search: Configuration for the Enterprise Search interface
- provider: Enterprise Search provider (glean, brain, etc.)
- api_key: API key for the provider
- api_url: API URL for the provider
-
sensitivity_analyzer: Configuration for the sensitivity analysis component
- model_name: Model name for sensitivity analysis
- threshold: Threshold for sensitivity score (0-100)
-
word_bank: Configuration for the word bank component
- storage_path: Path to the word bank database
This tool monitors requests to Enterprise Search RAG applications to detect potential sensitive data leakage attacks using an ensemble of detection methods:
- Pattern-based detection: Looks for suspicious patterns in queries
- Statistical detection: Identifies anomalies in query rates and patterns
- Behavioral detection: Analyzes user behavior for deviations from normal patterns
- Machine learning detection: Uses ML models to detect anomalous queries
otel-detector --config config.jsonfrom sensitive_data_detection.otel_detector import OpenTelemetryDetector, QueryData
# Initialize the detector
detector = OpenTelemetryDetector({
"opentelemetry": {
"service_name": "sensitive-data-detection",
"exporter_endpoint": "http://localhost:4317"
},
"ensemble_detector": {
"pattern_detector": {
"use_word_generator": True,
"llm_provider": "openai",
"api_key": "YOUR_OPENAI_API_KEY"
},
"detector_weights": {
"pattern": 0.3,
"statistical": 0.25,
"behavioral": 0.25,
"ml": 0.2
}
},
"detection": {
"domain_description": "Corporate information including HR, financial, and strategic data"
}
})
# Start the detector
detector.start()
# Process an HTTP request
detector.process_http_request({
"query": "confidential document",
"user_id": "user123",
"client_ip": "192.168.1.1",
"user_agent": "Mozilla/5.0",
"response_time": 100,
"result_count": 5
})
# Get alerts
alerts = detector.get_alerts({"limit": 10})
for alert in alerts:
print(f"Alert: {alert.severity} - {alert.detection_method}")
print(f"User: {alert.user_id}")
print(f"Queries: {', '.join(alert.queries)}")
# Stop the detector
detector.stop()The detection solution can be integrated with popular web frameworks:
from flask import Flask
from sensitive_data_detection.otel_detector import OpenTelemetryDetector, create_flask_middleware
app = Flask(__name__)
detector = OpenTelemetryDetector()
create_flask_middleware(app, detector)
@app.route('/search')
def search():
# Your search endpoint
pass
if __name__ == '__main__':
detector.start()
app.run()from fastapi import FastAPI
from sensitive_data_detection.otel_detector import OpenTelemetryDetector, create_fastapi_middleware
app = FastAPI()
detector = OpenTelemetryDetector()
create_fastapi_middleware(app, detector)
@app.get('/search')
async def search():
# Your search endpoint
pass
if __name__ == '__main__':
detector.start()
import uvicorn
uvicorn.run(app)The detection solution supports the following configuration options:
-
opentelemetry: Configuration for OpenTelemetry
- service_name: Service name for telemetry
- exporter_endpoint: Endpoint for telemetry exporter
- batch_timeout_secs: Batch timeout in seconds
-
ensemble_detector: Configuration for the ensemble detector
- pattern_detector: Configuration for the pattern detector
- statistical_detector: Configuration for the statistical detector
- behavioral_detector: Configuration for the behavioral detector
- ml_detector: Configuration for the ML detector
- detector_weights: Weights for each detector
- alert_threshold: Threshold for generating alerts
-
alert_manager: Configuration for the alert manager
- console_alerts: Whether to print alerts to console
- file_alerts: Whether to write alerts to file
- webhook_url: URL for webhook notifications
-
detection: Configuration for detection
- processing_interval: Interval for processing queries
- domain_description: Domain description for pattern detection
- min_queries_for_detection: Minimum queries required for detection
The tool consists of the following components:
- WordGenerator: Generates domain-specific words that might elicit sensitive information
- EnterpriseSearchInterface: Interfaces with Enterprise Search RAG applications
- SensitivityAnalyzer: Analyzes responses to determine if they contain sensitive information
- WordBank: Stores and manages words that successfully retrieved sensitive information
- SensitiveDataDetector: Integrates all components and provides the main API
The solution consists of the following components:
- OpenTelemetryDetector: Main class that integrates all components
- PatternDetector: Detects suspicious patterns in queries
- StatisticalDetector: Identifies statistical anomalies in query patterns
- BehavioralDetector: Analyzes user behavior for deviations
- MLDetector: Uses machine learning for anomaly detection
- EnsembleDetector: Combines multiple detection methods
- AlertManager: Manages and stores alerts
- The tools are designed to handle sensitive information securely
- No actual sensitive content is stored, only the words that retrieved it
- Local LLM models are used for sensitivity analysis to avoid sending potentially sensitive data to external services
- All data is stored locally and can be encrypted if needed
This project is licensed under the MIT License - see the LICENSE file for details.