Sensitive Data Leakage Detection Tool

This package provides tools for detecting sensitive data leakage in Enterprise Search RAG applications and detecting when someone is attempting to extract sensitive data using similar techniques.

Overview

The package consists of two main tools:

Sensitive Data Leakage Detection Tool: Tests Enterprise Search RAG applications for sensitive data leakage by using domain-specific words that might elicit sensitive information.
OpenTelemetry-based Detection Solution: Monitors requests to Enterprise Search RAG applications to detect potential sensitive data leakage attacks.

Installation

Prerequisites

Python 3.8 or higher
pip package manager

Install from Source

git clone https://github.com/yourusername/sensitive-data-detection.git
cd sensitive-data-detection
pip install -e .

Dependencies

The package requires the following main dependencies:

openai (for word generation)
transformers (for local sensitivity analysis)
torch (for running transformer models)
opentelemetry (for detection solution)
scikit-learn (for anomaly detection)
requests (for API communication)

All dependencies will be automatically installed when installing the package.

Sensitive Data Leakage Detection Tool

This tool tests Enterprise Search RAG applications for sensitive data leakage by:

Generating domain-specific words that might elicit sensitive information
Querying the Enterprise Search application with these words
Analyzing the responses with a local LLM to determine if they contain sensitive information
Storing the words that successfully retrieved sensitive information

Usage

Command Line Interface

sensitive-data-detector --domain "Corporate information including HR, financial, and strategic data" --provider glean --api-key YOUR_API_KEY

Python API

from sensitive_data_detection.detector import SensitiveDataDetector

# Initialize the detector
detector = SensitiveDataDetector({
    "word_generator": {
        "llm_provider": "openai",
        "api_key": "YOUR_OPENAI_API_KEY"
    },
    "enterprise_search": {
        "provider": "glean",
        "api_key": "YOUR_GLEAN_API_KEY",
        "api_url": "https://your-instance.glean.com/api/v1/search"
    },
    "sensitivity_analyzer": {
        "model_name": "distilbert-base-uncased",
        "threshold": 70
    }
})

# Run detection
results = detector.detect_leakage(
    domain="Corporate information including HR, financial, and strategic data",
    options={
        "max_words_per_run": 100,
        "batch_size": 10
    }
)

# Print results
print(f"Sensitive terms found: {len(results['sensitive_terms'])}")
for term in results['sensitive_terms'][:5]:
    print(f"- {term['word']} (Score: {term['sensitivity_score']})")

Configuration Options

The tool supports the following configuration options:

word_generator: Configuration for the word generation component
- llm_provider: LLM provider to use (openai, huggingface)
- api_key: API key for the LLM provider
- cache_dir: Directory to cache generated words
enterprise_search: Configuration for the Enterprise Search interface
- provider: Enterprise Search provider (glean, brain, etc.)
- api_key: API key for the provider
- api_url: API URL for the provider
sensitivity_analyzer: Configuration for the sensitivity analysis component
- model_name: Model name for sensitivity analysis
- threshold: Threshold for sensitivity score (0-100)
word_bank: Configuration for the word bank component
- storage_path: Path to the word bank database

OpenTelemetry-based Detection Solution

This tool monitors requests to Enterprise Search RAG applications to detect potential sensitive data leakage attacks using an ensemble of detection methods:

Pattern-based detection: Looks for suspicious patterns in queries
Statistical detection: Identifies anomalies in query rates and patterns
Behavioral detection: Analyzes user behavior for deviations from normal patterns
Machine learning detection: Uses ML models to detect anomalous queries

Usage

Command Line Interface

otel-detector --config config.json

Python API

from sensitive_data_detection.otel_detector import OpenTelemetryDetector, QueryData

# Initialize the detector
detector = OpenTelemetryDetector({
    "opentelemetry": {
        "service_name": "sensitive-data-detection",
        "exporter_endpoint": "http://localhost:4317"
    },
    "ensemble_detector": {
        "pattern_detector": {
            "use_word_generator": True,
            "llm_provider": "openai",
            "api_key": "YOUR_OPENAI_API_KEY"
        },
        "detector_weights": {
            "pattern": 0.3,
            "statistical": 0.25,
            "behavioral": 0.25,
            "ml": 0.2
        }
    },
    "detection": {
        "domain_description": "Corporate information including HR, financial, and strategic data"
    }
})

# Start the detector
detector.start()

# Process an HTTP request
detector.process_http_request({
    "query": "confidential document",
    "user_id": "user123",
    "client_ip": "192.168.1.1",
    "user_agent": "Mozilla/5.0",
    "response_time": 100,
    "result_count": 5
})

# Get alerts
alerts = detector.get_alerts({"limit": 10})
for alert in alerts:
    print(f"Alert: {alert.severity} - {alert.detection_method}")
    print(f"User: {alert.user_id}")
    print(f"Queries: {', '.join(alert.queries)}")

# Stop the detector
detector.stop()

Integration with Web Frameworks

The detection solution can be integrated with popular web frameworks:

Flask

from flask import Flask
from sensitive_data_detection.otel_detector import OpenTelemetryDetector, create_flask_middleware

app = Flask(__name__)
detector = OpenTelemetryDetector()
create_flask_middleware(app, detector)

@app.route('/search')
def search():
    # Your search endpoint
    pass

if __name__ == '__main__':
    detector.start()
    app.run()

FastAPI

from fastapi import FastAPI
from sensitive_data_detection.otel_detector import OpenTelemetryDetector, create_fastapi_middleware

app = FastAPI()
detector = OpenTelemetryDetector()
create_fastapi_middleware(app, detector)

@app.get('/search')
async def search():
    # Your search endpoint
    pass

if __name__ == '__main__':
    detector.start()
    import uvicorn
    uvicorn.run(app)

Configuration Options

The detection solution supports the following configuration options:

opentelemetry: Configuration for OpenTelemetry
- service_name: Service name for telemetry
- exporter_endpoint: Endpoint for telemetry exporter
- batch_timeout_secs: Batch timeout in seconds
ensemble_detector: Configuration for the ensemble detector
- pattern_detector: Configuration for the pattern detector
- statistical_detector: Configuration for the statistical detector
- behavioral_detector: Configuration for the behavioral detector
- ml_detector: Configuration for the ML detector
- detector_weights: Weights for each detector
- alert_threshold: Threshold for generating alerts
alert_manager: Configuration for the alert manager
- console_alerts: Whether to print alerts to console
- file_alerts: Whether to write alerts to file
- webhook_url: URL for webhook notifications
detection: Configuration for detection
- processing_interval: Interval for processing queries
- domain_description: Domain description for pattern detection
- min_queries_for_detection: Minimum queries required for detection

Architecture

Sensitive Data Leakage Detection Tool

The tool consists of the following components:

WordGenerator: Generates domain-specific words that might elicit sensitive information
EnterpriseSearchInterface: Interfaces with Enterprise Search RAG applications
SensitivityAnalyzer: Analyzes responses to determine if they contain sensitive information
WordBank: Stores and manages words that successfully retrieved sensitive information
SensitiveDataDetector: Integrates all components and provides the main API

OpenTelemetry-based Detection Solution

The solution consists of the following components:

OpenTelemetryDetector: Main class that integrates all components
PatternDetector: Detects suspicious patterns in queries
StatisticalDetector: Identifies statistical anomalies in query patterns
BehavioralDetector: Analyzes user behavior for deviations
MLDetector: Uses machine learning for anomaly detection
EnsembleDetector: Combines multiple detection methods
AlertManager: Manages and stores alerts

Security Considerations

The tools are designed to handle sensitive information securely
No actual sensitive content is stored, only the words that retrieved it
Local LLM models are used for sensitivity analysis to avoid sending potentially sensitive data to external services
All data is stored locally and can be encrypted if needed

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
BAppDocs		BAppDocs
home/ubuntu/sensitive_data_detection		home/ubuntu/sensitive_data_detection
.DS_Store		.DS_Store
BappManifest.xml		BappManifest.xml
Burp Suite Plugin Development.pdf		Burp Suite Plugin Development.pdf
README.md		README.md
cli.py		cli.py
detector_example.py		detector_example.py
image.png		image.png
install.py		install.py
main.py		main.py
otel_cli.py		otel_cli.py
otel_detector_example.py		otel_detector_example.py
rag-detector-bapp.zip		rag-detector-bapp.zip
rag_detector_standalone.py		rag_detector_standalone.py
requirements.md		requirements.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sensitive Data Leakage Detection Tool

Overview

Installation

Prerequisites

Install from Source

Dependencies

Sensitive Data Leakage Detection Tool

Usage

Command Line Interface

Python API

Configuration Options

OpenTelemetry-based Detection Solution

Usage

Command Line Interface

Python API

Integration with Web Frameworks

Flask

FastAPI

Configuration Options

Architecture

Sensitive Data Leakage Detection Tool

OpenTelemetry-based Detection Solution

Security Considerations

License

SDLDT4Trae_IDX

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sensitive Data Leakage Detection Tool

Overview

Installation

Prerequisites

Install from Source

Dependencies

Sensitive Data Leakage Detection Tool

Usage

Command Line Interface

Python API

Configuration Options

OpenTelemetry-based Detection Solution

Usage

Command Line Interface

Python API

Integration with Web Frameworks

Flask

FastAPI

Configuration Options

Architecture

Sensitive Data Leakage Detection Tool

OpenTelemetry-based Detection Solution

Security Considerations

License

SDLDT4Trae_IDX

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages