Skip to content

Profpwner/SDLDT4Trae_IDX

Repository files navigation

Sensitive Data Leakage Detection Tool

This package provides tools for detecting sensitive data leakage in Enterprise Search RAG applications and detecting when someone is attempting to extract sensitive data using similar techniques.

Overview

The package consists of two main tools:

  1. Sensitive Data Leakage Detection Tool: Tests Enterprise Search RAG applications for sensitive data leakage by using domain-specific words that might elicit sensitive information.

  2. OpenTelemetry-based Detection Solution: Monitors requests to Enterprise Search RAG applications to detect potential sensitive data leakage attacks.

Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Install from Source

git clone https://github.com/yourusername/sensitive-data-detection.git
cd sensitive-data-detection
pip install -e .

Dependencies

The package requires the following main dependencies:

  • openai (for word generation)
  • transformers (for local sensitivity analysis)
  • torch (for running transformer models)
  • opentelemetry (for detection solution)
  • scikit-learn (for anomaly detection)
  • requests (for API communication)

All dependencies will be automatically installed when installing the package.

Sensitive Data Leakage Detection Tool

This tool tests Enterprise Search RAG applications for sensitive data leakage by:

  1. Generating domain-specific words that might elicit sensitive information
  2. Querying the Enterprise Search application with these words
  3. Analyzing the responses with a local LLM to determine if they contain sensitive information
  4. Storing the words that successfully retrieved sensitive information

Usage

Command Line Interface

sensitive-data-detector --domain "Corporate information including HR, financial, and strategic data" --provider glean --api-key YOUR_API_KEY

Python API

from sensitive_data_detection.detector import SensitiveDataDetector

# Initialize the detector
detector = SensitiveDataDetector({
    "word_generator": {
        "llm_provider": "openai",
        "api_key": "YOUR_OPENAI_API_KEY"
    },
    "enterprise_search": {
        "provider": "glean",
        "api_key": "YOUR_GLEAN_API_KEY",
        "api_url": "https://your-instance.glean.com/api/v1/search"
    },
    "sensitivity_analyzer": {
        "model_name": "distilbert-base-uncased",
        "threshold": 70
    }
})

# Run detection
results = detector.detect_leakage(
    domain="Corporate information including HR, financial, and strategic data",
    options={
        "max_words_per_run": 100,
        "batch_size": 10
    }
)

# Print results
print(f"Sensitive terms found: {len(results['sensitive_terms'])}")
for term in results['sensitive_terms'][:5]:
    print(f"- {term['word']} (Score: {term['sensitivity_score']})")

Configuration Options

The tool supports the following configuration options:

  • word_generator: Configuration for the word generation component

    • llm_provider: LLM provider to use (openai, huggingface)
    • api_key: API key for the LLM provider
    • cache_dir: Directory to cache generated words
  • enterprise_search: Configuration for the Enterprise Search interface

    • provider: Enterprise Search provider (glean, brain, etc.)
    • api_key: API key for the provider
    • api_url: API URL for the provider
  • sensitivity_analyzer: Configuration for the sensitivity analysis component

    • model_name: Model name for sensitivity analysis
    • threshold: Threshold for sensitivity score (0-100)
  • word_bank: Configuration for the word bank component

    • storage_path: Path to the word bank database

OpenTelemetry-based Detection Solution

This tool monitors requests to Enterprise Search RAG applications to detect potential sensitive data leakage attacks using an ensemble of detection methods:

  1. Pattern-based detection: Looks for suspicious patterns in queries
  2. Statistical detection: Identifies anomalies in query rates and patterns
  3. Behavioral detection: Analyzes user behavior for deviations from normal patterns
  4. Machine learning detection: Uses ML models to detect anomalous queries

Usage

Command Line Interface

otel-detector --config config.json

Python API

from sensitive_data_detection.otel_detector import OpenTelemetryDetector, QueryData

# Initialize the detector
detector = OpenTelemetryDetector({
    "opentelemetry": {
        "service_name": "sensitive-data-detection",
        "exporter_endpoint": "http://localhost:4317"
    },
    "ensemble_detector": {
        "pattern_detector": {
            "use_word_generator": True,
            "llm_provider": "openai",
            "api_key": "YOUR_OPENAI_API_KEY"
        },
        "detector_weights": {
            "pattern": 0.3,
            "statistical": 0.25,
            "behavioral": 0.25,
            "ml": 0.2
        }
    },
    "detection": {
        "domain_description": "Corporate information including HR, financial, and strategic data"
    }
})

# Start the detector
detector.start()

# Process an HTTP request
detector.process_http_request({
    "query": "confidential document",
    "user_id": "user123",
    "client_ip": "192.168.1.1",
    "user_agent": "Mozilla/5.0",
    "response_time": 100,
    "result_count": 5
})

# Get alerts
alerts = detector.get_alerts({"limit": 10})
for alert in alerts:
    print(f"Alert: {alert.severity} - {alert.detection_method}")
    print(f"User: {alert.user_id}")
    print(f"Queries: {', '.join(alert.queries)}")

# Stop the detector
detector.stop()

Integration with Web Frameworks

The detection solution can be integrated with popular web frameworks:

Flask

from flask import Flask
from sensitive_data_detection.otel_detector import OpenTelemetryDetector, create_flask_middleware

app = Flask(__name__)
detector = OpenTelemetryDetector()
create_flask_middleware(app, detector)

@app.route('/search')
def search():
    # Your search endpoint
    pass

if __name__ == '__main__':
    detector.start()
    app.run()

FastAPI

from fastapi import FastAPI
from sensitive_data_detection.otel_detector import OpenTelemetryDetector, create_fastapi_middleware

app = FastAPI()
detector = OpenTelemetryDetector()
create_fastapi_middleware(app, detector)

@app.get('/search')
async def search():
    # Your search endpoint
    pass

if __name__ == '__main__':
    detector.start()
    import uvicorn
    uvicorn.run(app)

Configuration Options

The detection solution supports the following configuration options:

  • opentelemetry: Configuration for OpenTelemetry

    • service_name: Service name for telemetry
    • exporter_endpoint: Endpoint for telemetry exporter
    • batch_timeout_secs: Batch timeout in seconds
  • ensemble_detector: Configuration for the ensemble detector

    • pattern_detector: Configuration for the pattern detector
    • statistical_detector: Configuration for the statistical detector
    • behavioral_detector: Configuration for the behavioral detector
    • ml_detector: Configuration for the ML detector
    • detector_weights: Weights for each detector
    • alert_threshold: Threshold for generating alerts
  • alert_manager: Configuration for the alert manager

    • console_alerts: Whether to print alerts to console
    • file_alerts: Whether to write alerts to file
    • webhook_url: URL for webhook notifications
  • detection: Configuration for detection

    • processing_interval: Interval for processing queries
    • domain_description: Domain description for pattern detection
    • min_queries_for_detection: Minimum queries required for detection

Architecture

Sensitive Data Leakage Detection Tool

The tool consists of the following components:

  1. WordGenerator: Generates domain-specific words that might elicit sensitive information
  2. EnterpriseSearchInterface: Interfaces with Enterprise Search RAG applications
  3. SensitivityAnalyzer: Analyzes responses to determine if they contain sensitive information
  4. WordBank: Stores and manages words that successfully retrieved sensitive information
  5. SensitiveDataDetector: Integrates all components and provides the main API

OpenTelemetry-based Detection Solution

The solution consists of the following components:

  1. OpenTelemetryDetector: Main class that integrates all components
  2. PatternDetector: Detects suspicious patterns in queries
  3. StatisticalDetector: Identifies statistical anomalies in query patterns
  4. BehavioralDetector: Analyzes user behavior for deviations
  5. MLDetector: Uses machine learning for anomaly detection
  6. EnsembleDetector: Combines multiple detection methods
  7. AlertManager: Manages and stores alerts

Security Considerations

  • The tools are designed to handle sensitive information securely
  • No actual sensitive content is stored, only the words that retrieved it
  • Local LLM models are used for sensitivity analysis to avoid sending potentially sensitive data to external services
  • All data is stored locally and can be encrypted if needed

License

This project is licensed under the MIT License - see the LICENSE file for details.

SDLDT4Trae_IDX

About

Trying out "vibe" coding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors