Skip to content

An open-source tool that analyzes historical Spark job runs to recommend optimal resource configurations for future executions.

License

Notifications You must be signed in to change notification settings

gridatek/spark-optimizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark Optimizer

An open-source tool that analyzes historical Spark job runs to recommend optimal resource configurations for future executions.

Features

  • Multiple Data Collection Methods

    • Parse Spark event logs
    • Query Spark History Server API
    • Integrate with cloud provider APIs (AWS EMR, Databricks, GCP Dataproc)
  • Intelligent Recommendations

    • Similarity-based matching with historical jobs
    • ML-powered predictions for runtime and resource needs
    • Rule-based optimization for common anti-patterns
  • Cost Optimization

    • Balance performance vs. cost trade-offs
    • Support for spot/preemptible instances
    • Multi-cloud cost comparison
  • Easy Integration

    • REST API for programmatic access
    • CLI for manual operations
    • Python SDK for custom workflows

Quick Start

Installation

# Clone the repository
git clone https://github.com/gridatek/spark-optimizer.git
cd spark-optimizer

# Install dependencies
pip install -e .

# Optional: Install with cloud provider support
pip install -e ".[aws]"          # For AWS EMR integration
pip install -e ".[gcp]"          # For GCP Dataproc integration

# Setup database
python scripts/setup_db.py

Basic Usage

# Collect data from Spark event logs
spark-optimizer collect --event-log-dir /path/to/spark/logs

# Collect data from Spark History Server
spark-optimizer collect-from-history-server --history-server-url http://localhost:18080

# Collect data from AWS EMR
pip install -e ".[aws]"
spark-optimizer collect-from-emr --region us-west-2

# Collect data from Databricks (uses core dependencies)
spark-optimizer collect-from-databricks --workspace-url https://dbc-xxx.cloud.databricks.com

# Collect data from GCP Dataproc
pip install -e ".[gcp]"
spark-optimizer collect-from-dataproc --project my-project --region us-central1

# Get recommendations for a new job
spark-optimizer recommend --input-size 10GB --job-type etl

# Start the API server
spark-optimizer serve --port 8080

Architecture

Overview

The Spark Resource Optimizer is designed with a modular, layered architecture that separates concerns and allows for easy extension and maintenance.

System Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Client Applications                     │
│          (CLI, REST API Clients, Web Dashboard)             │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                      API Layer                              │
│  ┌─────────────┐  ┌─────────────┐  ┌──────────────┐         │
│  │  REST API   │  │     CLI     │  │   WebSocket  │         │
│  │   Routes    │  │  Commands   │  │  (Future)    │         │
│  └─────────────┘  └─────────────┘  └──────────────┘         │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                  Business Logic Layer                       │
│  ┌──────────────────────┐  ┌─────────────────────┐          │
│  │   Recommender        │  │   Analyzer          │          │
│  │  - Similarity        │  │  - Job Analysis     │          │
│  │  - ML-based          │  │  - Similarity       │          │
│  │  - Rule-based        │  │  - Features         │          │
│  └──────────────────────┘  └─────────────────────┘          │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                   Data Access Layer                         │
│  ┌──────────────────────────────────────────────┐           │
│  │          Repository Pattern                  │           │
│  │  - SparkApplicationRepository                │           │
│  │  - JobRecommendationRepository               │           │
│  └──────────────────────────────────────────────┘           │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Storage Layer                            │
│  ┌─────────────┐  ┌─────────────┐  ┌──────────────┐         │
│  │   SQLite    │  │  PostgreSQL │  │    MySQL     │         │
│  │  (Default)  │  │  (Optional) │  │  (Optional)  │         │
│  └─────────────┘  └─────────────┘  └──────────────┘         │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                   Data Collection Layer                     │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │  Event Log   │  │   History    │  │   Metrics    │       │
│  │  Collector   │  │    Server    │  │  Collector   │       │
│  └──────────────┘  └──────────────┘  └──────────────┘       │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                      Data Sources                           │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │ Spark Event  │  │    Spark     │  │  Cloud APIs  │       │
│  │    Logs      │  │   History    │  │              │       │
│  └──────────────┘  └──────────────┘  └──────────────┘       │
└─────────────────────────────────────────────────────────────┘

Core Components

1. Data Collection Layer

Purpose: Gather Spark job metrics from various sources

Components:

  • BaseCollector: Abstract interface for all collectors
  • EventLogCollector: Parse Spark event log files
  • HistoryServerCollector: Query Spark History Server API
  • MetricsCollector: Integrate with monitoring systems

Key Features:

  • Pluggable collector architecture
  • Batch processing support
  • Error handling and retry logic
  • Data normalization

2. Storage Layer

Purpose: Persist job data and recommendations

Components:

  • Database: Connection management and session handling
  • Models: SQLAlchemy ORM models
    • SparkApplication: Job metadata and metrics
    • SparkStage: Stage-level details
    • JobRecommendation: Historical recommendations
  • Repository: Data access abstraction

Key Features:

  • Database-agnostic design (SQLAlchemy)
  • Transaction management
  • Query optimization
  • Migration support (Alembic)

3. Analysis Layer

Purpose: Analyze job characteristics and extract insights

Components:

  • JobAnalyzer: Performance analysis and bottleneck detection
  • JobSimilarityCalculator: Calculate job similarity scores
  • FeatureExtractor: Extract ML features from job data

Key Features:

  • Resource efficiency metrics
  • Bottleneck identification (CPU, memory, I/O)
  • Issue detection (data skew, spills, failures)
  • Similarity-based job matching

4. Recommendation Layer

Purpose: Generate optimal resource configurations

Components:

  • BaseRecommender: Abstract recommender interface
  • SimilarityRecommender: History-based recommendations
  • MLRecommender: ML model predictions
  • RuleBasedRecommender: Heuristic-based suggestions

Key Features:

  • Multiple recommendation strategies
  • Confidence scoring
  • Cost-performance trade-offs
  • Feedback loop integration

5. API Layer

Purpose: Expose functionality to clients

Components:

  • REST API (Flask)
  • CLI interface (Click)
  • WebSocket support (future)

Endpoints:

  • /recommend: Get resource recommendations
  • /jobs: List and query historical jobs
  • /analyze: Analyze specific jobs
  • /feedback: Submit recommendation feedback

Data Flow

Collection Flow

Event Logs → Collector → Parser → Normalizer → Repository → Database

Recommendation Flow

User Request → API → Recommender → Analyzer → Repository → Database
                ↓
          Recommendation ← Model/Rules ← Historical Data

Analysis Flow

Job ID → Repository → Job Data → Analyzer → Insights
                                      ↓
                               Feature Extraction
                                      ↓
                               Similarity Matching

For more detailed architecture information, see docs/architecture.md.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this tool in your research or production systems, please cite:

@software{spark_resource_optimizer,
  title = {Spark Resource Optimizer},
  author = {Gridatek},
  year = {2024},
  url = {https://github.com/gridatek/spark-optimizer}
}

About

An open-source tool that analyzes historical Spark job runs to recommend optimal resource configurations for future executions.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 3

  •  
  •  
  •