The Spark Resource Optimizer is designed with a modular, layered architecture that separates concerns and allows for easy extension and maintenance.
┌─────────────────────────────────────────────────────────────┐
│ Client Applications │
│ (CLI, REST API Clients, Web Dashboard) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ API Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │ REST API │ │ CLI │ │ WebSocket │ │
│ │ Routes │ │ Commands │ │ (Future) │ │
│ └─────────────┘ └─────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Business Logic Layer │
│ ┌──────────────────────┐ ┌─────────────────────┐ │
│ │ Recommender │ │ Analyzer │ │
│ │ - Similarity │ │ - Job Analysis │ │
│ │ - ML-based │ │ - Similarity │ │
│ │ - Rule-based │ │ - Features │ │
│ └──────────────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Data Access Layer │
│ ┌──────────────────────────────────────────────┐ │
│ │ Repository Pattern │ │
│ │ - SparkApplicationRepository │ │
│ │ - JobRecommendationRepository │ │
│ └──────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Storage Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │ SQLite │ │ PostgreSQL │ │ MySQL │ │
│ │ (Default) │ │ (Optional) │ │ (Optional) │ │
│ └─────────────┘ └─────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Data Collection Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Event Log │ │ History │ │ Cloud APIs │ │
│ │ Collector │ │ Server │ │ Collector │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Data Sources │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Spark Event │ │ Spark │ │ Cloud APIs │ │
│ │ Logs │ │ History │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
Purpose: Gather Spark job metrics from various sources
Components:
BaseCollector: Abstract interface for all collectorsEventLogCollector: Parse Spark event log filesHistoryServerCollector: Query Spark History Server API- Cloud API Collectors: EMR, Databricks, Dataproc integrations
Key Features:
- Pluggable collector architecture
- Batch processing support
- Error handling and retry logic
- Data normalization
Purpose: Persist job data and recommendations
Components:
Database: Connection management and session handlingModels: SQLAlchemy ORM modelsSparkApplication: Job metadata and metricsSparkStage: Stage-level detailsJobRecommendation: Historical recommendations
Repository: Data access abstraction
Key Features:
- Database-agnostic design (SQLAlchemy)
- Transaction management
- Query optimization
- Migration support (Alembic)
Purpose: Analyze job characteristics and extract insights
Components:
JobAnalyzer: Performance analysis and bottleneck detectionJobSimilarityCalculator: Calculate job similarity scoresFeatureExtractor: Extract ML features from job data
Key Features:
- Resource efficiency metrics
- Bottleneck identification (CPU, memory, I/O)
- Issue detection (data skew, spills, failures)
- Similarity-based job matching
Purpose: Generate optimal resource configurations
Components:
BaseRecommender: Abstract recommender interfaceSimilarityRecommender: History-based recommendationsMLRecommender: ML model predictionsRuleBasedRecommender: Heuristic-based suggestions
Key Features:
- Multiple recommendation strategies
- Confidence scoring
- Cost-performance trade-offs
- Feedback loop integration
Purpose: Expose functionality to clients
Components:
- REST API (Flask)
- CLI interface (Click)
- WebSocket support (future)
Endpoints:
/recommend: Get resource recommendations/jobs: List and query historical jobs/analyze: Analyze specific jobs/feedback: Submit recommendation feedback
Event Logs → Collector → Parser → Normalizer → Repository → Database
User Request → API → Recommender → Analyzer → Repository → Database
↓
Recommendation ← Model/Rules ← Historical Data
Job ID → Repository → Job Data → Analyzer → Insights
↓
Feature Extraction
↓
Similarity Matching
- Abstracts data access logic
- Provides clean interface for CRUD operations
- Enables easy testing with mocks
- Multiple recommender implementations
- Runtime selection of recommendation strategy
- Easy addition of new strategies
- Collector creation based on source type
- Recommender instantiation based on method
- Configuration-driven component creation
- BaseCollector defines collection workflow
- Subclasses implement specific steps
- Consistent behavior across collectors
The system uses a hierarchical configuration approach:
- Default Values: Hardcoded defaults in
config.py - Configuration File: YAML file for persistent settings
- Environment Variables: Override for deployment-specific values
- Runtime Arguments: CLI/API parameters take precedence
Priority: Runtime > Environment > Config File > Defaults
- Extend
BaseCollector - Implement
collect()andvalidate_config() - Register in factory/configuration
- Extend
BaseRecommender - Implement
recommend()andtrain() - Add to recommender registry
- Define new collector class
- Add connection configuration
- Implement data normalization
- Update
FeatureExtractor - Retrain ML models
- Update similarity calculations
- Stateless API servers
- Load balancer distribution
- Database connection pooling
- Partitioned database tables
- Time-based data retention
- Background aggregation jobs
- Caching frequently accessed data
- Async processing for long operations
- Batch operations for bulk imports
- API key authentication (future)
- Role-based access control (future)
- Rate limiting per client
- Sensitive data encryption
- Secure credential storage
- Audit logging
- Request parameter validation
- SQL injection prevention (ORM)
- XSS protection in API responses
- Structured logging with loguru
- Log levels: DEBUG, INFO, WARNING, ERROR
- Correlation IDs for request tracing
- API request latency
- Recommendation accuracy
- Database query performance
- Collection throughput
- API endpoint availability
- Database connectivity
- External service status
Single machine → SQLite → Local file system
Load Balancer → API Servers → PostgreSQL
↓
Message Queue (Celery)
↓
Background Workers
- Web Dashboard: React-based UI for visualization
- Real-time Monitoring: WebSocket streaming of job metrics
- Auto-tuning: Automatic resource adjustment
- Multi-cloud Support: AWS EMR, Databricks, GCP Dataproc
- Cost Optimization: Spot instance recommendations
- Alerting: Proactive issue detection and notifications