A comprehensive machine learning project for predicting cluster performance metrics using multi-output regression models.
This project implements multiple regression models to predict various cluster performance metrics based on cluster configuration metadata. The system uses multi-output regression to simultaneously predict multiple performance metrics from cluster setup parameters.
- Multi-output Regression: Predicts multiple performance metrics simultaneously
- Multiple Algorithms: Supports Random Forest, XGBoost, and LightGBM
- Automated Preprocessing: Handles categorical encoding, scaling, and missing values
- Model Comparison: Evaluates and compares different algorithms
- Feature Importance: Analyzes which configuration parameters matter most
- Cross-validation: Robust model evaluation with k-fold cross-validation
- Visualization: Comprehensive plots and analysis
cluster-performance-ml/
├── src/ # Source code
│ ├── data_preprocessor.py # Data preprocessing pipeline
│ ├── multi_output_model.py # Multi-output regression models
│ ├── train.py # Training script
│ ├── predict.py # Prediction script
│ └── __init__.py # Package initialization
├── configs/ # Configuration files
│ └── config.yaml # Main configuration
├── data/ # Data directory
│ ├── raw/ # Raw data files
│ └── processed/ # Processed data files
├── models/ # Trained models
├── results/ # Results and evaluation metrics
│ └── plots/ # Generated plots
├── notebooks/ # Jupyter notebooks
│ └── exploratory_analysis.ipynb # EDA notebook
- Clone the repository:
git clone <repository-url>
cd cluster-performance-ml- Install dependencies:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtPlace your cluster performance CSV file in data/raw/cluster_data.csv. The CSV should contain:
- Metadata columns (input features): clusterType, controlPlaneArch, k8sVersion, etc.
- Metrics columns (target outputs): cpu-, memory-, latency metrics, etc.
Run the training pipeline:
python src/train.pyThis will:
- Load and preprocess the data
- Train multiple models (Random Forest, XGBoost, LightGBM)
- Evaluate model performance
- Save trained models and results
Use trained models to predict on new data:
python src/predict.py --input path/to/new_data.csv --output predictions.csv --model XGBoostRun the Jupyter notebook for data exploration:
jupyter notebook notebooks/exploratory_analysis.ipynb- EDA: https://colab.research.google.com/drive/1I_AqN-m2p0T2sP8gpHtornPpQWCl8Zlk#scrollTo=msdNxbzFslaG
- Training: https://colab.research.google.com/drive/1trek6cCQhJF-yZ-sSBLIGW86QzeY3ugR#scrollTo=bNTJCy_J9fd3
Modify configs/config.yaml to customize:
- Data paths: Input and output file locations
- Model parameters: Algorithm hyperparameters
- Feature patterns: Patterns to identify input/output columns
- Evaluation metrics: Metrics for model assessment
The system expects a CSV file with columns following these patterns:
clusterType: Type of cluster (e.g., self-managed)controlPlaneArch: Architecture (e.g., amd64)k8sVersion: Kubernetes versionmasterNodesCount: Number of master nodesworkerNodesCount: Number of worker nodesjobConfig.*: Job configuration parameters- And more cluster configuration parameters...
cpu-*: CPU usage metricsmemory-*: Memory usage metrics*-latency: API call latency metrics99th*: 99th percentile metricscgroup*: CGroup resource metrics- And more performance metrics...
The system evaluates models using:
- R² Score: Coefficient of determination
- RMSE: Root Mean Square Error
- MAE: Mean Absolute Error
- Explained Variance: Explained variance score
After training, find results in:
results/model_summary.csv: Model performance comparisonresults/evaluation_results.yaml: Detailed evaluation metricsresults/plots/: Performance comparison plotsmodels/: Trained model files
Model Overall R² Overall RMSE Overall MAE Explained Variance
RandomForest 0.938880 0.235042 0.113551 0.938970
XGBoost 0.934250 0.245289 0.122311 0.934259
CatBoost 0.921117 0.268978 0.150960 0.921137
LightGBM 0.898202 0.303874 0.157239 0.898221
Add new models in configs/config.yaml:
models:
- name: "CustomRF"
type: "RandomForestRegressor"
params:
n_estimators: 200
max_depth: 15
min_samples_split: 5Modify feature patterns in the config to include/exclude specific columns:
features:
metadata_patterns:
- clusterType
- customFeature
metric_patterns:
- custom-metric- File not found: Ensure your CSV is at
data/raw/cluster_data.csv - Memory issues: Reduce dataset size or use sample data for testing
- Missing dependencies: Run
pip install -r requirements.txt
Check training.log for detailed execution logs and error messages.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
This project is licensed under the Apache 2.0 License.
For questions or issues, please create an issue in the repository.