Cluster Performance ML Project

A comprehensive machine learning project for predicting cluster performance metrics using multi-output regression models.

Overview

This project implements multiple regression models to predict various cluster performance metrics based on cluster configuration metadata. The system uses multi-output regression to simultaneously predict multiple performance metrics from cluster setup parameters.

Features

Multi-output Regression: Predicts multiple performance metrics simultaneously
Multiple Algorithms: Supports Random Forest, XGBoost, and LightGBM
Automated Preprocessing: Handles categorical encoding, scaling, and missing values
Model Comparison: Evaluates and compares different algorithms
Feature Importance: Analyzes which configuration parameters matter most
Cross-validation: Robust model evaluation with k-fold cross-validation
Visualization: Comprehensive plots and analysis

Project Structure

cluster-performance-ml/
├── src/                          # Source code
│   ├── data_preprocessor.py      # Data preprocessing pipeline
│   ├── multi_output_model.py     # Multi-output regression models
│   ├── train.py                  # Training script
│   ├── predict.py                # Prediction script
│   └── __init__.py               # Package initialization
├── configs/                      # Configuration files
│   └── config.yaml               # Main configuration
├── data/                         # Data directory
│   ├── raw/                      # Raw data files
│   └── processed/                # Processed data files
├── models/                       # Trained models
├── results/                      # Results and evaluation metrics
│   └── plots/                    # Generated plots
├── notebooks/                    # Jupyter notebooks
│   └── exploratory_analysis.ipynb # EDA notebook

Installation

Clone the repository:

git clone <repository-url>
cd cluster-performance-ml

Install dependencies:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Usage

1. Prepare Your Data

Place your cluster performance CSV file in data/raw/cluster_data.csv. The CSV should contain:

Metadata columns (input features): clusterType, controlPlaneArch, k8sVersion, etc.
Metrics columns (target outputs): cpu-, memory-, latency metrics, etc.

2. Train Models

Run the training pipeline:

python src/train.py

This will:

Load and preprocess the data
Train multiple models (Random Forest, XGBoost, LightGBM)
Evaluate model performance
Save trained models and results

3. Make Predictions

Use trained models to predict on new data:

python src/predict.py --input path/to/new_data.csv --output predictions.csv --model XGBoost

4. Exploratory Data Analysis

Run the Jupyter notebook for data exploration:

jupyter notebook notebooks/exploratory_analysis.ipynb

Example Execution On Google Colab

Note: Requires Red Hat, Inc email to access above notebooks

Configuration

Modify configs/config.yaml to customize:

Data paths: Input and output file locations
Model parameters: Algorithm hyperparameters
Feature patterns: Patterns to identify input/output columns
Evaluation metrics: Metrics for model assessment

Input Data Format

The system expects a CSV file with columns following these patterns:

Input Features (Metadata)

clusterType: Type of cluster (e.g., self-managed)
controlPlaneArch: Architecture (e.g., amd64)
k8sVersion: Kubernetes version
masterNodesCount: Number of master nodes
workerNodesCount: Number of worker nodes
jobConfig.*: Job configuration parameters
And more cluster configuration parameters...

Target Metrics (Outputs)

cpu-*: CPU usage metrics
memory-*: Memory usage metrics
*-latency: API call latency metrics
99th*: 99th percentile metrics
cgroup*: CGroup resource metrics
And more performance metrics...

Model Performance

The system evaluates models using:

R² Score: Coefficient of determination
RMSE: Root Mean Square Error
MAE: Mean Absolute Error
Explained Variance: Explained variance score

Results

After training, find results in:

results/model_summary.csv: Model performance comparison
results/evaluation_results.yaml: Detailed evaluation metrics
results/plots/: Performance comparison plots
models/: Trained model files

Example Results

       Model  Overall R²  Overall RMSE  Overall MAE  Explained Variance
RandomForest    0.938880      0.235042     0.113551            0.938970
     XGBoost    0.934250      0.245289     0.122311            0.934259
    CatBoost    0.921117      0.268978     0.150960            0.921137
    LightGBM    0.898202      0.303874     0.157239            0.898221

Advanced Usage

Custom Model Configuration

Add new models in configs/config.yaml:

models:
  - name: "CustomRF"
    type: "RandomForestRegressor"
    params:
      n_estimators: 200
      max_depth: 15
      min_samples_split: 5

Feature Engineering

Modify feature patterns in the config to include/exclude specific columns:

features:
  metadata_patterns:
    - clusterType
    - customFeature
  metric_patterns:
    - custom-metric

Troubleshooting

Common Issues

File not found: Ensure your CSV is at data/raw/cluster_data.csv
Memory issues: Reduce dataset size or use sample data for testing
Missing dependencies: Run pip install -r requirements.txt

Logging

Check training.log for detailed execution logs and error messages.

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

License

This project is licensed under the Apache 2.0 License.

Support

For questions or issues, please create an issue in the repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cluster Performance ML Project

Overview

Features

Project Structure

Installation

Usage

1. Prepare Your Data

2. Train Models

3. Make Predictions

4. Exploratory Data Analysis

Example Execution On Google Colab

Note: Requires Red Hat, Inc email to access above notebooks

Configuration

Input Data Format

Input Features (Metadata)

Target Metrics (Outputs)

Model Performance

Results

Example Results

Advanced Usage

Custom Model Configuration

Feature Engineering

Troubleshooting

Common Issues

Logging

Contributing

License

Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
catboost_info		catboost_info
configs		configs
data		data
models		models
notebooks		notebooks
results		results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Cluster Performance ML Project

Overview

Features

Project Structure

Installation

Usage

1. Prepare Your Data

2. Train Models

3. Make Predictions

4. Exploratory Data Analysis

Example Execution On Google Colab

Note: Requires Red Hat, Inc email to access above notebooks

Configuration

Input Data Format

Input Features (Metadata)

Target Metrics (Outputs)

Model Performance

Results

Example Results

Advanced Usage

Custom Model Configuration

Feature Engineering

Troubleshooting

Common Issues

Logging

Contributing

License

Support

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages