Skip to content

AidenTran900/ml-library-cpp

Repository files navigation

ML Models

A C++ machine learning library built from the ground up implementing various ML algorithms and models.

Features

Models

  • Linear Regression with gradient descent optimization
  • Logistic Regression for binary classification
  • K-Nearest Neighbors (KNN) with Euclidean and Manhattan distance metrics
  • Support Vector Machines (SVM) with multiple kernels (Linear, Polynomial, RBF, Sigmoid)
  • Decision Trees with Gini and Entropy impurity measures
  • Random Forests with bootstrap aggregation
  • K-Means Clustering for unsupervised learning
  • Neural Networks with backpropagation and configurable layers
  • Residual Networks (ResNet) with skip connections
  • Transformer with multi-head self-attention, KV cache, and autoregressive generation
  • Perceptron for binary classification

Core Components

  • Matrix Operations: Addition, multiplication, transpose, inverse, Hadamard product, determinant — templated for float and double (Matrix<float> / Matrix<double>)
  • Activation Functions: ReLU, Sigmoid, Tanh, Linear, Softplus, Softmax, Step, Sign
  • Loss Functions: MSE, MAE, RMSE, Binary Cross-Entropy, Categorical Cross-Entropy
  • Optimizers: SGD, Mini-Batch GD, Momentum, AdaGrad, RMSProp, Adam
  • Normalization: Layer Norm, RMS Norm
  • Regularization: L1 (Lasso) & L2 (Ridge)
  • Metrics:
    • Regression: R², Adjusted R², MSE, MAE, RMSE
    • Classification: Accuracy, Precision, Recall, F1 Score, Confusion Matrix, ROC Curve, AUC

NLP / Transformer Components

  • Tokenizer: Word, Character, BPE (Byte Pair Encoding), and Sentence tokenization
  • Embedding Layer: Trainable word embeddings
  • Multi-Head Attention: Scaled dot-product attention with KV cache for efficient inference
  • Positional Encoding: Sinusoidal and Rotary (RoPE)
  • Transformer Blocks: Pre-norm architecture with residual connections

Precision Support

All core classes are templated on scalar type (template<typename T = double>), enabling both float (f32) and double (f64) precision:

  • Matrix<float> / MatrixF32 for memory-efficient inference
  • Matrix<double> / MatrixF64 for training precision (default)
  • Classical ML models default to double; the transformer stack supports both

Language Bindings

  • Python bindings via pybind11 with NumPy array support (both float32 and float64)

Prerequisites

  • C++17 or higher
  • CMake 3.16+
  • A C++ compiler (GCC, Clang, or MSVC)

Building

Linux/macOS

# Clone the repository
git clone https://github.com/ProdigiousPersonn/ML-Models
cd ML-Models

# Create and enter build directory
mkdir build && cd build

# Configure with CMake
cmake ..

# Build the project
cmake --build .

# Run the executable
./Build

Windows

# Clone the repository
git clone https://github.com/ProdigiousPersonn/ML-Models
cd ML-Models

# Create and enter build directory
mkdir build
cd build

# Configure with CMake
cmake ..

# Build the project
cmake --build . --config Release

# Run the executable
.\Release\Build.exe

Project Structure

LinearModel/
├── source/
│   ├── main.cpp           # Entry point
│   ├── math/              # Matrix operations
│   ├── core/              # Loss, optimizer, regularizer, metrics, tokenizer, embedding
│   ├── models/            # ML model implementations
│   └── utils/             # CSV utilities
├── include/ml_lib/        # Public headers
├── examples/
│   ├── c++/               # C++ examples
│   │   ├── linear-regression/housing/
│   │   └── language-model/
│   ├── logistic-regression/ # Heart disease classification example
│   ├── python/            # Python examples
│   └── datasets/          # Example datasets
├── python/                # Python bindings (pybind11)
├── tests/                 # Unit tests (doctest)
├── external/              # Dependencies (fmt, spdlog, doctest)
├── csv-parser/            # CSV parsing library
├── pybind11/              # Python bindings library
└── CMakeLists.txt         # Build configuration

Examples

Housing Price Prediction (Linear Regression)

A complete example demonstrating linear regression on a real-world housing dataset (https://www.kaggle.com/datasets/yasserh/housing-prices-dataset):

  • Dataset: 545 housing samples with 12 features (area, bedrooms, bathrooms, etc.)
  • Features: Z-score normalization
  • Model: Linear regression with L2 regularization
  • Optimizer: Batch gradient descent
  • Metrics: MSE, RMSE, MAE, R²

Llama 3.2-1B Instruct (Language Model)

A text generation example using Llama 3.2-1B Instruct loaded from a GGUF file:

  • Model: Llama 3.2-1B Instruct (GGUF format)
  • Supported quantizations: F32, F16, Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1
  • Features: Tokenizer encoding/decoding, streaming output, temperature and top-p sampling
  • Inference: Autoregressive generation with KV cache

Downloading the model:

Model weights are not included in the repository. Download a GGUF file from Hugging Face:

# Install the Hugging Face CLI
pip install huggingface-hub

# Q8_0 quantized (~1.1 GB)
huggingface-cli download bartowski/Llama-3.2-1B-Instruct-GGUF Llama-3.2-1B-Instruct-Q8_0.gguf --local-dir examples/datasets/language-model/

Heart Disease Prediction (Logistic Regression)

A binary classification example using logistic regression on the Framingham Heart Study dataset (https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression):

  • Dataset: Framingham Heart Study - 10 Year CHD Risk
  • Features: 15 clinical features (age, sex, cholesterol, blood pressure, BMI, etc.)
  • Preprocessing: Z-score normalization
  • Model: Logistic regression with L2 regularization
  • Loss: Binary Cross-Entropy (BCE)
  • Optimizer: Batch gradient descent
  • Metrics: Accuracy, Precision, Recall, F1 Score, Confusion Matrix, ROC Curve, AUC

Run the examples:

./Build

Roadmap

Regression [X]

  • Linear Regression
  • Evaluation Metrics (Regression): MSE, MAE, RMSE, R-squared
  • Regularization: L1 (Lasso) & L2 (Ridge)

Classification [X]

  • Logistic Regression
  • Evaluation Metrics (Classification):
    • Accuracy, Precision, Recall, FPR, F1-Score
    • Confusion Matrix
    • ROC Curve and AUC
  • K-Nearest Neighbors (KNN)
  • Support Vector Machines (SVMs)

Tree-Based Models [X]

  • Decision Trees
  • Random Forests

Unsupervised Learning [X]

  • K-Means Clustering

Deep Learning [In Progress]

  • Neural Networks (Feedforward)
  • Backpropagation
  • Activation Functions: ReLU, Sigmoid, Tanh, Linear, Softplus, Softmax, Step, Sign
  • Optimizers:
    • Mini-Batch Gradient Descent
    • Adam Optimizer
    • RMSProp
    • AdaGrad
    • Momentum SGD
  • Model Serialization
  • Batch Normalization
  • Layer Normalization
  • RMS Normalization
  • Dropout Regularization

NLP / Transformers [In Progress]

  • Tokenizer: Word, Character, BPE, Sentence
  • Embedding Layer
  • Attention Mechanisms: Multi-head self-attention with KV cache
  • Positional Encoding: Sinusoidal, Rotary (RoPE)
  • Transformer Blocks: Pre-norm with residual connections
  • Transformer Model: Autoregressive generation with token sampling
  • Language Models (GGUF loading / Llama inference)

Precision [X]

  • f64 (double): Default precision for all operations
  • f32 (float): Template support across the full stack
  • f16 / Quantization: F16, Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1 dequantization for GGUF loading

DL Architectures [ ]

  • Convolutional Neural Networks (CNNs) (For images)
  • Recurrent Neural Networks (RNNs) (For sequences)

About

A C++/Python machine learning library built from scratch. Features classic ML algorithms and a GGUF-compatible inference loader for transformers.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors