Skip to content

willow788/Linear-Regression-model-from-scratch

Repository files navigation

🎯 Linear Regression from Scratch

Python NumPy Pandas scikit-learn Jupyter

MIT License Code Style: Black Tests

A Journey from Negative R² to 98%+ Accuracy 🚀


📊 Quick Stats


Best Performance

Polynomial Features

Gradient Descent

Code Coverage

🌟 What Makes This Special?

🎓 Pure Implementation 🧮 Multiple Algorithms 📈 Advanced Features 📝 Detailed Logs
Built from scratch using only NumPy Batch, SGD & Mini-Batch GD Polynomial features & L1 reg Complete failure-to-success journey

graph LR
    A[📊 Load Data] --> B[🔧 Feature Engineering]
    B --> C[📏 Normalization]
    C --> D[🎯 Train Model]
    D --> E{Choose Method}
    E -->|Batch GD| F[📊 R²:  95.84%]
    E -->|Stochastic GD| G[📊 R²: 98.50%]
    E -->|Mini-Batch GD| H[🏆 R²: 98.74%]
    F --> I[📈 Evaluate]
    G --> I
    H --> I
    I --> J[✨ Predictions]
    
    style A fill:#e1f5ff
    style H fill:#90EE90
    style J fill:#FFD700
Loading

📖 Table of Contents


✨ Features

🎯 Core Features

  • Pure NumPy Implementation

    • No sklearn for core algorithm
    • Deep understanding of math
    • Educational & transparent
  • Three Gradient Descent Methods

    • 📊 Batch GD
    • ⚡ Stochastic GD
    • 🔄 Mini-Batch GD
  • Advanced ML Techniques

    • 🔢 Polynomial Features (up to degree 2)
    • 🎚️ L1 Regularization (Lasso)
    • ⏱️ Early Stopping
    • 📏 Z-Score Normalization

📊 Analysis Features

  • Robust Evaluation

    • 🔄 K-Fold Cross-Validation
    • 📈 Multiple Metrics (MSE, RMSE, MAE, R²)
    • 📊 Train/Test Performance
  • Rich Visualizations

    • 📉 Loss Convergence Curves
    • 🎯 Residual Analysis
    • 🔥 Correlation Heatmaps
    • 📊 Actual vs Predicted Plots
    • 🏆 Feature Importance Charts
  • Production Ready

    • 🧪 95%+ Test Coverage
    • 📝 Comprehensive Documentation
    • 🐳 Docker Support

🚀 Quick Start

Get Up and Running in 60 Seconds! ⚡

# 1️⃣ Clone the repository
git clone https://github.com/willow788/Linear-Regression-model-from-scratch.git
cd Linear-Regression-model-from-scratch

# 2️⃣ Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3️⃣ Install dependencies
pip install -r requirements.txt

# 4️⃣ Run the model
python main.py

# 🎉 That's it! Your model is training! 
🐳 Docker Quick Start (Click to expand)
# Build the image
docker build -t linear-regression . 

# Run the container
docker run -it -p 8888:8888 linear-regression

# Or use docker-compose
docker-compose up

💡 Usage Examples

🎯 Basic Usage

from linear_regression import LinearRegression
from data_preprocessing import load_and_preprocess_data

# Load your data
X_train, X_test, y_train, y_test = load_and_preprocess_data('Advertising.csv')

# Create and train model
model = LinearRegression(
    learn_rate=0.02,
    iter=50000,
    method='batch',
    l1_reg=0.1
)

model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)

print(f"✨ Model R² Score: {model.evaluate(y_test, predictions):.4f}")

🔄 Comparing Different Methods

methods = {
    '📊 Batch GD':  {'method': 'batch', 'iter': 50000},
    '⚡ Stochastic GD': {'method': 'stochastic', 'iter': 50},
    '🔄 Mini-Batch GD': {'method': 'mini-batch', 'iter': 1000, 'batch_size': 16}
}

for name, params in methods.items():
    model = LinearRegression(learn_rate=0.01, **params)
    model.fit(X_train, y_train)
    score = calculate_r2(y_test, model.predict(X_test))
    print(f"{name}: R² = {score:.4f}")

📊 Cross-Validation

from model_evaluation import cross_validation_score

# Perform 5-fold cross-validation
cv_score = cross_validation_score(X, y, k=5)
print(f"🎯 Cross-Validated R² Score: {cv_score:.4f}")

📈 Visualization

from visualization import (
    plot_loss_convergence,
    plot_residuals,
    plot_actual_vs_predicted
)

# Plot loss over iterations
plot_loss_convergence(model. loss_history)

# Analyze residuals
plot_residuals(y_test, predictions)

# Compare actual vs predicted
plot_actual_vs_predicted(y_test, predictions)

📁 Project Structure

📦 Linear-Regression-model-from-scratch/
│
├── 📂 Version- 1/                          # 🔴 Initial experiments
│   ├── 📓 experiment_log.txt               # The negative R² saga
│   └── 📊 Raw jupyter Notebook/
│
├── 📂 Version- 2/                          # 🟡 Feature engineering
│   ├── 📓 experiment_log.txt
│   └── 📊 Raw jupyter Notebook/
│
├── 📂 Version- 3/                          # 🟠 Normalization fixes
│   ├── 📓 experiment_log.txt
│   └── 📊 Raw jupyter Notebook/
│
├── 📂 Version- 9/                          # 🟢 Production ready! 
│   ├── 📊 Raw jupyter Notebook/
│   │   └── 📓 sales. ipynb                 # Complete analysis
│   └── 🐍 Python Files/
│       ├── 📄 data_preprocessing.py       # Data pipeline
│       ├── 📄 linear_regression.py        # Core model
│       ├── 📄 model_evaluation.py         # Metrics & CV
│       ├── 📄 visualization. py            # Plotting utils
│       ├── 📄 main.py                     # Main script
│       └── 📄 config.py                   # Configuration
│
├── 🧪 tests/                               # Test suite
│   ├── 📄 test_linear_regression.py
│   ├── 📄 test_data_preprocessing.py
│   ├── 📄 test_model_evaluation.py
│   ├── 📄 test_visualization.py
│   ├── 📄 test_integration.py
│   └── 📄 conftest.py
│
├── 📊 outputs/                             # Generated visualizations
│   ├── 🖼️ loss_convergence.png
│   ├── 🖼️ residual_plot.png
│   ├── 🖼️ correlation_matrix.png
│   ├── 🖼️ actual_vs_predicted.png
│   └── 🖼️ feature_importance.png
│
├── 📊 Advertising.csv                      # Dataset
├── 📋 requirements.txt                     # Dependencies
├── 📋 requirements-dev.txt                 # Dev dependencies
├── 🐳 Dockerfile                           # Container config
├── 🐳 docker-compose.yml                   # Orchestration
├── ⚙️ Makefile                             # Utility commands
├── 📖 README.md                            # You are here! 
├── 📖 INSTALL.md                           # Installation guide
└── 📜 LICENSE                              # MIT License

🧪 The Journey

From Failure to Success: A Data Science Story 📚

Version R² Score Key Learnings

🔴 Version 1 The Crisis

-18. 77 😱

Problems Discovered:

  • ❌ No feature normalization
  • ❌ Learning rate too high
  • ❌ Linear features insufficient

Breakthrough: "Failure teaches more than success ever could"

🟡 Version 2 Engineering

~0.60 📈

Improvements Made:

  • ✅ Added polynomial features
  • ✅ Implemented basic normalization
  • ⚠️ Still unstable convergence

🟠 Version 3 Refinement

~0.85 📊

Progress:

  • ✅ Z-score normalization
  • ✅ Tuned learning rates
  • ✅ Added interaction terms
  • ⚠️ Slight overfitting detected

🟢 Version 9 Production

0.9874 🏆

Final Optimizations:

  • ✅ L1 regularization (λ = 0.15)
  • ✅ Early stopping (patience = 1000)
  • ✅ K-fold cross-validation
  • ✅ Multiple GD methods
  • ✅ Comprehensive testing

📈 Progress Visualization

R² Score Evolution
│
1.0 ┤                                                    ████ 🏆
0.9 ┤                                           ████████
0.8 ┤                                  █████████
0.7 ┤                         █████████
0.6 ┤                ████████
0.5 ┤       ████████
0.0 ┼──────────────────────────────────────────────────────────►
   -1. 0┤███                                              Iterations
-10.0 ┤███ 😱
-18.0 ┤███
      V1   V2      V3           V4-V8              V9

📊 Performance Metrics

🏆 Model Comparison

Method Test R² Train R² RMSE MAE Training Time
📊 Batch GD 0.9584 0.9509 0.2249 0.1533 ~45s
Stochastic GD 0.9850 0.9848 0.1352 0.1118 ~5s
🔄 Mini-Batch GD 0.9874 🏆 0.9860 0.1238 0.1011 ~12s

📈 Cross-Validation Results (5-Fold)

| Fold | R² Score | Status | |: ----:|:--------:|:------:| | 1 | 0.9870 | ✅ | | 2 | 0.9860 | ✅ | | 3 | 0.9925 | ✅ 🏆 | | 4 | 0.9867 | ✅ | | 5 | 0.9690 | ✅ | | Mean | 0.9842 | |


🔬 Mathematical Foundation

The Math Behind the Magic ✨

📐 Linear Regression Equation

$$\hat{y} = X\mathbf{w} + b$$

Where:

  • $\hat{y}$ = predictions
  • $X$ = feature matrix
  • $\mathbf{w}$ = weights
  • $b$ = bias

🎯 Loss Function (with L1 Regularization)

$$L(\mathbf{w}, b) = \frac{1}{2m}\sum_{i=1}^{m}(h_\mathbf{w}(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2}\sum_{j=1}^{n}|w_j|$$

Where:

  • $m$ = number of samples
  • $\lambda$ = regularization parameter
📊 Gradient Descent Update Rules (Click to expand)

Weight Update: $$\mathbf{w} := \mathbf{w} - \alpha \cdot \frac{1}{m}X^T(X\mathbf{w} - \mathbf{y}) - \alpha \cdot \lambda \cdot \text{sign}(\mathbf{w})$$

Bias Update: $$b := b - \alpha \cdot \frac{1}{m}\sum_{i=1}^{m}(h_\mathbf{w}(x^{(i)}) - y^{(i)})$$

Parameters:

  • $\alpha$ = learning rate
  • $\lambda$ = L1 regularization parameter
  • $\text{sign}(\mathbf{w})$ = sign function for L1 penalty
🔢 Polynomial Feature Expansion (Click to expand)

Original Features: $[TV, Radio, Newspaper]$

Expanded to 9 features:

Feature # Expression Description
1 $TV$ Original TV budget
2 $Radio$ Original Radio budget
3 $Newspaper$ Original Newspaper budget
4 $TV^2$ Quadratic TV effect
5 $Radio^2$ Quadratic Radio effect
6 $Newspaper^2$ Quadratic Newspaper effect
7 $TV \times Radio$ Interaction effect
8 $TV \times Newspaper$ Interaction effect
9 $Radio \times Newspaper$ Interaction effect

📈 Visualizations

📊 Model Performance Insights

📉 Loss Convergence

Loss Convergence

Smooth convergence to global minimum

🎯 Residual Analysis

Residual Plot

Random scatter indicates good fit

📊 Actual vs Predicted

Actual vs Predicted

Points close to diagonal line

🔥 Correlation Matrix

Correlation Heatmap

Feature relationships visualized

🏆 Feature Importance

Feature Importance

TV advertising shows strongest impact on sales


🧰 Tech Stack

Built With Modern Tools 🛠️


Python 3.8+
Core Language

NumPy
Numerical Computing

Pandas
Data Manipulation

Scikit-Learn
Validation Tools

Jupyter
Interactive Analysis

Matplotlib
Visualizations

Seaborn
Statistical Plots

Docker
Containerization

📊 Dataset

📈 Advertising Dataset

Attribute Details
📁 Source Kaggle / UCI ML Repository
📊 Samples 200 observations
🔢 Features TV, Radio, Newspaper (advertising budgets in $1000s)
🎯 Target Sales (in $1000s of units)
Quality No missing values
📈 Correlation TV (0.78), Radio (0.58), Newspaper (0.23) with Sales
📊 Sample Data Preview (Click to expand)
   TV    Radio  Newspaper  Sales
0  230. 1  37.8   69.2      22.1
1  44.5   39.3   45.1      10.4
2  17.2   45.9   69.3      9.3
3  151.5  41.3   58.5      18.5
4  180.8  10.8   58.4      12.9

🎓 Key Learnings

💡 Insights from Building ML from Scratch

🔑 Technical Insights

  1. Normalization is Critical 🎯

    • Without it, gradients explode
    • Z-score normalization works best
    • Apply to both features AND targets
  2. Feature Engineering Matters 🔧

    • Polynomial terms capture non-linearity
    • Interaction terms reveal relationships
    • Domain knowledge helps feature selection
  3. Regularization Prevents Overfitting 🛡️

    • L1 (Lasso) performs feature selection
    • Sparsity helps interpretability
    • Balance between bias and variance

📚 Development Insights

  1. Hyperparameter Tuning is an Art 🎨

    • Learning rate: too high = divergence
    • Too low = slow convergence
    • Cross-validation finds sweet spot
  2. Different Methods, Different Trade-offs ⚖️

    • Batch GD: Stable but slow
    • SGD: Fast but noisy
    • Mini-Batch: Best of both worlds
  3. Document Your Failures 📝

    • Negative R² taught more than success
    • Experiment logs are invaluable
    • Share your learning journey

🚀 Future Roadmap

What's Next? 🔮

  • 🔄 L2 Regularization (Ridge)

    • Compare with L1
    • Implement Elastic Net (L1 + L2)
  • 🎯 Adaptive Learning Rates

    • Adam optimizer
    • RMSprop
    • Learning rate scheduling
  • 🔍 Automated Hyperparameter Tuning

    • Grid Search
    • Random Search
    • Bayesian Optimization
  • 📊 Extended Dataset Support

    • Boston Housing
    • California Housing
    • Custom datasets
  • 🌐 Web Interface

    • Interactive predictions
    • Real-time visualization
    • Model playground
  • 📱 API Development

    • REST API with FastAPI
    • Model serving
    • Deployment pipeline
  • 📚 Educational Content

    • Step-by-step tutorials
    • Video explanations
    • Blog posts

💻 Command Reference

⚡ Quick Commands

# 📦 Installation
make install              # Install production dependencies
make install-dev          # Install dev dependencies

# 🧪 Testing
make test                 # Run all tests
make test-cov             # Run tests with coverage report

# 🎨 Code Quality
make lint                 # Run linters
make format               # Format code with black

# 🚀 Running
make run                  # Run main script
make jupyter              # Start Jupyter notebook

# 🐳 Docker
make docker-build         # Build Docker image
make docker-run           # Run Docker container

# 🧹 Cleanup
make clean                # Remove generated files

🤝 Contributing

Join the Journey! 🌟

We welcome contributions from the community!

🐛 Bug Reports

Found a bug?
Open an Issue

💡 Feature Requests

Have an idea?
Suggest a Feature

🔧 Pull Requests

Want to contribute?
Submit a PR

📋 Contribution Steps

# 1. Fork the repository
# 2. Clone your fork
git clone https://github.com/YOUR_USERNAME/Linear-Regression-model-from-scratch.git

# 3. Create a feature branch
git checkout -b feature/AmazingFeature

# 4. Make your changes and commit
git commit -m '✨ Add some AmazingFeature'

# 5. Push to your branch
git push origin feature/AmazingFeature

# 6. Open a Pull Request

Please ensure:

  • ✅ Code passes all tests (pytest)
  • ✅ Code is formatted (make format)
  • ✅ Documentation is updated
  • ✅ Commit messages are descriptive

📜 License

This project is licensed under the MIT License

License:  MIT

See LICENSE for more information.


🙏 Acknowledgments

Special Thanks ❤️

📊 Dataset
Advertising Dataset
Kaggle Community

🎓 Inspiration
Andrew Ng
Machine Learning Course

🛠️ Tools
NumPy, Pandas
Scikit-Learn Team

📚 Community
Stack Overflow
GitHub Community


📞 Contact & Connect

Let's Connect! 🌐

GitHub LinkedIn Email Twitter


📊 Repository Stats

Stars Forks Issues License Last Commit

Language Composition

Jupyter Python

⭐ Star This Repository!

If you found this project helpful, please consider giving it a star! ⭐


 ███████╗████████╗ █████╗ ██████╗     ████████╗██╗  ██╗██╗███████╗
 ██╔════╝╚══██╔══╝██╔══██╗██╔══██╗    ╚══██╔══╝██║  ██║██║██╔════╝
 ███████╗   ██║   ███████║██████╔╝       ██║   ███████║██║███████╗
 ╚════██║   ██║   ██╔══██║██╔══██╗       ██║   ██╔══██║██║╚════██║
 ███████║   ██║   ██║  ██║██║  ██║       ██║   ██║  ██║██║███████║
 ╚══════╝   ╚═╝   ╚═╝  ╚═╝╚═╝  ╚═╝       ╚═╝   ╚═╝  ╚═╝╚═╝╚══════╝

💙 Built with passion and ☕ by willow788

Learning by doing, one gradient descent at a time 🚀



⬆ Back to Top

About

Linear Regression implementation from scratch using only NumPy. Features three gradient descent methods (Batch, SGD, Mini-Batch), polynomial feature engineering, L1 regularization, and early stopping. Includes comprehensive experiment logs documenting the complete journey from negative R² scores to 98%+ accuracy, 95% test coverage.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors