🏠 Multiple Linear Regression - House Price Prediction

A comprehensive machine learning project implementing Multiple Linear Regression to predict house prices using the California Housing Dataset

Overview • Dataset • Features • Results • Installation • Usage

📋 Overview

This project demonstrates a complete machine learning workflow for predictive modeling using Multiple Linear Regression. The model predicts house prices in California based on various housing features using statistical techniques and data analysis.

Built entirely by: Rupayan Dey

📊 Dataset

California Housing Dataset

The dataset contains 20,640 housing records from the 1990 California Census with the following features:

Feature	Description	Range
MedInc	Median Income	0.50 - 15.00
HouseAge	House Age in years	1 - 52
AveRooms	Average Rooms per House	0.85 - 141.91
AveBedrms	Average Bedrooms per House	0.33 - 34.07
Population	Population per Block Group	3 - 35,682
AveOccup	Average Occupancy	0.69 - 1243.33
Latitude	Latitude Coordinate	32.54 - 41.95
Longitude	Longitude Coordinate	-124.27 - -114.13
Price (Target)	Median House Price	0.15 - 5.00

🎯 Project Features

✨ Key Components:

Data Loading & Preprocessing: Using scikit-learn's California Housing dataset
Exploratory Data Analysis (EDA): Comprehensive statistical analysis and visualization
Feature Engineering: Data standardization and normalization
Model Training: Multiple Linear Regression implementation
Model Evaluation: Detailed performance metrics and validation
Assumption Checking: Residual analysis and diagnostics
Model Persistence: Serialization using pickle for production deployment

📈 Analysis Performed:

Descriptive Statistics: Mean, median, standard deviation, quartiles
Correlation Analysis: Feature relationships and multicollinearity detection
Distribution Analysis: Univariate and multivariate distributions
Residual Analysis: Normality, homoscedasticity, and independence checks

📈 Results & Performance Metrics

Model Performance on Test Set:

├─ Mean Square Error (MSE):      0.7549
├─ Root Mean Square Error (RMSE): 0.8689
├─ Mean Absolute Error (MAE):    0.5348
└─ R² Score (Coefficient of Determination): 0.5757

Interpretation:

R² Score of 0.5757 indicates the model explains approximately 57.57% of the variance in house prices
RMSE of 0.8689 means predictions are off by ~$86,890 on average
MAE of 0.5348 indicates an absolute average error of ~$53,480 per prediction

📊 Visualizations

1. Pairplot - Feature Relationships

Visualization showing relationships between all features and target variable. Key observations:

Strong positive correlation between MedInc and Price
Latitude/Longitude show clear geographic price patterns
Some features show non-linear relationships

2. Correlation Heatmap

Correlation matrix revealing feature dependencies:

Strongest positive correlation with Price: Median Income (0.69)
Notable negative correlations: Latitude (-0.14), Longitude (-0.05)
Latitude-Longitude correlation: -0.92 (expected geographic relationship)

3. Actual vs Predicted Prices

Scatter plot comparing predicted vs actual prices:

Points cluster around the diagonal (good predictions)
Some deviation at higher prices indicates model limitations
Model performs better for mid-range prices

4. Residuals Distribution

Distribution of prediction errors:

Approximately normal distribution (assumption validated)
Mean centered near zero
Right skew indicates occasional overestimation

5. Residuals vs Predicted Values

Residuals analysis for homoscedasticity check:

Residuals show slight funnel pattern (heteroscedasticity)
Variance increases with predicted values
Suggests potential for improvement through polynomial features

6. Dataset Statistics

Distribution of key features:

Price: Right-skewed, concentrated in 0.5-3.5 range
House Age: Uniform distribution with peak at maximum age
Average Rooms: Right-skewed, concentrated in 4-8 range
Average Bedrooms: Clustered around 1-1.5 bedrooms

🛠️ Technical Stack

Libraries & Tools:

├─ NumPy 2.3.3          # Numerical computations
├─ Pandas 2.3.2         # Data manipulation and analysis
├─ Scikit-Learn 1.8.0   # Machine learning algorithms
├─ Matplotlib 3.10.7    # Plotting library
├─ Seaborn 0.13.2       # Statistical data visualization
└─ Jupyter              # Interactive notebook environment

Installation

Prerequisites:

Python 3.8 or higher
pip or conda package manager

Setup Instructions:

Clone the repository:

git clone <repository-url>
cd "2-mulipleLinearRegression(HousePricePrediction)"

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install numpy pandas scikit-learn matplotlib seaborn jupyter

Launch Jupyter Notebook:

jupyter notebook 22.1-MultipleLinearRegression.ipynb

🚀 Usage

Running the Notebook:

Open the Jupyter notebook: 22.1-MultipleLinearRegression.ipynb
Execute cells sequentially from top to bottom
Review visualizations and metrics at each step

Using the Saved Model:

import pickle
import numpy as np

# Load the trained model
model = pickle.load(open('regressor.pkl', 'rb'))

# Make predictions on new data
new_data = np.array([[2.5, 25, 5.0, 1.0, 300, 2.5, 37.5, -122.3]])
prediction = model.predict(new_data)
print(f"Predicted Price: ${prediction[0] * 100000:.2f}")

📚 Methodology

Step-by-Step Workflow:

1. Data Loading
   └─ Import California Housing Dataset
   
2. Exploratory Data Analysis
   ├─ Dataset Overview
   ├─ Statistical Summary
   ├─ Distribution Analysis
   └─ Correlation Analysis
   
3. Data Preprocessing
   ├─ Feature Selection
   ├─ Train-Test Split (67-33)
   └─ Feature Standardization
   
4. Model Development
   ├─ Linear Regression Initialization
   ├─ Model Training
   └─ Coefficient Extraction
   
5. Model Evaluation
   ├─ Prediction Generation
   ├─ Performance Metrics Calculation
   └─ Residual Analysis
   
6. Assumption Validation
   ├─ Linearity Check
   ├─ Normality of Residuals
   ├─ Homoscedasticity
   └─ Independence
   
7. Model Deployment
   └─ Model Serialization (pickle)

📊 Model Equations

Multiple Linear Regression Formula:

$$\hat{y} = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$$

Where:

$\hat{y}$ = Predicted house price
$\beta_0$ = Intercept
$\beta_1, \beta_2, ..., \beta_n$ = Regression coefficients
$x_1, x_2, ..., x_n$ = Independent variables (features)

Performance Metrics:

R² Score: $$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$

RMSE: $$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$$

MAE: $$MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$

🔍 Key Insights

What We Learned:

Income is the strongest predictor of house prices (0.69 correlation)
Geographic location matters - Latitude/Longitude show clear price patterns
Model assumptions are mostly valid - Residuals follow approximate normal distribution
Linear model has limitations - Some non-linearity present in data
Heteroscedasticity detected - Variance increases with predicted values

Recommendations for Improvement:

🔧 Polynomial Features: Add polynomial terms to capture non-linearity
📈 Additional Features: Include economic indicators or market data
🤖 Advanced Models: Try Ridge, Lasso, or ensemble methods
📊 Feature Engineering: Create interaction terms between features
🎯 Data Quality: Investigate outliers and anomalies more thoroughly

📁 Project Structure

2-mulipleLinearRegression(HousePricePrediction)/
├── 22.1-MultipleLinearRegression.ipynb    # Main Jupyter notebook
├── save_visualizations.py                 # Script to generate visualizations
├── regressor.pkl                          # Trained model (serialized)
├── README.md                              # This file
└── assets/                                # Visualization outputs
    ├── 01_pairplot.png
    ├── 02_correlation_heatmap.png
    ├── 03_actual_vs_predicted.png
    ├── 04_residuals_histogram.png
    ├── 05_residuals_scatter.png
    └── 06_dataset_statistics.png

💡 Learning Outcomes

By exploring this project, you will understand:

✅ How to load and explore datasets using pandas
✅ How to perform exploratory data analysis with visualization
✅ How to preprocess and standardize features
✅ How to implement and train linear regression models
✅ How to evaluate model performance with multiple metrics
✅ How to validate regression assumptions
✅ How to serialize and save trained models
✅ How to make predictions on new data

🎓 Educational Value

This project is ideal for:

Students learning machine learning fundamentals
Data Scientists reviewing regression workflows
Practitioners implementing end-to-end ML pipelines
Researchers studying housing price prediction models

📝 Notes & Observations

Model Limitations:

Linear Assumption: Assumes linear relationships between features and price
Low R² Score: Only explains ~58% of price variance
Heteroscedasticity: Unequal error variance across predictions
Geographic Simplification: Continuous lat/long not ideal for distinct regions

Future Enhancements:

Implement non-linear models (Random Forest, Gradient Boosting)
Add temporal features if date information available
Perform cross-validation for robust metrics
Apply regularization (Ridge, Lasso) to prevent overfitting
Create interaction features between top predictors

🤝 Contributing

This project was created as a learning exercise. Improvements and suggestions are welcome!

📄 License

This project is open source and available under the MIT License.

👤 Author

Rupayan Dey

📧 Email: rupayandey134@gmail.com
🔗 GitHub: [Your GitHub Profile]
💼 LinkedIn: [Your LinkedIn Profile]

🙏 Acknowledgments

Scikit-Learn Team for the excellent California Housing dataset and ML library
Data Science Community for best practices and methodologies
Jupyter Project for interactive computing environment

⭐ If you found this project helpful, please consider giving it a star!

Made with ❤️ by Rupayan Dey

⬆ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
.gitignore		.gitignore
1-MultipleLinearRegression.ipynb		1-MultipleLinearRegression.ipynb
LICENSE		LICENSE
README.md		README.md
regressor.pkl		regressor.pkl

Folders and files

Latest commit

History

Repository files navigation

🏠 Multiple Linear Regression - House Price Prediction

📋 Overview

📊 Dataset

California Housing Dataset

🎯 Project Features

✨ Key Components:

📈 Analysis Performed:

📈 Results & Performance Metrics

Model Performance on Test Set:

Interpretation:

📊 Visualizations

1. Pairplot - Feature Relationships

2. Correlation Heatmap

3. Actual vs Predicted Prices

4. Residuals Distribution

5. Residuals vs Predicted Values

6. Dataset Statistics

🛠️ Technical Stack

Libraries & Tools:

Installation

Prerequisites:

Setup Instructions:

🚀 Usage

Running the Notebook:

Using the Saved Model:

📚 Methodology

Step-by-Step Workflow:

📊 Model Equations

Multiple Linear Regression Formula:

Performance Metrics:

🔍 Key Insights

What We Learned:

Recommendations for Improvement:

📁 Project Structure

💡 Learning Outcomes

🎓 Educational Value

📝 Notes & Observations

Model Limitations:

Future Enhancements:

🤝 Contributing

📄 License

👤 Author

🙏 Acknowledgments

⭐ If you found this project helpful, please consider giving it a star!

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages