Skip to content

valiantProgrammer/California-House-Price-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🏠 Multiple Linear Regression - House Price Prediction

Python Scikit-Learn Pandas License

A comprehensive machine learning project implementing Multiple Linear Regression to predict house prices using the California Housing Dataset

Overview β€’ Dataset β€’ Features β€’ Results β€’ Installation β€’ Usage


πŸ“‹ Overview

This project demonstrates a complete machine learning workflow for predictive modeling using Multiple Linear Regression. The model predicts house prices in California based on various housing features using statistical techniques and data analysis.

Built entirely by: Rupayan Dey


πŸ“Š Dataset

California Housing Dataset

The dataset contains 20,640 housing records from the 1990 California Census with the following features:

Feature Description Range
MedInc Median Income 0.50 - 15.00
HouseAge House Age in years 1 - 52
AveRooms Average Rooms per House 0.85 - 141.91
AveBedrms Average Bedrooms per House 0.33 - 34.07
Population Population per Block Group 3 - 35,682
AveOccup Average Occupancy 0.69 - 1243.33
Latitude Latitude Coordinate 32.54 - 41.95
Longitude Longitude Coordinate -124.27 - -114.13
Price (Target) Median House Price 0.15 - 5.00

🎯 Project Features

✨ Key Components:

  • Data Loading & Preprocessing: Using scikit-learn's California Housing dataset
  • Exploratory Data Analysis (EDA): Comprehensive statistical analysis and visualization
  • Feature Engineering: Data standardization and normalization
  • Model Training: Multiple Linear Regression implementation
  • Model Evaluation: Detailed performance metrics and validation
  • Assumption Checking: Residual analysis and diagnostics
  • Model Persistence: Serialization using pickle for production deployment

πŸ“ˆ Analysis Performed:

  1. Descriptive Statistics: Mean, median, standard deviation, quartiles
  2. Correlation Analysis: Feature relationships and multicollinearity detection
  3. Distribution Analysis: Univariate and multivariate distributions
  4. Residual Analysis: Normality, homoscedasticity, and independence checks

πŸ“ˆ Results & Performance Metrics

Model Performance on Test Set:

β”œβ”€ Mean Square Error (MSE):      0.7549
β”œβ”€ Root Mean Square Error (RMSE): 0.8689
β”œβ”€ Mean Absolute Error (MAE):    0.5348
└─ RΒ² Score (Coefficient of Determination): 0.5757

Interpretation:

  • RΒ² Score of 0.5757 indicates the model explains approximately 57.57% of the variance in house prices
  • RMSE of 0.8689 means predictions are off by ~$86,890 on average
  • MAE of 0.5348 indicates an absolute average error of ~$53,480 per prediction

πŸ“Š Visualizations

1. Pairplot - Feature Relationships

Pairplot

Visualization showing relationships between all features and target variable. Key observations:

  • Strong positive correlation between MedInc and Price
  • Latitude/Longitude show clear geographic price patterns
  • Some features show non-linear relationships

2. Correlation Heatmap

Correlation Heatmap

Correlation matrix revealing feature dependencies:

  • Strongest positive correlation with Price: Median Income (0.69)
  • Notable negative correlations: Latitude (-0.14), Longitude (-0.05)
  • Latitude-Longitude correlation: -0.92 (expected geographic relationship)

3. Actual vs Predicted Prices

Actual vs Predicted

Scatter plot comparing predicted vs actual prices:

  • Points cluster around the diagonal (good predictions)
  • Some deviation at higher prices indicates model limitations
  • Model performs better for mid-range prices

4. Residuals Distribution

Residuals Histogram

Distribution of prediction errors:

  • Approximately normal distribution (assumption validated)
  • Mean centered near zero
  • Right skew indicates occasional overestimation

5. Residuals vs Predicted Values

Residuals Scatter

Residuals analysis for homoscedasticity check:

  • Residuals show slight funnel pattern (heteroscedasticity)
  • Variance increases with predicted values
  • Suggests potential for improvement through polynomial features

6. Dataset Statistics

Dataset Statistics

Distribution of key features:

  • Price: Right-skewed, concentrated in 0.5-3.5 range
  • House Age: Uniform distribution with peak at maximum age
  • Average Rooms: Right-skewed, concentrated in 4-8 range
  • Average Bedrooms: Clustered around 1-1.5 bedrooms

πŸ› οΈ Technical Stack

Libraries & Tools:

β”œβ”€ NumPy 2.3.3          # Numerical computations
β”œβ”€ Pandas 2.3.2         # Data manipulation and analysis
β”œβ”€ Scikit-Learn 1.8.0   # Machine learning algorithms
β”œβ”€ Matplotlib 3.10.7    # Plotting library
β”œβ”€ Seaborn 0.13.2       # Statistical data visualization
└─ Jupyter              # Interactive notebook environment

Installation

Prerequisites:

  • Python 3.8 or higher
  • pip or conda package manager

Setup Instructions:

  1. Clone the repository:

    git clone <repository-url>
    cd "2-mulipleLinearRegression(HousePricePrediction)"
  2. Create a virtual environment (recommended):

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install numpy pandas scikit-learn matplotlib seaborn jupyter
  4. Launch Jupyter Notebook:

    jupyter notebook 22.1-MultipleLinearRegression.ipynb

πŸš€ Usage

Running the Notebook:

  1. Open the Jupyter notebook: 22.1-MultipleLinearRegression.ipynb
  2. Execute cells sequentially from top to bottom
  3. Review visualizations and metrics at each step

Using the Saved Model:

import pickle
import numpy as np

# Load the trained model
model = pickle.load(open('regressor.pkl', 'rb'))

# Make predictions on new data
new_data = np.array([[2.5, 25, 5.0, 1.0, 300, 2.5, 37.5, -122.3]])
prediction = model.predict(new_data)
print(f"Predicted Price: ${prediction[0] * 100000:.2f}")

πŸ“š Methodology

Step-by-Step Workflow:

1. Data Loading
   └─ Import California Housing Dataset
   
2. Exploratory Data Analysis
   β”œβ”€ Dataset Overview
   β”œβ”€ Statistical Summary
   β”œβ”€ Distribution Analysis
   └─ Correlation Analysis
   
3. Data Preprocessing
   β”œβ”€ Feature Selection
   β”œβ”€ Train-Test Split (67-33)
   └─ Feature Standardization
   
4. Model Development
   β”œβ”€ Linear Regression Initialization
   β”œβ”€ Model Training
   └─ Coefficient Extraction
   
5. Model Evaluation
   β”œβ”€ Prediction Generation
   β”œβ”€ Performance Metrics Calculation
   └─ Residual Analysis
   
6. Assumption Validation
   β”œβ”€ Linearity Check
   β”œβ”€ Normality of Residuals
   β”œβ”€ Homoscedasticity
   └─ Independence
   
7. Model Deployment
   └─ Model Serialization (pickle)

πŸ“Š Model Equations

Multiple Linear Regression Formula:

$$\hat{y} = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$$

Where:

  • $\hat{y}$ = Predicted house price
  • $\beta_0$ = Intercept
  • $\beta_1, \beta_2, ..., \beta_n$ = Regression coefficients
  • $x_1, x_2, ..., x_n$ = Independent variables (features)

Performance Metrics:

RΒ² Score: $$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$

RMSE: $$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$$

MAE: $$MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$


πŸ” Key Insights

What We Learned:

  1. Income is the strongest predictor of house prices (0.69 correlation)
  2. Geographic location matters - Latitude/Longitude show clear price patterns
  3. Model assumptions are mostly valid - Residuals follow approximate normal distribution
  4. Linear model has limitations - Some non-linearity present in data
  5. Heteroscedasticity detected - Variance increases with predicted values

Recommendations for Improvement:

  • πŸ”§ Polynomial Features: Add polynomial terms to capture non-linearity
  • πŸ“ˆ Additional Features: Include economic indicators or market data
  • πŸ€– Advanced Models: Try Ridge, Lasso, or ensemble methods
  • πŸ“Š Feature Engineering: Create interaction terms between features
  • 🎯 Data Quality: Investigate outliers and anomalies more thoroughly

πŸ“ Project Structure

2-mulipleLinearRegression(HousePricePrediction)/
β”œβ”€β”€ 22.1-MultipleLinearRegression.ipynb    # Main Jupyter notebook
β”œβ”€β”€ save_visualizations.py                 # Script to generate visualizations
β”œβ”€β”€ regressor.pkl                          # Trained model (serialized)
β”œβ”€β”€ README.md                              # This file
└── assets/                                # Visualization outputs
    β”œβ”€β”€ 01_pairplot.png
    β”œβ”€β”€ 02_correlation_heatmap.png
    β”œβ”€β”€ 03_actual_vs_predicted.png
    β”œβ”€β”€ 04_residuals_histogram.png
    β”œβ”€β”€ 05_residuals_scatter.png
    └── 06_dataset_statistics.png

πŸ’‘ Learning Outcomes

By exploring this project, you will understand:

βœ… How to load and explore datasets using pandas
βœ… How to perform exploratory data analysis with visualization
βœ… How to preprocess and standardize features
βœ… How to implement and train linear regression models
βœ… How to evaluate model performance with multiple metrics
βœ… How to validate regression assumptions
βœ… How to serialize and save trained models
βœ… How to make predictions on new data


πŸŽ“ Educational Value

This project is ideal for:

  • Students learning machine learning fundamentals
  • Data Scientists reviewing regression workflows
  • Practitioners implementing end-to-end ML pipelines
  • Researchers studying housing price prediction models

πŸ“ Notes & Observations

Model Limitations:

  • Linear Assumption: Assumes linear relationships between features and price
  • Low RΒ² Score: Only explains ~58% of price variance
  • Heteroscedasticity: Unequal error variance across predictions
  • Geographic Simplification: Continuous lat/long not ideal for distinct regions

Future Enhancements:

  • Implement non-linear models (Random Forest, Gradient Boosting)
  • Add temporal features if date information available
  • Perform cross-validation for robust metrics
  • Apply regularization (Ridge, Lasso) to prevent overfitting
  • Create interaction features between top predictors

🀝 Contributing

This project was created as a learning exercise. Improvements and suggestions are welcome!


πŸ“„ License

This project is open source and available under the MIT License.


πŸ‘€ Author

Rupayan Dey

  • πŸ“§ Email: rupayandey134@gmail.com
  • πŸ”— GitHub: [Your GitHub Profile]
  • πŸ’Ό LinkedIn: [Your LinkedIn Profile]

πŸ™ Acknowledgments

  • Scikit-Learn Team for the excellent California Housing dataset and ML library
  • Data Science Community for best practices and methodologies
  • Jupyter Project for interactive computing environment

⭐ If you found this project helpful, please consider giving it a star!

Made with ❀️ by Rupayan Dey

⬆ Back to Top

About

This is a project on supervised Machine learning Technique based on multiple linear regression to predict the house price of California after analyze the data from the 1990s.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors