A comprehensive machine learning project implementing Multiple Linear Regression to predict house prices using the California Housing Dataset
Overview β’ Dataset β’ Features β’ Results β’ Installation β’ Usage
This project demonstrates a complete machine learning workflow for predictive modeling using Multiple Linear Regression. The model predicts house prices in California based on various housing features using statistical techniques and data analysis.
Built entirely by: Rupayan Dey
The dataset contains 20,640 housing records from the 1990 California Census with the following features:
| Feature | Description | Range |
|---|---|---|
| MedInc | Median Income | 0.50 - 15.00 |
| HouseAge | House Age in years | 1 - 52 |
| AveRooms | Average Rooms per House | 0.85 - 141.91 |
| AveBedrms | Average Bedrooms per House | 0.33 - 34.07 |
| Population | Population per Block Group | 3 - 35,682 |
| AveOccup | Average Occupancy | 0.69 - 1243.33 |
| Latitude | Latitude Coordinate | 32.54 - 41.95 |
| Longitude | Longitude Coordinate | -124.27 - -114.13 |
| Price (Target) | Median House Price | 0.15 - 5.00 |
- Data Loading & Preprocessing: Using scikit-learn's California Housing dataset
- Exploratory Data Analysis (EDA): Comprehensive statistical analysis and visualization
- Feature Engineering: Data standardization and normalization
- Model Training: Multiple Linear Regression implementation
- Model Evaluation: Detailed performance metrics and validation
- Assumption Checking: Residual analysis and diagnostics
- Model Persistence: Serialization using pickle for production deployment
- Descriptive Statistics: Mean, median, standard deviation, quartiles
- Correlation Analysis: Feature relationships and multicollinearity detection
- Distribution Analysis: Univariate and multivariate distributions
- Residual Analysis: Normality, homoscedasticity, and independence checks
ββ Mean Square Error (MSE): 0.7549
ββ Root Mean Square Error (RMSE): 0.8689
ββ Mean Absolute Error (MAE): 0.5348
ββ RΒ² Score (Coefficient of Determination): 0.5757
- RΒ² Score of 0.5757 indicates the model explains approximately 57.57% of the variance in house prices
- RMSE of 0.8689 means predictions are off by ~$86,890 on average
- MAE of 0.5348 indicates an absolute average error of ~$53,480 per prediction
Visualization showing relationships between all features and target variable. Key observations:
- Strong positive correlation between MedInc and Price
- Latitude/Longitude show clear geographic price patterns
- Some features show non-linear relationships
Correlation matrix revealing feature dependencies:
- Strongest positive correlation with Price: Median Income (0.69)
- Notable negative correlations: Latitude (-0.14), Longitude (-0.05)
- Latitude-Longitude correlation: -0.92 (expected geographic relationship)
Scatter plot comparing predicted vs actual prices:
- Points cluster around the diagonal (good predictions)
- Some deviation at higher prices indicates model limitations
- Model performs better for mid-range prices
Distribution of prediction errors:
- Approximately normal distribution (assumption validated)
- Mean centered near zero
- Right skew indicates occasional overestimation
Residuals analysis for homoscedasticity check:
- Residuals show slight funnel pattern (heteroscedasticity)
- Variance increases with predicted values
- Suggests potential for improvement through polynomial features
Distribution of key features:
- Price: Right-skewed, concentrated in 0.5-3.5 range
- House Age: Uniform distribution with peak at maximum age
- Average Rooms: Right-skewed, concentrated in 4-8 range
- Average Bedrooms: Clustered around 1-1.5 bedrooms
ββ NumPy 2.3.3 # Numerical computations
ββ Pandas 2.3.2 # Data manipulation and analysis
ββ Scikit-Learn 1.8.0 # Machine learning algorithms
ββ Matplotlib 3.10.7 # Plotting library
ββ Seaborn 0.13.2 # Statistical data visualization
ββ Jupyter # Interactive notebook environment- Python 3.8 or higher
- pip or conda package manager
-
Clone the repository:
git clone <repository-url> cd "2-mulipleLinearRegression(HousePricePrediction)"
-
Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install numpy pandas scikit-learn matplotlib seaborn jupyter
-
Launch Jupyter Notebook:
jupyter notebook 22.1-MultipleLinearRegression.ipynb
- Open the Jupyter notebook:
22.1-MultipleLinearRegression.ipynb - Execute cells sequentially from top to bottom
- Review visualizations and metrics at each step
import pickle
import numpy as np
# Load the trained model
model = pickle.load(open('regressor.pkl', 'rb'))
# Make predictions on new data
new_data = np.array([[2.5, 25, 5.0, 1.0, 300, 2.5, 37.5, -122.3]])
prediction = model.predict(new_data)
print(f"Predicted Price: ${prediction[0] * 100000:.2f}")1. Data Loading
ββ Import California Housing Dataset
2. Exploratory Data Analysis
ββ Dataset Overview
ββ Statistical Summary
ββ Distribution Analysis
ββ Correlation Analysis
3. Data Preprocessing
ββ Feature Selection
ββ Train-Test Split (67-33)
ββ Feature Standardization
4. Model Development
ββ Linear Regression Initialization
ββ Model Training
ββ Coefficient Extraction
5. Model Evaluation
ββ Prediction Generation
ββ Performance Metrics Calculation
ββ Residual Analysis
6. Assumption Validation
ββ Linearity Check
ββ Normality of Residuals
ββ Homoscedasticity
ββ Independence
7. Model Deployment
ββ Model Serialization (pickle)
Where:
-
$\hat{y}$ = Predicted house price -
$\beta_0$ = Intercept -
$\beta_1, \beta_2, ..., \beta_n$ = Regression coefficients -
$x_1, x_2, ..., x_n$ = Independent variables (features)
RΒ² Score:
RMSE:
MAE:
- Income is the strongest predictor of house prices (0.69 correlation)
- Geographic location matters - Latitude/Longitude show clear price patterns
- Model assumptions are mostly valid - Residuals follow approximate normal distribution
- Linear model has limitations - Some non-linearity present in data
- Heteroscedasticity detected - Variance increases with predicted values
- π§ Polynomial Features: Add polynomial terms to capture non-linearity
- π Additional Features: Include economic indicators or market data
- π€ Advanced Models: Try Ridge, Lasso, or ensemble methods
- π Feature Engineering: Create interaction terms between features
- π― Data Quality: Investigate outliers and anomalies more thoroughly
2-mulipleLinearRegression(HousePricePrediction)/
βββ 22.1-MultipleLinearRegression.ipynb # Main Jupyter notebook
βββ save_visualizations.py # Script to generate visualizations
βββ regressor.pkl # Trained model (serialized)
βββ README.md # This file
βββ assets/ # Visualization outputs
βββ 01_pairplot.png
βββ 02_correlation_heatmap.png
βββ 03_actual_vs_predicted.png
βββ 04_residuals_histogram.png
βββ 05_residuals_scatter.png
βββ 06_dataset_statistics.png
By exploring this project, you will understand:
β
How to load and explore datasets using pandas
β
How to perform exploratory data analysis with visualization
β
How to preprocess and standardize features
β
How to implement and train linear regression models
β
How to evaluate model performance with multiple metrics
β
How to validate regression assumptions
β
How to serialize and save trained models
β
How to make predictions on new data
This project is ideal for:
- Students learning machine learning fundamentals
- Data Scientists reviewing regression workflows
- Practitioners implementing end-to-end ML pipelines
- Researchers studying housing price prediction models
- Linear Assumption: Assumes linear relationships between features and price
- Low RΒ² Score: Only explains ~58% of price variance
- Heteroscedasticity: Unequal error variance across predictions
- Geographic Simplification: Continuous lat/long not ideal for distinct regions
- Implement non-linear models (Random Forest, Gradient Boosting)
- Add temporal features if date information available
- Perform cross-validation for robust metrics
- Apply regularization (Ridge, Lasso) to prevent overfitting
- Create interaction features between top predictors
This project was created as a learning exercise. Improvements and suggestions are welcome!
This project is open source and available under the MIT License.
Rupayan Dey
- π§ Email: rupayandey134@gmail.com
- π GitHub: [Your GitHub Profile]
- πΌ LinkedIn: [Your LinkedIn Profile]
- Scikit-Learn Team for the excellent California Housing dataset and ML library
- Data Science Community for best practices and methodologies
- Jupyter Project for interactive computing environment
Made with β€οΈ by Rupayan Dey





