End-to-end machine learning pipeline for predicting median house prices in California districts using Linear Regression, SQL integration, and an interactive Streamlit dashboard.
- Dataset: California Housing Dataset (20,640 districts with 8 features)
- Algorithm: Linear Regression with feature engineering
- Tech Stack: Python, scikit-learn, SQLite, Streamlit, pandas, seaborn
# Setup environment
python -m venv .venv
source .venv/bin/activate # Linux/WSL (.venv\Scripts\activate for Windows)
pip install -r requirements.txt
# Run complete pipeline (data processing → database → visualizations → model training)
python pipeline.py
# Launch interactive dashboard
streamlit run app.py├── src/
│ ├── config.py # Configuration constants and paths
│ ├── dataset.py # HousingDataProcessor: data loading & cleaning
│ ├── features.py # Feature engineering functions
│ ├── plots.py # EDAAnalyser: 10 visualization types
│ ├── modeling/
│ │ ├── train.py # PricePredictionModel: model training
│ │ └── predict.py # PredictionInterface: inference
│ └── services/
│ └── database.py # DatabaseManager: SQLite operations
├── app.py # Streamlit dashboard (main interface)
├── pipeline.py # Full pipeline orchestrator
├── data/
│ ├── raw/ # Original dataset
│ ├── interim/ # Cleaned data
│ ├── processed/ # Feature-engineered data
│ └── housing.db # SQLite database
├── models/ # Trained model artifacts (.pkl files)
├── reports/figures/ # Generated visualizations (12 PNG files)
└── notebooks/ # Jupyter notebooks for EDA
Data loading, cleaning, and preprocessing
- Loads California Housing dataset from sklearn
- Handles missing values (median/mean/mode imputation)
- Removes outliers (IQR or z-score methods)
- Applies feature engineering
- Saves data at each pipeline stage
SQLite database operations demonstrating SQL concepts
- Tables:
housing(20,390 rows),district_summary(4 rows) - WHERE filtering: Income and location-based queries
- GROUP BY aggregation: Statistics by income category
- INNER JOIN: Merges housing data with district summaries
- CRUD operations with Python API
Generates 10 visualization types
- Histogram (price distribution)
- Boxplot (price by income category)
- Scatter (geographic coordinates)
- Correlation heatmap (14×14 matrix)
- Pairplot (key feature relationships)
- Bar chart (mean price by category)
- Violin plot (income distribution)
- Line chart (price trends by age)
- Density plot (multiple features)
- Geographic scatter (California map)
Linear regression training and evaluation
- Uses 11 features (8 original + 3 engineered ratios)
- StandardScaler normalization
- 80/20 train-test split
- Evaluation metrics: RMSE, R², MAE
- Saves model, scaler, and metrics as .pkl files
Simplified prediction API
- Single and batch predictions
- Automatic feature engineering
- Input validation (geographic bounds, positive values)
- Returns predictions with detailed metadata
┌─────────────────────────────────────────────────────────────────┐
│ STEP 1: Data Processing │
├─────────────────────────────────────────────────────────────────┤
│ sklearn.datasets → Load → Clean → Remove Outliers → │
│ Feature Engineering → data/processed/housing_processed.csv │
│ (20,640 rows → ~20,390 rows after outlier removal) │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ STEP 2: Database Operations │
├─────────────────────────────────────────────────────────────────┤
│ Create SQLite DB → Create Tables → Insert Data → │
│ Populate Aggregated Summary → data/housing.db (1.7 MB) │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ STEP 3: Exploratory Data Analysis │
├─────────────────────────────────────────────────────────────────┤
│ Generate 10 Visualizations → reports/figures/*.png │
│ Correlation Analysis → Summary Statistics │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ STEP 4: Model Training │
├─────────────────────────────────────────────────────────────────┤
│ Prepare Features (11 columns) → Split Data (80/20) → │
│ StandardScaler → LinearRegression.fit() → │
│ models/{model.pkl, scaler.pkl, metrics.json} │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ STEP 5: Model Evaluation & Visualization │
├─────────────────────────────────────────────────────────────────┤
│ Calculate Metrics (RMSE, R², MAE) → │
│ Plot Predictions vs Actual → Plot Residuals → │
│ reports/figures/model_*.png │
└─────────────────────────────────────────────────────────────────┘
Primary data table with 20,390 rows and 14 columns
Original Features (8):
longitude,latitude- Geographic coordinateshousing_median_age- Median age of houses in districttotal_rooms,total_bedrooms- Total counts in districtpopulation,households- Population statisticsmedian_income- Median household income (×$10,000)median_house_value- Target variable (price in dollars)
Engineered Features (5):
rooms_per_household= total_rooms / householdsbedrooms_per_room= total_bedrooms / total_roomspopulation_per_household= population / householdsincome_category- Categorical: low/medium/high/very_highage_category- Categorical: new/medium/old
Aggregated statistics by income category (4 rows)
Columns:
income_category- Category identifier (UNIQUE)avg_house_value- Average price in categoryavg_rooms- Average rooms per householdavg_age- Average housing agedistrict_count- Number of districts in category
WHERE Filtering:
SELECT * FROM housing
WHERE median_income >= ? AND median_income <= ?GROUP BY Aggregation:
SELECT income_category,
AVG(median_house_value) as avg_price,
COUNT(*) as count
FROM housing
GROUP BY income_category
ORDER BY avg_price DESCINNER JOIN:
SELECT h.*, ds.avg_house_value, ds.district_count
FROM housing h
INNER JOIN district_summary ds
ON h.income_category = ds.income_categoryInteractive 4-page application at http://localhost:8501:
- Home - Project overview and dataset statistics
- Data Exploration - Interactive data table with filters, SQL query demonstrations
- Visualizations - Gallery of 10 EDA plots + 2 model performance plots
- Price Prediction - Interactive prediction interface with input sliders
Features:
- Real-time SQL query execution
- Dynamic data filtering
- Model performance metrics
- Single-house price predictions
- Engineered feature display
- RMSE: ~$68,000-72,000 (typical error in predictions)
- R²: ~0.58-0.62 (model explains 58-62% of variance)
- MAE: ~$48,000-53,000 (average absolute error)
- Median income is the strongest predictor of house prices
- Geographic location (latitude/longitude) significantly impacts prices
- Engineered features (rooms per household, population density) improve model performance
- Linear regression provides interpretable baseline but has moderate accuracy due to:
- Non-linear relationships in housing data
- Geographic clustering effects
- Presence of outliers in price distribution
Key settings in src/config.py:
TEST_SIZE = 0.2(80/20 train-test split)RANDOM_STATE = 42(reproducibility)OUTLIER_THRESHOLD = 1.5(IQR multiplier)MISSING_VALUE_STRATEGY = 'median'(imputation method)
# Explore data in Jupyter
jupyter notebook notebooks/
# Run individual components
python src/modeling/train.py # Train model only
python src/modeling/predict.py # Run predictions
# Verify database
python -c "from src.services.database import DatabaseManager; DatabaseManager().verify_connection()"python pipeline.py # Execute complete ETL + training pipeline
streamlit run app.py # Launch web interface