Skip to content

amarskdev/rainfall-prediction-melbourne

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🌧️ Rainfall Prediction in Melbourne Using Machine Learning

End-to-end ML pipeline Β· Random Forest vs Logistic Regression Β· ~84% Accuracy Β· Production-ready design

Python Scikit-learn Accuracy Dataset License

Predicts daily rainfall occurrence in Melbourne using historical meteorological data β€” with rigorous data leakage prevention, seasonal feature engineering, and a deployable Scikit-learn pipeline achieving ~84% accuracy.


🎯 Why This Problem Is Hard

Rainfall prediction isn't a clean Kaggle exercise. Real meteorological data comes with missing values, class imbalance, geographic variability, and β€” critically β€” data leakage risk from how the target variable is defined. This project addresses all of these explicitly.


πŸ“Š Results

Model Accuracy Recall (Rain Events)
Random Forest ~84% Good overall
Logistic Regression ~84% Better β€” superior minority class recall

Key insight: In rainfall prediction, missing an actual rain event (false negative) costs more than a false alarm. Logistic Regression's higher recall for the minority class makes it the preferred model for operational use β€” accuracy alone doesn't tell the full story.

Most influential features: Humidity-related variables and engineered seasonal features.


πŸ” Key Engineering Decisions

1. Data Leakage Prevention

Redefined the prediction target to avoid using same-day rainfall measurements as input features β€” one of the most common production ML mistakes that inflates test accuracy but fails in deployment.

2. Seasonal Feature Engineering

Extracted cyclical seasonal signals from raw date fields β€” capturing weather patterns (wet/dry seasons) that the raw numerical features don't expose directly.

3. Geographic Filtering

Restricted analysis to geographically close locations (Melbourne, Melbourne Airport, Watsonia) to reduce variability from unrelated climate zones in the national dataset.

4. Deployable Pipelines

Built Scikit-learn Pipelines combining preprocessing + model in a single serializable object β€” not just notebook-style step-by-step code.


πŸ—‚ Dataset

Property Value
Source Australian Bureau of Meteorology (BOM) + Kaggle Rattle Package
Coverage Australia, 2008–2017
Target Locations Melbourne Β· Melbourne Airport Β· Watsonia
Task Binary classification β€” Rain tomorrow: Yes / No

🧩 Modeling Approach

Random Forest Classifier

  • Robust to feature interactions and non-linear relationships
  • Hyperparameter tuning via GridSearchCV
  • Feature importance analysis β€” identifies humidity and seasonal features as top predictors

Logistic Regression

  • Interpretable baseline β€” clear coefficient attribution
  • Better recall on minority class (actual rain days)
  • Preferred model for operational rainfall prediction

Both models trained via unified preprocessing + modeling pipeline.


πŸ“ Evaluation Suite

  • Accuracy, Precision, Recall, F1-score
  • Confusion Matrix β€” explicit false negative analysis
  • Feature Importance (Random Forest)
  • Model comparison on same train/test split β€” fair benchmarking

βš™οΈ Tech Stack

Tool Purpose
Python Core language
Pandas + NumPy Data cleaning, feature engineering
Scikit-learn Pipelines, GridSearchCV, models, evaluation
Matplotlib + Seaborn EDA and results visualization

πŸš€ Getting Started

# Clone the repository
git clone https://github.com/amarskdev/rainfall-prediction-melbourne.git
cd rainfall-prediction-melbourne

# Install dependencies
pip install -r requirements.txt

# Run the notebook
jupyter notebook rainfall_prediction_melbourne.ipynb

πŸ“ Project Structure

rainfall-prediction-melbourne/
β”‚
β”œβ”€β”€ rainfall_prediction_melbourne.ipynb   # Full pipeline: EDA β†’ features β†’ models β†’ evaluation
β”œβ”€β”€ requirements.txt                      # Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn
└── README.md

πŸ”­ Roadmap

  • XGBoost / LightGBM comparison
  • SMOTE for class imbalance handling
  • Time-series cross-validation (prevent temporal leakage)
  • Probability calibration for confidence-aware predictions
  • FastAPI deployment for real-time inference

Built with production ML principles β€” leakage prevention, deployable pipelines, and metric selection driven by real-world cost of errors.


🀝 Connect With Me

πŸ‘€ About the Author

Amar Kumar
Senior Backend Engineer Β· IBM Certified AI Engineer

LinkedIn GitHub Gmail LeetCode Instagram Credly

If you found this project useful, consider giving it a ⭐ β€” it means a lot!

About

This project builds a machine learning classifier to predict whether it will rain on a given day in the Melbourne region using historical weather data. The project demonstrates an end-to-end ML workflow including data preprocessing, feature engineering, model pipelines, hyperparameter tuning, and model evaluation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors