End-to-end ML pipeline Β· Random Forest vs Logistic Regression Β· ~84% Accuracy Β· Production-ready design
Predicts daily rainfall occurrence in Melbourne using historical meteorological data β with rigorous data leakage prevention, seasonal feature engineering, and a deployable Scikit-learn pipeline achieving ~84% accuracy.
Rainfall prediction isn't a clean Kaggle exercise. Real meteorological data comes with missing values, class imbalance, geographic variability, and β critically β data leakage risk from how the target variable is defined. This project addresses all of these explicitly.
| Model | Accuracy | Recall (Rain Events) |
|---|---|---|
| Random Forest | ~84% | Good overall |
| Logistic Regression | ~84% | Better β superior minority class recall |
Key insight: In rainfall prediction, missing an actual rain event (false negative) costs more than a false alarm. Logistic Regression's higher recall for the minority class makes it the preferred model for operational use β accuracy alone doesn't tell the full story.
Most influential features: Humidity-related variables and engineered seasonal features.
Redefined the prediction target to avoid using same-day rainfall measurements as input features β one of the most common production ML mistakes that inflates test accuracy but fails in deployment.
Extracted cyclical seasonal signals from raw date fields β capturing weather patterns (wet/dry seasons) that the raw numerical features don't expose directly.
Restricted analysis to geographically close locations (Melbourne, Melbourne Airport, Watsonia) to reduce variability from unrelated climate zones in the national dataset.
Built Scikit-learn Pipelines combining preprocessing + model in a single serializable object β not just notebook-style step-by-step code.
| Property | Value |
|---|---|
| Source | Australian Bureau of Meteorology (BOM) + Kaggle Rattle Package |
| Coverage | Australia, 2008β2017 |
| Target Locations | Melbourne Β· Melbourne Airport Β· Watsonia |
| Task | Binary classification β Rain tomorrow: Yes / No |
- Robust to feature interactions and non-linear relationships
- Hyperparameter tuning via GridSearchCV
- Feature importance analysis β identifies humidity and seasonal features as top predictors
- Interpretable baseline β clear coefficient attribution
- Better recall on minority class (actual rain days)
- Preferred model for operational rainfall prediction
Both models trained via unified preprocessing + modeling pipeline.
- Accuracy, Precision, Recall, F1-score
- Confusion Matrix β explicit false negative analysis
- Feature Importance (Random Forest)
- Model comparison on same train/test split β fair benchmarking
| Tool | Purpose |
|---|---|
| Python | Core language |
| Pandas + NumPy | Data cleaning, feature engineering |
| Scikit-learn | Pipelines, GridSearchCV, models, evaluation |
| Matplotlib + Seaborn | EDA and results visualization |
# Clone the repository
git clone https://github.com/amarskdev/rainfall-prediction-melbourne.git
cd rainfall-prediction-melbourne
# Install dependencies
pip install -r requirements.txt
# Run the notebook
jupyter notebook rainfall_prediction_melbourne.ipynbrainfall-prediction-melbourne/
β
βββ rainfall_prediction_melbourne.ipynb # Full pipeline: EDA β features β models β evaluation
βββ requirements.txt # Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn
βββ README.md
- XGBoost / LightGBM comparison
- SMOTE for class imbalance handling
- Time-series cross-validation (prevent temporal leakage)
- Probability calibration for confidence-aware predictions
- FastAPI deployment for real-time inference
Built with production ML principles β leakage prevention, deployable pipelines, and metric selection driven by real-world cost of errors.