Water Quality Classification using Machine Learning

Project Overview

Water quality monitoring is essential for protecting public health and maintaining environmental sustainability. Government agencies classify water bodies into designated use categories such as Class A (drinking water source after disinfection), Class B (outdoor bathing), Class C (drinking water after conventional treatment), and Class E (irrigation and industrial use).

The objective of this project is to build a machine learning classification model that predicts whether a water sample belongs to High Quality Water (Class A) or Lower Quality Water (Classes B, C, and E) using physicochemical and biological parameters.

The goal is to create a stable, interpretable, and generalizable predictive system that can assist in efficient water quality monitoring and public health decision-making.

Dataset Information

The dataset was collected under the National Water Monitoring Programme in Maharashtra.

Original dataset size: 444 rows and 54 columns
Final modeling dataset: 342 samples
Final selected features: 17 important environmental parameters

The dataset includes variables such as Dissolved Oxygen, BOD, COD, Conductivity, Turbidity, Total Coliform, Fecal Coliform, Total Alkalinity, Sulphate, Nitrate Nitrogen, Flow, and others.

Severe class imbalance was observed in the original dataset:

Class A: 288 samples Class B: 10 samples Class C: 11 samples Class E: 33 samples

Due to extremely low samples in Classes B and C, the multiclass classification problem was reformulated into a binary classification task:

0 – High Quality Water (Class A) 1 – Lower Quality Water (Classes B, C, E)

This reformulation ensured more stable and reliable model performance.

Project Workflow

1. Data Understanding and Exploratory Analysis

Checked duplicates and missing values
Removed columns with extremely high missing percentages
Converted numeric columns properly and cleaned BDL values
Performed univariate, bivariate, and multivariate analysis
Identified strong relationships such as negative correlation between Dissolved Oxygen and BOD
Conducted hypothesis testing using a one-sample t-test to validate environmental assumptions

2. Data Preprocessing and Feature Engineering

Applied median and mode imputation for missing values
Used log transformation to reduce skewness in chemical parameters
Created new features such as Pollution Index and BOD/DO Ratio
Performed one-hot encoding for categorical variables
Applied an 80–20 stratified train-test split to preserve class distribution
Removed constant features and highly correlated features (correlation > 0.85)
Selected the top 17 most important features using Random Forest feature importance

3. Model Training and Evaluation

Multiple models were tested:

Logistic Regression
Random Forest
XGBoost

The final selected model was XGBoost (default configuration), as it provided the best balance between overall performance and minority class detection.

Test performance:

Accuracy: 97 percent
Macro F1-score: 0.94
ROC-AUC: 0.95
Minority class recall: 0.82
Minority class precision: 1.00

The small gap between training and testing performance confirmed minimal overfitting and good generalization.

Model Explainability

Model interpretability was ensured using:

XGBoost feature importance (gain method)
SHAP analysis

SHAP analysis showed that the model relies on meaningful environmental indicators such as Flow, Total Alkalinity, Turbidity, Sulphate, and Dissolved Oxygen.

The model captures realistic environmental relationships rather than relying on random patterns.

Real-World Testing

The trained model was saved using joblib and tested on unseen real-world-like samples.

It correctly classified both high-quality and lower-quality water cases, confirming its practical usability.

The model does not replace laboratory testing but acts as a decision-support and screening tool to prioritize potentially risky water samples.

Business Impact

This system can support environmental agencies by:

Identifying potentially lower-quality water samples early
Prioritizing inspection and monitoring resources
Improving response time in water quality management
Supporting data-driven environmental decision-making

Technologies Used

Python
Pandas and NumPy
Scikit-learn
XGBoost
SHAP
Matplotlib and Seaborn

Future Improvements

Increase dataset size for full multiclass modeling
Include seasonal and time-based patterns
Deploy the model through an API for real-time predictions
Expand to additional geographical regions

Author

Vimal Machine Learning and Data Science Enthusiast

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ML Water Quality Classification & Public Health Analytics.ipynb		ML Water Quality Classification & Public Health Analytics.ipynb
NWMP_August2025_MPCB_0.csv		NWMP_August2025_MPCB_0.csv
NWMP_September2025_MPCB_0.csv		NWMP_September2025_MPCB_0.csv
README.md		README.md
final_xgb_model.pkl		final_xgb_model.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Water Quality Classification using Machine Learning

Project Overview

Dataset Information

Project Workflow

1. Data Understanding and Exploratory Analysis

2. Data Preprocessing and Feature Engineering

3. Model Training and Evaluation

Model Explainability

Real-World Testing

Business Impact

Technologies Used

Future Improvements

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Water Quality Classification using Machine Learning

Project Overview

Dataset Information

Project Workflow

1. Data Understanding and Exploratory Analysis

2. Data Preprocessing and Feature Engineering

3. Model Training and Evaluation

Model Explainability

Real-World Testing

Business Impact

Technologies Used

Future Improvements

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages