A production-ready machine learning system to classify Email/SMS messages as Spam or Ham (Not Spam) using classical NLP and supervised learning techniques.
This project is designed with ML engineering best practices, focusing on clean architecture, reproducibility, and business-driven evaluation.
Spam messages cause financial loss, security risks, and poor user experience.
The goal of this project is to build a reliable and interpretable spam detection system that:
- Minimizes false positives (important emails marked as spam)
- Maximizes spam recall
- Is fast, explainable, and deployable
- Source: Kaggle – SMS Spam Collection Dataset
- Type: Text (Unstructured)
- Classes:
Ham→ Legitimate messagesSpam→ Unwanted or malicious messages
- Challenge: Class imbalance (more ham than spam)
Raw dataset is stored without modification for reproducibility.
email-spam-detection/
│
├── data/
│ └── raw_dataset.csv
│
├── notebooks/
│ └── eda_and_training.ipynb
│
├── src/
│ ├── preprocessing.py
│ ├── train.py
│ └── evaluate.py
│
├── models/
│ ├── spam_classifier.pkl
│ └── tfidf_vectorizer.pkl
│
├── requirements.txt
├── README.md
└── .gitignore- Remove HTML tags and email headers
- Normalize whitespace
- Convert text to lowercase
- Replace URLs and email addresses with tokens
- Keep punctuation signals (
! ? $) - Avoid over-cleaning to preserve spam indicators
- TF-IDF Vectorization
- Word n-grams (1–2)
- Sublinear TF scaling
- Rare and overly frequent term filtering
- Logistic Regression
- Class weighting to handle imbalance
- Chosen for:
- Interpretability
- Speed
- Industry adoption
- Default threshold adjusted from
0.50→0.27 - Improves spam recall while maintaining high precision
| Metric | Ham | Spam |
|---|---|---|
| Precision | 1.00 | 0.94 |
| Recall | 0.99 | 0.97 |
| F1-Score | 0.99 | 0.95 |
| Accuracy | 99% |
pip install -r requirements.txtpython src/train.pypython src/evaluate.py-
Avoided deep learning due to small dataset size
-
Focused on feature engineering over model complexity
-
Explicit threshold tuning based on business tradeoffs
-
Clean separation between experimentation and production code
Saved in models/:
-
spam_classifier.pkl→ Trained Logistic Regression model -
tfidf_vectorizer.pkl→ TF-IDF feature transformer
Ali Sulman
Machine Learning Engineer (Aspirant)
Focused on applied ML, NLP, and production-ready systems.
This project is open for educational and portfolio use.
- Looks professional on GitHub
- Easy to explain in interviews
- Shows engineering maturity
- Ready for deployment extension