Skip to content
This repository was archived by the owner on Jan 28, 2026. It is now read-only.

alisulmanpro/Email-Spam-Detection-Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Image

Email/SMS Spam Detection System (Machine Learning)

A production-ready machine learning system to classify Email/SMS messages as Spam or Ham (Not Spam) using classical NLP and supervised learning techniques.

This project is designed with ML engineering best practices, focusing on clean architecture, reproducibility, and business-driven evaluation.


Problem Statement

Spam messages cause financial loss, security risks, and poor user experience.
The goal of this project is to build a reliable and interpretable spam detection system that:

  • Minimizes false positives (important emails marked as spam)
  • Maximizes spam recall
  • Is fast, explainable, and deployable

Dataset

  • Source: Kaggle – SMS Spam Collection Dataset
  • Type: Text (Unstructured)
  • Classes:
    • Ham → Legitimate messages
    • Spam → Unwanted or malicious messages
  • Challenge: Class imbalance (more ham than spam)

Raw dataset is stored without modification for reproducibility.


Project Structure

email-spam-detection/
│
├── data/
│ └── raw_dataset.csv
│
├── notebooks/
│ └── eda_and_training.ipynb
│
├── src/
│ ├── preprocessing.py
│ ├── train.py
│ └── evaluate.py
│
├── models/
│ ├── spam_classifier.pkl
│ └── tfidf_vectorizer.pkl
│
├── requirements.txt
├── README.md
└── .gitignore

Approach Overview

1. Text Preprocessing (Minimal, Signal-Preserving)

  • Remove HTML tags and email headers
  • Normalize whitespace
  • Convert text to lowercase
  • Replace URLs and email addresses with tokens
  • Keep punctuation signals (! ? $)
  • Avoid over-cleaning to preserve spam indicators

2. Feature Engineering

  • TF-IDF Vectorization
  • Word n-grams (1–2)
  • Sublinear TF scaling
  • Rare and overly frequent term filtering

3. Model Selection

  • Logistic Regression
  • Class weighting to handle imbalance
  • Chosen for:
    • Interpretability
    • Speed
    • Industry adoption

4. Decision Threshold Optimization

  • Default threshold adjusted from 0.500.27
  • Improves spam recall while maintaining high precision

Results

Final Performance (Threshold = 0.27)

Metric Ham Spam
Precision 1.00 0.94
Recall 0.99 0.97
F1-Score 0.99 0.95
Accuracy 99%

Evaluation Artifacts

Confusion Matrix & Classification Report

Image

How to Run the Project

1. Install Dependencies

pip install -r requirements.txt

2. Train the Model

python src/train.py

3. Evaluate the Model

python src/evaluate.py

Key Engineering Decisions

  • Avoided deep learning due to small dataset size

  • Focused on feature engineering over model complexity

  • Explicit threshold tuning based on business tradeoffs

  • Clean separation between experimentation and production code

Model Artifacts

Saved in models/:

  • spam_classifier.pkl → Trained Logistic Regression model

  • tfidf_vectorizer.pkl → TF-IDF feature transformer

Demo

Vɪᴅᴇᴏ Dᴇᴍᴏ

Image

Author

Ali Sulman
Machine Learning Engineer (Aspirant)
Focused on applied ML, NLP, and production-ready systems.

License

This project is open for educational and portfolio use.


What This README Does for You

  • Looks professional on GitHub
  • Easy to explain in interviews
  • Shows engineering maturity
  • Ready for deployment extension

About

End-to-end Email Spam Detection system using Machine Learning and NLP, featuring TF-IDF, Logistic Regression, threshold optimization, and a FastAPI-based real-time inference API.

Topics

Resources

Stars

Watchers

Forks

Contributors