Skip to content

MuhammadUsman-Khan/Spam-Mail-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ“ง Spam vs Ham Email Classification using Machine Learning

This project builds a machine learning model to classify emails as spam or ham (legitimate) using Natural Language Processing (NLP) and Logistic Regression.


๐Ÿ” Problem Statement

Email spam detection is an essential task for filtering unwanted emails and improving user experience. This project aims to classify email messages into two categories:

  • Ham (Not Spam) (label = 1)

  • Spam (label = 0)


๐Ÿ“ Dataset

Source: Kaggle SMS Spam Collection Dataset

File Used: mail_data.csv

Features:

  • Category: Spam or Ham

  • Message: Email text


๐Ÿ› ๏ธ Technologies Used

  • Python (Google Colab)

  • Libraries: NumPy, Pandas, scikit-learn

  • Vectorization: TF-IDF (Term Frequency-Inverse Document Frequency)

  • Modeling: Logistic Regression


โš™๏ธ Data Preprocessing

Null Value Handling: Replaced null entries with empty strings

Label Encoding: Converted spam โ†’ 0 and ham โ†’ 1

Feature Extraction:

  • Used TfidfVectorizer to transform email text into numerical feature vectors

  • Data Splitting: Train-Test split (80% training, 20% testing)


๐Ÿค– Model Details

  • Algorithm: Logistic Regression

  • Input: TF-IDF features from email text

  • Evaluation Metric: Accuracy Score


๐Ÿ“Š Results

  • Training Accuracy: ~96.8%

  • Test Accuracy: ~95.0%


๐Ÿงช Usage

  • Upload mail_data.csv to your Colab environment

  • Run the notebook cells sequentially

  • Enter a custom email message in the predictive system to check if it is spam or ham


๐Ÿ”ฎ Future Improvements

  • Experiment with advanced models (SVM, Naive Bayes, or XGBoost)

  • Use deep learning models (LSTM or Transformers) for better accuracy

  • Add email metadata (sender, subject, etc.) as additional features

About

A machine learning model that classifies emails as spam or ham (not spam) using Logistic Regression. Includes text preprocessing with TF-IDF vectorization and scikit-learn for model building and evaluation. Developed in Google Colab to demonstrate email classification using natural language processing

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors