This project builds a machine learning model to classify emails as spam or ham (legitimate) using Natural Language Processing (NLP) and Logistic Regression.
Email spam detection is an essential task for filtering unwanted emails and improving user experience. This project aims to classify email messages into two categories:
-
Ham (Not Spam) (label = 1)
-
Spam (label = 0)
Source: Kaggle SMS Spam Collection Dataset
File Used: mail_data.csv
Features:
-
Category: Spam or Ham
-
Message: Email text
-
Python (Google Colab)
-
Libraries: NumPy, Pandas, scikit-learn
-
Vectorization: TF-IDF (Term Frequency-Inverse Document Frequency)
-
Modeling: Logistic Regression
Null Value Handling: Replaced null entries with empty strings
Label Encoding: Converted spam โ 0 and ham โ 1
Feature Extraction:
-
Used TfidfVectorizer to transform email text into numerical feature vectors
-
Data Splitting: Train-Test split (80% training, 20% testing)
-
Algorithm: Logistic Regression
-
Input: TF-IDF features from email text
-
Evaluation Metric: Accuracy Score
-
Training Accuracy: ~96.8%
-
Test Accuracy: ~95.0%
-
Upload mail_data.csv to your Colab environment
-
Run the notebook cells sequentially
-
Enter a custom email message in the predictive system to check if it is spam or ham
-
Experiment with advanced models (SVM, Naive Bayes, or XGBoost)
-
Use deep learning models (LSTM or Transformers) for better accuracy
-
Add email metadata (sender, subject, etc.) as additional features