This project focuses on detecting whether an email is spam or not using machine learning techniques. The main objective is to classify incoming emails into two categories — Spam and Ham (Not Spam) — based on the email content.
The project demonstrates the process of data preprocessing, feature extraction, model building, and evaluation using Python.
- Cleans and preprocesses raw email text
- Converts text data into numerical form using TF-IDF vectorization
- Implements machine learning models such as Logistic Regression for classification
- Evaluates model performance using metrics like accuracy, precision, recall, and F1 score
- Includes exploratory data analysis (EDA) for better understanding of the dataset
- Language: Python
- Libraries: pandas, numpy, scikit-learn, matplotlib, seaborn, nltk
- Environment: Jupyter Notebook
The dataset used in this project is mail_data.csv, which contains email messages and their corresponding labels (spam or ham).
Each record includes:
- Email text – The actual content of the email
- Label – Indicates whether the email is spam or not
Email_Spam_Detection/
│
├── mail_data.csv # Dataset file
├── detection.ipynb # Main Jupyter notebook (data preprocessing, training, evaluation)
├── detection-checkpoint.ipynb # Backup notebook file
└── README.md # Project documentation
git clone https://github.com/Pallabi26313/Email_Spam_Detection.git
cd Email_Spam_DetectionMake sure you have Python installed, then install required libraries:
pip install pandas numpy scikit-learn matplotlib seaborn nltkjupyter notebook detection.ipynb- Load and explore the dataset
- Preprocess the text data
- Train and test the model
- Evaluate model performance
After training and testing, the model achieved:
- Accuracy: [96%]
You can further improve accuracy by trying other models such as SVM, Random Forest, or XGBoost.
- Add a web app interface using Streamlit or Flask
- Try deep learning models (LSTM or BERT) for better text understanding
- Improve text preprocessing using advanced NLP techniques
- Add visualization dashboards for real-time spam classification
This project successfully demonstrates how machine learning can be applied to classify emails as spam or not spam. It covers end-to-end development — from data preprocessing to model evaluation — making it a useful beginner project for NLP and text classification.
Pallabi Ghosh
- GitHub: Pallabi26313
- Email: [pallabighosh7142@gmail.com]