diff --git a/docs/natural-language-processing/email-spam-detection.md b/docs/natural-language-processing/email-spam-detection.md index 15bf34b5..23928cb2 100644 --- a/docs/natural-language-processing/email-spam-detection.md +++ b/docs/natural-language-processing/email-spam-detection.md @@ -1,204 +1,188 @@ +# 🌟 Email Spam Detection -# Email Spam Detection +
+ +
-### AIM -To develop a machine learning-based system that classifies email content as spam or ham (not spam). +## 🎯 AIM +To classify emails as spam or ham using machine learning models, ensuring better email filtering and security. -### DATASET LINK -[https://www.kaggle.com/datasets/ashfakyeafi/spam-email-classification](https://www.kaggle.com/datasets/ashfakyeafi/spam-email-classification) +## 📊 DATASET LINK +[Email Spam Detection Dataset](https://www.kaggle.com/datasets/shantanudhakadd/email-spam-detection-dataset-classification) +## 📚 KAGGLE NOTEBOOK +[Notebook Link](https://www.kaggle.com/code/thatarguy/email-spam-classifier?kernelSessionId=224262023) -### NOTEBOOK LINK -[https://www.kaggle.com/code/inshak9/email-spam-detection](https://www.kaggle.com/code/inshak9/email-spam-detection) +??? Abstract "Kaggle Notebook" + -### LIBRARIES NEEDED +## ⚙️ TECH STACK -??? quote "LIBRARIES USED" +| **Category** | **Technologies** | +|--------------------------|---------------------------------------------| +| **Languages** | Python | +| **Libraries/Frameworks** | Scikit-learn, NumPy, Pandas, Matplotlib, Seaborn | +| **Databases** | NOT USED | +| **Tools** | Kaggle, Jupyter Notebook | +| **Deployment** | NOT USED | - - pandas - - numpy - - scikit-learn - - matplotlib - - seaborn - ---- +--- -### DESCRIPTION +## 📝 DESCRIPTION !!! info "What is the requirement of the project?" - - A robust system to detect spam emails is essential to combat increasing spam content. - - It improves user experience by automatically filtering unwanted messages. - -??? info "Why is it necessary?" - - Spam emails consume resources, time, and may pose security risks like phishing. - - Helps organizations and individuals streamline their email communication. + - To efficiently classify emails as spam or ham. + - To improve email security by filtering out spam messages. ??? info "How is it beneficial and used?" - - Provides a quick and automated solution for spam classification. - - Used in email services, IT systems, and anti-spam software to filter messages. + - Helps in reducing unwanted spam emails in user inboxes. + - Enhances productivity by filtering out irrelevant emails. + - Can be integrated into email service providers for automatic filtering. ??? info "How did you start approaching this project? (Initial thoughts and planning)" - - Analyzed the dataset and prepared features. - - Implemented various machine learning models for comparison. + - Collected and preprocessed the dataset. + - Explored various machine learning models. + - Evaluated models based on performance metrics. + - Visualized results for better understanding. ??? info "Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.)." - - Documentation from [scikit-learn](https://scikit-learn.org) - - Blog: Introduction to Spam Classification with ML + - Scikit-learn documentation. + - Various Kaggle notebooks related to spam detection. --- -### EXPLANATION +## 🔍 PROJECT EXPLANATION + +### 🧩 DATASET OVERVIEW & FEATURE DETAILS + +??? example "📂 spam.csv" -#### DETAILS OF THE DIFFERENT FEATURES -The dataset contains features like word frequency, capital letter counts, and others that help in distinguishing spam emails from ham. + - The dataset contains the following features: -| Feature | Description | -|----------------------|-------------------------------------------------| -| `word_freq_x` | Frequency of specific words in the email body | -| `capital_run_length` | Length of consecutive capital letters | -| `char_freq` | Frequency of special characters like `;` and `$` | -| `is_spam` | Target variable (1 = Spam, 0 = Ham) | + | Feature Name | Description | Datatype | + |--------------|-------------|:------------:| + | Category | Spam or Ham | object | + | Text | Email text | object | + | Length | Length of email | int64 | + +??? example "🛠 Developed Features from spam.csv" + + | Feature Name | Description | Reason | Datatype | + |--------------|-------------|----------|:------------:| + | Length | Email text length | Helps in spam detection | int64 | --- -#### WHAT I HAVE DONE +### 🛤 PROJECT WORKFLOW -=== "Step 1" +!!! success "Project workflow" + + ``` mermaid + graph LR + A[Start] --> B[Load Dataset] + B --> C[Preprocess Data] + C --> D[Vectorize Text] + D --> E[Train Models] + E --> F[Evaluate Models] + F --> G[Visualize Results] + ``` - Initial data exploration and understanding: - - Loaded the dataset using pandas. - - Explored dataset features and target variable distribution. +=== "Step 1" + - Load the dataset and clean unnecessary columns. === "Step 2" - - Data cleaning and preprocessing: - - Checked for missing values. - - Standardized features using scaling techniques. + - Preprocess text and convert categorical labels. === "Step 3" - - Feature engineering and selection: - - Extracted relevant features for spam classification. - - Used correlation matrix to select significant features. + - Convert text into numerical features using CountVectorizer. === "Step 4" - - Model training and evaluation: - - Trained models: KNN, Naive Bayes, SVM, and Random Forest. - - Evaluated models using accuracy, precision, and recall. + - Train machine learning models. === "Step 5" - - Model optimization and fine-tuning: - - Tuned hyperparameters using GridSearchCV. + - Evaluate models using accuracy, precision, recall, and F1 score. === "Step 6" - - Validation and testing: - - Tested models on unseen data to check performance. + - Visualize performance using confusion matrices and heatmaps. --- -#### PROJECT TRADE-OFFS AND SOLUTIONS - -=== "Trade Off 1" - - **Accuracy vs. Training Time**: - - Models like Random Forest took longer to train but achieved higher accuracy compared to Naive Bayes. - -=== "Trade Off 2" - - **Complexity vs. Interpretability**: - - Simpler models like Naive Bayes were more interpretable but slightly less accurate. +### 🖥 CODE EXPLANATION ---- +=== "Section 1" + - Data loading and preprocessing. -### SCREENSHOTS - +=== "Section 2" + - Text vectorization using CountVectorizer. -!!! success "Project flowchart" - - ``` mermaid - graph LR - A[Start] --> B[Load Dataset]; - B --> C[Preprocessing]; - C --> D[Train Models]; - D --> E{Compare Performance}; - E -->|Best Model| F[Deploy]; - E -->|Retry| C; - ``` +=== "Section 3" + - Training models (MLP Classifier, MultinomialNB, BernoulliNB). -??? tip "Confusion Matrix" +=== "Section 4" + - Evaluating models using various metrics. - === "SVM" - ![Confusion Matrix - SVM](https://github.com/user-attachments/assets/5abda820-040a-4ea8-b389-cd114d329c62) +=== "Section 5" + - Visualizing confusion matrices and metric comparisons. - === "Naive Bayes" - ![Confusion Matrix - Naive Bayes](https://github.com/user-attachments/assets/bdae9210-9b9b-45c7-9371-36c0a66a9184) +--- - === "Decision Tree" - ![Confusion Matrix - Decision Tree](https://github.com/user-attachments/assets/8e92fc53-4aff-4973-b0a1-b65a7fc4a79e) +### ⚖️ PROJECT TRADE-OFFS AND SOLUTIONS - === "AdaBoost" - ![Confusion Matrix - AdaBoost](https://github.com/user-attachments/assets/043692e3-f733-419c-9fb2-834f2e199506) +=== "Trade Off 1" + - Balancing accuracy and computational efficiency. + - Used Naive Bayes for speed and MLP for improved accuracy. - === "Random Forest" - ![Confusion Matrix - Random Forest](https://github.com/user-attachments/assets/5c689f57-9ec5-4e49-9ef5-3537825ac772) +=== "Trade Off 2" + - Handling false positives vs. false negatives. + - Tuned models to improve precision for spam detection. --- -### MODELS USED AND THEIR EVALUATION METRICS +## 🎮 SCREENSHOTS -| Model | Accuracy | Precision | Recall | -|----------------------|----------|-----------|--------| -| KNN | 90% | 89% | 88% | -| Naive Bayes | 92% | 91% | 90% | -| SVM | 94% | 93% | 91% | -| Random Forest | 95% | 94% | 93% | -| AdaBoost | 97% | 97% | 100% | +!!! tip "Visualizations and EDA of different features" ---- - -#### MODELS COMPARISON GRAPHS + === "Confusion Matrix comparision" + ![img](https://github.com/user-attachments/assets/94a3b2d8-c7e5-41a5-bba7-8ba4cb1435a7) -!!! tip "Models Comparison Graphs" - === "Accuracy Comparison" - ![Model accracy comparison](https://github.com/user-attachments/assets/1e17844d-e953-4eb0-a24d-b3dbc727db93) +??? example "Model performance graphs" ---- + === "Meteric comparison" + ![img](https://github.com/user-attachments/assets/c2be4340-89c9-4aee-9a27-8c40bf2c0066) -### CONCLUSION -#### WHAT YOU HAVE LEARNED - -!!! tip "Insights gained from the data" - - Feature importance significantly impacts spam detection. - - Simple models like Naive Bayes can achieve competitive performance. +--- -??? tip "Improvements in understanding machine learning concepts" - - Gained hands-on experience with classification models and model evaluation techniques. +## 📉 MODELS USED AND THEIR EVALUATION METRICS -??? tip "Challenges faced and how they were overcome" - - Balancing between accuracy and training time was challenging, solved using model tuning. +| Model | Accuracy | Precision | Recall | F1 Score | +|------------|----------|------------|--------|----------| +| MLP Classifier | 95% | 0.94 | 0.90 | 0.92 | +| Multinomial NB | 93% | 0.91 | 0.88 | 0.89 | +| Bernoulli NB | 92% | 0.89 | 0.85 | 0.87 | --- -#### USE CASES OF THIS MODEL - -=== "Application 1" +## ✅ CONCLUSION - **Email Service Providers** - - Automated filtering of spam emails for improved user experience. +### 🔑 KEY LEARNINGS -=== "Application 2" +!!! tip "Insights gained from the data" + - Text length plays a role in spam detection. + - Certain words appear more frequently in spam emails. - **Enterprise Email Security** - - Used in enterprise software to detect phishing and spam emails. +??? tip "Improvements in understanding machine learning concepts" + - Gained insights into text vectorization techniques. + - Understood trade-offs between different classification models. --- -### FEATURES PLANNED BUT NOT IMPLEMENTED +### 🌍 USE CASES -=== "Feature 1" +=== "Email Filtering Systems" + - Can be integrated into email services like Gmail and Outlook. - - Integration of deep learning models (LSTM) for improved accuracy. +=== "SMS Spam Detection" + - Used in mobile networks to block spam messages. diff --git a/docs/natural-language-processing/index.md b/docs/natural-language-processing/index.md index d8f53239..4c5e4736 100644 --- a/docs/natural-language-processing/index.md +++ b/docs/natural-language-processing/index.md @@ -29,7 +29,7 @@
- Email Spam Detection + Email Spam Detection

Email Spam Detection

ML-Based Email Spam Classification