Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
246 changes: 115 additions & 131 deletions docs/natural-language-processing/email-spam-detection.md
Original file line number Diff line number Diff line change
@@ -1,204 +1,188 @@
# 🌟 Email Spam Detection

# Email Spam Detection
<div align="center">
<img src="https://github.com/user-attachments/assets/c90bf132-68a6-4155-b191-d2da7e35d0ca" />
</div>

### AIM
To develop a machine learning-based system that classifies email content as spam or ham (not spam).
## 🎯 AIM
To classify emails as spam or ham using machine learning models, ensuring better email filtering and security.

### DATASET LINK
[https://www.kaggle.com/datasets/ashfakyeafi/spam-email-classification](https://www.kaggle.com/datasets/ashfakyeafi/spam-email-classification)
## 📊 DATASET LINK
[Email Spam Detection Dataset](https://www.kaggle.com/datasets/shantanudhakadd/email-spam-detection-dataset-classification)

## 📚 KAGGLE NOTEBOOK
[Notebook Link](https://www.kaggle.com/code/thatarguy/email-spam-classifier?kernelSessionId=224262023)

### NOTEBOOK LINK
[https://www.kaggle.com/code/inshak9/email-spam-detection](https://www.kaggle.com/code/inshak9/email-spam-detection)
??? Abstract "Kaggle Notebook"

<iframe src="https://www.kaggle.com/embed/thatarguy/email-spam-classifier?kernelSessionId=224262023" height="800" style="margin: 0 auto; width: 100%; max-width: 950px;" frameborder="0" scrolling="auto" title="email-spam-classifier"></iframe>

### LIBRARIES NEEDED
## ⚙️ TECH STACK

??? quote "LIBRARIES USED"
| **Category** | **Technologies** |
|--------------------------|---------------------------------------------|
| **Languages** | Python |
| **Libraries/Frameworks** | Scikit-learn, NumPy, Pandas, Matplotlib, Seaborn |
| **Databases** | NOT USED |
| **Tools** | Kaggle, Jupyter Notebook |
| **Deployment** | NOT USED |

- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn

---
---

### DESCRIPTION
## 📝 DESCRIPTION
!!! info "What is the requirement of the project?"
- A robust system to detect spam emails is essential to combat increasing spam content.
- It improves user experience by automatically filtering unwanted messages.

??? info "Why is it necessary?"
- Spam emails consume resources, time, and may pose security risks like phishing.
- Helps organizations and individuals streamline their email communication.
- To efficiently classify emails as spam or ham.
- To improve email security by filtering out spam messages.

??? info "How is it beneficial and used?"
- Provides a quick and automated solution for spam classification.
- Used in email services, IT systems, and anti-spam software to filter messages.
- Helps in reducing unwanted spam emails in user inboxes.
- Enhances productivity by filtering out irrelevant emails.
- Can be integrated into email service providers for automatic filtering.

??? info "How did you start approaching this project? (Initial thoughts and planning)"
- Analyzed the dataset and prepared features.
- Implemented various machine learning models for comparison.
- Collected and preprocessed the dataset.
- Explored various machine learning models.
- Evaluated models based on performance metrics.
- Visualized results for better understanding.

??? info "Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.)."
- Documentation from [scikit-learn](https://scikit-learn.org)
- Blog: Introduction to Spam Classification with ML
- Scikit-learn documentation.
- Various Kaggle notebooks related to spam detection.

---

### EXPLANATION
## 🔍 PROJECT EXPLANATION

### 🧩 DATASET OVERVIEW & FEATURE DETAILS

??? example "📂 spam.csv"

#### DETAILS OF THE DIFFERENT FEATURES
The dataset contains features like word frequency, capital letter counts, and others that help in distinguishing spam emails from ham.
- The dataset contains the following features:

| Feature | Description |
|----------------------|-------------------------------------------------|
| `word_freq_x` | Frequency of specific words in the email body |
| `capital_run_length` | Length of consecutive capital letters |
| `char_freq` | Frequency of special characters like `;` and `$` |
| `is_spam` | Target variable (1 = Spam, 0 = Ham) |
| Feature Name | Description | Datatype |
|--------------|-------------|:------------:|
| Category | Spam or Ham | object |
| Text | Email text | object |
| Length | Length of email | int64 |

??? example "🛠 Developed Features from spam.csv"

| Feature Name | Description | Reason | Datatype |
|--------------|-------------|----------|:------------:|
| Length | Email text length | Helps in spam detection | int64 |

---

#### WHAT I HAVE DONE
### 🛤 PROJECT WORKFLOW

=== "Step 1"
!!! success "Project workflow"

``` mermaid
graph LR
A[Start] --> B[Load Dataset]
B --> C[Preprocess Data]
C --> D[Vectorize Text]
D --> E[Train Models]
E --> F[Evaluate Models]
F --> G[Visualize Results]
```

Initial data exploration and understanding:
- Loaded the dataset using pandas.
- Explored dataset features and target variable distribution.
=== "Step 1"
- Load the dataset and clean unnecessary columns.

=== "Step 2"

Data cleaning and preprocessing:
- Checked for missing values.
- Standardized features using scaling techniques.
- Preprocess text and convert categorical labels.

=== "Step 3"

Feature engineering and selection:
- Extracted relevant features for spam classification.
- Used correlation matrix to select significant features.
- Convert text into numerical features using CountVectorizer.

=== "Step 4"

Model training and evaluation:
- Trained models: KNN, Naive Bayes, SVM, and Random Forest.
- Evaluated models using accuracy, precision, and recall.
- Train machine learning models.

=== "Step 5"

Model optimization and fine-tuning:
- Tuned hyperparameters using GridSearchCV.
- Evaluate models using accuracy, precision, recall, and F1 score.

=== "Step 6"

Validation and testing:
- Tested models on unseen data to check performance.
- Visualize performance using confusion matrices and heatmaps.

---

#### PROJECT TRADE-OFFS AND SOLUTIONS

=== "Trade Off 1"
- **Accuracy vs. Training Time**:
- Models like Random Forest took longer to train but achieved higher accuracy compared to Naive Bayes.

=== "Trade Off 2"
- **Complexity vs. Interpretability**:
- Simpler models like Naive Bayes were more interpretable but slightly less accurate.
### 🖥 CODE EXPLANATION

---
=== "Section 1"
- Data loading and preprocessing.

### SCREENSHOTS
<!-- Attach the screenshots and images -->
=== "Section 2"
- Text vectorization using CountVectorizer.

!!! success "Project flowchart"

``` mermaid
graph LR
A[Start] --> B[Load Dataset];
B --> C[Preprocessing];
C --> D[Train Models];
D --> E{Compare Performance};
E -->|Best Model| F[Deploy];
E -->|Retry| C;
```
=== "Section 3"
- Training models (MLP Classifier, MultinomialNB, BernoulliNB).

??? tip "Confusion Matrix"
=== "Section 4"
- Evaluating models using various metrics.

=== "SVM"
![Confusion Matrix - SVM](https://github.com/user-attachments/assets/5abda820-040a-4ea8-b389-cd114d329c62)
=== "Section 5"
- Visualizing confusion matrices and metric comparisons.

=== "Naive Bayes"
![Confusion Matrix - Naive Bayes](https://github.com/user-attachments/assets/bdae9210-9b9b-45c7-9371-36c0a66a9184)
---

=== "Decision Tree"
![Confusion Matrix - Decision Tree](https://github.com/user-attachments/assets/8e92fc53-4aff-4973-b0a1-b65a7fc4a79e)
### ⚖️ PROJECT TRADE-OFFS AND SOLUTIONS

=== "AdaBoost"
![Confusion Matrix - AdaBoost](https://github.com/user-attachments/assets/043692e3-f733-419c-9fb2-834f2e199506)
=== "Trade Off 1"
- Balancing accuracy and computational efficiency.
- Used Naive Bayes for speed and MLP for improved accuracy.

=== "Random Forest"
![Confusion Matrix - Random Forest](https://github.com/user-attachments/assets/5c689f57-9ec5-4e49-9ef5-3537825ac772)
=== "Trade Off 2"
- Handling false positives vs. false negatives.
- Tuned models to improve precision for spam detection.

---

### MODELS USED AND THEIR EVALUATION METRICS
## 🎮 SCREENSHOTS

| Model | Accuracy | Precision | Recall |
|----------------------|----------|-----------|--------|
| KNN | 90% | 89% | 88% |
| Naive Bayes | 92% | 91% | 90% |
| SVM | 94% | 93% | 91% |
| Random Forest | 95% | 94% | 93% |
| AdaBoost | 97% | 97% | 100% |
!!! tip "Visualizations and EDA of different features"

---

#### MODELS COMPARISON GRAPHS
=== "Confusion Matrix comparision"
![img](https://github.com/user-attachments/assets/94a3b2d8-c7e5-41a5-bba7-8ba4cb1435a7)

!!! tip "Models Comparison Graphs"

=== "Accuracy Comparison"
![Model accracy comparison](https://github.com/user-attachments/assets/1e17844d-e953-4eb0-a24d-b3dbc727db93)
??? example "Model performance graphs"

---
=== "Meteric comparison"
![img](https://github.com/user-attachments/assets/c2be4340-89c9-4aee-9a27-8c40bf2c0066)

### CONCLUSION

#### WHAT YOU HAVE LEARNED

!!! tip "Insights gained from the data"
- Feature importance significantly impacts spam detection.
- Simple models like Naive Bayes can achieve competitive performance.
---

??? tip "Improvements in understanding machine learning concepts"
- Gained hands-on experience with classification models and model evaluation techniques.
## 📉 MODELS USED AND THEIR EVALUATION METRICS

??? tip "Challenges faced and how they were overcome"
- Balancing between accuracy and training time was challenging, solved using model tuning.
| Model | Accuracy | Precision | Recall | F1 Score |
|------------|----------|------------|--------|----------|
| MLP Classifier | 95% | 0.94 | 0.90 | 0.92 |
| Multinomial NB | 93% | 0.91 | 0.88 | 0.89 |
| Bernoulli NB | 92% | 0.89 | 0.85 | 0.87 |

---

#### USE CASES OF THIS MODEL

=== "Application 1"
## ✅ CONCLUSION

**Email Service Providers**
- Automated filtering of spam emails for improved user experience.
### 🔑 KEY LEARNINGS

=== "Application 2"
!!! tip "Insights gained from the data"
- Text length plays a role in spam detection.
- Certain words appear more frequently in spam emails.

**Enterprise Email Security**
- Used in enterprise software to detect phishing and spam emails.
??? tip "Improvements in understanding machine learning concepts"
- Gained insights into text vectorization techniques.
- Understood trade-offs between different classification models.

---

### FEATURES PLANNED BUT NOT IMPLEMENTED
### 🌍 USE CASES

=== "Feature 1"
=== "Email Filtering Systems"
- Can be integrated into email services like Gmail and Outlook.

- Integration of deep learning models (LSTM) for improved accuracy.
=== "SMS Spam Detection"
- Used in mobile networks to block spam messages.

2 changes: 1 addition & 1 deletion docs/natural-language-processing/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
<!-- Email Spam Detection -->
<figure style="padding: 1rem; background: rgba(39, 39, 43, 0.5); border-radius: 10px; border: 1px solid rgba(76, 76, 82, 0.4); box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); transition: transform 0.2s ease-in-out; text-align: center; max-width: 320px; margin: auto;">
<a href="email-spam-detection" style="color: white; text-decoration: none; display: block;">
<img src="https://img.freepik.com/free-photo/spam-mail-concept-with-envelopes_23-2149133736.jpg" alt="Email Spam Detection" style="width: 100%; height: 150px; object-fit: cover; border-radius: 8px; transition: transform 0.2s;" />
<img src="https://github.com/user-attachments/assets/c90bf132-68a6-4155-b191-d2da7e35d0ca" alt="Email Spam Detection" style="width: 100%; height: 150px; object-fit: cover; border-radius: 8px; transition: transform 0.2s;" />
<div style="padding: 0.8rem;">
<h3 style="margin: 0; font-size: 18px;">Email Spam Detection</h3>
<p style="font-size: 14px; opacity: 0.8;">ML-Based Email Spam Classification</p>
Expand Down