-### AIM
-To develop a machine learning-based system that classifies email content as spam or ham (not spam).
+## 🎯 AIM
+To classify emails as spam or ham using machine learning models, ensuring better email filtering and security.
-### DATASET LINK
-[https://www.kaggle.com/datasets/ashfakyeafi/spam-email-classification](https://www.kaggle.com/datasets/ashfakyeafi/spam-email-classification)
+## 📊 DATASET LINK
+[Email Spam Detection Dataset](https://www.kaggle.com/datasets/shantanudhakadd/email-spam-detection-dataset-classification)
+## 📚 KAGGLE NOTEBOOK
+[Notebook Link](https://www.kaggle.com/code/thatarguy/email-spam-classifier?kernelSessionId=224262023)
-### NOTEBOOK LINK
-[https://www.kaggle.com/code/inshak9/email-spam-detection](https://www.kaggle.com/code/inshak9/email-spam-detection)
+??? Abstract "Kaggle Notebook"
+
-### LIBRARIES NEEDED
+## ⚙️ TECH STACK
-??? quote "LIBRARIES USED"
+| **Category** | **Technologies** |
+|--------------------------|---------------------------------------------|
+| **Languages** | Python |
+| **Libraries/Frameworks** | Scikit-learn, NumPy, Pandas, Matplotlib, Seaborn |
+| **Databases** | NOT USED |
+| **Tools** | Kaggle, Jupyter Notebook |
+| **Deployment** | NOT USED |
- - pandas
- - numpy
- - scikit-learn
- - matplotlib
- - seaborn
-
----
+---
-### DESCRIPTION
+## 📝 DESCRIPTION
!!! info "What is the requirement of the project?"
- - A robust system to detect spam emails is essential to combat increasing spam content.
- - It improves user experience by automatically filtering unwanted messages.
-
-??? info "Why is it necessary?"
- - Spam emails consume resources, time, and may pose security risks like phishing.
- - Helps organizations and individuals streamline their email communication.
+ - To efficiently classify emails as spam or ham.
+ - To improve email security by filtering out spam messages.
??? info "How is it beneficial and used?"
- - Provides a quick and automated solution for spam classification.
- - Used in email services, IT systems, and anti-spam software to filter messages.
+ - Helps in reducing unwanted spam emails in user inboxes.
+ - Enhances productivity by filtering out irrelevant emails.
+ - Can be integrated into email service providers for automatic filtering.
??? info "How did you start approaching this project? (Initial thoughts and planning)"
- - Analyzed the dataset and prepared features.
- - Implemented various machine learning models for comparison.
+ - Collected and preprocessed the dataset.
+ - Explored various machine learning models.
+ - Evaluated models based on performance metrics.
+ - Visualized results for better understanding.
??? info "Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.)."
- - Documentation from [scikit-learn](https://scikit-learn.org)
- - Blog: Introduction to Spam Classification with ML
+ - Scikit-learn documentation.
+ - Various Kaggle notebooks related to spam detection.
---
-### EXPLANATION
+## 🔍 PROJECT EXPLANATION
+
+### 🧩 DATASET OVERVIEW & FEATURE DETAILS
+
+??? example "📂 spam.csv"
-#### DETAILS OF THE DIFFERENT FEATURES
-The dataset contains features like word frequency, capital letter counts, and others that help in distinguishing spam emails from ham.
+ - The dataset contains the following features:
-| Feature | Description |
-|----------------------|-------------------------------------------------|
-| `word_freq_x` | Frequency of specific words in the email body |
-| `capital_run_length` | Length of consecutive capital letters |
-| `char_freq` | Frequency of special characters like `;` and `$` |
-| `is_spam` | Target variable (1 = Spam, 0 = Ham) |
+ | Feature Name | Description | Datatype |
+ |--------------|-------------|:------------:|
+ | Category | Spam or Ham | object |
+ | Text | Email text | object |
+ | Length | Length of email | int64 |
+
+??? example "🛠 Developed Features from spam.csv"
+
+ | Feature Name | Description | Reason | Datatype |
+ |--------------|-------------|----------|:------------:|
+ | Length | Email text length | Helps in spam detection | int64 |
---
-#### WHAT I HAVE DONE
+### 🛤 PROJECT WORKFLOW
-=== "Step 1"
+!!! success "Project workflow"
+
+ ``` mermaid
+ graph LR
+ A[Start] --> B[Load Dataset]
+ B --> C[Preprocess Data]
+ C --> D[Vectorize Text]
+ D --> E[Train Models]
+ E --> F[Evaluate Models]
+ F --> G[Visualize Results]
+ ```
- Initial data exploration and understanding:
- - Loaded the dataset using pandas.
- - Explored dataset features and target variable distribution.
+=== "Step 1"
+ - Load the dataset and clean unnecessary columns.
=== "Step 2"
-
- Data cleaning and preprocessing:
- - Checked for missing values.
- - Standardized features using scaling techniques.
+ - Preprocess text and convert categorical labels.
=== "Step 3"
-
- Feature engineering and selection:
- - Extracted relevant features for spam classification.
- - Used correlation matrix to select significant features.
+ - Convert text into numerical features using CountVectorizer.
=== "Step 4"
-
- Model training and evaluation:
- - Trained models: KNN, Naive Bayes, SVM, and Random Forest.
- - Evaluated models using accuracy, precision, and recall.
+ - Train machine learning models.
=== "Step 5"
-
- Model optimization and fine-tuning:
- - Tuned hyperparameters using GridSearchCV.
+ - Evaluate models using accuracy, precision, recall, and F1 score.
=== "Step 6"
-
- Validation and testing:
- - Tested models on unseen data to check performance.
+ - Visualize performance using confusion matrices and heatmaps.
---
-#### PROJECT TRADE-OFFS AND SOLUTIONS
-
-=== "Trade Off 1"
- - **Accuracy vs. Training Time**:
- - Models like Random Forest took longer to train but achieved higher accuracy compared to Naive Bayes.
-
-=== "Trade Off 2"
- - **Complexity vs. Interpretability**:
- - Simpler models like Naive Bayes were more interpretable but slightly less accurate.
+### 🖥 CODE EXPLANATION
----
+=== "Section 1"
+ - Data loading and preprocessing.
-### SCREENSHOTS
-
+=== "Section 2"
+ - Text vectorization using CountVectorizer.
-!!! success "Project flowchart"
-
- ``` mermaid
- graph LR
- A[Start] --> B[Load Dataset];
- B --> C[Preprocessing];
- C --> D[Train Models];
- D --> E{Compare Performance};
- E -->|Best Model| F[Deploy];
- E -->|Retry| C;
- ```
+=== "Section 3"
+ - Training models (MLP Classifier, MultinomialNB, BernoulliNB).
-??? tip "Confusion Matrix"
+=== "Section 4"
+ - Evaluating models using various metrics.
- === "SVM"
- 
+=== "Section 5"
+ - Visualizing confusion matrices and metric comparisons.
- === "Naive Bayes"
- 
+---
- === "Decision Tree"
- 
+### ⚖️ PROJECT TRADE-OFFS AND SOLUTIONS
- === "AdaBoost"
- 
+=== "Trade Off 1"
+ - Balancing accuracy and computational efficiency.
+ - Used Naive Bayes for speed and MLP for improved accuracy.
- === "Random Forest"
- 
+=== "Trade Off 2"
+ - Handling false positives vs. false negatives.
+ - Tuned models to improve precision for spam detection.
---
-### MODELS USED AND THEIR EVALUATION METRICS
+## 🎮 SCREENSHOTS
-| Model | Accuracy | Precision | Recall |
-|----------------------|----------|-----------|--------|
-| KNN | 90% | 89% | 88% |
-| Naive Bayes | 92% | 91% | 90% |
-| SVM | 94% | 93% | 91% |
-| Random Forest | 95% | 94% | 93% |
-| AdaBoost | 97% | 97% | 100% |
+!!! tip "Visualizations and EDA of different features"
----
-
-#### MODELS COMPARISON GRAPHS
+ === "Confusion Matrix comparision"
+ 
-!!! tip "Models Comparison Graphs"
- === "Accuracy Comparison"
- 
+??? example "Model performance graphs"
----
+ === "Meteric comparison"
+ 
-### CONCLUSION
-#### WHAT YOU HAVE LEARNED
-
-!!! tip "Insights gained from the data"
- - Feature importance significantly impacts spam detection.
- - Simple models like Naive Bayes can achieve competitive performance.
+---
-??? tip "Improvements in understanding machine learning concepts"
- - Gained hands-on experience with classification models and model evaluation techniques.
+## 📉 MODELS USED AND THEIR EVALUATION METRICS
-??? tip "Challenges faced and how they were overcome"
- - Balancing between accuracy and training time was challenging, solved using model tuning.
+| Model | Accuracy | Precision | Recall | F1 Score |
+|------------|----------|------------|--------|----------|
+| MLP Classifier | 95% | 0.94 | 0.90 | 0.92 |
+| Multinomial NB | 93% | 0.91 | 0.88 | 0.89 |
+| Bernoulli NB | 92% | 0.89 | 0.85 | 0.87 |
---
-#### USE CASES OF THIS MODEL
-
-=== "Application 1"
+## ✅ CONCLUSION
- **Email Service Providers**
- - Automated filtering of spam emails for improved user experience.
+### 🔑 KEY LEARNINGS
-=== "Application 2"
+!!! tip "Insights gained from the data"
+ - Text length plays a role in spam detection.
+ - Certain words appear more frequently in spam emails.
- **Enterprise Email Security**
- - Used in enterprise software to detect phishing and spam emails.
+??? tip "Improvements in understanding machine learning concepts"
+ - Gained insights into text vectorization techniques.
+ - Understood trade-offs between different classification models.
---
-### FEATURES PLANNED BUT NOT IMPLEMENTED
+### 🌍 USE CASES
-=== "Feature 1"
+=== "Email Filtering Systems"
+ - Can be integrated into email services like Gmail and Outlook.
- - Integration of deep learning models (LSTM) for improved accuracy.
+=== "SMS Spam Detection"
+ - Used in mobile networks to block spam messages.
diff --git a/docs/natural-language-processing/index.md b/docs/natural-language-processing/index.md
index d8f53239..4c5e4736 100644
--- a/docs/natural-language-processing/index.md
+++ b/docs/natural-language-processing/index.md
@@ -29,7 +29,7 @@
-
+