diff --git a/docs/projects/deep-learning/handwritten-digit-classifier-CNN-Model.md b/docs/projects/deep-learning/handwritten-digit-classifier-CNN-Model.md new file mode 100644 index 00000000..37edeb78 --- /dev/null +++ b/docs/projects/deep-learning/handwritten-digit-classifier-CNN-Model.md @@ -0,0 +1,272 @@ +# Handwritten Digit Classifier + +### AIM + +To develop a Convolutional Neural Network (CNN) model for classifying handwritten digits with detailed explanations of CNN architecture and implementation using MNIST dataset. + +### DATASET LINK + +[MNIST Dataset](https://www.kaggle.com/code/imdevskp/digits-mnist-classification-using-cnn) +- Training Set: 60,000 images +- Test Set: 10,000 images +- Image Size: 28x28 pixels (grayscale) + +### LIBRARIES NEEDED + +??? quote "LIBRARIES USED" + ```python + import numpy as np + import pandas as pd + import matplotlib.pyplot as plt + import seaborn as sns + + from sklearn.model_selection import train_test_split + from sklearn.metrics import confusion_matrix + + import tensorflow as tf + from tensorflow.keras.models import Sequential + from tensorflow.keras.layers import Dense, Conv2D, MaxPool2D, Flatten, Dropout + from tensorflow.keras.preprocessing.image import ImageDataGenerator + from tensorflow.keras.optimizers import Adam + from tensorflow.keras.callbacks import EarlyStopping + ``` + +--- + +### DESCRIPTION + +!!! info "What is the requirement of the project?" + - Create a CNN model to classify handwritten digits (0-9) from the MNIST dataset + - Achieve high accuracy while preventing overfitting + - Provide comprehensive visualization of model performance + - Create an educational resource for understanding CNN implementation + +??? info "Technical Implementation Details" + ```python + # Load and preprocess data + (X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data() + + # Reshape and normalize data + X_train = X_train.reshape(X_train.shape[0], 28, 28, 1) + X_test = X_test.reshape(X_test.shape[0], 28, 28, 1) + + X_train = X_train.astype('float32') + X_test = X_test.astype('float32') + X_train /= 255 + X_test /= 255 + + # One-hot encode labels + y_train = tf.keras.utils.to_categorical(y_train, 10) + y_test = tf.keras.utils.to_categorical(y_test, 10) + ``` + +### Model Architecture +```python +model = Sequential([ + # First Convolutional Block + Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)), + MaxPool2D(2, 2), + + # Second Convolutional Block + Conv2D(64, (3, 3), activation='relu'), + MaxPool2D(2, 2), + + # Third Convolutional Block + Conv2D(64, (3, 3), activation='relu'), + + # Flatten and Dense Layers + Flatten(), + Dense(64, activation='relu'), + Dropout(0.5), + Dense(10, activation='softmax') +]) + +# Compile model +model.compile(optimizer='adam', + loss='categorical_crossentropy', + metrics=['accuracy']) +``` + +### Training Parameters +```python +# Data Augmentation +datagen = ImageDataGenerator( + rotation_range=10, + zoom_range=0.1, + width_shift_range=0.1, + height_shift_range=0.1 +) + +# Early Stopping +early_stopping = EarlyStopping( + monitor='val_loss', + patience=3, + restore_best_weights=True +) + +# Training +history = model.fit( + datagen.flow(X_train, y_train, batch_size=32), + epochs=20, + validation_data=(X_test, y_test), + callbacks=[early_stopping] +) +``` + +--- + +#### IMPLEMENTATION STEPS + +=== "Step 1" + + Data Preparation and Analysis + ```python + # Visualize sample images + plt.figure(figsize=(10, 10)) + for i in range(25): + plt.subplot(5, 5, i+1) + plt.imshow(X_train[i].reshape(28, 28), cmap='gray') + plt.axis('off') + plt.show() + + # Check data distribution + plt.figure(figsize=(10, 5)) + plt.bar(range(10), [len(y_train[y_train == i]) for i in range(10)]) + plt.title('Distribution of digits in training set') + plt.xlabel('Digit') + plt.ylabel('Count') + plt.show() + ``` + +=== "Step 2" + + Model Training and Monitoring + ```python + # Plot training history + plt.figure(figsize=(12, 4)) + + plt.subplot(1, 2, 1) + plt.plot(history.history['loss'], label='Training Loss') + plt.plot(history.history['val_loss'], label='Validation Loss') + plt.title('Model Loss') + plt.legend() + + plt.subplot(1, 2, 2) + plt.plot(history.history['accuracy'], label='Training Accuracy') + plt.plot(history.history['val_accuracy'], label='Validation Accuracy') + plt.title('Model Accuracy') + plt.legend() + + plt.show() + ``` + +=== "Step 3" + + Model Evaluation + ```python + # Make predictions + y_pred = model.predict(X_test) + y_pred_classes = np.argmax(y_pred, axis=1) + y_test_classes = np.argmax(y_test, axis=1) + + # Create confusion matrix + conf_mat = confusion_matrix(y_test_classes, y_pred_classes) + + # Plot confusion matrix + plt.figure(figsize=(10, 8)) + sns.heatmap(conf_mat, annot=True, fmt='d', cmap='Blues') + plt.title('Confusion Matrix') + plt.xlabel('Predicted') + plt.ylabel('True') + plt.show() + ``` + +--- + +#### MODEL PERFORMANCE + +=== "Metrics" + + - Training Accuracy: 99.42% + - Validation Accuracy: 99.15% + - Test Accuracy: 99.23% + +=== "Analysis" + + - Model shows excellent performance with minimal overfitting + - Data augmentation and dropout effectively prevent overfitting + - Confusion matrix shows most misclassifications between similar digits (4/9, 3/8) + +#### CHALLENGES AND SOLUTIONS + +=== "Challenge 1" + + **Overfitting Prevention** + - Solution: Implemented data augmentation and dropout layers + ```python + datagen = ImageDataGenerator( + rotation_range=10, + zoom_range=0.1, + width_shift_range=0.1, + height_shift_range=0.1 + ) + ``` + +=== "Challenge 2" + + **Model Optimization** + - Solution: Used early stopping to prevent unnecessary training + ```python + early_stopping = EarlyStopping( + monitor='val_loss', + patience=3, + restore_best_weights=True + ) + ``` + +--- + +### CONCLUSION + +#### KEY LEARNINGS + +!!! tip "Technical Achievements" + - Successfully implemented CNN with 99%+ accuracy + - Effective use of data augmentation and regularization + - Proper model monitoring and optimization + +??? tip "Future Improvements" + - Experiment with different architectures (ResNet, VGG) + - Implement real-time prediction capability + - Add support for custom handwritten input + +#### APPLICATIONS + +=== "Application 1" + + - Postal code recognition systems + ```python + # Example prediction code + def predict_digit(image): + image = image.reshape(1, 28, 28, 1) + image = image.astype('float32') / 255 + prediction = model.predict(image) + return np.argmax(prediction) + ``` + +=== "Application 2" + + - Educational tools for machine learning + ```python + # Example visualization code + def visualize_predictions(images, predictions, actual): + plt.figure(figsize=(15, 5)) + for i in range(10): + plt.subplot(2, 5, i+1) + plt.imshow(images[i].reshape(28, 28), cmap='gray') + plt.title(f'Pred: {predictions[i]}\nTrue: {actual[i]}') + plt.axis('off') + plt.show() + ``` + +--- \ No newline at end of file diff --git a/docs/projects/deep-learning/index.md b/docs/projects/deep-learning/index.md index 7d210a0f..9e328398 100644 --- a/docs/projects/deep-learning/index.md +++ b/docs/projects/deep-learning/index.md @@ -12,5 +12,14 @@ + + + +
+

Handwritten Digit Classifier CNN Model

+

Deep learning algorithm for Handwritten Digit Classification

+

📅 2025-01-29 | ⏱️ 10 mins

+
+
diff --git a/docs/projects/natural-language-processing/email_spam_detection.md b/docs/projects/natural-language-processing/email_spam_detection.md index 15bf34b5..f42ea0c3 100644 --- a/docs/projects/natural-language-processing/email_spam_detection.md +++ b/docs/projects/natural-language-processing/email_spam_detection.md @@ -1,204 +1,149 @@ +# 📜 Email Spam Classification System -# Email Spam Detection +
+ +
-### AIM -To develop a machine learning-based system that classifies email content as spam or ham (not spam). +## 🎯 AIM +To develop a machine learning-based system that accurately classifies email content as spam or legitimate (ham) using various classification algorithms and natural language processing techniques. -### DATASET LINK -[https://www.kaggle.com/datasets/ashfakyeafi/spam-email-classification](https://www.kaggle.com/datasets/ashfakyeafi/spam-email-classification) +## 📊 DATASET LINK +[Email Spam Classification Dataset (Kaggle)](https://www.kaggle.com/datasets/ashfakyeafi/spam-email-classification) +## 📓 NOTEBOOK +[Email Spam Detection Notebook (Kaggle)](https://www.kaggle.com/code/inshak9/email-spam-detection) -### NOTEBOOK LINK -[https://www.kaggle.com/code/inshak9/email-spam-detection](https://www.kaggle.com/code/inshak9/email-spam-detection) +## ⚙️ TECH STACK - -### LIBRARIES NEEDED - -??? quote "LIBRARIES USED" - - - pandas - - numpy - - scikit-learn - - matplotlib - - seaborn +| **Category** | **Technologies** | +|-------------------------|------------------------------------------------------| +| **Languages** | Python | +| **Libraries** | pandas, numpy, scikit-learn, matplotlib, seaborn | +| **Development Tools** | Jupyter Notebook, VS Code | +| **Version Control** | Git | --- -### DESCRIPTION -!!! info "What is the requirement of the project?" - - A robust system to detect spam emails is essential to combat increasing spam content. - - It improves user experience by automatically filtering unwanted messages. +## 📝 DESCRIPTION -??? info "Why is it necessary?" - - Spam emails consume resources, time, and may pose security risks like phishing. - - Helps organizations and individuals streamline their email communication. +!!! info "What is the requirement of the project?" + - Develop an automated system to detect and filter spam emails + - Create a robust classification model with high accuracy + - Implement feature engineering for email content analysis + - Build a scalable solution for real-time email classification ??? info "How is it beneficial and used?" - - Provides a quick and automated solution for spam classification. - - Used in email services, IT systems, and anti-spam software to filter messages. - -??? info "How did you start approaching this project? (Initial thoughts and planning)" - - Analyzed the dataset and prepared features. - - Implemented various machine learning models for comparison. - -??? info "Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.)." - - Documentation from [scikit-learn](https://scikit-learn.org) - - Blog: Introduction to Spam Classification with ML - ---- - -### EXPLANATION - -#### DETAILS OF THE DIFFERENT FEATURES -The dataset contains features like word frequency, capital letter counts, and others that help in distinguishing spam emails from ham. - -| Feature | Description | -|----------------------|-------------------------------------------------| -| `word_freq_x` | Frequency of specific words in the email body | -| `capital_run_length` | Length of consecutive capital letters | -| `char_freq` | Frequency of special characters like `;` and `$` | -| `is_spam` | Target variable (1 = Spam, 0 = Ham) | - ---- - -#### WHAT I HAVE DONE - -=== "Step 1" - - Initial data exploration and understanding: - - Loaded the dataset using pandas. - - Explored dataset features and target variable distribution. - -=== "Step 2" - - Data cleaning and preprocessing: - - Checked for missing values. - - Standardized features using scaling techniques. - -=== "Step 3" - - Feature engineering and selection: - - Extracted relevant features for spam classification. - - Used correlation matrix to select significant features. - -=== "Step 4" + - Protects users from phishing attempts and malicious content + - Saves time and resources by automatically filtering unwanted emails + - Improves email system efficiency and user experience + - Reduces security risks associated with spam emails + - Can be integrated into existing email services and security systems + +??? info "How did you start approaching this project?" + - Analyzed the dataset structure and characteristics + - Conducted exploratory data analysis to understand feature distributions + - Researched various ML algorithms suitable for text classification + - Implemented data preprocessing and feature engineering pipeline + - Developed and compared multiple classification models + +??? info "Additional resources used" + - scikit-learn official documentation + - "Email Spam Filtering: An Implementation with Python and Scikit-learn" (Medium article) + - "Introduction to Machine Learning with Python" (Book, Chapters 3-5) + - Research paper: "A Comparative Study of Spam Detection using Machine Learning" - Model training and evaluation: - - Trained models: KNN, Naive Bayes, SVM, and Random Forest. - - Evaluated models using accuracy, precision, and recall. - -=== "Step 5" - - Model optimization and fine-tuning: - - Tuned hyperparameters using GridSearchCV. +--- -=== "Step 6" +## 🔍 EXPLANATION - Validation and testing: - - Tested models on unseen data to check performance. +### 🧩 DETAILS OF THE DIFFERENT FEATURES ---- +#### 📂 spam_classification.csv -#### PROJECT TRADE-OFFS AND SOLUTIONS +| Feature Name | Description | +|----------------------|-------------------------------------------------------| +| word_freq_x | Frequency of specific words in email content | +| char_freq_x | Frequency of specific characters | +| capital_run_length | Statistics about capital letters usage | +| is_spam | Target variable (1 = Spam, 0 = Ham) | -=== "Trade Off 1" - - **Accuracy vs. Training Time**: - - Models like Random Forest took longer to train but achieved higher accuracy compared to Naive Bayes. +#### 🛠 Developed Features -=== "Trade Off 2" - - **Complexity vs. Interpretability**: - - Simpler models like Naive Bayes were more interpretable but slightly less accurate. +| Feature Name | Description | Reason | +|----------------------|------------------------------------------------|---------------------------------------| +| text_length | Total length of email content | Spam often has distinct length patterns| +| special_char_ratio | Ratio of special characters to total chars | Indicator of suspicious formatting | +| capital_ratio | Proportion of capital letters | Spam often uses excessive capitals | ---- +--- -### SCREENSHOTS - +### 🛤 PROJECT WORKFLOW -!!! success "Project flowchart" +!!! success "Project workflow" ``` mermaid - graph LR - A[Start] --> B[Load Dataset]; - B --> C[Preprocessing]; - C --> D[Train Models]; - D --> E{Compare Performance}; - E -->|Best Model| F[Deploy]; - E -->|Retry| C; + graph TD + A[Data Collection] --> B[Data Preprocessing] + B --> C[Feature Engineering] + C --> D[Model Selection] + D --> E[Model Training] + E --> F[Model Evaluation] + F --> G{Performance Check} + G -->|Satisfactory| H[Model Deployment] + G -->|Need Improvement| D + H --> I[Real-time Classification] ``` -??? tip "Confusion Matrix" - - === "SVM" - ![Confusion Matrix - SVM](https://github.com/user-attachments/assets/5abda820-040a-4ea8-b389-cd114d329c62) - - === "Naive Bayes" - ![Confusion Matrix - Naive Bayes](https://github.com/user-attachments/assets/bdae9210-9b9b-45c7-9371-36c0a66a9184) - - === "Decision Tree" - ![Confusion Matrix - Decision Tree](https://github.com/user-attachments/assets/8e92fc53-4aff-4973-b0a1-b65a7fc4a79e) - - === "AdaBoost" - ![Confusion Matrix - AdaBoost](https://github.com/user-attachments/assets/043692e3-f733-419c-9fb2-834f2e199506) - - === "Random Forest" - ![Confusion Matrix - Random Forest](https://github.com/user-attachments/assets/5c689f57-9ec5-4e49-9ef5-3537825ac772) - ---- - -### MODELS USED AND THEIR EVALUATION METRICS - -| Model | Accuracy | Precision | Recall | -|----------------------|----------|-----------|--------| -| KNN | 90% | 89% | 88% | -| Naive Bayes | 92% | 91% | 90% | -| SVM | 94% | 93% | 91% | -| Random Forest | 95% | 94% | 93% | -| AdaBoost | 97% | 97% | 100% | - ---- - -#### MODELS COMPARISON GRAPHS - -!!! tip "Models Comparison Graphs" - - === "Accuracy Comparison" - ![Model accracy comparison](https://github.com/user-attachments/assets/1e17844d-e953-4eb0-a24d-b3dbc727db93) - ---- - -### CONCLUSION - -#### WHAT YOU HAVE LEARNED - -!!! tip "Insights gained from the data" - - Feature importance significantly impacts spam detection. - - Simple models like Naive Bayes can achieve competitive performance. +### 🖥 CODE EXPLANATION -??? tip "Improvements in understanding machine learning concepts" - - Gained hands-on experience with classification models and model evaluation techniques. +=== "Data Preprocessing" + - Implemented text cleaning and normalization + - Handled missing values and outliers + - Performed feature scaling and encoding -??? tip "Challenges faced and how they were overcome" - - Balancing between accuracy and training time was challenging, solved using model tuning. +=== "Model Development" + - Created model training pipeline + - Implemented cross-validation + - Applied hyperparameter tuning + - Developed ensemble methods ---- +### ⚖️ PROJECT TRADE-OFFS AND SOLUTIONS -#### USE CASES OF THIS MODEL +=== "Accuracy vs. Speed" + - Trade-off: Complex models achieved higher accuracy but slower processing + - Solution: Implemented model optimization and feature selection to balance performance -=== "Application 1" +=== "Precision vs. Recall" + - Trade-off: Stricter spam detection reduced false positives but increased false negatives + - Solution: Tuned model thresholds to achieve optimal F1-score - **Email Service Providers** - - Automated filtering of spam emails for improved user experience. +## 📉 MODELS USED AND THEIR EVALUATION METRICS -=== "Application 2" +| Model | Accuracy | Precision | Recall | F1-Score | +|----------------|----------|-----------|---------|----------| +| Naive Bayes | 92% | 91% | 90% | 90.5% | +| SVM | 94% | 93% | 91% | 92% | +| Random Forest | 95% | 94% | 93% | 93.5% | +| AdaBoost | 97% | 97% | 100% | 98.5% | - **Enterprise Email Security** - - Used in enterprise software to detect phishing and spam emails. +## ✅ CONCLUSION ---- +### 🔑 KEY LEARNINGS -### FEATURES PLANNED BUT NOT IMPLEMENTED +!!! tip "Technical Insights" + - Feature engineering significantly impacts classification accuracy + - Ensemble methods generally outperform single models + - Model tuning is crucial for optimal performance + - Real-world email patterns require regular model updates -=== "Feature 1" +### 🌍 USE CASES - - Integration of deep learning models (LSTM) for improved accuracy. +=== "Email Service Providers" + - Integration with email servers for automatic spam filtering + - Real-time classification of incoming emails + - Customizable spam detection thresholds +=== "Enterprise Security" + - Protection against phishing attempts + - Reduction of spam-related productivity loss + - Integration with existing security infrastructure diff --git a/docs/projects/natural-language-processing/index.md b/docs/projects/natural-language-processing/index.md index b64b4bf8..cd78479e 100644 --- a/docs/projects/natural-language-processing/index.md +++ b/docs/projects/natural-language-processing/index.md @@ -11,5 +11,14 @@

📅 2025-01-21 | ⏱️ 15 mins

+ + + Chatbot Illustration +
+

Spam Email Classification

+

Developing a modern system using NLP techniques and AI algorithms.

+

📅 2025-01-29 | ⏱️ 15 mins

+
+