From fe9dd837c17d7639d8462fc5d2d1eb7d07d13364 Mon Sep 17 00:00:00 2001
From: Piyush Chakarborthy
 <102757301+Chakravartinsamrat@users.noreply.github.com>
Date: Sat, 25 Jan 2025 14:22:37 +0530
Subject: [PATCH 1/4] Create text_summarization.md

---
 .../text_summarization.md                     | 191 ++++++++++++++++++
 1 file changed, 191 insertions(+)
 create mode 100644 docs/projects/natural-language-processing/text_summarization.md
diff --git a/docs/projects/natural-language-processing/text_summarization.md b/docs/projects/natural-language-processing/text_summarization.md
new file mode 100644
index 00000000..17e62192
--- /dev/null
+++ b/docs/projects/natural-language-processing/text_summarization.md
@@ -0,0 +1,191 @@
+
+# Text Summarization
+
+### AIM
+Develop a model to summarize long articles into short, concise summaries.
+
+### DATASET LINK
+[CNN DailyMail News Dataset](https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail/)
+
+### NOTEBOOK LINK
+??? Abstract "Kaggle Notebook"
+
+    <iframe src="https://www.kaggle.com/embed/piyushchakarborthy/text-summary-via-textrank-transformers-tf-idf?kernelSessionId=219171135" height="800" style="margin: 0 auto; width: 100%; max-width: 950px;" frameborder="0" scrolling="auto" title="Text Summary Via TextRank, Transformers, TF-IDF"></iframe>
+### LIBRARIES NEEDED
+??? quote "LIBRARIES USED"
+
+    - pandas
+    - numpy
+    - scikit-learn
+    - matplotlib
+    - keras
+    - tensorflow
+    - spacy
+    - pytextrank
+    - TfidfVectorizer
+    - Transformer (Bart)
+--- 
+
+### DESCRIPTION
+
+??? info "What is the requirement of the project?"
+    - A robust system to summarize text efficiently is essential for handling large volumes of information.
+    - It helps users quickly grasp key insights without reading lengthy documents.
+
+??? info "Why is it necessary?"
+    - Large amounts of text can be overwhelming and time-consuming to process.
+    - Automated summarization improves productivity and aids decision-making in various fields like journalism, research, and customer support.
+
+??? info "How is it beneficial and used?"
+    - Provides a concise summary while preserving essential information.
+    - Used in news aggregation, academic research, and AI-powered assistants for quick content consumption.
+
+??? info "How did you start approaching this project? (Initial thoughts and planning)"
+    - Explored different text summarization techniques, including extractive and abstractive methods.
+    - Implemented models like TextRank, BART, and T5 to compare their effectiveness.
+
+??? info "Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.)."
+    - Documentation from Hugging Face Transformers
+    - Research Paper: "Text Summarization using Deep Learning"
+    - Blog: "Introduction to NLP-based Summarization Techniques"
+
+---
+
+#### DETAILS OF THE DIFFERENT FEATURES
+The dataset contains features like sentence importance, word frequency, and linguistic structures that help in generating meaningful summaries.
+
+| Feature              | Description                                     |
+|----------------------|-------------------------------------------------|
+| `sentence_rank`        | Rank of a sentence based on importance using TextRank |
+| `word_freq`            | Frequency of key terms in the document |
+| `tf-idf_score`         | Term Frequency-Inverse Document Frequency for words |
+| `summary_length`       | Desired length of the summary |
+| `generated_summary`    | AI-generated condensed version of the original text |
+
+---
+
+#### PROCEDURE
+
+=== "Step 1"
+
+    Exploratory Data Analysis:
+
+    -  Loaded the CNN/DailyMail dataset using pandas.
+    -  Explored dataset features like article and highlights, ensuring the correct format for summarization.
+    -  Analyzed the distribution of articles and their corresponding summaries.
+
+=== "Step 2"
+
+    Data cleaning and preprocessing:
+
+      - Removed unnecessary columns (like id) and checked for missing values.
+      - Tokenized articles into sentences and words, removing stopwords and special characters.
+      - Preprocessed the text using basic NLP techniques such as lowercasing, lemmatization, and removing non-alphanumeric characters.
+
+=== "Step 3"
+
+    Feature engineering and selection:
+
+      - For TextRank-based summarization, calculated sentence similarity using TF-IDF (Term Frequency-Inverse Document Frequency) and Cosine Similarity.
+      - Selected top-ranked sentences based on their importance and relevance to the article.
+      - Applied transformers-based models like BART and T5 for abstractive summarization.
+      - Applied transformers-based models like BART and T5 for abstractive summarization.
+
+=== "Step 4"
+
+    Model training and evaluation:
+
+      - For the TextRank summarization approach, created a similarity matrix based on TF-IDF and Cosine Similarity.
+      - For transformer-based methods, used Hugging Face's BART and T5 models, summarizing articles with their pre-trained weights.
+      - Evaluated the summarization models based on BLEU, ROUGE, and Cosine Similarity metrics.
+
+=== "Step 5"
+
+    Validation and testing:
+
+      - Tested both extractive and abstractive summarization models on unseen data to ensure generalizability.
+      - Plotted confusion matrices to visualize True Positives, False Positives, and False Negatives, ensuring effective model performance.
+---
+
+#### PROJECT TRADE-OFFS AND SOLUTIONS 
+
+=== "Trade-off 1"
+
+    Training Dataset being over 1.2Gb, which is too large for local machines.
+
+      - **Solution**: Instead of Training a model on train dataset, Used Test Dataset for training and validation.
+
+=== "Trade-off 2"
+
+    Transformer models (BART/T5) required high computational resources and long inference times for summarizing large articles.
+
+      - **Solution**: Model Pruning: Used smaller versions of transformer models (e.g., distilBART or distilT5) to reduce the computational load without compromising much on performance.
+
+=== "Trade-off 2"
+
+    TextRank summary might miss nuances and context, leading to less accurate or overly simplistic outputs compared to transformer-based models.
+
+      - **Solution**: Combined TextRank and Transformer-based summarization models in a hybrid approach to leverage the best of both worlds—speed from TextRank and accuracy from transformers.
+
+
+--- 
+
+### SCREENSHOTS
+
+!!! success "Project flowchart"
+
+    ``` mermaid
+      graph LR
+    A[Start] --> B[Load Dataset]
+    B --> C[Preprocessing]
+    C --> D[TextRank + TF-IDF / Transformer Models]
+    D --> E{Compare Performance}
+    E -->|Best Model| F[Deploy]
+    E -->|Retry| C;
+    ```
+
+??? example "Confusion Matrix"
+
+    === "TF-IDF Confusion Matrix"
+        ![tfidf](https://github.com/user-attachments/assets/28f257e1-2529-48f1-81e5-e058a50fb351)
+        
+    === "TextRank Confusion Matrix"
+        ![textrank](https://github.com/user-attachments/assets/cb748eff-e4f3-4096-ab2b-cf2e4b40186f)
+
+    === "Transformers Confusion Matrix"
+        ![trans](https://github.com/user-attachments/assets/7e99887b-e225-4dd0-802d-f1c2b0e89bef)
+
+
+### CONCLUSION 
+
+#### KEY LEARNINGS 
+
+!!! tip "Insights gained from the data"
+    - Data Complexity: News articles vary in length and structure, requiring different summarization techniques.
+    - Text Preprocessing: Cleaning text (e.g., stopword removal, tokenization) significantly improves summarization quality.
+    - Feature Extraction: Techniques like TF-IDF, TextRank, and Transformer embeddings help in effective text representation for summarization models.
+
+??? tip "Improvements in understanding machine learning concepts"
+    - Model Selection: Comparing extractive (TextRank, TF-IDF) and abstractive (Transformers) models to determine the best summarization approach.
+
+??? tip "Challenges faced and how they were overcome"
+    - Long Text Processing: Splitting lengthy articles into manageable sections before summarization.
+    - Computational Efficiency: Used batch processing and model optimization to handle large datasets efficiently.
+
+--- 
+
+#### USE CASES 
+
+=== "Application 1"
+
+    **News Aggregation & Personalized Summaries**
+    
+      - Automating news summarization helps users quickly grasp key events without reading lengthy articles.
+      - Used in news apps, digital assistants, and content curation platforms.
+
+=== "Application 2"
+
+    **Legal & Academic Document Summarization**
+    
+      - Helps professionals extract critical insights from lengthy legal or research documents.
+      - Reduces the time needed for manual reading and analysis.

From 5bb5d49a7aa05ff9f8003819f7d57b0c9f160782 Mon Sep 17 00:00:00 2001
From: Piyush Chakarborthy
 <102757301+Chakravartinsamrat@users.noreply.github.com>
Date: Sat, 25 Jan 2025 15:24:23 +0530
Subject: [PATCH 2/4] Added Emojis and Missing Section from Template

---
 .../text_summarization.md                     | 97 ++++++++++++++-----
 1 file changed, 73 insertions(+), 24 deletions(-)

diff --git a/docs/projects/natural-language-processing/text_summarization.md b/docs/projects/natural-language-processing/text_summarization.md
index 17e62192..23a27d14 100644
--- a/docs/projects/natural-language-processing/text_summarization.md
+++ b/docs/projects/natural-language-processing/text_summarization.md
@@ -1,17 +1,17 @@
 
-# Text Summarization
+# 📜Text Summarization
 
-### AIM
+### 🎯 AIM
 Develop a model to summarize long articles into short, concise summaries.
 
-### DATASET LINK
+### 📊 DATASET LINK
 [CNN DailyMail News Dataset](https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail/)
 
-### NOTEBOOK LINK
+### 📓 NOTEBOOK LINK
 ??? Abstract "Kaggle Notebook"
 
     <iframe src="https://www.kaggle.com/embed/piyushchakarborthy/text-summary-via-textrank-transformers-tf-idf?kernelSessionId=219171135" height="800" style="margin: 0 auto; width: 100%; max-width: 950px;" frameborder="0" scrolling="auto" title="Text Summary Via TextRank, Transformers, TF-IDF"></iframe>
-### LIBRARIES NEEDED
+### ⚙️ LIBRARIES NEEDED
 ??? quote "LIBRARIES USED"
 
     - pandas
@@ -26,7 +26,7 @@ Develop a model to summarize long articles into short, concise summaries.
     - Transformer (Bart)
 --- 
 
-### DESCRIPTION
+### 📝 DESCRIPTION
 
 ??? info "What is the requirement of the project?"
     - A robust system to summarize text efficiently is essential for handling large volumes of information.
@@ -50,10 +50,22 @@ Develop a model to summarize long articles into short, concise summaries.
     - Blog: "Introduction to NLP-based Summarization Techniques"
 
 ---
+## 🔍 EXPLANATION
+
+#### 🧩 DETAILS OF THE DIFFERENT FEATURES
+
+#### 📂 dataset.csv 
 
-#### DETAILS OF THE DIFFERENT FEATURES
 The dataset contains features like sentence importance, word frequency, and linguistic structures that help in generating meaningful summaries.
 
+| Feature Name | Description |
+|--------------|-------------|
+| Id           | A unique Id for each row                   |
+| Article      | Entire article written on CNN Daily mail   |
+| Highlights   | Key Notes of the article                   |
+
+#### 🛠 Developed Features 
+
 | Feature              | Description                                     |
 |----------------------|-------------------------------------------------|
 | `sentence_rank`        | Rank of a sentence based on importance using TextRank |
@@ -63,6 +75,18 @@ The dataset contains features like sentence importance, word frequency, and ling
 | `generated_summary`    | AI-generated condensed version of the original text |
 
 ---
+### 🛤 PROJECT WORKFLOW 
+!!! success "Project flowchart"
+
+    ``` mermaid
+      graph LR
+    A[Start] --> B[Load Dataset]
+    B --> C[Preprocessing]
+    C --> D[TextRank + TF-IDF / Transformer Models]
+    D --> E{Compare Performance}
+    E -->|Best Model| F[Deploy]
+    E -->|Retry| C;
+    ```
 
 #### PROCEDURE
 
@@ -107,7 +131,44 @@ The dataset contains features like sentence importance, word frequency, and ling
       - Plotted confusion matrices to visualize True Positives, False Positives, and False Negatives, ensuring effective model performance.
 ---
 
-#### PROJECT TRADE-OFFS AND SOLUTIONS 
+### 🖥 CODE EXPLANATION 
+<!-- Provide an explanation for your essential code, highlighting key sections and their functionalities. -->
+<!-- This will help beginners understand the core components and how they contribute to the overall project. -->
+
+=== "TextRank algorithm"
+    
+    Steps:
+        
+        - Preprocessing: Tokenize the article into sentences and preprocess the text by removing stopwords and special characters.
+        - Similarity Matrix: Compute the similarity between sentences using Jaccard Similarity.
+        - Graph Construction: Build a graph where sentences are nodes and edges represent similarity scores.
+        - Ranking: Use the PageRank algorithm to rank sentences based on their importance.
+        - Summary Generation: Select the top-ranked sentences to form the summary.
+        - Evaluation: Compare the generated summary with the reference summary using a confusion matrix.
+
+
+=== "Transformers"
+    
+    Steps:
+        
+        - Preprocessing: Tokenize the article and preprocess the text.
+        - Model Application: Use the pipeline function from the transformers library to generate a summary.
+        - Evaluation: Compare the generated summary with the reference summary using a confusion matrix.
+
+
+=== "TTF-IDF Algorithm"
+    
+    Steps:
+        
+        - Preprocessing: Tokenize the article and preprocess the text.
+        - TF-IDF Calculation: Compute the Term Frequency-Inverse Document Frequency (TF-IDF) scores for words in the article.
+        - Sentence Scoring: Score sentences based on the TF-IDF values of the words they contain.
+        - Summary Generation: Select the top-scored sentences to form the summary.
+        - Evaluation: Compare the generated summary with the reference summary using a confusion matrix.
+
+--- 
+
+#### ⚖️ PROJECT TRADE-OFFS AND SOLUTIONS 
 
 === "Trade-off 1"
 
@@ -130,19 +191,7 @@ The dataset contains features like sentence importance, word frequency, and ling
 
 --- 
 
-### SCREENSHOTS
-
-!!! success "Project flowchart"
-
-    ``` mermaid
-      graph LR
-    A[Start] --> B[Load Dataset]
-    B --> C[Preprocessing]
-    C --> D[TextRank + TF-IDF / Transformer Models]
-    D --> E{Compare Performance}
-    E -->|Best Model| F[Deploy]
-    E -->|Retry| C;
-    ```
+### 🖼 SCREENSHOTS
 
 ??? example "Confusion Matrix"
 
@@ -156,9 +205,9 @@ The dataset contains features like sentence importance, word frequency, and ling
         ![trans](https://github.com/user-attachments/assets/7e99887b-e225-4dd0-802d-f1c2b0e89bef)
 
 
-### CONCLUSION 
+### ✅CONCLUSION 
 
-#### KEY LEARNINGS 
+#### 🔑 KEY LEARNINGS 
 
 !!! tip "Insights gained from the data"
     - Data Complexity: News articles vary in length and structure, requiring different summarization techniques.
@@ -174,7 +223,7 @@ The dataset contains features like sentence importance, word frequency, and ling
 
 --- 
 
-#### USE CASES 
+#### 🌍 USE CASES 
 
 === "Application 1"
 

From 2c11f7c1d97239f974bfe855e16d1029c4426e2f Mon Sep 17 00:00:00 2001
From: Piyush Chakarborthy
 <102757301+Chakravartinsamrat@users.noreply.github.com>
Date: Mon, 27 Jan 2025 14:53:15 +0530
Subject: [PATCH 3/4] Added Minor Changes

---
 .../text_summarization.md                     | 24 ++++++++-----------
 1 file changed, 10 insertions(+), 14 deletions(-)

diff --git a/docs/projects/natural-language-processing/text_summarization.md b/docs/projects/natural-language-processing/text_summarization.md
index 23a27d14..08521c45 100644
--- a/docs/projects/natural-language-processing/text_summarization.md
+++ b/docs/projects/natural-language-processing/text_summarization.md
@@ -140,31 +140,27 @@ The dataset contains features like sentence importance, word frequency, and ling
     Steps:
         
         - Preprocessing: Tokenize the article into sentences and preprocess the text by removing stopwords and special characters.
-        - Similarity Matrix: Compute the similarity between sentences using Jaccard Similarity.
-        - Graph Construction: Build a graph where sentences are nodes and edges represent similarity scores.
-        - Ranking: Use the PageRank algorithm to rank sentences based on their importance.
-        - Summary Generation: Select the top-ranked sentences to form the summary.
-        - Evaluation: Compare the generated summary with the reference summary using a confusion matrix.
+        - sentence_similarity(sent1, sent2) - Computes Jaccard similarity between two sentences.
+        - nx.pagerank(graph) - Computes sentence importance scores using the PageRank algorithm
 
 
 === "Transformers"
     
     Steps:
         
-        - Preprocessing: Tokenize the article and preprocess the text.
-        - Model Application: Use the pipeline function from the transformers library to generate a summary.
-        - Evaluation: Compare the generated summary with the reference summary using a confusion matrix.
+        - pipeline("summarization") - Initializes a pre-trained transformer model for summarization.
+        - generated_summary = summarization_pipeline(article, max_length=150, min_length=50, do_sample=False) - Generates a summary using a transformer model.
 
 
 === "TTF-IDF Algorithm"
     
     Steps:
         
-        - Preprocessing: Tokenize the article and preprocess the text.
-        - TF-IDF Calculation: Compute the Term Frequency-Inverse Document Frequency (TF-IDF) scores for words in the article.
-        - Sentence Scoring: Score sentences based on the TF-IDF values of the words they contain.
-        - Summary Generation: Select the top-scored sentences to form the summary.
-        - Evaluation: Compare the generated summary with the reference summary using a confusion matrix.
+        - TfidfVectorizer - Converts text into numerical feature vectors based on Term Frequency-Inverse Document Frequency (TF-IDF).
+        - vectorizer.fit_transform(processed_sentences): Transforms the processed text into a sparse matrix representation.
+        - Cosine Similarity (cosine_similarity) - Measures the similarity between text vectors based on their cosine angle.
+        - cosine_similarity(tfidf_matrix): Computes a similarity matrix between sentences.
+        - generate_summary() generates summary.
 
 --- 
 
@@ -182,7 +178,7 @@ The dataset contains features like sentence importance, word frequency, and ling
 
       - **Solution**: Model Pruning: Used smaller versions of transformer models (e.g., distilBART or distilT5) to reduce the computational load without compromising much on performance.
 
-=== "Trade-off 2"
+=== "Trade-off 3"
 
     TextRank summary might miss nuances and context, leading to less accurate or overly simplistic outputs compared to transformer-based models.
 

From 6b034db3271fc28ab6df865bbb4707b894999d22 Mon Sep 17 00:00:00 2001
From: Piyush Chakarborthy
 <102757301+Chakravartinsamrat@users.noreply.github.com>
Date: Mon, 27 Jan 2025 16:16:02 +0530
Subject: [PATCH 4/4] Tryed my best to keep it short and meaningful

---
 .../text_summarization.md                     | 59 ++++++++++++++-----
 1 file changed, 45 insertions(+), 14 deletions(-)

diff --git a/docs/projects/natural-language-processing/text_summarization.md b/docs/projects/natural-language-processing/text_summarization.md
index 08521c45..8fd40a40 100644
--- a/docs/projects/natural-language-processing/text_summarization.md
+++ b/docs/projects/natural-language-processing/text_summarization.md
@@ -137,30 +137,61 @@ The dataset contains features like sentence importance, word frequency, and ling
 
 === "TextRank algorithm"
     
-    Steps:
-        
-        - Preprocessing: Tokenize the article into sentences and preprocess the text by removing stopwords and special characters.
-        - sentence_similarity(sent1, sent2) - Computes Jaccard similarity between two sentences.
-        - nx.pagerank(graph) - Computes sentence importance scores using the PageRank algorithm
+    Important Function:
+
+        graph = nx.from_numpy_array(similarity_matrix)
+        scores = nx.pagerank(graph)
+
+        Example Input: 
+        similarity_matrix = np.array([
+            [0.0, 0.2, 0.1],  # Sentence 1
+            [0.2, 0.0, 0.3],  # Sentence 2 
+            [0.1, 0.3, 0.0]]) # Sentence 3 
+
+        graph = nx.from_numpy_array(similarity_matrix)
+        scores = nx.pagerank(graph)
+
+        Output:
+        {0: 0.25, 1: 0.45, 2: 0.30} #That means sentence 2(0.45) has more importance than others
+
 
 
 === "Transformers"
     
-    Steps:
+    Important Function:
         
-        - pipeline("summarization") - Initializes a pre-trained transformer model for summarization.
-        - generated_summary = summarization_pipeline(article, max_length=150, min_length=50, do_sample=False) - Generates a summary using a transformer model.
+        pipeline("summarization") - Initializes a pre-trained transformer model for summarization.
+        generated_summary = summarization_pipeline(article, max_length=150, min_length=50, do_sample=False) 
+        This Generates a summary using a transformer model.
+
+        Example Input:
+        article = "The Apollo program was a NASA initiative that landed humans on the Moon between 1969 and 1972, 
+        with Apollo 11 being the first mission."
+
+        Output:
+        The Apollo program was a NASA initiative that landed humans on the Moon between 1969 and 1972. 
+        Apollo 11 was the first mission.
+
+
 
 
 === "TTF-IDF Algorithm"
     
-    Steps:
+    Important Function:
         
-        - TfidfVectorizer - Converts text into numerical feature vectors based on Term Frequency-Inverse Document Frequency (TF-IDF).
-        - vectorizer.fit_transform(processed_sentences): Transforms the processed text into a sparse matrix representation.
-        - Cosine Similarity (cosine_similarity) - Measures the similarity between text vectors based on their cosine angle.
-        - cosine_similarity(tfidf_matrix): Computes a similarity matrix between sentences.
-        - generate_summary() generates summary.
+        vectorizer = TfidfVectorizer()
+        tfidf_matrix = vectorizer.fit_transform(processed_sentences)
+
+        Example Input:
+        processed_sentences = [
+        "apollo program nasa initiative landed humans moon 1969 1972",
+        "apollo 11 first mission land moon neil armstrong buzz aldrin walked surface",
+        "apollo program significant achievement space exploration cold war space race"]
+
+        Output:
+        ['1969', '1972', 'achievement', 'aldrin', 'apollo', 'armstrong', 'buzz', 'cold', 'exploration', 
+        'first', 'humans', 'initiative', 'land', 'landed', 'moon', 'nasa', 'neil', 'program', 'race', 
+        'significant', 'space', 'surface', 'walked', 'war']
 
 ---