From fe9dd837c17d7639d8462fc5d2d1eb7d07d13364 Mon Sep 17 00:00:00 2001
From: Piyush Chakarborthy
<102757301+Chakravartinsamrat@users.noreply.github.com>
Date: Sat, 25 Jan 2025 14:22:37 +0530
Subject: [PATCH 1/4] Create text_summarization.md
---
.../text_summarization.md | 191 ++++++++++++++++++
1 file changed, 191 insertions(+)
create mode 100644 docs/projects/natural-language-processing/text_summarization.md
diff --git a/docs/projects/natural-language-processing/text_summarization.md b/docs/projects/natural-language-processing/text_summarization.md
new file mode 100644
index 00000000..17e62192
--- /dev/null
+++ b/docs/projects/natural-language-processing/text_summarization.md
@@ -0,0 +1,191 @@
+
+# Text Summarization
+
+### AIM
+Develop a model to summarize long articles into short, concise summaries.
+
+### DATASET LINK
+[CNN DailyMail News Dataset](https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail/)
+
+### NOTEBOOK LINK
+??? Abstract "Kaggle Notebook"
+
+
+### LIBRARIES NEEDED
+??? quote "LIBRARIES USED"
+
+ - pandas
+ - numpy
+ - scikit-learn
+ - matplotlib
+ - keras
+ - tensorflow
+ - spacy
+ - pytextrank
+ - TfidfVectorizer
+ - Transformer (Bart)
+---
+
+### DESCRIPTION
+
+??? info "What is the requirement of the project?"
+ - A robust system to summarize text efficiently is essential for handling large volumes of information.
+ - It helps users quickly grasp key insights without reading lengthy documents.
+
+??? info "Why is it necessary?"
+ - Large amounts of text can be overwhelming and time-consuming to process.
+ - Automated summarization improves productivity and aids decision-making in various fields like journalism, research, and customer support.
+
+??? info "How is it beneficial and used?"
+ - Provides a concise summary while preserving essential information.
+ - Used in news aggregation, academic research, and AI-powered assistants for quick content consumption.
+
+??? info "How did you start approaching this project? (Initial thoughts and planning)"
+ - Explored different text summarization techniques, including extractive and abstractive methods.
+ - Implemented models like TextRank, BART, and T5 to compare their effectiveness.
+
+??? info "Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.)."
+ - Documentation from Hugging Face Transformers
+ - Research Paper: "Text Summarization using Deep Learning"
+ - Blog: "Introduction to NLP-based Summarization Techniques"
+
+---
+
+#### DETAILS OF THE DIFFERENT FEATURES
+The dataset contains features like sentence importance, word frequency, and linguistic structures that help in generating meaningful summaries.
+
+| Feature | Description |
+|----------------------|-------------------------------------------------|
+| `sentence_rank` | Rank of a sentence based on importance using TextRank |
+| `word_freq` | Frequency of key terms in the document |
+| `tf-idf_score` | Term Frequency-Inverse Document Frequency for words |
+| `summary_length` | Desired length of the summary |
+| `generated_summary` | AI-generated condensed version of the original text |
+
+---
+
+#### PROCEDURE
+
+=== "Step 1"
+
+ Exploratory Data Analysis:
+
+ - Loaded the CNN/DailyMail dataset using pandas.
+ - Explored dataset features like article and highlights, ensuring the correct format for summarization.
+ - Analyzed the distribution of articles and their corresponding summaries.
+
+=== "Step 2"
+
+ Data cleaning and preprocessing:
+
+ - Removed unnecessary columns (like id) and checked for missing values.
+ - Tokenized articles into sentences and words, removing stopwords and special characters.
+ - Preprocessed the text using basic NLP techniques such as lowercasing, lemmatization, and removing non-alphanumeric characters.
+
+=== "Step 3"
+
+ Feature engineering and selection:
+
+ - For TextRank-based summarization, calculated sentence similarity using TF-IDF (Term Frequency-Inverse Document Frequency) and Cosine Similarity.
+ - Selected top-ranked sentences based on their importance and relevance to the article.
+ - Applied transformers-based models like BART and T5 for abstractive summarization.
+ - Applied transformers-based models like BART and T5 for abstractive summarization.
+
+=== "Step 4"
+
+ Model training and evaluation:
+
+ - For the TextRank summarization approach, created a similarity matrix based on TF-IDF and Cosine Similarity.
+ - For transformer-based methods, used Hugging Face's BART and T5 models, summarizing articles with their pre-trained weights.
+ - Evaluated the summarization models based on BLEU, ROUGE, and Cosine Similarity metrics.
+
+=== "Step 5"
+
+ Validation and testing:
+
+ - Tested both extractive and abstractive summarization models on unseen data to ensure generalizability.
+ - Plotted confusion matrices to visualize True Positives, False Positives, and False Negatives, ensuring effective model performance.
+---
+
+#### PROJECT TRADE-OFFS AND SOLUTIONS
+
+=== "Trade-off 1"
+
+ Training Dataset being over 1.2Gb, which is too large for local machines.
+
+ - **Solution**: Instead of Training a model on train dataset, Used Test Dataset for training and validation.
+
+=== "Trade-off 2"
+
+ Transformer models (BART/T5) required high computational resources and long inference times for summarizing large articles.
+
+ - **Solution**: Model Pruning: Used smaller versions of transformer models (e.g., distilBART or distilT5) to reduce the computational load without compromising much on performance.
+
+=== "Trade-off 2"
+
+ TextRank summary might miss nuances and context, leading to less accurate or overly simplistic outputs compared to transformer-based models.
+
+ - **Solution**: Combined TextRank and Transformer-based summarization models in a hybrid approach to leverage the best of both worldsโspeed from TextRank and accuracy from transformers.
+
+
+---
+
+### SCREENSHOTS
+
+!!! success "Project flowchart"
+
+ ``` mermaid
+ graph LR
+ A[Start] --> B[Load Dataset]
+ B --> C[Preprocessing]
+ C --> D[TextRank + TF-IDF / Transformer Models]
+ D --> E{Compare Performance}
+ E -->|Best Model| F[Deploy]
+ E -->|Retry| C;
+ ```
+
+??? example "Confusion Matrix"
+
+ === "TF-IDF Confusion Matrix"
+ 
+
+ === "TextRank Confusion Matrix"
+ 
+
+ === "Transformers Confusion Matrix"
+ 
+
+
+### CONCLUSION
+
+#### KEY LEARNINGS
+
+!!! tip "Insights gained from the data"
+ - Data Complexity: News articles vary in length and structure, requiring different summarization techniques.
+ - Text Preprocessing: Cleaning text (e.g., stopword removal, tokenization) significantly improves summarization quality.
+ - Feature Extraction: Techniques like TF-IDF, TextRank, and Transformer embeddings help in effective text representation for summarization models.
+
+??? tip "Improvements in understanding machine learning concepts"
+ - Model Selection: Comparing extractive (TextRank, TF-IDF) and abstractive (Transformers) models to determine the best summarization approach.
+
+??? tip "Challenges faced and how they were overcome"
+ - Long Text Processing: Splitting lengthy articles into manageable sections before summarization.
+ - Computational Efficiency: Used batch processing and model optimization to handle large datasets efficiently.
+
+---
+
+#### USE CASES
+
+=== "Application 1"
+
+ **News Aggregation & Personalized Summaries**
+
+ - Automating news summarization helps users quickly grasp key events without reading lengthy articles.
+ - Used in news apps, digital assistants, and content curation platforms.
+
+=== "Application 2"
+
+ **Legal & Academic Document Summarization**
+
+ - Helps professionals extract critical insights from lengthy legal or research documents.
+ - Reduces the time needed for manual reading and analysis.
From 5bb5d49a7aa05ff9f8003819f7d57b0c9f160782 Mon Sep 17 00:00:00 2001
From: Piyush Chakarborthy
<102757301+Chakravartinsamrat@users.noreply.github.com>
Date: Sat, 25 Jan 2025 15:24:23 +0530
Subject: [PATCH 2/4] Added Emojis and Missing Section from Template
---
.../text_summarization.md | 97 ++++++++++++++-----
1 file changed, 73 insertions(+), 24 deletions(-)
diff --git a/docs/projects/natural-language-processing/text_summarization.md b/docs/projects/natural-language-processing/text_summarization.md
index 17e62192..23a27d14 100644
--- a/docs/projects/natural-language-processing/text_summarization.md
+++ b/docs/projects/natural-language-processing/text_summarization.md
@@ -1,17 +1,17 @@
-# Text Summarization
+# ๐Text Summarization
-### AIM
+### ๐ฏ AIM
Develop a model to summarize long articles into short, concise summaries.
-### DATASET LINK
+### ๐ DATASET LINK
[CNN DailyMail News Dataset](https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail/)
-### NOTEBOOK LINK
+### ๐ NOTEBOOK LINK
??? Abstract "Kaggle Notebook"
-### LIBRARIES NEEDED
+### โ๏ธ LIBRARIES NEEDED
??? quote "LIBRARIES USED"
- pandas
@@ -26,7 +26,7 @@ Develop a model to summarize long articles into short, concise summaries.
- Transformer (Bart)
---
-### DESCRIPTION
+### ๐ DESCRIPTION
??? info "What is the requirement of the project?"
- A robust system to summarize text efficiently is essential for handling large volumes of information.
@@ -50,10 +50,22 @@ Develop a model to summarize long articles into short, concise summaries.
- Blog: "Introduction to NLP-based Summarization Techniques"
---
+## ๐ EXPLANATION
+
+#### ๐งฉ DETAILS OF THE DIFFERENT FEATURES
+
+#### ๐ dataset.csv
-#### DETAILS OF THE DIFFERENT FEATURES
The dataset contains features like sentence importance, word frequency, and linguistic structures that help in generating meaningful summaries.
+| Feature Name | Description |
+|--------------|-------------|
+| Id | A unique Id for each row |
+| Article | Entire article written on CNN Daily mail |
+| Highlights | Key Notes of the article |
+
+#### ๐ Developed Features
+
| Feature | Description |
|----------------------|-------------------------------------------------|
| `sentence_rank` | Rank of a sentence based on importance using TextRank |
@@ -63,6 +75,18 @@ The dataset contains features like sentence importance, word frequency, and ling
| `generated_summary` | AI-generated condensed version of the original text |
---
+### ๐ค PROJECT WORKFLOW
+!!! success "Project flowchart"
+
+ ``` mermaid
+ graph LR
+ A[Start] --> B[Load Dataset]
+ B --> C[Preprocessing]
+ C --> D[TextRank + TF-IDF / Transformer Models]
+ D --> E{Compare Performance}
+ E -->|Best Model| F[Deploy]
+ E -->|Retry| C;
+ ```
#### PROCEDURE
@@ -107,7 +131,44 @@ The dataset contains features like sentence importance, word frequency, and ling
- Plotted confusion matrices to visualize True Positives, False Positives, and False Negatives, ensuring effective model performance.
---
-#### PROJECT TRADE-OFFS AND SOLUTIONS
+### ๐ฅ CODE EXPLANATION
+
+
+
+=== "TextRank algorithm"
+
+ Steps:
+
+ - Preprocessing: Tokenize the article into sentences and preprocess the text by removing stopwords and special characters.
+ - Similarity Matrix: Compute the similarity between sentences using Jaccard Similarity.
+ - Graph Construction: Build a graph where sentences are nodes and edges represent similarity scores.
+ - Ranking: Use the PageRank algorithm to rank sentences based on their importance.
+ - Summary Generation: Select the top-ranked sentences to form the summary.
+ - Evaluation: Compare the generated summary with the reference summary using a confusion matrix.
+
+
+=== "Transformers"
+
+ Steps:
+
+ - Preprocessing: Tokenize the article and preprocess the text.
+ - Model Application: Use the pipeline function from the transformers library to generate a summary.
+ - Evaluation: Compare the generated summary with the reference summary using a confusion matrix.
+
+
+=== "TTF-IDF Algorithm"
+
+ Steps:
+
+ - Preprocessing: Tokenize the article and preprocess the text.
+ - TF-IDF Calculation: Compute the Term Frequency-Inverse Document Frequency (TF-IDF) scores for words in the article.
+ - Sentence Scoring: Score sentences based on the TF-IDF values of the words they contain.
+ - Summary Generation: Select the top-scored sentences to form the summary.
+ - Evaluation: Compare the generated summary with the reference summary using a confusion matrix.
+
+---
+
+#### โ๏ธ PROJECT TRADE-OFFS AND SOLUTIONS
=== "Trade-off 1"
@@ -130,19 +191,7 @@ The dataset contains features like sentence importance, word frequency, and ling
---
-### SCREENSHOTS
-
-!!! success "Project flowchart"
-
- ``` mermaid
- graph LR
- A[Start] --> B[Load Dataset]
- B --> C[Preprocessing]
- C --> D[TextRank + TF-IDF / Transformer Models]
- D --> E{Compare Performance}
- E -->|Best Model| F[Deploy]
- E -->|Retry| C;
- ```
+### ๐ผ SCREENSHOTS
??? example "Confusion Matrix"
@@ -156,9 +205,9 @@ The dataset contains features like sentence importance, word frequency, and ling

-### CONCLUSION
+### โ
CONCLUSION
-#### KEY LEARNINGS
+#### ๐ KEY LEARNINGS
!!! tip "Insights gained from the data"
- Data Complexity: News articles vary in length and structure, requiring different summarization techniques.
@@ -174,7 +223,7 @@ The dataset contains features like sentence importance, word frequency, and ling
---
-#### USE CASES
+#### ๐ USE CASES
=== "Application 1"
From 2c11f7c1d97239f974bfe855e16d1029c4426e2f Mon Sep 17 00:00:00 2001
From: Piyush Chakarborthy
<102757301+Chakravartinsamrat@users.noreply.github.com>
Date: Mon, 27 Jan 2025 14:53:15 +0530
Subject: [PATCH 3/4] Added Minor Changes
---
.../text_summarization.md | 24 ++++++++-----------
1 file changed, 10 insertions(+), 14 deletions(-)
diff --git a/docs/projects/natural-language-processing/text_summarization.md b/docs/projects/natural-language-processing/text_summarization.md
index 23a27d14..08521c45 100644
--- a/docs/projects/natural-language-processing/text_summarization.md
+++ b/docs/projects/natural-language-processing/text_summarization.md
@@ -140,31 +140,27 @@ The dataset contains features like sentence importance, word frequency, and ling
Steps:
- Preprocessing: Tokenize the article into sentences and preprocess the text by removing stopwords and special characters.
- - Similarity Matrix: Compute the similarity between sentences using Jaccard Similarity.
- - Graph Construction: Build a graph where sentences are nodes and edges represent similarity scores.
- - Ranking: Use the PageRank algorithm to rank sentences based on their importance.
- - Summary Generation: Select the top-ranked sentences to form the summary.
- - Evaluation: Compare the generated summary with the reference summary using a confusion matrix.
+ - sentence_similarity(sent1, sent2) - Computes Jaccard similarity between two sentences.
+ - nx.pagerank(graph) - Computes sentence importance scores using the PageRank algorithm
=== "Transformers"
Steps:
- - Preprocessing: Tokenize the article and preprocess the text.
- - Model Application: Use the pipeline function from the transformers library to generate a summary.
- - Evaluation: Compare the generated summary with the reference summary using a confusion matrix.
+ - pipeline("summarization") - Initializes a pre-trained transformer model for summarization.
+ - generated_summary = summarization_pipeline(article, max_length=150, min_length=50, do_sample=False) - Generates a summary using a transformer model.
=== "TTF-IDF Algorithm"
Steps:
- - Preprocessing: Tokenize the article and preprocess the text.
- - TF-IDF Calculation: Compute the Term Frequency-Inverse Document Frequency (TF-IDF) scores for words in the article.
- - Sentence Scoring: Score sentences based on the TF-IDF values of the words they contain.
- - Summary Generation: Select the top-scored sentences to form the summary.
- - Evaluation: Compare the generated summary with the reference summary using a confusion matrix.
+ - TfidfVectorizer - Converts text into numerical feature vectors based on Term Frequency-Inverse Document Frequency (TF-IDF).
+ - vectorizer.fit_transform(processed_sentences): Transforms the processed text into a sparse matrix representation.
+ - Cosine Similarity (cosine_similarity) - Measures the similarity between text vectors based on their cosine angle.
+ - cosine_similarity(tfidf_matrix): Computes a similarity matrix between sentences.
+ - generate_summary() generates summary.
---
@@ -182,7 +178,7 @@ The dataset contains features like sentence importance, word frequency, and ling
- **Solution**: Model Pruning: Used smaller versions of transformer models (e.g., distilBART or distilT5) to reduce the computational load without compromising much on performance.
-=== "Trade-off 2"
+=== "Trade-off 3"
TextRank summary might miss nuances and context, leading to less accurate or overly simplistic outputs compared to transformer-based models.
From 6b034db3271fc28ab6df865bbb4707b894999d22 Mon Sep 17 00:00:00 2001
From: Piyush Chakarborthy
<102757301+Chakravartinsamrat@users.noreply.github.com>
Date: Mon, 27 Jan 2025 16:16:02 +0530
Subject: [PATCH 4/4] Tryed my best to keep it short and meaningful
---
.../text_summarization.md | 59 ++++++++++++++-----
1 file changed, 45 insertions(+), 14 deletions(-)
diff --git a/docs/projects/natural-language-processing/text_summarization.md b/docs/projects/natural-language-processing/text_summarization.md
index 08521c45..8fd40a40 100644
--- a/docs/projects/natural-language-processing/text_summarization.md
+++ b/docs/projects/natural-language-processing/text_summarization.md
@@ -137,30 +137,61 @@ The dataset contains features like sentence importance, word frequency, and ling
=== "TextRank algorithm"
- Steps:
-
- - Preprocessing: Tokenize the article into sentences and preprocess the text by removing stopwords and special characters.
- - sentence_similarity(sent1, sent2) - Computes Jaccard similarity between two sentences.
- - nx.pagerank(graph) - Computes sentence importance scores using the PageRank algorithm
+ Important Function:
+
+ graph = nx.from_numpy_array(similarity_matrix)
+ scores = nx.pagerank(graph)
+
+ Example Input:
+ similarity_matrix = np.array([
+ [0.0, 0.2, 0.1], # Sentence 1
+ [0.2, 0.0, 0.3], # Sentence 2
+ [0.1, 0.3, 0.0]]) # Sentence 3
+
+ graph = nx.from_numpy_array(similarity_matrix)
+ scores = nx.pagerank(graph)
+
+ Output:
+ {0: 0.25, 1: 0.45, 2: 0.30} #That means sentence 2(0.45) has more importance than others
+
=== "Transformers"
- Steps:
+ Important Function:
- - pipeline("summarization") - Initializes a pre-trained transformer model for summarization.
- - generated_summary = summarization_pipeline(article, max_length=150, min_length=50, do_sample=False) - Generates a summary using a transformer model.
+ pipeline("summarization") - Initializes a pre-trained transformer model for summarization.
+ generated_summary = summarization_pipeline(article, max_length=150, min_length=50, do_sample=False)
+ This Generates a summary using a transformer model.
+
+ Example Input:
+ article = "The Apollo program was a NASA initiative that landed humans on the Moon between 1969 and 1972,
+ with Apollo 11 being the first mission."
+
+ Output:
+ The Apollo program was a NASA initiative that landed humans on the Moon between 1969 and 1972.
+ Apollo 11 was the first mission.
+
+
=== "TTF-IDF Algorithm"
- Steps:
+ Important Function:
- - TfidfVectorizer - Converts text into numerical feature vectors based on Term Frequency-Inverse Document Frequency (TF-IDF).
- - vectorizer.fit_transform(processed_sentences): Transforms the processed text into a sparse matrix representation.
- - Cosine Similarity (cosine_similarity) - Measures the similarity between text vectors based on their cosine angle.
- - cosine_similarity(tfidf_matrix): Computes a similarity matrix between sentences.
- - generate_summary() generates summary.
+ vectorizer = TfidfVectorizer()
+ tfidf_matrix = vectorizer.fit_transform(processed_sentences)
+
+ Example Input:
+ processed_sentences = [
+ "apollo program nasa initiative landed humans moon 1969 1972",
+ "apollo 11 first mission land moon neil armstrong buzz aldrin walked surface",
+ "apollo program significant achievement space exploration cold war space race"]
+
+ Output:
+ ['1969', '1972', 'achievement', 'aldrin', 'apollo', 'armstrong', 'buzz', 'cold', 'exploration',
+ 'first', 'humans', 'initiative', 'land', 'landed', 'moon', 'nasa', 'neil', 'program', 'race',
+ 'significant', 'space', 'surface', 'walked', 'war']
---