From fe9dd837c17d7639d8462fc5d2d1eb7d07d13364 Mon Sep 17 00:00:00 2001 From: Piyush Chakarborthy <102757301+Chakravartinsamrat@users.noreply.github.com> Date: Sat, 25 Jan 2025 14:22:37 +0530 Subject: [PATCH 1/4] Create text_summarization.md --- .../text_summarization.md | 191 ++++++++++++++++++ 1 file changed, 191 insertions(+) create mode 100644 docs/projects/natural-language-processing/text_summarization.md diff --git a/docs/projects/natural-language-processing/text_summarization.md b/docs/projects/natural-language-processing/text_summarization.md new file mode 100644 index 00000000..17e62192 --- /dev/null +++ b/docs/projects/natural-language-processing/text_summarization.md @@ -0,0 +1,191 @@ + +# Text Summarization + +### AIM +Develop a model to summarize long articles into short, concise summaries. + +### DATASET LINK +[CNN DailyMail News Dataset](https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail/) + +### NOTEBOOK LINK +??? Abstract "Kaggle Notebook" + + +### LIBRARIES NEEDED +??? quote "LIBRARIES USED" + + - pandas + - numpy + - scikit-learn + - matplotlib + - keras + - tensorflow + - spacy + - pytextrank + - TfidfVectorizer + - Transformer (Bart) +--- + +### DESCRIPTION + +??? info "What is the requirement of the project?" + - A robust system to summarize text efficiently is essential for handling large volumes of information. + - It helps users quickly grasp key insights without reading lengthy documents. + +??? info "Why is it necessary?" + - Large amounts of text can be overwhelming and time-consuming to process. + - Automated summarization improves productivity and aids decision-making in various fields like journalism, research, and customer support. + +??? info "How is it beneficial and used?" + - Provides a concise summary while preserving essential information. + - Used in news aggregation, academic research, and AI-powered assistants for quick content consumption. + +??? info "How did you start approaching this project? (Initial thoughts and planning)" + - Explored different text summarization techniques, including extractive and abstractive methods. + - Implemented models like TextRank, BART, and T5 to compare their effectiveness. + +??? info "Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.)." + - Documentation from Hugging Face Transformers + - Research Paper: "Text Summarization using Deep Learning" + - Blog: "Introduction to NLP-based Summarization Techniques" + +--- + +#### DETAILS OF THE DIFFERENT FEATURES +The dataset contains features like sentence importance, word frequency, and linguistic structures that help in generating meaningful summaries. + +| Feature | Description | +|----------------------|-------------------------------------------------| +| `sentence_rank` | Rank of a sentence based on importance using TextRank | +| `word_freq` | Frequency of key terms in the document | +| `tf-idf_score` | Term Frequency-Inverse Document Frequency for words | +| `summary_length` | Desired length of the summary | +| `generated_summary` | AI-generated condensed version of the original text | + +--- + +#### PROCEDURE + +=== "Step 1" + + Exploratory Data Analysis: + + - Loaded the CNN/DailyMail dataset using pandas. + - Explored dataset features like article and highlights, ensuring the correct format for summarization. + - Analyzed the distribution of articles and their corresponding summaries. + +=== "Step 2" + + Data cleaning and preprocessing: + + - Removed unnecessary columns (like id) and checked for missing values. + - Tokenized articles into sentences and words, removing stopwords and special characters. + - Preprocessed the text using basic NLP techniques such as lowercasing, lemmatization, and removing non-alphanumeric characters. + +=== "Step 3" + + Feature engineering and selection: + + - For TextRank-based summarization, calculated sentence similarity using TF-IDF (Term Frequency-Inverse Document Frequency) and Cosine Similarity. + - Selected top-ranked sentences based on their importance and relevance to the article. + - Applied transformers-based models like BART and T5 for abstractive summarization. + - Applied transformers-based models like BART and T5 for abstractive summarization. + +=== "Step 4" + + Model training and evaluation: + + - For the TextRank summarization approach, created a similarity matrix based on TF-IDF and Cosine Similarity. + - For transformer-based methods, used Hugging Face's BART and T5 models, summarizing articles with their pre-trained weights. + - Evaluated the summarization models based on BLEU, ROUGE, and Cosine Similarity metrics. + +=== "Step 5" + + Validation and testing: + + - Tested both extractive and abstractive summarization models on unseen data to ensure generalizability. + - Plotted confusion matrices to visualize True Positives, False Positives, and False Negatives, ensuring effective model performance. +--- + +#### PROJECT TRADE-OFFS AND SOLUTIONS + +=== "Trade-off 1" + + Training Dataset being over 1.2Gb, which is too large for local machines. + + - **Solution**: Instead of Training a model on train dataset, Used Test Dataset for training and validation. + +=== "Trade-off 2" + + Transformer models (BART/T5) required high computational resources and long inference times for summarizing large articles. + + - **Solution**: Model Pruning: Used smaller versions of transformer models (e.g., distilBART or distilT5) to reduce the computational load without compromising much on performance. + +=== "Trade-off 2" + + TextRank summary might miss nuances and context, leading to less accurate or overly simplistic outputs compared to transformer-based models. + + - **Solution**: Combined TextRank and Transformer-based summarization models in a hybrid approach to leverage the best of both worldsโ€”speed from TextRank and accuracy from transformers. + + +--- + +### SCREENSHOTS + +!!! success "Project flowchart" + + ``` mermaid + graph LR + A[Start] --> B[Load Dataset] + B --> C[Preprocessing] + C --> D[TextRank + TF-IDF / Transformer Models] + D --> E{Compare Performance} + E -->|Best Model| F[Deploy] + E -->|Retry| C; + ``` + +??? example "Confusion Matrix" + + === "TF-IDF Confusion Matrix" + ![tfidf](https://github.com/user-attachments/assets/28f257e1-2529-48f1-81e5-e058a50fb351) + + === "TextRank Confusion Matrix" + ![textrank](https://github.com/user-attachments/assets/cb748eff-e4f3-4096-ab2b-cf2e4b40186f) + + === "Transformers Confusion Matrix" + ![trans](https://github.com/user-attachments/assets/7e99887b-e225-4dd0-802d-f1c2b0e89bef) + + +### CONCLUSION + +#### KEY LEARNINGS + +!!! tip "Insights gained from the data" + - Data Complexity: News articles vary in length and structure, requiring different summarization techniques. + - Text Preprocessing: Cleaning text (e.g., stopword removal, tokenization) significantly improves summarization quality. + - Feature Extraction: Techniques like TF-IDF, TextRank, and Transformer embeddings help in effective text representation for summarization models. + +??? tip "Improvements in understanding machine learning concepts" + - Model Selection: Comparing extractive (TextRank, TF-IDF) and abstractive (Transformers) models to determine the best summarization approach. + +??? tip "Challenges faced and how they were overcome" + - Long Text Processing: Splitting lengthy articles into manageable sections before summarization. + - Computational Efficiency: Used batch processing and model optimization to handle large datasets efficiently. + +--- + +#### USE CASES + +=== "Application 1" + + **News Aggregation & Personalized Summaries** + + - Automating news summarization helps users quickly grasp key events without reading lengthy articles. + - Used in news apps, digital assistants, and content curation platforms. + +=== "Application 2" + + **Legal & Academic Document Summarization** + + - Helps professionals extract critical insights from lengthy legal or research documents. + - Reduces the time needed for manual reading and analysis. From 5bb5d49a7aa05ff9f8003819f7d57b0c9f160782 Mon Sep 17 00:00:00 2001 From: Piyush Chakarborthy <102757301+Chakravartinsamrat@users.noreply.github.com> Date: Sat, 25 Jan 2025 15:24:23 +0530 Subject: [PATCH 2/4] Added Emojis and Missing Section from Template --- .../text_summarization.md | 97 ++++++++++++++----- 1 file changed, 73 insertions(+), 24 deletions(-) diff --git a/docs/projects/natural-language-processing/text_summarization.md b/docs/projects/natural-language-processing/text_summarization.md index 17e62192..23a27d14 100644 --- a/docs/projects/natural-language-processing/text_summarization.md +++ b/docs/projects/natural-language-processing/text_summarization.md @@ -1,17 +1,17 @@ -# Text Summarization +# ๐Ÿ“œText Summarization -### AIM +### ๐ŸŽฏ AIM Develop a model to summarize long articles into short, concise summaries. -### DATASET LINK +### ๐Ÿ“Š DATASET LINK [CNN DailyMail News Dataset](https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail/) -### NOTEBOOK LINK +### ๐Ÿ““ NOTEBOOK LINK ??? Abstract "Kaggle Notebook" -### LIBRARIES NEEDED +### โš™๏ธ LIBRARIES NEEDED ??? quote "LIBRARIES USED" - pandas @@ -26,7 +26,7 @@ Develop a model to summarize long articles into short, concise summaries. - Transformer (Bart) --- -### DESCRIPTION +### ๐Ÿ“ DESCRIPTION ??? info "What is the requirement of the project?" - A robust system to summarize text efficiently is essential for handling large volumes of information. @@ -50,10 +50,22 @@ Develop a model to summarize long articles into short, concise summaries. - Blog: "Introduction to NLP-based Summarization Techniques" --- +## ๐Ÿ” EXPLANATION + +#### ๐Ÿงฉ DETAILS OF THE DIFFERENT FEATURES + +#### ๐Ÿ“‚ dataset.csv -#### DETAILS OF THE DIFFERENT FEATURES The dataset contains features like sentence importance, word frequency, and linguistic structures that help in generating meaningful summaries. +| Feature Name | Description | +|--------------|-------------| +| Id | A unique Id for each row | +| Article | Entire article written on CNN Daily mail | +| Highlights | Key Notes of the article | + +#### ๐Ÿ›  Developed Features + | Feature | Description | |----------------------|-------------------------------------------------| | `sentence_rank` | Rank of a sentence based on importance using TextRank | @@ -63,6 +75,18 @@ The dataset contains features like sentence importance, word frequency, and ling | `generated_summary` | AI-generated condensed version of the original text | --- +### ๐Ÿ›ค PROJECT WORKFLOW +!!! success "Project flowchart" + + ``` mermaid + graph LR + A[Start] --> B[Load Dataset] + B --> C[Preprocessing] + C --> D[TextRank + TF-IDF / Transformer Models] + D --> E{Compare Performance} + E -->|Best Model| F[Deploy] + E -->|Retry| C; + ``` #### PROCEDURE @@ -107,7 +131,44 @@ The dataset contains features like sentence importance, word frequency, and ling - Plotted confusion matrices to visualize True Positives, False Positives, and False Negatives, ensuring effective model performance. --- -#### PROJECT TRADE-OFFS AND SOLUTIONS +### ๐Ÿ–ฅ CODE EXPLANATION + + + +=== "TextRank algorithm" + + Steps: + + - Preprocessing: Tokenize the article into sentences and preprocess the text by removing stopwords and special characters. + - Similarity Matrix: Compute the similarity between sentences using Jaccard Similarity. + - Graph Construction: Build a graph where sentences are nodes and edges represent similarity scores. + - Ranking: Use the PageRank algorithm to rank sentences based on their importance. + - Summary Generation: Select the top-ranked sentences to form the summary. + - Evaluation: Compare the generated summary with the reference summary using a confusion matrix. + + +=== "Transformers" + + Steps: + + - Preprocessing: Tokenize the article and preprocess the text. + - Model Application: Use the pipeline function from the transformers library to generate a summary. + - Evaluation: Compare the generated summary with the reference summary using a confusion matrix. + + +=== "TTF-IDF Algorithm" + + Steps: + + - Preprocessing: Tokenize the article and preprocess the text. + - TF-IDF Calculation: Compute the Term Frequency-Inverse Document Frequency (TF-IDF) scores for words in the article. + - Sentence Scoring: Score sentences based on the TF-IDF values of the words they contain. + - Summary Generation: Select the top-scored sentences to form the summary. + - Evaluation: Compare the generated summary with the reference summary using a confusion matrix. + +--- + +#### โš–๏ธ PROJECT TRADE-OFFS AND SOLUTIONS === "Trade-off 1" @@ -130,19 +191,7 @@ The dataset contains features like sentence importance, word frequency, and ling --- -### SCREENSHOTS - -!!! success "Project flowchart" - - ``` mermaid - graph LR - A[Start] --> B[Load Dataset] - B --> C[Preprocessing] - C --> D[TextRank + TF-IDF / Transformer Models] - D --> E{Compare Performance} - E -->|Best Model| F[Deploy] - E -->|Retry| C; - ``` +### ๐Ÿ–ผ SCREENSHOTS ??? example "Confusion Matrix" @@ -156,9 +205,9 @@ The dataset contains features like sentence importance, word frequency, and ling ![trans](https://github.com/user-attachments/assets/7e99887b-e225-4dd0-802d-f1c2b0e89bef) -### CONCLUSION +### โœ…CONCLUSION -#### KEY LEARNINGS +#### ๐Ÿ”‘ KEY LEARNINGS !!! tip "Insights gained from the data" - Data Complexity: News articles vary in length and structure, requiring different summarization techniques. @@ -174,7 +223,7 @@ The dataset contains features like sentence importance, word frequency, and ling --- -#### USE CASES +#### ๐ŸŒ USE CASES === "Application 1" From 2c11f7c1d97239f974bfe855e16d1029c4426e2f Mon Sep 17 00:00:00 2001 From: Piyush Chakarborthy <102757301+Chakravartinsamrat@users.noreply.github.com> Date: Mon, 27 Jan 2025 14:53:15 +0530 Subject: [PATCH 3/4] Added Minor Changes --- .../text_summarization.md | 24 ++++++++----------- 1 file changed, 10 insertions(+), 14 deletions(-) diff --git a/docs/projects/natural-language-processing/text_summarization.md b/docs/projects/natural-language-processing/text_summarization.md index 23a27d14..08521c45 100644 --- a/docs/projects/natural-language-processing/text_summarization.md +++ b/docs/projects/natural-language-processing/text_summarization.md @@ -140,31 +140,27 @@ The dataset contains features like sentence importance, word frequency, and ling Steps: - Preprocessing: Tokenize the article into sentences and preprocess the text by removing stopwords and special characters. - - Similarity Matrix: Compute the similarity between sentences using Jaccard Similarity. - - Graph Construction: Build a graph where sentences are nodes and edges represent similarity scores. - - Ranking: Use the PageRank algorithm to rank sentences based on their importance. - - Summary Generation: Select the top-ranked sentences to form the summary. - - Evaluation: Compare the generated summary with the reference summary using a confusion matrix. + - sentence_similarity(sent1, sent2) - Computes Jaccard similarity between two sentences. + - nx.pagerank(graph) - Computes sentence importance scores using the PageRank algorithm === "Transformers" Steps: - - Preprocessing: Tokenize the article and preprocess the text. - - Model Application: Use the pipeline function from the transformers library to generate a summary. - - Evaluation: Compare the generated summary with the reference summary using a confusion matrix. + - pipeline("summarization") - Initializes a pre-trained transformer model for summarization. + - generated_summary = summarization_pipeline(article, max_length=150, min_length=50, do_sample=False) - Generates a summary using a transformer model. === "TTF-IDF Algorithm" Steps: - - Preprocessing: Tokenize the article and preprocess the text. - - TF-IDF Calculation: Compute the Term Frequency-Inverse Document Frequency (TF-IDF) scores for words in the article. - - Sentence Scoring: Score sentences based on the TF-IDF values of the words they contain. - - Summary Generation: Select the top-scored sentences to form the summary. - - Evaluation: Compare the generated summary with the reference summary using a confusion matrix. + - TfidfVectorizer - Converts text into numerical feature vectors based on Term Frequency-Inverse Document Frequency (TF-IDF). + - vectorizer.fit_transform(processed_sentences): Transforms the processed text into a sparse matrix representation. + - Cosine Similarity (cosine_similarity) - Measures the similarity between text vectors based on their cosine angle. + - cosine_similarity(tfidf_matrix): Computes a similarity matrix between sentences. + - generate_summary() generates summary. --- @@ -182,7 +178,7 @@ The dataset contains features like sentence importance, word frequency, and ling - **Solution**: Model Pruning: Used smaller versions of transformer models (e.g., distilBART or distilT5) to reduce the computational load without compromising much on performance. -=== "Trade-off 2" +=== "Trade-off 3" TextRank summary might miss nuances and context, leading to less accurate or overly simplistic outputs compared to transformer-based models. From 6b034db3271fc28ab6df865bbb4707b894999d22 Mon Sep 17 00:00:00 2001 From: Piyush Chakarborthy <102757301+Chakravartinsamrat@users.noreply.github.com> Date: Mon, 27 Jan 2025 16:16:02 +0530 Subject: [PATCH 4/4] Tryed my best to keep it short and meaningful --- .../text_summarization.md | 59 ++++++++++++++----- 1 file changed, 45 insertions(+), 14 deletions(-) diff --git a/docs/projects/natural-language-processing/text_summarization.md b/docs/projects/natural-language-processing/text_summarization.md index 08521c45..8fd40a40 100644 --- a/docs/projects/natural-language-processing/text_summarization.md +++ b/docs/projects/natural-language-processing/text_summarization.md @@ -137,30 +137,61 @@ The dataset contains features like sentence importance, word frequency, and ling === "TextRank algorithm" - Steps: - - - Preprocessing: Tokenize the article into sentences and preprocess the text by removing stopwords and special characters. - - sentence_similarity(sent1, sent2) - Computes Jaccard similarity between two sentences. - - nx.pagerank(graph) - Computes sentence importance scores using the PageRank algorithm + Important Function: + + graph = nx.from_numpy_array(similarity_matrix) + scores = nx.pagerank(graph) + + Example Input: + similarity_matrix = np.array([ + [0.0, 0.2, 0.1], # Sentence 1 + [0.2, 0.0, 0.3], # Sentence 2 + [0.1, 0.3, 0.0]]) # Sentence 3 + + graph = nx.from_numpy_array(similarity_matrix) + scores = nx.pagerank(graph) + + Output: + {0: 0.25, 1: 0.45, 2: 0.30} #That means sentence 2(0.45) has more importance than others + === "Transformers" - Steps: + Important Function: - - pipeline("summarization") - Initializes a pre-trained transformer model for summarization. - - generated_summary = summarization_pipeline(article, max_length=150, min_length=50, do_sample=False) - Generates a summary using a transformer model. + pipeline("summarization") - Initializes a pre-trained transformer model for summarization. + generated_summary = summarization_pipeline(article, max_length=150, min_length=50, do_sample=False) + This Generates a summary using a transformer model. + + Example Input: + article = "The Apollo program was a NASA initiative that landed humans on the Moon between 1969 and 1972, + with Apollo 11 being the first mission." + + Output: + The Apollo program was a NASA initiative that landed humans on the Moon between 1969 and 1972. + Apollo 11 was the first mission. + + === "TTF-IDF Algorithm" - Steps: + Important Function: - - TfidfVectorizer - Converts text into numerical feature vectors based on Term Frequency-Inverse Document Frequency (TF-IDF). - - vectorizer.fit_transform(processed_sentences): Transforms the processed text into a sparse matrix representation. - - Cosine Similarity (cosine_similarity) - Measures the similarity between text vectors based on their cosine angle. - - cosine_similarity(tfidf_matrix): Computes a similarity matrix between sentences. - - generate_summary() generates summary. + vectorizer = TfidfVectorizer() + tfidf_matrix = vectorizer.fit_transform(processed_sentences) + + Example Input: + processed_sentences = [ + "apollo program nasa initiative landed humans moon 1969 1972", + "apollo 11 first mission land moon neil armstrong buzz aldrin walked surface", + "apollo program significant achievement space exploration cold war space race"] + + Output: + ['1969', '1972', 'achievement', 'aldrin', 'apollo', 'armstrong', 'buzz', 'cold', 'exploration', + 'first', 'humans', 'initiative', 'land', 'landed', 'moon', 'nasa', 'neil', 'program', 'race', + 'significant', 'space', 'surface', 'walked', 'war'] ---