diff --git a/docs/projects/natural-language-processing/text_summarization.md b/docs/projects/natural-language-processing/text_summarization.md new file mode 100644 index 00000000..8fd40a40 --- /dev/null +++ b/docs/projects/natural-language-processing/text_summarization.md @@ -0,0 +1,267 @@ + +# ๐Ÿ“œText Summarization + +### ๐ŸŽฏ AIM +Develop a model to summarize long articles into short, concise summaries. + +### ๐Ÿ“Š DATASET LINK +[CNN DailyMail News Dataset](https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail/) + +### ๐Ÿ““ NOTEBOOK LINK +??? Abstract "Kaggle Notebook" + + +### โš™๏ธ LIBRARIES NEEDED +??? quote "LIBRARIES USED" + + - pandas + - numpy + - scikit-learn + - matplotlib + - keras + - tensorflow + - spacy + - pytextrank + - TfidfVectorizer + - Transformer (Bart) +--- + +### ๐Ÿ“ DESCRIPTION + +??? info "What is the requirement of the project?" + - A robust system to summarize text efficiently is essential for handling large volumes of information. + - It helps users quickly grasp key insights without reading lengthy documents. + +??? info "Why is it necessary?" + - Large amounts of text can be overwhelming and time-consuming to process. + - Automated summarization improves productivity and aids decision-making in various fields like journalism, research, and customer support. + +??? info "How is it beneficial and used?" + - Provides a concise summary while preserving essential information. + - Used in news aggregation, academic research, and AI-powered assistants for quick content consumption. + +??? info "How did you start approaching this project? (Initial thoughts and planning)" + - Explored different text summarization techniques, including extractive and abstractive methods. + - Implemented models like TextRank, BART, and T5 to compare their effectiveness. + +??? info "Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.)." + - Documentation from Hugging Face Transformers + - Research Paper: "Text Summarization using Deep Learning" + - Blog: "Introduction to NLP-based Summarization Techniques" + +--- +## ๐Ÿ” EXPLANATION + +#### ๐Ÿงฉ DETAILS OF THE DIFFERENT FEATURES + +#### ๐Ÿ“‚ dataset.csv + +The dataset contains features like sentence importance, word frequency, and linguistic structures that help in generating meaningful summaries. + +| Feature Name | Description | +|--------------|-------------| +| Id | A unique Id for each row | +| Article | Entire article written on CNN Daily mail | +| Highlights | Key Notes of the article | + +#### ๐Ÿ›  Developed Features + +| Feature | Description | +|----------------------|-------------------------------------------------| +| `sentence_rank` | Rank of a sentence based on importance using TextRank | +| `word_freq` | Frequency of key terms in the document | +| `tf-idf_score` | Term Frequency-Inverse Document Frequency for words | +| `summary_length` | Desired length of the summary | +| `generated_summary` | AI-generated condensed version of the original text | + +--- +### ๐Ÿ›ค PROJECT WORKFLOW +!!! success "Project flowchart" + + ``` mermaid + graph LR + A[Start] --> B[Load Dataset] + B --> C[Preprocessing] + C --> D[TextRank + TF-IDF / Transformer Models] + D --> E{Compare Performance} + E -->|Best Model| F[Deploy] + E -->|Retry| C; + ``` + +#### PROCEDURE + +=== "Step 1" + + Exploratory Data Analysis: + + - Loaded the CNN/DailyMail dataset using pandas. + - Explored dataset features like article and highlights, ensuring the correct format for summarization. + - Analyzed the distribution of articles and their corresponding summaries. + +=== "Step 2" + + Data cleaning and preprocessing: + + - Removed unnecessary columns (like id) and checked for missing values. + - Tokenized articles into sentences and words, removing stopwords and special characters. + - Preprocessed the text using basic NLP techniques such as lowercasing, lemmatization, and removing non-alphanumeric characters. + +=== "Step 3" + + Feature engineering and selection: + + - For TextRank-based summarization, calculated sentence similarity using TF-IDF (Term Frequency-Inverse Document Frequency) and Cosine Similarity. + - Selected top-ranked sentences based on their importance and relevance to the article. + - Applied transformers-based models like BART and T5 for abstractive summarization. + - Applied transformers-based models like BART and T5 for abstractive summarization. + +=== "Step 4" + + Model training and evaluation: + + - For the TextRank summarization approach, created a similarity matrix based on TF-IDF and Cosine Similarity. + - For transformer-based methods, used Hugging Face's BART and T5 models, summarizing articles with their pre-trained weights. + - Evaluated the summarization models based on BLEU, ROUGE, and Cosine Similarity metrics. + +=== "Step 5" + + Validation and testing: + + - Tested both extractive and abstractive summarization models on unseen data to ensure generalizability. + - Plotted confusion matrices to visualize True Positives, False Positives, and False Negatives, ensuring effective model performance. +--- + +### ๐Ÿ–ฅ CODE EXPLANATION + + + +=== "TextRank algorithm" + + Important Function: + + graph = nx.from_numpy_array(similarity_matrix) + scores = nx.pagerank(graph) + + Example Input: + similarity_matrix = np.array([ + [0.0, 0.2, 0.1], # Sentence 1 + [0.2, 0.0, 0.3], # Sentence 2 + [0.1, 0.3, 0.0]]) # Sentence 3 + + graph = nx.from_numpy_array(similarity_matrix) + scores = nx.pagerank(graph) + + Output: + {0: 0.25, 1: 0.45, 2: 0.30} #That means sentence 2(0.45) has more importance than others + + + +=== "Transformers" + + Important Function: + + pipeline("summarization") - Initializes a pre-trained transformer model for summarization. + generated_summary = summarization_pipeline(article, max_length=150, min_length=50, do_sample=False) + This Generates a summary using a transformer model. + + Example Input: + article = "The Apollo program was a NASA initiative that landed humans on the Moon between 1969 and 1972, + with Apollo 11 being the first mission." + + Output: + The Apollo program was a NASA initiative that landed humans on the Moon between 1969 and 1972. + Apollo 11 was the first mission. + + + + +=== "TTF-IDF Algorithm" + + Important Function: + + vectorizer = TfidfVectorizer() + tfidf_matrix = vectorizer.fit_transform(processed_sentences) + + Example Input: + processed_sentences = [ + "apollo program nasa initiative landed humans moon 1969 1972", + "apollo 11 first mission land moon neil armstrong buzz aldrin walked surface", + "apollo program significant achievement space exploration cold war space race"] + + Output: + ['1969', '1972', 'achievement', 'aldrin', 'apollo', 'armstrong', 'buzz', 'cold', 'exploration', + 'first', 'humans', 'initiative', 'land', 'landed', 'moon', 'nasa', 'neil', 'program', 'race', + 'significant', 'space', 'surface', 'walked', 'war'] + +--- + +#### โš–๏ธ PROJECT TRADE-OFFS AND SOLUTIONS + +=== "Trade-off 1" + + Training Dataset being over 1.2Gb, which is too large for local machines. + + - **Solution**: Instead of Training a model on train dataset, Used Test Dataset for training and validation. + +=== "Trade-off 2" + + Transformer models (BART/T5) required high computational resources and long inference times for summarizing large articles. + + - **Solution**: Model Pruning: Used smaller versions of transformer models (e.g., distilBART or distilT5) to reduce the computational load without compromising much on performance. + +=== "Trade-off 3" + + TextRank summary might miss nuances and context, leading to less accurate or overly simplistic outputs compared to transformer-based models. + + - **Solution**: Combined TextRank and Transformer-based summarization models in a hybrid approach to leverage the best of both worldsโ€”speed from TextRank and accuracy from transformers. + + +--- + +### ๐Ÿ–ผ SCREENSHOTS + +??? example "Confusion Matrix" + + === "TF-IDF Confusion Matrix" + ![tfidf](https://github.com/user-attachments/assets/28f257e1-2529-48f1-81e5-e058a50fb351) + + === "TextRank Confusion Matrix" + ![textrank](https://github.com/user-attachments/assets/cb748eff-e4f3-4096-ab2b-cf2e4b40186f) + + === "Transformers Confusion Matrix" + ![trans](https://github.com/user-attachments/assets/7e99887b-e225-4dd0-802d-f1c2b0e89bef) + + +### โœ…CONCLUSION + +#### ๐Ÿ”‘ KEY LEARNINGS + +!!! tip "Insights gained from the data" + - Data Complexity: News articles vary in length and structure, requiring different summarization techniques. + - Text Preprocessing: Cleaning text (e.g., stopword removal, tokenization) significantly improves summarization quality. + - Feature Extraction: Techniques like TF-IDF, TextRank, and Transformer embeddings help in effective text representation for summarization models. + +??? tip "Improvements in understanding machine learning concepts" + - Model Selection: Comparing extractive (TextRank, TF-IDF) and abstractive (Transformers) models to determine the best summarization approach. + +??? tip "Challenges faced and how they were overcome" + - Long Text Processing: Splitting lengthy articles into manageable sections before summarization. + - Computational Efficiency: Used batch processing and model optimization to handle large datasets efficiently. + +--- + +#### ๐ŸŒ USE CASES + +=== "Application 1" + + **News Aggregation & Personalized Summaries** + + - Automating news summarization helps users quickly grasp key events without reading lengthy articles. + - Used in news apps, digital assistants, and content curation platforms. + +=== "Application 2" + + **Legal & Academic Document Summarization** + + - Helps professionals extract critical insights from lengthy legal or research documents. + - Reduces the time needed for manual reading and analysis.