Skip to content

Devguru-codes/NLP-project---Skim_Lit-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Milestone Project 2: SkimLit

Overview

SkimLit is an NLP project designed to classify sentences in medical abstracts, making it easier to skim and understand research papers. This project replicates the approach from the PubMed 200k RCT paper, implementing and comparing various deep learning models for sentence classification.

Table of Contents

Project Workflow

  1. Confirm GPU Access: Recommended for deep learning models.
  2. Download Dataset: Clone the PubMed 200k RCT dataset from GitHub.
  3. Preprocess Data: Convert abstracts into line-level, token-level, and character-level representations.
  4. Model Building: Implement and train multiple models:
    • Baseline (TF-IDF + Naive Bayes)
    • Conv1D with token embeddings
    • Pre-trained embeddings (USE, BERT, GloVe)
    • Hybrid and tribrid models (token, char, positional)
  5. Evaluation: Compare models using accuracy, F1 score, and confusion matrices.
  6. Visualization: Plot results and analyze misclassifications.

Requirements

  • Python 3.x
  • TensorFlow, TensorFlow Hub
  • scikit-learn
  • pandas, numpy
  • matplotlib, seaborn
  • GPU (recommended for training deep models)

Usage Instructions

  1. Clone the Dataset:
    git clone https://github.com/Franck-Dernoncourt/pubmed-rct
  2. Run the Notebook:
    • Open P-02_SKIM_LIT.ipynb in Jupyter or Colab.
    • Execute cells step by step.
    • Modify model sections for custom experiments.

Model Architectures

1. Baseline: TF-IDF + Naive Bayes

  • Uses TF-IDF vectorization and Multinomial Naive Bayes for sentence classification.

2. Conv1D with Token Embeddings

  • Embeds tokens and applies 1D convolutional layers.

3. Pre-trained Embeddings

  • Universal Sentence Encoder (USE)
  • BERT (PubMed)
  • GloVe

4. Hybrid and Tribrid Models

  • Combine token, character, and positional embeddings.
  • Use concatenation and dense layers for final classification.

Results

Model Accuracy Precision Recall F1 Score Notes
Baseline (TF-IDF + Naive Bayes) 0.721832 0.718647 0.721832 0.698925 Simple, fast baseline
Conv1D Token Embeddings 0.820568 0.817851 0.820568 0.817342 Deep learning with token-level features
Custom Char Embeddings 0.740997 0.735827 0.740997 0.735911 Character-level features
Token+Char Embeddings 0.407255 0.288375 0.407255 0.322570 Combines token and character features
Token+Char Hybrid 0.757149 0.757167 0.757149 0.753551 Hybrid model
Token+Char Hybrid (Custom Token Embeddings) 0.808056 0.806597 0.808056 0.804239 Hybrid with custom token embeddings
Token+Char+LineNo+TotalLines Hybrid 0.846816 0.851562 0.846816 0.842716 Tribrid model
Tribrid + Label Smoothing 0.851979 0.854886 0.851979 0.848762 Improves generalization

Note: For confusion matrices and additional plots, see the notebook's output cells for each model.

Example: Model Comparison Table (from notebook)

model_results = pd.DataFrame({
    "model_0_baseline": base_line_model_results,
    "model_1_custom_token_embeddings": model_1_results,
    "model_2_custom_char_embeddings": model_2_results,
    "model_3_custom_token_char_embeddings": model_3_results,
    "model_4_token_char_hybrid": model_4_results,
    "model_4.5_token_char_hybrid_custom_token_embeddings": model_4_5_results,
    "model_5_token_char_line_no_total_lines_hybrid": model_5_results,
    "model_6_token_char_line_no_total_lines_hybrid_label_smoothing": model_6_results
})

Visualizations

  • Confusion Matrices: Plotted for each model to analyze misclassifications.
  • Bar Plots: Compare F1 scores and other metrics across models.
  • Heatmaps: Visualize confusion matrices for detailed error analysis.

Top Misclassifications

  • The notebook includes code to print and analyze the most common misclassified label pairs.

Author

Devguru Tiwari


For full details, code, and results, see P-02_SKIM_LIT.ipynb.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors