Milestone Project 2: SkimLit

Overview

SkimLit is an NLP project designed to classify sentences in medical abstracts, making it easier to skim and understand research papers. This project replicates the approach from the PubMed 200k RCT paper, implementing and comparing various deep learning models for sentence classification.

Dataset: PubMed 200k RCT
Reference Papers:
- PubMed 200k RCT Dataset
- Model Architecture

Project Workflow

Confirm GPU Access: Recommended for deep learning models.
Download Dataset: Clone the PubMed 200k RCT dataset from GitHub.
Preprocess Data: Convert abstracts into line-level, token-level, and character-level representations.
Model Building: Implement and train multiple models:
- Baseline (TF-IDF + Naive Bayes)
- Conv1D with token embeddings
- Pre-trained embeddings (USE, BERT, GloVe)
- Hybrid and tribrid models (token, char, positional)
Evaluation: Compare models using accuracy, F1 score, and confusion matrices.
Visualization: Plot results and analyze misclassifications.

Requirements

Python 3.x
TensorFlow, TensorFlow Hub
scikit-learn
pandas, numpy
matplotlib, seaborn
GPU (recommended for training deep models)

Usage Instructions

Clone the Dataset:

git clone https://github.com/Franck-Dernoncourt/pubmed-rct

Run the Notebook:
- Open P-02_SKIM_LIT.ipynb in Jupyter or Colab.
- Execute cells step by step.
- Modify model sections for custom experiments.

Model Architectures

1. Baseline: TF-IDF + Naive Bayes

Uses TF-IDF vectorization and Multinomial Naive Bayes for sentence classification.

2. Conv1D with Token Embeddings

Embeds tokens and applies 1D convolutional layers.

3. Pre-trained Embeddings

Universal Sentence Encoder (USE)
BERT (PubMed)
GloVe

4. Hybrid and Tribrid Models

Combine token, character, and positional embeddings.
Use concatenation and dense layers for final classification.

Results

Model	Accuracy	Precision	Recall	F1 Score	Notes
Baseline (TF-IDF + Naive Bayes)	0.721832	0.718647	0.721832	0.698925	Simple, fast baseline
Conv1D Token Embeddings	0.820568	0.817851	0.820568	0.817342	Deep learning with token-level features
Custom Char Embeddings	0.740997	0.735827	0.740997	0.735911	Character-level features
Token+Char Embeddings	0.407255	0.288375	0.407255	0.322570	Combines token and character features
Token+Char Hybrid	0.757149	0.757167	0.757149	0.753551	Hybrid model
Token+Char Hybrid (Custom Token Embeddings)	0.808056	0.806597	0.808056	0.804239	Hybrid with custom token embeddings
Token+Char+LineNo+TotalLines Hybrid	0.846816	0.851562	0.846816	0.842716	Tribrid model
Tribrid + Label Smoothing	0.851979	0.854886	0.851979	0.848762	Improves generalization

Note: For confusion matrices and additional plots, see the notebook's output cells for each model.

Example: Model Comparison Table (from notebook)

model_results = pd.DataFrame({
    "model_0_baseline": base_line_model_results,
    "model_1_custom_token_embeddings": model_1_results,
    "model_2_custom_char_embeddings": model_2_results,
    "model_3_custom_token_char_embeddings": model_3_results,
    "model_4_token_char_hybrid": model_4_results,
    "model_4.5_token_char_hybrid_custom_token_embeddings": model_4_5_results,
    "model_5_token_char_line_no_total_lines_hybrid": model_5_results,
    "model_6_token_char_line_no_total_lines_hybrid_label_smoothing": model_6_results
})

Visualizations

Confusion Matrices: Plotted for each model to analyze misclassifications.
Bar Plots: Compare F1 scores and other metrics across models.
Heatmaps: Visualize confusion matrices for detailed error analysis.

Top Misclassifications

The notebook includes code to print and analyze the most common misclassified label pairs.

Author

Devguru Tiwari

For full details, code, and results, see P-02_SKIM_LIT.ipynb.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
P-02_SKIM_LIT.ipynb		P-02_SKIM_LIT.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Milestone Project 2: SkimLit

Overview

Table of Contents

Project Workflow

Requirements

Usage Instructions

Model Architectures

1. Baseline: TF-IDF + Naive Bayes

2. Conv1D with Token Embeddings

3. Pre-trained Embeddings

4. Hybrid and Tribrid Models

Results

Example: Model Comparison Table (from notebook)

Visualizations

Top Misclassifications

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Milestone Project 2: SkimLit

Overview

Table of Contents

Project Workflow

Requirements

Usage Instructions

Model Architectures

1. Baseline: TF-IDF + Naive Bayes

2. Conv1D with Token Embeddings

3. Pre-trained Embeddings

4. Hybrid and Tribrid Models

Results

Example: Model Comparison Table (from notebook)

Visualizations

Top Misclassifications

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages