SkimLit is an NLP project designed to classify sentences in medical abstracts, making it easier to skim and understand research papers. This project replicates the approach from the PubMed 200k RCT paper, implementing and comparing various deep learning models for sentence classification.
- Dataset: PubMed 200k RCT
- Reference Papers:
- Confirm GPU Access: Recommended for deep learning models.
- Download Dataset: Clone the PubMed 200k RCT dataset from GitHub.
- Preprocess Data: Convert abstracts into line-level, token-level, and character-level representations.
- Model Building: Implement and train multiple models:
- Baseline (TF-IDF + Naive Bayes)
- Conv1D with token embeddings
- Pre-trained embeddings (USE, BERT, GloVe)
- Hybrid and tribrid models (token, char, positional)
- Evaluation: Compare models using accuracy, F1 score, and confusion matrices.
- Visualization: Plot results and analyze misclassifications.
- Python 3.x
- TensorFlow, TensorFlow Hub
- scikit-learn
- pandas, numpy
- matplotlib, seaborn
- GPU (recommended for training deep models)
- Clone the Dataset:
git clone https://github.com/Franck-Dernoncourt/pubmed-rct
- Run the Notebook:
- Open
P-02_SKIM_LIT.ipynbin Jupyter or Colab. - Execute cells step by step.
- Modify model sections for custom experiments.
- Open
- Uses TF-IDF vectorization and Multinomial Naive Bayes for sentence classification.
- Embeds tokens and applies 1D convolutional layers.
- Universal Sentence Encoder (USE)
- BERT (PubMed)
- GloVe
- Combine token, character, and positional embeddings.
- Use concatenation and dense layers for final classification.
| Model | Accuracy | Precision | Recall | F1 Score | Notes |
|---|---|---|---|---|---|
| Baseline (TF-IDF + Naive Bayes) | 0.721832 | 0.718647 | 0.721832 | 0.698925 | Simple, fast baseline |
| Conv1D Token Embeddings | 0.820568 | 0.817851 | 0.820568 | 0.817342 | Deep learning with token-level features |
| Custom Char Embeddings | 0.740997 | 0.735827 | 0.740997 | 0.735911 | Character-level features |
| Token+Char Embeddings | 0.407255 | 0.288375 | 0.407255 | 0.322570 | Combines token and character features |
| Token+Char Hybrid | 0.757149 | 0.757167 | 0.757149 | 0.753551 | Hybrid model |
| Token+Char Hybrid (Custom Token Embeddings) | 0.808056 | 0.806597 | 0.808056 | 0.804239 | Hybrid with custom token embeddings |
| Token+Char+LineNo+TotalLines Hybrid | 0.846816 | 0.851562 | 0.846816 | 0.842716 | Tribrid model |
| Tribrid + Label Smoothing | 0.851979 | 0.854886 | 0.851979 | 0.848762 | Improves generalization |
Note: For confusion matrices and additional plots, see the notebook's output cells for each model.
model_results = pd.DataFrame({
"model_0_baseline": base_line_model_results,
"model_1_custom_token_embeddings": model_1_results,
"model_2_custom_char_embeddings": model_2_results,
"model_3_custom_token_char_embeddings": model_3_results,
"model_4_token_char_hybrid": model_4_results,
"model_4.5_token_char_hybrid_custom_token_embeddings": model_4_5_results,
"model_5_token_char_line_no_total_lines_hybrid": model_5_results,
"model_6_token_char_line_no_total_lines_hybrid_label_smoothing": model_6_results
})- Confusion Matrices: Plotted for each model to analyze misclassifications.
- Bar Plots: Compare F1 scores and other metrics across models.
- Heatmaps: Visualize confusion matrices for detailed error analysis.
- The notebook includes code to print and analyze the most common misclassified label pairs.
Devguru Tiwari
For full details, code, and results, see P-02_SKIM_LIT.ipynb.