Skip to content

Ruthvik-7/Glaucoma-Chatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Glaucoma Chatbot

A domain-specific retrieval-based chatbot for glaucoma-related clinical questions. Runs two embedding models — BioBERT and BioWordVec — in parallel and returns the best answer from each, with a confidence score attached to both. Built on 1,679 curated Q&A pairs.


How it works

The chatbot doesn't generate text. It finds the most semantically similar question in the dataset to whatever the user asks, then returns the corresponding answer. Two models do this independently so you can compare semantic match quality across both embedding spaces.

User question
     │
     ├──► BioBERT embedding ──► cosine similarity vs. precomputed set ──► top match + score
     │
     └──► BioWordVec embedding ──► cosine similarity vs. precomputed set ──► top match + score

Embeddings for the full dataset are precomputed once and serialized to a .pkl file — this keeps inference fast on repeat runs.


Project structure

├── chatbot.py                    # main script, run this
├── Combined_Dataset.xlsx         # primary Q&A dataset (1,679 pairs)
├── Our_Dataset.xlsx              # alternate dataset (different results)
├── bio_embedding_intrinsic       # BioWordVec model file (binary format)
└── precomputed_embeddings.pkl    # auto-generated on first run

Setup

Requirements: Python 3.7+, PyTorch (CUDA-compatible version if using GPU)

pip install torch transformers gensim numpy pandas scikit-learn openpyxl

BioBERT downloads automatically via Hugging Face on first run:

dmis-lab/biobert-base-cased-v1.1

BioWordVec needs to be downloaded manually. Get bio_embedding_intrinsic from:

https://github.com/ncbi-nlp/BioWordVec

Place it in the project root directory. It must be in binary format.

Dataset: Make sure Combined_Dataset.xlsx is in the root. The file needs two columns named exactly:

  • Question
  • Answer

Run

python chatbot.py

On first run, embeddings for the full dataset get computed and saved to precomputed_embeddings.pkl. Subsequent runs load from this file directly — much faster.


Usage

Your question: what are the symptoms of glaucoma?

--- BioBERT Response ---
Matched Question: What are the symptoms of glaucoma?
Answer: In early stages, glaucoma often has no symptoms. As the disease progresses, symptoms may include loss of peripheral vision...
Confidence: 0.97

--- BioWordVec Response ---
Matched Question: How does glaucoma affect vision?
Answer: Glaucoma damages the optic nerve, typically starting with peripheral vision loss...
Confidence: 0.91

Type exit to close.


Datasets

Two datasets are included — results will differ between them.

File Description
Combined_Dataset.xlsx Full merged dataset — recommended for best coverage
Our_Dataset.xlsx Original curated set — more controlled, narrower scope

To switch datasets, update the file path in chatbot.py where the dataset is loaded.


Troubleshooting

BioWordVec model not found Make sure bio_embedding_intrinsic is in the project directory and fully downloaded. Partial downloads will cause silent failures. Verify the file size matches what's listed on the BioWordVec GitHub page.

CUDA errors Check your PyTorch version matches your CUDA version:

python -c "import torch; print(torch.version.cuda)"

Embeddings file missing or corrupted Just delete precomputed_embeddings.pkl and rerun the script — it regenerates automatically.

Low confidence scores This usually means the question is outside the dataset's domain coverage. Try rephrasing or check if a similar question exists in the dataset.


Tech stack

Component Tool
Biomedical NLP BioBERT (dmis-lab/biobert-base-cased-v1.1)
Word embeddings BioWordVec (NCBI)
Similarity Cosine similarity via Scikit-learn
Inference speed Precomputed embedding serialization (pickle)
Data Pandas + openpyxl

Acknowledgments

  • BioBERT by DMIS Lab — biomedical language model fine-tuned on PubMed and PMC
  • BioWordVec by NCBI NLP — biomedical word embeddings trained on PubMed + MeSH

Part of the HDA (Health Domain Applications) project — IIIT Dharwad, Dept. of Data Science & AI

About

Domain-specific retrieval chatbot using BioBERT & BioWordVec with cosine similarity on 1,679 clinical Q&A pairs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors