Glaucoma Chatbot

A domain-specific retrieval-based chatbot for glaucoma-related clinical questions. Runs two embedding models — BioBERT and BioWordVec — in parallel and returns the best answer from each, with a confidence score attached to both. Built on 1,679 curated Q&A pairs.

How it works

The chatbot doesn't generate text. It finds the most semantically similar question in the dataset to whatever the user asks, then returns the corresponding answer. Two models do this independently so you can compare semantic match quality across both embedding spaces.

User question
     │
     ├──► BioBERT embedding ──► cosine similarity vs. precomputed set ──► top match + score
     │
     └──► BioWordVec embedding ──► cosine similarity vs. precomputed set ──► top match + score

Embeddings for the full dataset are precomputed once and serialized to a .pkl file — this keeps inference fast on repeat runs.

Project structure

├── chatbot.py                    # main script, run this
├── Combined_Dataset.xlsx         # primary Q&A dataset (1,679 pairs)
├── Our_Dataset.xlsx              # alternate dataset (different results)
├── bio_embedding_intrinsic       # BioWordVec model file (binary format)
└── precomputed_embeddings.pkl    # auto-generated on first run

Setup

Requirements: Python 3.7+, PyTorch (CUDA-compatible version if using GPU)

pip install torch transformers gensim numpy pandas scikit-learn openpyxl

BioBERT downloads automatically via Hugging Face on first run:

dmis-lab/biobert-base-cased-v1.1

BioWordVec needs to be downloaded manually. Get bio_embedding_intrinsic from:

https://github.com/ncbi-nlp/BioWordVec

Place it in the project root directory. It must be in binary format.

Dataset: Make sure Combined_Dataset.xlsx is in the root. The file needs two columns named exactly:

Question
Answer

Run

python chatbot.py

On first run, embeddings for the full dataset get computed and saved to precomputed_embeddings.pkl. Subsequent runs load from this file directly — much faster.

Usage

Your question: what are the symptoms of glaucoma?

--- BioBERT Response ---
Matched Question: What are the symptoms of glaucoma?
Answer: In early stages, glaucoma often has no symptoms. As the disease progresses, symptoms may include loss of peripheral vision...
Confidence: 0.97

--- BioWordVec Response ---
Matched Question: How does glaucoma affect vision?
Answer: Glaucoma damages the optic nerve, typically starting with peripheral vision loss...
Confidence: 0.91

Type exit to close.

Datasets

Two datasets are included — results will differ between them.

File	Description
`Combined_Dataset.xlsx`	Full merged dataset — recommended for best coverage
`Our_Dataset.xlsx`	Original curated set — more controlled, narrower scope

To switch datasets, update the file path in chatbot.py where the dataset is loaded.

Troubleshooting

BioWordVec model not found Make sure bio_embedding_intrinsic is in the project directory and fully downloaded. Partial downloads will cause silent failures. Verify the file size matches what's listed on the BioWordVec GitHub page.

CUDA errors Check your PyTorch version matches your CUDA version:

python -c "import torch; print(torch.version.cuda)"

Embeddings file missing or corrupted Just delete precomputed_embeddings.pkl and rerun the script — it regenerates automatically.

Low confidence scores This usually means the question is outside the dataset's domain coverage. Try rephrasing or check if a similar question exists in the dataset.

Tech stack

Component	Tool
Biomedical NLP	BioBERT (`dmis-lab/biobert-base-cased-v1.1`)
Word embeddings	BioWordVec (NCBI)
Similarity	Cosine similarity via Scikit-learn
Inference speed	Precomputed embedding serialization (pickle)
Data	Pandas + openpyxl

Acknowledgments

BioBERT by DMIS Lab — biomedical language model fine-tuned on PubMed and PMC
BioWordVec by NCBI NLP — biomedical word embeddings trained on PubMed + MeSH

Part of the HDA (Health Domain Applications) project — IIIT Dharwad, Dept. of Data Science & AI

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Chatbot.ipynb		Chatbot.ipynb
Combined_Dataset.xlsx		Combined_Dataset.xlsx
Our_Dataset.xlsx		Our_Dataset.xlsx
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Glaucoma Chatbot

How it works

Project structure

Setup

Run

Usage

Datasets

Troubleshooting

Tech stack

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Glaucoma Chatbot

How it works

Project structure

Setup

Run

Usage

Datasets

Troubleshooting

Tech stack

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages