A domain-specific retrieval-based chatbot for glaucoma-related clinical questions. Runs two embedding models — BioBERT and BioWordVec — in parallel and returns the best answer from each, with a confidence score attached to both. Built on 1,679 curated Q&A pairs.
The chatbot doesn't generate text. It finds the most semantically similar question in the dataset to whatever the user asks, then returns the corresponding answer. Two models do this independently so you can compare semantic match quality across both embedding spaces.
User question
│
├──► BioBERT embedding ──► cosine similarity vs. precomputed set ──► top match + score
│
└──► BioWordVec embedding ──► cosine similarity vs. precomputed set ──► top match + score
Embeddings for the full dataset are precomputed once and serialized to a .pkl file — this keeps inference fast on repeat runs.
├── chatbot.py # main script, run this
├── Combined_Dataset.xlsx # primary Q&A dataset (1,679 pairs)
├── Our_Dataset.xlsx # alternate dataset (different results)
├── bio_embedding_intrinsic # BioWordVec model file (binary format)
└── precomputed_embeddings.pkl # auto-generated on first run
Requirements: Python 3.7+, PyTorch (CUDA-compatible version if using GPU)
pip install torch transformers gensim numpy pandas scikit-learn openpyxlBioBERT downloads automatically via Hugging Face on first run:
dmis-lab/biobert-base-cased-v1.1
BioWordVec needs to be downloaded manually. Get bio_embedding_intrinsic from:
Place it in the project root directory. It must be in binary format.
Dataset: Make sure Combined_Dataset.xlsx is in the root. The file needs two columns named exactly:
QuestionAnswer
python chatbot.pyOn first run, embeddings for the full dataset get computed and saved to precomputed_embeddings.pkl. Subsequent runs load from this file directly — much faster.
Your question: what are the symptoms of glaucoma?
--- BioBERT Response ---
Matched Question: What are the symptoms of glaucoma?
Answer: In early stages, glaucoma often has no symptoms. As the disease progresses, symptoms may include loss of peripheral vision...
Confidence: 0.97
--- BioWordVec Response ---
Matched Question: How does glaucoma affect vision?
Answer: Glaucoma damages the optic nerve, typically starting with peripheral vision loss...
Confidence: 0.91
Type exit to close.
Two datasets are included — results will differ between them.
| File | Description |
|---|---|
Combined_Dataset.xlsx |
Full merged dataset — recommended for best coverage |
Our_Dataset.xlsx |
Original curated set — more controlled, narrower scope |
To switch datasets, update the file path in chatbot.py where the dataset is loaded.
BioWordVec model not found
Make sure bio_embedding_intrinsic is in the project directory and fully downloaded. Partial downloads will cause silent failures. Verify the file size matches what's listed on the BioWordVec GitHub page.
CUDA errors Check your PyTorch version matches your CUDA version:
python -c "import torch; print(torch.version.cuda)"Embeddings file missing or corrupted
Just delete precomputed_embeddings.pkl and rerun the script — it regenerates automatically.
Low confidence scores This usually means the question is outside the dataset's domain coverage. Try rephrasing or check if a similar question exists in the dataset.
| Component | Tool |
|---|---|
| Biomedical NLP | BioBERT (dmis-lab/biobert-base-cased-v1.1) |
| Word embeddings | BioWordVec (NCBI) |
| Similarity | Cosine similarity via Scikit-learn |
| Inference speed | Precomputed embedding serialization (pickle) |
| Data | Pandas + openpyxl |
- BioBERT by DMIS Lab — biomedical language model fine-tuned on PubMed and PMC
- BioWordVec by NCBI NLP — biomedical word embeddings trained on PubMed + MeSH
Part of the HDA (Health Domain Applications) project — IIIT Dharwad, Dept. of Data Science & AI