LLM-powered pipeline for extracting & normalizing glycan structure terminology
Explore the docs »
View Demo
·
Report Bug
·
Contact Us
Table of Contents
Biomarker Glycan Structure Terms (bGST) is a controlled vocabulary of glycan structure terms extracted from literature and databases. It captures textual representations of glycans and glycan-related structural features, including full structures, motifs, epitopes, and substructures.
Because glycan structures are described inconsistently across sources, this project uses an LLM-assisted retrieval and entity resolution workflow to map terms to existing Glycan Structure Dictionary (GSD) entries or register new ones when needed. This helps unify heterogeneous glycan terminology into a normalized, de-duplicated reference knowledgebase.
Previous Work:
Vora J, Navelkar R, Vijay-Shanker K, Edwards N, Martinez K, Ding X, Wang T, Su P, Ross K, Lisacek F, Hayes C, Kahsay R, Ranzinger R, Tiemeyer M, Mazumder R. The Glycan Structure Dictionary-a dictionary describing commonly used glycan structure terms. Glycobiology. 2023 Jun 3;33(5):354-357. doi: 10.1093/glycob/cwad014. PMID: 36799723; PMCID: PMC10243773.
Follow these steps to get a local copy up and running.
Ollama is a local LLM inference runtime and model management layer that lets you pull and serve foundation models on-device. It abstracts backend details such as model packaging and request orchestration so developers can run local models with minimal and across different setups.
-
Install Ollama from https://ollama.com/download, or alternatively:
curl -fsSL https://ollama.com/install.sh | sh- Ollama version
>=v0.15.0is recommended
- Ollama version
-
HPC Users Only:
-
On servers that run on environment modules (Lmod), use the following to view pre-installed modules:
module avail
-
To display default version of Ollama:
module -d avail ollama
-
-
Clone this repo:
git clone https://github.com/glygener/glycan-structure-dictionary.git cd glycan-structure-dictionary -
Pull the required Ollama models:
A thinking model and an embedding model are required. If you chose to use other models, remember to update the model names at
configs/models.yaml. This pipeline was developed using a locally hosted Ollama server where GPU acceleration is almost necessary. Otherwise, Ollama also offers cloud models with limited free usage. For accessing cloud models and obtaining a Ollama API key, refer to their documentationStart your local ollama service at a separate terminal window (close this window after verifying downloads):
Non-HPC users:
ollama serve
HPC Users Only:
-
Load the
ollamamodule usingmodule load ollamaevery time when opening a new terminal window:module load ollama ollama serve
Back to your main terminal window - Download your reasoning model and your embedding model (more models):
ollama pull gpt-oss:20b ollama pull mxbai-embed-large:335m
Verify the downloads:
ollama list
# NAME ID SIZE MODIFIED # mxbai-embed-large:335m 468836162de7 669 MB 7 weeks ago # gpt-oss:20b 17052f91a42e 13 GB 7 weeks ago
(You may now close the terminal window that runs the Ollama server)
-
-
Install Python dependencies:
(Optional) create a virtual environment with
Python 3.12:python3.12 -m venv .venv source .venv/bin/activateInstall packages:
python -m pip install -r requirements.txt
-
Start Ollama server:
For Non-HPC users:
Every python script that utilizes LLM requires the hosting of an Ollama server. You may utilize these scripts to start/stop/check a server:
python scripts/ollama/start_server.py python scripts/ollama/stop_server.py python scripts/ollama/status_server.py
For HPC (Slurm) users only:
Ollama server is managed using the shell script
./main_slurm.sh. It serves as a template with resource pre-sets. To run a Python LLM script through the Slurm system, usemain_slurm.sh, passing the target script path as an argument:sbatch main_slurm.sh <SCRIPT.PY_PATH>
Example:
sbatch main_slurm.sh src/gsd/part1_textbook/01_ingest.py
On successful job submission, you will find the logs at
logs/slurm-<job-id>_output.txtandlogs/slurm-<job-id>_error.txt.More on basic Slurm commands
-
Creating ChromaDB from EoG documents
unzip data/inputs/eog/raw_chapters/unzip_me_before_running_01_ingest.py.zip -d data/inputs/eog/raw_chapters/
python src/gsd/part1_textbook/01_ingest.py
- Or for HPC users:
sbatch main_slurm.sh src/.../TargetScript.py
- Or for HPC users:
-
Extract terms from EoG documents (from vectorstore)
python src/gsd/part1_textbook/02_extract.py
Varki A, Cummings RD, Esko JD, et al., editors. Essentials of Glycobiology [Internet]. 4th edition. Cold Spring Harbor (NY): Cold Spring Harbor Laboratory Press; 2022. Available from: https://www.ncbi.nlm.nih.gov/books/NBK579918/ doi: 10.1101/9781621824213
This part builds a master dictionary of glycan structure terms by:
- Ingesting heterogeneous source term sets (Essentials of Glycobiology, legacy GSD v0, curated publications, composition lists, curator-supplied sets, etc.).
- Normalizing and formatting raw term JSONL inputs into a canonical intermediate structure.
- Creating a semantic vector store (Chroma + OpenAI embeddings) for retrieval-augmented AI mapping.
- Running AI-assisted mapping agents to (a) map synonyms to existing concepts or (b) propose creation of new canonical terms.
- Reconciling AI action logs into term-to-UUID mappings.
- Post-processing: merging multiple sources into consolidated node (
master_nodes.json) and edge (master_edges.json) registries with quality checks and backups.
-
Build embeddings
python src/gsd/part2_enrichment/1_ai-assisted_term_matching/01_create_vectordb.py
-
Run AI mapping for a source
python src/gsd/part2_enrichment/1_ai-assisted_term_matching/02_ai_mapping_gsdv0.py
-
Reconcile mapping decisions
python src/gsd/part2_enrichment/1_ai-assisted_term_matching/02_match_gsdv0_ai_mapping_with_uuid.py
(Repeat analogous steps for pubdictionaries)
python src/gsd/part2_enrichment/1_ai-assisted_term_matching/03_ai_mapping_pubdictionaries.py python src/gsd/part2_enrichment/1_ai-assisted_term_matching/03_match_pubdict_ai_mapping_with_uuid.py
-
Merge into master dictionaries
python src/gsd/part2_enrichment/2_generate_mappings/postprocessing.py
Note
An OpenAI API key enables the application to access LLM services. Where to obtain an API key?
.
├── README.md
├── configs # YAML-based configuration for models, paths, and tooling
│ ├── base.yaml
│ ├── chroma.yaml # Persist directories + retriever params
│ ├── models.yaml # LLM labels + params
│ ├── ollama.yaml # Ollama configs
│ ├── paths.yaml
│ ├── schemas # JSON/schema definitions for bGST data model
│ └── prompts # Collection of system prompts in markdown format
├── data
│ ├── inputs # Raw/normalized source data for the pipelines
│ │ ├── _resource_template # Folder template for integrating new resources
│ │ │ ├── metadata
│ │ │ ├── normalized
│ │ │ └── raw
│ │ └── ... # Source data + merging audit records, grouped by folders
│ ├── outputs # Mapped terms (current/previous) + vectorstore snapshots (previous)
│ │ └── releases
│ └── workspace # Vectorstores of current release
│ └── chroma
├── docs # Supplementary documentation + notes
├── requirements.txt
├── scripts
│ └── ollama # Ollama server helpers (env var + pid management)
│ ├── start_server.py
│ ├── status_server.py
│ └── stop_server.py
├── src # Python library code for the GSD pipeline
│ └── gsd
│ ├── __init__.py
│ ├── adapters # Higher level adapter tools
│ ├── part1_textbook # EoG term extraction pipeline
│ ├── part2_enrichment # GSD resource enrichment pipeline
│ ├── cli.py
│ ├── config.py # Config loaders
│ ├── models.py
│ └── utils.py
└── tests # Unit tests| Workflow | Description | Directory |
|---|---|---|
| GST Extraction | Extracts and classifies GST from a preprocessed text document, and creates sentence-level citations as supporting evidence. Identify GST entity pairs (i.e. has_abbr, has_formula). Example parses Essentials of Glycobiology 4e as a Chroma document. |
src/gsd/part1_textbook/02_extract/ |
| RAG For Term Generation | Starts with deduplicated glycan structure terms. Retrieve top-k document chunks from the Essentials of Glycobiology 4e, and synthesize a term summary in terms of definition, cellular component, molecular function, and biological process. |
src/gsd/part1_textbook/04_annotate |
| bGST Enrichment With New Datasets | Starts with a seed GST vectorstore (persistdirectory = src/data/workspace/chroma/gsd/). Parses query GST entities one at a time - searches against existing term entries from the vectorstore, and decides to i. _link query to existing entity or ii. register new entity. The vector store is dynamically updated in the iteration, whilst a list of AI term-linking audits is generated for human review (before incorporating into the production GST datasets). |
src/gsd/part2_enrichment/02_link/ |
| Resource | URL | Entities | Notes |
|---|---|---|---|
| GlycoMotif | https://glycomotif.glyomics.org/ | 701 | Secondary: Glydin, UniCarbKB, GlyTouCan, CCRC, GlyGen |
| Glydin | https://glycoproteome.expasy.org/epitopes/ | Secondary: SugarbindDB, GlycoEpitope, Cummings, BioOligo-DB | |
| SugarbindDB | https://sugarbind.expasy.org/ | 204 | |
| GlycoEpitope | https://www.glycoepitope.jp/ | 173 | Also available at https://glycosmos.org/glycoepitope |
| Cummings | https://pubmed.ncbi.nlm.nih.gov/19756298/ | ||
| BioOligo-DB | https://glyco3d.cermav.cnrs.fr/search.php?type=bioligo | ||
| Monosac-DB | https://glycopedia.eu/resources/presentation/ | ||
| UniLectin3D | https://unilectin.unige.ch/unilectin3D/ | ||
| GlycoMaple | https://glycosmos.org/glycomaple/Human |
Describe the core data model(s) used by this project, including how glycan structure terms are represented, stored, and linked to external resources.
-
Primary storage: (e.g., JSONL, SQLite)
-
Key entities:
Each source terms file (*terms.jsonl) after formatting should produce lines like:
{
"lbl": "sialyl Lewis x",
"term_uuid": "GSD:32e928fb-1550-5e0a-945f-2218ac79b83c",
"gtc_id": [
"G00054MO"
],
"sources": [
{
"src_lbl": "sialyl Lewis x",
"src": "SRC:EOG_VARKI_4E",
"src_uuid": "SRC:66cc8ff8-5b05-4882-8c47-8ab4f036bed3"
},
{
"src_lbl": "sialyl Lewis x",
"src": "SRC:GSD_GLYGEN_V0",
"src_uuid": "SRC:0e4ec742-01a0-4d61-b1fb-655f380ac009"
},
{
"src_lbl": "sialyl Lewis x",
"src": "SRC:PUBDICTIONARIES-GLYCAN-IMAGE",
"src_uuid": "SRC:5c02589c-9c5e-489f-8863-e0bd2618d901"
}
],
"gsd_id": "GSD000151"
},
Edges (*edges.jsonl) follow:
{
"subj": "GSD:a7868da4-a6c2-4825-97b9-c86700b1c213",
"pred": "is_a_related_synonym_of",
"obj": "GSD:8ce1f4e6-8cbe-5167-8ece-a1cfc850d3a5",
"comment": "GA1 is a related synonym of asialo-GM1"
},
MIT License. Copyright (c) 2025 GlyGen
See LICENSE for more details.
Placeholder
- Placeholder for contributor/organization 1
- Placeholder for contributor/organization 2
- Placeholder for contributor/organization 3

