Skip to content

glygener/glycan-structure-dictionary

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Release Notes Issues

BiomarkerKB GlyGen Wiki Page


Logo

Biomarker Glycan Structure Terms (bGST) Workflow

LLM-powered pipeline for extracting & normalizing glycan structure terminology
Explore the docs »

View Demo · Report Bug · Contact Us

About This Project

Biomarker Glycan Structure Terms (bGST) is a controlled vocabulary of glycan structure terms extracted from literature and databases. It captures textual representations of glycans and glycan-related structural features, including full structures, motifs, epitopes, and substructures.

Because glycan structures are described inconsistently across sources, this project uses an LLM-assisted retrieval and entity resolution workflow to map terms to existing Glycan Structure Dictionary (GSD) entries or register new ones when needed. This helps unify heterogeneous glycan terminology into a normalized, de-duplicated reference knowledgebase.

Previous Work:

Vora J, Navelkar R, Vijay-Shanker K, Edwards N, Martinez K, Ding X, Wang T, Su P, Ross K, Lisacek F, Hayes C, Kahsay R, Ranzinger R, Tiemeyer M, Mazumder R. The Glycan Structure Dictionary-a dictionary describing commonly used glycan structure terms. Glycobiology. 2023 Jun 3;33(5):354-357. doi: 10.1093/glycob/cwad014. PMID: 36799723; PMCID: PMC10243773.


Ollama
Local LLM inference
Run the pipeline entirely locally via Ollama, with configurable model selection and hardware setup.
LangGraph
Structured term normalization
Extract, normalize, and align glycan terminology through a state-driven workflow orchestrated by LangGraph.
Chroma
Vector search + embeddings
Build and query vector stores (Chroma) using embedded representations for similarity lookup.

back to top ▲

Getting Started

Follow these steps to get a local copy up and running.

Prerequisites

Ollama is a local LLM inference runtime and model management layer that lets you pull and serve foundation models on-device. It abstracts backend details such as model packaging and request orchestration so developers can run local models with minimal and across different setups.

  • Install Ollama from https://ollama.com/download, or alternatively:

    curl -fsSL https://ollama.com/install.sh | sh
    • Ollama version >=v0.15.0 is recommended
  • HPC Users Only:

    • On servers that run on environment modules (Lmod), use the following to view pre-installed modules:

      module avail
    • To display default version of Ollama:

      module -d avail ollama

Installation

  1. Clone this repo:

    git clone https://github.com/glygener/glycan-structure-dictionary.git
    cd glycan-structure-dictionary
  2. Pull the required Ollama models:

    A thinking model and an embedding model are required. If you chose to use other models, remember to update the model names at configs/models.yaml. This pipeline was developed using a locally hosted Ollama server where GPU acceleration is almost necessary. Otherwise, Ollama also offers cloud models with limited free usage. For accessing cloud models and obtaining a Ollama API key, refer to their documentation

    Start your local ollama service at a separate terminal window (close this window after verifying downloads):

    Non-HPC users:

    ollama serve

    HPC Users Only:

    • Load the ollama module using module load ollama every time when opening a new terminal window:

      module load ollama
      ollama serve

    Back to your main terminal window - Download your reasoning model and your embedding model (more models):

    ollama pull gpt-oss:20b
    ollama pull mxbai-embed-large:335m

    Verify the downloads:

    ollama list
    # NAME                         ID              SIZE      MODIFIED
    # mxbai-embed-large:335m       468836162de7    669 MB    7 weeks ago
    # gpt-oss:20b                  17052f91a42e    13 GB     7 weeks ago

    (You may now close the terminal window that runs the Ollama server)

  3. Install Python dependencies:

    (Optional) create a virtual environment with Python 3.12:

    python3.12 -m venv .venv
    source .venv/bin/activate

    Install packages:

    python -m pip install -r requirements.txt
  4. Start Ollama server:

    For Non-HPC users:

    Every python script that utilizes LLM requires the hosting of an Ollama server. You may utilize these scripts to start/stop/check a server:

    python scripts/ollama/start_server.py
    python scripts/ollama/stop_server.py
    python scripts/ollama/status_server.py

    For HPC (Slurm) users only:

    Ollama server is managed using the shell script ./main_slurm.sh. It serves as a template with resource pre-sets. To run a Python LLM script through the Slurm system, use main_slurm.sh, passing the target script path as an argument:

    sbatch main_slurm.sh <SCRIPT.PY_PATH>

    Example:

    sbatch main_slurm.sh src/gsd/part1_textbook/01_ingest.py

    On successful job submission, you will find the logs at logs/slurm-<job-id>_output.txt and logs/slurm-<job-id>_error.txt.

    More on basic Slurm commands

back to top ▲

Usage

Workflow

Part 1: Term extraction from EoG and relations mapping

  1. Creating ChromaDB from EoG documents

    unzip data/inputs/eog/raw_chapters/unzip_me_before_running_01_ingest.py.zip -d data/inputs/eog/raw_chapters/
    python src/gsd/part1_textbook/01_ingest.py
    • Or for HPC users: sbatch main_slurm.sh src/.../TargetScript.py
  2. Extract terms from EoG documents (from vectorstore)

    python src/gsd/part1_textbook/02_extract.py

Varki A, Cummings RD, Esko JD, et al., editors. Essentials of Glycobiology [Internet]. 4th edition. Cold Spring Harbor (NY): Cold Spring Harbor Laboratory Press; 2022. Available from: https://www.ncbi.nlm.nih.gov/books/NBK579918/ doi: 10.1101/9781621824213

Part 2: Incoporating heterogeneous data sources and build a deduplicated master list of terms

This part builds a master dictionary of glycan structure terms by:

  • Ingesting heterogeneous source term sets (Essentials of Glycobiology, legacy GSD v0, curated publications, composition lists, curator-supplied sets, etc.).
  • Normalizing and formatting raw term JSONL inputs into a canonical intermediate structure.
  • Creating a semantic vector store (Chroma + OpenAI embeddings) for retrieval-augmented AI mapping.
  • Running AI-assisted mapping agents to (a) map synonyms to existing concepts or (b) propose creation of new canonical terms.
  • Reconciling AI action logs into term-to-UUID mappings.
  • Post-processing: merging multiple sources into consolidated node (master_nodes.json) and edge (master_edges.json) registries with quality checks and backups.
  1. Build embeddings

    python src/gsd/part2_enrichment/1_ai-assisted_term_matching/01_create_vectordb.py
  2. Run AI mapping for a source

    python src/gsd/part2_enrichment/1_ai-assisted_term_matching/02_ai_mapping_gsdv0.py
  3. Reconcile mapping decisions

    python src/gsd/part2_enrichment/1_ai-assisted_term_matching/02_match_gsdv0_ai_mapping_with_uuid.py

    (Repeat analogous steps for pubdictionaries)

    python src/gsd/part2_enrichment/1_ai-assisted_term_matching/03_ai_mapping_pubdictionaries.py
    python src/gsd/part2_enrichment/1_ai-assisted_term_matching/03_match_pubdict_ai_mapping_with_uuid.py
  4. Merge into master dictionaries

    python src/gsd/part2_enrichment/2_generate_mappings/postprocessing.py

Note

An OpenAI API key enables the application to access LLM services. Where to obtain an API key?

Project Structure

.
├── README.md
├── configs                       # YAML-based configuration for models, paths, and tooling
│   ├── base.yaml
│   ├── chroma.yaml               # Persist directories + retriever params
│   ├── models.yaml               # LLM labels + params
│   ├── ollama.yaml               # Ollama configs
│   ├── paths.yaml
│   ├── schemas                       # JSON/schema definitions for bGST data model
│   └── prompts                   # Collection of system prompts in markdown format
├── data
│   ├── inputs                    # Raw/normalized source data for the pipelines
│   │   ├── _resource_template    # Folder template for integrating new resources
│   │   │   ├── metadata
│   │   │   ├── normalized
│   │   │   └── raw
│   │   └── ...                   # Source data + merging audit records, grouped by folders
│   ├── outputs                   # Mapped terms (current/previous) + vectorstore snapshots (previous)
│   │   └── releases
│   └── workspace                 # Vectorstores of current release
│       └── chroma
├── docs                          # Supplementary documentation + notes
├── requirements.txt
├── scripts
│   └── ollama                    # Ollama server helpers (env var + pid management)
│       ├── start_server.py
│       ├── status_server.py
│       └── stop_server.py
├── src                           # Python library code for the GSD pipeline
│   └── gsd
│       ├── __init__.py
│       ├── adapters              # Higher level adapter tools
│       ├── part1_textbook        # EoG term extraction pipeline
│       ├── part2_enrichment      # GSD resource enrichment pipeline
│       ├── cli.py
│       ├── config.py             # Config loaders
│       ├── models.py
│       └── utils.py
└── tests                         # Unit tests

LLM Workflows

Workflow Description Directory
GST Extraction Extracts and classifies GST from a preprocessed text document, and creates sentence-level citations as supporting evidence. Identify GST entity pairs (i.e. has_abbr, has_formula). Example parses Essentials of Glycobiology 4e as a Chroma document. src/gsd/part1_textbook/02_extract/
RAG For Term Generation Starts with deduplicated glycan structure terms. Retrieve top-k document chunks from the Essentials of Glycobiology 4e, and synthesize a term summary in terms of definition, cellular component, molecular function, and biological process. src/gsd/part1_textbook/04_annotate
bGST Enrichment With New Datasets Starts with a seed GST vectorstore (persistdirectory = src/data/workspace/chroma/gsd/). Parses query GST entities one at a time - searches against existing term entries from the vectorstore, and decides to i. _link query to existing entity or ii. register new entity. The vector store is dynamically updated in the iteration, whilst a list of AI term-linking audits is generated for human review (before incorporating into the production GST datasets). src/gsd/part2_enrichment/02_link/

back to top ▲

Data

Data Source

Resource URL Entities Notes
GlycoMotif https://glycomotif.glyomics.org/ 701 Secondary: Glydin, UniCarbKB, GlyTouCan, CCRC, GlyGen
Glydin https://glycoproteome.expasy.org/epitopes/ Secondary: SugarbindDB, GlycoEpitope, Cummings, BioOligo-DB
SugarbindDB https://sugarbind.expasy.org/ 204
GlycoEpitope https://www.glycoepitope.jp/ 173 Also available at https://glycosmos.org/glycoepitope
Cummings https://pubmed.ncbi.nlm.nih.gov/19756298/
BioOligo-DB https://glyco3d.cermav.cnrs.fr/search.php?type=bioligo
Monosac-DB https://glycopedia.eu/resources/presentation/
UniLectin3D https://unilectin.unige.ch/unilectin3D/
GlycoMaple https://glycosmos.org/glycomaple/Human

Data Model

Describe the core data model(s) used by this project, including how glycan structure terms are represented, stored, and linked to external resources.

  • Primary storage: (e.g., JSONL, SQLite)

  • Key entities:

Each source terms file (*terms.jsonl) after formatting should produce lines like:

{
    "lbl": "sialyl Lewis x",
    "term_uuid": "GSD:32e928fb-1550-5e0a-945f-2218ac79b83c",
    "gtc_id": [
      "G00054MO"
    ],
    "sources": [
      {
        "src_lbl": "sialyl Lewis x",
        "src": "SRC:EOG_VARKI_4E",
        "src_uuid": "SRC:66cc8ff8-5b05-4882-8c47-8ab4f036bed3"
      },
      {
        "src_lbl": "sialyl Lewis x",
        "src": "SRC:GSD_GLYGEN_V0",
        "src_uuid": "SRC:0e4ec742-01a0-4d61-b1fb-655f380ac009"
      },
      {
        "src_lbl": "sialyl Lewis x",
        "src": "SRC:PUBDICTIONARIES-GLYCAN-IMAGE",
        "src_uuid": "SRC:5c02589c-9c5e-489f-8863-e0bd2618d901"
      }
    ],
    "gsd_id": "GSD000151"
  },

Edges (*edges.jsonl) follow:

{
    "subj": "GSD:a7868da4-a6c2-4825-97b9-c86700b1c213",
    "pred": "is_a_related_synonym_of",
    "obj": "GSD:8ce1f4e6-8cbe-5167-8ece-a1cfc850d3a5",
    "comment": "GA1 is a related synonym of asialo-GM1"
  },

back to top ▲

License

MIT License. Copyright (c) 2025 GlyGen

See LICENSE for more details.

back to top ▲

Acknowledgements

Placeholder

  • Placeholder for contributor/organization 1
  • Placeholder for contributor/organization 2
  • Placeholder for contributor/organization 3

back to top ▲

About

This repository maintains the most updated version of the Glycan Structure Dictionary

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors