SNOMED CT is a biomedical ontology with a hierarchical representation of large-scale concepts. Knowledge retrieval in SNOMED CT is critial for its application, but often proves challanging due to language ambiguity, synonyms, polysemies and so on. This problem is exacerbated when the queries are out-of-vocabulary (OOV), i.e., having no equivalent matchings in the ontology. In this work, we focus on the problem of hierarchical concept retrieval from SNOMED CT with OOV queries, and propose an approach based on language model-based ontology embeddings. For evaluation, we construct OOV queries annotated against SNOMED CT concepts, testing the retrieval of the most direct subsumers and their less relevant ancestors. We find that our method outperforms the baselines including SBERT and two lexical matching methods. While evaluated against SNOMED CT, the approach is generalisable and can be extended to other ontologies. We release code, tools, and evaluation datasets at https://github.com/jonathondilworth.
-
Create new datasets to test agaisnt using
make sample(requires manual annotation). -
Re-use the provided retrievers and gpu_retrievers for knowledge retrieval. These support:
-
TF-IDF, BM25
-
SBERT
-
HiT and OnT (with hyperbolic distance and/or entity & concept subsumption)
-
-
Review experimental results.
- Published within the logs folder, and can be viewed within the included notebook.
See the included Makefile and the note under reproducability on deployment.
-
To initialise the repo on a remote machine, clone the repository with
-
git clone https://github.com/jonathondilworth/HR-OOV.git, -
and
cd HR-OOV. Then runmake initandmake env.
-
-
To process SNOMED CT, run
make download-snomedandmake process-snomed. -
To process MIRAGE, run
make download-mirageandmake process-mirage. -
To create new datasets, set
SAMPLING_PROCEDURE=randomin.envand runmake sample.
See details under reproducability for re-creating experimental results.
(Instructions on local execution with docker coming soon...)
For end-to-end reproducability, we strongly suggest running the included Makefile within a fresh VM instance and is presently used for deploying to remote cloud VMs (support for docker will be included shortly).
To reproduce experimental results, add your NHS_API_KEY to .env and set SAMPLING_PROCEDURE=deterministic, as shown in the example env file.
Then run make.
This procedure will:
-
Initialise the project using init.sh.
-
Configure the environment with env.sh.
-
Downloads and processes the September 2025 release of SNOMED CT.
-
This will failover to a publicly available version if no
NHS_API_KEYhas been provided. -
Failing to provide the
NHS_API_KEYwill result in small variation in the results.
-
-
Downloads embedding models.
-
Produces embeddings for experiments.
-
Runs single and multiple target experiments.
See the Makefile and scripts folder for the specific implementation.
- OS: Ubuntu 22.04
- Python: 3.12
- NVCC: 12.9
- NVIDIA GPU: H200
- vCPUs: 24
- Memory: 240 GB
The models used within this work include OnT-96, OnT-Mini-128 and HiT-mixed-SNOMED-25, all of which are available to download at OntoZoo.
As this work is on-going, OntoZoo and Google Drive provide hosting platforms prior to publishing finalised models to HuggingFace.
- OnT-96: https://ontozoo.io/models/OnT-96-ckpt.zip
- OnT-Mini-128: https://drive.google.com/file/d/1cQOqFVOHqBKkSirepzF7ga6mRYPP-LnT/view
- HiT-mixed-SNOMED-25: https://drive.google.com/file/d/1cQOqFVOHqBKkSirepzF7ga6mRYPP-LnT/view
To train HiT and OnT models using a local copy of SNOMED CT, ensure snomedct-international.owl is within the data directory, and run:
-
make hit-datato prepare the training data for a Hierarchy Transformer. -
make ont-datato prepare the training data for an Ontology Transformer. -
make train-hitto train a HiT model, according to the HiT config.yaml. -
make train-ontto train an OnT model, according to the OnT config.yaml.
See the included documentation on training HiT models using custom ontologies, or review the Hierarchy Transformers repo to re-train models using existing datasets from HuggingFace.
See the OnT repo for training custom OnT models.
All source code is licensed under the MIT License (see LICENSE).
This repository does not redistribute the full SNOMED CT release or any substantial portion of it. Scripts in this repository may download SNOMED CT content from existing, publicly available datasets. The small evaluation subset packaged within this repository is derived from SNOMED CT International Edition (release 2025/10/01).
This material includes SNOMED Clinical Terms® (SNOMED CT®) which is used by permission of the International Health Terminology Standards Development Organisation (IHTSDO). All rights reserved. SNOMED CT®, was originally created by The College of American Pathologists. "SNOMED" and "SNOMED CT" are registered trademarks of the IHTSDO.