HR-OOV

SNOMED CT is a biomedical ontology with a hierarchical representation of large-scale concepts. Knowledge retrieval in SNOMED CT is critial for its application, but often proves challanging due to language ambiguity, synonyms, polysemies and so on. This problem is exacerbated when the queries are out-of-vocabulary (OOV), i.e., having no equivalent matchings in the ontology. In this work, we focus on the problem of hierarchical concept retrieval from SNOMED CT with OOV queries, and propose an approach based on language model-based ontology embeddings. For evaluation, we construct OOV queries annotated against SNOMED CT concepts, testing the retrieval of the most direct subsumers and their less relevant ancestors. We find that our method outperforms the baselines including SBERT and two lexical matching methods. While evaluated against SNOMED CT, the approach is generalisable and can be extended to other ontologies. We release code, tools, and evaluation datasets at https://github.com/jonathondilworth.

Features

Create new datasets to test agaisnt using make sample (requires manual annotation).
Re-use the provided retrievers and gpu_retrievers for knowledge retrieval. These support:
- TF-IDF, BM25
- SBERT
- HiT and OnT (with hyperbolic distance and/or entity & concept subsumption)
Review experimental results.
- Published within the logs folder, and can be viewed within the included notebook.

Usage

See the included Makefile and the note under reproducability on deployment.

To initialise the repo on a remote machine, clone the repository with
- git clone https://github.com/jonathondilworth/HR-OOV.git,
- and cd HR-OOV. Then run make init and make env.
To process SNOMED CT, run make download-snomed and make process-snomed.
To process MIRAGE, run make download-mirage and make process-mirage.
To create new datasets, set SAMPLING_PROCEDURE=random in .env and run make sample.

See details under reproducability for re-creating experimental results.

(Instructions on local execution with docker coming soon...)

Reproducibility

For end-to-end reproducability, we strongly suggest running the included Makefile within a fresh VM instance and is presently used for deploying to remote cloud VMs (support for docker will be included shortly).

To reproduce experimental results, add your NHS_API_KEY to .env and set SAMPLING_PROCEDURE=deterministic, as shown in the example env file.

Then run make.

This procedure will:

Initialise the project using init.sh.
Configure the environment with env.sh.
Downloads and processes the September 2025 release of SNOMED CT.
- This will failover to a publicly available version if no NHS_API_KEY has been provided.
- Failing to provide the NHS_API_KEY will result in small variation in the results.
Downloads embedding models.
Produces embeddings for experiments.
Runs single and multiple target experiments.

See the Makefile and scripts folder for the specific implementation.

Environment

OS: Ubuntu 22.04
Python: 3.12
NVCC: 12.9

Hardware

NVIDIA GPU: H200
vCPUs: 24
Memory: 240 GB

Models

The models used within this work include OnT-96, OnT-Mini-128 and HiT-mixed-SNOMED-25, all of which are available to download at OntoZoo.

As this work is on-going, OntoZoo and Google Drive provide hosting platforms prior to publishing finalised models to HuggingFace.

Direct Model Links

OnT-96: https://ontozoo.io/models/OnT-96-ckpt.zip
OnT-Mini-128: https://drive.google.com/file/d/1cQOqFVOHqBKkSirepzF7ga6mRYPP-LnT/view
HiT-mixed-SNOMED-25: https://drive.google.com/file/d/1cQOqFVOHqBKkSirepzF7ga6mRYPP-LnT/view

Training Models

To train HiT and OnT models using a local copy of SNOMED CT, ensure snomedct-international.owl is within the data directory, and run:

make hit-data to prepare the training data for a Hierarchy Transformer.
make ont-data to prepare the training data for an Ontology Transformer.
make train-hit to train a HiT model, according to the HiT config.yaml.
make train-ont to train an OnT model, according to the OnT config.yaml.

Training Models with Custom OWL Ontologies

See the included documentation on training HiT models using custom ontologies, or review the Hierarchy Transformers repo to re-train models using existing datasets from HuggingFace.

See the OnT repo for training custom OnT models.

License

All source code is licensed under the MIT License (see LICENSE).

SNOMED CT

This repository does not redistribute the full SNOMED CT release or any substantial portion of it. Scripts in this repository may download SNOMED CT content from existing, publicly available datasets. The small evaluation subset packaged within this repository is derived from SNOMED CT International Edition (release 2025/10/01).

This material includes SNOMED Clinical Terms® (SNOMED CT®) which is used by permission of the International Health Terminology Standards Development Organisation (IHTSDO). All rights reserved. SNOMED CT®, was originally created by The College of American Pathologists. "SNOMED" and "SNOMED CT" are registered trademarks of the IHTSDO.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
data		data
docs		docs
lib		lib
logs		logs
notebooks		notebooks
scripts		scripts
src/hroov		src/hroov
.env.example		.env.example
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HR-OOV

Features

Usage

Reproducibility

Environment

Hardware

Models

Direct Model Links

Training Models

Training Models with Custom OWL Ontologies

License

SNOMED CT

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HR-OOV

Features

Usage

Reproducibility

Environment

Hardware

Models

Direct Model Links

Training Models

Training Models with Custom OWL Ontologies

License

SNOMED CT

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages