XDATA-263 Implement Ontology Inclusion Score#2090
Merged
eKathleenCarter merged 13 commits intomainfrom Mar 11, 2026
Merged
Conversation
) * Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282) Add drug/disease list publication to the data_publication pipeline, reading versioned parquet files from GCS and writing to everycure/drug-list and everycure/disease-list on HF Hub. The disease list drops six internal columns before publishing. HF publication is triggered automatically from the core_entities release CI on minor/major releases (patches are skipped). * Add drug/disease datasets and clarify releases Update public data releases page to refer to 'datasets' and broaden examples (knowledge graphs, drug lists, disease lists). Add drug-list and disease-list entries to the Available Datasets table with Hugging Face links and Docs column, adjust kg-nodes/kg-edges rows, and change 'Both datasets' to 'All datasets'. * Refactor hf pipeline into separate publication pipelines This was requested by @JacquesVergine: instead of heaving a single data publication pipeline, you have individual pipelines for KG, drug and disease list! * Add publish to huggingface core entities release pipeline * HF disease list now merges full Mondo list with EC curated data (#2079) The HuggingFace disease list publication previously only included curated diseases. It now left-joins the full Mondo disease list with the EC release list, so all Mondo diseases are present with EC enrichment data (specialties, prevalence, categories, etc.) where available. For overlapping columns, EC values take precedence. * Remove now unnecessary drop_disease_hf_columns method * Clarify exclusion of columns in HF dataset release Added clarification on the exclusion of certain columns from the public HF release due to their experimental nature. * Simplify data_publication pipeline tests Remove tests that targeted the internal _drop_disease_hf_columns helper and explicit publish_drug_list_node/publish_disease_list_node behaviors. Keep a single higher-level test that asserts the pipeline contains the expected node names (publish_kg_edges_node, publish_kg_nodes_node). This reduces coupling to implementation details and focuses the test on pipeline composition. * Add Git tag support to HFIterableDataset for version-coupled releases (#2080) * Add Git tag support to HFIterableDataset for version-coupled releases HFIterableDataset now accepts an optional `tag` parameter. After push and verification, it creates a Git tag on the HuggingFace Hub repo pinned to the exact commit SHA of the upload. This couples internal EC release versions (e.g. v1.1.0) with HF dataset versions, so users can load a specific release with `load_dataset("everycure/disease-list", revision="v1.1.0")`. * Move the HF pipeline into the kedro cloud space (from base) * Rephrase hugging face disease list docs * Add approximate number of diseases in disease list release
Collaborator
Author
|
Leaving this as a draft until I can get my |
JacquesVergine
approved these changes
Mar 11, 2026
Collaborator
JacquesVergine
left a comment
There was a problem hiding this comment.
Looks good to me!
…_metrics.py Co-authored-by: Jacques Vergine <jacques.vergine35@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of the changes
This PR implements the Ontology Inclusion Score metric (XDATA-263) as part of the integration pipeline.
Note: This branch was created off of ekcarter/xdata-285-improve-the-connected-components-output and should be merged after that PR is merged.
Background
Not all nodes in the knowledge graph are connected to ontological structures (TBox). Nodes that are connected to concept-level edges benefit from additional structural context, which improves performance in GNN-style approaches by grouping similar nodes closer together. This metric quantifies that coverage.
What was added
compute_ontology_inclusion_metricin nodes.py: Takesunified_nodesandunified_edges, identifies which nodes appear as subject or object in at least oneconcept-level (TBox) edge (descendants of related_to_at_concept_level in the Biolink model), and outputs a per-node boolean flag is_ontology_connected.
-
combine_node_metricsin nodes.py: Joins all per-node metric datasets (node_components,node_ontology) into the finalnode_metricstable.compute_connected_componentsnow writes tointegration.prm.node_components(an intermediate local dataset).compute_ontology_inclusion_metricwrites tointegration.prm.node_ontology. A newcombine_node_metricsstep joins these intointegration.prm.node_metricswhich is also written to BQ for use in the dashboard and downstream resources.Formula
Ontology Inclusion Score = Number of nodes with is_ontology_connected = True / Total number of nodes
Fixes / Resolves the following issues:
Checklist:
enhancementorbug)pulling in latest main, uncomment the below "Merge Notification" section and
describe steps necessary for people
kedro run -e sample -p test_sample(see sample environment guide)