XDATA-263 Implement Ontology Inclusion Score by eKathleenCarter · Pull Request #2090 · everycure-org/matrix

eKathleenCarter · 2026-02-24T15:12:48Z

Description of the changes

This PR implements the Ontology Inclusion Score metric (XDATA-263) as part of the integration pipeline.

Note: This branch was created off of ekcarter/xdata-285-improve-the-connected-components-output and should be merged after that PR is merged.

Background

Not all nodes in the knowledge graph are connected to ontological structures (TBox). Nodes that are connected to concept-level edges benefit from additional structural context, which improves performance in GNN-style approaches by grouping similar nodes closer together. This metric quantifies that coverage.

What was added

compute_ontology_inclusion_metric in nodes.py: Takes unified_nodes and unified_edges, identifies which nodes appear as subject or object in at least one
concept-level (TBox) edge (descendants of related_to_at_concept_level in the Biolink model), and outputs a per-node boolean flag is_ontology_connected.
- combine_node_metrics in nodes.py: Joins all per-node metric datasets (node_components, node_ontology) into the final node_metrics table.
Pipeline restructure: compute_connected_components now writes to integration.prm.node_components (an intermediate local dataset).
compute_ontology_inclusion_metric writes to integration.prm.node_ontology. A new combine_node_metrics step joins these into integration.prm.node_metrics which is also written to BQ for use in the dashboard and downstream resources.

Formula

Ontology Inclusion Score = Number of nodes with is_ontology_connected = True / Total number of nodes

Fixes / Resolves the following issues:

https://linear.app/everycure/issue/XDATA-263/implement-ontology-inclusion-score

Checklist:

Added label to PR (e.g. enhancement or bug)
Ensured the PR is named descriptively. FYI: This name is used as part of our changelog & release notes.
Looked at the diff on github to make sure no unwanted files have been committed.
Made corresponding changes to the documentation
Added tests that prove my fix is effective or that my feature works
Any dependent changes have been merged and published in downstream modules
If breaking changes occur or you need everyone to run a command locally after
pulling in latest main, uncomment the below "Merge Notification" section and
describe steps necessary for people
Ran on sample data using kedro run -e sample -p test_sample (see sample environment guide)

@JacquesVergine

) * Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282) Add drug/disease list publication to the data_publication pipeline, reading versioned parquet files from GCS and writing to everycure/drug-list and everycure/disease-list on HF Hub. The disease list drops six internal columns before publishing. HF publication is triggered automatically from the core_entities release CI on minor/major releases (patches are skipped). * Add drug/disease datasets and clarify releases Update public data releases page to refer to 'datasets' and broaden examples (knowledge graphs, drug lists, disease lists). Add drug-list and disease-list entries to the Available Datasets table with Hugging Face links and Docs column, adjust kg-nodes/kg-edges rows, and change 'Both datasets' to 'All datasets'. * Refactor hf pipeline into separate publication pipelines This was requested by @JacquesVergine: instead of heaving a single data publication pipeline, you have individual pipelines for KG, drug and disease list! * Add publish to huggingface core entities release pipeline * HF disease list now merges full Mondo list with EC curated data (#2079) The HuggingFace disease list publication previously only included curated diseases. It now left-joins the full Mondo disease list with the EC release list, so all Mondo diseases are present with EC enrichment data (specialties, prevalence, categories, etc.) where available. For overlapping columns, EC values take precedence. * Remove now unnecessary drop_disease_hf_columns method * Clarify exclusion of columns in HF dataset release Added clarification on the exclusion of certain columns from the public HF release due to their experimental nature. * Simplify data_publication pipeline tests Remove tests that targeted the internal _drop_disease_hf_columns helper and explicit publish_drug_list_node/publish_disease_list_node behaviors. Keep a single higher-level test that asserts the pipeline contains the expected node names (publish_kg_edges_node, publish_kg_nodes_node). This reduces coupling to implementation details and focuses the test on pipeline composition. * Add Git tag support to HFIterableDataset for version-coupled releases (#2080) * Add Git tag support to HFIterableDataset for version-coupled releases HFIterableDataset now accepts an optional `tag` parameter. After push and verification, it creates a Git tag on the HuggingFace Hub repo pinned to the exact commit SHA of the upload. This couples internal EC release versions (e.g. v1.1.0) with HF dataset versions, so users can load a specific release with `load_dataset("everycure/disease-list", revision="v1.1.0")`. * Move the HF pipeline into the kedro cloud space (from base) * Rephrase hugging face disease list docs * Add approximate number of diseases in disease list release

eKathleenCarter · 2026-02-24T15:14:24Z

Leaving this as a draft until I can get my uv run kedro experiment permissions issue resolved. I would like to test this but the work is complete besides that.

…ion-score

JacquesVergine

Looks good to me!

…_metrics.py Co-authored-by: Jacques Vergine <jacques.vergine35@gmail.com>

eKathleenCarter and others added 6 commits February 20, 2026 12:43

Modify outputs from connectivity metrics pipeline

adb065d

clean up

8147421

Update catalog.yml (#2088)

9ad3698

Update litellm targetRevision and image tag to v1.81.12-stable (#2087)

a515361

Implement Ontology inclusion metric

6a5ff78

eKathleenCarter self-assigned this Feb 24, 2026

eKathleenCarter requested a review from a team as a code owner February 24, 2026 15:12

eKathleenCarter added the enhancement improving an existing system or feature to work better. label Feb 24, 2026

eKathleenCarter requested a review from pascalwhoop February 24, 2026 15:12

eKathleenCarter added the Feature Used for PRs to label new features label Feb 24, 2026

eKathleenCarter temporarily deployed to dev February 24, 2026 15:12 — with GitHub Actions Inactive

eKathleenCarter marked this pull request as draft February 24, 2026 15:12

eKathleenCarter requested a review from JacquesVergine February 24, 2026 15:13

eKathleenCarter temporarily deployed to dev February 24, 2026 15:13 — with GitHub Actions Inactive

eKathleenCarter added 3 commits February 24, 2026 11:26

Merge branch 'main' into ekcarter/xdata-263-implement-ontology-inclus…

14e11ff

…ion-score

Merge branch 'main' into ekcarter/xdata-263-implement-ontology-inclus…

e9cf747

…ion-score

Merge branch 'main' into ekcarter/xdata-263-implement-ontology-inclus…

82c29f3

…ion-score

eKathleenCarter marked this pull request as ready for review March 2, 2026 15:19

eKathleenCarter temporarily deployed to dev March 2, 2026 15:19 — with GitHub Actions Inactive

Merge branch 'main' into ekcarter/xdata-263-implement-ontology-inclus…

93a338f

…ion-score

eKathleenCarter temporarily deployed to dev March 5, 2026 20:31 — with GitHub Actions Inactive

Merge branch 'main' into ekcarter/xdata-263-implement-ontology-inclus…

f2ea1c5

…ion-score

eKathleenCarter temporarily deployed to dev March 9, 2026 15:09 — with GitHub Actions Inactive

Merge branch 'main' into ekcarter/xdata-263-implement-ontology-inclus…

91d610d

…ion-score

eKathleenCarter temporarily deployed to dev March 11, 2026 14:18 — with GitHub Actions Inactive

JacquesVergine approved these changes Mar 11, 2026

View reviewed changes

Comment thread pipelines/matrix/src/matrix/pipelines/integration/connectivity_metrics.py Outdated

Update pipelines/matrix/src/matrix/pipelines/integration/connectivity…

6fd77cf

…_metrics.py Co-authored-by: Jacques Vergine <jacques.vergine35@gmail.com>

eKathleenCarter temporarily deployed to dev March 11, 2026 17:12 — with GitHub Actions Inactive

eKathleenCarter merged commit 3734364 into main Mar 11, 2026
15 checks passed

eKathleenCarter deleted the ekcarter/xdata-263-implement-ontology-inclusion-score branch March 11, 2026 17:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XDATA-263 Implement Ontology Inclusion Score#2090

XDATA-263 Implement Ontology Inclusion Score#2090
eKathleenCarter merged 13 commits intomainfrom
ekcarter/xdata-263-implement-ontology-inclusion-score

eKathleenCarter commented Feb 24, 2026 •

edited

Loading

Uh oh!

eKathleenCarter commented Feb 24, 2026

Uh oh!

JacquesVergine left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

eKathleenCarter commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of the changes

Fixes / Resolves the following issues:

Checklist:

Uh oh!

eKathleenCarter commented Feb 24, 2026

Uh oh!

JacquesVergine left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eKathleenCarter commented Feb 24, 2026 •

edited

Loading