Skip to content

XDATA-263 Implement Ontology Inclusion Score#2090

Merged
eKathleenCarter merged 13 commits intomainfrom
ekcarter/xdata-263-implement-ontology-inclusion-score
Mar 11, 2026
Merged

XDATA-263 Implement Ontology Inclusion Score#2090
eKathleenCarter merged 13 commits intomainfrom
ekcarter/xdata-263-implement-ontology-inclusion-score

Conversation

@eKathleenCarter
Copy link
Copy Markdown
Collaborator

@eKathleenCarter eKathleenCarter commented Feb 24, 2026

Description of the changes

This PR implements the Ontology Inclusion Score metric (XDATA-263) as part of the integration pipeline.

Note: This branch was created off of ekcarter/xdata-285-improve-the-connected-components-output and should be merged after that PR is merged.

Background

Not all nodes in the knowledge graph are connected to ontological structures (TBox). Nodes that are connected to concept-level edges benefit from additional structural context, which improves performance in GNN-style approaches by grouping similar nodes closer together. This metric quantifies that coverage.

What was added

  • compute_ontology_inclusion_metric in nodes.py: Takes unified_nodes and unified_edges, identifies which nodes appear as subject or object in at least one
    concept-level (TBox) edge (descendants of related_to_at_concept_level in the Biolink model), and outputs a per-node boolean flag is_ontology_connected.
    - combine_node_metrics in nodes.py: Joins all per-node metric datasets (node_components, node_ontology) into the final node_metrics table.
  • Pipeline restructure: compute_connected_components now writes to integration.prm.node_components (an intermediate local dataset).
    compute_ontology_inclusion_metric writes to integration.prm.node_ontology. A new combine_node_metrics step joins these into integration.prm.node_metrics which is also written to BQ for use in the dashboard and downstream resources.

Formula

Ontology Inclusion Score = Number of nodes with is_ontology_connected = True / Total number of nodes

Fixes / Resolves the following issues:

Checklist:

  • Added label to PR (e.g. enhancement or bug)
  • Ensured the PR is named descriptively. FYI: This name is used as part of our changelog & release notes.
  • Looked at the diff on github to make sure no unwanted files have been committed.
  • Made corresponding changes to the documentation
  • Added tests that prove my fix is effective or that my feature works
  • Any dependent changes have been merged and published in downstream modules
  • If breaking changes occur or you need everyone to run a command locally after
    pulling in latest main, uncomment the below "Merge Notification" section and
    describe steps necessary for people
  • Ran on sample data using kedro run -e sample -p test_sample (see sample environment guide)

eKathleenCarter and others added 6 commits February 20, 2026 12:43
)

* Publish EC drug and disease lists to HuggingFace Hub (#XDATA-282)

Add drug/disease list publication to the data_publication pipeline,
reading versioned parquet files from GCS and writing to everycure/drug-list
and everycure/disease-list on HF Hub. The disease list drops six internal
columns before publishing. HF publication is triggered automatically from
the core_entities release CI on minor/major releases (patches are skipped).

* Add drug/disease datasets and clarify releases

Update public data releases page to refer to 'datasets' and broaden examples (knowledge graphs, drug lists, disease lists). Add drug-list and disease-list entries to the Available Datasets table with Hugging Face links and Docs column, adjust kg-nodes/kg-edges rows, and change 'Both datasets' to 'All datasets'.

* Refactor hf pipeline into separate publication pipelines

This was requested by @JacquesVergine: instead of heaving a single data publication pipeline, you have individual pipelines for KG, drug and disease list!

* Add publish to huggingface core entities release pipeline

* HF disease list now merges full Mondo list with EC curated data (#2079)

The HuggingFace disease list publication previously only included
curated diseases. It now left-joins the full Mondo disease list with
the EC release list, so all Mondo diseases are present with EC
enrichment data (specialties, prevalence, categories, etc.) where
available. For overlapping columns, EC values take precedence.

* Remove now unnecessary drop_disease_hf_columns method

* Clarify exclusion of columns in HF dataset release

Added clarification on the exclusion of certain columns from the public HF release due to their experimental nature.

* Simplify data_publication pipeline tests

Remove tests that targeted the internal _drop_disease_hf_columns helper and explicit publish_drug_list_node/publish_disease_list_node behaviors. Keep a single higher-level test that asserts the pipeline contains the expected node names (publish_kg_edges_node, publish_kg_nodes_node). This reduces coupling to implementation details and focuses the test on pipeline composition.

* Add Git tag support to HFIterableDataset for version-coupled releases (#2080)

* Add Git tag support to HFIterableDataset for version-coupled releases

HFIterableDataset now accepts an optional `tag` parameter. After
push and verification, it creates a Git tag on the HuggingFace Hub
repo pinned to the exact commit SHA of the upload. This couples
internal EC release versions (e.g. v1.1.0) with HF dataset versions,
so users can load a specific release with
`load_dataset("everycure/disease-list", revision="v1.1.0")`.

* Move the HF pipeline into the kedro cloud space (from base)

* Rephrase hugging face disease list docs

* Add approximate number of diseases in disease list release
@eKathleenCarter eKathleenCarter self-assigned this Feb 24, 2026
@eKathleenCarter eKathleenCarter requested a review from a team as a code owner February 24, 2026 15:12
@eKathleenCarter eKathleenCarter added the enhancement improving an existing system or feature to work better. label Feb 24, 2026
@eKathleenCarter eKathleenCarter added the Feature Used for PRs to label new features label Feb 24, 2026
@eKathleenCarter eKathleenCarter marked this pull request as draft February 24, 2026 15:12
@eKathleenCarter
Copy link
Copy Markdown
Collaborator Author

Leaving this as a draft until I can get my uv run kedro experiment permissions issue resolved. I would like to test this but the work is complete besides that.

@eKathleenCarter eKathleenCarter marked this pull request as ready for review March 2, 2026 15:19
Copy link
Copy Markdown
Collaborator

@JacquesVergine JacquesVergine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Comment thread pipelines/matrix/src/matrix/pipelines/integration/connectivity_metrics.py Outdated
…_metrics.py

Co-authored-by: Jacques Vergine <jacques.vergine35@gmail.com>
@eKathleenCarter eKathleenCarter merged commit 3734364 into main Mar 11, 2026
15 checks passed
@eKathleenCarter eKathleenCarter deleted the ekcarter/xdata-263-implement-ontology-inclusion-score branch March 11, 2026 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement improving an existing system or feature to work better. Feature Used for PRs to label new features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants