You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Mar 4, 2026. It is now read-only.
I the current aggregation function, we are integrating the following node properties:
def union_and_deduplicate_nodes(retrieve_most_specific_category: bool, *nodes, cols: List[str]) -> ps.DataFrame:
"""Function to unify nodes datasets."""
# fmt: off
unioned_datasets = (
_union_datasets(*nodes)
# first we group the dataset by id to deduplicate
.groupBy("id")
.agg(
F.first("name", ignorenulls=True).alias("name"),
F.first("category", ignorenulls=True).alias("category"),
F.first("description", ignorenulls=True).alias("description"),
F.first("international_resource_identifier", ignorenulls=True).alias("international_resource_identifier"),
F.flatten(F.collect_set("equivalent_identifiers")).alias("equivalent_identifiers"),
F.flatten(F.collect_set("all_categories")).alias("all_categories"),
F.flatten(F.collect_set("labels")).alias("labels"),
F.flatten(F.collect_set("publications")).alias("publications"),
F.flatten(F.collect_set("upstream_data_source")).alias("upstream_data_source"),
)
)
# next we need to apply a number of transformations to the nodes to ensure grouping by id did not select wrong information
# this is especially important if we integrate multiple KGs
if retrieve_most_specific_category:
unioned_datasets = unioned_datasets.transform(determine_most_specific_category)
return unioned_datasets.select(*cols)
# fmt: on
Used by EC pipeline and already required
id
category
Used by EC pipeline but not required
publications
description
name
Useful but not biolink:
international_resource_identifier
equivalent_identifiers
all_categories
labels (should be synonyms)
upstream_data_source
Action items
determine if the ones not in biolink should be added / have a corresponding correct attribute
decide which ones to make required, if any
decide how to check "optional columns of interest" in the validator. For example, robokop has a column name:string which, if it was an "optional desirable" attribute, should have been name.
I the current aggregation function, we are integrating the following node properties:
Used by EC pipeline and already required
Used by EC pipeline but not required
Useful but not biolink:
synonyms)Action items
name:stringwhich, if it was an "optional desirable" attribute, should have beenname.