Skip to content
This repository was archived by the owner on Mar 4, 2026. It is now read-only.
This repository was archived by the owner on Mar 4, 2026. It is now read-only.

Decide whether we should include certain node properties as "optional desirable" #33

@matentzn

Description

@matentzn

I the current aggregation function, we are integrating the following node properties:

def union_and_deduplicate_nodes(retrieve_most_specific_category: bool, *nodes, cols: List[str]) -> ps.DataFrame:
    """Function to unify nodes datasets."""
    # fmt: off
    unioned_datasets = (
        _union_datasets(*nodes)
        # first we group the dataset by id to deduplicate
        .groupBy("id")
        .agg(
            F.first("name", ignorenulls=True).alias("name"),
            F.first("category", ignorenulls=True).alias("category"),
            F.first("description", ignorenulls=True).alias("description"),
            F.first("international_resource_identifier", ignorenulls=True).alias("international_resource_identifier"),
            F.flatten(F.collect_set("equivalent_identifiers")).alias("equivalent_identifiers"),
            F.flatten(F.collect_set("all_categories")).alias("all_categories"),
            F.flatten(F.collect_set("labels")).alias("labels"),
            F.flatten(F.collect_set("publications")).alias("publications"),
            F.flatten(F.collect_set("upstream_data_source")).alias("upstream_data_source"),
        )
    )
    # next we need to apply a number of transformations to the nodes to ensure grouping by id did not select wrong information
    # this is especially important if we integrate multiple KGs

    if retrieve_most_specific_category:
        unioned_datasets = unioned_datasets.transform(determine_most_specific_category)

    return unioned_datasets.select(*cols)
    # fmt: on

Used by EC pipeline and already required

  • id
  • category

Used by EC pipeline but not required

  • publications
  • description
  • name

Useful but not biolink:

  • international_resource_identifier
  • equivalent_identifiers
  • all_categories
  • labels (should be synonyms)
  • upstream_data_source

Action items

  • determine if the ones not in biolink should be added / have a corresponding correct attribute
  • decide which ones to make required, if any
  • decide how to check "optional columns of interest" in the validator. For example, robokop has a column name:string which, if it was an "optional desirable" attribute, should have been name.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions