You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Mar 4, 2026. It is now read-only.
KGs should conform to the biolink "type system". This would allow to catch systematic errors in the KG early on (either at ingestion or integration).
A few suggestions:
Minimally, each node's category (and edge's predicate) should be a valid biolink-class (biolink-predicate). Surprisingly that's not the case (e.g. biolink:Vitamin exists in our KG, but is NOT a valid biolink-class, similar for predicate biolink:contraindicated_for). Also abstract classes should probably not exists in the category of a node.
Edge types in biolink have specific domain and range, i.e. subject/object types, which we should enforce in the graph, e.g. biolink:in_taxon edges can only connect a ThingWithTaxon to a OrganismTaxon. Frequently violated at the moment.
Nodes often have more than one category (see all_categories). Often it's superclass-subclass relations, e.g. a node has all_categories =["Protein", "Polypeptide"], which are valid. However, certain combinations of category on the same node point to errors (e.g. a node can't really be a Disease and a Gene; the gene might be mutated in the disease, but those are still different concepts that should not be mixed up )
Comments
is easy, something like
importbmtimportpolarsasplB=bmt.Toolkit('https://raw.githubusercontent.com/biolink/biolink-model/refs/heads/master/biolink-model.yaml')
valid_classes= [B.get_element(el_name)['class_uri']forel_nameinB.get_all_classes()]
df_nodes.with_columns(
# not sure how do check for set-equality in polars, this one works though:valid_biolink=pl.col("all_categories").list.set_intersection(valid_classes).list.len() ==pl.col("all_categories").list.len(),
)
fairly easy too. One just needs to take into account the inheritance, e.g. if an edge type has domain=="ThingWithTaxon", any subclass is a valid subject
Harder. Anything that adheres to the biolink class hierarchy is definitely valid, but the rest is tricky (if the node's all_categories violates the class hierarchy, it's not neccessarily wrong, e.g.:
genes and proteins are often mixed in a single node(i guess the type should really be GeneOrGeneProduct then)
A node might be both a Protein and a Drug (e.g. antibody)
Proteins are sometimes SmallMolecules (if its just a few AAs) ...
We'd need to come up with a "blacklist" of category combinations that are "wrong" (.e.g Disease, Gene)
KGs should conform to the biolink "type system". This would allow to catch systematic errors in the KG early on (either at ingestion or integration).
A few suggestions:
category(and edge'spredicate) should be a valid biolink-class (biolink-predicate). Surprisingly that's not the case (e.g.biolink:Vitaminexists in our KG, but is NOT a valid biolink-class, similar for predicatebiolink:contraindicated_for). Also abstract classes should probably not exists in thecategoryof a node.domainandrange, i.e. subject/object types, which we should enforce in the graph, e.g.biolink:in_taxonedges can only connect a ThingWithTaxon to a OrganismTaxon. Frequently violated at the moment.category(seeall_categories). Often it's superclass-subclass relations, e.g. a node hasall_categories =["Protein", "Polypeptide"], which are valid. However, certain combinations ofcategoryon the same node point to errors (e.g. a node can't really be aDiseaseand aGene; the gene might be mutated in the disease, but those are still different concepts that should not be mixed up )Comments
domain=="ThingWithTaxon", any subclass is a valid subjectall_categoriesviolates the class hierarchy, it's not neccessarily wrong, e.g.:We'd need to come up with a "blacklist" of category combinations that are "wrong" (.e.g
Disease,Gene)