Overview
Systematic semantic assessment of all 65 unique edge patterns in kg-microbe transformed data, analyzing subject category + predicate + object category combinations for Biolink Model compliance.
Analysis Date: 2025-11-20 (build from 2025-11-11, commit 77c42d8 )
Data Files
kg_microbe_edge_patterns.tsv - Complete list of patterns with counts by source
edge_pattern_assessment.md - Detailed semantic assessment of each pattern
Pattern Format
source | subject_category | subject_prefix | predicate | object_category | object_prefix | count
Key Findings
High-Priority Issues (3 patterns, ~239K edges)
organism --capable_of--> quality (186,197 edges) → biolink:capable_of used with wrong object type (PhenotypicQuality instead of Occurrent) #438
EC codes miscategorized; should be processes not qualities
organism --occurs_in--> medium (52,995 edges) → biolink:occurs_in used incorrectly for organism-medium relationships #440
Organisms don't occur in media; growth processes do
chemical --occurs_in--> assay (340 edges)
Wrong direction/predicate; should be assay uses/has_input chemical
Data Quality Issues (~85K edges)
Multiple patterns with (unknown) or (empty) categories:
Node Categorization Issues
Non-Standard Predicates (23,289 edges)
biolink:produces (12,523 edges) - should be has_output
biolink:associated_with_resistance_to (10,297 edges) - not in Biolink Model
biolink:is_assessed_by (112 edges) - not in Biolink Model
biolink:has_chemical_role (357 edges) - may be valid
Valid Patterns (~1.1M edges)
18 patterns are semantically valid including:
organism --consumes--> chemical (429K edges)
organism --has_phenotype--> quality (210K edges)
environment --location_of--> organism (168K edges)
organism --subclass_of--> organism (173K edges)
organism --capable_of--> process (6K edges)
Recommendations
Priority 1: Invalid Patterns
Fix EC node categories and capable_of usage (biolink:capable_of used with wrong object type (PhenotypicQuality instead of Occurrent) #438 )
Fix organism-medium relationship (biolink:occurs_in used incorrectly for organism-medium relationships #440 )
Reverse chemical-assay relationship
Priority 2: Data Quality
Assign missing categories (METPO phenotype nodes in madin_etal have missing categories #439 )
Review misclassified nodes
Fix PATO-as-location patterns
Priority 3: Standardization
Map produces to has_output
Request associated_with_resistance_to in Biolink Model
Review non-standard predicates
Methodology
Assessment based on:
Biolink Model predicate definitions and constraints
Semantic appropriateness of subject-predicate-object combinations
Standard usage patterns in biomedical knowledge graphs
Domain knowledge of microbiology and biological processes
Related Issues
Files
See attached analysis files in metpo repository for complete assessment details.
Overview
Systematic semantic assessment of all 65 unique edge patterns in kg-microbe transformed data, analyzing subject category + predicate + object category combinations for Biolink Model compliance.
Analysis Date: 2025-11-20 (build from 2025-11-11, commit 77c42d8)
Data Files
kg_microbe_edge_patterns.tsv- Complete list of patterns with counts by sourceedge_pattern_assessment.md- Detailed semantic assessment of each patternPattern Format
Key Findings
High-Priority Issues (3 patterns, ~239K edges)
organism --capable_of--> quality (186,197 edges) → biolink:capable_of used with wrong object type (PhenotypicQuality instead of Occurrent) #438
organism --occurs_in--> medium (52,995 edges) → biolink:occurs_in used incorrectly for organism-medium relationships #440
chemical --occurs_in--> assay (340 edges)
Data Quality Issues (~85K edges)
Multiple patterns with
(unknown)or(empty)categories:Node Categorization Issues
Non-Standard Predicates (23,289 edges)
biolink:produces(12,523 edges) - should behas_outputbiolink:associated_with_resistance_to(10,297 edges) - not in Biolink Modelbiolink:is_assessed_by(112 edges) - not in Biolink Modelbiolink:has_chemical_role(357 edges) - may be validValid Patterns (~1.1M edges)
18 patterns are semantically valid including:
Recommendations
Priority 1: Invalid Patterns
Priority 2: Data Quality
Priority 3: Standardization
producestohas_outputassociated_with_resistance_toin Biolink ModelMethodology
Assessment based on:
Related Issues
Files
See attached analysis files in metpo repository for complete assessment details.