Summary
BacDive JSON data has inconsistent shapes where the same path can be either a single object or an array of objects. The current _extract_value_from_json_path() function in bacdive.py silently loses data when intermediate nodes are arrays, because it only handles dict traversal.
Estimated data loss: ~22% of METPO phenotype extractions (19,246 of 86,549 record-path combinations)
The Bug
In kg_microbe/transform_utils/bacdive/bacdive.py lines 314-320:
def _extract_value_from_json_path(self, record: dict, json_path: str):
# ...
for part in parts:
if isinstance(current, dict):
current = current.get(part)
if current is None:
return []
else:
return [] # ← BUG: If intermediate node is array, returns empty!
When traversing a path like "Morphology.cell morphology.cell shape":
- If
cell morphology is a dict: traversal continues, value extracted ✓
- If
cell morphology is an array: returns [], ALL values lost ✗
Data Loss Analysis
Analysis of 99,392 BacDive strains shows these METPO phenotype extraction paths are affected:
| Phenotype |
Intermediate Path |
Total |
Object |
Array |
% Lost |
| halophily |
Physiology and metabolism.halophily |
9,687 |
3,209 |
6,478 |
66.9% |
| oxygen preference |
Physiology and metabolism.oxygen tolerance |
23,255 |
18,655 |
4,600 |
19.8% |
| cell shape |
Morphology.cell morphology |
16,032 |
13,251 |
2,781 |
17.3% |
| gram stain |
Morphology.cell morphology |
16,032 |
13,251 |
2,781 |
17.3% |
| motility |
Morphology.cell morphology |
16,032 |
13,251 |
2,781 |
17.3% |
| biosafety level |
Safety information.risk assessment |
31,642 |
26,678 |
4,964 |
15.7% |
| sporulation |
Physiology and metabolism.spore formation |
5,443 |
5,050 |
393 |
7.2% |
| trophic type |
Physiology and metabolism.nutrition type |
490 |
460 |
30 |
6.1% |
| TOTAL |
|
86,549 |
67,303 |
19,246 |
22.2% |
Verified Example
Document 98 - cell morphology is object (works):
{
"Morphology": {
"cell morphology": {
"@ref": 119306,
"gram stain": "negative",
"cell shape": "coccus-shaped",
"motility": "no"
}
}
}
Document 99 - cell morphology is array (data lost):
{
"Morphology": {
"cell morphology": [
{"@ref": 22965, "gram stain": "negative", "cell shape": "coccus-shaped", "motility": "no"},
{"@ref": 67771, "cell shape": "coccus-shaped"},
{"@ref": 67771, "gram stain": "negative"},
{"@ref": 120258, "gram stain": "negative", "cell shape": "coccus-shaped", "motility": "no"}
]
}
}
For document 99, _extract_value_from_json_path("Morphology.cell morphology.cell shape") returns [] even though there are 4 valid cell shape values.
Proposed Fix
Modify _extract_value_from_json_path() to handle arrays at intermediate nodes:
def _extract_value_from_json_path(self, record: dict, json_path: str):
parts = json_path.split(".")
current = record
for part in parts[:-1]: # Traverse all but the last part
if isinstance(current, dict):
current = current.get(part)
if current is None:
return []
elif isinstance(current, list):
# Flatten: collect from all items in the array
results = []
for item in current:
if isinstance(item, dict):
sub_result = self._extract_value_from_json_path(
{part: item.get(part)}, ".".join(parts[parts.index(part):])
)
results.extend(sub_result)
return results
else:
return []
# Handle the final value (existing logic)
last_key = parts[-1]
if isinstance(current, list):
result = []
for item in current:
if isinstance(item, dict):
value = item.get(last_key)
if value:
result.append(str(value).strip())
elif item:
result.append(str(item).strip())
return result
elif isinstance(current, dict):
value = current.get(last_key)
if value:
return [str(value).strip()]
return []
elif current is not None:
return [str(current).strip()]
else:
return []
Or simpler - normalize arrays to be processed element-by-element:
def _extract_value_from_json_path(self, record: dict, json_path: str):
parts = json_path.split(".")
def traverse(current, remaining_parts):
if not remaining_parts:
if current is None:
return []
if isinstance(current, list):
return [str(v).strip() for v in current if v]
return [str(current).strip()] if current else []
part = remaining_parts[0]
rest = remaining_parts[1:]
if isinstance(current, dict):
return traverse(current.get(part), rest)
elif isinstance(current, list):
results = []
for item in current:
if isinstance(item, dict):
results.extend(traverse(item.get(part), rest))
return results
else:
return []
return traverse(record, parts)
BacDive Shape Patterns
Full intermediate node shape analysis (click to expand)
INTERMEDIATE NODES: object | array<object>
These nodes can be either a single object OR an array of objects
Path Object Array Total %Array
----------------------------------------------------------------------------------------------
Culture and growth conditions.culture medium 19587 21299 40886 52.1%
Culture and growth conditions.culture pH 1147 5649 6796 83.1%
Culture and growth conditions.culture temp 32617 16889 49506 34.1%
External links.literature 10636 9062 19698 46.0%
External links.phages 134 78 212 36.8%
External links.straininfo link 45914 171 46085 0.4%
General.NCBI tax id 94863 3711 98574 3.8%
General.strain history 32204 11317 43521 26.0%
Isolation, sampling and environmental information.isolation 44689 14399 59088 24.4%
Isolation, sampling and environmental information.isolation source categories 11768 31611 43379 72.9%
Morphology.cell morphology 13251 2781 16032 17.3%
Morphology.colony morphology 8122 2451 10573 23.2%
Morphology.multicellular morphology 6528 1400 7928 17.7%
Morphology.multimedia 3022 788 3810 20.7%
Morphology.pigmentation 4090 444 4534 9.8%
Name and taxonomic classification.LPSN.synonyms 24090 37801 61891 61.1%
Physiology and metabolism.antibiogram 107 112 219 51.1%
Physiology and metabolism.antibiotic resistance 4163 2397 6560 36.5%
Physiology and metabolism.compound production 1615 893 2508 35.6%
Physiology and metabolism.enzymes 659 28660 29319 97.8%
Physiology and metabolism.fatty acid profile 5420 214 5634 3.8%
Physiology and metabolism.halophily 3209 6478 9687 66.9%
Physiology and metabolism.metabolite production 15933 7587 23520 32.3%
Physiology and metabolism.metabolite tests 12093 8106 20199 40.1%
Physiology and metabolism.metabolite utilization 298 29901 30199 99.0%
Physiology and metabolism.murein 1307 12 1319 0.9%
Physiology and metabolism.nutrition type 460 30 490 6.1%
Physiology and metabolism.observation 7136 3911 11047 35.4%
Physiology and metabolism.oxygen tolerance 18655 4600 23255 19.8%
Physiology and metabolism.spore formation 5050 393 5443 7.2%
Safety information.risk assessment 26678 4964 31642 15.7%
Sequence information.16S sequences 21353 5691 27044 21.0%
Sequence information.GC content 10426 5390 15816 34.1%
Sequence information.Genome sequences 4465 13140 17605 74.6%
----------------------------------------------------------------------------------------------
TOTAL 491689 282330 774019 36.5%
Notes
- The code already handles list vs dict at leaf nodes correctly (e.g.,
metabolite_utilization, enzymes) using isinstance checks
- The bug is specifically in intermediate node traversal in
_extract_value_from_json_path()
- Other code paths that directly access fields (like culture medium processing at line 1241) already normalize with
if not isinstance(media, list): media = [media]
Analysis Method
Shape analysis was performed on a local MongoDB copy of 99,392 BacDive strains using aggregation queries to count type distribution at each path. The bacdive_meta.property_schemas collection contains inferred JSON schemas that document the anyOf patterns.
Summary
BacDive JSON data has inconsistent shapes where the same path can be either a single object or an array of objects. The current
_extract_value_from_json_path()function inbacdive.pysilently loses data when intermediate nodes are arrays, because it only handles dict traversal.Estimated data loss: ~22% of METPO phenotype extractions (19,246 of 86,549 record-path combinations)
The Bug
In
kg_microbe/transform_utils/bacdive/bacdive.pylines 314-320:When traversing a path like
"Morphology.cell morphology.cell shape":cell morphologyis a dict: traversal continues, value extracted ✓cell morphologyis an array: returns[], ALL values lost ✗Data Loss Analysis
Analysis of 99,392 BacDive strains shows these METPO phenotype extraction paths are affected:
Verified Example
Document 98 -
cell morphologyis object (works):{ "Morphology": { "cell morphology": { "@ref": 119306, "gram stain": "negative", "cell shape": "coccus-shaped", "motility": "no" } } }Document 99 -
cell morphologyis array (data lost):{ "Morphology": { "cell morphology": [ {"@ref": 22965, "gram stain": "negative", "cell shape": "coccus-shaped", "motility": "no"}, {"@ref": 67771, "cell shape": "coccus-shaped"}, {"@ref": 67771, "gram stain": "negative"}, {"@ref": 120258, "gram stain": "negative", "cell shape": "coccus-shaped", "motility": "no"} ] } }For document 99,
_extract_value_from_json_path("Morphology.cell morphology.cell shape")returns[]even though there are 4 valid cell shape values.Proposed Fix
Modify
_extract_value_from_json_path()to handle arrays at intermediate nodes:Or simpler - normalize arrays to be processed element-by-element:
BacDive Shape Patterns
Full intermediate node shape analysis (click to expand)
Notes
metabolite_utilization,enzymes) usingisinstancechecks_extract_value_from_json_path()if not isinstance(media, list): media = [media]Analysis Method
Shape analysis was performed on a local MongoDB copy of 99,392 BacDive strains using aggregation queries to count type distribution at each path. The
bacdive_meta.property_schemascollection contains inferred JSON schemas that document theanyOfpatterns.