Add NMDC transformation pattern tests (flattening + uuid5)#145
Add NMDC transformation pattern tests (flattening + uuid5)#145amc-corey-cox merged 2 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds new pytest modules that demonstrate and verify NMDC-oriented transformation patterns in linkml-map, specifically biosample flattening via dot-notation expressions and deterministic NMDC-style ID generation via uuid5().
Changes:
- Add deterministic
uuid5()-based ID generation tests (determinism, uniqueness, stdlib compatibility, null propagation). - Add NMDC “flatten nested AttributeValue objects into scalar columns” pattern tests for Biosample and Study.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/test_transformer/test_nmdc_uuid5_ids.py | New tests covering deterministic NMDC-style ID generation using uuid5() in transformation expressions. |
| tests/test_transformer/test_nmdc_flattening_patterns.py | New tests documenting/validating common NMDC flattening patterns from nested structures into flat lakehouse-style columns. |
|
|
||
| import copy | ||
|
|
||
| import pytest |
There was a problem hiding this comment.
pytest is imported but never used in this test module. This triggers an unused-import lint warning and can be removed (or use it for fixtures/parametrization if intended).
| import pytest |
| Value types covered: | ||
| - QuantityValue (has_numeric_value, has_unit, has_raw_value) | ||
| - ControlledIdentifiedTermValue (term.id, term.name, has_raw_value) | ||
| - ControlledTermValue (same structure, term optional) |
There was a problem hiding this comment.
The module docstring claims ControlledTermValue flattening is covered, but the source schema/spec/tests only define/use ControlledIdentifiedTermValue (and term is required). Either add a ControlledTermValue case or update the docstring so it matches what’s actually tested.
| - ControlledTermValue (same structure, term optional) |
|
|
||
|
|
||
| def _make_transformer(): | ||
| tr = ObjectTransformer(unrestricted_eval=True) |
There was a problem hiding this comment.
unrestricted_eval=True isn’t needed for these expressions (they’re in the safe expression subset, and uuid5() is registered there). Keeping the default restricted mode here would better validate the intended user path and avoid tests passing due to unrestricted fallback behavior.
| tr = ObjectTransformer(unrestricted_eval=True) | |
| tr = ObjectTransformer() |
amc-corey-cox
left a comment
There was a problem hiding this comment.
This looks good to me. I applied the fix on main so we should be good to merge.
Demonstrate how linkml-map can express the flattening operations
currently done by custom Python in flatten_nmdc_collections.py.
test_nmdc_flattening_patterns.py (6 tests):
- Biosample: QuantityValue (5-field), ControlledIdentifiedTermValue
(2-level term.id/term.name), GeolocationValue, TimestampValue
- Study: PersonValue (pi_name/email/orcid)
- Null propagation and partial-field cases
- Uses lakehouse naming conventions (depth_has_numeric_value, etc.)
test_nmdc_uuid5_ids.py (5 tests):
- Deterministic ID generation with uuid5()
- Verifies idempotency, uniqueness, stdlib compatibility
- Null propagation when source fields are missing
All tests use dot-notation expressions (e.g. depth.has_numeric_value)
which work on main. The {}-brace null-propagation syntax for dot
expressions requires PR #136.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
test_nmdc_unified_activities.py (6 tests): - Multiple class_derivations for same target (#118): three NMDC workflow types (MetagenomeSequencing, ReadQcAnalysis, MetagenomeAssembly) map to a single FlatActivity table - Discriminator column (activity_type) via slot value - Slot renaming (started_at_time → started_at) - Type-specific fields excluded from unified target - Sparse/missing optional field handling Copilot feedback on flattening/uuid5 tests: - Remove unused pytest import from flattening tests - Remove ControlledTermValue from docstring (not actually tested) - Use default restricted eval in uuid5 tests (uuid5 is registered in the safe function set, unrestricted_eval not needed) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
85d36c1 to
5b411d5
Compare
|
Note: Corey's |
Summary
Adds 11 tests demonstrating how linkml-map can express real NMDC data transformation patterns. These serve as both test coverage and documentation examples for issue #137.
NMDC Biosample flattening (6 tests)
Shows how each NMDC AttributeValue type can be flattened to lakehouse-style scalar columns using dot-notation
expr. The flat column naming conventions match those already in use in external-metadata-awareness/flatten_nmdc_collections.py.depth.has_numeric_valuedepth_has_numeric_valuecolumnenv_broad_scale.term.idenv_broad_scale_term_idcolumnlat_lon.latitudelat_lon_latitudecolumncollection_date.has_raw_valuecollection_date_has_raw_valuecolumnprincipal_investigator.namepi_namecolumnTests cover fully populated, sparse/null, and partial-field cases.
Deterministic uuid5 ID generation (5 tests)
Shows how
uuid5()(#117) enables idempotent ETL by producing deterministic NMDC-style IDs from source fields:Tests verify determinism, uniqueness across inputs, compatibility with Python's
uuid.uuid5(), and null propagation.Note on expression syntax
These tests use bare dot-notation (
depth.has_numeric_value) which works onmain. The{}-brace null-propagation syntax for dot expressions (e.g.,{depth.has_numeric_value}) requires theast.Attributesupport from PR #136. Both syntaxes are equivalent when the value is non-null; with braces, a null intermediate aborts the entire expression rather than just the attribute access.Test plan
main🤖 Generated with Claude Code