Skip to content

Add NMDC transformation pattern tests (flattening + uuid5)#145

Merged
amc-corey-cox merged 2 commits intomainfrom
nmdc-transformation-tests
Mar 9, 2026
Merged

Add NMDC transformation pattern tests (flattening + uuid5)#145
amc-corey-cox merged 2 commits intomainfrom
nmdc-transformation-tests

Conversation

@turbomam
Copy link
Copy Markdown
Member

@turbomam turbomam commented Mar 9, 2026

Summary

Adds 11 tests demonstrating how linkml-map can express real NMDC data transformation patterns. These serve as both test coverage and documentation examples for issue #137.

NMDC Biosample flattening (6 tests)

Shows how each NMDC AttributeValue type can be flattened to lakehouse-style scalar columns using dot-notation expr. The flat column naming conventions match those already in use in external-metadata-awareness/flatten_nmdc_collections.py.

Value type Nesting Expression pattern Example
QuantityValue 1-level depth.has_numeric_value depth_has_numeric_value column
ControlledIdentifiedTermValue 2-level env_broad_scale.term.id env_broad_scale_term_id column
GeolocationValue 1-level lat_lon.latitude lat_lon_latitude column
TimestampValue 1-level collection_date.has_raw_value collection_date_has_raw_value column
PersonValue (Study) 1-level principal_investigator.name pi_name column

Tests cover fully populated, sparse/null, and partial-field cases.

Deterministic uuid5 ID generation (5 tests)

Shows how uuid5() (#117) enables idempotent ETL by producing deterministic NMDC-style IDs from source fields:

id:
  expr: "'nmdc:bsm-' + uuid5('https://w3id.org/nmdc/', source_db + ':' + source_id)"

Tests verify determinism, uniqueness across inputs, compatibility with Python's uuid.uuid5(), and null propagation.

Note on expression syntax

These tests use bare dot-notation (depth.has_numeric_value) which works on main. The {}-brace null-propagation syntax for dot expressions (e.g., {depth.has_numeric_value}) requires the ast.Attribute support from PR #136. Both syntaxes are equivalent when the value is non-null; with braces, a null intermediate aborts the entire expression rather than just the attribute access.

Test plan

  • All 11 tests pass on main
  • No production code changes

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings March 9, 2026 18:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new pytest modules that demonstrate and verify NMDC-oriented transformation patterns in linkml-map, specifically biosample flattening via dot-notation expressions and deterministic NMDC-style ID generation via uuid5().

Changes:

  • Add deterministic uuid5()-based ID generation tests (determinism, uniqueness, stdlib compatibility, null propagation).
  • Add NMDC “flatten nested AttributeValue objects into scalar columns” pattern tests for Biosample and Study.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
tests/test_transformer/test_nmdc_uuid5_ids.py New tests covering deterministic NMDC-style ID generation using uuid5() in transformation expressions.
tests/test_transformer/test_nmdc_flattening_patterns.py New tests documenting/validating common NMDC flattening patterns from nested structures into flat lakehouse-style columns.


import copy

import pytest
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest is imported but never used in this test module. This triggers an unused-import lint warning and can be removed (or use it for fixtures/parametrization if intended).

Suggested change
import pytest

Copilot uses AI. Check for mistakes.
Value types covered:
- QuantityValue (has_numeric_value, has_unit, has_raw_value)
- ControlledIdentifiedTermValue (term.id, term.name, has_raw_value)
- ControlledTermValue (same structure, term optional)
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring claims ControlledTermValue flattening is covered, but the source schema/spec/tests only define/use ControlledIdentifiedTermValue (and term is required). Either add a ControlledTermValue case or update the docstring so it matches what’s actually tested.

Suggested change
- ControlledTermValue (same structure, term optional)

Copilot uses AI. Check for mistakes.


def _make_transformer():
tr = ObjectTransformer(unrestricted_eval=True)
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrestricted_eval=True isn’t needed for these expressions (they’re in the safe expression subset, and uuid5() is registered there). Keeping the default restricted mode here would better validate the intended user path and avoid tests passing due to unrestricted fallback behavior.

Suggested change
tr = ObjectTransformer(unrestricted_eval=True)
tr = ObjectTransformer()

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@amc-corey-cox amc-corey-cox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I applied the fix on main so we should be good to merge.

turbomam and others added 2 commits March 9, 2026 16:25
Demonstrate how linkml-map can express the flattening operations
currently done by custom Python in flatten_nmdc_collections.py.

test_nmdc_flattening_patterns.py (6 tests):
- Biosample: QuantityValue (5-field), ControlledIdentifiedTermValue
  (2-level term.id/term.name), GeolocationValue, TimestampValue
- Study: PersonValue (pi_name/email/orcid)
- Null propagation and partial-field cases
- Uses lakehouse naming conventions (depth_has_numeric_value, etc.)

test_nmdc_uuid5_ids.py (5 tests):
- Deterministic ID generation with uuid5()
- Verifies idempotency, uniqueness, stdlib compatibility
- Null propagation when source fields are missing

All tests use dot-notation expressions (e.g. depth.has_numeric_value)
which work on main. The {}-brace null-propagation syntax for dot
expressions requires PR #136.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
test_nmdc_unified_activities.py (6 tests):
- Multiple class_derivations for same target (#118): three NMDC
  workflow types (MetagenomeSequencing, ReadQcAnalysis,
  MetagenomeAssembly) map to a single FlatActivity table
- Discriminator column (activity_type) via slot value
- Slot renaming (started_at_time → started_at)
- Type-specific fields excluded from unified target
- Sparse/missing optional field handling

Copilot feedback on flattening/uuid5 tests:
- Remove unused pytest import from flattening tests
- Remove ControlledTermValue from docstring (not actually tested)
- Use default restricted eval in uuid5 tests (uuid5 is registered
  in the safe function set, unrestricted_eval not needed)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@turbomam turbomam force-pushed the nmdc-transformation-tests branch from 85d36c1 to 5b411d5 Compare March 9, 2026 20:25
@turbomam
Copy link
Copy Markdown
Member Author

turbomam commented Mar 9, 2026

Note: Corey's test_flatten_inlined.py (landed with #136) covers the basic QuantityValue dot-notation pattern. Our test_nmdc_flattening_patterns.py overlaps on that but also covers GeolocationValue, TimestampValue, PersonValue, and 2-level ControlledIdentifiedTermValue — NMDC-specific types not in Corey's test. The uuid5 and unified activity tests have no equivalent. Happy to deduplicate if there's too much overlap.

@amc-corey-cox amc-corey-cox merged commit eb19c49 into main Mar 9, 2026
7 checks passed
@amc-corey-cox amc-corey-cox deleted the nmdc-transformation-tests branch March 9, 2026 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants