Add cross-table lookup support for join-based transformations#136
Add cross-table lookup support for join-based transformations#136amc-corey-cox merged 8 commits intomainfrom
Conversation
Implement spec-driven cross-table lookups using DuckDB, enabling
slot derivations to reference columns from secondary tables via
`{table.column}` syntax in expressions. This supports biomedical
data harmonization use cases (e.g., pulling demographics into
measurement observations from different source tables).
- Extend AliasedClass with source_key, lookup_key, join_on fields
- Create DuckDB-backed LookupIndex for fast keyed row lookups
- Fix _eval_set to accept {obj.attr} null-propagation syntax
- Wire cross-table resolution into Bindings via join_specs
- Add get_path() to DataLoader for file path resolution
- Create transform_spec() engine for spec-driven processing
- Add 18 tests (unit + integration) covering the full stack
Closes #134
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The `DynObj | None` union syntax requires Python 3.10+. CI tests against Python 3.9, so use `Optional[DynObj]` instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The CI regenerates the Pydantic model from YAML and commits any diff. Align the hand-edited file with gen-pydantic output to avoid spurious CI push failures (only difference was comment line wrapping). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The strict parameter for zip() was added in Python 3.10. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds spec-driven cross-table lookup (join) capabilities to the transformation engine, enabling {joined_table.column} expressions and join resolution via DuckDB-backed indexing.
Changes:
- Introduces a DuckDB-backed
LookupIndexand a newtransform_spec()engine that registers/drops join tables and streams primary rows throughmap_object. - Extends join specifications (
AliasedClass) withsource_key,lookup_key, andjoin_on, and wires join resolution intoBindings. - Updates expression evaluation to allow
{obj.attr}null-propagating syntax, with new unit + integration tests.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_utils/test_lookup_index.py | Adds unit coverage for LookupIndex (register/lookup/drop/validation/varchar coercion). |
| tests/test_utils/test_eval_utils.py | Adds tests for {obj.attr} curly-brace attribute access + null propagation. |
| tests/test_transformer/test_cross_table_lookup.py | End-to-end tests for join-based lookups via the new spec-driven engine. |
| src/linkml_map/utils/lookup_index.py | Implements the DuckDB in-memory lookup/index used for joins. |
| src/linkml_map/utils/eval_utils.py | Extends _eval_set to allow {obj.attr} in null-propagation braces. |
| src/linkml_map/transformer/object_transformer.py | Adds join specs to Bindings and resolves joined rows to DynObj. |
| src/linkml_map/transformer/engine.py | New transform_spec() iterator that registers join tables and streams transformations. |
| src/linkml_map/loaders/data_loaders.py | Adds get_path() helper for resolving table identifiers to file paths. |
| src/linkml_map/datamodel/transformer_model.yaml | Documents/defines new join key fields on AliasedClass. |
| src/linkml_map/datamodel/transformer_model.py | Generated model updates for the new AliasedClass join key fields. |
- Fix 'on shorthand' references to 'join_on' in comments and error messages - Use parameter binding for file_path in DuckDB read_csv_auto call - Add # noqa: S608 to validated dynamic SQL statements - Raise clear ValueError when join is configured but lookup_index is None - Validate both source_key and lookup_key in engine join registration - Resolve path in data_loader.get_path() to match docstring guarantee Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cover four gaps in test coverage for the LookupIndex and transform_spec engine introduced in PR #136: - Duplicate keys: verify LIMIT 1 first-match semantics and document that the returned row is non-deterministic for non-unique keys - Empty secondary tables: headers-only TSV files register and query cleanly, returning None on lookup - LookupIndex lifecycle: close() clears tables, operations after close() raise, double-close is safe - Engine no-joins regression: transform_spec works correctly when class_derivations have no joins block (common case) - Mixed derivations: joins and non-joins class_derivations coexist All 11 tests pass on the cross-table-lookup branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Hey Corey — nice work on the cross-table joins. The architecture is clean: I put together two PRs that target your branch: PR #142 — 11 passing edge-case testsCovers gaps I noticed in test coverage:
All 11 pass on PR #144 + Issue #143 — resource cleanup gap (3 failing tests)
The fix is small — One other thing to considerThe
Not blocking — just flagging for your consideration. |
Demonstrate how linkml-map can express the flattening operations
currently done by custom Python in flatten_nmdc_collections.py.
test_nmdc_flattening_patterns.py (6 tests):
- Biosample: QuantityValue (5-field), ControlledIdentifiedTermValue
(2-level term.id/term.name), GeolocationValue, TimestampValue
- Study: PersonValue (pi_name/email/orcid)
- Null propagation and partial-field cases
- Uses lakehouse naming conventions (depth_has_numeric_value, etc.)
test_nmdc_uuid5_ids.py (5 tests):
- Deterministic ID generation with uuid5()
- Verifies idempotency, uniqueness, stdlib compatibility
- Null propagation when source fields are missing
All tests use dot-notation expressions (e.g. depth.has_numeric_value)
which work on main. The {}-brace null-propagation syntax for dot
expressions requires PR #136.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Demonstrates the joins: feature for a real NMDC use case: enriching Biosample rows with Study metadata (PI name/email, study name, ecosystem, funding) by joining on associated_studies → Study.id. 2 tests: - Full join with 3 biosamples (2 matched, 1 orphan with null propagation when study ID has no match) - Verify all biosample-native fields pass through unchanged when joins are active This is the pattern needed for NMDC lakehouse denormalized tables, currently done with custom Python joins in external-metadata-awareness. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two integration tests for the joins: feature: - Biosample rows enriched with Study metadata via join on associated_studies (matched joins, orphan with null propagation) - Field preservation: all biosample-native fields pass through unchanged when joins are active Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Demonstrate how linkml-map can express the flattening operations
currently done by custom Python in flatten_nmdc_collections.py.
test_nmdc_flattening_patterns.py (6 tests):
- Biosample: QuantityValue (5-field), ControlledIdentifiedTermValue
(2-level term.id/term.name), GeolocationValue, TimestampValue
- Study: PersonValue (pi_name/email/orcid)
- Null propagation and partial-field cases
- Uses lakehouse naming conventions (depth_has_numeric_value, etc.)
test_nmdc_uuid5_ids.py (5 tests):
- Deterministic ID generation with uuid5()
- Verifies idempotency, uniqueness, stdlib compatibility
- Null propagation when source fields are missing
All tests use dot-notation expressions (e.g. depth.has_numeric_value)
which work on main. The {}-brace null-propagation syntax for dot
expressions requires PR #136.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
AliasedClasswithsource_key,lookup_key, andjoin_onfields to specify cross-table join keysLookupIndex(src/linkml_map/utils/lookup_index.py) for fast keyed row lookups from CSV/TSV files_eval_setineval_utils.pyto accept{obj.attr}syntax (e.g.,{demographics.age_at_exam}) with null propagationBindingsvia_resolve_join(), returningDynObjwrappers for attribute accessget_path()toDataLoaderfor resolving table names to file pathstransform_spec()engine (src/linkml_map/transformer/engine.py) that iterates class_derivation blocks, registers secondary tables, streams rows throughmap_object, and cleans up joinsExample YAML syntax
Design notes
join_oninstead ofonbecause YAML 1.1 parses bareonas booleanTrueVARCHAR(all_varchar=true) to avoid type coercion surprises_transform_iterator,map-data) is not touched — the new spec-driven path is added alongside it[a-zA-Z_][a-zA-Z0-9_]*Test plan
LookupIndex(register, lookup, missing row, drop, CSV, identifier validation, varchar coercion){obj.attr}curly-brace attribute access with null propagationCloses #134
🤖 Generated with Claude Code