Skip to content

Add cross-table lookup support for join-based transformations#136

Merged
amc-corey-cox merged 8 commits intomainfrom
cross-table-lookup
Mar 9, 2026
Merged

Add cross-table lookup support for join-based transformations#136
amc-corey-cox merged 8 commits intomainfrom
cross-table-lookup

Conversation

@amc-corey-cox
Copy link
Copy Markdown
Contributor

Summary

  • Extends AliasedClass with source_key, lookup_key, and join_on fields to specify cross-table join keys
  • Creates DuckDB-backed LookupIndex (src/linkml_map/utils/lookup_index.py) for fast keyed row lookups from CSV/TSV files
  • Fixes _eval_set in eval_utils.py to accept {obj.attr} syntax (e.g., {demographics.age_at_exam}) with null propagation
  • Wires cross-table resolution into Bindings via _resolve_join(), returning DynObj wrappers for attribute access
  • Adds get_path() to DataLoader for resolving table names to file paths
  • Creates transform_spec() engine (src/linkml_map/transformer/engine.py) that iterates class_derivation blocks, registers secondary tables, streams rows through map_object, and cleans up joins

Example YAML syntax

class_derivations:
  MeasurementObservation:
    populated_from: lab_results
    joins:
      demographics:
        join_on: participant_id          # shorthand: same column name in both tables
      # or explicit:
      # demographics:
      #   source_key: participant_id
      #   lookup_key: subject_id
    slot_derivations:
      analyte_value:
        populated_from: result_value
      age_at_observation:
        expr: '{demographics.age_at_exam} * 365'

Design notes

  • Uses join_on instead of on because YAML 1.1 parses bare on as boolean True
  • DuckDB loads all columns as VARCHAR (all_varchar=true) to avoid type coercion surprises
  • Existing data-driven CLI path (_transform_iterator, map-data) is not touched — the new spec-driven path is added alongside it
  • SQL injection is prevented by validating all identifiers against [a-zA-Z_][a-zA-Z0-9_]*

Test plan

  • 8 unit tests for LookupIndex (register, lookup, missing row, drop, CSV, identifier validation, varchar coercion)
  • 4 eval_utils tests for {obj.attr} curly-brace attribute access with null propagation
  • 6 integration tests for cross-table lookups (join_on shorthand, explicit keys, null propagation, arithmetic expressions, multiple joins, missing key error)
  • Full test suite passes (439 passed, 4 skipped, 0 failures)

Closes #134

🤖 Generated with Claude Code

amc-corey-cox and others added 4 commits March 4, 2026 07:57
Implement spec-driven cross-table lookups using DuckDB, enabling
slot derivations to reference columns from secondary tables via
`{table.column}` syntax in expressions. This supports biomedical
data harmonization use cases (e.g., pulling demographics into
measurement observations from different source tables).

- Extend AliasedClass with source_key, lookup_key, join_on fields
- Create DuckDB-backed LookupIndex for fast keyed row lookups
- Fix _eval_set to accept {obj.attr} null-propagation syntax
- Wire cross-table resolution into Bindings via join_specs
- Add get_path() to DataLoader for file path resolution
- Create transform_spec() engine for spec-driven processing
- Add 18 tests (unit + integration) covering the full stack

Closes #134

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The `DynObj | None` union syntax requires Python 3.10+. CI tests
against Python 3.9, so use `Optional[DynObj]` instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The CI regenerates the Pydantic model from YAML and commits any diff.
Align the hand-edited file with gen-pydantic output to avoid spurious
CI push failures (only difference was comment line wrapping).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The strict parameter for zip() was added in Python 3.10.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds spec-driven cross-table lookup (join) capabilities to the transformation engine, enabling {joined_table.column} expressions and join resolution via DuckDB-backed indexing.

Changes:

  • Introduces a DuckDB-backed LookupIndex and a new transform_spec() engine that registers/drops join tables and streams primary rows through map_object.
  • Extends join specifications (AliasedClass) with source_key, lookup_key, and join_on, and wires join resolution into Bindings.
  • Updates expression evaluation to allow {obj.attr} null-propagating syntax, with new unit + integration tests.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/test_utils/test_lookup_index.py Adds unit coverage for LookupIndex (register/lookup/drop/validation/varchar coercion).
tests/test_utils/test_eval_utils.py Adds tests for {obj.attr} curly-brace attribute access + null propagation.
tests/test_transformer/test_cross_table_lookup.py End-to-end tests for join-based lookups via the new spec-driven engine.
src/linkml_map/utils/lookup_index.py Implements the DuckDB in-memory lookup/index used for joins.
src/linkml_map/utils/eval_utils.py Extends _eval_set to allow {obj.attr} in null-propagation braces.
src/linkml_map/transformer/object_transformer.py Adds join specs to Bindings and resolves joined rows to DynObj.
src/linkml_map/transformer/engine.py New transform_spec() iterator that registers join tables and streams transformations.
src/linkml_map/loaders/data_loaders.py Adds get_path() helper for resolving table identifiers to file paths.
src/linkml_map/datamodel/transformer_model.yaml Documents/defines new join key fields on AliasedClass.
src/linkml_map/datamodel/transformer_model.py Generated model updates for the new AliasedClass join key fields.

amc-corey-cox and others added 2 commits March 6, 2026 15:36
- Fix 'on shorthand' references to 'join_on' in comments and error messages
- Use parameter binding for file_path in DuckDB read_csv_auto call
- Add # noqa: S608 to validated dynamic SQL statements
- Raise clear ValueError when join is configured but lookup_index is None
- Validate both source_key and lookup_key in engine join registration
- Resolve path in data_loader.get_path() to match docstring guarantee

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
turbomam added a commit that referenced this pull request Mar 9, 2026
Cover four gaps in test coverage for the LookupIndex and
transform_spec engine introduced in PR #136:

- Duplicate keys: verify LIMIT 1 first-match semantics and document
  that the returned row is non-deterministic for non-unique keys
- Empty secondary tables: headers-only TSV files register and query
  cleanly, returning None on lookup
- LookupIndex lifecycle: close() clears tables, operations after
  close() raise, double-close is safe
- Engine no-joins regression: transform_spec works correctly when
  class_derivations have no joins block (common case)
- Mixed derivations: joins and non-joins class_derivations coexist

All 11 tests pass on the cross-table-lookup branch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@turbomam
Copy link
Copy Markdown
Member

turbomam commented Mar 9, 2026

Hey Corey — nice work on the cross-table joins. The architecture is clean: LookupIndex / DynObj / Bindings separation is well-layered, and I appreciate that the existing data-driven CLI path is untouched.

I put together two PRs that target your branch:

PR #142 — 11 passing edge-case tests

Covers gaps I noticed in test coverage:

  • Duplicate key behavior — documents the LIMIT 1 first-match semantics (the test doesn't assert which duplicate wins, just that a row comes back)
  • Empty secondary tables — headers-only TSV files register and query cleanly
  • LookupIndex.close() lifecycle — close clears state, post-close ops raise, double-close is safe
  • Engine no-joins regressiontransform_spec with no joins: block (the common case has no test coverage through the engine path)
  • Mixed derivations — joins and non-joins coexist in one spec

All 11 pass on cross-table-lookup. Should be a clean merge.

PR #144 + Issue #143 — resource cleanup gap (3 failing tests)

transform_spec() creates a LookupIndex on line 45 but never calls close(), so the DuckDB connection leaks. Also, LookupIndex doesn't support the with statement.

The fix is small — __enter__/__exit__ on LookupIndex + a top-level try/finally in transform_spec. Happy to implement if you'd like, or feel free to take it.

One other thing to consider

The lookup_row docstring says "Return the first row" but doesn't mention what "first" means when there are duplicate keys. Might be worth either:

  • Documenting "arbitrary row for non-unique keys" in the docstring
  • Or adding a warn_on_duplicates option

Not blocking — just flagging for your consideration.

turbomam added a commit that referenced this pull request Mar 9, 2026
Demonstrate how linkml-map can express the flattening operations
currently done by custom Python in flatten_nmdc_collections.py.

test_nmdc_flattening_patterns.py (6 tests):
- Biosample: QuantityValue (5-field), ControlledIdentifiedTermValue
  (2-level term.id/term.name), GeolocationValue, TimestampValue
- Study: PersonValue (pi_name/email/orcid)
- Null propagation and partial-field cases
- Uses lakehouse naming conventions (depth_has_numeric_value, etc.)

test_nmdc_uuid5_ids.py (5 tests):
- Deterministic ID generation with uuid5()
- Verifies idempotency, uniqueness, stdlib compatibility
- Null propagation when source fields are missing

All tests use dot-notation expressions (e.g. depth.has_numeric_value)
which work on main. The {}-brace null-propagation syntax for dot
expressions requires PR #136.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@amc-corey-cox amc-corey-cox merged commit 19cdeb4 into main Mar 9, 2026
7 checks passed
@amc-corey-cox amc-corey-cox deleted the cross-table-lookup branch March 9, 2026 19:58
turbomam added a commit that referenced this pull request Mar 9, 2026
Demonstrates the joins: feature for a real NMDC use case: enriching
Biosample rows with Study metadata (PI name/email, study name,
ecosystem, funding) by joining on associated_studies → Study.id.

2 tests:
- Full join with 3 biosamples (2 matched, 1 orphan with null
  propagation when study ID has no match)
- Verify all biosample-native fields pass through unchanged
  when joins are active

This is the pattern needed for NMDC lakehouse denormalized tables,
currently done with custom Python joins in external-metadata-awareness.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
turbomam added a commit that referenced this pull request Mar 9, 2026
Two integration tests for the joins: feature:
- Biosample rows enriched with Study metadata via join on associated_studies
  (matched joins, orphan with null propagation)
- Field preservation: all biosample-native fields pass through unchanged
  when joins are active

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
turbomam added a commit that referenced this pull request Mar 9, 2026
Demonstrate how linkml-map can express the flattening operations
currently done by custom Python in flatten_nmdc_collections.py.

test_nmdc_flattening_patterns.py (6 tests):
- Biosample: QuantityValue (5-field), ControlledIdentifiedTermValue
  (2-level term.id/term.name), GeolocationValue, TimestampValue
- Study: PersonValue (pi_name/email/orcid)
- Null propagation and partial-field cases
- Uses lakehouse naming conventions (depth_has_numeric_value, etc.)

test_nmdc_uuid5_ids.py (5 tests):
- Deterministic ID generation with uuid5()
- Verifies idempotency, uniqueness, stdlib compatibility
- Null propagation when source fields are missing

All tests use dot-notation expressions (e.g. depth.has_numeric_value)
which work on main. The {}-brace null-propagation syntax for dot
expressions requires PR #136.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support cross-table slot lookup in class_derivation slot_derivations

3 participants