Skip to content

Design: Multi-source to single-target harmonization #90

@matthewhorridge

Description

@matthewhorridge

Design: Support multi-source to single-target harmonization rules.

Background

  • Some harmonizations require combining multiple source data elements into a single target element.
  • Example: one-hot encoded source columns -> single enum target (e.g., REDCap coding).
  • Other examples: split fields (first/last) to a canonical name, separate date/time to a combined timestamp, or multiple flags to a consolidated category.

Goals

  • Define a rule model that can reference multiple source elements.
  • Specify serialization schema changes and backward compatibility expectations.
  • Clarify how rule lookup and application should work in the pipeline (e.g., list-of-sources mapping).
  • Define how replay logging should represent multi-source transformations.

Open questions

  • Should a new rule type be introduced (e.g., "multi_source") or extend existing rule schema?
  • How to express source ordering and missing-value handling?
  • How to support one-hot -> enum mapping explicitly (e.g., reduce + mapping)?
  • How should validation work when sources are missing?

Deliverables (design only)

  • Proposed schema with examples
  • Changes needed in RuleRegistry lookup and harmonize pipeline
  • Replay log implications
  • Test strategy

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions