Skip to content

Add extract_regex primitive for regex capture extraction #100

@matthewhorridge

Description

@matthewhorridge

Summary

Add a new extract_regex primitive to extract values from text using regex capture groups.

Motivation

Current substitute can rewrite strings but does not directly model extraction semantics. Harmonization workflows often need to pull a specific token from a larger field (e.g., MRN, code, numeric fragment) and route it to a target element.

Proposed primitive

  • operation: extract_regex
  • Input: scalar string (and iterable support via existing wrapper pattern)
  • Output: extracted string (or typed value in future extension)

V1 behavior

  • expression (required): regex pattern
  • group (optional, default 1): capture group index or group name
  • flags (optional): list of regex flags (IGNORECASE, MULTILINE, DOTALL)
  • strict (bool, default true):
    • true: raise on no match / invalid group
    • false: return default
  • default (optional): fallback value when strict is false

Serialization example

{
  "operation": "extract_regex",
  "expression": "MRN[:\\s]+([A-Z0-9-]+)",
  "group": 1,
  "strict": true
}

Example use cases

  • Extract identifier from free text:
    • "Patient MRN: A12-99" -> "A12-99"
  • Extract numeric suffix:
    • "visit_0042" with group=1 from visit_(\\d+) -> "0042"

Suggested implementation areas

  • src/harmonization_framework/primitives/extract_regex.py
  • src/harmonization_framework/primitives/vocabulary.py
  • src/harmonization_framework/primitives/__init__.py
  • src/harmonization_framework/harmonization_rule.py (deserialization dispatch)
  • tests in tests/test_primitives_serialization.py and rule roundtrip tests

Test plan

  1. Basic extraction with numeric group.
  2. Named group extraction.
  3. No-match strict failure.
  4. No-match non-strict fallback.
  5. Invalid group handling.
  6. Iterable input support.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions