Summary
Add a new extract_regex primitive to extract values from text using regex capture groups.
Motivation
Current substitute can rewrite strings but does not directly model extraction semantics. Harmonization workflows often need to pull a specific token from a larger field (e.g., MRN, code, numeric fragment) and route it to a target element.
Proposed primitive
operation: extract_regex
- Input: scalar string (and iterable support via existing wrapper pattern)
- Output: extracted string (or typed value in future extension)
V1 behavior
expression (required): regex pattern
group (optional, default 1): capture group index or group name
flags (optional): list of regex flags (IGNORECASE, MULTILINE, DOTALL)
strict (bool, default true):
true: raise on no match / invalid group
false: return default
default (optional): fallback value when strict is false
Serialization example
{
"operation": "extract_regex",
"expression": "MRN[:\\s]+([A-Z0-9-]+)",
"group": 1,
"strict": true
}
Example use cases
- Extract identifier from free text:
"Patient MRN: A12-99" -> "A12-99"
- Extract numeric suffix:
"visit_0042" with group=1 from visit_(\\d+) -> "0042"
Suggested implementation areas
src/harmonization_framework/primitives/extract_regex.py
src/harmonization_framework/primitives/vocabulary.py
src/harmonization_framework/primitives/__init__.py
src/harmonization_framework/harmonization_rule.py (deserialization dispatch)
- tests in
tests/test_primitives_serialization.py and rule roundtrip tests
Test plan
- Basic extraction with numeric group.
- Named group extraction.
- No-match strict failure.
- No-match non-strict fallback.
- Invalid group handling.
- Iterable input support.
Summary
Add a new
extract_regexprimitive to extract values from text using regex capture groups.Motivation
Current
substitutecan rewrite strings but does not directly model extraction semantics. Harmonization workflows often need to pull a specific token from a larger field (e.g., MRN, code, numeric fragment) and route it to a target element.Proposed primitive
operation:extract_regexV1 behavior
expression(required): regex patterngroup(optional, default1): capture group index or group nameflags(optional): list of regex flags (IGNORECASE,MULTILINE,DOTALL)strict(bool, defaulttrue):true: raise on no match / invalid groupfalse: returndefaultdefault(optional): fallback value when strict is falseSerialization example
{ "operation": "extract_regex", "expression": "MRN[:\\s]+([A-Z0-9-]+)", "group": 1, "strict": true }Example use cases
"Patient MRN: A12-99"->"A12-99""visit_0042"withgroup=1fromvisit_(\\d+)->"0042"Suggested implementation areas
src/harmonization_framework/primitives/extract_regex.pysrc/harmonization_framework/primitives/vocabulary.pysrc/harmonization_framework/primitives/__init__.pysrc/harmonization_framework/harmonization_rule.py(deserialization dispatch)tests/test_primitives_serialization.pyand rule roundtrip testsTest plan