A redesign of the replay log is needed so it is actually useful for audit, debugging, and re-running, and so it remains safe when multiple harmonization runs happen in parallel.
Current behavior and gaps:
The replay logger is a global logger named “ReplayLogger” that clears handlers every time it’s configured, so parallel runs can interfere. The event format is minimal—only { action, dataset }—with no run metadata, schema version, timestamps, or before/after values. Logging occurs once per rule per dataset (before transform), which is insufficient for step-by-step replay or debugging. The replay helper in utils/transformations.py replays rules without preserving ordering across datasets, and the log does not record input/output paths or other audit-relevant metadata.
Why this matters:
A replay log should be self-contained, robust, and parallel-safe, and it should support auditing, debugging, and re-running. The current implementation does not.
Desired behavior:
- Replay log is self-contained and robust for audit/debug/re-run.
- Safe for multiple parallel runs (no shared global logger, no shared file).
- Includes input/output file paths and rules file path in metadata.
- Can optionally include before/after values (toggleable).
- Uses row identifiers (preferred) rather than row index.
- Schema version naming format:
vN.N.N.
- Log is sufficient on its own for replay, though the rules file may exist.
Potential direction (not final):
- Introduce a per-run
ReplayLogger class (no shared global handlers).
- One log file per run; fail if it already exists unless overwrite is allowed.
- JSONL schema with
run_start, operation, and run_end events.
operation events include dataset, rule serialization, row identifier, and optional before/after values.
- Update replay tool to use this schema and validate version.
Acceptance criteria:
- Parallel runs do not interfere.
- Log schema is versioned and documented.
- Tests cover overwrite behavior, optional before/after values, and replay parsing.
A redesign of the replay log is needed so it is actually useful for audit, debugging, and re-running, and so it remains safe when multiple harmonization runs happen in parallel.
Current behavior and gaps:
The replay logger is a global logger named “ReplayLogger” that clears handlers every time it’s configured, so parallel runs can interfere. The event format is minimal—only
{ action, dataset }—with no run metadata, schema version, timestamps, or before/after values. Logging occurs once per rule per dataset (before transform), which is insufficient for step-by-step replay or debugging. The replay helper inutils/transformations.pyreplays rules without preserving ordering across datasets, and the log does not record input/output paths or other audit-relevant metadata.Why this matters:
A replay log should be self-contained, robust, and parallel-safe, and it should support auditing, debugging, and re-running. The current implementation does not.
Desired behavior:
vN.N.N.Potential direction (not final):
ReplayLoggerclass (no shared global handlers).run_start,operation, andrun_endevents.operationevents include dataset, rule serialization, row identifier, and optional before/after values.Acceptance criteria: