Skip to content

Redesign replay logging for audit/debug/replay and parallel runs #75

@matthewhorridge

Description

@matthewhorridge

A redesign of the replay log is needed so it is actually useful for audit, debugging, and re-running, and so it remains safe when multiple harmonization runs happen in parallel.

Current behavior and gaps:

The replay logger is a global logger named “ReplayLogger” that clears handlers every time it’s configured, so parallel runs can interfere. The event format is minimal—only { action, dataset }—with no run metadata, schema version, timestamps, or before/after values. Logging occurs once per rule per dataset (before transform), which is insufficient for step-by-step replay or debugging. The replay helper in utils/transformations.py replays rules without preserving ordering across datasets, and the log does not record input/output paths or other audit-relevant metadata.

Why this matters:

A replay log should be self-contained, robust, and parallel-safe, and it should support auditing, debugging, and re-running. The current implementation does not.

Desired behavior:

  • Replay log is self-contained and robust for audit/debug/re-run.
  • Safe for multiple parallel runs (no shared global logger, no shared file).
  • Includes input/output file paths and rules file path in metadata.
  • Can optionally include before/after values (toggleable).
  • Uses row identifiers (preferred) rather than row index.
  • Schema version naming format: vN.N.N.
  • Log is sufficient on its own for replay, though the rules file may exist.

Potential direction (not final):

  • Introduce a per-run ReplayLogger class (no shared global handlers).
  • One log file per run; fail if it already exists unless overwrite is allowed.
  • JSONL schema with run_start, operation, and run_end events.
  • operation events include dataset, rule serialization, row identifier, and optional before/after values.
  • Update replay tool to use this schema and validate version.

Acceptance criteria:

  • Parallel runs do not interfere.
  • Log schema is versioned and documented.
  • Tests cover overwrite behavior, optional before/after values, and replay parsing.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions