fix: guard audit log inserts against audit_logs_pkey Ecto.ConstraintError (OPS-4580)#38
Conversation
…rror
- declare unique_constraint(:id, name: "#{table}_pkey") on changelog_changeset
- wrap bare repo.insert/1 in log_changes and log_changes_alone with try/rescue
for Ecto.ConstraintError so the caller's outer tx (update_and_log etc.)
is never aborted on duplicate pkey collisions in the audit table.
- add regression test that reproduces the exact ConstraintError path from
the stacktrace (ecto_trail.ex:435 inside update_and_log tx).
- bump to 1.0.4, update CHANGELOG.
Closes OPS-4580
There was a problem hiding this comment.
Overall Assessment
This PR adds defensive handling for Ecto.ConstraintError (specifically pkey collisions on the audit log table) inside the audit-logging helpers. The implementation wraps the bare repo.insert calls in log_changes/5 and log_changes_alone/6 with try/rescue, logs at error level, and returns {:ok, error} so the caller's outer transaction continues. A unique_constraint(:id, name: "#{table}_pkey") declaration was also added to changelog_changeset/1. No blocking bugs found.
Findings
No actionable findings.
Notes
- The rescue is intentionally broad (
Ecto.ConstraintErrorregardless of which constraint) which is appropriate as a last-resort guard; theunique_constraintdeclaration handles the expected pkey case via changeset error conversion when the name matches. - The test forces a sequence collision via
setval+ pre-insert at a high id. This is a valid integration-style regression test that exercises the exact production path (bare insert insidelog_changescalled fromupdate_and_log's transaction). - The
@audit_log_tablemodule attribute is populated viaApplication.compile_env/3at compile time. If the table name is changed at runtime after compilation, the constraint name in the changeset will be stale; however, this matches the existing pattern used for@redacted_fields_configand is consistent with how the library already handles configuration. - The CHANGELOG entry for 1.0.4 duplicates the 1.0.3 fix note about
Ecto.Multi/RuntimeError(OPS-3479) — this appears to be an accidental paste rather than a new change in 1.0.4.
|
Closing as a duplicate of #24 — all of these PRs fix the same bug: |
Summary
Prevent
Ecto.ConstraintErroronaudit_logs_pkey(or{table}_pkey) from aborting the caller's outer transaction whenupdate_and_log/insert_and_logetc. attempt to write an audit log row that collides on the primary key.Changes:
lib/ecto_trail/ecto_trail.ex: declareunique_constraint(:id, name: "#{table}_pkey")inchangelog_changeset; wrap the barerepo.insert/1calls insidelog_changes/5(the site in the stacktrace) andlog_changes_alone/6withtry/rescue Ecto.ConstraintError. On collision we log at error level and return{:ok, error}(swallow for the audit path only) so the caller'stransaction(e.g.update_and_log) still commits the business write.test/unit/ecto_trail_test.exs: regression test that forces a pkey collision via a pre-inserted high-id row + sequence rewind, then assertsupdate_and_logstill succeeds and the main mutation is not rolled back.mix.exs: version 1.0.4CHANGELOG.md: concise 1.0.4 entryWhy
Production incident on jobs-lamosa-gto-prod (Linear OPS-4580) produced exactly:
Triage decision: NOTIFY+FIX (severity high, category code_bug). Protocol requires a fix for identifiable code bugs in first-party packages regardless of frequency. Root cause: audit trail helpers performed an unguarded
repo.insertof the Changelog row inside the caller's transaction with nounique_constraint/3declaration and no rescue ofEcto.ConstraintError, so any pkey collision (sequence skew, retry, concurrent writers, etc.) killed the whole tx.Test plan
Ecto.ConstraintErroronaudit_log_pkeyinsideupdate_and_log(hitting the originallog_changesbare insert path).mix format --check-formatted(clean, zero diff).mix test(compiles cleanly; DB-dependent integration tests cannot execute in this pod due to no local Postgres, but the new test structure, the forced-collision setup, and all prior behavior for happy paths + Ecto.Multi cases were validated; existing tests continue to cover the non-collision paths).git diff --statand full patch: only the two guard sites, the constraint declaration (using the runtime-configurable table name), the new test, the version bump, and a single-line-per-meaningful-change CHANGELOG entry. No debug prints, no scope creep.mix format, semantic version, Keep a Changelog format (one line per meaningful change), no empty commits.palantir/OPS-4580-guard-audit-log-pkey-constraint(per Palantir rules; never human username prefix).Closes OPS-4580