Skip to content

direct: Fix WAL corruption after two consecutive failed deploys#5606

Merged
denik merged 2 commits into
mainfrom
wal-bug1
Jun 15, 2026
Merged

direct: Fix WAL corruption after two consecutive failed deploys#5606
denik merged 2 commits into
mainfrom
wal-bug1

Conversation

@denik

@denik denik commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Changes

The direct engine's local state is resources.json (committed) plus a
resources.json.wal write-ahead log whose serial must be exactly one ahead of
the committed serial. When a deploy opened the WAL but committed nothing — a
header-only WAL, e.g. a crash between UpgradeToWrite and Finalize — recovery
advanced the in-memory serial without persisting it. After a second such failure
the next WAL header was written two serials ahead of the committed state, and
every later bundle command then failed WAL recovery (WAL serial (N) is ahead of expected) until the WAL was deleted by hand.

Fix: only advance the serial when the WAL carried entries, i.e. when the merged
state is actually written back, so a no-op WAL is discarded without drift.

Why

Fixes #5557. Reported by a customer: any error that took more than one deploy
attempt to clear wedged the bundle on the second failure.

Tests

  • bundle/deploy/wal/two-crashed-deploys: two deploys killed mid-apply recover
    without wedging.
  • Unit TestHeaderOnlyWALRecoveryDoesNotAdvanceSerial. Both confirmed to fail
    when the fix is reverted.

This pull request and its description were written by Isaac.

Recovering a header-only WAL (a deploy that opened the WAL but committed
nothing, e.g. a crash between UpgradeToWrite and Finalize) advanced the
in-memory serial without persisting it. After a second such failure the next
WAL header was written two serials ahead of the committed state, and every
later command failed WAL recovery until the WAL was deleted by hand.

Only advance the serial when the WAL carried entries, i.e. when the merged
state is actually written back.

Co-authored-by: Isaac
Name it after the WAL condition it recovers, matching sibling tests
(empty-wal, stale-wal, future-serial-wal). Use a generic test-bundle name.

Co-authored-by: Isaac
@denik denik temporarily deployed to test-trigger-is June 15, 2026 13:48 — with GitHub Actions Inactive
@denik denik temporarily deployed to test-trigger-is June 15, 2026 13:48 — with GitHub Actions Inactive
@eng-dev-ecosystem-bot

Copy link
Copy Markdown
Collaborator

Integration test report

Commit: ec65a6b

Run: 27550846365

Env 🟨​KNOWN 🔄​flaky 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
💚​ aws linux 7 15 264 979 6:54
🟨​ aws windows 7 15 266 977 11:00
💚​ aws-ucws linux 7 15 360 893 14:56
💚​ aws-ucws windows 7 15 362 891 12:53
🔄​ azure linux 1 1 17 266 977 8:51
💚​ azure windows 1 17 269 975 9:14
🔄​ azure-ucws linux 2 1 17 363 889 13:55
💚​ azure-ucws windows 1 17 367 887 8:46
💚​ gcp linux 1 17 263 980 7:33
💚​ gcp windows 1 17 265 978 8:13
25 interesting tests: 15 SKIP, 7 KNOWN, 3 flaky
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🟨​ TestAccept 💚​R 🟨​K 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/invariant/no_drift 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 💚​R 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 💚​R 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 🟨​K 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/replace_existing 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/grants/select 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🔄​ TestFetchRepositoryInfoAPI_FromRepo ✅​p ✅​p ✅​p ✅​p 🔄​f ✅​p ✅​p ✅​p ✅​p ✅​p
🔄​ TestFetchRepositoryInfoAPI_FromRepo/root ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p 🔄​f ✅​p ✅​p ✅​p
🔄​ TestFetchRepositoryInfoAPI_FromRepo/subdir ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p 🔄​f ✅​p ✅​p ✅​p
Top 30 slowest tests (at least 2 minutes):
duration env testname
4:43 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:24 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:20 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:18 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:49 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:24 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:19 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:18 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:09 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:07 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:07 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:06 azure windows TestAccept
3:06 aws-ucws windows TestAccept
3:02 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:00 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:58 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:57 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:57 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:55 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:55 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:43 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:42 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:40 gcp windows TestAccept
2:29 aws-ucws linux TestAccept/bundle/resources/alerts/with_file/DATABRICKS_BUNDLE_ENGINE=terraform
2:24 azure-ucws windows TestAccept
2:15 aws-ucws linux TestAccept/bundle/deploy/files/no-snapshot-sync/DATABRICKS_BUNDLE_ENGINE=terraform
2:15 aws-ucws linux TestAccept/bundle/resources/volumes/recreate/DATABRICKS_BUNDLE_ENGINE=terraform
2:11 aws-ucws linux TestAccept/bundle/resources/model_serving_endpoints/basic/DATABRICKS_BUNDLE_ENGINE=terraform
2:07 aws-ucws linux TestAccept/bundle/resources/model_serving_endpoints/basic/DATABRICKS_BUNDLE_ENGINE=direct
2:03 aws-ucws windows TestAccept/bundle/resources/volumes/recreate/DATABRICKS_BUNDLE_ENGINE=terraform

Comment thread acceptance/bundle/deploy/wal/header-only-wal/output.txt
@denik denik added this pull request to the merge queue Jun 15, 2026
Merged via the queue into main with commit 8f270ba Jun 15, 2026
25 checks passed
@denik denik deleted the wal-bug1 branch June 15, 2026 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multiple references to the same resource do not resolve

3 participants