From 13b28701f681231e924047a048eed3a57004359f Mon Sep 17 00:00:00 2001 From: Denis Bilenko Date: Sun, 14 Jun 2026 13:22:23 -0700 Subject: [PATCH 1/4] direct: Fix WAL corruption after two consecutive failed deploys Two consecutive failed deploys left the local state WAL with a serial ahead of the committed state, after which every bundle command failed WAL recovery until the WAL was deleted by hand. - Don't open the WAL for write when planning already failed, so a failed plan no longer leaves a header-only WAL behind. - Don't advance the serial when recovering a header-only WAL, so a crash between UpgradeToWrite and Finalize can't wedge later deploys. Co-authored-by: Isaac --- NEXT_CHANGELOG.md | 1 + .../wal/two-crashed-deploys/databricks.yml | 14 +++++++ .../wal/two-crashed-deploys/out.test.toml | 3 ++ .../deploy/wal/two-crashed-deploys/output.txt | 39 ++++++++++++++++++ .../deploy/wal/two-crashed-deploys/script | 22 ++++++++++ .../deploy/wal/two-crashed-deploys/test.py | 1 + .../wal/two-failed-deploys/databricks.yml | 14 +++++++ .../wal/two-failed-deploys/out.test.toml | 3 ++ .../deploy/wal/two-failed-deploys/output.txt | 41 +++++++++++++++++++ .../two-failed-deploys/resources.json.tmpl | 1 + .../deploy/wal/two-failed-deploys/script | 30 ++++++++++++++ .../deploy/wal/two-failed-deploys/test.py | 1 + bundle/direct/dstate/state.go | 17 +++++++- bundle/direct/dstate/state_test.go | 35 ++++++++++++++++ bundle/phases/deploy.go | 11 +++-- 15 files changed, 227 insertions(+), 6 deletions(-) create mode 100644 acceptance/bundle/deploy/wal/two-crashed-deploys/databricks.yml create mode 100644 acceptance/bundle/deploy/wal/two-crashed-deploys/out.test.toml create mode 100644 acceptance/bundle/deploy/wal/two-crashed-deploys/output.txt create mode 100644 acceptance/bundle/deploy/wal/two-crashed-deploys/script create mode 100644 acceptance/bundle/deploy/wal/two-crashed-deploys/test.py create mode 100644 acceptance/bundle/deploy/wal/two-failed-deploys/databricks.yml create mode 100644 acceptance/bundle/deploy/wal/two-failed-deploys/out.test.toml create mode 100644 acceptance/bundle/deploy/wal/two-failed-deploys/output.txt create mode 100644 acceptance/bundle/deploy/wal/two-failed-deploys/resources.json.tmpl create mode 100644 acceptance/bundle/deploy/wal/two-failed-deploys/script create mode 100644 acceptance/bundle/deploy/wal/two-failed-deploys/test.py diff --git a/NEXT_CHANGELOG.md b/NEXT_CHANGELOG.md index f023b573438..174eafcef74 100644 --- a/NEXT_CHANGELOG.md +++ b/NEXT_CHANGELOG.md @@ -14,6 +14,7 @@ * Bundle variable references now accept Unicode letters in path segments (e.g. `${var.变量}`). ([#5532](https://github.com/databricks/cli/pull/5532)) * Ignore remote changes for vector search direct_access_index_spec.schema_json to prevent drift when the backend normalizes the schema ([#5481](https://github.com/databricks/cli/pull/5481)). * Remove hidden, never-functional `--existing-dashboard-id`, `--existing-dashboard-path`, `--existing-alert-id`, and `--existing-genie-space-id` alias flags from `bundle generate`; use the documented `--existing-id` / `--existing-path` flags instead ([#5591](https://github.com/databricks/cli/pull/5591)). +* direct: Fix two consecutive failed deploys leaving the local state WAL with a serial ahead of the committed state, which blocked all subsequent `bundle` commands until the WAL was deleted manually. A failed plan no longer opens the WAL for write, and recovering a header-only WAL no longer advances the serial ([#5557](https://github.com/databricks/cli/issues/5557)). ### Dependency updates diff --git a/acceptance/bundle/deploy/wal/two-crashed-deploys/databricks.yml b/acceptance/bundle/deploy/wal/two-crashed-deploys/databricks.yml new file mode 100644 index 00000000000..3d65ac2bfca --- /dev/null +++ b/acceptance/bundle/deploy/wal/two-crashed-deploys/databricks.yml @@ -0,0 +1,14 @@ +bundle: + name: wal-two-crashed-deploys + +resources: + jobs: + test_job: + name: "test-job" + tasks: + - task_key: "test-task" + spark_python_task: + python_file: ./test.py + new_cluster: + spark_version: 15.4.x-scala2.12 + node_type_id: i3.xlarge diff --git a/acceptance/bundle/deploy/wal/two-crashed-deploys/out.test.toml b/acceptance/bundle/deploy/wal/two-crashed-deploys/out.test.toml new file mode 100644 index 00000000000..e90b6d5d1ba --- /dev/null +++ b/acceptance/bundle/deploy/wal/two-crashed-deploys/out.test.toml @@ -0,0 +1,3 @@ +Local = true +Cloud = false +EnvMatrix.DATABRICKS_BUNDLE_ENGINE = ["direct"] diff --git a/acceptance/bundle/deploy/wal/two-crashed-deploys/output.txt b/acceptance/bundle/deploy/wal/two-crashed-deploys/output.txt new file mode 100644 index 00000000000..1c70b4ae2ba --- /dev/null +++ b/acceptance/bundle/deploy/wal/two-crashed-deploys/output.txt @@ -0,0 +1,39 @@ + +=== First deploy (killed before recording the job, leaves a header-only WAL) +>>> errcode [CLI] bundle deploy +Uploading bundle files to /Workspace/Users/[USERNAME]/.bundle/wal-two-crashed-deploys/default/files... +Deploying resources... +[PROCESS_KILLED] + +Exit code: [KILLED] + +>>> cat .databricks/bundle/default/resources.json.wal +{"state_version":2,"cli_version":"[DEV_VERSION]","lineage":"[UUID]","serial":1} + +=== Second deploy (killed again, leaves another header-only WAL) +>>> errcode [CLI] bundle deploy --force-lock +Uploading bundle files to /Workspace/Users/[USERNAME]/.bundle/wal-two-crashed-deploys/default/files... +Deploying resources... +[PROCESS_KILLED] + +Exit code: [KILLED] + +>>> cat .databricks/bundle/default/resources.json.wal +{"state_version":2,"cli_version":"[DEV_VERSION]","lineage":"[UUID]","serial":1} + +=== Third deploy (must recover and succeed, not blocked by the leftover WAL) +>>> errcode [CLI] bundle deploy --force-lock +Uploading bundle files to /Workspace/Users/[USERNAME]/.bundle/wal-two-crashed-deploys/default/files... +Deploying resources... +Updating deployment state... +Deployment complete! + +>>> errcode assert_not_exists.py .databricks/bundle/default/resources.json.wal + +>>> errcode cat .databricks/bundle/default/resources.json +{ + "serial": 1, + "state_keys": [ + "resources.jobs.test_job" + ] +} diff --git a/acceptance/bundle/deploy/wal/two-crashed-deploys/script b/acceptance/bundle/deploy/wal/two-crashed-deploys/script new file mode 100644 index 00000000000..89da7e1a277 --- /dev/null +++ b/acceptance/bundle/deploy/wal/two-crashed-deploys/script @@ -0,0 +1,22 @@ +# Two consecutive deploys are killed mid-apply, after UpgradeToWrite has written +# the WAL header but before Finalize runs (killed on the jobs/create call, before +# the job's state is recorded). A kill cannot be prevented by bailing out early, +# so each crash leaves a header-only WAL behind. Recovery must discard those +# no-op WALs without advancing the serial; otherwise the second crash would write +# its WAL header two serials ahead of the committed state and block every later +# command. Regression test for the dstate recovery fix. +kill_after.py "POST /api/2.2/jobs/create" 0 2 + +title "First deploy (killed before recording the job, leaves a header-only WAL)" +trace errcode $CLI bundle deploy +trace cat .databricks/bundle/default/resources.json.wal + +title "Second deploy (killed again, leaves another header-only WAL)" +trace errcode $CLI bundle deploy --force-lock +trace cat .databricks/bundle/default/resources.json.wal + +title "Third deploy (must recover and succeed, not blocked by the leftover WAL)" +trace errcode $CLI bundle deploy --force-lock + +trace errcode assert_not_exists.py .databricks/bundle/default/resources.json.wal +trace errcode cat .databricks/bundle/default/resources.json | jq -S '{serial: .serial, state_keys: (.state | keys)}' diff --git a/acceptance/bundle/deploy/wal/two-crashed-deploys/test.py b/acceptance/bundle/deploy/wal/two-crashed-deploys/test.py new file mode 100644 index 00000000000..1ff8e07c707 --- /dev/null +++ b/acceptance/bundle/deploy/wal/two-crashed-deploys/test.py @@ -0,0 +1 @@ +print("test") diff --git a/acceptance/bundle/deploy/wal/two-failed-deploys/databricks.yml b/acceptance/bundle/deploy/wal/two-failed-deploys/databricks.yml new file mode 100644 index 00000000000..942927ad95d --- /dev/null +++ b/acceptance/bundle/deploy/wal/two-failed-deploys/databricks.yml @@ -0,0 +1,14 @@ +bundle: + name: wal-two-failed-deploys + +resources: + jobs: + test_job: + name: "test-job" + tasks: + - task_key: "test-task" + spark_python_task: + python_file: ./test.py + new_cluster: + spark_version: 15.4.x-scala2.12 + node_type_id: i3.xlarge diff --git a/acceptance/bundle/deploy/wal/two-failed-deploys/out.test.toml b/acceptance/bundle/deploy/wal/two-failed-deploys/out.test.toml new file mode 100644 index 00000000000..e90b6d5d1ba --- /dev/null +++ b/acceptance/bundle/deploy/wal/two-failed-deploys/out.test.toml @@ -0,0 +1,3 @@ +Local = true +Cloud = false +EnvMatrix.DATABRICKS_BUNDLE_ENGINE = ["direct"] diff --git a/acceptance/bundle/deploy/wal/two-failed-deploys/output.txt b/acceptance/bundle/deploy/wal/two-failed-deploys/output.txt new file mode 100644 index 00000000000..2db7d1594af --- /dev/null +++ b/acceptance/bundle/deploy/wal/two-failed-deploys/output.txt @@ -0,0 +1,41 @@ + +=== Deploy 1 (planning fails) +>>> errcode [CLI] bundle deploy --force-lock +Uploading bundle files to /Workspace/Users/[USERNAME]/.bundle/wal-two-failed-deploys/default/files... +Error: cannot plan resources.jobs.test_job: reading id="[JOB_ID]": Fault injected by test. (403 INJECTED) + +Endpoint: GET [DATABRICKS_URL]/api/2.2/jobs/get?job_id=[JOB_ID] +HTTP Status: 403 Forbidden +API error_code: INJECTED +API message: Fault injected by test. + +Error: planning failed + + +Exit code: 1 + +>>> assert_not_exists.py .databricks/bundle/default/resources.json.wal + +=== Deploy 2 (planning fails again) +>>> errcode [CLI] bundle deploy --force-lock +Uploading bundle files to /Workspace/Users/[USERNAME]/.bundle/wal-two-failed-deploys/default/files... +Error: cannot plan resources.jobs.test_job: reading id="[JOB_ID]": Fault injected by test. (403 INJECTED) + +Endpoint: GET [DATABRICKS_URL]/api/2.2/jobs/get?job_id=[JOB_ID] +HTTP Status: 403 Forbidden +API error_code: INJECTED +API message: Fault injected by test. + +Error: planning failed + + +Exit code: 1 + +>>> assert_not_exists.py .databricks/bundle/default/resources.json.wal + +=== Deploy 3 (succeeds, not blocked by a leftover WAL) +>>> errcode [CLI] bundle deploy --force-lock +Uploading bundle files to /Workspace/Users/[USERNAME]/.bundle/wal-two-failed-deploys/default/files... +Deploying resources... +Updating deployment state... +Deployment complete! diff --git a/acceptance/bundle/deploy/wal/two-failed-deploys/resources.json.tmpl b/acceptance/bundle/deploy/wal/two-failed-deploys/resources.json.tmpl new file mode 100644 index 00000000000..33ca8292f4c --- /dev/null +++ b/acceptance/bundle/deploy/wal/two-failed-deploys/resources.json.tmpl @@ -0,0 +1 @@ +{"state_version":2,"cli_version":"0.0.0","lineage":"two-failed-deploys-lineage","serial":3,"state":{"resources.jobs.test_job":{"__id__":"$JOB","state":{"name":"test-job"}}}} diff --git a/acceptance/bundle/deploy/wal/two-failed-deploys/script b/acceptance/bundle/deploy/wal/two-failed-deploys/script new file mode 100644 index 00000000000..bd43cac94e5 --- /dev/null +++ b/acceptance/bundle/deploy/wal/two-failed-deploys/script @@ -0,0 +1,30 @@ +# Two consecutive deploys whose plan stage fails must not block the next, healthy +# deploy. Previously a failed plan still opened the WAL for write (UpgradeToWrite) +# and returned without finalizing, leaving a header-only WAL. After two such +# failures the WAL serial drifted two ahead of the committed serial and every +# subsequent command failed WAL recovery until the WAL was deleted by hand. +# +# The failure is injected with a non-retried 403 on the resource-refresh GET that +# planning issues for the already-deployed job. (A 5xx would be retried with +# backoff and is too slow for a test.) + +# A real job on the server plus committed state pointing at it makes the plan +# stage issue a refresh GET we can fault. +export JOB=$($CLI jobs create --json '{"name":"test-job"}' | jq -r '.job_id') +echo "$JOB:JOB_ID" >> ACC_REPLS +mkdir -p .databricks/bundle/default +envsubst < resources.json.tmpl > .databricks/bundle/default/resources.json + +# Fault the refresh GET for the first two deploys; the third proceeds normally. +fault.py "GET /api/2.2/jobs/get" 403 0 2 + +title "Deploy 1 (planning fails)" +trace errcode $CLI bundle deploy --force-lock +trace assert_not_exists.py .databricks/bundle/default/resources.json.wal + +title "Deploy 2 (planning fails again)" +trace errcode $CLI bundle deploy --force-lock +trace assert_not_exists.py .databricks/bundle/default/resources.json.wal + +title "Deploy 3 (succeeds, not blocked by a leftover WAL)" +trace errcode $CLI bundle deploy --force-lock diff --git a/acceptance/bundle/deploy/wal/two-failed-deploys/test.py b/acceptance/bundle/deploy/wal/two-failed-deploys/test.py new file mode 100644 index 00000000000..1ff8e07c707 --- /dev/null +++ b/acceptance/bundle/deploy/wal/two-failed-deploys/test.py @@ -0,0 +1 @@ +print("test") diff --git a/bundle/direct/dstate/state.go b/bundle/direct/dstate/state.go index 54505677663..dff484ca30c 100644 --- a/bundle/direct/dstate/state.go +++ b/bundle/direct/dstate/state.go @@ -286,6 +286,7 @@ func (db *DeploymentState) mergeWalIntoState(ctx context.Context) (bool, error) scanner.Buffer(make([]byte, 0, initialBufferSize), maxWalEntrySize) lineNumber := 0 var corruptedLines [][]byte + var newSerial int for scanner.Scan() { lineNumber++ @@ -309,7 +310,7 @@ func (db *DeploymentState) mergeWalIntoState(ctx context.Context) (bool, error) if header.Serial > expectedSerial { return false, fmt.Errorf("WAL serial (%d) is ahead of expected (%d), state may be corrupted", header.Serial, expectedSerial) } - db.Data.Serial = expectedSerial + newSerial = header.Serial } else { var entry WALEntry if err := json.Unmarshal(line, &entry); err != nil { @@ -344,7 +345,19 @@ func (db *DeploymentState) mergeWalIntoState(ctx context.Context) (bool, error) } } - return lineNumber > 1, nil + hasEntries := lineNumber > 1 + + // Only advance the serial when the WAL carried entries, because the caller + // (replayWAL) persists the new state file only in that case. A header-only + // WAL is a deploy that started but committed nothing; advancing the serial + // for it leaves the in-memory serial ahead of the persisted one, so the + // next deploy writes its WAL header at serial+2 and recovery rejects it as + // "ahead of expected". See acceptance/bundle/deploy/wal/two-crashed-deploys. + if hasEntries { + db.Data.Serial = newSerial + } + + return hasEntries, nil } // Finalize replays the WAL (if open for write), captures the resulting state, and resets. diff --git a/bundle/direct/dstate/state_test.go b/bundle/direct/dstate/state_test.go index bbfd2559951..b2d13c0a6c7 100644 --- a/bundle/direct/dstate/state_test.go +++ b/bundle/direct/dstate/state_test.go @@ -1,6 +1,7 @@ package dstate import ( + "encoding/json" "os" "path/filepath" "testing" @@ -55,6 +56,40 @@ func TestPanicOnDoubleOpen(t *testing.T) { mustFinalize(t, &db) } +func TestHeaderOnlyWALRecoveryDoesNotAdvanceSerial(t *testing.T) { + path := filepath.Join(t.TempDir(), "state.json") + walPath := path + walSuffix + + // Commit serial 1 with one resource. + var db DeploymentState + require.NoError(t, db.Open(t.Context(), path, WithRecovery(true), WithWrite(true))) + require.NoError(t, db.SaveState("jobs.my_job", "123", map[string]string{}, nil)) + mustFinalize(t, &db) + + var committed DeploymentState + require.NoError(t, committed.Open(t.Context(), path, WithRecovery(false), WithWrite(false))) + lineage := committed.Data.Lineage + require.Equal(t, 1, committed.Data.Serial) + mustFinalize(t, &committed) + + // A deploy that opens the WAL for write but commits nothing (e.g. planning + // fails after UpgradeToWrite) leaves a header-only WAL behind, here at the + // expected serial 2. Recovering it must not advance the serial past the + // committed 1, otherwise a second such failed deploy would write its header + // at serial 3 and be rejected as ahead of the committed state. + header := Header{Lineage: lineage, Serial: 2, StateVersion: currentStateVersion} + headerLine, err := json.Marshal(header) + require.NoError(t, err) + require.NoError(t, os.WriteFile(walPath, append(headerLine, '\n'), 0o600)) + + var recovered DeploymentState + require.NoError(t, recovered.Open(t.Context(), path, WithRecovery(true), WithWrite(false))) + assert.Equal(t, 1, recovered.Data.Serial) + assert.Equal(t, "123", recovered.GetResourceID("jobs.my_job")) + assert.NoFileExists(t, walPath) + mustFinalize(t, &recovered) +} + func TestDeleteState(t *testing.T) { path := filepath.Join(t.TempDir(), "state.json") diff --git a/bundle/phases/deploy.go b/bundle/phases/deploy.go index 8518230770a..53eb73386fb 100644 --- a/bundle/phases/deploy.go +++ b/bundle/phases/deploy.go @@ -170,6 +170,13 @@ func Deploy(ctx context.Context, b *bundle.Bundle, outputHandler sync.OutputHand plan = RunPlan(ctx, b, engine) } + // Stop before opening the WAL for write if planning failed. UpgradeToWrite + // writes a WAL header that only deployCore's Finalize commits or discards; + // returning past it without finalizing leaves a header-only WAL behind. + if logdiag.HasError(ctx) { + return + } + if engine.IsDirect() { // Upgrade from read (opened by process.go) to write mode if err := b.DeploymentBundle.StateDB.UpgradeToWrite(); err != nil { @@ -187,10 +194,6 @@ func Deploy(ctx context.Context, b *bundle.Bundle, outputHandler sync.OutputHand } } - if logdiag.HasError(ctx) { - return - } - haveApproval, err := approvalForDeploy(ctx, b, plan) if err != nil { logdiag.LogError(ctx, err) From f91f43eb6d16c604111c1aa93b58d6a5ff829143 Mon Sep 17 00:00:00 2001 From: Denis Bilenko Date: Sun, 14 Jun 2026 13:36:36 -0700 Subject: [PATCH 2/4] direct: Keep post-InitForApply error check in deploy phase InitForApply receives ctx and could log a diagnostic without returning an error, so the call site cannot prove it never will. Re-check logdiag before deploying. UpgradeToWrite takes no ctx and thus cannot log, so the earlier check alone is enough to guard opening the WAL. Co-authored-by: Isaac --- bundle/phases/deploy.go | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/bundle/phases/deploy.go b/bundle/phases/deploy.go index 53eb73386fb..e3b0b777eb4 100644 --- a/bundle/phases/deploy.go +++ b/bundle/phases/deploy.go @@ -194,6 +194,13 @@ func Deploy(ctx context.Context, b *bundle.Bundle, outputHandler sync.OutputHand } } + // InitForApply receives ctx and could log a diagnostic without returning an + // error, so re-check before deploying. (UpgradeToWrite above takes no ctx and + // thus cannot log, so the earlier check is enough to guard the WAL open.) + if logdiag.HasError(ctx) { + return + } + haveApproval, err := approvalForDeploy(ctx, b, plan) if err != nil { logdiag.LogError(ctx, err) From 97033bd473bcfbd05c5a3e08b7cbcb9057e57ef5 Mon Sep 17 00:00:00 2001 From: Denis Bilenko Date: Sun, 14 Jun 2026 14:53:36 -0700 Subject: [PATCH 3/4] acceptance: Build WAL plan-failure repro from a real deploy Drop the hand-written resources.json.tmpl so the test no longer depends on the internal state-file format. Deploy the job normally, then inject a fault on the plan-stage refresh GET so the next two deploys fail while planning and the last one recovers. Co-authored-by: Isaac --- .../deploy/wal/two-failed-deploys/output.txt | 27 +++++++----- .../two-failed-deploys/resources.json.tmpl | 1 - .../deploy/wal/two-failed-deploys/script | 41 +++++++++---------- 3 files changed, 37 insertions(+), 32 deletions(-) delete mode 100644 acceptance/bundle/deploy/wal/two-failed-deploys/resources.json.tmpl diff --git a/acceptance/bundle/deploy/wal/two-failed-deploys/output.txt b/acceptance/bundle/deploy/wal/two-failed-deploys/output.txt index 2db7d1594af..3fec3f807a7 100644 --- a/acceptance/bundle/deploy/wal/two-failed-deploys/output.txt +++ b/acceptance/bundle/deploy/wal/two-failed-deploys/output.txt @@ -1,10 +1,17 @@ -=== Deploy 1 (planning fails) ->>> errcode [CLI] bundle deploy --force-lock +=== Deploy 1 (normal: creates the job and the committed state) +>>> [CLI] bundle deploy Uploading bundle files to /Workspace/Users/[USERNAME]/.bundle/wal-two-failed-deploys/default/files... -Error: cannot plan resources.jobs.test_job: reading id="[JOB_ID]": Fault injected by test. (403 INJECTED) +Deploying resources... +Updating deployment state... +Deployment complete! + +=== Deploy 2 (planning fails, must not leave a WAL) +>>> errcode [CLI] bundle deploy +Uploading bundle files to /Workspace/Users/[USERNAME]/.bundle/wal-two-failed-deploys/default/files... +Error: cannot plan resources.jobs.test_job: reading id="[NUMID]": Fault injected by test. (403 INJECTED) -Endpoint: GET [DATABRICKS_URL]/api/2.2/jobs/get?job_id=[JOB_ID] +Endpoint: GET [DATABRICKS_URL]/api/2.2/jobs/get?job_id=[NUMID] HTTP Status: 403 Forbidden API error_code: INJECTED API message: Fault injected by test. @@ -16,12 +23,12 @@ Exit code: 1 >>> assert_not_exists.py .databricks/bundle/default/resources.json.wal -=== Deploy 2 (planning fails again) ->>> errcode [CLI] bundle deploy --force-lock +=== Deploy 3 (planning fails again, must not leave a WAL) +>>> errcode [CLI] bundle deploy Uploading bundle files to /Workspace/Users/[USERNAME]/.bundle/wal-two-failed-deploys/default/files... -Error: cannot plan resources.jobs.test_job: reading id="[JOB_ID]": Fault injected by test. (403 INJECTED) +Error: cannot plan resources.jobs.test_job: reading id="[NUMID]": Fault injected by test. (403 INJECTED) -Endpoint: GET [DATABRICKS_URL]/api/2.2/jobs/get?job_id=[JOB_ID] +Endpoint: GET [DATABRICKS_URL]/api/2.2/jobs/get?job_id=[NUMID] HTTP Status: 403 Forbidden API error_code: INJECTED API message: Fault injected by test. @@ -33,8 +40,8 @@ Exit code: 1 >>> assert_not_exists.py .databricks/bundle/default/resources.json.wal -=== Deploy 3 (succeeds, not blocked by a leftover WAL) ->>> errcode [CLI] bundle deploy --force-lock +=== Deploy 4 (fault expired: recovers and succeeds) +>>> [CLI] bundle deploy Uploading bundle files to /Workspace/Users/[USERNAME]/.bundle/wal-two-failed-deploys/default/files... Deploying resources... Updating deployment state... diff --git a/acceptance/bundle/deploy/wal/two-failed-deploys/resources.json.tmpl b/acceptance/bundle/deploy/wal/two-failed-deploys/resources.json.tmpl deleted file mode 100644 index 33ca8292f4c..00000000000 --- a/acceptance/bundle/deploy/wal/two-failed-deploys/resources.json.tmpl +++ /dev/null @@ -1 +0,0 @@ -{"state_version":2,"cli_version":"0.0.0","lineage":"two-failed-deploys-lineage","serial":3,"state":{"resources.jobs.test_job":{"__id__":"$JOB","state":{"name":"test-job"}}}} diff --git a/acceptance/bundle/deploy/wal/two-failed-deploys/script b/acceptance/bundle/deploy/wal/two-failed-deploys/script index bd43cac94e5..b4fd4878627 100644 --- a/acceptance/bundle/deploy/wal/two-failed-deploys/script +++ b/acceptance/bundle/deploy/wal/two-failed-deploys/script @@ -1,30 +1,29 @@ -# Two consecutive deploys whose plan stage fails must not block the next, healthy -# deploy. Previously a failed plan still opened the WAL for write (UpgradeToWrite) -# and returned without finalizing, leaving a header-only WAL. After two such -# failures the WAL serial drifted two ahead of the committed serial and every -# subsequent command failed WAL recovery until the WAL was deleted by hand. +# A failed plan must not leave a write-ahead log behind, so repeated planning +# failures never block a later, healthy deploy. Previously a failed plan still +# opened the WAL for write (UpgradeToWrite) and returned without finalizing, +# leaving a header-only WAL; after two failures the WAL serial drifted two ahead +# of the committed serial and every later command failed WAL recovery until the +# WAL was deleted by hand. # -# The failure is injected with a non-retried 403 on the resource-refresh GET that -# planning issues for the already-deployed job. (A 5xx would be retried with -# backoff and is too slow for a test.) +# A first deploy creates the job normally. An injected fault then makes the next +# two deploys fail while planning (planning refreshes the existing job via +# jobs/get). The final deploy, with the fault expired, must recover and succeed. +# A non-retried 403 is used so the failure is immediate; a 5xx would be retried +# with backoff. -# A real job on the server plus committed state pointing at it makes the plan -# stage issue a refresh GET we can fault. -export JOB=$($CLI jobs create --json '{"name":"test-job"}' | jq -r '.job_id') -echo "$JOB:JOB_ID" >> ACC_REPLS -mkdir -p .databricks/bundle/default -envsubst < resources.json.tmpl > .databricks/bundle/default/resources.json +title "Deploy 1 (normal: creates the job and the committed state)" +trace $CLI bundle deploy -# Fault the refresh GET for the first two deploys; the third proceeds normally. +# Fail the plan-stage refresh GET for the next two deploys only. fault.py "GET /api/2.2/jobs/get" 403 0 2 -title "Deploy 1 (planning fails)" -trace errcode $CLI bundle deploy --force-lock +title "Deploy 2 (planning fails, must not leave a WAL)" +trace errcode $CLI bundle deploy trace assert_not_exists.py .databricks/bundle/default/resources.json.wal -title "Deploy 2 (planning fails again)" -trace errcode $CLI bundle deploy --force-lock +title "Deploy 3 (planning fails again, must not leave a WAL)" +trace errcode $CLI bundle deploy trace assert_not_exists.py .databricks/bundle/default/resources.json.wal -title "Deploy 3 (succeeds, not blocked by a leftover WAL)" -trace errcode $CLI bundle deploy --force-lock +title "Deploy 4 (fault expired: recovers and succeeds)" +trace $CLI bundle deploy From 49bd4e2eac0b00dca54fa7c365ff13190cb187ea Mon Sep 17 00:00:00 2001 From: Denis Bilenko Date: Sun, 14 Jun 2026 15:47:01 -0700 Subject: [PATCH 4/4] Shorten WAL changelog entry to match PR title Co-authored-by: Isaac --- NEXT_CHANGELOG.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/NEXT_CHANGELOG.md b/NEXT_CHANGELOG.md index 174eafcef74..0156a189e61 100644 --- a/NEXT_CHANGELOG.md +++ b/NEXT_CHANGELOG.md @@ -14,7 +14,7 @@ * Bundle variable references now accept Unicode letters in path segments (e.g. `${var.变量}`). ([#5532](https://github.com/databricks/cli/pull/5532)) * Ignore remote changes for vector search direct_access_index_spec.schema_json to prevent drift when the backend normalizes the schema ([#5481](https://github.com/databricks/cli/pull/5481)). * Remove hidden, never-functional `--existing-dashboard-id`, `--existing-dashboard-path`, `--existing-alert-id`, and `--existing-genie-space-id` alias flags from `bundle generate`; use the documented `--existing-id` / `--existing-path` flags instead ([#5591](https://github.com/databricks/cli/pull/5591)). -* direct: Fix two consecutive failed deploys leaving the local state WAL with a serial ahead of the committed state, which blocked all subsequent `bundle` commands until the WAL was deleted manually. A failed plan no longer opens the WAL for write, and recovering a header-only WAL no longer advances the serial ([#5557](https://github.com/databricks/cli/issues/5557)). +* engine/direct: Fix WAL corruption after two consecutive failed deploys ([#5557](https://github.com/databricks/cli/issues/5557)). ### Dependency updates