Skip to content

[FLINK-38867] [Kubernetes Operator] Fix restartSavepointNonce to respect initialSavepointPath in BlueGreenDeployments#12

Open
james-kan-shopify wants to merge 1 commit into
mainfrom
flink-blue-green-restartsavepointnonce
Open

[FLINK-38867] [Kubernetes Operator] Fix restartSavepointNonce to respect initialSavepointPath in BlueGreenDeployments#12
james-kan-shopify wants to merge 1 commit into
mainfrom
flink-blue-green-restartsavepointnonce

Conversation

@james-kan-shopify
Copy link
Copy Markdown

@james-kan-shopify james-kan-shopify commented Dec 11, 2025

What is the purpose of the change

This pull request fixes a bug where FlinkBlueGreenDeployment ignores the user-specified initialSavepointPath when savepointRedeployNonce is incremented. The controller now properly honors savepointRedeployNonce changes and uses the specified savepoint location for redeployment, aligning Blue/Green deployment behavior with standard FlinkDeployment behavior.

Brief change log

  • Added BlueGreenDiffType.SAVEPOINT_REDEPLOY to handle nonce-triggered redeployments
  • Updated FlinkBlueGreenDeploymentSpecDiff to detect and prioritize savepointRedeployNonce changes
  • Added startSavepointRedeployTransition() to skip savepoint triggering and use initialSavepointPath from spec

Verifying this change

This change added tests and can be verified as follows:

  • Added verifySavepointRedeployNonceTriggersTransitionWithInitialSavepointPath integration test in FlinkBlueGreenDeploymentControllerTest that verifies the full savepoint redeploy flow:
    • Deploys to stable ACTIVE_BLUE state
    • Bumps savepointRedeployNonce with user-specified initialSavepointPath
    • Verifies controller skips savepoint triggering and transitions directly to TRANSITIONING_TO_GREEN
    • Verifies Green deployment uses the user-specified initialSavepointPath
    • Verifies successful completion to ACTIVE_GREEN
  • Updated existing tests and expanded existing tests to reflect new behavior

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changes to the CustomResourceDescriptors: no
  • Core observer or reconciler logic that is regularly executed: yes

Documentation

  • Does this pull request introduce a new feature? no (bug fix)
  • If yes, how is the feature documented? not applicable (fixes existing documented feature)

Copy link
Copy Markdown

@drossos drossos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a larger comment about how we could refactor this to potentially streamline the logic

Really like direction 👍

@james-kan-shopify james-kan-shopify force-pushed the flink-blue-green-restartsavepointnonce branch from 84834f0 to 4d3cc92 Compare January 13, 2026 06:26
@james-kan-shopify james-kan-shopify marked this pull request as ready for review January 14, 2026 02:56
@james-kan-shopify james-kan-shopify changed the title Flink blue green restartSavepointNonce behaviour alignment [FLINK-38867] [flink-kubernetes-operator] Fix restartSavepointNonce to respect initialSavepointPath in BlueGreenDeployments Jan 14, 2026
@james-kan-shopify james-kan-shopify changed the title [FLINK-38867] [flink-kubernetes-operator] Fix restartSavepointNonce to respect initialSavepointPath in BlueGreenDeployments [FLINK-38867] [Kubernetes Operator] Fix restartSavepointNonce to respect initialSavepointPath in BlueGreenDeployments Jan 14, 2026
Copy link
Copy Markdown

@drossos drossos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on gap with redeploying from stateless

FlinkBlueGreenDeploymentSpec spec2 = createBasicSpec();

// Change both top-level (configuration) and nested spec
// With new logic, only nested spec changes matter - setSavepointRedeployNonce triggers
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't specific to this code, but in our PR / JIRA in OSS we should highlight that for savepoint upgrade behaviour is not working as expected (loads from taken savepoint in transition and not specified one by initalSavepointPath)

@drossos drossos force-pushed the flink-blue-green-restartsavepointnonce branch from 0e4b463 to 5e48bdb Compare January 16, 2026 23:26
Comment on lines +352 to +355
// lastCheckpoint is null in two scenarios:
// 1. savepointRedeployNonce changed - user wants to redeploy from initialSavepointPath
// 2. upgradeMode is STATELESS - don't take savepoints (but use initialSavepointPath if
// set)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drossos we gotta revise this comment, likely drop the comment in the else case and move it up here to centralize it.

…estartSavepointNonce

Co-authored-by: Daniel Rossos <daniel.rossos@shopify.com>
@james-kan-shopify james-kan-shopify force-pushed the flink-blue-green-restartsavepointnonce branch from 5e48bdb to ed3508a Compare January 19, 2026 17:45
Copy link
Copy Markdown

@drossos drossos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (although co-authored a small part), good to deploy on our side

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants