Skip to content

Added BlueGreen ingress that switches between active Svc + resolve path conflict on Blue and Green deployment ingresses#11

Open
drossos wants to merge 2 commits into
mainfrom
dr.bg-ingress
Open

Added BlueGreen ingress that switches between active Svc + resolve path conflict on Blue and Green deployment ingresses#11
drossos wants to merge 2 commits into
mainfrom
dr.bg-ingress

Conversation

@drossos
Copy link
Copy Markdown

@drossos drossos commented Nov 4, 2025

Closes https://github.com/Shopify/streaming-compute/issues/667

Overview

Added ingress to FlinkBlueGreen deployment layer that will switch automatically between whatever is the active FlinkDeployment service.

This switching behaviour happens on the finalization step of any transition, meaning that after the flink deployment has transitioning to a new active state, only then will the ingress be switched over the to new service. This allow the ingress to also away be alive as it is pointing only at active running flink pipelines.

This FlinkBlueGreen ingress spec is separate from the FlinkDeployment template spec's ingress definition and can be found within the following part of the manifest (see under paragraph). This does imply that you can have both, neither or a mix of FlinkDeployment and FlinkBlueGreenDeployment ingress' defined.

apiVersion: flink.apache.org/v1beta1
kind: FlinkBlueGreenDeployment
metadata:
  name: basic-bg-stateless-example
spec:
  configuration:
    kubernetes.operator.bluegreen.deployment-deletion.delay: "2s"
  ingress:
    labels:
      app: basic-bg-stateless-example-controller
    template: sandbox-sql-bluegreensample.yd-playground.shopifycloud.com
  template:
    spec:
      ingress:
        labels:
          app: basic-bg-stateless-example
        template: sandbox-sql-bluegreensample.yd-playground.shopifycloud.com
      image: docker.io/library/flink:1.20
      flinkVersion: v1_20
      flinkConfiguration:
        rest.port: "8081"
. . . etc . . .

Note

  • This does not effect any existing lifecylce / functionality of existing FlinkDeployment code and only utilizes structures already afforded to the FlinkBlueGreenDeployment
  • Because of this, we do not log a k8 event if there is issue reconciling the BlueGreenIngress as we do with the other ingress reconciliation functions
  • Introduces new fields to the FlinkBlueGreenDeployment CRD to allow for this Ingress field at the top layer

Testing

These changes are deployed into our sandbox environment and can be tested there. Namely the yd-playground-sandbox-sql-bluegreensample namespace has a FlinkBlueGreenDeployment where you can trigger a bluegreen deployment to see this ingress switching behaviour.

Can also be tested in minikube and utilize the following attached BlueGreenDeployment for testing:

apiVersion: flink.apache.org/v1beta1
kind: FlinkBlueGreenDeployment
metadata:
  name: basic-bg-stateless-example
spec:
  configuration:
    kubernetes.operator.bluegreen.deployment-deletion.delay: "2s"
  template:
    spec:
      image: flink:1.20
      flinkVersion: v1_20
      flinkConfiguration:
        rest.port: "8081"
        taskmanager.numberOfTaskSlots: "1"
      serviceAccount: flink
      jobManager:
        resource:
          memory: 1G
          cpu: 1
      taskManager:
        resource:
          memory: 2G
          cpu: 1
      job:
        jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar
        parallelism: 1
        entryClass: org.apache.flink.streaming.examples.statemachine.StateMachineExample
        args:
          - "--error-rate"
          - "0.15"
          - "--sleep"
          - "30"
        upgradeMode: stateless
      mode: native

@drossos drossos marked this pull request as ready for review December 1, 2025 21:55
Copy link
Copy Markdown

@james-kan-shopify james-kan-shopify left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be some unit tests for reconcileBlueGreenIngress?

Comment on lines +135 to +136
var flinkBlueGreenDeploymentSpec = context.getBgDeployment().getSpec();
var objectMeta = context.getBgDeployment().getMetadata();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know the type we should be getting here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do, I was mostly following pattern from other IngressUtil reconcile but I can add in the explicit type for FlinkBlueGreenDeploymentSpec

Comment on lines +640 to +645
IngressUtils.reconcileBlueGreenIngress(
blueGreenContext,
flinkResourceContext.getOperatorConfig().isManageIngress(),
activeDeployment,
flinkResourceContext.getDeployConfig(activeDeployment.getSpec()),
blueGreenContext.getJosdkContext());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if any part of this encounters exceptions like null pointers, resulting in exceptions, is this going to be a problem? Do we need to catch it and handle it?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a reconciliation at kubernetes client level only, i.e any attempts it does to write / update will be retried . In terms of more operator exceptions being thrown, they throw at the top level in the FlinkBlueGreenDeploymentController.reconcile() method.

I will take a look a bit more to confirm here but seems its to be the pattern use by operator.

Comment on lines +353 to +363
// Update Ingress template if exists to prevent path collision between Blue and Green
if (flinkDeployment.getSpec().getIngress() != null) {
flinkDeployment
.getSpec()
.getIngress()
.setTemplate(
blueGreenDeploymentType.toString().toLowerCase()
+ "-"
+ flinkDeployment.getSpec().getIngress().getTemplate());
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

IngressSpec ingress = flinkDeployment.getSpec().getIngress();
if (ingress != null) {
    ingress.setTemplate(blueGreenDeploymentType.name().toLowerCase() + "-" + ingress.getTemplate());
}

Comment on lines +151 to +159
Optional<? extends HasMetadata> ingress;
if (ingressInNetworkingV1(client.getClient())) {
ingress =
client.getSecondaryResource(
io.fabric8.kubernetes.api.model.networking.v1.Ingress.class);
} else {
ingress = client.getSecondaryResource(Ingress.class);
}
ingress.ifPresent(i -> client.getClient().resource(i).delete());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for cleaning up and deleting the stable ingress, can we ever get into a scenario where upon next deployment this was removed but also it's now operatorManagedIngress is true? would that leave the ingress orphaned?

@JsonProperty("configuration")
private Map<String, String> configuration;

@Nullable private IngressSpec ingress;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ingress field tracked by FlinkBlueGreenDeploymentSpecDiff? If this gets changed, only when the next deploy occurs via another change that is tracked, will it be picked up. just checking that's the intended behaviour? Or did we want to immediately trigger a blue green deploy there?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point, let me investigate this a bit more and test to see what would most sensible flow here

Copy link
Copy Markdown
Author

@drossos drossos Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explored having a top level "parent-diff" that would update parent ingress without requiring blue-green transition. In the end I didn't love the number of edge cases and complexity it introduced. Especially since the precedent now with the blue-green configs that exist outside the template is that they only take effect upon blue-green transition.

I am going to leave as is for now (requires blue-green transition to modify parent ingress) and see what maintainers think would be best

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore above, it was bugging me too much and seemed like bad design so added in a prelim ingress reconciliation to allow for in-place bluegreen ingress updates

@drossos
Copy link
Copy Markdown
Author

drossos commented Dec 12, 2025

Should there be some unit tests for reconcileBlueGreenIngress?

Writing those up now, forgot to add them when flipped PR to green

return new FlinkBlueGreenDeploymentSpec(configuration, null, flinkDeploymentTemplateSpec);
}

// ==================== Ingress Rotation Tests ====================
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have any tests handling ingress deletion? should we add that?

@james-kan-shopify
Copy link
Copy Markdown

overall looks good, can approve, just wondering if you wanted to cover the tests around ingress deletion?

Copy link
Copy Markdown

@james-kan-shopify james-kan-shopify left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed a few weeks ago, looks good and the PR is opened up in open source for review.

…c + resolve path conflict on Blue and Green deployment ingresses
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants