[FLINK-38787][Kubernetes Operator] Introduce Blue Green Deployment Metrics by james-kan-shopify · Pull Request #13 · Shopify/flink-kubernetes-operator

james-kan-shopify · 2025-12-12T03:40:18Z

What is the purpose of the change

Jira Issue: https://issues.apache.org/jira/browse/FLINK-38787

This pull request adds lifecycle metrics support for FlinkBlueGreenDeployment resources, enabling operators to monitor blue-green deployment state transitions and timing. This provides observability into the deployment pipeline, helping identify bottlenecks and track deployment health. The implementation heavily mirrors the existing FlinkDeployment metrics' implementation to ensure consistency.

Note: Most lines of code introduced are for test files!

Brief change log

Real-time State Distribution Tracking
- Namespace-level gauges showing current count of deployments in each blue-green state and Flink job status.
Lifecycle Transition Timing
- Histogram metrics measuring duration of key transitions (initial deployment, blue-to-green, green-to-blue) and time spent in each state, available at system and namespace levels.
Historical Failure Tracking
- Accumulating counter that increments on each transition to FAILING state, enabling failure rate calculation and long-term reliability monitoring.

Verifying this change

This change added tests and can be verified as follows:

Added BlueGreenLifecycleMetricsTest to verify histogram creation, namespace isolation, and metric registration
Added BlueGreenResourceLifecycleMetricTrackerTest to verify state transition timing, rollback scenarios, and intermediate state recording
Tests cover initial deployment, blue-to-green transitions, green-to-blue transitions, and failed transition rollbacks

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changes to the CustomResourceDescriptors: no
Core observer or reconciler logic that is regularly executed: no

Documentation

Does this pull request introduce a new feature? yes
If yes, how is the feature documented? docs

james-kan-shopify · 2026-01-13T04:26:50Z

+                                // FlinkResourceListener doesn't have a specific method for
+                                // BlueGreen deployments yet, so we skip listener notifications
+                                // for now. Metrics will still be tracked via MetricManager.
+                            });
+                    // No audit logging for BlueGreen deployments yet


This would be for future work. Not required for the current metric PR which is already large enough. A separate JIRA ticket would be opened for this.

Lets remove this part then and open up a new Jira ticket. What is the effect of no audit log in practise? (i don't think its huge deal for an inital metrics exporter)

james-kan-shopify · 2026-01-13T06:24:09Z

+        // Create the resource in the mock server before reconciling
+        kubernetesClient.resource(blueGreenDeployment).createOrReplace();


Addressed CI failures.

drossos

LGTM just a few questions with a more general question of:

If we were to add on other metrics in the future to mirror closer the regular FlinkDeployment metrics, what would the most prevalent ones be?

Code wise though 👍

drossos · 2026-01-13T21:15:53Z

+                buildStateTimeHistograms(namespace));
+    }
+
+    private Map<String, List<Histogram>> buildTransitionHistograms(String namespace) {


I know we conversed about the histograms that were seen in sandbox, did we ever conclude if they were actually accurate

I don't think the histograms are essential for our use case, but would be good for OSS

Histograms for FlinkDeployment also aren't leveraged for our use case. But i think we can open metrics to visualize them. They can be of use to us to identify if things are taking quite a while (average trend) ...etc. But yes not critical for our immediate use case.

Histograms we saw in sandbox were always 0 because the default way Observe showed data for HIstogram was to demonstrate it as a rate. Since the values were reported upon transition and for the most part, we didn't transition much, they were always 0 (since there was no change). Looking at the raw value afterwards showed that it was capturing the right amount of seconds.

No this is not because of Observe, this is how prometheus metric exporter exports histograms as summaries.

Please use Histograms when they make sense.

We need to fix this at some point: https://github.com/Shopify/streaming-compute/issues/44

drossos · 2026-01-13T21:18:44Z

+                                // FlinkResourceListener doesn't have a specific method for
+                                // BlueGreen deployments yet, so we skip listener notifications
+                                // for now. Metrics will still be tracked via MetricManager.
+                            });
+                    // No audit logging for BlueGreen deployments yet


Lets remove this part then and open up a new Jira ticket. What is the effect of no audit log in practise? (i don't think its huge deal for an inital metrics exporter)

drossos · 2026-01-16T17:13:51Z

+ - BlueToGreen : Time from ACTIVE_BLUE to ACTIVE_GREEN (upgrade via savepoint and traffic switch)
+ - GreenToBlue : Time from ACTIVE_GREEN to ACTIVE_BLUE (upgrade via savepoint and traffic switch)
+
+State time metrics track how long a resource spends in each intermediate state (SAVEPOINTING_BLUE, TRANSITIONING_TO_GREEN, etc.), which helps identify bottlenecks in the deployment pipeline.


Maybe clarify this sentence a bit on meaning

drossos · 2026-01-16T19:35:51Z

+            return;
+        }
+
+        long durationSeconds = Duration.between(fromTimes.f0, time).toSeconds();


Shouldn't we be getting time between f.1 to get the time difference between exiting our last state and time to complete transition? Correct me if I am wrong here

including stable state isn't really necessary at all. Bumping to f1.

drossos

Everything LGTM sans a few comments / clarifications. Not included in review but currently trying out failure metrics gauge to see if we wanted to add

drossos

LGTM and new metrics all tested and working with dashboard 👍

james-kan-shopify changed the base branch from dr.bg-metrics to main January 6, 2026 01:22

james-kan-shopify changed the base branch from main to dr.bg-metrics January 6, 2026 01:24

james-kan-shopify changed the base branch from dr.bg-metrics to main January 6, 2026 08:16

james-kan-shopify changed the base branch from main to dr.bg-metrics January 6, 2026 08:25

james-kan-shopify changed the base branch from dr.bg-metrics to main January 7, 2026 00:57

james-kan-shopify changed the title ~~Expand Blue Green Metrics to Include Lifecycle Metrics~~ [FLINK-38787][Kubernetes Operator] Introduce Blue Green Deployment Metrics Jan 13, 2026

james-kan-shopify commented Jan 13, 2026

View reviewed changes

james-kan-shopify requested a review from drossos January 13, 2026 04:31

james-kan-shopify commented Jan 13, 2026

View reviewed changes

james-kan-shopify marked this pull request as ready for review January 13, 2026 20:11

drossos reviewed Jan 13, 2026

View reviewed changes

drossos reviewed Jan 16, 2026

View reviewed changes

james-kan-shopify force-pushed the jk.bg-metrics branch from f62be4c to 7f30663 Compare January 18, 2026 19:22

[FLINK-38787] Introduce FlinkBlueGreenDeployment Metrics

239e5cd

james-kan-shopify force-pushed the jk.bg-metrics branch from 7f30663 to 239e5cd Compare January 19, 2026 18:57

drossos approved these changes Jan 23, 2026

View reviewed changes

james-kan-shopify mentioned this pull request Jan 23, 2026

Flink bg merged #16

Draft

		// Create the resource in the mock server before reconciling
		kubernetesClient.resource(blueGreenDeployment).createOrReplace();

Conversation

james-kan-shopify commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

james-kan-shopify Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drossos left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryanvanhuuksloot Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drossos left a comment

Choose a reason for hiding this comment

Uh oh!

drossos left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

james-kan-shopify commented Dec 12, 2025 •

edited

Loading

james-kan-shopify Jan 13, 2026 •

edited

Loading

ryanvanhuuksloot Jan 13, 2026 •

edited

Loading