Skip to content

[FLINK-38787][Kubernetes Operator] Introduce Blue Green Deployment Metrics#13

Open
james-kan-shopify wants to merge 1 commit into
mainfrom
jk.bg-metrics
Open

[FLINK-38787][Kubernetes Operator] Introduce Blue Green Deployment Metrics#13
james-kan-shopify wants to merge 1 commit into
mainfrom
jk.bg-metrics

Conversation

@james-kan-shopify
Copy link
Copy Markdown

@james-kan-shopify james-kan-shopify commented Dec 12, 2025

What is the purpose of the change

Jira Issue: https://issues.apache.org/jira/browse/FLINK-38787

This pull request adds lifecycle metrics support for FlinkBlueGreenDeployment resources, enabling operators to monitor blue-green deployment state transitions and timing. This provides observability into the deployment pipeline, helping identify bottlenecks and track deployment health. The implementation heavily mirrors the existing FlinkDeployment metrics' implementation to ensure consistency.

Note: Most lines of code introduced are for test files!

Brief change log

  1. Real-time State Distribution Tracking

    • Namespace-level gauges showing current count of deployments in each blue-green state and Flink job status.
  2. Lifecycle Transition Timing

    • Histogram metrics measuring duration of key transitions (initial deployment, blue-to-green, green-to-blue) and time spent in each state, available at system and namespace levels.
  3. Historical Failure Tracking

    • Accumulating counter that increments on each transition to FAILING state, enabling failure rate calculation and long-term reliability monitoring.

Verifying this change

This change added tests and can be verified as follows:

  • Added BlueGreenLifecycleMetricsTest to verify histogram creation, namespace isolation, and metric registration

  • Added BlueGreenResourceLifecycleMetricTrackerTest to verify state transition timing, rollback scenarios, and intermediate state recording

  • Tests cover initial deployment, blue-to-green transitions, green-to-blue transitions, and failed transition rollbacks

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no

  • The public API, i.e., is any changes to the CustomResourceDescriptors: no

  • Core observer or reconciler logic that is regularly executed: no

Documentation

  • Does this pull request introduce a new feature? yes

  • If yes, how is the feature documented? docs

@james-kan-shopify james-kan-shopify changed the base branch from dr.bg-metrics to main January 6, 2026 01:22
@james-kan-shopify james-kan-shopify changed the base branch from main to dr.bg-metrics January 6, 2026 01:24
@james-kan-shopify james-kan-shopify changed the base branch from dr.bg-metrics to main January 6, 2026 08:16
@james-kan-shopify james-kan-shopify changed the base branch from main to dr.bg-metrics January 6, 2026 08:25
@james-kan-shopify james-kan-shopify changed the base branch from dr.bg-metrics to main January 7, 2026 00:57
@james-kan-shopify james-kan-shopify changed the title Expand Blue Green Metrics to Include Lifecycle Metrics [FLINK-38787][Kubernetes Operator] Introduce Blue Green Deployment Metrics Jan 13, 2026
Comment on lines +317 to +321
// FlinkResourceListener doesn't have a specific method for
// BlueGreen deployments yet, so we skip listener notifications
// for now. Metrics will still be tracked via MetricManager.
});
// No audit logging for BlueGreen deployments yet
Copy link
Copy Markdown
Author

@james-kan-shopify james-kan-shopify Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be for future work. Not required for the current metric PR which is already large enough. A separate JIRA ticket would be opened for this.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove this part then and open up a new Jira ticket. What is the effect of no audit log in practise? (i don't think its huge deal for an inital metrics exporter)

Comment on lines +956 to +957
// Create the resource in the mock server before reconciling
kubernetesClient.resource(blueGreenDeployment).createOrReplace();
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed CI failures.

@james-kan-shopify james-kan-shopify marked this pull request as ready for review January 13, 2026 20:11
Copy link
Copy Markdown

@drossos drossos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just a few questions with a more general question of:

If we were to add on other metrics in the future to mirror closer the regular FlinkDeployment metrics, what would the most prevalent ones be?

Code wise though 👍

buildStateTimeHistograms(namespace));
}

private Map<String, List<Histogram>> buildTransitionHistograms(String namespace) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we conversed about the histograms that were seen in sandbox, did we ever conclude if they were actually accurate

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the histograms are essential for our use case, but would be good for OSS

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Histograms for FlinkDeployment also aren't leveraged for our use case. But i think we can open metrics to visualize them. They can be of use to us to identify if things are taking quite a while (average trend) ...etc. But yes not critical for our immediate use case.

Histograms we saw in sandbox were always 0 because the default way Observe showed data for HIstogram was to demonstrate it as a rate. Since the values were reported upon transition and for the most part, we didn't transition much, they were always 0 (since there was no change). Looking at the raw value afterwards showed that it was capturing the right amount of seconds.

Copy link
Copy Markdown

@ryanvanhuuksloot ryanvanhuuksloot Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is not because of Observe, this is how prometheus metric exporter exports histograms as summaries.

Please use Histograms when they make sense.

We need to fix this at some point: https://github.com/Shopify/streaming-compute/issues/44

Comment on lines +317 to +321
// FlinkResourceListener doesn't have a specific method for
// BlueGreen deployments yet, so we skip listener notifications
// for now. Metrics will still be tracked via MetricManager.
});
// No audit logging for BlueGreen deployments yet
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove this part then and open up a new Jira ticket. What is the effect of no audit log in practise? (i don't think its huge deal for an inital metrics exporter)

- BlueToGreen : Time from ACTIVE_BLUE to ACTIVE_GREEN (upgrade via savepoint and traffic switch)
- GreenToBlue : Time from ACTIVE_GREEN to ACTIVE_BLUE (upgrade via savepoint and traffic switch)

State time metrics track how long a resource spends in each intermediate state (SAVEPOINTING_BLUE, TRANSITIONING_TO_GREEN, etc.), which helps identify bottlenecks in the deployment pipeline.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe clarify this sentence a bit on meaning

return;
}

long durationSeconds = Duration.between(fromTimes.f0, time).toSeconds();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be getting time between f.1 to get the time difference between exiting our last state and time to complete transition? Correct me if I am wrong here

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

including stable state isn't really necessary at all. Bumping to f1.

Copy link
Copy Markdown

@drossos drossos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything LGTM sans a few comments / clarifications. Not included in review but currently trying out failure metrics gauge to see if we wanted to add

Copy link
Copy Markdown

@drossos drossos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and new metrics all tested and working with dashboard 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants