[FLINK-38787][Kubernetes Operator] Introduce Blue Green Deployment Metrics#13
[FLINK-38787][Kubernetes Operator] Introduce Blue Green Deployment Metrics#13james-kan-shopify wants to merge 1 commit into
Conversation
| // FlinkResourceListener doesn't have a specific method for | ||
| // BlueGreen deployments yet, so we skip listener notifications | ||
| // for now. Metrics will still be tracked via MetricManager. | ||
| }); | ||
| // No audit logging for BlueGreen deployments yet |
There was a problem hiding this comment.
This would be for future work. Not required for the current metric PR which is already large enough. A separate JIRA ticket would be opened for this.
There was a problem hiding this comment.
Lets remove this part then and open up a new Jira ticket. What is the effect of no audit log in practise? (i don't think its huge deal for an inital metrics exporter)
| // Create the resource in the mock server before reconciling | ||
| kubernetesClient.resource(blueGreenDeployment).createOrReplace(); |
There was a problem hiding this comment.
Addressed CI failures.
drossos
left a comment
There was a problem hiding this comment.
LGTM just a few questions with a more general question of:
If we were to add on other metrics in the future to mirror closer the regular FlinkDeployment metrics, what would the most prevalent ones be?
Code wise though 👍
| buildStateTimeHistograms(namespace)); | ||
| } | ||
|
|
||
| private Map<String, List<Histogram>> buildTransitionHistograms(String namespace) { |
There was a problem hiding this comment.
I know we conversed about the histograms that were seen in sandbox, did we ever conclude if they were actually accurate
There was a problem hiding this comment.
I don't think the histograms are essential for our use case, but would be good for OSS
There was a problem hiding this comment.
Histograms for FlinkDeployment also aren't leveraged for our use case. But i think we can open metrics to visualize them. They can be of use to us to identify if things are taking quite a while (average trend) ...etc. But yes not critical for our immediate use case.
Histograms we saw in sandbox were always 0 because the default way Observe showed data for HIstogram was to demonstrate it as a rate. Since the values were reported upon transition and for the most part, we didn't transition much, they were always 0 (since there was no change). Looking at the raw value afterwards showed that it was capturing the right amount of seconds.
There was a problem hiding this comment.
No this is not because of Observe, this is how prometheus metric exporter exports histograms as summaries.
Please use Histograms when they make sense.
We need to fix this at some point: https://github.com/Shopify/streaming-compute/issues/44
| // FlinkResourceListener doesn't have a specific method for | ||
| // BlueGreen deployments yet, so we skip listener notifications | ||
| // for now. Metrics will still be tracked via MetricManager. | ||
| }); | ||
| // No audit logging for BlueGreen deployments yet |
There was a problem hiding this comment.
Lets remove this part then and open up a new Jira ticket. What is the effect of no audit log in practise? (i don't think its huge deal for an inital metrics exporter)
| - BlueToGreen : Time from ACTIVE_BLUE to ACTIVE_GREEN (upgrade via savepoint and traffic switch) | ||
| - GreenToBlue : Time from ACTIVE_GREEN to ACTIVE_BLUE (upgrade via savepoint and traffic switch) | ||
|
|
||
| State time metrics track how long a resource spends in each intermediate state (SAVEPOINTING_BLUE, TRANSITIONING_TO_GREEN, etc.), which helps identify bottlenecks in the deployment pipeline. |
There was a problem hiding this comment.
Maybe clarify this sentence a bit on meaning
| return; | ||
| } | ||
|
|
||
| long durationSeconds = Duration.between(fromTimes.f0, time).toSeconds(); |
There was a problem hiding this comment.
Shouldn't we be getting time between f.1 to get the time difference between exiting our last state and time to complete transition? Correct me if I am wrong here
There was a problem hiding this comment.
including stable state isn't really necessary at all. Bumping to f1.
drossos
left a comment
There was a problem hiding this comment.
Everything LGTM sans a few comments / clarifications. Not included in review but currently trying out failure metrics gauge to see if we wanted to add
f62be4c to
7f30663
Compare
7f30663 to
239e5cd
Compare
drossos
left a comment
There was a problem hiding this comment.
LGTM and new metrics all tested and working with dashboard 👍
What is the purpose of the change
Jira Issue: https://issues.apache.org/jira/browse/FLINK-38787
This pull request adds lifecycle metrics support for
FlinkBlueGreenDeploymentresources, enabling operators to monitor blue-green deployment state transitions and timing. This provides observability into the deployment pipeline, helping identify bottlenecks and track deployment health. The implementation heavily mirrors the existing FlinkDeployment metrics' implementation to ensure consistency.Note: Most lines of code introduced are for test files!
Brief change log
Real-time State Distribution Tracking
Lifecycle Transition Timing
Historical Failure Tracking
Verifying this change
This change added tests and can be verified as follows:
Added
BlueGreenLifecycleMetricsTestto verify histogram creation, namespace isolation, and metric registrationAdded
BlueGreenResourceLifecycleMetricTrackerTestto verify state transition timing, rollback scenarios, and intermediate state recordingTests cover initial deployment, blue-to-green transitions, green-to-blue transitions, and failed transition rollbacks
Does this pull request potentially affect one of the following parts:
Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changes to the
CustomResourceDescriptors: noCore observer or reconciler logic that is regularly executed: no
Documentation
Does this pull request introduce a new feature? yes
If yes, how is the feature documented? docs