[FLINK-39546][s3] Improve observability in flink-s3-fs-native by exposing operation-level S3 metrics#28427
Open
Samrat002 wants to merge 1 commit into
Open
[FLINK-39546][s3] Improve observability in flink-s3-fs-native by exposing operation-level S3 metrics#28427Samrat002 wants to merge 1 commit into
Samrat002 wants to merge 1 commit into
Conversation
…sing operation-level S3 metrics
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is the purpose of the change
flink-s3-fs-nativecurrently emits no metrics. When a job's checkpoints, savepoints, or sinks go through it, operators have no visibility into how Flink is actually talking to S3: request volume, latency, throttling, or retries, which makes diagnosing slow or failing checkpoints largely guesswork.This change makes the native S3 filesystem report operation-level S3 metrics into Flink's metric system. It does so by bridging the AWS SDK's built-in metrics SPI into Flink
Counter/Histograminstruments, so every completed S3 API call is counted, timed, and classified.More details on : FLIP-576
Brief change log
flink-core
MetricsAware(@PublicEvolving,org.apache.flink.core.plugin): aFileSystemFactoryimplements it to be handed aMetricGroup.FileSystem.attachMetrics(MetricGroup)(@Internal): creates afilesystemchild group and forwards it to every registeredMetricsAwarefactory; resilient to a misbehaving factory and idempotent.PluginFileSystemFactorynow implementsMetricsAwareand forwardssetMetricGroupto the wrapped inner factory under the plugin classloader. Without this, plugin-loaded filesystems (the normal deployment mode) would silently never receive the group.flink-runtime
ClusterEntrypoint(JobManager) andTaskManagerRunner(TaskManager) callFileSystem.attachMetrics(processMetricGroup)during startup. TheClusterEntrypointservice-init order was adjusted so this runs before HA/blob services cache filesystem clients, otherwise those early clients would be created without a metric group.flink-s3-fs-native
NativeS3FileSystemFactory/NativeS3AFileSystemFactoryimplementMetricsAwareand tag metrics with afilesystem_typelabel set to the scheme (s3vss3a), so the two stay distinguishable.AwsSdkMetricBridgeimplementssoftware.amazon.awssdk.metrics.MetricPublisherand translates eachMetricCollectioninto Flink metrics:api_call_count(labels:op,status_class),api_call_duration_ms(histogram, labelop),throttle_count(labelop),retry_count(labels:op,reason).S3MetricHistogram: a bounded sliding-window histogram backing the duration metric.S3ClientProviderregisters the publisher on the sync/async clients.s3.metrics.enabled(off by default),s3.metrics.allowlist,s3.metrics.histogram.window-size.Verifying this change
This change added tests and can be verified as follows.
Automated tests
AwsSdkMetricBridgeTest— translation of SDK records to Flink metrics;status_classclassification (2xx/4xx/5xx/throttled); retry attribution; allowlist behavior (explicit list,*wildcard, empty → defaults).S3MetricHistogramTest— sliding-window statistics.NativeS3FileSystemFactoryMetricsTest— thefilesystem_typelabel resolves tos3/s3aper factory.FileSystemAttachMetricsTest(flink-core) —attachMetricsunwrapsPluginFileSystemFactoryto reach the real factory, skips non-MetricsAwarefactories, survives a throwing factory, and is idempotent.NativeS3MetricsEmissionITCase— MinIO via Testcontainers; real GET/HEAD/LIST round trips, asserting the counters/histograms are readable back through a realMetricRegistry(MetricListener). Auto-skips without Docker.Manual end-to-end against real AWS S3
I ran a standalone cluster built from this branch with
s3.metrics.enabled: trueand the SLF4J reporter, and submitted a large-state streaming job checkpointing tos3://<bucket>/checkpoints(HashMap backend, filesystem checkpoint storage, 10 s interval). The native plugin loaded (Plugin loader ... s3-fs-native), built its client via the SDK default credential chain, and wrote real checkpoint objects to S3. The reporter then showed the metrics on both the TaskManager (data-plane writes) and the JobManager (checkpoint coordination / multipart):Does this pull request potentially affect one of the following parts:
@Public(Evolving): noDocumentation
this change introduces the feature, followup documentation is up next.
Was generative AI tooling used to co-author this PR?