Skip to content

Conversation

@qiref
Copy link

@qiref qiref commented Dec 18, 2025

What is the purpose of the change

When a Flink job starts without any checkpoint data, the DescriptiveStatisticsHistogramStatistics
may be initialized with an empty array. The maybeInitPercentile() method only checked for null but not for empty arrays, causing Apache Commons Math's Percentile.evaluate() to throw a NullArgumentException when accessing the checkpoint statistics page in the Web UI.

org.apache.commons.math3.exception.NullArgumentException: input array
	at org.apache.commons.math3.util.MathArrays.verifyValues(MathArrays.java:1753) ~[commons-math3-3.6.1.jar:3.6.1]
	at org.apache.commons.math3.stat.descriptive.AbstractUnivariateStatistic.test(AbstractUnivariateStatistic.java:158) ~[commons-math3-3.6.1.jar:3.6.1]
	at org.apache.commons.math3.stat.descriptive.rank.Percentile.evaluate(Percentile.java:272) ~[commons-math3-3.6.1.jar:3.6.1]
	at org.apache.commons.math3.stat.descriptive.rank.Percentile.evaluate(Percentile.java:241) ~[commons-math3-3.6.1.jar:3.6.1]
	at org.apache.flink.runtime.metrics.DescriptiveStatisticsHistogramStatistics$CommonMetricsSnapshot.getPercentile(DescriptiveStatisticsHistogramStatistics.java:163) ~[classes/:?]
	at org.apache.flink.runtime.metrics.DescriptiveStatisticsHistogramStatistics.getQuantile(DescriptiveStatisticsHistogramStatistics.java:57) ~[classes/:?]
	at org.apache.flink.runtime.checkpoint.StatsSummarySnapshot.getQuantile(StatsSummarySnapshot.java:108) ~[classes/:?]
	at org.apache.flink.runtime.rest.messages.checkpoints.StatsSummaryDto.valueOf(StatsSummaryDto.java:81) ~[classes/:?]
	at org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.createCheckpointingStatistics(CheckpointingStatisticsHandler.java:128) ~[classes/:?]
	at org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.handleCheckpointStatsRequest(CheckpointingStatisticsHandler.java:85) ~[classes/:?]
	at org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.handleCheckpointStatsRequest(CheckpointingStatisticsHandler.java:59) ~[classes/:?]
	at org.apache.flink.runtime.rest.handler.job.checkpoints.AbstractCheckpointStatsHandler.lambda$handleRequest$1(AbstractCheckpointStatsHandler.java:89) ~[classes/:?]
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) [?:1.8.0_432]
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) [?:1.8.0_432]
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) [?:1.8.0_432]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_432]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_432]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_432]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_432]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_432]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_432]
	at java.lang.Thread.run(Thread.java:750) [?:1.8.0_432]

Brief change log

This commit adds an additional check for empty arrays in the maybeInitPercentile() method. When the data array is null or empty, it now uses a default array {0.0} to avoid the exception while maintaining reasonable default values for all percentile calculations.

Verifying this change

This change is already covered by existing tests, such as:

  • DescriptiveStatisticsHistogramTest - Tests the histogram statistics calculation with various data scenarios
  • The existing tests continue to pass with this change, and the fix prevents the NullArgumentException in production scenarios

Manual Verification:

  • Started a Flink cluster with a streaming job that has no checkpoint configured
  • Accessed the checkpoint statistics page in Web UI (http://localhost:8081/#/job/<job-id>/checkpoints)
  • Verified that the page loads successfully without throwing NullArgumentException
  • After enabling checkpoints and completing some checkpoints, verified that real statistics are calculated correctly

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)

@flinkbot
Copy link
Collaborator

flinkbot commented Dec 18, 2025

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

percentilesImpl = new Percentile().withNaNStrategy(NaNStrategy.FIXED);
}
if (data != null) {
if (data != null && data.length > 0) {
Copy link
Contributor

@davidradl davidradl Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Could we have a unit test for this please? Maybe at the level that hit the NullArgumentException.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Added tests that specifically target the NullArgumentException scenario. The bug is now immortalized in our test suite. Tests can be verified with:

mvn test -pl flink-runtime -Dtest=DescriptiveStatisticsHistogramTest

@github-actions github-actions bot added the community-reviewed PR has been reviewed by the community. label Dec 19, 2025
@qiref qiref force-pushed the hotfix-NullArgumentException branch from 3119b54 to da263a7 Compare December 19, 2025 04:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants