Skip to content

NullPointerException in FixedSizeExemplarReservoir due to unsafe lazy initialization #8523

Description

@seongjun-rpls

Describe the bug

We observed a NullPointerException in FixedSizeExemplarReservoir.offerDoubleMeasurement(...) while recording metrics through the OpenTelemetry Java SDK.

The exception message indicates that the storage array itself was non-null, but one of its elements was observed as null:

Cannot invoke "io.opentelemetry.sdk.metrics.internal.exemplar.ReservoirCell.recordDoubleMeasurement(double, io.opentelemetry.api.common.Attributes, io.opentelemetry.context.Context)" because "this.storage[bucket]" is null

FixedSizeExemplarReservoir lazily initializes ReservoirCell[] storage, but storage is neither volatile nor initialized under synchronization.

@Nullable private ReservoirCell[] storage;

@Override
public void offerDoubleMeasurement(double value, Attributes attributes, Context context) {
  if (storage == null) {
    storage = initStorage();
  }
  int bucket = reservoirCellSelector.reservoirCellIndexFor(storage, value, attributes, context);
  if (bucket != -1) {
    this.storage[bucket].recordDoubleMeasurement(value, attributes, context);
    this.hasMeasurements = true;
  }
}

initStorage() initializes every element:

private ReservoirCell[] initStorage() {
  ReservoirCell[] storage = new ReservoirCell[this.size];
  for (int i = 0; i < size; ++i) {
    storage[i] = new ReservoirCell(this.clock);
  }
  return storage;
}

There does not appear to be any code path that sets storage[bucket] back to null after initialization. This looks like an unsafe race during concurrent first-time metric recording: another thread may observe the array reference without safely observing all element writes.

Steps to reproduce

I do not currently have a deterministic reproducer.

The issue occurred in a production Spring Boot workload under concurrent request handling while recording a repository invocation metric through Micrometer / OpenTelemetry metrics.

The observed stack path included:

OpenTelemetryTimer.recordNonNegative
MetricsRepositoryMethodInvocationListener.afterInvocation
FixedSizeExemplarReservoir.offerDoubleMeasurement

The failure appears to require concurrent metric recording during the first-time use of a lazily initialized exemplar reservoir. Since this appears to be a Java Memory Model unsafe publication race, it may be difficult to reproduce deterministically with a normal unit test.

What did you expect to see?

Concurrent metric recordings should not observe a partially initialized ReservoirCell[] storage.

FixedSizeExemplarReservoir.offerDoubleMeasurement(...) should either initialize the reservoir safely or use an already fully initialized reservoir, and metric recording should not throw.

What did you see instead?

Metric recording threw the following NullPointerException, and the application request failed with a 500:

java.lang.NullPointerException: Cannot invoke "io.opentelemetry.sdk.metrics.internal.exemplar.ReservoirCell.recordDoubleMeasurement(double, io.opentelemetry.api.common.Attributes, io.opentelemetry.context.Context)" because "this.storage[bucket]" is null

What version and what artifacts are you using?

Artifacts:

  • OpenTelemetry Java agent via OpenTelemetry Operator auto-instrumentation
  • OpenTelemetry Java SDK metrics, shaded inside the Java agent
  • Micrometer / Spring Boot metrics bridge recording into OpenTelemetry metrics

Version:

  • OpenTelemetry Java instrumentation: 2.27.0
  • OpenTelemetry Java SDK used by instrumentation: 1.61.0
  • Java auto-instrumentation image: opentelemetry-operator/autoinstrumentation-java:2.27.0

I also checked opentelemetry-java v1.63.0 and current main, and FixedSizeExemplarReservoir appears to still have the same lazy initialization pattern.

How did you reference these artifacts?

The application uses Kubernetes auto-instrumentation:

instrumentation.opentelemetry.io/inject-java: addons-opentelemetry-operator/java-instrumentation

The injected init container image was:

opentelemetry-operator/autoinstrumentation-java:2.27.0

Environment

Compiler: not directly applicable; the application is instrumented at runtime.

Runtime:

  • Spring Boot application running on Kubernetes
  • OpenTelemetry Java auto-instrumentation agent 2.27.0
  • OpenTelemetry Java SDK 1.61.0 bundled with the agent

OS:

  • Amazon Linux container on Kubernetes with Amazon Corretto 25

Additional context

A possible fix would be to safely publish the lazily initialized array, for example with volatile and double-checked locking using a dedicated lock object:

@Nullable private volatile ReservoirCell[] storage;
private final Object storageLock = new Object();

private ReservoirCell[] getOrInitStorage() {
  ReservoirCell[] currentStorage = storage;
  if (currentStorage == null) {
    synchronized (storageLock) {
      currentStorage = storage;
      if (currentStorage == null) {
        currentStorage = initStorage();
        storage = currentStorage;
      }
    }
  }
  return currentStorage;
}

Then offerDoubleMeasurement(...) / offerLongMeasurement(...) can use the returned local array for bucket selection and recording.

As a workaround, setting the exemplar filter to always_off should avoid this code path:

OTEL_METRICS_EXEMPLAR_FILTER=always_off

This keeps metrics and traces enabled, but disables metric exemplars.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions