Skip to content

NPE in ManagedInformerEventSource.eventFilteringUpdateAndCacheResource when patch target was already deleted on the API server #3331

@Donnerbart

Description

@Donnerbart

Bug Report

What did you do?

ResourceOperations.removeFinalizer ultimately calls client.resource(r).edit(unaryOperator) via ManagedInformerEventSource.eventFilteringUpdateAndCacheResource. When that edit() returns null (fabric8's behavior on HTTP 404 when the target is gone), the result is fed unguarded into TemporaryResourceCache.putResource, which calls ResourceID.fromResource(null) and throws.

The code path that handles this null is missing; see ManagedInformerEventSource.java:101-103 on current main (5.3.4-SNAPSHOT). For comparison, removeFinalizer's own retry-precondition guard already null-checks the resource and returns gracefully (ResourceOperations.java:636-639); this one does not.

Concrete observed reproduction: a controller implementing Reconciler + Cleaner, using KubernetesMockServer (fabric8 7.6.1) in test teardown. The mock server honours finalizers, so the typical sequence we observed was:

  1. Test calls client.resources(...).inNamespace(ns).delete().
  2. Mock server sets deletionTimestamp and keeps the resource (finalizer present).
  3. Informer emits an update event; JOSDK runs handleCleanup; removeFinalizer patches the finalizer off successfully.
  4. Mock server actually removes the resource.
  5. A subsequent cleanup pass runs (the dispatcher had a superseding event queued — could be the delete event or another update; logs show deleteEvent=false, so in this case it was an update event). This pass's patch reaches the API server after the resource is gone; edit() returns null; NPE.

We have not characterized this on a real cluster — no production observations either way. The bug is reported because the code path is unguarded against a documented null return from fabric8, and we have a deterministic in-process reproducer (below).

What did you expect to see?

A clean return (no exception), matching the existing precondition guard in ResourceOperations.removeFinalizer's retry path:

if (r == null) {
  log.warn("Cannot remove finalizer since resource not exists.");
  return false;
}

What did you see instead? Under which circumstances?

NullPointerException thrown from TemporaryResourceCache.putResource, surfaced as a WARN in EventProcessor ("Uncaught error during event processing ... another reconciliation will be attempted because a superseding event has been received").

java.lang.NullPointerException: Cannot invoke "io.fabric8.kubernetes.api.model.HasMetadata.getMetadata()" because "resource" is null
  at io.javaoperatorsdk.operator.processing.event.ResourceID.fromResource(ResourceID.java:28)
  at io.javaoperatorsdk.operator.processing.event.source.informer.TemporaryResourceCache.putResource(TemporaryResourceCache.java:180)
  at io.javaoperatorsdk.operator.processing.event.source.informer.ManagedInformerEventSource.eventFilteringUpdateAndCacheResource(ManagedInformerEventSource.java:103)
  at io.javaoperatorsdk.operator.api.reconciler.ResourceOperations.resourcePatch(ResourceOperations.java:567)
  at io.javaoperatorsdk.operator.api.reconciler.ResourceOperations.jsonPatchPrimary(ResourceOperations.java:409)
  at io.javaoperatorsdk.operator.api.reconciler.ResourceOperations.conflictRetryingPatchPrimary(ResourceOperations.java:668)
  at io.javaoperatorsdk.operator.api.reconciler.ResourceOperations.removeFinalizer(ResourceOperations.java:630)

Triggering circumstances:

  1. client.resources(...).inNamespace(ns).delete() is called against a resource whose reconciler manages a finalizer.
  2. Mock server (and real API server) honour the finalizer: deletionTimestamp is set, resource still present.
  3. Informer emits an update event with deletionTimestamp set.
  4. JOSDK runs handleCleanup and removeFinalizer patches the finalizer off successfully.
  5. Mock server removes the resource.
  6. A second cleanup pass is dispatched (superseding event already queued). This time edit() returns null and the NPE fires.

The relevant call site (ManagedInformerEventSource.eventFilteringUpdateAndCacheResource, current main / 5.3.4-SNAPSHOT):

updatedResource = updateMethod.apply(resourceToUpdate);   // can return null
log.debug("Resource update successful");
handleRecentResourceUpdate(id, updatedResource, resourceToUpdate);  // NPE: no null guard

handleRecentResourceUpdate forwards to temporaryResourceCache.putResource(null), which fails at ResourceID.fromResource(null).

Environment

Kubernetes cluster type:

vanilla; reproducer is also deterministic against fabric8 KubernetesMockServer.

$ Mention java-operator-sdk version from pom.xml file

Originally observed on 5.3.3; reproducer (below) also confirmed against current main / 5.3.4-SNAPSHOT.

$ java -version

openjdk version "21"

$ kubectl version

N/A (reproduced via KubernetesMockServer and via direct unit test; not cluster-dependent).

Possible Solution

Null-guard the call in ManagedInformerEventSource.eventFilteringUpdateAndCacheResource:

updatedResource = updateMethod.apply(resourceToUpdate);
if (updatedResource != null) {
  log.debug("Resource update successful");
  handleRecentResourceUpdate(id, updatedResource, resourceToUpdate);
} else {
  log.debug("Update returned null, skipping cache update");
}
return updatedResource;

The existing finally block already null-checks updatedResource for version tracking, so only this direct call path needs the guard.

Additional context

Reproducer test (no cluster required), demonstrating the call path from removeFinalizer all the way down to the NPE:

@Test
void removeFinalizerHandlesAlreadyDeletedTargetGracefully() {
  // When the underlying patch returns null because the target resource is already gone
  // on the API server (fabric8's edit() returns null on 404), removeFinalizer must
  // return cleanly rather than throw.
  var resource = TestUtils.testCustomResource1();
  resource.getMetadata().setResourceVersion("1");
  resource.addFinalizer(FINALIZER_NAME);
  when(context.getPrimaryResource()).thenReturn(resource);

  var client = MockKubernetesClient.client(TestCustomResource.class);
  NamespaceableResource<TestCustomResource> single = mock();
  when(single.edit(any(UnaryOperator.class))).thenReturn(null);
  when(client.resource(any(TestCustomResource.class))).thenReturn(single);

  var eventSource = new ControllerEventSource<>(testController(client));
  eventSource.start();
  try {
    when(context.getClient()).thenReturn(client);
    var eventSourceRetriever = mock(EventSourceRetriever.class);
    when(context.eventSourceRetriever()).thenReturn(eventSourceRetriever);
    when(eventSourceRetriever.getControllerEventSource()).thenReturn(eventSource);

    var result = resourceOperations.removeFinalizer(FINALIZER_NAME);

    assertThat(result).isNull();
  } finally {
    eventSource.stop();
  }
}

A second, narrower reproducer at the ManagedInformerEventSource layer:

@Test
void eventFilteringUpdateHandlesNullFromUpdateMethodGracefully() {
  withRealTemporaryResourceCache();
  var result = informerEventSource.eventFilteringUpdateAndCacheResource(testDeployment(), r -> null);
  assertThat(result).isNull();
}

Both fail today with the NPE shown above and pass with the proposed null-guard.

Related prior issue (different flow, same class of bug): #1987 — fixed in the old ReconciliationDispatcher flow; the 5.x rewrite moved the logic to ResourceOperations + ManagedInformerEventSource and regressed null-safety on the cache-update side.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions