Skip to content

Distinguish transient Get errors from 'no tunnel-targeted parent' in findTunnelTargetedParentRef #146

@jacaudi

Description

@jacaudi

Background

Issue #145 introduced deactivation pruning at the HTTPRoute/TLSRoute parent == nil branches (commits 1a2d934, 38e7f39). The prune deletes all previously-emitted CloudflareDNSRecord CRs labelled with the source's identity.

findTunnelTargetedParentRef (internal/controller/tunnel/attach.go:516-551) iterates parentRefs and continues past parents that either don't exist or fail to Get for any other reason — a single `continue` handles both `apierrors.IsNotFound` and transient errors (apiserver glitch, cache resync hole, etc.). If every candidate parent fails, the function returns `(nil parent, nil err)` → reconcile hits the new deactivation-prune branch.

Before #145's fix this was harmless: orphans would just persist for a reconcile. After the fix, a transient Get error on the parent Gateway can spuriously delete every CR the source ever emitted, triggering a delete-and-reemit churn that briefly tears down the Cloudflare-side DNS + TXT records.

Why this isn't urgent

  • controller-runtime's cache normally returns `NotFound` (not network errors) for transient apiserver hiccups, so the spurious-prune path is theoretical in practice.
  • `client.IgnoreNotFound` in the pruner makes the prune itself race-safe.
  • The CR's finalizer chain rebuilds state on the next reconcile.

The risk is observability churn (events, briefly missing Cloudflare-side records), not data loss. Documented as design open question §2 in `docs/plans/2026-05-28-source-controller-orphan-gc-design.md`.

Proposed fix

In `attach.go:529` and `:541` (the two `continue` sites in `findTunnelTargetedParentRef`), branch on the error type:

  • `apierrors.IsNotFound(err)` → `continue` (parent definitively doesn't exist).
  • Any other error → return the error from `findTunnelTargetedParentRef` so the reconcile can requeue and try again, instead of presenting the call site as "no tunnel-targeted parent."

Then update the HTTPRoute and TLSRoute reconcile paths (`httproute_source_controller.go:120-122` and `tlsroute_source_controller.go:118-120`) to surface the non-NotFound error case as a requeue (`return reconcile.Result{}, err`) instead of falling into the deactivation prune.

When to revisit

Field evidence: an operator reports a CR briefly disappearing without a deliberate annotation change. Logs at the time would show the `orphan-prune failed during deactivation sweep` log message absent (because the prune call succeeded) followed by re-emission a moment later.

Surfaced by

Independent comprehensive review of the #145 fix branch (`feature/source-controller-orphan-gc`).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions