Skip to content

fix: source-controllers GC derived DNSRecord CRs on deactivation (#145)#149

Merged
jacaudi merged 7 commits into
mainfrom
feature/source-controller-orphan-gc
May 28, 2026
Merged

fix: source-controllers GC derived DNSRecord CRs on deactivation (#145)#149
jacaudi merged 7 commits into
mainfrom
feature/source-controller-orphan-gc

Conversation

@jacaudi
Copy link
Copy Markdown
Owner

@jacaudi jacaudi commented May 28, 2026

Summary

Closes #145. Source-controllers (HTTPRoute, TLSRoute, Gateway, Service) emit derived `CloudflareDNSRecord` CRs based on source annotations. When annotations are removed from the source, the previously-emitted CRs were not deleted — they persisted as orphans and continued to hold the live DNS + TXT ownership entries in Cloudflare, blocking any new CR for the same hostname with `OwnershipCompanionFailed: foreign`.

Five new `pruneOrphanedDNSRecords(..., desired=nil)` calls at definitive-deactivation early-return branches:

Controller Branch File
HTTPRoute `parent == nil` `internal/controller/tunnel/httproute_source_controller.go`
TLSRoute `parent == nil` `internal/controller/tunnel/tlsroute_source_controller.go`
Gateway `!enabled` (primary repro: `cloudflare.io/tunnel` removed) `internal/controller/tunnel/gateway_source_controller.go`
Gateway `len(hostnames) == 0` `internal/controller/tunnel/gateway_source_controller.go`
Service `!enabled` `internal/controller/tunnel/service_source_controller.go`

Reuses the existing `pruneOrphanedDNSRecords` helper unchanged. The deleted CR's existing finalizer cleans up the Cloudflare-side DNS + TXT records.

Transient and ambiguous branches deliberately not touched (deferred-emission paths, `resolveGatewayService` errors) — pruning there would either thrash CRs on tunnel restarts or risk false-positive deletes.

Design and plan

Review

  • Per-task spec + code-quality reviews after each of the 4 implementation tasks.
  • Independent Go-engineering review of the full plan before implementation began (verdicts: will fix bug = YES, practices = SOUND, scope = RIGHT_SIZED).
  • Independent comprehensive review of the full diff (verdict: approve with notes).
  • Two notes addressed in the cleanup commit (`296173c`): aligned Gateway prune events to use `r.Recorder` consistently with the package's other `ReasonOrphanedDNSRecordPruned` emits; added `LabelSourceKind` assertion + recorder-rationale comment to the no-hostnames Gateway test.
  • Two notes deferred to follow-up issues: Distinguish transient Get errors from 'no tunnel-targeted parent' in findTunnelTargetedParentRef #146 (`findTunnelTargetedParentRef` transient-Get-error edge case), Cleanup: extract source-controller deactivation prune block + rename perr #147 (DRY review of the prune-event block + rename `perr`).

Test Plan

  • `make test` passes (unit + envtest + race detection)
  • Pre-existing lint failures unchanged (none in files this PR modifies)
  • New tests verify the bug fix end-to-end:
    • `TestHTTPRouteSource_ParentDeactivation_PrunesEmittedCRs`
    • `TestTLSRouteSource_ParentDeactivation_PrunesEmittedCRs`
    • `TestGatewaySource_OptOut_PrunesEmittedCRs` (primary repro)
    • `TestGatewaySource_HostnamesCleared_PrunesEmittedCRs`
    • `TestServiceSource_OptOut_PrunesEmittedCRs`
  • Optional: manual smoke test on a live cluster — apply a Gateway with `cloudflare.io/tunnel: "true"`, confirm a `CloudflareDNSRecord` appears, remove the annotation, confirm the CR is deleted within one reconcile cycle.

🤖 Generated with Claude Code

jacaudi and others added 7 commits May 28, 2026 01:30
… completion

Design update: per-controller call-sites table extended to enumerate
two branches missed in the initial brainstorming pass — Gateway
!enabled (the primary repro from issue #145) and Service !enabled
(parallel deactivation path that the original design wrongly claimed
the end-of-reconcile prune covered). Both follow the design's stated
"definitive deactivation" principle; the gap was incomplete
enumeration, not a different principle. Test-style language updated
from "envtest" to "controller-runtime fake client" to match the
package's actual unit-test pattern.

Implementation plan: 4 tasks (HTTPRoute, TLSRoute, Gateway with two
branches, Service) plus a final verification task. Each task is TDD:
write failing test, verify failure, implement, verify pass, commit.
Includes exact code blocks, exact commands, exact commit messages.

Refs issue #145.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…unnel-targeted

When an HTTPRoute's tunnel-targeted parent loses its cloudflare.io/tunnel
annotation, findTunnelTargetedParent returns nil and the reconcile early-
returns. The end-of-reconcile pruneOrphanedDNSRecords call is never
reached, so previously-emitted CloudflareDNSRecord CRs persist as
orphans and continue to hold Cloudflare-side records. Add an explicit
prune at the parent==nil branch with desired=nil so all CRs labelled
with this source's identity are deleted.

Refs issue #145.
…e to design open questions

Surfaced by an independent Go-engineering review of the plan:
attach.go's findTunnelTargetedParentRef swallows Get errors via
continue, which could surface as parent==nil and spuriously fire
the new deactivation prune on a transient apiserver glitch. Document
the residual risk + mitigation; not blocking the current fix.

Refs issue #145.
…nnel-targeted

Mirrors the HTTPRoute fix: TLSRoute's parent==nil branch was missing
the deactivation prune. Add it so previously-emitted CRs are deleted
when the parent Gateway loses its cloudflare.io/tunnel annotation.

Refs issue #145.
…istener hostnames

Two deactivation branches in the Gateway reconcile early-returned
without pruning previously-emitted CRs: (1) cloudflare.io/tunnel
annotation removed (the primary repro from issue #145), and
(2) all listener hostnames cleared. Add prune calls at both branches.

Refs issue #145.
The Service reconcile's !enabled branch (cloudflare.io/tunnel removed)
early-returned without pruning previously-emitted CRs. The end-of-
reconcile pruneOrphanedDNSRecords call was unreachable on this path,
so removing the annotation left CRs as orphans. Add a prune at the
!enabled branch to match the parallel fixes in HTTPRoute, TLSRoute,
and Gateway.

Refs issue #145.
…ntion

Comprehensive review of issue #145's fix flagged that the new Gateway
deactivation prunes used r.recorder (lowercase, SafeRecorder wrapper)
while the existing Gateway happy-path prune at the end of the same
reconcile uses r.Recorder (capital) — as do the four other new
ReasonOrphanedDNSRecordPruned emits added in this branch
(HTTPRoute, TLSRoute, Service, and the Gateway happy-path). The
plan's "match adjacent convention" prescription was wrong: the
adjacent Warning event uses r.recorder, but the rule across the
package for ReasonOrphanedDNSRecordPruned is r.Recorder.

Use r.Recorder for both Gateway deactivation prune emits so every
ReasonOrphanedDNSRecordPruned in the package uses the same recorder
field, regardless of source kind or branch. dedupe.emit is nil-safe
on its first argument either way; this is a clarity fix, not a
runtime-behavior change.

Also tighten TestGatewaySource_HostnamesCleared_PrunesEmittedCRs:
add the LabelSourceKind assertion present in the four sibling tests
(consistency with TestGatewaySource_OptOut_PrunesEmittedCRs etc.),
and explain why the test wires Recorder (unlike the OptOut variant —
the no-hostnames branch emits a Warning before the prune).

Two follow-ups filed: #146 (transient Get errors at
findTunnelTargetedParentRef can spuriously fire the new
deactivation prune on HTTPRoute/TLSRoute), #147 (extract the prune
event block + rename perr to pruneErr).

Refs issue #145.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jacaudi jacaudi merged commit 5de7324 into main May 28, 2026
7 checks passed
@jacaudi jacaudi deleted the feature/source-controller-orphan-gc branch May 28, 2026 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Source controllers don't garbage-collect derived CloudflareDNSRecords when source annotations are removed

1 participant