Skip to content

relay: add DNS-01 cert acquisition via Cloudflare API#37

Merged
findias merged 1 commit into
mainfrom
feat/relay-dns-01-cert-method
May 6, 2026
Merged

relay: add DNS-01 cert acquisition via Cloudflare API#37
findias merged 1 commit into
mainfrom
feat/relay-dns-01-cert-method

Conversation

@findias
Copy link
Copy Markdown
Contributor

@findias findias commented May 6, 2026

Why

Multi-IP RU round-robin DNS made the existing webroot HTTP-01 challenge unreliable — LE picks one apex A-record IP, and only one RU has the challenge file. The second RU was workarounded by rsyncing certs from the first, but that goes stale after 90 days unless we run cron-sync. DNS-01 via Cloudflare API is the standard fix: works regardless of where DNS resolves because the challenge is a TXT record.

After this PR, each RU auto-renews independently against its own LE account; no inter-RU coordination needed.

Changes

  • defaults/main.ymlrelay_certbot_method (default webroot for backwards compat) + relay_certbot_dns_propagation_seconds.
  • defaults/secrets.yml.example — documents relay_cloudflare_api_token vault var with the CF token-creation recipe (Zone:DNS:Edit on apex).
  • tasks/install.yml — snap-installs certbot-dns-cloudflare plugin and connects it, gated on method=dns-cloudflare. trust-plugin-with-root set explicitly (plugin needs root to write into /etc/letsencrypt).
  • tasks/certbot.yml — validates token presence, deploys /etc/letsencrypt/cloudflare.ini (mode 600, no_log), branches the certbot certonly command on method. Existing webroot path retained unchanged for hosts that don't opt in.

Idempotency bug fixed in same commit

Pre-existing 'cert already covers domain' check parsed certbot certificates ... | grep 'Domains:', but snap certbot 3.x renamed that field to Identifiers:. Grep returned empty → 'cert doesn't exist' branch fired on every run → certbot tried to re-issue. LE dedup masked it for a few minutes, but sustained loops burn rate budget.

Fix:

grep -E '^[[:space:]]*(Domains|Identifiers):'

Handles both certbot 2.x (apt) and 3.x+ (snap).

Test plan

  • Manual DNS-01 setup on vm_my_ru and vm_my_ru2 (out-of-band, before this PR)
  • Re-ran role with --tags relay_install,relay_certbot. First run reported changed=2 due to the broken grep (false positive).
  • After the grep fix, idempotent re-run reports changed=0 on both hosts.
  • certbot renew --cert-name zirgate.com --dry-run on both hosts: "Congratulations, all simulated renewals succeeded".
  • Reviewer note: opt-in via relay_certbot_method: dns-cloudflare in inventory or group_vars; default behaviour for any non-AlchemyLink consumer of this role is unchanged.

Operator note

To migrate an existing host from webroot to dns-cloudflare:

  1. Add relay_cloudflare_api_token to roles/relay/defaults/secrets.yml.
  2. Set relay_certbot_method: dns-cloudflare for the host.
  3. Run --tags relay_install,relay_certbot. Plugin installs; cloudflare.ini deploys; existing cert is re-used (idempotency check skips re-issuance).
  4. Manually certbot certonly --dns-cloudflare --force-renewal --cert-name <domain> -d <domain> once to switch the renewal config from webroot to dns-cloudflare. Future auto-renewals pick up the new method.

Multi-IP RU round-robin DNS makes the existing webroot HTTP-01 challenge
unreliable: LE may pick whichever apex A-record IP it likes, and only
one of the round-robin RUs has the challenge file. The second RU's
cert ends up rsynced from the first, which goes stale after 90 days
unless we run cron-based sync — fragile.

DNS-01 via Cloudflare API works regardless of where DNS resolves
because the challenge is a TXT record, not an HTTP file. Each RU
auto-renews independently against its own LE account; no inter-RU
coordination needed.

Changes:

* defaults/main.yml: relay_certbot_method (default 'webroot' for
  backwards compat) and relay_certbot_dns_propagation_seconds.
* defaults/secrets.yml.example: documents relay_cloudflare_api_token
  vault var with the CF token-creation recipe (Zone:DNS:Edit on apex).
* tasks/install.yml: snap-install certbot-dns-cloudflare plugin and
  connect it via snap interface, gated on method=dns-cloudflare.
  trust-plugin-with-root must be set explicitly because the plugin
  needs root to write into /etc/letsencrypt.
* tasks/certbot.yml: validates token presence, deploys
  /etc/letsencrypt/cloudflare.ini (mode 600, no_log), branches the
  certbot certonly command on method. Existing webroot path retained
  unchanged for hosts that don't opt in.

Idempotency bug fixed in the same commit:

The pre-existing 'cert already covers domain' check parsed
`certbot certificates ... | grep 'Domains:'`, but snap certbot 3.x
renamed that line to 'Identifiers:'. The grep returned empty, the
'cert doesn't exist' branch fired on every run, and certbot tried to
re-issue. LE's small dedup window masked it for a few minutes, but a
sustained re-run loop would burn rate budget. Updated the grep to
accept both labels:

  grep -E '^[[:space:]]*(Domains|Identifiers):'

Both certbot 2.x (apt distro) and 3.x+ (snap) parse correctly now.

Tested:

* Manual DNS-01 setup on vm_my_ru and vm_my_ru2 from earlier session.
* Re-ran the role with --tags relay_install,relay_certbot — first run
  reported changed=2 due to the broken grep (false positive); after
  the fix, idempotent re-run reports changed=0 on both hosts.
* certbot renew --cert-name zirgate.com --dry-run on both hosts:
  "Congratulations, all simulated renewals succeeded".

Signed-off-by: findias <findias@gmail.com>
@findias findias merged commit a46ccea into main May 6, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant