Skip to content

feat(talos): detect and repair malformed CA in saved talosconfig#4645

Open
devantler wants to merge 5 commits intomainfrom
devantler/talos-ca-cert-update-error
Open

feat(talos): detect and repair malformed CA in saved talosconfig#4645
devantler wants to merge 5 commits intomainfrom
devantler/talos-ca-cert-update-error

Conversation

@devantler
Copy link
Copy Markdown
Contributor

Problem

ksail cluster update against a Talos cluster could fail deep inside talosclient.New with an opaque error:

✗ failed to apply updates: failed to sync cluster secrets:
  failed to create Talos client for secret sync:
  failed to create Talos client from saved config:
  failed to create client connection:
  failed to append CA certificate to RootCAs pool

Root cause: the CA certificate stored in ~/.talos/config (in the current context) was structurally malformed by exactly one byte — a BasicConstraints SEQUENCE-length field in the DER was 0x0e instead of 0x0f, causing crypto/tls RootCAs.AppendCertsFromPEM to silently return false. The pattern was reproducible and recoverable by bumping that single byte.

Changes

  • pkg/svc/repairer/ — new opt-in registry for local-state repairs. Future repairs (kubeconfig, stale state files, etc.) can plug in via init() without changing the command surface.
  • pkg/svc/repairer/talosconfig/ — first concrete repair. Detects the known DER signature
    30 0e 06 03 55 1d 13 01 01 ff 04 05 30 03 01 01 and bumps the length byte to 0x0f, after writing a timestamped backup. Idempotent; only touches contexts whose CA fails x509.ParseCertificate AND matches the recognised pattern.
  • ksail cluster repair — new cobra command iterating registered repairs and reporting per-repair status (OK / repaired / unrepairable / skipped). Honors --talosconfig for the talosconfig repair.
  • Early validation in pkg/svc/provisioner/cluster/talos/update.go createTalosClient: after clientconfig.Open, the current-context CA is base64-decoded → PEM-decoded → x509.ParseCertificate'd. On failure it returns a typed MalformedTalosConfigCAError (with path, context name, and the parse error) telling the user to run ksail cluster repair.
  • Docs — new troubleshooting subsection under ## Talos Issues and the regenerated cli-flags/cluster/cluster-repair.mdx. Chat docs (pkg/svc/chat/docs_generated.go) regenerated.

Tests

  • pkg/svc/repairer/repairer_test.go — registry behaviour, status table.
  • pkg/svc/repairer/talosconfig/talosconfig_test.go — table-driven black-box tests that generate a real Ed25519 self-signed CA, locate the 30 0f 06 03 … BasicConstraints SEQUENCE, bump 0x0f→0x0e to inject the corruption, and assert the repair fixes it.
  • pkg/svc/provisioner/cluster/talos/talosconfig_ca_test.go — black-box tests for the early validator with valid/invalid CA fixtures.
  • pkg/cli/cmd/cluster/repair_test.go — smoke tests for the new command (success, unrepairable, no-repairs-registered).

All go build ./..., go vet ./..., go test ./..., and the docs site npm run build pass locally.

Out of scope

  • Origin of the corruption (likely a one-off bundle write bug or post-write byte corruption) — tracked separately.
  • Auto-rotating the cluster CA (impossible without out-of-band Talos API access — exactly what we're locked out of).
  • Repairs other than the talosconfig CA byte-bump.

Notes

  • New cobra commands under cluster automatically inherit the ai.toolgen.consolidate=command annotation, so cluster repair shows up as part of the cluster_write consolidated MCP/chat tool with no extra wiring.

When the Talos client config (typically ~/.talos/config) contains a
structurally malformed CA certificate, 'ksail cluster update' previously
failed deep in talosclient.New with the opaque error:

  failed to append CA certificate to RootCAs pool

This change introduces:

- pkg/svc/repairer: a small registry for opt-in local-state repairs,
  designed to grow as we discover more recoverable issues.
- pkg/svc/repairer/talosconfig: the first concrete repair, fixing the
  one-byte BasicConstraints SEQUENCE-length corruption that has been
  observed in Ed25519 self-signed Talos CAs (DER pattern
  '30 0e 06 03 55 1d 13 01 01 ff 04 05 30 03 01 01' bumped to '30 0f …'),
  with a timestamped backup written before any modification.
- 'ksail cluster repair' command that iterates registered repairs and
  reports per-repair status (OK / repaired / unrepairable / skipped).
- Early validation in pkg/svc/provisioner/cluster/talos/update.go that
  surfaces a typed MalformedTalosConfigCAError pointing the user at
  'ksail cluster repair' instead of the cryptic AppendCertsFromPEM
  failure.
- Troubleshooting docs entry plus regenerated CLI-flag docs and chat
  docs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 7, 2026 22:14
@github-project-automation github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board May 7, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

MegaLinter analysis: Success

✅ Linters with no issues

actionlint, git_diff, hadolint, jscpd, jsonlint, lychee, markdown-table-formatter, markdownlint, prettier, prettier, stylelint, syft, trivy-sbom, trufflehog, v8r, v8r, yamllint

See detailed reports in MegaLinter artifacts

MegaLinter is graciously provided by OX Security
Show us your support by starring ⭐ the repository

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in “repair” subsystem to KSail and uses it to detect/repair a known single-byte Talos CA corruption in ~/.talos/config, plus earlier validation in the Talos update path so users get a targeted error that points to ksail cluster repair.

Changes:

  • Introduces a repairer registry and a concrete talosconfig-ca repair that patches the malformed BasicConstraints DER byte (with backups).
  • Adds ksail cluster repair (with --talosconfig) to run all registered repairs and report per-repair outcomes.
  • Adds early validation in Talos provisioner client creation; updates troubleshooting + generated CLI docs; adds unit tests.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
pkg/svc/repairer/doc.go Documents the new repairer subsystem.
pkg/svc/repairer/repairer.go Implements global registry + status/result types for repairs.
pkg/svc/repairer/repairer_test.go Tests registry behaviors and status string formatting.
pkg/svc/repairer/talosconfig/talosconfig.go Implements the talosconfig CA corruption detection/repair with backup + rewrite.
pkg/svc/repairer/talosconfig/talosconfig_test.go Tests the talosconfig repair logic (including corruption injection).
pkg/svc/provisioner/cluster/talos/update.go Adds early CA validation before constructing a Talos client from saved config.
pkg/svc/provisioner/cluster/talos/talosconfig_ca.go Adds typed error + CA validation helper for saved talosconfig current context.
pkg/svc/provisioner/cluster/talos/talosconfig_ca_test.go Tests CA validation behavior and typed error fields.
pkg/svc/provisioner/cluster/talos/export_test.go Exposes CA validation helper for black-box tests via export seam.
pkg/cli/cmd/cluster/repair.go Adds ksail cluster repair command and per-repair result rendering.
pkg/cli/cmd/cluster/repair_test.go Smoke tests for the new repair command behavior.
pkg/cli/cmd/cluster/cluster.go Wires the new repair subcommand under cluster.
docs/src/content/docs/troubleshooting.md Adds troubleshooting guidance for the malformed CA failure + repair command.
docs/src/content/docs/cli-flags/cluster/cluster-root.mdx Updates generated command list to include repair.
docs/src/content/docs/cli-flags/cluster/cluster-repair.mdx Adds generated CLI flags page for ksail cluster repair.

Comment thread pkg/cli/cmd/cluster/repair.go Outdated
Comment thread pkg/svc/repairer/repairer_test.go Outdated
Comment thread pkg/cli/cmd/cluster/repair_test.go
Comment thread pkg/svc/repairer/talosconfig/talosconfig_test.go Outdated
Comment thread pkg/svc/repairer/talosconfig/talosconfig_test.go Outdated
Comment thread pkg/svc/repairer/talosconfig/talosconfig.go Outdated
Comment thread pkg/svc/repairer/talosconfig/talosconfig.go Outdated
Comment thread pkg/svc/repairer/talosconfig/talosconfig.go Outdated
Comment thread pkg/svc/provisioner/cluster/talos/talosconfig_ca.go Outdated
Comment thread pkg/svc/repairer/talosconfig/talosconfig_test.go Outdated
@devantler devantler marked this pull request as ready for review May 7, 2026 22:19
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.50.

Benchmark suite Current: 227bcc2 Previous: d05bb04 Ratio
BenchmarkCluster_MarshalJSON/FullProductionCluster (github.com/devantler-tech/ksail/v7/pkg/apis/cluster/v1alpha1) 196276 ns/op 20673 B/op 465 allocs/op 82654 ns/op 20661 B/op 465 allocs/op 2.37
BenchmarkCluster_MarshalJSON/FullProductionCluster (github.com/devantler-tech/ksail/v7/pkg/apis/cluster/v1alpha1) - ns/op 196276 ns/op 82654 ns/op 2.37
BenchmarkYAMLEncode/Minimal (github.com/devantler-tech/ksail/v7/pkg/apis/cluster/v1alpha1) 183821 ns/op 21320 B/op 388 allocs/op 63804 ns/op 21320 B/op 388 allocs/op 2.88
BenchmarkYAMLEncode/Minimal (github.com/devantler-tech/ksail/v7/pkg/apis/cluster/v1alpha1) - ns/op 183821 ns/op 63804 ns/op 2.88
BenchmarkYAMLEncode/FullProductionCluster (github.com/devantler-tech/ksail/v7/pkg/apis/cluster/v1alpha1) 222353 ns/op 32448 B/op 433 allocs/op 71887 ns/op 32448 B/op 433 allocs/op 3.09
BenchmarkYAMLEncode/FullProductionCluster (github.com/devantler-tech/ksail/v7/pkg/apis/cluster/v1alpha1) - ns/op 222353 ns/op 71887 ns/op 3.09
BenchmarkJSONEncode (github.com/devantler-tech/ksail/v7/pkg/apis/cluster/v1alpha1) 168919 ns/op 15833 B/op 384 allocs/op 61435 ns/op 15831 B/op 384 allocs/op 2.75
BenchmarkJSONEncode (github.com/devantler-tech/ksail/v7/pkg/apis/cluster/v1alpha1) - ns/op 168919 ns/op 61435 ns/op 2.75
BenchmarkPruneClusterDefaults/MostlyDefaults (github.com/devantler-tech/ksail/v7/pkg/apis/cluster/v1alpha1) 48631 ns/op 7904 B/op 226 allocs/op 28797 ns/op 7904 B/op 226 allocs/op 1.69
BenchmarkPruneClusterDefaults/MostlyDefaults (github.com/devantler-tech/ksail/v7/pkg/apis/cluster/v1alpha1) - ns/op 48631 ns/op 28797 ns/op 1.69

This comment was automatically generated by workflow using github-action-benchmark.

- Convert repairer registry to instance-based design with Default() singleton
  to eliminate global mutable state and allow parallel-safe tests.
- Canonicalize talosconfig path with fsutil.EvalCanonicalPath after
  ExpandHomePath to prevent symlink-escape and TOCTOU.
- Use fsutil.AtomicWriteFile for both backup and final writes so partial
  failures cannot leave a half-written talosconfig.
- Generate unique backup paths with nanosecond timestamp + random suffix
  to avoid clobbering when repair runs more than once per second.
- Thread a *repairer.Registry through cluster.NewRepairCmd so tests can
  inject isolated registries instead of mutating the package global.
- Drop misleading test sentinels and rewrite the validateCurrentContextCA
  doc comment to match its actual nil-on-benign-cases behaviour.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ksail-bot ksail-bot Bot enabled auto-merge May 8, 2026 03:09
- Refactor talosconfig repairer with extracted helpers, sentinel errors,
  named constants, and explicit error patterns (no nolint directives).
- Replace tryClientFromSavedConfig (client, bool, error) signature with
  (client, error) + sentinel errSavedTalosconfigUnavailable so callers
  use errors.Is to fall through to bundle (resolves revive error-last
  and nilerr without suppression).
- Remove unused Reset method/shim from repairer (resolves Dead Code
  Analysis failure).
- Replace init() side-effects with exported RegisterDefault wired once
  from cluster.go.
- Test conventions: t.Parallel() everywhere, package-level sentinels
  without nolint, noinlineerr / wsl_v5 / lll / golines / varnamelen all
  cleared on touched files.
- Allowlist gopkg.in/yaml.v3 in depguard (needed for yaml.Node round-
  tripping that preserves user formatting/comments).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 8, 2026 04:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated 2 comments.

Comment thread pkg/cli/cmd/cluster/cluster.go
Comment thread pkg/svc/provisioner/cluster/talos/update.go
Production code uses Default().Register / Registry.All() directly via
the *Registry value, never the package-level shims. Removing them
clears the deadcode analyzer.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address review feedback on PR #4645:

- Registry.Register now scans existing repairs by Name() and is a no-op
  when a repair with the same name is already registered, so callers may
  invoke RegisterDefault on every NewRootCmd construction without
  accumulating duplicates.
- createTalosClient now canonicalizes the talosconfig path via
  fsutil.EvalCanonicalPath after expanding ~, matching the path-safety
  convention used elsewhere in the provisioner. Falls back to the
  expanded path on canonicalization error so behavior degrades
  gracefully when the file does not yet exist.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 8, 2026 05:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated 4 comments.

return r.summarizeNoRepair(path, unrepairable, alreadyValid)
}

return r.persistRepairedConfig(path, doc, data, repaired)
Comment on lines +140 to +145
// through symlinks. EvalCanonicalPath falls back to the parent
// directory when the file itself does not yet exist, so a missing
// talosconfig is still reported through the os.ReadFile branch
// below rather than failing canonicalization.
path, err := fsutil.EvalCanonicalPath(expanded)
if err != nil {
Comment thread pkg/svc/repairer/doc.go
Comment on lines +5 to +7
// Each repair satisfies the Repair interface and is registered in the
// process via Register. The `ksail cluster repair` command iterates
// every registered repair and runs it.
Comment on lines +403 to +408
// Canonicalize so symlinks resolve before opening the file (matches
// the path-safety convention used elsewhere in the provisioner).
canonicalPath, canonErr := fsutil.EvalCanonicalPath(talosconfigPath)
if canonErr == nil {
talosconfigPath = canonicalPath
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🫴 Ready

Development

Successfully merging this pull request may close these issues.

3 participants