feat(talos): detect and repair malformed CA in saved talosconfig#4645
feat(talos): detect and repair malformed CA in saved talosconfig#4645
Conversation
When the Talos client config (typically ~/.talos/config) contains a structurally malformed CA certificate, 'ksail cluster update' previously failed deep in talosclient.New with the opaque error: failed to append CA certificate to RootCAs pool This change introduces: - pkg/svc/repairer: a small registry for opt-in local-state repairs, designed to grow as we discover more recoverable issues. - pkg/svc/repairer/talosconfig: the first concrete repair, fixing the one-byte BasicConstraints SEQUENCE-length corruption that has been observed in Ed25519 self-signed Talos CAs (DER pattern '30 0e 06 03 55 1d 13 01 01 ff 04 05 30 03 01 01' bumped to '30 0f …'), with a timestamped backup written before any modification. - 'ksail cluster repair' command that iterates registered repairs and reports per-repair status (OK / repaired / unrepairable / skipped). - Early validation in pkg/svc/provisioner/cluster/talos/update.go that surfaces a typed MalformedTalosConfigCAError pointing the user at 'ksail cluster repair' instead of the cryptic AppendCertsFromPEM failure. - Troubleshooting docs entry plus regenerated CLI-flag docs and chat docs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
✅MegaLinter analysis: Success✅ Linters with no issuesactionlint, git_diff, hadolint, jscpd, jsonlint, lychee, markdown-table-formatter, markdownlint, prettier, prettier, stylelint, syft, trivy-sbom, trufflehog, v8r, v8r, yamllint See detailed reports in MegaLinter artifacts
|
There was a problem hiding this comment.
Pull request overview
Adds an opt-in “repair” subsystem to KSail and uses it to detect/repair a known single-byte Talos CA corruption in ~/.talos/config, plus earlier validation in the Talos update path so users get a targeted error that points to ksail cluster repair.
Changes:
- Introduces a
repairerregistry and a concretetalosconfig-carepair that patches the malformed BasicConstraints DER byte (with backups). - Adds
ksail cluster repair(with--talosconfig) to run all registered repairs and report per-repair outcomes. - Adds early validation in Talos provisioner client creation; updates troubleshooting + generated CLI docs; adds unit tests.
Reviewed changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/svc/repairer/doc.go | Documents the new repairer subsystem. |
| pkg/svc/repairer/repairer.go | Implements global registry + status/result types for repairs. |
| pkg/svc/repairer/repairer_test.go | Tests registry behaviors and status string formatting. |
| pkg/svc/repairer/talosconfig/talosconfig.go | Implements the talosconfig CA corruption detection/repair with backup + rewrite. |
| pkg/svc/repairer/talosconfig/talosconfig_test.go | Tests the talosconfig repair logic (including corruption injection). |
| pkg/svc/provisioner/cluster/talos/update.go | Adds early CA validation before constructing a Talos client from saved config. |
| pkg/svc/provisioner/cluster/talos/talosconfig_ca.go | Adds typed error + CA validation helper for saved talosconfig current context. |
| pkg/svc/provisioner/cluster/talos/talosconfig_ca_test.go | Tests CA validation behavior and typed error fields. |
| pkg/svc/provisioner/cluster/talos/export_test.go | Exposes CA validation helper for black-box tests via export seam. |
| pkg/cli/cmd/cluster/repair.go | Adds ksail cluster repair command and per-repair result rendering. |
| pkg/cli/cmd/cluster/repair_test.go | Smoke tests for the new repair command behavior. |
| pkg/cli/cmd/cluster/cluster.go | Wires the new repair subcommand under cluster. |
| docs/src/content/docs/troubleshooting.md | Adds troubleshooting guidance for the malformed CA failure + repair command. |
| docs/src/content/docs/cli-flags/cluster/cluster-root.mdx | Updates generated command list to include repair. |
| docs/src/content/docs/cli-flags/cluster/cluster-repair.mdx | Adds generated CLI flags page for ksail cluster repair. |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.50.
| Benchmark suite | Current: 227bcc2 | Previous: d05bb04 | Ratio |
|---|---|---|---|
BenchmarkCluster_MarshalJSON/FullProductionCluster (github.com/devantler-tech/ksail/v7/pkg/apis/cluster/v1alpha1) |
196276 ns/op 20673 B/op 465 allocs/op |
82654 ns/op 20661 B/op 465 allocs/op |
2.37 |
BenchmarkCluster_MarshalJSON/FullProductionCluster (github.com/devantler-tech/ksail/v7/pkg/apis/cluster/v1alpha1) - ns/op |
196276 ns/op |
82654 ns/op |
2.37 |
BenchmarkYAMLEncode/Minimal (github.com/devantler-tech/ksail/v7/pkg/apis/cluster/v1alpha1) |
183821 ns/op 21320 B/op 388 allocs/op |
63804 ns/op 21320 B/op 388 allocs/op |
2.88 |
BenchmarkYAMLEncode/Minimal (github.com/devantler-tech/ksail/v7/pkg/apis/cluster/v1alpha1) - ns/op |
183821 ns/op |
63804 ns/op |
2.88 |
BenchmarkYAMLEncode/FullProductionCluster (github.com/devantler-tech/ksail/v7/pkg/apis/cluster/v1alpha1) |
222353 ns/op 32448 B/op 433 allocs/op |
71887 ns/op 32448 B/op 433 allocs/op |
3.09 |
BenchmarkYAMLEncode/FullProductionCluster (github.com/devantler-tech/ksail/v7/pkg/apis/cluster/v1alpha1) - ns/op |
222353 ns/op |
71887 ns/op |
3.09 |
BenchmarkJSONEncode (github.com/devantler-tech/ksail/v7/pkg/apis/cluster/v1alpha1) |
168919 ns/op 15833 B/op 384 allocs/op |
61435 ns/op 15831 B/op 384 allocs/op |
2.75 |
BenchmarkJSONEncode (github.com/devantler-tech/ksail/v7/pkg/apis/cluster/v1alpha1) - ns/op |
168919 ns/op |
61435 ns/op |
2.75 |
BenchmarkPruneClusterDefaults/MostlyDefaults (github.com/devantler-tech/ksail/v7/pkg/apis/cluster/v1alpha1) |
48631 ns/op 7904 B/op 226 allocs/op |
28797 ns/op 7904 B/op 226 allocs/op |
1.69 |
BenchmarkPruneClusterDefaults/MostlyDefaults (github.com/devantler-tech/ksail/v7/pkg/apis/cluster/v1alpha1) - ns/op |
48631 ns/op |
28797 ns/op |
1.69 |
This comment was automatically generated by workflow using github-action-benchmark.
- Convert repairer registry to instance-based design with Default() singleton to eliminate global mutable state and allow parallel-safe tests. - Canonicalize talosconfig path with fsutil.EvalCanonicalPath after ExpandHomePath to prevent symlink-escape and TOCTOU. - Use fsutil.AtomicWriteFile for both backup and final writes so partial failures cannot leave a half-written talosconfig. - Generate unique backup paths with nanosecond timestamp + random suffix to avoid clobbering when repair runs more than once per second. - Thread a *repairer.Registry through cluster.NewRepairCmd so tests can inject isolated registries instead of mutating the package global. - Drop misleading test sentinels and rewrite the validateCurrentContextCA doc comment to match its actual nil-on-benign-cases behaviour. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Refactor talosconfig repairer with extracted helpers, sentinel errors, named constants, and explicit error patterns (no nolint directives). - Replace tryClientFromSavedConfig (client, bool, error) signature with (client, error) + sentinel errSavedTalosconfigUnavailable so callers use errors.Is to fall through to bundle (resolves revive error-last and nilerr without suppression). - Remove unused Reset method/shim from repairer (resolves Dead Code Analysis failure). - Replace init() side-effects with exported RegisterDefault wired once from cluster.go. - Test conventions: t.Parallel() everywhere, package-level sentinels without nolint, noinlineerr / wsl_v5 / lll / golines / varnamelen all cleared on touched files. - Allowlist gopkg.in/yaml.v3 in depguard (needed for yaml.Node round- tripping that preserves user formatting/comments). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Production code uses Default().Register / Registry.All() directly via the *Registry value, never the package-level shims. Removing them clears the deadcode analyzer. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address review feedback on PR #4645: - Registry.Register now scans existing repairs by Name() and is a no-op when a repair with the same name is already registered, so callers may invoke RegisterDefault on every NewRootCmd construction without accumulating duplicates. - createTalosClient now canonicalizes the talosconfig path via fsutil.EvalCanonicalPath after expanding ~, matching the path-safety convention used elsewhere in the provisioner. Falls back to the expanded path on canonicalization error so behavior degrades gracefully when the file does not yet exist. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| return r.summarizeNoRepair(path, unrepairable, alreadyValid) | ||
| } | ||
|
|
||
| return r.persistRepairedConfig(path, doc, data, repaired) |
| // through symlinks. EvalCanonicalPath falls back to the parent | ||
| // directory when the file itself does not yet exist, so a missing | ||
| // talosconfig is still reported through the os.ReadFile branch | ||
| // below rather than failing canonicalization. | ||
| path, err := fsutil.EvalCanonicalPath(expanded) | ||
| if err != nil { |
| // Each repair satisfies the Repair interface and is registered in the | ||
| // process via Register. The `ksail cluster repair` command iterates | ||
| // every registered repair and runs it. |
| // Canonicalize so symlinks resolve before opening the file (matches | ||
| // the path-safety convention used elsewhere in the provisioner). | ||
| canonicalPath, canonErr := fsutil.EvalCanonicalPath(talosconfigPath) | ||
| if canonErr == nil { | ||
| talosconfigPath = canonicalPath | ||
| } |

Problem
ksail cluster updateagainst a Talos cluster could fail deep insidetalosclient.Newwith an opaque error:Root cause: the CA certificate stored in
~/.talos/config(in the current context) was structurally malformed by exactly one byte — a BasicConstraints SEQUENCE-length field in the DER was0x0einstead of0x0f, causingcrypto/tlsRootCAs.AppendCertsFromPEMto silently returnfalse. The pattern was reproducible and recoverable by bumping that single byte.Changes
pkg/svc/repairer/— new opt-in registry for local-state repairs. Future repairs (kubeconfig, stale state files, etc.) can plug in viainit()without changing the command surface.pkg/svc/repairer/talosconfig/— first concrete repair. Detects the known DER signature30 0e 06 03 55 1d 13 01 01 ff 04 05 30 03 01 01and bumps the length byte to0x0f, after writing a timestamped backup. Idempotent; only touches contexts whose CA failsx509.ParseCertificateAND matches the recognised pattern.ksail cluster repair— new cobra command iterating registered repairs and reporting per-repair status (OK / repaired / unrepairable / skipped). Honors--talosconfigfor the talosconfig repair.pkg/svc/provisioner/cluster/talos/update.gocreateTalosClient: afterclientconfig.Open, the current-context CA is base64-decoded → PEM-decoded →x509.ParseCertificate'd. On failure it returns a typedMalformedTalosConfigCAError(with path, context name, and the parse error) telling the user to runksail cluster repair.## Talos Issuesand the regeneratedcli-flags/cluster/cluster-repair.mdx. Chat docs (pkg/svc/chat/docs_generated.go) regenerated.Tests
pkg/svc/repairer/repairer_test.go— registry behaviour, status table.pkg/svc/repairer/talosconfig/talosconfig_test.go— table-driven black-box tests that generate a real Ed25519 self-signed CA, locate the30 0f 06 03 …BasicConstraints SEQUENCE, bump 0x0f→0x0e to inject the corruption, and assert the repair fixes it.pkg/svc/provisioner/cluster/talos/talosconfig_ca_test.go— black-box tests for the early validator with valid/invalid CA fixtures.pkg/cli/cmd/cluster/repair_test.go— smoke tests for the new command (success, unrepairable, no-repairs-registered).All
go build ./...,go vet ./...,go test ./..., and the docs sitenpm run buildpass locally.Out of scope
Notes
clusterautomatically inherit theai.toolgen.consolidate=commandannotation, socluster repairshows up as part of thecluster_writeconsolidated MCP/chat tool with no extra wiring.