Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .golangci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,12 @@ linters:
- linters:
- gochecknoglobals
path: pkg/cli/ui/chat/
# repairer package uses a process-wide registry singleton and an
# immutable byte-pattern constant that cannot be expressed as a Go
# const because of its []byte type.
- linters:
- gochecknoglobals
path: pkg/svc/repairer/
# Package names that conflict with stdlib or are too generic cannot be renamed without breaking changes
- linters:
- revive
Expand Down Expand Up @@ -124,6 +130,7 @@ linters:
- golang.org/x/sync
- golang.org/x/term
- golang.org/x/text
- gopkg.in/yaml.v3
- helm.sh/helm
- k8s.io
- sigs.k8s.io
29 changes: 29 additions & 0 deletions docs/src/content/docs/cli-flags/cluster/cluster-repair.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
title: "ksail cluster repair"
description: "Repair local KSail/Talos state files"
---

{/* This page is auto-generated by go generate ./docs/... — DO NOT EDIT */}

```text
Detect and repair known corruption patterns in local state files.

Currently supported repairs:
- talosconfig-ca: fixes a single-byte BasicConstraints corruption in
the Talos talosconfig CA that prevents 'cluster update' from
establishing a Talos client.

Each repair is idempotent and writes a timestamped backup of any file
it modifies.

Usage:
ksail cluster repair [flags]

Flags:
--talosconfig string path to talosconfig (default: ~/.talos/config)

Global Flags:
--benchmark Show per-activity benchmark output
--config string Path to config file (default: ksail.yaml found via directory traversal)

```
1 change: 1 addition & 0 deletions docs/src/content/docs/cli-flags/cluster/cluster-root.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ Available Commands:
info Display cluster information
init Initialize a new project
list List clusters
repair Repair local KSail/Talos state files
restore Restore cluster resources from backup
start Start a stopped cluster
stop Stop a running cluster
Expand Down
30 changes: 30 additions & 0 deletions docs/src/content/docs/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,36 @@ KSail automatically retries transient Talos node image pull failures (up to 3 at

If all retries fail, check your internet connection and `ghcr.io` availability with `curl -I https://ghcr.io/v2/`, then retry with `ksail cluster delete && ksail cluster create`.

### "failed to append CA certificate to RootCAs pool" on `cluster update`

`ksail cluster update` against a Talos cluster fails with:

```text
failed to apply updates: failed to sync cluster secrets:
failed to create Talos client for secret sync:
failed to create Talos client from saved config:
failed to create client connection:
failed to append CA certificate to RootCAs pool
```

This means the CA certificate stored under the current context in `~/.talos/config` is structurally malformed. KSail validates the saved CA before opening a Talos client and surfaces the path, context name, and underlying X.509 parse error.

**Recover automatically:**

```bash
ksail cluster repair
```

The `talosconfig-ca` repair detects a known single-byte BasicConstraints corruption pattern, fixes it in place, and writes a timestamped backup (`~/.talos/config.bak.<timestamp>`) before overwriting. The repair is idempotent and only modifies CA bytes whose corruption it recognises.

**Verify manually** (optional):

```bash
yq '.contexts.<context>.ca' ~/.talos/config | base64 -d | openssl x509 -noout -text
```

If neither the repair nor a backup restore work, regenerate the talosconfig by re-running `ksail cluster create` (note: this requires destroying and recreating the cluster).

## VCluster Issues

### Transient Startup Failures
Expand Down
4 changes: 4 additions & 0 deletions pkg/cli/cmd/cluster/cluster.go
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,8 @@ import (
clusterprovisioner "github.com/devantler-tech/ksail/v7/pkg/svc/provisioner/cluster"
"github.com/devantler-tech/ksail/v7/pkg/svc/provisioner/cluster/clustererr"
"github.com/devantler-tech/ksail/v7/pkg/svc/provisioner/cluster/clusterupdate"
"github.com/devantler-tech/ksail/v7/pkg/svc/repairer"
talosconfigrepair "github.com/devantler-tech/ksail/v7/pkg/svc/repairer/talosconfig"
"github.com/devantler-tech/ksail/v7/pkg/svc/state"
"github.com/devantler-tech/ksail/v7/pkg/svc/versionresolver"
"github.com/devantler-tech/ksail/v7/pkg/timer"
Expand Down Expand Up @@ -1025,6 +1027,8 @@ func NewClusterCmd(runtimeContainer *di.Runtime) *cobra.Command {
cmd.AddCommand(NewBackupCmd(runtimeContainer))
cmd.AddCommand(NewRestoreCmd(runtimeContainer))
cmd.AddCommand(NewSwitchCmd(runtimeContainer))
talosconfigrepair.RegisterDefault(repairer.Default())
cmd.AddCommand(NewRepairCmd(runtimeContainer, repairer.Default()))
Comment thread
devantler marked this conversation as resolved.

return cmd
}
Expand Down
139 changes: 139 additions & 0 deletions pkg/cli/cmd/cluster/repair.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
package cluster

import (
"context"
"errors"

"github.com/devantler-tech/ksail/v7/pkg/di"
"github.com/devantler-tech/ksail/v7/pkg/notify"
"github.com/devantler-tech/ksail/v7/pkg/svc/repairer"
talosconfigrepair "github.com/devantler-tech/ksail/v7/pkg/svc/repairer/talosconfig"
"github.com/spf13/cobra"
)

// NewRepairCmd creates the `ksail cluster repair` command, backed by
// the supplied [repairer.Registry]. Pass [repairer.Default] for normal
// operation; tests can pass an isolated registry from
// [repairer.NewRegistry] to avoid cross-package contention.
//
// The command runs every [repairer.Repair] registered with the
// supplied registry, printing one status line per repair. It is
// idempotent and safe to run repeatedly. The first registered repair
// fixes a known single-byte corruption in Talos talosconfig CA bytes
// that produces:
//
// failed to append CA certificate to RootCAs pool
//
// during `ksail cluster update`.
func NewRepairCmd(_ *di.Runtime, registry *repairer.Registry) *cobra.Command {
if registry == nil {
registry = repairer.Default()
}

var talosconfigPath string

cmd := &cobra.Command{
Use: "repair",
Short: "Repair local KSail/Talos state files",
Long: `Detect and repair known corruption patterns in local state files.

Currently supported repairs:
- talosconfig-ca: fixes a single-byte BasicConstraints corruption in
the Talos talosconfig CA that prevents 'cluster update' from
establishing a Talos client.

Each repair is idempotent and writes a timestamped backup of any file
it modifies.`,
SilenceUsage: true,
RunE: func(cmd *cobra.Command, _ []string) error {
return runRepair(cmd.Context(), cmd, registry, talosconfigPath)
},
}

cmd.Flags().StringVar(
&talosconfigPath,
"talosconfig",
"",
"path to talosconfig (default: ~/.talos/config)",
)

return cmd
}

func runRepair(
ctx context.Context,
cmd *cobra.Command,
registry *repairer.Registry,
talosconfigPath string,
) error {
out := cmd.OutOrStdout()

repairs := registry.All()
configurePerRepairOptions(repairs, talosconfigPath)

if len(repairs) == 0 {
notify.Activityf(out, "no repairs registered")

return nil
}

var hadError bool

for _, r := range repairs {
notify.Activityf(out, "running repair %q...", r.Name())

result := r.Run(ctx, out)
printRepairResult(cmd, result)

if result.Err != nil || result.Status == repairer.StatusUnrepairable {
hadError = true
}
}

if hadError {
return errRepairsFailed
}

return nil
}

// errRepairsFailed signals that at least one repair returned an error
// or [repairer.StatusUnrepairable]. Cobra picks this up via RunE and
// surfaces it as a non-zero exit.
var errRepairsFailed = errors.New("one or more repairs reported failures")

// configurePerRepairOptions threads CLI flags into individual repair
// implementations that need them. Today only the talosconfig repair
// reads --talosconfig.
func configurePerRepairOptions(repairs []repairer.Repair, talosconfigPath string) {
if talosconfigPath == "" {
return
}

for _, r := range repairs {
if tc, ok := r.(*talosconfigrepair.Repair); ok {
tc.Path = talosconfigPath
}
}
}

func printRepairResult(cmd *cobra.Command, result repairer.Result) {
out := cmd.OutOrStdout()

switch result.Status {
case repairer.StatusOK:
notify.Successf(out, "[%s] %s", result.Name, result.Detail)
case repairer.StatusRepaired:
notify.Successf(out, "[%s] %s (backup: %s)", result.Name, result.Detail, result.BackupPath)
case repairer.StatusUnrepairable:
notify.Warningf(out, "[%s] %s", result.Name, result.Detail)
case repairer.StatusSkipped:
notify.Activityf(out, "[%s] %s", result.Name, result.Detail)
default:
notify.Activityf(out, "[%s] %s (status=%s)", result.Name, result.Detail, result.Status)
}

if result.Err != nil {
notify.Errorf(out, "[%s] error: %v", result.Name, result.Err)
}
}
101 changes: 101 additions & 0 deletions pkg/cli/cmd/cluster/repair_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
package cluster_test

import (
"bytes"
"context"
"errors"
"io"
"strings"
"testing"

clustercmd "github.com/devantler-tech/ksail/v7/pkg/cli/cmd/cluster"
"github.com/devantler-tech/ksail/v7/pkg/svc/repairer"
)

type stubRepair struct {
name string
result repairer.Result
}

func (s *stubRepair) Name() string { return s.name }

func (s *stubRepair) Run(_ context.Context, _ io.Writer) repairer.Result {
return s.result
}

func TestRepairCmd_RunsRegisteredRepairs(t *testing.T) {
t.Parallel()

reg := repairer.NewRegistry()
reg.Register(&stubRepair{
name: "fake-ok",
result: repairer.Result{Name: "fake-ok", Status: repairer.StatusOK, Detail: "all good"},
})

cmd := clustercmd.NewRepairCmd(nil, reg)
cmd.SetContext(context.Background())

Comment thread
devantler marked this conversation as resolved.
var out bytes.Buffer
cmd.SetOut(&out)
cmd.SetErr(&out)
cmd.SetArgs([]string{})

err := cmd.Execute()
if err != nil {
t.Fatalf("execute: %v\nout: %s", err, out.String())
}

if !strings.Contains(out.String(), "fake-ok") {
t.Fatalf("expected fake-ok in output: %s", out.String())
}
}

// errStubFailure is a sentinel used by stub repairs in failure-path tests.
var errStubFailure = errors.New("stub repair failed")

func TestRepairCmd_FailsOnUnrepairable(t *testing.T) {
t.Parallel()

reg := repairer.NewRegistry()
reg.Register(&stubRepair{name: "broken", result: repairer.Result{
Name: "broken",
Status: repairer.StatusUnrepairable,
Detail: "cannot fix",
Err: errStubFailure,
}})

cmd := clustercmd.NewRepairCmd(nil, reg)
cmd.SetContext(context.Background())

var out bytes.Buffer
cmd.SetOut(&out)
cmd.SetErr(&out)

err := cmd.Execute()
if err == nil {
t.Fatalf("expected non-nil error, got nil; out: %s", out.String())
}
}

func TestRepairCmd_NoRepairsRegistered(t *testing.T) {
t.Parallel()

reg := repairer.NewRegistry()

cmd := clustercmd.NewRepairCmd(nil, reg)
cmd.SetContext(context.Background())

var out bytes.Buffer

cmd.SetOut(&out)
cmd.SetErr(&out)

err := cmd.Execute()
if err != nil {
t.Fatalf("expected nil err, got %v", err)
}

if !strings.Contains(out.String(), "no repairs registered") {
t.Fatalf("expected 'no repairs registered' in output: %s", out.String())
}
}
2 changes: 1 addition & 1 deletion pkg/svc/chat/docs_generated.go

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions pkg/svc/provisioner/cluster/talos/export_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -325,3 +325,8 @@ func MergeTalosconfigBytesForTest(talosconfigPath string, newData []byte) error
//
//nolint:gochecknoglobals // export_test.go pattern exposes internal helpers as globals.
var DetectHetznerServerTypesForTest = detectHetznerServerTypes

// ValidateCurrentContextCAForTest exposes validateCurrentContextCA for unit testing.
//
//nolint:gochecknoglobals // export_test.go pattern exposes internal helpers as globals.
var ValidateCurrentContextCAForTest = validateCurrentContextCA
Loading
Loading