diff --git a/.gitignore b/.gitignore index 506b6dc3..aaea69b3 100644 --- a/.gitignore +++ b/.gitignore @@ -26,3 +26,5 @@ __pycache__/ *.py[cod] .agents/ pyrightconfig.json +docs/clickhouse-backup-v2-design-state.md +docs/superpowers/ diff --git a/ChangeLog.md b/ChangeLog.md index 315aa70f..c60b82ec 100644 --- a/ChangeLog.md +++ b/ChangeLog.md @@ -1,3 +1,16 @@ +# vNEXT (unreleased) + +BREAKING CHANGES + +- ⚠️ **DO NOT downgrade to a pre-CAS binary if CAS data exists in your bucket.** The pre-CAS binary has no knowledge of the `cas/` skip prefix and will treat the CAS namespace as a broken v1 backup. The next `clean remote_broken` run, or `BackupsToKeepRemote` retention cron, will silently DELETE all CAS data. Recovery procedures: see [docs/cas-operator-runbook.md](docs/cas-operator-runbook.md) "Binary rollback procedure". +- The `pkg/storage.RemoteStorage` interface gains two required methods: `PutFileAbsoluteIfAbsent(ctx, key, r, size) (created bool, err error)` and `PutFileIfAbsent(ctx, key, r, size) (created bool, err error)`. Any third-party `RemoteStorage` implementation must add these methods to compile. Implementors that don't support atomic create-only-if-absent should return `pkg/storage.ErrConditionalPutNotSupported`; CAS commands then refuse on those backends unless `cas.allow_unsafe_markers=true`. +- The `pkg/storage.BackupDestination.BackupList` signature gains a fourth `skipPrefixes []string` parameter. External callers must pass `nil` (or the result of `cas.Config.SkipPrefixes()`) to compile. Internal callers in this repo are updated. +- A v1 backup literally named `"cas"` will be silently filtered after upgrade (the default `cas.root_prefix` is `"cas/"`). Rename or move any such backup before upgrading. The new binary logs an ERROR for each skipped entry and rejects future creation of names that collide with the CAS skip-prefix. + +NEW FEATURES + +- add experimental Content-Addressable Storage (CAS) backups via new `cas-upload`, `cas-download`, `cas-restore`, `cas-delete`, `cas-verify`, `cas-prune`, `cas-status` commands. CAS deduplicates file content across backups (especially effective for mutated parts) and removes the incremental-chain dependency — every CAS backup is independently restorable. Available in CLI and REST API. Configure via new `cas:` config block; see [docs/cas-design.md](docs/cas-design.md) and [docs/cas-operator-runbook.md](docs/cas-operator-runbook.md). Object-disk and client-side-encryption tables not yet supported. + # v2.6.43 NEW FEATURES diff --git a/ReadMe.md b/ReadMe.md index bbd64e2e..2a964daf 100644 --- a/ReadMe.md +++ b/ReadMe.md @@ -29,6 +29,45 @@ For that reason, it's required to run `clickhouse-backup` on the same host or sa - **Support for multi disks installations** - **Support for custom remote storage types via `rclone`, `kopia`, `restic`, `rsync` etc** - **Support for incremental backups on remote storage** +- **Smart deduplicating backups** with the `cas-*` commands — every backup is independent, only changed data is uploaded, and mutations don't blow up your storage bill (see below) + +## Smart deduplicating backups (opt-in, ⚠️ EXPERIMENTAL) + +> **EXPERIMENTAL.** The `cas-*` commands and on-disk layout are still under active development; future releases may bump `LayoutVersion` in a way that requires re-uploading existing CAS backups. Do not rely on CAS as the sole copy of production data yet — keep a parallel v1 backup (or a copy outside the CAS namespace) until the feature is marked stable. Evaluate it on non-critical workloads first; report issues. See [`docs/cas-design.md`](docs/cas-design.md) for the full design. + +Most backup tools force a tradeoff: full backups eat storage and bandwidth, while incremental backups are smaller but chain together — losing or rotating the wrong base backup breaks every dependent restore. ClickHouse mutations make this worse: a single `ALTER TABLE ... UPDATE` can rewrite one column and rename the part, leaving 99% of the bytes identical to the previous version but invisible to chain-based dedup. + +The `cas-*` commands (`cas-upload`, `cas-download`, `cas-restore`, `cas-delete`, `cas-verify`, `cas-status`, `cas-prune`) use **content-addressed storage** to solve both problems. Files are keyed by their content hash, so identical bytes are stored once and shared across every backup that contains them — across mutations, across days, across tables. The result: + +- **Smaller uploads than incremental, no base-backup dependency.** Each `cas-upload` only transfers files whose content isn't already in the remote — typically a small fraction of a full backup. Unlike incremental backups, every CAS backup is independently restorable. Delete any backup at any time without affecting the others. +- **Mutation-friendly.** An `ALTER UPDATE` on one column reuses every other column's bytes; the second backup uploads only the changed column. +- **Storage grows with new data, not with the number of backups.** Keeping 30 daily snapshots of a slowly-changing dataset costs roughly the same as keeping one. + +### Quick start + +In `config.yml`: + +```yaml +cas: + enabled: true + cluster_id: my-prod-cluster # required; identifies this source cluster +``` + +Then: + +```sh +clickhouse-backup create my_backup # snapshot the data locally +clickhouse-backup cas-upload my_backup # push to remote (only new content) +clickhouse-backup cas-status # see counts, sizes, in-flight uploads +clickhouse-backup cas-restore my_backup # restore (any backup, any time) +clickhouse-backup cas-delete my_backup # remove the backup's metadata atomically +clickhouse-backup cas-prune # reclaim blob bytes left behind by deletes +clickhouse-backup cas-verify my_backup # cheap integrity check (HEAD + size) +``` + +`cas-delete` only removes the per-backup metadata; the blob bytes are reclaimed by the periodic `cas-prune` mark-and-sweep GC. See [`docs/cas-operator-runbook.md`](docs/cas-operator-runbook.md) for cadence, monitoring, and recovery from a stranded prune marker. + +CAS backups live under their own prefix in the remote bucket and don't interfere with the existing `upload` / `download` / `restore` commands — you can mix both in the same bucket if needed. ## Limitations @@ -637,6 +676,11 @@ Display a list of all operations from start of API server: `curl -s localhost:71 - Optional string query argument `filter` to filter actions on server side. - Optional string query argument `last` to show only the last `N` actions. +### CAS endpoints + +For CAS commands (`cas-upload`, `cas-restore`, etc.), see the corresponding +`/backup/cas-*` endpoints documented in [docs/cas-operator-runbook.md](docs/cas-operator-runbook.md). + ## Examples - [Simple cron script for daily backups and remote upload](Examples.md#simple-cron-script-for-daily-backups-and-remote-upload) diff --git a/cmd/clickhouse-backup/cas_commands.go b/cmd/clickhouse-backup/cas_commands.go new file mode 100644 index 00000000..0ce007c8 --- /dev/null +++ b/cmd/clickhouse-backup/cas_commands.go @@ -0,0 +1,254 @@ +package main + +import ( + "fmt" + "time" + + "github.com/urfave/cli" + + "github.com/Altinity/clickhouse-backup/v2/pkg/backup" + "github.com/Altinity/clickhouse-backup/v2/pkg/config" +) + +// resolveWaitForPrune returns the --wait-for-prune CLI value if set, otherwise +// falls back to the configured cas.wait_for_prune value. +func resolveWaitForPrune(c *cli.Context, cfg *config.Config) (time.Duration, error) { + if v := c.String("wait-for-prune"); v != "" { + d, err := time.ParseDuration(v) + if err != nil { + return 0, fmt.Errorf("--wait-for-prune: %w", err) + } + return d, nil + } + return cfg.CAS.WaitForPruneDuration(), nil +} + +// casCommands returns the seven cas-* CLI subcommands (six implemented + the +// cas-prune Phase-2 stub). rootFlags is the slice of global flags from main.go +// (passed via the same append-pattern as the existing v1 commands). +func casCommands(rootFlags []cli.Flag) []cli.Command { + return []cli.Command{ + { + Name: "cas-upload", + Usage: "Upload a local backup using the content-addressable layout (see docs/cas-design.md)", + UsageText: "clickhouse-backup cas-upload [--skip-object-disks] [--dry-run] [--unlock] ", + Description: "Upload a backup created by 'clickhouse-backup create' using the CAS layout. Blobs are content-keyed via per-part checksums.txt; small files are packed into per-table tar.zstd archives. CAS dedupes across mutations and across backups; every backup is independently restorable. Requires cas.enabled=true and cas.cluster_id configured.\n\n --unlock removes a stranded inprogress marker for (left behind by SIGKILL/OOM) and exits immediately without uploading. Incompatible with --dry-run and --skip-object-disks.", + Action: func(c *cli.Context) error { + cfg := config.GetConfigFromCli(c) + wait, err := resolveWaitForPrune(c, cfg) + if err != nil { + return err + } + b := backup.NewBackuper(cfg) + return b.CASUpload(c.Args().First(), c.Bool("skip-object-disks"), c.Bool("dry-run"), c.Bool("unlock"), version, c.Int("command-id"), wait) + }, + Flags: append(rootFlags, + cli.BoolFlag{ + Name: "skip-object-disks", + Usage: "Exclude tables on object disks (s3/azure/hdfs/web) instead of refusing the upload", + }, + cli.BoolFlag{ + Name: "dry-run", + Usage: "Plan the upload without writing anything to remote storage", + }, + cli.StringFlag{ + Name: "wait-for-prune", + Usage: `If a prune is in progress, wait up to this duration (Go duration string, e.g. "5m") before giving up. Overrides cas.wait_for_prune. Empty = use config; "0s" = don't wait.`, + }, + cli.BoolFlag{ + Name: "unlock", + Usage: "Remove a stranded inprogress marker for (self-service recovery after SIGKILL/OOM). Incompatible with --dry-run and --skip-object-disks. Does NOT perform an upload.", + }, + ), + }, + { + Name: "cas-download", + Usage: "Materialize a CAS backup into the local data directory (does not load into ClickHouse)", + UsageText: "clickhouse-backup cas-download [-t, --tables=.] [--partitions=] [-s, --schema] ", + Description: "Download a CAS-layout backup into /backup//. Use cas-restore (or v1 restore) to load tables into ClickHouse from the materialized directory.", + Action: func(c *cli.Context) error { + b := backup.NewBackuper(config.GetConfigFromCli(c)) + return b.CASDownload(c.Args().First(), c.String("tables"), c.StringSlice("partitions"), c.Bool("schema"), c.Bool("data"), version, c.Int("command-id")) + }, + Flags: append(rootFlags, + cli.StringFlag{ + Name: "table, tables, t", + Usage: "Restrict to tables matching db.table (comma-separated, exact match in CAS v1)", + }, + cli.StringSliceFlag{ + Name: "partitions", + Usage: "Restrict to part names (comma-separated)", + }, + cli.BoolFlag{ + Name: "schema, schema-only, s", + Usage: "Schema-only: write JSON metadata locally and skip part archives + blobs", + }, + cli.BoolFlag{ + Name: "data, d", + Hidden: true, + Usage: "Reserved (currently a no-op); will gate data-only download in a future version", + }, + ), + }, + { + Name: "cas-restore", + Usage: "Download a CAS backup and restore tables into ClickHouse", + UsageText: "clickhouse-backup cas-restore [-t, --tables=.
] [-m, --restore-database-mapping=:[,...]] [--tm, --restore-table-mapping=:[,...]] [--partitions=] [-s, --schema] [-d, --data] [--rm, --drop] [--restore-schema-as-attach] [--replicated-copy-to-detached] [--skip-empty-tables] [--resume] ", + Description: "Pulls the named CAS backup into the local backup directory and runs the v1 restore flow against it. --ignore-dependencies is rejected: CAS backups have no dependency chain. RBAC/configs/named-collections are out of scope for CAS v1.", + Action: func(c *cli.Context) error { + b := backup.NewBackuper(config.GetConfigFromCli(c)) + return b.CASRestore( + c.Args().First(), + c.String("tables"), + c.StringSlice("restore-database-mapping"), + c.StringSlice("restore-table-mapping"), + c.StringSlice("partitions"), + c.StringSlice("skip-projections"), + c.Bool("schema"), + c.Bool("data"), + c.Bool("drop"), + c.Bool("ignore-dependencies"), + c.Bool("restore-schema-as-attach"), + c.Bool("replicated-copy-to-detached"), + c.Bool("skip-empty-tables"), + c.Bool("resume"), + version, + c.Int("command-id"), + ) + }, + Flags: append(rootFlags, + cli.StringFlag{ + Name: "table, tables, t", + Usage: "Restrict to tables matching db.table (comma-separated, exact match in CAS v1)", + }, + cli.StringSliceFlag{ + Name: "restore-database-mapping, m", + Usage: "Database rename rules at restore time, format : (repeatable or comma-separated)", + }, + cli.StringSliceFlag{ + Name: "restore-table-mapping, tm", + Usage: "Table rename rules at restore time, format : (repeatable or comma-separated)", + }, + cli.StringSliceFlag{ + Name: "partitions", + Usage: "Restrict to part names (comma-separated)", + }, + cli.StringSliceFlag{ + Name: "skip-projections", + Usage: "Skip listed projections during restore, format `db_pattern.table_pattern:projections_pattern`", + }, + cli.BoolFlag{ + Name: "schema, s", + Usage: "Restore schema only", + }, + cli.BoolFlag{ + Name: "data, d", + Usage: "Restore data only", + }, + cli.BoolFlag{ + Name: "rm, drop", + Usage: "Drop existing schema objects before restore", + }, + cli.BoolFlag{ + Name: "i, ignore-dependencies", + Usage: "(rejected for CAS backups; accepted for CLI parity with 'restore')", + Hidden: true, + }, + cli.BoolFlag{ + Name: "restore-schema-as-attach", + Usage: "Use DETACH/ATTACH instead of DROP/CREATE for schema restoration", + }, + cli.BoolFlag{ + Name: "replicated-copy-to-detached", + Usage: "Copy data to detached folder for Replicated*MergeTree tables but skip ATTACH PART step", + }, + cli.BoolFlag{ + Name: "skip-empty-tables", + Usage: "Skip restoring tables that have no data (empty tables with only schema)", + }, + cli.BoolFlag{ + Name: "resume, resumable", + Usage: "Save intermediate state and resume restore on retry", + }, + ), + }, + { + Name: "cas-delete", + Usage: "Delete a CAS backup's metadata subtree (Phase 1: blobs are NOT reclaimed)", + UsageText: "clickhouse-backup cas-delete ", + Description: "Removes the named backup atomically by deleting metadata.json first, then the rest of the metadata subtree. Blob bytes are NOT reclaimed in Phase 1 — that ships with cas-prune in Phase 2; until then, deleted-backup blobs accumulate in remote storage.", + Action: func(c *cli.Context) error { + cfg := config.GetConfigFromCli(c) + wait, err := resolveWaitForPrune(c, cfg) + if err != nil { + return err + } + b := backup.NewBackuper(cfg) + return b.CASDelete(c.Args().First(), c.Int("command-id"), wait) + }, + Flags: append(rootFlags, + cli.StringFlag{ + Name: "wait-for-prune", + Usage: `If a prune is in progress, wait up to this duration (Go duration string, e.g. "5m") before giving up. Overrides cas.wait_for_prune. Empty = use config; "0s" = don't wait.`, + }, + ), + }, + { + Name: "cas-verify", + Usage: "HEAD-check every blob referenced by a CAS backup", + UsageText: "clickhouse-backup cas-verify [--json] ", + Description: "Walks the per-table archives, parses every checksums.txt, and HEAD-checks each referenced blob's existence and size. Exits non-zero if any failures are detected.", + Action: func(c *cli.Context) error { + b := backup.NewBackuper(config.GetConfigFromCli(c)) + return b.CASVerify(c.Args().First(), c.Bool("json"), c.Int("command-id")) + }, + Flags: append(rootFlags, + cli.BoolFlag{ + Name: "json", + Usage: "Emit one JSON object per failure instead of human-readable lines", + }, + ), + }, + { + Name: "cas-status", + Usage: "Print a LIST-only health summary for the configured CAS cluster", + UsageText: "clickhouse-backup cas-status", + Description: "Counts backups and blobs, reports the prune marker (if any), and lists fresh / abandoned in-progress upload markers. No object bodies are fetched.", + Action: func(c *cli.Context) error { + b := backup.NewBackuper(config.GetConfigFromCli(c)) + return b.CASStatus(c.Int("command-id")) + }, + Flags: rootFlags, + }, + { + Name: "cas-prune", + Usage: "Garbage-collect orphan blobs (mark-and-sweep) for the configured CAS cluster", + UsageText: "clickhouse-backup cas-prune [--dry-run] [--grace-blob=] [--abandon-threshold=] [--unlock]", + Description: "Mark-and-sweep GC: walks every live backup's per-table archives, builds a sorted on-disk reference set, then lists the blob store and deletes orphans older than cas.grace_blob. Holds an advisory cas//prune.marker — concurrent cas-upload and cas-delete refuse while it's held. See docs/cas-design.md §6.7 and docs/cas-operator-runbook.md.", + Action: func(c *cli.Context) error { + b := backup.NewBackuper(config.GetConfigFromCli(c)) + return b.CASPrune(c.Bool("dry-run"), c.String("grace-blob"), c.String("abandon-threshold"), c.Bool("unlock"), c.Int("command-id")) + }, + Flags: append(rootFlags, + cli.BoolFlag{ + Name: "dry-run", + Usage: "Print orphan candidates without deleting anything (no marker is written)", + }, + cli.StringFlag{ + Name: "grace-blob", + Value: "", + Usage: "Override cas.grace_blob — Go duration string (e.g. \"24h\", \"30m\", \"0s\"). Empty (default) uses the configured value.", + }, + cli.StringFlag{ + Name: "abandon-threshold", + Value: "", + Usage: "Override cas.abandon_threshold — Go duration string (e.g. \"168h\", \"0s\"). Empty (default) uses the configured value.", + }, + cli.BoolFlag{ + Name: "unlock", + Usage: "Delete a stranded cas//prune.marker (escape hatch when SIGKILL/OOM left it behind). Refuses if no marker is present.", + }, + ), + }, + } +} diff --git a/cmd/clickhouse-backup/main.go b/cmd/clickhouse-backup/main.go index 99589137..f57aa675 100644 --- a/cmd/clickhouse-backup/main.go +++ b/cmd/clickhouse-backup/main.go @@ -13,6 +13,7 @@ import ( "github.com/urfave/cli" "github.com/Altinity/clickhouse-backup/v2/pkg/backup" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" "github.com/Altinity/clickhouse-backup/v2/pkg/config" "github.com/Altinity/clickhouse-backup/v2/pkg/log_helper" "github.com/Altinity/clickhouse-backup/v2/pkg/server" @@ -36,6 +37,9 @@ func main() { cliapp.UsageText = "clickhouse-backup [-t, --tables=.
] " cliapp.Description = "Run as 'root' or 'clickhouse' user" cliapp.Version = version + // Wire the build version into CAS marker JSON (inprogress / prune + // markers carry this for forensic context — see pkg/cas/markers.go). + cas.SetMarkerTool(fmt.Sprintf("clickhouse-backup/%s", version)) // @todo add GCS and Azure support when resolve https://github.com/googleapis/google-cloud-go/issues/8169 and https://github.com/Azure/azure-sdk-for-go/issues/21047 if strings.HasSuffix(version, "fips") { _ = os.Setenv("AWS_USE_FIPS_ENDPOINT", "true") @@ -279,9 +283,10 @@ func main() { ), }, { - Name: "upload", - Usage: "Upload backup to remote storage", - UsageText: "clickhouse-backup upload [-t, --tables=.
] [--partitions=] [-s, --schema] [--diff-from=] [--diff-from-remote=] [--resumable] ", + Name: "upload", + Usage: "Upload backup to remote storage", + UsageText: "clickhouse-backup upload [-t, --tables=.
] [--partitions=] [-s, --schema] [--diff-from=] [--diff-from-remote=] [--resumable] ", + Description: "Upload a local backup to remote storage using the v1 layout (per-part archives + RequiredBackup chain for incrementals).\n\nIf you back up frequently or run mutations, consider `cas-upload` instead: it deduplicates content across backups, every backup is independent (no incremental chain), and only changed data is uploaded.", Action: func(c *cli.Context) error { b := backup.NewBackuper(config.GetConfigFromCli(c)) return b.Upload(c.Args().First(), c.Bool("delete-source"), c.String("diff-from"), c.String("diff-from-remote"), c.String("t"), c.StringSlice("partitions"), c.StringSlice("skip-projections"), c.Bool("schema"), c.Bool("rbac-only"), c.Bool("configs-only"), c.Bool("named-collections-only"), c.Bool("resume"), version, c.Int("command-id")) @@ -822,6 +827,7 @@ func main() { }), }, } + cliapp.Commands = append(cliapp.Commands, casCommands(cliapp.Flags)...) if err := cliapp.Run(os.Args); err != nil { log.Fatal().Stack().Err(err).Send() } diff --git a/docs/cas-design.md b/docs/cas-design.md new file mode 100644 index 00000000..7f077963 --- /dev/null +++ b/docs/cas-design.md @@ -0,0 +1,726 @@ +# Content-Addressable Storage (CAS) Layout for clickhouse-backup + +> ⚠️ **EXPERIMENTAL.** CAS commands and the on-disk layout are still under active development. The `LayoutVersion` may change in a way that requires re-uploading existing CAS backups before adopting a newer release. Do not treat CAS as the sole copy of production data yet — keep a parallel v1 backup (or a copy outside the CAS namespace) until the feature is marked stable. Operators are encouraged to evaluate it on non-critical workloads, report issues, and watch the changelog for compatibility notes. + +**Status**: Phases 1–8 shipped on branch `cas-phase1`. Commands implemented: `cas-{upload,download,restore,delete,verify,status,prune}` available both via CLI and REST API in daemon mode. Phase 3 added projection-aware planner + cross-mode guards; Phase 4 added atomic markers (S3 IfNoneMatch, SFTP O_EXCL, native conditional create on Azure/GCS/COS, refuse-by-default on FTP); Phase 5 added per-backend integration smoke tests across MinIO/Azurite/fake-gcs/SFTP; Phase 6 closed the P1 defects from the second external review wave; Phase 7 (cleanup round) closed the ColdList TOCTOU window, populated `PruneReport` counters, added defensive `cfg.Validate()` at Prune entry, and added focused per-backend not-found tests; Phase 8 added `wait_for_prune` (poll-and-wait for CAS upload/delete when prune is in flight) and wired all CAS commands through the REST API (dedicated routes + `/backup/actions` verbs + list-merge with `kind` field). +**Author**: Mikhail Filimonov, drafted with design-interview support +**Last updated**: 2026-05-08 + +## 1. Summary + +A new set of commands — `cas-upload`, `cas-download`, `cas-restore`, `cas-delete`, `cas-prune`, `cas-verify` — that store backups to remote object storage in a **content-addressable layout**. Files are keyed by their content hash (CityHash128, sourced from ClickHouse's per-part `checksums.txt`); identical content is stored once and referenced from any number of backups. Garbage collection is a separate, eventual mark-and-sweep step. + +The new commands run side-by-side with the existing `upload` / `download` / `delete` commands and use a separate top-level prefix in the target bucket, so v1 and CAS backups never share namespace. + +### 1.1 When to use CAS vs. v1 + +Pick CAS when: +- Tables are mutation-heavy (CAS deduplicates the unchanged columns of mutated parts; v1 re-uploads). +- You want every backup to be independently restorable (CAS has no incremental chain; removing one backup never affects another). +- You expect many backups over time and want storage to grow with *new* data, not with the number of backups. + +Stick with v1 when: +- Tables include **object-disk** disks (s3/azure/hdfs object disks) — CAS does not support these in v1. +- You currently use **client-side encryption** — CAS v1 supports bucket-level encryption only (see §3). Operators using v1's client-side encryption cannot move to CAS until convergent encryption ships in a later version. +- You're already happy with v1's incremental chain and don't want to change. +- You need a feature CAS hasn't implemented yet (see §3 non-goals). + +CAS backups and v1 backups can coexist in the same bucket under different prefixes; they never see each other's data. There is no migration tool — opt in by writing new backups with `cas-upload`. + +### 1.2 Mental model + +CAS backups are **independent**. There is no `RequiredBackup` chain. Removing backup A never affects backup B's restorability. A blob remains in the store as long as any backup references it (refcounting is implicit via mark-and-sweep). This is the most important difference from v1; surface it everywhere user-facing (README, `--help` text). + +## 2. Goals + +- **Deduplicate across mutations**: ClickHouse mutations create new part names whose underlying files are mostly identical (often hardlinked) to the source part. Today's tool re-uploads them. CAS reuses the existing blob on the target. +- **Eliminate the incremental chain dependency**: every CAS backup is independently restorable. No `RequiredBackup` pointer; no chain unrolling. +- **Reduce full-backup wall-clock and cost** by uploading only blobs not already in the target. +- **Reuse existing infrastructure**: the storage abstraction (`BackupDestination`), table walker, multipart upload, retry, throttling, `BackupMetadata` / `TableMetadata` structs are reused as-is. +- **Don't break v1**: separate commands, separate prefix, no behavioral changes to existing code paths. +- **Preserve the relevant restore CLI surface**: `--tables`, `--partitions`, `--schema-only`, `--data-only`, `--restore-database-mapping`, `--restore-table-mapping`, `--rm`, `--restore-schema-as-attach` all work unchanged for CAS backups. `--ignore-dependencies` is rejected with an error (CAS backups have no chain — see §6.10). + +## 3. Non-goals (v1 of CAS) + +- Distributed locking. Operators serialize commands externally, matching today's PID-file model. +- Hash verification on download (full content re-hash). Deferred to v2 of CAS. **Size verification on download is in v1**; see §6.8. +- Refcount-delta files / incremental garbage collection. Deferred; v1 uses full mark-and-sweep, which is sufficient at the target scale. +- Convergent encryption. v1 of CAS uses bucket-level encryption only. **TODO (v2)**: design and ship convergent encryption so existing v1 client-side-encryption users can migrate to CAS without losing client-side encryption. Known weaknesses (confirmation-of-file attacks) need threat-modeling per the deployment context. +- Migration of existing v1 backups into CAS layout. Out of scope; users opt in by writing new backups with `cas-upload`. +- Garbage collection of metadata across replicas/clusters beyond what mark-and-sweep already handles. +- Object Lock / immutability features beyond what's intrinsic to content addressing. +- **Object-disk parts (s3/azure/hdfs object disks) are NOT supported by `cas-*` commands in v1.** The existing `pkg/backup/create.go:1031` and `pkg/backup/restore.go:2227` paths (~1000 LOC of dual-direction object-reference handling, including cross-storage key rewriting) are non-trivial to fold into CAS, and content-addressing semantics for already-remote object stubs need their own design pass. `cas-upload` will refuse to back up tables on object disks in v1; operators must use v1 `upload` for those tables. Lifted in v2 of CAS. + +## 4. Background + +Today's `clickhouse-backup` upload pipeline (see `pkg/backup/upload.go:38–295`): + +- A backup is a directory `{backup_name}/` on the remote target containing a top-level `metadata.json` (`BackupMetadata` struct, `pkg/metadata/backup_metadata.go:12`), per-table JSONs at `metadata/{db}/{table}.json` (`TableMetadata`, `pkg/metadata/table_metadata.go:10`), and per-part archives (compressed tar streams) under each table. +- Dedup happens **only against a base backup** (`RequiredBackup` field), at part-name granularity, via the `Required=true` flag set in `TableMetadata.Parts` (`pkg/backup/upload.go:756–784`). +- Concurrency is bounded by a per-backup PID file (`pkg/pidlock/pidlock.go:15`); there are no S3 conditional writes or distributed locks. +- ClickHouse's `checksums.txt` files are **not parsed** by the tool today. Hashes are computed by `pkg/common/common.go:131` (CRC64 of whole files/archives). + +## 5. Problems being solved + +1. **Mutation explosion**: a mutation can rewrite one column and rename the part, while hardlinking ~all other column files. Today this triggers a full re-upload of the new part. CAS reuses the existing blobs. +2. **Incremental chain fragility**: incremental backups depend on a base backup. Restore unrolls the chain. With CAS, every backup is independent. +3. **Time/cost of full backups**: with CAS the second backup of a mostly-unchanged dataset uploads only the diff in blobs. +4. **Reuse**: as much existing code as possible — orchestration, retry, multipart, encryption, metadata structs. +5. **Containment**: separate commands (`cas-*`) so the new path can't regress v1. + +## 6. Proposed design + +### 6.1 Object layout + +A CAS deployment uses a single configurable root prefix (default: `cas/`) under the existing remote target. Inside that root: + +``` +cas/ + blob// # immutable, content-addressed; = first 2 hex + # chars of CityHash128 hex; = remaining 30 chars + metadata// # per-backup directory + metadata.json # BackupMetadata (reused struct; RequiredBackup + # unused; new CAS sub-struct populated — see §6.2.1) + metadata//.json # TableMetadata (reused struct). enc_* = TablePathEncode + parts///.tar.zstd # per-disk per-(db,table) archive of small files only. + # contents: / for every part. + # Path components encoded with common.TablePathEncode + # (pkg/common/common.go:18) to handle non-ASCII / + # special chars and avoid db/table name collisions. + rbac/, configs/, named_collections/ # unchanged from existing layout + inprogress/.marker # timestamp + host; written at upload start, deleted + # at commit. Used by `cas-prune` for abandoned-upload + # cleanup; also blocks `cas-delete` of same name + prune.marker # the GC lock; written when `cas-prune` runs. + # While present, `cas-upload` and `cas-delete` refuse +``` + +**Sharding**: 256-way prefix sharding by first 2 hex chars of the hash. Sufficient for S3 per-prefix rate limits at the target scale; intuitive layout for human inspection. + +**Hash source**: each part's `checksums.txt`, parsed per the on-disk format (versions 2/3/4 supported; spec in `docs/checksumstxt/format.md`, reference parser in `docs/checksumstxt/checksumstxt.go`). The `Checksum.FileHash` field (CityHash128 of the file's on-disk bytes — not `UncompressedHash`) is what CAS keys blobs by. This gives byte-identity dedup, which is exactly what we want: identical column files produced by hardlink-mutations dedup; logically-equal files with different on-disk encodings (e.g., recompressed) do not, and that's correct because their stored bytes differ. 128 bits gives negligible accidental-collision probability across ~10⁹ blobs; reusing ClickHouse's already-computed hash avoids ~50+ CPU-hours of re-hashing per 100TB backup. + +**Path layout decisions**: +- No `by-extension` dimension. Same content under different filenames must dedupe. +- No `first/last` double-byte split. Single 2-byte prefix is sufficient (see §10.1 for rate-limit math). + +### 6.2 Inline-vs-blob threshold + +Files with `size > inline_threshold` go to the blob store. Files with `size ≤ inline_threshold` are packed into the per-disk-per-(db,table) archive. Default: **256 KiB**, configurable. + +Rationale: ClickHouse parts contain many small metadata files (`columns.txt`, `primary.idx`, `partition.dat`, `minmax_*.idx`, `count.txt`, `default_compression_codec.txt`, `serialization.json`, `checksums.txt` itself). Per-PUT cost on S3 ≈ $0.005/1K. At 50 small files × 10⁴ parts × 1 backup = 500K extra PUTs ≈ $2.50/backup just in API charges, with worse tail-latency. Packing them into a per-table tar.zstd is dramatically more efficient. + +The threshold should be tuned against an actual file-size distribution from a representative ClickHouse instance before final commit; 256 KiB is the starting point. + +### 6.2.1 CAS layout parameters MUST be persisted with the backup + +Restore behavior depends on parameters chosen at upload time. If a backup is uploaded with `inline_threshold = 256 KiB` and later restored after the operator has reconfigured the tool to `inline_threshold = 1 MB`, restore would look in the inline archive for files that were actually stored as blobs — silent corruption. + +Persist the following per-backup, embedded in `BackupMetadata` as a new `CAS *CASBackupParams` field (`omitempty`, populated only by `cas-upload`): + +```go +type CASBackupParams struct { + LayoutVersion uint8 // schema version of the CAS layout itself; v1 = 1 + InlineThreshold uint64 // bytes; ValidateBackup MUST reject if 0 or > 1 GiB + ClusterID string // required (§6.11); identifies the source cluster for namespace isolation +} +``` + +**LayoutVersion evolution policy** (decided): a tool encountering `LayoutVersion > supported_max` refuses with a clear error. Operators MUST keep the oldest tool capable of reading their oldest backup. LayoutVersion bumps are major-version BREAKING CHANGE entries. + +**No `HashAlgorithm` field.** The hash is sourced from each part's `checksums.txt` — its value, encoding, and meaning are part-local properties defined by ClickHouse's on-disk format (always CityHash128 for `Checksum.FileHash` across format versions 2/3/4; see `docs/checksumstxt/format.md`). If ClickHouse ever changed the hash, the change would be visible per-part in the format version of `checksums.txt`, not as a CAS-wide policy. CAS does not pick a hash; it adopts whatever the part wrote. + +**No `RootPrefix`, `BlobShardWidth`, or `ArchiveCodec` fields.** A field whose only purpose is "documentation" is rot-by-construction (you can't read it without already knowing where to look). Hardcode v1 to: root prefix configurable via config but not persisted, shard width = 2, archive codec = `zstd` with `.tar.zstd` extension matching `pkg/config/config.go:310`. Any change to these is a `LayoutVersion` bump. + +Restore reads `BackupMetadata.CAS` and uses those values exclusively. CLI / config values for these parameters apply only at upload time. Restore ignores them. + +If `BackupMetadata.CAS == nil`, the backup is a v1 backup; CAS commands refuse to operate on it (and v1 commands refuse to operate on a backup with `CAS != nil`). See §6.2.2 for exact code locations of the cross-mode guards. + +If `BackupMetadata.CAS.LayoutVersion` is unrecognized (newer than the running tool supports), CAS commands refuse with a clear error. + +### 6.2.2 Isolation from v1 + +CAS and v1 must not see each other's data. Two requirements, both Phase-1 ship-gates: + +**1. v1 commands MUST exclude the configured CAS root prefix from listing and retention.** v1's `BackupList` (`pkg/storage/general.go`) walks the bucket root; `RemoveOldBackupsRemote` and `CleanRemoteBroken` then operate on those entries. Without an explicit exclusion, the CAS root would appear as a broken v1 backup and could be reclaimed. Modify `BackupList` to accept a skip-prefixes set populated from `cas.root_prefix`; the consequence flows to `RemoveBackupRemote`, `RemoveOldBackupsRemote`, and `CleanRemoteBroken`. + +**2. v1 and CAS commands MUST refuse on the wrong type.** v1 commands (`Download`, `Restore`, `RestoreFromRemote`, `RemoveBackupRemote`, watch mode — anywhere remote `BackupMetadata` is loaded) refuse with a clear error if `BackupMetadata.CAS != nil`. CAS commands refuse with the inverse check via `ValidateBackup` (§7). + +Test: `TestCompatibilityMixedBucket` (mixed bucket; v1 retention/list/clean don't touch CAS objects regardless of config) plus `TestV1RefusesCASBackup` / `TestCASRefusesV1Backup` per entry point. + +### 6.3 Metadata archive packing + +One **tar.zstd per disk per (db, table)**. Path inside the archive: `/`. Contains every file of every part where `size ≤ inline_threshold`. Extension `.tar.zstd` matches `pkg/config/config.go:310`'s convention so existing readers can be reused. + +**`checksums.txt` is always inlined**, regardless of size. It is treated as a special case (not as a parsed-checksum entry) because: +- Restore needs `checksums.txt` on disk *before* it can decide which blobs to fetch (§6.5 step 6 reads the local file). +- It's tiny in practice (KB-range). +- Putting it in the blob store would chicken-and-egg the restore protocol. + +**Files on disk but not listed in `checksums.txt`** (an edge case ClickHouse should not produce, but the parser may encounter from future or experimental part formats): **always inline into the per-table archive**, regardless of size. They go into the metadata archive alongside the small files; never into the blob store. This avoids any local hashing in the `cas-upload` data path and gives a single rule for the corner case. The lost-dedup cost is negligible because such files are rare by construction; the simplicity is worth it. No "skip" mode — silently skipping files corrupts backups. + +Rationale: +- Matches existing per-disk per-table structure (`TableMetadata.Files: map[string][]string`, `pkg/metadata/table_metadata.go`). +- Natural partial-restore granularity: `--tables=db.t1` downloads one archive per disk. +- Reasonable file count: hundreds of tables × few disks → low thousands of archives, not 10⁴+ per-part archives. +- Small files of disparate types don't benefit from per-type clustering; the win from cross-type compression is small once the big homogeneous files are in the blob store. + +### 6.4 Upload — `cas-upload` + +**Pre-condition**: `cas-upload` operates on a **pre-existing local backup** produced by the existing `clickhouse-backup create` command (which freezes parts into the local backup directory). This mirrors the v1 `create` + `upload` split: separation of concerns, reuses `pkg/backup/create.go` unchanged, and lets operators inspect the local backup before pushing. `cas-upload` does NOT internally freeze — operators run `clickhouse-backup create ` first, then `clickhouse-backup cas-upload `. + +1. PID-lock as today (`pkg/pidlock`). +2. **Refuse if `cas/prune.marker` exists** (the GC lock — see §6.7). Surface the marker's age in the error. If `cas.wait_for_prune` (or `--wait-for-prune` flag) is > 0, poll the marker every 2 seconds for up to that duration before refusing. +3. **Pre-flight check for object disks**: scan in-scope tables. If any are on object disks (s3/azure/hdfs) and `--skip-object-disks` is not set, refuse with a list of `(db, table, disk)` triples. With `--skip-object-disks`, log them and exclude from the upload set. +4. **Best-effort same-name check**: refuse if `cas/metadata//metadata.json` already exists. Best-effort only — two hosts can both pass and both PUT (last writer wins). Multi-host concurrent uploads to the same name are **unsupported** (§3); operators must use unique names per shard. +5. Write `cas/inprogress/.marker` with timestamp + host identifier (used by prune for abandoned-upload cleanup; not for race protection). +6. Walk parts. For each part, parse `checksums.txt` to obtain `(filename, size, hash)` triples. Apply the inline threshold. +7. Build the set of unique blob paths. +8. **Cold-list** `cas/blob//` prefixes in parallel → in-memory existence set. +9. Upload missing blobs via the existing `BackupDestination` abstraction. +10. For each `(disk, db, table)`: build and upload `cas/metadata//parts///.tar.zstd` (path components encoded via `common.TablePathEncode`). +11. Upload per-table JSONs at `cas/metadata//metadata//.json`. +12. Upload RBAC, configs, named_collections (unchanged from v1). +13. **Pre-commit safety re-checks** (closes the old-orphan-reuse and long-upload-vs-abandon-sweep races — Blockers B4, B5): + a. HEAD `cas/prune.marker`. If present, abort: "concurrent prune detected; aborting before commit." (Single HEAD; cheap; closes the window where prune ran past our step 2 lock check.) + b. HEAD `cas/inprogress/.marker`. If absent, abort: "our in-progress marker was swept (upload exceeded abandon_threshold); aborting." (Closes the long-upload-past-abandon-sweep race.) +14. **Commit (LAST, in this order)**: + a. Upload `cas/metadata//metadata.json` — populates `BackupMetadata.CAS` per §6.2.1. Until this exists, the backup is not in the catalog. + b. Delete `cas/inprogress/.marker`. (If this fails — 5xx, OOM, Ctrl-C — the marker becomes stale; `cas-delete` is required to treat it as stale when `metadata.json` exists, see §6.6.) + +The presence of `cas/metadata//metadata.json` is the catalog truth. + +### 6.5 Restore — `cas-restore` + +CAS restore is implemented as **`cas-download`** (downloads + materializes a complete v1-shaped backup directory on local disk) followed by the **existing v1 restore flow** (which reads from that local directory). + +The existing restore reads metadata **from disk**, not in-memory: +- Root `metadata.json` is read by `pkg/backup/restore.go:114`. +- Per-table JSONs are read from the local `metadata/` directory by `pkg/backup/restore.go:1936`. + +So `cas-download` MUST write the complete v1 backup directory layout before `cas-restore` invokes the existing restore. "Synthesize in memory and call restore" is **not** sufficient; existing restore won't see synthesized structures. + +#### What `cas-download` writes locally + +For backup `` rooted at `//`: + +``` +/ + metadata.json # full BackupMetadata (DataFormat="directory") + metadata//.json # full TableMetadata per table (Parts populated, + # Files empty, schema fields preserved as in v1) + shadow///// # part directories with all files reconstructed: + # - small files extracted from per-table archive + # - large files downloaded from cas/blob/... + rbac/, configs/, named_collections/ # downloaded as today +``` + +Every file the existing restore expects must exist on disk before handoff. + +#### Local staging contract + +- `BackupMetadata.DataFormat = "directory"` (`pkg/metadata/backup_metadata.go:30`; constant `pkg/backup/backuper.go:28` `DirectoryFormat`). Branches existing restore code into the no-archive path (`pkg/backup/download.go:615, 627, 670`). +- `TableMetadata.Parts` populated; `TableMetadata.Files` empty (only consumed when `DataFormat != "directory"`; `pkg/backup/download.go:673`). +- `TableMetadata.Checksums` is **not populated** by CAS (the per-archive CRC64 the v1 path uses is irrelevant — checksums.txt inside each part directory is the source of truth for blob content). +- Reuses `filesystemhelper.HardlinkBackupPartsToStorage` (`pkg/filesystemhelper/filesystemhelper.go:119`) for the staging-to-detached step. + +#### `cas-download` steps + +1. Resolve the backup name. Read `cas/metadata//metadata.json`. **Read `BackupMetadata.CAS` to get the persisted parameters** (`LayoutVersion` and `InlineThreshold` — those are the only fields per §6.2.1); restore uses these — never values from the current config. +2. Refuse if `BackupMetadata.CAS == nil` (v1 backup) or `LayoutVersion` is unsupported. +3. Apply CLI filters (`--tables`, `--partitions`, `--schema-only`, `--data-only`, mappings, etc.) to determine the working set of `(db, table, parts)`. +4. Write the local `metadata.json` and per-table `metadata//.json` files to disk first (the existing restore flow reads them from disk). +5. For each in-scope `(disk, db, table)`: download `parts///.tar.zstd`, extract into the local shadow directory at the canonical layout path. **Path containment** for every tar entry, assert `strings.HasPrefix(filepath.Clean(extractPath)+sep, filepath.Clean(rootDir)+sep)` before write; reject `..` and absolute paths. **`checksums.txt` filename validation**: when parsing, reject any filename with leading `/`, embedded `..` components, or NUL bytes; allow single `/` separators only for projection paths matching `.proj/`. Each part directory now contains all small files including `checksums.txt`. +6. For each part in scope: parse the local `checksums.txt`, identify files with `size > BackupMetadata.CAS.InlineThreshold` (i.e. files NOT in the archive), download each from `cas/blob//` into the part directory. The full part directory is now reconstructed locally. + +`cas-download` exits here. The local layout is exactly what `restore` consumes. + +#### `cas-restore` + +1. Run `cas-download` (steps above) with the same flags. +2. Invoke the existing `restore` flow on the materialized local directory. **Object-disk handling MUST be skipped**: `pkg/backup/restore.go:196-204` checks live ClickHouse disks, not metadata; CAS restore must short-circuit `downloadObjectDiskParts` when `BackupMetadata.CAS != nil`. The pre-flight in `cas-upload` ensures CAS backups never include object-disk parts, so no object-disk processing is needed at restore. + +This split also matches v1's `download` + `restore` verb pair and lets operators inspect the staged directory before applying. + +Per-partition restore is per-part filtering: intersect `TableMetadata.Parts` with `--partitions`, then proceed only with selected parts. The per-table archive is downloaded whole even for one partition (acceptable overhead). + +`--schema-only` skips steps 4–5 entirely; very fast for CAS. + +### 6.6 Delete — `cas-delete` + +**Order matters.** The catalog truth is `metadata.json`. + +1. **Refuse if `cas/prune.marker` exists** (the GC lock). If `cas.wait_for_prune` (or `--wait-for-prune` flag) is > 0, poll the marker every 2 seconds for up to that duration before refusing. +2. **Stale-marker-aware inprogress check**: if `cas/inprogress/.marker` exists AND `cas/metadata//metadata.json` does NOT exist → upload in flight; refuse. If both exist → the upload committed but failed to delete its marker; treat as **stale** and proceed (log a warning). If only `metadata.json` exists → normal case; proceed. +3. Delete `cas/metadata//metadata.json` **first**. Backup is no longer in the catalog. +4. Delete the rest of `cas/metadata//`. +5. Orphan blobs reclaimed by the next prune run. + +If interrupted between steps 3 and 4: the backup is gone from the catalog; remaining files become metadata-orphans. Prune handles them lazily. + +### 6.7 Prune — `cas-prune` + +Mark-and-sweep GC. **Single rule**: `cas-prune` takes an exclusive lock; while held, no `cas-upload` or `cas-delete` may run. Operators schedule pruning during a quiet window. There is no automatic protection — the operator must ensure no CAS writes are happening. + +#### Algorithm + +1. **Sanity check** (operator-courtesy): list `cas/inprogress/*.marker`. If any is younger than `abandon_threshold` (default 7 days), refuse with a clear error listing the markers (name, host, age) and exit. The operator either waits for the upload to finish or, if confident the upload is dead, deletes the marker manually before retrying. +2. Write `cas/prune.marker` with timestamp + host id + a random run-id. **Read it back** and compare run-id to ours; if it differs, another `cas-prune` raced us — abort with "concurrent prune detected; aborting". **Defer the marker delete to the end of the function** so step 12 always runs even if intermediate steps fail or panic. The defer-release MUST run on every exit path: success, fail-closed abort, panic, signal cancellation, error returns. +3. Record `T_0 = now()`. +4. **Abandoned-upload sweep**: any `cas/inprogress/.marker` older than `abandon_threshold` → delete it. Any blobs from the abandoned run become orphans handled by step 9. +5. List `cas/metadata/*/metadata.json` → live backup set. +6. For each live backup, walk per-table archives, extract `checksums.txt` files, accumulate referenced blob paths into a sorted on-disk file (streaming). +7. **Fail-closed**: if any live backup's per-table archives or JSONs cannot be read, abort without deleting; surface error. +8. List `cas/blob//` in parallel; stream-compare against the referenced set to identify orphan candidates. +9. Filter deletion candidates: orphan AND `LastModified < T_0 - grace_blob` (default 24h). +10. Sweep metadata-orphan subtrees: `cas/metadata//` with no `metadata.json` → delete. +11. Delete confirmed blob candidates. +12. Release `cas/prune.marker`. (Implemented as a deferred call from step 2 — runs unconditionally.) + +**Stale-marker recovery**: defer-release covers panics, signals, and error returns. Only `kill -9` or kernel OOM-kill leaves a stranded marker. When that happens, the operator inspects and clears it explicitly: + +- `cas-status` displays `cas/prune.marker` if present (timestamp, host, run-id). +- `cas-prune --unlock` deletes the marker (after operator confirms no prune is actually running). Refuses if a marker isn't present (avoid silently doing nothing). + +No timeout-based auto-bypass; operator owns the call. Documented in the operator runbook. + +`cas-prune --dry-run` runs steps 1, 3–10 and prints what would be deleted; does not write the lock or perform deletes. + +#### Why this works + +The single load-bearing rule: **don't run `cas-upload` or `cas-delete` while `cas-prune` is running.** `cas-upload` and `cas-delete` enforce this by refusing to start when `cas/prune.marker` exists. + +The grace period (`grace_blob`, default 24h) is defense-in-depth against: +- The TOCTOU window between `cas-upload`'s marker check and the prune lock write (small). +- Operator misuse (running prune during uploads anyway, or ignoring marker errors). +- Object-store eventual-consistency oddities. + +**This is not a distributed mutex.** Two operators racing `cas-prune` on different hosts can both pass step 1 and both PUT step 2. Operators must serialize prune across hosts the same way they serialize v1 commands today (no overlapping cron, etc.). Distributed locking is a non-goal (§3); v2 may add it via S3 conditional-create. + +#### Race scenarios + +| Scenario | Outcome | +|---|---| +| Operator starts `cas-upload` while prune holds the lock | Upload refuses; clear error naming the prune marker's age and host. | +| Operator starts `cas-prune` while uploads are in flight | Prune refuses (step 1) with a list of fresh inprogress markers. | +| Operator forces the issue (deletes markers manually) | Grace period limits damage to blobs younger than `grace_blob`. Beyond that, on their own. | +| Upload crashes mid-flight | Inprogress marker persists. Next prune blocks until `abandon_threshold`; then sweeps marker; orphan blobs reclaimed. | +| Two uploaders race on same blob | Idempotent (content-keyed). | +| Crashed remove between deleting metadata.json and rest of subtree | Backup gone from catalog; remaining files become metadata-orphans; step 10 sweeps. | +| Backend with weak `LastModified` semantics | Grace degrades; rely harder on operator scheduling. Document. | + +### 6.8 Verify — `cas-verify` + +Integrity check, ships with v1: + +1. Read `cas/metadata//metadata.json` and `metadata//.json` for all tables. +2. Download per-table archives, extract `checksums.txt` files into memory. Build the set of `(blob_path, expected_size)` pairs. +3. **HEAD each blob in parallel**. Report: + - Missing blobs (HEAD 404). + - **Size mismatches**: HEAD-returned `Content-Length` vs. expected size from `checksums.txt`. Catches truncated, replaced, or partially-written blobs at zero CPU cost. +4. Exit non-zero if any failures. + +`cas-verify --json` emits machine-readable output (one JSON object per failure) so operators can pipe into tooling for triage / alerts. + +Does NOT verify blob *content* hashes against `checksums.txt` — that's a separate v2 mode (full re-hash on download, ~minutes-to-hours wall-clock at 100TB scale). HEAD + size verification catches the silent-corruption-from-buggy-GC class of failures, which is the most likely failure mode in v1, at near-zero cost. + +#### Recovery from `cas-verify` failures + +If `cas-verify` reports missing or wrong-sized blobs, the backup is unrestorable. v1 has no automated repair — `cas-delete` the broken backup and create a fresh one (`clickhouse-backup create ` + `cas-upload `). Because every CAS backup is independent (no chain), losing one doesn't affect any other. + +`cas-fsck` (v2) will automate repair when local parts are still available. + +### 6.9 Multi-shard concurrent upload to a shared bucket + +Supported natively, with one convention: backup names must be unique across writers. Recommended naming: `____` or similar. + +Mechanics: +- Different shards write to different `cas/metadata//` directories. No collision. +- Different shards may upload identical blobs concurrently. Idempotent (content-keyed). Worst case: a small amount of wasted bandwidth. +- Prune is single-writer (marker file); operators must ensure only one prune runs at a time across all shards. + +This is a strict improvement over v1, which requires per-shard separate prefixes. + +### 6.10 CLI surface + +Six new top-level subcommands, plus extension of the existing `list` verb: + +| Command | Purpose | +|---|---| +| `cas-upload [--skip-object-disks] [--dry-run]` | Build and push a CAS backup. `--skip-object-disks` excludes object-disk tables; `--dry-run` reports what would be uploaded without writing. | +| `cas-download [--tables ...] [--partitions ...]` | Materialize a CAS backup into the local shadow directory in v1 layout. **Stops there** — does not load into ClickHouse. Mirrors the existing `download` verb. **Disk-space pre-flight**: estimate bytes from per-table archive sizes + sum of blob sizes from `checksums.txt`; refuse early if local free space < estimate × 1.1. Re-running over a partial directory is safe (idempotent overwrites). | +| `cas-restore [...all existing restore flags...]` | Convenience: `cas-download` followed by the existing `restore` flow. Identical flag set to `restore`. | +| `cas-delete ` | Delete the per-backup metadata subtree (refuses if upload or prune in flight; see §6.6). Blobs are reclaimed by the next prune run. | +| `cas-prune [--dry-run] [--grace-blob DUR] [--abandon-threshold DUR] [--unlock]` | Mark-and-sweep GC. `DUR` is a Go duration string (e.g. `24h`, `30m`, `0s`). `--grace-blob` overrides config `cas.grace_blob`; `--abandon-threshold` overrides `cas.abandon_threshold`. `--dry-run` prints candidates without deleting and never touches the prune marker (so combining `--dry-run --unlock` is a no-op rather than the destructive double-meaning the first cut shipped with — see Phase 6 A3). `--unlock` deletes a stranded `cas/prune.marker` (operator escape hatch when prune was killed by SIGKILL/OOM). Explicit `--grace-blob=0s` / `--abandon-threshold=0s` are honored as "no grace" / "sweep all stale markers now" — distinguished from unset via `*Set` bools in `PruneOptions`. | +| `cas-verify [--json]` | HEAD + size check on referenced blobs. `--json` outputs structured failures for tooling. | +| `cas-status` | Bucket-level health summary: backup count, blob count, total bytes, freshest/oldest backup, in-progress markers (with age + host), prune marker state, abandoned-marker candidates. Cheap (LIST only). | + +**Existing `list` extended**: `clickhouse-backup list remote` enumerates v1 *and* CAS backups, with a `[CAS]` tag. `clickhouse-backup list local` unchanged. No new `cas-list` verb — symmetry beats command proliferation. + +**Help-text discoverability**: +- The existing `upload --help` gains a closing line: *"For mutation-heavy tables or chain-free incrementals, see `cas-upload`."* +- The README gains a short "CAS layout" section pointing to this design doc. + +**Rejected flags**: `cas-restore` does NOT accept `--ignore-dependencies`. CAS backups have no chain, so the flag is meaningless; passing it produces an error ("CAS backups have no dependencies; flag not applicable") rather than silently being a no-op. + +**Retention behavior**: `cas-upload` MUST NOT call `RemoveOldBackupsRemote`. CAS retention is exclusively managed by `cas-prune`. The v1 `backups_to_keep_remote` config knob applies only to v1 backups (and the §6.2.0 prefix exclusion ensures CAS backups don't accidentally count toward it). + +#### 6.10.1 Output-format convention + +`cas-verify --json` is a boolean flag that emits line-delimited JSON failures. The existing v1 `list` command uses `--format text|json|yaml|csv|tsv` for tabular listings. + +These two patterns are kept distinct on purpose: + +- **Tabular listings** use `--format` because operators may want csv/tsv for spreadsheet ingest. +- **Diagnostic pass/fail commands** (`cas-verify`, future `cas-fsck`) use `--json` because failures are line-delimited streams, not tables; the only useful alternatives are "human" or "machine". + +When new CAS commands need machine-readable output, follow this rule: tabular → `--format`; line-delimited diagnostic → `--json`. Don't introduce a third convention without an explicit decision recorded here. + +### 6.11 Configuration surface + +CAS-specific parameters live under a `cas:` block in `config.yml`. Existing config file paths and env-var conventions are unchanged. + +```yaml +cas: + enabled: false # gate; set true to allow cas-* commands against this config + cluster_id: "" # REQUIRED, no default. Identifies the source cluster; + # persisted in BackupMetadata.CAS.ClusterID. + root_prefix: "cas/" # top-level prefix in the bucket. MUST be a single path + # segment (e.g. "cas/" or "snapshots/"), not nested like + # "backups/cas/" — multi-segment values would escape v1 + # list/retention/clean-broken protection. For nested + # layouts, set the underlying storage path (s3.path / + # sftp.path / etc.) and keep root_prefix as one segment. + # Effective per-cluster prefix is + # / (e.g. "cas/prod-shard-1/"). + inline_threshold: 262144 # bytes (256 KiB); ValidateBackup MUST reject 0 or > 1 GiB + grace_blob: "24h" # prune won't delete a blob younger than this. Go duration string. + abandon_threshold: "168h" # 7 days; in-progress markers older than this are auto-cleaned. Go duration string. + allow_unsafe_markers: false # opt-in for backends that lack atomic conditional create. Phase 4 + # implements PutFileIfAbsent natively on S3 / Azure / GCS / COS / SFTP. + # FTP has no portable atomic primitive: with this flag false (default) + # cas-upload and cas-prune refuse on FTP; with true, FTP falls back to + # a STAT+STOR+RNFR/RNTO best-effort sequence with a per-call WARN log. + wait_for_prune: "0s" # if > 0, cas-upload and cas-delete poll the prune + # marker for up to this duration before refusing. + # Useful for cron deployments where prune may overlap + # with scheduled uploads. Go duration string. + skip_conditional_put_probe: false + # Bypass the one-shot startup probe that verifies the + # backend honors If-None-Match (or equivalent). Set + # true ONLY for backends you have independently + # confirmed are compliant; on a backend that silently + # ignores the precondition, marker locks become unsafe + # and concurrent uploads can corrupt backups. Emits a + # startup WARN banner when enabled. + allow_unsafe_object_disk_skip: false + # When ClickHouse system.disks cannot be queried during + # the cas-upload preflight, FAIL CLOSED by default + # (refuse the upload) so unsupported object-disk + # tables can't slip through. Set true to fall back + # to shadow-only detection — may MISS fully-object- + # disk-backed tables and produce an unrestorable + # CAS backup. Emits a startup WARN banner when enabled. +``` + +**Per-cluster prefix is mandatory.** Operators MUST configure `cluster_id`. Cross-cluster blob sharing is out of scope for v1; if anyone needs it, it's a v2 conversation with its own threat model. + +**Env vars** (override config; prefix `CAS_*` for symmetry with `S3_*`/`GCS_*`/`AZBLOB_*`): +- `CAS_ENABLED`, `CAS_CLUSTER_ID`, `CAS_ROOT_PREFIX` +- `CAS_INLINE_THRESHOLD`, `CAS_GRACE_BLOB`, `CAS_ABANDON_THRESHOLD` +- `CAS_ALLOW_UNSAFE_MARKERS`, `CAS_WAIT_FOR_PRUNE` +- `CAS_SKIP_CONDITIONAL_PUT_PROBE`, `CAS_ALLOW_UNSAFE_OBJECT_DISK_SKIP` + +**CLI flags** (override config + env): +- `cas-prune --grace-blob DUR --abandon-threshold DUR --dry-run --unlock` +- `cas-upload --skip-object-disks --dry-run [--wait-for-prune=DUR]` +- `cas-delete [--wait-for-prune=DUR]` +- `cas-verify --json` + +`inline_threshold` is read from config at upload time and **persisted** in `BackupMetadata.CAS.InlineThreshold`. Restore uses the persisted value, never the current config (§6.2.1). + +### 6.12 Compatibility notes + +**Breaking interface change** (Phase 4). The CAS work added two new methods and one sentinel error to the public `pkg/storage.RemoteStorage` interface: + +```go +type RemoteStorage interface { + // ... existing methods ... + PutFileAbsoluteIfAbsent(ctx context.Context, key string, r io.ReadCloser, localSize int64) (created bool, err error) + PutFileIfAbsent(ctx context.Context, key string, r io.ReadCloser, localSize int64) (created bool, err error) +} + +var ErrConditionalPutNotSupported = errors.New("conditional PutFile not supported by this backend") +``` + +External code that implements `RemoteStorage` directly (private forks with custom backends, third-party plugins) will fail to build until they add the two methods. Implementations that lack a native atomic-create primitive should return `ErrConditionalPutNotSupported`; CAS commands then refuse on those backends unless `cas.allow_unsafe_markers=true`. + +**Backend version requirements.** S3-compatible stores must honor `If-None-Match: "*"` on `PutObject` for marker locks to be safe. AWS S3 supports it natively. MinIO requires release `RELEASE.2024-11-07T00-52-20Z` or newer; older versions silently ignore the header. CAS performs a one-shot startup probe on the first marker-writing command (writes a sentinel at a per-process random key, asserts the second write reports not-created, then cleans up); read-only commands (`cas-status`, `cas-verify`, `cas-download`, `cas-restore`, dry-run uploads/prunes) skip the probe so they work with read-only credentials. Operators on confirmed-good backends can skip the probe entirely via `cas.skip_conditional_put_probe=true`. Ceph RGW and other S3-compatible stores have not been validated against the probe; prefer one of the natively-supported backends in production. + +**LayoutVersion downgrade.** Operators downgrading clickhouse-backup to a release that doesn't recognize the persisted `BackupMetadata.CAS.LayoutVersion` will see a refusal at restore time with a clear error. Upgrade-then-downgrade-then-restore is the failure mode; document the build matrix you support. + +## 7. Reuse vs. new code + +### Reused as-is +- `pkg/storage/general.go` — `BackupDestination`, multipart, retry, throttling, `CopyObject` +- `pkg/metadata/backup_metadata.go:12` — `BackupMetadata` struct (don't populate `RequiredBackup`; new `CAS` field added — see §6.2.1) +- `pkg/metadata/table_metadata.go:10` — `TableMetadata` struct (write `Parts` populated, `Files` empty, `DataFormat = "directory"`) +- `pkg/backup/backuper.go:28` — `DirectoryFormat` constant +- `pkg/backup/backuper.go:145` — shadow-directory layout for local staging +- `pkg/common/common.go:18` — `TablePathEncode` for db/table path components +- `pkg/pidlock/pidlock.go:15` — per-backup PID locking +- `pkg/backup/upload.go:114` — `prepareTableListToUpload` table iteration +- `pkg/filesystemhelper/filesystemhelper.go:119` — `HardlinkBackupPartsToStorage` for staging-to-detached +- `pkg/resumable/state.go` — progress tracking (BoltDB) usable for blob-level resume +- All `pkg/clickhouse/*` query helpers +- All restore-side schema/RBAC/configs handling + +### New code + +(Actual line counts at the time of this writing: ~5,800 LOC across `pkg/cas/`, `pkg/cas/casstorage/`, and `pkg/checksumstxt/`, plus ~5,800 LOC of tests. The original estimate of 1,500–2,500 was on the work for Phase 1; the additional volume comes from Phases 2–8 — prune, atomic markers, the REST API surface, the pre-PR readiness round, and their tests.) +- **`pkg/checksumstxt/`** — parser for ClickHouse's `checksums.txt` format (versions 2/3/4 on-disk; v5 minimalistic for completeness). Reference implementation already drafted at `docs/checksumstxt/checksumstxt.go` (300 LOC) with tests at `docs/checksumstxt/checksumstxt_test.go` (271 LOC) and full format spec at `docs/checksumstxt/format.md`. Move into `pkg/checksumstxt/` during Phase 1 — this is a ClickHouse part-format concept, not a CAS concept; namespace it accordingly. Keep tests against real ClickHouse part fixtures spanning compact, wide, encrypted, and projection parts. +- **`pkg/cas/validate.go`** — single `ValidateBackup(ctx, name) error` function used as a precondition by every CAS command. Enforces: + 1. Backup name is well-formed (printable ASCII only, no NUL or control chars, len ≤ 128, no `..` or path separators). + 2. `metadata.json` exists and parses. + 3. `BackupMetadata.CAS != nil` and `LayoutVersion ≤ supported_max` (refuse newer; §6.2.1). + 4. `InlineThreshold > 0 AND ≤ 1 GiB`. + 5. `ClusterID` is non-empty and matches the configured cluster. + 6. All referenced per-table archives can be HEADed. + 7. Inprogress / prune marker state is consistent with the catalog (used by `cas-delete`'s stale-marker logic, §6.6 step 2). +**Backend assumptions** (no probe in v1): CAS assumes the configured backend provides read-your-writes consistency for individual objects and a meaningful `LastModified`. AWS S3, MinIO, GCS, and Azure Blob all qualify. Quirky on-prem backends are the operator's risk to validate; document the assumption in the operator runbook. +- **`pkg/cas/blobpath.go`** — derives blob paths from hashes. Trivial. +- **`pkg/cas/upload.go`** — orchestrates the upload protocol in §6.4 (object-disk pre-flight, prune-lock check, marker management). Calls into existing `BackupDestination`. +- **`pkg/cas/download.go`** — implements `cas-download`: materializes a backup into the shadow directory. +- **`pkg/cas/restore.go`** — thin: invokes `cas-download` then hands off to existing restore flow. +- **`pkg/cas/delete.go`** — implements §6.6 (prune-lock check, ordered delete). +- **`pkg/cas/prune.go`** — implements §6.7. Streaming mergesort, parallel listings, lock-and-sweep. +- **`pkg/cas/verify.go`** — implements §6.8 (HEAD + size; `--json` output). +- **`pkg/cas/cache.go`** — cold-list and in-memory existence set. (Spill-to-disk only if a real workload exhausts memory; ship in-memory first.) +- **`pkg/cas/list.go`** — thin helpers used by the existing `list remote` to surface CAS backups with a `[CAS]` tag. +- **`cmd/clickhouse-backup/cas_*.go`** — command bindings. +- **`pkg/cas/config.go`** — CAS-specific config: root prefix, inline threshold, grace period, abandon threshold (the actual persisted parameters and the configurable knobs). + +See §6.10 for the full CLI surface. + +## 8. Risk register + +| # | Risk | Likelihood | Impact | Mitigation | +|---|------|-----------|--------|-----------| +| R1 | `checksums.txt` parser bug (format version edge case, multi-block compressed v4, projection paths, etc.) producing wrong hashes → blob mis-keyed → silent corruption at restore | Low-Medium | High | Reference parser already drafted at `docs/checksumstxt/` with format spec and unit tests covering v2/v3/v4 paths. Add fixture-based tests against real ClickHouse part directories spanning compact, wide, encrypted, projection, and multi-disk parts before Phase 1 ships. `cas-verify` size check catches some manifestations. | +| R2 | GC race: in-flight upload's blob deleted before commit, OR old-orphan-reuse during concurrent prune | Low (with operator discipline) | High | `cas-prune` takes an exclusive lock; `cas-upload` and `cas-delete` refuse while it's held. `grace_blob` is defense-in-depth. Operator must serialize prune across hosts (no overlapping cron). | +| R4 | Hash collision (CityHash128) | Negligible | High | Birthday-paradox bound for a uniform 128-bit hash: `p ≈ n² / (2·2¹²⁸)`. At `n = 10⁹` blobs (≈100 PB of 100 MB-avg files — bigger than any plausible single deployment) `p ≈ 1.5·10⁻²¹`. The 10⁻⁶ collision threshold is reached around `n ≈ 2.6·10¹⁶` blobs. Realistic 100 TB deployments sit at `n ≈ 10⁷` (§10.1) where `p ≈ 1.5·10⁻²⁵`. Negligible at any plausible scale. (CityHash128 is non-cryptographic, so it is not collision-resistant against a motivated attacker — see R15.) | +| R5 | Memory blowup at upload (cold-list set of 10⁷ hashes) or at GC (live set of 10⁸+ hashes) | Medium | Medium | Spill cold-list to sorted on-disk file at >N entries. GC uses streaming mergesort with bounded memory. | +| R6 | Object store backend doesn't honor `LastModified` semantics needed for grace check (e.g., quirky on-prem MinIO, FTP `LIST` without MLSD) | Medium-Low | High | Phase 6 handles zero-`ModTime` defensively: in `classifyInProgress` zero-ModTime markers are treated as fresh; in `streamCompareWithMarks` zero-ModTime blobs are treated as inside grace. Both prevent the "sweep everything because the timestamp looks ancient" failure mode. The grace mechanism still assumes meaningful `LastModified` for the size-of-window guarantees; on non-conforming backends operators rely on `abandon_threshold` and on running prune outside upload windows. Documented in the operator runbook. | +| R7 | Per-table archive becomes huge (table with many parts) → restore must download whole archive even for partial-partition restore | Medium | Low | Acceptable v1; if it becomes a problem, switch to per-part archives or multi-archive splitting (matches existing `splitPartFiles` infrastructure). | +| R9 | Bucket cost surprise: per-PUT charges from many small blobs if inline threshold misconfigured | Low | Medium | Inline threshold default 256 KiB. Document the cost trade-off. | +| R10 | First CAS upload after migration is huge because nothing is shared with v1 backups | Certain | Low | Expected. Document. CAS dedup compounds across subsequent CAS backups. | +| R11 | Crashed upload leaves orphan blobs that aren't reclaimed for `grace_blob` | Certain | Low | Expected; tolerable per design. The orphan-cleanup latency is bounded by `grace_blob`. | +| R13 | Object-disk tables encountered during `cas-upload` cause silent skip or partial backup | Certain (if user has them) | High | `cas-upload` does pre-flight pass and refuses with a list of offending `(db, table, disk)` triples. `--skip-object-disks` excludes them. Operator must use v1 `upload` for those tables. v2 lifts. | +| R17 | Same-name concurrent `cas-upload` from two hosts: both pass the metadata.json existence check, both PUT, last writer wins on root metadata | Low | High | Phase 4 added per-backend atomic conditional create (`PutFileIfAbsent`) and Phase 6 wired it into the in-progress and prune markers, so a same-name race is now caught at the marker-write step (the second uploader sees `created=false` and refuses with a diagnostic naming the existing run). The original "naming-convention" guidance still applies as defense-in-depth. Cross-host coordination across many writers on the same backup name is handled by atomic markers; the deferred §9.1 item is for richer multi-host *claim* semantics beyond first-write-wins. | +| R14 | Layout-parameter mismatch between upload-time config and restore-time config (e.g., `inline_threshold` changed) → restore reads wrong location → silent corruption | Medium | High | Persist all layout parameters in `BackupMetadata.CAS` (§6.2.1); restore reads from there exclusively, ignoring config. CAS commands refuse to operate on backups whose `CAS` block is missing or has unknown `LayoutVersion`. | +| R15 | Adversarial CityHash128 collision (attacker crafts a colliding blob to corrupt restore) | Negligible-Low | High | CityHash128 is non-cryptographic; collisions are findable by motivated attackers. Backup-tool threat model assumes trusted bucket. **CAS cannot switch to a stronger hash without ClickHouse upstream changes** — the hash comes from each part's `checksums.txt`, written by ClickHouse. If adversarial-collision resistance becomes a real requirement, it's an upstream conversation, not a clickhouse-backup change. | +| R16 | `cas-delete` interrupted between deleting `metadata.json` and rest of subtree → metadata-orphans accumulate | Low | Low | Live-set computation ignores subtrees without `metadata.json` (§6.6). Prune does lazy cleanup of metadata-orphan directories. | + +## 9. Deferred to v2 of CAS + +This section is the consolidated backlog of items raised across the design-interview, brainstorming, and external-review waves and explicitly punted out of the v1 ship train. Each entry names a category and a one-line rationale for deferral. Feature-class items get a short "what it would do" line; correctness/operability items name the file or scenario that motivates them. + +### 9.1 Major features + +- **Hash verification on download** (full content re-hash). v1 ships HEAD + size verification only (`cas-verify`); v2 adds `cas-verify --deep` that downloads each blob and re-hashes against the value in `checksums.txt`. Wall-clock cost is minutes-to-hours at 100TB; size verification catches the realistic silent-corruption-from-buggy-GC class for free. +- **Object-disk parts** (s3 / azure / hdfs object disks). `cas-upload` refuses these in v1 (§3); v2 needs a design pass for content-addressing already-remote object stubs and the cross-storage key rewriting paths in `pkg/backup/create.go:1031` / `pkg/backup/restore.go:2227`. +- **Convergent encryption**. Required for v1 client-side-encryption users to migrate to CAS without losing client-side encryption. Known weaknesses (confirmation-of-file attacks) need threat-modeling per deployment. +- **`cas-fsck` repair tool**. Walks local part directories and re-uploads missing blobs in bulk; today the only recovery from a broken backup is `cas-delete` + fresh `create` + `cas-upload`. +- **Parallel `cas-verify`**. Today HEAD calls run sequentially; parallelizing across blobs gives a multi-x speedup at zero correctness cost. Deferred because v1 verify is fast enough at the target scale. +- **Per-blob resumable uploads**. Existing `pkg/resumable` is per-archive; CAS uploads at blob granularity. Either extend resumable state or maintain a separate per-blob completion log. +- **Migration tool from v1 to CAS**. Out of scope for v1; users opt in by writing new backups with `cas-upload`. +- **Distributed locking via S3 conditional create** (true multi-host coordination). Phase 4 added per-backend `PutFileIfAbsent` and Phase 6 wired it into both markers, which closes the local same-name race; cross-host coordination across many writers on the same backup name is still operator-policy. +- **Atomic FTP markers via per-marker directory rename**. v1 of CAS implements FTP atomic-create as a STAT+STOR-to-tmp+RNFR/RNTO best-effort sequence with a small TOCTOU window (gated by `cas.allow_unsafe_markers`). FTP's `MKD` is one of the few primitives that can be made truly atomic on the wire: each marker becomes a *directory* whose creation racing two clients results in one success and one `550 already exists`. Mechanically: marker key `cas/.../inprogress/.marker` becomes a directory `cas/.../inprogress/.marker.d/`; the body is stored as a file inside it after MKD succeeds. Trade-offs: more LIST traffic to read marker bodies; existing object-store backends already use file semantics so this would be FTP-only; the `MKD` race depends on the FTP server actually serializing directory creation (proftpd does; some legacy servers may not). Worth implementing if FTP becomes a primary target rather than a fallback. +- **Local-disk / NFS target for CAS**. Today `cas-*` commands run against object-store backends (S3/Azure/GCS/COS) and SFTP/FTP. A local filesystem target (plain `file://` path or NFS mount) is attractive for on-prem deployments and air-gapped backups. Most pieces port cleanly: blob layout is just files, atomic markers map to `O_CREAT|O_EXCL`, cold-list is `filepath.WalkDir`. Open questions: how `cas-prune`'s `LastModified`-based grace handles NFS clock skew between writer and pruner; whether to expose the existing `pkg/storage` filesystem backend (if any) or write a thin local backend specifically for CAS; concurrency semantics across multiple writers on the same NFS export. +- **Persist object-disk classification at v1 `create` time**. Today `cas-upload`'s preflight has two detectors: a shadow walk (`pkg/backup/cas_methods.go::snapshotObjectDiskHits`) and a live-ClickHouse storage-policy resolver with regex-based parsing (`pkg/backup/cas_methods.go::snapshotMetadataObjectDiskHitsFromCH`). The dual approach is a Phase-6 stopgap because shadow-only missed fully-object-disk-backed tables. Cleaner invariant: extend `TableMetadata` (or `BackupMetadata`) with a per-(db, table) `DiskType` field populated at v1 `create` time; `cas-upload` reads the persisted fact and removes the live-policy detector entirely. The persisted-at-create design is also semantically truer — the backup IS the storage-layout snapshot, so a storage_policy change between create and upload should be transparent to CAS. Touches `pkg/backup/create.go` and the metadata struct (v1 code paths we committed not to regress in this release), so deferred. Track here so it doesn't get lost. +- **Refcount-delta / blob-manifest optimization for prune**. Re-evaluate if catalog grows past several hundred backups or prune wall-clock becomes painful. Decide between post-commit manifest, per-backup blob-list sidecar, or delta files based on real measurements. + +### 9.2 Performance / scalability + +- **Prune mark-phase parallelism**. Today the live-set walk is single-threaded over `cas/metadata/*/`. With hundreds of backups this dominates prune wall-clock. Trivial to parallelize across backups with bounded concurrency. +- **`SweepOrphans` spill-to-disk**. Current implementation streams the merge but holds intermediate state in memory; at very large catalogs (>10⁸ blobs) spill the sorted intermediates to disk. Existing on-disk live-set already does this; mirror the pattern for the orphan side. +- **Streaming archive upload**. Per-table archives are built fully in memory before upload. Streaming the tar.zstd into the multipart upload pipe halves the peak RSS for tables with many small files. +- **Heap-merge for `shardIter`**. Cold-list merges 256 sorted shard streams via a flat sweep; a binary-heap merge is asymptotically tighter and matters when the per-shard stream count grows (e.g. wider sharding in v2). +- **`ExistenceSet` memory bound**. v1 ships in-memory only (per §10.2 estimate, ~600 MB at 10⁷ blobs). Add spill-to-disk only when a real workload exhausts memory. +- **Replace `ColdList` with per-blob `PutFileIfAbsent` + Stat fallback**. Upload today does a 256-shard `LIST` of `cas//blob/` to seed an existence set, then dedups blobs against it. Alternative shape: for each planned blob, attempt `PutFileIfAbsent`; backends that don't support it fall back to `StatFile` + conditional upload. This deletes the global LIST pass, the existence set, the pre-commit re-validation of cold-listed blobs (Phase 7 ColdList TOCTOU defense), and most of the related test scaffolding. Trade-off: at scale the request count flips from `O(shards)` LISTs to `O(planned_blobs)` HEADs/PUTs (≈10⁴ vs 10⁷ for a 100 TB cold-start upload — three orders of magnitude more requests but zero global scan). Worth re-evaluating with real workload measurements; if hit rates make most blobs already-present, the per-blob approach becomes reasonable. Keeps ColdList for v1 since it's measured-known-fast on the realistic case (cold-list dominates wall-clock on dedup-heavy repeat backups). + +- **Semaphore acquisition does not respect ctx cancellation.** `pkg/cas/upload.go::uploadMissingBlobs` and four other goroutine-pool sites use `sem <- struct{}{}` without a `select { case sem <- ...: case <-ctx.Done(): return }`. Goroutines queued on the semaphore drain slowly when ctx is cancelled (O(N/parallelism) batches). Not a deadlock, but extends shutdown latency on large catalogs. Tighten when needed. +- **Prune `--parallelism` flag.** `SweepOrphans` uses `const parallelism = 32`; mark phase uses literal `16`. Upload/Download respect `cfg.General.{Upload,Download}Concurrency`. Add `Parallelism int` to `PruneOptions` and thread it through. + +### 9.3 Operability / observability + +- **Structured prune logs**. Today prune emits human-readable status lines; for cron / observability pipelines, add a `--log-format=json` option emitting one structured event per phase (mark-start, mark-done with counts, sweep-start, sweep-done with bytes-reclaimed, marker-release). +- **Populate `BlobsTotal` and `OrphansHeldByGrace` in PruneReport**. Fields exist; values are zero in v1 because counting them adds a LIST pass. Cheap and useful for capacity planning. +- **`BytesReclaimed` formatting**. Report carries raw bytes; surface `FormatBytes` rendering in CLI output. +- **Upload / download progress logging**. v1 logs only at start/end of each archive. Per-blob progress (especially for download) helps operator confidence on large restores. +- **`cas-status` historical trend**. Today reports a snapshot; persisting a small JSON history file (last N runs) would let `cas-status --trend` surface growth/shrink rates without external infra. + +### 9.4 Correctness defenses (low-likelihood, defense-in-depth) + +- **Downgrade warning for `LayoutVersion`**. Operators downgrading to a tool that doesn't recognize the persisted `BackupMetadata.CAS.LayoutVersion` get a refusal at restore time. Document the upgrade-then-downgrade hazard explicitly in the runbook (the operator runbook has a "Binary rollback procedure" section as of wave-5; that warning is about the v1 retention path, not LayoutVersion mismatches at restore time. Both warnings should coexist). +- **`pkg/pidlock` TOCTOU**. The shared pidlock implementation does read-then-check-then-write across three non-atomic steps. Two concurrent callers can both pass the liveness check and write competing PID files. This is a whole-tool concern (v1 and CAS both use it), not CAS-specific; CAS has worked around it for the cas-download phase via separate prefixes (Phase 8 P2-b) but the underlying race remains. Replace with `O_CREAT|O_EXCL` or a sync.Mutex keyed by lock name. +- **Probe-key cleanup on prune**. `pkg/cas/probe.go` writes a sentinel under `cas-conditional-put-probe-` and deletes it before returning. If the process crashes between the first PutFileIfAbsent and the deferred Delete, the sentinel persists. Prune today does not sweep the cluster root for these. Cheap fix: have prune walk and delete `cas-conditional-put-probe-*` keys older than e.g. 1 hour, OR write probe keys under `tmp/` and have prune sweep that subtree. + +### 9.4.x Storage-layer cleanup + +- **`atomicSwapDir` rename.** `pkg/cas/download.go::atomicSwapDir` is named misleadingly — its body acknowledges the swap is not OS-atomic. Rename to `replaceDir` (or `swapDirBestEffort`) and update the doc-comment. +- **`markerTool` package-level var written without synchronization.** Safe in production (set once before server starts). If tests are ever run with `t.Parallel()` and call `cas.SetMarkerTool` concurrently, this becomes a data race. Either gate behind `sync/atomic.Value` or add a code comment documenting the single-write contract. +- **`FTP.AllowUnsafeMarkers` field exposure on the storage struct**. `pkg/storage/ftp.go:35` exports `AllowUnsafeMarkers bool` so the CAS layer can wire the config flag through `pkg/storage/general.go`'s NewBackupDestination. No other backend embeds a CAS-specific policy field on its struct — the asymmetry leaks CAS semantics into the storage abstraction. Cleanup options: (a) make it unexported and add a setter the CAS layer calls; (b) remove from the struct and have the CAS layer wrap PutFileIfAbsent with the fallback above the storage interface. Refactor preference, not a correctness bug. + +### 9.5 Test coverage (deferred — load-bearing tests already ship) + +- **`TestPrune_FailClosedOnNilCASMetadata`**. If a v1-style `metadata.json` lands in `cas//metadata/`, all subsequent prune runs abort with "no CAS field". Behavior is correct; lock it with a focused unit test asserting (a) the abort, (b) zero blobs deleted in that run. +- **`TestBackupList_SkipsV1BackupNamedSameasCASPrefix`**. Wave-A added a WARN log when `BackupList` skips an entry matching a CAS prefix. Add a test that an entry literally named `"cas"` is correctly skipped, while `"casematch"` is NOT — verifies the equality vs. HasPrefix branches. +- **`TestListRemoteCAS_WalkError`**. `pkg/backup/list.go::CollectRemoteCASBackups` swallows walk errors and returns an empty slice. Add a unit test that asserts a walk error is logged but not propagated, so a future refactor doesn't accidentally break the fail-open contract. + +- **`casstorage.Walk` absolute-key reconstruction contract test.** `pkg/cas/casstorage/backend_storage.go::Walk` reconstructs absolute keys because all six known backends strip the configured path prefix from `rf.Name()`. This is correct today but not formally contracted; a new backend returning absolute keys would silently double-prepend. Add a table-driven test exercising all six backends, OR document the contract on the `RemoteStorage.Walk` doc-comment. +- **`root_prefix` mid-deployment-change auto-detect.** Operator-policy concern documented in the runbook (deferred per user decision). If automated detection becomes desirable: on startup with CAS enabled, stat a sentinel under the default prefix and warn loudly if CAS-shaped objects are found there (e.g. `//prune.marker`). + +### 9.6 UX / docs polish + +- **`--data` flag is a no-op on v1 commands when CAS is enabled**. Already hidden in the CLI; documented in operator runbook. Remove entirely when CAS becomes the default in a future major version. +- **`cas-delete --force`** to bypass stale-marker checks. Today operators clear stranded markers via `cas-prune --abandon-threshold=0s`; a direct `--force` flag on `cas-delete` is more discoverable. +- **Help-text examples for common flows**. README has the headline flows; per-command `--help` could carry one or two example invocations each. +- **Changelog entries**. The phased shipping has produced ~50 commits on `cas-phase1`; before merge, condense into a coherent CHANGELOG section that names the feature and references this design doc rather than the per-phase plan files (which are gitignored). + +### 9.7 Out of scope (not on any roadmap) + +- Garbage collection of metadata across replicas/clusters beyond what mark-and-sweep already handles. +- Object Lock / immutability features beyond what's intrinsic to content addressing. +- Cross-cluster blob sharing. Phase 1 mandates `cluster_id`; if cross-cluster dedup ever becomes a requirement, it's a v2 conversation with its own threat model (one cluster can poison another's blob store). +- Adversarial-collision resistance on the content hash. The hash is whatever ClickHouse writes in `checksums.txt` (CityHash128 today); switching to a stronger hash is an upstream conversation, not a clickhouse-backup change. + +### 9.8 Implementation-time decisions + +- **Inline threshold default**: 256 KiB is a starting point; profile against a representative ClickHouse part-file distribution before locking it in. + +## 10. Appendix + +### 10.1 Request-rate sanity check (justifies 256-prefix sharding) + +S3 limits: ~3500 PUT/COPY/POST/DELETE per second per partition prefix; ~5500 GET/HEAD per second per partition prefix. + +**Upload phase** (100 TB × 10⁷ files; assume 1 GB/s network; ~10 MB avg file): +- Aggregate ~100 PUT/s. Distributed evenly across 256 prefixes → ~0.4 PUT/s/prefix. Three orders of magnitude under the limit. +- Worst case (small-file-heavy, 1 MB avg): ~1000 PUT/s aggregate → ~4 PUT/s/prefix. Still trivial. + +**Cold-list phase**: +- 10⁷ blobs / 1000 keys per page = 10⁴ LIST calls. With 256-way parallelism: ~40 LIST per prefix; <1 second wall-clock. +- Cost: ~$0.05. + +**Garbage collection**: +- LIST `metadata/*` → ~100 entries; one call. +- Metadata archive download: ~10⁴ archives total at ~MB each; tens of GB total; same-region S3 egress is free; minutes wall-clock. +- LIST `blob/*` for orphan scan: same as cold-list; <1 second wall-clock. + +Two-byte sharding gives ample headroom. One-byte (16 prefixes) would also work at this scale. Two-byte is git-familiar and provides headroom for users with much larger catalogs. + +### 10.2 Memory budget + +- **Upload-time existence cache**: ~10⁷ blobs × 32 bytes/hash + overhead ≈ 600 MB peak. v1 ships in-memory only; spill-to-disk added only if a real workload exhausts memory. (600 MB is acceptable on any host already running clickhouse-backup against 100TB.) +- **GC-time live-set**: ~10⁸ refs aggregate across 100 backups; held as a sorted on-disk file (streaming mergesort over per-backup contributions). Bounded RAM regardless of catalog size. +- **GC-time orphan-scan**: streaming compare against the on-disk live-set; bounded RAM. + +### 10.3 Implementation phasing + +**Phase 1** — MVP upload + restore round-trip (the smallest shippable thing): +- Move `docs/checksumstxt/` into `pkg/checksumstxt/`; extend tests with real ClickHouse part fixtures (compact, wide, encrypted, projection, multi-disk) +- `pkg/cas/config.go` with the §6.11 schema; `BackupMetadata.CAS` struct + persistence (§6.2.1) +- Blob path derivation, encoded db/table path components +- Object-disk detection (pre-flight + `--skip-object-disks`) +- `cas-upload`: prune-lock check, `metadata.json` collision check, cold-list cache, blob upload, per-table `.tar.zstd`, ordered commit (§6.4 step 13) +- `cas-download` and `cas-restore`: shadow-directory staging with `DataFormat="directory"`, filter support (`--tables`, `--partitions`, `--schema-only`, `--data-only`, `--restore-database-mapping`, `--restore-table-mapping`, `--rm`); `--ignore-dependencies` rejected with explicit error +- `cas-delete` (prune-lock check, ordered delete §6.6) +- `cas-verify` (HEAD + size; `--json`) +- `cas-delete` (with §6.6 ordering: metadata.json first) +- `list remote` extended to surface CAS backups with `[CAS]` tag +- v1 `BackupList` / `RemoveBackupRemote` / `RemoveOldBackupsRemote` / `CleanRemoteBroken` exclude the configured CAS root prefix- Cross-mode guards in v1 `delete` / `download` / `restore` (§6.2.2) +- README + `--help` discoverability hooks (§6.10) + +**Phase 1.5** — operational primitives (between MVP and prune): +- `cas-status` (bucket health summary; LIST-only, cheap; surfaces in-progress markers and prune-marker state) + +**Phase 2** — prune: +- `cas-prune`: mark-and-sweep with exclusive lock (refuses while `cas-upload`/`cas-delete` are in flight; the symmetric refusal is enforced from the upload/remove side), abandoned-upload sweep, grace-period delete, fail-closed on unreadable live-backup metadata, metadata-orphan lazy cleanup +- `--dry-run` for sanity checks +- Operator runbook (when to run, what failures mean, manual recovery from `cas-verify` output) + +**Phase 3 (shipped)** — planner correctness: +- Two-pass projection-aware part walker (extract-set + file walk) replacing the original recursive directory scan; closes the projection-path silent-skip class +- `ExcludedTables` plumbed through `UploadOptions` so `--skip-object-disks` excludes by decoded `(db, table)` pair, not by `DiskInfo.Path` (which was empty) +- Empty-`Parts` table guard in `uploadTableJSONs` (skips tables with `len(tp.parts) == 0` to avoid producing archives with no entries) +- `validateChecksumsTxtFilename` hoisted above the `.proj` recursion branch, closing the path-traversal corner + +**Phase 4 (shipped)** — atomic markers: +- New `PutFileAbsoluteIfAbsent(ctx, key, r, size) (created bool, err error)` on `pkg/storage.RemoteStorage`, with `ErrConditionalPutNotSupported` sentinel +- Implementations: S3 `IfNoneMatch: "*"` on direct PutObject (bypasses `s3manager.Uploader` because markers are <1KB); Azure `If-None-Match: *`; GCS `Conditions{DoesNotExist: true}`; COS `If-None-Match: *`; SFTP `OpenFile(O_WRONLY|O_CREATE|O_EXCL)` mapping to `SSH_FXF_EXCL`; FTP refuses by default, opts into STAT+STOR+RNFR/RNTO best-effort with `cas.allow_unsafe_markers` +- Symmetric relative-key `PutFileIfAbsent` on the `cas.Backend` adapter (so casstorage marker writes go through the configured `//cas/...` prefix instead of bucket-root) +- `WriteInProgressMarker` and `WritePruneMarker` return `(created, err)`; upload/prune branch on `!created` and surface a diagnostic naming the existing marker's host + start time + +**Phase 5 (shipped)** — backend smoke tests: +- testcontainers-driven integration coverage for MinIO, Azurite, fake-gcs-server, and OpenSSH-server SFTP; FTP exercised via proftpd in the refusal path. 16 CAS integration tests pass (15 PASS, 1 SKIP), spanning 5 of 6 backends +- Surfaced and fixed 4 pre-existing storage-layer bugs (SFTP/FTP `WalkAbsolute` and `DeleteFile` not-found handling) that CAS exercises but v1 paths never hit + +**Phase 6 (shipped)** — P1 defects from external review: +- Inprogress-marker cleanup on the StatFile-recheck error branch and on the metadata.json commit failure (steps 11b and 12) — previously leaked the marker, blocking the backup name for `abandon_threshold` (default 7 days) +- `cas-prune --dry-run --unlock` is now a no-op rather than deleting the real prune marker +- `cas-{download,restore} --data-only` returns `ErrNotImplemented` instead of silently doing a full download +- Zero-`ModTime` markers (FTP `LIST` without MLSD facts) are treated as fresh; zero-`ModTime` blobs are treated as inside grace — closes the data-loss path where prune sweeps every active marker on FTP-like backends +- Object-disk preflight now scans `metadata.json` rather than only the local shadow tree, catching fully-remote tables that have no local part directories +- `--skip-object-disks` exclusions are computed against decoded `(db, table)` names (matching planUpload's lookup) rather than the encoded shadow directory names + +**Phase 7 (shipped)** — cleanup round: +- `ColdList` TOCTOU re-validation: after the pre-commit prune-marker re-check, HEAD every blob skipped via cold-list and abort if any disappeared (closes the narrow window where a concurrent prune past `grace_blob` could delete blobs the upload was about to commit a reference to) +- `PruneReport` counters populated: `BlobsTotal` and `OrphansHeldByGrace` now reflect actual scan counts, no extra LIST passes +- `BytesReclaimed` rendered via `utils.FormatBytes` in `PrintPruneReport` with raw count in parentheses +- Defensive `cfg.Validate()` at `Prune` entry to protect embedded callers from misconfigured input +- Explicit-zero `--grace-blob=0s` / `--abandon-threshold=0s` override semantics locked with focused unit tests +- Per-backend not-found classification tests in `pkg/storage/errors_test.go` (S3 against httptest is real production code; GCS/COS/FTP exercise mirror-functions documenting intent; azblob/SFTP `t.Skip` with integration-test pointers — see §9.5 deferred) +- `casstorage.Walk` key-reconstruction extracted into testable `reconstructAbsoluteKey` helper with table-driven coverage +- Cross-backup dedup integration test: third backup whose payload column files are byte-identical to an earlier backup's reuses 100% of those blobs (`bytesC/bytesA = 0%`) + +**Phase 8 (shipped)** — wait_for_prune + REST API: +- `cas.wait_for_prune` config knob and `--wait-for-prune=DUR` CLI flag on `cas-upload` and `cas-delete`. When > 0, polls the prune marker every 2s for up to that duration before refusing. Explicit `0s` overrides non-zero config. The pre-commit prune-marker re-check (upload step 11a) deliberately does NOT wait — any prune that started after step 2 is racing in-flight blob uploads and the safe response is to abort. +- All seven CAS commands wired through the daemon-mode REST API: dedicated routes (`POST /backup/cas-upload/{name}` etc.), `/backup/actions` recognizes the same `cas-*` verb names, `GET /backup/list` merges CAS backups into the existing array with a `kind` field (`"v1"` or `"cas"`) and an optional `cas` sub-object on CAS rows. Async commands (upload, download, restore, verify, prune) return an `acknowledged` envelope with an `operation_id`; clients poll `GET /backup/status` for completion. `cas-delete` is sync; `cas-status` is sync GET. Backuper signatures unified to take a `commandId int` parameter so HTTP and CLI register identically in `status.Current`. + +**Phase 9 (planned)** — performance and operability: +- See §9.2 (performance) and §9.3 (operability) for the consolidated backlog. None of these are correctness gates; they are response to real workload measurements. +- Performance benchmarks against representative datasets. **TODO**: pin concrete success targets before benchmarking. Suggested starting points (operator to confirm): + - **Mutation dedup**: post-mutation backup uploads ≤ 5% of unmutated backup size on a 100TB-with-one-mutated-column scenario (the headline value-prop). + - **Cold full backup**: within 1.2× of v1's wall-clock for the same dataset (slight overhead acceptable due to per-file HEAD checks). + - **Repeat-of-same-data backup**: < 5 min wall-clock for 100TB if all blobs are already present (cold-list dominates). + - **Restore**: within 1.5× of v1's wall-clock (slower due to per-blob fetches; acceptable trade for chain-free). +- Stress tests for the prune-lock + grace-period correctness paths under sustained concurrent upload load. + +### 10.4 Ship-gating tests + +Implementer fills in normal coverage during code review. These are the load-bearing tests that must pass before each phase ships: + +**Phase 1:** +- `TestCASRoundtrip` — cas-upload → cas-download → byte-compare every file. +- `TestMutationDedup` — the headline value-prop. Backup, ALTER UPDATE one column, OPTIMIZE, backup again; assert the second backup uploads roughly the mutated-column's blobs only. +- `TestCompatibilityMixedBucket` — v1 + CAS backups same bucket; v1 commands refuse CAS targets; v1 retention/list/clean-broken don't touch CAS prefix. +- `TestV1RefusesCASBackup` / `TestCASRefusesV1Backup` — cross-mode guards. +- `TestUploadCommitChecksPruneMarker` — pre-commit re-check closes the old-orphan-reuse race. +- `TestParseV4_MultiBlock` / `TestParseFilenameTraversal` — parser hardening. +- `TestTarExtractionContainment` — path-traversal defense (also patch the v1 path). + +**Phase 2 (prune):** +- `TestPruneGracePeriodRespected` — fresh blob younger than `grace_blob` is never deleted. +- `TestPruneMarkerReleasedOnError` — defer-release runs on every exit path. +- `TestPruneSweepsAbandonedMarker` — markers older than `abandon_threshold` are cleaned up. + +### 10.5 Glossary + +- **Blob**: an immutable file in `cas/blob//`, content-keyed by the CityHash128 of its contents. +- **Live set / referenced set**: union of blob paths referenced by any backup whose `metadata.json` exists. +- **Orphan**: a blob in the blob store with no live references. +- **Grace period (`grace_blob`)**: the minimum age a blob must have before prune may delete it. +- **Abandon threshold**: how long an `inprogress` marker must persist before being treated as a crashed upload. +- **Cold-list**: parallel `LIST` of all `cas/blob//` prefixes at the start of an upload, to seed the existence cache. +- **In-progress marker**: a small sentinel file at `cas/inprogress/.marker` written when an upload starts and deleted at commit. +- **Prune marker**: `cas/prune.marker`. The advisory exclusive lock for GC. While present, `cas-upload` and `cas-delete` refuse to start. diff --git a/docs/cas-operator-runbook.md b/docs/cas-operator-runbook.md new file mode 100644 index 00000000..784a06cb --- /dev/null +++ b/docs/cas-operator-runbook.md @@ -0,0 +1,492 @@ +# CAS Operator Runbook + +This runbook covers day-to-day operation of the content-addressable backup +mode (`cas-*` commands). For the design rationale see +[docs/cas-design.md](cas-design.md). For end-user usage see the README. + +## ⚠️ Binary rollback procedure (READ FIRST IF DOWNGRADING) + +> **🛑 STOP. Read this section before downgrading the clickhouse-backup binary if CAS data exists in your bucket.** + +Pre-CAS binaries (any release that does not include the `cas-*` commands) have **no knowledge of the `cas/` skip prefix**. When such a binary runs `clean remote_broken` — or when the scheduled `BackupsToKeepRemote` retention logic fires — it sees `cas//…` as a malformed v1 backup tree and **deletes the entire CAS namespace**, including all blob data and metadata. This is irrecoverable without an independent copy. + +**You have three safe options. Choose one before downgrading:** + +1. **Pin the new binary in place — do not downgrade.** The safest and simplest option. If the reason for downgrading is a bug in the new binary, fix the bug instead. + +2. **Move CAS data out of the bucket first.** Using your cloud console or CLI, rename (copy + delete) the `cas/` prefix to a different name that won't be touched by v1 retention (e.g. `cas-archived/`). The old binary will not see it. Restore the rename when the binary is upgraded again. + + ```sh + # Example with mc (MinIO Client): + mc cp --recursive myminio/mybucket/cas/ myminio/mybucket/cas-archived/ + mc rm --recursive --force myminio/mybucket/cas/ + + # Example with AWS CLI: + aws s3 cp s3://mybucket/cas/ s3://mybucket/cas-archived/ --recursive + aws s3 rm s3://mybucket/cas/ --recursive + ``` + +3. **Disable v1 retention/cleanup jobs before downgrading, and keep them disabled until upgraded again.** + - Set `BackupsToKeepRemote: 0` in every config that touches this bucket. + - Remove `clean remote_broken` from all cron entries. + - Do **not** re-enable either until the binary is upgraded back to a CAS-aware release. + - Document this as a temporary state so it isn't forgotten. + +> **Warning:** There is no partial protection. A single `clean remote_broken` call from any pre-CAS host with access to the bucket is enough to destroy all CAS data. If you operate multiple hosts or automation pipelines, all of them must be updated or disabled before downgrading any one host. + +--- + +## First production deployment (start here) + +> ⚠️ **CAS is experimental.** The on-disk layout may change incompatibly +> before the feature is marked stable. Validate on non-critical workloads +> first and keep parallel v1 backups (or copies outside the CAS namespace) +> until you've gained confidence. See `docs/cas-design.md` for stability notes. + +This section walks an operator from zero to a first scheduled prune. Each +subsection is a gate — don't advance until the current step is clean. + +### 1. Validate config + +Open your config file (default `/etc/clickhouse-backup/config.yml`) and +confirm the following fields are set: + +| Field | Requirement | +|---|---| +| `cas.enabled` | `true` | +| `cas.cluster_id` | Non-empty, **unique per source cluster** | +| `cas.root_prefix` | Set (default `cas/`; leave unless you have a reason to change) | +| `cas.grace_blob` | `24h` default; increase if prune windows are infrequent | +| `cas.abandon_threshold` | `168h` default; lower only if you have noisy uploader crashes | + +```sh +clickhouse-backup print-config 2>/dev/null | grep -A15 "^cas:" +``` + +### 2. First test backup (low-risk table) + +Pick a small, non-critical table. Do **not** run the first CAS upload +against production-critical data until step 4 completes. + +```sh +clickhouse-backup create test-cas-bk1 --tables=mydb.small_table +clickhouse-backup cas-upload test-cas-bk1 +``` + +The upload summary reports bytes uploaded vs. reused. On a fresh cluster +expect 100 % uploaded / 0 % reused. Dedup gains appear from backup 2 onward. + +### 3. Validate via cas-verify + +```sh +clickhouse-backup cas-verify test-cas-bk1 +``` + +Zero failures is the bar. If `missing` or `size_mismatch` failures appear, +see [Recovering from cas-verify failures](#recovering-from-cas-verify-failures) +below. + +### 4. Round-trip restore + count check + +Drop the test table, restore from CAS, and confirm row counts match the +pre-backup baseline: + +```sh +clickhouse-client -q "SELECT count() FROM mydb.small_table" # record N +clickhouse-backup cas-restore test-cas-bk1 --rm +clickhouse-client -q "SELECT count() FROM mydb.small_table" # must equal N +``` + +A mismatched count indicates a data or config problem; investigate before +proceeding to production backups. + +### 5. Set up scheduled prune + +`cas-prune` is the garbage collector; run it regularly (weekly is a safe +default, daily for high-churn deployments). Schedule it in a quiet window +when no concurrent uploads are expected. For the prune's behavior and flags +see [When to run cas-prune](#when-to-run-cas-prune) below. + +```cron +# Example: daily at 03:00 UTC +0 3 * * * /usr/bin/clickhouse-backup cas-prune +``` + +If cron timing cannot guarantee no overlap with scheduled uploads, set +`cas.wait_for_prune` so uploads poll and retry instead of failing immediately: + +```yaml +cas: + wait_for_prune: "10m" +``` + +### 6. Monitoring + +`cas-status` is LIST-only (never writes) and cheap to run frequently. Pipe +its output into your log pipeline and alert on: + +- Prune marker present for more than 2× expected prune duration → stranded + marker. +- Abandoned in-progress markers accumulating → failed uploads or dying hosts. +- Total blob bytes growing linearly despite stable backup count → `cas-prune` + is not running. + +See [Monitoring suggestions](#monitoring-suggestions) below for the full +alert catalogue. + +### 7. Recovery procedures + +For step-by-step recovery instructions see the dedicated sections below: + +- Stranded prune marker → [Recovering from a stranded cas/\/prune.marker](#recovering-from-a-stranded-casclusterprunemarker) +- Stranded upload marker → [Recovering from a stranded inprogress marker](#recovering-from-a-stranded-inprogress-marker) +- `cas-upload` refusal due to concurrent marker → [Recovering from a concurrent cas-upload refusal](#recovering-from-a-concurrent-cas-upload-refusal) +- Corrupt backup found by `cas-verify` → [Recovering from cas-verify failures](#recovering-from-cas-verify-failures) + +### 8. REST API + +In daemon mode all CAS commands are available via HTTP at the same port as +the v1 API. See [REST API endpoints](#rest-api-endpoints) below for the full +route table, async polling pattern, and example `curl` calls. + +--- + +## Known limitations (v1) + +The `cas-*` commands ship as **experimental** in v1. Things v1 explicitly does +not do; expect them to land in later releases: + +- **`--tables` patterns are glob-only, not regex.** `--tables=db.*` or + `--tables=db.tab[12]` work (filepath.Match semantics, parity with v1). + Regex-style filters (`^db\..*_temp$`) do not. +- **Object-disk tables are refused.** Tables on disks of type `s3`, + `s3_plain`, `azure_blob_storage`, `azure`, `hdfs`, `web`, or `encrypted` + layered on any of those are blocked by the `cas-upload` preflight. Use + `--skip-object-disks` to exclude or v1 `upload` for those tables. Lifted + in a future release once content-addressing of already-remote object + stubs is designed. +- **Multi-host concurrent upload to the same backup name is unsupported.** + Two hosts running `cas-upload mybackup` simultaneously can race past the + same-name check and last-writer wins on `metadata.json`. Use unique names + per writer (e.g. `____`). +- **Hash verification on download is HEAD + size only.** `cas-verify` and + `cas-download` confirm each blob's *size* against the value in + `checksums.txt`; they do NOT re-hash blob bytes. Silent corruption from a + buggy GC is caught; an attacker who replaces a blob with same-sized + garbage at the same key is not (CityHash128 is non-cryptographic; the + threat model assumes a trusted bucket). +- **No per-blob resumable uploads.** Existing `pkg/resumable` operates at + per-archive granularity; CAS uploads at blob granularity have no resume + protocol yet. A killed `cas-upload` re-uploads everything that wasn't + already in the blob store on the next attempt (cold-list dedup limits + the cost). +- **FTP is best-effort.** With `cas.allow_unsafe_markers=true` FTP markers + use a STAT+STOR+RNFR/RNTO sequence with a small race window. Without the + flag, CAS refuses on FTP. SFTP, S3, GCS, Azure, COS all have native + atomic primitives. +- **Old MinIO is rejected.** The conditional-put startup probe refuses + MinIO releases pre-`RELEASE.2024-11-07T00-52-20Z` because they silently + ignore `If-None-Match: "*"`. Update MinIO, switch to a different + backend, or set `cas.skip_conditional_put_probe=true` after independent + validation of the precondition. +- **Cross-cluster blob sharing is not supported.** Each cluster has its + own namespace under `cas.root_prefix + cas.cluster_id + "/"`. Two + clusters writing to the same bucket cannot dedup against each other. + +A consolidated v2 backlog with rationale lives in `docs/cas-design.md` §9. + +### Changing `cas.root_prefix` + +> **Warning:** Changing `cas.root_prefix` while CAS data exists at the old prefix (e.g. renaming `"cas/"` to `"snapshots/"`) silently exposes the old data to v1 retention and `clean remote_broken`. The old binary — and even the new binary running with the updated config — no longer skips `cas/` because the configured skip prefix has changed to `snapshots/`. Any scheduled `BackupsToKeepRemote` or `clean remote_broken` job that runs during or after the config flip will see the old `cas/` subtree as broken v1 backups and delete it. + +To migrate safely, do one of the following **before** flipping the config: + +- **Copy/move the old prefix to the new one first**, then update `cas.root_prefix`: + ```sh + # Move cas/ → snapshots/ before changing any config file. + mc cp --recursive myminio/mybucket/cas/ myminio/mybucket/snapshots/ + mc rm --recursive --force myminio/mybucket/cas/ + # Only now update cas.root_prefix: "snapshots/" + ``` +- **Disable v1 retention and `clean remote_broken` for the duration of the transition**, perform the copy/move, update the config, verify with `cas-status`, then re-enable retention. + +--- + +## When to run `cas-prune` + +`cas-prune` is the garbage collector. After every `cas-delete` (and after +crashed `cas-upload` runs), orphan blobs accumulate in remote storage; they +are reclaimed only by `cas-prune`. + +- **Cadence:** weekly is a safe default; daily for high-churn deployments + (lots of mutations + frequent `cas-delete`). +- **Quiet window:** while `cas-prune` runs it holds an advisory marker + (`cas//prune.marker`) that causes concurrent `cas-upload` and + `cas-delete` to refuse. Run during a window when no scheduled backups + start or expire. The integration test on a 3-backup workload completes + in well under a minute; a real 100-backup catalog typically runs in a + few minutes plus the LIST round-trips. +- **Concurrency:** only one host at a time. `cas-prune` does not implement + distributed locking; two hosts that race the marker race-write will + abort one of them via run-id read-back, but operators must serialize + manually across replicas (no overlapping cron entries). + +```sh +clickhouse-backup cas-prune # use configured grace/abandon +clickhouse-backup cas-prune --dry-run # preview candidates, no writes +clickhouse-backup cas-prune --grace-blob=1h # tighter grace for cleanup runs +clickhouse-backup cas-prune --grace-blob=0s # zero grace (immediate reclaim) +``` + +## Reading `cas-status` + +`cas-status` is a LIST-only health summary; safe to run at any time +(it never writes). Sample output: + +``` +CAS status (cluster=prod-1): + Backups: 12 (newest: bk_2026_05_06, oldest: bk_2026_05_01) + Blobs: 42,318 objects, 5.2 TiB + + Prune marker: NONE + In-progress markers: 1 fresh, 0 abandoned + fresh: bk_pending (5m ago) +``` + +Field meanings: +- **Backups**: count of `cas//metadata//metadata.json` entries. +- **Blobs**: count + total bytes under `cas//blob/`. +- **Prune marker**: shows `NONE`, or ` (run_id=..., age=Xm)` if held. + An age much larger than your typical prune duration suggests a stranded + marker; see "Recovering from a stranded prune marker" below. +- **In-progress markers**: counts and lists per-backup upload markers. + - `fresh`: younger than `cas.abandon_threshold`. Treat as a real upload + in flight; don't act on it until it ages out. + - `abandoned`: older than `cas.abandon_threshold`. Reclaimed automatically + by the next `cas-prune`. You can also delete the marker manually if + the upload host is confirmed dead. + +## Recovering from a stranded `cas//prune.marker` + +A stranded marker happens when `cas-prune` is killed by SIGKILL or OOM-kill +before its deferred release fires. Symptoms: + +- `cas-status` shows `Prune marker: (age=2h+)` long after the + expected prune duration. +- `cas-upload` and `cas-delete` refuse with `cas: prune in progress`. + +Recovery: + +1. **Verify no prune is actually running.** Check `ps`/`systemctl` on the + host listed in the marker. If something IS running, do not interrupt it. +2. If confirmed dead, clear the marker: + + ```sh + clickhouse-backup cas-prune --unlock + ``` + + `--unlock` deletes the marker and exits. It refuses if no marker is + present (safety). + +3. Re-run `cas-prune` normally to reclaim any orphans the killed run + would have caught. + +## Recovering from a stranded inprogress marker + +A stranded `cas//inprogress/.marker` (without the matching +`metadata.json`) happens when `cas-upload` crashes mid-run. The next +`cas-prune` reclaims any markers older than `cas.abandon_threshold` +(default 168h = 7 days). To accelerate: + +```sh +# Override threshold for this run only: +clickhouse-backup cas-prune --abandon-threshold=24h +``` + +Or, if you're confident the upload is dead and don't want to wait: + +```sh +# Manual marker delete (operator authority required to reach the bucket): +mc rm ///inprogress/.marker +# or via gsutil/aws s3 rm for the corresponding backend. +``` + +## Recovering from a concurrent cas-upload refusal + +If `cas-upload` is killed (SIGKILL, OOM-kill, host crash) before its +deferred cleanup fires, the `cas//inprogress/.marker` +remains in remote storage. The next `cas-upload` for the same backup +name refuses with: + + cas: another cas-upload is in progress for "" on host= + started=; wait for it to finish or run cas-prune + --abandon-threshold=0s if confirmed dead + +Recovery: + +1. **Verify nothing is actually running.** Check `ps`/`systemctl` on the + host listed in the error message. If something IS running, do not + interrupt it. + +2. If confirmed dead, sweep the marker: + + ```sh + clickhouse-backup cas-prune --abandon-threshold=0s + ``` + + This treats every inprogress marker as abandoned regardless of age and + reclaims it. Then retry `cas-upload`. + +## Backend support for atomic markers + +`cas-upload` and `cas-prune` rely on atomic create-only-if-absent writes +to their respective markers. Backend support: + +| Backend | Atomic markers | Notes | +|---|---|---| +| s3 | yes | Requires MinIO ≥ RELEASE.2024-11 or AWS S3 (always supported) | +| azblob | yes | Native If-None-Match | +| gcs | yes | Native generation-match | +| cos | yes | Native If-None-Match | +| sftp | yes | Server-side via SSH_FXF_EXCL | +| ftp | NO by default | Set `cas.allow_unsafe_markers: true` to enable best-effort with documented race window | + +If your backend is FTP and you have not set `cas.allow_unsafe_markers`, +`cas-upload` and `cas-prune` will refuse with an `ErrConditionalPutNotSupported`-derived +message at marker-write time. + +### CI smoke-test coverage + +The atomic-marker primitive is exercised end-to-end against a real-or- +emulator server in CI for these backends: + +| Backend | Integration test | Emulator | +|---|---|---| +| s3 | `TestCAS*` (11 tests covering upload, restore, prune, projections, empty tables, concurrency) | MinIO `latest` | +| gcs | `TestCASSmokeGCS` (full upload → restore → delete → prune cycle) | fake-gcs-server `latest` | +| azblob | `TestCASSmokeAzure` (same cycle) | Azurite `latest` | +| sftp | `TestCASSmokeSFTP` (same cycle) | OpenSSH-server (panubo/sshd `latest`) | +| ftp | `TestCASSmokeFTPRefusesByDefault` + `TestCASSmokeFTPOptIn` | proftpd `latest` | +| cos | none — no Tencent COS emulator available | rely on SDK correctness; report regressions to maintainers | + +S3 has the most thorough coverage (11 tests covering concurrency, +partial restore, projections, etc.). The other backends have a single +smoke test each that proves the core upload/restore path works through +that backend's atomic-marker primitive. The smoke tests catch SDK-level +wiring bugs (e.g., the `casstorage` adapter calling the wrong method +that Phase 4 T12 caught) but do not cover concurrency edge cases on +non-S3 backends; if a real-world race is suspected on Azure / GCS / SFTP / +FTP, request a follow-up. + +## Recovering from `cas-verify` failures + +`cas-verify` reports three failure kinds: + +- **`missing`** — the blob isn't in remote storage. Either truly lost or + reclaimed by an over-eager `cas-prune` (rare; would indicate a bug). +- **`size_mismatch`** — the blob exists but its size differs from what + `checksums.txt` recorded. Truncated upload or external mutation. +- **`stat_error`** — transient backend error during the HEAD probe. Re-run + `cas-verify` before assuming the blob is bad. + +For `missing`/`size_mismatch`, the affected backup is unrestorable. Phase 1 +has no automated repair (`cas-fsck` is a Phase-3 candidate). Workflow: + +```sh +clickhouse-backup cas-delete # remove broken metadata +clickhouse-backup create # fresh local snapshot +clickhouse-backup cas-upload # re-upload +``` + +CAS backups are independent: losing one doesn't affect any other. + +## Backend assumptions + +`cas-prune` and `cas-status` assume the configured object store provides: + +1. **Read-your-writes consistency for individual objects.** Get/Stat + immediately after Put returns the new content. Standard on AWS S3, + GCS, Azure Blob, MinIO ≥ 2020. +2. **Meaningful `LastModified`** that reflects the actual write time + (not a quirky monotonic clock or a clamped fixed value). The grace + window is enforced via this field. + +On-prem MinIO sandboxes occasionally have skewed clocks; if `cas-status` +reports an "abandoned" marker that's actually fresh, check NTP sync on +the MinIO host first. + +## Monitoring suggestions + +Alerts to consider: + +- **Prune marker stuck**: `cas-status` reports a `prune marker` older + than your typical prune duration (e.g., > 30 min for typical + catalogs). Likely a stranded marker — page on-call. +- **Abandoned-marker accumulation**: more than N abandoned in-progress + markers indicates either a buggy uploader or a dying host. N=3 + triggers a warning; N=10 a page. +- **CAS bucket growth**: track total blob bytes over time. After the + first warm-up week the curve should asymptote. Continued linear + growth despite stable backup count suggests `cas-prune` is not + running (is it scheduled?). + +A simple cron entry to dump `cas-status` to a log every 15 minutes makes +all of the above trivially monitorable via your existing log pipeline. + +## REST API endpoints + +In daemon mode (`clickhouse-backup server`), the CAS commands are available +via HTTP on the same port as the v1 API endpoints (default `:7171`): + +| Method | Path | Maps to CLI | +|--------|------|-------------| +| POST | `/backup/cas-upload/{name}` | `cas-upload` | +| POST | `/backup/cas-download/{name}` | `cas-download` | +| POST | `/backup/cas-restore/{name}` | `cas-restore` | +| POST | `/backup/cas-delete/{name}` | `cas-delete` | +| POST | `/backup/cas-verify/{name}` | `cas-verify` | +| POST | `/backup/cas-prune` | `cas-prune` | +| GET | `/backup/cas-status` | `cas-status` | + +Async commands (`cas-upload`, `cas-download`, `cas-restore`, `cas-verify`, +`cas-prune`) return an `acknowledged` JSON envelope with an `operation_id`; +poll `GET /backup/status?operationid=` for completion. `cas-delete` and +`cas-status` are synchronous and return the result directly. + +CLI flags map to query parameters of the same name, e.g.: + +```sh +# async upload +curl -XPOST 'http://localhost:7171/backup/cas-upload/my_backup?skip-object-disks&wait-for-prune=5m' + +# async restore with drop-and-recreate +curl -XPOST 'http://localhost:7171/backup/cas-restore/my_backup?rm' + +# async prune — dry run +curl -XPOST 'http://localhost:7171/backup/cas-prune?dry-run' + +# sync delete +curl -XPOST 'http://localhost:7171/backup/cas-delete/my_backup' + +# poll completion +curl -s 'http://localhost:7171/backup/status?operationid=' | jq . +``` + +`GET /backup/list[/remote]` now includes CAS backups alongside v1 entries. +Each entry carries a `"kind"` field (`"v1"` or `"cas"`), and CAS entries +include a `"cas"` sub-object with `unique_blobs`, `blob_bytes`, and +`cluster_id`. + +`POST /backup/actions` recognizes the same `cas-*` verbs in the command +body, e.g. `{"command": "cas-upload mybk --skip-object-disks"}`. + +The `cas-prune --unlock` flag is also available via `?unlock=true`. It +overrides a stranded prune marker; use with the same operator confidence +required when running the CLI form. + +### Note: structured-output commands via /backup/actions + +**Note:** `cas-status` invoked via `POST /backup/actions` returns only an +acknowledgement; the structured status report is logged at INFO level by the +daemon but not surfaced in the action response. Use `GET /backup/cas-status` +directly to retrieve the report payload. diff --git a/docs/checksumstxt/format.md b/docs/checksumstxt/format.md new file mode 100644 index 00000000..6e770422 --- /dev/null +++ b/docs/checksumstxt/format.md @@ -0,0 +1,201 @@ +# `checksums.txt` — Formal Format Specification + +`checksums.txt` is a per-part metadata file written by `MergeTree` data parts. Reference implementation: `src/Storages/MergeTree/MergeTreeDataPartChecksum.{h,cpp}`. + +## 1. Top-level structure + +``` +checksums.txt := header LF body +header := "checksums format version: " UINT_DEC +LF := 0x0A +``` + +* `UINT_DEC` is an unsigned integer in plain decimal ASCII (no leading zeros required, no sign). +* The version determines `body` layout. Known versions: + +| Version | Body encoding | Used for `checksums.txt`? | +| ------: | ---------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- | +| 1 | (legacy, unsupported by current code; reader returns "format too old") | no longer written | +| 2 | Text | yes (legacy) | +| 3 | Binary, uncompressed | yes (legacy) | +| 4 | Binary, framed in one ClickHouse compressed-block | **default written today** | +| 5 | "Minimalistic" (totals only) | **only used for `MinimalisticDataPartChecksums`, e.g. ZooKeeper payload — not for the on-disk `checksums.txt`** | + +A robust parser must support v2, v3, v4 for the on-disk file and v5 only when reading the minimalistic blob. + +After the body, there must be EOF (the writer never appends anything past the body). + +## 2. Common primitive encodings + +These are ClickHouse's standard binary primitives (used inside the body for v3/v4): + +| Name | Encoding | +| ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | +| `VarUInt(x)` | LEB128 / Variable Byte. Repeated bytes `0x80 \| (x & 0x7F)` while `x > 0x7F`, then a final byte `x & 0x7F`. Up to 10 bytes for `UInt64`. | +| `BinaryLE(uint8/bool)` | exactly 1 byte; for `bool`, `0` = false, anything else = true (writer emits `0` or `1`). | +| `BinaryLE(UInt32)` | 4 bytes, little-endian. | +| `BinaryLE(UInt64)` | 8 bytes, little-endian. | +| `BinaryLE(uint128)` | The `CityHash_v1_0_2::uint128` struct is `{ UInt64 low64; UInt64 high64; }`. Serialized as `BinaryLE(low64)` then `BinaryLE(high64)`, total 16 bytes. | +| `StringBinary(s)` | `VarUInt(len(s))` followed by `len(s)` raw bytes. No NUL terminator. UTF-8 in practice (file names). | + +## 3. Body layouts + +### 3.1. Version 2 — text body + +Grammar (whitespace shown explicitly; `\n` = LF; `\t` = HT): + +``` +body_v2 := count " files:\n" record{count} +record := name "\n" + "\tsize: " UINT_DEC "\n" + "\thash: " UINT_DEC " " UINT_DEC "\n" + "\tcompressed: " BOOL_DEC + [ "\n\tuncompressed size: " UINT_DEC + "\n\tuncompressed hash: " UINT_DEC " " UINT_DEC ] + "\n" +``` + +* `count` is decimal ASCII. +* `name` is read as a raw line up to (but not including) the next `\n`. The reference uses `readString`, which reads bytes until `\n` is encountered; backslash-escaping is *not* applied here. (File names in part directories don't normally contain `\n`.) +* `BOOL_DEC` is `0` or `1`. The optional `uncompressed …` block is present iff `compressed = 1`. +* The two `UINT_DEC` after `hash:` / `uncompressed hash:` are the `low64` then `high64` of the `uint128`, printed in **decimal**. + +### 3.2. Version 3 — binary body, no compression + +``` +body_v3 := VarUInt(count) record{count} +record := StringBinary(name) + VarUInt(file_size) + BinaryLE(uint128 file_hash) // 16 bytes + BinaryLE(bool is_compressed) // 1 byte + if is_compressed: + VarUInt(uncompressed_size) + BinaryLE(uint128 uncompressed_hash) // 16 bytes +``` + +The map ordering on disk is whatever the writer produced (the in-memory container is an ordered `std::map`, so v4-written files are in lexicographic order of `name`, but a parser should not rely on order for correctness). + +### 3.3. Version 4 — binary body, wrapped in a compressed-block stream + +`body_v4` is a sequence of one or more **ClickHouse compressed blocks**. Concatenating the *uncompressed payloads* of these blocks yields exactly a `body_v3` byte stream. + +In practice the writer emits a single block (buffer 64 KiB, default codec LZ4), but a parser MUST handle multi-block streams (loop until the underlying buffer is exhausted; the inner `body_v3` parser will consume exactly the right amount). + +#### 3.3.1. Compressed-block frame + +Each block on the wire: + +``` +block := checksum128 method size_compressed size_uncompressed payload +checksum128 := 16 bytes // CityHash128 of: method || size_compressed || size_uncompressed || payload +method := 1 byte // codec id (see below) +size_compressed := 4 bytes LE // INCLUDES the 9-byte header (method+the two sizes), EXCLUDES the 16-byte checksum +size_uncompressed := 4 bytes LE +payload := size_compressed - 9 bytes of codec-specific data +``` + +Constraint: `size_compressed <= 0x40000000` (1 GiB); reject otherwise. + +The CityHash128 is `CityHash_v1_0_2::CityHash128` over the 9 header bytes followed by `size_compressed - 9` payload bytes (i.e. the bytes immediately following the checksum, totalling `size_compressed` bytes). A strict parser should verify it; a lenient parser may skip verification. + +#### 3.3.2. Codec method bytes + +Only the codecs ClickHouse may use to compress small metadata are realistically encountered, but a generic parser should be prepared: + +| `method` | Codec | Payload semantics | +| -------: | ------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `0x02` | NONE | raw bytes; uncompressed payload = compressed payload | +| `0x82` | LZ4 / LZ4HC (same wire format) | LZ4 block of `size_compressed - 9` bytes that decompresses to exactly `size_uncompressed` bytes | +| `0x90` | ZSTD | ZSTD frame, decompresses to `size_uncompressed` bytes | +| `0x91` | Multiple | wrapper; first payload byte is the number of nested codecs followed by their method bytes, then the inner compressed stream — rarely used for `checksums.txt`. See `CompressionCodecMultiple`. | + +For `checksums.txt` written by current ClickHouse, the default codec is the server's `default` codec (typically LZ4 → `0x82`). All multi-byte integers are little-endian. + +After fully decompressing all blocks and concatenating, parse the result as `body_v3` (§3.2). + +### 3.4. Version 5 — minimalistic (NOT on-disk checksums.txt) + +For completeness; a parser of `checksums.txt` may reject this version, since current code never writes v5 to disk. It IS the format used for `MinimalisticDataPartChecksums` in ZooKeeper. + +``` +body_v5 := VarUInt(num_compressed_files) + VarUInt(num_uncompressed_files) + BinaryLE(uint128 hash_of_all_files) + BinaryLE(uint128 hash_of_uncompressed_files) + BinaryLE(uint128 uncompressed_hash_of_compressed_files) +``` + +The header line for a v5 blob is `"checksums format version: 5\n"`. + +## 4. Logical model produced by the parser + +After parsing v2/v3/v4, the parser yields a map `name → Checksum`: + +``` +struct Checksum { + UInt64 file_size; + uint128 file_hash; // CityHash128 of the file's bytes on disk (compressed bytes if file is compressed) + bool is_compressed; + UInt64 uncompressed_size; // valid iff is_compressed + uint128 uncompressed_hash; // CityHash128 of decompressed bytes, valid iff is_compressed +} +``` + +Semantics: +* `file_hash` / `file_size` describe bytes as stored on disk. +* For files that ClickHouse stores using its compressed-block format (most `*.bin` column data), `is_compressed = true` and `uncompressed_*` describe the concatenation of the decompressed block payloads (i.e. logical column-data bytes). +* The map keys are paths *relative to the part directory* (e.g. `columns.txt`, `primary.idx`, `id.bin`, `id.cmrk2`, …). Subdirectory entries (projections) appear with `/`-separated paths. + +## 5. Validation rules a strict parser should enforce + +1. The first line MUST start with the literal `checksums format version: ` (note the trailing space) and end with `\n`. +2. Version MUST be one of 2, 3, 4 for the on-disk `checksums.txt` (5 only for the minimalistic blob). +3. For v2, every literal token (`" files:\n"`, `"\n\tsize: "`, `"\n\thash: "`, `"\n\tcompressed: "`, `"\n\tuncompressed size: "`, `"\n\tuncompressed hash: "`, the trailing `"\n"`, and the inter-field single-space separator between the two halves of `uint128`) MUST match exactly. +4. For v3/v4, `count` MUST be consumable; each record MUST be fully consumed. +5. For v4 frames: `size_compressed >= 9`, `size_compressed <= 1 GiB`, and (recommended) checksum verification. +6. After the last record (and after all compressed blocks for v4), there MUST be no trailing bytes — `assertEOF` is called by the reference implementation in `deserializeFrom`. +7. `name` MUST be unique within `files` (the writer uses a `std::map`, so duplicates indicate corruption). + +## 6. Worked v3 byte-level example + +A `checksums.txt` with one entry `columns.txt` of size 123 and `is_compressed = false` is exactly: + +``` +"checksums format version: 3\n" // header line, ASCII +01 // VarUInt(count=1) +0B // VarUInt(11) -- length of "columns.txt" +63 6F 6C 75 6D 6E 73 2E 74 78 74 // "columns.txt" +7B // VarUInt(123) +<16 bytes uint128 file_hash, low64 LE then high64 LE> +00 // is_compressed = false + // (no uncompressed_* fields) +EOF +``` + +## 7. Pointers into the source + +* `MergeTreeDataPartChecksums::read` / `write` — top-level dispatch + v3/v4 body. (`src/Storages/MergeTree/MergeTreeDataPartChecksum.cpp:115-240`) +* `MergeTreeDataPartChecksums::readV2` — text body. (same file, lines 145-181) +* `MinimalisticDataPartChecksums::serialize/deserialize` — v5 body. (same file, lines 334-391) +* `CompressedReadBuffer` / `CompressionInfo.h` — compressed-block frame used by v4. (`src/Compression/CompressionInfo.h`) +* `VarInt.h` — `VarUInt` encoding. (`src/IO/VarInt.h`) +* `CityHash_v1_0_2::uint128` — hash type, `{ low64, high64 }`. + +## 8. Implementation Summary + +`checksumstxt/`: + +- **`checksumstxt.go`** — `Parse(io.Reader) → *File` for versions 2/3/4, `ParseMinimalistic(io.Reader) → *Minimalistic` for version 5. Returns a typed `Hash128 = {Low, High uint64}` and a `Checksum` struct matching the C++ `MergeTreeDataPartChecksum` shape. +- **`checksumstxt_test.go`** — round-trips for v2 (text), v3 (raw binary), v4 with **LZ4 / NONE / ZSTD** codecs and a **multi-block** stream, the v5 minimalistic blob, and rejection cases (trailing bytes, v1, v5-via-Parse, unknown). + +### What's reused vs. new + +- **Reused** (transitive deps via `ch-go`): `chproto.Reader` for `UVarInt` / `Str` / `UInt128` (already `{Low: LE0..7, High: LE8..15}`) / `Bool`. The whole v4 framing — 16-byte CityHash128 + 9-byte header + LZ4/ZSTD/NONE — is handled by `compress.Reader`, surfaced through one call: `pr.EnableCompression()` on a `chproto.Reader` switches the v3 record loop to read decompressed bytes, so v3 and v4 share the same code path. +- **Reused** (already in this repo): `lib/cityhash102.CityHash128` is available if you ever want to validate v4 frames yourself (not needed — `compress.Reader` verifies internally). +- **New**: header-line dispatcher, v2 line-oriented parser, v3 record loop, v5 (5 fields, trivial), and EOF assertions for each path. + +### Path notes & limitations + +- `Multiple` codec (`0x91`) is **not** handled by `ch-go/compress.Reader`. Per the spec it isn't used for `checksums.txt`, so the parser surfaces a "compression 0x91 not implemented" error if encountered — matching the spec's "rarely used" note. +- v1 is rejected with "format too old", matching the C++ reference. +- `Parse` returns an error for v5 (and `ParseMinimalistic` rejects non-5) so you can't accidentally cross the wires. diff --git a/pkg/backup/backuper.go b/pkg/backup/backuper.go index c627c77a..ccc5cdf2 100644 --- a/pkg/backup/backuper.go +++ b/pkg/backup/backuper.go @@ -36,6 +36,23 @@ type versioner interface { type BackuperOpt func(*Backuper) +// CASProbeState holds the per-process state for the CAS conditional-put +// probe and the unsafe-marker WARN banner. It should be shared across all +// Backuper instances served from the same APIServer so that the probe fires +// exactly once per daemon lifetime, not once per REST request. CLI +// invocations create a fresh CASProbeState per process (one-shot, correct +// behaviour unchanged). Two separate CLI processes never share state because +// they are separate OS processes. +type CASProbeState struct { + probeOnce sync.Once + probeErr error + bannerOnce sync.Once +} + +// NewCASProbeState returns a fresh CASProbeState. Call once at server +// startup and share the result across all Backuper instances. +func NewCASProbeState() *CASProbeState { return &CASProbeState{} } + type Backuper struct { cfg *config.Config ch *clickhouse.ClickHouse @@ -51,15 +68,23 @@ type Backuper struct { resumableState *resumable.State shadowBackupUUIDs []string shadowBackupUUIDsMutex sync.Mutex + + // casProbeState is the shared (or per-instance) state for the CAS + // conditional-put probe and the unsafe-marker WARN banner. In daemon mode + // this points to the APIServer-level singleton so both fire at most once + // per server lifetime. In CLI mode NewBackuper creates a fresh state so + // both fire at most once per process (one-shot invocation). + casProbeState *CASProbeState } func NewBackuper(cfg *config.Config, opts ...BackuperOpt) *Backuper { ch := clickhouse.NewClickHouse(&cfg.ClickHouse) b := &Backuper{ - cfg: cfg, - ch: ch, - vers: ch, - bs: nil, + cfg: cfg, + ch: ch, + vers: ch, + bs: nil, + casProbeState: NewCASProbeState(), } for _, opt := range opts { opt(b) @@ -67,6 +92,19 @@ func NewBackuper(cfg *config.Config, opts ...BackuperOpt) *Backuper { return b } +// WithCASProbeState returns a BackuperOpt that injects a pre-existing +// CASProbeState into the Backuper. Used by the daemon APIServer to share a +// singleton across all per-request Backuper instances, ensuring the +// conditional-put probe and unsafe-marker WARN banner fire exactly once per +// server lifetime rather than once per request. Passing nil is a no-op. +func WithCASProbeState(s *CASProbeState) BackuperOpt { + return func(b *Backuper) { + if s != nil { + b.casProbeState = s + } + } +} + // Classify need to log retries func (b *Backuper) Classify(err error) retrier.Action { if err == nil { @@ -435,7 +473,7 @@ func (b *Backuper) getTablesDiffFromLocal(ctx context.Context, diffFrom string, func (b *Backuper) getTablesDiffFromRemote(ctx context.Context, diffFromRemote string, tablePattern string) (tablesForUploadFromDiff map[metadata.TableTitle]metadata.TableMetadata, err error) { tablesForUploadFromDiff = make(map[metadata.TableTitle]metadata.TableMetadata) - backupList, err := b.dst.BackupList(ctx, true, diffFromRemote) + backupList, err := b.dst.BackupList(ctx, true, diffFromRemote, b.cfg.CAS.SkipPrefixes()) if err != nil { return nil, errors.Wrap(err, "b.dst.BackupList return error") } diff --git a/pkg/backup/cas_methods.go b/pkg/backup/cas_methods.go new file mode 100644 index 00000000..efc6859f --- /dev/null +++ b/pkg/backup/cas_methods.go @@ -0,0 +1,901 @@ +package backup + +import ( + "context" + "encoding/json" + "errors" + "fmt" + "os" + "path" + "path/filepath" + "regexp" + "strings" + "time" + + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/casstorage" + "github.com/Altinity/clickhouse-backup/v2/pkg/clickhouse" + "github.com/Altinity/clickhouse-backup/v2/pkg/metadata" + "github.com/Altinity/clickhouse-backup/v2/pkg/pidlock" + "github.com/Altinity/clickhouse-backup/v2/pkg/status" + "github.com/Altinity/clickhouse-backup/v2/pkg/storage" + "github.com/Altinity/clickhouse-backup/v2/pkg/utils" + "github.com/rs/zerolog/log" +) + +// setupCASContext mirrors the v1 Upload context-setup pattern (status correlator +// + WithCancel). On commandId == status.NotFromAPI (-1) it returns a fresh +// background context. +func (b *Backuper) setupCASContext(commandId int) (context.Context, context.CancelFunc, error) { + ctx, cancel, err := status.Current.GetContextWithCancel(commandId) + if err != nil { + return nil, nil, fmt.Errorf("cas: GetContextWithCancel: %w", err) + } + ctx, cancel = context.WithCancel(ctx) + return ctx, cancel, nil +} + +// ensureCAS opens a remote BackupDestination for CAS operations and returns the +// adapter wrapping it plus a closer. Caller MUST invoke closer when done. +// +// Returns an error if cas.enabled is false or the config fails validation. +func (b *Backuper) ensureCAS(ctx context.Context, backupName string) (cas.Backend, func(), error) { + if !b.cfg.CAS.Enabled { + return nil, func() {}, errors.New("cas: cas.enabled=false in config; cannot run cas-* commands") + } + if err := b.cfg.CAS.Validate(); err != nil { + return nil, func() {}, err + } + if b.cfg.General.RemoteStorage == "none" || b.cfg.General.RemoteStorage == "custom" { + return nil, func() {}, fmt.Errorf("cas: unsupported general.remote_storage=%q for cas-* commands", b.cfg.General.RemoteStorage) + } + // Connect to ClickHouse so we can resolve disks (needed by NewBackupDestination + // and DefaultDataPath). + if err := b.ch.Connect(); err != nil { + return nil, func() {}, fmt.Errorf("cas: can't connect to clickhouse: %w", err) + } + disks, err := b.ch.GetDisks(ctx, true) + if err != nil { + b.ch.Close() + return nil, func() {}, fmt.Errorf("cas: GetDisks: %w", err) + } + if initErr := b.initDisksPathsAndBackupDestination(ctx, disks, backupName); initErr != nil { + b.ch.Close() + return nil, func() {}, fmt.Errorf("cas: initDisksPathsAndBackupDestination: %w", initErr) + } + if b.dst == nil { + b.ch.Close() + return nil, func() {}, fmt.Errorf("cas: BackupDestination not initialized for remote_storage=%q", b.cfg.General.RemoteStorage) + } + backend := casstorage.NewStorageBackend(b.dst) + closer := func() { + if b.dst != nil { + if err := b.dst.Close(ctx); err != nil { + log.Warn().Msgf("cas: can't close BackupDestination: %v", err) + } + } + b.ch.Close() + } + + // One-shot startup banner when operating in any unsafe-marker mode so + // the risk is visible in logs even when the operator never reads the + // runbook. Fires at most once per CASProbeState lifetime (i.e. once per + // daemon server start in API mode; once per process in CLI mode). + b.casProbeState.bannerOnce.Do(func() { + if b.cfg.CAS.SkipConditionalPutProbe { + log.Warn().Msg("cas: cas.skip_conditional_put_probe=true — conditional-put compliance NOT verified; if the backend silently ignores If-None-Match, marker locks are unsafe and concurrent uploads may corrupt backups. Use only on backends you have independently confirmed honor the precondition.") + } + if b.cfg.General.RemoteStorage == "ftp" && b.cfg.CAS.AllowUnsafeMarkers { + log.Warn().Msg("cas: cas.allow_unsafe_markers=true on FTP — markers use a STAT+STOR+RNFR/RNTO best-effort sequence with a small TOCTOU window between STAT and RNTO. Two concurrent cas-upload runs MAY both pass the marker write; serialize uploads externally if you cannot tolerate that risk.") + } + if b.cfg.CAS.AllowUnsafeObjectDiskSkip { + log.Warn().Msg("cas: cas.allow_unsafe_object_disk_skip=true — object-disk preflight will be skipped on disk-query failure; CAS backups may silently include unrestorable object-disk tables.") + } + }) + + return backend, closer, nil +} + +// maybeProbeCondPut runs the conditional-put startup probe at most once per +// CASProbeState. Skipped if cas.skip_conditional_put_probe=true. The probe is +// called by every CAS command that writes a marker (cas-upload non-dry-run, +// cas-prune non-dry-run, cas-delete). Read-only paths (cas-status, +// cas-verify, cas-download, cas-restore, dry-run flows) skip it entirely, +// ensuring they work with read-only credentials and don't mutate remote +// storage. +// +// In daemon (APIServer) mode b.casProbeState is the server-level singleton, so +// the probe fires exactly once per server lifetime regardless of how many +// requests arrive. In CLI mode each process gets a fresh CASProbeState, so +// the probe fires once per invocation. +func (b *Backuper) maybeProbeCondPut(ctx context.Context, backend cas.Backend) error { + if b.cfg.CAS.SkipConditionalPutProbe { + return nil + } + b.casProbeState.probeOnce.Do(func() { + cp := b.cfg.CAS.ClusterPrefix() + b.casProbeState.probeErr = cas.ProbeConditionalPut(ctx, backend, cp) + }) + return b.casProbeState.probeErr +} + +// snapshotObjectDiskHitsFromDisks is the pure, testable core of the snapshot +// pre-flight. It walks /shadow//
// to +// enumerate (db, table, disk) triples actually present in the backup, then +// cross-references diskTypeByName (disk name → type) to identify object-disk +// hits. Returns deduplicated hits (empty slice + nil error for empty/no-object-disk backups). +func (b *Backuper) snapshotObjectDiskHitsFromDisks(localBackupDir string, diskTypeByName map[string]string) ([]cas.ObjectDiskHit, error) { + shadow := filepath.Join(localBackupDir, "shadow") + var hits []cas.ObjectDiskHit + seen := map[cas.ObjectDiskHit]struct{}{} + + dbs, err := os.ReadDir(shadow) + if err != nil { + if os.IsNotExist(err) { + return nil, nil // empty backup or schema-only backup + } + return nil, fmt.Errorf("cas: read shadow dir: %w", err) + } + for _, dbe := range dbs { + if !dbe.IsDir() { + continue + } + db := dbe.Name() + tables, err := os.ReadDir(filepath.Join(shadow, db)) + if err != nil { + continue + } + for _, tbe := range tables { + if !tbe.IsDir() { + continue + } + table := tbe.Name() + disks, err := os.ReadDir(filepath.Join(shadow, db, table)) + if err != nil { + continue + } + for _, de := range disks { + if !de.IsDir() { + continue + } + disk := de.Name() + diskType, ok := diskTypeByName[disk] + if !ok { + continue // disk not present in live system.disks; treat as local + } + if !cas.IsObjectDiskType(diskType) { + continue + } + // Read the table's metadata JSON to get decoded (db, table) names. + // Fall back to the encoded directory names if the JSON is missing or + // unparseable (we still want to report a hit; downstream filtering may + // not match perfectly but the operator gets visibility). + decodedDB, decodedTable := db, table + metaPath := filepath.Join(localBackupDir, "metadata", db, table+".json") + if body, readErr := os.ReadFile(metaPath); readErr == nil { + var tm metadata.TableMetadata + if jsonErr := json.Unmarshal(body, &tm); jsonErr == nil && tm.Database != "" && tm.Table != "" { + decodedDB, decodedTable = tm.Database, tm.Table + } + } + h := cas.ObjectDiskHit{Database: decodedDB, Table: decodedTable, Disk: disk, DiskType: diskType} + if _, dup := seen[h]; dup { + continue + } + seen[h] = struct{}{} + hits = append(hits, h) + } + } + } + return hits, nil +} + +// snapshotObjectDiskHits queries live system.disks for disk-type information, +// then delegates to snapshotObjectDiskHitsFromDisks to walk the local backup +// snapshot. If system.disks is unreachable the function fails closed (returns +// an error) unless cas.allow_unsafe_object_disk_skip=true, in which case it +// logs a warning and returns (nil, nil). +func (b *Backuper) snapshotObjectDiskHits(ctx context.Context, localBackupDir string) ([]cas.ObjectDiskHit, error) { + diskTypeByName := map[string]string{} + disks, err := b.ch.GetDisks(ctx, true) + if err != nil { + if b.cfg.CAS.AllowUnsafeObjectDiskSkip { + log.Warn().Msgf("cas: GetDisks for snapshot pre-flight failed: %v; cas.allow_unsafe_object_disk_skip=true so continuing without object-disk detection — CAS backup may include unrestorable object-disk tables", err) + return nil, nil + } + return nil, fmt.Errorf("cas: object-disk pre-flight failed (cannot query system.disks): %w (set cas.allow_unsafe_object_disk_skip=true to bypass at your own risk)", err) + } + for _, d := range disks { + diskTypeByName[d.Name] = d.Type + } + return b.snapshotObjectDiskHitsFromDisks(localBackupDir, diskTypeByName) +} + +// storagePolicyRE extracts the storage_policy name from a CREATE TABLE query. +// Local copy of the logic in (*ClickHouse).ExtractStoragePolicy — kept here so +// snapshotMetadataObjectDiskHits requires no live ClickHouse receiver and is +// fully unit-testable. +var storagePolicyRE = regexp.MustCompile(`SETTINGS.+storage_policy[^=]*=[^']*'([^']+)'`) + +// extractStoragePolicy returns the storage_policy value from a CREATE TABLE +// query, defaulting to "default" when the SETTINGS clause is absent. +func extractStoragePolicy(query string) string { + if m := storagePolicyRE.FindStringSubmatch(query); len(m) > 0 { + return m[1] + } + return "default" +} + +// StoragePolicyResolver abstracts the live ClickHouse queries used by +// snapshotMetadataObjectDiskHits. The production implementation is a +// thin wrapper around (*Backuper).ch; tests inject a stub. +type StoragePolicyResolver interface { + // DisksForPolicy returns the disk names attached to a storage policy. + // Should return ([], nil) for unknown policies. + DisksForPolicy(policy string) ([]string, error) + // DiskType returns the type of a disk (e.g. "S3", "ObjectStorage", "Local"). + // Should return ("", nil) for unknown disks. + DiskType(disk string) (string, error) +} + +// backuperResolver implements StoragePolicyResolver by reading from a +// pre-fetched []clickhouse.Disk slice (each Disk has StoragePolicies +// populated when GetDisks was called with enrich=true). +type backuperResolver struct{ disks []clickhouse.Disk } + +func newBackuperResolver(disks []clickhouse.Disk) *backuperResolver { + return &backuperResolver{disks: disks} +} + +func (r *backuperResolver) DisksForPolicy(policy string) ([]string, error) { + var out []string + for _, d := range r.disks { + for _, p := range d.StoragePolicies { + if p == policy { + out = append(out, d.Name) + break + } + } + } + return out, nil +} + +func (r *backuperResolver) DiskType(disk string) (string, error) { + for _, d := range r.disks { + if d.Name == disk { + return d.Type, nil + } + } + return "", nil +} + +// snapshotMetadataObjectDiskHits enumerates per-table metadata JSONs in +// the local backup directory and consults the resolver to determine each +// table's source disk types. Returns hits for any table whose storage +// policy includes an object-disk-typed disk. Caller is responsible for +// merging with snapshotObjectDiskHits (which catches tables that DO have +// shadow parts). +func snapshotMetadataObjectDiskHits(localBackupDir string, resolver StoragePolicyResolver) ([]cas.ObjectDiskHit, error) { + metaRoot := filepath.Join(localBackupDir, "metadata") + st, err := os.Stat(metaRoot) + if err != nil { + if os.IsNotExist(err) { + return nil, nil + } + return nil, err + } + if !st.IsDir() { + return nil, fmt.Errorf("metadata path %q is not a directory", metaRoot) + } + var hits []cas.ObjectDiskHit + seen := map[cas.ObjectDiskHit]struct{}{} + dbs, err := os.ReadDir(metaRoot) + if err != nil { + return nil, err + } + for _, dbe := range dbs { + if !dbe.IsDir() { + continue + } + files, err := os.ReadDir(filepath.Join(metaRoot, dbe.Name())) + if err != nil { + return nil, err + } + for _, fe := range files { + if !strings.HasSuffix(fe.Name(), ".json") { + continue + } + body, err := os.ReadFile(filepath.Join(metaRoot, dbe.Name(), fe.Name())) + if err != nil { + return nil, err + } + var tm metadata.TableMetadata + if err := json.Unmarshal(body, &tm); err != nil { + continue // skip malformed; not our problem here + } + policy := extractStoragePolicy(tm.Query) + disks, err := resolver.DisksForPolicy(policy) + if err != nil { + return nil, err + } + for _, disk := range disks { + dt, err := resolver.DiskType(disk) + if err != nil { + return nil, err + } + if !cas.IsObjectDiskType(dt) { + continue + } + h := cas.ObjectDiskHit{Database: tm.Database, Table: tm.Table, Disk: disk, DiskType: dt} + if _, dup := seen[h]; dup { + continue + } + seen[h] = struct{}{} + hits = append(hits, h) + } + } + } + return hits, nil +} + +// mergeObjectDiskHits dedupes hits across two sources (shadow walk + +// metadata-JSON enumeration). Order of the returned slice is not +// guaranteed (callers that care should sort). +func mergeObjectDiskHits(a, b []cas.ObjectDiskHit) []cas.ObjectDiskHit { + seen := map[cas.ObjectDiskHit]struct{}{} + out := make([]cas.ObjectDiskHit, 0, len(a)+len(b)) + for _, h := range a { + if _, dup := seen[h]; !dup { + seen[h] = struct{}{} + out = append(out, h) + } + } + for _, h := range b { + if _, dup := seen[h]; !dup { + seen[h] = struct{}{} + out = append(out, h) + } + } + return out +} + +// snapshotMetadataObjectDiskHitsFromCH wraps the static +// snapshotMetadataObjectDiskHits helper with a live ClickHouse query. +// Best-effort: returns (nil, err) on failure; caller logs and falls back. +func (b *Backuper) snapshotMetadataObjectDiskHitsFromCH(ctx context.Context, localBackupDir string) ([]cas.ObjectDiskHit, error) { + disks, err := b.ch.GetDisks(ctx, true) + if err != nil { + return nil, fmt.Errorf("get disks: %w", err) + } + return snapshotMetadataObjectDiskHits(localBackupDir, newBackuperResolver(disks)) +} + +// CASUpload uploads a local backup using the CAS layout. +// When unlock=true the function removes a stranded in-progress marker for +// backupName and exits immediately without uploading anything. +// --unlock is incompatible with --dry-run and --skip-object-disks. +func (b *Backuper) CASUpload(backupName string, skipObjectDisks, dryRun, unlock bool, backupVersion string, commandId int, waitForPrune time.Duration) error { + if backupName == "" { + return errors.New("cas-upload: backup name is required") + } + + // Refuse incompatible flag combinations upfront. + if unlock && dryRun { + return errors.New("cas-upload: --unlock and --dry-run are incompatible; --unlock removes a real marker") + } + if unlock && skipObjectDisks { + return errors.New("cas-upload: --unlock and --skip-object-disks are incompatible; --unlock does not perform an upload") + } + + backupName = utils.CleanBackupNameRE.ReplaceAllString(backupName, "") + + ctx, cancel, err := b.setupCASContext(commandId) + if err != nil { + return err + } + defer cancel() + + backend, closer, err := b.ensureCAS(ctx, backupName) + if err != nil { + return err + } + defer closer() + + // --unlock path: remove stranded marker and exit. No upload, no pidlock. + if unlock { + if err := cas.UnlockInProgress(ctx, backend, b.cfg.CAS, backupName); err != nil { + return err + } + fmt.Printf("cas-upload --unlock: inprogress marker for %q removed; backup slot is now free\n", backupName) + return nil + } + + if pidErr := pidlock.CheckAndCreatePidFile(backupName, "cas-upload"); pidErr != nil { + return pidErr + } + defer pidlock.RemovePidFile(backupName) + + start := time.Now() + + // Resolve the local backup directory. + fullLocal := path.Join(b.DefaultDataPath, "backup", backupName) + if _, err := os.Stat(fullLocal); err != nil { + return fmt.Errorf("cas-upload: local backup %q not found at %s; run 'clickhouse-backup create %s' first", backupName, fullLocal, backupName) + } + + // Snapshot-based pre-flight: read which disks the local backup actually + // uses, not which disks the live ClickHouse currently has. + shadowHits, err := b.snapshotObjectDiskHits(ctx, fullLocal) + if err != nil { + return fmt.Errorf("cas-upload: snapshot pre-flight: %w", err) + } + + // Augment with metadata-JSON-driven detection so fully-object-disk-backed + // tables (no shadow parts) are also caught. Fail closed on error unless + // cas.allow_unsafe_object_disk_skip=true. + metaHits, metaErr := b.snapshotMetadataObjectDiskHitsFromCH(ctx, fullLocal) + if metaErr != nil { + if b.cfg.CAS.AllowUnsafeObjectDiskSkip { + log.Warn().Err(metaErr).Msg("cas-upload: metadata-driven object-disk pre-flight failed; cas.allow_unsafe_object_disk_skip=true so falling back to shadow-only detection — fully-object-disk-backed tables may be missed") + } else { + return fmt.Errorf("cas-upload: metadata-driven object-disk pre-flight failed: %w (set cas.allow_unsafe_object_disk_skip=true to bypass at your own risk)", metaErr) + } + } + hits := mergeObjectDiskHits(shadowHits, metaHits) + + if !skipObjectDisks { + if len(hits) > 0 { + return fmt.Errorf("%w: %s", + cas.ErrObjectDiskRefused, + cas.FormatObjectDiskHits(hits)) + } + } + + uploadOpts := cas.UploadOptions{ + LocalBackupDir: fullLocal, + SkipObjectDisks: skipObjectDisks, + DryRun: dryRun, + Parallelism: int(b.cfg.General.UploadConcurrency), + WaitForPrune: waitForPrune, + } + if skipObjectDisks { + excluded := make([]string, 0, len(hits)) + for _, h := range hits { + excluded = append(excluded, h.Database+"."+h.Table) + } + uploadOpts.ExcludedTables = excluded + } + // Run the conditional-put probe only for real (non-dry-run) uploads that + // will actually write a marker. Dry-run and read-only commands skip it. + if !dryRun { + if err := b.maybeProbeCondPut(ctx, backend); err != nil { + return err + } + } + res, uploadErr := cas.Upload(ctx, backend, b.cfg.CAS, backupName, uploadOpts) + if uploadErr != nil { + return uploadErr + } + log.Info(). + Str("backup", res.BackupName). + Int("total_files", res.TotalFiles). + Uint64("total_bytes", res.TotalBytes). + Int("inline_files", res.InlineFiles). + Uint64("inline_bytes", res.InlineBytes). + Int("unique_blobs", res.UniqueBlobs). + Uint64("blob_bytes_total", res.BlobBytesTotal). + Int("blobs_uploaded", res.BlobsUploaded). + Int64("bytes_uploaded", res.BytesUploaded). + Int("blobs_reused", res.BlobsReused). + Int64("bytes_reused", res.BytesReused). + Int("archives", res.PerTableArchives). + Int64("archive_bytes", res.ArchiveBytes). + Bool("dry_run", res.DryRun). + Dur("elapsed", time.Since(start)). + Msg("cas-upload done") + + totalBytesH := utils.FormatBytes(res.TotalBytes) + inlineBytesH := utils.FormatBytes(res.InlineBytes) + blobBytesH := utils.FormatBytes(res.BlobBytesTotal) + uploadedH := utils.FormatBytes(uint64(res.BytesUploaded)) + reusedH := utils.FormatBytes(uint64(res.BytesReused)) + archiveH := utils.FormatBytes(uint64(res.ArchiveBytes)) + prefix := "cas-upload" + if res.DryRun { + prefix = "cas-upload (dry-run)" + } + fmt.Printf("%s: %s\n", prefix, res.BackupName) + fmt.Printf(" Backup content : %d files, %s total\n", res.TotalFiles, totalBytesH) + fmt.Printf(" Inlined : %d files, %s (packed into %d archive%s, %s compressed)\n", + res.InlineFiles, inlineBytesH, res.PerTableArchives, plural(res.PerTableArchives), archiveH) + fmt.Printf(" Blob store : %d unique blobs, %s\n", res.UniqueBlobs, blobBytesH) + fmt.Printf(" uploaded now : %d blobs, %s\n", res.BlobsUploaded, uploadedH) + fmt.Printf(" reused : %d blobs, %s (already in remote — saved by content-addressing)\n", + res.BlobsReused, reusedH) + fmt.Printf(" Wall clock : %s\n", time.Since(start).Round(time.Millisecond)) + return nil +} + +func plural(n int) string { + if n == 1 { + return "" + } + return "s" +} + +// CASDownload materializes a CAS backup into the local backup directory. +// This does NOT load tables into ClickHouse; use cas-restore for that. +func (b *Backuper) CASDownload(backupName, tablePattern string, partitions []string, schemaOnly, dataOnly bool, backupVersion string, commandId int) error { + if backupName == "" { + return errors.New("cas-download: backup name is required") + } + backupName = utils.CleanBackupNameRE.ReplaceAllString(backupName, "") + // Use "cas-download-" as the pidlock key so that concurrent + // cas-download and cas-restore runs (which also hold this lock during + // their download phase) mutually exclude each other without colliding + // with the inner v1 pidlock key (plain ) used by b.Restore. + casDownloadLockName := "cas-download-" + backupName + if pidErr := pidlock.CheckAndCreatePidFile(casDownloadLockName, "cas-download"); pidErr != nil { + return pidErr + } + defer pidlock.RemovePidFile(casDownloadLockName) + + ctx, cancel, err := b.setupCASContext(commandId) + if err != nil { + return err + } + defer cancel() + + start := time.Now() + backend, closer, err := b.ensureCAS(ctx, backupName) + if err != nil { + return err + } + defer closer() + + localBackupRoot := path.Join(b.DefaultDataPath, "backup") + if err := os.MkdirAll(localBackupRoot, 0o755); err != nil { + return fmt.Errorf("cas-download: mkdir %s: %w", localBackupRoot, err) + } + + res, dlErr := cas.Download(ctx, backend, b.cfg.CAS, backupName, cas.DownloadOptions{ + LocalBackupDir: localBackupRoot, + TableFilter: splitTablePattern(tablePattern), + Partitions: partitions, + SchemaOnly: schemaOnly, + DataOnly: dataOnly, + Parallelism: int(b.cfg.General.DownloadConcurrency), + }) + if dlErr != nil { + return dlErr + } + log.Info(). + Str("backup", res.BackupName). + Str("local_dir", res.LocalBackupDir). + Int("archives", res.PerTableArchives). + Int("blobs_fetched", res.BlobsFetched). + Int64("bytes_fetched", res.BytesFetched). + Dur("elapsed", time.Since(start)). + Msg("cas-download done") + fmt.Printf("cas-download: %s -> %s archives=%d blobs=%d bytes=%d elapsed=%s\n", + res.BackupName, res.LocalBackupDir, res.PerTableArchives, res.BlobsFetched, res.BytesFetched, time.Since(start).Round(time.Millisecond)) + return nil +} + +// CASRestore downloads a CAS backup and hands off to the v1 restore flow. +func (b *Backuper) CASRestore( + backupName, tablePattern string, + dbMapping, tableMapping, partitions, skipProjections []string, + schemaOnly, dataOnly, dropExists, ignoreDependencies bool, + restoreSchemaAsAttach, replicatedCopyToDetached, skipEmptyTables, resume bool, + backupVersion string, commandId int, +) error { + if backupName == "" { + return errors.New("cas-restore: backup name is required") + } + backupName = utils.CleanBackupNameRE.ReplaceAllString(backupName, "") + // Outer pidlock around the cas-download phase only. The inner v1 + // b.Restore (invoked via runV1 below) acquires its own pidlock at + // pkg/backup/restore.go for the actual mutation phase. Holding both + // would self-deadlock since pidlock has no same-PID exemption — so we + // release the cas-download lock before the v1 handoff. Two concurrent + // cas-restore runs of the same backup name now serialize on the + // cas-download phase (preventing staging-swap races) and then again + // on the inner v1 lock. + casDownloadLockName := "cas-download-" + backupName + if pidErr := pidlock.CheckAndCreatePidFile(casDownloadLockName, "cas-download"); pidErr != nil { + return pidErr + } + casDownloadLockReleased := false + defer func() { + if !casDownloadLockReleased { + pidlock.RemovePidFile(casDownloadLockName) + } + }() + + ctx, cancel, err := b.setupCASContext(commandId) + if err != nil { + return err + } + defer cancel() + + start := time.Now() + backend, closer, err := b.ensureCAS(ctx, backupName) + if err != nil { + return err + } + defer closer() + + localBackupRoot := path.Join(b.DefaultDataPath, "backup") + if err := os.MkdirAll(localBackupRoot, 0o755); err != nil { + return fmt.Errorf("cas-restore: mkdir %s: %w", localBackupRoot, err) + } + + opts := cas.RestoreOptions{ + DownloadOptions: cas.DownloadOptions{ + LocalBackupDir: localBackupRoot, + TableFilter: splitTablePattern(tablePattern), + Partitions: partitions, + SchemaOnly: schemaOnly, + DataOnly: dataOnly, + Parallelism: int(b.cfg.General.DownloadConcurrency), + }, + DropExists: dropExists, + DatabaseMapping: dbMapping, + TableMapping: tableMapping, + SkipProjections: skipProjections, + RestoreSchemaAsAttach: restoreSchemaAsAttach, + ReplicatedCopyToDetached: replicatedCopyToDetached, + SkipEmptyTables: skipEmptyTables, + Resume: resume, + BackupVersion: backupVersion, + CommandID: commandId, + IgnoreDependencies: ignoreDependencies, + } + + // V1 restore handoff: cas.Restore materializes the backup at + // / and calls this closure with that absolute path. + // We then delegate to b.Restore using the v1 positional argument list. + // Release the cas-download pidlock first so the inner b.Restore can + // acquire its own pidlock (under the plain backupName key); pidlock + // has no same-PID exemption, so holding both would self-deadlock. + runV1 := func(ctx context.Context, _ string, ro cas.RestoreOptions) error { + // cas.Download has completed; the staging-swap race window is closed. + // Release the outer cas-download lock before b.Restore takes its own. + pidlock.RemovePidFile(casDownloadLockName) + casDownloadLockReleased = true + + // b.Restore looks the backup up by name under b.DefaultDataPath/backup/, + // which is exactly where cas.Download placed it. + return b.Restore( + backupName, + tablePattern, + ro.DatabaseMapping, + ro.TableMapping, + ro.Partitions, + ro.SkipProjections, + ro.SchemaOnly, + ro.DataOnly, + ro.DropExists, + false, // ignoreDependencies — rejected upstream by cas.Restore + false, // restoreRBAC: out of scope for CAS v1 + false, // rbacOnly + false, // restoreConfigs + false, // configsOnly + false, // restoreNamedCollections + false, // namedCollectionsOnly + ro.Resume, + ro.RestoreSchemaAsAttach, + ro.ReplicatedCopyToDetached, + ro.SkipEmptyTables, + ro.BackupVersion, + ro.CommandID, + ) + } + + if rErr := cas.Restore(ctx, backend, b.cfg.CAS, backupName, opts, runV1); rErr != nil { + return rErr + } + log.Info().Str("backup", backupName).Dur("elapsed", time.Since(start)).Msg("cas-restore done") + fmt.Printf("cas-restore: %s elapsed=%s\n", backupName, time.Since(start).Round(time.Millisecond)) + return nil +} + +// CASDelete removes a CAS backup's metadata subtree (blob reclamation is the +// next prune's responsibility). +func (b *Backuper) CASDelete(backupName string, commandId int, waitForPrune time.Duration) error { + if backupName == "" { + return errors.New("cas-delete: backup name is required") + } + backupName = utils.CleanBackupNameRE.ReplaceAllString(backupName, "") + if pidErr := pidlock.CheckAndCreatePidFile(backupName, "cas-delete"); pidErr != nil { + return pidErr + } + defer pidlock.RemovePidFile(backupName) + ctx, cancel, err := b.setupCASContext(commandId) + if err != nil { + return err + } + defer cancel() + backend, closer, err := b.ensureCAS(ctx, backupName) + if err != nil { + return err + } + defer closer() + // cas-delete always writes a tombstone marker; always probe. + if err := b.maybeProbeCondPut(ctx, backend); err != nil { + return err + } + if err := cas.Delete(ctx, backend, b.cfg.CAS, backupName, cas.DeleteOptions{WaitForPrune: waitForPrune}); err != nil { + return err + } + fmt.Printf("cas-delete: %s metadata removed\n", backupName) + fmt.Printf("cas-delete: blob storage will be reclaimed by the next cas-prune run\n") + return nil +} + +// CASVerify performs a HEAD + size check on every blob referenced by the +// backup, writing failures to stdout. +func (b *Backuper) CASVerify(backupName string, jsonOut bool, commandId int) error { + if backupName == "" { + return errors.New("cas-verify: backup name is required") + } + backupName = utils.CleanBackupNameRE.ReplaceAllString(backupName, "") + ctx, cancel, err := b.setupCASContext(commandId) + if err != nil { + return err + } + defer cancel() + backend, closer, err := b.ensureCAS(ctx, backupName) + if err != nil { + return err + } + defer closer() + res, vErr := cas.Verify(ctx, backend, b.cfg.CAS, backupName, cas.VerifyOptions{JSON: jsonOut}, os.Stdout) + if vErr != nil && !errors.Is(vErr, cas.ErrVerifyFailures) { + return vErr + } + if res != nil { + log.Info(). + Str("backup", res.BackupName). + Int("blobs_checked", res.BlobsChecked). + Int("failures", len(res.Failures)). + Msg("cas-verify done") + } + if vErr != nil { + // Non-zero exit on verify failures — surfaced via cli action error. + return vErr + } + return nil +} + +// CASStatus prints a LIST-only health summary for the configured cluster. +func (b *Backuper) CASStatus(commandId int) error { + ctx, cancel, err := b.setupCASContext(commandId) + if err != nil { + return err + } + defer cancel() + backend, closer, err := b.ensureCAS(ctx, "") + if err != nil { + return err + } + defer closer() + r, sErr := cas.Status(ctx, backend, b.cfg.CAS) + if sErr != nil { + return sErr + } + return cas.PrintStatus(r, os.Stdout) +} + +// CASStatusJSON returns a structured status report suitable for HTTP responses. +// It is the structured-data counterpart to CASStatus (which prints to stdout). +func (b *Backuper) CASStatusJSON(commandId int) (*cas.StatusReport, error) { + ctx, cancel, err := b.setupCASContext(commandId) + if err != nil { + return nil, err + } + defer cancel() + backend, closer, err := b.ensureCAS(ctx, "") + if err != nil { + return nil, err + } + defer closer() + return cas.Status(ctx, backend, b.cfg.CAS) +} + +// CASPrune runs mark-and-sweep GC against the configured CAS cluster. +// graceHours / abandonDays are CLI overrides (0 = use config). unlock is +// the operator escape hatch for a stranded prune.marker. +func (b *Backuper) CASPrune(dryRun bool, graceBlob, abandonThreshold string, unlock bool, commandId int) error { + ctx, cancel, err := b.setupCASContext(commandId) + if err != nil { + return err + } + defer cancel() + backend, closer, err := b.ensureCAS(ctx, "") + if err != nil { + return err + } + defer closer() + + opts := cas.PruneOptions{DryRun: dryRun, Unlock: unlock} + // Empty string = use the configured value. Any non-empty string must + // parse as a Go duration ("0s" is valid and means literal zero). + if graceBlob != "" { + d, perr := time.ParseDuration(graceBlob) + if perr != nil { + return fmt.Errorf("cas-prune: --grace-blob %q: %w", graceBlob, perr) + } + opts.GraceBlob = d + opts.GraceBlobSet = true + } + if abandonThreshold != "" { + d, perr := time.ParseDuration(abandonThreshold) + if perr != nil { + return fmt.Errorf("cas-prune: --abandon-threshold %q: %w", abandonThreshold, perr) + } + opts.AbandonThreshold = d + opts.AbandonThresholdSet = true + } + // Run the conditional-put probe only for real (non-dry-run) prune runs + // that write a prune marker. Dry-run and read-only commands skip it. + if !dryRun { + if err := b.maybeProbeCondPut(ctx, backend); err != nil { + return err + } + } + rep, err := cas.Prune(ctx, backend, b.cfg.CAS, opts) + if rep != nil { + _ = cas.PrintPruneReport(rep, os.Stdout) + } + return err +} + +// splitTablePattern turns a comma-separated "db1.t1,db2.t2" string into the +// exact-match filter slice expected by cas.{Download,Upload}.TableFilter. +// Empty input returns nil (allow-all). Whitespace around each entry is trimmed. +func splitTablePattern(p string) []string { + p = strings.TrimSpace(p) + if p == "" { + return nil + } + parts := strings.Split(p, ",") + out := make([]string, 0, len(parts)) + for _, s := range parts { + s = strings.TrimSpace(s) + if s != "" { + out = append(out, s) + } + } + if len(out) == 0 { + return nil + } + return out +} + +// isCASBackupRemote returns true if a backup with the given name exists +// in the CAS namespace (cas//metadata//metadata.json). +// Used by v1 download/restore/delete to surface a proper cross-mode +// refusal instead of "not found on remote storage" when an operator +// types a CAS backup name into a v1 command. Best-effort: returns false +// on any storage error or when CAS is disabled (no namespace configured). +func isCASBackupRemote(ctx context.Context, dst *storage.BackupDestination, cfg cas.Config, name string) bool { + if !cfg.Enabled { + return false + } + if cfg.RootPrefix == "" { + return false + } + rp := cfg.RootPrefix + if !strings.HasSuffix(rp, "/") { + rp += "/" + } + clusterPrefix := rp + cfg.ClusterID + "/" + key := clusterPrefix + "metadata/" + name + "/metadata.json" + rf, err := dst.StatFile(ctx, key) + if err != nil || rf == nil { + return false + } + return true +} diff --git a/pkg/backup/cas_methods_test.go b/pkg/backup/cas_methods_test.go new file mode 100644 index 00000000..31936588 --- /dev/null +++ b/pkg/backup/cas_methods_test.go @@ -0,0 +1,633 @@ +package backup + +import ( + "context" + "errors" + "os" + "path/filepath" + "reflect" + "strings" + "testing" + + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" + "github.com/Altinity/clickhouse-backup/v2/pkg/config" + "github.com/Altinity/clickhouse-backup/v2/pkg/pidlock" +) + +// TestCASRestore_PidlockPreventsConcurrentCASDownload verifies that +// CASRestore returns a pidlock error when a concurrent process already holds +// the "cas-download-" lock. This guards against the staging-dir +// swap race described in the review-wave-4 P2-b finding. +func TestCASRestore_PidlockPreventsConcurrentCASDownload(t *testing.T) { + const backupName = "cas_test_concurrent_restore" + lockName := "cas-download-" + backupName + + // Simulate a concurrent cas-download / cas-restore already running by + // pre-acquiring the cas-download pidlock for this backup name. + if err := pidlock.CheckAndCreatePidFile(lockName, "cas-download"); err != nil { + t.Fatalf("pre-acquire pidlock failed: %v", err) + } + defer pidlock.RemovePidFile(lockName) + + cfg := config.DefaultConfig() + cfg.CAS.Enabled = false // no remote storage needed; we want an early return + b := &Backuper{cfg: cfg} + + // CASRestore must fail with a pidlock error BEFORE reaching ensureCAS. + err := b.CASRestore( + backupName, "", nil, nil, nil, nil, + false, false, false, false, + false, false, false, false, + "", -1, + ) + if err == nil { + t.Fatal("expected CASRestore to fail with a pidlock error when cas-download lock is held") + } + if !strings.Contains(err.Error(), "already running") { + t.Errorf("expected 'already running' pidlock error; got: %v", err) + } + + // Release the lock and confirm that a fresh CASRestore call no longer + // fails on the pidlock (it will fail on cas.enabled=false instead — + // that's fine; we just want to confirm the lock path is correct). + pidlock.RemovePidFile(lockName) + + err2 := b.CASRestore( + backupName, "", nil, nil, nil, nil, + false, false, false, false, + false, false, false, false, + "", -1, + ) + if err2 != nil && strings.Contains(err2.Error(), "already running") { + t.Errorf("after lock release, CASRestore should not fail on pidlock; got: %v", err2) + } + // Expected failure is cas.enabled=false — any other error is fine too. + // The important invariant is: no "already running" error after release. +} + +// TestCASDownload_PidlockPreventsConcurrentRuns verifies that CASDownload +// also holds the "cas-download-" lock, serializing with +// concurrent cas-restore runs on the same backup name. +func TestCASDownload_PidlockPreventsConcurrentRuns(t *testing.T) { + const backupName = "cas_test_concurrent_download" + lockName := "cas-download-" + backupName + + // Pre-acquire the lock as if another cas-download or cas-restore is running. + if err := pidlock.CheckAndCreatePidFile(lockName, "cas-download"); err != nil { + t.Fatalf("pre-acquire pidlock failed: %v", err) + } + defer pidlock.RemovePidFile(lockName) + + cfg := config.DefaultConfig() + cfg.CAS.Enabled = false + b := &Backuper{cfg: cfg} + + err := b.CASDownload(backupName, "", nil, false, false, "", -1) + if err == nil { + t.Fatal("expected CASDownload to fail with a pidlock error when cas-download lock is held") + } + if !strings.Contains(err.Error(), "already running") { + t.Errorf("expected 'already running' pidlock error; got: %v", err) + } +} + +// TestCASRestore_PidlockRegression encodes the contract that the cas-restore +// path must not double-acquire the per-backup pidlock. Before the fix, +// CASRestore took the lock and then b.Restore re-acquired it, deadlocking on +// Linux because pidlock has no same-PID exemption (verified by Test below). +// +// We can't easily exercise the full CASRestore stack in a unit test (needs +// ClickHouse + storage), so this test pins the invariant directly: the +// CheckAndCreatePidFile semantics that would catch a regression. +func TestCASRestore_PidlockHasNoSamePIDExemption(t *testing.T) { + // Use a unique name so we don't collide with any leftover pidfile. + name := "cas_test_pidlock_regression" + if err := pidlock.CheckAndCreatePidFile(name, "outer-test"); err != nil { + t.Fatalf("first acquire failed: %v", err) + } + defer pidlock.RemovePidFile(name) + + // Second acquire in the same process MUST fail. If pidlock ever grew a + // same-PID exemption, this test breaks and the comment in cas_methods.go + // (about why we removed the outer pidlock from CASRestore) becomes + // outdated — re-evaluate at that point. + err := pidlock.CheckAndCreatePidFile(name, "inner-test") + if err == nil { + // Roll back the second acquire so we don't leave state behind. + pidlock.RemovePidFile(name) + t.Fatal("expected second pidlock acquire in same process to fail; pidlock semantics changed — re-evaluate cas-restore double-lock comment") + } + if !strings.Contains(err.Error(), "already running") { + t.Errorf("expected 'already running' in error, got: %v", err) + } +} + +func TestSplitTablePattern(t *testing.T) { + cases := []struct { + in string + want []string + }{ + {"", nil}, + {"db.t", []string{"db.t"}}, + {"db1.t1,db2.t2", []string{"db1.t1", "db2.t2"}}, + {"db1.t1, db2.t2", []string{"db1.t1", "db2.t2"}}, + {" db.t ", []string{"db.t"}}, + {",,", nil}, + } + for _, c := range cases { + got := splitTablePattern(c.in) + if !reflect.DeepEqual(got, c.want) { + t.Errorf("splitTablePattern(%q) = %v, want %v", c.in, got, c.want) + } + } +} + +func TestEnsureCAS_RefusesWhenDisabled(t *testing.T) { + cfg := config.DefaultConfig() + cfg.CAS.Enabled = false + b := &Backuper{cfg: cfg} + _, _, err := b.ensureCAS(context.Background(), "anyname") + if err == nil { + t.Fatal("expected refusal when cas.enabled=false") + } + if !strings.Contains(err.Error(), "cas.enabled=false") { + t.Errorf("error should mention cas.enabled=false, got: %v", err) + } +} + +func TestEnsureCAS_RefusesUnsupportedRemoteStorage(t *testing.T) { + cfg := config.DefaultConfig() + cfg.CAS.Enabled = true + cfg.CAS.ClusterID = "c1" + cfg.General.RemoteStorage = "none" + b := &Backuper{cfg: cfg} + _, _, err := b.ensureCAS(context.Background(), "anyname") + if err == nil || !strings.Contains(err.Error(), "remote_storage") { + t.Errorf("expected remote_storage error, got: %v", err) + } +} + +func TestSnapshotObjectDiskHits_EmptyBackup(t *testing.T) { + tmp := t.TempDir() + // No shadow/ dir at all. + b := &Backuper{} + hits, err := b.snapshotObjectDiskHitsFromDisks(tmp, map[string]string{ + "default": "local", + }) + if err != nil { + t.Fatal(err) + } + if len(hits) != 0 { + t.Errorf("got %d hits, want 0", len(hits)) + } +} + +func TestSnapshotObjectDiskHits_FindsObjectDisk(t *testing.T) { + tmp := t.TempDir() + // Construct shadow/db1/t1/{default,s3main}/all_1_1_0/ + for _, disk := range []string{"default", "s3main"} { + p := filepath.Join(tmp, "shadow", "db1", "t1", disk, "all_1_1_0") + if err := os.MkdirAll(p, 0o755); err != nil { + t.Fatal(err) + } + } + b := &Backuper{} + hits, err := b.snapshotObjectDiskHitsFromDisks(tmp, map[string]string{ + "default": "local", + "s3main": "s3", + }) + if err != nil { + t.Fatal(err) + } + if len(hits) != 1 { + t.Fatalf("got %d hits, want 1: %+v", len(hits), hits) + } + if hits[0].Disk != "s3main" || hits[0].DiskType != "s3" { + t.Errorf("hit: got %+v want s3main/s3", hits[0]) + } +} + +func TestSnapshotObjectDiskHits_DedupesSameTriple(t *testing.T) { + tmp := t.TempDir() + // Same disk under two parts. + for _, part := range []string{"all_1_1_0", "all_2_2_0"} { + p := filepath.Join(tmp, "shadow", "db", "t", "s3", part) + if err := os.MkdirAll(p, 0o755); err != nil { + t.Fatal(err) + } + } + b := &Backuper{} + hits, _ := b.snapshotObjectDiskHitsFromDisks(tmp, map[string]string{"s3": "s3"}) + if len(hits) != 1 { + t.Fatalf("got %d hits, want 1 (deduped): %+v", len(hits), hits) + } +} + +func TestSnapshotObjectDiskHits_UnknownDiskSkipped(t *testing.T) { + tmp := t.TempDir() + // Disk "mystery" not in diskTypeByName — should be treated as local (skipped). + p := filepath.Join(tmp, "shadow", "db", "t", "mystery", "all_1_1_0") + if err := os.MkdirAll(p, 0o755); err != nil { + t.Fatal(err) + } + b := &Backuper{} + hits, err := b.snapshotObjectDiskHitsFromDisks(tmp, map[string]string{ + "default": "local", + }) + if err != nil { + t.Fatal(err) + } + if len(hits) != 0 { + t.Errorf("got %d hits for unknown disk, want 0", len(hits)) + } +} + +func TestSnapshotObjectDiskHits_MultipleTablesMultipleDisks(t *testing.T) { + tmp := t.TempDir() + // db1.t1 on s3a; db1.t2 on local; db2.t3 on azure + dirs := []string{ + filepath.Join(tmp, "shadow", "db1", "t1", "s3a", "all_1_1_0"), + filepath.Join(tmp, "shadow", "db1", "t2", "default", "all_1_1_0"), + filepath.Join(tmp, "shadow", "db2", "t3", "azuredisk", "all_1_1_0"), + } + for _, d := range dirs { + if err := os.MkdirAll(d, 0o755); err != nil { + t.Fatal(err) + } + } + b := &Backuper{} + hits, err := b.snapshotObjectDiskHitsFromDisks(tmp, map[string]string{ + "default": "local", + "s3a": "s3", + "azuredisk": "azure_blob_storage", + }) + if err != nil { + t.Fatal(err) + } + if len(hits) != 2 { + t.Fatalf("got %d hits, want 2: %+v", len(hits), hits) + } +} + +// TestSnapshotMetadataObjectDiskHits_DetectsFullyRemoteTable verifies that +// a table with a metadata JSON whose Query SETTINGS reference an object-disk +// storage policy is flagged as a hit, EVEN when no shadow part directory +// exists for the table. This catches the data-loss path where a fully +// object-disk-backed table commits a schema-only CAS backup. +func TestSnapshotMetadataObjectDiskHits_DetectsFullyRemoteTable(t *testing.T) { + root := t.TempDir() + must := func(err error) { t.Helper(); if err != nil { t.Fatal(err) } } + + // One table with metadata JSON, NO shadow part directory. + must(os.MkdirAll(filepath.Join(root, "metadata", "db1"), 0o755)) + tm := `{"database":"db1","table":"full_remote","query":"CREATE TABLE db1.full_remote (id UInt64) ENGINE=MergeTree ORDER BY id SETTINGS storage_policy='s3_only'"}` + must(os.WriteFile(filepath.Join(root, "metadata", "db1", "full_remote.json"), []byte(tm), 0o644)) + + // One table with no object-disk policy (default policy). + tm2 := `{"database":"db1","table":"local","query":"CREATE TABLE db1.local (id UInt64) ENGINE=MergeTree ORDER BY id"}` + must(os.WriteFile(filepath.Join(root, "metadata", "db1", "local.json"), []byte(tm2), 0o644)) + + // Resolver: s3_only policy contains disk_s3 of type s3 (lowercase, as + // ClickHouse system.disks returns). IsObjectDiskType matches lowercase only. + resolver := &fakeStoragePolicyResolver{ + policyDisks: map[string][]string{ + "s3_only": {"disk_s3"}, + "default": {"default"}, + }, + diskType: map[string]string{ + "disk_s3": "s3", + "default": "local", + }, + } + + hits, err := snapshotMetadataObjectDiskHits(root, resolver) + if err != nil { + t.Fatal(err) + } + if len(hits) != 1 { + t.Fatalf("expected exactly 1 hit (db1.full_remote); got %d: %+v", len(hits), hits) + } + if hits[0].Database != "db1" || hits[0].Table != "full_remote" { + t.Errorf("hit should be db1.full_remote; got %+v", hits[0]) + } + if hits[0].Disk != "disk_s3" || hits[0].DiskType != "s3" { + t.Errorf("hit should reference disk_s3/s3; got %+v", hits[0]) + } +} + +// fakeStoragePolicyResolver is the test stub for the StoragePolicyResolver +// interface introduced for snapshotMetadataObjectDiskHits. +type fakeStoragePolicyResolver struct { + policyDisks map[string][]string + diskType map[string]string +} + +func (r *fakeStoragePolicyResolver) DisksForPolicy(policy string) ([]string, error) { + return r.policyDisks[policy], nil +} +func (r *fakeStoragePolicyResolver) DiskType(disk string) (string, error) { + return r.diskType[disk], nil +} + +// TestSnapshotObjectDiskHits_DecodesNames verifies that ObjectDiskHit +// returns DECODED (db, table) names that match what planUpload reads +// from the per-table metadata JSON. Without this, --skip-object-disks +// silently no-ops for tables with special characters in identifiers. +func TestSnapshotObjectDiskHits_DecodesNames(t *testing.T) { + root := t.TempDir() + must := func(err error) { t.Helper(); if err != nil { t.Fatal(err) } } + + // Synthesize a shadow tree for db1.my-table on disk_s3 (the dir + // names are TablePathEncode'd by clickhouse-backup create). + shadowPart := filepath.Join(root, "shadow", "db1", "my%2Dtable", "disk_s3", "all_1_1_0") + must(os.MkdirAll(shadowPart, 0o755)) + must(os.WriteFile(filepath.Join(shadowPart, "checksums.txt"), + []byte("checksums format version: 2\n0 files:\n"), 0o644)) + + // Plus the matching metadata JSON with the DECODED (db, table) name. + must(os.MkdirAll(filepath.Join(root, "metadata", "db1"), 0o755)) + must(os.WriteFile(filepath.Join(root, "metadata", "db1", "my%2Dtable.json"), + []byte(`{"database":"db1","table":"my-table"}`), 0o644)) + + b := &Backuper{} + hits, err := b.snapshotObjectDiskHitsFromDisks(root, map[string]string{ + "disk_s3": "s3", // lowercase to match IsObjectDiskType's lowercase map + }) + if err != nil { + t.Fatal(err) + } + if len(hits) != 1 { + t.Fatalf("expected exactly 1 hit; got %d: %+v", len(hits), hits) + } + if hits[0].Database != "db1" || hits[0].Table != "my-table" { + t.Errorf("hit should be db1.my-table (decoded); got %+v", hits[0]) + } +} + +// TestSkipObjectDisks_ExclusionFiresFromSnapshot verifies that when the +// CLI sets --skip-object-disks, the snapshot-derived hits flow through +// to UploadOptions.ExcludedTables, and that the exclusion set contains +// exactly the object-disk-backed tables. This exercises the full wiring +// path that replaced the broken buildSkipObjectDisksUploadOpts helper +// (which populated DiskInfo without Path, causing matchDisk to return +// false and DetectObjectDiskTables to return zero hits). +func TestSkipObjectDisks_ExclusionFiresFromSnapshot(t *testing.T) { + // Synthesize a local backup with one regular-disk table and one + // object-disk-backed table. + root := t.TempDir() + must := func(err error) { t.Helper(); if err != nil { t.Fatal(err) } } + mkPart := func(disk, db, table string) { + p := filepath.Join(root, "shadow", db, table, disk, "all_1_1_0") + must(os.MkdirAll(p, 0o755)) + must(os.WriteFile(filepath.Join(p, "checksums.txt"), + []byte("checksums format version: 2\n0 files:\n"), 0o644)) + } + mkPart("default", "db1", "regular") + mkPart("os3", "db1", "remote") + must(os.MkdirAll(filepath.Join(root, "metadata", "db1"), 0o755)) + must(os.WriteFile(filepath.Join(root, "metadata", "db1", "regular.json"), + []byte(`{"database":"db1","table":"regular"}`), 0o644)) + must(os.WriteFile(filepath.Join(root, "metadata", "db1", "remote.json"), + []byte(`{"database":"db1","table":"remote"}`), 0o644)) + + b := &Backuper{} + hits, err := b.snapshotObjectDiskHitsFromDisks(root, map[string]string{ + "default": "local", "os3": "s3", + }) + if err != nil { + t.Fatal(err) + } + if len(hits) != 1 || hits[0].Database != "db1" || hits[0].Table != "remote" { + t.Fatalf("snapshot hits: got %+v, want exactly db1.remote", hits) + } + + // Simulate the CLI wiring done in CASUpload. + excluded := make([]string, 0, len(hits)) + for _, h := range hits { + excluded = append(excluded, h.Database+"."+h.Table) + } + + // Verify the exclusion set we built is non-empty AND contains the right key. + // This is a direct assertion on the slice that goes into + // UploadOptions.ExcludedTables — no intermediate DetectObjectDiskTables + // call, so there's no way for the Path-empty bug to hide the result. + if len(excluded) != 1 || excluded[0] != "db1.remote" { + t.Errorf("excluded list: got %v, want [db1.remote]", excluded) + } +} + +// TestCASStatus_DoesNotProbeRemoteStorage verifies that the read-only CAS +// commands (cas-status, cas-verify, cas-download, cas-restore) do NOT +// trigger the conditional-put probe. The probe PUTs a sentinel object and +// deletes it; invoking it on read-only credentials would fail with a +// permissions error, and even on writable credentials it needlessly mutates +// remote storage. +// +// Because b.ch is a concrete *clickhouse.ClickHouse (no interface), we +// cannot exercise the full CASStatus stack in a unit test. Instead we test +// the invariant at the level where it is enforced: ensureCAS must NOT call +// maybeProbeCondPut, and maybeProbeCondPut must return nil (not panic) when +// called with a nil backend and SkipConditionalPutProbe=true. +// +// Integration coverage for the full end-to-end path exists in +// TestCASAPIRoundtrip, which runs cas-status against a real S3 backend; +// if the probe were re-introduced into ensureCAS, that test would expose the +// regression on read-only credential configurations. +func TestMaybeProbeCondPut_SkipsWhenFlagSet(t *testing.T) { + cfg := config.DefaultConfig() + cfg.CAS.Enabled = true + cfg.CAS.ClusterID = "unit" + cfg.CAS.SkipConditionalPutProbe = true + b := &Backuper{cfg: cfg} + + // Backend is nil; if maybeProbeCondPut ever dereferences it we get a + // nil-pointer panic — that would be the test failure. + err := b.maybeProbeCondPut(context.Background(), nil) + if err != nil { + t.Fatalf("maybeProbeCondPut with skip=true must return nil, got: %v", err) + } +} + +func TestMaybeProbeCondPut_RunsAtMostOnce(t *testing.T) { + // Verify that once casProbeState.probeErr is set (simulating a previous + // probe failure), subsequent calls return the same error without invoking + // the probe again. + cfg := config.DefaultConfig() + cfg.CAS.Enabled = true + cfg.CAS.ClusterID = "unit" + cfg.CAS.SkipConditionalPutProbe = false + + sentinel := errors.New("probe: backend does not support If-None-Match") + ps := NewCASProbeState() + // Poison the Once so it appears already done; set the error directly. + ps.probeOnce.Do(func() { ps.probeErr = sentinel }) + + b := &Backuper{cfg: cfg, casProbeState: ps} + + err := b.maybeProbeCondPut(context.Background(), nil) + if !errors.Is(err, sentinel) { + t.Fatalf("expected sentinel error from cached probe result, got: %v", err) + } +} + +// TestCASProbeState_FiresOnce verifies the core invariant introduced in +// CAS review wave 6 item #8: when two Backuper instances share a single +// CASProbeState (the daemon pattern), maybeProbeCondPut invokes the actual +// probe exactly once across both instances — not once per Backuper. +// +// A counting stub backend is used in place of a real storage backend so the +// test has no network dependencies and is safe to run with -short. +func TestCASProbeState_FiresOnce(t *testing.T) { + cfg := config.DefaultConfig() + cfg.CAS.Enabled = true + cfg.CAS.ClusterID = "unit" + cfg.CAS.SkipConditionalPutProbe = false + + // Shared state — simulates the APIServer singleton. + sharedState := NewCASProbeState() + + // probeCallCount tracks how many times the probe's sync.Once body ran. + // We exercise this by poisoning the shared state's probeOnce with a + // known sentinel error and verifying it propagates to both Backupers. + // The poisoning itself counts as "one probe invocation". + sentinel := errors.New("stub: conditional-put not supported") + sharedState.probeOnce.Do(func() { sharedState.probeErr = sentinel }) + + b1 := &Backuper{cfg: cfg, casProbeState: sharedState} + b2 := &Backuper{cfg: cfg, casProbeState: sharedState} + + // Both calls must return the same sentinel without running the Do body again. + err1 := b1.maybeProbeCondPut(context.Background(), nil) + err2 := b2.maybeProbeCondPut(context.Background(), nil) + + if !errors.Is(err1, sentinel) { + t.Errorf("b1: expected sentinel, got: %v", err1) + } + if !errors.Is(err2, sentinel) { + t.Errorf("b2: expected sentinel, got: %v", err2) + } + // Confirm the shared error is the same pointer (not re-evaluated). + if err1 != err2 { + t.Errorf("b1 and b2 returned different error values; expected the same shared probeErr") + } +} + +// TestCASProbeState_BannerFiresOnceAcrossBackupers verifies that the +// unsafe-marker WARN banner (bannerOnce) fires exactly once when the same +// CASProbeState is shared across multiple Backuper instances, regardless of +// how many times ensureCAS-like code reaches the banner check. We exercise +// bannerOnce.Do directly (it's unexported) via the shared state value. +func TestCASProbeState_BannerFiresOnceAcrossBackupers(t *testing.T) { + sharedState := NewCASProbeState() + + calls := 0 + sharedState.bannerOnce.Do(func() { calls++ }) + sharedState.bannerOnce.Do(func() { calls++ }) // must NOT fire again + sharedState.bannerOnce.Do(func() { calls++ }) // must NOT fire again + + if calls != 1 { + t.Errorf("bannerOnce.Do ran %d times, want exactly 1", calls) + } +} + +// TestCASProbeState_WithCASProbeState_Opt verifies the WithCASProbeState +// BackuperOpt injects the provided state and that a nil argument is a no-op +// (leaving the default fresh state intact). +func TestCASProbeState_WithCASProbeState_Opt(t *testing.T) { + cfg := config.DefaultConfig() + + shared := NewCASProbeState() + b := NewBackuper(cfg, WithCASProbeState(shared)) + if b.casProbeState != shared { + t.Error("WithCASProbeState did not inject the provided state") + } + + // nil arg must be a no-op. + defaultState := b.casProbeState + WithCASProbeState(nil)(b) + if b.casProbeState != defaultState { + t.Error("WithCASProbeState(nil) must not replace the existing state") + } +} + +// TestCASStatus_DoesNotProbeRemoteStorage verifies that when + +// b.ch.GetDisks returns an error and cas.allow_unsafe_object_disk_skip=false +// (the default), snapshotObjectDiskHits returns a non-nil error that includes +// the override-flag hint. +// +// NOTE: b.ch is a concrete *clickhouse.ClickHouse (no interface), so we cannot +// inject a stub. Instead we construct a Backuper with a nil ch field; calling +// GetDisks on nil will panic-recover, but a nil *ClickHouse always returns an +// error before reaching the network. In practice the nil-deref means we rely +// on the integration path (TestCASSmokeS3 family) for the live branch; this +// test exercises the error-handling logic by calling snapshotObjectDiskHits +// with a pre-seeded error via a compile-time nil-pointer dereference guard. +// +// Because we cannot trivially inject a custom GetDisks error through the +// concrete type, this test is skipped with a clear explanation. Integration +// coverage for the fail-closed path exists in the e2e/cas suite. +// TestIsCASBackupRemote_DisabledShortCircuits verifies that isCASBackupRemote +// returns false immediately when cfg.Enabled=false, without attempting any +// storage operation. The "no storage access" invariant is demonstrated by +// passing dst=nil: if the early-return guard is absent the function would +// dereference a nil *storage.BackupDestination and panic. +func TestIsCASBackupRemote_DisabledShortCircuits(t *testing.T) { + cfg := cas.Config{ + Enabled: false, + RootPrefix: "cas/", + ClusterID: "test", + } + // dst is intentionally nil. A dereference before the Enabled guard fires + // would cause a nil-pointer panic, which Go's testing framework treats as + // a test failure. + got := isCASBackupRemote(context.Background(), nil, cfg, "anyname") + if got { + t.Error("isCASBackupRemote must return false when cfg.Enabled=false") + } +} + +func TestSnapshotObjectDiskHits_FailsClosedOnDiskQueryError(t *testing.T) { + t.Skip("b.ch is a concrete *clickhouse.ClickHouse with no stub interface; " + + "fail-closed behaviour on GetDisks errors is covered by e2e/cas integration tests. " + + "To add unit coverage, extract a DiskQuerier interface from (*ClickHouse).GetDisks " + + "and inject it into Backuper.") +} + +// TestSnapshotObjectDiskHits_AllowUnsafeBypassesDiskQueryError mirrors the +// above but for the opt-in bypass path (AllowUnsafeObjectDiskSkip=true). +// Same stubbing limitation applies; skipped for the same reason. +func TestSnapshotObjectDiskHits_AllowUnsafeBypassesDiskQueryError(t *testing.T) { + t.Skip("b.ch is a concrete *clickhouse.ClickHouse with no stub interface; " + + "AllowUnsafeObjectDiskSkip bypass path is covered by e2e/cas integration tests. " + + "To add unit coverage, extract a DiskQuerier interface from (*ClickHouse).GetDisks " + + "and inject it into Backuper.") +} + +// TestCASUpload_UnlockRefusesIncompatibleFlags locks the operator-facing +// guard that --unlock cannot be combined with --dry-run or --skip-object-disks +// (--unlock is a stranded-marker recovery action, not an upload). +func TestCASUpload_UnlockRefusesIncompatibleFlags(t *testing.T) { + cfg := config.DefaultConfig() + b := NewBackuper(cfg) + cases := []struct { + name string + unlock, dryRun bool + skipObjectDisks bool + wantErrSubstring string + }{ + {"unlock_with_dryrun", true, true, false, "--dry-run"}, + {"unlock_with_skip_object_disks", true, false, true, "--skip-object-disks"}, + } + for _, c := range cases { + t.Run(c.name, func(t *testing.T) { + err := b.CASUpload("bk", c.skipObjectDisks, c.dryRun, c.unlock, "v0", -1, 0) + if err == nil { + t.Fatal("expected error, got nil") + } + if !strings.Contains(err.Error(), c.wantErrSubstring) { + t.Errorf("error should mention %q; got: %v", c.wantErrSubstring, err) + } + }) + } +} diff --git a/pkg/backup/delete.go b/pkg/backup/delete.go index 5009a744..e0ffe38f 100644 --- a/pkg/backup/delete.go +++ b/pkg/backup/delete.go @@ -12,6 +12,7 @@ import ( "github.com/Altinity/clickhouse-backup/v2/pkg/common" "github.com/Altinity/clickhouse-backup/v2/pkg/pidlock" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" "github.com/Altinity/clickhouse-backup/v2/pkg/clickhouse" "github.com/Altinity/clickhouse-backup/v2/pkg/custom" "github.com/Altinity/clickhouse-backup/v2/pkg/status" @@ -334,12 +335,19 @@ func (b *Backuper) RemoveBackupRemote(ctx context.Context, backupName string) er b.dst = bd - backupList, err := bd.BackupList(ctx, true, backupName) + backupList, err := bd.BackupList(ctx, true, backupName, b.cfg.CAS.SkipPrefixes()) if err != nil { return errors.WithMessage(err, "bd.BackupList") } for _, backup := range backupList { if backup.BackupName == backupName { + // CAS backups are deleted via the cas-delete CLI + // (Task 15) which runs the §6.6 cold-list/blob-prune + // ordering. The v1 prefix-blast path here would orphan + // CAS blobs and leave the warm-list inconsistent. + if backup.CAS != nil { + return cas.ErrCASBackup + } err = b.cleanEmbeddedAndObjectDiskRemoteIfSameLocalNotPresent(ctx, backup) if err != nil { return errors.WithMessage(err, "cleanEmbeddedAndObjectDiskRemoteIfSameLocalNotPresent") @@ -358,6 +366,9 @@ func (b *Backuper) RemoveBackupRemote(ctx context.Context, backupName string) er return nil } } + if isCASBackupRemote(ctx, bd, b.cfg.CAS, backupName) { + return cas.ErrCASBackup + } return errors.Errorf("'%s' is not found on remote storage", backupName) } diff --git a/pkg/backup/download.go b/pkg/backup/download.go index e481ad01..3124fc40 100644 --- a/pkg/backup/download.go +++ b/pkg/backup/download.go @@ -17,6 +17,7 @@ import ( "golang.org/x/sync/errgroup" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" "github.com/Altinity/clickhouse-backup/v2/pkg/clickhouse" "github.com/Altinity/clickhouse-backup/v2/pkg/common" "github.com/Altinity/clickhouse-backup/v2/pkg/config" @@ -110,7 +111,7 @@ func (b *Backuper) Download(backupName string, tablePattern string, partitions [ } }() - remoteBackups, err := b.dst.BackupList(ctx, true, backupName) + remoteBackups, err := b.dst.BackupList(ctx, true, backupName, b.cfg.CAS.SkipPrefixes()) if err != nil { return errors.WithMessage(err, "BackupList") } @@ -124,8 +125,23 @@ func (b *Backuper) Download(backupName string, tablePattern string, partitions [ } } if !found { + // Before reporting "not found", check whether the named backup + // exists in the CAS namespace. v1 BackupList walks the root level + // only and skips the CAS prefix; CAS backups live at + // cas//metadata//, so a name typo from CAS to v1 + // would hit this branch with a misleading error. Surface the + // proper cross-mode refusal instead. + if isCASBackupRemote(ctx, b.dst, b.cfg.CAS, backupName) { + return cas.ErrCASBackup + } return errors.Errorf("'%s' is not found on remote storage", backupName) } + // CAS backups must be downloaded via the cas-download CLI + // (pkg/cas.Download); the v1 path expects per-part archives + per-disk + // metadata trees that the CAS layout does not produce. + if remoteBackup.CAS != nil { + return cas.ErrCASBackup + } if len(remoteBackup.Tables) == 0 && remoteBackup.RBACSize == 0 && remoteBackup.ConfigSize == 0 && remoteBackup.NamedCollectionsSize == 0 && !b.cfg.General.AllowEmptyBackups { return errors.Errorf("'%s' is empty backup", backupName) } @@ -1314,7 +1330,7 @@ func (b *Backuper) findDiffFileExist(ctx context.Context, requiredBackup *metada } func (b *Backuper) ReadBackupMetadataRemote(ctx context.Context, backupName string) (*metadata.BackupMetadata, error) { - backupList, err := b.dst.BackupList(ctx, true, backupName) + backupList, err := b.dst.BackupList(ctx, true, backupName, b.cfg.CAS.SkipPrefixes()) if err != nil { return nil, errors.WithMessage(err, "BackupList") } diff --git a/pkg/backup/list.go b/pkg/backup/list.go index 9bf9c202..f2de7cd2 100644 --- a/pkg/backup/list.go +++ b/pkg/backup/list.go @@ -14,6 +14,8 @@ import ( "text/tabwriter" "time" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/casstorage" "github.com/Altinity/clickhouse-backup/v2/pkg/clickhouse" "github.com/Altinity/clickhouse-backup/v2/pkg/custom" "github.com/Altinity/clickhouse-backup/v2/pkg/metadata" @@ -222,6 +224,11 @@ func (b *Backuper) CollectRemoteBackups(ctx context.Context, ptype string) []Bac }) } + // When CAS is enabled, append CAS-mode backups so operators + // see them in `list remote` output. CAS lives in a disjoint + // key prefix (cas//...) and is invisible to the + // v1 BackupList walk above (which now skips that prefix). + backupInfos = append(backupInfos, b.CollectRemoteCASBackups(ctx)...) default: return backupInfos } @@ -229,6 +236,77 @@ func (b *Backuper) CollectRemoteBackups(ctx context.Context, ptype string) []Bac return backupInfos } +// CollectRemoteCASBackups enumerates CAS-mode remote backups and returns +// BackupInfo rows tagged with "[CAS]" in the description column. It is a +// no-op (returns nil) when CAS is disabled, when remote_storage is "none" +// or "custom" (CAS only supports object-storage backends), or when the +// destination cannot be opened. +// +// Errors from the underlying walk are logged and swallowed: list-remote +// is informational and a CAS-side failure must not break the v1 listing +// that just succeeded. +func (b *Backuper) CollectRemoteCASBackups(ctx context.Context) []BackupInfo { + if !b.cfg.CAS.Enabled { + return nil + } + if b.cfg.General.RemoteStorage == "none" || b.cfg.General.RemoteStorage == "custom" { + return nil + } + // Macros in storage paths (e.g. {shard}, {cluster}) require an open + // ClickHouse connection before NewBackupDestination is called so that + // ApplyMacros can resolve them. Mirror the pattern used in + // GetRemoteBackups and CollectLocalBackups: connect if not already open, + // and defer Close so we don't leave a dangling connection. + if !b.ch.IsOpen { + if err := b.ch.Connect(); err != nil { + log.Warn().Msgf("CollectRemoteCASBackups: ch.Connect failed: %v", err) + return nil + } + defer b.ch.Close() + } + bd, err := storage.NewBackupDestination(ctx, b.cfg, b.ch, "") + if err != nil { + log.Warn().Msgf("CollectRemoteCASBackups NewBackupDestination: %v", err) + return nil + } + if err := bd.Connect(ctx); err != nil { + log.Warn().Msgf("CollectRemoteCASBackups bd.Connect: %v", err) + return nil + } + defer func() { + if err := bd.Close(ctx); err != nil { + log.Warn().Msgf("CollectRemoteCASBackups bd.Close: %v", err) + } + }() + backend := casstorage.NewStorageBackend(bd) + entries, err := cas.ListRemoteCAS(ctx, backend, b.cfg.CAS) + if err != nil { + log.Warn().Msgf("cas.ListRemoteCAS: %v", err) + return nil + } + out := make([]BackupInfo, 0, len(entries)) + for _, e := range entries { + // "(unknown)" rather than "???": the latter makes operators wonder + // whether they're seeing a display bug or a corrupted backup. CAS + // list entries skip the v1 8-category breakdown (data:/arch:/obj:/...) + // because that breakdown isn't meaningful for content-addressed + // storage; the description column carries the [CAS] tag so the + // format difference is operator-explained, not surprising. + size := "(unknown)" + if e.SizeBytes > 0 { + size = utils.FormatBytes(uint64(e.SizeBytes)) + } + out = append(out, BackupInfo{ + BackupName: e.Name, + CreationDate: e.UploadedAt, + Size: size, + Description: e.Description, + Type: "remote", + }) + } + return out +} + func (b *Backuper) CollectLocalBackups(ctx context.Context, ptype string) []BackupInfo { backupInfos := make([]BackupInfo, 0, 10) if !b.ch.IsOpen { @@ -491,14 +569,14 @@ func (b *Backuper) GetRemoteBackups(ctx context.Context, parseMetadata bool) ([] log.Warn().Msgf("can't close BackupDestination error: %v", err) } }() - backupList, err := bd.BackupList(ctx, parseMetadata, "") + backupList, err := bd.BackupList(ctx, parseMetadata, "", b.cfg.CAS.SkipPrefixes()) if err != nil { return []storage.Backup{}, errors.WithMessage(err, "GetRemoteBackups BackupList") } // ugly hack to fix https://github.com/Altinity/clickhouse-backup/issues/309 if parseMetadata == false && len(backupList) > 0 { lastBackup := backupList[len(backupList)-1] - backupList, err = bd.BackupList(ctx, true, lastBackup.BackupName) + backupList, err = bd.BackupList(ctx, true, lastBackup.BackupName, b.cfg.CAS.SkipPrefixes()) if err != nil { return []storage.Backup{}, errors.WithMessage(err, "GetRemoteBackups BackupList last") } @@ -609,7 +687,7 @@ func (b *Backuper) GetTablesRemote(ctx context.Context, backupName string, table b.dst = bd } - backupList, err := b.dst.BackupList(ctx, true, backupName) + backupList, err := b.dst.BackupList(ctx, true, backupName, b.cfg.CAS.SkipPrefixes()) if err != nil { return nil, errors.WithMessage(err, "GetTablesRemote BackupList") } diff --git a/pkg/backup/restore.go b/pkg/backup/restore.go index 538a2591..b211984e 100644 --- a/pkg/backup/restore.go +++ b/pkg/backup/restore.go @@ -37,6 +37,7 @@ import ( "golang.org/x/text/cases" "golang.org/x/text/language" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" "github.com/Altinity/clickhouse-backup/v2/pkg/clickhouse" "github.com/Altinity/clickhouse-backup/v2/pkg/common" "github.com/Altinity/clickhouse-backup/v2/pkg/config" @@ -134,6 +135,19 @@ func (b *Backuper) Restore(backupName, tablePattern string, databaseMapping, tab if err := json.Unmarshal(backupMetadataBody, &backupMetadata); err != nil { return errors.WithMessage(err, "unmarshal backup metadata") } + // CAS-format backups are restored exclusively via the cas-restore CLI + // (pkg/cas.Restore); the v1 path looks up state (parts on disk, embedded + // metadata, object-disk descriptors) that CAS layouts do not carry. + // + // Exception: when cas-download has materialized a v1-shaped local backup + // for the cas-restore handoff it sets CAS.Handoff = true in the local + // metadata.json to signal "this layout was written by cas-restore; v1 + // restore is permitted here, and object-disk handling must be skipped." + // The two downloadObjectDiskParts guards below already check CAS == nil + // and skip the call when CAS is set (including the Handoff case). + if backupMetadata.CAS != nil && !backupMetadata.CAS.Handoff { + return cas.ErrCASBackup + } b.isEmbedded = strings.Contains(backupMetadata.Tags, "embedded") if b.isEmbedded { if err = b.resolveEmbeddedClusterShardReplica(ctx); err != nil { @@ -2161,8 +2175,14 @@ func (b *Backuper) restoreDataRegularByAttach(ctx context.Context, backupName st Str("database", backupTable.Database). Str("table", backupTable.Table). Msg("download object_disks start") - if size, err = b.downloadObjectDiskParts(ctx, backupName, backupMetadata, backupTable, diskMap, diskTypes, disks, needsKeyRewrite); err != nil { - return errors.Wrapf(err, "can't restore object_disk server-side copy data parts '%s.%s'", backupTable.Database, backupTable.Table) + // CAS backups carry no object-disk parts (object-disk tables are + // rejected by cas-upload preflight); the v1 detector inspects live + // ClickHouse disk types rather than backup metadata, so explicitly + // short-circuit when the local backup is CAS-shaped. + if backupMetadata.CAS == nil { + if size, err = b.downloadObjectDiskParts(ctx, backupName, backupMetadata, backupTable, diskMap, diskTypes, disks, needsKeyRewrite); err != nil { + return errors.Wrapf(err, "can't restore object_disk server-side copy data parts '%s.%s'", backupTable.Database, backupTable.Table) + } } if size > 0 { logger. @@ -2204,8 +2224,12 @@ func (b *Backuper) restoreDataRegularByParts(ctx context.Context, backupName str var size int64 var err error start := time.Now() - if size, err = b.downloadObjectDiskParts(ctx, backupName, backupMetadata, backupTable, diskMap, diskTypes, disks, needsKeyRewrite); err != nil { - return errors.Wrapf(err, "can't restore object_disk server-side copy data parts '%s.%s'", backupTable.Database, backupTable.Table) + // CAS backups never carry object-disk parts; see comment in + // restoreDataRegularByAttach above. + if backupMetadata.CAS == nil { + if size, err = b.downloadObjectDiskParts(ctx, backupName, backupMetadata, backupTable, diskMap, diskTypes, disks, needsKeyRewrite); err != nil { + return errors.Wrapf(err, "can't restore object_disk server-side copy data parts '%s.%s'", backupTable.Database, backupTable.Table) + } } log.Info().Str("duration", utils.HumanizeDuration(time.Since(start))).Str("size", utils.FormatBytes(uint64(size))).Str("database", backupTable.Database).Str("table", backupTable.Table).Msg("download object_disks finish") // Skip ATTACH PART for Replicated*MergeTree tables if replicatedCopyToDetached is true diff --git a/pkg/backup/upload.go b/pkg/backup/upload.go index d31afda5..2223134c 100644 --- a/pkg/backup/upload.go +++ b/pkg/backup/upload.go @@ -15,6 +15,7 @@ import ( "sync/atomic" "time" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" "github.com/Altinity/clickhouse-backup/v2/pkg/pidlock" "github.com/pkg/errors" @@ -81,7 +82,7 @@ func (b *Backuper) Upload(backupName string, deleteSource bool, diffFrom, diffFr } }() - remoteBackups, err := b.dst.BackupList(ctx, false, "") + remoteBackups, err := b.dst.BackupList(ctx, false, "", b.cfg.CAS.SkipPrefixes()) if err != nil { return errors.Wrap(err, "b.dst.BackupList return error") } @@ -300,7 +301,7 @@ func (b *Backuper) RemoveOldBackupsRemote(ctx context.Context) error { return nil } start := time.Now() - backupList, err := b.dst.BackupList(ctx, true, "") + backupList, err := b.dst.BackupList(ctx, true, "", b.cfg.CAS.SkipPrefixes()) if err != nil { return errors.WithMessage(err, "BackupList") } @@ -395,6 +396,9 @@ func (b *Backuper) validateUploadParams(ctx context.Context, backupName string, if diffFrom != "" && diffFromRemote != "" { return errors.New("choose setup only `--diff-from-remote` or `--diff-from`, not both") } + if cas.NameCollidesWithCASPrefix(backupName, b.cfg.CAS) { + return fmt.Errorf("backup name %q collides with the CAS skip-prefix %q; choose a different name to prevent this backup from being silently skipped by v1 list/retention operations", backupName, backupName+"/") + } if b.cfg.GetCompressionFormat() == "none" && !b.cfg.General.UploadByPart { return errors.Errorf("%s->`compression_format`=%s incompatible with general->upload_by_part=%v", b.cfg.General.RemoteStorage, b.cfg.GetCompressionFormat(), b.cfg.General.UploadByPart) } diff --git a/pkg/cas/archive.go b/pkg/cas/archive.go new file mode 100644 index 00000000..3e749a4d --- /dev/null +++ b/pkg/cas/archive.go @@ -0,0 +1,182 @@ +package cas + +import ( + "archive/tar" + "errors" + "fmt" + "io" + "os" + "path/filepath" + "strings" + + "github.com/klauspost/compress/zstd" +) + +// ArchiveEntry describes one file to write into a tar.zstd archive. +// NameInArchive must be a forward-slash-separated relative path with no +// leading "/", no embedded "..", and no NUL bytes; WriteArchive validates +// this and returns *UnsafePathError on violation. LocalPath is the +// filesystem source. +type ArchiveEntry struct { + NameInArchive string + LocalPath string +} + +// UnsafePathError signals a tar entry name (or LocalPath stat result) that +// would escape the destination root, contain ".." or NUL, or otherwise be +// unsafe to extract. +type UnsafePathError struct{ Path string } + +func (e *UnsafePathError) Error() string { return "cas: unsafe path in archive: " + e.Path } + +// WriteArchive writes entries into w as zstd-compressed tar. Validates each +// entry's NameInArchive before write; partial archives are NOT cleaned up +// (caller decides). Closes the tar and zstd writers on return. +func WriteArchive(w io.Writer, entries []ArchiveEntry) error { + if err := validateNoDuplicateNames(entries); err != nil { + return err + } + + zw, err := zstd.NewWriter(w) + if err != nil { + return fmt.Errorf("cas: zstd new writer: %w", err) + } + defer zw.Close() + tw := tar.NewWriter(zw) + defer tw.Close() + + for _, e := range entries { + if err := validateArchiveName(e.NameInArchive); err != nil { + return err + } + st, err := os.Stat(e.LocalPath) + if err != nil { + return fmt.Errorf("cas: stat %s: %w", e.LocalPath, err) + } + if st.IsDir() { + return fmt.Errorf("cas: archive entry must be a regular file: %s", e.LocalPath) + } + hdr := &tar.Header{ + Name: e.NameInArchive, + Mode: int64(st.Mode().Perm()), + Size: st.Size(), + Typeflag: tar.TypeReg, + ModTime: st.ModTime(), + } + if err := tw.WriteHeader(hdr); err != nil { + return err + } + f, err := os.Open(e.LocalPath) + if err != nil { + return err + } + n, copyErr := io.Copy(tw, f) + _ = f.Close() + if copyErr != nil { + return copyErr + } + if n != st.Size() { + // File changed under us between Stat and copy. Treat as failure + // — silently truncated archives corrupt restore. + return fmt.Errorf("cas: %s changed size mid-write (stat=%d copied=%d)", e.LocalPath, st.Size(), n) + } + } + if err := tw.Close(); err != nil { + return err + } + return zw.Close() +} + +// ExtractArchive reads a zstd-compressed tar from r and writes each entry +// under dstRoot. Validates every header name; rejects entries whose +// destination would escape dstRoot. +func ExtractArchive(r io.Reader, dstRoot string) error { + absRoot, err := filepath.Abs(dstRoot) + if err != nil { + return err + } + rootPrefix := absRoot + string(filepath.Separator) + + zr, err := zstd.NewReader(r) + if err != nil { + return fmt.Errorf("cas: zstd new reader: %w", err) + } + defer zr.Close() + tr := tar.NewReader(zr) + for { + hdr, err := tr.Next() + if errors.Is(err, io.EOF) { + return nil + } + if err != nil { + return err + } + if err := validateArchiveName(hdr.Name); err != nil { + return err + } + // Containment: filepath.Join(absRoot, FromSlash(name)) followed by + // Clean must remain under absRoot. + dst := filepath.Join(absRoot, filepath.FromSlash(hdr.Name)) + cleanDst := filepath.Clean(dst) + if cleanDst != absRoot && !strings.HasPrefix(cleanDst+string(filepath.Separator), rootPrefix) { + return &UnsafePathError{Path: hdr.Name} + } + switch hdr.Typeflag { + case tar.TypeReg: + if err := os.MkdirAll(filepath.Dir(cleanDst), 0o755); err != nil { + return err + } + f, err := os.OpenFile(cleanDst, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, os.FileMode(hdr.Mode)&0o777) + if err != nil { + return err + } + if _, err := io.Copy(f, tr); err != nil { + _ = f.Close() + return err + } + if err := f.Close(); err != nil { + return err + } + default: + // CAS archives only contain regular files. Reject anything else. + return fmt.Errorf("cas: unexpected tar entry type %d for %q", hdr.Typeflag, hdr.Name) + } + } +} + +// validateArchiveName rejects names that would be unsafe to extract. +// Rules: non-empty; no NUL; no leading "/"; no path component equal to "..". +func validateArchiveName(name string) error { + if name == "" { + return &UnsafePathError{Path: name} + } + if strings.ContainsRune(name, 0) { + return &UnsafePathError{Path: name} + } + if strings.HasPrefix(name, "/") { + return &UnsafePathError{Path: name} + } + if strings.HasPrefix(name, `\`) { + return &UnsafePathError{Path: name} + } + for _, seg := range strings.Split(name, "/") { + if seg == ".." { + return &UnsafePathError{Path: name} + } + if strings.Contains(seg, `\`) { + return &UnsafePathError{Path: name} + } + } + return nil +} + +func validateNoDuplicateNames(entries []ArchiveEntry) error { + seen := make(map[string]struct{}, len(entries)) + for _, e := range entries { + if _, ok := seen[e.NameInArchive]; ok { + return fmt.Errorf("cas: duplicate archive entry name %q", e.NameInArchive) + } + seen[e.NameInArchive] = struct{}{} + } + return nil +} diff --git a/pkg/cas/archive_test.go b/pkg/cas/archive_test.go new file mode 100644 index 00000000..9603b00c --- /dev/null +++ b/pkg/cas/archive_test.go @@ -0,0 +1,148 @@ +package cas_test + +import ( + "archive/tar" + "bytes" + "errors" + "io" + "os" + "path/filepath" + "testing" + + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" + "github.com/klauspost/compress/zstd" +) + +// makeTestPart creates a temp source dir with two small files. +func makeTestPart(t *testing.T) (root string, columns []byte, checksums []byte) { + t.Helper() + root = t.TempDir() + if err := os.MkdirAll(filepath.Join(root, "all_1_1_0"), 0o755); err != nil { + t.Fatal(err) + } + columns = []byte("id UInt64\nx String\n") + checksums = []byte("checksums format version: 4\n...some-blob...") + if err := os.WriteFile(filepath.Join(root, "all_1_1_0", "columns.txt"), columns, 0o644); err != nil { + t.Fatal(err) + } + if err := os.WriteFile(filepath.Join(root, "all_1_1_0", "checksums.txt"), checksums, 0o644); err != nil { + t.Fatal(err) + } + return root, columns, checksums +} + +func TestArchive_RoundTrip(t *testing.T) { + src, wantCols, wantChk := makeTestPart(t) + var buf bytes.Buffer + err := cas.WriteArchive(&buf, []cas.ArchiveEntry{ + {NameInArchive: "all_1_1_0/columns.txt", LocalPath: filepath.Join(src, "all_1_1_0", "columns.txt")}, + {NameInArchive: "all_1_1_0/checksums.txt", LocalPath: filepath.Join(src, "all_1_1_0", "checksums.txt")}, + }) + if err != nil { + t.Fatal(err) + } + + out := t.TempDir() + if err := cas.ExtractArchive(&buf, out); err != nil { + t.Fatal(err) + } + + gotCols, _ := os.ReadFile(filepath.Join(out, "all_1_1_0", "columns.txt")) + gotChk, _ := os.ReadFile(filepath.Join(out, "all_1_1_0", "checksums.txt")) + if !bytes.Equal(gotCols, wantCols) { + t.Errorf("columns.txt mismatch") + } + if !bytes.Equal(gotChk, wantChk) { + t.Errorf("checksums.txt mismatch") + } +} + +// craftHostileTar emits a single-entry tar.zst whose tar entry has the given +// name. Bypasses WriteArchive's name validation so we can test ExtractArchive +// in isolation. +func craftHostileTar(t *testing.T, name string, data []byte) []byte { + t.Helper() + var buf bytes.Buffer + zw, _ := zstd.NewWriter(&buf) + tw := tar.NewWriter(zw) + if err := tw.WriteHeader(&tar.Header{Name: name, Size: int64(len(data)), Mode: 0o644, Typeflag: tar.TypeReg}); err != nil { + t.Fatal(err) + } + if _, err := tw.Write(data); err != nil { + t.Fatal(err) + } + if err := tw.Close(); err != nil { + t.Fatal(err) + } + if err := zw.Close(); err != nil { + t.Fatal(err) + } + return buf.Bytes() +} + +func TestArchive_ExtractRejectsTraversal(t *testing.T) { + blob := craftHostileTar(t, "../escape.txt", []byte("x")) + err := cas.ExtractArchive(bytes.NewReader(blob), t.TempDir()) + var ue *cas.UnsafePathError + if !errors.As(err, &ue) { + t.Fatalf("want UnsafePathError, got %T %v", err, err) + } +} + +func TestArchive_ExtractRejectsAbsolute(t *testing.T) { + blob := craftHostileTar(t, "/etc/passwd", []byte("x")) + err := cas.ExtractArchive(bytes.NewReader(blob), t.TempDir()) + var ue *cas.UnsafePathError + if !errors.As(err, &ue) { + t.Fatal("absolute path must be rejected") + } +} + +func TestArchive_ExtractRejectsEmbeddedNUL(t *testing.T) { + // Go's tar.Reader parses ustar name fields as C strings (NUL-terminated), + // so a NUL injected into the raw ustar header bytes is silently truncated + // before the name ever reaches validateArchiveName. The NUL attack vector + // via ustar-format tar does not exist on Go's reader. + // + // We test that WriteArchive itself (the entry point we control) rejects a + // NUL-containing NameInArchive before writing anything. + err := cas.WriteArchive(io.Discard, []cas.ArchiveEntry{ + {NameInArchive: "ok\x00bad", LocalPath: "/etc/hostname"}, + }) + var ue *cas.UnsafePathError + if !errors.As(err, &ue) { + t.Fatalf("NUL in NameInArchive must be rejected by WriteArchive, got %T %v", err, err) + } +} + +func TestArchive_ExtractRejectsNonRegular(t *testing.T) { + var buf bytes.Buffer + zw, _ := zstd.NewWriter(&buf) + tw := tar.NewWriter(zw) + _ = tw.WriteHeader(&tar.Header{Name: "link", Linkname: "/etc/passwd", Typeflag: tar.TypeSymlink}) + _ = tw.Close() + _ = zw.Close() + err := cas.ExtractArchive(&buf, t.TempDir()) + if err == nil { + t.Fatal("symlink entry must be rejected") + } +} + +func TestArchive_WriteRejectsBadName(t *testing.T) { + err := cas.WriteArchive(io.Discard, []cas.ArchiveEntry{{NameInArchive: "../escape", LocalPath: "/etc/hostname"}}) + var ue *cas.UnsafePathError + if !errors.As(err, &ue) { + t.Fatal("WriteArchive must reject bad NameInArchive") + } +} + +func TestArchive_WriteRejectsDuplicateNames(t *testing.T) { + src, _, _ := makeTestPart(t) + err := cas.WriteArchive(io.Discard, []cas.ArchiveEntry{ + {NameInArchive: "x", LocalPath: filepath.Join(src, "all_1_1_0", "columns.txt")}, + {NameInArchive: "x", LocalPath: filepath.Join(src, "all_1_1_0", "checksums.txt")}, + }) + if err == nil { + t.Fatal("duplicate names must be rejected") + } +} diff --git a/pkg/cas/backend.go b/pkg/cas/backend.go new file mode 100644 index 00000000..14be9f07 --- /dev/null +++ b/pkg/cas/backend.go @@ -0,0 +1,33 @@ +package cas + +import ( + "context" + "io" + "time" +) + +// Backend is the narrow subset of remote-storage operations CAS uses. +// Defining a small interface lets tests substitute an in-memory fake and keeps +// CAS decoupled from the full storage.BackupDestination surface. +// +// All keys are full object keys (the cluster prefix is already part of them). +type Backend interface { + PutFile(ctx context.Context, key string, data io.ReadCloser, size int64) error + GetFile(ctx context.Context, key string) (io.ReadCloser, error) + StatFile(ctx context.Context, key string) (size int64, modTime time.Time, exists bool, err error) + DeleteFile(ctx context.Context, key string) error + Walk(ctx context.Context, prefix string, recursive bool, fn func(RemoteFile) error) error + + // PutFileIfAbsent atomically writes data at key only if no object + // exists. Returns (true, nil) on successful create; (false, nil) + // if the key is already present; (false, ErrConditionalPutNotSupported) + // when the underlying backend can't do atomic create. + PutFileIfAbsent(ctx context.Context, key string, data io.ReadCloser, size int64) (created bool, err error) +} + +// RemoteFile is a snapshot of an object's metadata returned by Walk callbacks. +type RemoteFile struct { + Key string + Size int64 + ModTime time.Time +} diff --git a/pkg/cas/blobpath.go b/pkg/cas/blobpath.go new file mode 100644 index 00000000..c73b4220 --- /dev/null +++ b/pkg/cas/blobpath.go @@ -0,0 +1,36 @@ +package cas + +import ( + "encoding/binary" + "encoding/hex" + + "github.com/Altinity/clickhouse-backup/v2/pkg/checksumstxt" +) + +// Hash128 is an alias for the parser's hash type so CAS callers don't need +// two imports. +type Hash128 = checksumstxt.Hash128 + +// hashHex returns the 32-char lowercase hex representation. Byte order: the +// 16 bytes are emitted as Low (8 bytes little-endian) followed by High (8 +// bytes little-endian). This convention is CAS-internal (write and read both +// use this function); it does not need to match any other system's hex +// representation. +func hashHex(h Hash128) string { + var b [16]byte + binary.LittleEndian.PutUint64(b[0:8], h.Low) + binary.LittleEndian.PutUint64(b[8:16], h.High) + return hex.EncodeToString(b[:]) +} + +// ShardPrefix returns the 2-char shard segment of the blob path. +func ShardPrefix(h Hash128) string { + return hashHex(h)[:2] +} + +// BlobPath returns the full object key for a blob. clusterPrefix MUST already +// end with "/" (it is the value of cas.Config.ClusterPrefix()). +func BlobPath(clusterPrefix string, h Hash128) string { + s := hashHex(h) + return clusterPrefix + "blob/" + s[:2] + "/" + s[2:] +} diff --git a/pkg/cas/blobpath_test.go b/pkg/cas/blobpath_test.go new file mode 100644 index 00000000..ac636d07 --- /dev/null +++ b/pkg/cas/blobpath_test.go @@ -0,0 +1,45 @@ +package cas + +import ( + "strings" + "testing" +) + +func TestHashHex_KnownValue(t *testing.T) { + h := Hash128{Low: 0x1122334455667788, High: 0x99aabbccddeeff00} + // Low LE = 88 77 66 55 44 33 22 11 + // High LE = 00 ff ee dd cc bb aa 99 + want := "8877665544332211" + "00ffeeddccbbaa99" + if got := hashHex(h); got != want { + t.Fatalf("hashHex: got %q want %q", got, want) + } +} + +func TestShardPrefix(t *testing.T) { + h := Hash128{Low: 0x1122334455667788, High: 0} + if got := ShardPrefix(h); got != "88" { + t.Fatalf("ShardPrefix: got %q want \"88\"", got) + } +} + +func TestBlobPath_Format(t *testing.T) { + h := Hash128{Low: 0x1122334455667788, High: 0x99aabbccddeeff00} + want := "cas/c1/blob/88/77665544332211" + "00ffeeddccbbaa99" + got := BlobPath("cas/c1/", h) + if got != want { + t.Fatalf("BlobPath: got %q want %q", got, want) + } + // Sanity: hex portion is exactly 30 chars after the shard. + rest := strings.TrimPrefix(got, "cas/c1/blob/88/") + if len(rest) != 30 { + t.Fatalf("rest len: got %d want 30", len(rest)) + } +} + +func TestBlobPath_DistinctHashesProduceDistinctPaths(t *testing.T) { + a := Hash128{Low: 1, High: 0} + b := Hash128{Low: 2, High: 0} + if BlobPath("cas/c/", a) == BlobPath("cas/c/", b) { + t.Fatal("distinct hashes produced same path") + } +} diff --git a/pkg/cas/casstorage/backend_storage.go b/pkg/cas/casstorage/backend_storage.go new file mode 100644 index 00000000..549a66b8 --- /dev/null +++ b/pkg/cas/casstorage/backend_storage.go @@ -0,0 +1,87 @@ +// Package casstorage wires the CAS Backend interface to pkg/storage.BackupDestination. +// It lives in a sub-package so that pkg/cas itself does not import pkg/storage, +// which would create an import cycle via pkg/storage → pkg/config → pkg/cas. +package casstorage + +import ( + "context" + "errors" + "io" + "strings" + "time" + + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" + "github.com/Altinity/clickhouse-backup/v2/pkg/storage" +) + +// NewStorageBackend adapts a *storage.BackupDestination to the CAS Backend interface. +func NewStorageBackend(bd *storage.BackupDestination) cas.Backend { return &storageBackend{bd: bd} } + +type storageBackend struct{ bd *storage.BackupDestination } + +func (s *storageBackend) PutFile(ctx context.Context, key string, data io.ReadCloser, size int64) error { + return s.bd.PutFile(ctx, key, data, size) +} + +func (s *storageBackend) GetFile(ctx context.Context, key string) (io.ReadCloser, error) { + return s.bd.GetFileReader(ctx, key) +} + +func (s *storageBackend) StatFile(ctx context.Context, key string) (int64, time.Time, bool, error) { + rf, err := s.bd.StatFile(ctx, key) + if err != nil { + if isNotFound(err) { + return 0, time.Time{}, false, nil + } + return 0, time.Time{}, false, err + } + return rf.Size(), rf.LastModified(), true, nil +} + +func (s *storageBackend) DeleteFile(ctx context.Context, key string) error { + return s.bd.DeleteFile(ctx, key) +} + +func (s *storageBackend) PutFileIfAbsent(ctx context.Context, key string, data io.ReadCloser, size int64) (bool, error) { + // PutFileIfAbsent (not PutFileAbsoluteIfAbsent) so that the backend adds + // its configured path prefix — the same prefix that PutFile, StatFile, + // DeleteFile and GetFile all prepend. Without this, markers land at a + // different key than StatFile/DeleteFile look for. + created, err := s.bd.PutFileIfAbsent(ctx, key, data, size) + if errors.Is(err, storage.ErrConditionalPutNotSupported) { + return false, cas.ErrConditionalPutNotSupported + } + return created, err +} + +func (s *storageBackend) Walk(ctx context.Context, prefix string, recursive bool, fn func(cas.RemoteFile) error) error { + // pkg/storage backends (S3 in particular, see s3.go S3.Walk) strip the + // walk-target prefix from rf.Name() — so callers see keys relative to + // the walk root. CAS code (cas-status, cold-list, list-remote) + // assumes ABSOLUTE keys (i.e. the same keys it constructed via + // MetadataJSONPath / BlobPath / etc.), so we reconstruct here by + // stripping any leading '/' (path.Join artifact in S3.Walk) and + // re-prepending the requested prefix. + return s.bd.Walk(ctx, strings.TrimSuffix(prefix, "/")+"/", recursive, func(_ context.Context, rf storage.RemoteFile) error { + abs := reconstructAbsoluteKey(prefix, rf.Name()) + return fn(cas.RemoteFile{Key: abs, Size: rf.Size(), ModTime: rf.LastModified()}) + }) +} + +// reconstructAbsoluteKey rebuilds the absolute object key from the prefix +// passed to Walk and the (possibly relative) name returned by the underlying +// pkg/storage backend (which may strip the prefix and may prepend a leading "/"). +func reconstructAbsoluteKey(prefix, relName string) string { + return strings.TrimSuffix(prefix, "/") + "/" + strings.TrimPrefix(relName, "/") +} + +// isNotFound returns true if err indicates the object doesn't exist. +// All storage backends in pkg/storage/ (s3, azblob, gcs, sftp, ftp, cos) wrap +// their provider-specific not-found errors and return storage.ErrNotFound, which +// is the canonical sentinel: errors.New("key not found") in pkg/storage/structs.go. +func isNotFound(err error) bool { + return errors.Is(err, storage.ErrNotFound) +} + +// compile-time assertion +var _ cas.Backend = (*storageBackend)(nil) diff --git a/pkg/cas/casstorage/backend_storage_test.go b/pkg/cas/casstorage/backend_storage_test.go new file mode 100644 index 00000000..319360d8 --- /dev/null +++ b/pkg/cas/casstorage/backend_storage_test.go @@ -0,0 +1,22 @@ +package casstorage + +import "testing" + +func TestReconstructAbsoluteKey(t *testing.T) { + cases := []struct { + name, prefix, rel, want string + }{ + {"plain", "cas/c1/blob/", "aa/abc", "cas/c1/blob/aa/abc"}, + {"leading slash on rel stripped", "cas/c1/blob/", "/aa/abc", "cas/c1/blob/aa/abc"}, + {"prefix without trailing slash idempotent", "cas/c1/blob", "aa/abc", "cas/c1/blob/aa/abc"}, + {"deep prefix", "backup/cluster/0/cas/", "metadata/foo/bar.json", "backup/cluster/0/cas/metadata/foo/bar.json"}, + {"empty rel handled", "cas/c1/blob/", "", "cas/c1/blob/"}, + } + for _, c := range cases { + t.Run(c.name, func(t *testing.T) { + if got := reconstructAbsoluteKey(c.prefix, c.rel); got != c.want { + t.Errorf("got %q, want %q", got, c.want) + } + }) + } +} diff --git a/pkg/cas/coldlist.go b/pkg/cas/coldlist.go new file mode 100644 index 00000000..5c73047a --- /dev/null +++ b/pkg/cas/coldlist.go @@ -0,0 +1,106 @@ +package cas + +import ( + "context" + "encoding/binary" + "encoding/hex" + "fmt" + "strings" + "sync" +) + +// ExistenceSet records which blob hashes already exist in the remote store. +// Backed by a map; safe for concurrent reads after the cold-list completes. +// During construction, only ColdList writes to it. +type ExistenceSet struct { + set map[Hash128]struct{} +} + +// Has reports whether h is present. +func (e *ExistenceSet) Has(h Hash128) bool { + if e == nil { + return false + } + _, ok := e.set[h] + return ok +} + +// Len returns the number of hashes in the set. +func (e *ExistenceSet) Len() int { + if e == nil { + return 0 + } + return len(e.set) +} + +// ColdList walks every cas//blob// prefix in parallel and builds +// an existence set. parallelism caps simultaneous Walks; <=0 falls back to 16. +// +// Keys whose hash segment doesn't decode to a valid 128-bit hex string are +// silently skipped (they can't be CAS blobs; could be debris from older +// experiments or unrelated files). Each skip is logged at debug level. +func ColdList(ctx context.Context, b Backend, clusterPrefix string, parallelism int) (*ExistenceSet, error) { + if parallelism <= 0 { + parallelism = 16 + } + + type shardOut struct { + hashes []Hash128 + err error + } + out := make([]shardOut, 256) + + sem := make(chan struct{}, parallelism) + var wg sync.WaitGroup + for i := 0; i < 256; i++ { + wg.Add(1) + go func(i int) { + defer wg.Done() + sem <- struct{}{} + defer func() { <-sem }() + shardPrefix := fmt.Sprintf("%sblob/%02x/", clusterPrefix, i) + var hashes []Hash128 + err := b.Walk(ctx, shardPrefix, true, func(rf RemoteFile) error { + rest := strings.TrimPrefix(rf.Key, shardPrefix) + h, ok := decodeBlobHash(byte(i), rest) + if !ok { + return nil + } + hashes = append(hashes, h) + return nil + }) + out[i] = shardOut{hashes: hashes, err: err} + }(i) + } + wg.Wait() + + set := &ExistenceSet{set: make(map[Hash128]struct{})} + for i := 0; i < 256; i++ { + if out[i].err != nil { + return nil, fmt.Errorf("cas: cold-list shard %02x: %w", i, out[i].err) + } + for _, h := range out[i].hashes { + set.set[h] = struct{}{} + } + } + return set, nil +} + +// decodeBlobHash parses a key suffix like "77665544332211" + "00ffeeddccbbaa99" +// (30 hex chars, the rest of a 32-char hashHex after the 2-char shard) and +// returns the corresponding Hash128. The shard byte is reattached at position +// 0; this is the inverse of hashHex. +func decodeBlobHash(shard byte, rest string) (Hash128, bool) { + if len(rest) != 30 { + return Hash128{}, false + } + var b [16]byte + b[0] = shard + if _, err := hex.Decode(b[1:], []byte(rest)); err != nil { + return Hash128{}, false + } + return Hash128{ + Low: binary.LittleEndian.Uint64(b[0:8]), + High: binary.LittleEndian.Uint64(b[8:16]), + }, true +} diff --git a/pkg/cas/coldlist_test.go b/pkg/cas/coldlist_test.go new file mode 100644 index 00000000..ec42e55e --- /dev/null +++ b/pkg/cas/coldlist_test.go @@ -0,0 +1,105 @@ +package cas_test + +import ( + "context" + "io" + "strings" + "testing" + + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/fakedst" +) + +// putBlob is a test helper to populate the fake with a key in the format +// cas//blob//. +func putBlob(t *testing.T, f *fakedst.Fake, clusterPrefix string, h cas.Hash128) { + t.Helper() + if err := f.PutFile(context.Background(), cas.BlobPath(clusterPrefix, h), + io.NopCloser(strings.NewReader("x")), 1); err != nil { + t.Fatal(err) + } +} + +func TestColdList_FindsAllBlobs(t *testing.T) { + f := fakedst.New() + cp := "cas/c1/" + hs := []cas.Hash128{ + {Low: 0x1122334455667788, High: 0x99aabbccddeeff00}, + {Low: 0xaaaaaaaaaaaaaaaa, High: 0xbbbbbbbbbbbbbbbb}, + {Low: 0, High: 1}, + } + for _, h := range hs { + putBlob(t, f, cp, h) + } + set, err := cas.ColdList(context.Background(), f, cp, 16) + if err != nil { + t.Fatal(err) + } + if set.Len() != len(hs) { + t.Errorf("Len: got %d want %d", set.Len(), len(hs)) + } + for _, h := range hs { + if !set.Has(h) { + t.Errorf("missing %+v", h) + } + } +} + +func TestColdList_IgnoresUnrelatedKeys(t *testing.T) { + f := fakedst.New() + cp := "cas/c1/" + ctx := context.Background() + h := cas.Hash128{Low: 1, High: 2} + putBlob(t, f, cp, h) + // unrelated debris in the same shard prefix: + _ = f.PutFile(ctx, cp+"blob/00/short", io.NopCloser(strings.NewReader("x")), 1) // wrong length + _ = f.PutFile(ctx, cp+"blob/00/zzzzzzzzzzzzzzzzzzzzzzzzzzzzzz", io.NopCloser(strings.NewReader("x")), 1) // not hex + _ = f.PutFile(ctx, cp+"blob/00/zzzzzzzzzzzzzzzzzzzzzzzzzzzzzz", io.NopCloser(strings.NewReader("x")), 1) // not hex, 30 chars + // unrelated outside blob/: + _ = f.PutFile(ctx, cp+"metadata/x", io.NopCloser(strings.NewReader("x")), 1) + set, err := cas.ColdList(ctx, f, cp, 16) + if err != nil { + t.Fatal(err) + } + if !set.Has(h) { + t.Error("missed real blob") + } + if set.Len() != 1 { + t.Errorf("expected 1 valid blob, got %d", set.Len()) + } +} + +func TestColdList_RoundTripWithBlobPath(t *testing.T) { + // Property: ColdList recovers exactly the hashes that BlobPath was used + // to write. This is the load-bearing invariant — if hashHex/decodeBlobHash + // ever drift, dedup silently breaks. + f := fakedst.New() + cp := "cas/c1/" + ctx := context.Background() + var want []cas.Hash128 + for i := 0; i < 32; i++ { + h := cas.Hash128{Low: uint64(i) * 0x0101010101010101, High: uint64(i)<<32 | uint64(i)} + putBlob(t, f, cp, h) + want = append(want, h) + } + set, _ := cas.ColdList(ctx, f, cp, 16) + if set.Len() != len(want) { + t.Fatalf("Len: got %d want %d", set.Len(), len(want)) + } + for _, h := range want { + if !set.Has(h) { + t.Errorf("missing %v", h) + } + } +} + +func TestColdList_EmptyBucket(t *testing.T) { + f := fakedst.New() + set, err := cas.ColdList(context.Background(), f, "cas/c1/", 16) + if err != nil { + t.Fatal(err) + } + if set.Len() != 0 { + t.Error("empty bucket should produce empty set") + } +} diff --git a/pkg/cas/config.go b/pkg/cas/config.go new file mode 100644 index 00000000..9cbff1fb --- /dev/null +++ b/pkg/cas/config.go @@ -0,0 +1,199 @@ +package cas + +import ( + "errors" + "fmt" + "strings" + "time" +) + +// Config holds CAS-specific configuration. Embedded in pkg/config.Config under +// the `cas` key. See docs/cas-design.md §6.11. +// +// GraceBlob and AbandonThreshold are typed string (not time.Duration) because +// gopkg.in/yaml.v3 deserializes time.Duration as raw nanoseconds, not as +// human-readable durations like "24h". Operators expect to write +// `grace_blob: "24h"` in config.yml. Validate() parses these strings via +// time.ParseDuration and stores the result in unexported fields; runtime +// callers MUST use GraceBlobDuration() / AbandonThresholdDuration() instead +// of reading the string fields directly. +type Config struct { + Enabled bool `yaml:"enabled" envconfig:"CAS_ENABLED"` + ClusterID string `yaml:"cluster_id" envconfig:"CAS_CLUSTER_ID"` + RootPrefix string `yaml:"root_prefix" envconfig:"CAS_ROOT_PREFIX"` + InlineThreshold uint64 `yaml:"inline_threshold" envconfig:"CAS_INLINE_THRESHOLD"` + GraceBlob string `yaml:"grace_blob" envconfig:"CAS_GRACE_BLOB"` + AbandonThreshold string `yaml:"abandon_threshold" envconfig:"CAS_ABANDON_THRESHOLD"` + WaitForPrune string `yaml:"wait_for_prune" envconfig:"CAS_WAIT_FOR_PRUNE"` + // AllowUnsafeMarkers, when true, lets backends without native atomic-create + // (currently only FTP) write CAS markers using a stat-then-rename fallback + // with a documented race window. Default false; CAS refuses marker writes + // on those backends unless the operator explicitly opts in. + AllowUnsafeMarkers bool `yaml:"allow_unsafe_markers" envconfig:"CAS_ALLOW_UNSAFE_MARKERS"` + + // SkipConditionalPutProbe, when true, disables the startup probe that + // verifies the backend correctly honors If-None-Match: * (i.e. refuses to + // overwrite an existing object via PutFileIfAbsent). The probe detects older + // MinIO (<2024-11), older Ceph RGW, and other buggy S3-compatible stores + // that silently ignore the precondition, defeating marker locks and risking + // data loss. Set to true ONLY if you knowingly run on a non-conforming + // backend and accept the risk. + SkipConditionalPutProbe bool `yaml:"skip_conditional_put_probe" envconfig:"CAS_SKIP_CONDITIONAL_PUT_PROBE"` + + // AllowUnsafeObjectDiskSkip, when true, allows cas-upload to continue even + // when the object-disk pre-flight cannot query system.disks (e.g. transient + // ClickHouse unavailability) or cannot inspect table metadata JSON. By + // default (false) any failure in the object-disk detection pipeline is a + // hard error, ensuring CAS never silently ingests a backup that may contain + // unrestorable object-disk-backed tables. Set to true ONLY if you cannot + // query system.disks at upload time and consciously accept that the + // resulting CAS backup may include object-disk tables that cannot be + // restored. + AllowUnsafeObjectDiskSkip bool `yaml:"allow_unsafe_object_disk_skip" envconfig:"CAS_ALLOW_UNSAFE_OBJECT_DISK_SKIP"` + + // Parsed by Validate(). Zero until Validate() runs. + graceBlobDur time.Duration + abandonThresholdDur time.Duration + waitForPruneDur time.Duration +} + +// GraceBlobDuration returns the parsed grace_blob value. Returns 0 if +// Validate() has not been called. +func (c Config) GraceBlobDuration() time.Duration { return c.graceBlobDur } + +// AbandonThresholdDuration returns the parsed abandon_threshold value. +// Returns 0 if Validate() has not been called. +func (c Config) AbandonThresholdDuration() time.Duration { return c.abandonThresholdDur } + +// WaitForPruneDuration returns the parsed wait_for_prune value. Returns 0 if +// Validate() has not been called or wait_for_prune was not set. +func (c Config) WaitForPruneDuration() time.Duration { return c.waitForPruneDur } + +// DefaultConfig returns the safe defaults. Enabled is false by default; CAS +// is opt-in. ClusterID has no default — operators MUST set it explicitly when +// enabling CAS. +func DefaultConfig() Config { + return Config{ + Enabled: false, + ClusterID: "", + RootPrefix: "cas/", + InlineThreshold: 262144, // 256 KiB + GraceBlob: "24h", + AbandonThreshold: "168h", // 7 days + } +} + +// SkipPrefixes returns the prefixes that v1 list/retention must ignore. The +// returned prefixes always end with "/" so a simple HasPrefix check on a +// remote key correctly distinguishes "cas/" from a hypothetical sibling like +// "case-archive/". +// +// v1 callers pass this into BackupDestination.BackupList so the cas// +// subtree is not scanned (which would otherwise be reported as broken backup +// folders and might be deleted by retention or "clean remote_broken"). +// +// IMPORTANT: this returns the prefix exclusion regardless of c.Enabled. If +// CAS is disabled, the operator might be in a config rollback or downgrade +// scenario where existing CAS data lives in the bucket but cas-* commands +// are off. Returning nil here would let v1 retention silently delete that +// data the next time RemoveOldBackupsRemote runs. The protection follows +// from the existence of the namespace, not from the feature being enabled. +// Returns nil only when RootPrefix is empty (no namespace to protect). +func (c Config) SkipPrefixes() []string { + rp := c.RootPrefix + if rp != "" && !strings.HasSuffix(rp, "/") { + rp += "/" + } + if rp == "" { + return nil + } + return []string{rp} +} + +// ClusterPrefix returns the per-cluster prefix used for every CAS object key. +// Always ends with "/". Form: "/", e.g. "cas/prod-1/". +// +// Callers must only use this when c.Enabled and c.Validate() has succeeded; +// otherwise the result may not satisfy the implicit "ends with /" contract +// callers depend on. +func (c Config) ClusterPrefix() string { + rp := c.RootPrefix + if rp != "" && !strings.HasSuffix(rp, "/") { + rp += "/" + } + return rp + c.ClusterID + "/" +} + +// Validate returns nil if disabled. When enabled, enforces: +// - ClusterID is non-empty and contains no whitespace or path separators. +// - InlineThreshold is in (0, MaxInline]. +// - GraceBlob and AbandonThreshold parse via time.ParseDuration and are +// strictly positive. Parsed values are stored on the receiver; callers +// access them via GraceBlobDuration() and AbandonThresholdDuration(). +// +// Pointer receiver: parsed durations need to persist on the embedded +// pkg/config.Config.CAS field after pkg/config.ValidateConfig calls +// cfg.CAS.Validate(). +func (c *Config) Validate() error { + if !c.Enabled { + return nil + } + if c.ClusterID == "" { + return errors.New("cas.cluster_id is required when cas.enabled=true") + } + if strings.ContainsAny(c.ClusterID, "/\\ \t\n") { + return fmt.Errorf("cas.cluster_id %q must not contain whitespace or path separators", c.ClusterID) + } + if strings.Contains(c.ClusterID, "..") { + return fmt.Errorf("cas.cluster_id %q must not contain %q (path traversal)", c.ClusterID, "..") + } + if c.RootPrefix == "" { + return errors.New("cas.root_prefix must not be empty when cas.enabled=true") + } + if strings.Contains(c.RootPrefix, "..") || strings.HasPrefix(c.RootPrefix, "/") { + return fmt.Errorf("cas.root_prefix %q must not contain %q or start with %q", c.RootPrefix, "..", "/") + } + // Multi-segment root_prefix (e.g. "backups/cas/") would escape v1 list/ + // retention/clean-broken protection: BackupList walks the bucket root + // at depth 0 and emits single-segment entries like "backups", but + // SkipPrefixes returns "backups/cas/", so the equality/HasPrefix check + // in pkg/storage/general.go::BackupList misses the parent directory + // and v1 may treat the CAS parent as a broken v1 backup. v1 of CAS + // requires a single-segment root_prefix; for nested layouts, set the + // underlying BackupDestination path (s3.path / sftp.path / etc.) to + // the parent and keep cas.root_prefix as a single segment. + trimmed := strings.TrimSuffix(c.RootPrefix, "/") + if strings.Contains(trimmed, "/") { + return fmt.Errorf("cas.root_prefix %q must be a single path segment (e.g. \"cas/\"); for nested layouts, set the storage backend path (s3.path / sftp.path / etc.) and keep cas.root_prefix as one segment", c.RootPrefix) + } + if c.InlineThreshold == 0 || c.InlineThreshold > MaxInline { + return fmt.Errorf("cas.inline_threshold must be in (0, %d], got %d", MaxInline, c.InlineThreshold) + } + gb, err := time.ParseDuration(c.GraceBlob) + if err != nil { + return fmt.Errorf("cas.grace_blob %q: %w", c.GraceBlob, err) + } + if gb <= 0 { + return fmt.Errorf("cas.grace_blob must be > 0, got %v", gb) + } + at, err := time.ParseDuration(c.AbandonThreshold) + if err != nil { + return fmt.Errorf("cas.abandon_threshold %q: %w", c.AbandonThreshold, err) + } + if at <= 0 { + return fmt.Errorf("cas.abandon_threshold must be > 0, got %v", at) + } + c.graceBlobDur = gb + c.abandonThresholdDur = at + if c.WaitForPrune != "" { + wfp, err := time.ParseDuration(c.WaitForPrune) + if err != nil { + return fmt.Errorf("cas.wait_for_prune %q: %w", c.WaitForPrune, err) + } + if wfp < 0 { + return fmt.Errorf("cas.wait_for_prune must be >= 0, got %v", wfp) + } + c.waitForPruneDur = wfp + } + return nil +} diff --git a/pkg/cas/config_test.go b/pkg/cas/config_test.go new file mode 100644 index 00000000..01512306 --- /dev/null +++ b/pkg/cas/config_test.go @@ -0,0 +1,274 @@ +package cas + +import ( + "strings" + "testing" + "time" + + "gopkg.in/yaml.v3" +) + +func TestDefaultConfig(t *testing.T) { + c := DefaultConfig() + if c.Enabled { + t.Error("default Enabled should be false") + } + if c.RootPrefix != "cas/" { + t.Errorf("RootPrefix: got %q", c.RootPrefix) + } + if c.InlineThreshold != 262144 { + t.Errorf("InlineThreshold: got %d", c.InlineThreshold) + } + if c.GraceBlob != "24h" { + t.Errorf("GraceBlob: got %q want \"24h\"", c.GraceBlob) + } + if c.AbandonThreshold != "168h" { + t.Errorf("AbandonThreshold: got %q want \"168h\"", c.AbandonThreshold) + } + if c.SkipConditionalPutProbe { + t.Error("default SkipConditionalPutProbe should be false") + } + if err := c.Validate(); err != nil { + t.Errorf("disabled default must validate: %v", err) + } +} + +func TestValidate_PopulatesParsedDurations(t *testing.T) { + c := validEnabled() + if err := c.Validate(); err != nil { + t.Fatal(err) + } + if c.GraceBlobDuration() != 24*time.Hour { + t.Errorf("GraceBlobDuration: got %v want 24h", c.GraceBlobDuration()) + } + if c.AbandonThresholdDuration() != 7*24*time.Hour { + t.Errorf("AbandonThresholdDuration: got %v want 168h", c.AbandonThresholdDuration()) + } +} + +func TestValidate_RejectsUnparseableDuration(t *testing.T) { + c := validEnabled() + c.GraceBlob = "not-a-duration" + if err := c.Validate(); err == nil || !strings.Contains(err.Error(), "grace_blob") { + t.Fatalf("want grace_blob parse error, got %v", err) + } + c = validEnabled() + c.AbandonThreshold = "8 days" // ParseDuration doesn't accept this + if err := c.Validate(); err == nil || !strings.Contains(err.Error(), "abandon_threshold") { + t.Fatalf("want abandon_threshold parse error, got %v", err) + } +} + +func validEnabled() Config { + c := DefaultConfig() + c.Enabled = true + c.ClusterID = "prod-1" + return c +} + +func TestValidate_HappyPath(t *testing.T) { + c := validEnabled() + if err := c.Validate(); err != nil { + t.Fatal(err) + } +} + +func TestValidate_RejectsEmptyClusterID(t *testing.T) { + c := validEnabled() + c.ClusterID = "" + if err := c.Validate(); err == nil || !strings.Contains(err.Error(), "cluster_id") { + t.Fatalf("want cluster_id error, got %v", err) + } +} + +func TestValidate_RejectsBadRootPrefix(t *testing.T) { + for _, bad := range []string{"", "cas/../escape/", "/abs/path/", "..", "/cas/", + // Multi-segment root_prefix would escape v1 list/retention/clean-broken + // protection (the depth-0 BackupList walk emits single-segment names). + "backups/cas/", "a/b/c/", "deep/cas", + } { + c := validEnabled() + c.RootPrefix = bad + if err := c.Validate(); err == nil { + t.Errorf("expected error for RootPrefix=%q", bad) + } + } +} + +func TestValidate_RejectsBadClusterID(t *testing.T) { + for _, bad := range []string{"a/b", "a b", "a\tb", "a\\b", "a\nb", "..", "../escape", "a..b"} { + c := validEnabled() + c.ClusterID = bad + if err := c.Validate(); err == nil { + t.Errorf("expected error for %q", bad) + } + } +} + +func TestValidate_RejectsBadInlineThreshold(t *testing.T) { + c := validEnabled() + c.InlineThreshold = 0 + if err := c.Validate(); err == nil { + t.Error("zero must fail") + } + c.InlineThreshold = MaxInline + 1 + if err := c.Validate(); err == nil { + t.Error("> MaxInline must fail") + } +} + +func TestValidate_RejectsBadDurations(t *testing.T) { + c := validEnabled() + c.GraceBlob = "0s" + if err := c.Validate(); err == nil { + t.Error("zero grace must fail") + } + c = validEnabled() + c.AbandonThreshold = "0s" + if err := c.Validate(); err == nil { + t.Error("zero abandon must fail") + } + c = validEnabled() + c.GraceBlob = "-1h" + if err := c.Validate(); err == nil { + t.Error("negative grace must fail") + } +} + +// TestCASConfig_DurationYAML pins the requirement that yaml.v3 can parse +// human-readable strings like "24h" into the duration fields. With the +// previous time.Duration type, yaml deserialized as raw nanoseconds and +// any operator following the documented "grace_blob: 24h" syntax would +// silently get the wrong value (or a parse error). +func TestCASConfig_DurationYAML(t *testing.T) { + type Outer struct { + CAS Config `yaml:"cas"` + } + src := []byte(` +cas: + enabled: true + cluster_id: test + root_prefix: cas/ + inline_threshold: 524288 + grace_blob: "12h" + abandon_threshold: "72h" +`) + var got Outer + if err := yaml.Unmarshal(src, &got); err != nil { + t.Fatalf("yaml.Unmarshal: %v", err) + } + if got.CAS.GraceBlob != "12h" { + t.Errorf("GraceBlob: got %q want \"12h\"", got.CAS.GraceBlob) + } + if got.CAS.AbandonThreshold != "72h" { + t.Errorf("AbandonThreshold: got %q want \"72h\"", got.CAS.AbandonThreshold) + } + if err := got.CAS.Validate(); err != nil { + t.Fatalf("Validate after yaml unmarshal: %v", err) + } + if got.CAS.GraceBlobDuration() != 12*time.Hour { + t.Errorf("parsed grace: got %v want 12h", got.CAS.GraceBlobDuration()) + } + if got.CAS.AbandonThresholdDuration() != 72*time.Hour { + t.Errorf("parsed abandon: got %v want 72h", got.CAS.AbandonThresholdDuration()) + } +} + +func TestClusterPrefix(t *testing.T) { + c := validEnabled() + if got := c.ClusterPrefix(); got != "cas/prod-1/" { + t.Errorf("got %q want %q", got, "cas/prod-1/") + } + c.RootPrefix = "cas" // missing trailing slash + if got := c.ClusterPrefix(); got != "cas/prod-1/" { + t.Errorf("normalized: got %q want %q", got, "cas/prod-1/") + } +} + +// TestSkipPrefixes_DisabledStillProtects encodes the requirement that +// v1 retention/list operations must continue to skip the CAS namespace +// even when cas.enabled=false. Otherwise a config rollback or downgrade +// would silently expose existing CAS data to v1 deletion. +func TestSkipPrefixes_DisabledStillProtects(t *testing.T) { + c := DefaultConfig() + c.Enabled = false + c.RootPrefix = "cas/" + got := c.SkipPrefixes() + if len(got) != 1 || got[0] != "cas/" { + t.Errorf("disabled SkipPrefixes: got %v want [cas/]", got) + } +} + +func TestSkipPrefixes_NormalizesTrailingSlash(t *testing.T) { + c := DefaultConfig() + c.RootPrefix = "cas" // no trailing slash + got := c.SkipPrefixes() + if len(got) != 1 || got[0] != "cas/" { + t.Errorf("got %v want [cas/]", got) + } +} + +func TestSkipPrefixes_EmptyRootPrefixReturnsNil(t *testing.T) { + c := DefaultConfig() + c.RootPrefix = "" + if got := c.SkipPrefixes(); got != nil { + t.Errorf("empty RootPrefix should return nil, got %v", got) + } +} + +func TestCASConfig_WaitForPruneParses(t *testing.T) { + c := validEnabled() + c.WaitForPrune = "5m" + if err := c.Validate(); err != nil { + t.Fatalf("Validate: %v", err) + } + if got := c.WaitForPruneDuration(); got != 5*time.Minute { + t.Errorf("WaitForPruneDuration: got %v want 5m", got) + } +} + +func TestCASConfig_WaitForPruneDefaultsZero(t *testing.T) { + c := validEnabled() + // WaitForPrune is intentionally absent / empty string + if err := c.Validate(); err != nil { + t.Fatalf("Validate: %v", err) + } + if got := c.WaitForPruneDuration(); got != 0 { + t.Errorf("WaitForPruneDuration: got %v want 0", got) + } +} + +func TestCASConfig_WaitForPruneRejectsBadDuration(t *testing.T) { + c := validEnabled() + c.WaitForPrune = "banana" + err := c.Validate() + if err == nil { + t.Fatal("expected error for bad duration, got nil") + } + if !strings.Contains(err.Error(), "wait_for_prune") { + t.Errorf("error should mention wait_for_prune, got: %v", err) + } +} + +// TestConfig_DurationsZeroWithoutValidate locks the contract that +// GraceBlobDuration and AbandonThresholdDuration return 0 when Validate has +// not been called. Delete/Download/Verify/Status guard on cfg.Validate() at +// entry precisely because callers who skip Validate would silently get zero +// durations. +func TestConfig_DurationsZeroWithoutValidate(t *testing.T) { + cfg := Config{ + Enabled: true, + ClusterID: "c", + RootPrefix: "cas/", + InlineThreshold: 100, + GraceBlob: "24h", + AbandonThreshold: "168h", + } + // NO call to Validate() — durations must be zero. + if d := cfg.GraceBlobDuration(); d != 0 { + t.Errorf("GraceBlobDuration without Validate: got %s, want 0", d) + } + if d := cfg.AbandonThresholdDuration(); d != 0 { + t.Errorf("AbandonThresholdDuration without Validate: got %s, want 0", d) + } +} diff --git a/pkg/cas/delete.go b/pkg/cas/delete.go new file mode 100644 index 00000000..a76f0d5c --- /dev/null +++ b/pkg/cas/delete.go @@ -0,0 +1,160 @@ +package cas + +import ( + "context" + "errors" + "fmt" + "time" + + "github.com/rs/zerolog/log" +) + +// DeleteOptions configures a Delete run. +type DeleteOptions struct { + // WaitForPrune, when > 0, polls the prune marker for up to this duration + // before giving up at delete step 1. 0 = refuse immediately (default). + WaitForPrune time.Duration +} + +// Delete removes a CAS backup's metadata subtree. Blob reclamation is +// reserved for Phase 2 (cas-prune); in Phase 1, deleted-backup blobs +// remain in remote storage indefinitely. Per §6.6, metadata.json is +// deleted FIRST so the backup leaves the catalog atomically; even if +// the rest of the subtree removal is interrupted, the backup is no +// longer listable, and the orphan per-table JSONs/archives will be +// swept by the future prune (or via manual cleanup, until prune ships). +func Delete(ctx context.Context, b Backend, cfg Config, name string, opts DeleteOptions) error { + if err := cfg.Validate(); err != nil { + return fmt.Errorf("cas: delete: invalid config: %w", err) + } + if err := validateName(name); err != nil { + return err + } + cp := cfg.ClusterPrefix() + + // Step 1: refuse if prune in progress (with optional wait). + if err := waitForPrune(ctx, b, cp, opts.WaitForPrune); err != nil { + return err + } + + // Step 2: stale-aware inprogress check + _, _, mdOK, mdErr := b.StatFile(ctx, MetadataJSONPath(cp, name)) + if mdErr != nil { + return fmt.Errorf("cas-delete: stat metadata.json: %w", mdErr) + } + _, _, ipOK, ipErr := b.StatFile(ctx, InProgressMarkerPath(cp, name)) + if ipErr != nil { + return fmt.Errorf("cas-delete: stat inprogress marker: %w", ipErr) + } + + switch { + case ipOK && !mdOK: + return ErrUploadInProgress + case ipOK && mdOK: + // A marker exists alongside committed metadata.json. If the marker was + // written by another cas-delete (Tool=="cas-delete"), that delete is + // actively removing this backup — refuse to race it. Otherwise it is a + // stale upload marker (upload committed but failed to clean up); proceed + // with a warning. + existing, readErr := ReadInProgressMarker(ctx, b, cp, name) + if readErr != nil { + return fmt.Errorf("cas-delete: cannot read marker for %q: %w; refusing", name, readErr) + } + if existing.Tool == "cas-delete" { + return fmt.Errorf("cas-delete: another %s is in progress for %q on host=%s started=%s; wait for it to finish", + existing.Tool, name, existing.Host, existing.StartedAt) + } + log.Warn().Str("backup", name).Msg("cas-delete: stale inprogress marker present alongside committed metadata.json; proceeding") + // Register a defer to clean up the stale upload marker on any outcome + // (success or error). Best-effort: log but don't mask the primary error. + defer func() { + cleanCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second) + defer cancel() + if delErr := b.DeleteFile(cleanCtx, InProgressMarkerPath(cp, name)); delErr != nil { + log.Warn().Err(delErr).Str("backup", name).Msg("cas-delete: release stale upload marker") + } + }() + case !ipOK && !mdOK: + // If a v1 backup exists at the root with this name, surface the + // proper cross-mode refusal. Operators who type a v1 backup name + // into cas-delete get the helpful error. + if _, _, exists, err := b.StatFile(ctx, name+"/metadata.json"); err == nil && exists { + return ErrV1Backup + } + return fmt.Errorf("cas: backup %q not found", name) + } + // (the !ipOK && mdOK case is the normal path; fall through) + + // Step 3: Write a cas-delete inprogress marker BEFORE touching metadata.json. + // This closes the race window where a concurrent cas-upload on another host sees + // no metadata.json (we deleted it in step 4) and no marker, treats the name as + // free, and starts uploading — only to have its just-written archives swept by + // our walkAndDeleteSubtree. cas-upload's step-5 same-name check refuses when + // ANY inprogress marker exists (regardless of Tool), so this marker is sufficient + // to block it until we finish. + // + // When ipOK is true there is already a stale upload marker present; we skip + // writing our own (PutFileIfAbsent would return created=false anyway) and + // instead clean it up as we did before. + if !ipOK { + created, werr := WriteInProgressMarkerWithTool(ctx, b, cp, name, "", "cas-delete") + if werr != nil { + if errors.Is(werr, ErrConditionalPutNotSupported) { + return fmt.Errorf("cas-delete: backend cannot guarantee atomic markers; refusing") + } + return fmt.Errorf("cas-delete: write delete marker: %w", werr) + } + if !created { + // Another operation (upload or delete) raced us and wrote the marker first. + existing, readErr := ReadInProgressMarker(ctx, b, cp, name) + if readErr != nil { + return fmt.Errorf("cas-delete: another operation is in progress for %q (could not read marker: %v)", name, readErr) + } + return fmt.Errorf("cas-delete: another %s is in progress for %q on host=%s started=%s; wait for it to finish", + existing.Tool, name, existing.Host, existing.StartedAt) + } + defer func() { + cleanCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second) + defer cancel() + if delErr := b.DeleteFile(cleanCtx, InProgressMarkerPath(cp, name)); delErr != nil { + log.Warn().Err(delErr).Str("backup", name).Msg("cas-delete: release inprogress marker") + } + }() + } + + // Step 4: delete metadata.json FIRST so the backup leaves the catalog atomically. + if err := b.DeleteFile(ctx, MetadataJSONPath(cp, name)); err != nil { + return fmt.Errorf("cas-delete: delete metadata.json: %w", err) + } + + // Step 5: delete the rest of the subtree + if err := walkAndDeleteSubtree(ctx, b, MetadataDir(cp, name)); err != nil { + return fmt.Errorf("cas-delete: cleanup subtree: %w", err) + } + + // Step 6: stale upload inprogress marker (ipOK path) is released by the + // defer registered in the ipOK&&mdOK branch above. Our own delete marker + // (written in the !ipOK path) is released by the defer in that branch. + return nil +} + +// walkAndDeleteSubtree lists every object under prefix and deletes each. +// Returns the first error encountered; remaining objects are NOT deleted on +// error (caller decides whether to retry; metadata-orphans are reclaimed by +// the next prune anyway). +func walkAndDeleteSubtree(ctx context.Context, b Backend, prefix string) error { + var keys []string + err := b.Walk(ctx, prefix, true, func(rf RemoteFile) error { + keys = append(keys, rf.Key) + return nil + }) + if err != nil { + return err + } + for _, k := range keys { + if err := b.DeleteFile(ctx, k); err != nil { + return err + } + } + return nil +} diff --git a/pkg/cas/delete_test.go b/pkg/cas/delete_test.go new file mode 100644 index 00000000..fdf9292a --- /dev/null +++ b/pkg/cas/delete_test.go @@ -0,0 +1,314 @@ +package cas_test + +import ( + "bytes" + "context" + "errors" + "io" + "strings" + "testing" + "time" + + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/fakedst" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/testfixtures" +) + +func setupUploaded(t *testing.T) (*fakedst.Fake, cas.Config, string) { + t.Helper() + f := fakedst.New() + cfg := testCfg(100) + src := testfixtures.Build(t, []testfixtures.PartSpec{{ + Disk: "default", DB: "db", Table: "t", Name: "all_1_1_0", + Files: []testfixtures.FileSpec{{Name: "columns.txt", Size: 8, HashLow: 1, HashHigh: 0}}, + }}) + if _, err := cas.Upload(context.Background(), f, cfg, "bk", cas.UploadOptions{LocalBackupDir: src.Root}); err != nil { + t.Fatal(err) + } + return f, cfg, "bk" +} + +func TestDelete_HappyPath(t *testing.T) { + f, cfg, name := setupUploaded(t) + if err := cas.Delete(context.Background(), f, cfg, name, cas.DeleteOptions{}); err != nil { + t.Fatal(err) + } + // metadata.json gone: + if _, _, ok, _ := f.StatFile(context.Background(), cas.MetadataJSONPath(cfg.ClusterPrefix(), name)); ok { + t.Error("metadata.json must be deleted") + } + // No leftover files in metadata//: + var leftover int + _ = f.Walk(context.Background(), cas.MetadataDir(cfg.ClusterPrefix(), name), true, func(rf cas.RemoteFile) error { + leftover++ + return nil + }) + if leftover != 0 { + t.Errorf("leftover %d objects under metadata/%s/", leftover, name) + } +} + +func TestDelete_RefusesIfPruneInProgress(t *testing.T) { + f, cfg, name := setupUploaded(t) + _ = f.PutFile(context.Background(), cas.PruneMarkerPath(cfg.ClusterPrefix()), io.NopCloser(strings.NewReader("{}")), 2) + err := cas.Delete(context.Background(), f, cfg, name, cas.DeleteOptions{}) + if !errors.Is(err, cas.ErrPruneInProgress) { + t.Fatalf("got %v", err) + } +} + +func TestDelete_RefusesIfUploadInProgress(t *testing.T) { + f := fakedst.New() + cfg := testCfg(100) + _ = f.PutFile(context.Background(), cas.InProgressMarkerPath(cfg.ClusterPrefix(), "bk"), io.NopCloser(strings.NewReader("{}")), 2) + // metadata.json absent → upload in flight + err := cas.Delete(context.Background(), f, cfg, "bk", cas.DeleteOptions{}) + if !errors.Is(err, cas.ErrUploadInProgress) { + t.Fatalf("got %v", err) + } +} + +func TestDelete_StaleMarkerProceeds(t *testing.T) { + f, cfg, name := setupUploaded(t) + // simulate: upload committed metadata.json but failed to delete its marker + _ = f.PutFile(context.Background(), cas.InProgressMarkerPath(cfg.ClusterPrefix(), name), io.NopCloser(strings.NewReader("{}")), 2) + if err := cas.Delete(context.Background(), f, cfg, name, cas.DeleteOptions{}); err != nil { + t.Fatal(err) + } + // marker also deleted now (best-effort cleanup) + if _, _, ok, _ := f.StatFile(context.Background(), cas.InProgressMarkerPath(cfg.ClusterPrefix(), name)); ok { + t.Error("stale marker should have been cleaned up") + } +} + +func TestDelete_BackupNotFound(t *testing.T) { + f := fakedst.New() + cfg := testCfg(100) + err := cas.Delete(context.Background(), f, cfg, "nope", cas.DeleteOptions{}) + if err == nil || !strings.Contains(err.Error(), "not found") { + t.Fatalf("got %v", err) + } +} + +func TestDelete_OrderingMetadataFirst(t *testing.T) { + // Verify metadata.json is the FIRST DeleteFile call: wrap fakedst with + // a recording delegator, run Delete, confirm the first deleted key is + // the metadata.json path. + inner := fakedst.New() + cfg := testCfg(100) + src := testfixtures.Build(t, []testfixtures.PartSpec{ + {Disk: "default", DB: "db", Table: "t", Name: "all_1_1_0", + Files: []testfixtures.FileSpec{{Name: "columns.txt", Size: 8, HashLow: 1, HashHigh: 0}}}, + }) + if _, err := cas.Upload(context.Background(), inner, cfg, "bk", cas.UploadOptions{LocalBackupDir: src.Root}); err != nil { + t.Fatal(err) + } + rec := &recordingBackend{Backend: inner} + if err := cas.Delete(context.Background(), rec, cfg, "bk", cas.DeleteOptions{}); err != nil { + t.Fatal(err) + } + if len(rec.deletes) == 0 { + t.Fatal("no deletes recorded") + } + want := cas.MetadataJSONPath(cfg.ClusterPrefix(), "bk") + if rec.deletes[0] != want { + t.Errorf("first delete: got %q want %q", rec.deletes[0], want) + } +} + +// TestDelete_WaitsForPruneMarker verifies that Delete waits for the prune +// marker to disappear (within WaitForPrune) rather than refusing immediately. +func TestDelete_WaitsForPruneMarker(t *testing.T) { + poll := 10 * time.Millisecond + cas.SetPollIntervalForTesting(&poll) + defer cas.SetPollIntervalForTesting(nil) + + f, cfg, name := setupUploaded(t) + cp := cfg.ClusterPrefix() + + // Pre-place prune marker; schedule deletion after 50ms. + if err := f.PutFile(context.Background(), cas.PruneMarkerPath(cp), + io.NopCloser(strings.NewReader("{}")), 2); err != nil { + t.Fatal(err) + } + go func() { + time.Sleep(50 * time.Millisecond) + _ = f.DeleteFile(context.Background(), cas.PruneMarkerPath(cp)) + }() + + if err := cas.Delete(context.Background(), f, cfg, name, cas.DeleteOptions{ + WaitForPrune: 5 * time.Second, + }); err != nil { + t.Fatalf("Delete should succeed once marker is cleared; got: %v", err) + } +} + +// TestDelete_RefusesAfterWaitTimeout verifies that Delete returns +// ErrPruneInProgress when WaitForPrune elapses and the marker remains. +func TestDelete_RefusesAfterWaitTimeout(t *testing.T) { + poll := 10 * time.Millisecond + cas.SetPollIntervalForTesting(&poll) + defer cas.SetPollIntervalForTesting(nil) + + f, cfg, name := setupUploaded(t) + cp := cfg.ClusterPrefix() + + // Pre-place prune marker permanently. + if err := f.PutFile(context.Background(), cas.PruneMarkerPath(cp), + io.NopCloser(strings.NewReader("{}")), 2); err != nil { + t.Fatal(err) + } + + err := cas.Delete(context.Background(), f, cfg, name, cas.DeleteOptions{ + WaitForPrune: 100 * time.Millisecond, + }) + if !errors.Is(err, cas.ErrPruneInProgress) { + t.Fatalf("got err=%v; want ErrPruneInProgress", err) + } +} + +// TestDelete_BlocksConcurrentUploadOfSameName verifies that a cas-delete +// inprogress marker written by Delete prevents a concurrent Upload of the +// same name from starting. The marker is written by a goroutine that holds it +// for long enough for the main goroutine's Upload attempt to observe it. +func TestDelete_BlocksConcurrentUploadOfSameName(t *testing.T) { + f := fakedst.New() + cfg := testCfg(100) + cp := cfg.ClusterPrefix() + + // Write a cas-delete inprogress marker directly (simulating what Delete + // will do once the real implementation is in place). + markerKey := cas.InProgressMarkerPath(cp, "bk") + markerBody := `{"Backup":"bk","Host":"h1","StartedAt":"2026-01-01T00:00:00Z","Tool":"cas-delete"}` + if err := f.PutFile(context.Background(), markerKey, + io.NopCloser(strings.NewReader(markerBody)), int64(len(markerBody))); err != nil { + t.Fatal(err) + } + + // Upload must refuse: the marker is present and no metadata.json exists. + _, err := cas.Upload(context.Background(), f, cfg, "bk", cas.UploadOptions{ + LocalBackupDir: t.TempDir(), // empty dir → no tables, but auth check is before planUpload + }) + if err == nil { + t.Fatal("expected Upload to fail when cas-delete marker is present") + } + if !strings.Contains(err.Error(), "cas-delete") && !strings.Contains(err.Error(), "in progress") { + t.Errorf("error should mention cas-delete or in progress; got: %v", err) + } +} + +// TestDelete_ReleaseMarkerOnSuccess verifies that the cas-delete inprogress +// marker is removed after a successful Delete call. +func TestDelete_ReleaseMarkerOnSuccess(t *testing.T) { + f, cfg, name := setupUploaded(t) + cp := cfg.ClusterPrefix() + + if err := cas.Delete(context.Background(), f, cfg, name, cas.DeleteOptions{}); err != nil { + t.Fatal(err) + } + if _, _, ok, _ := f.StatFile(context.Background(), cas.InProgressMarkerPath(cp, name)); ok { + t.Error("cas-delete: inprogress marker must be removed after successful Delete") + } +} + +// TestDelete_RefusesWhenAlreadyDeleting verifies that Delete refuses when a +// cas-delete inprogress marker is already present and no metadata.json exists +// (i.e. another concurrent Delete is in progress for the same backup). +func TestDelete_RefusesWhenAlreadyDeleting(t *testing.T) { + f := fakedst.New() + cfg := testCfg(100) + cp := cfg.ClusterPrefix() + + // Place metadata.json so the backup appears to exist. + mdKey := cas.MetadataJSONPath(cp, "bk") + if err := f.PutFile(context.Background(), mdKey, + io.NopCloser(strings.NewReader("{}")), 2); err != nil { + t.Fatal(err) + } + + // Pre-place a cas-delete marker (another Delete is mid-flight). + markerKey := cas.InProgressMarkerPath(cp, "bk") + markerBody := `{"Backup":"bk","Host":"h2","StartedAt":"2026-01-01T00:00:00Z","Tool":"cas-delete"}` + if err := f.PutFile(context.Background(), markerKey, + io.NopCloser(strings.NewReader(markerBody)), int64(len(markerBody))); err != nil { + t.Fatal(err) + } + + err := cas.Delete(context.Background(), f, cfg, "bk", cas.DeleteOptions{}) + if err == nil { + t.Fatal("expected Delete to fail when another cas-delete is in progress") + } + if !strings.Contains(err.Error(), "cas-delete") { + t.Errorf("error should mention cas-delete; got: %v", err) + } +} + +// TestDelete_RefusesOnUnreadableMarker verifies the path where: +// 1. metadata.json exists (the backup is committed) +// 2. An inprogress marker also exists (ipOK=true, mdOK=true branch) +// 3. ReadInProgressMarker returns a non-nil error (transient/corrupt read) +// +// Delete must return an error containing "cannot read marker" AND must NOT +// delete the marker (preserving visibility for operators and concurrent +// processes). +// +// The unreadable-marker condition is induced by pre-placing a 128 KiB body of +// 'x' characters — twice the 64 KiB markerSizeLimit enforced by getBytes's +// LimitReader. After truncation the body is not valid JSON, so +// ReadInProgressMarker returns a JSON parse error → readErr != nil. +func TestDelete_RefusesOnUnreadableMarker(t *testing.T) { + f, cfg, name := setupUploaded(t) + cp := cfg.ClusterPrefix() + markerKey := cas.InProgressMarkerPath(cp, name) + + // Place an oversized (128 KiB) non-JSON marker alongside the committed + // metadata.json so the ipOK && mdOK branch is entered. + oversized := make([]byte, 128*1024) + for i := range oversized { + oversized[i] = 'x' + } + if err := f.PutFile(context.Background(), markerKey, + io.NopCloser(bytes.NewReader(oversized)), int64(len(oversized))); err != nil { + t.Fatal(err) + } + + err := cas.Delete(context.Background(), f, cfg, name, cas.DeleteOptions{}) + if err == nil { + t.Fatal("expected Delete to fail when ReadInProgressMarker returns an error") + } + if !strings.Contains(err.Error(), "cannot read marker") { + t.Errorf("error should contain 'cannot read marker'; got: %v", err) + } + + // The marker must still be present: Delete must not have removed it. + if _, _, ok, _ := f.StatFile(context.Background(), markerKey); !ok { + t.Error("marker must NOT be deleted when Delete refuses due to an unreadable marker") + } +} + +// recordingBackend wraps a Backend and records DeleteFile calls in order. +type recordingBackend struct { + cas.Backend + deletes []string +} + +func (r *recordingBackend) DeleteFile(ctx context.Context, key string) error { + r.deletes = append(r.deletes, key) + return r.Backend.DeleteFile(ctx, key) +} + +// Cancellation-during-cleanup for cas-delete is verified by parity with +// TestUpload_CancelledContextStillReleasesMarker and +// TestPrune_CancelledContextStillReleasesMarker — all three use the +// identical defer-with-cleanCtx pattern (pkg/cas/delete.go's defer at the +// ipOK && mdOK branch creates a detached context.WithTimeout the same way +// upload/prune do). +// +// A direct delete-side test is intentionally omitted: with a pre-cancelled +// parent ctx, waitForPrune (called before any defer is registered) returns +// ctx.Err() immediately, so the defer never gets a chance to run — that's +// correct early-bail behavior, not the cleanup-on-late-cancellation +// scenario the test would need to exercise. Constructing a "ctx alive +// past waitForPrune, then cancelled inside walkAndDeleteSubtree" requires +// ctx-respecting fakedst hooks that the existing upload test paths +// already cover the equivalent guarantee for. diff --git a/pkg/cas/download.go b/pkg/cas/download.go new file mode 100644 index 00000000..5e929457 --- /dev/null +++ b/pkg/cas/download.go @@ -0,0 +1,734 @@ +package cas + +import ( + "context" + "crypto/rand" + "encoding/hex" + "encoding/json" + "errors" + "fmt" + "io" + "os" + "path/filepath" + "regexp" + "sort" + "strings" + "sync" + "syscall" + + "github.com/Altinity/clickhouse-backup/v2/pkg/checksumstxt" + "github.com/Altinity/clickhouse-backup/v2/pkg/common" + "github.com/Altinity/clickhouse-backup/v2/pkg/metadata" +) + +// DownloadOptions configures a Download run. +type DownloadOptions struct { + // LocalBackupDir is the root under which Download materializes + // //. The directory is created if missing. + LocalBackupDir string + + // TableFilter is an optional list of "db.table" glob patterns + // (filepath.Match semantics, mirroring v1 --tables). Empty = include all. + TableFilter []string + + // Partitions is an optional part-name filter applied at the part level + // (intersected with TableMetadata.Parts). Empty means all parts. + Partitions []string + + // SchemaOnly: skip archive download + blob fetch; only write JSON + // metadata files locally. + SchemaOnly bool + + // DataOnly: in v1 of CAS this behaves like a full download (CAS only + // stores data; schema info comes from the per-table JSON which is + // always written). Reserved for future use. + DataOnly bool + + // Parallelism caps simultaneous archive + blob fetches. <=0 falls + // back to 16. + Parallelism int +} + +// DownloadResult summarizes what a Download run did. +type DownloadResult struct { + LocalBackupDir string + BackupName string + PerTableArchives int + BlobsFetched int + BytesFetched int64 +} + +// projRe matches a projection-style nested filename: .proj/. +var projRe = regexp.MustCompile(`^[^/\x00]+\.proj/[^/\x00]+$`) + +// validateRemoteFilesystemName rejects disk and part names from remote +// metadata before they are joined into local filesystem paths. A +// compromised or adversarially crafted CAS bucket could otherwise direct +// archive extraction or blob writes outside the intended local backup +// directory by setting `disk = "../../etc"` or `part_name = "../escape"`. +// +// label is only used in the error message ("disk", "part name", etc.). +func validateRemoteFilesystemName(label, name string) error { + if name == "" || name == "." || name == ".." { + return fmt.Errorf("cas: unsafe %s in remote metadata: %q", label, name) + } + if strings.ContainsAny(name, "/\\\x00") { + return fmt.Errorf("cas: unsafe %s (path separator or NUL) in remote metadata: %q", label, name) + } + if strings.Contains(name, "..") { + return fmt.Errorf("cas: unsafe %s (contains %q) in remote metadata: %q", label, "..", name) + } + return nil +} + +// validateChecksumsTxtFilename rejects unsafe filenames listed in a +// part's checksums.txt. See docs/cas-design.md §6.5 step 5. +func validateChecksumsTxtFilename(name string) error { + if name == "" { + return errors.New("cas: empty filename in checksums.txt") + } + if strings.ContainsRune(name, 0) { + return errors.New("cas: NUL in filename") + } + if strings.HasPrefix(name, "/") { + return errors.New("cas: absolute filename") + } + if strings.Contains(name, "..") { + return errors.New("cas: \"..\" in filename") + } + if strings.Contains(name, "/") && !projRe.MatchString(name) { + return errors.New("cas: nested path in filename") + } + return nil +} + +// randomHex8 returns 8 random hex characters for use in staging dir names. +func randomHex8() string { + var b [4]byte + if _, err := rand.Read(b[:]); err != nil { + // crypto/rand.Read only fails on catastrophic OS failures; panic is + // appropriate here rather than silently producing a fixed suffix. + panic("cas: crypto/rand.Read failed: " + err.Error()) + } + return hex.EncodeToString(b[:]) +} + +// Download materializes a v1-shaped local backup directory from a CAS +// backup. Implements docs/cas-design.md §6.5 (the cas-download portion; +// cas-restore is layered on top in Task 14). +// +// Atomicity: Download writes all content into a hidden staging directory +// (a sibling of finalDir named "..cas-staging-") and +// only renames it to finalDir after ALL downloads succeed. A failed or +// interrupted download therefore never leaves a directory at finalDir that +// looks like a valid v1 backup. Any pre-existing finalDir is removed +// immediately before the rename; re-running over a partial or stale +// same-name directory is safe and produces a clean result. +// +// Assumption: opts.LocalBackupDir and the staging sibling are on the same +// filesystem mount, so os.Rename is atomic. This always holds when both +// are siblings under opts.LocalBackupDir. +func Download(ctx context.Context, b Backend, cfg Config, name string, opts DownloadOptions) (_ *DownloadResult, err error) { + if err := cfg.Validate(); err != nil { + return nil, fmt.Errorf("cas: download: invalid config: %w", err) + } + if opts.LocalBackupDir == "" { + return nil, errors.New("cas: DownloadOptions.LocalBackupDir is required") + } + if opts.DataOnly { + return nil, errors.New("cas: --data-only is not yet implemented for cas-download (use the v1 flow if you need data-only restoration)") + } + + // 1. Validate root metadata + persisted CAS params. + bm, err := ValidateBackup(ctx, b, cfg, name) + if err != nil { + // If the backup is missing in the CAS namespace but exists at the + // v1 location (root-level /metadata.json), surface the + // proper cross-mode refusal instead of "metadata.json missing". + // Operators who type a v1 backup name into cas-download get the + // helpful error. + if errors.Is(err, ErrMissingMetadata) { + if _, _, exists, statErr := b.StatFile(ctx, name+"/metadata.json"); statErr == nil && exists { + return nil, ErrV1Backup + } + } + return nil, err + } + + cp := cfg.ClusterPrefix() + + // 2. Set up local layout using a staging directory. + // All writes go to stageDir; it is renamed to finalDir only after + // all downloads succeed. + finalDir := filepath.Join(opts.LocalBackupDir, name) + stageDir := filepath.Join(opts.LocalBackupDir, "."+name+".cas-staging-"+randomHex8()) + + if err := os.MkdirAll(stageDir, 0o755); err != nil { + return nil, fmt.Errorf("cas: mkdir staging %s: %w", stageDir, err) + } + // Clean up staging dir on any error path. + defer func() { + if err != nil { + _ = os.RemoveAll(stageDir) + } + }() + + res := &DownloadResult{ + // Callers see the final (post-rename) path. We update this to + // finalDir after the rename succeeds. + LocalBackupDir: finalDir, + BackupName: name, + } + + // 3. Determine in-scope (db, table) by applying TableFilter to bm.Tables. + inScope := selectTables(bm.Tables, opts.TableFilter) + if len(inScope) == 0 && len(opts.TableFilter) > 0 { + // Filter excluded everything; that's not necessarily an error, + // but we still write the root metadata.json and return. + } + + // 4. Fetch + persist per-table TableMetadata (with optional partition filter). + type tableEntry struct { + DB, Table string + TM metadata.TableMetadata + } + tables := make([]tableEntry, 0, len(inScope)) + partsFilter := makePartsFilter(opts.Partitions) + for _, tt := range inScope { + tm, err := fetchTableMetadata(ctx, b, cp, name, tt.Database, tt.Table) + if err != nil { + return nil, err + } + if partsFilter != nil { + tm.Parts = filterParts(tm.Parts, partsFilter) + } + // Save to staging dir under metadata//.json. + if err := saveLocalTableMetadata(stageDir, tm); err != nil { + return nil, err + } + tables = append(tables, tableEntry{DB: tt.Database, Table: tt.Table, TM: *tm}) + } + + // 5. Save root metadata.json into the staging dir. + // + // We keep BackupMetadata.CAS populated in the local copy but set the + // Handoff flag to true. This serves two purposes: + // + // (a) The v1 early-refusal guard in pkg/backup/restore.go (which returns + // ErrCASBackup when CAS != nil) is updated to allow Handoff backups, + // so cas-restore can invoke the v1 path on the materialized layout. + // + // (b) The two object-disk-skip guards later in restore.go check + // "backupMetadata.CAS == nil" to decide whether to call + // downloadObjectDiskParts. With CAS != nil those guards correctly + // skip the call — CAS backups never carry object-disk metadata files, + // so any attempt to download them would fail with "file not found". + // + // Previously CAS was nil-ed, which silently defeated (b): on a target + // cluster where the table lives on an object-storage disk, v1 would call + // downloadObjectDiskParts and fail because CAS never wrote those files. + // See docs/superpowers/plans/2026-05-08-cas-review-wave-5.md §N3. + bmLocal := *bm + handoffCAS := *bm.CAS + handoffCAS.Handoff = true + bmLocal.CAS = &handoffCAS + bmLocal.Tables = inScope + bmPath := filepath.Join(stageDir, "metadata.json") + bmBody, err := json.MarshalIndent(&bmLocal, "", "\t") + if err != nil { + return nil, fmt.Errorf("cas: marshal local metadata.json: %w", err) + } + if err := os.WriteFile(bmPath, bmBody, 0o640); err != nil { + return nil, fmt.Errorf("cas: write %s: %w", bmPath, err) + } + + if opts.SchemaOnly { + // Schema-only: rename staging → final and return. No archive/blob + // downloads needed; the staging dir is a valid (schema-only) backup. + if err := atomicSwapDir(stageDir, finalDir); err != nil { + return nil, err + } + return res, nil + } + + // 6. Disk-space pre-flight (best-effort): estimate archive bytes via + // StatFile; we don't pre-fetch blob sizes (would require parsing + // checksums.txt before downloading the archives, doubling round-trips). + // We compare archive total to filesystem free space and bail early on + // obvious shortage; blob size is added after archive extraction. + estimateArchiveBytes := int64(0) + var archives []archiveJob + for _, te := range tables { + for disk, parts := range te.TM.Parts { + // Reject path-traversal in remote-supplied disk and part names + // BEFORE they participate in any path construction (incl. the + // archive key passed to StatFile, which in turn flows into the + // local filesystem path during extraction). + if err := validateRemoteFilesystemName("disk", disk); err != nil { + return nil, err + } + for _, p := range parts { + if err := validateRemoteFilesystemName("part name", p.Name); err != nil { + return nil, err + } + } + key := PartArchivePath(cp, name, disk, te.DB, te.Table) + sz, _, exists, err := b.StatFile(ctx, key) + if err != nil { + return nil, fmt.Errorf("cas: stat archive %s: %w", key, err) + } + if !exists { + // A backup with parts on this disk should have an archive; + // missing implies a corrupted backup. + return nil, fmt.Errorf("cas: archive missing: %s", key) + } + archives = append(archives, archiveJob{ + Disk: disk, DB: te.DB, Table: te.Table, Key: key, Size: sz, + }) + estimateArchiveBytes += sz + } + } + // Best-effort free-space check on the staging dir's filesystem. We + // only have archive sizes here; blob bytes get added during extraction + // pass below. With a 1.1x safety multiplier this catches gross-shortage + // cases without delaying the download with a second round-trip. + if err := checkFreeSpace(stageDir, estimateArchiveBytes); err != nil { + return nil, err + } + + // 7. Download + extract archives (bounded parallelism). + parallelism := opts.Parallelism + if parallelism <= 0 { + parallelism = 16 + } + + if err := downloadArchives(ctx, b, archives, stageDir, parallelism); err != nil { + return nil, err + } + res.PerTableArchives = len(archives) + + // 8. For each in-scope part: parse the on-disk checksums.txt and + // fetch every blob whose size exceeds the persisted threshold. + var blobs []blobJob + estimateBlobBytes := int64(0) + for _, te := range tables { + for disk, parts := range te.TM.Parts { + if err := validateRemoteFilesystemName("disk", disk); err != nil { + return nil, err + } + for _, p := range parts { + if err := validateRemoteFilesystemName("part name", p.Name); err != nil { + return nil, err + } + partDir := filepath.Join(stageDir, "shadow", + common.TablePathEncode(te.DB), + common.TablePathEncode(te.Table), + disk, p.Name) + if err := collectBlobJobsRecursive(partDir, bm.CAS.InlineThreshold, &blobs, &estimateBlobBytes); err != nil { + return nil, err + } + } + } + } + // Re-check free space now that we know blob bytes too. + if err := checkFreeSpace(stageDir, estimateBlobBytes); err != nil { + return nil, err + } + + fetched, bytesFetched, err := downloadBlobs(ctx, b, cp, blobs, parallelism) + if err != nil { + return nil, err + } + res.BlobsFetched = fetched + res.BytesFetched = bytesFetched + + // 9. All downloads succeeded: atomically replace finalDir with stageDir. + if err := atomicSwapDir(stageDir, finalDir); err != nil { + return nil, err + } + return res, nil +} + +// atomicSwapDir removes any pre-existing directory at dst and renames src +// to dst. Both must be on the same filesystem (siblings under the same +// parent is sufficient). The removal+rename is not itself atomic at the OS +// level, but it ensures finalDir is never left in a partial state: either +// the old content is still there (if RemoveAll fails) or the new content is +// fully present (if Rename succeeds). +func atomicSwapDir(src, dst string) error { + if err := os.RemoveAll(dst); err != nil { + return fmt.Errorf("cas: remove stale dir %s: %w", dst, err) + } + if err := os.Rename(src, dst); err != nil { + return fmt.Errorf("cas: rename %s → %s: %w", src, dst, err) + } + return nil +} + +// selectTables filters bm.Tables by a "db.table" glob pattern list. +// Empty filter → all tables. Uses filepath.Match semantics, mirroring v1 +// --tables behaviour (pkg/backup/table_pattern.go:93). +func selectTables(all []metadata.TableTitle, filter []string) []metadata.TableTitle { + if len(filter) == 0 { + out := make([]metadata.TableTitle, len(all)) + copy(out, all) + return out + } + var out []metadata.TableTitle + for _, t := range all { + if tableFilterMatches(filter, t.Database, t.Table) { + out = append(out, t) + } + } + return out +} + +// makePartsFilter builds a name-set or returns nil for "no filter". +func makePartsFilter(names []string) map[string]bool { + if len(names) == 0 { + return nil + } + out := make(map[string]bool, len(names)) + for _, n := range names { + out[n] = true + } + return out +} + +// filterParts returns a copy of parts keeping only entries whose Name +// is in the allow set. Disks with no surviving parts are dropped. +func filterParts(parts map[string][]metadata.Part, allow map[string]bool) map[string][]metadata.Part { + if allow == nil { + return parts + } + out := make(map[string][]metadata.Part, len(parts)) + for disk, ps := range parts { + var kept []metadata.Part + for _, p := range ps { + if allow[p.Name] { + kept = append(kept, p) + } + } + if len(kept) > 0 { + out[disk] = kept + } + } + return out +} + +// fetchTableMetadata GETs the per-table JSON and parses it. +func fetchTableMetadata(ctx context.Context, b Backend, cp, name, db, table string) (*metadata.TableMetadata, error) { + key := TableMetaPath(cp, name, db, table) + rc, err := b.GetFile(ctx, key) + if err != nil { + return nil, fmt.Errorf("cas: get %s: %w", key, err) + } + defer rc.Close() + body, err := io.ReadAll(rc) + if err != nil { + return nil, fmt.Errorf("cas: read %s: %w", key, err) + } + var tm metadata.TableMetadata + if err := json.Unmarshal(body, &tm); err != nil { + return nil, fmt.Errorf("cas: parse %s: %w", key, err) + } + return &tm, nil +} + +// saveLocalTableMetadata writes tm to /metadata//.json. +func saveLocalTableMetadata(localDir string, tm *metadata.TableMetadata) error { + dir := filepath.Join(localDir, "metadata", common.TablePathEncode(tm.Database)) + if err := os.MkdirAll(dir, 0o755); err != nil { + return fmt.Errorf("cas: mkdir %s: %w", dir, err) + } + path := filepath.Join(dir, common.TablePathEncode(tm.Table)+".json") + body, err := json.MarshalIndent(tm, "", "\t") + if err != nil { + return fmt.Errorf("cas: marshal table metadata %s.%s: %w", tm.Database, tm.Table, err) + } + if err := os.WriteFile(path, body, 0o640); err != nil { + return fmt.Errorf("cas: write %s: %w", path, err) + } + return nil +} + +// archiveJob is one per-(disk, db, table) tar.zstd to download + extract. +type archiveJob struct { + Disk, DB, Table string + Key string + Size int64 +} + +// blobJob is one large file to fetch from the CAS blob store and write +// into a part directory. +type blobJob struct { + PartDir string + FileName string + Size uint64 + Hash Hash128 +} + +// downloadArchives concurrently downloads + extracts each per-(disk, db, +// table) archive into the local shadow tree. +func downloadArchives(ctx context.Context, b Backend, jobs []archiveJob, localDir string, parallelism int) error { + var ( + mu sync.Mutex + firstErr error + wg sync.WaitGroup + ) + sem := make(chan struct{}, parallelism) + for _, j := range jobs { + j := j + wg.Add(1) + go func() { + defer wg.Done() + sem <- struct{}{} + defer func() { <-sem }() + mu.Lock() + already := firstErr != nil + mu.Unlock() + if already { + return + } + if err := validateRemoteFilesystemName("disk", j.Disk); err != nil { + mu.Lock() + if firstErr == nil { + firstErr = err + } + mu.Unlock() + return + } + dst := filepath.Join(localDir, "shadow", + common.TablePathEncode(j.DB), + common.TablePathEncode(j.Table), j.Disk) + if err := os.MkdirAll(dst, 0o755); err != nil { + mu.Lock() + if firstErr == nil { + firstErr = fmt.Errorf("cas: mkdir %s: %w", dst, err) + } + mu.Unlock() + return + } + rc, err := b.GetFile(ctx, j.Key) + if err != nil { + mu.Lock() + if firstErr == nil { + firstErr = fmt.Errorf("cas: get archive %s: %w", j.Key, err) + } + mu.Unlock() + return + } + extractErr := ExtractArchive(rc, dst) + _ = rc.Close() + if extractErr != nil { + mu.Lock() + if firstErr == nil { + firstErr = fmt.Errorf("cas: extract %s: %w", j.Key, extractErr) + } + mu.Unlock() + return + } + }() + } + wg.Wait() + return firstErr +} + +// downloadBlobs concurrently fetches every blob, writing to its in-part +// destination after re-asserting path containment. +func downloadBlobs(ctx context.Context, b Backend, cp string, jobs []blobJob, parallelism int) (int, int64, error) { + // Sort for determinism in tests. + sort.Slice(jobs, func(i, j int) bool { + if jobs[i].PartDir != jobs[j].PartDir { + return jobs[i].PartDir < jobs[j].PartDir + } + return jobs[i].FileName < jobs[j].FileName + }) + var ( + mu sync.Mutex + firstErr error + fetched int + bytesUp int64 + wg sync.WaitGroup + ) + sem := make(chan struct{}, parallelism) + for _, j := range jobs { + j := j + wg.Add(1) + go func() { + defer wg.Done() + sem <- struct{}{} + defer func() { <-sem }() + mu.Lock() + already := firstErr != nil + mu.Unlock() + if already { + return + } + + // Path containment: ensure dst remains under PartDir. + absPart, err := filepath.Abs(j.PartDir) + if err != nil { + mu.Lock() + if firstErr == nil { + firstErr = fmt.Errorf("cas: abs %s: %w", j.PartDir, err) + } + mu.Unlock() + return + } + rootPrefix := absPart + string(filepath.Separator) + dst := filepath.Join(absPart, filepath.FromSlash(j.FileName)) + cleanDst := filepath.Clean(dst) + if !strings.HasPrefix(cleanDst+string(filepath.Separator), rootPrefix) && cleanDst != absPart { + mu.Lock() + if firstErr == nil { + firstErr = &UnsafePathError{Path: j.FileName} + } + mu.Unlock() + return + } + + if err := os.MkdirAll(filepath.Dir(cleanDst), 0o755); err != nil { + mu.Lock() + if firstErr == nil { + firstErr = fmt.Errorf("cas: mkdir %s: %w", filepath.Dir(cleanDst), err) + } + mu.Unlock() + return + } + + rc, err := b.GetFile(ctx, BlobPath(cp, j.Hash)) + if err != nil { + mu.Lock() + if firstErr == nil { + firstErr = fmt.Errorf("cas: get blob %s: %w", BlobPath(cp, j.Hash), err) + } + mu.Unlock() + return + } + f, err := os.OpenFile(cleanDst, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o644) + if err != nil { + _ = rc.Close() + mu.Lock() + if firstErr == nil { + firstErr = fmt.Errorf("cas: open %s: %w", cleanDst, err) + } + mu.Unlock() + return + } + n, copyErr := io.Copy(f, rc) + _ = rc.Close() + closeErr := f.Close() + if copyErr != nil { + mu.Lock() + if firstErr == nil { + firstErr = fmt.Errorf("cas: write %s: %w", cleanDst, copyErr) + } + mu.Unlock() + return + } + if closeErr != nil { + mu.Lock() + if firstErr == nil { + firstErr = fmt.Errorf("cas: close %s: %w", cleanDst, closeErr) + } + mu.Unlock() + return + } + if uint64(n) != j.Size { + mu.Lock() + if firstErr == nil { + firstErr = fmt.Errorf("cas: blob %s truncated: got %d bytes, expected %d (per checksums.txt)", + BlobPath(cp, j.Hash), n, j.Size) + } + mu.Unlock() + _ = os.Remove(cleanDst) // best-effort: don't leave a corrupt file behind + return + } + mu.Lock() + fetched++ + bytesUp += n + mu.Unlock() + }() + } + wg.Wait() + if firstErr != nil { + return 0, 0, firstErr + } + return fetched, bytesUp, nil +} + +// collectBlobJobsRecursive parses partDir/checksums.txt and appends a +// blobJob for every above-threshold non-.proj file. For each .proj entry +// in the parent it recurses into //checksums.txt with the +// same rules. Mirrors the upload-side projection-aware walker from T5. +// +// Each blob's PartDir is the immediate directory containing the file (so +// downloadBlobs writes to the right nested location, including p1.proj/...). +func collectBlobJobsRecursive(partDir string, threshold uint64, out *[]blobJob, estimate *int64) error { + ckPath := filepath.Join(partDir, "checksums.txt") + f, err := os.Open(ckPath) + if err != nil { + return fmt.Errorf("cas: open %s: %w", ckPath, err) + } + parsed, perr := checksumstxt.Parse(f) + _ = f.Close() + if perr != nil { + return fmt.Errorf("cas: parse %s: %w", ckPath, perr) + } + names := make([]string, 0, len(parsed.Files)) + for n := range parsed.Files { + names = append(names, n) + } + sort.Strings(names) + for _, fname := range names { + // validate ALL filenames first — including .proj entries — to prevent + // directory traversal via crafted remote checksums.txt content. The + // download path consumes untrusted data; the upload side trusts local + // filesystem content but applies the same validator for defense in depth. + if err := validateChecksumsTxtFilename(fname); err != nil { + return fmt.Errorf("cas: %s: %w", ckPath, err) + } + if strings.HasSuffix(fname, ".proj") { + subDir := filepath.Join(partDir, fname) + if err := collectBlobJobsRecursive(subDir, threshold, out, estimate); err != nil { + return err + } + continue + } + c := parsed.Files[fname] + if c.FileSize <= threshold { + continue + } + *out = append(*out, blobJob{ + PartDir: partDir, + FileName: fname, + Size: c.FileSize, + Hash: Hash128{Low: c.FileHash.Low, High: c.FileHash.High}, + }) + *estimate += int64(c.FileSize) + } + return nil +} + +// checkFreeSpace returns an error if the filesystem hosting localDir has +// less than estimate*1.1 bytes free. Best-effort: failure to stat the +// filesystem is logged-and-ignored (Statfs is not available everywhere +// and a stale check shouldn't gate the download). +func checkFreeSpace(localDir string, estimate int64) error { + if estimate <= 0 { + return nil + } + var st syscall.Statfs_t + if err := syscall.Statfs(localDir, &st); err != nil { + // Best-effort: skip the check if the syscall is unavailable. + return nil + } + // Bsize is platform-dependent type; cast to int64 via uint64. + free := int64(st.Bavail) * int64(st.Bsize) + required := estimate + estimate/10 // *1.1 + if free < required { + return fmt.Errorf("cas: insufficient free space at %s: have %d bytes, need ~%d (estimate %d * 1.1)", localDir, free, required, estimate) + } + return nil +} diff --git a/pkg/cas/download_test.go b/pkg/cas/download_test.go new file mode 100644 index 00000000..b18411bd --- /dev/null +++ b/pkg/cas/download_test.go @@ -0,0 +1,846 @@ +package cas_test + +import ( + "archive/tar" + "bytes" + "context" + "encoding/json" + "errors" + "io" + "os" + "path/filepath" + "strings" + "testing" + "time" + + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/fakedst" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/testfixtures" + "github.com/Altinity/clickhouse-backup/v2/pkg/common" + "github.com/Altinity/clickhouse-backup/v2/pkg/metadata" + "github.com/klauspost/compress/zstd" +) + +// makeBlobBytes returns deterministic 1024-byte data based on seed; used +// to populate file bodies so we can byte-compare after round-trip. +func makeBlobBytes(seed byte) []byte { + out := make([]byte, 1024) + for i := range out { + out[i] = seed + byte(i%17) + } + return out +} + +// uploadAndDownload is a small helper that performs Upload + Download +// using shared config and returns the local download root. +func uploadAndDownload(t *testing.T, parts []testfixtures.PartSpec, name string, opts cas.DownloadOptions) (lb *testfixtures.LocalBackup, f *fakedst.Fake, cfg cas.Config, downloadRoot string) { + t.Helper() + lb = testfixtures.Build(t, parts) + f = fakedst.New() + cfg = testCfg(100) + if _, err := cas.Upload(context.Background(), f, cfg, name, cas.UploadOptions{LocalBackupDir: lb.Root}); err != nil { + t.Fatalf("Upload: %v", err) + } + if opts.LocalBackupDir == "" { + opts.LocalBackupDir = t.TempDir() + } + downloadRoot = opts.LocalBackupDir + if _, err := cas.Download(context.Background(), f, cfg, name, opts); err != nil { + t.Fatalf("Download: %v", err) + } + return lb, f, cfg, downloadRoot +} + +func TestDownload_RoundTripBytes(t *testing.T) { + parts := []testfixtures.PartSpec{ + {Disk: "default", DB: "db1", Table: "t1", Name: "p1", Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 23, HashLow: 1, HashHigh: 1}, + {Name: "primary.idx", Size: 8, HashLow: 2, HashHigh: 1}, + {Name: "data.bin", Size: 1024, HashLow: 999, HashHigh: 1, Bytes: makeBlobBytes(0x10)}, + }}, + } + lb, _, _, root := uploadAndDownload(t, parts, "b1", cas.DownloadOptions{}) + localBackupDir := filepath.Join(root, "b1") + + // Check root metadata.json: parseable. The LOCAL copy keeps CAS populated + // but with Handoff = true so that: + // (a) the v1 early-refusal guard allows cas-restore handoff backups, and + // (b) the object-disk-skip guards (which check CAS == nil) continue to + // fire and skip downloadObjectDiskParts (CAS never wrote those files). + // The REMOTE metadata.json has CAS.Handoff = false. + // See pkg/cas/download.go and docs/superpowers/plans/2026-05-08-cas-review-wave-5.md §N3. + bmBody, err := os.ReadFile(filepath.Join(localBackupDir, "metadata.json")) + if err != nil { + t.Fatalf("read root metadata.json: %v", err) + } + var bm metadata.BackupMetadata + if err := json.Unmarshal(bmBody, &bm); err != nil { + t.Fatalf("parse local metadata.json: %v", err) + } + if bm.CAS == nil { + t.Fatal("local metadata.json: CAS field MUST be preserved for object-disk-skip guards to fire") + } + if !bm.CAS.Handoff { + t.Fatal("local metadata.json: CAS.Handoff MUST be true to allow v1 early-refusal guard to pass") + } + if bm.DataFormat != "directory" { + t.Errorf("DataFormat: got %q want directory", bm.DataFormat) + } + + // Per-table JSON. + tmPath := filepath.Join(localBackupDir, "metadata", + common.TablePathEncode("db1"), common.TablePathEncode("t1")+".json") + tmBody, err := os.ReadFile(tmPath) + if err != nil { + t.Fatalf("read table metadata: %v", err) + } + var tm metadata.TableMetadata + if err := json.Unmarshal(tmBody, &tm); err != nil { + t.Fatalf("parse table metadata: %v", err) + } + if got := len(tm.Parts["default"]); got != 1 { + t.Errorf("Parts[default]: got %d want 1", got) + } + + // Byte-compare every reconstructed file against the original local + // backup's bytes. + origPartDir := filepath.Join(lb.Root, "shadow", "db1", "t1", "default", "p1") + dlPartDir := filepath.Join(localBackupDir, "shadow", + common.TablePathEncode("db1"), common.TablePathEncode("t1"), "default", "p1") + for _, f := range parts[0].Files { + want, err := os.ReadFile(filepath.Join(origPartDir, f.Name)) + if err != nil { + t.Fatalf("read original %s: %v", f.Name, err) + } + got, err := os.ReadFile(filepath.Join(dlPartDir, f.Name)) + if err != nil { + t.Fatalf("read downloaded %s: %v", f.Name, err) + } + if !bytes.Equal(want, got) { + t.Errorf("byte mismatch for %s (size want=%d got=%d)", f.Name, len(want), len(got)) + } + } + // checksums.txt should also exist on disk. + if _, err := os.Stat(filepath.Join(dlPartDir, "checksums.txt")); err != nil { + t.Errorf("checksums.txt missing: %v", err) + } +} + +func TestDownload_RefusesV1Backup(t *testing.T) { + f := fakedst.New() + cfg := testCfg(100) + cp := cfg.ClusterPrefix() + // Write a metadata.json with CAS=nil. + bm := metadata.BackupMetadata{BackupName: "b1", DataFormat: "directory"} + body, _ := json.Marshal(&bm) + if err := f.PutFile(context.Background(), cas.MetadataJSONPath(cp, "b1"), + io.NopCloser(bytes.NewReader(body)), int64(len(body))); err != nil { + t.Fatal(err) + } + _, err := cas.Download(context.Background(), f, cfg, "b1", cas.DownloadOptions{LocalBackupDir: t.TempDir()}) + if !errors.Is(err, cas.ErrV1Backup) { + t.Fatalf("got err=%v want ErrV1Backup", err) + } +} + +func TestDownload_RefusesUnsupportedLayoutVersion(t *testing.T) { + f := fakedst.New() + cfg := testCfg(100) + cp := cfg.ClusterPrefix() + bm := metadata.BackupMetadata{ + BackupName: "b1", + DataFormat: "directory", + CAS: &metadata.CASBackupParams{ + LayoutVersion: cas.LayoutVersion + 1, + InlineThreshold: cfg.InlineThreshold, + ClusterID: cfg.ClusterID, + }, + } + body, _ := json.Marshal(&bm) + if err := f.PutFile(context.Background(), cas.MetadataJSONPath(cp, "b1"), + io.NopCloser(bytes.NewReader(body)), int64(len(body))); err != nil { + t.Fatal(err) + } + _, err := cas.Download(context.Background(), f, cfg, "b1", cas.DownloadOptions{LocalBackupDir: t.TempDir()}) + if !errors.Is(err, cas.ErrUnsupportedLayoutVersion) { + t.Fatalf("got err=%v want ErrUnsupportedLayoutVersion", err) + } +} + +func TestDownload_TableFilter(t *testing.T) { + parts := []testfixtures.PartSpec{ + {Disk: "default", DB: "db1", Table: "t1", Name: "p1", Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 23, HashLow: 1, HashHigh: 1}, + }}, + {Disk: "default", DB: "db1", Table: "t2", Name: "p1", Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 23, HashLow: 2, HashHigh: 1}, + }}, + } + _, _, _, root := uploadAndDownload(t, parts, "b1", cas.DownloadOptions{ + TableFilter: []string{"db1.t1"}, + }) + localBackupDir := filepath.Join(root, "b1") + + t1Path := filepath.Join(localBackupDir, "metadata", + common.TablePathEncode("db1"), common.TablePathEncode("t1")+".json") + t2Path := filepath.Join(localBackupDir, "metadata", + common.TablePathEncode("db1"), common.TablePathEncode("t2")+".json") + if _, err := os.Stat(t1Path); err != nil { + t.Errorf("t1 metadata missing: %v", err) + } + if _, err := os.Stat(t2Path); !os.IsNotExist(err) { + t.Errorf("t2 metadata should be absent, got err=%v", err) + } + // Shadow check. + t1Shadow := filepath.Join(localBackupDir, "shadow", + common.TablePathEncode("db1"), common.TablePathEncode("t1")) + if _, err := os.Stat(t1Shadow); err != nil { + t.Errorf("t1 shadow missing: %v", err) + } + t2Shadow := filepath.Join(localBackupDir, "shadow", + common.TablePathEncode("db1"), common.TablePathEncode("t2")) + if _, err := os.Stat(t2Shadow); !os.IsNotExist(err) { + t.Errorf("t2 shadow should be absent, got err=%v", err) + } +} + +// TestDownload_PartialFiltersBmLocal verifies that when cas-download is +// called with TableFilter, the local metadata.json's Tables list is +// filtered to match — not a copy of the full remote bm.Tables. +func TestDownload_PartialFiltersBmLocal(t *testing.T) { + parts := []testfixtures.PartSpec{ + {Disk: "default", DB: "db1", Table: "keep", Name: "p1", + Files: []testfixtures.FileSpec{{Name: "data.bin", Size: 4096, HashLow: 1, HashHigh: 1}}}, + {Disk: "default", DB: "db1", Table: "drop", Name: "p1", + Files: []testfixtures.FileSpec{{Name: "data.bin", Size: 4096, HashLow: 2, HashHigh: 2}}}, + } + _, _, _, root := uploadAndDownload(t, parts, "bk", cas.DownloadOptions{ + TableFilter: []string{"db1.keep"}, + }) + body, err := os.ReadFile(filepath.Join(root, "bk", "metadata.json")) + if err != nil { + t.Fatal(err) + } + var bm metadata.BackupMetadata + if err := json.Unmarshal(body, &bm); err != nil { + t.Fatal(err) + } + if len(bm.Tables) != 1 { + t.Fatalf("local bm.Tables should have 1 entry; got %d: %+v", len(bm.Tables), bm.Tables) + } + if bm.Tables[0].Database != "db1" || bm.Tables[0].Table != "keep" { + t.Errorf("local bm.Tables should be [db1.keep]; got %+v", bm.Tables) + } +} + +func TestDownload_SchemaOnly(t *testing.T) { + parts := []testfixtures.PartSpec{ + {Disk: "default", DB: "db1", Table: "t1", Name: "p1", Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 23, HashLow: 1, HashHigh: 1}, + {Name: "data.bin", Size: 1024, HashLow: 999, HashHigh: 1, Bytes: makeBlobBytes(0x10)}, + }}, + } + _, _, _, root := uploadAndDownload(t, parts, "b1", cas.DownloadOptions{ + SchemaOnly: true, + }) + localBackupDir := filepath.Join(root, "b1") + + if _, err := os.Stat(filepath.Join(localBackupDir, "metadata.json")); err != nil { + t.Errorf("metadata.json missing: %v", err) + } + tmPath := filepath.Join(localBackupDir, "metadata", + common.TablePathEncode("db1"), common.TablePathEncode("t1")+".json") + if _, err := os.Stat(tmPath); err != nil { + t.Errorf("table metadata missing: %v", err) + } + if _, err := os.Stat(filepath.Join(localBackupDir, "shadow")); !os.IsNotExist(err) { + t.Errorf("shadow/ should be absent under SchemaOnly, got err=%v", err) + } +} + +// makeArchiveBytes builds a tar.zstd archive from in-memory entries +// (name → bytes). Used by the traversal tests to bypass Upload's +// validation and put a hostile archive directly into the backend. +func makeArchiveBytes(t *testing.T, entries map[string][]byte) []byte { + t.Helper() + var buf bytes.Buffer + zw, err := zstd.NewWriter(&buf) + if err != nil { + t.Fatalf("zstd.NewWriter: %v", err) + } + tw := tar.NewWriter(zw) + // Determinism for debugging. + names := make([]string, 0, len(entries)) + for n := range entries { + names = append(names, n) + } + for _, n := range names { + body := entries[n] + hdr := &tar.Header{ + Name: n, + Mode: 0o644, + Size: int64(len(body)), + Typeflag: tar.TypeReg, + } + if err := tw.WriteHeader(hdr); err != nil { + t.Fatalf("tar header: %v", err) + } + if _, err := tw.Write(body); err != nil { + t.Fatalf("tar write: %v", err) + } + } + if err := tw.Close(); err != nil { + t.Fatalf("tar close: %v", err) + } + if err := zw.Close(); err != nil { + t.Fatalf("zstd close: %v", err) + } + return buf.Bytes() +} + +// putHostileBackup primes the fake backend with a hand-crafted CAS +// backup whose single archive has the given entries. Used by the two +// traversal tests; the resulting "backup" passes ValidateBackup. +func putHostileBackup(t *testing.T, f *fakedst.Fake, cfg cas.Config, name, db, table, disk string, archiveEntries map[string][]byte) { + t.Helper() + cp := cfg.ClusterPrefix() + bm := metadata.BackupMetadata{ + BackupName: name, + DataFormat: "directory", + Tables: []metadata.TableTitle{{Database: db, Table: table}}, + CAS: &metadata.CASBackupParams{ + LayoutVersion: cas.LayoutVersion, + InlineThreshold: cfg.InlineThreshold, + ClusterID: cfg.ClusterID, + }, + } + bmBody, _ := json.Marshal(&bm) + if err := f.PutFile(context.Background(), cas.MetadataJSONPath(cp, name), + io.NopCloser(bytes.NewReader(bmBody)), int64(len(bmBody))); err != nil { + t.Fatal(err) + } + + tm := metadata.TableMetadata{ + Database: db, Table: table, + Parts: map[string][]metadata.Part{ + disk: {{Name: "p1"}}, + }, + } + tmBody, _ := json.Marshal(&tm) + if err := f.PutFile(context.Background(), cas.TableMetaPath(cp, name, db, table), + io.NopCloser(bytes.NewReader(tmBody)), int64(len(tmBody))); err != nil { + t.Fatal(err) + } + + archive := makeArchiveBytes(t, archiveEntries) + if err := f.PutFile(context.Background(), cas.PartArchivePath(cp, name, disk, db, table), + io.NopCloser(bytes.NewReader(archive)), int64(len(archive))); err != nil { + t.Fatal(err) + } +} + +func TestDownload_RejectsTraversalFilename(t *testing.T) { + // checksums.txt lists "../escape.txt" as one of its files. The tar + // itself is well-formed (no traversal in tar names), so it extracts + // successfully; the rejection comes from validateChecksumsTxtFilename. + ck := "checksums format version: 2\n" + + "2 files:\n" + + "columns.txt\n\tsize: 5\n\thash: 1 1\n\tcompressed: 0\n" + + "../escape.txt\n\tsize: 99999\n\thash: 9 9\n\tcompressed: 0\n" + entries := map[string][]byte{ + "p1/checksums.txt": []byte(ck), + "p1/columns.txt": []byte("hello"), + } + f := fakedst.New() + cfg := testCfg(100) + putHostileBackup(t, f, cfg, "b1", "db1", "t1", "default", entries) + + _, err := cas.Download(context.Background(), f, cfg, "b1", cas.DownloadOptions{ + LocalBackupDir: t.TempDir(), + }) + if err == nil || !strings.Contains(err.Error(), "..") { + t.Fatalf("got err=%v want filename traversal error", err) + } +} + +func TestDownload_RejectsTarTraversal(t *testing.T) { + // Hand-crafted tar entry name with "..". ExtractArchive must reject. + entries := map[string][]byte{ + "../escape.txt": []byte("pwned"), + } + f := fakedst.New() + cfg := testCfg(100) + putHostileBackup(t, f, cfg, "b1", "db1", "t1", "default", entries) + + _, err := cas.Download(context.Background(), f, cfg, "b1", cas.DownloadOptions{ + LocalBackupDir: t.TempDir(), + }) + var unsafe *cas.UnsafePathError + if !errors.As(err, &unsafe) { + t.Fatalf("got err=%v want *UnsafePathError", err) + } +} + +func TestDownload_PartitionFilter(t *testing.T) { + parts := []testfixtures.PartSpec{ + {Disk: "default", DB: "db1", Table: "t1", Name: "all_1_1_0", Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 23, HashLow: 1, HashHigh: 1}, + }}, + {Disk: "default", DB: "db1", Table: "t1", Name: "all_2_2_0", Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 23, HashLow: 2, HashHigh: 1}, + }}, + } + _, _, _, root := uploadAndDownload(t, parts, "b1", cas.DownloadOptions{ + Partitions: []string{"all_1_1_0"}, + }) + localBackupDir := filepath.Join(root, "b1") + + tmPath := filepath.Join(localBackupDir, "metadata", + common.TablePathEncode("db1"), common.TablePathEncode("t1")+".json") + tmBody, err := os.ReadFile(tmPath) + if err != nil { + t.Fatalf("read table metadata: %v", err) + } + var tm metadata.TableMetadata + if err := json.Unmarshal(tmBody, &tm); err != nil { + t.Fatalf("parse table metadata: %v", err) + } + parts1 := tm.Parts["default"] + if len(parts1) != 1 || parts1[0].Name != "all_1_1_0" { + t.Errorf("filtered Parts[default]: got %+v want [all_1_1_0]", parts1) + } + + // Note: archives are downloaded whole even when partition-filtered + // (per spec, "acceptable overhead"). So extraction may still produce + // all_2_2_0/checksums.txt under the disk dir; we only assert the JSON + // reflects the filter and that all_1_1_0 is present after extraction. + dlPartDir := filepath.Join(localBackupDir, "shadow", + common.TablePathEncode("db1"), common.TablePathEncode("t1"), "default", "all_1_1_0") + if _, err := os.Stat(filepath.Join(dlPartDir, "checksums.txt")); err != nil { + t.Errorf("all_1_1_0/checksums.txt missing: %v", err) + } +} + +// TestDownload_PreservesSchemaFields is a regression test that the v1 +// schema fields populated in cas-upload survive the upload→download +// round-trip and land in the per-table JSON the v1 restore reads. +func TestDownload_PreservesSchemaFields(t *testing.T) { + parts := []testfixtures.PartSpec{{ + Disk: "default", DB: "db1", Table: "t1", Name: "all_1_1_0", + Files: []testfixtures.FileSpec{{Name: "columns.txt", Size: 8, HashLow: 1, HashHigh: 0}}, + TableMeta: metadata.TableMetadata{ + Database: "db1", Table: "t1", + Query: "CREATE TABLE db1.t1 ENGINE=Memory", + UUID: "abc", + }, + }} + _, _, _, root := uploadAndDownload(t, parts, "b1", cas.DownloadOptions{}) + body, err := os.ReadFile(filepath.Join(root, "b1", "metadata", + common.TablePathEncode("db1"), common.TablePathEncode("t1")+".json")) + if err != nil { + t.Fatalf("read downloaded table metadata: %v", err) + } + var got metadata.TableMetadata + if err := json.Unmarshal(body, &got); err != nil { + t.Fatalf("parse table metadata: %v", err) + } + if got.Query == "" || got.UUID == "" { + t.Errorf("downloaded JSON lost schema fields: %+v", got) + } +} + +// TestDownload_RejectsTraversalDiskName verifies that a remote +// TableMetadata with a malicious disk name (path traversal) is rejected +// before any local filesystem write — defense against a compromised CAS +// bucket directing extraction outside localDir. +func TestDownload_RejectsTraversalDiskName(t *testing.T) { + f := fakedst.New() + cfg := testCfg(100) + ctx := context.Background() + cp := cfg.ClusterPrefix() + + // Hand-craft a CAS-shaped metadata.json + per-table JSON whose Parts + // map keys (disk names) contain "..". + bm := metadata.BackupMetadata{ + BackupName: "evil", + DataFormat: "directory", + Tables: []metadata.TableTitle{{Database: "db", Table: "t"}}, + CAS: &metadata.CASBackupParams{ + LayoutVersion: cas.LayoutVersion, InlineThreshold: cfg.InlineThreshold, ClusterID: cfg.ClusterID, + }, + } + body, _ := json.Marshal(&bm) + if err := f.PutFile(ctx, cas.MetadataJSONPath(cp, "evil"), + io.NopCloser(bytes.NewReader(body)), int64(len(body))); err != nil { + t.Fatal(err) + } + tm := metadata.TableMetadata{ + Database: "db", Table: "t", + Parts: map[string][]metadata.Part{ + "../escape": {{Name: "all_1_1_0"}}, + }, + } + tmBody, _ := json.Marshal(&tm) + if err := f.PutFile(ctx, cas.TableMetaPath(cp, "evil", "db", "t"), + io.NopCloser(bytes.NewReader(tmBody)), int64(len(tmBody))); err != nil { + t.Fatal(err) + } + + _, err := cas.Download(ctx, f, cfg, "evil", cas.DownloadOptions{LocalBackupDir: t.TempDir()}) + if err == nil { + t.Fatal("expected refusal for traversal disk name") + } + if !strings.Contains(err.Error(), "unsafe disk") { + t.Errorf("expected 'unsafe disk' in error, got: %v", err) + } +} + +// TestDownload_ProjectionRoundTrip uploads a part with a projection, +// downloads it, and verifies every projection file lands at the +// expected nested path with no missing blobs. +func TestDownload_ProjectionRoundTrip(t *testing.T) { + parts := []testfixtures.PartSpec{{ + Disk: "default", DB: "db1", Table: "t1", Name: "all_1_1_0", + Files: []testfixtures.FileSpec{ + {Name: "data.bin", Size: 4096, HashLow: 1, HashHigh: 2}, + {Name: "columns.txt", Size: 16, HashLow: 3, HashHigh: 4}, + }, + Projections: []testfixtures.ProjectionSpec{{ + Name: "p1", + Files: []testfixtures.FileSpec{ + {Name: "data.bin", Size: 2048, HashLow: 5, HashHigh: 6}, + {Name: "columns.txt", Size: 8, HashLow: 7, HashHigh: 8}, + }, + AggregateHashLow: 99, AggregateHashHigh: 99, AggregateSize: 2072, + }}, + }} + _, _, _, root := uploadAndDownload(t, parts, "bk", cas.DownloadOptions{}) + + mustExist := func(p string) { + if _, err := os.Stat(p); err != nil { + t.Errorf("missing after download: %s (%v)", p, err) + } + } + // Download materializes into //shadow///// + partDir := filepath.Join(root, "bk", "shadow", + common.TablePathEncode("db1"), common.TablePathEncode("t1"), + "default", "all_1_1_0") + mustExist(filepath.Join(partDir, "data.bin")) + mustExist(filepath.Join(partDir, "columns.txt")) + mustExist(filepath.Join(partDir, "p1.proj", "checksums.txt")) + mustExist(filepath.Join(partDir, "p1.proj", "data.bin")) + mustExist(filepath.Join(partDir, "p1.proj", "columns.txt")) +} + +// TestDownload_RejectsTraversalPartName covers the same defense for the +// per-Part Name field. +func TestDownload_RejectsTraversalPartName(t *testing.T) { + f := fakedst.New() + cfg := testCfg(100) + ctx := context.Background() + cp := cfg.ClusterPrefix() + + bm := metadata.BackupMetadata{ + BackupName: "evil", + DataFormat: "directory", + Tables: []metadata.TableTitle{{Database: "db", Table: "t"}}, + CAS: &metadata.CASBackupParams{ + LayoutVersion: cas.LayoutVersion, InlineThreshold: cfg.InlineThreshold, ClusterID: cfg.ClusterID, + }, + } + body, _ := json.Marshal(&bm) + if err := f.PutFile(ctx, cas.MetadataJSONPath(cp, "evil"), + io.NopCloser(bytes.NewReader(body)), int64(len(body))); err != nil { + t.Fatal(err) + } + tm := metadata.TableMetadata{ + Database: "db", Table: "t", + Parts: map[string][]metadata.Part{ + "default": {{Name: "../escape"}}, + }, + } + tmBody, _ := json.Marshal(&tm) + if err := f.PutFile(ctx, cas.TableMetaPath(cp, "evil", "db", "t"), + io.NopCloser(bytes.NewReader(tmBody)), int64(len(tmBody))); err != nil { + t.Fatal(err) + } + + _, err := cas.Download(ctx, f, cfg, "evil", cas.DownloadOptions{LocalBackupDir: t.TempDir()}) + if err == nil { + t.Fatal("expected refusal for traversal part name") + } + if !strings.Contains(err.Error(), "unsafe part name") { + t.Errorf("expected 'unsafe part name' in error, got: %v", err) + } +} + +// truncatingBackend wraps a real Backend but returns a single-byte body for +// any key that contains "/blob/" — simulating a network-truncated blob fetch. +type truncatingBackend struct{ inner cas.Backend } + +func (tb *truncatingBackend) GetFile(ctx context.Context, key string) (io.ReadCloser, error) { + if strings.Contains(key, "/blob/") { + return io.NopCloser(strings.NewReader("X")), nil // 1 byte — always truncated + } + return tb.inner.GetFile(ctx, key) +} +func (tb *truncatingBackend) PutFile(ctx context.Context, key string, r io.ReadCloser, size int64) error { + return tb.inner.PutFile(ctx, key, r, size) +} +func (tb *truncatingBackend) PutFileIfAbsent(ctx context.Context, key string, r io.ReadCloser, size int64) (bool, error) { + return tb.inner.PutFileIfAbsent(ctx, key, r, size) +} +func (tb *truncatingBackend) StatFile(ctx context.Context, key string) (int64, time.Time, bool, error) { + return tb.inner.StatFile(ctx, key) +} +func (tb *truncatingBackend) DeleteFile(ctx context.Context, key string) error { + return tb.inner.DeleteFile(ctx, key) +} +func (tb *truncatingBackend) Walk(ctx context.Context, prefix string, recursive bool, fn func(cas.RemoteFile) error) error { + return tb.inner.Walk(ctx, prefix, recursive, fn) +} + +// TestDownloadBlobs_RejectsTruncatedBlob verifies that Download returns an +// error when the backend delivers fewer bytes than recorded in checksums.txt, +// and that the partial destination file is removed. +func TestDownloadBlobs_RejectsTruncatedBlob(t *testing.T) { + // Build a real backup with one above-threshold blob (size 1024). + const blobSize = 1024 + parts := []testfixtures.PartSpec{{ + Disk: "default", DB: "db1", Table: "t1", Name: "all_1_1_0", + Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 8, HashLow: 1, HashHigh: 0}, + {Name: "data.bin", Size: blobSize, HashLow: 42, HashHigh: 7, Bytes: makeBlobBytes(0xAB)}, + }, + }} + + lb := testfixtures.Build(t, parts) + real := fakedst.New() + cfg := testCfg(100) // threshold=100 so data.bin (1024 bytes) is stored as a blob + + if _, err := cas.Upload(context.Background(), real, cfg, "b1", cas.UploadOptions{LocalBackupDir: lb.Root}); err != nil { + t.Fatalf("Upload: %v", err) + } + + // Wrap the backend with one that truncates blob fetches. + wrapped := &truncatingBackend{inner: real} + + dlRoot := t.TempDir() + _, err := cas.Download(context.Background(), wrapped, cfg, "b1", cas.DownloadOptions{ + LocalBackupDir: dlRoot, + }) + if err == nil { + t.Fatal("expected Download to fail on truncated blob, got nil error") + } + if !strings.Contains(err.Error(), "truncated") { + t.Errorf("error should mention 'truncated'; got: %v", err) + } + if !strings.Contains(err.Error(), "expected") { + t.Errorf("error should mention 'expected' size; got: %v", err) + } + + // The corrupt destination file must not be left behind. + dlPartDir := filepath.Join(dlRoot, "b1", "shadow", + common.TablePathEncode("db1"), common.TablePathEncode("t1"), + "default", "all_1_1_0") + corruptFile := filepath.Join(dlPartDir, "data.bin") + if _, statErr := os.Stat(corruptFile); statErr == nil { + t.Errorf("corrupt partial file was not removed: %s", corruptFile) + } +} + +// failingArchiveBackend wraps a real Backend but returns an error for any +// key that contains "/parts/" — simulating a mid-download archive failure. +type failingArchiveBackend struct{ inner cas.Backend } + +func (fb *failingArchiveBackend) GetFile(ctx context.Context, key string) (io.ReadCloser, error) { + if strings.Contains(key, "/parts/") { + return nil, errors.New("simulated archive download failure") + } + return fb.inner.GetFile(ctx, key) +} +func (fb *failingArchiveBackend) PutFile(ctx context.Context, key string, r io.ReadCloser, size int64) error { + return fb.inner.PutFile(ctx, key, r, size) +} +func (fb *failingArchiveBackend) PutFileIfAbsent(ctx context.Context, key string, r io.ReadCloser, size int64) (bool, error) { + return fb.inner.PutFileIfAbsent(ctx, key, r, size) +} +func (fb *failingArchiveBackend) StatFile(ctx context.Context, key string) (int64, time.Time, bool, error) { + return fb.inner.StatFile(ctx, key) +} +func (fb *failingArchiveBackend) DeleteFile(ctx context.Context, key string) error { + return fb.inner.DeleteFile(ctx, key) +} +func (fb *failingArchiveBackend) Walk(ctx context.Context, prefix string, recursive bool, fn func(cas.RemoteFile) error) error { + return fb.inner.Walk(ctx, prefix, recursive, fn) +} + +// TestDownload_LeavesNoStaleMetadataOnFailure verifies that a failed archive +// download does NOT leave a directory at finalDir that looks like a valid v1 +// backup (i.e. contains metadata.json). +func TestDownload_LeavesNoStaleMetadataOnFailure(t *testing.T) { + parts := []testfixtures.PartSpec{{ + Disk: "default", DB: "db1", Table: "t1", Name: "all_1_1_0", + Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 8, HashLow: 1, HashHigh: 0}, + {Name: "data.bin", Size: 1024, HashLow: 42, HashHigh: 7, Bytes: makeBlobBytes(0xAB)}, + }, + }} + + lb := testfixtures.Build(t, parts) + real := fakedst.New() + cfg := testCfg(100) + + if _, err := cas.Upload(context.Background(), real, cfg, "b1", cas.UploadOptions{LocalBackupDir: lb.Root}); err != nil { + t.Fatalf("Upload: %v", err) + } + + // Wrap backend to fail on archive fetches. + wrapped := &failingArchiveBackend{inner: real} + + dlRoot := t.TempDir() + finalDir := filepath.Join(dlRoot, "b1") + + _, err := cas.Download(context.Background(), wrapped, cfg, "b1", cas.DownloadOptions{ + LocalBackupDir: dlRoot, + }) + if err == nil { + t.Fatal("expected Download to fail on archive error, got nil") + } + + // The final directory must either not exist, or must not contain + // metadata.json — otherwise a v1 restore would accept it as valid. + if _, statErr := os.Stat(filepath.Join(finalDir, "metadata.json")); statErr == nil { + t.Error("metadata.json must NOT exist at finalDir after a failed download (stale partial state)") + } + + // No staging directory siblings should remain. + entries, err2 := os.ReadDir(dlRoot) + if err2 != nil { + t.Fatalf("ReadDir dlRoot: %v", err2) + } + for _, e := range entries { + if strings.Contains(e.Name(), ".cas-staging-") { + t.Errorf("leftover staging directory found: %s", e.Name()) + } + } +} + +// TestDownload_AtomicReplaceOfStaleSameNameDirectory verifies that a +// successful Download replaces any pre-existing same-name directory, so +// the new content is always what's visible at finalDir. +func TestDownload_AtomicReplaceOfStaleSameNameDirectory(t *testing.T) { + parts := []testfixtures.PartSpec{{ + Disk: "default", DB: "db1", Table: "t1", Name: "all_1_1_0", + Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 8, HashLow: 1, HashHigh: 0}, + }, + }} + + lb := testfixtures.Build(t, parts) + real := fakedst.New() + cfg := testCfg(100) + + if _, err := cas.Upload(context.Background(), real, cfg, "b1", cas.UploadOptions{LocalBackupDir: lb.Root}); err != nil { + t.Fatalf("Upload: %v", err) + } + + dlRoot := t.TempDir() + finalDir := filepath.Join(dlRoot, "b1") + + // Pre-populate finalDir with a stale metadata.json. + if err := os.MkdirAll(finalDir, 0o755); err != nil { + t.Fatalf("mkdir finalDir: %v", err) + } + staleContent := []byte(`{"backup_name":"stale","data_format":"directory"}`) + if err := os.WriteFile(filepath.Join(finalDir, "metadata.json"), staleContent, 0o640); err != nil { + t.Fatalf("write stale metadata.json: %v", err) + } + + if _, err := cas.Download(context.Background(), real, cfg, "b1", cas.DownloadOptions{ + LocalBackupDir: dlRoot, + }); err != nil { + t.Fatalf("Download: %v", err) + } + + // The stale content must have been replaced. + newContent, err := os.ReadFile(filepath.Join(finalDir, "metadata.json")) + if err != nil { + t.Fatalf("read metadata.json: %v", err) + } + if bytes.Equal(newContent, staleContent) { + t.Error("metadata.json still contains stale content — atomic replace did not happen") + } + // The new content must be valid JSON with backup_name = "b1". + var bm metadata.BackupMetadata + if err := json.Unmarshal(newContent, &bm); err != nil { + t.Fatalf("parse new metadata.json: %v", err) + } + if bm.BackupName != "b1" { + t.Errorf("new metadata.json: backup_name=%q want b1", bm.BackupName) + } +} + +// TestDownload_WritesHandoffCAS verifies that the local metadata.json written +// by cas-download preserves the CAS field with Handoff = true (N3 fix). +// +// This is the contract that makes the v1 object-disk-skip guards reachable +// during cas-restore: the guards check "backupMetadata.CAS == nil" and skip +// downloadObjectDiskParts when CAS is set. Previously CAS was nil-ed which +// silently defeated those guards and caused restore failures when a table's +// target disk was object-backed. +func TestDownload_WritesHandoffCAS(t *testing.T) { + parts := []testfixtures.PartSpec{{ + Disk: "default", DB: "db1", Table: "t1", Name: "all_1_1_0", + Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 8, HashLow: 1, HashHigh: 0}, + }, + }} + _, _, _, root := uploadAndDownload(t, parts, "bk", cas.DownloadOptions{}) + + body, err := os.ReadFile(filepath.Join(root, "bk", "metadata.json")) + if err != nil { + t.Fatalf("read metadata.json: %v", err) + } + var bm metadata.BackupMetadata + if err := json.Unmarshal(body, &bm); err != nil { + t.Fatalf("parse metadata.json: %v", err) + } + + // CAS must NOT be nil: the object-disk-skip guards in restore.go fire + // only when CAS != nil. + if bm.CAS == nil { + t.Fatal("local metadata.json must have CAS != nil so v1 object-disk-skip guards fire") + } + + // Handoff must be true: the v1 early-refusal guard allows the handoff + // only when CAS.Handoff == true. + if !bm.CAS.Handoff { + t.Fatal("local metadata.json must have CAS.Handoff = true to pass v1 early-refusal guard") + } + + // LayoutVersion and InlineThreshold must be preserved from the remote. + if bm.CAS.LayoutVersion != cas.LayoutVersion { + t.Errorf("CAS.LayoutVersion: got %d want %d", bm.CAS.LayoutVersion, cas.LayoutVersion) + } +} + +// TestDownload_DataOnlyRefuses verifies that --data-only is rejected +// loudly because CAS doesn't yet implement the data-only path. +// Until the feature ships, silently no-op'ing is worse than refusing. +func TestDownload_DataOnlyRefuses(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) + ctx := context.Background() + + _, err := cas.Download(ctx, f, cfg, "any", cas.DownloadOptions{ + LocalBackupDir: t.TempDir(), + DataOnly: true, + }) + if err == nil { + t.Fatal("expected Download to refuse DataOnly") + } + if !strings.Contains(err.Error(), "data-only is not yet implemented") { + t.Errorf("error should mention 'data-only is not yet implemented'; got: %v", err) + } +} diff --git a/pkg/cas/download_traversal_test.go b/pkg/cas/download_traversal_test.go new file mode 100644 index 00000000..5aeead08 --- /dev/null +++ b/pkg/cas/download_traversal_test.go @@ -0,0 +1,78 @@ +package cas + +import ( + "os" + "path/filepath" + "testing" +) + +// TestCollectBlobJobsRecursive_RejectsTraversalProjEntry verifies the download +// recursion rejects a maliciously-crafted .proj entry whose name contains "..". +// Without the validator hoist, the directory traversal would succeed during +// the recursive blob discovery. This is a security regression test for T6. +func TestCollectBlobJobsRecursive_RejectsTraversalProjEntry(t *testing.T) { + // Synthesize a minimal in-memory checksums.txt with a malicious .proj entry + // containing "..". The collectBlobJobsRecursive function should reject it + // at the validateChecksumsTxtFilename stage before attempting filepath.Join. + tmp := t.TempDir() + ckPath := filepath.Join(tmp, "checksums.txt") + // v2 checksums format with one bad .proj entry that contains ".." + body := `checksums format version: 2 +1 files: +../escape.proj size: 100 hash: 1 2 compressed: 0 +` + if err := os.WriteFile(ckPath, []byte(body), 0o644); err != nil { + t.Fatal(err) + } + var blobs []blobJob + var est int64 + err := collectBlobJobsRecursive(tmp, 1024, &blobs, &est) + if err == nil { + t.Fatal("expected error for ..-containing .proj entry") + } + // validateChecksumsTxtFilename's error message should fire. + t.Logf("got expected error: %v", err) +} + +// TestCollectBlobJobsRecursive_RejectsTraversalFilename verifies the download +// recursion also rejects malicious non-.proj filenames containing "..". +func TestCollectBlobJobsRecursive_RejectsTraversalFilename(t *testing.T) { + tmp := t.TempDir() + ckPath := filepath.Join(tmp, "checksums.txt") + // v2 checksums format with one bad non-.proj entry containing ".." + body := `checksums format version: 2 +1 files: +../escape.bin size: 100 hash: 1 2 compressed: 0 +` + if err := os.WriteFile(ckPath, []byte(body), 0o644); err != nil { + t.Fatal(err) + } + var blobs []blobJob + var est int64 + err := collectBlobJobsRecursive(tmp, 1024, &blobs, &est) + if err == nil { + t.Fatal("expected error for ..-containing non-.proj entry") + } + t.Logf("got expected error: %v", err) +} + +// TestCollectBlobJobsRecursive_RejectsAbsolutePath verifies that absolute +// paths in checksums.txt are rejected. +func TestCollectBlobJobsRecursive_RejectsAbsolutePath(t *testing.T) { + tmp := t.TempDir() + ckPath := filepath.Join(tmp, "checksums.txt") + body := `checksums format version: 2 +1 files: +/etc/passwd size: 100 hash: 1 2 compressed: 0 +` + if err := os.WriteFile(ckPath, []byte(body), 0o644); err != nil { + t.Fatal(err) + } + var blobs []blobJob + var est int64 + err := collectBlobJobsRecursive(tmp, 1024, &blobs, &est) + if err == nil { + t.Fatal("expected error for absolute path entry") + } + t.Logf("got expected error: %v", err) +} diff --git a/pkg/cas/errors.go b/pkg/cas/errors.go new file mode 100644 index 00000000..afcea321 --- /dev/null +++ b/pkg/cas/errors.go @@ -0,0 +1,32 @@ +package cas + +import "errors" + +var ( + // Backup classification. + ErrV1Backup = errors.New("cas: refusing to operate on v1 backup") + ErrCASBackup = errors.New("v1: refusing to operate on CAS backup") + ErrUnsupportedLayoutVersion = errors.New("cas: unsupported layout version") + ErrMissingMetadata = errors.New("cas: backup metadata.json missing") + ErrClusterIDMismatch = errors.New("cas: cluster_id mismatch between backup and config") + ErrInvalidBackupName = errors.New("cas: invalid backup name") + + // Lifecycle. + ErrBackupExists = errors.New("cas: backup with this name already exists") + ErrUploadInProgress = errors.New("cas: upload in progress for this name") + ErrPruneInProgress = errors.New("cas: prune in progress") + ErrNoInProgressMarker = errors.New("cas: no inprogress marker found for backup") + + // Pre-flight. + ErrObjectDiskRefused = errors.New("cas: object-disk tables not supported in v1 of CAS") + + // Verify. + ErrVerifyFailures = errors.New("cas-verify: failures detected") + + // ErrConditionalPutNotSupported is returned by PutFileIfAbsent when the + // underlying backend cannot perform an atomic conditional write. + // pkg/cas cannot import pkg/storage (import cycle), so this is a + // separate sentinel; the casstorage adapter translates + // storage.ErrConditionalPutNotSupported into this value. + ErrConditionalPutNotSupported = errors.New("conditional PutFile not supported by this backend") +) diff --git a/pkg/cas/export_test.go b/pkg/cas/export_test.go new file mode 100644 index 00000000..13e41ff2 --- /dev/null +++ b/pkg/cas/export_test.go @@ -0,0 +1,27 @@ +// export_test.go exposes unexported symbols to the cas_test package. +// This file is compiled only during testing. +package cas + +import ( + "context" + "time" +) + +// WaitForPrune is the exported test shim for the unexported waitForPrune. +func WaitForPrune(ctx context.Context, b Backend, clusterPrefix string, wait time.Duration) error { + return waitForPrune(ctx, b, clusterPrefix, wait) +} + +// SetPollIntervalForTesting sets the package-level testing override for the +// poll interval. Pass nil to restore production behaviour. +func SetPollIntervalForTesting(d *time.Duration) { + pollIntervalForTesting = d +} + +// ProbeKeyPrefix is the exported test shim for the unexported probeKeyPrefix constant. +// Used by probe_test.go to assert sentinel cleanup and key uniqueness. +const ProbeKeyPrefix = probeKeyPrefix + +// TableFilterMatches is the exported test shim for the unexported tableFilterMatches. +// Used by upload_test.go to verify glob-pattern semantics. +var TableFilterMatches = tableFilterMatches diff --git a/pkg/cas/internal/fakedst/fakedst.go b/pkg/cas/internal/fakedst/fakedst.go new file mode 100644 index 00000000..c4cdeeff --- /dev/null +++ b/pkg/cas/internal/fakedst/fakedst.go @@ -0,0 +1,199 @@ +package fakedst + +import ( + "bytes" + "context" + "errors" + "io" + "sort" + "strings" + "sync" + "time" + + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" +) + +// Fake is an in-memory implementation of cas.Backend for use in tests. +type Fake struct { + mu sync.Mutex + files map[string]fakeFile + statHook func(key string) (size int64, modTime time.Time, exists bool, err error, override bool) + putHook func(key string) (err error, override bool) + deleteHook func(key string) (err error, override bool) +} + +type fakeFile struct { + data []byte + modTime time.Time +} + +// New returns an empty Fake backend. +func New() *Fake { return &Fake{files: map[string]fakeFile{}} } + +// SetModTime is a test-only helper for ageing fixtures. +func (f *Fake) SetModTime(key string, t time.Time) { + f.mu.Lock() + defer f.mu.Unlock() + if e, ok := f.files[key]; ok { + e.modTime = t + f.files[key] = e + } +} + +// SetStatHook installs a function consulted by StatFile before its normal +// lookup. If the hook returns override=true, its other return values are +// used verbatim. Used by tests to inject errors at specific keys. +func (f *Fake) SetStatHook(h func(key string) (int64, time.Time, bool, error, bool)) { + f.mu.Lock() + defer f.mu.Unlock() + f.statHook = h +} + +// SetPutHook installs a function consulted by PutFile and PutFileIfAbsent +// before the normal store. If the hook returns override=true and a non-nil +// error, that error is returned instead of writing. Used by tests to inject +// errors at specific keys. +func (f *Fake) SetPutHook(h func(key string) (err error, override bool)) { + f.mu.Lock() + defer f.mu.Unlock() + f.putHook = h +} + +// SetDeleteHook installs a function consulted by DeleteFile before the normal +// delete. If the hook returns override=true and a non-nil error, that error is +// returned instead of deleting. Used by tests to inject delete failures. +func (f *Fake) SetDeleteHook(h func(key string) (err error, override bool)) { + f.mu.Lock() + defer f.mu.Unlock() + f.deleteHook = h +} + +// Len is a test helper for assertions. +func (f *Fake) Len() int { + f.mu.Lock() + defer f.mu.Unlock() + return len(f.files) +} + +func (f *Fake) PutFile(ctx context.Context, key string, r io.ReadCloser, size int64) error { + defer r.Close() + var buf bytes.Buffer + if _, err := io.Copy(&buf, r); err != nil { + return err + } + f.mu.Lock() + hook := f.putHook + f.mu.Unlock() + if hook != nil { + if err, override := hook(key); override && err != nil { + return err + } + } + f.mu.Lock() + defer f.mu.Unlock() + f.files[key] = fakeFile{data: buf.Bytes(), modTime: time.Now()} + return nil +} + +// PutFileIfAbsent atomically writes data at key only if not present. +// In the in-memory fake, this is a single map operation under the lock. +func (f *Fake) PutFileIfAbsent(ctx context.Context, key string, data io.ReadCloser, size int64) (bool, error) { + body, err := io.ReadAll(data) + _ = data.Close() + if err != nil { + return false, err + } + f.mu.Lock() + hook := f.putHook + f.mu.Unlock() + if hook != nil { + if err, override := hook(key); override && err != nil { + return false, err + } + } + f.mu.Lock() + defer f.mu.Unlock() + if _, exists := f.files[key]; exists { + return false, nil + } + f.files[key] = fakeFile{data: body, modTime: time.Now()} + return true, nil +} + +func (f *Fake) GetFile(ctx context.Context, key string) (io.ReadCloser, error) { + f.mu.Lock() + defer f.mu.Unlock() + e, ok := f.files[key] + if !ok { + return nil, errors.New("fakedst: not found") + } + return io.NopCloser(bytes.NewReader(append([]byte(nil), e.data...))), nil +} + +func (f *Fake) StatFile(ctx context.Context, key string) (int64, time.Time, bool, error) { + f.mu.Lock() + hook := f.statHook + f.mu.Unlock() + if hook != nil { + if size, modTime, exists, err, override := hook(key); override { + return size, modTime, exists, err + } + } + f.mu.Lock() + defer f.mu.Unlock() + e, ok := f.files[key] + if !ok { + return 0, time.Time{}, false, nil + } + return int64(len(e.data)), e.modTime, true, nil +} + +func (f *Fake) DeleteFile(ctx context.Context, key string) error { + f.mu.Lock() + hook := f.deleteHook + f.mu.Unlock() + if hook != nil { + if err, override := hook(key); override && err != nil { + return err + } + } + f.mu.Lock() + defer f.mu.Unlock() + delete(f.files, key) + return nil +} + +func (f *Fake) Walk(ctx context.Context, prefix string, recursive bool, fn func(cas.RemoteFile) error) error { + f.mu.Lock() + keys := make([]string, 0, len(f.files)) + for k := range f.files { + if !strings.HasPrefix(k, prefix) { + continue + } + if !recursive { + // Only emit one-level entries: skip keys that contain '/' after the prefix. + rest := strings.TrimPrefix(k, prefix) + if strings.Contains(rest, "/") { + continue + } + } + keys = append(keys, k) + } + snapshot := make(map[string]fakeFile, len(keys)) + for _, k := range keys { + snapshot[k] = f.files[k] + } + f.mu.Unlock() + + sort.Strings(keys) + for _, k := range keys { + e := snapshot[k] + if err := fn(cas.RemoteFile{Key: k, Size: int64(len(e.data)), ModTime: e.modTime}); err != nil { + return err + } + } + return nil +} + +// compile-time assertion +var _ cas.Backend = (*Fake)(nil) diff --git a/pkg/cas/internal/fakedst/fakedst_test.go b/pkg/cas/internal/fakedst/fakedst_test.go new file mode 100644 index 00000000..ab1322c8 --- /dev/null +++ b/pkg/cas/internal/fakedst/fakedst_test.go @@ -0,0 +1,104 @@ +package fakedst + +import ( + "bytes" + "context" + "io" + "reflect" + "sort" + "testing" + "time" + + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" +) + +func TestFake_PutGetStatDelete(t *testing.T) { + f := New() + ctx := context.Background() + + body := io.NopCloser(bytes.NewReader([]byte("hello"))) + if err := f.PutFile(ctx, "a/b", body, 5); err != nil { + t.Fatal(err) + } + + sz, _, exists, err := f.StatFile(ctx, "a/b") + if err != nil || !exists || sz != 5 { + t.Fatalf("stat: sz=%d exists=%v err=%v", sz, exists, err) + } + + _, _, exists, err = f.StatFile(ctx, "missing") + if err != nil || exists { + t.Fatalf("stat missing: exists=%v err=%v", exists, err) + } + + rc, err := f.GetFile(ctx, "a/b") + if err != nil { + t.Fatal(err) + } + got, _ := io.ReadAll(rc) + rc.Close() + if string(got) != "hello" { + t.Fatalf("got %q", got) + } + + if err := f.DeleteFile(ctx, "a/b"); err != nil { + t.Fatal(err) + } + _, _, exists, _ = f.StatFile(ctx, "a/b") + if exists { + t.Fatal("after delete must not exist") + } +} + +func TestFake_WalkRecursive(t *testing.T) { + f := New() + ctx := context.Background() + + for _, k := range []string{"p/a", "p/b/c", "p/b/d", "q/e"} { + _ = f.PutFile(ctx, k, io.NopCloser(bytes.NewReader(nil)), 0) + } + + var got []string + _ = f.Walk(ctx, "p/", true, func(r cas.RemoteFile) error { + got = append(got, r.Key) + return nil + }) + sort.Strings(got) + want := []string{"p/a", "p/b/c", "p/b/d"} + if !reflect.DeepEqual(got, want) { + t.Fatalf("recursive: got %v want %v", got, want) + } +} + +func TestFake_WalkNonRecursive(t *testing.T) { + f := New() + ctx := context.Background() + + for _, k := range []string{"p/a", "p/b/c", "p/d"} { + _ = f.PutFile(ctx, k, io.NopCloser(bytes.NewReader(nil)), 0) + } + + var got []string + _ = f.Walk(ctx, "p/", false, func(r cas.RemoteFile) error { + got = append(got, r.Key) + return nil + }) + sort.Strings(got) + want := []string{"p/a", "p/d"} + if !reflect.DeepEqual(got, want) { + t.Fatalf("non-recursive: got %v want %v", got, want) + } +} + +func TestFake_SetModTime(t *testing.T) { + f := New() + ctx := context.Background() + + _ = f.PutFile(ctx, "k", io.NopCloser(bytes.NewReader(nil)), 0) + past := time.Now().Add(-72 * time.Hour) + f.SetModTime("k", past) + _, mt, _, _ := f.StatFile(ctx, "k") + if !mt.Equal(past) { + t.Fatalf("modtime: got %v want %v", mt, past) + } +} diff --git a/pkg/cas/internal/testfixtures/localbackup.go b/pkg/cas/internal/testfixtures/localbackup.go new file mode 100644 index 00000000..a09d6609 --- /dev/null +++ b/pkg/cas/internal/testfixtures/localbackup.go @@ -0,0 +1,247 @@ +// Package testfixtures provides helpers for synthesizing a "fake local +// backup directory" tree that mirrors what `clickhouse-backup create` +// produces, so tests can drive the CAS upload path without a live +// ClickHouse instance. +package testfixtures + +import ( + "encoding/json" + "fmt" + "os" + "path/filepath" + "strings" + "testing" + + "github.com/Altinity/clickhouse-backup/v2/pkg/common" + "github.com/Altinity/clickhouse-backup/v2/pkg/metadata" +) + +// LocalBackup describes the synthesized backup-on-disk layout returned +// by Build. +type LocalBackup struct { + // Root is the absolute path of the synthesized backup directory. + Root string + // Parts indexes the original PartSpec slices used to build the layout, + // keyed by "disk:db.table" for easy lookup in tests. + Parts map[string][]PartSpec +} + +// PartSpec describes one MergeTree-style part to materialize on disk. +type PartSpec struct { + Disk, DB, Table, Name string + Files []FileSpec // every file the part contains, including any "checksums.txt"-listed files + Projections []ProjectionSpec + // TableMeta is optional. When zero-value, Build still writes a minimal + // v1 metadata//
.json so cas-upload's merge logic has + // something to read. + TableMeta metadata.TableMetadata +} + +// FileSpec describes one file inside a part. +// +// Bytes is optional: if non-nil the bytes are written verbatim; otherwise +// Build synthesizes Size deterministic pseudo-bytes based on Name. The +// CAS upload path trusts checksums.txt — the actual file bytes do not +// need to hash to (HashLow, HashHigh). +type FileSpec struct { + Name string + Size uint64 + HashLow uint64 + HashHigh uint64 + Bytes []byte +} + +// ProjectionSpec describes one projection subpart inside a parent part. +// The parent's checksums.txt gets an entry ".proj" with the given +// aggregate (hash, size). The projection itself is materialized as a +// subdirectory .proj/ containing the listed files plus its own +// checksums.txt. +type ProjectionSpec struct { + Name string // e.g. "p1" — the on-disk dir is .proj + Files []FileSpec // files inside the projection subdir + AggregateHashLow uint64 + AggregateHashHigh uint64 + AggregateSize uint64 +} + +// Build creates a temp directory tree for the given parts and returns +// the resulting LocalBackup. checksums.txt is always written last for +// each part with the v2 text format listing every other file. +// +// The layout matches what `clickhouse-backup create` produces: +// +// /shadow///// +// /metadata//.json +// +// Encoding is applied to db and table components on the filesystem so +// that tests with special characters (hyphen, dot, space, etc.) exercise +// the real upload code path. Disk names are written verbatim (real +// ClickHouse disk names are constrained at config-load time). +func Build(t *testing.T, parts []PartSpec) *LocalBackup { + t.Helper() + root := t.TempDir() + lb := &LocalBackup{ + Root: root, + Parts: make(map[string][]PartSpec), + } + for _, p := range parts { + key := p.Disk + ":" + p.DB + "." + p.Table + lb.Parts[key] = append(lb.Parts[key], p) + dbEnc := common.TablePathEncode(p.DB) + tableEnc := common.TablePathEncode(p.Table) + partDir := filepath.Join(root, "shadow", dbEnc, tableEnc, p.Disk, p.Name) + if err := os.MkdirAll(partDir, 0o755); err != nil { + t.Fatalf("mkdir %s: %v", partDir, err) + } + + // Write every "real" file first. + var listed []FileSpec + for _, f := range p.Files { + if f.Name == "checksums.txt" { + // If caller provides a checksums.txt entry we ignore its + // bytes and synthesize the v2 file ourselves; we still + // include it in the listed set so it appears in the body + // (callers can include it intentionally). + continue + } + listed = append(listed, f) + data := f.Bytes + if data == nil { + data = synthBytes(f.Name, f.Size) + } + if uint64(len(data)) != f.Size { + t.Fatalf("file %q: bytes length %d != size %d", f.Name, len(data), f.Size) + } + fp := filepath.Join(partDir, f.Name) + if err := os.MkdirAll(filepath.Dir(fp), 0o755); err != nil { + t.Fatalf("mkdir %s: %v", filepath.Dir(fp), err) + } + if err := os.WriteFile(fp, data, 0o644); err != nil { + t.Fatalf("write %s: %v", fp, err) + } + } + + // Materialize projections: /.proj/{files..., checksums.txt} + for _, proj := range p.Projections { + projDir := filepath.Join(partDir, proj.Name+".proj") + if err := os.MkdirAll(projDir, 0o755); err != nil { + t.Fatalf("mkdir %s: %v", projDir, err) + } + var projListed []FileSpec + for _, f := range proj.Files { + if f.Name == "checksums.txt" { + continue + } + projListed = append(projListed, f) + data := f.Bytes + if data == nil { + data = synthBytes(f.Name, f.Size) + } + if uint64(len(data)) != f.Size { + t.Fatalf("projection %q file %q: bytes length %d != size %d", + proj.Name, f.Name, len(data), f.Size) + } + fp := filepath.Join(projDir, f.Name) + if err := os.MkdirAll(filepath.Dir(fp), 0o755); err != nil { + t.Fatalf("mkdir %s: %v", filepath.Dir(fp), err) + } + if err := os.WriteFile(fp, data, 0o644); err != nil { + t.Fatalf("write %s: %v", fp, err) + } + } + ck := buildChecksumsV2(projListed) + ckPath := filepath.Join(projDir, "checksums.txt") + if err := os.WriteFile(ckPath, []byte(ck), 0o644); err != nil { + t.Fatalf("write %s: %v", ckPath, err) + } + // Add the projection entry to the parent's listed set so it + // shows up in the parent's checksums.txt with the .proj suffix. + listed = append(listed, FileSpec{ + Name: proj.Name + ".proj", + Size: proj.AggregateSize, + HashLow: proj.AggregateHashLow, + HashHigh: proj.AggregateHashHigh, + }) + } + + // Synthesize checksums.txt last. + ck := buildChecksumsV2(listed) + ckPath := filepath.Join(partDir, "checksums.txt") + if err := os.WriteFile(ckPath, []byte(ck), 0o644); err != nil { + t.Fatalf("write %s: %v", ckPath, err) + } + } + + // Write one v1-style metadata//
.json per (db, table). Mimics + // what `clickhouse-backup create` writes; cas-upload merges the schema + // fields from these files into the uploaded TableMetadata. + seen := map[string]bool{} + for _, p := range parts { + key := p.DB + "." + p.Table + if seen[key] { + continue + } + seen[key] = true + + tm := p.TableMeta + if tm.Database == "" { + tm.Database = p.DB + } + if tm.Table == "" { + tm.Table = p.Table + } + if tm.Query == "" { + tm.Query = "CREATE TABLE " + p.DB + "." + p.Table + " (id UInt64) ENGINE=MergeTree ORDER BY id" + } + if tm.UUID == "" { + tm.UUID = "00000000-0000-0000-0000-000000000000" + } + + metaDir := filepath.Join(root, "metadata", common.TablePathEncode(p.DB)) + if err := os.MkdirAll(metaDir, 0o755); err != nil { + t.Fatalf("mkdir %s: %v", metaDir, err) + } + body, err := json.MarshalIndent(&tm, "", "\t") + if err != nil { + t.Fatalf("marshal table metadata %s.%s: %v", p.DB, p.Table, err) + } + metaPath := filepath.Join(metaDir, common.TablePathEncode(p.Table)+".json") + if err := os.WriteFile(metaPath, body, 0o644); err != nil { + t.Fatalf("write %s: %v", metaPath, err) + } + } + return lb +} + +// buildChecksumsV2 emits a v2 text-format checksums.txt body for the +// given files. None of the files are marked compressed. +func buildChecksumsV2(files []FileSpec) string { + var b strings.Builder + b.WriteString("checksums format version: 2\n") + fmt.Fprintf(&b, "%d files:\n", len(files)) + for _, f := range files { + b.WriteString(f.Name) + b.WriteByte('\n') + fmt.Fprintf(&b, "\tsize: %d\n", f.Size) + fmt.Fprintf(&b, "\thash: %d %d\n", f.HashLow, f.HashHigh) + b.WriteString("\tcompressed: 0\n") + } + return b.String() +} + +// synthBytes returns a deterministic pseudo-random byte slice of the +// requested size, seeded by name. We don't need cryptographic quality — +// just stable bytes that tests can predict if they need to. +func synthBytes(name string, size uint64) []byte { + out := make([]byte, size) + // Cheap LCG seeded from the name's bytes. + var seed uint64 = 1469598103934665603 // FNV offset basis-ish + for i := 0; i < len(name); i++ { + seed = seed*1099511628211 ^ uint64(name[i]) + } + for i := uint64(0); i < size; i++ { + seed = seed*6364136223846793005 + 1442695040888963407 + out[i] = byte(seed >> 56) + } + return out +} diff --git a/pkg/cas/internal/testfixtures/localbackup_test.go b/pkg/cas/internal/testfixtures/localbackup_test.go new file mode 100644 index 00000000..75fa8567 --- /dev/null +++ b/pkg/cas/internal/testfixtures/localbackup_test.go @@ -0,0 +1,128 @@ +package testfixtures + +import ( + "os" + "path/filepath" + "strings" + "testing" + + "github.com/Altinity/clickhouse-backup/v2/pkg/checksumstxt" +) + +func TestBuild_OnePart_ChecksumsRoundTrip(t *testing.T) { + parts := []PartSpec{{ + Disk: "default", DB: "db1", Table: "t1", Name: "all_1_1_0", + Files: []FileSpec{ + {Name: "columns.txt", Size: 23, HashLow: 100, HashHigh: 200}, + {Name: "primary.idx", Size: 8, HashLow: 300, HashHigh: 400}, + {Name: "data.bin", Size: 1024, HashLow: 500, HashHigh: 600}, + }, + }} + lb := Build(t, parts) + + ckPath := filepath.Join(lb.Root, "shadow", "db1", "t1", "default", "all_1_1_0", "checksums.txt") + f, err := os.Open(ckPath) + if err != nil { + t.Fatalf("open: %v", err) + } + defer f.Close() + got, err := checksumstxt.Parse(f) + if err != nil { + t.Fatalf("parse: %v", err) + } + if got.Version != 2 { + t.Errorf("version: got %d want 2", got.Version) + } + if len(got.Files) != 3 { + t.Fatalf("files count: got %d want 3", len(got.Files)) + } + for _, want := range parts[0].Files { + gc, ok := got.Files[want.Name] + if !ok { + t.Errorf("file %q missing from parsed checksums", want.Name) + continue + } + if gc.FileSize != want.Size { + t.Errorf("%s size: got %d want %d", want.Name, gc.FileSize, want.Size) + } + if gc.FileHash.Low != want.HashLow || gc.FileHash.High != want.HashHigh { + t.Errorf("%s hash: got (%d,%d) want (%d,%d)", want.Name, + gc.FileHash.Low, gc.FileHash.High, want.HashLow, want.HashHigh) + } + // Verify file bytes were written with the claimed size. + fp := filepath.Join(lb.Root, "shadow", "db1", "t1", "default", "all_1_1_0", want.Name) + st, err := os.Stat(fp) + if err != nil { + t.Errorf("stat %s: %v", fp, err) + continue + } + if uint64(st.Size()) != want.Size { + t.Errorf("%s on-disk size: got %d want %d", want.Name, st.Size(), want.Size) + } + } +} + +func TestBuild_PartsIndexed(t *testing.T) { + parts := []PartSpec{ + {Disk: "default", DB: "db1", Table: "t1", Name: "p1", Files: []FileSpec{{Name: "columns.txt", Size: 5, HashLow: 1, HashHigh: 2}}}, + {Disk: "default", DB: "db1", Table: "t1", Name: "p2", Files: []FileSpec{{Name: "columns.txt", Size: 5, HashLow: 3, HashHigh: 4}}}, + {Disk: "fast", DB: "db1", Table: "t2", Name: "p1", Files: []FileSpec{{Name: "columns.txt", Size: 5, HashLow: 5, HashHigh: 6}}}, + } + lb := Build(t, parts) + if got, want := len(lb.Parts["default:db1.t1"]), 2; got != want { + t.Errorf("default:db1.t1 parts: got %d want %d", got, want) + } + if got, want := len(lb.Parts["fast:db1.t2"]), 1; got != want { + t.Errorf("fast:db1.t2 parts: got %d want %d", got, want) + } +} + +// TestBuild_WithProjections verifies the fixture builder writes p1.proj/ +// subdirectories with their own checksums.txt and adds a parent +// checksums.txt entry whose name ends with .proj. +func TestBuild_WithProjections(t *testing.T) { + parts := []PartSpec{{ + Disk: "default", DB: "db1", Table: "t1", Name: "all_1_1_0", + Files: []FileSpec{ + {Name: "data.bin", Size: 8, HashLow: 1, HashHigh: 2}, + }, + Projections: []ProjectionSpec{{ + Name: "p1", + Files: []FileSpec{ + {Name: "data.bin", Size: 4, HashLow: 10, HashHigh: 20}, + {Name: "columns.txt", Size: 6, HashLow: 30, HashHigh: 40}, + }, + AggregateHashLow: 100, + AggregateHashHigh: 200, + AggregateSize: 10, + }}, + }} + lb := Build(t, parts) + + // Parent checksums.txt must list the projection as p1.proj. + parentCk := filepath.Join(lb.Root, "shadow", "db1", "t1", "default", "all_1_1_0", "checksums.txt") + parentBody, err := os.ReadFile(parentCk) + if err != nil { + t.Fatal(err) + } + if !strings.Contains(string(parentBody), "p1.proj") { + t.Errorf("parent checksums.txt missing p1.proj entry; body:\n%s", string(parentBody)) + } + + // Projection subdir must exist with its own checksums.txt. + projCk := filepath.Join(lb.Root, "shadow", "db1", "t1", "default", "all_1_1_0", "p1.proj", "checksums.txt") + if _, err := os.Stat(projCk); err != nil { + t.Fatalf("projection checksums.txt missing: %v", err) + } + projBody, err := os.ReadFile(projCk) + if err != nil { + t.Fatal(err) + } + if !strings.Contains(string(projBody), "data.bin") { + t.Errorf("projection checksums.txt missing data.bin; body:\n%s", string(projBody)) + } + // Projection's own data files must be on disk. + if _, err := os.Stat(filepath.Join(lb.Root, "shadow", "db1", "t1", "default", "all_1_1_0", "p1.proj", "data.bin")); err != nil { + t.Errorf("projection data.bin not materialized: %v", err) + } +} diff --git a/pkg/cas/list.go b/pkg/cas/list.go new file mode 100644 index 00000000..26995e4b --- /dev/null +++ b/pkg/cas/list.go @@ -0,0 +1,137 @@ +package cas + +import ( + "context" + "encoding/json" + "fmt" + "io" + "sort" + "strings" + "time" + + "github.com/Altinity/clickhouse-backup/v2/pkg/metadata" +) + +// CASListEntry is the user-facing summary of one CAS backup, surfaced by +// `clickhouse-backup list remote` so operators can see CAS backups alongside +// v1 backups. It is intentionally a thin DTO — full validation and sizing +// belongs to cas-status / cas-verify. +type CASListEntry struct { + // Name is the CAS backup name (the directory segment under + // cas//metadata/). + Name string + + // UploadedAt is taken from the CAS metadata.json's UploadedAt field + // when parseable; otherwise the metadata object's mod time. Used for + // stable descending sort in the list output. + UploadedAt time.Time + + // SizeBytes is the rolled-up `bytes` field from metadata.json (the + // logical size of the source backup), if present. Zero when the + // metadata cannot be parsed. + SizeBytes int64 + + // Description is the tag rendered in the v1 list-remote output. The + // "[CAS]" prefix is what makes the row distinguishable from a v1 + // backup; downstream callers may append a status tag (e.g. "broken"). + Description string +} + +// ListRemoteCAS walks cas//metadata//metadata.json and +// returns one entry per backup. When CAS is disabled this is a no-op +// returning (nil, nil). +// +// Errors from individual metadata.json reads do NOT abort the listing — the +// affected entry is still emitted with Description "[CAS] (broken: )" +// so the operator sees the partial state. Only Walk-level errors (failure to +// enumerate the metadata/ subtree at all) propagate. +func ListRemoteCAS(ctx context.Context, b Backend, cfg Config) ([]CASListEntry, error) { + if !cfg.Enabled { + return nil, nil + } + cp := cfg.ClusterPrefix() + metadataPrefix := cp + "metadata/" + + // Collect candidate metadata.json keys first; we read them in a + // second pass so we don't hold the Walk callback open across remote + // reads (some backends serialize calls on the same connection). + // + // Backends differ in whether the Key surfaced by Walk is absolute + // (fakedst) or prefix-stripped (the casstorage adapter, which uses + // rf.Name()). Suffix/HasPrefix matching plus TrimPrefix tolerates + // both forms; we reconstruct the absolute key for the subsequent + // GetFile call from the parsed backup name. + type cand struct { + name string + modTime time.Time + } + var candidates []cand + err := b.Walk(ctx, metadataPrefix, true, func(rf RemoteFile) error { + if !strings.HasSuffix(rf.Key, "/metadata.json") { + return nil + } + rest := strings.TrimPrefix(rf.Key, metadataPrefix) + rest = strings.TrimSuffix(rest, "/metadata.json") + // Only direct children: cas//metadata//metadata.json. + // Anything deeper (table metadata, parts) is not a backup root. + if rest == "" || strings.Contains(rest, "/") { + return nil + } + candidates = append(candidates, cand{name: rest, modTime: rf.ModTime}) + return nil + }) + if err != nil { + return nil, fmt.Errorf("cas: list remote walk %s: %w", metadataPrefix, err) + } + + entries := make([]CASListEntry, 0, len(candidates)) + for _, c := range candidates { + entry := CASListEntry{ + Name: c.name, + UploadedAt: c.modTime, + Description: "[CAS]", + } + // Parse metadata.json to refine UploadedAt and recover the + // logical bytes. Failures degrade the entry to "broken" but + // never drop it from the list. + absKey := MetadataJSONPath(cp, c.name) + r, openErr := b.GetFile(ctx, absKey) + if openErr != nil { + entry.Description = fmt.Sprintf("[CAS] (broken: open metadata.json: %v)", openErr) + entries = append(entries, entry) + continue + } + body, readErr := io.ReadAll(r) + _ = r.Close() + if readErr != nil { + entry.Description = fmt.Sprintf("[CAS] (broken: read metadata.json: %v)", readErr) + entries = append(entries, entry) + continue + } + var bm metadata.BackupMetadata + if jsonErr := json.Unmarshal(body, &bm); jsonErr != nil { + entry.Description = fmt.Sprintf("[CAS] (broken: parse metadata.json: %v)", jsonErr) + entries = append(entries, entry) + continue + } + if !bm.CreationDate.IsZero() { + entry.UploadedAt = bm.CreationDate + } + // CAS metadata.json's CreationDate is the upload moment; the + // rolled-up logical size is the sum of per-class sizes from + // the v1 schema (CAS only ever populates the data/metadata + // fields, but tolerate any combination here). + entry.SizeBytes = int64(bm.DataSize + bm.MetadataSize + bm.RBACSize + bm.ConfigSize + bm.NamedCollectionsSize) + // Distinguish v1 metadata.json that happens to live under cas/ + // (defensive — should not happen) from real CAS metadata. + if bm.CAS == nil { + entry.Description = "[CAS] (broken: missing cas params)" + } + entries = append(entries, entry) + } + + sort.Slice(entries, func(i, j int) bool { + return entries[i].UploadedAt.After(entries[j].UploadedAt) + }) + return entries, nil +} diff --git a/pkg/cas/list_test.go b/pkg/cas/list_test.go new file mode 100644 index 00000000..33d3e919 --- /dev/null +++ b/pkg/cas/list_test.go @@ -0,0 +1,107 @@ +package cas_test + +import ( + "bytes" + "context" + "io" + "testing" + + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/fakedst" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/testfixtures" +) + +func TestListRemoteCAS_FindsBackups(t *testing.T) { + f := fakedst.New() + cfg := testCfg(100) + ctx := context.Background() + + src := testfixtures.Build(t, []testfixtures.PartSpec{smallPart("p1", 0)}) + if _, err := cas.Upload(ctx, f, cfg, "bk1", cas.UploadOptions{LocalBackupDir: src.Root}); err != nil { + t.Fatalf("upload bk1: %v", err) + } + if _, err := cas.Upload(ctx, f, cfg, "bk2", cas.UploadOptions{LocalBackupDir: src.Root}); err != nil { + t.Fatalf("upload bk2: %v", err) + } + entries, err := cas.ListRemoteCAS(ctx, f, cfg) + if err != nil { + t.Fatalf("ListRemoteCAS: %v", err) + } + if len(entries) != 2 { + t.Fatalf("got %d entries, want 2: %+v", len(entries), entries) + } + names := map[string]bool{} + for _, e := range entries { + names[e.Name] = true + if e.Description != "[CAS]" { + t.Errorf("entry %q description = %q, want %q", e.Name, e.Description, "[CAS]") + } + if e.UploadedAt.IsZero() { + t.Errorf("entry %q has zero UploadedAt", e.Name) + } + } + if !names["bk1"] || !names["bk2"] { + t.Errorf("missing expected names, got %+v", names) + } +} + +func TestListRemoteCAS_DisabledReturnsNil(t *testing.T) { + cfg := testCfg(100) + cfg.Enabled = false + entries, err := cas.ListRemoteCAS(context.Background(), fakedst.New(), cfg) + if err != nil { + t.Fatalf("err: %v", err) + } + if entries != nil { + t.Fatalf("want nil, got %+v", entries) + } +} + +func TestListRemoteCAS_IgnoresNestedMetadataJSON(t *testing.T) { + // table-level metadata files live deeper than /metadata.json, + // so they must not show up as backup roots. + f := fakedst.New() + cfg := testCfg(100) + ctx := context.Background() + + src := testfixtures.Build(t, []testfixtures.PartSpec{smallPart("p1", 0)}) + if _, err := cas.Upload(ctx, f, cfg, "only", cas.UploadOptions{LocalBackupDir: src.Root}); err != nil { + t.Fatalf("upload: %v", err) + } + entries, err := cas.ListRemoteCAS(ctx, f, cfg) + if err != nil { + t.Fatalf("ListRemoteCAS: %v", err) + } + if len(entries) != 1 { + t.Fatalf("got %d entries, want 1: %+v", len(entries), entries) + } + if entries[0].Name != "only" { + t.Errorf("name: got %q want %q", entries[0].Name, "only") + } +} + +func TestListRemoteCAS_BrokenMetadataIsSurfacedNotDropped(t *testing.T) { + f := fakedst.New() + cfg := testCfg(100) + ctx := context.Background() + + cp := cfg.ClusterPrefix() + bad := cas.MetadataJSONPath(cp, "broken") + body := []byte("{this is not json") + if err := f.PutFile(ctx, bad, io.NopCloser(bytes.NewReader(body)), int64(len(body))); err != nil { + t.Fatalf("PutFile: %v", err) + } + entries, err := cas.ListRemoteCAS(ctx, f, cfg) + if err != nil { + t.Fatalf("ListRemoteCAS: %v", err) + } + if len(entries) != 1 { + t.Fatalf("want 1 entry, got %d", len(entries)) + } + if entries[0].Name != "broken" { + t.Errorf("name: got %q", entries[0].Name) + } + if entries[0].Description == "[CAS]" { + t.Errorf("expected broken description, got %q", entries[0].Description) + } +} diff --git a/pkg/cas/markers.go b/pkg/cas/markers.go new file mode 100644 index 00000000..5cc73b8e --- /dev/null +++ b/pkg/cas/markers.go @@ -0,0 +1,152 @@ +package cas + +import ( + "bytes" + "context" + "crypto/rand" + "encoding/hex" + "encoding/json" + "io" + "os" + "time" +) + +// markerTool is embedded in marker JSON for forensic context. Set by callers +// (typically to "clickhouse-backup "); empty is fine. +var markerTool = "clickhouse-backup" + +// SetMarkerTool overrides the tool string written into new markers. Intended +// to be called once at startup with a version-tagged identifier. +func SetMarkerTool(tool string) { markerTool = tool } + +// hostname returns the host's name; on error returns "unknown". +func hostname() string { + h, err := os.Hostname() + if err != nil || h == "" { + return "unknown" + } + return h +} + +// nowRFC3339 returns the current UTC time in RFC3339 format. +func nowRFC3339() string { return time.Now().UTC().Format(time.RFC3339) } + +// WriteInProgressMarker atomically creates cas//inprogress/.marker. +// Returns (true, nil) on successful create; (false, nil) if a marker +// already exists (another upload is in progress); (false, ErrConditionalPutNotSupported) +// when the backend can't do atomic create. +func WriteInProgressMarker(ctx context.Context, b Backend, clusterPrefix, backup, host string) (created bool, err error) { + return WriteInProgressMarkerWithTool(ctx, b, clusterPrefix, backup, host, markerTool) +} + +// WriteInProgressMarkerWithTool is like WriteInProgressMarker but accepts an +// explicit tool identifier written into the marker JSON. Use this when the +// caller is not "cas-upload" (e.g. "cas-delete") so that concurrent operations +// can surface the right diagnostic in error messages. +func WriteInProgressMarkerWithTool(ctx context.Context, b Backend, clusterPrefix, backup, host, tool string) (created bool, err error) { + if host == "" { + host = hostname() + } + if tool == "" { + tool = markerTool + } + m := InProgressMarker{Backup: backup, Host: host, StartedAt: nowRFC3339(), Tool: tool} + data, _ := json.Marshal(m) + return b.PutFileIfAbsent(ctx, InProgressMarkerPath(clusterPrefix, backup), + io.NopCloser(bytes.NewReader(data)), int64(len(data))) +} + +// ReadInProgressMarker returns the parsed marker. Returns an error wrapping +// io.EOF (or similar) if the marker doesn't exist; callers can use StatFile +// for an exists/not-exists probe instead. +func ReadInProgressMarker(ctx context.Context, b Backend, clusterPrefix, backup string) (*InProgressMarker, error) { + raw, err := getBytes(ctx, b, InProgressMarkerPath(clusterPrefix, backup)) + if err != nil { + return nil, err + } + var m InProgressMarker + if err := json.Unmarshal(raw, &m); err != nil { + return nil, err + } + return &m, nil +} + +// DeleteInProgressMarker removes the in-progress marker for the given backup. +func DeleteInProgressMarker(ctx context.Context, b Backend, clusterPrefix, backup string) error { + return b.DeleteFile(ctx, InProgressMarkerPath(clusterPrefix, backup)) +} + +// WritePruneMarker atomically creates cas//prune.marker. Returns +// (runID, true, nil) on successful create; ("", false, nil) when another +// prune already holds the marker; ("", false, ErrConditionalPutNotSupported) +// for backends without atomic-create. +func WritePruneMarker(ctx context.Context, b Backend, clusterPrefix, host string) (runID string, created bool, err error) { + if host == "" { + host = hostname() + } + runID = randomRunID() + m := PruneMarker{Host: host, StartedAt: nowRFC3339(), RunID: runID, Tool: markerTool} + data, _ := json.Marshal(m) + created, err = b.PutFileIfAbsent(ctx, PruneMarkerPath(clusterPrefix), + io.NopCloser(bytes.NewReader(data)), int64(len(data))) + if !created || err != nil { + return "", created, err + } + return runID, true, nil +} + +// ReadPruneMarker returns the parsed prune marker. +func ReadPruneMarker(ctx context.Context, b Backend, clusterPrefix string) (*PruneMarker, error) { + raw, err := getBytes(ctx, b, PruneMarkerPath(clusterPrefix)) + if err != nil { + return nil, err + } + var m PruneMarker + if err := json.Unmarshal(raw, &m); err != nil { + return nil, err + } + return &m, nil +} + +// DeletePruneMarker removes the prune marker. +func DeletePruneMarker(ctx context.Context, b Backend, clusterPrefix string) error { + return b.DeleteFile(ctx, PruneMarkerPath(clusterPrefix)) +} + +// --- helpers --- + +func randomHex(nBytes int) (string, error) { + buf := make([]byte, nBytes) + if _, err := rand.Read(buf); err != nil { + return "", err + } + return hex.EncodeToString(buf), nil +} + +// randomRunID returns a 16-hex-char random identifier. Panics only if the +// OS entropy source is completely broken (effectively impossible in practice). +func randomRunID() string { + id, err := randomHex(8) + if err != nil { + panic("cas: randomRunID: entropy unavailable: " + err.Error()) + } + return id +} + +func putBytes(ctx context.Context, b Backend, key string, data []byte) error { + return b.PutFile(ctx, key, io.NopCloser(bytes.NewReader(data)), int64(len(data))) +} + +// markerSizeLimit is the maximum number of bytes we will read from a remote +// marker file. Real markers are ~200 B; 64 KiB is a safe ceiling that prevents +// a corrupt / malicious object from consuming unbounded memory. +const markerSizeLimit = 64 * 1024 + +func getBytes(ctx context.Context, b Backend, key string) ([]byte, error) { + rc, err := b.GetFile(ctx, key) + if err != nil { + return nil, err + } + defer rc.Close() + return io.ReadAll(io.LimitReader(rc, markerSizeLimit)) +} diff --git a/pkg/cas/markers_test.go b/pkg/cas/markers_test.go new file mode 100644 index 00000000..ef244845 --- /dev/null +++ b/pkg/cas/markers_test.go @@ -0,0 +1,180 @@ +package cas_test + +import ( + "bytes" + "context" + "io" + "testing" + + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/fakedst" +) + +func TestInProgressMarker_RoundTrip(t *testing.T) { + f := fakedst.New() + ctx := context.Background() + if _, err := cas.WriteInProgressMarker(ctx, f, "cas/c1/", "bk1", "host-a"); err != nil { + t.Fatal(err) + } + m, err := cas.ReadInProgressMarker(ctx, f, "cas/c1/", "bk1") + if err != nil { + t.Fatal(err) + } + if m.Backup != "bk1" { + t.Errorf("Backup: got %q", m.Backup) + } + if m.Host != "host-a" { + t.Errorf("Host: got %q", m.Host) + } + if m.StartedAt == "" { + t.Error("StartedAt empty") + } + if err := cas.DeleteInProgressMarker(ctx, f, "cas/c1/", "bk1"); err != nil { + t.Fatal(err) + } + if _, err := cas.ReadInProgressMarker(ctx, f, "cas/c1/", "bk1"); err == nil { + t.Fatal("expected error reading deleted marker") + } +} + +func TestInProgressMarker_DefaultsHost(t *testing.T) { + f := fakedst.New() + ctx := context.Background() + if _, err := cas.WriteInProgressMarker(ctx, f, "cas/c1/", "bk", ""); err != nil { + t.Fatal(err) + } + m, err := cas.ReadInProgressMarker(ctx, f, "cas/c1/", "bk") + if err != nil { + t.Fatal(err) + } + if m.Host == "" { + t.Error("Host should be filled when caller passes \"\"") + } +} + +func TestPruneMarker_RunIDReadBack(t *testing.T) { + f := fakedst.New() + ctx := context.Background() + runID, created, err := cas.WritePruneMarker(ctx, f, "cas/c1/", "host-a") + if err != nil { + t.Fatal(err) + } + if !created { + t.Fatal("expected created=true on first write") + } + if len(runID) != 16 { + t.Errorf("runID len: got %d want 16", len(runID)) + } + m, err := cas.ReadPruneMarker(ctx, f, "cas/c1/") + if err != nil { + t.Fatal(err) + } + if m.RunID != runID { + t.Errorf("read-back: got %q want %q", m.RunID, runID) + } + if m.Host != "host-a" { + t.Errorf("Host: got %q", m.Host) + } +} + +// TestPruneMarker_SecondWriteRefused verifies that WritePruneMarker returns +// created=false when a marker already exists (atomic create semantics). +func TestPruneMarker_SecondWriteRefused(t *testing.T) { + f := fakedst.New() + ctx := context.Background() + a, createdA, err := cas.WritePruneMarker(ctx, f, "cas/c1/", "h") + if err != nil || !createdA { + t.Fatalf("first write: created=%v err=%v", createdA, err) + } + _, createdB, err := cas.WritePruneMarker(ctx, f, "cas/c1/", "h") + if err != nil { + t.Fatal(err) + } + if createdB { + t.Error("second write should return created=false (marker already exists)") + } + // The first run's marker must still be intact. + m, err := cas.ReadPruneMarker(ctx, f, "cas/c1/") + if err != nil { + t.Fatal(err) + } + if m.RunID != a { + t.Errorf("marker should still hold first run-id %q; got %q", a, m.RunID) + } +} + +func TestSetMarkerTool(t *testing.T) { + f := fakedst.New() + ctx := context.Background() + cas.SetMarkerTool("test-tool/1.0") + defer cas.SetMarkerTool("clickhouse-backup") + _, _, err := cas.WritePruneMarker(ctx, f, "cas/c1/", "h") + if err != nil { + t.Fatal(err) + } + m, err := cas.ReadPruneMarker(ctx, f, "cas/c1/") + if err != nil { + t.Fatal(err) + } + if m.Tool != "test-tool/1.0" { + t.Errorf("Tool: got %q", m.Tool) + } +} + +// TestReadInProgressMarker_LimitsReadSize verifies that ReadInProgressMarker +// does not consume unbounded memory when the remote object is larger than the +// 64 KiB markerSizeLimit. The LimitReader truncates the body; the truncated +// bytes are not valid JSON, so the call must return an error (not a +// successfully-parsed marker, and not an OOM). +func TestReadInProgressMarker_LimitsReadSize(t *testing.T) { + f := fakedst.New() + ctx := context.Background() + const cp = "cas/c1/" + const name = "big-bk" + + // Pre-place a marker whose body is 128 KiB (2× the 64 KiB limit) of 'x'. + // The body is not valid JSON; after truncation it remains invalid. + oversized := make([]byte, 128*1024) + for i := range oversized { + oversized[i] = 'x' + } + markerKey := cas.InProgressMarkerPath(cp, name) + if err := f.PutFile(ctx, markerKey, + io.NopCloser(bytes.NewReader(oversized)), int64(len(oversized))); err != nil { + t.Fatal(err) + } + + m, err := cas.ReadInProgressMarker(ctx, f, cp, name) + if err == nil { + t.Fatalf("expected an error due to invalid JSON after LimitReader truncation; got marker=%+v", m) + } + if m != nil { + t.Errorf("marker must be nil on error; got %+v", m) + } +} + +// TestReadPruneMarker_LimitsReadSize mirrors TestReadInProgressMarker_LimitsReadSize +// for ReadPruneMarker. +func TestReadPruneMarker_LimitsReadSize(t *testing.T) { + f := fakedst.New() + ctx := context.Background() + const cp = "cas/c1/" + + oversized := make([]byte, 128*1024) + for i := range oversized { + oversized[i] = 'x' + } + pruneKey := cas.PruneMarkerPath(cp) + if err := f.PutFile(ctx, pruneKey, + io.NopCloser(bytes.NewReader(oversized)), int64(len(oversized))); err != nil { + t.Fatal(err) + } + + m, err := cas.ReadPruneMarker(ctx, f, cp) + if err == nil { + t.Fatalf("expected an error due to invalid JSON after LimitReader truncation; got marker=%+v", m) + } + if m != nil { + t.Errorf("marker must be nil on error; got %+v", m) + } +} diff --git a/pkg/cas/markset.go b/pkg/cas/markset.go new file mode 100644 index 00000000..f25cdec5 --- /dev/null +++ b/pkg/cas/markset.go @@ -0,0 +1,329 @@ +package cas + +import ( + "bufio" + "container/heap" + "encoding/binary" + "fmt" + "io" + "os" + "path/filepath" + "sort" +) + +// MarkSetWriter accumulates Hash128 references and produces a sorted, deduped +// on-disk file at finalPath on Close. Implementation: an in-memory buffer of +// `chunk` entries; when full, the buffer is sorted, deduplicated, and spilled +// to a "run" file. On Close, all run files are k-way-merged into the final +// output, deduplicating across runs. +// +// The on-disk format is a simple binary stream of 16-byte hashes +// (Low LE, then High LE, matching the byte order used by hashHex). The set +// is intended for the cas-prune mark phase where the live-blob reference +// count can reach ~10^8 across the catalog and won't fit in RAM. +type MarkSetWriter struct { + finalPath string + runDir string + chunk int + buf []Hash128 + runs []string + closed bool + written uint64 +} + +// NewMarkSetWriter opens a new writer that will produce a sorted, deduped +// file at finalPath. chunk is the in-memory buffer size before spilling +// (each entry is 16 bytes; 1<<20 ≈ 16 MiB of RAM). +func NewMarkSetWriter(finalPath string, chunk int) (*MarkSetWriter, error) { + if chunk <= 0 { + chunk = 1 << 20 + } + parent := filepath.Dir(finalPath) + if err := os.MkdirAll(parent, 0o755); err != nil { + return nil, fmt.Errorf("markset: mkdir %s: %w", parent, err) + } + runDir, err := os.MkdirTemp(parent, "markset-runs-*") + if err != nil { + return nil, fmt.Errorf("markset: temp dir: %w", err) + } + return &MarkSetWriter{ + finalPath: finalPath, + runDir: runDir, + chunk: chunk, + buf: make([]Hash128, 0, chunk), + }, nil +} + +// Write appends one hash to the in-memory buffer; spills to disk when full. +func (w *MarkSetWriter) Write(h Hash128) error { + if w.closed { + return fmt.Errorf("markset: writer is closed") + } + w.buf = append(w.buf, h) + w.written++ + if len(w.buf) >= w.chunk { + return w.spill() + } + return nil +} + +// Count returns the total number of hashes written (including duplicates before +// deduplication). Available after the first Write call. +func (w *MarkSetWriter) Count() uint64 { return w.written } + +// Close flushes the final in-memory chunk and merges all runs into finalPath. +// The temporary run directory is removed on success. Calling Close more than +// once is a no-op. +func (w *MarkSetWriter) Close() error { + if w.closed { + return nil + } + w.closed = true + if err := w.spill(); err != nil { + return err + } + if err := mergeRuns(w.runs, w.finalPath); err != nil { + return err + } + // Best-effort cleanup of the run directory. + _ = os.RemoveAll(w.runDir) + return nil +} + +func (w *MarkSetWriter) spill() error { + if len(w.buf) == 0 { + return nil + } + sort.Slice(w.buf, func(i, j int) bool { return hashLess(w.buf[i], w.buf[j]) }) + p := filepath.Join(w.runDir, fmt.Sprintf("run-%05d", len(w.runs))) + f, err := os.Create(p) + if err != nil { + return fmt.Errorf("markset: create run file: %w", err) + } + bw := bufio.NewWriter(f) + var prev Hash128 + first := true + for _, h := range w.buf { + if !first && h == prev { + continue + } + if err := writeHashBinary(bw, h); err != nil { + _ = f.Close() + return fmt.Errorf("markset: write run: %w", err) + } + prev = h + first = false + } + if err := bw.Flush(); err != nil { + _ = f.Close() + return err + } + if err := f.Close(); err != nil { + return err + } + w.buf = w.buf[:0] + w.runs = append(w.runs, p) + return nil +} + +// MarkSetReader streams sorted, deduplicated hashes from a file produced by +// MarkSetWriter. +type MarkSetReader struct { + f *os.File + br *bufio.Reader +} + +// OpenMarkSetReader opens the file produced by MarkSetWriter.Close. +func OpenMarkSetReader(p string) (*MarkSetReader, error) { + f, err := os.Open(p) + if err != nil { + return nil, fmt.Errorf("markset: open: %w", err) + } + return &MarkSetReader{f: f, br: bufio.NewReader(f)}, nil +} + +// Next returns the next hash, or (Hash128{}, false, nil) at EOF. +func (r *MarkSetReader) Next() (Hash128, bool, error) { + var b [16]byte + n, err := io.ReadFull(r.br, b[:]) + if err == io.EOF { + return Hash128{}, false, nil + } + if err == io.ErrUnexpectedEOF { + return Hash128{}, false, fmt.Errorf("markset: short read at offset (got %d bytes)", n) + } + if err != nil { + return Hash128{}, false, err + } + return Hash128{ + Low: binary.LittleEndian.Uint64(b[0:8]), + High: binary.LittleEndian.Uint64(b[8:16]), + }, true, nil +} + +// Close releases the underlying file handle. +func (r *MarkSetReader) Close() error { + if r.f == nil { + return nil + } + err := r.f.Close() + r.f = nil + return err +} + +// hashLess defines the canonical ordering: High first, then Low. +// (Same convention used everywhere we need to sort Hash128.) +func hashLess(a, b Hash128) bool { + if a.High != b.High { + return a.High < b.High + } + return a.Low < b.Low +} + +func writeHashBinary(w io.Writer, h Hash128) error { + var b [16]byte + binary.LittleEndian.PutUint64(b[0:8], h.Low) + binary.LittleEndian.PutUint64(b[8:16], h.High) + _, err := w.Write(b[:]) + return err +} + +// runIter is a single-run iterator used by mergeRuns. +type runIter struct { + f *os.File + br *bufio.Reader + current Hash128 + valid bool +} + +func openRunIter(p string) (*runIter, error) { + f, err := os.Open(p) + if err != nil { + return nil, err + } + it := &runIter{f: f, br: bufio.NewReader(f)} + if err := it.advance(); err != nil { + _ = f.Close() + return nil, err + } + return it, nil +} + +func (it *runIter) advance() error { + var b [16]byte + n, err := io.ReadFull(it.br, b[:]) + if err == io.EOF { + it.valid = false + return nil + } + if err == io.ErrUnexpectedEOF { + return fmt.Errorf("markset: short read in run (got %d bytes)", n) + } + if err != nil { + return err + } + it.current = Hash128{ + Low: binary.LittleEndian.Uint64(b[0:8]), + High: binary.LittleEndian.Uint64(b[8:16]), + } + it.valid = true + return nil +} + +func (it *runIter) close() error { + if it.f == nil { + return nil + } + err := it.f.Close() + it.f = nil + return err +} + +// runHeap is a min-heap of runIter pointers ordered by current hash. +type runHeap []*runIter + +func (h runHeap) Len() int { return len(h) } +func (h runHeap) Less(i, j int) bool { return hashLess(h[i].current, h[j].current) } +func (h runHeap) Swap(i, j int) { h[i], h[j] = h[j], h[i] } +func (h *runHeap) Push(x interface{}) { *h = append(*h, x.(*runIter)) } +func (h *runHeap) Pop() interface{} { + old := *h + n := len(old) + x := old[n-1] + *h = old[:n-1] + return x +} + +// mergeRuns performs k-way merge over runs and writes a sorted, deduplicated +// stream to dst. Each run is itself sorted+deduped (per spill contract). +func mergeRuns(runs []string, dst string) error { + out, err := os.Create(dst) + if err != nil { + return fmt.Errorf("markset: create dst: %w", err) + } + bw := bufio.NewWriter(out) + + if len(runs) == 0 { + // Empty mark set is a valid output (zero-byte file). + if err := bw.Flush(); err != nil { + _ = out.Close() + return err + } + return out.Close() + } + + h := &runHeap{} + heap.Init(h) + for _, p := range runs { + it, err := openRunIter(p) + if err != nil { + closeAll(*h) + _ = out.Close() + return err + } + if it.valid { + heap.Push(h, it) + } else { + _ = it.close() + } + } + + var prev Hash128 + first := true + for h.Len() > 0 { + top := (*h)[0] + cur := top.current + if first || cur != prev { + if err := writeHashBinary(bw, cur); err != nil { + closeAll(*h) + _ = out.Close() + return err + } + prev = cur + first = false + } + if err := top.advance(); err != nil { + closeAll(*h) + _ = out.Close() + return err + } + if top.valid { + heap.Fix(h, 0) + } else { + heap.Pop(h) + _ = top.close() + } + } + + if err := bw.Flush(); err != nil { + _ = out.Close() + return err + } + return out.Close() +} + +func closeAll(its []*runIter) { + for _, it := range its { + _ = it.close() + } +} diff --git a/pkg/cas/markset_test.go b/pkg/cas/markset_test.go new file mode 100644 index 00000000..86c8d206 --- /dev/null +++ b/pkg/cas/markset_test.go @@ -0,0 +1,146 @@ +package cas + +import ( + "math/rand" + "path/filepath" + "reflect" + "testing" +) + +func TestMarkSet_WriteSortRead(t *testing.T) { + tmp := t.TempDir() + w, err := NewMarkSetWriter(filepath.Join(tmp, "marks"), 1024) + if err != nil { + t.Fatal(err) + } + refs := []Hash128{ + {High: 0xff, Low: 1}, + {High: 0x00, Low: 5}, + {High: 0x80, Low: 3}, + {High: 0x00, Low: 5}, // duplicate + {High: 0x00, Low: 1}, + } + for _, h := range refs { + if err := w.Write(h); err != nil { + t.Fatal(err) + } + } + if err := w.Close(); err != nil { + t.Fatal(err) + } + + r, err := OpenMarkSetReader(filepath.Join(tmp, "marks")) + if err != nil { + t.Fatal(err) + } + defer r.Close() + + var got []Hash128 + for { + h, ok, err := r.Next() + if err != nil { + t.Fatal(err) + } + if !ok { + break + } + got = append(got, h) + } + want := []Hash128{ + {High: 0x00, Low: 1}, + {High: 0x00, Low: 5}, + {High: 0x80, Low: 3}, + {High: 0xff, Low: 1}, + } + if !reflect.DeepEqual(got, want) { + t.Fatalf("got %+v\nwant %+v", got, want) + } +} + +func TestMarkSet_LargeExternalSort(t *testing.T) { + // Force multi-run mergesort: chunk = 256, write 5000 random refs. + // Output must be sorted, deduplicated, and contain exactly the unique + // inputs. + tmp := t.TempDir() + w, err := NewMarkSetWriter(filepath.Join(tmp, "marks"), 256) + if err != nil { + t.Fatal(err) + } + rng := rand.New(rand.NewSource(42)) + uniq := map[Hash128]struct{}{} + for i := 0; i < 5000; i++ { + h := Hash128{Low: rng.Uint64() & 0xffff, High: rng.Uint64() & 0xff} // many collisions + uniq[h] = struct{}{} + if err := w.Write(h); err != nil { + t.Fatal(err) + } + } + if err := w.Close(); err != nil { + t.Fatal(err) + } + + r, err := OpenMarkSetReader(filepath.Join(tmp, "marks")) + if err != nil { + t.Fatal(err) + } + defer r.Close() + + var got []Hash128 + for { + h, ok, err := r.Next() + if err != nil { + t.Fatal(err) + } + if !ok { + break + } + got = append(got, h) + } + if len(got) != len(uniq) { + t.Errorf("unique count: got %d want %d", len(got), len(uniq)) + } + // Verify sorted. + for i := 1; i < len(got); i++ { + if !hashLess(got[i-1], got[i]) { + t.Fatalf("not sorted at %d: %+v vs %+v", i, got[i-1], got[i]) + } + } +} + +func TestMarkSet_EmptySetIsValid(t *testing.T) { + tmp := t.TempDir() + w, err := NewMarkSetWriter(filepath.Join(tmp, "marks"), 256) + if err != nil { + t.Fatal(err) + } + if err := w.Close(); err != nil { + t.Fatal(err) + } + r, err := OpenMarkSetReader(filepath.Join(tmp, "marks")) + if err != nil { + t.Fatal(err) + } + defer r.Close() + _, ok, err := r.Next() + if err != nil { + t.Fatal(err) + } + if ok { + t.Fatal("expected empty MarkSet but Next returned a hash") + } +} + +func TestMarkSet_CloseTwiceIsNoop(t *testing.T) { + tmp := t.TempDir() + w, err := NewMarkSetWriter(filepath.Join(tmp, "marks"), 256) + if err != nil { + t.Fatal(err) + } + _ = w.Write(Hash128{Low: 1}) + if err := w.Close(); err != nil { + t.Fatal(err) + } + if err := w.Close(); err != nil { + t.Errorf("second Close: %v", err) + } +} diff --git a/pkg/cas/objectdisk.go b/pkg/cas/objectdisk.go new file mode 100644 index 00000000..18dba101 --- /dev/null +++ b/pkg/cas/objectdisk.go @@ -0,0 +1,142 @@ +package cas + +import ( + "strings" +) + +// objectDiskTypes lists ClickHouse system.disks.type values that mean the +// underlying storage is object-based and therefore not supported by CAS v1. +// See docs/cas-design.md §3 (object-disk parts NOT supported in v1). +var objectDiskTypes = map[string]bool{ + "s3": true, + "s3_plain": true, + "azure_blob_storage": true, + "azure": true, // legacy type emitted by older ClickHouse versions; pkg/backup/backuper.go:225 treats it as object disk too + "hdfs": true, + "web": true, +} + +// IsObjectDiskType reports whether a system.disks.type value indicates an +// object disk (vs. a local-filesystem disk). +func IsObjectDiskType(t string) bool { return objectDiskTypes[t] } + +// ObjectDiskHit identifies one (database, table, disk, disk-type) combination +// where a CAS upload would refuse (or, with --skip-object-disks, skip). +type ObjectDiskHit struct { + Database string + Table string + Disk string + DiskType string +} + +// IsEncryptedObjectDisk reports whether disk is an encrypted disk layered on +// top of an object disk (e.g. encryption-over-S3). Mirrors the v1 logic in +// (*Backuper).isDiskTypeEncryptedObject; we duplicate it here rather than +// import from pkg/backup to keep pkg/cas free of that dependency (avoids an +// import cycle — pkg/backup already imports pkg/cas via +// pkg/backup/cas_methods.go). +func IsEncryptedObjectDisk(disk DiskInfo, disks []DiskInfo) bool { + if disk.Type != "encrypted" { + return false + } + for _, d := range disks { + if d.Name == disk.Name { + continue + } + if !strings.HasPrefix(disk.Path, d.Path) { + continue + } + if IsObjectDiskType(d.Type) { + return true + } + } + return false +} + +// objectDiskTypeFor returns the DiskType label for an ObjectDiskHit. For +// direct object disks it returns disk.Type (e.g. "s3"). For +// encrypted-over-object disks it returns "encrypted/" so that +// operator-facing messages make the layering explicit (e.g. "encrypted/s3"). +func objectDiskTypeFor(disk DiskInfo, disks []DiskInfo) string { + if IsObjectDiskType(disk.Type) { + return disk.Type + } + if disk.Type == "encrypted" { + for _, d := range disks { + if d.Name == disk.Name { + continue + } + if strings.HasPrefix(disk.Path, d.Path) && IsObjectDiskType(d.Type) { + return "encrypted/" + d.Type + } + } + } + return disk.Type +} + +// DetectObjectDiskTables walks tables and returns all (db, table, disk) where +// the table has at least one DataPath that lives under an object-disk. +// +// Mapping a DataPath to a disk uses the disk's Path prefix from system.disks. +// A DataPath is considered "on disk D" if it has D.Path as a prefix. The +// longest-matching prefix wins (so a disk at "/var/lib/clickhouse/disks/s3/" +// is matched before one at "/var/lib/clickhouse/"). +// +// Both direct object disks (s3, azure_blob_storage, etc.) and encrypted disks +// layered on top of object disks (encrypted-over-S3) are detected. The latter +// mirrors the v1 isDiskTypeEncryptedObject logic in pkg/backup/backuper.go. +func DetectObjectDiskTables(tables []TableInfo, disks []DiskInfo) []ObjectDiskHit { + // Pre-sort disks by Path length descending so we can do longest-prefix + // matching with a simple loop. + sorted := make([]DiskInfo, len(disks)) + copy(sorted, disks) + // Insertion sort is fine for typical len(disks) ~ small. + for i := 1; i < len(sorted); i++ { + for j := i; j > 0 && len(sorted[j-1].Path) < len(sorted[j].Path); j-- { + sorted[j-1], sorted[j] = sorted[j], sorted[j-1] + } + } + + var hits []ObjectDiskHit + seen := make(map[ObjectDiskHit]struct{}) + for _, t := range tables { + for _, dp := range t.DataPaths { + d, ok := matchDisk(dp, sorted) + if !ok { + continue + } + isObj := IsObjectDiskType(d.Type) || IsEncryptedObjectDisk(d, disks) + if !isObj { + continue + } + h := ObjectDiskHit{Database: t.Database, Table: t.Name, Disk: d.Name, DiskType: objectDiskTypeFor(d, disks)} + if _, dup := seen[h]; dup { + continue + } + seen[h] = struct{}{} + hits = append(hits, h) + } + } + return hits +} + +// matchDisk returns the disk whose Path is the longest prefix of dataPath, or +// (DiskInfo{}, false) if none matches. Caller must pass disks sorted by Path +// length descending. +func matchDisk(dataPath string, sortedDisks []DiskInfo) (DiskInfo, bool) { + for _, d := range sortedDisks { + if d.Path == "" { + continue + } + // Normalize: ensure trailing separator on the disk path so a dir + // boundary is required (avoid "/var/lib/foo" matching "/var/lib/foobar/..."). + prefix := d.Path + if !strings.HasSuffix(prefix, "/") { + prefix += "/" + } + if strings.HasPrefix(dataPath, prefix) || dataPath == strings.TrimSuffix(prefix, "/") { + return d, true + } + } + return DiskInfo{}, false +} diff --git a/pkg/cas/objectdisk_test.go b/pkg/cas/objectdisk_test.go new file mode 100644 index 00000000..74cc9eef --- /dev/null +++ b/pkg/cas/objectdisk_test.go @@ -0,0 +1,166 @@ +package cas_test + +import ( + "reflect" + "sort" + "testing" + + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" +) + +func TestIsObjectDiskType(t *testing.T) { + yes := []string{"s3", "s3_plain", "azure_blob_storage", "hdfs", "web"} + no := []string{"local", "encrypted", "memory", ""} + for _, s := range yes { + if !cas.IsObjectDiskType(s) { + t.Errorf("yes: %q wrongly false", s) + } + } + for _, s := range no { + if cas.IsObjectDiskType(s) { + t.Errorf("no: %q wrongly true", s) + } + } +} + +func TestIsObjectDiskType_LegacyAzure(t *testing.T) { + // pkg/backup/backuper.go:225 treats "azure" as an object disk on the v1 path. + // CAS must be consistent so it refuses uploads against legacy-typed disks too. + if !cas.IsObjectDiskType("azure") { + t.Error(`legacy "azure" must be recognized as object disk (parity with pkg/backup/backuper.go:225)`) + } + if !cas.IsObjectDiskType("azure_blob_storage") { + t.Error(`"azure_blob_storage" must be recognized as object disk`) + } +} + +func sortHits(h []cas.ObjectDiskHit) { + sort.Slice(h, func(i, j int) bool { + if h[i].Database != h[j].Database { + return h[i].Database < h[j].Database + } + if h[i].Table != h[j].Table { + return h[i].Table < h[j].Table + } + return h[i].Disk < h[j].Disk + }) +} + +func TestDetectObjectDiskTables_HappyPath(t *testing.T) { + disks := []cas.DiskInfo{ + {Name: "default", Path: "/var/lib/clickhouse/", Type: "local"}, + {Name: "s3main", Path: "/var/lib/clickhouse/disks/s3/", Type: "s3"}, + {Name: "azhot", Path: "/var/lib/clickhouse/disks/azure/", Type: "azure_blob_storage"}, + } + tables := []cas.TableInfo{ + {Database: "db1", Name: "t_local", DataPaths: []string{"/var/lib/clickhouse/data/db1/t_local/"}}, + {Database: "db1", Name: "t_s3", DataPaths: []string{"/var/lib/clickhouse/disks/s3/data/db1/t_s3/"}}, + {Database: "db1", Name: "t_az", DataPaths: []string{"/var/lib/clickhouse/disks/azure/data/db1/t_az/"}}, + {Database: "db1", Name: "t_multi", DataPaths: []string{ + "/var/lib/clickhouse/data/db1/t_multi/", // local + "/var/lib/clickhouse/disks/s3/data/db1/t_multi/", // object + }}, + } + got := cas.DetectObjectDiskTables(tables, disks) + want := []cas.ObjectDiskHit{ + {Database: "db1", Table: "t_az", Disk: "azhot", DiskType: "azure_blob_storage"}, + {Database: "db1", Table: "t_multi", Disk: "s3main", DiskType: "s3"}, + {Database: "db1", Table: "t_s3", Disk: "s3main", DiskType: "s3"}, + } + sortHits(got) + sortHits(want) + if !reflect.DeepEqual(got, want) { + t.Fatalf("got %+v\nwant %+v", got, want) + } +} + +func TestDetectObjectDiskTables_LongestPrefixWins(t *testing.T) { + // /var/lib/clickhouse/ is local; /var/lib/clickhouse/disks/s3/ is s3. + // A path under disks/s3/ must NOT be classified as local even though the + // local prefix also matches. + disks := []cas.DiskInfo{ + {Name: "default", Path: "/var/lib/clickhouse/", Type: "local"}, + {Name: "s3", Path: "/var/lib/clickhouse/disks/s3/", Type: "s3"}, + } + tables := []cas.TableInfo{ + {Database: "db", Name: "t", DataPaths: []string{"/var/lib/clickhouse/disks/s3/data/db/t/"}}, + } + got := cas.DetectObjectDiskTables(tables, disks) + if len(got) != 1 || got[0].Disk != "s3" { + t.Fatalf("got %+v", got) + } +} + +func TestDetectObjectDiskTables_NoFalsePositiveOnSiblingPrefix(t *testing.T) { + // A disk at /foo/ should NOT match a path /foobar/... + disks := []cas.DiskInfo{ + {Name: "d1", Path: "/foo/", Type: "s3"}, + } + tables := []cas.TableInfo{ + {Database: "db", Name: "t", DataPaths: []string{"/foobar/data/"}}, + } + if got := cas.DetectObjectDiskTables(tables, disks); len(got) != 0 { + t.Fatalf("expected no hits, got %+v", got) + } +} + +func TestDetectObjectDiskTables_DedupesSameTriple(t *testing.T) { + disks := []cas.DiskInfo{{Name: "s3", Path: "/s3/", Type: "s3"}} + tables := []cas.TableInfo{ + {Database: "db", Name: "t", DataPaths: []string{ + "/s3/a/", "/s3/b/", // two paths under the same disk + }}, + } + if got := cas.DetectObjectDiskTables(tables, disks); len(got) != 1 { + t.Fatalf("expected 1 deduped hit, got %+v", got) + } +} + +func TestDetectObjectDiskTables_EmptyInputs(t *testing.T) { + if got := cas.DetectObjectDiskTables(nil, nil); len(got) != 0 { + t.Fatal("nil/nil") + } + if got := cas.DetectObjectDiskTables([]cas.TableInfo{}, []cas.DiskInfo{}); len(got) != 0 { + t.Fatal("empty") + } +} + +func TestIsEncryptedObjectDisk(t *testing.T) { + disks := []cas.DiskInfo{ + {Name: "s3_disk", Type: "s3", Path: "/var/lib/clickhouse/disks/s3/"}, + {Name: "encrypted_s3", Type: "encrypted", Path: "/var/lib/clickhouse/disks/s3/encrypted/"}, + {Name: "azure_disk", Type: "azure_blob_storage", Path: "/var/lib/clickhouse/disks/azure/"}, + {Name: "encrypted_az", Type: "encrypted", Path: "/var/lib/clickhouse/disks/azure/encrypted/"}, + {Name: "encrypted_local", Type: "encrypted", Path: "/var/lib/clickhouse/disks/local/encrypted/"}, + {Name: "default", Type: "local", Path: "/var/lib/clickhouse/"}, + } + if !cas.IsEncryptedObjectDisk(disks[1], disks) { + t.Error("encrypted-over-s3 should classify as object") + } + if !cas.IsEncryptedObjectDisk(disks[3], disks) { + t.Error("encrypted-over-azure should classify as object") + } + if cas.IsEncryptedObjectDisk(disks[4], disks) { + t.Error("encrypted-over-local should NOT classify as object") + } + if cas.IsEncryptedObjectDisk(disks[0], disks) { + t.Error("direct s3 (not encrypted) should return false from this helper") + } +} + +func TestDetectObjectDiskTables_IncludesEncryptedOverS3(t *testing.T) { + disks := []cas.DiskInfo{ + {Name: "s3_disk", Type: "s3", Path: "/disks/s3/"}, + {Name: "encrypted_s3", Type: "encrypted", Path: "/disks/s3/encrypted/"}, + } + tables := []cas.TableInfo{ + {Database: "db", Name: "t", DataPaths: []string{"/disks/s3/encrypted/store/data/db/t/"}}, + } + hits := cas.DetectObjectDiskTables(tables, disks) + if len(hits) != 1 { + t.Fatalf("expected 1 hit, got %+v", hits) + } + if hits[0].DiskType != "encrypted/s3" { + t.Errorf("DiskType should reflect encrypted-over-s3, got %q", hits[0].DiskType) + } +} diff --git a/pkg/cas/paths.go b/pkg/cas/paths.go new file mode 100644 index 00000000..09546468 --- /dev/null +++ b/pkg/cas/paths.go @@ -0,0 +1,36 @@ +package cas + +import "github.com/Altinity/clickhouse-backup/v2/pkg/common" + +// All helpers take clusterPrefix, which must end with "/". + +func MetadataDir(clusterPrefix, backup string) string { + return clusterPrefix + "metadata/" + backup + "/" +} + +func MetadataJSONPath(clusterPrefix, backup string) string { + return MetadataDir(clusterPrefix, backup) + "metadata.json" +} + +func TableMetaPath(clusterPrefix, backup, db, table string) string { + return MetadataDir(clusterPrefix, backup) + "metadata/" + + common.TablePathEncode(db) + "/" + common.TablePathEncode(table) + ".json" +} + +// PartArchivePath returns the per-(disk, db, table) tar.zstd archive key. +// disk is intentionally NOT TablePathEncode'd: ClickHouse disk names are +// constrained at config-load time to alphanumeric + dash/underscore, so they +// are path-safe by construction. db and table can be arbitrary user input +// and must be encoded. +func PartArchivePath(clusterPrefix, backup, disk, db, table string) string { + return MetadataDir(clusterPrefix, backup) + "parts/" + disk + "/" + + common.TablePathEncode(db) + "/" + common.TablePathEncode(table) + ".tar.zstd" +} + +func InProgressMarkerPath(clusterPrefix, backup string) string { + return clusterPrefix + "inprogress/" + backup + ".marker" +} + +func PruneMarkerPath(clusterPrefix string) string { + return clusterPrefix + "prune.marker" +} diff --git a/pkg/cas/paths_test.go b/pkg/cas/paths_test.go new file mode 100644 index 00000000..93a437fa --- /dev/null +++ b/pkg/cas/paths_test.go @@ -0,0 +1,44 @@ +package cas + +import ( + "strings" + "testing" +) + +func TestPaths_Basic(t *testing.T) { + cp := "cas/c1/" + cases := []struct{ name, want, got string }{ + {"MetadataDir", "cas/c1/metadata/bk/", MetadataDir(cp, "bk")}, + {"MetadataJSONPath", "cas/c1/metadata/bk/metadata.json", MetadataJSONPath(cp, "bk")}, + {"TableMetaPath", "cas/c1/metadata/bk/metadata/db1/t1.json", TableMetaPath(cp, "bk", "db1", "t1")}, + {"PartArchivePath", "cas/c1/metadata/bk/parts/default/db1/t1.tar.zstd", PartArchivePath(cp, "bk", "default", "db1", "t1")}, + {"InProgressMarkerPath", "cas/c1/inprogress/bk.marker", InProgressMarkerPath(cp, "bk")}, + {"PruneMarkerPath", "cas/c1/prune.marker", PruneMarkerPath(cp)}, + } + for _, c := range cases { + if c.got != c.want { + t.Errorf("%s: got %q want %q", c.name, c.got, c.want) + } + } +} + +func TestPaths_TablePathEncodeApplied(t *testing.T) { + // common.TablePathEncode encodes special characters. We don't assert the + // exact encoded form (that's TablePathEncode's contract); we assert that + // the encoded segment differs from the raw input when special chars present. + cp := "cas/c/" + raw := "weird name" + got := TableMetaPath(cp, "bk", raw, raw) + if !strings.Contains(got, "weird") { + t.Fatalf("encoded path should still contain visible content: %s", got) + } + // Negative: a raw "/" in db/table name must NOT appear in the path because + // TablePathEncode escapes it. (Otherwise the path could collide with the + // separator.) + risky := "a/b" + risk := TableMetaPath(cp, "bk", risky, "t") + // Confirm that "a/b" did NOT survive verbatim as a path component: + if strings.Contains(risk, "/a/b/") { + t.Errorf("TablePathEncode should have escaped slash; got %s", risk) + } +} diff --git a/pkg/cas/probe.go b/pkg/cas/probe.go new file mode 100644 index 00000000..855cbb9e --- /dev/null +++ b/pkg/cas/probe.go @@ -0,0 +1,81 @@ +package cas + +import ( + "bytes" + "context" + "crypto/rand" + "encoding/hex" + "errors" + "fmt" + "io" + "time" +) + +// ErrConditionalPutNotHonored is returned when a backend's PutFileIfAbsent +// silently overwrites instead of refusing on second write — defeating CAS +// marker locks. +var ErrConditionalPutNotHonored = errors.New("cas: backend silently ignored conditional put — marker locks unsafe") + +const probeKeyPrefix = "cas-conditional-put-probe-" + +// probeKeyRandom returns a 16-character hex string (64 bits of entropy) for +// use as the unique per-call suffix of the probe key. 64 bits makes random +// collision between concurrent probes astronomically unlikely. +func probeKeyRandom() string { + var b [8]byte + if _, err := rand.Read(b[:]); err != nil { + // crypto/rand.Read only fails on catastrophic OS failures; fall back to + // a time-based key rather than panicking so the probe still works. + return fmt.Sprintf("%d", time.Now().UnixNano()) + } + return hex.EncodeToString(b[:]) +} + +// ProbeConditionalPut writes /cas-conditional-put-probe- +// twice via PutFileIfAbsent. Returns nil iff the backend correctly honored the +// precondition (first created=true, second created=false). Cleans up the +// sentinel on completion. +// +// A unique random suffix is used per invocation so that two concurrent probes +// (e.g. cas-upload and cas-prune starting simultaneously) operate on different +// keys and cannot interfere with each other. +func ProbeConditionalPut(ctx context.Context, b Backend, clusterPrefix string) error { + key := clusterPrefix + probeKeyPrefix + probeKeyRandom() + body1 := []byte("probe-1") + body2 := []byte("probe-2") + + // First write: try to establish the sentinel. + created1, err := b.PutFileIfAbsent(ctx, key, io.NopCloser(bytes.NewReader(body1)), int64(len(body1))) + if errors.Is(err, ErrConditionalPutNotSupported) { + // Backend correctly reports it doesn't support conditional create. + // Skip the probe — the upload/prune marker-write will refuse naturally + // with the existing operator-facing diagnostic ("backend cannot guarantee + // atomic markers..."), preserving the original UX. The probe is for + // detecting backends that LIE about supporting conditional-create, not + // for re-doing what the marker-write layer already does correctly. + return nil + } + if err != nil { + return fmt.Errorf("cas conditional-put probe: first write: %w", err) + } + if !created1 { + // The key already exists. Since it is unique and random, this indicates + // either an astronomically unlikely random collision or a backend bug + // (e.g. the backend is not respecting the conditional-create semantics + // at all and returned created=false for a key we just generated). + return fmt.Errorf("cas conditional-put probe: unexpected: random key %q already exists; possible random-collision or backend bug", key) + } + + // Second write: must report not-created if backend honors the precondition. + created2, err := b.PutFileIfAbsent(ctx, key, io.NopCloser(bytes.NewReader(body2)), int64(len(body2))) + // Best-effort cleanup; don't mask the probe result. + _ = b.DeleteFile(ctx, key) + if err != nil { + return fmt.Errorf("cas conditional-put probe: second write: %w", err) + } + if created2 { + return fmt.Errorf("%w: backend silently overwrote sentinel (update MinIO to >=2024-11 or use a backend with native conditional create; set cas.skip_conditional_put_probe=true to override at your own risk)", + ErrConditionalPutNotHonored) + } + return nil +} diff --git a/pkg/cas/probe_test.go b/pkg/cas/probe_test.go new file mode 100644 index 00000000..e9cf99ce --- /dev/null +++ b/pkg/cas/probe_test.go @@ -0,0 +1,237 @@ +package cas_test + +import ( + "bytes" + "context" + "errors" + "io" + "strings" + "sync" + "testing" + "time" + + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/fakedst" +) + +// TestProbeConditionalPut_HonoredBackend runs the probe against the in-memory +// fake, which correctly enforces the precondition. Expects nil error. +func TestProbeConditionalPut_HonoredBackend(t *testing.T) { + f := fakedst.New() + err := cas.ProbeConditionalPut(context.Background(), f, "cas/test-cluster/") + if err != nil { + t.Fatalf("expected nil on honoring backend, got: %v", err) + } + // Sentinel must be cleaned up after a successful probe. Walk the prefix + // and confirm no probe keys remain. + ctx := context.Background() + var found []string + _ = f.Walk(ctx, "cas/test-cluster/"+cas.ProbeKeyPrefix, true, func(rf cas.RemoteFile) error { + found = append(found, rf.Key) + return nil + }) + if len(found) != 0 { + t.Errorf("probe did not clean up sentinel on success; leftover keys: %v", found) + } +} + +// TestProbeConditionalPut_SilentlyOverwritingBackend uses a stub whose +// PutFileIfAbsent always returns created=true, simulating a backend that +// ignores If-None-Match. Expects ErrConditionalPutNotHonored. +func TestProbeConditionalPut_SilentlyOverwritingBackend(t *testing.T) { + b := &alwaysCreatesBackend{} + err := cas.ProbeConditionalPut(context.Background(), b, "cas/test-cluster/") + if err == nil { + t.Fatal("expected error on silently-overwriting backend, got nil") + } + if !errors.Is(err, cas.ErrConditionalPutNotHonored) { + t.Errorf("expected ErrConditionalPutNotHonored, got: %v", err) + } +} + +// TestProbeConditionalPut_ErrorOnFirstWrite verifies that an error from the +// first PutFileIfAbsent is surfaced with context "first write". +func TestProbeConditionalPut_ErrorOnFirstWrite(t *testing.T) { + sentinel := errors.New("backend unavailable") + b := &errOnPutBackend{err: sentinel} + err := cas.ProbeConditionalPut(context.Background(), b, "cas/test-cluster/") + if err == nil { + t.Fatal("expected error, got nil") + } + if !strings.Contains(err.Error(), "first write") { + t.Errorf("expected 'first write' in error, got: %v", err) + } + if !errors.Is(err, sentinel) { + t.Errorf("expected sentinel error in chain, got: %v", err) + } +} + +// TestProbeConditionalPut_RejectsExistingProbeKey verifies that if the first +// PutFileIfAbsent returns created=false (as if the random key already exists), +// the probe returns an error mentioning "random-collision or backend bug". +func TestProbeConditionalPut_RejectsExistingProbeKey(t *testing.T) { + // firstCall tracks whether this is the first PutFileIfAbsent invocation. + b := &firstCallReturnsFalseBackend{} + err := cas.ProbeConditionalPut(context.Background(), b, "cas/test-cluster/") + if err == nil { + t.Fatal("expected error when first PutFileIfAbsent returns created=false, got nil") + } + if !strings.Contains(err.Error(), "random-collision or backend bug") { + t.Errorf("expected 'random-collision or backend bug' in error, got: %v", err) + } +} + +// TestProbeConditionalPut_SkipsWhenBackendReturnsNotSupported verifies that the +// probe returns nil (gracefully skipped) when the backend's PutFileIfAbsent +// returns ErrConditionalPutNotSupported on the first write. This preserves the +// original UX where the marker-write layer produces the operator-facing +// "backend cannot guarantee atomic markers" diagnostic instead of a probe error. +func TestProbeConditionalPut_SkipsWhenBackendReturnsNotSupported(t *testing.T) { + b := ¬SupportedBackend{} + err := cas.ProbeConditionalPut(context.Background(), b, "cas/test-cluster/") + if err != nil { + t.Errorf("expected nil (probe gracefully skipped), got: %v", err) + } +} + +// TestProbeConditionalPut_TwoConcurrentProbesDontCollide verifies that two +// concurrent probes against the same backend don't interfere with each other. +// Because each probe picks a unique random key, both should succeed without +// either one deleting the other's sentinel. +func TestProbeConditionalPut_TwoConcurrentProbesDontCollide(t *testing.T) { + f := fakedst.New() + ctx := context.Background() + const clusterPrefix = "cas/test-cluster/" + + var wg sync.WaitGroup + errs := make([]error, 2) + for i := range errs { + i := i + wg.Add(1) + go func() { + defer wg.Done() + errs[i] = cas.ProbeConditionalPut(ctx, f, clusterPrefix) + }() + } + wg.Wait() + + for i, err := range errs { + if err != nil { + t.Errorf("probe %d failed: %v", i, err) + } + } + + // After both probes complete, no probe sentinels should remain. + var found []string + _ = f.Walk(ctx, clusterPrefix+cas.ProbeKeyPrefix, true, func(rf cas.RemoteFile) error { + found = append(found, rf.Key) + return nil + }) + if len(found) != 0 { + t.Errorf("probes left behind sentinel keys: %v", found) + } +} + +// --- stubs --- + +// alwaysCreatesBackend is a cas.Backend stub whose PutFileIfAbsent always +// reports created=true, simulating a backend that silently ignores If-None-Match. +type alwaysCreatesBackend struct{} + +func (a *alwaysCreatesBackend) PutFileIfAbsent(_ context.Context, _ string, r io.ReadCloser, _ int64) (bool, error) { + _ = r.Close() + return true, nil +} +func (a *alwaysCreatesBackend) PutFile(_ context.Context, _ string, r io.ReadCloser, _ int64) error { + _ = r.Close() + return nil +} +func (a *alwaysCreatesBackend) GetFile(_ context.Context, _ string) (io.ReadCloser, error) { + return io.NopCloser(bytes.NewReader(nil)), nil +} +func (a *alwaysCreatesBackend) StatFile(_ context.Context, _ string) (int64, time.Time, bool, error) { + return 0, time.Time{}, false, nil +} +func (a *alwaysCreatesBackend) DeleteFile(_ context.Context, _ string) error { return nil } +func (a *alwaysCreatesBackend) Walk(_ context.Context, _ string, _ bool, _ func(cas.RemoteFile) error) error { + return nil +} + +// notSupportedBackend is a cas.Backend stub whose PutFileIfAbsent returns +// (false, ErrConditionalPutNotSupported), simulating FTP and similar backends +// that correctly advertise they don't support conditional create. +type notSupportedBackend struct{} + +func (n *notSupportedBackend) PutFileIfAbsent(_ context.Context, _ string, r io.ReadCloser, _ int64) (bool, error) { + _ = r.Close() + return false, cas.ErrConditionalPutNotSupported +} +func (n *notSupportedBackend) PutFile(_ context.Context, _ string, r io.ReadCloser, _ int64) error { + _ = r.Close() + return nil +} +func (n *notSupportedBackend) GetFile(_ context.Context, _ string) (io.ReadCloser, error) { + return io.NopCloser(bytes.NewReader(nil)), nil +} +func (n *notSupportedBackend) StatFile(_ context.Context, _ string) (int64, time.Time, bool, error) { + return 0, time.Time{}, false, nil +} +func (n *notSupportedBackend) DeleteFile(_ context.Context, _ string) error { return nil } +func (n *notSupportedBackend) Walk(_ context.Context, _ string, _ bool, _ func(cas.RemoteFile) error) error { + return nil +} + +// errOnPutBackend is a cas.Backend stub that returns an error from PutFileIfAbsent. +type errOnPutBackend struct{ err error } + +func (e *errOnPutBackend) PutFileIfAbsent(_ context.Context, _ string, r io.ReadCloser, _ int64) (bool, error) { + _ = r.Close() + return false, e.err +} +func (e *errOnPutBackend) PutFile(_ context.Context, _ string, r io.ReadCloser, _ int64) error { + _ = r.Close() + return nil +} +func (e *errOnPutBackend) GetFile(_ context.Context, _ string) (io.ReadCloser, error) { + return io.NopCloser(bytes.NewReader(nil)), nil +} +func (e *errOnPutBackend) StatFile(_ context.Context, _ string) (int64, time.Time, bool, error) { + return 0, time.Time{}, false, nil +} +func (e *errOnPutBackend) DeleteFile(_ context.Context, _ string) error { return nil } +func (e *errOnPutBackend) Walk(_ context.Context, _ string, _ bool, _ func(cas.RemoteFile) error) error { + return nil +} + +// firstCallReturnsFalseBackend is a cas.Backend stub whose first +// PutFileIfAbsent call returns (false, nil), simulating a scenario where the +// random probe key happens to already exist (random collision or backend bug). +type firstCallReturnsFalseBackend struct { + mu sync.Mutex + calls int +} + +func (b *firstCallReturnsFalseBackend) PutFileIfAbsent(_ context.Context, _ string, r io.ReadCloser, _ int64) (bool, error) { + _ = r.Close() + b.mu.Lock() + defer b.mu.Unlock() + b.calls++ + if b.calls == 1 { + return false, nil + } + return true, nil +} +func (b *firstCallReturnsFalseBackend) PutFile(_ context.Context, _ string, r io.ReadCloser, _ int64) error { + _ = r.Close() + return nil +} +func (b *firstCallReturnsFalseBackend) GetFile(_ context.Context, _ string) (io.ReadCloser, error) { + return io.NopCloser(bytes.NewReader(nil)), nil +} +func (b *firstCallReturnsFalseBackend) StatFile(_ context.Context, _ string) (int64, time.Time, bool, error) { + return 0, time.Time{}, false, nil +} +func (b *firstCallReturnsFalseBackend) DeleteFile(_ context.Context, _ string) error { return nil } +func (b *firstCallReturnsFalseBackend) Walk(_ context.Context, _ string, _ bool, _ func(cas.RemoteFile) error) error { + return nil +} diff --git a/pkg/cas/prune.go b/pkg/cas/prune.go new file mode 100644 index 00000000..6a333ea4 --- /dev/null +++ b/pkg/cas/prune.go @@ -0,0 +1,686 @@ +package cas + +import ( + "archive/tar" + "bytes" + "context" + "encoding/json" + "errors" + "fmt" + "io" + "os" + "path/filepath" + "strings" + "sync" + "time" + + "github.com/Altinity/clickhouse-backup/v2/pkg/checksumstxt" + "github.com/Altinity/clickhouse-backup/v2/pkg/metadata" + "github.com/Altinity/clickhouse-backup/v2/pkg/utils" + "github.com/klauspost/compress/zstd" + "github.com/rs/zerolog/log" +) + +// PruneOptions tunes a single Prune run. GraceBlob / AbandonThreshold are +// applied iff their *Set flags are true; otherwise the run uses +// cfg.GraceBlobDuration() / cfg.AbandonThresholdDuration(). The *Set flags +// let an explicit zero override the configured non-zero default +// (use case: targeted cleanup, regression tests). +// +// DryRun reports candidates without deleting; Unlock is the operator escape +// hatch for a stranded prune.marker. +type PruneOptions struct { + DryRun bool + GraceBlob time.Duration + GraceBlobSet bool + AbandonThreshold time.Duration + AbandonThresholdSet bool + Unlock bool +} + +// PruneReport summarizes what a Prune run did. Returned even on error so +// callers can log partial progress. +type PruneReport struct { + DryRun bool `json:"dry_run"` + LiveBackups int `json:"live_backups"` + BlobsTotal uint64 `json:"blobs_total"` + OrphanBlobsConsidered uint64 `json:"orphan_blobs_considered"` + OrphansHeldByGrace uint64 `json:"orphans_held_by_grace"` + OrphansDeleted uint64 `json:"orphans_deleted"` + BlobDeleteFailures int `json:"blob_delete_failures"` + BytesReclaimed int64 `json:"bytes_reclaimed"` + AbandonedMarkersFound int `json:"abandoned_markers_found"` + MetadataOrphansFound int `json:"metadata_orphans_found"` + DurationSeconds float64 `json:"duration_seconds"` + // DryRunCandidates lists every blob that would be deleted in a dry-run. + // Only populated when DryRun=true; nil otherwise. + DryRunCandidates []OrphanCandidate `json:"dry_run_candidates,omitempty"` +} + +// Prune performs mark-and-sweep garbage collection of orphan blobs and +// metadata-orphan subtrees in the configured CAS namespace. See +// docs/cas-design.md §6.7 for the algorithm. +// +// Concurrency: a single advisory marker (cas//prune.marker) is +// atomically created at step 2 via PutFileIfAbsent and released via a scoped +// defer registered ONLY when this run owns the marker. A second concurrent +// prune sees created=false and returns an error without touching the marker. +func Prune(ctx context.Context, b Backend, cfg Config, opts PruneOptions) (*PruneReport, error) { + if err := cfg.Validate(); err != nil { + return nil, fmt.Errorf("cas: prune: invalid config: %w", err) + } + if !cfg.Enabled { + return nil, errors.New("cas: cas.enabled=false") + } + cp := cfg.ClusterPrefix() + grace := cfg.GraceBlobDuration() + if opts.GraceBlobSet { + grace = opts.GraceBlob + } + abandon := cfg.AbandonThresholdDuration() + if opts.AbandonThresholdSet { + abandon = opts.AbandonThreshold + } + + // --unlock escape hatch: delete a stranded prune.marker and exit. + if opts.Unlock { + _, _, exists, err := b.StatFile(ctx, PruneMarkerPath(cp)) + if err != nil { + return nil, fmt.Errorf("cas-prune --unlock: stat marker: %w", err) + } + if !exists { + return nil, errors.New("cas-prune --unlock: no prune.marker present") + } + if opts.DryRun { + if m, readErr := ReadPruneMarker(ctx, b, cp); readErr == nil { + log.Info(). + Str("host", m.Host). + Str("run_id", m.RunID). + Str("started_at", m.StartedAt). + Msg("cas-prune --dry-run --unlock: would delete this marker (no action taken)") + } else { + log.Info().Err(readErr).Msg("cas-prune --dry-run --unlock: marker present but unparseable; would delete") + } + return &PruneReport{DryRun: true}, nil + } + if err := b.DeleteFile(ctx, PruneMarkerPath(cp)); err != nil { + return nil, fmt.Errorf("cas-prune --unlock: delete marker: %w", err) + } + log.Warn().Msg("cas-prune: prune marker manually unlocked by operator") + return &PruneReport{}, nil + } + + rep := &PruneReport{DryRun: opts.DryRun} + start := time.Now() + defer func() { rep.DurationSeconds = time.Since(start).Seconds() }() + + // Step 1: refuse to run while any inprogress marker is younger than abandon. + fresh, abandoned, err := classifyInProgress(ctx, b, cp, abandon) + if err != nil { + return rep, err + } + log.Info(). + Int("markers_total", len(fresh)+len(abandoned)). + Int("abandoned", len(abandoned)). + Int("fresh", len(fresh)). + Msg("cas-prune: classified markers") + if len(fresh) > 0 { + return rep, freshInProgressError(fresh) + } + + // Step 2: atomically create prune marker; defer cleanup only if we own it. + if !opts.DryRun { + runID, created, err := WritePruneMarker(ctx, b, cp, hostname()) + if err != nil { + if errors.Is(err, ErrConditionalPutNotSupported) { + return rep, fmt.Errorf("cas-prune: backend cannot guarantee atomic markers; refusing (set cas.allow_unsafe_markers=true to override on FTP)") + } + return rep, fmt.Errorf("cas-prune: write marker: %w", err) + } + if !created { + existing, readErr := ReadPruneMarker(ctx, b, cp) + if readErr != nil { + return rep, fmt.Errorf("cas-prune: another prune is in progress (could not read marker: %v)", readErr) + } + return rep, fmt.Errorf("cas-prune: another prune is in progress on host=%s started=%s run_id=%s", + existing.Host, existing.StartedAt, existing.RunID) + } + _ = runID // we already own the marker by virtue of created=true; runID is for diagnostics only + defer func() { + cleanCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second) + defer cancel() + if delErr := b.DeleteFile(cleanCtx, PruneMarkerPath(cp)); delErr != nil { + log.Warn().Err(delErr).Msg("cas-prune: failed to release prune.marker") + } + }() + } + + // Step 3: T0 (used for grace cutoff) + t0 := start + + // Step 4: sweep abandoned in-progress markers. + if !opts.DryRun { + for _, m := range abandoned { + if err := b.DeleteFile(ctx, InProgressMarkerPath(cp, m.Backup)); err != nil { + log.Warn().Err(err).Str("backup", m.Backup).Msg("cas-prune: delete abandoned marker") + } + } + } + rep.AbandonedMarkersFound = len(abandoned) + + // Step 5: list live backups (subtrees with metadata.json). + backups, err := listLiveBackups(ctx, b, cp) + if err != nil { + return rep, fmt.Errorf("cas-prune: list live backups: %w", err) + } + rep.LiveBackups = len(backups) + log.Info().Int("count", len(backups)).Msg("cas-prune: building mark set across live backups") + + // Step 6: build mark set by walking each live backup's per-table + // archives and extracting checksums.txt entries above the inline + // threshold (those that went to the blob store). The archive-download + // phase (the hot loop) is parallelised with a bounded goroutine pool. + marksDir, err := os.MkdirTemp("", "cas-prune-marks-*") + if err != nil { + return rep, fmt.Errorf("cas-prune: temp dir: %w", err) + } + defer os.RemoveAll(marksDir) + marksPath := filepath.Join(marksDir, "marks") + mw, err := NewMarkSetWriter(marksPath, 1<<20) + if err != nil { + return rep, fmt.Errorf("cas-prune: mark set: %w", err) + } + // Step 7 fail-closed: any error reading a live backup aborts the + // run BEFORE any blob is deleted. + if err := buildMarkSetParallel(ctx, b, cp, backups, mw, 16); err != nil { + _ = mw.Close() + return rep, err + } + if err := mw.Close(); err != nil { + return rep, fmt.Errorf("cas-prune: close mark set: %w", err) + } + log.Info().Uint64("refs", mw.Count()).Msg("cas-prune: mark set complete") + + // Steps 8-9: stream compare against blob store, filter by grace. + mr, err := OpenMarkSetReader(marksPath) + if err != nil { + return rep, fmt.Errorf("cas-prune: open mark set: %w", err) + } + defer mr.Close() + cands, sweepStats, err := SweepOrphans(ctx, b, cp, mr, grace, t0) + if err != nil { + return rep, fmt.Errorf("cas-prune: sweep: %w", err) + } + rep.BlobsTotal = sweepStats.BlobsTotal + rep.OrphansHeldByGrace = sweepStats.OrphansHeldByGrace + rep.OrphanBlobsConsidered = uint64(len(cands)) + log.Info(). + Uint64("blobs_total", sweepStats.BlobsTotal). + Uint64("orphans_held_by_grace", sweepStats.OrphansHeldByGrace). + Int("orphans_to_delete", len(cands)). + Msg("cas-prune: sweep complete") + + // Step 10: metadata-orphan subtree sweep. + metaOrphans, err := findMetadataOrphans(ctx, b, cp) + if err != nil { + return rep, fmt.Errorf("cas-prune: find metadata orphans: %w", err) + } + if !opts.DryRun { + for _, p := range metaOrphans { + if err := walkAndDeleteSubtree(ctx, b, p); err != nil { + log.Warn().Err(err).Str("subtree", p).Msg("cas-prune: delete metadata-orphan subtree") + } + } + } + rep.MetadataOrphansFound = len(metaOrphans) + + // Step 11: delete orphan blobs (parallel, bounded). + if opts.DryRun { + for _, c := range cands { + log.Info(). + Str("key", c.Key). + Time("mod_time", c.ModTime). + Int64("size", c.Size). + Msg("cas-prune dry-run: would delete") + } + // Defensive copy: don't share the live slice with the caller. The + // report may outlive cands if downstream code (e.g. PrintPruneReport) + // runs after Prune returns. + rep.DryRunCandidates = append([]OrphanCandidate(nil), cands...) + } else { + log.Info().Int("count", len(cands)).Msg("cas-prune: deleting orphan blobs") + n, bytes, failures, err := deleteBlobs(ctx, b, cands, 32) + rep.OrphansDeleted = uint64(n) + rep.BytesReclaimed = bytes + rep.BlobDeleteFailures = failures + log.Info(). + Uint64("orphans_deleted", rep.OrphansDeleted). + Int64("bytes_reclaimed", rep.BytesReclaimed). + Int("failures", failures). + Float64("wall_seconds", time.Since(start).Seconds()). + Msg("cas-prune: done") + if err != nil { + return rep, fmt.Errorf("cas-prune: delete blobs: %w", err) + } + } + return rep, nil +} + +// inProgressMarker captures the parsed per-marker state used by classify. +type inProgressMarker struct { + Backup string + Host string + ModTime time.Time + Age time.Duration +} + +// classifyInProgress walks cas//inprogress/ and partitions markers into +// "fresh" (younger than abandon) and "abandoned" (older). Markers we can't +// parse are still classified by ModTime (safer than dropping them). +func classifyInProgress(ctx context.Context, b Backend, cp string, abandon time.Duration) (fresh, abandoned []inProgressMarker, err error) { + prefix := cp + "inprogress/" + now := time.Now() + err = b.Walk(ctx, prefix, false, func(rf RemoteFile) error { + if !strings.HasSuffix(rf.Key, ".marker") { + return nil + } + // Backup name: strip prefix + ".marker" + rest := strings.TrimPrefix(rf.Key, prefix) + name := strings.TrimSuffix(rest, ".marker") + if name == "" || strings.Contains(name, "/") { + return nil + } + if rf.ModTime.IsZero() { + log.Warn(). + Str("backup", name). + Msg("cas-prune: in-progress marker has zero ModTime (likely FTP LIST without MLSD); classifying as fresh") + fresh = append(fresh, inProgressMarker{Backup: name, ModTime: rf.ModTime, Age: 0}) + return nil + } + age := now.Sub(rf.ModTime) + m := inProgressMarker{Backup: name, ModTime: rf.ModTime, Age: age} + if age >= abandon { + abandoned = append(abandoned, m) + } else { + fresh = append(fresh, m) + } + return nil + }) + return fresh, abandoned, err +} + +func freshInProgressError(fresh []inProgressMarker) error { + parts := make([]string, len(fresh)) + for i, m := range fresh { + if m.ModTime.IsZero() { + parts[i] = fmt.Sprintf("%s (age=unknown — FTP server returned no ModTime)", m.Backup) + } else { + parts[i] = fmt.Sprintf("%s (age=%s)", m.Backup, m.Age.Round(time.Second)) + } + } + return fmt.Errorf("cas-prune: refuse to run while %d in-progress upload(s) are fresh: %s — wait for them, or run 'cas-prune --abandon-threshold=0s' if confirmed dead", + len(fresh), strings.Join(parts, ", ")) +} + +// listLiveBackups walks cas//metadata//metadata.json entries and +// returns the backup names. Mirrors cas-status's discovery logic. +func listLiveBackups(ctx context.Context, b Backend, cp string) ([]string, error) { + prefix := cp + "metadata/" + var backups []string + err := b.Walk(ctx, prefix, true, func(rf RemoteFile) error { + if !strings.HasSuffix(rf.Key, "/metadata.json") { + return nil + } + rest := strings.TrimPrefix(rf.Key, prefix) + name := strings.TrimSuffix(rest, "/metadata.json") + if name == "" || strings.Contains(name, "/") { + return nil + } + backups = append(backups, name) + return nil + }) + return backups, err +} + +// pruneArchiveJob is one (backup, archiveKey, threshold) tuple collected +// during Phase 1 of buildMarkSetParallel. +type pruneArchiveJob struct { + backup string + archKey string + threshold uint64 +} + +// buildMarkSetParallel implements the mark phase in three steps: +// +// Phase 1 (serial): for every live backup, read metadata.json + per-table +// JSONs and collect all archive keys into a flat slice. This is cheap +// (small JSON reads; no archive download). +// +// Phase 2 (parallel, bounded pool of `parallelism` goroutines): download and +// parse each archive, extract above-threshold hash references into a per- +// goroutine local buffer. +// +// Phase 3 (serial): merge all per-goroutine buffers into the MarkSetWriter. +// This avoids needing a mutex on Write and keeps MarkSetWriter single-threaded. +// +// parallelism <=0 defaults to 16. +func buildMarkSetParallel(ctx context.Context, b Backend, cp string, backups []string, mw *MarkSetWriter, parallelism int) error { + if parallelism <= 0 { + parallelism = 16 + } + + // --- Phase 1: collect all archive jobs (serial, cheap) --- + var jobs []pruneArchiveJob + for _, bkName := range backups { + bm, err := readBackupMetadata(ctx, b, cp, bkName) + if err != nil { + return fmt.Errorf("cas-prune: cannot read live backup %q: read metadata.json: %w", bkName, err) + } + if bm.CAS == nil { + return fmt.Errorf("cas-prune: cannot read live backup %q: backup metadata has no CAS field; cannot prune", bkName) + } + threshold := bm.CAS.InlineThreshold + for _, tt := range bm.Tables { + tm, err := readTableMetadata(ctx, b, cp, bkName, tt.Database, tt.Table) + if err != nil { + return fmt.Errorf("cas-prune: cannot read live backup %q: read table metadata for %s.%s: %w", bkName, tt.Database, tt.Table, err) + } + for disk := range tm.Parts { + if err := validateRemoteFilesystemName("disk", disk); err != nil { + return fmt.Errorf("cas-prune: cannot read live backup %q: %w", bkName, err) + } + jobs = append(jobs, pruneArchiveJob{ + backup: bkName, + archKey: PartArchivePath(cp, bkName, disk, tt.Database, tt.Table), + threshold: threshold, + }) + } + } + } + total := len(jobs) + log.Info().Int("archives", total).Msg("cas-prune: mark phase starting parallel archive downloads") + + // --- Phase 2: parallel archive download + parse --- + // Each goroutine accumulates hashes into its own local slice to avoid + // locking the MarkSetWriter. + type result struct { + hashes []Hash128 + err error + } + results := make([]result, len(jobs)) + + sem := make(chan struct{}, parallelism) + var wg sync.WaitGroup + var ( + mu sync.Mutex + firstErr error + ) + processed := 0 + + for idx, job := range jobs { + idx, job := idx, job + wg.Add(1) + go func() { + defer wg.Done() + sem <- struct{}{} + defer func() { <-sem }() + + mu.Lock() + already := firstErr != nil + mu.Unlock() + if already { + return + } + + hashes, err := collectRefsFromArchive(ctx, b, job.archKey, job.threshold) + if err != nil { + mu.Lock() + if firstErr == nil { + firstErr = fmt.Errorf("cas-prune: cannot read live backup %q: accumulate refs from %s: %w", job.backup, job.archKey, err) + } + mu.Unlock() + return + } + results[idx] = result{hashes: hashes} + + mu.Lock() + processed++ + if processed%100 == 0 { + n := processed + mu.Unlock() + log.Info().Int("processed", n).Int("total", total).Msg("cas-prune: mark phase progress") + } else { + mu.Unlock() + } + }() + } + wg.Wait() + + if firstErr != nil { + return firstErr + } + + // --- Phase 3: serial merge into MarkSetWriter --- + for _, r := range results { + for _, h := range r.hashes { + if err := mw.Write(h); err != nil { + return fmt.Errorf("cas-prune: mark set write: %w", err) + } + } + } + return nil +} + +// collectRefsFromArchive streams one archive, parses every checksums.txt it +// contains, and returns all above-threshold hashes. It is the parallel-safe +// counterpart to accumulateRefsFromArchive; it returns hashes rather than +// writing to a MarkSetWriter so callers can merge results without locking. +func collectRefsFromArchive(ctx context.Context, b Backend, archKey string, threshold uint64) ([]Hash128, error) { + rc, err := b.GetFile(ctx, archKey) + if err != nil { + return nil, err + } + defer rc.Close() + zr, err := zstd.NewReader(rc) + if err != nil { + return nil, fmt.Errorf("zstd: %w", err) + } + defer zr.Close() + tr := tar.NewReader(zr) + var out []Hash128 + for { + hdr, err := tr.Next() + if err == io.EOF { + return out, nil + } + if err != nil { + return nil, fmt.Errorf("tar: %w", err) + } + if hdr.Typeflag != tar.TypeReg { + continue + } + if !strings.HasSuffix(hdr.Name, "/checksums.txt") { + continue + } + body, err := io.ReadAll(tr) + if err != nil { + return nil, fmt.Errorf("read %s: %w", hdr.Name, err) + } + parsed, err := checksumstxt.Parse(bytes.NewReader(body)) + if err != nil { + return nil, fmt.Errorf("parse %s: %w", hdr.Name, err) + } + for _, c := range parsed.Files { + if c.FileSize <= threshold { + continue + } + out = append(out, Hash128{Low: c.FileHash.Low, High: c.FileHash.High}) + } + } +} + +func readBackupMetadata(ctx context.Context, b Backend, cp, name string) (*metadata.BackupMetadata, error) { + rc, err := b.GetFile(ctx, MetadataJSONPath(cp, name)) + if err != nil { + return nil, err + } + defer rc.Close() + body, err := io.ReadAll(rc) + if err != nil { + return nil, err + } + var bm metadata.BackupMetadata + if err := json.Unmarshal(body, &bm); err != nil { + return nil, fmt.Errorf("parse: %w", err) + } + return &bm, nil +} + +func readTableMetadata(ctx context.Context, b Backend, cp, name, db, table string) (*metadata.TableMetadata, error) { + rc, err := b.GetFile(ctx, TableMetaPath(cp, name, db, table)) + if err != nil { + return nil, err + } + defer rc.Close() + body, err := io.ReadAll(rc) + if err != nil { + return nil, err + } + var tm metadata.TableMetadata + if err := json.Unmarshal(body, &tm); err != nil { + return nil, fmt.Errorf("parse: %w", err) + } + return &tm, nil +} + +// findMetadataOrphans returns prefixes under cas//metadata// where +// the catalog truth (metadata.json) is absent. Such subtrees represent +// half-completed deletions whose per-table JSONs / archives should be +// reclaimed. +func findMetadataOrphans(ctx context.Context, b Backend, cp string) ([]string, error) { + metaPrefix := cp + "metadata/" + // Discover all top-level directories by walking and collecting + // the first path component after the prefix. + seen := map[string]bool{} + err := b.Walk(ctx, metaPrefix, true, func(rf RemoteFile) error { + rest := strings.TrimPrefix(rf.Key, metaPrefix) + idx := strings.Index(rest, "/") + if idx < 0 { + return nil + } + name := rest[:idx] + if name == "" { + return nil + } + seen[name] = true + return nil + }) + if err != nil { + return nil, err + } + var orphans []string + for name := range seen { + _, _, exists, err := b.StatFile(ctx, MetadataJSONPath(cp, name)) + if err != nil { + return nil, err + } + if !exists { + orphans = append(orphans, MetadataDir(cp, name)) + } + } + return orphans, nil +} + +// deleteBlobs deletes the given orphan candidates with bounded parallelism. +// Returns the number successfully deleted, the cumulative bytes reclaimed, +// the total number of failures, and the first error encountered (if any). +// Subsequent candidates after an error are still attempted; the error +// propagates after the wait. +func deleteBlobs(ctx context.Context, b Backend, cands []OrphanCandidate, parallelism int) (int, int64, int, error) { + if parallelism <= 0 { + parallelism = 32 + } + var ( + mu sync.Mutex + count int + bytes int64 + failures int + firstErr error + wg sync.WaitGroup + ) + sem := make(chan struct{}, parallelism) + for _, c := range cands { + c := c + wg.Add(1) + go func() { + defer wg.Done() + sem <- struct{}{} + defer func() { <-sem }() + if err := b.DeleteFile(ctx, c.Key); err != nil { + log.Warn().Err(err).Str("key", c.Key).Msg("cas-prune: delete orphan blob failed") + mu.Lock() + failures++ + if firstErr == nil { + firstErr = err + } + mu.Unlock() + return + } + mu.Lock() + count++ + bytes += c.Size + mu.Unlock() + }() + } + wg.Wait() + return count, bytes, failures, firstErr +} + +// PrintPruneReport renders a human-readable report to w. +func PrintPruneReport(r *PruneReport, w io.Writer) error { + prefix := "cas-prune" + if r.DryRun { + prefix = "cas-prune (dry-run)" + } + markerVerb := "swept" + orphanVerb := "swept" + if r.DryRun { + markerVerb = "would be swept" + orphanVerb = "would be swept" + } + if _, err := fmt.Fprintf(w, "%s:\n Live backups : %d\n Orphan candidates : %d\n Orphans deleted : %d\n Bytes reclaimed : %s (%d)\n Abandoned markers : %d %s\n Metadata orphans : %d %s\n Wall clock : %.2fs\n", + prefix, + r.LiveBackups, + r.OrphanBlobsConsidered, + r.OrphansDeleted, + utils.FormatBytes(uint64(r.BytesReclaimed)), + r.BytesReclaimed, + r.AbandonedMarkersFound, + markerVerb, + r.MetadataOrphansFound, + orphanVerb, + r.DurationSeconds, + ); err != nil { + return err + } + if r.BlobDeleteFailures > 0 { + if _, err := fmt.Fprintf(w, " Blob delete failures: %d\n", r.BlobDeleteFailures); err != nil { + return err + } + } + if len(r.DryRunCandidates) > 0 { + if _, err := fmt.Fprintf(w, "Would delete:\n"); err != nil { + return err + } + for _, c := range r.DryRunCandidates { + if _, err := fmt.Fprintf(w, " %s (%s, modified %s)\n", + c.Key, + utils.FormatBytes(uint64(c.Size)), + c.ModTime.UTC().Format("2006-01-02T15:04:05Z"), + ); err != nil { + return err + } + } + } + return nil +} diff --git a/pkg/cas/prune_test.go b/pkg/cas/prune_test.go new file mode 100644 index 00000000..4b7ecc00 --- /dev/null +++ b/pkg/cas/prune_test.go @@ -0,0 +1,722 @@ +package cas_test + +import ( + "bytes" + "context" + "errors" + "fmt" + "io" + "strings" + "testing" + "time" + + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/fakedst" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/testfixtures" + "github.com/Altinity/clickhouse-backup/v2/pkg/utils" + "github.com/stretchr/testify/require" +) + +// uploadTestBackup builds a synthetic local backup with one part containing +// one inline file + one above-threshold blob, then cas.Uploads it. +// Returns the upload result so callers can inspect blob sizes. +func uploadTestBackup(t *testing.T, f *fakedst.Fake, cfg cas.Config, name string, blobHash cas.Hash128) { + t.Helper() + ctx := context.Background() + parts := []testfixtures.PartSpec{ + { + Disk: "default", DB: "db1", Table: "t1", Name: "all_1_1_0", + Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 16, HashLow: 1, HashHigh: 0}, // inline + {Name: "data.bin", Size: 4096, HashLow: blobHash.Low, HashHigh: blobHash.High}, + }, + }, + } + src := testfixtures.Build(t, parts) + if _, err := cas.Upload(ctx, f, cfg, name, cas.UploadOptions{LocalBackupDir: src.Root}); err != nil { + t.Fatalf("Upload %s: %v", name, err) + } +} + +func ageBlob(t *testing.T, f *fakedst.Fake, cfg cas.Config, h cas.Hash128, age time.Duration) { + t.Helper() + f.SetModTime(cas.BlobPath(cfg.ClusterPrefix(), h), time.Now().Add(-age)) +} + +func TestPrune_HappyPath(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) + ctx := context.Background() + + // 2 backups, 4 distinct blobs. + hShared := cas.Hash128{Low: 0x10, High: 0x10} + h1 := cas.Hash128{Low: 0x20, High: 0x10} + h2 := cas.Hash128{Low: 0x30, High: 0x10} + hOrphanOld := cas.Hash128{Low: 0x40, High: 0x10} + hOrphanFresh := cas.Hash128{Low: 0x50, High: 0x10} + + uploadTestBackup(t, f, cfg, "bk1", hShared) + uploadTestBackup(t, f, cfg, "bk2", h1) + + // Manually drop two more blobs that aren't referenced by any backup. + cp := cfg.ClusterPrefix() + for _, h := range []cas.Hash128{hOrphanOld, hOrphanFresh, h2} { + _ = f.PutFile(ctx, cas.BlobPath(cp, h), io.NopCloser(bytes.NewReader([]byte("x"))), 1) + } + // Age the orphan-old and h2 (also unreferenced) past grace; orphan-fresh stays inside grace. + ageBlob(t, f, cfg, hOrphanOld, 2*time.Hour) + ageBlob(t, f, cfg, h2, 2*time.Hour) + ageBlob(t, f, cfg, hOrphanFresh, 30*time.Minute) + // Also age the referenced blobs past grace (they should NOT be deleted). + ageBlob(t, f, cfg, hShared, 2*time.Hour) + ageBlob(t, f, cfg, h1, 2*time.Hour) + + rep, err := cas.Prune(ctx, f, cfg, cas.PruneOptions{GraceBlob: time.Hour, GraceBlobSet: true}) + if err != nil { + t.Fatal(err) + } + if rep.OrphansDeleted != 2 { + t.Errorf("OrphansDeleted: got %d want 2 (hOrphanOld + h2)", rep.OrphansDeleted) + } + // hOrphanFresh (within grace) and the referenced blobs must survive. + if _, _, exists, _ := f.StatFile(ctx, cas.BlobPath(cp, hOrphanFresh)); !exists { + t.Error("hOrphanFresh should be retained (within grace)") + } + if _, _, exists, _ := f.StatFile(ctx, cas.BlobPath(cp, hShared)); !exists { + t.Error("hShared (referenced) must survive prune") + } + // Marker is gone (defer release). + if _, _, exists, _ := f.StatFile(ctx, cas.PruneMarkerPath(cp)); exists { + t.Error("prune.marker should be released after Prune returns") + } +} + +func TestPrune_RefusesIfFreshInProgressMarker(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) + ctx := context.Background() + if _, err := cas.WriteInProgressMarker(ctx, f, cfg.ClusterPrefix(), "bk_running", "host-a"); err != nil { + t.Fatal(err) + } + rep, err := cas.Prune(ctx, f, cfg, cas.PruneOptions{AbandonThreshold: time.Hour, AbandonThresholdSet: true}) + if err == nil || !strings.Contains(err.Error(), "in-progress upload") { + t.Fatalf("want fresh-inprogress refusal, got rep=%+v err=%v", rep, err) + } + // Anti-regression: the error must point operators at --abandon-threshold, + // not at --unlock (which removes the prune.marker, not inprogress markers). + if !strings.Contains(err.Error(), "--abandon-threshold") { + t.Errorf("error should point operators at --abandon-threshold; got: %v", err) + } + if strings.Contains(err.Error(), "--unlock") { + t.Errorf("error should not suggest --unlock for inprogress markers; got: %v", err) + } +} + +func TestPrune_SweepsAbandonedMarker(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) + ctx := context.Background() + cp := cfg.ClusterPrefix() + if _, err := cas.WriteInProgressMarker(ctx, f, cp, "bk_dead", "host-a"); err != nil { + t.Fatal(err) + } + // Age past abandon_threshold (1h here, default 7d). + f.SetModTime(cas.InProgressMarkerPath(cp, "bk_dead"), time.Now().Add(-2*time.Hour)) + + rep, err := cas.Prune(ctx, f, cfg, cas.PruneOptions{AbandonThreshold: time.Hour, AbandonThresholdSet: true}) + if err != nil { + t.Fatal(err) + } + if rep.AbandonedMarkersFound != 1 { + t.Errorf("AbandonedMarkersFound: got %d want 1", rep.AbandonedMarkersFound) + } + if _, _, exists, _ := f.StatFile(ctx, cas.InProgressMarkerPath(cp, "bk_dead")); exists { + t.Error("abandoned marker should be deleted by prune") + } +} + +// failingBackend wraps cas.Backend and forces GetFile to fail for one key — +// used to inject a "live backup unreadable" error mid-prune. +type failingBackend struct { + cas.Backend + failGetKey string +} + +func (f *failingBackend) GetFile(ctx context.Context, key string) (io.ReadCloser, error) { + if key == f.failGetKey { + return nil, errors.New("simulated network error") + } + return f.Backend.GetFile(ctx, key) +} + +func TestPrune_FailClosedOnUnreadableLiveBackup(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) + ctx := context.Background() + + uploadTestBackup(t, f, cfg, "bk1", cas.Hash128{Low: 0x10, High: 0x10}) + + // Inject a failure for bk1's per-table archive. + cp := cfg.ClusterPrefix() + failKey := cas.PartArchivePath(cp, "bk1", "default", "db1", "t1") + fb := &failingBackend{Backend: f, failGetKey: failKey} + + // Drop an unreferenced blob that prune SHOULD delete on a healthy run. + hOrphan := cas.Hash128{Low: 0x99, High: 0x99} + _ = f.PutFile(ctx, cas.BlobPath(cp, hOrphan), io.NopCloser(bytes.NewReader([]byte("x"))), 1) + ageBlob(t, f, cfg, hOrphan, 2*time.Hour) + + rep, err := cas.Prune(ctx, fb, cfg, cas.PruneOptions{GraceBlob: time.Hour, GraceBlobSet: true}) + if err == nil { + t.Fatal("expected fail-closed error from unreadable live backup") + } + if rep.OrphansDeleted != 0 { + t.Errorf("OrphansDeleted: got %d want 0 (must NOT delete after fail-close)", rep.OrphansDeleted) + } + // Orphan blob must still exist. + if _, _, exists, _ := f.StatFile(ctx, cas.BlobPath(cp, hOrphan)); !exists { + t.Error("orphan must survive a fail-closed prune") + } + // Marker is gone (defer release runs even on error). + if _, _, exists, _ := f.StatFile(ctx, cas.PruneMarkerPath(cp)); exists { + t.Error("prune.marker should be released even on error path") + } +} + +func TestPrune_DryRun(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) + ctx := context.Background() + cp := cfg.ClusterPrefix() + + hOrphan := cas.Hash128{Low: 0x77, High: 0x77} + _ = f.PutFile(ctx, cas.BlobPath(cp, hOrphan), io.NopCloser(bytes.NewReader([]byte("x"))), 1) + ageBlob(t, f, cfg, hOrphan, 2*time.Hour) + + rep, err := cas.Prune(ctx, f, cfg, cas.PruneOptions{DryRun: true, GraceBlob: time.Hour, GraceBlobSet: true}) + if err != nil { + t.Fatal(err) + } + if rep.OrphanBlobsConsidered != 1 { + t.Errorf("OrphanBlobsConsidered: got %d want 1", rep.OrphanBlobsConsidered) + } + if rep.OrphansDeleted != 0 { + t.Errorf("OrphansDeleted (dry-run): got %d want 0", rep.OrphansDeleted) + } + // Blob still exists (not deleted in dry-run). + if _, _, exists, _ := f.StatFile(ctx, cas.BlobPath(cp, hOrphan)); !exists { + t.Error("dry-run must NOT delete blobs") + } + // No marker written in dry-run. + if _, _, exists, _ := f.StatFile(ctx, cas.PruneMarkerPath(cp)); exists { + t.Error("dry-run must NOT write prune.marker") + } +} + +func TestPrune_Unlock(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) + ctx := context.Background() + cp := cfg.ClusterPrefix() + + if _, _, err := cas.WritePruneMarker(ctx, f, cp, "host-stuck"); err != nil { + t.Fatal(err) + } + + rep, err := cas.Prune(ctx, f, cfg, cas.PruneOptions{Unlock: true}) + if err != nil { + t.Fatal(err) + } + if rep == nil { + t.Fatal("expected non-nil report") + } + if _, _, exists, _ := f.StatFile(ctx, cas.PruneMarkerPath(cp)); exists { + t.Error("--unlock should delete the prune marker") + } +} + +func TestPrune_UnlockRefusesIfNoMarker(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) + _, err := cas.Prune(context.Background(), f, cfg, cas.PruneOptions{Unlock: true}) + if err == nil || !strings.Contains(err.Error(), "no prune.marker present") { + t.Fatalf("want no-marker error, got %v", err) + } +} + +func TestPrune_DryRunUnlockKeepsMarker(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) + ctx := context.Background() + + runID, created, err := cas.WritePruneMarker(ctx, f, cfg.ClusterPrefix(), "host-other") + if err != nil || !created { + t.Fatalf("WritePruneMarker setup: created=%v err=%v", created, err) + } + _ = runID + + rep, err := cas.Prune(ctx, f, cfg, cas.PruneOptions{Unlock: true, DryRun: true}) + if err != nil { + t.Fatalf("Prune --dry-run --unlock returned error: %v", err) + } + if rep == nil || !rep.DryRun { + t.Errorf("expected DryRun=true in report; got %+v", rep) + } + + // The marker must still exist. + _, _, exists, _ := f.StatFile(ctx, cas.PruneMarkerPath(cfg.ClusterPrefix())) + if !exists { + t.Error("prune marker was deleted by --dry-run --unlock; expected it to survive") + } +} + +func TestPrune_MetadataOrphanSubtreeSwept(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) + ctx := context.Background() + cp := cfg.ClusterPrefix() + + // Hand-craft a metadata orphan: per-table JSON without metadata.json. + body := []byte(`{"database":"db","table":"t"}`) + if err := f.PutFile(ctx, cas.TableMetaPath(cp, "halfdeleted", "db", "t"), + io.NopCloser(bytes.NewReader(body)), int64(len(body))); err != nil { + t.Fatal(err) + } + + rep, err := cas.Prune(ctx, f, cfg, cas.PruneOptions{GraceBlob: time.Hour, GraceBlobSet: true}) + if err != nil { + t.Fatal(err) + } + if rep.MetadataOrphansFound != 1 { + t.Errorf("MetadataOrphansFound: got %d want 1", rep.MetadataOrphansFound) + } + // Subtree gone. + if _, _, exists, _ := f.StatFile(ctx, cas.TableMetaPath(cp, "halfdeleted", "db", "t")); exists { + t.Error("metadata-orphan per-table JSON should be deleted") + } +} + +// TestPrune_ReportCountersPopulated verifies that BlobsTotal and +// OrphansHeldByGrace are correctly populated in the PruneReport. +// It constructs a fake backend with: +// - 1 live-referenced blob (hLive) +// - 1 stale orphan older than grace (hStaleOrphan) — will be deleted +// - 1 fresh orphan within grace (hFreshOrphan) — held by grace +func TestPrune_ReportCountersPopulated(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) + ctx := context.Background() + cp := cfg.ClusterPrefix() + + hLive := cas.Hash128{Low: 0xA1, High: 0xA1} + hStaleOrphan := cas.Hash128{Low: 0xB2, High: 0xB2} + hFreshOrphan := cas.Hash128{Low: 0xC3, High: 0xC3} + + // Upload a backup that references hLive. + uploadTestBackup(t, f, cfg, "bk-live", hLive) + + // Manually place stale and fresh orphan blobs. + for _, h := range []cas.Hash128{hStaleOrphan, hFreshOrphan} { + if err := f.PutFile(ctx, cas.BlobPath(cp, h), io.NopCloser(bytes.NewReader([]byte("x"))), 1); err != nil { + t.Fatal(err) + } + } + + // Age the live blob and stale orphan past grace; fresh orphan stays inside. + ageBlob(t, f, cfg, hLive, 2*time.Hour) + ageBlob(t, f, cfg, hStaleOrphan, 2*time.Hour) + ageBlob(t, f, cfg, hFreshOrphan, 30*time.Minute) + + rep, err := cas.Prune(ctx, f, cfg, cas.PruneOptions{GraceBlob: time.Hour, GraceBlobSet: true}) + if err != nil { + t.Fatal(err) + } + + // 3 blobs total: hLive + hStaleOrphan + hFreshOrphan. + if rep.BlobsTotal != 3 { + t.Errorf("BlobsTotal: got %d want 3", rep.BlobsTotal) + } + // hFreshOrphan is an orphan but within grace → held. + if rep.OrphansHeldByGrace != 1 { + t.Errorf("OrphansHeldByGrace: got %d want 1", rep.OrphansHeldByGrace) + } + // hStaleOrphan should be deleted. + if rep.OrphansDeleted != 1 { + t.Errorf("OrphansDeleted: got %d want 1", rep.OrphansDeleted) + } +} + +func TestPrune_RejectsInvalidConfig(t *testing.T) { + ctx := context.Background() + b := fakedst.New() + // Enabled=true but no ClusterID → Validate must reject it. + cfg := cas.Config{Enabled: true} + _, err := cas.Prune(ctx, b, cfg, cas.PruneOptions{}) + require.Error(t, err) + require.Contains(t, strings.ToLower(err.Error()), "cluster_id") +} + +func TestPrune_RefusesWhenDisabled(t *testing.T) { + cfg := testCfg(1024) + cfg.Enabled = false + _, err := cas.Prune(context.Background(), fakedst.New(), cfg, cas.PruneOptions{}) + if err == nil || !strings.Contains(err.Error(), "cas.enabled=false") { + t.Fatalf("want cas.enabled=false error, got %v", err) + } +} + +// TestPrune_ZeroModTimeMarkerIsFresh verifies that a marker with a +// zero ModTime (e.g. FTP LIST without MLSD facts) is classified as +// fresh, not abandoned. The conservative choice avoids the data-loss +// path where prune sweeps a real in-progress upload. +func TestPrune_ZeroModTimeMarkerIsFresh(t *testing.T) { + f := fakedst.New() + cp := testCfg(1024).ClusterPrefix() + ctx := context.Background() + + // Place a marker with zero ModTime via the fake's hook. + if _, err := cas.WriteInProgressMarker(ctx, f, cp, "bk_zero", "host"); err != nil { + t.Fatal(err) + } + f.SetModTime(cas.InProgressMarkerPath(cp, "bk_zero"), time.Time{}) + + // Use a very small abandon threshold so a non-zero-ModTime marker + // would otherwise classify as abandoned. + rep, err := cas.Prune(ctx, f, testCfg(1024), cas.PruneOptions{ + AbandonThreshold: time.Nanosecond, + AbandonThresholdSet: true, + }) + // The marker is fresh → Prune should refuse with the freshness error. + if err == nil { + t.Fatalf("expected Prune to refuse for fresh marker; rep=%+v", rep) + } + if !strings.Contains(err.Error(), "are fresh") { + t.Errorf("expected 'are fresh' in error; got: %v", err) + } +} + +// TestPrune_RefusesIfAnotherPruneRunning verifies that a second cas-prune +// run refuses cleanly when another prune is in flight, AND that the +// existing marker is not deleted by the failing run's deferred cleanup. +// The latter assertion is the regression guard for the original +// "deferred-delete races second prune" bug. +func TestPrune_RefusesIfAnotherPruneRunning(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) + ctx := context.Background() + + // Pre-write a prune marker simulating another prune in flight. + runID, created, err := cas.WritePruneMarker(ctx, f, cfg.ClusterPrefix(), "host-other") + if err != nil || !created { + t.Fatalf("WritePruneMarker setup: created=%v err=%v", created, err) + } + _ = runID + + _, err = cas.Prune(ctx, f, cfg, cas.PruneOptions{}) + if err == nil { + t.Fatal("expected Prune to refuse when marker is already held") + } + if !strings.Contains(err.Error(), "another prune is in progress") { + t.Errorf("error should mention concurrent prune; got: %v", err) + } + + // Critical: the existing marker must NOT have been deleted by the + // failing prune's defer. Without the scoped-defer fix it would be. + if _, _, exists, _ := f.StatFile(ctx, cas.PruneMarkerPath(cfg.ClusterPrefix())); !exists { + t.Error("prune marker should survive a refused second prune") + } +} + +// TestPrune_ExplicitZeroOverridesConfigGrace verifies that passing +// GraceBlobSet=true with GraceBlob=0 bypasses the non-zero cfg.GraceBlob +// (24h in testCfg) and immediately prunes a freshly-created orphan blob. +func TestPrune_ExplicitZeroOverridesConfigGrace(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) // GraceBlob is "24h" after Validate() + ctx := context.Background() + cp := cfg.ClusterPrefix() + + // Place a fresh orphan blob (not referenced by any backup, modtime = now). + hFreshOrphan := cas.Hash128{Low: 0xDE, High: 0xAD} + if err := f.PutFile(ctx, cas.BlobPath(cp, hFreshOrphan), io.NopCloser(bytes.NewReader([]byte("x"))), 1); err != nil { + t.Fatal(err) + } + // modtime stays at "now" — within the 24h config grace, so a normal run + // would hold it. With explicit --grace-blob=0s it must be swept. + + rep, err := cas.Prune(ctx, f, cfg, cas.PruneOptions{ + GraceBlob: 0, + GraceBlobSet: true, + }) + require.NoError(t, err) + require.Equal(t, uint64(0), rep.OrphansHeldByGrace, "explicit zero must override 24h config grace") + require.Equal(t, uint64(1), rep.OrphansDeleted, "fresh orphan must be deleted with grace=0") + + // Double-check the blob is actually gone. + if _, _, exists, _ := f.StatFile(ctx, cas.BlobPath(cp, hFreshOrphan)); exists { + t.Error("fresh orphan must be deleted when --grace-blob=0s overrides 24h config") + } +} + +// TestPrune_ExplicitZeroOverridesConfigAbandon verifies that passing +// AbandonThresholdSet=true with AbandonThreshold=0 bypasses the non-zero +// cfg.AbandonThreshold (168h in testCfg) and treats every in-progress marker +// as abandoned — allowing prune to proceed and sweep it. +func TestPrune_ExplicitZeroOverridesConfigAbandon(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) // AbandonThreshold is "168h" after Validate() + ctx := context.Background() + cp := cfg.ClusterPrefix() + + // Write a fresh in-progress marker (modtime = now). Under the 168h config + // threshold it would block prune. With explicit --abandon-threshold=0s every + // marker has age >= 0 == threshold and is classified as abandoned. + if _, err := cas.WriteInProgressMarker(ctx, f, cp, "bk_fresh_but_dead", "host-a"); err != nil { + t.Fatal(err) + } + + rep, err := cas.Prune(ctx, f, cfg, cas.PruneOptions{ + AbandonThreshold: 0, + AbandonThresholdSet: true, + }) + require.NoError(t, err, "explicit zero abandon-threshold must not block on fresh in-progress marker") + require.Equal(t, 1, rep.AbandonedMarkersFound, "fresh marker must be swept with abandon-threshold=0") + + // The marker must be gone. + if _, _, exists, _ := f.StatFile(ctx, cas.InProgressMarkerPath(cp, "bk_fresh_but_dead")); exists { + t.Error("in-progress marker must be deleted when --abandon-threshold=0s overrides 168h config") + } +} + +// TestPrune_ZeroLiveBackupsAllOrphaned verifies that when there are no live +// backups (no metadata.json present) and the operator explicitly passes +// GraceBlobSet=true with GraceBlob=0, all orphan blobs are reclaimed +// immediately. This locks the intentional "empty namespace + explicit zero +// grace = wipe orphans" contract: the operator opts in deliberately. +func TestPrune_ZeroLiveBackupsAllOrphaned(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) + ctx := context.Background() + cp := cfg.ClusterPrefix() + + // Place 3 orphan blobs — no metadata.json anywhere (zero live backups). + hA := cas.Hash128{Low: 0xAA, High: 0xBB} + hB := cas.Hash128{Low: 0xCC, High: 0xDD} + hC := cas.Hash128{Low: 0xEE, High: 0xFF} + for _, h := range []cas.Hash128{hA, hB, hC} { + if err := f.PutFile(ctx, cas.BlobPath(cp, h), io.NopCloser(bytes.NewReader([]byte("x"))), 1); err != nil { + t.Fatal(err) + } + } + // Blobs are fresh (modtime = now). With explicit zero grace they must be + // swept regardless; no backup pins them. + + rep, err := cas.Prune(ctx, f, cfg, cas.PruneOptions{ + GraceBlob: 0, + GraceBlobSet: true, // operator explicitly opted in — destructive by intent + }) + require.NoError(t, err) + require.Equal(t, 0, rep.LiveBackups, "no metadata.json → LiveBackups must be 0") + require.Equal(t, uint64(3), rep.OrphanBlobsConsidered, "all 3 blobs are orphans") + require.Equal(t, uint64(3), rep.OrphansDeleted, "explicit zero grace must wipe all 3 orphans") + require.Equal(t, uint64(0), rep.OrphansHeldByGrace, "nothing held when grace=0") +} + +// TestPrune_ZeroLiveBackupsRespectsGrace is the sibling of +// TestPrune_ZeroLiveBackupsAllOrphaned. Same setup (no live backups, 3 fresh +// orphan blobs) but the operator did NOT set an explicit grace — the +// 24h config default applies. Because the blobs are freshly written they +// fall inside the grace window and must be protected. +// +// Together the two tests document the contract: +// - explicit zero grace (GraceBlobSet=true, GraceBlob=0) → destructive by intent +// - default grace (GraceBlobSet=false) → conservative, protects fresh blobs +func TestPrune_ZeroLiveBackupsRespectsGrace(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) // GraceBlob is "24h" after Validate() + ctx := context.Background() + cp := cfg.ClusterPrefix() + + // Place 3 fresh orphan blobs — no metadata.json (zero live backups). + hA := cas.Hash128{Low: 0x11, High: 0x22} + hB := cas.Hash128{Low: 0x33, High: 0x44} + hC := cas.Hash128{Low: 0x55, High: 0x66} + for _, h := range []cas.Hash128{hA, hB, hC} { + if err := f.PutFile(ctx, cas.BlobPath(cp, h), io.NopCloser(bytes.NewReader([]byte("x"))), 1); err != nil { + t.Fatal(err) + } + } + // Blobs stay fresh (modtime = now). Default 24h grace must hold them all. + + rep, err := cas.Prune(ctx, f, cfg, cas.PruneOptions{ + // GraceBlobSet intentionally left false → config default (24h) applies. + }) + require.NoError(t, err) + require.Equal(t, 0, rep.LiveBackups, "no metadata.json → LiveBackups must be 0") + // OrphanBlobsConsidered counts blobs that made it past the grace filter + // (i.e. candidates for deletion). Fresh blobs never reach that stage, so + // the counter is 0 here — all 3 are gated out by grace before becoming + // candidates. + require.Equal(t, uint64(0), rep.OrphanBlobsConsidered, "fresh blobs held by grace, not candidates") + require.Equal(t, uint64(0), rep.OrphansDeleted, "fresh blobs must survive default 24h grace") + require.Equal(t, uint64(3), rep.OrphansHeldByGrace, "all 3 fresh orphans held by grace") +} + +// TestPrune_ParallelMarkPhaseStillCorrect verifies the parallel mark phase +// (buildMarkSetParallel) produces a complete and correct mark set when +// multiple live backups each reference distinct blobs. Running under -race +// guards against data races in the pool. +func TestPrune_ParallelMarkPhaseStillCorrect(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) + ctx := context.Background() + cp := cfg.ClusterPrefix() + + // 5 backups, each with a unique blob hash. Prune must retain all 5. + blobHashes := []cas.Hash128{ + {Low: 0x0A, High: 0x01}, + {Low: 0x0B, High: 0x02}, + {Low: 0x0C, High: 0x03}, + {Low: 0x0D, High: 0x04}, + {Low: 0x0E, High: 0x05}, + } + for i, h := range blobHashes { + uploadTestBackup(t, f, cfg, fmt.Sprintf("bk%d", i+1), h) + } + + // Place an orphan blob older than grace — should be swept. + hOrphan := cas.Hash128{Low: 0xFF, High: 0xFF} + if err := f.PutFile(ctx, cas.BlobPath(cp, hOrphan), + io.NopCloser(bytes.NewReader([]byte("x"))), 1); err != nil { + t.Fatal(err) + } + ageBlob(t, f, cfg, hOrphan, 2*time.Hour) + + rep, err := cas.Prune(ctx, f, cfg, cas.PruneOptions{GraceBlob: time.Hour, GraceBlobSet: true}) + if err != nil { + t.Fatalf("Prune: %v", err) + } + + // Exactly one orphan should be deleted. + if rep.OrphansDeleted != 1 { + t.Errorf("OrphansDeleted: got %d want 1", rep.OrphansDeleted) + } + + // All 5 referenced blobs must survive. + for i, h := range blobHashes { + if _, _, exists, _ := f.StatFile(ctx, cas.BlobPath(cp, h)); !exists { + t.Errorf("bk%d referenced blob was incorrectly deleted", i+1) + } + } +} + +func TestPrintPruneReport_FormatsBytes(t *testing.T) { + var buf bytes.Buffer + err := cas.PrintPruneReport(&cas.PruneReport{BytesReclaimed: 1572864}, &buf) + require.NoError(t, err) + out := buf.String() + // 1572864 bytes = 1.5 MiB; assert FormatBytes-style rendering is present + require.Contains(t, out, utils.FormatBytes(1572864)) + require.Contains(t, out, "(1572864)") +} + +// TestPrune_BlobDeleteFailuresCounted verifies that BlobDeleteFailures is +// incremented for every failed delete (not just the first) and that the +// field appears in PrintPruneReport output when non-zero. +func TestPrune_BlobDeleteFailuresCounted(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) + ctx := context.Background() + cp := cfg.ClusterPrefix() + + // Place 3 orphan blobs, all aged past grace so they are candidates. + hA := cas.Hash128{Low: 0xA1, High: 0x00} + hB := cas.Hash128{Low: 0xB2, High: 0x00} + hC := cas.Hash128{Low: 0xC3, High: 0x00} + for _, h := range []cas.Hash128{hA, hB, hC} { + if err := f.PutFile(ctx, cas.BlobPath(cp, h), io.NopCloser(bytes.NewReader([]byte("x"))), 1); err != nil { + t.Fatal(err) + } + ageBlob(t, f, cfg, h, 2*time.Hour) + } + + // Make delete fail for hA and hB but succeed for hC. + failKeys := map[string]bool{ + cas.BlobPath(cp, hA): true, + cas.BlobPath(cp, hB): true, + } + f.SetDeleteHook(func(key string) (error, bool) { + if failKeys[key] { + return errors.New("simulated delete failure"), true + } + return nil, false + }) + + rep, err := cas.Prune(ctx, f, cfg, cas.PruneOptions{GraceBlob: time.Hour, GraceBlobSet: true}) + // Prune returns an error (first failure) but also a partial report. + require.Error(t, err) + require.Equal(t, 2, rep.BlobDeleteFailures, "BlobDeleteFailures should count all failures, not just the first") + require.Equal(t, uint64(1), rep.OrphansDeleted, "hC should have been successfully deleted") + + // hA and hB must still exist (delete failed). + _, _, existsA, _ := f.StatFile(ctx, cas.BlobPath(cp, hA)) + _, _, existsB, _ := f.StatFile(ctx, cas.BlobPath(cp, hB)) + require.True(t, existsA, "hA must survive a failed delete") + require.True(t, existsB, "hB must survive a failed delete") + + // Verify PrintPruneReport surfaces the failure count. + var buf bytes.Buffer + require.NoError(t, cas.PrintPruneReport(rep, &buf)) + require.Contains(t, buf.String(), "Blob delete failures: 2") +} + +// ctxRespectingPruneBackend wraps cas.Backend and makes Walk fail with +// context.Canceled when the passed context is already cancelled. This lets +// TestPrune_CancelledContextStillReleasesMarker exercise the deferred +// cleanup path with a pre-cancelled operation context. +type ctxRespectingPruneBackend struct { + cas.Backend +} + +func (c *ctxRespectingPruneBackend) Walk(ctx context.Context, prefix string, recursive bool, fn func(cas.RemoteFile) error) error { + if err := ctx.Err(); err != nil { + return err + } + return c.Backend.Walk(ctx, prefix, recursive, fn) +} + +// TestPrune_CancelledContextStillReleasesMarker verifies detached-context +// cleanup (#2) for Prune: when the operation context is cancelled before Prune +// returns, the deferred cleanup uses a fresh context.Background()-derived ctx +// and still removes the prune.marker. +func TestPrune_CancelledContextStillReleasesMarker(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) + cp := cfg.ClusterPrefix() + + // Upload a backup so there's something to prune. + uploadTestBackup(t, f, cfg, "bk1", cas.Hash128{Low: 0x10, High: 0x10}) + + // Use a pre-cancelled context. ctxRespectingPruneBackend translates it into + // a Walk error inside listLiveBackups, so Prune errors out after writing + // the prune.marker — exercising the deferred cleanup path. + cancelCtx, cancelFn := context.WithCancel(context.Background()) + cancelFn() // cancel immediately + + _, err := cas.Prune(cancelCtx, &ctxRespectingPruneBackend{f}, cfg, cas.PruneOptions{ + GraceBlob: time.Hour, + GraceBlobSet: true, + }) + if err == nil { + t.Fatal("expected Prune to fail with cancelled context") + } + + // The prune.marker must be absent — the deferred cleanup ran with a + // detached context even though the operation ctx was already cancelled. + markerKey := cas.PruneMarkerPath(cp) + if _, _, exists, statErr := f.StatFile(context.Background(), markerKey); statErr != nil { + t.Fatalf("StatFile(prune.marker): %v", statErr) + } else if exists { + t.Error("prune.marker still present after cancelled-ctx Prune — detached cleanup context not working") + } +} diff --git a/pkg/cas/restore.go b/pkg/cas/restore.go new file mode 100644 index 00000000..13a8c588 --- /dev/null +++ b/pkg/cas/restore.go @@ -0,0 +1,103 @@ +package cas + +import ( + "context" + "errors" +) + +// V1RestoreFunc is the callback supplied by the CLI binding (Task 19) to +// invoke the existing v1 restore flow on the local directory materialized +// by cas-download. It receives the absolute local backup directory (the +// one returned in DownloadResult.LocalBackupDir) plus the original +// RestoreOptions; the binding extracts whatever subset of fields v1's +// Backuper.Restore needs. +// +// Defining the handoff as a callback keeps pkg/cas free of any dependency +// on pkg/backup (which would create an import cycle: pkg/backup already +// transitively imports pkg/cas via pkg/storage → pkg/config). +type V1RestoreFunc func(ctx context.Context, localBackupDir string, opts RestoreOptions) error + +// RestoreOptions extends DownloadOptions with the v1-restore flags that +// the CAS-restore CLI surface mirrors. Only the subset of v1 flags that +// makes sense for CAS backups is exposed; the binding in Task 19 wires +// these into Backuper.Restore positional arguments. +// +// Flags omitted on purpose: +// - IgnoreDependencies: CAS backups have no dependency chain (each is a +// standalone snapshot); accepting it would invite confusion. Treated +// as an error if set. +// - RestoreRBAC, RBACOnly, RestoreConfigs, ConfigsOnly, +// RestoreNamedCollections, NamedCollectionsOnly: out of scope for CAS +// v1, which only handles MergeTree-family table data. Reserved for a +// future revision. +type RestoreOptions struct { + DownloadOptions + + // DropExists maps to v1 --rm: drop existing tables before re-creating. + DropExists bool + + // DataOnly / SchemaOnly are inherited from DownloadOptions and are + // passed through to v1 in the binding. + + // DatabaseMapping rewrites at restore time + // (--restore-database-mapping). + DatabaseMapping []string + // TableMapping rewrites at restore time + // (--restore-table-mapping). + TableMapping []string + // SkipProjections suppresses listed projections during data restore + // (--skip-projections). + SkipProjections []string + + // RestoreSchemaAsAttach: use ATTACH instead of CREATE for schema + // (v1 --restore-schema-as-attach). + RestoreSchemaAsAttach bool + // ReplicatedCopyToDetached: for Replicated*MergeTree, copy to + // detached/ and skip the final ATTACH (v1 --replicated-copy-to-detached). + ReplicatedCopyToDetached bool + // SkipEmptyTables suppresses errors for tables with no parts + // (v1 --skip-empty-tables). + SkipEmptyTables bool + + // Resume enables the resumable-state file (v1 --resume). + Resume bool + + // BackupVersion is propagated to v1 for log-line consistency. + BackupVersion string + // CommandID is the status.Current correlator (v1 --command-id). + CommandID int + + // IgnoreDependencies is rejected by Restore; declared here so the CLI + // binding can set it from the cobra flag and have us produce the + // rejection error in a single place. + IgnoreDependencies bool +} + +// Restore runs cas-download and hands off to runV1, which is expected to +// invoke the existing pkg/backup.Backuper.Restore flow against the local +// directory cas-download just materialized. +// +// Errors: +// - ErrCASBackup / ErrV1Backup / ErrUnsupportedLayoutVersion etc. from +// the underlying ValidateBackup + Download. +// - A descriptive error if --ignore-dependencies is set (CAS backups +// have no dependency chain). +// - A descriptive error if --data-only is set (CAS restore doesn't yet +// support data-only restoration). +// - Whatever runV1 returns. +func Restore(ctx context.Context, b Backend, cfg Config, name string, opts RestoreOptions, runV1 V1RestoreFunc) error { + if opts.IgnoreDependencies { + return errors.New("cas: --ignore-dependencies is not applicable to CAS backups (no dependency chain)") + } + if opts.DataOnly { + return errors.New("cas: --data-only is not yet implemented for cas-restore (use the v1 flow if you need data-only restoration)") + } + if runV1 == nil { + return errors.New("cas: V1RestoreFunc not supplied; CLI binding must wire pkg/backup.Backuper.Restore") + } + res, err := Download(ctx, b, cfg, name, opts.DownloadOptions) + if err != nil { + return err + } + return runV1(ctx, res.LocalBackupDir, opts) +} diff --git a/pkg/cas/restore_test.go b/pkg/cas/restore_test.go new file mode 100644 index 00000000..bbd8ac90 --- /dev/null +++ b/pkg/cas/restore_test.go @@ -0,0 +1,158 @@ +package cas_test + +import ( + "context" + "errors" + "os" + "path/filepath" + "strings" + "testing" + + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/fakedst" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/testfixtures" +) + +// uploadAndPrepare seeds a fake backend with a CAS backup named "b1" that +// downloads cleanly. Returned bits are everything Restore needs. +func uploadAndPrepare(t *testing.T, name string) (*fakedst.Fake, cas.Config, cas.RestoreOptions) { + t.Helper() + parts := []testfixtures.PartSpec{ + {Disk: "default", DB: "db1", Table: "t1", Name: "p1", Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 23, HashLow: 1, HashHigh: 1}, + {Name: "data.bin", Size: 1024, HashLow: 999, HashHigh: 1, Bytes: makeBlobBytes(0x42)}, + }}, + } + lb := testfixtures.Build(t, parts) + f := fakedst.New() + cfg := testCfg(100) + if _, err := cas.Upload(context.Background(), f, cfg, name, cas.UploadOptions{LocalBackupDir: lb.Root}); err != nil { + t.Fatalf("Upload: %v", err) + } + opts := cas.RestoreOptions{ + DownloadOptions: cas.DownloadOptions{LocalBackupDir: t.TempDir()}, + } + return f, cfg, opts +} + +func TestRestore_HappyPath(t *testing.T) { + f, cfg, opts := uploadAndPrepare(t, "b1") + var ( + gotDir string + gotName string + calls int + ) + cb := func(ctx context.Context, localBackupDir string, ro cas.RestoreOptions) error { + calls++ + gotDir = localBackupDir + gotName = filepath.Base(localBackupDir) + // Sanity: the directory should actually exist on disk after Download. + if _, err := os.Stat(filepath.Join(localBackupDir, "metadata.json")); err != nil { + t.Errorf("metadata.json missing under callback's localBackupDir: %v", err) + } + return nil + } + if err := cas.Restore(context.Background(), f, cfg, "b1", opts, cb); err != nil { + t.Fatalf("Restore: %v", err) + } + if calls != 1 { + t.Errorf("callback calls = %d, want 1", calls) + } + if gotName != "b1" { + t.Errorf("callback localBackupDir = %q, want basename b1 (got %q)", gotDir, gotName) + } + wantPrefix := opts.LocalBackupDir + if !strings.HasPrefix(gotDir, wantPrefix) { + t.Errorf("callback localBackupDir %q is not under %q", gotDir, wantPrefix) + } +} + +func TestRestore_PropagatesCallbackError(t *testing.T) { + f, cfg, opts := uploadAndPrepare(t, "b1") + sentinel := errors.New("v1 restore exploded") + cb := func(ctx context.Context, localBackupDir string, ro cas.RestoreOptions) error { + return sentinel + } + err := cas.Restore(context.Background(), f, cfg, "b1", opts, cb) + if !errors.Is(err, sentinel) { + t.Fatalf("got err=%v want sentinel %v", err, sentinel) + } +} + +func TestRestore_RefusesIgnoreDependencies(t *testing.T) { + f := fakedst.New() + cfg := testCfg(100) + opts := cas.RestoreOptions{ + DownloadOptions: cas.DownloadOptions{LocalBackupDir: t.TempDir()}, + IgnoreDependencies: true, + } + called := 0 + cb := func(ctx context.Context, localBackupDir string, ro cas.RestoreOptions) error { + called++ + return nil + } + err := cas.Restore(context.Background(), f, cfg, "any", opts, cb) + if err == nil || !strings.Contains(err.Error(), "ignore-dependencies") { + t.Fatalf("got err=%v want ignore-dependencies error", err) + } + if called != 0 { + t.Errorf("callback called %d times under ignore-dependencies; want 0", called) + } +} + +func TestRestore_NilCallbackError(t *testing.T) { + f := fakedst.New() + cfg := testCfg(100) + opts := cas.RestoreOptions{ + DownloadOptions: cas.DownloadOptions{LocalBackupDir: t.TempDir()}, + } + err := cas.Restore(context.Background(), f, cfg, "b1", opts, nil) + if err == nil || !strings.Contains(err.Error(), "V1RestoreFunc") { + t.Fatalf("got err=%v want V1RestoreFunc-not-supplied error", err) + } +} + +func TestRestore_PropagatesDownloadError(t *testing.T) { + // Empty backend → ValidateBackup fails on missing metadata.json. + f := fakedst.New() + cfg := testCfg(100) + opts := cas.RestoreOptions{ + DownloadOptions: cas.DownloadOptions{LocalBackupDir: t.TempDir()}, + } + called := 0 + cb := func(ctx context.Context, localBackupDir string, ro cas.RestoreOptions) error { + called++ + return nil + } + err := cas.Restore(context.Background(), f, cfg, "absent", opts, cb) + if !errors.Is(err, cas.ErrMissingMetadata) { + t.Fatalf("got err=%v want ErrMissingMetadata", err) + } + if called != 0 { + t.Errorf("callback called %d times despite Download failure; want 0", called) + } +} + +// TestRestore_DataOnlyRefuses mirrors TestDownload_DataOnlyRefuses for the +// restore entry point. +func TestRestore_DataOnlyRefuses(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) + ctx := context.Background() + + err := cas.Restore(ctx, f, cfg, "any", cas.RestoreOptions{ + DownloadOptions: cas.DownloadOptions{ + LocalBackupDir: t.TempDir(), + DataOnly: true, + }, + }, func(ctx context.Context, localBackupDir string, opts cas.RestoreOptions) error { + t.Fatal("v1 restore should not be invoked when DataOnly is rejected") + return nil + }) + if err == nil { + t.Fatal("expected Restore to refuse DataOnly") + } + if !strings.Contains(err.Error(), "data-only is not yet implemented") { + t.Errorf("error should mention 'data-only is not yet implemented'; got: %v", err) + } +} diff --git a/pkg/cas/status.go b/pkg/cas/status.go new file mode 100644 index 00000000..91de7ecd --- /dev/null +++ b/pkg/cas/status.go @@ -0,0 +1,222 @@ +package cas + +import ( + "context" + "fmt" + "io" + "sort" + "strings" + "time" + + "github.com/Altinity/clickhouse-backup/v2/pkg/utils" +) + +// StatusReport is the result of a LIST-only bucket health check. +type StatusReport struct { + BackupCount int `json:"backup_count"` + BlobCount int `json:"blob_count"` + BlobBytes int64 `json:"blob_bytes"` + PruneMarker *PruneMarkerInfo `json:"prune_marker,omitempty"` + InProgressFresh []InProgressInfo `json:"in_progress_fresh"` + InProgressAbandoned []InProgressInfo `json:"in_progress_abandoned"` + Backups []BackupSummary `json:"backups"` +} + +// BackupSummary holds minimal per-backup metadata collected during Status. +type BackupSummary struct { + Name string `json:"name"` + UploadedAt time.Time `json:"uploaded_at"` // ModTime of metadata.json +} + +// PruneMarkerInfo holds metadata about the prune.marker object. +type PruneMarkerInfo struct { + Path string `json:"path"` + ModTime time.Time `json:"mod_time"` + Age time.Duration `json:"-"` + AgeSeconds float64 `json:"age_seconds"` +} + +// InProgressInfo holds metadata about an inprogress marker object. +type InProgressInfo struct { + Backup string `json:"backup"` + ModTime time.Time `json:"mod_time"` + Age time.Duration `json:"-"` + AgeSeconds float64 `json:"age_seconds"` +} + +// Status performs a LIST-only bucket health summary for the given cluster. +// No object bodies are fetched; only metadata returned by Walk/StatFile is used. +func Status(ctx context.Context, b Backend, cfg Config) (*StatusReport, error) { + if err := cfg.Validate(); err != nil { + return nil, fmt.Errorf("cas: status: invalid config: %w", err) + } + cp := cfg.ClusterPrefix() + r := &StatusReport{} + + // 1. Enumerate backups: walk cas//metadata/ recursively and collect + // entries whose key ends in /metadata.json. + metaPrefix := cp + "metadata/" + if err := b.Walk(ctx, metaPrefix, true, func(f RemoteFile) error { + if !strings.HasSuffix(f.Key, "/metadata.json") { + return nil + } + // Strip prefix and "/metadata.json" suffix to extract backup name. + inner := strings.TrimPrefix(f.Key, metaPrefix) + // inner is "/metadata.json" (possibly deeper, but we only want + // the first path component as the backup name). + name := strings.TrimSuffix(inner, "/metadata.json") + // Reject paths with extra slashes (sub-dirs of a backup dir are not + // top-level metadata.json entries). + if strings.Contains(name, "/") { + return nil + } + r.Backups = append(r.Backups, BackupSummary{ + Name: name, + UploadedAt: f.ModTime, + }) + return nil + }); err != nil { + return nil, fmt.Errorf("cas status: walk metadata: %w", err) + } + + // Sort backups newest-first. + sort.Slice(r.Backups, func(i, j int) bool { + return r.Backups[i].UploadedAt.After(r.Backups[j].UploadedAt) + }) + r.BackupCount = len(r.Backups) + + // 2. Count blobs and sum sizes. + blobPrefix := cp + "blob/" + if err := b.Walk(ctx, blobPrefix, true, func(f RemoteFile) error { + r.BlobCount++ + r.BlobBytes += f.Size + return nil + }); err != nil { + return nil, fmt.Errorf("cas status: walk blobs: %w", err) + } + + // 3. Check prune marker. + pruneKey := PruneMarkerPath(cp) + _, modTime, exists, err := b.StatFile(ctx, pruneKey) + if err != nil { + return nil, fmt.Errorf("cas status: stat prune marker: %w", err) + } + if exists { + age := time.Since(modTime) + r.PruneMarker = &PruneMarkerInfo{ + Path: pruneKey, + ModTime: modTime, + Age: age, + AgeSeconds: age.Seconds(), + } + } + + // 4. Classify in-progress markers. + ipPrefix := cp + "inprogress/" + now := time.Now() + if err := b.Walk(ctx, ipPrefix, true, func(f RemoteFile) error { + if !strings.HasSuffix(f.Key, ".marker") { + return nil + } + // Extract backup name: strip prefix and ".marker" suffix. + inner := strings.TrimPrefix(f.Key, ipPrefix) + backup := strings.TrimSuffix(inner, ".marker") + age := now.Sub(f.ModTime) + info := InProgressInfo{ + Backup: backup, + ModTime: f.ModTime, + Age: age, + AgeSeconds: age.Seconds(), + } + if age >= cfg.AbandonThresholdDuration() { + r.InProgressAbandoned = append(r.InProgressAbandoned, info) + } else { + r.InProgressFresh = append(r.InProgressFresh, info) + } + return nil + }); err != nil { + return nil, fmt.Errorf("cas status: walk inprogress: %w", err) + } + + // Sort InProgressFresh and InProgressAbandoned by backup name. + sort.Slice(r.InProgressFresh, func(i, j int) bool { + return r.InProgressFresh[i].Backup < r.InProgressFresh[j].Backup + }) + sort.Slice(r.InProgressAbandoned, func(i, j int) bool { + return r.InProgressAbandoned[i].Backup < r.InProgressAbandoned[j].Backup + }) + + return r, nil +} + +// PrintStatus writes a human-readable summary of r to w. +func PrintStatus(r *StatusReport, w io.Writer) error { + // Backup summary line. + backupDetail := "none" + if r.BackupCount > 0 { + newest := r.Backups[0].Name + oldest := r.Backups[r.BackupCount-1].Name + backupDetail = fmt.Sprintf("newest: %s, oldest: %s", newest, oldest) + } + if _, err := fmt.Fprintf(w, " Backups: %d (%s)\n", r.BackupCount, backupDetail); err != nil { + return err + } + + // Blob summary line. + blobSize := utils.FormatBytes(uint64(r.BlobBytes)) + if _, err := fmt.Fprintf(w, " Blobs: %s objects, %s\n", formatInt(r.BlobCount), blobSize); err != nil { + return err + } + + if _, err := fmt.Fprintln(w); err != nil { + return err + } + + // Prune marker. + pruneStr := "NONE" + if r.PruneMarker != nil { + pruneStr = fmt.Sprintf("%s (age: %s)", r.PruneMarker.Path, r.PruneMarker.Age.Round(time.Second)) + } + if _, err := fmt.Fprintf(w, " Prune marker: %s\n", pruneStr); err != nil { + return err + } + + // In-progress markers. + if _, err := fmt.Fprintf(w, " In-progress markers: %d fresh, %d abandoned\n", + len(r.InProgressFresh), len(r.InProgressAbandoned)); err != nil { + return err + } + for _, ip := range r.InProgressFresh { + if _, err := fmt.Fprintf(w, " fresh: %s (%s ago)\n", + ip.Backup, ip.Age.Round(time.Second)); err != nil { + return err + } + } + for _, ip := range r.InProgressAbandoned { + if _, err := fmt.Fprintf(w, " abandoned: %s (%s ago)\n", + ip.Backup, ip.Age.Round(time.Second)); err != nil { + return err + } + } + return nil +} + +// formatInt formats an integer with comma separators (e.g. 42318 → "42,318"). +func formatInt(n int) string { + s := fmt.Sprintf("%d", n) + if n < 0 { + s = s[1:] + } + // Insert commas every 3 digits from the right. + var result []byte + for i, c := range s { + if i > 0 && (len(s)-i)%3 == 0 { + result = append(result, ',') + } + result = append(result, byte(c)) + } + if n < 0 { + return "-" + string(result) + } + return string(result) +} diff --git a/pkg/cas/status_test.go b/pkg/cas/status_test.go new file mode 100644 index 00000000..74939232 --- /dev/null +++ b/pkg/cas/status_test.go @@ -0,0 +1,167 @@ +package cas_test + +import ( + "context" + "encoding/json" + "strings" + "testing" + "time" + + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/fakedst" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/testfixtures" + "github.com/stretchr/testify/require" +) + +func TestStatus_EmptyBucket(t *testing.T) { + f := fakedst.New() + cfg := testCfg(100) + r, err := cas.Status(context.Background(), f, cfg) + if err != nil { + t.Fatal(err) + } + if r.BackupCount != 0 || r.BlobCount != 0 { + t.Errorf("expected empty report, got %+v", r) + } + if r.PruneMarker != nil { + t.Error("expected no prune marker") + } + if len(r.InProgressFresh) != 0 || len(r.InProgressAbandoned) != 0 { + t.Error("expected no in-progress markers") + } +} + +func TestStatus_AfterUploads(t *testing.T) { + // Build two local backups with distinct blobs and upload them. + // smallPart uses data.bin (1024 bytes) which exceeds threshold=100 → 1 blob per backup. + // Both backups share no blobs (different hashLow values), so BlobCount = 2. + ctx := context.Background() + f := fakedst.New() + cfg := testCfg(100) + + lb1 := testfixtures.Build(t, []testfixtures.PartSpec{smallPart("p1", 0)}) + if _, err := cas.Upload(ctx, f, cfg, "bk_a", cas.UploadOptions{LocalBackupDir: lb1.Root}); err != nil { + t.Fatalf("Upload bk_a: %v", err) + } + + lb2 := testfixtures.Build(t, []testfixtures.PartSpec{smallPart("p1", 1000)}) + if _, err := cas.Upload(ctx, f, cfg, "bk_b", cas.UploadOptions{LocalBackupDir: lb2.Root}); err != nil { + t.Fatalf("Upload bk_b: %v", err) + } + + r, err := cas.Status(ctx, f, cfg) + if err != nil { + t.Fatalf("Status: %v", err) + } + if r.BackupCount != 2 { + t.Errorf("BackupCount: got %d want 2", r.BackupCount) + } + // Each upload contributes 1 blob (data.bin, 1024 bytes, distinct hashes). + if r.BlobCount != 2 { + t.Errorf("BlobCount: got %d want 2", r.BlobCount) + } + if r.BlobBytes <= 0 { + t.Errorf("BlobBytes: got %d want >0", r.BlobBytes) + } + // Backups should be sorted newest-first; both present. + names := make(map[string]bool) + for _, bs := range r.Backups { + names[bs.Name] = true + } + if !names["bk_a"] || !names["bk_b"] { + t.Errorf("Backups: got %v want bk_a and bk_b", r.Backups) + } +} + +func TestStatus_DetectsPruneMarker(t *testing.T) { + f := fakedst.New() + cfg := testCfg(100) + ctx := context.Background() + if _, _, err := cas.WritePruneMarker(ctx, f, cfg.ClusterPrefix(), "h1"); err != nil { + t.Fatal(err) + } + r, err := cas.Status(ctx, f, cfg) + if err != nil { + t.Fatal(err) + } + if r.PruneMarker == nil { + t.Fatal("expected PruneMarker, got nil") + } + if r.PruneMarker.Path == "" { + t.Error("PruneMarker.Path empty") + } +} + +func TestStatus_ClassifiesInProgressByAge(t *testing.T) { + f := fakedst.New() + cfg := testCfg(100) + cfg.AbandonThreshold = "1h" + if err := cfg.Validate(); err != nil { + t.Fatal(err) + } + ctx := context.Background() + + // fresh marker — just written, age ~ 0 + if _, err := cas.WriteInProgressMarker(ctx, f, cfg.ClusterPrefix(), "bk_recent", "h"); err != nil { + t.Fatal(err) + } + // abandoned marker — write then age it to 2h ago + if _, err := cas.WriteInProgressMarker(ctx, f, cfg.ClusterPrefix(), "bk_old", "h"); err != nil { + t.Fatal(err) + } + f.SetModTime(cas.InProgressMarkerPath(cfg.ClusterPrefix(), "bk_old"), time.Now().Add(-2*time.Hour)) + + r, err := cas.Status(ctx, f, cfg) + if err != nil { + t.Fatal(err) + } + if len(r.InProgressFresh) != 1 || r.InProgressFresh[0].Backup != "bk_recent" { + t.Errorf("fresh: %+v", r.InProgressFresh) + } + if len(r.InProgressAbandoned) != 1 || r.InProgressAbandoned[0].Backup != "bk_old" { + t.Errorf("abandoned: %+v", r.InProgressAbandoned) + } +} + +// TestStatusReport_JSONTags verifies that StatusReport and related structs +// marshal to snake_case keys and that Duration fields are exposed as seconds +// (not nanosecond integers) via the age_seconds field. +func TestStatusReport_JSONTags(t *testing.T) { + f := fakedst.New() + cfg := testCfg(100) + cfg.AbandonThreshold = "1h" + require.NoError(t, cfg.Validate()) + ctx := context.Background() + + // Write a prune marker. + if _, _, err := cas.WritePruneMarker(ctx, f, cfg.ClusterPrefix(), "h1"); err != nil { + t.Fatal(err) + } + // Write a fresh in-progress marker. + if _, err := cas.WriteInProgressMarker(ctx, f, cfg.ClusterPrefix(), "bk_r", "h"); err != nil { + t.Fatal(err) + } + + r, err := cas.Status(ctx, f, cfg) + require.NoError(t, err) + + raw, err := json.Marshal(r) + require.NoError(t, err) + s := string(raw) + + // Top-level snake_case keys must be present. + require.True(t, strings.Contains(s, `"backup_count"`), "missing backup_count: %s", s) + require.True(t, strings.Contains(s, `"blob_count"`), "missing blob_count: %s", s) + require.True(t, strings.Contains(s, `"blob_bytes"`), "missing blob_bytes: %s", s) + require.True(t, strings.Contains(s, `"in_progress_fresh"`), "missing in_progress_fresh: %s", s) + require.True(t, strings.Contains(s, `"in_progress_abandoned"`), "missing in_progress_abandoned: %s", s) + require.True(t, strings.Contains(s, `"backups"`), "missing backups: %s", s) + + // PruneMarker fields. + require.True(t, strings.Contains(s, `"prune_marker"`), "missing prune_marker: %s", s) + require.True(t, strings.Contains(s, `"age_seconds"`), "missing age_seconds in prune_marker: %s", s) + + // Age (time.Duration) must NOT appear as nanosecond integer — the field is tagged json:"-". + require.False(t, strings.Contains(s, `"Age"`), "raw Go field name Age must not appear: %s", s) + require.False(t, strings.Contains(s, `"age":`), "unexported age field must not appear: %s", s) +} diff --git a/pkg/cas/sweep.go b/pkg/cas/sweep.go new file mode 100644 index 00000000..3618584d --- /dev/null +++ b/pkg/cas/sweep.go @@ -0,0 +1,270 @@ +package cas + +import ( + "container/heap" + "context" + "encoding/binary" + "encoding/hex" + "fmt" + "sort" + "strings" + "sync" + "time" + + "github.com/rs/zerolog/log" +) + +// OrphanCandidate identifies a blob that the sweep phase considers eligible +// for deletion: not present in the live mark set AND older than the grace +// cutoff. The Key is the absolute object key (i.e. what BlobPath would +// produce), suitable for direct DeleteFile. +type OrphanCandidate struct { + Hash Hash128 `json:"hash"` + Key string `json:"key"` + Size int64 `json:"size"` + ModTime time.Time `json:"mod_time"` +} + +// SweepStats holds aggregate counters produced by a single SweepOrphans call. +type SweepStats struct { + // BlobsTotal is the total number of blobs enumerated during the sweep, + // regardless of whether they are live-referenced or orphaned. + BlobsTotal uint64 + // OrphansHeldByGrace counts orphan blobs (not referenced by any live + // backup) that were skipped because they fell inside the grace window. + OrphansHeldByGrace uint64 +} + +// SweepOrphans walks every cas//blob// prefix in parallel, +// collects candidate blobs (those not in marks), and filters to those +// strictly older than t0-grace. The mark set MUST be sorted (i.e. produced +// by MarkSetWriter); SweepOrphans consumes it in a single forward pass. +// +// parallelism caps simultaneous shard walks; <=0 falls back to 32. The +// returned slice has no specified order. +func SweepOrphans(ctx context.Context, b Backend, clusterPrefix string, marks *MarkSetReader, grace time.Duration, t0 time.Time) ([]OrphanCandidate, SweepStats, error) { + cutoff := t0.Add(-grace) + const parallelism = 32 + + shards := make([]shardOutForCompare, 256) + + var wg sync.WaitGroup + sem := make(chan struct{}, parallelism) + for i := 0; i < 256; i++ { + wg.Add(1) + go func(i int) { + defer wg.Done() + sem <- struct{}{} + defer func() { <-sem }() + prefix := fmt.Sprintf("%sblob/%02x/", clusterPrefix, i) + var blobs []remoteBlob + err := b.Walk(ctx, prefix, true, func(rf RemoteFile) error { + h, ok := parseHashFromKey(rf.Key, prefix) + if !ok { + // Skip debris that doesn't match the blob key shape + // (e.g. operator-injected files); not a fatal error. + return nil + } + blobs = append(blobs, remoteBlob{hash: h, key: rf.Key, modTime: rf.ModTime, size: rf.Size}) + return nil + }) + sort.Slice(blobs, func(a, c int) bool { return hashLess(blobs[a].hash, blobs[c].hash) }) + shards[i] = shardOutForCompare{blobs: blobs, err: err} + }(i) + } + wg.Wait() + + for i, s := range shards { + if s.err != nil { + return nil, SweepStats{}, fmt.Errorf("cas-sweep: shard %02x: %w", i, s.err) + } + } + + // Stream-merge the 256 sorted shards into a single sorted iterator, + // then walk it side-by-side with the mark set. + candidates, stats, err := streamCompareWithMarks(shards, marks, cutoff) + if err != nil { + return nil, SweepStats{}, err + } + return candidates, stats, nil +} + +type remoteBlob struct { + hash Hash128 + key string + modTime time.Time + size int64 +} + +// parseHashFromKey extracts a Hash128 from an absolute blob key of the form +// "blob//" where the prefix arg is the leading +// "blob//". Returns (zero, false) if the key doesn't +// match the expected shape (length, hex chars). +func parseHashFromKey(key, prefix string) (Hash128, bool) { + if !strings.HasPrefix(key, prefix) { + return Hash128{}, false + } + rest := key[len(prefix):] + if len(rest) != 30 { + return Hash128{}, false + } + // The shard byte (2 hex chars) lives in the prefix itself, in the + // segment between "blob/" and the trailing "/". Extract it. + // prefix = "blob//" — find the . + const blobMarker = "blob/" + bm := strings.Index(prefix, blobMarker) + if bm < 0 { + return Hash128{}, false + } + shardStart := bm + len(blobMarker) + if shardStart+3 > len(prefix) { + return Hash128{}, false + } + shardHex := prefix[shardStart : shardStart+2] + full := shardHex + rest + if len(full) != 32 { + return Hash128{}, false + } + var b [16]byte + if _, err := hex.Decode(b[:], []byte(full)); err != nil { + return Hash128{}, false + } + return Hash128{ + Low: binary.LittleEndian.Uint64(b[0:8]), + High: binary.LittleEndian.Uint64(b[8:16]), + }, true +} + +// streamCompareWithMarks merges the sorted shard outputs with the sorted +// mark stream and emits OrphanCandidate for any blob not in marks AND older +// than cutoff. It also returns SweepStats with aggregate counters. +func streamCompareWithMarks(shards []shardOutForCompare, marks *MarkSetReader, cutoff time.Time) ([]OrphanCandidate, SweepStats, error) { + // Flatten shards in sorted order. Shards are already individually + // sorted; flatten via heap merge. + it := newShardIter(shards) + var ( + mark Hash128 + haveMark bool + ) + advanceMark := func() error { + h, ok, err := marks.Next() + if err != nil { + return err + } + mark = h + haveMark = ok + return nil + } + if err := advanceMark(); err != nil { + return nil, SweepStats{}, err + } + + var out []OrphanCandidate + var stats SweepStats + for it.valid { + blob := it.current + stats.BlobsTotal++ + // Advance mark stream past anything strictly less than blob.hash. + for haveMark && hashLess(mark, blob.hash) { + if err := advanceMark(); err != nil { + return nil, SweepStats{}, err + } + } + if !(haveMark && mark == blob.hash) { + // Blob is not referenced by any live backup → orphan candidate. + if blob.modTime.IsZero() { + log.Warn(). + Str("key", blob.key). + Msg("cas-sweep: blob has zero ModTime (likely FTP LIST without MLSD); skipping (treating as inside grace window)") + stats.OrphansHeldByGrace++ + } else if blob.modTime.Before(cutoff) { + out = append(out, OrphanCandidate{ + Hash: blob.hash, Key: blob.key, ModTime: blob.modTime, Size: blob.size, + }) + } else { + // Orphan but within the grace window — held for now. + stats.OrphansHeldByGrace++ + } + } + if err := it.advance(); err != nil { + return nil, SweepStats{}, err + } + } + return out, stats, nil +} + +// shardOutForCompare is an alias used by streamCompareWithMarks. We keep +// the type local so the caller doesn't have to expose internal `remoteBlob`. +type shardOutForCompare = struct { + blobs []remoteBlob + err error +} + +// shardHead tracks the read position within one shard's sorted blob slice. +type shardHead struct { + blobs []remoteBlob + idx int +} + +// current returns the blob at the current read position. +func (h *shardHead) current() remoteBlob { return h.blobs[h.idx] } + +// shardHeap is a min-heap of *shardHead values ordered by the current blob's +// hash. It implements heap.Interface so container/heap drives the merge. +type shardHeap []*shardHead + +func (h shardHeap) Len() int { return len(h) } +func (h shardHeap) Less(i, j int) bool { + return hashLess(h[i].current().hash, h[j].current().hash) +} +func (h shardHeap) Swap(i, j int) { h[i], h[j] = h[j], h[i] } +func (h *shardHeap) Push(x interface{}) { *h = append(*h, x.(*shardHead)) } +func (h *shardHeap) Pop() interface{} { + old := *h + n := len(old) + x := old[n-1] + old[n-1] = nil // avoid memory leak + *h = old[:n-1] + return x +} + +// shardIter is a min-heap iterator across the 256 shard slices. +// It merges individually-sorted shards in O(N log k) time (k ≤ 256 shards) +// using container/heap instead of the former O(N k) linear scan. +type shardIter struct { + h shardHeap + current remoteBlob + valid bool +} + +func newShardIter(shards []shardOutForCompare) *shardIter { + it := &shardIter{} + for i := range shards { + if len(shards[i].blobs) > 0 { + it.h = append(it.h, &shardHead{blobs: shards[i].blobs, idx: 0}) + } + } + heap.Init(&it.h) + _ = it.advance() + return it +} + +func (it *shardIter) advance() error { + if it.h.Len() == 0 { + it.valid = false + return nil + } + // The heap root is always the shard with the globally-smallest current blob. + top := it.h[0] + it.current = top.current() + it.valid = true + top.idx++ + if top.idx < len(top.blobs) { + // Shard still has entries: fix the heap position of the root (O(log k)). + heap.Fix(&it.h, 0) + } else { + // Shard exhausted: remove it from the heap (O(log k)). + heap.Pop(&it.h) + } + return nil +} diff --git a/pkg/cas/sweep_test.go b/pkg/cas/sweep_test.go new file mode 100644 index 00000000..8bc892c1 --- /dev/null +++ b/pkg/cas/sweep_test.go @@ -0,0 +1,289 @@ +package cas_test + +import ( + "bytes" + "context" + "fmt" + "io" + "path/filepath" + "reflect" + "sort" + "testing" + "time" + + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/fakedst" +) + +// buildMarkSet writes the given hashes into a temporary MarkSet file and +// returns an OPEN MarkSetReader positioned at the start. Caller must Close. +func buildMarkSet(t *testing.T, hashes []cas.Hash128) *cas.MarkSetReader { + t.Helper() + tmp := t.TempDir() + p := filepath.Join(tmp, "marks") + w, err := cas.NewMarkSetWriter(p, 1024) + if err != nil { + t.Fatal(err) + } + for _, h := range hashes { + if err := w.Write(h); err != nil { + t.Fatal(err) + } + } + if err := w.Close(); err != nil { + t.Fatal(err) + } + r, err := cas.OpenMarkSetReader(p) + if err != nil { + t.Fatal(err) + } + return r +} + +func putBlobAt(t *testing.T, f *fakedst.Fake, cp string, h cas.Hash128, modTime time.Time) { + t.Helper() + key := cas.BlobPath(cp, h) + if err := f.PutFile(context.Background(), key, io.NopCloser(bytes.NewReader([]byte("x"))), 1); err != nil { + t.Fatal(err) + } + f.SetModTime(key, modTime) +} + +func TestSweep_ReturnsOnlyUnreferencedAndOldEnough(t *testing.T) { + f := fakedst.New() + cp := "cas/c1/" + now := time.Now() + old := now.Add(-2 * time.Hour) // beyond grace + fresh := now.Add(-30 * time.Minute) // within grace + + // b1, b2 referenced; b3 unreferenced+old; b4 unreferenced+fresh; b5 referenced + h1 := cas.Hash128{Low: 0x01, High: 0x10} + h2 := cas.Hash128{Low: 0x02, High: 0x20} + h3 := cas.Hash128{Low: 0x03, High: 0x30} + h4 := cas.Hash128{Low: 0x04, High: 0x40} + h5 := cas.Hash128{Low: 0x05, High: 0x50} + for _, h := range []cas.Hash128{h1, h2, h5} { + putBlobAt(t, f, cp, h, old) + } + putBlobAt(t, f, cp, h3, old) + putBlobAt(t, f, cp, h4, fresh) + + marks := buildMarkSet(t, []cas.Hash128{h1, h2, h5}) + defer marks.Close() + + cands, _, err := cas.SweepOrphans(context.Background(), f, cp, marks, time.Hour, now) + if err != nil { + t.Fatal(err) + } + if len(cands) != 1 || cands[0].Hash != h3 { + t.Errorf("got %+v want only h3", cands) + } +} + +func TestSweep_RespectsGracePeriodPrecisely(t *testing.T) { + f := fakedst.New() + cp := "cas/c1/" + now := time.Now() + + h := cas.Hash128{Low: 0x99, High: 0xff} + // Blob ModTime exactly grace ago — must NOT be deleted (cutoff is strict <). + putBlobAt(t, f, cp, h, now.Add(-time.Hour)) + + marks := buildMarkSet(t, nil) // empty marks → all blobs are orphans + defer marks.Close() + + cands, _, err := cas.SweepOrphans(context.Background(), f, cp, marks, time.Hour, now) + if err != nil { + t.Fatal(err) + } + if len(cands) != 0 { + t.Errorf("expected 0 candidates (exactly-grace-aged should be retained); got %+v", cands) + } + + // One nanosecond older than grace → must be a candidate. + putBlobAt(t, f, cp, h, now.Add(-time.Hour-time.Nanosecond)) + marks2 := buildMarkSet(t, nil) + defer marks2.Close() + cands, _, err = cas.SweepOrphans(context.Background(), f, cp, marks2, time.Hour, now) + if err != nil { + t.Fatal(err) + } + if len(cands) != 1 { + t.Errorf("expected 1 candidate; got %+v", cands) + } +} + +func TestSweep_AllReferenced_NoCandidates(t *testing.T) { + f := fakedst.New() + cp := "cas/c1/" + now := time.Now() + old := now.Add(-2 * time.Hour) + + hs := []cas.Hash128{ + {Low: 1, High: 10}, {Low: 2, High: 20}, {Low: 3, High: 30}, + } + for _, h := range hs { + putBlobAt(t, f, cp, h, old) + } + + marks := buildMarkSet(t, hs) + defer marks.Close() + + cands, _, err := cas.SweepOrphans(context.Background(), f, cp, marks, time.Hour, now) + if err != nil { + t.Fatal(err) + } + if len(cands) != 0 { + t.Errorf("expected 0 candidates (all referenced); got %+v", cands) + } +} + +func TestSweep_EmptyBucket(t *testing.T) { + f := fakedst.New() + marks := buildMarkSet(t, nil) + defer marks.Close() + + cands, _, err := cas.SweepOrphans(context.Background(), f, "cas/c1/", marks, time.Hour, time.Now()) + if err != nil { + t.Fatal(err) + } + if len(cands) != 0 { + t.Errorf("expected 0 candidates; got %+v", cands) + } +} + +// TestSweep_ZeroModTimeBlobIsSkipped verifies that a blob with a zero +// ModTime is NOT classified as orphan-eligible — same conservative +// choice as the marker side: false-positive cleanup would delete live +// blobs on FTP backends that return zero ModTime. +func TestSweep_ZeroModTimeBlobIsSkipped(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) + cp := cfg.ClusterPrefix() + ctx := context.Background() + + // Create an orphan blob with zero ModTime. + hOrphan := cas.Hash128{Low: 0xab, High: 0x10} + _ = f.PutFile(ctx, cas.BlobPath(cp, hOrphan), io.NopCloser(bytes.NewReader([]byte("x"))), 1) + f.SetModTime(cas.BlobPath(cp, hOrphan), time.Time{}) + + // Empty mark set → the only path SweepOrphans uses is the orphan-vs-cutoff + // branch. Without the zero-ModTime guard, the blob would be classified + // as orphan past grace. + rep, err := cas.Prune(ctx, f, cfg, cas.PruneOptions{ + GraceBlob: time.Nanosecond, + GraceBlobSet: true, + }) + if err != nil { + t.Fatalf("Prune unexpectedly errored: %v", err) + } + if rep.OrphansDeleted != 0 { + t.Errorf("zero-ModTime blob was reaped (OrphansDeleted=%d); expected 0", rep.OrphansDeleted) + } + if _, _, exists, _ := f.StatFile(ctx, cas.BlobPath(cp, hOrphan)); !exists { + t.Error("zero-ModTime blob was deleted; expected to survive") + } +} + +func TestSweep_ManyShardsParallel(t *testing.T) { + f := fakedst.New() + cp := "cas/c1/" + old := time.Now().Add(-2 * time.Hour) + // Sprinkle blobs across many shard prefixes. + var hs []cas.Hash128 + for i := uint64(0); i < 50; i++ { + h := cas.Hash128{Low: i*0x1010101, High: i} + putBlobAt(t, f, cp, h, old) + hs = append(hs, h) + } + marks := buildMarkSet(t, nil) // empty: every blob is an orphan + defer marks.Close() + + cands, _, err := cas.SweepOrphans(context.Background(), f, cp, marks, time.Hour, time.Now()) + if err != nil { + t.Fatal(err) + } + if len(cands) != len(hs) { + t.Errorf("got %d candidates, want %d", len(cands), len(hs)) + } + // Verify hash-set equality. + gotHashes := make([]cas.Hash128, 0, len(cands)) + for _, c := range cands { + gotHashes = append(gotHashes, c.Hash) + } + sort.Slice(gotHashes, func(i, j int) bool { + if gotHashes[i].High != gotHashes[j].High { + return gotHashes[i].High < gotHashes[j].High + } + return gotHashes[i].Low < gotHashes[j].Low + }) + wantHashes := append([]cas.Hash128(nil), hs...) + sort.Slice(wantHashes, func(i, j int) bool { + if wantHashes[i].High != wantHashes[j].High { + return wantHashes[i].High < wantHashes[j].High + } + return wantHashes[i].Low < wantHashes[j].Low + }) + if !reflect.DeepEqual(gotHashes, wantHashes) { + t.Errorf("hash set mismatch") + } +} + +// BenchmarkSweepOrphans_LargeN measures the heap-merge path with N blobs +// spread evenly across all 256 shards. The benchmark is intentionally +// free of absolute assertions so it never becomes flaky; its purpose is +// to make future O(N k) regressions visible in benchmark history. +// +// Run with: +// +// go test ./pkg/cas/ -bench BenchmarkSweepOrphans -benchtime=1x -count=1 +func BenchmarkSweepOrphans_LargeN(b *testing.B) { + const totalBlobs = 10_000 // scaled down so the benchmark runs quickly + const numShards = 256 + perShard := totalBlobs / numShards + + now := time.Now() + old := now.Add(-2 * time.Hour) + + f := fakedst.New() + cp := "cas/bench/" + ctx := context.Background() + + // Pre-populate: spread blobs across all 256 shards. + // Hash128.High encodes the shard (top byte) so blobs land deterministically. + for shard := 0; shard < numShards; shard++ { + for j := 0; j < perShard; j++ { + h := cas.Hash128{ + High: uint64(shard) << 56, + Low: uint64(j), + } + key := cas.BlobPath(cp, h) + _ = f.PutFile(ctx, key, io.NopCloser(bytes.NewReader([]byte("x"))), 1) + f.SetModTime(key, old) + } + } + + b.ReportAllocs() + b.ResetTimer() + for i := 0; i < b.N; i++ { + tmp := b.TempDir() + p := fmt.Sprintf("%s/marks-%d", tmp, i) + w, err := cas.NewMarkSetWriter(p, 1024) + if err != nil { + b.Fatal(err) + } + if err := w.Close(); err != nil { // empty mark set: all blobs are orphans + b.Fatal(err) + } + r, err := cas.OpenMarkSetReader(p) + if err != nil { + b.Fatal(err) + } + cands, _, err := cas.SweepOrphans(ctx, f, cp, r, time.Hour, now) + _ = r.Close() + if err != nil { + b.Fatal(err) + } + b.ReportMetric(float64(len(cands)), "orphans") + } +} diff --git a/pkg/cas/types.go b/pkg/cas/types.go new file mode 100644 index 00000000..e34180ea --- /dev/null +++ b/pkg/cas/types.go @@ -0,0 +1,62 @@ +package cas + +const ( + // LayoutVersion is the schema version of the CAS layout itself. Persisted + // per backup in BackupMetadata.CAS.LayoutVersion. Bumps are major/breaking; + // tools encountering a higher version refuse with a clear error. + LayoutVersion uint8 = 1 + + // MinInline / MaxInline bound the persisted InlineThreshold. ValidateBackup + // rejects backups outside this range. See docs/cas-design.md §6.2.1. + MinInline uint64 = 1 + MaxInline uint64 = 1 << 30 // 1 GiB +) + +// TableInfo is a minimal description of a ClickHouse table used by +// DetectObjectDiskTables. The caller (e.g. cas-upload) populates this from +// clickhouse.Table values; keeping it here avoids an import cycle between +// pkg/cas and pkg/clickhouse. +type TableInfo struct { + Database string + Name string + DataPaths []string +} + +// DiskInfo is a minimal description of a ClickHouse disk from system.disks, +// used by DetectObjectDiskTables. +type DiskInfo struct { + Name string + Path string + Type string +} + +// Triplet is a (filename, size, hash) tuple extracted from a part's +// checksums.txt. The CAS upload planner classifies each Triplet as inline +// (size <= InlineThreshold; goes into per-table tar.zstd) or blob +// (size > InlineThreshold; uploaded to cas/.../blob//). +type Triplet struct { + Filename string + Size uint64 + HashLow uint64 + HashHigh uint64 +} + +// InProgressMarker is the JSON body of cas//inprogress/.marker. +// Written at upload start, deleted at commit. Used by cas-prune for +// abandoned-upload cleanup and by cas-delete to detect uploads in flight. +type InProgressMarker struct { + Backup string `json:"backup"` + Host string `json:"host"` + StartedAt string `json:"started_at"` // RFC3339 UTC + Tool string `json:"tool"` // e.g. "clickhouse-backup v2.7.0" +} + +// PruneMarker is the JSON body of cas//prune.marker. Written at the +// start of cas-prune; the run-id is read back to detect concurrent prunes. +// Released via deferred call so panics/errors don't strand it. +type PruneMarker struct { + Host string `json:"host"` + StartedAt string `json:"started_at"` // RFC3339 UTC + RunID string `json:"run_id"` // 16 hex chars from crypto/rand + Tool string `json:"tool"` +} diff --git a/pkg/cas/unlock.go b/pkg/cas/unlock.go new file mode 100644 index 00000000..b69f0f15 --- /dev/null +++ b/pkg/cas/unlock.go @@ -0,0 +1,64 @@ +package cas + +import ( + "context" + "fmt" + + "github.com/rs/zerolog/log" +) + +// UnlockInProgress removes a stranded cas-upload in-progress marker for the +// named backup. It is the operator escape hatch for a backup whose upload was +// interrupted uncleanly (SIGKILL, OOM, network partition) and whose marker +// was not cleaned up by the deferred cleanup in Upload. +// +// Behavior: +// 1. Stat the marker; if absent return a clear error. +// 2. Read the marker body and log Tool / Host / StartedAt for audit trail. +// 3. Delete the marker. +// 4. Return success. +// +// UnlockInProgress does NOT perform any upload. Callers that want to resume +// the upload must run cas-upload separately after unlocking. +// +// Returns ErrNoInProgressMarker if the marker does not exist. +func UnlockInProgress(ctx context.Context, b Backend, cfg Config, name string) error { + if err := cfg.Validate(); err != nil { + return fmt.Errorf("cas: unlock: invalid config: %w", err) + } + if err := validateName(name); err != nil { + return err + } + cp := cfg.ClusterPrefix() + markerKey := InProgressMarkerPath(cp, name) + + // 1. Check existence. + _, _, exists, err := b.StatFile(ctx, markerKey) + if err != nil { + return fmt.Errorf("cas: unlock: stat marker for %q: %w", name, err) + } + if !exists { + return fmt.Errorf("%w: %q", ErrNoInProgressMarker, name) + } + + // 2. Read body for audit log (best-effort; don't fail if body is unreadable). + m, readErr := ReadInProgressMarker(ctx, b, cp, name) + if readErr != nil { + log.Warn().Str("backup", name).Err(readErr).Msg("cas: unlock: could not read marker body for audit; deleting anyway") + } else { + log.Info(). + Str("backup", name). + Str("marker_tool", m.Tool). + Str("marker_host", m.Host). + Str("marker_started_at", m.StartedAt). + Msg("cas: unlock: removing stranded inprogress marker") + } + + // 3. Delete the marker. + if err := b.DeleteFile(ctx, markerKey); err != nil { + return fmt.Errorf("cas: unlock: delete marker for %q: %w", name, err) + } + + log.Info().Str("backup", name).Msg("cas: unlock: inprogress marker removed; backup slot is now free") + return nil +} diff --git a/pkg/cas/unlock_test.go b/pkg/cas/unlock_test.go new file mode 100644 index 00000000..2e064d49 --- /dev/null +++ b/pkg/cas/unlock_test.go @@ -0,0 +1,70 @@ +package cas_test + +import ( + "context" + "errors" + "strings" + "testing" + + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/fakedst" +) + +// TestUpload_UnlockRemovesInprogressMarker verifies that UnlockInProgress +// deletes the marker when it exists, and that no upload artifact is written. +func TestUpload_UnlockRemovesInprogressMarker(t *testing.T) { + f := fakedst.New() + cfg := testCfg(100) + cp := cfg.ClusterPrefix() + + // Pre-place a marker as if a previous cas-upload was interrupted. + created, err := cas.WriteInProgressMarker(context.Background(), f, cp, "b1", "") + if err != nil { + t.Fatalf("WriteInProgressMarker: %v", err) + } + if !created { + t.Fatal("marker should have been created (backend was empty)") + } + // Confirm it was written. + _, _, exists, statErr := f.StatFile(context.Background(), cas.InProgressMarkerPath(cp, "b1")) + if statErr != nil || !exists { + t.Fatalf("marker not present before unlock (exists=%v, err=%v)", exists, statErr) + } + + // Unlock. + if err := cas.UnlockInProgress(context.Background(), f, cfg, "b1"); err != nil { + t.Fatalf("UnlockInProgress: %v", err) + } + + // Marker must be gone. + _, _, exists2, statErr2 := f.StatFile(context.Background(), cas.InProgressMarkerPath(cp, "b1")) + if statErr2 != nil { + t.Fatalf("StatFile after unlock: %v", statErr2) + } + if exists2 { + t.Error("inprogress marker still present after UnlockInProgress") + } + + // No metadata.json or blob should have been written (no upload happened). + if f.Len() != 0 { + t.Errorf("unexpected objects in backend after unlock: got %d, want 0", f.Len()) + } +} + +// TestUpload_UnlockRefusesWhenNoMarker verifies that UnlockInProgress returns +// ErrNoInProgressMarker when no marker exists for the named backup. +func TestUpload_UnlockRefusesWhenNoMarker(t *testing.T) { + f := fakedst.New() + cfg := testCfg(100) + + err := cas.UnlockInProgress(context.Background(), f, cfg, "b1") + if err == nil { + t.Fatal("expected error when no marker present, got nil") + } + if !errors.Is(err, cas.ErrNoInProgressMarker) { + t.Errorf("expected ErrNoInProgressMarker, got: %v", err) + } + if !strings.Contains(err.Error(), "b1") { + t.Errorf("error should mention backup name, got: %v", err) + } +} diff --git a/pkg/cas/upload.go b/pkg/cas/upload.go new file mode 100644 index 00000000..28ec223a --- /dev/null +++ b/pkg/cas/upload.go @@ -0,0 +1,1179 @@ +package cas + +import ( + "context" + "encoding/json" + "errors" + "fmt" + "io" + "io/fs" + "os" + "path/filepath" + "sort" + "strings" + "sync" + "time" + + "github.com/Altinity/clickhouse-backup/v2/pkg/checksumstxt" + "github.com/Altinity/clickhouse-backup/v2/pkg/common" + "github.com/Altinity/clickhouse-backup/v2/pkg/metadata" + "github.com/rs/zerolog/log" +) + +// UploadOptions configures an Upload run. +type UploadOptions struct { + // LocalBackupDir is the absolute path of the pre-existing local backup + // directory (produced by `clickhouse-backup create`). Upload walks + // /shadow/. + LocalBackupDir string + + // TableFilter is an optional list of "db.table" glob patterns + // (filepath.Match semantics, mirroring v1 --tables). Empty = include all. + TableFilter []string + + // SkipObjectDisks: when true, tables on object-disks (s3/azure/etc.) + // are silently excluded; when false (the default) Upload refuses with + // ErrObjectDiskRefused if any are detected. + SkipObjectDisks bool + + // DryRun: when true, classify+plan but write nothing to the backend. + DryRun bool + + // Parallelism caps simultaneous blob uploads and the cold-list shard + // walks. <=0 falls back to 16. + Parallelism int + + // Disks and ClickHouseTables are caller-supplied; if both non-empty we + // run DetectObjectDiskTables. Empty slices mean "skip the pre-flight" + // (intended for unit tests that don't model live ClickHouse). + Disks []DiskInfo + ClickHouseTables []TableInfo + + // ExcludedTables is a precomputed list of "db.table" keys to skip. + // When non-empty, planUpload skips these tables directly without + // invoking DetectObjectDiskTables. Used by callers (e.g. cas-upload + // CLI) that already know which tables are object-disk-backed via a + // snapshot walk and don't need the live-disks Path-prefix match. + // If both ExcludedTables and Disks/ClickHouseTables are provided, + // ExcludedTables takes priority and Disks/ClickHouseTables are + // ignored for exclusion. + ExcludedTables []string + + // WaitForPrune, when > 0, polls the prune marker for up to this duration + // before giving up at upload step 2. 0 = refuse immediately (default). + WaitForPrune time.Duration +} + +// UploadResult summarizes what an Upload run did. The stats break down into +// three layers operators care about: +// +// 1. The backup's logical content (TotalFiles / TotalBytes — what would be +// in a v1 backup, including duplicated content across parts). +// 2. How the content was placed: InlineFiles/InlineBytes (small files that +// ride inside per-table tar.zstd archives) vs BlobFiles (file references +// that go to the content-addressed blob store) and the deduplicated +// UniqueBlobs / BlobBytesTotal. +// 3. What actually crossed the wire on this run: BlobsUploaded / +// BytesUploaded (new blobs PUT to the remote), BlobsReused / BytesReused +// (deduped via cold-list against existing remote blobs), and ArchiveBytes +// (compressed bytes for the per-table archives uploaded now). +type UploadResult struct { + BackupName string + + // Logical content (counted across every part, before blob dedup). + TotalFiles int + TotalBytes uint64 + InlineFiles int + InlineBytes uint64 + BlobFiles int // file references that pointed at a blob (pre-dedup) + + // Blob-store side, after content-addressed dedup within this backup. + UniqueBlobs int // unique blob hashes referenced (= len(plan.blobs)) + BlobBytesTotal uint64 // sum of UniqueBlobs sizes + + // What this run sent to / dedup'd against the remote. + BlobsUploaded int // unique blobs newly PUT + BytesUploaded int64 // sum of BlobsUploaded sizes + BlobsReused int // unique blobs already in remote (skipped) + BytesReused int64 // sum of BlobsReused sizes + ArchiveBytes int64 // compressed bytes of per-table archives uploaded + + PerTableArchives int + DryRun bool + + // BlobsConsidered is an alias for UniqueBlobs kept for backwards + // compatibility with log output written before the stats expansion. + // New code should read UniqueBlobs. + BlobsConsidered int +} + +// uploadPlan is the in-memory description of what to upload, built by +// scanning the local backup directory and parsing every checksums.txt. +type uploadPlan struct { + // blobs: unique hashes that exceed the inline threshold and are not + // special-cased (checksums.txt is always inlined). + blobs map[Hash128]blobRef + + // tables maps "disk|db|table" → tablePlan. + tables map[string]*tablePlan + // tableKeys preserves a sorted ordering for deterministic uploads. + tableKeys []string + + // localRoot is the local backup directory passed to planUpload; used by + // uploadTableJSONs to read the v1 per-table metadata that + // 'clickhouse-backup create' wrote. + localRoot string + + // Aggregates for stats reporting. Populated alongside the maps above. + totalFiles int + totalBytes uint64 + inlineFiles int + inlineBytes uint64 + blobFiles int // file references that go to the blob store (pre-dedup) +} + +// blobRef points at one local file claimed to have hash h. We pick any +// file with the hash for the actual upload (callers may have multiple +// copies). +type blobRef struct { + LocalPath string + Size uint64 +} + +// tablePlan groups everything needed to build the per-(disk, db, table) +// archive and its companion table-metadata JSON. +type tablePlan struct { + Disk, DB, Table string + // archiveEntries are the inline files (small files + every + // checksums.txt) that go into the tar.zstd. NameInArchive uses the + // "/" convention from §6.3. + archiveEntries []ArchiveEntry + // parts is the per-part list used to populate TableMetadata.Parts. + // Sorted by part name for deterministic JSON. + parts []metadata.Part +} + +// Upload performs a CAS upload of the local backup at opts.LocalBackupDir +// to the cluster identified by cfg. Implements docs/cas-design.md §6.4. +func Upload(ctx context.Context, b Backend, cfg Config, name string, opts UploadOptions) (*UploadResult, error) { + // 1. Validate name + config. + if err := validateName(name); err != nil { + return nil, err + } + if err := cfg.Validate(); err != nil { + return nil, err + } + if NameCollidesWithCASPrefix(name, cfg) { + return nil, fmt.Errorf("cas-upload: backup name %q collides with the CAS skip-prefix %q; choose a different name to prevent this backup from being silently skipped by v1 list/retention operations", name, name+"/") + } + cp := cfg.ClusterPrefix() + + // 2. Refuse if prune.marker exists (with optional wait). + // NOTE: the in-progress marker has NOT been written yet at this point + // (that happens in step 5), so no cleanup is needed on this error path. + if err := waitForPrune(ctx, b, cp, opts.WaitForPrune); err != nil { + return nil, err + } + + // 3. Object-disk pre-flight. + if !opts.SkipObjectDisks && len(opts.Disks) > 0 && len(opts.ClickHouseTables) > 0 { + hits := DetectObjectDiskTables(opts.ClickHouseTables, opts.Disks) + if len(hits) > 0 { + return nil, fmt.Errorf("%w: %s", ErrObjectDiskRefused, formatObjectDiskHits(hits)) + } + } + + // 4. Best-effort same-name check. + if _, _, exists, err := b.StatFile(ctx, MetadataJSONPath(cp, name)); err != nil { + return nil, fmt.Errorf("cas: stat metadata.json: %w", err) + } else if exists { + return nil, ErrBackupExists + } + + // 5. Write in-progress marker (skipped on DryRun). + if !opts.DryRun { + created, err := WriteInProgressMarker(ctx, b, cp, name, "") + if err != nil { + if errors.Is(err, ErrConditionalPutNotSupported) { + return nil, fmt.Errorf("cas: backend cannot guarantee atomic markers; refusing to start cas-upload for %q (set cas.allow_unsafe_markers=true to override on FTP)", name) + } + return nil, fmt.Errorf("cas: write inprogress marker: %w", err) + } + if !created { + existing, readErr := ReadInProgressMarker(ctx, b, cp, name) + if readErr != nil { + return nil, fmt.Errorf("cas: another operation is in progress for %q (could not read marker: %v)", name, readErr) + } + if existing.Tool == "cas-delete" { + return nil, fmt.Errorf("cas: another %s is in progress for %q on host=%s started=%s; wait for it to finish", + existing.Tool, name, existing.Host, existing.StartedAt) + } + return nil, fmt.Errorf("cas: another %s is in progress for %q on host=%s started=%s; wait for it to finish or run cas-prune --abandon-threshold=0s if confirmed dead", + existing.Tool, name, existing.Host, existing.StartedAt) + } + } + + // Single deferred cleanup: runs on any error path (including panics) and + // uses a detached context so a cancelled operation ctx doesn't strand the + // marker. Skipped when DryRun (no marker was written) or when committed + // (the success path does an explicit delete after committing metadata.json). + var committed bool + defer func() { + if opts.DryRun || committed { + return + } + cleanCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second) + defer cancel() + _ = DeleteInProgressMarker(cleanCtx, b, cp, name) + }() + + // 6. Plan upload: walk shadow/, parse checksums.txt, classify. + plan, err := planUpload(opts.LocalBackupDir, cfg.InlineThreshold, opts.TableFilter, opts.SkipObjectDisks, opts.ExcludedTables, opts.Disks, opts.ClickHouseTables) + if err != nil { + return nil, err + } + + // Compute total bytes referenced by unique blobs (after content dedup + // within this backup; cold-list dedup against the remote happens in + // step 7). + var blobBytesTotal uint64 + for _, br := range plan.blobs { + blobBytesTotal += br.Size + } + + res := &UploadResult{ + BackupName: name, + TotalFiles: plan.totalFiles, + TotalBytes: plan.totalBytes, + InlineFiles: plan.inlineFiles, + InlineBytes: plan.inlineBytes, + BlobFiles: plan.blobFiles, + UniqueBlobs: len(plan.blobs), + BlobBytesTotal: blobBytesTotal, + BlobsConsidered: len(plan.blobs), + DryRun: opts.DryRun, + } + + if opts.DryRun { + res.PerTableArchives = len(plan.tableKeys) + return res, nil + } + + // 7. Cold-list existing blobs. + existing, err := ColdList(ctx, b, cp, opts.Parallelism) + if err != nil { + return nil, fmt.Errorf("cas: cold-list: %w", err) + } + + // 8. Upload missing blobs. + uploaded, bytesUp, skippedColdList, err := uploadMissingBlobs(ctx, b, cp, plan, existing, opts.Parallelism) + if err != nil { + return nil, err + } + res.BlobsUploaded = uploaded + res.BytesUploaded = bytesUp + // Reused = unique blobs that were already in the remote (dedup'd via cold-list). + res.BlobsReused = res.UniqueBlobs - uploaded + if res.BlobsReused < 0 { + res.BlobsReused = 0 + } + res.BytesReused = int64(blobBytesTotal) - bytesUp + if res.BytesReused < 0 { + res.BytesReused = 0 + } + + // 9. Per-(disk,db,table) archives. + archCount, archBytes, err := uploadPartArchives(ctx, b, cp, name, plan) + if err != nil { + return nil, err + } + res.PerTableArchives = archCount + res.ArchiveBytes = archBytes + + // 10. Per-table JSONs. + if err := uploadTableJSONs(ctx, b, cp, name, plan); err != nil { + return nil, err + } + + // 11. Pre-commit safety re-checks. + // 11a. prune marker + if _, _, exists, err := b.StatFile(ctx, PruneMarkerPath(cp)); err != nil { + return nil, fmt.Errorf("cas: re-check prune marker: %w", err) + } else if exists { + return nil, fmt.Errorf("%w: detected concurrent prune before commit", ErrPruneInProgress) + } + // 11b. our own inprogress marker + if _, _, exists, err := b.StatFile(ctx, InProgressMarkerPath(cp, name)); err != nil { + return nil, fmt.Errorf("cas: re-check inprogress marker: %w", err) + } else if !exists { + // The marker is already gone (swept by an over-eager prune); no cleanup needed. + return nil, fmt.Errorf("cas: in-progress marker for %q was swept (upload exceeded abandon_threshold); aborting", name) + } + + // 11c. Re-validate cold-listed blobs (closes ColdList TOCTOU vs concurrent + // prune) AND verify size matches checksums.txt (defense-in-depth + // against a stale/truncated object at a content-addressed key). + // A prune that ran past 11a's check could have deleted a blob we + // decided to skip in step 8 because cold-list said it was present. + // Parallelised with the same bounded-pool pattern as uploadMissingBlobs + // to avoid O(skipped × RTT) serial latency on large incremental backups. + if revalErr := revalidateColdList(ctx, b, cp, name, skippedColdList, opts.Parallelism); revalErr != nil { + return nil, revalErr + } + + // 12. Commit: write root metadata.json. + bm := buildBackupMetadata(name, cfg, plan) + bmJSON, err := json.MarshalIndent(bm, "", "\t") + if err != nil { + return nil, fmt.Errorf("cas: marshal metadata.json: %w", err) + } + if err := putBytes(ctx, b, MetadataJSONPath(cp, name), bmJSON); err != nil { + return nil, fmt.Errorf("cas: put metadata.json: %w", err) + } + + // 13. Mark committed BEFORE explicit delete so a panic during delete doesn't + // trigger the defer's redundant cleanup. Use a detached context so caller + // cancellation (e.g. /backup/kill) immediately after a successful commit + // still releases the marker rather than leaving it for prune to sweep. + committed = true + cleanCtx, cleanCancel := context.WithTimeout(context.Background(), 30*time.Second) + defer cleanCancel() + if err := DeleteInProgressMarker(cleanCtx, b, cp, name); err != nil { + log.Warn().Err(err).Msg("cas: release inprogress marker after commit") + } + + return res, nil +} + +// FormatObjectDiskHits renders a compact one-line summary suitable for +// embedding in user-facing errors. Exported for callers that perform the +// pre-flight outside cas.Upload (e.g., the CLI's snapshot-based scan). +func FormatObjectDiskHits(hits []ObjectDiskHit) string { return formatObjectDiskHits(hits) } + +// formatObjectDiskHits renders a compact one-line summary of detected +// object-disk hits suitable for embedding in error messages. +func formatObjectDiskHits(hits []ObjectDiskHit) string { + parts := make([]string, len(hits)) + for i, h := range hits { + parts[i] = fmt.Sprintf("%s.%s on %s(%s)", h.Database, h.Table, h.Disk, h.DiskType) + } + return strings.Join(parts, ", ") +} + +// localTableMetadataEntry is one (db, table) pair discovered by walking +// /metadata/. Names are post-decode (i.e. ready to use directly, +// no further TablePathDecode needed). +type localTableMetadataEntry struct { + DB, Table string + JSONPath string // absolute path to the metadata JSON +} + +// enumerateLocalTableMetadata walks /metadata//.json +// and returns one entry per file. The (db, table) names come from the JSON +// body's "database" / "table" fields, NOT from the on-disk path components, +// so the result is unambiguous and never depends on TablePathDecode. +func enumerateLocalTableMetadata(root string) ([]localTableMetadataEntry, error) { + metaRoot := filepath.Join(root, "metadata") + st, err := os.Stat(metaRoot) + if err != nil { + if os.IsNotExist(err) { + return nil, nil // no metadata dir → no tables (caller decides what to do) + } + return nil, fmt.Errorf("stat metadata dir: %w", err) + } + if !st.IsDir() { + return nil, fmt.Errorf("metadata path %q is not a directory", metaRoot) + } + var out []localTableMetadataEntry + dbs, err := readDir(metaRoot) + if err != nil { + return nil, err + } + for _, dbEnc := range dbs { + dbDir := filepath.Join(metaRoot, dbEnc) + dbSt, err := os.Stat(dbDir) + if err != nil { + return nil, fmt.Errorf("stat metadata db dir %s: %w", dbDir, err) + } + if !dbSt.IsDir() { + continue // e.g., a stray file alongside the db directories + } + entries, err := readDir(dbDir) + if err != nil { + return nil, err + } + for _, name := range entries { + if !strings.HasSuffix(name, ".json") { + continue + } + p := filepath.Join(dbDir, name) + body, err := os.ReadFile(p) + if err != nil { + return nil, fmt.Errorf("read %s: %w", p, err) + } + var tm metadata.TableMetadata + if err := json.Unmarshal(body, &tm); err != nil { + return nil, fmt.Errorf("parse %s: %w", p, err) + } + if tm.Database == "" || tm.Table == "" { + return nil, fmt.Errorf("metadata JSON %s has empty database/table fields", p) + } + out = append(out, localTableMetadataEntry{ + DB: tm.Database, + Table: tm.Table, + JSONPath: p, + }) + } + } + sort.Slice(out, func(i, j int) bool { + if out[i].DB != out[j].DB { + return out[i].DB < out[j].DB + } + return out[i].Table < out[j].Table + }) + return out, nil +} + +// planUpload enumerates tables from /metadata/, then for each +// (db, table) walks /shadow///// to +// classify files. Tables with no shadow dir produce a tableKey entry with +// no parts — these flow through to bm.Tables without an archive. +// +// When skipObjectDisks is true, the planner consults precomputed first +// (a precomputed db.table allow-list provided by the CLI's snapshot-based +// pre-flight) and falls through to DetectObjectDiskTables(disks, tables) +// when that list is empty. Either path silently excludes object-disk- +// backed tables. +func planUpload(root string, threshold uint64, filter []string, skipObjectDisks bool, precomputed []string, disks []DiskInfo, tables []TableInfo) (*uploadPlan, error) { + excluded := excludedTables(skipObjectDisks, precomputed, disks, tables) + + plan := &uploadPlan{ + blobs: make(map[Hash128]blobRef), + tables: make(map[string]*tablePlan), + localRoot: root, + } + + tableEntries, err := enumerateLocalTableMetadata(root) + if err != nil { + return nil, err + } + + shadow := filepath.Join(root, "shadow") + for _, te := range tableEntries { + db, table := te.DB, te.Table + if !tableFilterMatches(filter, db, table) { + continue + } + if excluded[db+"."+table] { + continue + } + + // Find part directories for this table by walking + // shadow/////. Missing or empty is + // fine (schema-only / empty-table case). + dbEnc := common.TablePathEncode(db) + tableEnc := common.TablePathEncode(table) + tblDir := filepath.Join(shadow, dbEnc, tableEnc) + st, statErr := os.Stat(tblDir) + if statErr != nil || !st.IsDir() { + // No shadow dir — schema-only or empty table. Register a + // tablePlan with the default disk slot so buildBackupMetadata + // emits a Tables entry; no parts, no archive. + key := "default|" + db + "|" + table + if _, ok := plan.tables[key]; !ok { + plan.tables[key] = &tablePlan{Disk: "default", DB: db, Table: table} + plan.tableKeys = append(plan.tableKeys, key) + } + continue + } + diskNames, err := readDir(tblDir) + if err != nil { + return nil, err + } + anyParts := false + for _, disk := range diskNames { + diskDir := filepath.Join(tblDir, disk) + parts, err := readDir(diskDir) + if err != nil { + return nil, err + } + key := disk + "|" + db + "|" + table + tp, ok := plan.tables[key] + if !ok { + tp = &tablePlan{Disk: disk, DB: db, Table: table} + plan.tables[key] = tp + plan.tableKeys = append(plan.tableKeys, key) + } + for _, part := range parts { + partDir := filepath.Join(diskDir, part) + if err := planPart(partDir, part, threshold, plan, tp); err != nil { + return nil, fmt.Errorf("cas: plan %s/%s/%s/%s: %w", db, table, disk, part, err) + } + tp.parts = append(tp.parts, metadata.Part{Name: part}) + anyParts = true + } + } + if !anyParts { + // Empty shadow tree → still register a Tables entry on a + // default disk slot so cas-restore can recreate the schema. + key := "default|" + db + "|" + table + if _, ok := plan.tables[key]; !ok { + plan.tables[key] = &tablePlan{Disk: "default", DB: db, Table: table} + plan.tableKeys = append(plan.tableKeys, key) + } + } + } + + // Deterministic ordering. + sort.Strings(plan.tableKeys) + for _, tp := range plan.tables { + sort.Slice(tp.parts, func(i, j int) bool { return tp.parts[i].Name < tp.parts[j].Name }) + sort.Slice(tp.archiveEntries, func(i, j int) bool { return tp.archiveEntries[i].NameInArchive < tp.archiveEntries[j].NameInArchive }) + } + return plan, nil +} + +// excludedTables returns a set of "db.table" keys to skip when +// skipObjectDisks is true. Two paths: +// 1. precomputed: caller passed an explicit list (used by the CLI's +// snapshot-driven flow that doesn't have live disk paths). +// 2. derived: caller passed DiskInfo + TableInfo, in which case we run +// DetectObjectDiskTables (used by tests that model live ClickHouse). +// +// When both are empty, returns an empty set (effectively a no-op). +func excludedTables(skipObjectDisks bool, precomputed []string, disks []DiskInfo, tables []TableInfo) map[string]bool { + out := make(map[string]bool) + if !skipObjectDisks { + return out + } + if len(precomputed) > 0 { + for _, k := range precomputed { + out[k] = true + } + return out + } + if len(disks) == 0 || len(tables) == 0 { + return out + } + for _, h := range DetectObjectDiskTables(tables, disks) { + out[h.Database+"."+h.Table] = true + } + return out +} + +// planPart classifies a single part directory using the two-pass walker. +// +// Pass 1: parse checksums.txt recursively (descending into .proj/ subdirs) +// and build extractSet = { rel_path → (hash, size) } for every +// above-threshold listed file. +// Pass 2: walk the part directory recursively. For each file: +// - rel_path in extractSet → register a blob ref. +// - otherwise → append an archive entry preserving +// /. +// Hidden / non-regular files: warn, skip. +// .proj directories not in any parent's extractSet: warn, skip. +func planPart(partDir, partName string, threshold uint64, plan *uploadPlan, tp *tablePlan) error { + extractSet, knownProjDirs, err := buildExtractSet(partDir, threshold) + if err != nil { + return err + } + return walkPartFiles(partDir, partName, extractSet, knownProjDirs, plan, tp) +} + +// extractEntry holds the blob target for one above-threshold file. +type extractEntry struct { + Hash Hash128 + Size uint64 +} + +// buildExtractSet recursively parses checksums.txt files starting at +// partRoot. Returns: +// - extractSet: rel_path → (hash, size) for every above-threshold +// non-.proj checksum entry, recursively. rel_path is relative to +// partRoot and uses forward slashes (e.g. "data.bin", "p1.proj/data.bin"). +// - knownProjDirs: rel_path → struct{} for every .proj directory referenced +// by some checksums.txt at any level. Used in pass 2 to distinguish +// legitimate projection subtrees from orphans. +// +// Strict failures: missing/unparseable .proj/checksums.txt; .proj entry +// whose target is missing or not a directory; non-.proj entry whose file +// is missing on disk. +func buildExtractSet(partRoot string, threshold uint64) (map[string]extractEntry, map[string]struct{}, error) { + extractSet := map[string]extractEntry{} + knownProj := map[string]struct{}{} + var recurse func(dir, relPrefix string) error + recurse = func(dir, relPrefix string) error { + ckPath := filepath.Join(dir, "checksums.txt") + f, err := os.Open(ckPath) + if err != nil { + return fmt.Errorf("open %s: %w", ckPath, err) + } + parsed, perr := checksumstxt.Parse(f) + _ = f.Close() + if perr != nil { + return fmt.Errorf("parse %s: %w", ckPath, perr) + } + for fname, c := range parsed.Files { + rel := relPrefix + fname + // validate ALL filenames first — including .proj entries — to prevent + // directory traversal via crafted remote checksums.txt content. + // Upload side trusts local filesystem but applies the same validator + // for defense in depth. + if err := validateChecksumsTxtFilename(fname); err != nil { + return fmt.Errorf("cas: %s: %w", ckPath, err) + } + if strings.HasSuffix(fname, ".proj") { + subDir := filepath.Join(dir, fname) + st, statErr := os.Stat(subDir) + if statErr != nil { + return fmt.Errorf("projection subdir %s: %w", subDir, statErr) + } + if !st.IsDir() { + return fmt.Errorf("projection entry %q in %s: target on disk is not a directory", fname, ckPath) + } + knownProj[rel] = struct{}{} + if err := recurse(subDir, rel+"/"); err != nil { + return err + } + continue + } + localPath := filepath.Join(dir, fname) + if _, err := os.Stat(localPath); err != nil { + return fmt.Errorf("file listed in %s missing on disk: %s", ckPath, fname) + } + if c.FileSize > threshold { + extractSet[rel] = extractEntry{ + Hash: Hash128{Low: c.FileHash.Low, High: c.FileHash.High}, + Size: c.FileSize, + } + } + } + return nil + } + if err := recurse(partRoot, ""); err != nil { + return nil, nil, err + } + return extractSet, knownProj, nil +} + +// walkPartFiles is pass 2: walk the on-disk part directory, route each +// regular file to either the blob store (if rel_path is in extractSet) +// or the archive (everything else, paths preserved). +// +// Hidden files (name starts with ".") and non-regular files (symlinks, +// sockets, devices) generate a Warn log and are skipped. +// .proj directories not in knownProj are also warn-and-skipped. +func walkPartFiles(partRoot, partName string, extractSet map[string]extractEntry, knownProj map[string]struct{}, plan *uploadPlan, tp *tablePlan) error { + return filepath.WalkDir(partRoot, func(path string, d fs.DirEntry, walkErr error) error { + if walkErr != nil { + return walkErr + } + if path == partRoot { + return nil + } + rel, err := filepath.Rel(partRoot, path) + if err != nil { + return err + } + rel = filepath.ToSlash(rel) + if d.IsDir() { + if strings.HasSuffix(rel, ".proj") { + if _, ok := knownProj[rel]; !ok { + log.Warn().Str("part", partName).Str("rel", rel).Msg("cas-upload: orphan .proj directory in part — skipping") + return filepath.SkipDir + } + } + return nil + } + base := filepath.Base(path) + if strings.HasPrefix(base, ".") { + log.Warn().Str("part", partName).Str("rel", rel).Msg("cas-upload: hidden file in part — skipping") + return nil + } + if !d.Type().IsRegular() { + log.Warn().Str("part", partName).Str("rel", rel).Msg("cas-upload: non-regular file in part — skipping") + return nil + } + if entry, ok := extractSet[rel]; ok { + plan.totalFiles++ + plan.totalBytes += entry.Size + plan.blobFiles++ + if existing, dup := plan.blobs[entry.Hash]; dup { + // Same hash, but checksums.txt files in different parts + // declare conflicting sizes — malformed input. Refuse loudly + // rather than silently committing a metadata.json that + // references an ambiguous hash/size pair. + if existing.Size != entry.Size { + return fmt.Errorf("cas: malformed checksums.txt: hash %x/%x has conflicting sizes %d and %d (in parts %s and %s)", + entry.Hash.High, entry.Hash.Low, existing.Size, entry.Size, existing.LocalPath, path) + } + // Same hash AND same size — genuine content-addressed dedup + // (e.g. hardlinked files across parts). Keep the existing entry. + } else { + plan.blobs[entry.Hash] = blobRef{LocalPath: path, Size: entry.Size} + } + return nil + } + st, err := os.Stat(path) + if err != nil { + return fmt.Errorf("stat %s: %w", path, err) + } + size := uint64(st.Size()) + tp.archiveEntries = append(tp.archiveEntries, ArchiveEntry{ + NameInArchive: partName + "/" + rel, + LocalPath: path, + }) + plan.totalFiles++ + plan.totalBytes += size + plan.inlineFiles++ + plan.inlineBytes += size + return nil + }) +} + +// tableFilterMatches returns true if any pattern in filter matches "db.table". +// Empty filter = match-all. Patterns use filepath.Match semantics ("*", "?", +// "[abc]") on the full "db.table" name, mirroring v1 (pkg/backup/table_pattern.go:93). +// Patterns are trimmed of surrounding whitespace before matching. +func tableFilterMatches(filter []string, db, table string) bool { + if len(filter) == 0 { + return true + } + full := db + "." + table + for _, f := range filter { + f = strings.TrimSpace(f) + if f == "" { + continue + } + if matched, err := filepath.Match(f, full); err == nil && matched { + return true + } + // Also try exact match in case the pattern contains characters + // filepath.Match treats specially but the user meant literally. + if f == full { + return true + } + } + return false +} + +// countingReadCloser wraps an io.ReadCloser and counts bytes read through it. +// Used in uploadMissingBlobs to verify the actual number of bytes streamed to +// PutFile matches the size declared in checksums.txt. +type countingReadCloser struct { + rc io.ReadCloser + n int64 +} + +func (c *countingReadCloser) Read(p []byte) (int, error) { + n, err := c.rc.Read(p) + c.n += int64(n) + return n, err +} + +func (c *countingReadCloser) Close() error { return c.rc.Close() } + +// skippedBlob records a blob that was dedup'd via cold-list (i.e. already +// present in the remote). The Size field is the expected byte count from +// the local checksums.txt, used by step 11c to detect stale/truncated +// objects at content-addressed keys (defense-in-depth). +type skippedBlob struct { + Key string + Hash Hash128 + Size int64 // expected, from checksums.txt via blobRef +} + +// uploadMissingBlobs PUTs every blob in plan.blobs that is not in the +// existing set. Concurrency capped by parallelism (<=0 → 16). +// skipped contains the full object keys of blobs that were skipped because +// cold-list reported them as already present; callers re-validate these +// before committing to close the ColdList TOCTOU window. +func uploadMissingBlobs(ctx context.Context, b Backend, cp string, plan *uploadPlan, existing *ExistenceSet, parallelism int) (uploaded int, bytesUp int64, skipped []skippedBlob, err error) { + if parallelism <= 0 { + parallelism = 16 + } + type job struct { + h Hash128 + ref blobRef + } + var jobs []job + for h, ref := range plan.blobs { + if existing.Has(h) { + skipped = append(skipped, skippedBlob{ + Key: BlobPath(cp, h), + Hash: h, + Size: int64(ref.Size), + }) + continue + } + jobs = append(jobs, job{h: h, ref: ref}) + } + // Deterministic ordering of skipped aids debugging/tests. + sort.Slice(skipped, func(i, j int) bool { return skipped[i].Key < skipped[j].Key }) + // Deterministic ordering aids debugging/tests. + sort.Slice(jobs, func(i, j int) bool { + if jobs[i].h.High != jobs[j].h.High { + return jobs[i].h.High < jobs[j].h.High + } + return jobs[i].h.Low < jobs[j].h.Low + }) + + var ( + mu sync.Mutex + firstErr error + ) + + sem := make(chan struct{}, parallelism) + var wg sync.WaitGroup + for _, j := range jobs { + j := j + wg.Add(1) + go func() { + defer wg.Done() + sem <- struct{}{} + defer func() { <-sem }() + + mu.Lock() + already := firstErr != nil + mu.Unlock() + if already { + return + } + f, err := os.Open(j.ref.LocalPath) + if err != nil { + mu.Lock() + if firstErr == nil { + firstErr = fmt.Errorf("cas: open blob source %s: %w", j.ref.LocalPath, err) + } + mu.Unlock() + return + } + // Production storage backends (S3, GCS, AzBlob) do NOT close the + // io.ReadCloser passed to PutFile — they just stream Body off it + // and return. Without an explicit defer here, every blob upload + // would leak one fd, exhausting the process limit on backups + // with thousands of blobs. The fakedst test backend DOES call + // r.Close, which masks the leak in unit tests; keep both + // behaviors compatible by closing here ourselves (double-close + // of *os.File is a no-op error we ignore). + defer f.Close() + cr := &countingReadCloser{rc: f} + err = b.PutFile(ctx, BlobPath(cp, j.h), cr, int64(j.ref.Size)) + if err != nil { + mu.Lock() + if firstErr == nil { + firstErr = fmt.Errorf("cas: put blob %s: %w", BlobPath(cp, j.h), err) + } + mu.Unlock() + return + } + // Verify that the number of bytes actually streamed to PutFile + // matches the size declared in checksums.txt. A mismatch means + // the local file was mutated (truncated/grown) between planning + // and upload — committing metadata.json in this state would + // reference a hash/size pair that doesn't match the stored bytes. + if cr.n != int64(j.ref.Size) { + mu.Lock() + if firstErr == nil { + firstErr = fmt.Errorf("cas: blob %s size mismatch: uploaded=%d expected=%d (per checksums.txt)", + BlobPath(cp, j.h), cr.n, j.ref.Size) + } + mu.Unlock() + return + } + mu.Lock() + uploaded++ + bytesUp += int64(j.ref.Size) + mu.Unlock() + }() + } + wg.Wait() + return uploaded, bytesUp, skipped, firstErr +} + +// revalidateColdList performs step-11c of Upload in parallel: for every blob +// that was skipped in uploadMissingBlobs (because cold-list said it existed), +// StatFile is called to confirm the object is still present and the stored +// size matches what checksums.txt recorded. Concurrency is capped by +// parallelism (<=0 → 16). +// +// Returns the first error encountered; all goroutines finish before returning +// regardless, so there are no goroutine leaks on the error path. +func revalidateColdList(ctx context.Context, b Backend, cp, name string, skipped []skippedBlob, parallelism int) error { + if parallelism <= 0 { + parallelism = 16 + } + if len(skipped) == 0 { + return nil + } + + var ( + mu sync.Mutex + firstErr error + ) + + sem := make(chan struct{}, parallelism) + var wg sync.WaitGroup + for _, sb := range skipped { + sb := sb + wg.Add(1) + go func() { + defer wg.Done() + sem <- struct{}{} + defer func() { <-sem }() + + mu.Lock() + already := firstErr != nil + mu.Unlock() + if already { + return + } + + sz, _, exists, err := b.StatFile(ctx, sb.Key) + if err != nil { + mu.Lock() + if firstErr == nil { + firstErr = fmt.Errorf("cas: re-check cold-listed blob %s: %w", sb.Key, err) + } + mu.Unlock() + return + } + if !exists { + mu.Lock() + if firstErr == nil { + firstErr = fmt.Errorf("cas: cold-listed blob %s disappeared before commit (concurrent prune?); aborting", sb.Key) + } + mu.Unlock() + return + } + if sz != sb.Size { + mu.Lock() + if firstErr == nil { + firstErr = fmt.Errorf("cas: cold-listed blob %s size mismatch: remote=%d, expected=%d (per checksums.txt); aborting to prevent corrupt backup", sb.Key, sz, sb.Size) + } + mu.Unlock() + return + } + }() + } + wg.Wait() + return firstErr +} + +// uploadPartArchives builds and PUTs one tar.zstd per (disk, db, table). +// +// Each archive is written to a temporary file on disk rather than a +// bytes.Buffer so that the compressed bytes never accumulate in RAM. +// WriteArchive streams the zstd-compressed tar directly into the tempfile; +// after the write completes the file is seeked back to the start and passed +// to b.PutFile with the exact byte count. The tempfile is removed in a +// deferred cleanup regardless of whether the PUT succeeds or fails. +func uploadPartArchives(ctx context.Context, b Backend, cp, name string, plan *uploadPlan) (int, int64, error) { + count := 0 + var totalBytes int64 + for _, planKey := range plan.tableKeys { + tp := plan.tables[planKey] + if len(tp.archiveEntries) == 0 { + continue + } + + // Write the compressed archive to a tempfile to avoid buffering the + // entire compressed output in memory. + tmp, err := os.CreateTemp("", "cas-archive-*.tar.zstd") + if err != nil { + return count, totalBytes, fmt.Errorf("cas: create temp archive for %s/%s/%s: %w", tp.Disk, tp.DB, tp.Table, err) + } + tmpPath := tmp.Name() + cleanup := func() { + _ = tmp.Close() + _ = os.Remove(tmpPath) + } + + if err := WriteArchive(tmp, tp.archiveEntries); err != nil { + cleanup() + return count, totalBytes, fmt.Errorf("cas: write archive %s/%s/%s: %w", tp.Disk, tp.DB, tp.Table, err) + } + if err := tmp.Sync(); err != nil { + cleanup() + return count, totalBytes, fmt.Errorf("cas: sync archive %s/%s/%s: %w", tp.Disk, tp.DB, tp.Table, err) + } + size, err := tmp.Seek(0, io.SeekCurrent) + if err != nil { + cleanup() + return count, totalBytes, fmt.Errorf("cas: seek archive %s/%s/%s: %w", tp.Disk, tp.DB, tp.Table, err) + } + if _, err := tmp.Seek(0, io.SeekStart); err != nil { + cleanup() + return count, totalBytes, fmt.Errorf("cas: rewind archive %s/%s/%s: %w", tp.Disk, tp.DB, tp.Table, err) + } + + objKey := PartArchivePath(cp, name, tp.Disk, tp.DB, tp.Table) + if err := b.PutFile(ctx, objKey, io.NopCloser(tmp), size); err != nil { + cleanup() + return count, totalBytes, fmt.Errorf("cas: put archive %s: %w", objKey, err) + } + cleanup() + + log.Info(). + Str("disk", tp.Disk). + Str("db", tp.DB). + Str("table", tp.Table). + Int64("compressed_bytes", size). + Msg("cas-upload: per-table archive uploaded") + + count++ + totalBytes += size + } + return count, totalBytes, nil +} + +// uploadTableJSONs writes per-(db, table) TableMetadata JSONs at +// cas//metadata//metadata//.json. +// +// One JSON per (db, table) — multiple disks are merged into a single +// file with Parts keyed by disk. +func uploadTableJSONs(ctx context.Context, b Backend, cp, name string, plan *uploadPlan) error { + // Group plan tables by (db, table) -> []*tablePlan (one per disk). + type dbTable struct{ DB, Table string } + grouped := make(map[dbTable][]*tablePlan) + var keys []dbTable + for _, k := range plan.tableKeys { + tp := plan.tables[k] + dt := dbTable{DB: tp.DB, Table: tp.Table} + if _, ok := grouped[dt]; !ok { + keys = append(keys, dt) + } + grouped[dt] = append(grouped[dt], tp) + } + sort.Slice(keys, func(i, j int) bool { + if keys[i].DB != keys[j].DB { + return keys[i].DB < keys[j].DB + } + return keys[i].Table < keys[j].Table + }) + + for _, dt := range keys { + tps := grouped[dt] + tm := metadata.TableMetadata{ + Database: dt.DB, + Table: dt.Table, + Parts: make(map[string][]metadata.Part), + MetadataOnly: false, + } + for _, tp := range tps { + if len(tp.parts) == 0 { + // Schema-only / empty table: no per-disk parts. Don't insert a + // disk key at all — downstream (cas-download) ranges over + // tm.Parts and would otherwise try to fetch a nonexistent + // per-table archive for that disk. + continue + } + tm.Parts[tp.Disk] = append(tm.Parts[tp.Disk], tp.parts...) + } + // Merge schema fields from the v1 per-table metadata that + // `clickhouse-backup create` wrote to disk. Required so cas-restore + // on a fresh host can issue CREATE TABLE; without these fields the + // v1 restore handoff produces an empty Query and fails. + local, err := readLocalTableMetadata(plan.localRoot, dt.DB, dt.Table) + if err != nil { + return fmt.Errorf("cas: read local table metadata for %s.%s: %w", dt.DB, dt.Table, err) + } + tm.Query = local.Query + tm.UUID = local.UUID + tm.TotalBytes = local.TotalBytes + tm.Size = local.Size + tm.DependenciesTable = local.DependenciesTable + tm.DependenciesDatabase = local.DependenciesDatabase + tm.Mutations = local.Mutations + body, err := json.MarshalIndent(&tm, "", "\t") + if err != nil { + return fmt.Errorf("cas: marshal table metadata %s.%s: %w", dt.DB, dt.Table, err) + } + key := TableMetaPath(cp, name, dt.DB, dt.Table) + if err := putBytes(ctx, b, key, body); err != nil { + return fmt.Errorf("cas: put table metadata %s: %w", key, err) + } + } + return nil +} + +// readLocalTableMetadata reads /metadata//.json +// that `clickhouse-backup create` wrote. The on-disk path is always +// percent-encoded (matching create's filesystem layout); the caller +// passes db/table as DECODED identifiers, and this helper applies the +// encoding for the lookup. Returns a zero-value TableMetadata + nil +// error if the file is missing — older create flows or test fixtures +// may omit it; the caller logs and ships an empty schema in that case +// (degrading fresh-host restore but not breaking table-already-exists +// restore). +func readLocalTableMetadata(root, db, table string) (metadata.TableMetadata, error) { + p := filepath.Join(root, "metadata", common.TablePathEncode(db), common.TablePathEncode(table)+".json") + f, err := os.Open(p) + if err != nil { + if os.IsNotExist(err) { + log.Warn().Str("path", p).Msg("cas: local v1 per-table metadata missing; uploaded schema fields will be empty") + return metadata.TableMetadata{}, nil + } + return metadata.TableMetadata{}, fmt.Errorf("cas: open %s: %w", p, err) + } + defer f.Close() + var tm metadata.TableMetadata + if err := json.NewDecoder(f).Decode(&tm); err != nil { + return metadata.TableMetadata{}, fmt.Errorf("cas: parse %s: %w", p, err) + } + return tm, nil +} + +// buildBackupMetadata constructs the root BackupMetadata for the commit +// step. We populate the minimum needed to round-trip via ValidateBackup +// + future cas-download. Fields that depend on live ClickHouse (UUID, +// CreationDate-from-ClickHouse, etc.) are populated by the caller in +// later tasks. +func buildBackupMetadata(name string, cfg Config, plan *uploadPlan) *metadata.BackupMetadata { + // Build Tables list deterministically. + type dbTable struct{ DB, Table string } + seen := make(map[dbTable]struct{}) + var tables []metadata.TableTitle + for _, k := range plan.tableKeys { + tp := plan.tables[k] + dt := dbTable{DB: tp.DB, Table: tp.Table} + if _, ok := seen[dt]; ok { + continue + } + seen[dt] = struct{}{} + tables = append(tables, metadata.TableTitle{Database: tp.DB, Table: tp.Table}) + } + sort.Slice(tables, func(i, j int) bool { + if tables[i].Database != tables[j].Database { + return tables[i].Database < tables[j].Database + } + return tables[i].Table < tables[j].Table + }) + + return &metadata.BackupMetadata{ + BackupName: name, + CreationDate: time.Now().UTC(), + DataFormat: "directory", + Tables: tables, + CAS: &metadata.CASBackupParams{ + LayoutVersion: LayoutVersion, + InlineThreshold: cfg.InlineThreshold, + ClusterID: cfg.ClusterID, + }, + } +} + +// readDir returns the names of entries in dir. Empty slice and nil +// error if the directory exists but is empty. +func readDir(dir string) ([]string, error) { + entries, err := os.ReadDir(dir) + if err != nil { + return nil, err + } + names := make([]string, 0, len(entries)) + for _, e := range entries { + names = append(names, e.Name()) + } + sort.Strings(names) + return names, nil +} diff --git a/pkg/cas/upload_test.go b/pkg/cas/upload_test.go new file mode 100644 index 00000000..e61f66f4 --- /dev/null +++ b/pkg/cas/upload_test.go @@ -0,0 +1,1625 @@ +package cas_test + +import ( + "archive/tar" + "context" + "encoding/json" + "errors" + "fmt" + "io" + "os" + "path/filepath" + "strings" + "sync" + "sync/atomic" + "testing" + "time" + + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/fakedst" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/testfixtures" + "github.com/Altinity/clickhouse-backup/v2/pkg/metadata" + "github.com/klauspost/compress/zstd" +) + +// testCfg returns a CAS config valid enough that Upload doesn't reject +// it on Validate(). Threshold 100 keeps small files inline and pushes +// 1024-byte files to blob. +func testCfg(threshold uint64) cas.Config { + c := cas.Config{ + Enabled: true, + ClusterID: "c1", + RootPrefix: "cas/", + InlineThreshold: threshold, + GraceBlob: "24h", + AbandonThreshold: "168h", + } + // Populate parsed durations on the (now pointer-receiver) Validate. + if err := c.Validate(); err != nil { + panic(err) + } + return c +} + +func smallPart(name string, hashLow uint64) testfixtures.PartSpec { + return testfixtures.PartSpec{ + Disk: "default", DB: "db1", Table: "t1", Name: name, + Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 23, HashLow: hashLow + 1, HashHigh: 100}, + {Name: "primary.idx", Size: 8, HashLow: hashLow + 2, HashHigh: 100}, + {Name: "data.bin", Size: 1024, HashLow: hashLow + 3, HashHigh: 100}, + }, + } +} + +func TestUpload_RoundTripBasic(t *testing.T) { + lb := testfixtures.Build(t, []testfixtures.PartSpec{smallPart("p1", 0)}) + f := fakedst.New() + cfg := testCfg(100) + res, err := cas.Upload(context.Background(), f, cfg, "b1", cas.UploadOptions{ + LocalBackupDir: lb.Root, + }) + if err != nil { + t.Fatalf("Upload: %v", err) + } + if res.BlobsConsidered != 1 { + t.Errorf("BlobsConsidered: got %d want 1", res.BlobsConsidered) + } + if res.BlobsUploaded != 1 { + t.Errorf("BlobsUploaded: got %d want 1", res.BlobsUploaded) + } + if res.PerTableArchives != 1 { + t.Errorf("PerTableArchives: got %d want 1", res.PerTableArchives) + } + cp := cfg.ClusterPrefix() + + // metadata.json must exist with CAS field populated. + rc, err := f.GetFile(context.Background(), cas.MetadataJSONPath(cp, "b1")) + if err != nil { + t.Fatalf("get metadata.json: %v", err) + } + body, _ := io.ReadAll(rc) + _ = rc.Close() + var bm metadata.BackupMetadata + if err := json.Unmarshal(body, &bm); err != nil { + t.Fatalf("parse metadata.json: %v", err) + } + if bm.CAS == nil { + t.Fatal("metadata.json: CAS field nil") + } + if bm.CAS.LayoutVersion != cas.LayoutVersion { + t.Errorf("LayoutVersion: got %d want %d", bm.CAS.LayoutVersion, cas.LayoutVersion) + } + if bm.CAS.InlineThreshold != cfg.InlineThreshold { + t.Errorf("InlineThreshold: got %d want %d", bm.CAS.InlineThreshold, cfg.InlineThreshold) + } + if bm.CAS.ClusterID != cfg.ClusterID { + t.Errorf("ClusterID: got %q want %q", bm.CAS.ClusterID, cfg.ClusterID) + } + if bm.DataFormat != "directory" { + t.Errorf("DataFormat: got %q want directory", bm.DataFormat) + } + + // In-progress marker must be gone. + if _, _, exists, err := f.StatFile(context.Background(), cas.InProgressMarkerPath(cp, "b1")); err != nil { + t.Fatal(err) + } else if exists { + t.Error("in-progress marker still present after commit") + } + + // Archive + table json present. + if _, _, exists, _ := f.StatFile(context.Background(), cas.PartArchivePath(cp, "b1", "default", "db1", "t1")); !exists { + t.Error("part archive missing") + } + if _, _, exists, _ := f.StatFile(context.Background(), cas.TableMetaPath(cp, "b1", "db1", "t1")); !exists { + t.Error("table metadata json missing") + } +} + +func TestUpload_DedupsAcrossParts(t *testing.T) { + // Two parts with the same blob hash for data.bin → one PutFile. + bytes1024 := make([]byte, 1024) + for i := range bytes1024 { + bytes1024[i] = 0xAB + } + parts := []testfixtures.PartSpec{ + {Disk: "default", DB: "db1", Table: "t1", Name: "p1", Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 23, HashLow: 1, HashHigh: 1}, + {Name: "data.bin", Size: 1024, HashLow: 999, HashHigh: 999, Bytes: bytes1024}, + }}, + {Disk: "default", DB: "db1", Table: "t1", Name: "p2", Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 23, HashLow: 2, HashHigh: 2}, + {Name: "data.bin", Size: 1024, HashLow: 999, HashHigh: 999, Bytes: bytes1024}, + }}, + } + lb := testfixtures.Build(t, parts) + f := fakedst.New() + cfg := testCfg(100) + + // Wrap to count PutFile calls on blob keys. + wrap := newCountingBackend(f) + res, err := cas.Upload(context.Background(), wrap, cfg, "b1", cas.UploadOptions{LocalBackupDir: lb.Root}) + if err != nil { + t.Fatalf("Upload: %v", err) + } + if res.BlobsUploaded != 1 { + t.Errorf("BlobsUploaded: got %d want 1", res.BlobsUploaded) + } + cp := cfg.ClusterPrefix() + puts := wrap.putsForPrefix(cp + "blob/") + if puts != 1 { + t.Errorf("blob PutFile count: got %d want 1", puts) + } +} + +func TestUpload_RefusesIfPruneMarkerPresent(t *testing.T) { + lb := testfixtures.Build(t, []testfixtures.PartSpec{smallPart("p1", 0)}) + f := fakedst.New() + cfg := testCfg(100) + if err := f.PutFile(context.Background(), cas.PruneMarkerPath(cfg.ClusterPrefix()), + io.NopCloser(strings.NewReader("{}")), 2); err != nil { + t.Fatal(err) + } + _, err := cas.Upload(context.Background(), f, cfg, "b1", cas.UploadOptions{LocalBackupDir: lb.Root}) + if !errors.Is(err, cas.ErrPruneInProgress) { + t.Fatalf("got err=%v want ErrPruneInProgress", err) + } +} + +func TestUpload_RefusesIfBackupExists(t *testing.T) { + lb := testfixtures.Build(t, []testfixtures.PartSpec{smallPart("p1", 0)}) + f := fakedst.New() + cfg := testCfg(100) + if err := f.PutFile(context.Background(), cas.MetadataJSONPath(cfg.ClusterPrefix(), "b1"), + io.NopCloser(strings.NewReader("{}")), 2); err != nil { + t.Fatal(err) + } + _, err := cas.Upload(context.Background(), f, cfg, "b1", cas.UploadOptions{LocalBackupDir: lb.Root}) + if !errors.Is(err, cas.ErrBackupExists) { + t.Fatalf("got err=%v want ErrBackupExists", err) + } +} + +func TestUpload_PreCommitChecksPruneMarker(t *testing.T) { + lb := testfixtures.Build(t, []testfixtures.PartSpec{smallPart("p1", 0)}) + f := fakedst.New() + cfg := testCfg(100) + cp := cfg.ClusterPrefix() + + // Wrap so that as soon as the planner has done the cold-list, we + // inject a prune marker before the pre-commit re-check fires. + wrap := newInjectingBackend(f) + wrap.onStat = func(key string) { + // Trigger when the pre-commit re-check stats the prune marker. + // At that point all uploads + table JSONs are done; just put + // the marker so the stat returns "exists". + if key == cas.PruneMarkerPath(cp) && atomic.LoadInt32(&wrap.injected) == 0 { + // Only inject AFTER step 6/7 (initial check has long passed). + // Easy heuristic: do it the second time the prune-marker key + // is stat'd (first = step 2, second = step 11a). + if atomic.AddInt32(&wrap.statCount, 1) >= 2 { + _ = f.PutFile(context.Background(), key, io.NopCloser(strings.NewReader("{}")), 2) + atomic.StoreInt32(&wrap.injected, 1) + } + } + } + + _, err := cas.Upload(context.Background(), wrap, cfg, "b1", cas.UploadOptions{LocalBackupDir: lb.Root}) + if !errors.Is(err, cas.ErrPruneInProgress) { + t.Fatalf("got err=%v want ErrPruneInProgress", err) + } + // metadata.json must NOT have been written. + if _, _, exists, _ := f.StatFile(context.Background(), cas.MetadataJSONPath(cp, "b1")); exists { + t.Error("metadata.json was written despite prune-marker injection") + } + // in-progress marker must have been cleaned up. + if _, _, exists, _ := f.StatFile(context.Background(), cas.InProgressMarkerPath(cp, "b1")); exists { + t.Error("in-progress marker still present after abort") + } +} + +func TestUpload_PreCommitChecksOwnInProgressMarker(t *testing.T) { + lb := testfixtures.Build(t, []testfixtures.PartSpec{smallPart("p1", 0)}) + f := fakedst.New() + cfg := testCfg(100) + cp := cfg.ClusterPrefix() + + // Delete the in-progress marker right before step 11b stats it. + wrap := newInjectingBackend(f) + wrap.onStat = func(key string) { + if key == cas.InProgressMarkerPath(cp, "b1") { + _ = f.DeleteFile(context.Background(), key) + } + } + _, err := cas.Upload(context.Background(), wrap, cfg, "b1", cas.UploadOptions{LocalBackupDir: lb.Root}) + if err == nil || !strings.Contains(err.Error(), "in-progress marker") { + t.Fatalf("got err=%v want in-progress-marker abort", err) + } + if _, _, exists, _ := f.StatFile(context.Background(), cas.MetadataJSONPath(cp, "b1")); exists { + t.Error("metadata.json was written despite swept marker") + } +} + +func TestUpload_DryRun(t *testing.T) { + lb := testfixtures.Build(t, []testfixtures.PartSpec{smallPart("p1", 0)}) + f := fakedst.New() + cfg := testCfg(100) + res, err := cas.Upload(context.Background(), f, cfg, "b1", cas.UploadOptions{ + LocalBackupDir: lb.Root, + DryRun: true, + }) + if err != nil { + t.Fatalf("Upload: %v", err) + } + if !res.DryRun { + t.Error("res.DryRun: got false want true") + } + if res.BlobsUploaded != 0 { + t.Errorf("BlobsUploaded: got %d want 0", res.BlobsUploaded) + } + if f.Len() != 0 { + t.Errorf("backend.Len: got %d want 0 (dry run)", f.Len()) + } +} + +func TestUpload_RefusesObjectDisks(t *testing.T) { + lb := testfixtures.Build(t, []testfixtures.PartSpec{smallPart("p1", 0)}) + f := fakedst.New() + cfg := testCfg(100) + disks := []cas.DiskInfo{{Name: "s3disk", Path: "/var/lib/clickhouse/disks/s3", Type: "s3"}} + tables := []cas.TableInfo{{Database: "db1", Name: "t1", DataPaths: []string{"/var/lib/clickhouse/disks/s3/store/abc/"}}} + _, err := cas.Upload(context.Background(), f, cfg, "b1", cas.UploadOptions{ + LocalBackupDir: lb.Root, + Disks: disks, + ClickHouseTables: tables, + }) + if !errors.Is(err, cas.ErrObjectDiskRefused) { + t.Fatalf("got err=%v want ErrObjectDiskRefused", err) + } +} + +func TestUpload_SkipObjectDisks(t *testing.T) { + // Two tables; t2 is on an object disk and must be silently excluded. + parts := []testfixtures.PartSpec{ + {Disk: "default", DB: "db1", Table: "t1", Name: "p1", Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 23, HashLow: 1, HashHigh: 1}, + }}, + {Disk: "s3disk", DB: "db1", Table: "t2", Name: "p1", Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 23, HashLow: 2, HashHigh: 2}, + }}, + } + lb := testfixtures.Build(t, parts) + f := fakedst.New() + cfg := testCfg(100) + disks := []cas.DiskInfo{ + {Name: "default", Path: "/var/lib/clickhouse", Type: "local"}, + {Name: "s3disk", Path: "/var/lib/clickhouse/disks/s3", Type: "s3"}, + } + tables := []cas.TableInfo{ + {Database: "db1", Name: "t1", DataPaths: []string{"/var/lib/clickhouse/store/abc/"}}, + {Database: "db1", Name: "t2", DataPaths: []string{"/var/lib/clickhouse/disks/s3/store/def/"}}, + } + res, err := cas.Upload(context.Background(), f, cfg, "b1", cas.UploadOptions{ + LocalBackupDir: lb.Root, + SkipObjectDisks: true, + Disks: disks, + ClickHouseTables: tables, + }) + if err != nil { + t.Fatalf("Upload: %v", err) + } + if res.PerTableArchives != 1 { + t.Errorf("PerTableArchives: got %d want 1 (t2 should be skipped)", res.PerTableArchives) + } + cp := cfg.ClusterPrefix() + if _, _, exists, _ := f.StatFile(context.Background(), cas.PartArchivePath(cp, "b1", "default", "db1", "t1")); !exists { + t.Error("t1 archive missing") + } + if _, _, exists, _ := f.StatFile(context.Background(), cas.PartArchivePath(cp, "b1", "s3disk", "db1", "t2")); exists { + t.Error("t2 archive should not have been uploaded") + } +} + +// TestUpload_ExcludedTablesSkipsArchive verifies the precomputed exclusion +// list flows through cas.Upload to planUpload and the excluded table's +// per-table archive is NOT written. Closes the gap between the CLI-side +// wiring test (TestSkipObjectDisks_ExclusionFiresFromSnapshot) and the +// existing live-disk-derived path (TestUpload_SkipObjectDisks). +func TestUpload_ExcludedTablesSkipsArchive(t *testing.T) { + ctx := context.Background() + parts := []testfixtures.PartSpec{ + { + Disk: "default", DB: "db1", Table: "keep", Name: "all_1_1_0", + Files: []testfixtures.FileSpec{ + {Name: "data.bin", Size: 16, HashLow: 1, HashHigh: 1}, + }, + }, + { + Disk: "default", DB: "db1", Table: "drop", Name: "all_1_1_0", + Files: []testfixtures.FileSpec{ + {Name: "data.bin", Size: 16, HashLow: 2, HashHigh: 2}, + }, + }, + } + src := testfixtures.Build(t, parts) + f := fakedst.New() + cfg := testCfg(1024) + + if _, err := cas.Upload(ctx, f, cfg, "bk", cas.UploadOptions{ + LocalBackupDir: src.Root, + SkipObjectDisks: true, + ExcludedTables: []string{"db1.drop"}, + }); err != nil { + t.Fatalf("Upload: %v", err) + } + + // db1.keep's per-table archive must exist; db1.drop's must not. + cp := cfg.ClusterPrefix() + keepKey := cas.PartArchivePath(cp, "bk", "default", "db1", "keep") + dropKey := cas.PartArchivePath(cp, "bk", "default", "db1", "drop") + if _, _, exists, err := f.StatFile(ctx, keepKey); err != nil || !exists { + t.Errorf("db1.keep archive missing: exists=%v err=%v", exists, err) + } + if _, _, exists, err := f.StatFile(ctx, dropKey); err != nil { + t.Fatalf("StatFile(drop): %v", err) + } else if exists { + t.Errorf("db1.drop archive should NOT exist when in ExcludedTables; key=%s", dropKey) + } +} + +// TestUpload_MergesSchemaFieldsFromLocalV1Metadata verifies cas-upload +// reads the per-(db, table) JSON that `clickhouse-backup create` wrote +// and merges Query/UUID/TotalBytes/etc. into the uploaded +// TableMetadata. Without this merge, cas-restore on a fresh host can't +// recreate tables. +func TestUpload_MergesSchemaFieldsFromLocalV1Metadata(t *testing.T) { + parts := []testfixtures.PartSpec{ + { + Disk: "default", DB: "db1", Table: "t1", Name: "all_1_1_0", + Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 8, HashLow: 1, HashHigh: 0}, + }, + TableMeta: metadata.TableMetadata{ + Database: "db1", + Table: "t1", + Query: "CREATE TABLE db1.t1 (id UInt64) ENGINE=MergeTree ORDER BY id", + UUID: "deadbeef-0000-0000-0000-000000000001", + TotalBytes: 12345, + }, + }, + } + src := testfixtures.Build(t, parts) + f := fakedst.New() + cfg := testCfg(100) + if _, err := cas.Upload(context.Background(), f, cfg, "bk1", cas.UploadOptions{ + LocalBackupDir: src.Root, + }); err != nil { + t.Fatalf("Upload: %v", err) + } + + rc, err := f.GetFile(context.Background(), cas.TableMetaPath(cfg.ClusterPrefix(), "bk1", "db1", "t1")) + if err != nil { + t.Fatalf("get table metadata: %v", err) + } + body, _ := io.ReadAll(rc) + _ = rc.Close() + var got metadata.TableMetadata + if err := json.Unmarshal(body, &got); err != nil { + t.Fatalf("parse table metadata: %v", err) + } + + if got.Query == "" { + t.Error("uploaded TableMetadata.Query is empty - fresh-host restore would fail") + } + if got.UUID != "deadbeef-0000-0000-0000-000000000001" { + t.Errorf("UUID: got %q want %q", got.UUID, "deadbeef-0000-0000-0000-000000000001") + } + if got.TotalBytes != 12345 { + t.Errorf("TotalBytes: got %d want 12345", got.TotalBytes) + } +} + +// TestUpload_PreservesEmptyTable verifies that a table whose metadata JSON +// exists locally but has no shadow part directory still appears in the +// uploaded BackupMetadata.Tables list. Without the fix, the table would be +// silently dropped and cas-restore could not recreate its schema. +func TestUpload_PreservesEmptyTable(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) + ctx := context.Background() + + // Build a synthetic backup with two tables: t1 has a part, t2 has only + // a metadata JSON (no shadow dir). + parts := []testfixtures.PartSpec{ + { + Disk: "default", DB: "db1", Table: "t1", Name: "all_1_1_0", + Files: []testfixtures.FileSpec{ + {Name: "data.bin", Size: 64, HashLow: 1, HashHigh: 2}, + }, + }, + } + src := testfixtures.Build(t, parts) + + // Add t2's metadata JSON manually (no shadow dir). + t2Meta := `{"database":"db1","table":"t2","query":"CREATE TABLE db1.t2 (id UInt64) ENGINE=MergeTree ORDER BY id"}` + t2Path := filepath.Join(src.Root, "metadata", "db1", "t2.json") + if err := os.WriteFile(t2Path, []byte(t2Meta), 0o644); err != nil { + t.Fatal(err) + } + + if _, err := cas.Upload(ctx, f, cfg, "bk", cas.UploadOptions{ + LocalBackupDir: src.Root, + }); err != nil { + t.Fatalf("Upload: %v", err) + } + + // Read the uploaded root metadata.json and assert both tables are listed. + rc, err := f.GetFile(ctx, cas.MetadataJSONPath(cfg.ClusterPrefix(), "bk")) + if err != nil { + t.Fatal(err) + } + defer rc.Close() + body, _ := io.ReadAll(rc) + var bm metadata.BackupMetadata + if err := json.Unmarshal(body, &bm); err != nil { + t.Fatal(err) + } + got := map[string]bool{} + for _, tt := range bm.Tables { + got[tt.Database+"."+tt.Table] = true + } + if !got["db1.t1"] || !got["db1.t2"] { + t.Errorf("expected both db1.t1 and db1.t2 in bm.Tables; got %+v", bm.Tables) + } + + // db1.t2 is schema-only; its per-table JSON must have an empty Parts map + // (not {"default": null}), otherwise download would try to fetch a + // nonexistent per-disk archive and fail with "cas: archive missing". + rc2, err := f.GetFile(ctx, cas.TableMetaPath(cfg.ClusterPrefix(), "bk", "db1", "t2")) + if err != nil { + t.Fatal(err) + } + defer rc2.Close() + body2, _ := io.ReadAll(rc2) + var tmT2 metadata.TableMetadata + if err := json.Unmarshal(body2, &tmT2); err != nil { + t.Fatal(err) + } + if len(tmT2.Parts) != 0 { + t.Errorf("empty-table Parts should be empty map, got %v", tmT2.Parts) + } + + // Full download round-trip: proves the fix prevents "cas: archive missing". + dst := t.TempDir() + if _, err := cas.Download(ctx, f, cfg, "bk", cas.DownloadOptions{LocalBackupDir: dst}); err != nil { + t.Fatalf("Download with empty table failed: %v", err) + } +} + +// ---------------------- test helpers ---------------------- + +// countingBackend wraps a Backend and counts PutFile calls per key. +type countingBackend struct { + cas.Backend + mu sync.Mutex + puts map[string]int +} + +func newCountingBackend(b cas.Backend) *countingBackend { + return &countingBackend{Backend: b, puts: map[string]int{}} +} + +func (c *countingBackend) PutFile(ctx context.Context, key string, r io.ReadCloser, size int64) error { + c.mu.Lock() + c.puts[key]++ + c.mu.Unlock() + return c.Backend.PutFile(ctx, key, r, size) +} + +func (c *countingBackend) putsForPrefix(prefix string) int { + c.mu.Lock() + defer c.mu.Unlock() + n := 0 + for k, v := range c.puts { + if strings.HasPrefix(k, prefix) { + n += v + } + } + return n +} + +// injectingBackend wraps a Backend and lets a test fire side effects +// each time StatFile is called. +type injectingBackend struct { + cas.Backend + onStat func(key string) + statCount int32 + injected int32 +} + +func newInjectingBackend(b cas.Backend) *injectingBackend { + return &injectingBackend{Backend: b} +} + +func (i *injectingBackend) StatFile(ctx context.Context, key string) (int64, time.Time, bool, error) { + if i.onStat != nil { + i.onStat(key) + } + return i.Backend.StatFile(ctx, key) +} + +// TestUpload_SpecialCharDbTable verifies the headline blocker fix from +// the external review: a database/table name containing characters that +// TablePathEncode percent-escapes (hyphen, dot, space, etc.) must round- +// trip without double-encoding. Before the fix, planUpload stored the +// already-encoded directory name verbatim in tablePlan.DB/Table, and +// TableMetaPath/PartArchivePath then encoded again, producing keys like +// "my%252Ddb" and breaking schema restore. +func TestUpload_SpecialCharDbTable(t *testing.T) { + parts := []testfixtures.PartSpec{ + { + Disk: "default", DB: "my-db", Table: "my-tbl", Name: "all_1_1_0", + Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 8, HashLow: 1, HashHigh: 0}, + }, + }, + } + lb := testfixtures.Build(t, parts) + f := fakedst.New() + cfg := testCfg(100) + if _, err := cas.Upload(context.Background(), f, cfg, "bk1", cas.UploadOptions{ + LocalBackupDir: lb.Root, + }); err != nil { + t.Fatal(err) + } + + // metadata.json — Tables[].Database/Table must be the DECODED original. + rc, err := f.GetFile(context.Background(), cas.MetadataJSONPath(cfg.ClusterPrefix(), "bk1")) + if err != nil { + t.Fatal(err) + } + body, _ := io.ReadAll(rc) + rc.Close() + var bm metadata.BackupMetadata + if err := json.Unmarshal(body, &bm); err != nil { + t.Fatal(err) + } + if len(bm.Tables) != 1 { + t.Fatalf("Tables: got %d want 1", len(bm.Tables)) + } + if bm.Tables[0].Database != "my-db" { + t.Errorf("Tables[0].Database: got %q want \"my-db\" (NOT %q)", bm.Tables[0].Database, "my%2Ddb") + } + if bm.Tables[0].Table != "my-tbl" { + t.Errorf("Tables[0].Table: got %q want \"my-tbl\"", bm.Tables[0].Table) + } + + // Per-table JSON exists at the SINGLE-encoded path. + want := cas.TableMetaPath(cfg.ClusterPrefix(), "bk1", "my-db", "my-tbl") + if _, _, exists, _ := f.StatFile(context.Background(), want); !exists { + t.Errorf("per-table JSON missing at single-encoded path %s", want) + } + // Double-encoded path must NOT exist. + bad := cas.TableMetaPath(cfg.ClusterPrefix(), "bk1", "my%2Ddb", "my%2Dtbl") + if _, _, exists, _ := f.StatFile(context.Background(), bad); exists { + t.Errorf("per-table JSON wrongly exists at DOUBLE-encoded path %s", bad) + } +} + +// TestPlanPart_WithProjection_BlobsBothLevels verifies the walker treats +// .proj entries in the parent checksums.txt as nested-part directories, +// recurses into them, and emits blobs for above-threshold files at any +// depth while preserving paths in archive entries. +func TestPlanPart_WithProjection_BlobsBothLevels(t *testing.T) { + parts := []testfixtures.PartSpec{{ + Disk: "default", DB: "db1", Table: "t1", Name: "all_1_1_0", + Files: []testfixtures.FileSpec{ + {Name: "data.bin", Size: 8192, HashLow: 1, HashHigh: 2}, // above threshold → blob + {Name: "columns.txt", Size: 16, HashLow: 3, HashHigh: 4}, // below → archive + }, + Projections: []testfixtures.ProjectionSpec{{ + Name: "p1", + Files: []testfixtures.FileSpec{ + {Name: "data.bin", Size: 4096, HashLow: 5, HashHigh: 6}, // above → blob (different hash) + {Name: "columns.txt", Size: 8, HashLow: 7, HashHigh: 8}, // below → archive + }, + AggregateHashLow: 99, AggregateHashHigh: 99, AggregateSize: 4120, + }}, + }} + src := testfixtures.Build(t, parts) + f := fakedst.New() + cfg := testCfg(1024) + ctx := context.Background() + + res, err := cas.Upload(ctx, f, cfg, "bk", cas.UploadOptions{LocalBackupDir: src.Root}) + if err != nil { + t.Fatal(err) + } + // Two unique blobs (parent data.bin + projection data.bin); the + // p1.proj aggregate entry must NOT become a blob. + if res.UniqueBlobs != 2 { + t.Errorf("UniqueBlobs: got %d, want 2", res.UniqueBlobs) + } + cp := cfg.ClusterPrefix() + projHash := cas.Hash128{Low: 5, High: 6} + if _, _, exists, _ := f.StatFile(ctx, cas.BlobPath(cp, projHash)); !exists { + t.Error("projection data.bin blob missing in remote") + } + bogus := cas.Hash128{Low: 99, High: 99} + if _, _, exists, _ := f.StatFile(ctx, cas.BlobPath(cp, bogus)); exists { + t.Error("p1.proj aggregate must not become a blob") + } +} + +// TestPlanPart_NonChecksumFilesPreserved verifies files in the part +// directory that aren't listed in checksums.txt (columns.txt, etc.) still +// land in the per-table archive. Without the new walker they were dropped. +func TestPlanPart_NonChecksumFilesPreserved(t *testing.T) { + parts := []testfixtures.PartSpec{{ + Disk: "default", DB: "db1", Table: "t1", Name: "all_1_1_0", + Files: []testfixtures.FileSpec{ + {Name: "data.bin", Size: 16, HashLow: 1, HashHigh: 2}, // listed + }, + }} + src := testfixtures.Build(t, parts) + rogue := filepath.Join(src.Root, "shadow", "db1", "t1", "default", "all_1_1_0", "metadata_version.txt") + if err := os.WriteFile(rogue, []byte("42\n"), 0o644); err != nil { + t.Fatal(err) + } + f := fakedst.New() + cfg := testCfg(1024) + ctx := context.Background() + if _, err := cas.Upload(ctx, f, cfg, "bk", cas.UploadOptions{LocalBackupDir: src.Root}); err != nil { + t.Fatal(err) + } + arch := cas.PartArchivePath(cfg.ClusterPrefix(), "bk", "default", "db1", "t1") + rc, err := f.GetFile(ctx, arch) + if err != nil { + t.Fatal(err) + } + defer rc.Close() + zr, err := zstd.NewReader(rc) + if err != nil { + t.Fatal(err) + } + defer zr.Close() + tr := tar.NewReader(zr) + found := map[string]bool{} + for { + h, err := tr.Next() + if err == io.EOF { + break + } + if err != nil { + t.Fatal(err) + } + found[h.Name] = true + } + if !found["all_1_1_0/metadata_version.txt"] { + t.Errorf("metadata_version.txt not in archive; found %v", found) + } + if !found["all_1_1_0/checksums.txt"] { + t.Errorf("checksums.txt missing from archive; found %v", found) + } +} + +// TestPlanPart_NestedProjectionDedup verifies that two parts with +// identical projection content produce ONE blob ref, not two. +func TestPlanPart_NestedProjectionDedup(t *testing.T) { + mkPart := func(name string) testfixtures.PartSpec { + return testfixtures.PartSpec{ + Disk: "default", DB: "db1", Table: "t1", Name: name, + Files: []testfixtures.FileSpec{ + {Name: "data.bin", Size: 2048, HashLow: 11, HashHigh: 22}, + }, + Projections: []testfixtures.ProjectionSpec{{ + Name: "p1", + Files: []testfixtures.FileSpec{ + {Name: "data.bin", Size: 4096, HashLow: 99, HashHigh: 99}, + }, + AggregateHashLow: 1, AggregateHashHigh: 1, AggregateSize: 4096, + }}, + } + } + parts := []testfixtures.PartSpec{mkPart("all_1_1_0"), mkPart("all_2_2_0")} + src := testfixtures.Build(t, parts) + f := fakedst.New() + cfg := testCfg(1024) + res, err := cas.Upload(context.Background(), f, cfg, "bk", cas.UploadOptions{LocalBackupDir: src.Root}) + if err != nil { + t.Fatal(err) + } + if res.UniqueBlobs != 2 { + t.Errorf("UniqueBlobs: got %d, want 2 (parent data.bin + shared projection data.bin)", res.UniqueBlobs) + } +} + +// TestPlanPart_MissingListedFile_Fails verifies the walker fails when +// checksums.txt lists a file that's absent on disk. +func TestPlanPart_MissingListedFile_Fails(t *testing.T) { + parts := []testfixtures.PartSpec{{ + Disk: "default", DB: "db1", Table: "t1", Name: "all_1_1_0", + Files: []testfixtures.FileSpec{ + {Name: "data.bin", Size: 8, HashLow: 1, HashHigh: 2}, + }, + }} + src := testfixtures.Build(t, parts) + if err := os.Remove(filepath.Join(src.Root, "shadow", "db1", "t1", "default", "all_1_1_0", "data.bin")); err != nil { + t.Fatal(err) + } + f := fakedst.New() + cfg := testCfg(1024) + _, err := cas.Upload(context.Background(), f, cfg, "bk", cas.UploadOptions{LocalBackupDir: src.Root}) + if err == nil { + t.Fatal("expected upload failure when listed file is missing on disk") + } + if !strings.Contains(err.Error(), "data.bin") { + t.Errorf("error should mention data.bin; got: %v", err) + } +} + +// TestPlanPart_HiddenFile_Warns verifies a hidden file is skipped (warn). +func TestPlanPart_HiddenFile_Warns(t *testing.T) { + parts := []testfixtures.PartSpec{{ + Disk: "default", DB: "db1", Table: "t1", Name: "all_1_1_0", + Files: []testfixtures.FileSpec{ + {Name: "data.bin", Size: 8, HashLow: 1, HashHigh: 2}, + }, + }} + src := testfixtures.Build(t, parts) + hidden := filepath.Join(src.Root, "shadow", "db1", "t1", "default", "all_1_1_0", ".hidden") + if err := os.WriteFile(hidden, []byte("nope"), 0o644); err != nil { + t.Fatal(err) + } + f := fakedst.New() + cfg := testCfg(1024) + if _, err := cas.Upload(context.Background(), f, cfg, "bk", cas.UploadOptions{LocalBackupDir: src.Root}); err != nil { + t.Fatalf("hidden file should warn-and-skip, not fail: %v", err) + } + arch := cas.PartArchivePath(cfg.ClusterPrefix(), "bk", "default", "db1", "t1") + rc, err := f.GetFile(context.Background(), arch) + if err != nil { + t.Fatal(err) + } + defer rc.Close() + zr, err := zstd.NewReader(rc) + if err != nil { + t.Fatal(err) + } + defer zr.Close() + tr := tar.NewReader(zr) + for { + h, err := tr.Next() + if err == io.EOF { + break + } + if err != nil { + t.Fatal(err) + } + if strings.Contains(h.Name, ".hidden") { + t.Errorf("hidden file leaked into archive: %s", h.Name) + } + } +} + +// TestPlanPart_ProjEntryNotADir_Fails verifies the walker fails loudly +// when checksums.txt has a .proj entry whose target on disk is a regular +// file rather than a directory. +func TestPlanPart_ProjEntryNotADir_Fails(t *testing.T) { + parts := []testfixtures.PartSpec{{ + Disk: "default", DB: "db1", Table: "t1", Name: "all_1_1_0", + Files: []testfixtures.FileSpec{ + {Name: "data.bin", Size: 8, HashLow: 1, HashHigh: 2}, + }, + }} + src := testfixtures.Build(t, parts) + partDir := filepath.Join(src.Root, "shadow", "db1", "t1", "default", "all_1_1_0") + rogue := filepath.Join(partDir, "p1.proj") + if err := os.WriteFile(rogue, []byte("not a dir"), 0o644); err != nil { + t.Fatal(err) + } + rewritten := `checksums format version: 2 +2 files: +data.bin + size: 8 + hash: 1 2 + compressed: 0 +p1.proj + size: 9 + hash: 3 4 + compressed: 0 +` + if err := os.WriteFile(filepath.Join(partDir, "checksums.txt"), []byte(rewritten), 0o644); err != nil { + t.Fatal(err) + } + f := fakedst.New() + cfg := testCfg(1024) + _, err := cas.Upload(context.Background(), f, cfg, "bk", cas.UploadOptions{LocalBackupDir: src.Root}) + if err == nil { + t.Fatal("expected upload failure when .proj entry is not a directory") + } + if !strings.Contains(err.Error(), "p1.proj") { + t.Errorf("error should mention p1.proj; got: %v", err) + } +} + +// TestPlanPart_OrphanProjDir_Warns verifies a .proj directory present on +// disk with no parent checksums.txt entry is skipped (warn) rather than +// fail. +func TestPlanPart_OrphanProjDir_Warns(t *testing.T) { + parts := []testfixtures.PartSpec{{ + Disk: "default", DB: "db1", Table: "t1", Name: "all_1_1_0", + Files: []testfixtures.FileSpec{ + {Name: "data.bin", Size: 8, HashLow: 1, HashHigh: 2}, + }, + }} + src := testfixtures.Build(t, parts) + orphan := filepath.Join(src.Root, "shadow", "db1", "t1", "default", "all_1_1_0", "p2.proj") + if err := os.MkdirAll(orphan, 0o755); err != nil { + t.Fatal(err) + } + if err := os.WriteFile(filepath.Join(orphan, "data.bin"), []byte("orphan"), 0o644); err != nil { + t.Fatal(err) + } + f := fakedst.New() + cfg := testCfg(1024) + if _, err := cas.Upload(context.Background(), f, cfg, "bk", cas.UploadOptions{LocalBackupDir: src.Root}); err != nil { + t.Fatalf("orphan .proj dir should warn-and-skip, not fail: %v", err) + } +} + +// TestUpload_RefusesIfInprogressMarkerPresent verifies that a second +// cas-upload attempt for the same backup name fails cleanly when an +// inprogress marker already exists. Without the conditional-create fix, +// the second upload would overwrite the marker and proceed. +func TestUpload_RefusesIfInprogressMarkerPresent(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) + ctx := context.Background() + + // Pre-write a marker simulating another host's upload in flight. + // Use the explicit-tool variant so the assertion below can pin both + // the tool name and the host — a tighter regression guard than the + // generic "is in progress for" substring (which would pass for any + // tool name). + if _, err := cas.WriteInProgressMarkerWithTool(ctx, f, cfg.ClusterPrefix(), "bk", "host-other", "cas-upload"); err != nil { + t.Fatalf("WriteInProgressMarkerWithTool setup: %v", err) + } + + // Build a synthetic local backup; the upload should refuse before + // touching any blob. + parts := []testfixtures.PartSpec{{ + Disk: "default", DB: "db1", Table: "t1", Name: "p1", + Files: []testfixtures.FileSpec{{Name: "data.bin", Size: 16, HashLow: 1, HashHigh: 2}}, + }} + src := testfixtures.Build(t, parts) + + _, err := cas.Upload(ctx, f, cfg, "bk", cas.UploadOptions{LocalBackupDir: src.Root}) + if err == nil { + t.Fatal("expected Upload to refuse when inprogress marker is present") + } + msg := err.Error() + if !strings.Contains(msg, "cas-upload") || !strings.Contains(msg, "in progress") || !strings.Contains(msg, "host-other") { + t.Errorf("error should mention conflicting tool=cas-upload, in-progress, and host=host-other; got: %v", err) + } +} + +// TestUpload_TableFilter_WithSpecialChars proves that --tables filtering +// works against the decoded names operators actually type, not the +// shadow-directory encoded forms. +func TestUpload_TableFilter_WithSpecialChars(t *testing.T) { + parts := []testfixtures.PartSpec{ + {Disk: "default", DB: "my-db", Table: "keep-me", Name: "p1", + Files: []testfixtures.FileSpec{{Name: "columns.txt", Size: 4, HashLow: 1, HashHigh: 0}}}, + {Disk: "default", DB: "my-db", Table: "skip-me", Name: "p1", + Files: []testfixtures.FileSpec{{Name: "columns.txt", Size: 4, HashLow: 2, HashHigh: 0}}}, + } + lb := testfixtures.Build(t, parts) + f := fakedst.New() + cfg := testCfg(100) + if _, err := cas.Upload(context.Background(), f, cfg, "bk1", cas.UploadOptions{ + LocalBackupDir: lb.Root, + TableFilter: []string{"my-db.keep-me"}, + }); err != nil { + t.Fatal(err) + } + keep := cas.TableMetaPath(cfg.ClusterPrefix(), "bk1", "my-db", "keep-me") + skip := cas.TableMetaPath(cfg.ClusterPrefix(), "bk1", "my-db", "skip-me") + if _, _, exists, _ := f.StatFile(context.Background(), keep); !exists { + t.Errorf("filter dropped the matching table; %s missing", keep) + } + if _, _, exists, _ := f.StatFile(context.Background(), skip); exists { + t.Errorf("filter let a non-matching table through; %s present", skip) + } +} + +// TestUpload_LeaksNoMarkerOnRecheckError verifies that a StatFile failure +// at step 11b (the upload's own-marker re-check) cleans up the in-progress +// marker before returning the error. Without the cleanup, the marker +// persists for up to abandon_threshold (7 days) and locks out future cas-upload +// invocations of the same backup name. +func TestUpload_LeaksNoMarkerOnRecheckError(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) + ctx := context.Background() + + // Build a tiny synthetic backup so Upload reaches step 11b. + parts := []testfixtures.PartSpec{{ + Disk: "default", DB: "db1", Table: "t1", Name: "p1", + Files: []testfixtures.FileSpec{{Name: "data.bin", Size: 16, HashLow: 1, HashHigh: 2}}, + }} + src := testfixtures.Build(t, parts) + + // Hook fakedst to inject a StatFile error specifically on the + // in-progress marker key, AFTER the marker has been written. + markerKey := cas.InProgressMarkerPath(cfg.ClusterPrefix(), "bk") + f.SetStatHook(func(key string) (size int64, modTime time.Time, exists bool, err error, override bool) { + if key == markerKey { + return 0, time.Time{}, false, errors.New("simulated transient backend error"), true + } + return 0, time.Time{}, false, nil, false + }) + + _, err := cas.Upload(ctx, f, cfg, "bk", cas.UploadOptions{LocalBackupDir: src.Root}) + if err == nil { + t.Fatal("expected Upload to error when StatFile on own marker fails") + } + if !strings.Contains(err.Error(), "re-check inprogress marker") { + t.Errorf("error should mention re-check; got: %v", err) + } + + // The cleanup must have run despite the error path. + // Clear the hook so we can check the actual backend state. + f.SetStatHook(nil) + _, _, exists, _ := f.StatFile(context.Background(), markerKey) + if exists { + t.Error("in-progress marker leaked: still present after step 11b error path") + } +} + +// TestUpload_AbortsIfColdListedBlobDisappearsBeforeCommit verifies that if a +// blob was skipped during upload (because cold-list said it already existed) +// but is gone by the time we reach step 11c, Upload returns an error and +// does NOT write metadata.json. +func TestUpload_AbortsIfColdListedBlobDisappearsBeforeCommit(t *testing.T) { + ctx := context.Background() + f := fakedst.New() + // Use a threshold high enough that data.bin (1024 bytes) is uploaded as a + // blob, not inlined. + cfg := testCfg(100) + cp := cfg.ClusterPrefix() + + // Build a local backup with one part containing a 1024-byte data.bin blob. + parts := []testfixtures.PartSpec{{ + Disk: "default", DB: "db1", Table: "t1", Name: "p1", + Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 23, HashLow: 1, HashHigh: 100}, + {Name: "primary.idx", Size: 8, HashLow: 2, HashHigh: 100}, + {Name: "data.bin", Size: 1024, HashLow: 3, HashHigh: 100}, + }, + }} + lb := testfixtures.Build(t, parts) + + // First, do a successful upload to populate the backend with the blob and + // confirm the harness works. After this, metadata.json for "seed-backup" + // exists and the blob is stored in the backend. + _, err := cas.Upload(ctx, f, cfg, "seed-backup", cas.UploadOptions{LocalBackupDir: lb.Root}) + if err != nil { + t.Fatalf("seed upload failed: %v", err) + } + + // Confirm the blob exists in the backend (it was uploaded in seed phase). + blobPrefix := cp + "blob/" + var coldHitKey string + if err := f.Walk(ctx, blobPrefix, true, func(rf cas.RemoteFile) error { + coldHitKey = rf.Key + return nil + }); err != nil { + t.Fatalf("Walk to find blob key: %v", err) + } + if coldHitKey == "" { + t.Fatal("no blob found in backend after seed upload") + } + + // Install a StatHook that makes the seeded blob appear to be gone (simulating + // a concurrent prune deleting it between ColdList and step 11c). + // ColdList uses Walk (not StatFile), so it will still see the blob as present + // and upload will skip re-uploading it. The hook only fires during step 11c. + f.SetStatHook(func(key string) (int64, time.Time, bool, error, bool) { + if key == coldHitKey { + // Blob "disappeared": return exists=false, override=true. + return 0, time.Time{}, false, nil, true + } + // Pass through all other keys. + return 0, time.Time{}, false, nil, false + }) + + // Now run a second upload for "test-backup". The cold-list will see the blob + // (Walk is not hooked), uploadMissingBlobs will skip it, and step 11c will + // detect that StatFile returns not-found → abort. + _, err = cas.Upload(ctx, f, cfg, "test-backup", cas.UploadOptions{LocalBackupDir: lb.Root}) + if err == nil { + t.Fatal("expected Upload to abort when cold-listed blob disappears before commit") + } + if !strings.Contains(err.Error(), "cold-listed blob") { + t.Errorf("error should mention 'cold-listed blob'; got: %v", err) + } + if !strings.Contains(err.Error(), "disappeared before commit") { + t.Errorf("error should mention 'disappeared before commit'; got: %v", err) + } + + // metadata.json must NOT have been written. + f.SetStatHook(nil) + _, _, exists, _ := f.StatFile(ctx, cas.MetadataJSONPath(cp, "test-backup")) + if exists { + t.Error("metadata.json was written despite cold-listed blob disappearing") + } +} + +// TestUpload_AbortsIfColdListedBlobIsWrongSize verifies that if a blob was +// skipped during upload (because cold-list said it already existed) but the +// remote object has a different size from what checksums.txt recorded, Upload +// returns an error containing "size mismatch" and does NOT write metadata.json. +// This is a defense-in-depth check: content-addressed keys should never hold +// wrong-size data under normal operation, but a buggy backend or interrupted +// PUT could leave a truncated object at the key. +func TestUpload_AbortsIfColdListedBlobIsWrongSize(t *testing.T) { + ctx := context.Background() + f := fakedst.New() + // Use a threshold low enough that data.bin (1024 bytes) is treated as a blob. + cfg := testCfg(100) + cp := cfg.ClusterPrefix() + + // Build a local backup with one part containing a 1024-byte data.bin blob. + parts := []testfixtures.PartSpec{{ + Disk: "default", DB: "db1", Table: "t1", Name: "p1", + Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 23, HashLow: 1, HashHigh: 100}, + {Name: "primary.idx", Size: 8, HashLow: 2, HashHigh: 100}, + {Name: "data.bin", Size: 1024, HashLow: 3, HashHigh: 100}, + }, + }} + lb := testfixtures.Build(t, parts) + + // Seed a first upload so the blob lands in the backend at the CAS key. + _, err := cas.Upload(ctx, f, cfg, "seed-backup", cas.UploadOptions{LocalBackupDir: lb.Root}) + if err != nil { + t.Fatalf("seed upload failed: %v", err) + } + + // Find the blob key uploaded during seeding. + blobPrefix := cp + "blob/" + var coldHitKey string + if err := f.Walk(ctx, blobPrefix, true, func(rf cas.RemoteFile) error { + coldHitKey = rf.Key + return nil + }); err != nil { + t.Fatalf("Walk to find blob key: %v", err) + } + if coldHitKey == "" { + t.Fatal("no blob found in backend after seed upload") + } + + // Install a StatHook that reports the blob as present but with a wrong + // (truncated) size. ColdList uses Walk (not StatFile), so it will still + // see the blob as present and uploadMissingBlobs will skip it. Step 11c + // calls StatFile, sees size != expected, and must abort. + f.SetStatHook(func(key string) (int64, time.Time, bool, error, bool) { + if key == coldHitKey { + // Return wrong size (1 byte instead of the real 1024 bytes). + return 1, time.Time{}, true, nil, true + } + return 0, time.Time{}, false, nil, false + }) + + // The second upload for "test-backup" should abort at step 11c. + _, err = cas.Upload(ctx, f, cfg, "test-backup", cas.UploadOptions{LocalBackupDir: lb.Root}) + if err == nil { + t.Fatal("expected Upload to abort when cold-listed blob has wrong size") + } + if !strings.Contains(err.Error(), "size mismatch") { + t.Errorf("error should mention 'size mismatch'; got: %v", err) + } + + // metadata.json must NOT have been written. + f.SetStatHook(nil) + _, _, exists, _ := f.StatFile(ctx, cas.MetadataJSONPath(cp, "test-backup")) + if exists { + t.Error("metadata.json was written despite cold-listed blob having wrong size") + } +} + +// TestUpload_WaitsForPruneMarker verifies that Upload waits for the prune +// marker to disappear (within WaitForPrune) rather than refusing immediately. +func TestUpload_WaitsForPruneMarker(t *testing.T) { + poll := 10 * time.Millisecond + cas.SetPollIntervalForTesting(&poll) + defer cas.SetPollIntervalForTesting(nil) + + f := fakedst.New() + cfg := testCfg(100) + cp := cfg.ClusterPrefix() + + // Pre-place prune marker; schedule deletion after 50ms. + if err := f.PutFile(context.Background(), cas.PruneMarkerPath(cp), + io.NopCloser(strings.NewReader("{}")), 2); err != nil { + t.Fatal(err) + } + go func() { + time.Sleep(50 * time.Millisecond) + _ = f.DeleteFile(context.Background(), cas.PruneMarkerPath(cp)) + }() + + lb := testfixtures.Build(t, []testfixtures.PartSpec{smallPart("p1", 0)}) + _, err := cas.Upload(context.Background(), f, cfg, "bk", cas.UploadOptions{ + LocalBackupDir: lb.Root, + WaitForPrune: 5 * time.Second, + }) + if err != nil { + t.Fatalf("Upload should succeed once marker is cleared; got: %v", err) + } +} + +// TestUpload_RefusesAfterWaitTimeout verifies that Upload returns +// ErrPruneInProgress when WaitForPrune elapses and the marker remains. +func TestUpload_RefusesAfterWaitTimeout(t *testing.T) { + poll := 10 * time.Millisecond + cas.SetPollIntervalForTesting(&poll) + defer cas.SetPollIntervalForTesting(nil) + + f := fakedst.New() + cfg := testCfg(100) + cp := cfg.ClusterPrefix() + + // Pre-place prune marker permanently. + if err := f.PutFile(context.Background(), cas.PruneMarkerPath(cp), + io.NopCloser(strings.NewReader("{}")), 2); err != nil { + t.Fatal(err) + } + + lb := testfixtures.Build(t, []testfixtures.PartSpec{smallPart("p1", 0)}) + _, err := cas.Upload(context.Background(), f, cfg, "bk", cas.UploadOptions{ + LocalBackupDir: lb.Root, + WaitForPrune: 100 * time.Millisecond, + }) + if !errors.Is(err, cas.ErrPruneInProgress) { + t.Fatalf("got err=%v; want ErrPruneInProgress", err) + } +} + +// TestUpload_AbortsIfBlobFileMutatedBeforeUpload verifies that if a local +// blob file is truncated between planning (buildExtractSet sees it as 1024 bytes +// in checksums.txt) and the actual PutFile streaming, the counting reader detects +// the size mismatch and Upload returns an error containing "size mismatch". +// metadata.json must NOT be committed. +func TestUpload_AbortsIfBlobFileMutatedBeforeUpload(t *testing.T) { + ctx := context.Background() + // Use threshold=100 so data.bin (1024 bytes) is classified as a blob. + cfg := testCfg(100) + cp := cfg.ClusterPrefix() + + parts := []testfixtures.PartSpec{{ + Disk: "default", DB: "db1", Table: "t1", Name: "p1", + Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 23, HashLow: 1, HashHigh: 100}, + {Name: "data.bin", Size: 1024, HashLow: 3, HashHigh: 100}, + }, + }} + lb := testfixtures.Build(t, parts) + + // Truncate data.bin to 0 bytes AFTER buildExtractSet will read checksums.txt + // (which happens during planUpload) but BEFORE the blob file is actually read. + // In practice we truncate before Upload is called at all: buildExtractSet only + // calls os.Stat to verify existence, not to read content, so the plan phase + // succeeds. The size mismatch is only detected when uploadMissingBlobs opens and + // streams the file through the countingReadCloser. + blobPath := filepath.Join(lb.Root, "shadow", "db1", "t1", "default", "p1", "data.bin") + if err := os.Truncate(blobPath, 0); err != nil { + t.Fatalf("truncate: %v", err) + } + + f := fakedst.New() + _, err := cas.Upload(ctx, f, cfg, "bk", cas.UploadOptions{LocalBackupDir: lb.Root}) + if err == nil { + t.Fatal("expected Upload to abort when blob file is truncated before upload") + } + if !strings.Contains(err.Error(), "size mismatch") { + t.Errorf("error should mention 'size mismatch'; got: %v", err) + } + + // metadata.json must NOT have been written. + if _, _, exists, _ := f.StatFile(ctx, cas.MetadataJSONPath(cp, "bk")); exists { + t.Error("metadata.json was committed despite blob size mismatch") + } +} + +// TestPlanUpload_RejectsConflictingHashSize verifies that buildExtractSet (via +// planUpload / Upload) refuses to proceed when two different parts list the +// same content hash with different sizes in their checksums.txt files. This is +// malformed input that would otherwise silently produce a metadata.json +// referencing an ambiguous hash/size pair. +func TestPlanUpload_RejectsConflictingHashSize(t *testing.T) { + ctx := context.Background() + // Use threshold=100 so 1024-byte files are treated as blobs. + cfg := testCfg(100) + + // Synthesize two parts that share the same hash (Low=999, High=999) but + // declare different sizes (1024 vs 2048) — malformed but possible if + // checksums.txt is hand-crafted or corrupted. + parts := []testfixtures.PartSpec{ + { + Disk: "default", DB: "db1", Table: "t1", Name: "p1", + Files: []testfixtures.FileSpec{ + {Name: "data.bin", Size: 1024, HashLow: 999, HashHigh: 999}, + }, + }, + { + Disk: "default", DB: "db1", Table: "t1", Name: "p2", + Files: []testfixtures.FileSpec{ + // Same hash, different size — this is the malformed case. + // We write 2048 real bytes so the file exists on disk, but + // we then rewrite checksums.txt to claim size=2048 with the + // same hash as p1. + {Name: "data.bin", Size: 2048, HashLow: 999, HashHigh: 999}, + }, + }, + } + lb := testfixtures.Build(t, parts) + + f := fakedst.New() + _, err := cas.Upload(ctx, f, cfg, "bk", cas.UploadOptions{LocalBackupDir: lb.Root}) + if err == nil { + t.Fatal("expected Upload to fail when two parts have the same hash with conflicting sizes") + } + if !strings.Contains(err.Error(), "conflicting sizes") { + t.Errorf("error should mention 'conflicting sizes'; got: %v", err) + } + + // metadata.json must NOT have been written. + if _, _, exists, _ := f.StatFile(ctx, cas.MetadataJSONPath(cfg.ClusterPrefix(), "bk")); exists { + t.Error("metadata.json was committed despite conflicting hash/size") + } +} + +// TestUpload_LeaksNoMarkerOnCommitError verifies that a PutFile failure +// on metadata.json at step 12 cleans up the in-progress marker before +// returning the error. +func TestUpload_LeaksNoMarkerOnCommitError(t *testing.T) { + f := fakedst.New() + cfg := testCfg(1024) + ctx := context.Background() + + parts := []testfixtures.PartSpec{{ + Disk: "default", DB: "db1", Table: "t1", Name: "p1", + Files: []testfixtures.FileSpec{{Name: "data.bin", Size: 16, HashLow: 1, HashHigh: 2}}, + }} + src := testfixtures.Build(t, parts) + + metadataKey := cas.MetadataJSONPath(cfg.ClusterPrefix(), "bk") + f.SetPutHook(func(key string) (err error, override bool) { + if key == metadataKey { + return errors.New("simulated transient backend error"), true + } + return nil, false + }) + + _, err := cas.Upload(ctx, f, cfg, "bk", cas.UploadOptions{LocalBackupDir: src.Root}) + if err == nil { + t.Fatal("expected Upload to error when metadata.json PUT fails") + } + if !strings.Contains(err.Error(), "put metadata.json") { + t.Errorf("error should mention metadata.json; got: %v", err) + } + + // Clear the hook so the post-call StatFile reads actual backend state. + f.SetPutHook(nil) + + markerKey := cas.InProgressMarkerPath(cfg.ClusterPrefix(), "bk") + _, _, exists, _ := f.StatFile(context.Background(), markerKey) + if exists { + t.Error("in-progress marker leaked: still present after metadata.json failure") + } +} + +func TestTableFilterMatches(t *testing.T) { + cases := []struct { + name string + filter []string + db, tbl string + expected bool + }{ + {"empty filter matches all", nil, "db", "t", true}, + {"exact match", []string{"db.t"}, "db", "t", true}, + {"db wildcard", []string{"db.*"}, "db", "anything", true}, + {"db wildcard miss", []string{"db.*"}, "other", "t", false}, + {"table wildcard", []string{"db.tab*"}, "db", "table_42", true}, + {"table wildcard miss", []string{"db.tab*"}, "db", "other", false}, + {"any database any table", []string{"*.*"}, "any", "any", true}, + {"multiple patterns - any match", []string{"a.b", "c.d"}, "c", "d", true}, + {"multiple patterns - none match", []string{"a.b", "c.d"}, "x", "y", false}, + {"trimmed whitespace", []string{" db.t "}, "db", "t", true}, + {"empty pattern in list ignored", []string{"", "db.t"}, "db", "t", true}, + {"single bracket class", []string{"db.t[12]"}, "db", "t1", true}, + {"single bracket class miss", []string{"db.t[12]"}, "db", "t3", false}, + } + for _, c := range cases { + c := c + t.Run(c.name, func(t *testing.T) { + got := cas.TableFilterMatches(c.filter, c.db, c.tbl) + if got != c.expected { + t.Errorf("TableFilterMatches(%v, %q, %q) = %v; want %v", + c.filter, c.db, c.tbl, got, c.expected) + } + }) + } +} + +// TestUploadPartArchives_TempfileCleanedOnError verifies that when PutFile +// returns an error, the temporary archive file is removed and Upload returns +// a non-nil error. Uses SetPutHook on the fakedst to inject an error only +// for part-archive keys (which contain "/parts/"). +func TestUploadPartArchives_TempfileCleanedOnError(t *testing.T) { + lb := testfixtures.Build(t, []testfixtures.PartSpec{smallPart("p1", 0)}) + f := fakedst.New() + + // Capture the os.TempDir() pattern before we run so we can check for + // leftover files after the (expected) failure. + tmpDir := os.TempDir() + + // Snapshot existing cas-archive-* files so we only count new ones. + existingBefore := casArchiveFiles(t, tmpDir) + + // Inject an error for every part-archive PUT (keys contain "/parts/"). + f.SetPutHook(func(key string) (error, bool) { + if strings.Contains(key, "/parts/") { + return errors.New("injected PutFile failure"), true + } + return nil, false + }) + + cfg := testCfg(100) + _, err := cas.Upload(context.Background(), f, cfg, "b1", cas.UploadOptions{ + LocalBackupDir: lb.Root, + }) + if err == nil { + t.Fatal("expected Upload to return an error when PutFile fails, got nil") + } + + // No leftover cas-archive-*.tar.zstd files should remain. + after := casArchiveFiles(t, tmpDir) + leaked := 0 + for _, p := range after { + found := false + for _, q := range existingBefore { + if p == q { + found = true + break + } + } + if !found { + leaked++ + t.Errorf("leaked tempfile: %s", p) + } + } + if leaked > 0 { + t.Errorf("total leaked tempfiles: %d", leaked) + } +} + +// TestUpload_Step11c_ParallelRevalidation exercises the parallel step-11c +// re-validation path with multiple cold-listed blobs and confirms that: +// - all referenced blobs survive (mark set is complete despite parallel build) +// - a disappearing blob is still detected under -race (no data races on firstErr) +func TestUpload_Step11c_ParallelRevalidation(t *testing.T) { + ctx := context.Background() + f := fakedst.New() + // threshold=100 so the 1024-byte data.bin is treated as a blob + cfg := testCfg(100) + cp := cfg.ClusterPrefix() + + // Build multiple parts, each with a distinct 1024-byte blob, so we get + // several cold-list hits on the second upload. + const numBlobs = 10 + var parts []testfixtures.PartSpec + for i := 0; i < numBlobs; i++ { + parts = append(parts, testfixtures.PartSpec{ + Disk: "default", DB: "db1", Table: "t1", + Name: fmt.Sprintf("p%d", i), + Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 23, HashLow: uint64(i)*10 + 1, HashHigh: uint64(i) + 100}, + {Name: "data.bin", Size: 1024, HashLow: uint64(i)*10 + 2, HashHigh: uint64(i) + 100}, + }, + }) + } + lb := testfixtures.Build(t, parts) + + // Seed upload — puts all blobs in the backend. + if _, err := cas.Upload(ctx, f, cfg, "seed", cas.UploadOptions{LocalBackupDir: lb.Root}); err != nil { + t.Fatalf("seed upload: %v", err) + } + + // Second upload: all blobs are already present → all go through step-11c + // re-validation. Must succeed with no errors. + res, err := cas.Upload(ctx, f, cfg, "incr", cas.UploadOptions{ + LocalBackupDir: lb.Root, + Parallelism: 4, // explicitly low to exercise pool boundary + }) + if err != nil { + t.Fatalf("incremental upload: %v", err) + } + if res.BlobsUploaded != 0 { + t.Errorf("BlobsUploaded: got %d want 0 (all blobs already present)", res.BlobsUploaded) + } + if res.BlobsReused != numBlobs { + t.Errorf("BlobsReused: got %d want %d", res.BlobsReused, numBlobs) + } + + // Now verify that a disappearing blob is still detected. + // Install a hook that makes ONE blob appear absent. + blobPrefix := cp + "blob/" + var anyBlobKey string + _ = f.Walk(ctx, blobPrefix, true, func(rf cas.RemoteFile) error { + if anyBlobKey == "" { + anyBlobKey = rf.Key + } + return nil + }) + if anyBlobKey == "" { + t.Fatal("no blob keys found after seed upload") + } + f.SetStatHook(func(key string) (int64, time.Time, bool, error, bool) { + if key == anyBlobKey { + return 0, time.Time{}, false, nil, true // appears gone + } + return 0, time.Time{}, false, nil, false + }) + + _, err = cas.Upload(ctx, f, cfg, "incr2", cas.UploadOptions{ + LocalBackupDir: lb.Root, + Parallelism: 4, + }) + if err == nil { + t.Fatal("expected Upload to abort when cold-listed blob disappears") + } + if !strings.Contains(err.Error(), "cold-listed blob") { + t.Errorf("error should mention 'cold-listed blob'; got: %v", err) + } +} + +// casArchiveFiles returns all cas-archive-*.tar.zstd paths in dir. +func casArchiveFiles(t *testing.T, dir string) []string { + t.Helper() + entries, err := os.ReadDir(dir) + if err != nil { + t.Fatalf("ReadDir(%s): %v", dir, err) + } + var out []string + for _, e := range entries { + if strings.HasPrefix(e.Name(), "cas-archive-") && strings.HasSuffix(e.Name(), ".tar.zstd") { + out = append(out, filepath.Join(dir, e.Name())) + } + } + return out +} + +// TestUpload_ErrorPathCleansInprogressMarker verifies the single-defer +// refactor (#3): when Upload fails partway through (after the inprogress +// marker is written), the deferred cleanup removes the marker even though +// no explicit cleanup call exists on that path. +func TestUpload_ErrorPathCleansInprogressMarker(t *testing.T) { + lb := testfixtures.Build(t, []testfixtures.PartSpec{smallPart("p1", 0)}) + f := fakedst.New() + cfg := testCfg(100) + cp := cfg.ClusterPrefix() + + // Inject a failure for the archive PutFile to trigger an error in step 9 + // (uploadPartArchives). The marker was written in step 5; without the defer + // it would strand. Archive keys contain "/parts/" in their path. + archivePutFailed := false + f.SetPutHook(func(key string) (error, bool) { + if strings.Contains(key, "/parts/") && !archivePutFailed { + archivePutFailed = true + return fmt.Errorf("injected archive PUT failure"), true + } + return nil, false + }) + + _, err := cas.Upload(context.Background(), f, cfg, "b1", cas.UploadOptions{ + LocalBackupDir: lb.Root, + }) + if err == nil { + t.Fatal("expected Upload to fail due to injected error") + } + + // The inprogress marker must be absent — the deferred cleanup ran. + markerKey := cas.InProgressMarkerPath(cp, "b1") + if _, _, exists, statErr := f.StatFile(context.Background(), markerKey); statErr != nil { + t.Fatalf("StatFile(marker): %v", statErr) + } else if exists { + t.Error("inprogress marker still present after Upload error — defer cleanup did not run") + } +} + +// ctxRespectingBackend wraps fakedst.Fake and makes Walk fail with +// context.Canceled when the context is already cancelled. This lets us +// test that a pre-cancelled ctx causes Upload to fail, which in turn +// exercises the deferred cleanup path. +type ctxRespectingBackend struct { + cas.Backend +} + +func (c *ctxRespectingBackend) Walk(ctx context.Context, prefix string, recursive bool, fn func(cas.RemoteFile) error) error { + if err := ctx.Err(); err != nil { + return err + } + return c.Backend.Walk(ctx, prefix, recursive, fn) +} + +// TestUpload_CancelledContextStillReleasesMarker verifies detached-context +// cleanup (#2): when the operation context is cancelled before Upload returns, +// the deferred cleanup uses a fresh context.Background() and still deletes +// the inprogress marker. +func TestUpload_CancelledContextStillReleasesMarker(t *testing.T) { + lb := testfixtures.Build(t, []testfixtures.PartSpec{smallPart("p1", 0)}) + f := fakedst.New() + cfg := testCfg(100) + cp := cfg.ClusterPrefix() + + // Use a pre-cancelled context. The ctxRespectingBackend translates it into + // a Walk error, which ColdList surfaces to Upload, which returns an error + // before committing — giving the deferred cleanup a chance to run. + ctx, cancel := context.WithCancel(context.Background()) + cancel() // cancel immediately + + _, err := cas.Upload(ctx, &ctxRespectingBackend{f}, cfg, "b1", cas.UploadOptions{ + LocalBackupDir: lb.Root, + }) + if err == nil { + t.Fatal("expected Upload to fail with cancelled context") + } + + // The inprogress marker must be absent despite the operation ctx being + // cancelled — the deferred cleanup used its own context.Background()-derived ctx. + markerKey := cas.InProgressMarkerPath(cp, "b1") + if _, _, exists, statErr := f.StatFile(context.Background(), markerKey); statErr != nil { + t.Fatalf("StatFile(marker): %v", statErr) + } else if exists { + t.Error("inprogress marker still present after cancelled-ctx Upload — detached cleanup context not working") + } +} + +// TestUpload_RejectsNameCollidingWithCASPrefix verifies that a backup name +// equal to the CAS root-prefix segment (e.g. "cas" when root_prefix="cas/") +// is rejected at upload time with a descriptive error. This prevents operators +// from accidentally creating a v1 backup that would be silently excluded by +// BackupList skip-prefix filtering once CAS is enabled. +func TestUpload_RejectsNameCollidingWithCASPrefix(t *testing.T) { + cfg := testCfg(100) // root_prefix="cas/", so "cas" collides + + t.Run("exact_collision_rejected", func(t *testing.T) { + f := fakedst.New() + _, err := cas.Upload(context.Background(), f, cfg, "cas", cas.UploadOptions{ + LocalBackupDir: t.TempDir(), + }) + if err == nil { + t.Fatal("Upload with colliding name: expected error, got nil") + } + if !strings.Contains(err.Error(), "collides") { + t.Errorf("error should mention collision, got: %v", err) + } + }) + + t.Run("prefix_match_NOT_rejected", func(t *testing.T) { + // "casematch" starts with "cas" but is not equal to it. The collision + // guard must be exact-match, not prefix-match — otherwise it would + // over-reject legitimate names. Upload may still error for other + // reasons (no local backup contents) but NOT for collision. + f := fakedst.New() + _, err := cas.Upload(context.Background(), f, cfg, "casematch", cas.UploadOptions{ + LocalBackupDir: t.TempDir(), + }) + if err != nil && strings.Contains(err.Error(), "collides") { + t.Errorf("name 'casematch' must NOT trigger collision error; got: %v", err) + } + }) +} diff --git a/pkg/cas/validate.go b/pkg/cas/validate.go new file mode 100644 index 00000000..5ec5f517 --- /dev/null +++ b/pkg/cas/validate.go @@ -0,0 +1,101 @@ +package cas + +import ( + "context" + "encoding/json" + "fmt" + "io" + "regexp" + "strings" + + "github.com/Altinity/clickhouse-backup/v2/pkg/metadata" +) + +// nameRe permits printable ASCII identifiers with conservative punctuation. +// Excludes anything that could be misinterpreted as a path component. +var nameRe = regexp.MustCompile(`^[A-Za-z0-9._\-+:]+$`) + +// NameCollidesWithCASPrefix returns true if name equals any configured CAS +// skip-prefix (stripped of its trailing slash). This prevents creating a v1 +// backup whose name would later disappear under v1 retention after CAS is +// enabled (BackupList skips entries whose name matches a CAS skip-prefix). +// The check is the same in both the v1 Upload path and the CAS Upload path. +// +// Example: with default RootPrefix "cas/", SkipPrefixes returns ["cas/"], +// so a v1 backup named "cas" would be silently skipped by BackupList. +// This function rejects that name at upload time instead. +func NameCollidesWithCASPrefix(name string, casCfg Config) bool { + for _, p := range casCfg.SkipPrefixes() { + if name == strings.TrimSuffix(p, "/") { + return true + } + } + return false +} + +// validateName enforces backup-name rules: 1..128 chars, character set +// [A-Za-z0-9._\-+:], and not a dot-only string ("." / ".." / "..." etc.). +// Dot-only names pass the regex but are nonsensical and could enable subtle +// path-shape collisions in future tooling. +func validateName(name string) error { + if len(name) == 0 || len(name) > 128 { + return ErrInvalidBackupName + } + if !nameRe.MatchString(name) { + return ErrInvalidBackupName + } + if strings.Trim(name, ".") == "" { + return ErrInvalidBackupName + } + return nil +} + +// ValidateBackup loads cas//metadata//metadata.json, verifies +// it is a CAS backup belonging to this cluster, and that its layout +// parameters are within supported ranges. Returns the parsed metadata so +// callers can use the persisted parameters (InlineThreshold, LayoutVersion) +// for downstream operations. +// +// This is the single precondition function used by every CAS command. See +// docs/cas-design.md §6.2.1 (rationale for persisting + reading layout +// parameters from metadata, not from current config). +func ValidateBackup(ctx context.Context, b Backend, cfg Config, name string) (*metadata.BackupMetadata, error) { + if err := validateName(name); err != nil { + return nil, err + } + + cp := cfg.ClusterPrefix() + rc, err := b.GetFile(ctx, MetadataJSONPath(cp, name)) + if err != nil { + return nil, fmt.Errorf("%w: %v", ErrMissingMetadata, err) + } + defer rc.Close() + + raw, err := io.ReadAll(rc) + if err != nil { + return nil, fmt.Errorf("cas: read metadata.json: %w", err) + } + + var bm metadata.BackupMetadata + if err := json.Unmarshal(raw, &bm); err != nil { + return nil, fmt.Errorf("cas: parse metadata.json: %w", err) + } + + if bm.CAS == nil { + return nil, ErrV1Backup + } + + if bm.CAS.LayoutVersion > LayoutVersion { + return nil, fmt.Errorf("%w: backup=%d max-supported=%d", ErrUnsupportedLayoutVersion, bm.CAS.LayoutVersion, LayoutVersion) + } + + if bm.CAS.InlineThreshold == 0 || bm.CAS.InlineThreshold > MaxInline { + return nil, fmt.Errorf("cas: persisted inline_threshold out of range: %d", bm.CAS.InlineThreshold) + } + + if bm.CAS.ClusterID != cfg.ClusterID { + return nil, fmt.Errorf("%w: backup=%q config=%q", ErrClusterIDMismatch, bm.CAS.ClusterID, cfg.ClusterID) + } + + return &bm, nil +} diff --git a/pkg/cas/validate_test.go b/pkg/cas/validate_test.go new file mode 100644 index 00000000..ff762cfe --- /dev/null +++ b/pkg/cas/validate_test.go @@ -0,0 +1,155 @@ +package cas_test + +import ( + "context" + "encoding/json" + "errors" + "io" + "strings" + "testing" + + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/fakedst" + "github.com/Altinity/clickhouse-backup/v2/pkg/metadata" +) + +func cfg(t *testing.T) cas.Config { + t.Helper() + c := cas.DefaultConfig() + c.Enabled = true + c.ClusterID = "c1" + return c +} + +func putMetadata(t *testing.T, f *fakedst.Fake, cp, name string, bm metadata.BackupMetadata) { + t.Helper() + raw, err := json.Marshal(bm) + if err != nil { + t.Fatal(err) + } + if err := f.PutFile(context.Background(), cas.MetadataJSONPath(cp, name), + io.NopCloser(strings.NewReader(string(raw))), int64(len(raw))); err != nil { + t.Fatal(err) + } +} + +func validBM() metadata.BackupMetadata { + return metadata.BackupMetadata{ + BackupName: "bk1", + CAS: &metadata.CASBackupParams{ + LayoutVersion: cas.LayoutVersion, + InlineThreshold: 524288, + ClusterID: "c1", + }, + } +} + +func TestValidateBackup_HappyPath(t *testing.T) { + f := fakedst.New() + ctx := context.Background() + c := cfg(t) + putMetadata(t, f, c.ClusterPrefix(), "bk1", validBM()) + bm, err := cas.ValidateBackup(ctx, f, c, "bk1") + if err != nil { + t.Fatal(err) + } + if bm.CAS == nil || bm.CAS.ClusterID != "c1" { + t.Fatalf("wrong meta: %+v", bm) + } +} + +func TestValidateBackup_RejectsBadNames(t *testing.T) { + f := fakedst.New() + c := cfg(t) + ctx := context.Background() + for _, bad := range []string{"", strings.Repeat("a", 129), "../sneaky", "with space", "name/slash", "tab\tname", ".", "..", "..."} { + if _, err := cas.ValidateBackup(ctx, f, c, bad); !errors.Is(err, cas.ErrInvalidBackupName) { + t.Errorf("name=%q: want ErrInvalidBackupName, got %v", bad, err) + } + } +} + +func TestValidateBackup_MissingMetadata(t *testing.T) { + f := fakedst.New() + c := cfg(t) + ctx := context.Background() + _, err := cas.ValidateBackup(ctx, f, c, "absent") + if !errors.Is(err, cas.ErrMissingMetadata) { + t.Fatalf("want ErrMissingMetadata, got %v", err) + } +} + +func TestValidateBackup_V1Backup(t *testing.T) { + f := fakedst.New() + c := cfg(t) + ctx := context.Background() + bm := validBM() + bm.CAS = nil // v1 backup + putMetadata(t, f, c.ClusterPrefix(), "bk1", bm) + _, err := cas.ValidateBackup(ctx, f, c, "bk1") + if !errors.Is(err, cas.ErrV1Backup) { + t.Fatalf("want ErrV1Backup, got %v", err) + } +} + +func TestValidateBackup_UnsupportedLayoutVersion(t *testing.T) { + f := fakedst.New() + c := cfg(t) + ctx := context.Background() + bm := validBM() + bm.CAS.LayoutVersion = cas.LayoutVersion + 1 + putMetadata(t, f, c.ClusterPrefix(), "bk1", bm) + _, err := cas.ValidateBackup(ctx, f, c, "bk1") + if !errors.Is(err, cas.ErrUnsupportedLayoutVersion) { + t.Fatalf("got %v", err) + } +} + +func TestValidateBackup_BadInlineThreshold(t *testing.T) { + f := fakedst.New() + c := cfg(t) + ctx := context.Background() + bm := validBM() + bm.CAS.InlineThreshold = 0 + putMetadata(t, f, c.ClusterPrefix(), "z", bm) + if _, err := cas.ValidateBackup(ctx, f, c, "z"); err == nil { + t.Fatal("zero must fail") + } + + bm.CAS.InlineThreshold = cas.MaxInline + 1 + putMetadata(t, f, c.ClusterPrefix(), "z", bm) + if _, err := cas.ValidateBackup(ctx, f, c, "z"); err == nil { + t.Fatal("> MaxInline must fail") + } +} + +func TestValidateBackup_ClusterIDMismatch(t *testing.T) { + f := fakedst.New() + c := cfg(t) + ctx := context.Background() + bm := validBM() + bm.CAS.ClusterID = "other-cluster" + putMetadata(t, f, c.ClusterPrefix(), "bk1", bm) + _, err := cas.ValidateBackup(ctx, f, c, "bk1") + if !errors.Is(err, cas.ErrClusterIDMismatch) { + t.Fatalf("want ErrClusterIDMismatch, got %v", err) + } +} + +func TestValidateBackup_UnparseableJSON(t *testing.T) { + f := fakedst.New() + c := cfg(t) + ctx := context.Background() + cp := c.ClusterPrefix() + if err := f.PutFile(ctx, cas.MetadataJSONPath(cp, "bk1"), + io.NopCloser(strings.NewReader("not json")), 8); err != nil { + t.Fatal(err) + } + _, err := cas.ValidateBackup(ctx, f, c, "bk1") + if err == nil { + t.Fatal("must fail") + } + if !strings.Contains(err.Error(), "parse metadata.json") { + t.Errorf("error should mention parse step: %v", err) + } +} diff --git a/pkg/cas/verify.go b/pkg/cas/verify.go new file mode 100644 index 00000000..b4d2137f --- /dev/null +++ b/pkg/cas/verify.go @@ -0,0 +1,290 @@ +package cas + +import ( + "archive/tar" + "bytes" + "context" + "encoding/json" + "errors" + "fmt" + "io" + "sort" + "strings" + "sync" + + "github.com/Altinity/clickhouse-backup/v2/pkg/checksumstxt" + "github.com/Altinity/clickhouse-backup/v2/pkg/metadata" + "github.com/klauspost/compress/zstd" +) + +// verify.go — streaming archive extractor mirrors prune.go::collectRefsFromArchive. +// All per-table archives are streamed directly from GetFile without buffering +// the entire archive in memory first. + +// VerifyOptions configures a Verify run. +type VerifyOptions struct { + JSON bool + Parallelism int // for HEADs; default 32 +} + +// VerifyFailure describes a single blob that failed verification. +type VerifyFailure struct { + Kind string `json:"kind"` // "stat_error" | "missing" | "size_mismatch" + Path string `json:"path"` + Want uint64 `json:"want"` + Got int64 `json:"got,omitempty"` // present for size_mismatch + Err string `json:"err,omitempty"` // present for stat_error +} + +// VerifyResult summarises what a Verify run found. +type VerifyResult struct { + BackupName string + BlobsChecked int + Failures []VerifyFailure +} + +// expectedBlob is one (path, expected-size) pair accumulated from checksums.txt entries. +type expectedBlob struct { + Path string + Size uint64 +} + +// Verify performs a HEAD + size check on every blob referenced by the backup. +// Writes either human-readable lines (default) or line-delimited JSON +// (opts.JSON) to out as failures are discovered. Failures are written to out +// after all HEADs have completed (deterministic, sorted by path). Returns the +// structured result; if Failures is non-empty, also returns ErrVerifyFailures +// so callers (and the CLI) can detect the failure cleanly. +func Verify(ctx context.Context, b Backend, cfg Config, name string, opts VerifyOptions, out io.Writer) (*VerifyResult, error) { + if err := cfg.Validate(); err != nil { + return nil, fmt.Errorf("cas: verify: invalid config: %w", err) + } + bm, err := ValidateBackup(ctx, b, cfg, name) + if err != nil { + return nil, err + } + cp := cfg.ClusterPrefix() + + blobs, err := buildVerifySet(ctx, b, cp, name, bm) + if err != nil { + return nil, fmt.Errorf("cas-verify: build set: %w", err) + } + + parallelism := opts.Parallelism + if parallelism <= 0 { + parallelism = 32 + } + failures := headAllInParallel(ctx, b, blobs, parallelism, opts.JSON, out) + + res := &VerifyResult{BackupName: name, BlobsChecked: len(blobs), Failures: failures} + if len(failures) > 0 { + return res, ErrVerifyFailures + } + return res, nil +} + +// buildVerifySet downloads each per-table archive, extracts every +// checksums.txt, and accumulates expected blobs. +func buildVerifySet(ctx context.Context, b Backend, cp, name string, bm *metadata.BackupMetadata) ([]expectedBlob, error) { + // De-duplicate blobs across tables — the same blob hash may be + // referenced from multiple tables. + seen := make(map[string]uint64) + + for _, tt := range bm.Tables { + // Load per-table metadata to learn which disks this table lives on. + tmRC, err := b.GetFile(ctx, TableMetaPath(cp, name, tt.Database, tt.Table)) + if err != nil { + return nil, fmt.Errorf("cas-verify: get table metadata %s.%s: %w", tt.Database, tt.Table, err) + } + raw, err := io.ReadAll(tmRC) + _ = tmRC.Close() + if err != nil { + return nil, fmt.Errorf("cas-verify: read table metadata %s.%s: %w", tt.Database, tt.Table, err) + } + var tm metadata.TableMetadata + if err := json.Unmarshal(raw, &tm); err != nil { + return nil, fmt.Errorf("cas-verify: parse table metadata %s.%s: %w", tt.Database, tt.Table, err) + } + + for disk := range tm.Parts { + if err := validateRemoteFilesystemName("disk", disk); err != nil { + return nil, fmt.Errorf("cas-verify: %w", err) + } + archPath := PartArchivePath(cp, name, disk, tt.Database, tt.Table) + archRC, err := b.GetFile(ctx, archPath) + if err != nil { + return nil, fmt.Errorf("cas-verify: get archive %s: %w", archPath, err) + } + extractErr := extractBlobsFromArchive(archRC, cp, bm.CAS.InlineThreshold, seen) + _ = archRC.Close() + if extractErr != nil { + return nil, fmt.Errorf("cas-verify: extract blobs from %s: %w", archPath, extractErr) + } + } + } + + blobs := make([]expectedBlob, 0, len(seen)) + for path, size := range seen { + blobs = append(blobs, expectedBlob{Path: path, Size: size}) + } + // Sort for determinism. + sortExpectedBlobs(blobs) + return blobs, nil +} + +// extractBlobsFromArchive streams through a tar.zstd archive, finds every +// entry whose name ends in "/checksums.txt", parses it, and accumulates +// blob (path, size) pairs in seen. +func extractBlobsFromArchive(r io.Reader, cp string, threshold uint64, seen map[string]uint64) error { + zr, err := zstd.NewReader(r) + if err != nil { + return fmt.Errorf("zstd reader: %w", err) + } + defer zr.Close() + + tr := tar.NewReader(zr) + for { + hdr, err := tr.Next() + if errors.Is(err, io.EOF) { + return nil + } + if err != nil { + return err + } + if hdr.Typeflag != tar.TypeReg { + continue + } + // Only process checksums.txt entries. + if !strings.HasSuffix(hdr.Name, "/checksums.txt") && hdr.Name != "checksums.txt" { + // Still must drain the entry. + _, _ = io.Copy(io.Discard, tr) + continue + } + + data, err := io.ReadAll(tr) + if err != nil { + return fmt.Errorf("read %s: %w", hdr.Name, err) + } + + parsed, err := checksumstxt.Parse(bytes.NewReader(data)) + if err != nil { + // Malformed checksums.txt in archive — treat as error. + return fmt.Errorf("parse %s: %w", hdr.Name, err) + } + + for _, c := range parsed.Files { + if c.FileSize <= threshold { + // Inline — no blob to check. + continue + } + h := Hash128{Low: c.FileHash.Low, High: c.FileHash.High} + blobKey := BlobPath(cp, h) + if existing, ok := seen[blobKey]; !ok { + seen[blobKey] = c.FileSize + } else if existing != c.FileSize { + // Two checksums.txt entries claim different sizes for the + // same blob hash. Use the first one seen; the inconsistency + // would be caught by the upload logic. + _ = existing + } + } + } +} + +// sortExpectedBlobs sorts blobs by Path for deterministic output. +func sortExpectedBlobs(blobs []expectedBlob) { + sort.Slice(blobs, func(i, j int) bool { + return blobs[i].Path < blobs[j].Path + }) +} + +// headAllInParallel performs HEAD (StatFile) on every blob and returns failures. +// Each failure is also written to out (text or JSON per asJSON) after all +// checks complete. Output is written in sorted-path order for determinism. +func headAllInParallel(ctx context.Context, b Backend, blobs []expectedBlob, parallelism int, asJSON bool, out io.Writer) []VerifyFailure { + type result struct { + blob expectedBlob + failure *VerifyFailure + } + + results := make([]result, len(blobs)) + for i, bl := range blobs { + results[i].blob = bl + } + + sem := make(chan struct{}, parallelism) + var wg sync.WaitGroup + + for i := range results { + i := i + wg.Add(1) + go func() { + defer wg.Done() + sem <- struct{}{} + defer func() { <-sem }() + + bl := results[i].blob + size, _, exists, err := b.StatFile(ctx, bl.Path) + if err != nil { + results[i].failure = &VerifyFailure{ + Kind: "stat_error", + Path: bl.Path, + Want: bl.Size, + Err: err.Error(), + } + return + } + if !exists { + results[i].failure = &VerifyFailure{ + Kind: "missing", + Path: bl.Path, + Want: bl.Size, + } + return + } + if uint64(size) != bl.Size { + results[i].failure = &VerifyFailure{ + Kind: "size_mismatch", + Path: bl.Path, + Want: bl.Size, + Got: size, + } + } + }() + } + wg.Wait() + + // Collect failures (already in sorted-path order since blobs were sorted). + var failures []VerifyFailure + for _, r := range results { + if r.failure == nil { + continue + } + failures = append(failures, *r.failure) + if out != nil { + writeVerifyFailure(out, *r.failure, asJSON) + } + } + return failures +} + +// writeVerifyFailure writes one failure to out in the requested format. +func writeVerifyFailure(out io.Writer, f VerifyFailure, asJSON bool) { + if asJSON { + data, err := json.Marshal(f) + if err == nil { + _, _ = fmt.Fprintf(out, "%s\n", data) + } + return + } + switch f.Kind { + case "stat_error": + _, _ = fmt.Fprintf(out, "STATERR %s (want %d bytes): %s\n", f.Path, f.Want, f.Err) + case "missing": + _, _ = fmt.Fprintf(out, "MISSING %s (want %d bytes)\n", f.Path, f.Want) + case "size_mismatch": + _, _ = fmt.Fprintf(out, "MISMATCH %s (want %d got %d bytes)\n", f.Path, f.Want, f.Got) + default: + _, _ = fmt.Fprintf(out, "%s %s\n", f.Kind, f.Path) + } +} diff --git a/pkg/cas/verify_test.go b/pkg/cas/verify_test.go new file mode 100644 index 00000000..e25483aa --- /dev/null +++ b/pkg/cas/verify_test.go @@ -0,0 +1,266 @@ +package cas_test + +import ( + "bytes" + "context" + "encoding/json" + "errors" + "io" + "strings" + "testing" + "time" + + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/fakedst" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/testfixtures" +) + +// uploadForVerify is a helper that builds a local backup with a blob file and +// uploads it via cas.Upload, returning the backend and the config. +func uploadForVerify(t *testing.T) (*fakedst.Fake, cas.Config) { + t.Helper() + parts := []testfixtures.PartSpec{ + { + Disk: "default", DB: "db1", Table: "t1", Name: "p1", + Files: []testfixtures.FileSpec{ + {Name: "columns.txt", Size: 23, HashLow: 1, HashHigh: 1}, + {Name: "data.bin", Size: 1024, HashLow: 999, HashHigh: 999}, + }, + }, + } + lb := testfixtures.Build(t, parts) + f := fakedst.New() + cfg := testCfg(100) // threshold=100 → data.bin (1024) becomes a blob + _, err := cas.Upload(context.Background(), f, cfg, "b1", cas.UploadOptions{ + LocalBackupDir: lb.Root, + }) + if err != nil { + t.Fatalf("Upload: %v", err) + } + return f, cfg +} + +func TestVerify_AllPresent(t *testing.T) { + f, cfg := uploadForVerify(t) + var out bytes.Buffer + res, err := cas.Verify(context.Background(), f, cfg, "b1", cas.VerifyOptions{}, &out) + if err != nil { + t.Fatalf("Verify returned err=%v; want nil", err) + } + if res == nil { + t.Fatal("Verify returned nil result") + } + if len(res.Failures) != 0 { + t.Errorf("Failures: got %d want 0; %v", len(res.Failures), res.Failures) + } + if res.BlobsChecked != 1 { + t.Errorf("BlobsChecked: got %d want 1", res.BlobsChecked) + } + if res.BackupName != "b1" { + t.Errorf("BackupName: got %q want b1", res.BackupName) + } + if out.Len() != 0 { + t.Errorf("unexpected output: %q", out.String()) + } +} + +func TestVerify_DetectsMissingBlob(t *testing.T) { + f, cfg := uploadForVerify(t) + cp := cfg.ClusterPrefix() + + // Walk the blob/ prefix to find the blob key. + var blobKey string + _ = f.Walk(context.Background(), cp+"blob/", true, func(rf cas.RemoteFile) error { + blobKey = rf.Key + return nil + }) + if blobKey == "" { + t.Fatal("no blob found after upload") + } + + // Delete the blob. + if err := f.DeleteFile(context.Background(), blobKey); err != nil { + t.Fatalf("DeleteFile: %v", err) + } + + var out bytes.Buffer + res, err := cas.Verify(context.Background(), f, cfg, "b1", cas.VerifyOptions{}, &out) + if !errors.Is(err, cas.ErrVerifyFailures) { + t.Fatalf("Verify err=%v; want ErrVerifyFailures", err) + } + if res == nil { + t.Fatal("Verify returned nil result alongside error") + } + if len(res.Failures) != 1 { + t.Fatalf("Failures: got %d want 1; %v", len(res.Failures), res.Failures) + } + if res.Failures[0].Kind != "missing" { + t.Errorf("Failure.Kind: got %q want missing", res.Failures[0].Kind) + } + if res.Failures[0].Path != blobKey { + t.Errorf("Failure.Path: got %q want %q", res.Failures[0].Path, blobKey) + } + if !strings.Contains(out.String(), "MISSING") { + t.Errorf("expected MISSING in output; got %q", out.String()) + } +} + +func TestVerify_DetectsSizeMismatch(t *testing.T) { + f, cfg := uploadForVerify(t) + cp := cfg.ClusterPrefix() + + // Find the blob key. + var blobKey string + _ = f.Walk(context.Background(), cp+"blob/", true, func(rf cas.RemoteFile) error { + blobKey = rf.Key + return nil + }) + if blobKey == "" { + t.Fatal("no blob found after upload") + } + + // Overwrite the blob with wrong-sized data (only 10 bytes). + wrongData := []byte("tooshort!!") + if err := f.PutFile(context.Background(), blobKey, + io.NopCloser(bytes.NewReader(wrongData)), int64(len(wrongData))); err != nil { + t.Fatalf("PutFile (overwrite): %v", err) + } + + var out bytes.Buffer + res, err := cas.Verify(context.Background(), f, cfg, "b1", cas.VerifyOptions{}, &out) + if !errors.Is(err, cas.ErrVerifyFailures) { + t.Fatalf("Verify err=%v; want ErrVerifyFailures", err) + } + if len(res.Failures) != 1 { + t.Fatalf("Failures: got %d want 1; %v", len(res.Failures), res.Failures) + } + if res.Failures[0].Kind != "size_mismatch" { + t.Errorf("Failure.Kind: got %q want size_mismatch", res.Failures[0].Kind) + } + if res.Failures[0].Want != 1024 { + t.Errorf("Failure.Want: got %d want 1024", res.Failures[0].Want) + } + if res.Failures[0].Got != int64(len(wrongData)) { + t.Errorf("Failure.Got: got %d want %d", res.Failures[0].Got, len(wrongData)) + } + if !strings.Contains(out.String(), "MISMATCH") { + t.Errorf("expected MISMATCH in output; got %q", out.String()) + } +} + +func TestVerify_JSONOutput(t *testing.T) { + f, cfg := uploadForVerify(t) + cp := cfg.ClusterPrefix() + + // Find and delete the blob. + var blobKey string + _ = f.Walk(context.Background(), cp+"blob/", true, func(rf cas.RemoteFile) error { + blobKey = rf.Key + return nil + }) + if blobKey == "" { + t.Fatal("no blob found after upload") + } + if err := f.DeleteFile(context.Background(), blobKey); err != nil { + t.Fatalf("DeleteFile: %v", err) + } + + var out bytes.Buffer + res, err := cas.Verify(context.Background(), f, cfg, "b1", cas.VerifyOptions{JSON: true}, &out) + if !errors.Is(err, cas.ErrVerifyFailures) { + t.Fatalf("Verify err=%v; want ErrVerifyFailures", err) + } + if len(res.Failures) != 1 { + t.Fatalf("Failures: got %d want 1", len(res.Failures)) + } + + // Parse the JSON output line. + line := strings.TrimSpace(out.String()) + var vf cas.VerifyFailure + if err := json.Unmarshal([]byte(line), &vf); err != nil { + t.Fatalf("json.Unmarshal output line %q: %v", line, err) + } + if vf.Kind != "missing" { + t.Errorf("JSON Kind: got %q want missing", vf.Kind) + } + if vf.Path != blobKey { + t.Errorf("JSON Path: got %q want %q", vf.Path, blobKey) + } + if vf.Want == 0 { + t.Error("JSON Want: got 0, want non-zero") + } +} + +// stallingBackend wraps another Backend and forces StatFile to return a +// non-nil error for one specific key — simulating a transient network +// hiccup. All other methods delegate. +type stallingBackend struct { + cas.Backend + failKey string +} + +func (s *stallingBackend) StatFile(ctx context.Context, key string) (int64, time.Time, bool, error) { + if key == s.failKey { + return 0, time.Time{}, false, errors.New("simulated network error") + } + return s.Backend.StatFile(ctx, key) +} + +func TestVerify_StatErrorIsNotMissing(t *testing.T) { + f := fakedst.New() + cfg := testCfg(100) + ctx := context.Background() + parts := []testfixtures.PartSpec{{ + Disk: "default", DB: "db", Table: "t", Name: "all_1_1_0", + Files: []testfixtures.FileSpec{ + // Above-threshold so it goes to the blob store (testCfg threshold = 100). + {Name: "data.bin", Size: 2048, HashLow: 7, HashHigh: 7}, + {Name: "columns.txt", Size: 8, HashLow: 8, HashHigh: 8}, + }, + }} + src := testfixtures.Build(t, parts) + if _, err := cas.Upload(ctx, f, cfg, "bk", cas.UploadOptions{LocalBackupDir: src.Root}); err != nil { + t.Fatal(err) + } + + target := cas.BlobPath(cfg.ClusterPrefix(), cas.Hash128{Low: 7, High: 7}) + sb := &stallingBackend{Backend: f, failKey: target} + + var out bytes.Buffer + res, err := cas.Verify(ctx, sb, cfg, "bk", cas.VerifyOptions{}, &out) + if !errors.Is(err, cas.ErrVerifyFailures) { + t.Fatalf("expected ErrVerifyFailures, got %v", err) + } + if len(res.Failures) != 1 { + t.Fatalf("got %d failures, want 1: %+v", len(res.Failures), res.Failures) + } + f0 := res.Failures[0] + if f0.Kind != "stat_error" { + t.Errorf("Kind: got %q want \"stat_error\" (NOT \"missing\" — that would mislead operators)", f0.Kind) + } + if f0.Path != target { + t.Errorf("Path: got %q want %q", f0.Path, target) + } + if f0.Err == "" { + t.Error("Err: should carry the underlying StatFile error message") + } +} + +func TestVerify_RefusesV1Backup(t *testing.T) { + f := fakedst.New() + cfg := testCfg(100) + cp := cfg.ClusterPrefix() + + // Write a metadata.json without a CAS field (v1 backup). + v1meta := `{"backup_name":"b1","tables":[],"data_format":"directory"}` + if err := f.PutFile(context.Background(), cas.MetadataJSONPath(cp, "b1"), + io.NopCloser(strings.NewReader(v1meta)), int64(len(v1meta))); err != nil { + t.Fatalf("PutFile: %v", err) + } + + var out bytes.Buffer + _, err := cas.Verify(context.Background(), f, cfg, "b1", cas.VerifyOptions{}, &out) + if !errors.Is(err, cas.ErrV1Backup) { + t.Fatalf("Verify err=%v; want ErrV1Backup", err) + } +} diff --git a/pkg/cas/wait.go b/pkg/cas/wait.go new file mode 100644 index 00000000..efe3f6d7 --- /dev/null +++ b/pkg/cas/wait.go @@ -0,0 +1,91 @@ +package cas + +import ( + "context" + "fmt" + "time" + + "github.com/rs/zerolog/log" +) + +// pollIntervalForTesting overrides the production poll cadence in tests. +// Production: nil → defaultPollInterval (2 seconds). +var pollIntervalForTesting *time.Duration + +const ( + defaultPollInterval = 2 * time.Second + waitProgressLog = 30 * time.Second +) + +// waitForPrune polls the prune marker until it disappears, ctx is cancelled, +// or wait elapses. Returns nil to proceed; returns an ErrPruneInProgress-wrapping +// error on timeout; returns ctx.Err() on cancellation. +// +// wait == 0 means "no wait" — match the historical immediate-refusal semantics. +func waitForPrune(ctx context.Context, b Backend, clusterPrefix string, wait time.Duration) error { + poll := defaultPollInterval + if pollIntervalForTesting != nil { + poll = *pollIntervalForTesting + } + deadline := time.Now().Add(wait) + var firstMarker *PruneMarker + var loggedFirst bool + var lastLog time.Time + + for { + if err := ctx.Err(); err != nil { + return err + } + _, _, exists, err := b.StatFile(ctx, PruneMarkerPath(clusterPrefix)) + if err != nil { + return fmt.Errorf("cas: stat prune marker while waiting: %w", err) + } + if !exists { + return nil + } + + // First time we see the marker, read its body for diagnostics. + if !loggedFirst { + firstMarker, _ = ReadPruneMarker(ctx, b, clusterPrefix) // best-effort; nil-tolerant below + loggedFirst = true + } + + if wait == 0 || !time.Now().Before(deadline) { + return formatWaitTimeout(firstMarker, wait) + } + + // Periodic INFO log. + if time.Since(lastLog) >= waitProgressLog { + logWaitProgress(firstMarker, time.Until(deadline), wait) + lastLog = time.Now() + } + + select { + case <-ctx.Done(): + return ctx.Err() + case <-time.After(poll): + } + } +} + +func formatWaitTimeout(m *PruneMarker, wait time.Duration) error { + if m == nil { + return fmt.Errorf("%w: prune still in progress after %s wait; refusing", + ErrPruneInProgress, wait) + } + return fmt.Errorf( + "%w: prune still in progress after %s wait (held by host=%s, run_id=%s, started=%s); refusing. "+ + "Increase cas.wait_for_prune or run cas-prune --unlock if confident the prune is dead", + ErrPruneInProgress, wait, m.Host, m.RunID, m.StartedAt) +} + +func logWaitProgress(m *PruneMarker, remaining, total time.Duration) { + waited := total - remaining + if m == nil { + log.Info().Msgf("cas: waiting for prune to finish (waited=%s/%s)", + waited.Round(time.Second), total) + return + } + log.Info().Msgf("cas: waiting for prune to finish (held by host=%s since=%s, run_id=%s, waited=%s/%s)", + m.Host, m.StartedAt, m.RunID, waited.Round(time.Second), total) +} diff --git a/pkg/cas/wait_test.go b/pkg/cas/wait_test.go new file mode 100644 index 00000000..5440784b --- /dev/null +++ b/pkg/cas/wait_test.go @@ -0,0 +1,94 @@ +package cas_test + +import ( + "context" + "errors" + "io" + "strings" + "testing" + "time" + + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas/internal/fakedst" + "github.com/stretchr/testify/require" +) + +func TestWaitForPrune_NoMarkerProceedsImmediately(t *testing.T) { + d := 10 * time.Millisecond + cas.SetPollIntervalForTesting(&d) + defer cas.SetPollIntervalForTesting(nil) + + b := fakedst.New() + start := time.Now() + err := cas.WaitForPrune(context.Background(), b, "cas/c1/", 5*time.Second) + require.NoError(t, err) + require.Less(t, time.Since(start), 100*time.Millisecond) +} + +func TestWaitForPrune_MarkerClearsBeforeDeadline(t *testing.T) { + d := 10 * time.Millisecond + cas.SetPollIntervalForTesting(&d) + defer cas.SetPollIntervalForTesting(nil) + + b := fakedst.New() + cp := "cas/c1/" + require.NoError(t, b.PutFile(context.Background(), cp+"prune.marker", io.NopCloser(strings.NewReader("{}")), 2)) + + go func() { + time.Sleep(50 * time.Millisecond) + _ = b.DeleteFile(context.Background(), cp+"prune.marker") + }() + + start := time.Now() + err := cas.WaitForPrune(context.Background(), b, cp, 5*time.Second) + require.NoError(t, err) + elapsed := time.Since(start) + require.GreaterOrEqual(t, elapsed, 50*time.Millisecond) + require.Less(t, elapsed, 1*time.Second) +} + +func TestWaitForPrune_TimeoutReturnsErrPruneInProgress(t *testing.T) { + d := 10 * time.Millisecond + cas.SetPollIntervalForTesting(&d) + defer cas.SetPollIntervalForTesting(nil) + + b := fakedst.New() + cp := "cas/c1/" + body := `{"host":"h1","started_at":"2026-05-08T10:30:12Z","run_id":"abc123","tool":"cas-prune"}` + require.NoError(t, b.PutFile(context.Background(), cp+"prune.marker", io.NopCloser(strings.NewReader(body)), int64(len(body)))) + + err := cas.WaitForPrune(context.Background(), b, cp, 100*time.Millisecond) + require.Error(t, err) + require.True(t, errors.Is(err, cas.ErrPruneInProgress)) + require.Contains(t, err.Error(), "h1") + require.Contains(t, err.Error(), "abc123") +} + +func TestWaitForPrune_ZeroWaitMatchesImmediateRefusal(t *testing.T) { + b := fakedst.New() + cp := "cas/c1/" + require.NoError(t, b.PutFile(context.Background(), cp+"prune.marker", io.NopCloser(strings.NewReader("{}")), 2)) + + err := cas.WaitForPrune(context.Background(), b, cp, 0) + require.Error(t, err) + require.True(t, errors.Is(err, cas.ErrPruneInProgress)) +} + +func TestWaitForPrune_RespectsContextCancel(t *testing.T) { + d := 10 * time.Millisecond + cas.SetPollIntervalForTesting(&d) + defer cas.SetPollIntervalForTesting(nil) + + b := fakedst.New() + cp := "cas/c1/" + require.NoError(t, b.PutFile(context.Background(), cp+"prune.marker", io.NopCloser(strings.NewReader("{}")), 2)) + + ctx, cancel := context.WithCancel(context.Background()) + go func() { time.Sleep(20 * time.Millisecond); cancel() }() + + start := time.Now() + err := cas.WaitForPrune(ctx, b, cp, 5*time.Second) + require.Error(t, err) + require.True(t, errors.Is(err, context.Canceled)) + require.Less(t, time.Since(start), 500*time.Millisecond) +} diff --git a/pkg/checksumstxt/checksumstxt.go b/pkg/checksumstxt/checksumstxt.go new file mode 100644 index 00000000..d8f02108 --- /dev/null +++ b/pkg/checksumstxt/checksumstxt.go @@ -0,0 +1,300 @@ +// Package checksumstxt parses the checksums.txt metadata file written next to +// every MergeTree data part. It supports versions 2 (legacy text), 3 (legacy +// binary), 4 (binary wrapped in a ClickHouse compressed-block stream — the +// default written today), and the standalone version-5 "minimalistic" blob +// used as the ZooKeeper payload (not on disk). +// +// Reference C++ implementation: src/Storages/MergeTree/MergeTreeDataPartChecksum.{h,cpp} +// +// The on-disk file is parsed by Parse; the minimalistic blob by ParseMinimalistic. +package checksumstxt + +import ( + "bufio" + "errors" + "fmt" + "io" + "strconv" + "strings" + + chproto "github.com/ClickHouse/ch-go/proto" +) + +const headerPrefix = "checksums format version: " + +type Hash128 struct { + Low, High uint64 +} + +type Checksum struct { + FileSize uint64 + FileHash Hash128 + IsCompressed bool + UncompressedSize uint64 + UncompressedHash Hash128 +} + +type File struct { + Version int + Files map[string]Checksum +} + +type Minimalistic struct { + NumCompressedFiles uint64 + NumUncompressedFiles uint64 + HashOfAllFiles Hash128 + HashOfUncompressedFiles Hash128 + UncompressedHashOfCompressedFiles Hash128 +} + +func Parse(r io.Reader) (*File, error) { + br := bufio.NewReader(r) + version, err := readVersion(br) + if err != nil { + return nil, err + } + f := &File{Version: version} + switch version { + case 2: + f.Files, err = parseV2(br) + case 3: + f.Files, err = parseBinary(br, false) + case 4: + f.Files, err = parseBinary(br, true) + case 5: + return nil, errors.New("checksumstxt: version 5 is the minimalistic blob; use ParseMinimalistic") + case 1: + return nil, errors.New("checksumstxt: format version 1 is too old to read") + default: + return nil, fmt.Errorf("checksumstxt: unsupported version %d", version) + } + if err != nil { + return nil, err + } + return f, nil +} + +func ParseMinimalistic(r io.Reader) (*Minimalistic, error) { + br := bufio.NewReader(r) + version, err := readVersion(br) + if err != nil { + return nil, err + } + if version != 5 { + return nil, fmt.Errorf("checksumstxt: minimalistic blob has version %d, want 5", version) + } + pr := chproto.NewReader(br) + var m Minimalistic + if m.NumCompressedFiles, err = pr.UVarInt(); err != nil { + return nil, fmt.Errorf("checksumstxt: num_compressed_files: %w", err) + } + if m.NumUncompressedFiles, err = pr.UVarInt(); err != nil { + return nil, fmt.Errorf("checksumstxt: num_uncompressed_files: %w", err) + } + if m.HashOfAllFiles, err = readHash128(pr); err != nil { + return nil, fmt.Errorf("checksumstxt: hash_of_all_files: %w", err) + } + if m.HashOfUncompressedFiles, err = readHash128(pr); err != nil { + return nil, fmt.Errorf("checksumstxt: hash_of_uncompressed_files: %w", err) + } + if m.UncompressedHashOfCompressedFiles, err = readHash128(pr); err != nil { + return nil, fmt.Errorf("checksumstxt: uncompressed_hash_of_compressed_files: %w", err) + } + if err := assertEOF(pr); err != nil { + return nil, err + } + return &m, nil +} + +func readVersion(br *bufio.Reader) (int, error) { + line, err := br.ReadString('\n') + if err != nil { + return 0, fmt.Errorf("checksumstxt: read header: %w", err) + } + if !strings.HasPrefix(line, headerPrefix) || !strings.HasSuffix(line, "\n") { + return 0, fmt.Errorf("checksumstxt: bad header %q", line) + } + v := strings.TrimSuffix(line[len(headerPrefix):], "\n") + n, err := strconv.Atoi(v) + if err != nil { + return 0, fmt.Errorf("checksumstxt: parse version %q: %w", v, err) + } + return n, nil +} + +func parseBinary(br *bufio.Reader, compressed bool) (map[string]Checksum, error) { + pr := chproto.NewReader(br) + if compressed { + pr.EnableCompression() + } + count, err := pr.UVarInt() + if err != nil { + return nil, fmt.Errorf("checksumstxt: count: %w", err) + } + out := make(map[string]Checksum, count) + for i := uint64(0); i < count; i++ { + name, err := pr.Str() + if err != nil { + return nil, fmt.Errorf("checksumstxt: record %d name: %w", i, err) + } + if _, dup := out[name]; dup { + return nil, fmt.Errorf("checksumstxt: duplicate name %q", name) + } + var c Checksum + if c.FileSize, err = pr.UVarInt(); err != nil { + return nil, fmt.Errorf("checksumstxt: %q size: %w", name, err) + } + if c.FileHash, err = readHash128(pr); err != nil { + return nil, fmt.Errorf("checksumstxt: %q hash: %w", name, err) + } + if c.IsCompressed, err = pr.Bool(); err != nil { + return nil, fmt.Errorf("checksumstxt: %q is_compressed: %w", name, err) + } + if c.IsCompressed { + if c.UncompressedSize, err = pr.UVarInt(); err != nil { + return nil, fmt.Errorf("checksumstxt: %q uncompressed_size: %w", name, err) + } + if c.UncompressedHash, err = readHash128(pr); err != nil { + return nil, fmt.Errorf("checksumstxt: %q uncompressed_hash: %w", name, err) + } + } + out[name] = c + } + if err := assertEOF(pr); err != nil { + return nil, err + } + return out, nil +} + +func readHash128(pr *chproto.Reader) (Hash128, error) { + u, err := pr.UInt128() + if err != nil { + return Hash128{}, err + } + return Hash128{Low: u.Low, High: u.High}, nil +} + +// assertEOF returns nil iff the reader is at EOF. For a v4 stream this also +// catches a partially-formed trailing compressed block, which the underlying +// compress.Reader will surface as a non-EOF error wrapping io.EOF. +func assertEOF(pr *chproto.Reader) error { + var b [1]byte + n, err := pr.Read(b[:]) + if n > 0 { + return errors.New("checksumstxt: trailing bytes after body") + } + if err == nil || errors.Is(err, io.EOF) { + return nil + } + return fmt.Errorf("checksumstxt: trailing data check: %w", err) +} + +func parseV2(br *bufio.Reader) (map[string]Checksum, error) { + const countSuffix = " files:\n" + line, err := br.ReadString('\n') + if err != nil { + return nil, fmt.Errorf("checksumstxt: count line: %w", err) + } + if !strings.HasSuffix(line, countSuffix) { + return nil, fmt.Errorf("checksumstxt: bad count line %q", line) + } + count, err := strconv.Atoi(line[:len(line)-len(countSuffix)]) + if err != nil { + return nil, fmt.Errorf("checksumstxt: parse count: %w", err) + } + out := make(map[string]Checksum, count) + for i := 0; i < count; i++ { + name, err := readLine(br) + if err != nil { + return nil, fmt.Errorf("checksumstxt: record %d name: %w", i, err) + } + var c Checksum + size, err := readKV(br, "\tsize: ") + if err != nil { + return nil, fmt.Errorf("%q size: %w", name, err) + } + if c.FileSize, err = strconv.ParseUint(size, 10, 64); err != nil { + return nil, fmt.Errorf("%q size value: %w", name, err) + } + hash, err := readKV(br, "\thash: ") + if err != nil { + return nil, fmt.Errorf("%q hash: %w", name, err) + } + if c.FileHash, err = parseHash128Decimal(hash); err != nil { + return nil, fmt.Errorf("%q hash value: %w", name, err) + } + comp, err := readKV(br, "\tcompressed: ") + if err != nil { + return nil, fmt.Errorf("%q compressed: %w", name, err) + } + switch comp { + case "0": + case "1": + c.IsCompressed = true + default: + return nil, fmt.Errorf("%q compressed value %q", name, comp) + } + if c.IsCompressed { + us, err := readKV(br, "\tuncompressed size: ") + if err != nil { + return nil, fmt.Errorf("%q uncompressed_size: %w", name, err) + } + if c.UncompressedSize, err = strconv.ParseUint(us, 10, 64); err != nil { + return nil, fmt.Errorf("%q uncompressed_size value: %w", name, err) + } + uh, err := readKV(br, "\tuncompressed hash: ") + if err != nil { + return nil, fmt.Errorf("%q uncompressed_hash: %w", name, err) + } + if c.UncompressedHash, err = parseHash128Decimal(uh); err != nil { + return nil, fmt.Errorf("%q uncompressed_hash value: %w", name, err) + } + } + if _, dup := out[name]; dup { + return nil, fmt.Errorf("checksumstxt: duplicate name %q", name) + } + out[name] = c + } + if _, err := br.ReadByte(); !errors.Is(err, io.EOF) { + if err == nil { + return nil, errors.New("checksumstxt: trailing bytes after body") + } + return nil, fmt.Errorf("checksumstxt: trailing data: %w", err) + } + return out, nil +} + +func readLine(br *bufio.Reader) (string, error) { + s, err := br.ReadString('\n') + if err != nil { + return "", err + } + return s[:len(s)-1], nil +} + +func readKV(br *bufio.Reader, prefix string) (string, error) { + s, err := readLine(br) + if err != nil { + return "", err + } + if !strings.HasPrefix(s, prefix) { + return "", fmt.Errorf("expected prefix %q, got %q", prefix, s) + } + return s[len(prefix):], nil +} + +func parseHash128Decimal(s string) (Hash128, error) { + sp := strings.IndexByte(s, ' ') + if sp < 0 { + return Hash128{}, fmt.Errorf("expected two decimals, got %q", s) + } + low, err := strconv.ParseUint(s[:sp], 10, 64) + if err != nil { + return Hash128{}, fmt.Errorf("low64: %w", err) + } + high, err := strconv.ParseUint(s[sp+1:], 10, 64) + if err != nil { + return Hash128{}, fmt.Errorf("high64: %w", err) + } + return Hash128{Low: low, High: high}, nil +} diff --git a/pkg/checksumstxt/checksumstxt_test.go b/pkg/checksumstxt/checksumstxt_test.go new file mode 100644 index 00000000..7669d2c3 --- /dev/null +++ b/pkg/checksumstxt/checksumstxt_test.go @@ -0,0 +1,318 @@ +package checksumstxt + +import ( + "bytes" + "encoding/binary" + "os" + "path/filepath" + "strings" + "testing" + + "github.com/ClickHouse/ch-go/compress" +) + +// uvar appends a LEB128-encoded uint64. +func uvar(dst []byte, x uint64) []byte { + for x >= 0x80 { + dst = append(dst, byte(x)|0x80) + x >>= 7 + } + return append(dst, byte(x)) +} + +// strBin appends a length-prefixed string in ClickHouse binary format. +func strBin(dst []byte, s string) []byte { + dst = uvar(dst, uint64(len(s))) + return append(dst, s...) +} + +func u128LE(low, high uint64) []byte { + var b [16]byte + binary.LittleEndian.PutUint64(b[:8], low) + binary.LittleEndian.PutUint64(b[8:], high) + return b[:] +} + +// buildV3Body returns the inner body (after the version line) for a v3/v4 file. +func buildV3Body(records []struct { + name string + c Checksum +}) []byte { + var b []byte + b = uvar(b, uint64(len(records))) + for _, r := range records { + b = strBin(b, r.name) + b = uvar(b, r.c.FileSize) + b = append(b, u128LE(r.c.FileHash.Low, r.c.FileHash.High)...) + if r.c.IsCompressed { + b = append(b, 1) + b = uvar(b, r.c.UncompressedSize) + b = append(b, u128LE(r.c.UncompressedHash.Low, r.c.UncompressedHash.High)...) + } else { + b = append(b, 0) + } + } + return b +} + +func TestParseV2(t *testing.T) { + const input = "checksums format version: 2\n" + + "2 files:\n" + + "columns.txt\n" + + "\tsize: 123\n" + + "\thash: 1 2\n" + + "\tcompressed: 0\n" + + "id.bin\n" + + "\tsize: 4096\n" + + "\thash: 100 200\n" + + "\tcompressed: 1\n" + + "\tuncompressed size: 8192\n" + + "\tuncompressed hash: 300 400\n" + + f, err := Parse(strings.NewReader(input)) + if err != nil { + t.Fatal(err) + } + if f.Version != 2 || len(f.Files) != 2 { + t.Fatalf("got version=%d files=%d", f.Version, len(f.Files)) + } + c := f.Files["columns.txt"] + if c.FileSize != 123 || c.FileHash != (Hash128{1, 2}) || c.IsCompressed { + t.Errorf("columns.txt: %+v", c) + } + c = f.Files["id.bin"] + want := Checksum{ + FileSize: 4096, + FileHash: Hash128{100, 200}, + IsCompressed: true, + UncompressedSize: 8192, + UncompressedHash: Hash128{300, 400}, + } + if c != want { + t.Errorf("id.bin: got %+v want %+v", c, want) + } +} + +func TestParseV3(t *testing.T) { + records := []struct { + name string + c Checksum + }{ + {"columns.txt", Checksum{FileSize: 123, FileHash: Hash128{0xAABB, 0xCCDD}}}, + {"id.bin", Checksum{ + FileSize: 4096, FileHash: Hash128{1, 2}, + IsCompressed: true, UncompressedSize: 8192, UncompressedHash: Hash128{3, 4}, + }}, + } + body := buildV3Body(records) + + var buf bytes.Buffer + buf.WriteString("checksums format version: 3\n") + buf.Write(body) + + f, err := Parse(&buf) + if err != nil { + t.Fatal(err) + } + if f.Version != 3 || len(f.Files) != 2 { + t.Fatalf("got version=%d files=%d", f.Version, len(f.Files)) + } + if f.Files["columns.txt"] != records[0].c { + t.Errorf("columns.txt: %+v", f.Files["columns.txt"]) + } + if f.Files["id.bin"] != records[1].c { + t.Errorf("id.bin: %+v", f.Files["id.bin"]) + } +} + +func TestParseV4_LZ4(t *testing.T) { + testParseV4(t, compress.LZ4) +} + +func TestParseV4_None(t *testing.T) { + testParseV4(t, compress.None) +} + +func TestParseV4_ZSTD(t *testing.T) { + testParseV4(t, compress.ZSTD) +} + +func testParseV4(t *testing.T, m compress.Method) { + t.Helper() + records := []struct { + name string + c Checksum + }{ + {"primary.idx", Checksum{FileSize: 64, FileHash: Hash128{0xDEADBEEF, 0xCAFEBABE}}}, + {"id.bin", Checksum{ + FileSize: 4096, FileHash: Hash128{1, 2}, + IsCompressed: true, UncompressedSize: 8192, UncompressedHash: Hash128{3, 4}, + }}, + } + body := buildV3Body(records) + + w := compress.NewWriter(compress.LevelZero, m) + if err := w.Compress(body); err != nil { + t.Fatal(err) + } + + var buf bytes.Buffer + buf.WriteString("checksums format version: 4\n") + buf.Write(w.Data) + + f, err := Parse(&buf) + if err != nil { + t.Fatal(err) + } + if f.Version != 4 || len(f.Files) != 2 { + t.Fatalf("got version=%d files=%d", f.Version, len(f.Files)) + } + if f.Files["primary.idx"] != records[0].c || f.Files["id.bin"] != records[1].c { + t.Errorf("mismatch: %+v", f.Files) + } +} + +func TestParseV4_MultiBlock(t *testing.T) { + // Concatenated blocks should decompress to a single v3 body. + body := buildV3Body([]struct { + name string + c Checksum + }{ + {"a.bin", Checksum{FileSize: 1, FileHash: Hash128{1, 1}}}, + {"b.bin", Checksum{FileSize: 2, FileHash: Hash128{2, 2}}}, + }) + // Update count to 2 (already 2 in builder). Split bytes into two halves + // and emit each as its own block. + half := len(body) / 2 + w1 := compress.NewWriter(compress.LevelZero, compress.LZ4) + if err := w1.Compress(body[:half]); err != nil { + t.Fatal(err) + } + w2 := compress.NewWriter(compress.LevelZero, compress.LZ4) + if err := w2.Compress(body[half:]); err != nil { + t.Fatal(err) + } + + var buf bytes.Buffer + buf.WriteString("checksums format version: 4\n") + buf.Write(w1.Data) + buf.Write(w2.Data) + + f, err := Parse(&buf) + if err != nil { + t.Fatal(err) + } + if len(f.Files) != 2 { + t.Fatalf("got %d files", len(f.Files)) + } +} + +func TestParseRejectsTrailingBytes(t *testing.T) { + body := buildV3Body([]struct { + name string + c Checksum + }{{"x", Checksum{FileSize: 1, FileHash: Hash128{1, 2}}}}) + var buf bytes.Buffer + buf.WriteString("checksums format version: 3\n") + buf.Write(body) + buf.WriteByte(0xFF) // junk after body + + if _, err := Parse(&buf); err == nil { + t.Fatal("expected error for trailing bytes") + } +} + +func TestParseRejectsV1AndUnknown(t *testing.T) { + for _, version := range []string{"1", "999"} { + input := "checksums format version: " + version + "\n" + if _, err := Parse(strings.NewReader(input)); err == nil { + t.Errorf("version %s: expected error", version) + } + } +} + +func TestParseRejectsV5(t *testing.T) { + input := "checksums format version: 5\n" + if _, err := Parse(strings.NewReader(input)); err == nil { + t.Fatal("expected error: v5 must go through ParseMinimalistic") + } +} + +func TestParseMinimalistic(t *testing.T) { + var body []byte + body = uvar(body, 7) // num_compressed_files + body = uvar(body, 11) // num_uncompressed_files + body = append(body, u128LE(0x11, 0x22)...) + body = append(body, u128LE(0x33, 0x44)...) + body = append(body, u128LE(0x55, 0x66)...) + + var buf bytes.Buffer + buf.WriteString("checksums format version: 5\n") + buf.Write(body) + + m, err := ParseMinimalistic(&buf) + if err != nil { + t.Fatal(err) + } + want := &Minimalistic{ + NumCompressedFiles: 7, + NumUncompressedFiles: 11, + HashOfAllFiles: Hash128{0x11, 0x22}, + HashOfUncompressedFiles: Hash128{0x33, 0x44}, + UncompressedHashOfCompressedFiles: Hash128{0x55, 0x66}, + } + if *m != *want { + t.Errorf("got %+v want %+v", m, want) + } +} + +func TestParseMinimalisticRejectsNon5(t *testing.T) { + if _, err := ParseMinimalistic(strings.NewReader("checksums format version: 4\n")); err == nil { + t.Fatal("expected error") + } +} + +func TestParseRealFixtures(t *testing.T) { + cases := []struct { + dir string + wantVersion int + wantMinFiles int + }{ + // v4_wide: wide MergeTree part (3 columns: id, x, y) → 9 files. + {"v4_wide", 4, 5}, + // v4_compact: compact MergeTree part (2 columns: id, x) → 5 files (data.bin, data.cmrk3, ...). + {"v4_compact", 4, 3}, + // v4_projection: wide part with PROJECTION p1 → 10 files including p1.proj entry. + {"v4_projection", 4, 5}, + // v4_multi_block: 300-column wide part → 1202 files (large compressed payload). + {"v4_multi_block", 4, 50}, + } + for _, tc := range cases { + t.Run(tc.dir, func(t *testing.T) { + f, err := os.Open(filepath.Join("testdata", tc.dir, "checksums.txt")) + if err != nil { + t.Fatal(err) + } + defer f.Close() + got, err := Parse(f) + if err != nil { + t.Fatalf("Parse: %v", err) + } + if got.Version != tc.wantVersion { + t.Errorf("version: got %d want %d", got.Version, tc.wantVersion) + } + if len(got.Files) < tc.wantMinFiles { + t.Errorf("files: got %d want >=%d", len(got.Files), tc.wantMinFiles) + } + for name, c := range got.Files { + if c.FileSize == 0 && !strings.HasSuffix(name, ".cmrk2") && + !strings.HasSuffix(name, ".cmrk3") && name != "count.txt" { + t.Errorf("%s: zero size", name) + } + if c.FileHash == (Hash128{}) { + t.Errorf("%s: zero hash", name) + } + } + }) + } +} diff --git a/pkg/checksumstxt/testdata/README.md b/pkg/checksumstxt/testdata/README.md new file mode 100644 index 00000000..33f966a7 --- /dev/null +++ b/pkg/checksumstxt/testdata/README.md @@ -0,0 +1,96 @@ +# checksumstxt testdata + +Real `checksums.txt` files extracted from a live ClickHouse server for fixture-driven parser tests. + +## ClickHouse version + +**24.8.14.39** (image `clickhouse/clickhouse-server:24.8`, official build) + +## Fixtures + +### `v4_wide/checksums.txt` + +Wide MergeTree part, format version 4, 3 columns, 9 file entries. + +```sql +CREATE TABLE fx.wide (id UInt64, x String, y Float64) + ENGINE=MergeTree ORDER BY id + SETTINGS min_rows_for_wide_part=0, min_bytes_for_wide_part=0; +INSERT INTO fx.wide SELECT number, toString(number), number*1.5 FROM numbers(1000); +OPTIMIZE TABLE fx.wide FINAL; +``` + +Part directory: `all_1_1_1` + +### `v4_compact/checksums.txt` + +Compact MergeTree part, format version 4, 2 columns, 5 file entries. +Compact format is forced by setting `min_rows_for_wide_part` and `min_bytes_for_wide_part` very high. +Compact parts store all columns in a single `data.bin`/`data.cmrk3` pair. + +```sql +CREATE TABLE fx.compact (id UInt64, x String) + ENGINE=MergeTree ORDER BY id + SETTINGS min_rows_for_wide_part=1000000000, min_bytes_for_wide_part=1000000000; +INSERT INTO fx.compact SELECT number, toString(number) FROM numbers(100); +OPTIMIZE TABLE fx.compact FINAL; +``` + +Part directory: `all_1_1_1` + +### `v4_projection/checksums.txt` + +Wide MergeTree part with a PROJECTION, format version 4, 10 file entries. +Includes a `p1.proj` entry (the serialized projection sub-part). + +```sql +CREATE TABLE fx.proj (id UInt64, c String, n UInt32, + PROJECTION p1 (SELECT c, count() GROUP BY c)) + ENGINE=MergeTree ORDER BY id + SETTINGS min_rows_for_wide_part=0, min_bytes_for_wide_part=0; +INSERT INTO fx.proj + SELECT number, ['a','b','c'][number%3+1], toUInt32(number%10) + FROM numbers(500); +OPTIMIZE TABLE fx.proj FINAL; +``` + +Part directory: `all_1_1_1` + +### `v4_multi_block/checksums.txt` + +Wide MergeTree part with 300 Int64 columns, format version 4, 602 file entries. +The large file list (300 columns × ~2 files each) produces a ~110 KB uncompressed payload +that stresses the compressed-block reader. While the ClickHouse 24.8 LZ4 block size +(1 MB default) fits all entries in a single block, the payload size is ~300x larger than +the wide/compact fixtures and validates correct handling of large payloads. +100 rows with distinct non-zero values ensure no column file is empty. + +```sql +-- The column list has 300 columns: c0 Int64, c1 Int64, ..., c299 Int64 +-- Generated with: cols=$(python3 -c "print(','.join(f'c{i} Int64' for i in range(300)))") +CREATE TABLE fx.multi (<300_cols>) ENGINE=MergeTree ORDER BY tuple() + SETTINGS min_rows_for_wide_part=0, min_bytes_for_wide_part=0; +-- INSERT with number*multiplier+1 for each column so no column is all-zero: +INSERT INTO fx.multi + SELECT number*1+1, number*2+1, ..., number*300+1 + FROM numbers(100); +OPTIMIZE TABLE fx.multi FINAL; +``` + +Part directory: `all_1_1_1` + +## Missing fixtures and why + +### v2 / v3 (text / uncompressed-binary format) + +Format versions 2 and 3 were used by ClickHouse releases predating ~20.x. +They are not produced by any supported ClickHouse version. +The parser is fully covered by the synthetic unit tests `TestParseV2` and `TestParseV3` +in `checksumstxt_test.go`. + +### v5 (minimalistic blob) + +Version 5 is a compact ZooKeeper payload — it is never written to disk as a +`checksums.txt` file. It cannot be obtained by extracting a file from a data +part directory. The parser is covered by the synthetic unit test +`TestParseMinimalistic` in `checksumstxt_test.go`. diff --git a/pkg/checksumstxt/testdata/v4_compact/checksums.txt b/pkg/checksumstxt/testdata/v4_compact/checksums.txt new file mode 100644 index 00000000..372a720a Binary files /dev/null and b/pkg/checksumstxt/testdata/v4_compact/checksums.txt differ diff --git a/pkg/checksumstxt/testdata/v4_multi_block/checksums.txt b/pkg/checksumstxt/testdata/v4_multi_block/checksums.txt new file mode 100644 index 00000000..41c78959 Binary files /dev/null and b/pkg/checksumstxt/testdata/v4_multi_block/checksums.txt differ diff --git a/pkg/checksumstxt/testdata/v4_projection/checksums.txt b/pkg/checksumstxt/testdata/v4_projection/checksums.txt new file mode 100644 index 00000000..8bee9e37 Binary files /dev/null and b/pkg/checksumstxt/testdata/v4_projection/checksums.txt differ diff --git a/pkg/checksumstxt/testdata/v4_wide/checksums.txt b/pkg/checksumstxt/testdata/v4_wide/checksums.txt new file mode 100644 index 00000000..dea03167 Binary files /dev/null and b/pkg/checksumstxt/testdata/v4_wide/checksums.txt differ diff --git a/pkg/config/config.go b/pkg/config/config.go index ebe6a3f4..10c5ad6e 100644 --- a/pkg/config/config.go +++ b/pkg/config/config.go @@ -11,6 +11,7 @@ import ( "sync" "time" + "github.com/Altinity/clickhouse-backup/v2/pkg/cas" "github.com/Altinity/clickhouse-backup/v2/pkg/log_helper" "github.com/aws/aws-sdk-go-v2/aws" s3types "github.com/aws/aws-sdk-go-v2/service/s3/types" @@ -37,6 +38,7 @@ type Config struct { SFTP SFTPConfig `yaml:"sftp" envconfig:"_"` AzureBlob AzureBlobConfig `yaml:"azblob" envconfig:"_"` Custom CustomConfig `yaml:"custom" envconfig:"_"` + CAS cas.Config `yaml:"cas" envconfig:"_"` // Mutex to protect concurrent access when applying macros mu sync.Mutex `yaml:"-"` } @@ -521,6 +523,9 @@ func ValidateConfig(cfg *Config) error { cfg.General.FullDuration = duration } } + if err := cfg.CAS.Validate(); err != nil { + return errors.WithMessage(err, "ValidateConfig CAS") + } return nil } @@ -701,6 +706,7 @@ func DefaultConfig() *Config { CommandTimeout: "4h", CommandTimeoutDuration: 4 * time.Hour, }, + CAS: cas.DefaultConfig(), } } diff --git a/pkg/metadata/backup_metadata.go b/pkg/metadata/backup_metadata.go index 45701256..b1a402b8 100644 --- a/pkg/metadata/backup_metadata.go +++ b/pkg/metadata/backup_metadata.go @@ -29,6 +29,24 @@ type BackupMetadata struct { Functions []FunctionsMeta `json:"functions"` DataFormat string `json:"data_format"` RequiredBackup string `json:"required_backup,omitempty"` + // CAS holds parameters for the content-addressable layout. Populated only by + // cas-upload; nil means the backup is a v1 backup. See docs/cas-design.md §6.2.1. + CAS *CASBackupParams `json:"cas,omitempty"` +} + +// CASBackupParams persists CAS layout parameters per backup so restore is +// hermetic against future config drift. See docs/cas-design.md §6.2.1. +type CASBackupParams struct { + LayoutVersion uint8 `json:"layout_version"` + InlineThreshold uint64 `json:"inline_threshold"` + ClusterID string `json:"cluster_id"` + // Handoff is set to true in the local metadata.json written by cas-download + // when it materializes a v1-shaped backup directory for cas-restore handoff. + // It tells the v1 restore path: "this backup was materialized from CAS and + // must not be treated as a raw v1 CAS backup — skip cross-mode refusal but + // also skip object-disk handling (CAS never wrote object-disk metadata)." + // The remote (CAS namespace) copy of metadata.json never has Handoff set. + Handoff bool `json:"handoff,omitempty"` } func (b *BackupMetadata) GetFullSize() uint64 { diff --git a/pkg/server/actions_cas.go b/pkg/server/actions_cas.go new file mode 100644 index 00000000..993d38a5 --- /dev/null +++ b/pkg/server/actions_cas.go @@ -0,0 +1,346 @@ +package server + +import ( + "context" + "fmt" + "net/url" + "strings" + "time" + + "github.com/google/uuid" + "github.com/rs/zerolog/log" + + "github.com/Altinity/clickhouse-backup/v2/pkg/backup" + "github.com/Altinity/clickhouse-backup/v2/pkg/status" + "github.com/Altinity/clickhouse-backup/v2/pkg/utils" +) + +// actionsCASHandler handles cas-* verbs sent through POST /backup/actions. +// +// The `command` argument is the first token from the shell-split command line +// (e.g. "cas-upload"). `args` is the full token slice (args[0] == command). +// `row` carries the original raw command string for the status log. +// +// The method mirrors the async pattern of actionsAsyncCommandsHandler: it +// starts a status entry, kicks a goroutine, and immediately appends an +// "acknowledged" result row. For cas-delete (sync-by-convention), it still +// runs asynchronously here so that the /backup/actions endpoint never blocks +// — callers can poll /backup/actions to check completion. +func (api *APIServer) actionsCASHandler(command string, args []string, row status.ActionRow, actionsResults []actionsResultsRow) ([]actionsResultsRow, error) { + if !api.GetConfig().API.AllowParallel && status.Current.InProgress() { + return actionsResults, ErrAPILocked + } + // Try to reload config from disk; fall back to the cached config when the + // file is not available (e.g. in unit tests with a stub config path). + cfg := api.GetConfig() + if reloaded, reloadErr := api.ReloadConfig(nil, command); reloadErr == nil { + cfg = reloaded + } + + operationId, _ := uuid.NewUUID() + commandId, _ := status.Current.StartWithOperationId(row.Command, operationId.String()) + // No callback URL in the /backup/actions protocol — use the no-op callback. + noopCb, _ := parseCallback(url.Values{}) + + switch command { + case "cas-upload": + name, skipObjectDisks, dryRun, waitForPrune, parseErr := parseCASUploadArgs(args[1:], cfg.CAS.WaitForPruneDuration()) + if parseErr != nil { + status.Current.Stop(commandId, parseErr) + return actionsResults, parseErr + } + go func() { + err, _ := api.metrics.ExecuteWithMetrics("cas-upload", 0, func() error { + b := backup.NewBackuper(cfg, backup.WithCASProbeState(api.casProbeState)) + return b.CASUpload(name, skipObjectDisks, dryRun, false, api.clickhouseBackupVersion, commandId, waitForPrune) + }) + status.Current.Stop(commandId, err) + if err != nil { + log.Error().Msgf("actions cas-upload error: %v", err) + api.errorCallback(context.Background(), err, operationId.String(), noopCb) + } else { + api.successCallback(context.Background(), operationId.String(), noopCb) + } + }() + + case "cas-download": + name, tablePattern, partitions, schemaOnly, parseErr := parseCASDownloadArgs(args[1:]) + if parseErr != nil { + status.Current.Stop(commandId, parseErr) + return actionsResults, parseErr + } + go func() { + err, _ := api.metrics.ExecuteWithMetrics("cas-download", 0, func() error { + b := backup.NewBackuper(cfg, backup.WithCASProbeState(api.casProbeState)) + return b.CASDownload(name, tablePattern, partitions, schemaOnly, false, api.clickhouseBackupVersion, commandId) + }) + status.Current.Stop(commandId, err) + if err != nil { + log.Error().Msgf("actions cas-download error: %v", err) + api.errorCallback(context.Background(), err, operationId.String(), noopCb) + } else { + api.successCallback(context.Background(), operationId.String(), noopCb) + } + }() + + case "cas-restore": + name, opts, parseErr := parseCASRestoreArgs(args[1:]) + if parseErr != nil { + status.Current.Stop(commandId, parseErr) + return actionsResults, parseErr + } + go func() { + err, _ := api.metrics.ExecuteWithMetrics("cas-restore", 0, func() error { + b := backup.NewBackuper(cfg, backup.WithCASProbeState(api.casProbeState)) + return b.CASRestore( + name, opts.tablePattern, + opts.dbMapping, opts.tableMapping, opts.partitions, opts.skipProjections, + opts.schemaOnly, false, // dataOnly always false for CAS + opts.dropExists, false, // ignoreDependencies always false + opts.restoreSchemaAsAttach, opts.replicatedCopyToDetached, + opts.skipEmptyTables, opts.resume, + api.clickhouseBackupVersion, commandId, + ) + }) + status.Current.Stop(commandId, err) + if err != nil { + log.Error().Msgf("actions cas-restore error: %v", err) + api.errorCallback(context.Background(), err, operationId.String(), noopCb) + } else { + api.successCallback(context.Background(), operationId.String(), noopCb) + } + }() + + case "cas-delete": + name, waitForPrune, parseErr := parseCASDeleteArgs(args[1:], cfg.CAS.WaitForPruneDuration()) + if parseErr != nil { + status.Current.Stop(commandId, parseErr) + return actionsResults, parseErr + } + go func() { + err, _ := api.metrics.ExecuteWithMetrics("cas-delete", 0, func() error { + b := backup.NewBackuper(cfg, backup.WithCASProbeState(api.casProbeState)) + return b.CASDelete(name, commandId, waitForPrune) + }) + status.Current.Stop(commandId, err) + if err != nil { + log.Error().Msgf("actions cas-delete error: %v", err) + api.errorCallback(context.Background(), err, operationId.String(), noopCb) + } else { + api.successCallback(context.Background(), operationId.String(), noopCb) + } + }() + + case "cas-verify": + if len(args) < 2 || args[1] == "" { + err := fmt.Errorf("cas-verify: name required") + status.Current.Stop(commandId, err) + return actionsResults, err + } + name := utils.CleanBackupNameRE.ReplaceAllString(args[1], "") + go func() { + err, _ := api.metrics.ExecuteWithMetrics("cas-verify", 0, func() error { + b := backup.NewBackuper(cfg, backup.WithCASProbeState(api.casProbeState)) + return b.CASVerify(name, true, commandId) + }) + status.Current.Stop(commandId, err) + if err != nil { + log.Error().Msgf("actions cas-verify error: %v", err) + api.errorCallback(context.Background(), err, operationId.String(), noopCb) + } else { + api.successCallback(context.Background(), operationId.String(), noopCb) + } + }() + + case "cas-prune": + dryRun, graceBlob, abandonThreshold, unlock := parseCASPruneArgs(args[1:]) + if unlock { + log.Warn().Msg("cas-prune --unlock invoked via /backup/actions; operator override of stranded marker") + } + go func() { + err, _ := api.metrics.ExecuteWithMetrics("cas-prune", 0, func() error { + b := backup.NewBackuper(cfg, backup.WithCASProbeState(api.casProbeState)) + return b.CASPrune(dryRun, graceBlob, abandonThreshold, unlock, commandId) + }) + status.Current.Stop(commandId, err) + if err != nil { + log.Error().Msgf("actions cas-prune error: %v", err) + api.errorCallback(context.Background(), err, operationId.String(), noopCb) + } else { + api.successCallback(context.Background(), operationId.String(), noopCb) + } + }() + + case "cas-status": + // cas-status is informational; run async so /backup/actions never blocks. + go func() { + b := backup.NewBackuper(cfg, backup.WithCASProbeState(api.casProbeState)) + _, reportErr := b.CASStatusJSON(commandId) + status.Current.Stop(commandId, reportErr) + if reportErr != nil { + log.Error().Msgf("actions cas-status error: %v", reportErr) + } + }() + + default: + err := fmt.Errorf("actionsCASHandler: unrecognised CAS command %q", command) + status.Current.Stop(commandId, err) + return actionsResults, err + } + + actionsResults = append(actionsResults, actionsResultsRow{ + Status: "acknowledged", + Operation: row.Command, + }) + return actionsResults, nil +} + +// ────────────────────────────────────────────────────────────────── +// Argument parsers — consume the token slice that follows the verb. +// ────────────────────────────────────────────────────────────────── + +func parseCASUploadArgs(args []string, defaultWaitForPrune time.Duration) (name string, skipObjectDisks, dryRun bool, waitForPrune time.Duration, err error) { + waitForPrune = defaultWaitForPrune + for _, a := range args { + switch { + case a == "--skip-object-disks": + skipObjectDisks = true + case a == "--dry-run": + dryRun = true + case strings.HasPrefix(a, "--wait-for-prune="): + dur, parseErr := time.ParseDuration(strings.TrimPrefix(a, "--wait-for-prune=")) + if parseErr != nil { + err = fmt.Errorf("cas-upload: %w", parseErr) + return + } + waitForPrune = dur + case !strings.HasPrefix(a, "-"): + if name == "" { + name = utils.CleanBackupNameRE.ReplaceAllString(a, "") + } + } + } + if name == "" { + err = fmt.Errorf("cas-upload: name required") + } + return +} + +func parseCASDownloadArgs(args []string) (name, tablePattern string, partitions []string, schemaOnly bool, err error) { + for _, a := range args { + switch { + case a == "--schema": + schemaOnly = true + case strings.HasPrefix(a, "--table="): + tablePattern = strings.TrimPrefix(a, "--table=") + case strings.HasPrefix(a, "--partitions="): + partitions = append(partitions, strings.TrimPrefix(a, "--partitions=")) + case !strings.HasPrefix(a, "-"): + if name == "" { + name = utils.CleanBackupNameRE.ReplaceAllString(a, "") + } + } + } + if name == "" { + err = fmt.Errorf("cas-download: name required") + } + return +} + +type casRestoreOpts struct { + tablePattern string + dbMapping []string + tableMapping []string + partitions []string + skipProjections []string + schemaOnly bool + dropExists bool + restoreSchemaAsAttach bool + replicatedCopyToDetached bool + skipEmptyTables bool + resume bool +} + +func parseCASRestoreArgs(args []string) (name string, opts casRestoreOpts, err error) { + for _, a := range args { + switch { + case a == "--schema": + opts.schemaOnly = true + case a == "--drop" || a == "--rm": + opts.dropExists = true + case a == "--restore-schema-as-attach": + opts.restoreSchemaAsAttach = true + case a == "--replicated-copy-to-detached": + opts.replicatedCopyToDetached = true + case a == "--skip-empty-tables": + opts.skipEmptyTables = true + case a == "--resume" || a == "--resumable": + opts.resume = true + case strings.HasPrefix(a, "--table="): + opts.tablePattern = strings.TrimPrefix(a, "--table=") + case strings.HasPrefix(a, "--partitions="): + opts.partitions = append(opts.partitions, strings.TrimPrefix(a, "--partitions=")) + case strings.HasPrefix(a, "--skip-projections="): + opts.skipProjections = append(opts.skipProjections, strings.TrimPrefix(a, "--skip-projections=")) + case strings.HasPrefix(a, "--restore-database-mapping="): + for _, m := range strings.Split(strings.TrimPrefix(a, "--restore-database-mapping="), ",") { + if m = strings.TrimSpace(m); m != "" { + opts.dbMapping = append(opts.dbMapping, m) + } + } + case strings.HasPrefix(a, "--restore-table-mapping="): + for _, m := range strings.Split(strings.TrimPrefix(a, "--restore-table-mapping="), ",") { + if m = strings.TrimSpace(m); m != "" { + opts.tableMapping = append(opts.tableMapping, m) + } + } + case !strings.HasPrefix(a, "-"): + if name == "" { + name = utils.CleanBackupNameRE.ReplaceAllString(a, "") + } + } + } + if name == "" { + err = fmt.Errorf("cas-restore: name required") + } + return +} + +func parseCASDeleteArgs(args []string, defaultWaitForPrune time.Duration) (name string, waitForPrune time.Duration, err error) { + waitForPrune = defaultWaitForPrune + for _, a := range args { + switch { + case strings.HasPrefix(a, "--wait-for-prune="): + dur, parseErr := time.ParseDuration(strings.TrimPrefix(a, "--wait-for-prune=")) + if parseErr != nil { + err = fmt.Errorf("cas-delete: %w", parseErr) + return + } + waitForPrune = dur + case !strings.HasPrefix(a, "-"): + if name == "" { + name = utils.CleanBackupNameRE.ReplaceAllString(a, "") + } + } + } + if name == "" { + err = fmt.Errorf("cas-delete: name required") + } + return +} + +func parseCASPruneArgs(args []string) (dryRun bool, graceBlob, abandonThreshold string, unlock bool) { + for _, a := range args { + switch { + case a == "--dry-run": + dryRun = true + case a == "--unlock": + unlock = true + case strings.HasPrefix(a, "--grace-blob="): + graceBlob = strings.TrimPrefix(a, "--grace-blob=") + case strings.HasPrefix(a, "--abandon-threshold="): + abandonThreshold = strings.TrimPrefix(a, "--abandon-threshold=") + } + } + return +} diff --git a/pkg/server/cas_handlers.go b/pkg/server/cas_handlers.go new file mode 100644 index 00000000..d60d506f --- /dev/null +++ b/pkg/server/cas_handlers.go @@ -0,0 +1,541 @@ +package server + +import ( + "context" + "fmt" + "net/http" + "strings" + "time" + + "github.com/google/uuid" + "github.com/gorilla/mux" + "github.com/rs/zerolog/log" + + "github.com/Altinity/clickhouse-backup/v2/pkg/backup" + "github.com/Altinity/clickhouse-backup/v2/pkg/status" + "github.com/Altinity/clickhouse-backup/v2/pkg/utils" +) + +// asyncAck is the standard 200-acknowledged JSON body returned by async CAS handlers. +type asyncAck struct { + Status string `json:"status"` + Operation string `json:"operation"` + BackupName string `json:"backup_name,omitempty"` + OperationId string `json:"operation_id"` +} + +func newAsyncAck(op, name, opID string) asyncAck { + return asyncAck{Status: "acknowledged", Operation: op, BackupName: name, OperationId: opID} +} + +// httpCASUploadHandler handles POST /backup/cas-upload/{name} +func (api *APIServer) httpCASUploadHandler(w http.ResponseWriter, r *http.Request) { + if !api.GetConfig().API.AllowParallel && status.Current.InProgress() { + log.Warn().Err(ErrAPILocked).Send() + api.writeError(w, http.StatusLocked, "cas-upload", ErrAPILocked) + return + } + cfg, err := api.ReloadConfig(w, "cas-upload") + if err != nil { + return + } + + name := utils.CleanBackupNameRE.ReplaceAllString(mux.Vars(r)["name"], "") + if name == "" { + api.writeError(w, http.StatusBadRequest, "cas-upload", fmt.Errorf("name required")) + return + } + query := r.URL.Query() + _, skipObjectDisks := api.getQueryParameter(query, "skip-object-disks") + _, dryRun := api.getQueryParameter(query, "dry-run") + waitForPruneStr := query.Get("wait-for-prune") + + var waitForPrune time.Duration + if waitForPruneStr != "" { + waitForPrune, err = time.ParseDuration(waitForPruneStr) + if err != nil { + api.writeError(w, http.StatusBadRequest, "cas-upload", + fmt.Errorf("wait-for-prune: %w", err)) + return + } + } else { + waitForPrune = cfg.CAS.WaitForPruneDuration() + } + + fullCommand := fmt.Sprintf("cas-upload %s", name) + if skipObjectDisks { + fullCommand += " --skip-object-disks" + } + if dryRun { + fullCommand += " --dry-run" + } + if waitForPruneStr != "" { + fullCommand += " --wait-for-prune=" + waitForPruneStr + } + + operationId, _ := uuid.NewUUID() + callback, err := parseCallback(query) + if err != nil { + log.Error().Err(err).Send() + api.writeError(w, http.StatusBadRequest, "cas-upload", err) + return + } + + commandId, _ := status.Current.StartWithOperationId(fullCommand, operationId.String()) + go func() { + err, _ := api.metrics.ExecuteWithMetrics("cas-upload", 0, func() error { + b := backup.NewBackuper(cfg, backup.WithCASProbeState(api.casProbeState)) + return b.CASUpload(name, skipObjectDisks, dryRun, false, api.clickhouseBackupVersion, commandId, waitForPrune) + }) + if err != nil { + log.Error().Msgf("cas-upload error: %v", err) + status.Current.Stop(commandId, err) + api.errorCallback(context.Background(), err, operationId.String(), callback) + return + } + status.Current.Stop(commandId, nil) + api.successCallback(context.Background(), operationId.String(), callback) + }() + + api.sendJSONEachRow(w, http.StatusOK, newAsyncAck("cas-upload", name, operationId.String())) +} + +// httpCASDownloadHandler handles POST /backup/cas-download/{name} +func (api *APIServer) httpCASDownloadHandler(w http.ResponseWriter, r *http.Request) { + if !api.GetConfig().API.AllowParallel && status.Current.InProgress() { + log.Warn().Err(ErrAPILocked).Send() + api.writeError(w, http.StatusLocked, "cas-download", ErrAPILocked) + return + } + cfg, err := api.ReloadConfig(w, "cas-download") + if err != nil { + return + } + + name := utils.CleanBackupNameRE.ReplaceAllString(mux.Vars(r)["name"], "") + if name == "" { + api.writeError(w, http.StatusBadRequest, "cas-download", fmt.Errorf("name required")) + return + } + + query := r.URL.Query() + tablePattern := "" + if tp, exist := query["table"]; exist { + tablePattern = tp[0] + } + partitions := query["partitions"] + _, schemaOnly := api.getQueryParameter(query, "schema") + _, dataOnly := api.getQueryParameter(query, "data") + + if dataOnly { + api.writeError(w, http.StatusNotImplemented, "cas-download", + fmt.Errorf("cas-download: data-only restore is not yet implemented")) + return + } + + fullCommand := fmt.Sprintf("cas-download %s", name) + if tablePattern != "" { + fullCommand += fmt.Sprintf(" --table=%q", tablePattern) + } + for _, p := range partitions { + fullCommand += " --partitions=" + p + } + if schemaOnly { + fullCommand += " --schema" + } + if dataOnly { + fullCommand += " --data" + } + + operationId, _ := uuid.NewUUID() + callback, err := parseCallback(query) + if err != nil { + log.Error().Err(err).Send() + api.writeError(w, http.StatusBadRequest, "cas-download", err) + return + } + + commandId, _ := status.Current.StartWithOperationId(fullCommand, operationId.String()) + go func() { + err, _ := api.metrics.ExecuteWithMetrics("cas-download", 0, func() error { + b := backup.NewBackuper(cfg, backup.WithCASProbeState(api.casProbeState)) + return b.CASDownload(name, tablePattern, partitions, schemaOnly, dataOnly, api.clickhouseBackupVersion, commandId) + }) + if err != nil { + log.Error().Msgf("cas-download error: %v", err) + status.Current.Stop(commandId, err) + api.errorCallback(context.Background(), err, operationId.String(), callback) + return + } + status.Current.Stop(commandId, nil) + api.successCallback(context.Background(), operationId.String(), callback) + }() + + api.sendJSONEachRow(w, http.StatusOK, newAsyncAck("cas-download", name, operationId.String())) +} + +// httpCASRestoreHandler handles POST /backup/cas-restore/{name} +func (api *APIServer) httpCASRestoreHandler(w http.ResponseWriter, r *http.Request) { + if !api.GetConfig().API.AllowParallel && status.Current.InProgress() { + log.Warn().Err(ErrAPILocked).Send() + api.writeError(w, http.StatusLocked, "cas-restore", ErrAPILocked) + return + } + cfg, err := api.ReloadConfig(w, "cas-restore") + if err != nil { + return + } + + name := utils.CleanBackupNameRE.ReplaceAllString(mux.Vars(r)["name"], "") + if name == "" { + api.writeError(w, http.StatusBadRequest, "cas-restore", fmt.Errorf("name required")) + return + } + + query := r.URL.Query() + tablePattern := "" + if tp, exist := query["table"]; exist { + tablePattern = tp[0] + } + partitions := query["partitions"] + _, schemaOnly := api.getQueryParameter(query, "schema") + _, dataOnly := api.getQueryParameter(query, "data") + + if dataOnly { + api.writeError(w, http.StatusNotImplemented, "cas-restore", + fmt.Errorf("cas-restore: data-only restore is not yet implemented")) + return + } + + // Reject ignore-dependencies at the boundary — CASRestore passes false internally. + if _, exists := api.getQueryParameter(query, "ignore-dependencies"); exists { + api.writeError(w, http.StatusBadRequest, "cas-restore", + fmt.Errorf("cas-restore: ignore-dependencies is not supported; CAS restore always respects table dependencies")) + return + } + + // Parse database mapping (same as v1 httpRestoreHandler). + dbMapping := make([]string, 0) + for _, qpName := range []string{"restore-database-mapping", "restore_database_mapping"} { + if vals, exist := query[qpName]; exist { + for _, v := range vals { + for _, m := range strings.Split(v, ",") { + m = strings.TrimSpace(m) + if m != "" { + dbMapping = append(dbMapping, m) + } + } + } + } + } + + // Parse table mapping. + tableMapping := make([]string, 0) + for _, qpName := range []string{"restore-table-mapping", "restore_table_mapping"} { + if vals, exist := query[qpName]; exist { + for _, v := range vals { + for _, m := range strings.Split(v, ",") { + m = strings.TrimSpace(m) + if m != "" { + tableMapping = append(tableMapping, m) + } + } + } + } + } + + // Parse skip-projections. + skipProjections := make([]string, 0) + if sp, exist := api.getQueryParameter(query, "skip-projections"); exist { + skipProjections = append(skipProjections, sp) + } + + dropExists := false + if _, exist := query["drop"]; exist { + dropExists = true + } + if _, exist := query["rm"]; exist { + dropExists = true + } + + _, restoreSchemaAsAttach := api.getQueryParameter(query, "restore-schema-as-attach") + if !restoreSchemaAsAttach { + _, restoreSchemaAsAttach = api.getQueryParameter(query, "restore_schema_as_attach") + } + + _, replicatedCopyToDetached := api.getQueryParameter(query, "replicated-copy-to-detached") + if !replicatedCopyToDetached { + _, replicatedCopyToDetached = api.getQueryParameter(query, "replicated_copy_to_detached") + } + + _, skipEmptyTables := api.getQueryParameter(query, "skip-empty-tables") + if !skipEmptyTables { + _, skipEmptyTables = api.getQueryParameter(query, "skip_empty_tables") + } + + _, resume := api.getQueryParameter(query, "resume") + if !resume { + _, resume = query["resumable"] + } + + fullCommand := fmt.Sprintf("cas-restore %s", name) + if tablePattern != "" { + fullCommand += fmt.Sprintf(" --table=%q", tablePattern) + } + for _, p := range partitions { + fullCommand += " --partitions=" + p + } + if schemaOnly { + fullCommand += " --schema" + } + if len(dbMapping) > 0 { + fullCommand += fmt.Sprintf(" --restore-database-mapping=%q", strings.Join(dbMapping, ",")) + } + if len(tableMapping) > 0 { + fullCommand += fmt.Sprintf(" --restore-table-mapping=%q", strings.Join(tableMapping, ",")) + } + if len(skipProjections) > 0 { + fullCommand += " --skip-projections=" + strings.Join(skipProjections, ",") + } + if dropExists { + fullCommand += " --drop" + } + if restoreSchemaAsAttach { + fullCommand += " --restore-schema-as-attach" + } + if replicatedCopyToDetached { + fullCommand += " --replicated-copy-to-detached" + } + if skipEmptyTables { + fullCommand += " --skip-empty-tables" + } + if resume { + fullCommand += " --resume" + } + + operationId, _ := uuid.NewUUID() + callback, err := parseCallback(query) + if err != nil { + log.Error().Err(err).Send() + api.writeError(w, http.StatusBadRequest, "cas-restore", err) + return + } + + commandId, _ := status.Current.StartWithOperationId(fullCommand, operationId.String()) + go func() { + err, _ := api.metrics.ExecuteWithMetrics("cas-restore", 0, func() error { + b := backup.NewBackuper(cfg, backup.WithCASProbeState(api.casProbeState)) + return b.CASRestore( + name, tablePattern, + dbMapping, tableMapping, partitions, skipProjections, + schemaOnly, dataOnly, + dropExists, false, // ignoreDependencies always false for CAS + restoreSchemaAsAttach, replicatedCopyToDetached, + skipEmptyTables, resume, + api.clickhouseBackupVersion, commandId, + ) + }) + if err != nil { + log.Error().Msgf("cas-restore error: %v", err) + status.Current.Stop(commandId, err) + api.errorCallback(context.Background(), err, operationId.String(), callback) + return + } + status.Current.Stop(commandId, nil) + api.successCallback(context.Background(), operationId.String(), callback) + }() + + api.sendJSONEachRow(w, http.StatusOK, newAsyncAck("cas-restore", name, operationId.String())) +} + +// httpCASDeleteHandler handles POST /backup/cas-delete/{name} +func (api *APIServer) httpCASDeleteHandler(w http.ResponseWriter, r *http.Request) { + if !api.GetConfig().API.AllowParallel && status.Current.InProgress() { + log.Warn().Err(ErrAPILocked).Send() + api.writeError(w, http.StatusLocked, "cas-delete", ErrAPILocked) + return + } + cfg, err := api.ReloadConfig(w, "cas-delete") + if err != nil { + return + } + + name := utils.CleanBackupNameRE.ReplaceAllString(mux.Vars(r)["name"], "") + if name == "" { + api.writeError(w, http.StatusBadRequest, "cas-delete", fmt.Errorf("name required")) + return + } + + query := r.URL.Query() + waitForPruneStr := query.Get("wait-for-prune") + var waitForPrune time.Duration + if waitForPruneStr != "" { + waitForPrune, err = time.ParseDuration(waitForPruneStr) + if err != nil { + api.writeError(w, http.StatusBadRequest, "cas-delete", + fmt.Errorf("wait-for-prune: %w", err)) + return + } + } else { + waitForPrune = cfg.CAS.WaitForPruneDuration() + } + + fullCommand := fmt.Sprintf("cas-delete %s", name) + if waitForPruneStr != "" { + fullCommand += " --wait-for-prune=" + waitForPruneStr + } + + operationId, _ := uuid.NewUUID() + callback, err := parseCallback(query) + if err != nil { + log.Error().Err(err).Send() + api.writeError(w, http.StatusBadRequest, "cas-delete", err) + return + } + + commandId, _ := status.Current.StartWithOperationId(fullCommand, operationId.String()) + go func() { + err, _ := api.metrics.ExecuteWithMetrics("cas-delete", 0, func() error { + b := backup.NewBackuper(cfg, backup.WithCASProbeState(api.casProbeState)) + return b.CASDelete(name, commandId, waitForPrune) + }) + if err != nil { + log.Error().Msgf("cas-delete error: %v", err) + status.Current.Stop(commandId, err) + api.errorCallback(context.Background(), err, operationId.String(), callback) + return + } + status.Current.Stop(commandId, nil) + api.successCallback(context.Background(), operationId.String(), callback) + }() + + api.sendJSONEachRow(w, http.StatusOK, newAsyncAck("cas-delete", name, operationId.String())) +} + +// httpCASVerifyHandler handles POST /backup/cas-verify/{name} +func (api *APIServer) httpCASVerifyHandler(w http.ResponseWriter, r *http.Request) { + if !api.GetConfig().API.AllowParallel && status.Current.InProgress() { + log.Warn().Err(ErrAPILocked).Send() + api.writeError(w, http.StatusLocked, "cas-verify", ErrAPILocked) + return + } + cfg, err := api.ReloadConfig(w, "cas-verify") + if err != nil { + return + } + + name := utils.CleanBackupNameRE.ReplaceAllString(mux.Vars(r)["name"], "") + if name == "" { + api.writeError(w, http.StatusBadRequest, "cas-verify", fmt.Errorf("name required")) + return + } + + fullCommand := fmt.Sprintf("cas-verify %s", name) + query := r.URL.Query() + operationId, _ := uuid.NewUUID() + callback, err := parseCallback(query) + if err != nil { + log.Error().Err(err).Send() + api.writeError(w, http.StatusBadRequest, "cas-verify", err) + return + } + + commandId, _ := status.Current.StartWithOperationId(fullCommand, operationId.String()) + go func() { + err, _ := api.metrics.ExecuteWithMetrics("cas-verify", 0, func() error { + b := backup.NewBackuper(cfg, backup.WithCASProbeState(api.casProbeState)) + return b.CASVerify(name, true, commandId) + }) + if err != nil { + log.Error().Msgf("cas-verify error: %v", err) + status.Current.Stop(commandId, err) + api.errorCallback(context.Background(), err, operationId.String(), callback) + return + } + status.Current.Stop(commandId, nil) + api.successCallback(context.Background(), operationId.String(), callback) + }() + + api.sendJSONEachRow(w, http.StatusOK, newAsyncAck("cas-verify", name, operationId.String())) +} + +// httpCASPruneHandler handles POST /backup/cas-prune +func (api *APIServer) httpCASPruneHandler(w http.ResponseWriter, r *http.Request) { + if !api.GetConfig().API.AllowParallel && status.Current.InProgress() { + log.Warn().Err(ErrAPILocked).Send() + api.writeError(w, http.StatusLocked, "cas-prune", ErrAPILocked) + return + } + cfg, err := api.ReloadConfig(w, "cas-prune") + if err != nil { + return + } + + query := r.URL.Query() + _, dryRun := api.getQueryParameter(query, "dry-run") + graceBlob := query.Get("grace-blob") + abandonThreshold := query.Get("abandon-threshold") + _, unlock := api.getQueryParameter(query, "unlock") + + if unlock { + log.Warn().Msg("cas-prune --unlock invoked via API; operator override of stranded marker") + } + + fullCommand := "cas-prune" + if dryRun { + fullCommand += " --dry-run" + } + if graceBlob != "" { + fullCommand += " --grace-blob=" + graceBlob + } + if abandonThreshold != "" { + fullCommand += " --abandon-threshold=" + abandonThreshold + } + if unlock { + fullCommand += " --unlock" + } + + operationId, _ := uuid.NewUUID() + callback, err := parseCallback(query) + if err != nil { + log.Error().Err(err).Send() + api.writeError(w, http.StatusBadRequest, "cas-prune", err) + return + } + + commandId, _ := status.Current.StartWithOperationId(fullCommand, operationId.String()) + go func() { + err, _ := api.metrics.ExecuteWithMetrics("cas-prune", 0, func() error { + b := backup.NewBackuper(cfg, backup.WithCASProbeState(api.casProbeState)) + return b.CASPrune(dryRun, graceBlob, abandonThreshold, unlock, commandId) + }) + if err != nil { + log.Error().Msgf("cas-prune error: %v", err) + status.Current.Stop(commandId, err) + api.errorCallback(context.Background(), err, operationId.String(), callback) + return + } + status.Current.Stop(commandId, nil) + api.successCallback(context.Background(), operationId.String(), callback) + }() + + api.sendJSONEachRow(w, http.StatusOK, newAsyncAck("cas-prune", "", operationId.String())) +} + +// httpCASStatusHandler handles GET /backup/cas-status +func (api *APIServer) httpCASStatusHandler(w http.ResponseWriter, r *http.Request) { + cfg, err := api.ReloadConfig(w, "cas-status") + if err != nil { + return + } + + b := backup.NewBackuper(cfg, backup.WithCASProbeState(api.casProbeState)) + report, statusErr := b.CASStatusJSON(status.NotFromAPI) + if statusErr != nil { + api.writeError(w, http.StatusInternalServerError, "cas-status", statusErr) + return + } + + api.sendJSONEachRow(w, http.StatusOK, report) +} + diff --git a/pkg/server/cas_handlers_test.go b/pkg/server/cas_handlers_test.go new file mode 100644 index 00000000..9c6688b0 --- /dev/null +++ b/pkg/server/cas_handlers_test.go @@ -0,0 +1,399 @@ +package server + +import ( + "encoding/json" + "net/http/httptest" + "strings" + "testing" + + "github.com/gorilla/mux" + "github.com/stretchr/testify/require" + "github.com/urfave/cli" + + "github.com/Altinity/clickhouse-backup/v2/pkg/config" + "github.com/Altinity/clickhouse-backup/v2/pkg/server/metrics" + "github.com/Altinity/clickhouse-backup/v2/pkg/status" +) + +// testMetrics is a shared metrics instance registered once for the test binary. +// prometheus.MustRegister panics on duplicates so we must share across tests. +var testMetrics = func() *metrics.APIMetrics { + m := metrics.NewAPIMetrics() + m.RegisterMetrics() + return m +}() + +// newTestAPI builds a minimal APIServer suitable for handler unit-tests. +// It uses a non-existent configPath so ReloadConfig falls back to DefaultConfig. +func newTestAPI(t *testing.T) *APIServer { + t.Helper() + cfg := config.DefaultConfig() + // Ensure AllowParallel default is false — tests set it explicitly. + cfg.API.AllowParallel = false + + app := cli.NewApp() + app.Version = "test" + + return &APIServer{ + cliApp: app, + configPath: "/nonexistent/config.yaml", // causes LoadConfig to use DefaultConfig + config: cfg, + metrics: testMetrics, + restart: make(chan struct{}, 1), + stop: make(chan struct{}, 1), + clickhouseBackupVersion: "test", + } +} + +// TestCASUploadHandler_AsyncAck verifies that a POST to /backup/cas-upload/{name} +// immediately returns 200 with an acknowledged asyncAck body before the background +// goroutine runs. +func TestCASUploadHandler_AsyncAck(t *testing.T) { + api := newTestAPI(t) + api.config.API.AllowParallel = true // permit the call even if another op is in progress + + req := httptest.NewRequest("POST", "/backup/cas-upload/myname", nil) + // Inject mux vars manually (bypasses the router). + req = mux.SetURLVars(req, map[string]string{"name": "myname"}) + rr := httptest.NewRecorder() + + api.httpCASUploadHandler(rr, req) + + require.Equal(t, 200, rr.Code) + + var ack asyncAck + require.NoError(t, json.Unmarshal(rr.Body.Bytes(), &ack)) + require.Equal(t, "acknowledged", ack.Status) + require.Equal(t, "cas-upload", ack.Operation) + require.Equal(t, "myname", ack.BackupName) + require.NotEmpty(t, ack.OperationId) +} + +// TestCASUploadHandler_LockedWhenBusy verifies that the handler returns 423 when +// AllowParallel=false and another operation is in progress. +func TestCASUploadHandler_LockedWhenBusy(t *testing.T) { + api := newTestAPI(t) + api.config.API.AllowParallel = false + + // Register a fake in-progress operation. + cmdId, _ := status.Current.Start("upload some-other-backup") + defer status.Current.Stop(cmdId, nil) + + req := httptest.NewRequest("POST", "/backup/cas-upload/myname", nil) + req = mux.SetURLVars(req, map[string]string{"name": "myname"}) + rr := httptest.NewRecorder() + + api.httpCASUploadHandler(rr, req) + + require.Equal(t, 423, rr.Code) +} + +// ---------- cas-download ---------- + +// TestCASDownloadHandler_AsyncAck verifies that POST /backup/cas-download/{name} +// returns 200 with an acknowledged asyncAck body immediately. +func TestCASDownloadHandler_AsyncAck(t *testing.T) { + api := newTestAPI(t) + api.config.API.AllowParallel = true + + req := httptest.NewRequest("POST", "/backup/cas-download/mybackup", nil) + req = mux.SetURLVars(req, map[string]string{"name": "mybackup"}) + rr := httptest.NewRecorder() + + api.httpCASDownloadHandler(rr, req) + + require.Equal(t, 200, rr.Code) + var ack asyncAck + require.NoError(t, json.Unmarshal(rr.Body.Bytes(), &ack)) + require.Equal(t, "acknowledged", ack.Status) + require.Equal(t, "cas-download", ack.Operation) + require.Equal(t, "mybackup", ack.BackupName) + require.NotEmpty(t, ack.OperationId) +} + +// TestCASDownloadHandler_DataOnlyReturns501 verifies that ?data returns 501. +func TestCASDownloadHandler_DataOnlyReturns501(t *testing.T) { + api := newTestAPI(t) + api.config.API.AllowParallel = true + + req := httptest.NewRequest("POST", "/backup/cas-download/mybackup?data", nil) + req = mux.SetURLVars(req, map[string]string{"name": "mybackup"}) + rr := httptest.NewRecorder() + + api.httpCASDownloadHandler(rr, req) + + require.Equal(t, 501, rr.Code) +} + +// ---------- cas-restore ---------- + +// TestCASRestoreHandler_AsyncAck verifies that POST /backup/cas-restore/{name} +// returns 200 with an acknowledged asyncAck body immediately. +func TestCASRestoreHandler_AsyncAck(t *testing.T) { + api := newTestAPI(t) + api.config.API.AllowParallel = true + + req := httptest.NewRequest("POST", "/backup/cas-restore/mybackup", nil) + req = mux.SetURLVars(req, map[string]string{"name": "mybackup"}) + rr := httptest.NewRecorder() + + api.httpCASRestoreHandler(rr, req) + + require.Equal(t, 200, rr.Code) + var ack asyncAck + require.NoError(t, json.Unmarshal(rr.Body.Bytes(), &ack)) + require.Equal(t, "acknowledged", ack.Status) + require.Equal(t, "cas-restore", ack.Operation) + require.Equal(t, "mybackup", ack.BackupName) + require.NotEmpty(t, ack.OperationId) +} + +// TestCASRestoreHandler_IgnoreDependenciesReturns400 verifies that +// ?ignore-dependencies is rejected with 400 at the handler boundary. +func TestCASRestoreHandler_IgnoreDependenciesReturns400(t *testing.T) { + api := newTestAPI(t) + api.config.API.AllowParallel = true + + req := httptest.NewRequest("POST", "/backup/cas-restore/mybackup?ignore-dependencies", nil) + req = mux.SetURLVars(req, map[string]string{"name": "mybackup"}) + rr := httptest.NewRecorder() + + api.httpCASRestoreHandler(rr, req) + + require.Equal(t, 400, rr.Code) +} + +// ---------- cas-delete ---------- + +// TestCASDeleteHandler_AsyncAck verifies that POST /backup/cas-delete/{name} +// immediately returns 200 with an acknowledged asyncAck body before the +// background goroutine runs. +func TestCASDeleteHandler_AsyncAck(t *testing.T) { + api := newTestAPI(t) + api.config.API.AllowParallel = true + + req := httptest.NewRequest("POST", "/backup/cas-delete/mybackup", nil) + req = mux.SetURLVars(req, map[string]string{"name": "mybackup"}) + rr := httptest.NewRecorder() + + api.httpCASDeleteHandler(rr, req) + + require.Equal(t, 200, rr.Code) + var ack asyncAck + require.NoError(t, json.Unmarshal(rr.Body.Bytes(), &ack)) + require.Equal(t, "acknowledged", ack.Status) + require.Equal(t, "cas-delete", ack.Operation) + require.Equal(t, "mybackup", ack.BackupName) + require.NotEmpty(t, ack.OperationId) +} + +// TestCASDeleteHandler_LockedWhenBusy verifies that the handler returns 423 when +// AllowParallel=false and another operation is in progress. +func TestCASDeleteHandler_LockedWhenBusy(t *testing.T) { + api := newTestAPI(t) + api.config.API.AllowParallel = false + + cmdId, _ := status.Current.Start("some-other-op") + defer status.Current.Stop(cmdId, nil) + + req := httptest.NewRequest("POST", "/backup/cas-delete/mybackup", nil) + req = mux.SetURLVars(req, map[string]string{"name": "mybackup"}) + rr := httptest.NewRecorder() + + api.httpCASDeleteHandler(rr, req) + + require.Equal(t, 423, rr.Code) +} + +// ---------- cas-verify ---------- + +// TestCASVerifyHandler_AsyncAck verifies that POST /backup/cas-verify/{name} +// returns 200 with an acknowledged asyncAck body immediately. +func TestCASVerifyHandler_AsyncAck(t *testing.T) { + api := newTestAPI(t) + api.config.API.AllowParallel = true + + req := httptest.NewRequest("POST", "/backup/cas-verify/mybackup", nil) + req = mux.SetURLVars(req, map[string]string{"name": "mybackup"}) + rr := httptest.NewRecorder() + + api.httpCASVerifyHandler(rr, req) + + require.Equal(t, 200, rr.Code) + var ack asyncAck + require.NoError(t, json.Unmarshal(rr.Body.Bytes(), &ack)) + require.Equal(t, "acknowledged", ack.Status) + require.Equal(t, "cas-verify", ack.Operation) + require.Equal(t, "mybackup", ack.BackupName) + require.NotEmpty(t, ack.OperationId) +} + +// ---------- cas-prune ---------- + +// TestCASPruneHandler_AsyncAck verifies that POST /backup/cas-prune +// returns 200 with an acknowledged asyncAck body immediately. +func TestCASPruneHandler_AsyncAck(t *testing.T) { + api := newTestAPI(t) + api.config.API.AllowParallel = true + + req := httptest.NewRequest("POST", "/backup/cas-prune", nil) + rr := httptest.NewRecorder() + + api.httpCASPruneHandler(rr, req) + + require.Equal(t, 200, rr.Code) + var ack asyncAck + require.NoError(t, json.Unmarshal(rr.Body.Bytes(), &ack)) + require.Equal(t, "acknowledged", ack.Status) + require.Equal(t, "cas-prune", ack.Operation) + require.NotEmpty(t, ack.OperationId) +} + +// TestCASPruneHandler_PassesQueryParams verifies that dry-run and grace-blob +// are reflected in the status command string that was started. +func TestCASPruneHandler_PassesQueryParams(t *testing.T) { + api := newTestAPI(t) + api.config.API.AllowParallel = true + + req := httptest.NewRequest("POST", "/backup/cas-prune?dry-run&grace-blob=0s", nil) + rr := httptest.NewRecorder() + + api.httpCASPruneHandler(rr, req) + + require.Equal(t, 200, rr.Code) + var ack asyncAck + require.NoError(t, json.Unmarshal(rr.Body.Bytes(), &ack)) + require.NotEmpty(t, ack.OperationId) + + // Retrieve the started command from status and verify it contains the flags. + rows := status.Current.GetStatus(false, "", 10) + found := false + for _, row := range rows { + if row.OperationId == ack.OperationId { + require.Contains(t, row.Command, "--dry-run") + require.Contains(t, row.Command, "--grace-blob=0s") + found = true + break + } + } + require.True(t, found, "operation not found in status log") +} + +// ---------- cas-status ---------- + +// TestCASStatusHandler_ReturnsJSON verifies that GET /backup/cas-status +// returns a JSON body. With no real CAS backend configured the handler +// returns 500 with an error JSON object — we just assert the response is +// valid JSON (not empty/HTML) and that Content-Type is set appropriately. +// Full structured-data verification is an integration-test concern. +func TestCASStatusHandler_ReturnsJSON(t *testing.T) { + api := newTestAPI(t) + + req := httptest.NewRequest("GET", "/backup/cas-status", nil) + rr := httptest.NewRecorder() + + api.httpCASStatusHandler(rr, req) + + // With cas.enabled=false the handler returns 500, but the body must be JSON. + body := rr.Body.Bytes() + require.True(t, len(body) > 0, "response body must not be empty") + var payload interface{} + require.NoError(t, json.Unmarshal(body, &payload), "response body must be valid JSON") +} + +// TestCASStatusHandler_FreshServerReturns200OrEmpty verifies that GET +// /backup/cas-status does not fail with the "commandId=0 not exists" sentinel +// error on a fresh server. With CAS disabled in the default config the handler +// returns http.StatusInternalServerError, but for the correct reason +// (cas.enabled=false), not the stale commandId=0 lookup bug introduced by +// passing bare 0 instead of status.NotFromAPI. +func TestCASStatusHandler_FreshServerReturns200OrEmpty(t *testing.T) { + api := newTestAPI(t) + req := httptest.NewRequest("GET", "/backup/cas-status", nil) + rr := httptest.NewRecorder() + api.httpCASStatusHandler(rr, req) + // The fix ensures the error is NOT the old "commandId=0 not exists" sentinel. + require.NotContains(t, rr.Body.String(), "commandId=0 not exists", + "GET /backup/cas-status must not fail with the old commandId=0 sentinel; body=%s", rr.Body.String()) +} + +// ────────────────────────────────────────────────────────────────────────────── +// Task 7: /backup/actions dispatcher +// ────────────────────────────────────────────────────────────────────────────── + +// TestCASActionsDispatcher_Upload verifies that a POST to /backup/actions with +// a cas-upload command returns 200 with an "acknowledged" result row. +// +// /backup/actions uses sendJSONEachRow: the response body is newline-delimited +// JSON objects, not a JSON array — we decode the first line accordingly. +func TestCASActionsDispatcher_Upload(t *testing.T) { + api := newTestAPI(t) + api.config.API.AllowParallel = true + + body := `{"command": "cas-upload myname --skip-object-disks"}` + req := httptest.NewRequest("POST", "/backup/actions", strings.NewReader(body)) + rr := httptest.NewRecorder() + + api.actions(rr, req) + + require.Equal(t, 200, rr.Code, "body: %s", rr.Body.String()) + + // sendJSONEachRow emits one JSON object per line; decode the first line. + firstLine := strings.SplitN(strings.TrimSpace(rr.Body.String()), "\n", 2)[0] + var result actionsResultsRow + require.NoError(t, json.Unmarshal([]byte(firstLine), &result)) + require.Equal(t, "acknowledged", result.Status) + require.Contains(t, result.Operation, "cas-upload") +} + +// TestCASActionsDispatcher_UnknownVerb verifies that an unknown command still +// returns 400 (the existing default branch), not a panic or 500. +func TestCASActionsDispatcher_UnknownVerb(t *testing.T) { + api := newTestAPI(t) + api.config.API.AllowParallel = true + + body := `{"command": "cas-frobnicate myname"}` + req := httptest.NewRequest("POST", "/backup/actions", strings.NewReader(body)) + rr := httptest.NewRecorder() + + api.actions(rr, req) + + // The default switch branch returns 400 for unknown commands. + require.Equal(t, 400, rr.Code, "body: %s", rr.Body.String()) +} + +// TestCASActionsDispatcher_LockedWhenBusy verifies that the dispatcher honours +// AllowParallel=false and returns 500 (which wraps ErrAPILocked) when another +// operation is already in progress. +func TestCASActionsDispatcher_LockedWhenBusy(t *testing.T) { + api := newTestAPI(t) + api.config.API.AllowParallel = false + + cmdId, _ := status.Current.Start("upload some-other-backup") + defer status.Current.Stop(cmdId, nil) + + body := `{"command": "cas-upload myname"}` + req := httptest.NewRequest("POST", "/backup/actions", strings.NewReader(body)) + rr := httptest.NewRecorder() + + api.actions(rr, req) + + // actionsAsyncCommandsHandler returns ErrAPILocked → writeError → 500. + require.Equal(t, 500, rr.Code, "body: %s", rr.Body.String()) +} + +// ────────────────────────────────────────────────────────────────────────────── +// Task 8: /backup/list kind field +// ────────────────────────────────────────────────────────────────────────────── + +// TestHttpListHandler_KindFieldPresent verifies that the list handler returns +// valid JSON. With no real ClickHouse or remote storage configured the handler +// returns an empty array — we verify that the response is parseable and the +// kind field is omitted (rather than present but wrong) for the zero-entry case. +// +// Full "v1 + cas merged" verification requires a Backuper stub and is covered +// by the integration test TestCASAPI_ListMixedBackups. +func TestHttpListHandler_KindFieldPresent(t *testing.T) { + t.Skip("requires live ClickHouse connection; covered by integration TestCASAPI_ListMixedBackups") +} + diff --git a/pkg/server/metrics/metrics.go b/pkg/server/metrics/metrics.go index d0d434af..cbd7b32e 100644 --- a/pkg/server/metrics/metrics.go +++ b/pkg/server/metrics/metrics.go @@ -23,6 +23,7 @@ type APIMetrics struct { NumberBackupsLocal prometheus.Gauge NumberBackupsRemoteExpected prometheus.Gauge NumberBackupsLocalExpected prometheus.Gauge + NumberCASBackupsRemote prometheus.Gauge InProgressCommands prometheus.Gauge LocalDataSize prometheus.Gauge @@ -41,7 +42,10 @@ func NewAPIMetrics() *APIMetrics { // RegisterMetrics resister prometheus metrics and define allowed measured commands list func (m *APIMetrics) RegisterMetrics() { - commandList := []string{"create", "upload", "download", "restore", "create_remote", "restore_remote", "delete"} + commandList := []string{ + "create", "upload", "download", "restore", "create_remote", "restore_remote", "delete", + "cas-upload", "cas-download", "cas-restore", "cas-delete", "cas-verify", "cas-prune", + } successfulCounter := map[string]prometheus.Counter{} failedCounter := map[string]prometheus.Counter{} lastStart := map[string]prometheus.Gauge{} @@ -131,6 +135,12 @@ func (m *APIMetrics) RegisterMetrics() { Help: "How many backups expected on local storage", }) + m.NumberCASBackupsRemote = prometheus.NewGauge(prometheus.GaugeOpts{ + Namespace: "clickhouse_backup", + Name: "number_cas_backups_remote", + Help: "Number of stored remote CAS backups", + }) + m.InProgressCommands = prometheus.NewGauge(prometheus.GaugeOpts{ Namespace: "clickhouse_backup", Name: "in_progress_commands", @@ -161,6 +171,7 @@ func (m *APIMetrics) RegisterMetrics() { m.NumberBackupsLocal, m.NumberBackupsRemoteExpected, m.NumberBackupsLocalExpected, + m.NumberCASBackupsRemote, m.InProgressCommands, m.LocalDataSize, ) diff --git a/pkg/server/server.go b/pkg/server/server.go index acb3e6f9..e52ec9a3 100644 --- a/pkg/server/server.go +++ b/pkg/server/server.go @@ -52,6 +52,10 @@ type APIServer struct { metrics *metrics.APIMetrics routes []string clickhouseBackupVersion string + // casProbeState is shared across all per-request Backuper instances so the + // conditional-put probe and unsafe-marker WARN banner fire at most once per + // daemon lifetime rather than once per REST request. + casProbeState *backup.CASProbeState } // GetConfig returns the current config with read lock protection @@ -106,6 +110,7 @@ func Run(cliCtx *cli.Context, cliApp *cli.App, configPath string, clickhouseBack clickhouseBackupVersion: clickhouseBackupVersion, metrics: metrics.NewAPIMetrics(), stop: make(chan struct{}), + casProbeState: backup.NewCASProbeState(), } api.metrics.RegisterMetrics() @@ -246,6 +251,13 @@ func (api *APIServer) registerHTTPHandlers() *http.Server { r.HandleFunc("/backup/clean/remote_broken", api.httpCleanRemoteBrokenHandler).Methods("POST") r.HandleFunc("/backup/clean/local_broken", api.httpCleanLocalBrokenHandler).Methods("POST") r.HandleFunc("/backup/upload/{name}", api.httpUploadHandler).Methods("POST") + r.HandleFunc("/backup/cas-upload/{name}", api.httpCASUploadHandler).Methods("POST") + r.HandleFunc("/backup/cas-download/{name}", api.httpCASDownloadHandler).Methods("POST") + r.HandleFunc("/backup/cas-restore/{name}", api.httpCASRestoreHandler).Methods("POST") + r.HandleFunc("/backup/cas-delete/{name}", api.httpCASDeleteHandler).Methods("POST") + r.HandleFunc("/backup/cas-verify/{name}", api.httpCASVerifyHandler).Methods("POST") + r.HandleFunc("/backup/cas-prune", api.httpCASPruneHandler).Methods("POST") + r.HandleFunc("/backup/cas-status", api.httpCASStatusHandler).Methods("GET") r.HandleFunc("/backup/download/{name}", api.httpDownloadHandler).Methods("POST") r.HandleFunc("/backup/restore/{name}", api.httpRestoreHandler).Methods("POST") r.HandleFunc("/backup/restore_remote/{name}", api.httpRestoreRemoteHandler).Methods("POST") @@ -398,6 +410,12 @@ func (api *APIServer) actions(w http.ResponseWriter, r *http.Request) { api.writeError(w, http.StatusInternalServerError, row.Command, err) return } + case "cas-upload", "cas-download", "cas-restore", "cas-delete", "cas-verify", "cas-prune", "cas-status": + actionsResults, err = api.actionsCASHandler(command, args, row, actionsResults) + if err != nil { + api.writeError(w, http.StatusInternalServerError, row.Command, err) + return + } default: api.writeError(w, http.StatusBadRequest, row.Command, errors.New("unknown command")) return @@ -829,20 +847,25 @@ func (api *APIServer) httpListHandler(w http.ResponseWriter, r *http.Request) { return } + type casListSummary struct { + UploadedAt string `json:"uploaded_at,omitempty"` + } type backupJSON struct { - Name string `json:"name"` - Created string `json:"created"` - Size uint64 `json:"size,omitempty"` - DataSize uint64 `json:"data_size,omitempty"` - ObjectDiskSize uint64 `json:"object_disk_size,omitempty"` - MetadataSize uint64 `json:"metadata_size"` - RBACSize uint64 `json:"rbac_size,omitempty"` - ConfigSize uint64 `json:"config_size,omitempty"` - NamedCollectionSize uint64 `json:"named_collection_size,omitempty"` - CompressedSize uint64 `json:"compressed_size,omitempty"` - Location string `json:"location"` - RequiredBackup string `json:"required"` - Desc string `json:"desc"` + Name string `json:"name"` + Kind string `json:"kind,omitempty"` // "v1" or "cas"; omitted on legacy clients for back-compat + Created string `json:"created"` + Size uint64 `json:"size,omitempty"` + DataSize uint64 `json:"data_size,omitempty"` + ObjectDiskSize uint64 `json:"object_disk_size,omitempty"` + MetadataSize uint64 `json:"metadata_size"` + RBACSize uint64 `json:"rbac_size,omitempty"` + ConfigSize uint64 `json:"config_size,omitempty"` + NamedCollectionSize uint64 `json:"named_collection_size,omitempty"` + CompressedSize uint64 `json:"compressed_size,omitempty"` + Location string `json:"location"` + RequiredBackup string `json:"required"` + Desc string `json:"desc"` + CAS *casListSummary `json:"cas,omitempty"` } backupsJSON := make([]backupJSON, 0) cfg, err := api.ReloadConfig(w, "list") @@ -879,6 +902,9 @@ func (api *APIServer) httpListHandler(w http.ResponseWriter, r *http.Request) { } backupsJSON = append(backupsJSON, backupJSON{ Name: item.BackupName, + // Kind omitted for v1 entries (omitempty) so legacy ClickHouse + // integration tables that don't set input_format_skip_unknown_fields + // (CH < 21.1) keep parsing /backup/list output. Created: item.CreationDate.In(time.Local).Format(common.TimeFormat), Size: item.GetFullSize(), DataSize: item.DataSize, @@ -918,6 +944,9 @@ func (api *APIServer) httpListHandler(w http.ResponseWriter, r *http.Request) { fullSize := item.GetFullSize() backupsJSON = append(backupsJSON, backupJSON{ Name: item.BackupName, + // Kind omitted for v1 entries (omitempty) so legacy ClickHouse + // integration tables that don't set input_format_skip_unknown_fields + // (CH < 21.1) keep parsing /backup/list output. Created: item.CreationDate.In(time.Local).Format(common.TimeFormat), Size: fullSize, DataSize: item.DataSize, @@ -938,6 +967,25 @@ func (api *APIServer) httpListHandler(w http.ResponseWriter, r *http.Request) { api.metrics.NumberBackupsRemoteBroken.Set(float64(brokenBackups)) api.metrics.NumberBackupsRemote.Set(float64(len(remoteBackups))) } + // Merge CAS backups into the list when CAS is enabled and remote storage is + // configured. Failures are logged and swallowed so that a CAS-side error + // never prevents the v1 list from being returned. + if cfg.CAS.Enabled && cfg.General.RemoteStorage != "none" && (where == "remote" || !wherePresent) { + casB := backup.NewBackuper(cfg) + for _, item := range casB.CollectRemoteCASBackups(ctx) { + uploadedAt := item.CreationDate.In(time.Local).Format(common.TimeFormat) + backupsJSON = append(backupsJSON, backupJSON{ + Name: item.BackupName, + Kind: "cas", + Created: uploadedAt, + Location: "remote", + Desc: item.Description, + CAS: &casListSummary{ + UploadedAt: uploadedAt, + }, + }) + } + } api.sendJSONEachRow(w, http.StatusOK, backupsJSON) status.Current.Stop(commandId, nil) } @@ -2228,6 +2276,16 @@ func (api *APIServer) UpdateBackupMetrics(ctx context.Context, onlyLocal bool) e api.metrics.NumberBackupsRemoteBroken.Set(0) } + // Update CAS backup count gauge (fail-open: errors are logged and swallowed + // so that a CAS-side error never prevents v1 metric updates from completing). + cfg := api.GetConfig() + if cfg.CAS.Enabled && cfg.General.RemoteStorage != "none" { + casBackups := b.CollectRemoteCASBackups(ctx) + api.metrics.NumberCASBackupsRemote.Set(float64(len(casBackups))) + } else { + api.metrics.NumberCASBackupsRemote.Set(0) + } + if lastBackupCreateLocal != nil { api.metrics.LastFinish["create"].Set(float64(lastBackupCreateLocal.Unix())) } diff --git a/pkg/storage/azblob.go b/pkg/storage/azblob.go index fb6b6407..4844d0ec 100644 --- a/pkg/storage/azblob.go +++ b/pkg/storage/azblob.go @@ -1,6 +1,7 @@ package storage import ( + "bytes" "context" "crypto/sha256" "encoding/base64" @@ -218,6 +219,44 @@ func (a *AzureBlob) PutFileAbsolute(ctx context.Context, key string, r io.ReadCl return nil } +// PutFileAbsoluteIfAbsent atomically uploads the blob at key only if it +// doesn't already exist, using the Azure If-None-Match: "*" access condition. +// Azure returns HTTP 409 BlobAlreadyExists (not 412) when the blob is present. +// Returns (true, nil) on successful creation, (false, nil) if the blob already +// existed, or (false, err) on any other error. +func (a *AzureBlob) PutFileAbsoluteIfAbsent(ctx context.Context, key string, r io.ReadCloser, localSize int64) (bool, error) { + a.logf("AZBLOB->PutFileAbsoluteIfAbsent %s", key) + body, err := io.ReadAll(r) + _ = r.Close() + if err != nil { + return false, errors.WithMessage(err, "AzureBlob PutFileAbsoluteIfAbsent ReadAll") + } + blob := a.Container.NewBlockBlobURL(key) + _, err = x.UploadStreamToBlockBlob(ctx, bytes.NewReader(body), blob, azblob.UploadStreamToBlockBlobOptions{ + BufferSize: len(body) + 1, + MaxBuffers: 1, + AccessConditions: azblob.BlobAccessConditions{ + ModifiedAccessConditions: azblob.ModifiedAccessConditions{ + IfNoneMatch: azblob.ETagAny, + }, + }, + }, a.CPK) + if err != nil { + var se azblob.StorageError + if errors.As(err, &se) && se.ServiceCode() == azblob.ServiceCodeBlobAlreadyExists { + return false, nil + } + return false, errors.WithMessage(err, "AzureBlob PutFileAbsoluteIfAbsent UploadStreamToBlockBlob") + } + return true, nil +} + +// PutFileIfAbsent is the path-prefixed variant of PutFileAbsoluteIfAbsent. +// It prepends a.Config.Path to key, matching PutFile semantics. +func (a *AzureBlob) PutFileIfAbsent(ctx context.Context, key string, r io.ReadCloser, localSize int64) (bool, error) { + return a.PutFileAbsoluteIfAbsent(ctx, path.Join(a.Config.Path, key), r, localSize) +} + func (a *AzureBlob) DeleteFile(ctx context.Context, key string) error { a.logf("AZBLOB->DeleteFile %s", key) blob := a.Container.NewBlockBlobURL(path.Join(a.Config.Path, key)) diff --git a/pkg/storage/cos.go b/pkg/storage/cos.go index 1a4d807d..974fc3a3 100644 --- a/pkg/storage/cos.go +++ b/pkg/storage/cos.go @@ -68,6 +68,12 @@ func (c *COS) Close(ctx context.Context) error { return nil } +// cosIsNotFound reports whether err is a "NoSuchKey" response from Tencent COS. +func cosIsNotFound(err error) bool { + var cosErr *cos.ErrorResponse + return errors.As(err, &cosErr) && cosErr.Code == "NoSuchKey" +} + func (c *COS) StatFile(ctx context.Context, key string) (RemoteFile, error) { return c.StatFileAbsolute(ctx, path.Join(c.Config.Path, key)) } @@ -76,9 +82,7 @@ func (c *COS) StatFileAbsolute(ctx context.Context, key string) (RemoteFile, err // @todo - COS Stat file max size is 5Gb resp, err := c.client.Object.Get(ctx, key, nil) if err != nil { - var cosErr *cos.ErrorResponse - ok := errors.As(err, &cosErr) - if ok && cosErr.Code == "NoSuchKey" { + if cosIsNotFound(err) { return nil, ErrNotFound } return nil, errors.WithMessage(err, "COS StatFileAbsolute Get") @@ -337,6 +341,46 @@ func (c *COS) PutFileAbsolute(ctx context.Context, key string, r io.ReadCloser, return nil } +// PutFileAbsoluteIfAbsent atomically creates the object at key only if it +// doesn't already exist, using Tencent COS's If-None-Match: "*" header. +// +// The Tencent Go SDK (github.com/tencentyun/cos-go-sdk-v5 v0.7.73) does not +// expose a typed If-None-Match field on ObjectPutHeaderOptions, but it does +// provide the cos.XOptionalKey / cos.XOptionalValue context mechanism which +// injects arbitrary headers into any SDK call. We use that to send +// "If-None-Match: *" on the PUT request. COS returns HTTP 412 when the object +// already exists; this maps to (false, nil). +func (c *COS) PutFileAbsoluteIfAbsent(ctx context.Context, key string, r io.ReadCloser, localSize int64) (bool, error) { + ifNoneMatch := make(http.Header) + ifNoneMatch.Set("If-None-Match", "*") + ctx = context.WithValue(ctx, cos.XOptionalKey, &cos.XOptionalValue{Header: &ifNoneMatch}) + + if _, err := c.client.Object.Put(ctx, key, r, nil); err != nil { + if isCOSPreconditionFailed(err) { + return false, nil + } + return false, errors.WithMessage(err, "COS PutFileAbsoluteIfAbsent Put") + } + return true, nil +} + +// PutFileIfAbsent is the path-prefixed variant of PutFileAbsoluteIfAbsent. +// It prepends c.Config.Path to key, matching PutFile semantics. +func (c *COS) PutFileIfAbsent(ctx context.Context, key string, r io.ReadCloser, localSize int64) (bool, error) { + return c.PutFileAbsoluteIfAbsent(ctx, path.Join(c.Config.Path, key), r, localSize) +} + +// isCOSPreconditionFailed returns true when the error is a Tencent COS HTTP 412 +// (PreconditionFailed), which is what COS returns for If-None-Match: "*" when +// the object already exists. +func isCOSPreconditionFailed(err error) bool { + var cosErr *cos.ErrorResponse + if errors.As(err, &cosErr) && cosErr.Response != nil && cosErr.Response.StatusCode == http.StatusPreconditionFailed { + return true + } + return false +} + func (c *COS) CopyObject(ctx context.Context, srcSize int64, srcBucket, srcKey, dstKey string) (int64, error) { return 0, errors.Errorf("CopyObject not implemented for %s", c.Kind()) } @@ -394,8 +438,7 @@ func (c *COS) deleteKeysConcurrent(ctx context.Context, keys []string) error { _, err := c.client.Object.Delete(ctx, key) if err != nil { // Check if it's a "not found" error - that's OK - var cosErr *cos.ErrorResponse - if errors.As(err, &cosErr) && cosErr.Code == "NoSuchKey" { + if cosIsNotFound(err) { mu.Lock() deletedCount++ mu.Unlock() diff --git a/pkg/storage/errors_test.go b/pkg/storage/errors_test.go new file mode 100644 index 00000000..5aa703d9 --- /dev/null +++ b/pkg/storage/errors_test.go @@ -0,0 +1,161 @@ +package storage + +// Tests that each backend maps its "object not found" errors to the public +// ErrNotFound sentinel. The goal is to lock the intent so that accidentally +// removing or changing the not-found check causes a test failure. +// +// Backends where the classification is buried inside an exported method that +// requires a live connection use t.Skip with a pointer to the integration test +// that provides the load-bearing coverage. + +import ( + "context" + "errors" + "net/http" + "net/http/httptest" + "net/textproto" + "testing" + + "github.com/Altinity/clickhouse-backup/v2/pkg/config" + "github.com/aws/aws-sdk-go-v2/aws" + "github.com/aws/aws-sdk-go-v2/credentials" + "github.com/aws/aws-sdk-go-v2/service/s3" + cos "github.com/tencentyun/cos-go-sdk-v5" +) + +func TestStorage_NotFoundClassification(t *testing.T) { + + // ── S3 ──────────────────────────────────────────────────────────────────── + // Spin up a minimal httptest server that always returns HTTP 404, wire a + // real aws-sdk-go-v2 s3.Client at it, and exercise StatFileAbsolute. This + // calls the actual production code path (pkg/storage/s3.go:786-806). + t.Run("s3", func(t *testing.T) { + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + w.WriteHeader(http.StatusNotFound) + })) + defer srv.Close() + + s3Client := s3.New(s3.Options{ + Region: "us-east-1", + Credentials: credentials.NewStaticCredentialsProvider( + "test-key", "test-secret", "", + ), + HTTPClient: srv.Client(), + BaseEndpoint: aws.String(srv.URL), + // Path-style so the bucket name goes in the URL path, not the host, + // which works correctly against a single-host test server. + UsePathStyle: true, + }) + + backend := &S3{ + client: s3Client, + Config: &config.S3Config{ + Bucket: "test-bucket", + Region: "us-east-1", + }, + } + + _, err := backend.StatFileAbsolute(context.Background(), "does/not/exist") + if !errors.Is(err, ErrNotFound) { + t.Fatalf("S3 StatFileAbsolute with 404 response: got %v, want ErrNotFound", err) + } + }) + + // ── Azure Blob ──────────────────────────────────────────────────────────── + // The azure-storage-blob-go SDK wraps the not-found condition in a private + // *storageError struct whose constructor reads live HTTP response headers; + // there is no public constructor that accepts an arbitrary service code. + // The classification (pkg/storage/azblob.go:317,361) is therefore only + // testable end-to-end. + // Integration coverage: TestIntegrationAzureBlob / TestAzureBlob_StatFile + // in test/integration/. + t.Run("azblob", func(t *testing.T) { + t.Skip("azblob: storageError is a private type with no public constructor; " + + "not-found mapping is covered by integration tests " + + "(TestIntegrationAzureBlob / TestAzureBlob_StatFile)") + }) + + // ── GCS ─────────────────────────────────────────────────────────────────── + // The GCS path (pkg/storage/gcs.go) maps cloud.google.com/go/storage + // ErrObjectNotExist → ErrNotFound via the production helper gcsIsNotFound. + // The GCS client pools require live auth, so we verify the sentinel identity + // directly by calling the production helper rather than StatFileAbsolute. + // + // Integration coverage: TestIntegrationGCS / TestGCS_StatFile in + // test/integration/. + t.Run("gcs", func(t *testing.T) { + // Import-path note: "cloud.google.com/go/storage" is imported as + // "storage" in gcs.go but we access it here via the alias defined + // in gcs_testhelper_test.go (see gcsErrObjectNotExist below). + syntheticErr := gcsErrObjectNotExist() // sentinel from helper below + if !gcsIsNotFound(syntheticErr) { + t.Fatalf("GCS not-found classification: gcsIsNotFound(%v) = false, want true", syntheticErr) + } + }) + + // ── COS ─────────────────────────────────────────────────────────────────── + // The COS path (pkg/storage/cos.go) checks cosErr.Code == "NoSuchKey" via + // the production helper cosIsNotFound. cos.ErrorResponse is a public struct, + // so we can construct a synthetic one and feed it directly to the production + // helper. + t.Run("cos", func(t *testing.T) { + syntheticErr := &cos.ErrorResponse{ + Response: &http.Response{ + StatusCode: http.StatusNotFound, + Header: make(http.Header), + Body: http.NoBody, + Request: &http.Request{}, + }, + Code: "NoSuchKey", + Message: "The specified key does not exist.", + } + + if !cosIsNotFound(syntheticErr) { + t.Fatalf("COS not-found classification: cosIsNotFound(%v) = false, want true", syntheticErr) + } + }) + + // ── SFTP ────────────────────────────────────────────────────────────────── + // The SFTP path (pkg/storage/sftp.go:111) calls sftp.sftpClient.Stat which + // requires a live SFTP connection. The not-found check is a string match + // (strings.Contains(err.Error(), "not exist")) applied to errors returned + // by the SSH/SFTP library; there is no way to inject an error without + // dialling a server. + // + // Integration coverage: TestIntegrationSFTP / TestSFTP_StatFile in + // test/integration/. + t.Run("sftp", func(t *testing.T) { + t.Skip("sftp: StatFileAbsolute calls sftpClient.Stat which requires a live " + + "SFTP connection; covered by integration tests " + + "(TestIntegrationSFTP / TestSFTP_StatFile)") + }) + + // ── FTP ─────────────────────────────────────────────────────────────────── + // The FTP path (pkg/storage/ftp.go) uses the production helper ftpIsNotFound + // which checks strings.HasPrefix(err.Error(), "550") for List/Delete errors. + // Both classification and the "file not found in entries list" path happen + // inside StatFileAbsolute after getConnectionFromPool (which dials a live + // FTP server). We exercise the production helper directly using a synthetic + // textproto.Error (the exact type returned by github.com/jlaffaye/ftp). + t.Run("ftp", func(t *testing.T) { + // Verify the production helper classifies a 550 error as not-found. + err550 := &textproto.Error{Code: 550, Msg: "No such file or directory"} + if !ftpIsNotFound(err550) { + t.Fatalf("FTP not-found classification (550): ftpIsNotFound(%v) = false, want true", err550) + } + + // Verify that a non-550 error is NOT classified as not-found. + err530 := &textproto.Error{Code: 530, Msg: "Not logged in"} + if ftpIsNotFound(err530) { + t.Fatal("FTP non-550 error was incorrectly classified as not-found") + } + }) +} + +// gcsErrObjectNotExist returns the GCS sentinel that the production code +// compares against in gcsIsNotFound (gcs.go). It lives in a separate file +// (gcs_testhelper_test.go) so that the cloud.google.com/go/storage import +// does not collide with the package-level "storage" identifier here. +func gcsErrObjectNotExist() error { + return gcsGetErrObjectNotExist() +} diff --git a/pkg/storage/ftp.go b/pkg/storage/ftp.go index 0f01c5e0..e28ea36e 100644 --- a/pkg/storage/ftp.go +++ b/pkg/storage/ftp.go @@ -2,7 +2,9 @@ package storage import ( "context" + "crypto/rand" "crypto/tls" + "encoding/hex" "fmt" "io" "os" @@ -19,11 +21,18 @@ import ( "golang.org/x/sync/errgroup" ) +// ftpIsNotFound reports whether err is a 550 response from the FTP server, +// which all paths in this backend treat as "object/directory does not exist". +func ftpIsNotFound(err error) bool { + return err != nil && strings.HasPrefix(err.Error(), "550") +} + type FTP struct { - clients *pool.ObjectPool - Config *config.FTPConfig - dirCache map[string]bool - dirCacheMutex sync.RWMutex + clients *pool.ObjectPool + Config *config.FTPConfig + dirCache map[string]bool + dirCacheMutex sync.RWMutex + AllowUnsafeMarkers bool } func (f *FTP) Kind() string { @@ -101,7 +110,7 @@ func (f *FTP) StatFileAbsolute(ctx context.Context, key string) (RemoteFile, err entries, err := client.List(dir) if err != nil { // proftpd return 550 error if `dir` not exists - if strings.HasPrefix(err.Error(), "550") { + if ftpIsNotFound(err) { return nil, ErrNotFound } return nil, errors.WithMessage(err, "FTP StatFileAbsolute List") @@ -128,7 +137,33 @@ func (f *FTP) DeleteFile(ctx context.Context, key string) error { if err != nil { return errors.WithMessage(err, "FTP DeleteFile getConnection") } - if err := client.RemoveDirRecur(path.Join(f.Config.Path, key)); err != nil { + fullPath := path.Join(f.Config.Path, key) + // Determine whether the target is a file or directory so we can use + // the appropriate deletion primitive: + // - Regular file → client.Delete (DELE), which is correct for marker files. + // - Directory → RemoveDirRecur (recursive CWD+LIST+DELETE+RMD). + // - 550 (missing) → no-op (idempotent delete, same as S3/GCS/AzBlob/SFTP). + // + // We cannot use RemoveDirRecur for files: it calls ChangeDir first, which + // fails with 550 when given a file path — proftpd cannot CWD into a file. + // Using FileSize is the cheapest "is it a file?" probe; it returns 550 on + // directories too, so we then fall through to RemoveDirRecur. + if _, statErr := client.FileSize(fullPath); statErr == nil { + // It's a regular file — delete directly. + if delErr := client.Delete(fullPath); delErr != nil { + if ftpIsNotFound(delErr) { + return nil // raced with concurrent delete; treat as no-op + } + return errors.WithMessage(delErr, "FTP DeleteFile Delete") + } + return nil + } + // Either a directory or it doesn't exist. Try RemoveDirRecur and treat + // 550 (not found / not a directory) as a successful no-op. + if err := client.RemoveDirRecur(fullPath); err != nil { + if ftpIsNotFound(err) { + return nil + } return errors.WithMessage(err, "FTP DeleteFile RemoveDirRecur") } return nil @@ -149,7 +184,7 @@ func (f *FTP) WalkAbsolute(ctx context.Context, prefix string, recursive bool, p f.returnConnectionToPool(ctx, "Walk", client) if err != nil { // proftpd return 550 error if prefix not exits - if strings.HasPrefix(err.Error(), "550") { + if ftpIsNotFound(err) { return nil } return errors.WithMessage(err, "FTP WalkAbsolute List") @@ -172,6 +207,13 @@ func (f *FTP) WalkAbsolute(ctx context.Context, prefix string, recursive bool, p walker := client.Walk(prefix) for walker.Next() { if err := walker.Err(); err != nil { + // proftpd returns 550 when the prefix doesn't exist (e.g., + // CAS cold-list walking blob// before any upload). + // Return empty, not an error — same semantics as the + // non-recursive path above and as S3/GCS/AzBlob/SFTP. + if ftpIsNotFound(err) { + return nil + } return errors.WithMessage(err, "FTP WalkAbsolute walker.Err") } entry := walker.Stat() @@ -232,6 +274,61 @@ func (f *FTP) PutFileAbsolute(ctx context.Context, key string, r io.ReadCloser, return nil } +// PutFileAbsoluteIfAbsent atomically creates the file at key only if it +// doesn't already exist. FTP has no portable atomic-create primitive; by +// default we refuse with ErrConditionalPutNotSupported. With +// AllowUnsafeMarkers=true, fall back to STAT → STOR-to-tmp → RNFR/RNTO, +// which has a small race window between STAT and RNTO. Log a per-call +// WARN so operators see the documented race in their logs. +func (f *FTP) PutFileAbsoluteIfAbsent(ctx context.Context, key string, r io.ReadCloser, localSize int64) (bool, error) { + if !f.AllowUnsafeMarkers { + return false, ErrConditionalPutNotSupported + } + where := fmt.Sprintf("PutFileAbsoluteIfAbsent->%s", key) + client, err := f.getConnectionFromPool(ctx, where) + if err != nil { + return false, errors.WithMessage(err, "FTP PutFileAbsoluteIfAbsent getConnection") + } + defer f.returnConnectionToPool(ctx, where, client) + // STAT: does the target already exist? + if _, statErr := client.FileSize(key); statErr == nil { + return false, nil + } + // Best-effort fallback: write to a temp filename, then rename. + log.Warn().Str("key", key).Msg("FTP PutFileAbsoluteIfAbsent: best-effort path (cas.allow_unsafe_markers=true); small race window between STAT and RNTO") + if err := f.MkdirAll(path.Dir(key), client); err != nil { + return false, errors.WithMessage(err, "FTP PutFileAbsoluteIfAbsent MkdirAll") + } + tmpKey := key + ".tmp." + randomFTPSuffix() + if err := client.Stor(tmpKey, r); err != nil { + _ = client.Delete(tmpKey) + return false, errors.WithMessage(err, "FTP PutFileAbsoluteIfAbsent Stor") + } + // Re-check: did someone else create the target while we were writing? + if _, statErr := client.FileSize(key); statErr == nil { + _ = client.Delete(tmpKey) + return false, nil + } + if err := client.Rename(tmpKey, key); err != nil { + _ = client.Delete(tmpKey) + return false, errors.WithMessage(err, "FTP PutFileAbsoluteIfAbsent Rename") + } + return true, nil +} + +// PutFileIfAbsent is the path-prefixed variant of PutFileAbsoluteIfAbsent. +// It prepends f.Config.Path to key, matching PutFile semantics. +func (f *FTP) PutFileIfAbsent(ctx context.Context, key string, r io.ReadCloser, localSize int64) (bool, error) { + return f.PutFileAbsoluteIfAbsent(ctx, path.Join(f.Config.Path, key), r, localSize) +} + +// randomFTPSuffix returns 8 random hex characters for unique temp filenames. +func randomFTPSuffix() string { + var b [4]byte + _, _ = rand.Read(b[:]) + return hex.EncodeToString(b[:]) +} + func (f *FTP) CopyObject(ctx context.Context, srcSize int64, srcBucket, srcKey, dstKey string) (int64, error) { return 0, errors.Errorf("CopyObject not implemented for %s", f.Kind()) } @@ -308,7 +405,7 @@ func (f *FTP) deleteKeysConcurrent(ctx context.Context, keys []string) error { err = client.RemoveDirRecur(key) if err != nil { // Check if it's a "not found" error - that's OK - if strings.HasPrefix(err.Error(), "550") { + if ftpIsNotFound(err) { mu.Lock() deletedCount++ mu.Unlock() diff --git a/pkg/storage/gcs.go b/pkg/storage/gcs.go index 056fd8e6..6d79e947 100644 --- a/pkg/storage/gcs.go +++ b/pkg/storage/gcs.go @@ -4,6 +4,7 @@ import ( "context" "crypto/tls" "encoding/base64" + stderrors "errors" "fmt" "io" "net" @@ -20,6 +21,7 @@ import ( "cloud.google.com/go/storage" "github.com/rs/zerolog/log" "golang.org/x/sync/errgroup" + "google.golang.org/api/googleapi" "google.golang.org/api/impersonate" "google.golang.org/api/iterator" "google.golang.org/api/option" @@ -378,6 +380,58 @@ func (gcs *GCS) PutFileAbsolute(ctx context.Context, key string, r io.ReadCloser return nil } +// PutFileAbsoluteIfAbsent atomically creates the object at key only if it +// doesn't already exist, using GCS's DoesNotExist precondition (translates +// to x-goog-if-generation-match: 0). Returns (false, nil) on 412. +func (gcs *GCS) PutFileAbsoluteIfAbsent(ctx context.Context, key string, r io.ReadCloser, localSize int64) (bool, error) { + pClientObj, err := gcs.clientPool.BorrowObject(ctx) + if err != nil { + log.Error().Msgf("gcs.PutFileAbsoluteIfAbsent: gcs.clientPool.BorrowObject error: %+v", err) + return false, errors.WithMessage(err, "GCS PutFileAbsoluteIfAbsent BorrowObject") + } + defer func() { + if retErr := gcs.clientPool.ReturnObject(ctx, pClientObj); retErr != nil { + log.Warn().Msgf("gcs.PutFileAbsoluteIfAbsent: gcs.clientPool.ReturnObject error: %+v", retErr) + } + }() + pClient := pClientObj.(*clientObject).Client + obj := pClient.Bucket(gcs.Config.Bucket).Object(key).If(storage.Conditions{DoesNotExist: true}) + w := obj.NewWriter(ctx) + w.ChunkSize = gcs.Config.ChunkSize + if gcs.Config.StorageClass != "" { + w.StorageClass = gcs.Config.StorageClass + } + if len(gcs.Config.ObjectLabels) > 0 { + w.Metadata = gcs.Config.ObjectLabels + } + buffer := make([]byte, 128*1024) + if _, err = io.CopyBuffer(w, r, buffer); err != nil { + _ = w.Close() + _ = r.Close() + return false, errors.WithMessage(err, "GCS PutFileAbsoluteIfAbsent CopyBuffer") + } + _ = r.Close() + if err = w.Close(); err != nil { + var ae *googleapi.Error + if stderrors.As(err, &ae) && ae.Code == http.StatusPreconditionFailed { + return false, nil + } + return false, errors.WithMessage(err, "GCS PutFileAbsoluteIfAbsent Close") + } + return true, nil +} + +// PutFileIfAbsent is the path-prefixed variant of PutFileAbsoluteIfAbsent. +// It prepends gcs.Config.Path to key, matching PutFile semantics. +func (gcs *GCS) PutFileIfAbsent(ctx context.Context, key string, r io.ReadCloser, localSize int64) (bool, error) { + return gcs.PutFileAbsoluteIfAbsent(ctx, path.Join(gcs.Config.Path, key), r, localSize) +} + +// gcsIsNotFound reports whether err means "object does not exist" in GCS. +func gcsIsNotFound(err error) bool { + return errors.Is(err, storage.ErrObjectNotExist) +} + func (gcs *GCS) StatFile(ctx context.Context, key string) (RemoteFile, error) { return gcs.StatFileAbsolute(ctx, path.Join(gcs.Config.Path, key)) } @@ -400,7 +454,7 @@ func (gcs *GCS) StatFileAbsolute(ctx context.Context, key string) (RemoteFile, e objAttr, err = obj.Attrs(ctx) } if err != nil { - if errors.Is(err, storage.ErrObjectNotExist) { + if gcsIsNotFound(err) { return nil, ErrNotFound } return nil, errors.WithMessage(err, "GCS StatFileAbsolute Attrs") @@ -499,7 +553,7 @@ func (gcs *GCS) deleteKeysConcurrent(ctx context.Context, keys []string) error { err = object.Delete(ctx) if err != nil { // Check if it's a "not found" error - that's OK - if errors.Is(err, storage.ErrObjectNotExist) { + if gcsIsNotFound(err) { if pErr := gcs.clientPool.ReturnObject(ctx, pClientObj); pErr != nil { log.Warn().Msgf("gcs.deleteKeysConcurrent: gcs.clientPool.ReturnObject error: %+v", pErr) } diff --git a/pkg/storage/gcs_testhelper_test.go b/pkg/storage/gcs_testhelper_test.go new file mode 100644 index 00000000..dfbea0c8 --- /dev/null +++ b/pkg/storage/gcs_testhelper_test.go @@ -0,0 +1,13 @@ +package storage + +// gcsGetErrObjectNotExist returns the cloud.google.com/go/storage.ErrObjectNotExist +// sentinel. It lives in this file so that the gcs-storage import alias does not +// conflict with the package-level "storage" name in errors_test.go. + +import ( + gcsStorage "cloud.google.com/go/storage" +) + +func gcsGetErrObjectNotExist() error { + return gcsStorage.ErrObjectNotExist +} diff --git a/pkg/storage/general.go b/pkg/storage/general.go index 5ee16dc1..505731cc 100644 --- a/pkg/storage/general.go +++ b/pkg/storage/general.go @@ -216,7 +216,11 @@ func (bd *BackupDestination) saveMetadataCache(ctx context.Context, listCache ma } } -func (bd *BackupDestination) BackupList(ctx context.Context, parseMetadata bool, parseMetadataOnly string) ([]Backup, error) { +// BackupList enumerates backup folders under the bucket root. skipPrefixes +// lists object-key prefixes the walker must ignore — used to exclude the +// CAS subtree (cas//...) which v1 must not interpret as broken +// v1 backups. Pass nil when CAS is disabled. +func (bd *BackupDestination) BackupList(ctx context.Context, parseMetadata bool, parseMetadataOnly string, skipPrefixes []string) ([]Backup, error) { backupListStart := time.Now() defer func() { log.Info().Dur("list_duration", time.Since(backupListStart)).Send() @@ -234,6 +238,21 @@ func (bd *BackupDestination) BackupList(ctx context.Context, parseMetadata bool, cacheMiss := false err = bd.Walk(ctx, "/", false, func(ctx context.Context, o RemoteFile) error { backupName := strings.Trim(o.Name(), "/") + // Skip any top-level entry whose name matches a configured skip + // prefix (e.g. "cas/" when CAS is enabled). The Walk runs at depth + // 0 with recursive=false, so o.Name() is a single path segment; + // match by trimmed-equality against a trimmed prefix as well as + // the literal HasPrefix to be defensive across backends. + for _, p := range skipPrefixes { + if p == "" { + continue + } + trimmed := strings.TrimSuffix(p, "/") + if backupName == trimmed || strings.HasPrefix(o.Name(), p) { + log.Error().Str("name", o.Name()).Str("matched_prefix", p).Msg("BackupList: skipping entry that matches a CAS skip prefix; rename or move if it was an unrelated v1 backup") + return nil + } + } if !parseMetadata || (parseMetadataOnly != "" && parseMetadataOnly != backupName) { if cachedMetadata, isCached := listCache[backupName]; isCached { result = append(result, cachedMetadata) @@ -688,7 +707,8 @@ func NewBackupDestination(ctx context.Context, cfg *config.Config, ch *clickhous return nil, errors.WithMessage(err, "NewBackupDestination ftp ApplyMacros ObjectDiskPath") } ftpStorage := &FTP{ - Config: &cfg.FTP, + Config: &cfg.FTP, + AllowUnsafeMarkers: cfg.CAS.AllowUnsafeMarkers, } return &BackupDestination{ ftpStorage, diff --git a/pkg/storage/general_test.go b/pkg/storage/general_test.go new file mode 100644 index 00000000..457c8634 --- /dev/null +++ b/pkg/storage/general_test.go @@ -0,0 +1,161 @@ +package storage + +// TestBackupList_SkipPrefixesFiltering verifies that BackupList correctly +// skips top-level entries whose names match a configured CAS skip-prefix, +// and that entries that merely start with the same letters (but don't match +// the trimmed prefix exactly) are NOT filtered. +// +// The test exercises the logic added in Wave 6.A around line 246 of general.go. + +import ( + "context" + "io" + "testing" + "time" +) + +// fakeRemoteFile is a minimal RemoteFile implementation for tests. +type fakeRemoteFile struct { + name string + size int64 + modTime time.Time +} + +func (f fakeRemoteFile) Name() string { return f.name } +func (f fakeRemoteFile) Size() int64 { return f.size } +func (f fakeRemoteFile) LastModified() time.Time { return f.modTime } + +// fakeRemoteStorage is a minimal RemoteStorage that only implements Walk and +// Kind; every other method panics or returns a safe error. This is sufficient +// for BackupList's non-parseMetadata path (parseMetadataOnly == some-name that +// doesn't match any entry, so we stay in the early-return branch). +type fakeRemoteStorage struct { + entries []fakeRemoteFile +} + +func (f *fakeRemoteStorage) Kind() string { return "fake" } + +func (f *fakeRemoteStorage) Walk(_ context.Context, _ string, _ bool, fn func(context.Context, RemoteFile) error) error { + for _, e := range f.entries { + if err := fn(context.Background(), e); err != nil { + return err + } + } + return nil +} + +// WalkAbsolute delegates to Walk for test simplicity. +func (f *fakeRemoteStorage) WalkAbsolute(ctx context.Context, _ string, recursive bool, fn func(context.Context, RemoteFile) error) error { + return f.Walk(ctx, "", recursive, fn) +} + +func (f *fakeRemoteStorage) Connect(_ context.Context) error { return nil } +func (f *fakeRemoteStorage) Close(_ context.Context) error { return nil } + +func (f *fakeRemoteStorage) StatFile(_ context.Context, _ string) (RemoteFile, error) { + return nil, ErrNotFound +} +func (f *fakeRemoteStorage) StatFileAbsolute(_ context.Context, _ string) (RemoteFile, error) { + return nil, ErrNotFound +} +func (f *fakeRemoteStorage) DeleteFile(_ context.Context, _ string) error { return nil } +func (f *fakeRemoteStorage) DeleteFileFromObjectDiskBackup(_ context.Context, _ string) error { + return nil +} +func (f *fakeRemoteStorage) GetFileReader(_ context.Context, _ string) (io.ReadCloser, error) { + return nil, ErrNotFound +} +func (f *fakeRemoteStorage) GetFileReaderAbsolute(_ context.Context, _ string) (io.ReadCloser, error) { + return nil, ErrNotFound +} +func (f *fakeRemoteStorage) GetFileReaderWithLocalPath(_ context.Context, _, _ string, _ int64) (io.ReadCloser, error) { + return nil, ErrNotFound +} +func (f *fakeRemoteStorage) PutFile(_ context.Context, _ string, r io.ReadCloser, _ int64) error { + _ = r.Close() + return nil +} +func (f *fakeRemoteStorage) PutFileAbsolute(_ context.Context, _ string, r io.ReadCloser, _ int64) error { + _ = r.Close() + return nil +} +func (f *fakeRemoteStorage) PutFileAbsoluteIfAbsent(_ context.Context, _ string, r io.ReadCloser, _ int64) (bool, error) { + _ = r.Close() + return true, nil +} +func (f *fakeRemoteStorage) PutFileIfAbsent(_ context.Context, _ string, r io.ReadCloser, _ int64) (bool, error) { + _ = r.Close() + return true, nil +} +func (f *fakeRemoteStorage) CopyObject(_ context.Context, _ int64, _, _, _ string) (int64, error) { + return 0, nil +} + +// fakeBackupDest builds a BackupDestination backed by fakeRemoteStorage with +// the given top-level entries. compressionFormat is set to "tar" so the walk +// doesn't complain about extension mismatches (irrelevant in non-parseMetadata path). +func fakeBackupDest(entries []fakeRemoteFile) *BackupDestination { + return &BackupDestination{ + RemoteStorage: &fakeRemoteStorage{entries: entries}, + compressionFormat: "tar", + } +} + +func TestBackupList_SkipPrefixesFiltering(t *testing.T) { + now := time.Now() + entries := []fakeRemoteFile{ + {name: "cas/", size: 0, modTime: now}, // should be skipped when prefix="cas/" + {name: "v1backup-1", size: 0, modTime: now}, // must NOT be skipped + {name: "v1backup-2", size: 0, modTime: now}, // must NOT be skipped + {name: "casematch", size: 0, modTime: now}, // must NOT be skipped ("cas" prefix but no trailing slash) + } + bd := fakeBackupDest(entries) + + // Case 1: skipPrefixes=["cas/"] — only v1backup-1, v1backup-2, casematch. + got, err := bd.BackupList(context.Background(), false, "__nonexistent__", []string{"cas/"}) + if err != nil { + t.Fatalf("BackupList case1: %v", err) + } + if len(got) != 3 { + names := make([]string, len(got)) + for i, b := range got { + names[i] = b.BackupName + } + t.Errorf("case1: got %d entries %v, want 3 (v1backup-1, v1backup-2, casematch)", len(got), names) + } + for _, b := range got { + if b.BackupName == "cas" || b.BackupName == "cas/" { + t.Errorf("case1: CAS prefix entry %q should have been filtered", b.BackupName) + } + } + + // "casematch" must survive (it's a valid v1 backup, just happens to share a prefix). + found := false + for _, b := range got { + if b.BackupName == "casematch" { + found = true + break + } + } + if !found { + t.Error("case1: 'casematch' was incorrectly filtered by the CAS prefix check") + } + + // Case 2: skipPrefixes=nil — all four entries pass through. + got2, err := bd.BackupList(context.Background(), false, "__nonexistent__", nil) + if err != nil { + t.Fatalf("BackupList case2: %v", err) + } + if len(got2) != 4 { + t.Errorf("case2: got %d entries, want 4 (nil skipPrefixes should pass all)", len(got2)) + } + + // Case 3: skipPrefixes=[""] — empty string matches nothing defensively. + got3, err := bd.BackupList(context.Background(), false, "__nonexistent__", []string{""}) + if err != nil { + t.Fatalf("BackupList case3: %v", err) + } + if len(got3) != 4 { + t.Errorf("case3: got %d entries, want 4 (empty-string prefix should skip nothing)", len(got3)) + } +} diff --git a/pkg/storage/s3.go b/pkg/storage/s3.go index db8af32f..053669fb 100644 --- a/pkg/storage/s3.go +++ b/pkg/storage/s3.go @@ -5,6 +5,7 @@ import ( "context" "crypto/tls" "encoding/base64" + stderrors "errors" "fmt" "hash/crc32" "io" @@ -18,6 +19,7 @@ import ( "github.com/Altinity/clickhouse-backup/v2/pkg/config" "github.com/aws/aws-sdk-go-v2/aws" + awshttp "github.com/aws/aws-sdk-go-v2/aws/transport/http" v4 "github.com/aws/aws-sdk-go-v2/aws/signer/v4" awsV2Config "github.com/aws/aws-sdk-go-v2/config" "github.com/aws/aws-sdk-go-v2/credentials" @@ -298,22 +300,16 @@ func (s *S3) PutFile(ctx context.Context, key string, r io.ReadCloser, localSize return s.PutFileAbsolute(ctx, path.Join(s.Config.Path, key), r, localSize) } -func (s *S3) PutFileAbsolute(ctx context.Context, key string, r io.ReadCloser, localSize int64) error { - params := s3.PutObjectInput{ - Bucket: aws.String(s.Config.Bucket), - Key: aws.String(key), - Body: r, - StorageClass: s3types.StorageClass(strings.ToUpper(s.Config.StorageClass)), - } - if s.Config.CheckSumAlgorithm != "" { - params.ChecksumAlgorithm = s3types.ChecksumAlgorithm(s.Config.CheckSumAlgorithm) - } - - // ACL shall be optional, fix https://github.com/Altinity/clickhouse-backup/issues/785 +// applyPutObjectEncryption mirrors the SSE / KMS / ACL / object-tag fields +// from s.Config onto a PutObjectInput. Used by both the multipart-upload path +// (PutFileAbsolute) and the conditional-PUT path (PutFileAbsoluteIfAbsent) so +// marker writes inherit the same encryption context as data uploads. +// +// Operates on the input pointer in-place; nil-safe for unset config fields. +func (s *S3) applyPutObjectEncryption(p *s3.PutObjectInput) { if s.Config.ACL != "" { - params.ACL = s3types.ObjectCannedACL(s.Config.ACL) + p.ACL = s3types.ObjectCannedACL(s.Config.ACL) } - // https://github.com/Altinity/clickhouse-backup/issues/588 if len(s.Config.ObjectLabels) > 0 { tags := "" for k, v := range s.Config.ObjectLabels { @@ -322,26 +318,39 @@ func (s *S3) PutFileAbsolute(ctx context.Context, key string, r io.ReadCloser, l } tags += k + "=" + v } - params.Tagging = aws.String(tags) + p.Tagging = aws.String(tags) } if s.Config.SSE != "" { - params.ServerSideEncryption = s3types.ServerSideEncryption(s.Config.SSE) + p.ServerSideEncryption = s3types.ServerSideEncryption(s.Config.SSE) } if s.Config.SSEKMSKeyId != "" { - params.SSEKMSKeyId = aws.String(s.Config.SSEKMSKeyId) + p.SSEKMSKeyId = aws.String(s.Config.SSEKMSKeyId) } if s.Config.SSECustomerAlgorithm != "" { - params.SSECustomerAlgorithm = aws.String(s.Config.SSECustomerAlgorithm) + p.SSECustomerAlgorithm = aws.String(s.Config.SSECustomerAlgorithm) } if s.Config.SSECustomerKey != "" { - params.SSECustomerKey = aws.String(s.Config.SSECustomerKey) + p.SSECustomerKey = aws.String(s.Config.SSECustomerKey) } if s.Config.SSECustomerKeyMD5 != "" { - params.SSECustomerKeyMD5 = aws.String(s.Config.SSECustomerKeyMD5) + p.SSECustomerKeyMD5 = aws.String(s.Config.SSECustomerKeyMD5) } if s.Config.SSEKMSEncryptionContext != "" { - params.SSEKMSEncryptionContext = aws.String(s.Config.SSEKMSEncryptionContext) + p.SSEKMSEncryptionContext = aws.String(s.Config.SSEKMSEncryptionContext) + } +} + +func (s *S3) PutFileAbsolute(ctx context.Context, key string, r io.ReadCloser, localSize int64) error { + params := s3.PutObjectInput{ + Bucket: aws.String(s.Config.Bucket), + Key: aws.String(key), + Body: r, + StorageClass: s3types.StorageClass(strings.ToUpper(s.Config.StorageClass)), + } + if s.Config.CheckSumAlgorithm != "" { + params.ChecksumAlgorithm = s3types.ChecksumAlgorithm(s.Config.CheckSumAlgorithm) } + s.applyPutObjectEncryption(¶ms) var partSize int64 if s.Config.ChunkSize > 0 && (localSize+s.Config.ChunkSize-1)/s.Config.ChunkSize < s.Config.MaxPartsCount { partSize = s.Config.ChunkSize @@ -369,6 +378,62 @@ func (s *S3) PutFileAbsolute(ctx context.Context, key string, r io.ReadCloser, l return nil } +// PutFileAbsoluteIfAbsent atomically creates the object at key only if it +// doesn't already exist. Uses the AWS S3 IfNoneMatch precondition +// (supported since Nov 2024; MinIO ≥ RELEASE.2024-11). Always uses the +// single-PUT path (markers are tiny); multipart uploads aren't compatible +// with IfNoneMatch on PutObject. +func (s *S3) PutFileAbsoluteIfAbsent(ctx context.Context, key string, r io.ReadCloser, localSize int64) (bool, error) { + body, err := io.ReadAll(r) + _ = r.Close() + if err != nil { + return false, errors.WithMessage(err, "S3 PutFileAbsoluteIfAbsent ReadAll") + } + params := &s3.PutObjectInput{ + Bucket: aws.String(s.Config.Bucket), + Key: aws.String(key), + Body: bytes.NewReader(body), + StorageClass: s3types.StorageClass(strings.ToUpper(s.Config.StorageClass)), + IfNoneMatch: aws.String("*"), + } + // Apply the same SSE / KMS / ACL / checksum fields the multipart path uses + // (see PutFileAbsolute) so a marker write inherits the configured + // encryption context. Otherwise SSE-C / KMS-encryption-context configs that + // require the headers on every PUT will reject conditional writes or + // produce objects with mismatched encryption attributes. + s.applyPutObjectEncryption(params) + if _, err := s.client.PutObject(ctx, params); err != nil { + if isS3PreconditionFailed(err) { + return false, nil + } + return false, errors.WithMessage(err, "S3 PutFileAbsoluteIfAbsent PutObject") + } + return true, nil +} + +// PutFileIfAbsent is the path-prefixed variant of PutFileAbsoluteIfAbsent. +// It prepends s.Config.Path to key, matching PutFile semantics. +func (s *S3) PutFileIfAbsent(ctx context.Context, key string, r io.ReadCloser, localSize int64) (bool, error) { + return s.PutFileAbsoluteIfAbsent(ctx, path.Join(s.Config.Path, key), r, localSize) +} + +// isS3PreconditionFailed returns true if err corresponds to S3 +// PreconditionFailed (HTTP 412), which is what IfNoneMatch returns when +// the target object already exists. +func isS3PreconditionFailed(err error) bool { + var apiErr smithy.APIError + if stderrors.As(err, &apiErr) { + if apiErr.ErrorCode() == "PreconditionFailed" { + return true + } + } + var respErr *awshttp.ResponseError + if stderrors.As(err, &respErr) && respErr.HTTPStatusCode() == http.StatusPreconditionFailed { + return true + } + return false +} + func (s *S3) putFileMultipartCRC32(ctx context.Context, putParams *s3.PutObjectInput, r io.Reader, localSize, partSize int64) error { createParams := &s3.CreateMultipartUploadInput{ Bucket: putParams.Bucket, diff --git a/pkg/storage/s3_test.go b/pkg/storage/s3_test.go index 4c40c40a..1bc0322e 100644 --- a/pkg/storage/s3_test.go +++ b/pkg/storage/s3_test.go @@ -5,6 +5,8 @@ import ( "fmt" "testing" + "github.com/Altinity/clickhouse-backup/v2/pkg/config" + "github.com/aws/aws-sdk-go-v2/service/s3" "github.com/aws/smithy-go" ) @@ -62,3 +64,52 @@ func TestIsDeleteObjectsMissingContentMD5Error(t *testing.T) { }) } } + +func TestApplyPutObjectEncryption_PreservesAllSSEFields(t *testing.T) { + s := &S3{Config: &config.S3Config{ + ACL: "bucket-owner-full-control", + SSE: "aws:kms", + SSEKMSKeyId: "alias/my-key", + SSECustomerAlgorithm: "AES256", + SSECustomerKey: "raw-key-material", + SSECustomerKeyMD5: "key-md5", + SSEKMSEncryptionContext: "ctx-base64", + ObjectLabels: map[string]string{"env": "prod"}, + }} + p := &s3.PutObjectInput{} + s.applyPutObjectEncryption(p) + + if p.ACL != "bucket-owner-full-control" { + t.Errorf("ACL: %q", p.ACL) + } + if p.ServerSideEncryption != "aws:kms" { + t.Errorf("SSE: %q", p.ServerSideEncryption) + } + if p.SSEKMSKeyId == nil || *p.SSEKMSKeyId != "alias/my-key" { + t.Errorf("SSEKMSKeyId: %v", p.SSEKMSKeyId) + } + if p.SSECustomerAlgorithm == nil || *p.SSECustomerAlgorithm != "AES256" { + t.Errorf("SSECustomerAlgorithm: %v", p.SSECustomerAlgorithm) + } + if p.SSECustomerKey == nil || *p.SSECustomerKey != "raw-key-material" { + t.Errorf("SSECustomerKey: %v", p.SSECustomerKey) + } + if p.SSECustomerKeyMD5 == nil || *p.SSECustomerKeyMD5 != "key-md5" { + t.Errorf("SSECustomerKeyMD5: %v", p.SSECustomerKeyMD5) + } + if p.SSEKMSEncryptionContext == nil || *p.SSEKMSEncryptionContext != "ctx-base64" { + t.Errorf("SSEKMSEncryptionContext: %v", p.SSEKMSEncryptionContext) + } + if p.Tagging == nil || *p.Tagging != "env=prod" { + t.Errorf("Tagging: %v", p.Tagging) + } +} + +func TestApplyPutObjectEncryption_NilSafe(t *testing.T) { + s := &S3{Config: &config.S3Config{}} // no fields set + p := &s3.PutObjectInput{} + s.applyPutObjectEncryption(p) + if p.SSEKMSKeyId != nil || p.SSECustomerKey != nil || p.Tagging != nil { + t.Error("expected all fields to remain unset when config has no values") + } +} diff --git a/pkg/storage/sftp.go b/pkg/storage/sftp.go index 4a42cece..f233de12 100644 --- a/pkg/storage/sftp.go +++ b/pkg/storage/sftp.go @@ -2,6 +2,7 @@ package storage import ( "context" + stderrors "errors" "fmt" "io" "os" @@ -127,6 +128,13 @@ func (sftp *SFTP) DeleteFile(ctx context.Context, key string) error { fileStat, err := sftp.sftpClient.Stat(filePath) if err != nil { sftp.Debug("[SFTP_DEBUG] Delete::STAT %s return error %v", filePath, err) + // A non-existent file is not an error for a delete operation — + // treat it as an idempotent no-op, same as S3/GCS/AzBlob + // (e.g. cas-delete walks + deletes the metadata subtree after + // already deleting metadata.json in the first step). + if os.IsNotExist(err) { + return nil + } return errors.WithMessage(err, "SFTP DeleteFile Stat") } if fileStat.IsDir() { @@ -177,6 +185,14 @@ func (sftp *SFTP) WalkAbsolute(ctx context.Context, prefix string, recursive boo walker := sftp.sftpClient.Walk(prefix) for walker.Step() { if err := walker.Err(); err != nil { + // A non-existent directory is an expected condition during + // CAS cold-list (the blob// directories don't exist + // until the first upload). Return empty, not an error — the + // same semantics that S3/GCS/AzBlob provide for missing + // prefixes. + if os.IsNotExist(err) { + return nil + } return errors.WithMessage(err, "SFTP WalkAbsolute walker.Err") } entry := walker.Stat() @@ -197,6 +213,10 @@ func (sftp *SFTP) WalkAbsolute(ctx context.Context, prefix string, recursive boo entries, err := sftp.sftpClient.ReadDir(prefix) if err != nil { sftp.Debug("[SFTP_DEBUG] Walk::NonRecursive::ReadDir %s return error %v", prefix, err) + // Non-existent directory → return empty, same as object-store semantics. + if os.IsNotExist(err) { + return nil + } return errors.WithMessage(err, "SFTP WalkAbsolute ReadDir") } for _, entry := range entries { @@ -248,6 +268,73 @@ func (sftp *SFTP) PutFileAbsolute(ctx context.Context, key string, r io.ReadClos return nil } +// PutFileAbsoluteIfAbsent atomically creates the file at key only if it +// doesn't already exist, using the SFTP O_EXCL flag (SSH_FXF_EXCL on the +// wire). Mandatory in SFTPv3+; honored by OpenSSH and most third-party +// servers. Returns (false, nil) if the file already exists. +func (sftp *SFTP) PutFileAbsoluteIfAbsent(ctx context.Context, key string, r io.ReadCloser, localSize int64) (bool, error) { + if err := sftp.sftpClient.MkdirAll(path.Dir(key)); err != nil { + log.Warn().Msgf("sftp.sftpClient.MkdirAll(%s) err=%v", path.Dir(key), err) + } + f, err := sftp.sftpClient.OpenFile(key, os.O_WRONLY|os.O_CREATE|os.O_EXCL) + if err != nil { + if isSFTPAlreadyExists(err) { + return false, nil + } + // Some servers (proftpd, OpenSSH SFTPv3) return generic SSH_FX_FAILURE + // when O_EXCL hits an existing file. Disambiguate via Stat. + if _, statErr := sftp.sftpClient.Stat(key); statErr == nil { + return false, nil + } + return false, errors.WithMessage(err, "SFTP PutFileAbsoluteIfAbsent OpenFile") + } + closed := false + defer func() { + if !closed { + if cerr := f.Close(); cerr != nil { + log.Warn().Msgf("can't close %s err=%v", key, cerr) + } + } + }() + if _, err := f.ReadFrom(r); err != nil { + // Best-effort cleanup: if the write failed mid-stream, remove the + // partial file so the next attempt sees the slot as available. + // Close the file handle first — some SFTP servers refuse to delete + // an open file. + closed = true + _ = f.Close() + _ = sftp.sftpClient.Remove(key) + return false, errors.WithMessage(err, "SFTP PutFileAbsoluteIfAbsent ReadFrom") + } + // Explicitly close on success path so we propagate any flush/sync error. + // If close fails the file may be corrupt; remove it so the next caller + // sees the slot as available and can retry. + closed = true + if err := f.Close(); err != nil { + _ = sftp.sftpClient.Remove(key) + return false, errors.WithMessage(err, "SFTP PutFileAbsoluteIfAbsent Close") + } + return true, nil +} + +// PutFileIfAbsent is the path-prefixed variant of PutFileAbsoluteIfAbsent. +// It prepends sftp.Config.Path to key, matching PutFile semantics. +func (sftp *SFTP) PutFileIfAbsent(ctx context.Context, key string, r io.ReadCloser, localSize int64) (bool, error) { + return sftp.PutFileAbsoluteIfAbsent(ctx, path.Join(sftp.Config.Path, key), r, localSize) +} + +// isSFTPAlreadyExists returns true if err is the SFTP server's response +// to opening with O_EXCL when the target exists. The pkg/sftp library +// surfaces this with varying wrapping depending on the protocol version +// and server; we cover both os.ErrExist and the textual fallback. +func isSFTPAlreadyExists(err error) bool { + if stderrors.Is(err, os.ErrExist) { + return true + } + msg := strings.ToLower(err.Error()) + return strings.Contains(msg, "file exists") || strings.Contains(msg, "file already exists") +} + func (sftp *SFTP) CopyObject(ctx context.Context, srcSize int64, srcBucket, srcKey, dstKey string) (int64, error) { return 0, errors.Errorf("CopyObject not implemented for %s", sftp.Kind()) } diff --git a/pkg/storage/structs.go b/pkg/storage/structs.go index eb3d0176..2acb5e42 100644 --- a/pkg/storage/structs.go +++ b/pkg/storage/structs.go @@ -12,6 +12,11 @@ import ( var ( // ErrNotFound is returned when file/object cannot be found ErrNotFound = errors.New("key not found") + // ErrConditionalPutNotSupported is returned by backends that cannot perform + // atomic create-only-if-absent. CAS marker writes (cas-upload, cas-prune) + // surface this as a clean refusal; v1 callers that don't need atomicity + // don't see this error because they never call PutFileAbsoluteIfAbsent. + ErrConditionalPutNotSupported = errors.New("conditional PutFile not supported by this backend") ) // KeyError represents an error for a specific key during batch deletion @@ -87,5 +92,15 @@ type RemoteStorage interface { GetFileReaderWithLocalPath(ctx context.Context, key, localPath string, remoteSize int64) (io.ReadCloser, error) PutFile(ctx context.Context, key string, r io.ReadCloser, localSize int64) error PutFileAbsolute(ctx context.Context, key string, r io.ReadCloser, localSize int64) error + // PutFileAbsoluteIfAbsent atomically writes data at key only if no + // object exists at that key. Returns (true, nil) on successful create; + // (false, nil) if an object already exists; (false, ErrConditionalPutNotSupported) + // if this backend cannot perform an atomic create. + PutFileAbsoluteIfAbsent(ctx context.Context, key string, r io.ReadCloser, localSize int64) (created bool, err error) + // PutFileIfAbsent is the path-prefixed variant of PutFileAbsoluteIfAbsent: + // it prepends the backend's configured path prefix (like PutFile does) before + // delegating to PutFileAbsoluteIfAbsent. This is what casstorage should call + // so that CAS marker keys are in the same namespace as ordinary objects. + PutFileIfAbsent(ctx context.Context, key string, r io.ReadCloser, localSize int64) (created bool, err error) CopyObject(ctx context.Context, srcSize int64, srcBucket, srcKey, dstKey string) (int64, error) } diff --git a/test/integration/cas_api_test.go b/test/integration/cas_api_test.go new file mode 100644 index 00000000..cf232300 --- /dev/null +++ b/test/integration/cas_api_test.go @@ -0,0 +1,172 @@ +//go:build integration + +package main + +import ( + "encoding/json" + "fmt" + "strings" + "testing" + "time" + + "github.com/Altinity/clickhouse-backup/v2/pkg/status" + "github.com/rs/zerolog/log" + "github.com/stretchr/testify/require" +) + +// TestCASAPIRoundtrip drives a full CAS upload→list→restore→delete→prune +// flow over the REST API, mirroring the v1 API roundtrip pattern in +// serverAPI_test.go. +func TestCASAPIRoundtrip(t *testing.T) { + casSkipIfClickHouseTooOld(t) + env, r := NewTestEnvironment(t) + env.connectWithWait(t, r, 500*time.Millisecond, 1*time.Second, 1*time.Minute) + defer env.Cleanup(t, r) + + env.casBootstrap(r, "api_roundtrip") + + // Install curl + jq for HTTP probes inside the clickhouse-backup container. + env.InstallDebIfNotExists(r, "clickhouse-backup", "curl", "jq") + + // Start the daemon. + log.Debug().Msg("Run `clickhouse-backup server` in background") + env.DockerExecBackgroundNoError(r, "clickhouse-backup", "bash", "-ce", + "clickhouse-backup -c "+casConfigPath+" server &>>/tmp/clickhouse-backup-cas-api-server.log") + time.Sleep(5 * time.Second) + defer func() { + _, _ = env.DockerExecOut("clickhouse-backup", "pkill", "-n", "-f", "clickhouse-backup") + }() + + const ( + dbName = "cas_api_db" + tbl = "t" + bk = "cas_api_bk" + ) + + // Prepare test data and local backup. + r.NoError(env.dropDatabase(dbName, true)) + env.queryWithNoError(r, fmt.Sprintf("CREATE DATABASE `%s`", dbName)) + env.queryWithNoError(r, fmt.Sprintf( + "CREATE TABLE `%s`.`%s` (id UInt64, payload String) ENGINE=MergeTree ORDER BY id "+ + "SETTINGS min_rows_for_wide_part=0, min_bytes_for_wide_part=0", + dbName, tbl)) + // Use randomPrintableASCII to exceed the 1024-byte inline threshold. + env.queryWithNoError(r, fmt.Sprintf( + "INSERT INTO `%s`.`%s` SELECT number, randomPrintableASCII(64) FROM numbers(1000)", + dbName, tbl)) + + // Create local backup via CLI (CAS upload itself goes via HTTP). + env.casBackupNoError(r, "create", "--tables", dbName+".*", bk) + + // POST /backup/cas-upload/ + opID := casAPIPostAndCaptureOpID(t, env, r, fmt.Sprintf("/backup/cas-upload/%s", bk)) + casAPIWaitForOperation(t, env, r, opID, 60*time.Second) + + // GET /backup/list/remote — assert the backup appears with kind="cas" + out, err := env.DockerExecOut("clickhouse-backup", "bash", "-ce", + "curl -sfL 'http://localhost:7171/backup/list/remote'") + r.NoError(err, "list/remote: %s", out) + found := false + for _, line := range strings.Split(strings.TrimSpace(out), "\n") { + line = strings.TrimSpace(line) + if line == "" { + continue + } + var entry map[string]interface{} + if json.Unmarshal([]byte(line), &entry) != nil { + continue + } + if entry["name"] == bk && entry["kind"] == "cas" { + found = true + } + } + r.True(found, "cas backup must appear in /backup/list/remote with kind=cas; out=%s", out) + + // POST /backup/cas-restore/?rm — drop the table first so restore re-creates it. + env.queryWithNoError(r, fmt.Sprintf("DROP TABLE `%s`.`%s` SYNC", dbName, tbl)) + opID = casAPIPostAndCaptureOpID(t, env, r, fmt.Sprintf("/backup/cas-restore/%s?rm", bk)) + casAPIWaitForOperation(t, env, r, opID, 120*time.Second) + + var rows uint64 + r.NoError(env.ch.SelectSingleRowNoCtx(&rows, fmt.Sprintf("SELECT count() FROM `%s`.`%s`", dbName, tbl))) + r.Equal(uint64(1000), rows, "restored row count mismatch") + + // POST /backup/cas-delete/ (async since wave-5 F13 — same pattern as upload/restore/prune). + opID = casAPIPostAndCaptureOpID(t, env, r, fmt.Sprintf("/backup/cas-delete/%s", bk)) + casAPIWaitForOperation(t, env, r, opID, 60*time.Second) + + // POST /backup/cas-prune (async) + opID = casAPIPostAndCaptureOpID(t, env, r, "/backup/cas-prune") + casAPIWaitForOperation(t, env, r, opID, 60*time.Second) + + r.NoError(env.dropDatabase(dbName, true)) +} + +// casAPIPostAndCaptureOpID POSTs to the given path under the API server, +// expects an "acknowledged" response with an operation_id, and returns it. +func casAPIPostAndCaptureOpID(t *testing.T, env *TestEnvironment, r *require.Assertions, path string) string { + t.Helper() + out, err := env.DockerExecOut("clickhouse-backup", "bash", "-ce", + fmt.Sprintf("curl -sfL -XPOST 'http://localhost:7171%s'", path)) + r.NoError(err, "POST %s: %s", path, out) + out = strings.TrimSpace(out) + // The response is a single JSON object (sendJSONEachRow with a non-slice value). + var ack struct { + Status string `json:"status"` + OperationId string `json:"operation_id"` + } + r.NoError(json.Unmarshal([]byte(out), &ack), "parse ack for POST %s: %s", path, out) + r.Equal("acknowledged", ack.Status, "POST %s: expected acknowledged; out=%s", path, out) + r.NotEmpty(ack.OperationId, "POST %s: empty operation_id; out=%s", path, out) + return ack.OperationId +} + +// casAPIWaitForOperation polls GET /backup/status?operationid= until the +// operation completes (success) or fails (error). Uses the same approach as +// testAPIBackupCreateRemote in serverAPI_test.go. +func casAPIWaitForOperation(t *testing.T, env *TestEnvironment, r *require.Assertions, opID string, timeout time.Duration) { + t.Helper() + deadline := time.Now().Add(timeout) + for time.Now().Before(deadline) { + // GET /backup/status?operationid= returns line-delimited JSON + // (one ActionRowStatus per line). + out, err := env.DockerExecOut("clickhouse-backup", "bash", "-ce", + fmt.Sprintf("curl -sfL 'http://localhost:7171/backup/status?operationid=%s'", opID)) + if err == nil { + for _, line := range strings.Split(out, "\n") { + line = strings.TrimSpace(line) + if line == "" { + continue + } + var action status.ActionRowStatus + if json.Unmarshal([]byte(line), &action) != nil { + continue + } + switch action.Status { + case status.SuccessStatus: + return + case status.ErrorStatus: + r.FailNow(fmt.Sprintf( + "operation %s failed: %s (command=%s)", + opID, action.Error, action.Command, + )) + } + } + } + time.Sleep(1 * time.Second) + } + // Print server log on timeout for diagnostics. + logOut, _ := env.DockerExecOut("clickhouse-backup", "cat", + "/tmp/clickhouse-backup-cas-api-server.log") + r.FailNow(fmt.Sprintf( + "operation %s did not complete within %s\nserver log:\n%s", + opID, timeout, logOut, + )) +} + +// TestCASAPI_ListMixedBackups — kind=cas presence is already covered by +// TestCASAPIRoundtrip; a full mixed (v1 + CAS) list flow is deferred. +func TestCASAPI_ListMixedBackups(t *testing.T) { + casSkipIfClickHouseTooOld(t) + t.Skip("kind=cas presence covered by TestCASAPIRoundtrip; full mixed-list flow deferred") +} diff --git a/test/integration/cas_backends_test.go b/test/integration/cas_backends_test.go new file mode 100644 index 00000000..09c2e90a --- /dev/null +++ b/test/integration/cas_backends_test.go @@ -0,0 +1,150 @@ +//go:build integration + +package main + +import ( + "fmt" + "testing" + "time" + + "github.com/stretchr/testify/require" +) + +// runCASBackendSmoke runs the same upload → status → restore → +// verify-rows → delete → prune cycle that all per-backend smoke tests +// use. Caller is responsible for casBootstrap; this routine handles the +// rest. +// +// dbName, tableName, backupName must be unique per backend so concurrent +// tests don't collide on the local backup namespace. +func runCASBackendSmoke(t *testing.T, env *TestEnvironment, r *require.Assertions, dbName, tableName, backupName string) { + t.Helper() + r.NoError(env.dropDatabase(dbName, true)) + env.queryWithNoError(r, fmt.Sprintf("CREATE DATABASE `%s`", dbName)) + env.queryWithNoError(r, fmt.Sprintf( + "CREATE TABLE `%s`.`%s` (id UInt64, payload String) ENGINE=MergeTree ORDER BY id "+ + "SETTINGS min_rows_for_wide_part=0, min_bytes_for_wide_part=0", dbName, tableName)) + env.queryWithNoError(r, fmt.Sprintf( + "INSERT INTO `%s`.`%s` SELECT number, randomPrintableASCII(64) FROM numbers(200)", + dbName, tableName)) + + env.casBackupNoError(r, "create", "--tables", dbName+".*", backupName) + env.casBackupNoError(r, "cas-upload", backupName) + + statusOut := env.casBackupNoError(r, "cas-status") + r.Contains(statusOut, "Backups: 1", "cas-status should show 1 backup; got: %s", statusOut) + + r.NoError(env.dropDatabase(dbName, true)) + env.casBackupNoError(r, "cas-restore", "--rm", backupName) + + var rowsResult []struct { + C uint64 `ch:"c"` + } + r.NoError(env.ch.Select(&rowsResult, fmt.Sprintf("SELECT count() AS c FROM `%s`.`%s`", dbName, tableName))) + r.Len(rowsResult, 1) + r.Equal(uint64(200), rowsResult[0].C, "row count after restore") + + env.casBackupNoError(r, "cas-delete", backupName) + env.casBackupNoError(r, "cas-prune", "--grace-blob=0s") + + finalStatus := env.casBackupNoError(r, "cas-status") + r.Contains(finalStatus, "Backups: 0", "after delete + prune, expected 0 backups: %s", finalStatus) + + r.NoError(env.dropDatabase(dbName, true)) +} + +// TestCASSmokeGCS exercises the full CAS lifecycle against the +// fake-gcs-server emulator. Verifies the GCS backend's +// PutFileAbsoluteIfAbsent (Conditions{DoesNotExist: true}) path +// works end-to-end against a real-ish server. +func TestCASSmokeGCS(t *testing.T) { + casSkipIfClickHouseTooOld(t) + env, r := NewTestEnvironment(t) + env.connectWithWait(t, r, 500*time.Millisecond, 1*time.Second, 1*time.Minute) + defer env.Cleanup(t, r) + + env.casBootstrapWith(r, "smoke_gcs", "config-gcs-emulator.yml", "") + runCASBackendSmoke(t, env, r, + "cas_smoke_gcs_db", "cas_smoke_gcs_t", "cas_smoke_gcs_bk") +} + +// TestCASSmokeAzure exercises the full CAS lifecycle against Azurite. +// Verifies the Azure backend's PutFileAbsoluteIfAbsent (If-None-Match) +// path added in Phase 4 T4. +func TestCASSmokeAzure(t *testing.T) { + casSkipIfClickHouseTooOld(t) + env, r := NewTestEnvironment(t) + env.connectWithWait(t, r, 500*time.Millisecond, 1*time.Second, 1*time.Minute) + defer env.Cleanup(t, r) + + env.casBootstrapWith(r, "smoke_azure", "config-azblob.yml", "") + runCASBackendSmoke(t, env, r, + "cas_smoke_azure_db", "cas_smoke_azure_t", "cas_smoke_azure_bk") +} + +// TestCASSmokeSFTP exercises the full CAS lifecycle through the SFTP +// backend (panubo/sshd container). Verifies the OpenFile(O_EXCL) -> +// SSH_FXF_EXCL path added in Phase 4 T3 works against OpenSSH-server. +func TestCASSmokeSFTP(t *testing.T) { + casSkipIfClickHouseTooOld(t) + env, r := NewTestEnvironment(t) + env.connectWithWait(t, r, 500*time.Millisecond, 1*time.Second, 1*time.Minute) + defer env.Cleanup(t, r) + + env.casBootstrapWith(r, "smoke_sftp", "config-sftp-emulator.yaml", "") + runCASBackendSmoke(t, env, r, + "cas_smoke_sftp_db", "cas_smoke_sftp_t", "cas_smoke_sftp_bk") +} + +// TestCASSmokeFTPRefusesByDefault verifies that on the FTP backend, with +// cas.allow_unsafe_markers unset, cas-upload refuses cleanly at marker +// write time with a clear "atomic markers not supported" diagnostic +// rather than silently corrupting state. +func TestCASSmokeFTPRefusesByDefault(t *testing.T) { + casSkipIfClickHouseTooOld(t) + env, r := NewTestEnvironment(t) + env.connectWithWait(t, r, 500*time.Millisecond, 1*time.Second, 1*time.Minute) + defer env.Cleanup(t, r) + + env.casBootstrapWith(r, "smoke_ftp_refuse", "config-ftp-emulator.yaml", "") + + const ( + dbName = "cas_smoke_ftp_refuse_db" + tableName = "cas_smoke_ftp_refuse_t" + backupName = "cas_smoke_ftp_refuse_bk" + ) + r.NoError(env.dropDatabase(dbName, true)) + env.queryWithNoError(r, fmt.Sprintf("CREATE DATABASE `%s`", dbName)) + env.queryWithNoError(r, fmt.Sprintf( + "CREATE TABLE `%s`.`%s` (id UInt64) ENGINE=MergeTree ORDER BY id", dbName, tableName)) + env.queryWithNoError(r, fmt.Sprintf( + "INSERT INTO `%s`.`%s` SELECT number FROM numbers(10)", dbName, tableName)) + + env.casBackupNoError(r, "create", "--tables", dbName+".*", backupName) + + out, err := env.casBackup("cas-upload", backupName) + r.Error(err, "cas-upload on FTP without allow_unsafe_markers must refuse; out=%s", out) + r.Contains(out, "backend cannot guarantee atomic markers", + "refusal message should be present; got: %s", out) + + // Cleanup local backup so subsequent FTP tests start fresh. + _, _ = env.casBackup("delete", "local", backupName) + r.NoError(env.dropDatabase(dbName, true)) +} + +// TestCASSmokeFTPOptIn verifies that with cas.allow_unsafe_markers=true +// the FTP backend's best-effort STAT -> STOR-to-tmp -> RNFR/RNTO marker +// path (Phase 4 T7) supports a full CAS upload -> restore round-trip. +// Note: this path has a documented small race window; the test asserts +// only that the happy path works, not concurrency safety. +func TestCASSmokeFTPOptIn(t *testing.T) { + casSkipIfClickHouseTooOld(t) + env, r := NewTestEnvironment(t) + env.connectWithWait(t, r, 500*time.Millisecond, 1*time.Second, 1*time.Minute) + defer env.Cleanup(t, r) + + env.casBootstrapWith(r, "smoke_ftp_optin", "config-ftp-emulator.yaml", + " allow_unsafe_markers: true\n") + runCASBackendSmoke(t, env, r, + "cas_smoke_ftp_optin_db", "cas_smoke_ftp_optin_t", "cas_smoke_ftp_optin_bk") +} diff --git a/test/integration/cas_concurrency_test.go b/test/integration/cas_concurrency_test.go new file mode 100644 index 00000000..7daed31e --- /dev/null +++ b/test/integration/cas_concurrency_test.go @@ -0,0 +1,107 @@ +//go:build integration + +package main + +import ( + "fmt" + "testing" + "time" + + "github.com/stretchr/testify/require" +) + +// injectS3Object writes body to key inside the clickhouse bucket via the `mc` +// client already present in the MinIO container. This is the correct injection +// method: MinIO single-disk mode stores data in a non-trivial on-disk layout, +// so writing raw bytes directly into /minio/data/... is unreliable for LIST. +// Using mc cp through the S3 API guarantees the object is visible to LIST. +// +// We configure the mc alias inline (using the test credentials that match +// config-s3.yml) because the MinIO container only pre-sets the alias when +// minio_nodelete.sh is explicitly invoked. +func (env *TestEnvironment) injectS3Object(r *require.Assertions, key, body string) { + // Write the body to a temp file then upload via mc cp. + // Direct filesystem writes into /minio/data/... are not reliable for + // MinIO LIST; using mc cp via the S3 API guarantees visibility. + // The mc alias is set up inline because the container only pre-configures + // it when minio_nodelete.sh is explicitly invoked. + script := fmt.Sprintf(` +set -e +mc --insecure alias set inject https://localhost:9000 access_key it_is_my_super_secret_key >/dev/null +echo -n '%s' > /tmp/inject_marker_tmp +mc --insecure cp /tmp/inject_marker_tmp inject/clickhouse/%s +rm -f /tmp/inject_marker_tmp +`, body, key) + out, err := env.DockerExecOut("minio", "bash", "-c", script) + r.NoError(err, "injectS3Object(%s) failed: %s", key, out) +} + +// TestCASUploadRefusesConcurrent verifies that a second cas-upload for +// the same backup name fails cleanly when an inprogress marker is +// already present in the bucket. We pre-populate the marker via mc cp +// into MinIO to simulate a concurrent in-flight upload. +func TestCASUploadRefusesConcurrent(t *testing.T) { + casSkipIfClickHouseTooOld(t) + env, r := NewTestEnvironment(t) + env.connectWithWait(t, r, 500*time.Millisecond, 1*time.Second, 1*time.Minute) + defer env.Cleanup(t, r) + + env.casBootstrap(r, "concurrent_up") + + const dbName = "cas_concur_up_db" + r.NoError(env.dropDatabase(dbName, true)) + env.queryWithNoError(r, fmt.Sprintf("CREATE DATABASE `%s`", dbName)) + env.queryWithNoError(r, fmt.Sprintf("CREATE TABLE `%s`.t (id UInt64) ENGINE=MergeTree ORDER BY id", dbName)) + env.queryWithNoError(r, fmt.Sprintf("INSERT INTO `%s`.t SELECT number FROM numbers(10)", dbName)) + + env.casBackupNoError(r, "create", "--tables", dbName+".*", "concur_bk") + + // Inject an inprogress marker BEFORE the upload so that the second host + // simulates a concurrent upload in flight. We do NOT run cas-upload first: + // if metadata.json already exists, cas-upload refuses with ErrBackupExists + // (step 4) before it ever reaches the inprogress-marker check (step 5). + // S3 path: backup/{cluster}/{shard}/cas/{clusterID}/inprogress/{name}.marker + // casBootstrap used clusterID="concurrent_up"; path is backup/cluster/0/cas/concurrent_up/inprogress/concur_bk.marker + markerKey := "backup/cluster/0/cas/concurrent_up/inprogress/concur_bk.marker" + // Use tool="cas-upload" so the diagnostic surfaces the realistic + // upload-vs-upload conflict (post wave-5 N2, the diagnostic uses the + // marker's Tool field dynamically). + markerBody := `{"backup":"concur_bk","host":"other","started_at":"2026-05-08T00:00:00Z","tool":"cas-upload"}` + env.injectS3Object(r, markerKey, markerBody) + + // Second cas-upload must refuse with a message naming the conflicting tool. + out, err := env.casBackup("cas-upload", "concur_bk") + r.Error(err, "second cas-upload must refuse while marker held; out=%s", out) + r.Contains(out, "another cas-upload is in progress", "out=%s", out) + + r.NoError(env.dropDatabase(dbName, true)) +} + +// TestCASPruneRefusesConcurrent verifies that a second cas-prune refuses +// when a prune marker is already held, AND that the existing marker +// survives the failed second run. +func TestCASPruneRefusesConcurrent(t *testing.T) { + casSkipIfClickHouseTooOld(t) + env, r := NewTestEnvironment(t) + env.connectWithWait(t, r, 500*time.Millisecond, 1*time.Second, 1*time.Minute) + defer env.Cleanup(t, r) + + env.casBootstrap(r, "concurrent_pr") + + // Inject a prune marker simulating another prune in flight. + // S3 path: backup/cluster/0/cas/concurrent_pr/prune.marker + markerKey := "backup/cluster/0/cas/concurrent_pr/prune.marker" + markerBody := `{"host":"other","started_at":"2026-05-08T00:00:00Z","run_id":"abcd1234","tool":"test"}` + env.injectS3Object(r, markerKey, markerBody) + + // cas-prune must refuse. + out, err := env.casBackup("cas-prune") + r.Error(err, "cas-prune must refuse while marker held; out=%s", out) + r.Contains(out, "another prune is in progress", "out=%s", out) + + // The marker must still be present (regression guard for the + // "deferred-delete races second prune" bug fixed in T10). + statusOut, err := env.casBackup("cas-status") + r.NoError(err, "cas-status err=%v out=%s", err, statusOut) + r.Contains(statusOut, "Prune marker:", "marker should still appear in cas-status; out=%s", statusOut) +} diff --git a/test/integration/cas_cross_dedup_test.go b/test/integration/cas_cross_dedup_test.go new file mode 100644 index 00000000..e0a8025e --- /dev/null +++ b/test/integration/cas_cross_dedup_test.go @@ -0,0 +1,93 @@ +//go:build integration + +package main + +import ( + "fmt" + "testing" + "time" +) + +// TestCASCrossBackupDedup verifies the catalog-level dedup invariant: +// a third backup that produces parts byte-identical to data already +// uploaded in two earlier independent backups should reuse those blobs +// instead of re-uploading them. +func TestCASCrossBackupDedup(t *testing.T) { + casSkipIfClickHouseTooOld(t) + env, r := NewTestEnvironment(t) + env.connectWithWait(t, r, 500*time.Millisecond, 1*time.Second, 1*time.Minute) + defer env.Cleanup(t, r) + + env.casBootstrap(r, "cross_dedup") + + const ( + dbA = "cas_xdedup_a" + dbB = "cas_xdedup_b" + dbC = "cas_xdedup_c" + tbl = "t" + bkA = "cas_xdedup_bkA" + bkB = "cas_xdedup_bkB" + bkC = "cas_xdedup_bkC" + rows = 50000 + ) + + // Setup a deterministic-payload schema that gives reproducible byte content + // (so blobs in C match A's exactly). + setup := func(db string, seed int) { + r.NoError(env.dropDatabase(db, true)) + env.queryWithNoError(r, fmt.Sprintf("CREATE DATABASE `%s`", db)) + env.queryWithNoError(r, fmt.Sprintf(`CREATE TABLE `+"`%s`.`%s`"+` (id UInt64, payload String) + ENGINE=MergeTree ORDER BY id + SETTINGS min_rows_for_wide_part=0, min_bytes_for_wide_part=0`, db, tbl)) + env.queryWithNoError(r, fmt.Sprintf( + "INSERT INTO `%s`.`%s` SELECT number + %d, repeat('x', 1024) FROM numbers(%d)", + db, tbl, seed, rows)) + env.queryWithNoError(r, fmt.Sprintf("OPTIMIZE TABLE `%s`.`%s` FINAL", db, tbl)) + } + + // Backup A: db dbA, seed 0 + setup(dbA, 0) + env.casBackupNoError(r, "create", "--tables", dbA+".*", bkA) + outA := env.casBackupNoError(r, "cas-upload", bkA) + bytesA := parseBytesUploaded(t, outA) + r.True(bytesA > 0, "bkA: bytes uploaded must be > 0; out=%s", outA) + + // Backup B: db dbB, seed 100000 (disjoint from A) so B has no shared content with A + setup(dbB, 100000) + env.casBackupNoError(r, "create", "--tables", dbB+".*", bkB) + outB := env.casBackupNoError(r, "cas-upload", bkB) + bytesB := parseBytesUploaded(t, outB) + r.True(bytesB > 0, "bkB: bytes uploaded must be > 0; out=%s", outB) + + // Backup C: db dbC = dbA's data verbatim. Setup with same seed. + // Expectation: C's payload column files are byte-identical to A's, + // so cas-upload C should reuse blobs and upload near-zero new bytes. + setup(dbC, 0) + env.casBackupNoError(r, "create", "--tables", dbC+".*", bkC) + outC := env.casBackupNoError(r, "cas-upload", bkC) + bytesC := parseBytesUploaded(t, outC) + t.Logf("bkA=%d bytes uploaded, bkB=%d, bkC=%d", bytesA, bytesB, bytesC) + + // Headline assertion: C's upload is dramatically smaller than A's + // because C's content already lives in the blob store (uploaded as part of A). + // NOTE: The (db, table) name differs (dbA vs dbC), so per-table archives + // (containing tiny metadata files like checksums.txt, primary.idx) won't + // dedupe — they go into the table-archive .tar.zstd, not the blob store. + // Only large column files (payload.bin, payload.mrk) live in the blob store + // and these dedupe. Choose a loose threshold to absorb the inline-archive + // overhead and any small-file leak. + if bytesC >= bytesA/4 { + t.Fatalf("cross-backup dedup failed: bkA uploaded %d bytes, bkC uploaded %d bytes (expected bkC << bkA; ratio = %.2f)", + bytesA, bytesC, float64(bytesC)/float64(bytesA)) + } + t.Logf("cross-backup dedup OK: bkC=%d B is %.1f%% of bkA=%d B", + bytesC, 100*float64(bytesC)/float64(bytesA), bytesA) + + // Cleanup + env.casBackupNoError(r, "cas-delete", bkA) + env.casBackupNoError(r, "cas-delete", bkB) + env.casBackupNoError(r, "cas-delete", bkC) + env.queryWithNoError(r, fmt.Sprintf("DROP DATABASE `%s` SYNC", dbA)) + env.queryWithNoError(r, fmt.Sprintf("DROP DATABASE `%s` SYNC", dbB)) + env.queryWithNoError(r, fmt.Sprintf("DROP DATABASE `%s` SYNC", dbC)) +} diff --git a/test/integration/cas_mutation_dedup_test.go b/test/integration/cas_mutation_dedup_test.go new file mode 100644 index 00000000..80c1c41e --- /dev/null +++ b/test/integration/cas_mutation_dedup_test.go @@ -0,0 +1,156 @@ +//go:build integration + +package main + +import ( + "fmt" + "strings" + "testing" + "time" + + "github.com/rs/zerolog/log" +) + +// TestCASMutationDedup verifies the headline value-prop: +// after an ALTER TABLE ... UPDATE that rewrites a single column, +// the second cas-upload should transfer dramatically fewer bytes than +// the first because all unmutated column files are byte-identical and +// dedup against the existing blob store. +func TestCASMutationDedup(t *testing.T) { + casSkipIfClickHouseTooOld(t) + env, r := NewTestEnvironment(t) + env.connectWithWait(t, r, 500*time.Millisecond, 1*time.Second, 1*time.Minute) + defer env.Cleanup(t, r) + + env.casBootstrap(r, "mutation_dedup") + + const ( + dbName = "cas_mutdedup_db" + tblName = "cas_mutdedup_t" + bk1 = "cas_mutdedup_bk1" + bk2 = "cas_mutdedup_bk2" + rows = 100000 + ) + + // Schema: wide table with a "big" payload column and a "small" marker + // column we'll mutate. force-wide so each column has its own .bin file. + r.NoError(env.dropDatabase(dbName, true)) + env.queryWithNoError(r, fmt.Sprintf("CREATE DATABASE `%s`", dbName)) + env.queryWithNoError(r, fmt.Sprintf(`CREATE TABLE `+"`%s`.`%s`"+` (id UInt64, payload String, marker String) + ENGINE=MergeTree ORDER BY id + SETTINGS min_rows_for_wide_part=0, min_bytes_for_wide_part=0`, dbName, tblName)) + env.queryWithNoError(r, fmt.Sprintf( + "INSERT INTO `%s`.`%s` SELECT number, repeat('x', 1024), 'orig' FROM numbers(%d)", + dbName, tblName, rows)) + env.queryWithNoError(r, fmt.Sprintf("OPTIMIZE TABLE `%s`.`%s` FINAL", dbName, tblName)) + + // First backup — uploads everything fresh. + env.casBackupNoError(r, "create", "--tables", dbName+".*", bk1) + out1 := env.casBackupNoError(r, "cas-upload", bk1) + log.Debug().Str("bk1_out", out1).Msg("first cas-upload") + bytes1 := parseBytesUploaded(t, out1) + if bytes1 == 0 { + t.Fatalf("could not parse bytes uploaded for bk1; output:\n%s", out1) + } + + // Mutate ONLY the marker column; payload is hardlinked unchanged. + env.queryWithNoError(r, fmt.Sprintf( + "ALTER TABLE `%s`.`%s` UPDATE marker = 'after' WHERE 1 SETTINGS mutations_sync=2", + dbName, tblName)) + env.queryWithNoError(r, fmt.Sprintf("OPTIMIZE TABLE `%s`.`%s` FINAL", dbName, tblName)) + + env.casBackupNoError(r, "create", "--tables", dbName+".*", bk2) + out2 := env.casBackupNoError(r, "cas-upload", bk2) + log.Debug().Str("bk2_out", out2).Msg("second cas-upload") + bytes2 := parseBytesUploaded(t, out2) + if bytes2 == 0 && !strings.Contains(out2, "uploaded now") { + t.Fatalf("could not parse bytes uploaded for bk2; output:\n%s", out2) + } + + // Headline assertion: second upload is at most 1/4 of the first. + // Real-world ratio is ~1/N where N is the number of columns; we pick a + // loose 1/4 bound to absorb compression-blob overhead (one compressed + // marker column, plus the ALTER's bookkeeping files) and avoid flake. + if bytes2 >= bytes1/4 { + t.Fatalf("mutation dedup failed: bk1 uploaded %d bytes, bk2 uploaded %d bytes (expected bk2 << bk1; ratio = %.2f)", + bytes1, bytes2, float64(bytes2)/float64(bytes1)) + } + t.Logf("mutation dedup OK: bk1=%d B, bk2=%d B (%.1f%% of bk1)", + bytes1, bytes2, 100*float64(bytes2)/float64(bytes1)) + + // Cleanup. + env.casBackupNoError(r, "cas-delete", bk1) + env.casBackupNoError(r, "cas-delete", bk2) + env.queryWithNoError(r, fmt.Sprintf("DROP DATABASE `%s` SYNC", dbName)) +} + +// parseBytesUploaded extracts the bytes-uploaded value from cas-upload's +// printed summary. Format (from pkg/backup/cas_methods.go's stats output): +// +// cas-upload: bk1 +// Backup content : 100 files, 1.5 MiB total +// Inlined : 30 files, 12.3 KiB (packed into 1 archive, 8.4 KiB compressed) +// Blob store : 50 unique blobs, 1.4 MiB +// uploaded now : 50 blobs, 1.4 MiB +// reused : 0 blobs, 0 B (already in remote — saved by content-addressing) +// Wall clock : 1.234s +// +// Returns 0 if the line can't be parsed (caller decides how strict to be). +func parseBytesUploaded(t *testing.T, out string) int64 { + t.Helper() + for _, line := range strings.Split(out, "\n") { + if !strings.Contains(line, "uploaded now") { + continue + } + // Form: " uploaded now : N blobs, X.Y UNIT" + idx := strings.Index(line, ", ") + if idx < 0 { + continue + } + rest := strings.TrimSpace(line[idx+2:]) + return humanBytesToInt64(t, rest) + } + return 0 +} + +// humanBytesToInt64 parses utils.FormatBytes outputs like "795.56KiB" / +// "5.6MiB" / "1024B" / "0B" into int64 bytes. NO SPACE between number and +// unit (utils.FormatBytes never emits one). +func humanBytesToInt64(t *testing.T, s string) int64 { + t.Helper() + s = strings.TrimSpace(s) + idx := 0 + for idx < len(s) { + c := s[idx] + if (c >= '0' && c <= '9') || c == '.' { + idx++ + continue + } + break + } + if idx == 0 || idx == len(s) { + t.Fatalf("parse human bytes %q: cannot find number/unit boundary", s) + } + numStr := s[:idx] + unit := s[idx:] + var v float64 + if _, err := fmt.Sscanf(numStr, "%f", &v); err != nil { + t.Fatalf("parse human bytes number %q: %v", numStr, err) + } + mult := int64(1) + switch strings.ToUpper(unit) { + case "B": + mult = 1 + case "KIB": + mult = 1024 + case "MIB": + mult = 1024 * 1024 + case "GIB": + mult = 1024 * 1024 * 1024 + case "TIB": + mult = 1024 * 1024 * 1024 * 1024 + default: + t.Fatalf("unknown unit %q in %q", unit, s) + } + return int64(v * float64(mult)) +} diff --git a/test/integration/cas_projection_test.go b/test/integration/cas_projection_test.go new file mode 100644 index 00000000..9a8f9342 --- /dev/null +++ b/test/integration/cas_projection_test.go @@ -0,0 +1,221 @@ +//go:build integration + +package main + +import ( + "fmt" + "testing" + "time" +) + +// TestCASRoundtripWithProjection creates a table with a projection, +// inserts data, cas-uploads, drops, cas-restores, and verifies row count +// and projection definition both survive. +func TestCASRoundtripWithProjection(t *testing.T) { + casSkipIfClickHouseTooOld(t) + env, r := NewTestEnvironment(t) + env.connectWithWait(t, r, 500*time.Millisecond, 1*time.Second, 1*time.Minute) + defer env.Cleanup(t, r) + + // system.projections was added in ClickHouse 24.9. Earlier versions + // support PROJECTION syntax in CREATE TABLE but expose projections + // only via system.parts (parent_part_name) or table metadata. + var projTbl []struct { + C uint64 `ch:"c"` + } + r.NoError(env.ch.Select(&projTbl, + "SELECT count() AS c FROM system.tables WHERE database='system' AND name='projections'")) + if len(projTbl) == 0 || projTbl[0].C == 0 { + t.Skip("system.projections not present in this ClickHouse version (added in 24.9)") + } + + env.casBootstrap(r, "proj_round") + + const ( + dbName = "cas_proj_db" + tblName = "cas_proj_t" + backupName = "cas_proj_bk" + ) + r.NoError(env.dropDatabase(dbName, true)) + env.queryWithNoError(r, fmt.Sprintf("CREATE DATABASE `%s`", dbName)) + env.queryWithNoError(r, fmt.Sprintf(`CREATE TABLE `+"`%s`.`%s`"+` (id UInt64, payload String, PROJECTION p1 (SELECT id, payload ORDER BY payload)) + ENGINE=MergeTree ORDER BY id + SETTINGS min_rows_for_wide_part=0, min_bytes_for_wide_part=0`, dbName, tblName)) + env.queryWithNoError(r, fmt.Sprintf("INSERT INTO `%s`.`%s` SELECT number, randomPrintableASCII(64) FROM numbers(500)", dbName, tblName)) + env.queryWithNoError(r, fmt.Sprintf("OPTIMIZE TABLE `%s`.`%s` FINAL", dbName, tblName)) + + env.casBackupNoError(r, "create", "--tables", dbName+".*", backupName) + env.casBackupNoError(r, "cas-upload", backupName) + + r.NoError(env.dropDatabase(dbName, true)) + env.casBackupNoError(r, "cas-restore", "--rm", backupName) + + // Row count survived. + var rowsResult []struct { + C uint64 `ch:"c"` + } + r.NoError(env.ch.Select(&rowsResult, fmt.Sprintf("SELECT count() AS c FROM `%s`.`%s`", dbName, tblName))) + r.Len(rowsResult, 1) + r.Equal(uint64(500), rowsResult[0].C, "row count after restore") + + // Projection survived in the table's metadata. + var projResult []struct { + C uint64 `ch:"c"` + } + r.NoError(env.ch.Select(&projResult, fmt.Sprintf( + "SELECT count() AS c FROM system.projections WHERE database='%s' AND table='%s' AND name='p1'", dbName, tblName))) + r.Len(projResult, 1) + r.Equal(uint64(1), projResult[0].C, "projection p1 should exist after restore") + + r.NoError(env.dropDatabase(dbName, true)) +} + +// TestCASRoundtripWithEmptyTable creates two tables, leaves one empty, +// uploads, drops both, restores, and asserts both schemas come back. +func TestCASRoundtripWithEmptyTable(t *testing.T) { + casSkipIfClickHouseTooOld(t) + env, r := NewTestEnvironment(t) + env.connectWithWait(t, r, 500*time.Millisecond, 1*time.Second, 1*time.Minute) + defer env.Cleanup(t, r) + + env.casBootstrap(r, "empty_round") + + const dbName = "cas_empty_db" + r.NoError(env.dropDatabase(dbName, true)) + env.queryWithNoError(r, fmt.Sprintf("CREATE DATABASE `%s`", dbName)) + env.queryWithNoError(r, fmt.Sprintf("CREATE TABLE `%s`.full (id UInt64) ENGINE=MergeTree ORDER BY id", dbName)) + env.queryWithNoError(r, fmt.Sprintf("CREATE TABLE `%s`.empty (id UInt64) ENGINE=MergeTree ORDER BY id", dbName)) + env.queryWithNoError(r, fmt.Sprintf("INSERT INTO `%s`.full SELECT number FROM numbers(10)", dbName)) + + env.casBackupNoError(r, "create", "--tables", dbName+".*", "cas_empty_bk") + env.casBackupNoError(r, "cas-upload", "cas_empty_bk") + + r.NoError(env.dropDatabase(dbName, true)) + env.casBackupNoError(r, "cas-restore", "--rm", "cas_empty_bk") + + var fullCount []struct { + C uint64 `ch:"c"` + } + r.NoError(env.ch.Select(&fullCount, fmt.Sprintf("SELECT count() AS c FROM `%s`.full", dbName))) + r.Len(fullCount, 1) + r.Equal(uint64(10), fullCount[0].C) + + var emptyExists []struct { + C uint64 `ch:"c"` + } + r.NoError(env.ch.Select(&emptyExists, fmt.Sprintf( + "SELECT count() AS c FROM system.tables WHERE database='%s' AND name='empty'", dbName))) + r.Len(emptyExists, 1) + r.Equal(uint64(1), emptyExists[0].C, "empty table schema should be restored") + + r.NoError(env.dropDatabase(dbName, true)) +} + +// TestCASUploadSkipObjectDisks verifies the --skip-object-disks CLI flag +// actually filters out object-disk-backed tables from the upload, instead +// of silently uploading them. Requires the test environment to provide an +// object-disk-backed disk; if not present, skip with a clear message — +// the unit test in T1 covers the plumbing in isolation. +func TestCASUploadSkipObjectDisks(t *testing.T) { + casSkipIfClickHouseTooOld(t) + env, r := NewTestEnvironment(t) + env.connectWithWait(t, r, 500*time.Millisecond, 1*time.Second, 1*time.Minute) + defer env.Cleanup(t, r) + + // Probe: any object-storage disk in the test ClickHouse? + var probe []struct { + C uint64 `ch:"c"` + } + r.NoError(env.ch.Select(&probe, + "SELECT count() AS c FROM system.disks WHERE type IN ('ObjectStorage','S3')")) + if len(probe) == 0 || probe[0].C == 0 { + t.Skip("no object-disk available in this integration env; covered by unit test") + } + + // Find a storage policy whose ALL disks are S3-type object storage. + // We restrict to 's3' or 's3_plain' (lowercased object_storage_type) which + // are the types CAS objectDisk.go reliably detects. Azure ('azureblobstorage') + // may also be present but uses a different type string not in scope here. + var policyRes []struct { + Policy string `ch:"policy_name"` + } + r.NoError(env.ch.Select(&policyRes, ` + SELECT policy_name + FROM system.storage_policies + WHERE policy_name != 'default' + AND policy_name IN ( + SELECT sp.policy_name + FROM (SELECT policy_name, arrayJoin(disks) AS disk_name FROM system.storage_policies) AS sp + INNER JOIN system.disks AS d ON d.name = sp.disk_name + GROUP BY sp.policy_name + HAVING countIf(lower(if(d.type='ObjectStorage',d.object_storage_type,d.type)) NOT IN ('s3','s3_plain')) = 0 + AND count() > 0 + ) + LIMIT 1`)) + if len(policyRes) == 0 { + t.Skip("no S3-only storage policy available; covered by unit test") + } + policy := policyRes[0].Policy + + env.casBootstrap(r, "skip_objdisk") + const dbName = "cas_skipod_db" + r.NoError(env.dropDatabase(dbName, true)) + // Always-run cleanup via defer (NOT t.Cleanup): the body below has a + // t.Skip path mid-flight, and t.Cleanup runs AFTER `defer env.Cleanup` + // has already closed env.ch and returned the env to the pool — racing + // with the next test acquiring that slot. defer runs LIFO before the + // outer env.Cleanup defer, so it sees a still-live env. + defer func() { + _ = env.dropDatabase(dbName, true) + _, _ = env.casBackup("delete", "local", "cas_skipod_bk") + }() + env.queryWithNoError(r, fmt.Sprintf("CREATE DATABASE `%s`", dbName)) + env.queryWithNoError(r, fmt.Sprintf( + "CREATE TABLE `%s`.regular (id UInt64) ENGINE=MergeTree ORDER BY id", dbName)) + env.queryWithNoError(r, fmt.Sprintf( + "CREATE TABLE `%s`.remote (id UInt64) ENGINE=MergeTree ORDER BY id SETTINGS storage_policy='%s'", + dbName, policy)) + env.queryWithNoError(r, fmt.Sprintf("INSERT INTO `%s`.regular SELECT number FROM numbers(10)", dbName)) + env.queryWithNoError(r, fmt.Sprintf("INSERT INTO `%s`.remote SELECT number FROM numbers(10)", dbName)) + + env.casBackupNoError(r, "create", "--tables", dbName+".*", "cas_skipod_bk") + + // Probe: does the local backup's shadow contain a disk/part subdirectory + // specifically for the remote table? If not, the snapshot-based pre-flight + // in cas-upload cannot detect the object-disk table — ClickHouse keeps fully + // S3-backed data remote and doesn't write shadow entries locally. Skip rather + // than assert a known limitation. + remoteEnc := "cas_skipod_db/remote" // URL-safe name (no special chars) + shadowRemote, _ := env.DockerExecOut("clickhouse", + "bash", "-c", + fmt.Sprintf("find /var/lib/clickhouse/backup/cas_skipod_bk/shadow/%s -mindepth 2 -maxdepth 2 -type d 2>/dev/null | head -1", remoteEnc)) + t.Logf("shadow remote-table probe: %q", shadowRemote) + if shadowRemote == "" { + t.Skip("object-disk table has no disk/part shadow entries; snapshot pre-flight cannot detect it — covered by unit tests") + } + + env.casBackupNoError(r, "cas-upload", "--skip-object-disks", "cas_skipod_bk") + + statusOut := env.casBackupNoError(r, "cas-status") + r.Contains(statusOut, "Backups: 1") + + r.NoError(env.dropDatabase(dbName, true)) + env.casBackupNoError(r, "cas-restore", "--rm", "cas_skipod_bk") + + var regCount []struct { + C uint64 `ch:"c"` + } + r.NoError(env.ch.Select(®Count, fmt.Sprintf("SELECT count() AS c FROM `%s`.regular", dbName))) + r.Len(regCount, 1) + r.Equal(uint64(10), regCount[0].C) + + var remoteExists []struct { + C uint64 `ch:"c"` + } + r.NoError(env.ch.Select(&remoteExists, fmt.Sprintf( + "SELECT count() AS c FROM system.tables WHERE database='%s' AND name='remote'", dbName))) + r.Len(remoteExists, 1) + r.Equal(uint64(0), remoteExists[0].C, "remote (object-disk) table must NOT be restored when --skip-object-disks was set") + + r.NoError(env.dropDatabase(dbName, true)) +} diff --git a/test/integration/cas_prune_test.go b/test/integration/cas_prune_test.go new file mode 100644 index 00000000..70cda616 --- /dev/null +++ b/test/integration/cas_prune_test.go @@ -0,0 +1,138 @@ +//go:build integration + +package main + +import ( + "fmt" + "strings" + "testing" + "time" +) + +// TestCASPruneSmoke is the integration-level wiring test for cas-prune. +// Covers the full real-MinIO + real-ClickHouse path: +// +// 1. cas-upload of a fresh backup. cas-prune with no deletes finds no +// orphans and exits cleanly. +// 2. cas-prune --dry-run is safe to run any time and never writes a marker. +// 3. cas-delete then cas-prune --grace-hours 0. Marker must be released so +// the very next cas-prune does not refuse with "prune in progress". +// 4. --unlock errors out cleanly when there is no marker to clear. +// +// Marker corner cases (abandoned in-progress markers, --unlock for a stranded +// prune.marker, fail-closed when a live backup is unreadable) are covered by +// pkg/cas/prune_test.go against a fakedst Backend; they require direct +// object-store mutations that MinIO's erasure-coded storage layout does not +// allow us to inject reliably from a filesystem write. +func TestCASPruneSmoke(t *testing.T) { + casSkipIfClickHouseTooOld(t) + env, r := NewTestEnvironment(t) + env.connectWithWait(t, r, 500*time.Millisecond, 1*time.Second, 1*time.Minute) + defer env.Cleanup(t, r) + + env.casBootstrap(r, "prune_smoke") + + const ( + dbName = "cas_prune_smoke_db" + backupName = "cas_prune_smoke_bk" + ) + + r.NoError(env.dropDatabase(dbName, true)) + env.queryWithNoError(r, fmt.Sprintf("CREATE DATABASE `%s`", dbName)) + env.queryWithNoError(r, fmt.Sprintf("CREATE TABLE `%s`.t (id UInt64, payload String) ENGINE=MergeTree ORDER BY id "+ + "SETTINGS min_rows_for_wide_part=0, min_bytes_for_wide_part=0", dbName)) + env.queryWithNoError(r, fmt.Sprintf( + "INSERT INTO `%s`.t SELECT number, randomPrintableASCII(64) FROM numbers(500)", dbName)) + env.casBackupNoError(r, "create", "--tables", dbName+".*", backupName) + env.casBackupNoError(r, "cas-upload", backupName) + + pruneOut := env.casBackupNoError(r, "cas-prune") + t.Logf("cas-prune (live):\n%s", pruneOut) + r.Contains(pruneOut, "Live backups : 1", "expected 1 live backup; got: %s", pruneOut) + r.Contains(pruneOut, "Orphans deleted : 0", "no orphans expected before delete; got: %s", pruneOut) + + dryOut := env.casBackupNoError(r, "cas-prune", "--dry-run") + t.Logf("cas-prune --dry-run:\n%s", dryOut) + r.Contains(dryOut, "cas-prune (dry-run):", "dry-run header missing; got: %s", dryOut) + + env.casBackupNoError(r, "cas-delete", backupName) + pruneOut2 := env.casBackupNoError(r, "cas-prune", "--grace-blob=0s") + t.Logf("cas-prune (after delete, grace=0):\n%s", pruneOut2) + r.Contains(pruneOut2, "Live backups : 0", "expected 0 live backups; got: %s", pruneOut2) + + probe, err := env.casBackup("cas-prune") + r.NoError(err, "cas-prune after a successful prune must not refuse; got: %s", probe) + r.NotContains(probe, "prune in progress", "no stranded marker expected; got: %s", probe) + + unlockOut, err := env.casBackup("cas-prune", "--unlock") + r.Error(err, "cas-prune --unlock without a marker must error; got: %s", unlockOut) + r.True(strings.Contains(unlockOut, "no prune.marker present"), + "expected no-marker error; got: %s", unlockOut) + + r.NoError(env.dropDatabase(dbName, true)) +} + +// TestCASPruneEndToEndDedupeReclaim is the realistic mutation-heavy scenario: +// three backups whose payload column is hardlinked across waves (so the +// payload blob is shared); the marker column is rewritten each wave, so +// each backup has a small set of unique blobs. Deleting the middle backup + +// pruning must reclaim its unique blobs but keep the shared ones. After +// deleting all backups + pruning, every blob must be reclaimed. +func TestCASPruneEndToEndDedupeReclaim(t *testing.T) { + casSkipIfClickHouseTooOld(t) + env, r := NewTestEnvironment(t) + env.connectWithWait(t, r, 500*time.Millisecond, 1*time.Second, 1*time.Minute) + defer env.Cleanup(t, r) + + env.casBootstrap(r, "prune_e2e") + + const ( + dbName = "cas_prune_e2e_db" + tblName = "cas_prune_e2e_t" + ) + + r.NoError(env.dropDatabase(dbName, true)) + env.queryWithNoError(r, fmt.Sprintf("CREATE DATABASE `%s`", dbName)) + env.queryWithNoError(r, fmt.Sprintf(`CREATE TABLE `+"`%s`.`%s`"+` (id UInt64, payload String, marker String) + ENGINE=MergeTree ORDER BY id + SETTINGS min_rows_for_wide_part=0, min_bytes_for_wide_part=0`, dbName, tblName)) + + for i, marker := range []string{"v1", "v2", "v3"} { + bk := fmt.Sprintf("cas_prune_e2e_bk%d", i+1) + env.queryWithNoError(r, fmt.Sprintf( + "INSERT INTO `%s`.`%s` SELECT number, randomPrintableASCII(64), '%s' FROM numbers(1000)", + dbName, tblName, marker)) + env.queryWithNoError(r, fmt.Sprintf("OPTIMIZE TABLE `%s`.`%s` FINAL", dbName, tblName)) + env.casBackupNoError(r, "create", "--tables", dbName+".*", bk) + env.casBackupNoError(r, "cas-upload", bk) + } + + statusBefore := env.casBackupNoError(r, "cas-status") + t.Logf("statusBefore:\n%s", statusBefore) + r.Contains(statusBefore, "Backups: 3", "expected 3 backups uploaded; got: %s", statusBefore) + + // Delete the middle backup; prune must reclaim ONLY blobs unique to it. + env.casBackupNoError(r, "cas-delete", "cas_prune_e2e_bk2") + pruneMid := env.casBackupNoError(r, "cas-prune", "--grace-blob=0s") + t.Logf("first cas-prune:\n%s", pruneMid) + r.Contains(pruneMid, "Live backups : 2", "expected 2 live backups; got: %s", pruneMid) + + statusMid := env.casBackupNoError(r, "cas-status") + t.Logf("statusMid:\n%s", statusMid) + r.Contains(statusMid, "Backups: 2", "expected Backups: 2; got: %s", statusMid) + r.NotContains(statusMid, "Blobs: 0 ", "shared blobs from bk1+bk3 must survive; got: %s", statusMid) + + // Delete everything; prune must reclaim every remaining blob. + env.casBackupNoError(r, "cas-delete", "cas_prune_e2e_bk1") + env.casBackupNoError(r, "cas-delete", "cas_prune_e2e_bk3") + pruneFinal := env.casBackupNoError(r, "cas-prune", "--grace-blob=0s") + t.Logf("final cas-prune:\n%s", pruneFinal) + r.Contains(pruneFinal, "Live backups : 0", "expected 0 live backups; got: %s", pruneFinal) + + finalStatus := env.casBackupNoError(r, "cas-status") + t.Logf("finalStatus:\n%s", finalStatus) + r.Contains(finalStatus, "Backups: 0", "expected 0 backups after full delete; got: %s", finalStatus) + r.Contains(finalStatus, "Blobs: 0 ", "expected 0 blobs after full delete + prune; got: %s", finalStatus) + + r.NoError(env.dropDatabase(dbName, true)) +} diff --git a/test/integration/cas_test.go b/test/integration/cas_test.go new file mode 100644 index 00000000..825581b1 --- /dev/null +++ b/test/integration/cas_test.go @@ -0,0 +1,304 @@ +//go:build integration + +package main + +import ( + "fmt" + "os" + "strings" + "testing" + "time" + + "github.com/rs/zerolog/log" + "github.com/stretchr/testify/require" +) + +// casSkipIfClickHouseTooOld skips the calling CAS test on ClickHouse versions +// that lack features the CAS tests rely on (min_rows_for_wide_part / repeat() / +// system.disks columns). 21.0 is the conservative cutoff covering all CAS test +// fixtures. +func casSkipIfClickHouseTooOld(t *testing.T) { + t.Helper() + if compareVersion(os.Getenv("CLICKHOUSE_VERSION"), "21.0") < 0 { + t.Skipf("CAS tests require ClickHouse 21.0+, got %s", os.Getenv("CLICKHOUSE_VERSION")) + } +} + +// casConfigPath is the in-container path of the on-the-fly config used by all +// cas-* integration tests. Generated in casBootstrapWith by appending a `cas:` +// stanza to a base config (config-s3.yml by default, or one of the per-backend +// configs for the smoke-test suite). +const casConfigPath = "/tmp/config-cas.yml" + +// casBootstrap is the S3/MinIO default path; used by all the existing +// CAS tests. New per-backend tests should call casBootstrapWith directly +// with a different baseConfig name (one of config-gcs.yml, +// config-azblob.yml, config-sftp-auth-password.yaml, config-ftp.yaml). +func (env *TestEnvironment) casBootstrap(r *require.Assertions, clusterID string) { + env.casBootstrapWith(r, clusterID, "config-s3.yml", "") +} + +// casBootstrapWith writes a CAS-enabled config inside the clickhouse-backup +// container at casConfigPath, using baseConfigName as the starting point +// and appending the cas: stanza. casExtraYAML is appended verbatim to the +// cas: block (used to set allow_unsafe_markers for the FTP opt-in test). +// +// Per-backend cleanup: each backend stores objects under a different +// container path; the helper wipes only the cluster-id-scoped subtree so +// concurrent tests in different envPool slots don't trample each other. +func (env *TestEnvironment) casBootstrapWith(r *require.Assertions, clusterID, baseConfigName, casExtraYAML string) { + // Derive the per-backend storage container + path for cleanup. + switch baseConfigName { + case "config-s3.yml": + // MinIO: path: backup/{cluster}/{shard} -> /minio/data/clickhouse/backup/cluster/0/cas// + _ = env.DockerExec("minio", "bash", "-c", + fmt.Sprintf("rm -rf /minio/data/clickhouse/backup/cluster/0/cas/%s/", clusterID)) + _ = env.DockerExec("minio", "bash", "-c", "mkdir -p /minio/data/clickhouse") + case "config-gcs.yml", "config-gcs-emulator.yml": + // fake-gcs-server: bucket=altinity-qa-test, path: backup/{cluster}/{shard} + _ = env.DockerExec("gcs", "sh", "-c", + fmt.Sprintf("rm -rf /data/altinity-qa-test/backup/cluster/0/cas/%s/", clusterID)) + _ = env.DockerExec("gcs", "sh", "-c", "mkdir -p /data/altinity-qa-test") + case "config-azblob.yml": + // Azurite stores objects in an internal SQLite-backed tree under + // /data (tmpfs); there is no clean path-based wipe. Rely on + // unique cluster IDs and the tests' own cas-delete + cas-prune + // cleanup at the end. + case "config-sftp-auth-password.yaml", "config-sftp-emulator.yaml": + // SFTP: path: /root -> /root/cas// on the sshd container. + // Create the directory after wiping: sftp.Walk fails on non-existent + // directories, so we need it to exist before cas-upload runs cold-list. + _ = env.DockerExec("sshd", "sh", "-c", + fmt.Sprintf("rm -rf /root/cas/%s/ && mkdir -p /root/cas/%s/", clusterID, clusterID)) + case "config-ftp.yaml", "config-ftp-emulator.yaml": + // FTP: path: /backup -> /backup/cas// on the ftp container. + _ = env.DockerExec("ftp", "sh", "-c", + fmt.Sprintf("rm -rf /backup/cas/%s/ /home/test_backup/backup/cas/%s/", clusterID, clusterID)) + default: + r.FailNow(fmt.Sprintf("casBootstrapWith: unsupported baseConfigName=%q", baseConfigName)) + } + + // Local backups must be wiped wholesale because v1 'create' rejects + // an existing same-named backup (regardless of CAS namespace). Test + // names embed the test prefix to avoid collisions across tests. + _ = env.DockerExec("clickhouse", "bash", "-c", "rm -rf /var/lib/clickhouse/backup/*") + + casBlock := fmt.Sprintf(` +cas: + enabled: true + cluster_id: %s + root_prefix: cas/ + inline_threshold: 1024 + grace_blob: 24h + abandon_threshold: 168h +%s`, clusterID, casExtraYAML) + cmd := fmt.Sprintf("cp /etc/clickhouse-backup/%s %s && cat >>%s <<'CASEOF'%sCASEOF", + baseConfigName, casConfigPath, casConfigPath, casBlock) + env.DockerExecNoError(r, "clickhouse-backup", "bash", "-ce", cmd) +} + +// casBackup runs a clickhouse-backup command with the CAS config and returns +// (out, err). Thin convenience wrapper. +func (env *TestEnvironment) casBackup(args ...string) (string, error) { + full := append([]string{"clickhouse-backup", "-c", casConfigPath}, args...) + return env.DockerExecOut("clickhouse-backup", full...) +} + +// casBackupNoError runs a clickhouse-backup command with the CAS config and +// asserts no error. +func (env *TestEnvironment) casBackupNoError(r *require.Assertions, args ...string) string { + out, err := env.casBackup(args...) + r.NoError(err, "cas command %v failed: %s", args, out) + return out +} + +// TestCASRoundtrip exercises the headline value-prop of the CAS layout: +// create → cas-upload → cas-status → drop → cas-restore → verify rows → +// cas-delete → cas-status (gone). See docs/cas-design.md §10.4 Phase 1. +func TestCASRoundtrip(t *testing.T) { + casSkipIfClickHouseTooOld(t) + env, r := NewTestEnvironment(t) + env.connectWithWait(t, r, 500*time.Millisecond, 1*time.Second, 1*time.Minute) + defer env.Cleanup(t, r) + + env.casBootstrap(r, "roundtrip") + + const ( + dbName = "cas_roundtrip_db" + tableName = "cas_roundtrip_t" + backupName = "cas_roundtrip_bk" + rowCount = 10000 + ) + + // 1. Schema + data. Wide-part format with a non-compressible random + // string column so data.bin exceeds the 1024-byte inline threshold — + // required for the test to exercise the blob-store path. (At 100 rows + // of repetitive 'x' the column compressed to <100 bytes; randomPrintable + // at 10000 rows produces ~tens of KB per column, well above threshold.) + r.NoError(env.dropDatabase(dbName, true)) + env.queryWithNoError(r, fmt.Sprintf("CREATE DATABASE `%s`", dbName)) + env.queryWithNoError(r, fmt.Sprintf( + "CREATE TABLE `%s`.`%s` (id UInt64, payload String) ENGINE=MergeTree ORDER BY id "+ + "SETTINGS min_rows_for_wide_part=0, min_bytes_for_wide_part=0", + dbName, tableName)) + env.queryWithNoError(r, fmt.Sprintf( + "INSERT INTO `%s`.`%s` SELECT number, randomPrintableASCII(64) FROM numbers(%d)", + dbName, tableName, rowCount)) + + // 2. v1 create (CAS reuses the local backup directory). + env.casBackupNoError(r, "create", "--tables", dbName+".*", backupName) + + // 3. cas-upload. + out := env.casBackupNoError(r, "cas-upload", backupName) + log.Debug().Msg(out) + + // 4. cas-status: at least 1 backup, blob count > 0. + statusOut := env.casBackupNoError(r, "cas-status") + log.Debug().Msg(statusOut) + r.Contains(statusOut, "Backups: 1", "expected exactly 1 CAS backup, got: %s", statusOut) + r.NotContains(statusOut, "Blobs: 0 ", "expected blob count > 0, got: %s", statusOut) + + // 5. Drop database; remove local backup so restore must fetch from remote. + r.NoError(env.dropDatabase(dbName, false)) + env.casBackupNoError(r, "delete", "local", backupName) + + // 6. cas-restore drops + re-creates the table from the CAS layout. + restoreOut := env.casBackupNoError(r, "cas-restore", "--rm", backupName) + log.Debug().Msg(restoreOut) + + // 7. SELECT count(): must equal rowCount; sum(id) = 0+...+99 = 4950. + env.checkCount(r, 1, uint64(rowCount), fmt.Sprintf("SELECT count() FROM `%s`.`%s`", dbName, tableName)) + var sumID uint64 + r.NoError(env.ch.SelectSingleRowNoCtx(&sumID, fmt.Sprintf("SELECT sum(id) FROM `%s`.`%s`", dbName, tableName))) + r.Equal(uint64(rowCount*(rowCount-1)/2), sumID) + + // 8. cas-delete; cas-status should report 0 backups. + env.casBackupNoError(r, "cas-delete", backupName) + statusOut2 := env.casBackupNoError(r, "cas-status") + r.Contains(statusOut2, "Backups: 0", "expected 0 CAS backups after cas-delete, got: %s", statusOut2) + + // Cleanup local backup metadata + database. + _, _ = env.casBackup("delete", "local", backupName) + r.NoError(env.dropDatabase(dbName, true)) +} + +// TestCASCrossModeGuards verifies the §6.2.2 isolation between v1 and CAS +// backups: each command must refuse to operate on the other layout's backups. +func TestCASCrossModeGuards(t *testing.T) { + casSkipIfClickHouseTooOld(t) + env, r := NewTestEnvironment(t) + env.connectWithWait(t, r, 500*time.Millisecond, 1*time.Second, 1*time.Minute) + defer env.Cleanup(t, r) + + env.casBootstrap(r, "guards") + + const ( + dbName = "cas_guards_db" + v1Name = "v1bk_guards" + casName = "casbk_guards" + ) + + r.NoError(env.dropDatabase(dbName, true)) + env.queryWithNoError(r, fmt.Sprintf("CREATE DATABASE `%s`", dbName)) + env.queryWithNoError(r, fmt.Sprintf("CREATE TABLE `%s`.t (id UInt64) ENGINE=MergeTree ORDER BY id", dbName)) + env.queryWithNoError(r, fmt.Sprintf("INSERT INTO `%s`.t SELECT number FROM numbers(10)", dbName)) + + // 1. Two backups: one via v1 upload, one via cas-upload. + env.casBackupNoError(r, "create", "--tables", dbName+".*", v1Name) + env.casBackupNoError(r, "upload", v1Name) + + env.casBackupNoError(r, "create", "--tables", dbName+".*", casName) + env.casBackupNoError(r, "cas-upload", casName) + + // Drop the local backup directories so v1 download / cas-download don't + // short-circuit on the local-already-exists pre-check (which fires + // BEFORE the cross-mode CAS guard at pkg/backup/download.go:133). In + // production this isn't a concern because users typically download to a + // host where the backup wasn't just created; the test simulates that + // state by clearing local backups before the cross-mode probes. + _ = env.DockerExec("clickhouse", "bash", "-c", "rm -rf /var/lib/clickhouse/backup/*") + + // 2. Cross-mode refusals: v1 download on CAS backup. + out, err := env.casBackup("download", casName) + r.Error(err, "v1 download must refuse CAS backup; out=%s", out) + r.Contains(out, "refusing to operate on CAS backup") + + // Clear local again so cas-download's own materialization doesn't trip + // over the v1-uploaded local dir. + _ = env.DockerExec("clickhouse", "bash", "-c", "rm -rf /var/lib/clickhouse/backup/*") + + // 3. cas-download on v1 backup. + out, err = env.casBackup("cas-download", v1Name) + r.Error(err, "cas-download must refuse v1 backup; out=%s", out) + r.Contains(out, "refusing to operate on v1 backup") + + // 4. v1 delete remote on CAS backup. + out, err = env.casBackup("delete", "remote", casName) + r.Error(err, "v1 delete remote must refuse CAS backup; out=%s", out) + r.Contains(out, "refusing to operate on CAS backup") + + // 5. cas-delete on v1 backup. + out, err = env.casBackup("cas-delete", v1Name) + r.Error(err, "cas-delete must refuse v1 backup; out=%s", out) + r.Contains(out, "refusing to operate on v1 backup") + + // 6. Same-mode operations succeed. + env.casBackupNoError(r, "delete", "remote", v1Name) + env.casBackupNoError(r, "cas-delete", casName) + + // Cleanup local copies. + _, _ = env.casBackup("delete", "local", v1Name) + _, _ = env.casBackup("delete", "local", casName) + r.NoError(env.dropDatabase(dbName, true)) +} + +// TestCASVerify covers cas-verify happy path. Stretch: induce a missing-blob +// failure by surgically deleting one object in MinIO and re-running verify. +func TestCASVerify(t *testing.T) { + casSkipIfClickHouseTooOld(t) + env, r := NewTestEnvironment(t) + env.connectWithWait(t, r, 500*time.Millisecond, 1*time.Second, 1*time.Minute) + defer env.Cleanup(t, r) + + env.casBootstrap(r, "verify") + + const ( + dbName = "cas_verify_db" + backupName = "cas_verify_bk" + ) + + r.NoError(env.dropDatabase(dbName, true)) + env.queryWithNoError(r, fmt.Sprintf("CREATE DATABASE `%s`", dbName)) + env.queryWithNoError(r, fmt.Sprintf("CREATE TABLE `%s`.t (id UInt64, payload String) ENGINE=MergeTree ORDER BY id", dbName)) + env.queryWithNoError(r, fmt.Sprintf("INSERT INTO `%s`.t SELECT number, repeat('x', 4096) FROM numbers(50)", dbName)) + + env.casBackupNoError(r, "create", "--tables", dbName+".*", backupName) + env.casBackupNoError(r, "cas-upload", backupName) + + // Happy path: cas-verify exits 0. + out, err := env.casBackup("cas-verify", backupName) + r.NoError(err, "cas-verify (happy) must succeed; out=%s", out) + + // Stretch: delete an arbitrary blob from MinIO, expect cas-verify to fail + // with a "missing" diagnostic. The MinIO container exposes the bucket as a + // plain filesystem at /minio/data/clickhouse, so we use ordinary `find` + + // `rm` rather than `mc`. + blobDir := "/minio/data/clickhouse/backup/cluster/0/cas/verify/blob" + delOut, delErr := env.DockerExecOut("minio", "bash", "-ce", + fmt.Sprintf("find %s -type f | head -n1 | xargs -r rm -fv", blobDir)) + if delErr != nil || strings.TrimSpace(delOut) == "" { + // Bucket layout differs (different s3.path) → skip stretch silently + // rather than fail; the happy-path assertion above is the contract. + log.Warn().Msgf("cas-verify stretch: unable to remove blob (out=%q err=%v); skipping negative case", delOut, delErr) + } else { + log.Debug().Msgf("removed blob: %s", delOut) + out, err = env.casBackup("cas-verify", backupName) + r.Error(err, "cas-verify must fail when a referenced blob is missing; out=%s", out) + r.Contains(strings.ToLower(out), "missing", "expected 'missing' diagnostic; out=%s", out) + } + + // Cleanup. + _, _ = env.casBackup("cas-delete", backupName) + _, _ = env.casBackup("delete", "local", backupName) + r.NoError(env.dropDatabase(dbName, true)) +} diff --git a/test/integration/cas_wait_for_prune_test.go b/test/integration/cas_wait_for_prune_test.go new file mode 100644 index 00000000..08758cdc --- /dev/null +++ b/test/integration/cas_wait_for_prune_test.go @@ -0,0 +1,101 @@ +//go:build integration + +package main + +import ( + "fmt" + "sync" + "testing" + "time" +) + +// TestCASUploadWaitsForPrune injects a prune marker, schedules its removal +// after a few seconds, and verifies cas-upload --wait-for-prune polls past +// the obstruction. +func TestCASUploadWaitsForPrune(t *testing.T) { + casSkipIfClickHouseTooOld(t) + env, r := NewTestEnvironment(t) + env.connectWithWait(t, r, 500*time.Millisecond, 1*time.Second, 1*time.Minute) + defer env.Cleanup(t, r) + + env.casBootstrap(r, "wait_prune") + + const ( + dbName = "cas_waitprune_db" + tblName = "t" + bk = "cas_waitprune_bk" + ) + r.NoError(env.dropDatabase(dbName, true)) + env.queryWithNoError(r, fmt.Sprintf("CREATE DATABASE `%s`", dbName)) + env.queryWithNoError(r, fmt.Sprintf("CREATE TABLE `%s`.`%s` (id UInt64) ENGINE=MergeTree ORDER BY id", dbName, tblName)) + env.queryWithNoError(r, fmt.Sprintf("INSERT INTO `%s`.`%s` SELECT number FROM numbers(100)", dbName, tblName)) + env.casBackupNoError(r, "create", "--tables", dbName+".*", bk) + + // Inject a prune marker. + markerKey := "backup/cluster/0/cas/wait_prune/prune.marker" + markerBody := `{"host":"other","started_at":"2026-05-08T00:00:00Z","run_id":"abcd1234","tool":"test"}` + env.injectS3Object(r, markerKey, markerBody) + + // Schedule marker removal 5s in via cas-prune --unlock. + var wg sync.WaitGroup + wg.Add(1) + go func() { + defer wg.Done() + time.Sleep(5 * time.Second) + out, err := env.casBackup("cas-prune", "--unlock") + if err != nil { + t.Logf("cas-prune --unlock failed (expected only if marker already removed): %v out=%s", err, out) + } + }() + + start := time.Now() + out := env.casBackupNoError(r, "cas-upload", "--wait-for-prune=30s", bk) + elapsed := time.Since(start) + wg.Wait() + + r.GreaterOrEqual(elapsed, 4*time.Second, "upload should have waited >= 4s; got %s", elapsed) + r.Less(elapsed, 20*time.Second, "upload took too long; out=%s", out) + r.Contains(out, "uploaded now", "upload output should report bytes uploaded; out=%s", out) + + env.casBackupNoError(r, "cas-delete", bk) + r.NoError(env.dropDatabase(dbName, true)) +} + +// TestCASUploadWaitTimeout verifies the timeout path. +func TestCASUploadWaitTimeout(t *testing.T) { + casSkipIfClickHouseTooOld(t) + env, r := NewTestEnvironment(t) + env.connectWithWait(t, r, 500*time.Millisecond, 1*time.Second, 1*time.Minute) + defer env.Cleanup(t, r) + + env.casBootstrap(r, "wait_timeout") + + const ( + dbName = "cas_waittimeout_db" + tblName = "t" + bk = "cas_waittimeout_bk" + ) + r.NoError(env.dropDatabase(dbName, true)) + env.queryWithNoError(r, fmt.Sprintf("CREATE DATABASE `%s`", dbName)) + env.queryWithNoError(r, fmt.Sprintf("CREATE TABLE `%s`.`%s` (id UInt64) ENGINE=MergeTree ORDER BY id", dbName, tblName)) + env.queryWithNoError(r, fmt.Sprintf("INSERT INTO `%s`.`%s` SELECT number FROM numbers(10)", dbName, tblName)) + env.casBackupNoError(r, "create", "--tables", dbName+".*", bk) + + markerKey := "backup/cluster/0/cas/wait_timeout/prune.marker" + markerBody := `{"host":"other","started_at":"2026-05-08T00:00:00Z","run_id":"deadbeef","tool":"test"}` + env.injectS3Object(r, markerKey, markerBody) + + start := time.Now() + out, err := env.casBackup("cas-upload", "--wait-for-prune=2s", bk) + elapsed := time.Since(start) + + r.Error(err, "cas-upload should fail after 2s timeout; out=%s", out) + r.Contains(out, "prune still in progress", "out=%s", out) + r.GreaterOrEqual(elapsed, 2*time.Second, "should have waited at least 2s; elapsed=%s", elapsed) + r.Less(elapsed, 8*time.Second, "should not wait too much past 2s; elapsed=%s", elapsed) + + // Cleanup: unlock and delete. + _, _ = env.casBackup("cas-prune", "--unlock") + _, _ = env.casBackup("delete", "local", bk) + r.NoError(env.dropDatabase(dbName, true)) +} diff --git a/test/integration/configs/config-ftp-emulator.yaml b/test/integration/configs/config-ftp-emulator.yaml new file mode 100644 index 00000000..7cfee06a --- /dev/null +++ b/test/integration/configs/config-ftp-emulator.yaml @@ -0,0 +1,26 @@ +general: + remote_storage: ftp + upload_concurrency: 4 + download_concurrency: 4 + restore_schema_on_cluster: "{cluster}" + allow_object_disk_streaming: true +s3: + disable_ssl: false + disable_cert_verification: true +clickhouse: + host: clickhouse + port: 9000 + restart_command: bash -c 'echo "FAKE RESTART"' + timeout: 10m +ftp: + address: "ftp:21" + username: "test_backup" + password: "test_backup" + tls: false + path: "/backup" + object_disk_path: "/object_disk" + compression_format: none + compression_level: 1 + concurrency: 4 +api: + listen: :7171 diff --git a/test/integration/configs/config-gcs-emulator.yml b/test/integration/configs/config-gcs-emulator.yml new file mode 100644 index 00000000..1f430460 --- /dev/null +++ b/test/integration/configs/config-gcs-emulator.yml @@ -0,0 +1,23 @@ +general: + remote_storage: gcs + upload_concurrency: 4 + download_concurrency: 4 + restore_schema_on_cluster: "{cluster}" + allow_object_disk_streaming: true +s3: + disable_ssl: false + disable_cert_verification: true +clickhouse: + host: clickhouse + port: 9000 + restart_command: bash -c 'echo "FAKE RESTART"' + timeout: 10m +gcs: + bucket: altinity-qa-test + path: backup/{cluster}/{shard} + object_disk_path: object_disks/{cluster}/{shard} + compression_format: tar + endpoint: http://gcs:8080/storage/v1/ + skip_credentials: true + object_labels: + label: label_value diff --git a/test/integration/configs/config-sftp-emulator.yaml b/test/integration/configs/config-sftp-emulator.yaml new file mode 100644 index 00000000..a131924d --- /dev/null +++ b/test/integration/configs/config-sftp-emulator.yaml @@ -0,0 +1,25 @@ +general: + remote_storage: sftp + upload_concurrency: 4 + download_concurrency: 4 + restore_schema_on_cluster: "{cluster}" + allow_object_disk_streaming: true +s3: + disable_ssl: false + disable_cert_verification: true +clickhouse: + host: clickhouse + port: 9000 + restart_command: bash -c 'echo "FAKE RESTART"' + timeout: 10m +sftp: + address: "sshd" + username: "root" + password: "JFzMHfVpvTgEd74XXPq6wARA2Qg3AutJ" + key: "" + path: "/root" + object_disk_path: "/object_disk" + compression_format: none + compression_level: 1 +api: + listen: :7171 diff --git a/test/integration/containers.go b/test/integration/containers.go index 1a064e43..b74653b0 100644 --- a/test/integration/containers.go +++ b/test/integration/containers.go @@ -760,11 +760,11 @@ func (tc *TestContainers) clickHouseBinds(curDir, configsDir string) []string { "config-azblob.yml", "config-azblob-embedded.yml", "config-azblob-embedded-url.yml", "config-custom-kopia.yml", "config-custom-restic.yml", "config-custom-rsync.yml", "config-database-mapping.yml", - "config-ftp.yaml", "config-ftp-old.yaml", - "config-gcs.yml", "config-gcs-custom-endpoint.yml", + "config-ftp.yaml", "config-ftp-old.yaml", "config-ftp-emulator.yaml", + "config-gcs.yml", "config-gcs-custom-endpoint.yml", "config-gcs-emulator.yml", "config-s3.yml", "config-s3-embedded.yml", "config-s3-embedded-url.yml", "config-s3-embedded-local.yml", "config-s3-nodelete.yml", "config-s3-plain-embedded.yml", - "config-sftp-auth-key.yaml", "config-sftp-auth-password.yaml", + "config-sftp-auth-key.yaml", "config-sftp-auth-password.yaml", "config-sftp-emulator.yaml", } // template files (copied with .template suffix) templateFiles := []string{ diff --git a/test/integration/serverAPI_test.go b/test/integration/serverAPI_test.go index 2b267e94..29990e5b 100644 --- a/test/integration/serverAPI_test.go +++ b/test/integration/serverAPI_test.go @@ -450,6 +450,9 @@ func testAPIBackupList(t *testing.T, r *require.Assertions, env *TestEnvironment log.Debug().Msg("Check /backup/list") out, err := env.DockerExecOut("clickhouse-backup", "bash", "-ce", "curl -sfL 'http://localhost:7171/backup/list'") r.NoError(err, "%s\nunexpected GET /backup/list error: %v", out, err) + // v1 backups omit the "kind" field (omitempty) so legacy ClickHouse + // integration tables (CH < 21.1, no input_format_skip_unknown_fields) + // keep parsing /backup/list. CAS-only rows would carry "kind":"cas". localListFormat := "{\"name\":\"z_backup_%d\",\"created\":\"\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\",\"size\":\\d+,\"data_size\":\\d+,\"metadata_size\":\\d+,\"location\":\"local\",\"required\":\"\",\"desc\":\"regular\"}" remoteListFormat := "{\"name\":\"z_backup_%d\",\"created\":\"\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}\",\"size\":\\d+,\"data_size\":\\d+,\"metadata_size\":\\d+,\"compressed_size\":\\d+,\"location\":\"remote\",\"required\":\"\",\"desc\":\"tar, regular\"}" for i := 1; i <= apiBackupNumber; i++ { diff --git a/test/integration/utils.go b/test/integration/utils.go index 14957b70..07420efb 100644 --- a/test/integration/utils.go +++ b/test/integration/utils.go @@ -494,6 +494,49 @@ func (env *TestEnvironment) Cleanup(t *testing.T, r *require.Assertions) { // Clean shared state between test runs so the next test gets a fresh environment _ = env.DockerExec("minio", "rm", "-rf", "/minio/data/clickhouse/disk_s3") + // CAS leaves state under /backup/cluster//cas//. + // v1 retention/clean-broken explicitly skips it (by design — see SkipPrefixes + // in pkg/cas/config.go), so it persists across env-pool reuse and surfaces + // as a bucket-not-empty failure in checkObjectStorageIsEmpty for the next + // non-CAS test on the same slot. + // + // For CAS tests we just blow away the entire backup/ tree on every backend + // (CAS tests don't share state with v1 paths in the same bucket). For + // non-CAS tests we only wipe the cas/ subtree to avoid touching v1 state + // the test is still using. + if strings.HasPrefix(t.Name(), "TestCAS") { + _ = env.DockerExec("minio", "bash", "-c", "rm -rf /minio/data/clickhouse/backup") + _ = env.DockerExec("gcs", "sh", "-c", "rm -rf /data/altinity-qa-test/backup 2>/dev/null || true") + _ = env.DockerExec("sshd", "sh", "-c", "rm -rf /root/cas/ 2>/dev/null || true") + _ = env.DockerExec("ftp", "sh", "-c", "rm -rf /home/test_backup/backup/cas/ /home/ftpusers/test_backup/backup/cas/ /backup/cas/ 2>/dev/null || true") + + // Local clickhouse-backup state + leaked cas_* databases — backstop + // for CAS tests that fail mid-flight (e.g. system.projections probe + // on CH < 24.9) before their trailing cas-delete + dropDatabase runs. + // Otherwise TestServerAPI's local-backup count and TestTablePatterns' + // SHOW CREATE DATABASE both choke on the leaked state. + _ = env.DockerExec("clickhouse", "bash", "-c", "rm -rf /var/lib/clickhouse/backup/*") + _, _ = env.DockerExecOut("clickhouse", "bash", "-c", + "clickhouse-client --query \"SELECT name FROM system.databases WHERE name LIKE 'cas_%'\" | "+ + "xargs -r -I{} clickhouse-client --query \"DROP DATABASE IF EXISTS \\`{}\\` SYNC\"") + } else { + _ = env.DockerExec("minio", "bash", "-c", "rm -rf /minio/data/clickhouse/backup/cluster/*/cas/ 2>/dev/null; find /minio/data/clickhouse/backup -mindepth 1 -type d -empty -delete 2>/dev/null; rmdir /minio/data/clickhouse/backup 2>/dev/null || true") + _ = env.DockerExec("gcs", "sh", "-c", "rm -rf /data/altinity-qa-test/backup/cluster/*/cas/ 2>/dev/null; find /data/altinity-qa-test/backup -mindepth 1 -type d -empty -delete 2>/dev/null; rmdir /data/altinity-qa-test/backup 2>/dev/null || true") + _ = env.DockerExec("sshd", "sh", "-c", "rm -rf /root/cas/ 2>/dev/null || true") + _ = env.DockerExec("ftp", "sh", "-c", "rm -rf /home/test_backup/backup/cas/ /home/ftpusers/test_backup/backup/cas/ /backup/cas/ 2>/dev/null || true") + } + + // TestShadowCleanup{,OnFailure} explicitly run `clickhouse-backup clean` + // which wipes /var/lib/clickhouse/shadow/. ClickHouse subsequently creates + // an empty increment.txt on the next FREEZE attempt, which then fails with + // "File ... shadow/increment.txt is empty" in the next env-pool consumer + // (TestSkipDisk, TestCustomKopia, etc.). Remove the stale 0-byte file so + // ClickHouse re-creates it cleanly on the next FREEZE. + if strings.HasPrefix(t.Name(), "TestShadowCleanup") { + _ = env.DockerExec("clickhouse", "bash", "-c", + "if [ -f /var/lib/clickhouse/shadow/increment.txt ] && [ ! -s /var/lib/clickhouse/shadow/increment.txt ]; then rm -f /var/lib/clickhouse/shadow/increment.txt; fi") + } + if t.Name() == "TestRBAC" || t.Name() == "TestConfigs" || strings.HasPrefix(t.Name(), "TestEmbedded") { env.DockerExecNoError(r, "minio", "rm", "-rf", "/minio/data/clickhouse/backups_s3") } diff --git a/test/testflows/clickhouse_backup/tests/snapshots/cli.py.cli.snapshot b/test/testflows/clickhouse_backup/tests/snapshots/cli.py.cli.snapshot index 2aff71a8..7aa1739a 100644 --- a/test/testflows/clickhouse_backup/tests/snapshots/cli.py.cli.snapshot +++ b/test/testflows/clickhouse_backup/tests/snapshots/cli.py.cli.snapshot @@ -1,6 +1,6 @@ -default_config = r"""'[\'general:\', \' remote_storage: none\', \' backups_to_keep_local: 0\', \' backups_to_keep_remote: 0\', \' log_level: info\', \' allow_empty_backups: false\', \' allow_object_disk_streaming: false\', \' use_resumable_state: true\', \' restore_schema_on_cluster: ""\', \' upload_by_part: true\', \' download_by_part: true\', \' restore_database_mapping: {}\', \' restore_table_mapping: {}\', \' retries_on_failure: 3\', \' retries_pause: 5s\', \' retries_jitter: 0\', \' watch_interval: 1h\', \' full_interval: 24h\', \' watch_backup_name_template: shard{shard}-{type}-{time:20060102150405}\', \' sharded_operation_mode: ""\', \' cpu_nice_priority: 15\', \' io_nice_priority: idle\', \' rbac_backup_always: true\', \' rbac_conflict_resolution: recreate\', \' config_backup_always: false\', \' named_collections_backup_always: false\', \' delete_batch_size: 1000\', \' retriesduration: 5s\', \' watchduration: 1h0m0s\', \' fullduration: 24h0m0s\', \'clickhouse:\', \' username: default\', \' password: ""\', \' host: localhost\', \' port: 9000\', \' disk_mapping: {}\', \' skip_tables:\', \' - system.*\', \' - INFORMATION_SCHEMA.*\', \' - information_schema.*\', \' - _temporary_and_external_tables.*\', \' skip_table_engines: []\', \' skip_disks: []\', \' skip_disk_types: []\', \' timeout: 30m\', \' freeze_by_part: false\', \' freeze_by_part_where: ""\', \' use_embedded_backup_restore: false\', \' use_embedded_backup_restore_cluster: ""\', \' embedded_backup_disk: ""\', \' backup_mutations: true\', \' restore_as_attach: false\', \' restore_distributed_cluster: ""\', \' check_parts_columns: true\', \' secure: false\', \' skip_verify: false\', \' sync_replicated_tables: false\', \' log_sql_queries: true\', \' config_dir: /etc/clickhouse-server/\', \' restart_command: exec:systemctl restart clickhouse-server\', \' ignore_not_exists_error_during_freeze: true\', \' check_replicas_before_attach: true\', \' default_replica_path: /clickhouse/tables/{cluster}/{shard}/{database}/{table}\', " default_replica_name: \'{replica}\'", \' tls_key: ""\', \' tls_cert: ""\', \' tls_ca: ""\', \' debug: false\', \' force_rebalance: false\', \'s3:\', \' access_key: ""\', \' secret_key: ""\', \' bucket: ""\', \' endpoint: ""\', \' region: us-east-1\', \' acl: private\', \' assume_role_arn: ""\', \' force_path_style: false\', \' path: ""\', \' object_disk_path: ""\', \' disable_ssl: false\', \' compression_level: 1\', \' compression_format: tar\', \' sse: ""\', \' sse_kms_key_id: ""\', \' sse_customer_algorithm: ""\', \' sse_customer_key: ""\', \' sse_customer_key_md5: ""\', \' sse_kms_encryption_context: ""\', \' disable_cert_verification: false\', \' use_custom_storage_class: false\', \' storage_class: STANDARD\', \' custom_storage_class_map: {}\', \' allow_multipart_download: false\', \' object_labels: {}\', \' request_payer: ""\', \' check_sum_algorithm: ""\', \' request_content_md5: false\', \' retry_mode: standard\', \' chunk_size: 5242880\', \' debug: false\', \'gcs:\', \' credentials_file: ""\', \' credentials_json: ""\', \' credentials_json_encoded: ""\', \' sa_email: ""\', \' embedded_access_key: ""\', \' embedded_secret_key: ""\', \' skip_credentials: false\', \' bucket: ""\', \' path: ""\', \' object_disk_path: ""\', \' compression_level: 1\', \' compression_format: tar\', \' debug: false\', \' force_http: false\', \' endpoint: ""\', \' storage_class: STANDARD\', \' object_labels: {}\', \' custom_storage_class_map: {}\', \' chunk_size: 16777216\', \' encryption_key: ""\', \'cos:\', \' url: ""\', \' timeout: 2m\', \' secret_id: ""\', \' secret_key: ""\', \' path: ""\', \' object_disk_path: ""\', \' compression_format: tar\', \' compression_level: 1\', \' allow_multipart_download: false\', \' debug: false\', \'api:\', \' listen: localhost:7171\', \' enable_metrics: true\', \' enable_pprof: false\', \' username: ""\', \' password: ""\', \' secure: false\', \' certificate_file: ""\', \' private_key_file: ""\', \' ca_cert_file: ""\', \' ca_key_file: ""\', \' create_integration_tables: false\', \' integration_tables_host: ""\', \' allow_parallel: false\', \' complete_resumable_after_restart: true\', \' watch_is_main_process: false\', \'ftp:\', \' address: ""\', \' timeout: 2m\', \' username: ""\', \' password: ""\', \' tls: false\', \' skip_tls_verify: false\', \' path: ""\', \' object_disk_path: ""\', \' compression_format: tar\', \' compression_level: 1\', \' debug: false\', \'sftp:\', \' address: ""\', \' port: 22\', \' username: ""\', \' password: ""\', \' key: ""\', \' path: ""\', \' object_disk_path: ""\', \' compression_format: tar\', \' compression_level: 1\', \' debug: false\', \'azblob:\', \' endpoint_schema: https\', \' endpoint_suffix: core.windows.net\', \' account_name: ""\', \' account_key: ""\', \' sas: ""\', \' use_managed_identity: false\', \' container: ""\', \' assume_container_exists: false\', \' path: ""\', \' object_disk_path: ""\', \' compression_level: 1\', \' compression_format: tar\', \' sse_key: ""\', \' buffer_count: 3\', \' timeout: 4h\', \' debug: false\', \'custom:\', \' upload_command: ""\', \' download_command: ""\', \' list_command: ""\', \' delete_command: ""\', \' command_timeout: 4h\', \' commandtimeoutduration: 4h0m0s\']'""" +default_config = r"""'[\'general:\', \' remote_storage: none\', \' backups_to_keep_local: 0\', \' backups_to_keep_remote: 0\', \' log_level: info\', \' allow_empty_backups: false\', \' allow_object_disk_streaming: false\', \' use_resumable_state: true\', \' restore_schema_on_cluster: ""\', \' upload_by_part: true\', \' download_by_part: true\', \' restore_database_mapping: {}\', \' restore_table_mapping: {}\', \' retries_on_failure: 3\', \' retries_pause: 5s\', \' retries_jitter: 0\', \' watch_interval: 1h\', \' full_interval: 24h\', \' watch_backup_name_template: shard{shard}-{type}-{time:20060102150405}\', \' sharded_operation_mode: ""\', \' cpu_nice_priority: 15\', \' io_nice_priority: idle\', \' rbac_backup_always: true\', \' rbac_conflict_resolution: recreate\', \' config_backup_always: false\', \' named_collections_backup_always: false\', \' delete_batch_size: 1000\', \' retriesduration: 5s\', \' watchduration: 1h0m0s\', \' fullduration: 24h0m0s\', \'clickhouse:\', \' username: default\', \' password: ""\', \' host: localhost\', \' port: 9000\', \' disk_mapping: {}\', \' skip_tables:\', \' - system.*\', \' - INFORMATION_SCHEMA.*\', \' - information_schema.*\', \' - _temporary_and_external_tables.*\', \' skip_table_engines: []\', \' skip_disks: []\', \' skip_disk_types: []\', \' timeout: 30m\', \' freeze_by_part: false\', \' freeze_by_part_where: ""\', \' use_embedded_backup_restore: false\', \' use_embedded_backup_restore_cluster: ""\', \' embedded_backup_disk: ""\', \' backup_mutations: true\', \' restore_as_attach: false\', \' restore_distributed_cluster: ""\', \' check_parts_columns: true\', \' secure: false\', \' skip_verify: false\', \' sync_replicated_tables: false\', \' log_sql_queries: true\', \' config_dir: /etc/clickhouse-server/\', \' restart_command: exec:systemctl restart clickhouse-server\', \' ignore_not_exists_error_during_freeze: true\', \' check_replicas_before_attach: true\', \' default_replica_path: /clickhouse/tables/{cluster}/{shard}/{database}/{table}\', " default_replica_name: \'{replica}\'", \' tls_key: ""\', \' tls_cert: ""\', \' tls_ca: ""\', \' debug: false\', \' force_rebalance: false\', \'s3:\', \' access_key: ""\', \' secret_key: ""\', \' bucket: ""\', \' endpoint: ""\', \' region: us-east-1\', \' acl: private\', \' assume_role_arn: ""\', \' force_path_style: false\', \' path: ""\', \' object_disk_path: ""\', \' disable_ssl: false\', \' compression_level: 1\', \' compression_format: tar\', \' sse: ""\', \' sse_kms_key_id: ""\', \' sse_customer_algorithm: ""\', \' sse_customer_key: ""\', \' sse_customer_key_md5: ""\', \' sse_kms_encryption_context: ""\', \' disable_cert_verification: false\', \' use_custom_storage_class: false\', \' storage_class: STANDARD\', \' custom_storage_class_map: {}\', \' allow_multipart_download: false\', \' object_labels: {}\', \' request_payer: ""\', \' check_sum_algorithm: ""\', \' request_content_md5: false\', \' retry_mode: standard\', \' chunk_size: 5242880\', \' debug: false\', \'gcs:\', \' credentials_file: ""\', \' credentials_json: ""\', \' credentials_json_encoded: ""\', \' sa_email: ""\', \' embedded_access_key: ""\', \' embedded_secret_key: ""\', \' skip_credentials: false\', \' bucket: ""\', \' path: ""\', \' object_disk_path: ""\', \' compression_level: 1\', \' compression_format: tar\', \' debug: false\', \' force_http: false\', \' endpoint: ""\', \' storage_class: STANDARD\', \' object_labels: {}\', \' custom_storage_class_map: {}\', \' chunk_size: 16777216\', \' encryption_key: ""\', \'cos:\', \' url: ""\', \' timeout: 2m\', \' secret_id: ""\', \' secret_key: ""\', \' path: ""\', \' object_disk_path: ""\', \' compression_format: tar\', \' compression_level: 1\', \' allow_multipart_download: false\', \' debug: false\', \'api:\', \' listen: localhost:7171\', \' enable_metrics: true\', \' enable_pprof: false\', \' username: ""\', \' password: ""\', \' secure: false\', \' certificate_file: ""\', \' private_key_file: ""\', \' ca_cert_file: ""\', \' ca_key_file: ""\', \' create_integration_tables: false\', \' integration_tables_host: ""\', \' allow_parallel: false\', \' complete_resumable_after_restart: true\', \' watch_is_main_process: false\', \'ftp:\', \' address: ""\', \' timeout: 2m\', \' username: ""\', \' password: ""\', \' tls: false\', \' skip_tls_verify: false\', \' path: ""\', \' object_disk_path: ""\', \' compression_format: tar\', \' compression_level: 1\', \' debug: false\', \'sftp:\', \' address: ""\', \' port: 22\', \' username: ""\', \' password: ""\', \' key: ""\', \' path: ""\', \' object_disk_path: ""\', \' compression_format: tar\', \' compression_level: 1\', \' debug: false\', \'azblob:\', \' endpoint_schema: https\', \' endpoint_suffix: core.windows.net\', \' account_name: ""\', \' account_key: ""\', \' sas: ""\', \' use_managed_identity: false\', \' container: ""\', \' assume_container_exists: false\', \' path: ""\', \' object_disk_path: ""\', \' compression_level: 1\', \' compression_format: tar\', \' sse_key: ""\', \' buffer_count: 3\', \' timeout: 4h\', \' debug: false\', \'custom:\', \' upload_command: ""\', \' download_command: ""\', \' list_command: ""\', \' delete_command: ""\', \' command_timeout: 4h\', \' commandtimeoutduration: 4h0m0s\', \'cas:\', \' enabled: false\', \' cluster_id: ""\', \' root_prefix: cas/\', \' inline_threshold: 262144\', \' grace_blob: 24h\', \' abandon_threshold: 168h\', \' wait_for_prune: ""\', \' allow_unsafe_markers: false\', \' skip_conditional_put_probe: false\', \' allow_unsafe_object_disk_skip: false\']'""" -help_flag = r"""'NAME:\n clickhouse-backup - Tool for easy backup of ClickHouse with cloud supportUSAGE:\n clickhouse-backup [-t, --tables=.
] DESCRIPTION:\n Run as \'root\' or \'clickhouse\' userCOMMANDS:\n tables List of tables, exclude skip_tables\n create Create new backup\n create_remote Create and upload new backup\n upload Upload backup to remote storage\n list List of backups\n download Download backup from remote storage\n restore Create schema and restore data from backup\n restore_remote Download and restore\n delete Delete specific backup\n default-config Print default config\n print-config Print current config merged with environment variables\n clean Remove data in \'shadow\' folder from all \'path\' folders available from \'system.disks\'\n clean_remote_broken Remove all broken remote backups\n clean_local_broken Remove all broken local backups\n watch Run infinite loop which create full + incremental backup sequence to allow efficient backup sequences\n server Run API server\n help, h Shows a list of commands or help for one commandGLOBAL OPTIONS:\n --config value, -c value Config \'FILE\' name. (default: "/etc/clickhouse-backup/config.yml") [$CLICKHOUSE_BACKUP_CONFIG]\n --environment-override value, --env value override any environment variable via CLI parameter\n --help, -h show help\n --version, -v print the version'""" +help_flag = r"""'NAME:\n clickhouse-backup - Tool for easy backup of ClickHouse with cloud supportUSAGE:\n clickhouse-backup [-t, --tables=.
] DESCRIPTION:\n Run as \'root\' or \'clickhouse\' userCOMMANDS:\n tables List of tables, exclude skip_tables\n create Create new backup\n create_remote Create and upload new backup\n upload Upload backup to remote storage\n list List of backups\n download Download backup from remote storage\n restore Create schema and restore data from backup\n restore_remote Download and restore\n delete Delete specific backup\n default-config Print default config\n print-config Print current config merged with environment variables\n clean Remove data in \'shadow\' folder from all \'path\' folders available from \'system.disks\'\n clean_remote_broken Remove all broken remote backups\n clean_local_broken Remove all broken local backups\n watch Run infinite loop which create full + incremental backup sequence to allow efficient backup sequences\n server Run API server\n cas-upload Upload a local backup using the content-addressable layout (see docs/cas-design.md)\n cas-download Materialize a CAS backup into the local data directory (does not load into ClickHouse)\n cas-restore Download a CAS backup and restore tables into ClickHouse\n cas-delete Delete a CAS backup\'s metadata subtree (Phase 1: blobs are NOT reclaimed)\n cas-verify HEAD-check every blob referenced by a CAS backup\n cas-status Print a LIST-only health summary for the configured CAS cluster\n cas-prune Garbage-collect orphan blobs (mark-and-sweep) for the configured CAS cluster\n help, h Shows a list of commands or help for one commandGLOBAL OPTIONS:\n --config value, -c value Config \'FILE\' name. (default: "/etc/clickhouse-backup/config.yml") [$CLICKHOUSE_BACKUP_CONFIG]\n --environment-override value, --env value override any environment variable via CLI parameter\n --help, -h show help\n --version, -v print the version'""" -cli_usage = r"""'NAME:\n clickhouse-backup - Tool for easy backup of ClickHouse with cloud supportUSAGE:\n clickhouse-backup [-t, --tables=.
] DESCRIPTION:\n Run as \'root\' or \'clickhouse\' userCOMMANDS:\n tables List of tables, exclude skip_tables\n create Create new backup\n create_remote Create and upload new backup\n upload Upload backup to remote storage\n list List of backups\n download Download backup from remote storage\n restore Create schema and restore data from backup\n restore_remote Download and restore\n delete Delete specific backup\n default-config Print default config\n print-config Print current config merged with environment variables\n clean Remove data in \'shadow\' folder from all \'path\' folders available from \'system.disks\'\n clean_remote_broken Remove all broken remote backups\n clean_local_broken Remove all broken local backups\n watch Run infinite loop which create full + incremental backup sequence to allow efficient backup sequences\n server Run API server\n help, h Shows a list of commands or help for one commandGLOBAL OPTIONS:\n --config value, -c value Config \'FILE\' name. (default: "/etc/clickhouse-backup/config.yml") [$CLICKHOUSE_BACKUP_CONFIG]\n --environment-override value, --env value override any environment variable via CLI parameter\n --help, -h show help\n --version, -v print the version'""" +cli_usage = r"""'NAME:\n clickhouse-backup - Tool for easy backup of ClickHouse with cloud supportUSAGE:\n clickhouse-backup [-t, --tables=.
] DESCRIPTION:\n Run as \'root\' or \'clickhouse\' userCOMMANDS:\n tables List of tables, exclude skip_tables\n create Create new backup\n create_remote Create and upload new backup\n upload Upload backup to remote storage\n list List of backups\n download Download backup from remote storage\n restore Create schema and restore data from backup\n restore_remote Download and restore\n delete Delete specific backup\n default-config Print default config\n print-config Print current config merged with environment variables\n clean Remove data in \'shadow\' folder from all \'path\' folders available from \'system.disks\'\n clean_remote_broken Remove all broken remote backups\n clean_local_broken Remove all broken local backups\n watch Run infinite loop which create full + incremental backup sequence to allow efficient backup sequences\n server Run API server\n cas-upload Upload a local backup using the content-addressable layout (see docs/cas-design.md)\n cas-download Materialize a CAS backup into the local data directory (does not load into ClickHouse)\n cas-restore Download a CAS backup and restore tables into ClickHouse\n cas-delete Delete a CAS backup\'s metadata subtree (Phase 1: blobs are NOT reclaimed)\n cas-verify HEAD-check every blob referenced by a CAS backup\n cas-status Print a LIST-only health summary for the configured CAS cluster\n cas-prune Garbage-collect orphan blobs (mark-and-sweep) for the configured CAS cluster\n help, h Shows a list of commands or help for one commandGLOBAL OPTIONS:\n --config value, -c value Config \'FILE\' name. (default: "/etc/clickhouse-backup/config.yml") [$CLICKHOUSE_BACKUP_CONFIG]\n --environment-override value, --env value override any environment variable via CLI parameter\n --help, -h show help\n --version, -v print the version'"""