From f4267e9f480702d69b28f381203a6f87d76d692e Mon Sep 17 00:00:00 2001 From: Chris Munns Date: Tue, 31 Mar 2026 18:42:43 -0400 Subject: [PATCH 1/3] Fix --split-tables-larger-than guidance for resume The blanket "never use with --resume" was wrong. The truncation risk only exists when COPY is still in progress. If COPY completed and the failure was in indexes/CDC, resume with the same value is safe and required by catalog validation. --- pgcopydb-helpers/AGENTS.md | 8 ++++---- pgcopydb-helpers/README.md | 11 +++++++++-- pgcopydb-helpers/resume-migration.sh | 19 +++++++++++++++---- 3 files changed, 28 insertions(+), 10 deletions(-) diff --git a/pgcopydb-helpers/AGENTS.md b/pgcopydb-helpers/AGENTS.md index 036d747..3e850ba 100644 --- a/pgcopydb-helpers/AGENTS.md +++ b/pgcopydb-helpers/AGENTS.md @@ -201,9 +201,9 @@ Resumes a previously interrupted `pgcopydb clone --follow` migration. Backs up t ~/resume-migration.sh ~/migration_YYYYMMDD-HHMMSS # specify explicitly ``` -**Important:** This script intentionally does NOT use `--split-tables-larger-than` with `--resume`. pgcopydb truncates the entire table before checking split parts on resume, which causes data loss. +**Important:** If the original migration used `--split-tables-larger-than`, the resume script passes the same value. This is safe when COPY has already completed (the COPY supervisor doesn't run, so no truncation occurs). If COPY was still in progress when the failure happened, use `--restart` instead — pgcopydb truncates split tables before re-queuing parts on resume, which loses already-copied partitions. Run `~/check-migration-status.sh` to determine whether COPY completed before deciding. -**When to use:** After pgcopydb crashes, the instance reboots, or the migration is interrupted. Do NOT use after a successful migration — use `run-migration.sh` to start fresh. +**When to use:** After pgcopydb crashes, the instance reboots, or the migration is interrupted during indexes, post-data restore, or CDC. If COPY failed mid-flight, use `~/target-clean.sh` + `~/drop-replication-slots.sh` + `~/start-migration-screen.sh` to start fresh instead. **Requires:** `PGCOPYDB_SOURCE_PGURI`, `PGCOPYDB_TARGET_PGURI`, existing migration directory @@ -396,11 +396,11 @@ All scripts use variables at the top that can be adjusted per migration. See [Cl | `TABLE_JOBS` | 16 | run-migration.sh, resume-migration.sh | | `INDEX_JOBS` | 12 | run-migration.sh, resume-migration.sh | | `FILTER_FILE` | ~/filters.ini | run-migration.sh, resume-migration.sh | -| `--split-tables-larger-than` | 50GB | run-migration.sh only (not resume) | +| `--split-tables-larger-than` | 50GB | run-migration.sh, resume-migration.sh | ## Critical Warnings -- **Never use `--split-tables-larger-than` with `--resume`** — pgcopydb truncates the entire table before checking parts, causing data loss. +- **If COPY failed mid-flight, use `--restart` instead of `--resume`** — pgcopydb truncates split tables before re-queuing parts on resume, causing data loss for partially-copied tables. If COPY completed and the failure was in a later phase (indexes, CDC), `--resume` with the same `--split-tables-larger-than` value is safe. Run `~/check-migration-status.sh` to check. - **Never use `pgcopydb --restart`** without backing up first — it wipes the CDC directory AND SQLite catalogs. - **Always clean up replication slots** after a migration — unconsumed slots cause WAL accumulation on the source. - **Verify extension filtering after STEP 1** — check `SELECT COUNT(*) FROM s_depend;` in `filter.db`. If it's 0, extension-owned objects in `public` won't be filtered. diff --git a/pgcopydb-helpers/README.md b/pgcopydb-helpers/README.md index 3b85e55..cb2293c 100644 --- a/pgcopydb-helpers/README.md +++ b/pgcopydb-helpers/README.md @@ -215,7 +215,14 @@ If pgcopydb crashes, the instance reboots, or the migration is interrupted: ~/resume-migration.sh ~/migration_YYYYMMDD-HHMMSS # or specify explicitly ``` -This backs up the SQLite catalog before resuming. It uses `--not-consistent` to allow resuming from a mid-transaction state, and intentionally omits `--split-tables-larger-than` because pgcopydb truncates the entire table before checking split parts on resume, which causes data loss. +This backs up the SQLite catalog before resuming and uses `--not-consistent` to allow resuming from a mid-transaction state. + +**Choosing between `--resume` and `--restart`:** + +- **COPY already completed** (failure was during indexes, post-data restore, or CDC): Use `--resume`. If the original migration used `--split-tables-larger-than`, pass the same value — the COPY phase is skipped entirely so there is no truncation risk. +- **COPY was still in progress** when the failure occurred: Use `--restart` (full restart) instead. pgcopydb truncates split tables before re-queuing parts on resume, which loses data from already-copied partitions. + +To check whether COPY completed, run `~/check-migration-status.sh` and look at the copy task progress. If all COPY tasks show as completed with no outstanding jobs, it is safe to `--resume`. To start completely over, wipe the target and clean up replication: @@ -392,7 +399,7 @@ sqlite3 ~/migration_*/schema/filter.db "SELECT COUNT(*) FROM s_depend;" ## Critical Warnings -- **Never use `--split-tables-larger-than` with `--resume`** — pgcopydb truncates the entire table before checking parts, causing data loss. +- **If COPY failed mid-flight, use `--restart` instead of `--resume`** — pgcopydb truncates split tables before re-queuing parts on resume, causing data loss for partially-copied tables. If COPY completed and the failure was in a later phase (indexes, CDC), `--resume` with the same `--split-tables-larger-than` value is safe. - **Never use `pgcopydb --restart`** without backing up first — it wipes the CDC directory AND SQLite catalogs. - **Always clean up replication slots** when done — unconsumed slots cause unbounded WAL growth on the source. - **Verify extension filtering after STEP 1** — if `s_depend` count is 0, extension-owned objects won't be excluded. diff --git a/pgcopydb-helpers/resume-migration.sh b/pgcopydb-helpers/resume-migration.sh index 2675efa..0a2ad5f 100755 --- a/pgcopydb-helpers/resume-migration.sh +++ b/pgcopydb-helpers/resume-migration.sh @@ -5,8 +5,15 @@ # # Resumes a previously interrupted pgcopydb clone --follow migration. # If no directory is given, uses the most recent ~/migration_* directory. -# Backs up the SQLite catalog before resuming. Does NOT use -# --split-tables-larger-than (unsafe with --resume). +# Backs up the SQLite catalog before resuming. +# +# IMPORTANT: --split-tables-larger-than and --resume +# If the original migration used --split-tables-larger-than, you MUST pass +# the same value here -- pgcopydb validates catalog consistency and will +# refuse to resume without it. This is SAFE if the COPY phase already +# completed (indexes, CDC, etc.). If COPY was still in progress when the +# failure occurred, use --restart instead -- pgcopydb truncates split tables +# before re-queuing parts on resume, which loses already-copied partitions. # set -eo pipefail @@ -57,8 +64,10 @@ cp "$MIGRATION_DIR/schema/source.db" "$MIGRATION_DIR/schema/source.db.bak.$(date echo "Migration dir: $MIGRATION_DIR" echo "==========================================" - # NOTE: Do NOT use --split-tables-larger-than with --resume. - # pgcopydb truncates the entire table before checking parts, causing data loss. + # If the original migration used --split-tables-larger-than, pass the + # same value here. This is safe when COPY is already complete (the COPY + # supervisor won't run, so no truncation occurs). If COPY failed + # mid-flight, use --restart instead of --resume. /usr/lib/postgresql/17/bin/pgcopydb clone \ --follow \ --plugin wal2json \ @@ -73,6 +82,8 @@ cp "$MIGRATION_DIR/schema/source.db" "$MIGRATION_DIR/schema/source.db.bak.$(date --skip-db-properties \ --table-jobs "$TABLE_JOBS" \ --index-jobs "$INDEX_JOBS" \ + --split-tables-larger-than 50GB \ + --split-max-parts "$TABLE_JOBS" \ --dir "$MIGRATION_DIR" EXIT_CODE=$? From 7ffcb9e26e43358c24dbb0928b4a81138f39c58c Mon Sep 17 00:00:00 2001 From: Chris Munns Date: Fri, 10 Apr 2026 12:42:31 -0400 Subject: [PATCH 2/3] Simplify resume docs: drop resume vs restart guidance The resume script just does its thing with --resume and --split-tables-larger-than for catalog consistency. Users don't need to choose between resume and restart. --- pgcopydb-helpers/AGENTS.md | 6 +++--- pgcopydb-helpers/README.md | 11 ++--------- pgcopydb-helpers/resume-migration.sh | 14 +++----------- 3 files changed, 8 insertions(+), 23 deletions(-) diff --git a/pgcopydb-helpers/AGENTS.md b/pgcopydb-helpers/AGENTS.md index 3e850ba..f0029d9 100644 --- a/pgcopydb-helpers/AGENTS.md +++ b/pgcopydb-helpers/AGENTS.md @@ -201,9 +201,9 @@ Resumes a previously interrupted `pgcopydb clone --follow` migration. Backs up t ~/resume-migration.sh ~/migration_YYYYMMDD-HHMMSS # specify explicitly ``` -**Important:** If the original migration used `--split-tables-larger-than`, the resume script passes the same value. This is safe when COPY has already completed (the COPY supervisor doesn't run, so no truncation occurs). If COPY was still in progress when the failure happened, use `--restart` instead — pgcopydb truncates split tables before re-queuing parts on resume, which loses already-copied partitions. Run `~/check-migration-status.sh` to determine whether COPY completed before deciding. +**Important:** The script passes `--split-tables-larger-than` to match `run-migration.sh`. pgcopydb requires catalog consistency — if the original run used split tables, the resume must pass the same value. -**When to use:** After pgcopydb crashes, the instance reboots, or the migration is interrupted during indexes, post-data restore, or CDC. If COPY failed mid-flight, use `~/target-clean.sh` + `~/drop-replication-slots.sh` + `~/start-migration-screen.sh` to start fresh instead. +**When to use:** After pgcopydb crashes, the instance reboots, or the migration is interrupted. Do NOT use after a successful migration — use `run-migration.sh` to start fresh. **Requires:** `PGCOPYDB_SOURCE_PGURI`, `PGCOPYDB_TARGET_PGURI`, existing migration directory @@ -400,7 +400,7 @@ All scripts use variables at the top that can be adjusted per migration. See [Cl ## Critical Warnings -- **If COPY failed mid-flight, use `--restart` instead of `--resume`** — pgcopydb truncates split tables before re-queuing parts on resume, causing data loss for partially-copied tables. If COPY completed and the failure was in a later phase (indexes, CDC), `--resume` with the same `--split-tables-larger-than` value is safe. Run `~/check-migration-status.sh` to check. +- **Resume and restart are different** — `--resume` skips completed work; `--restart` wipes progress and starts over. The resume script uses `--resume`. - **Never use `pgcopydb --restart`** without backing up first — it wipes the CDC directory AND SQLite catalogs. - **Always clean up replication slots** after a migration — unconsumed slots cause WAL accumulation on the source. - **Verify extension filtering after STEP 1** — check `SELECT COUNT(*) FROM s_depend;` in `filter.db`. If it's 0, extension-owned objects in `public` won't be filtered. diff --git a/pgcopydb-helpers/README.md b/pgcopydb-helpers/README.md index cb2293c..4270b0c 100644 --- a/pgcopydb-helpers/README.md +++ b/pgcopydb-helpers/README.md @@ -215,14 +215,7 @@ If pgcopydb crashes, the instance reboots, or the migration is interrupted: ~/resume-migration.sh ~/migration_YYYYMMDD-HHMMSS # or specify explicitly ``` -This backs up the SQLite catalog before resuming and uses `--not-consistent` to allow resuming from a mid-transaction state. - -**Choosing between `--resume` and `--restart`:** - -- **COPY already completed** (failure was during indexes, post-data restore, or CDC): Use `--resume`. If the original migration used `--split-tables-larger-than`, pass the same value — the COPY phase is skipped entirely so there is no truncation risk. -- **COPY was still in progress** when the failure occurred: Use `--restart` (full restart) instead. pgcopydb truncates split tables before re-queuing parts on resume, which loses data from already-copied partitions. - -To check whether COPY completed, run `~/check-migration-status.sh` and look at the copy task progress. If all COPY tasks show as completed with no outstanding jobs, it is safe to `--resume`. +This backs up the SQLite catalog before resuming and uses `--not-consistent` to allow resuming from a mid-transaction state. The script passes `--split-tables-larger-than` to match `run-migration.sh` — pgcopydb requires catalog consistency, so the resume must use the same split value as the original run. To start completely over, wipe the target and clean up replication: @@ -399,7 +392,7 @@ sqlite3 ~/migration_*/schema/filter.db "SELECT COUNT(*) FROM s_depend;" ## Critical Warnings -- **If COPY failed mid-flight, use `--restart` instead of `--resume`** — pgcopydb truncates split tables before re-queuing parts on resume, causing data loss for partially-copied tables. If COPY completed and the failure was in a later phase (indexes, CDC), `--resume` with the same `--split-tables-larger-than` value is safe. +- **Resume and restart are different** — `--resume` skips completed work; `--restart` wipes progress and starts over. The resume script uses `--resume`. - **Never use `pgcopydb --restart`** without backing up first — it wipes the CDC directory AND SQLite catalogs. - **Always clean up replication slots** when done — unconsumed slots cause unbounded WAL growth on the source. - **Verify extension filtering after STEP 1** — if `s_depend` count is 0, extension-owned objects won't be excluded. diff --git a/pgcopydb-helpers/resume-migration.sh b/pgcopydb-helpers/resume-migration.sh index 0a2ad5f..8bd2c90 100755 --- a/pgcopydb-helpers/resume-migration.sh +++ b/pgcopydb-helpers/resume-migration.sh @@ -7,13 +7,9 @@ # If no directory is given, uses the most recent ~/migration_* directory. # Backs up the SQLite catalog before resuming. # -# IMPORTANT: --split-tables-larger-than and --resume -# If the original migration used --split-tables-larger-than, you MUST pass -# the same value here -- pgcopydb validates catalog consistency and will -# refuse to resume without it. This is SAFE if the COPY phase already -# completed (indexes, CDC, etc.). If COPY was still in progress when the -# failure occurred, use --restart instead -- pgcopydb truncates split tables -# before re-queuing parts on resume, which loses already-copied partitions. +# Uses --split-tables-larger-than to match run-migration.sh. pgcopydb +# requires catalog consistency — if the original run used split tables, +# the resume must pass the same value. # set -eo pipefail @@ -64,10 +60,6 @@ cp "$MIGRATION_DIR/schema/source.db" "$MIGRATION_DIR/schema/source.db.bak.$(date echo "Migration dir: $MIGRATION_DIR" echo "==========================================" - # If the original migration used --split-tables-larger-than, pass the - # same value here. This is safe when COPY is already complete (the COPY - # supervisor won't run, so no truncation occurs). If COPY failed - # mid-flight, use --restart instead of --resume. /usr/lib/postgresql/17/bin/pgcopydb clone \ --follow \ --plugin wal2json \ From 5bd33dc3f983b9e6158a93e4b581b6915d554a12 Mon Sep 17 00:00:00 2001 From: Chris Munns Date: Fri, 10 Apr 2026 12:56:03 -0400 Subject: [PATCH 3/3] Remove --restart guidance, recommend clean start instead pgcopydb --restart doesn't clean the target or correct previous failures. Point users to target-clean + drop-slots + start-migration instead. --- pgcopydb-helpers/AGENTS.md | 5 ++--- pgcopydb-helpers/README.md | 3 +-- 2 files changed, 3 insertions(+), 5 deletions(-) diff --git a/pgcopydb-helpers/AGENTS.md b/pgcopydb-helpers/AGENTS.md index f0029d9..7237744 100644 --- a/pgcopydb-helpers/AGENTS.md +++ b/pgcopydb-helpers/AGENTS.md @@ -203,7 +203,7 @@ Resumes a previously interrupted `pgcopydb clone --follow` migration. Backs up t **Important:** The script passes `--split-tables-larger-than` to match `run-migration.sh`. pgcopydb requires catalog consistency — if the original run used split tables, the resume must pass the same value. -**When to use:** After pgcopydb crashes, the instance reboots, or the migration is interrupted. Do NOT use after a successful migration — use `run-migration.sh` to start fresh. +**When to use:** After pgcopydb crashes, the instance reboots, or the migration is interrupted. To start completely over instead, run `~/target-clean.sh` + `~/drop-replication-slots.sh` first, then `~/start-migration-screen.sh`. **Requires:** `PGCOPYDB_SOURCE_PGURI`, `PGCOPYDB_TARGET_PGURI`, existing migration directory @@ -400,8 +400,7 @@ All scripts use variables at the top that can be adjusted per migration. See [Cl ## Critical Warnings -- **Resume and restart are different** — `--resume` skips completed work; `--restart` wipes progress and starts over. The resume script uses `--resume`. -- **Never use `pgcopydb --restart`** without backing up first — it wipes the CDC directory AND SQLite catalogs. +- **Do not use `pgcopydb --restart`** — it wipes the CDC directory and SQLite catalogs without cleaning the target database or correcting previous failures. To start over, use `~/target-clean.sh` + `~/drop-replication-slots.sh` + `~/start-migration-screen.sh` instead. - **Always clean up replication slots** after a migration — unconsumed slots cause WAL accumulation on the source. - **Verify extension filtering after STEP 1** — check `SELECT COUNT(*) FROM s_depend;` in `filter.db`. If it's 0, extension-owned objects in `public` won't be filtered. - **pg_restore error tolerance** — pgcopydb allows up to 10 restore errors by default. If your migration has more, you may need a custom build with a higher `MAX_TOLERATED_RESTORE_ERRORS`. diff --git a/pgcopydb-helpers/README.md b/pgcopydb-helpers/README.md index 4270b0c..1cadd89 100644 --- a/pgcopydb-helpers/README.md +++ b/pgcopydb-helpers/README.md @@ -392,7 +392,6 @@ sqlite3 ~/migration_*/schema/filter.db "SELECT COUNT(*) FROM s_depend;" ## Critical Warnings -- **Resume and restart are different** — `--resume` skips completed work; `--restart` wipes progress and starts over. The resume script uses `--resume`. -- **Never use `pgcopydb --restart`** without backing up first — it wipes the CDC directory AND SQLite catalogs. +- **Do not use `pgcopydb --restart`** — it wipes the CDC directory and SQLite catalogs without cleaning the target database or correcting previous failures. To start over, use `~/target-clean.sh` + `~/drop-replication-slots.sh` + `~/start-migration-screen.sh` instead. - **Always clean up replication slots** when done — unconsumed slots cause unbounded WAL growth on the source. - **Verify extension filtering after STEP 1** — if `s_depend` count is 0, extension-owned objects won't be excluded.