Fix restore order: wait for replicated DB sync before restoring tables#302
Fix restore order: wait for replicated DB sync before restoring tables#302aalexfvk merged 22 commits intoyandex:mainfrom
Conversation
Try to fix readonly replicated tables in-place via ZooKeeper metadata cleanup and SYSTEM RESTORE REPLICA before falling back to table recreation
Reviewer's GuideRefactors the restore workflow to first synchronize replicated databases via ZooKeeper log_ptr tracking, attempts in‑place repair of readonly replicated tables using ZooKeeper metadata cleanup plus SYSTEM RESTORE REPLICA before falling back to table recreation, and centralizes table-drop logic while skipping table operations in replicated databases unless explicitly requested. Sequence diagram for updated restore workflow with replicated DB syncsequenceDiagram
actor Operator
participant CLI as ch_backup_restore
participant Restore as RestoreWorkflow
participant DBSync as wait_sync_replicated_databases
participant SyncOne as _sync_database
participant ZKData as DatabaseZookeeperData
participant ZK as ZookeeperCTL
participant CH as ClickhouseCTL
participant TblMgr as TableBackupManager
Operator->>CLI: invoke restore
CLI->>Restore: _restore(context, sources,...)
Restore->>DBSync: wait_sync_replicated_databases(context, restored_databases, keep_going=True)
loop for each replicated Database
DBSync->>SyncOne: _sync_database(context, db, deadline)
SyncOne->>ZKData: DatabaseZookeeperData.resolve(context, db)
ZKData->>CH: get_macros()
CH-->>ZKData: macros
ZKData->>ZK: zk_client.get(zk_root_path + zk_path + /max_log_ptr)
ZK-->>ZKData: max_log_ptr
ZKData->>ZK: zk_client.get(zk_root_path + zk_path + /replicas/shard|replica/log_ptr)
ZK-->>ZKData: log_ptr
alt log_ptr >= max_log_ptr
SyncOne-->>DBSync: sync complete for db
else log_ptr < max_log_ptr
loop until deadline
SyncOne->>ZKData: get_log_ptr()
ZKData->>ZK: zk_client.get(...log_ptr)
ZK-->>ZKData: log_ptr
alt log_ptr >= max_log_ptr
SyncOne-->>DBSync: sync complete
SyncOne-->>SyncOne: break out of loop
else log_ptr stalled
SyncOne->>CH: get_ddl_queue_unfinished_status(db.name)
CH-->>SyncOne: pending DDL entries
alt entries have errors
SyncOne-->>DBSync: raise RuntimeError
else no errors
SyncOne->>SyncOne: sleep(poll_interval)
end
end
end
end
end
DBSync-->>Restore: sync_results per db (DONE/FAILED)
Restore->>TblMgr: restore(..., restore_tables_in_replicated_database)
TblMgr-->>Restore: tables restored
Restore->>DBSync: wait_sync_replicated_databases(context, failed_databases, keep_going)
DBSync-->>Restore: updated sync_results
Restore-->>CLI: restore finished / errors
CLI-->>Operator: output status
Sequence diagram for readonly replicated table in-place repair during restoresequenceDiagram
participant TblMgr as TableBackupManager
participant Prep as _preprocess_tables_to_restore
participant FixRO as _get_remaining_readonly_tables_after_fix_attempt
participant DropTbl as _drop_existing_table
participant MC as MetadataCleaner
participant CH as ClickhouseCTL
TblMgr->>Prep: _preprocess_tables_to_restore(context, tables, metadata_cleaner,...)
Prep->>FixRO: _get_remaining_readonly_tables_after_fix_attempt(context, tables, metadata_cleaner)
FixRO->>CH: get_replicas(readonly=True)
CH-->>FixRO: readonly replicas list
alt metadata_cleaner is None or no readonly tables
FixRO-->>Prep: (existing_readonly_tables, empty_set)
else some replicated readonly tables
FixRO->>MC: clean_tables_metadata(replicated_readonly_tables)
loop for each replicated_readonly_table
FixRO->>CH: restore_replica(table)
alt restore_replica fails
CH-->>FixRO: error
FixRO->>FixRO: log warning, continue
else restore_replica succeeds
CH-->>FixRO: ok
end
end
FixRO->>CH: get_replicas(readonly=True)
CH-->>FixRO: still_readonly set
FixRO->>FixRO: compute already_cleaned
FixRO-->>Prep: (remaining_readonly_tables, already_cleaned)
end
loop for each table to restore
Prep->>Prep: find existing_table
alt db is replicated and restore_tables_in_replicated_database is False
Prep->>Prep: log skip restore/drop
else existing_table does not exist
Prep->>Prep: append table to result
else existing_table exists
Prep->>DropTbl: _drop_existing_table(context, table, existing_table)
alt drop succeeds
DropTbl-->>Prep: None
Prep->>Prep: drop_performed=True, append table
else drop raises
DropTbl--xPrep: Exception
alt keep_going is False
Prep-->>TblMgr: raise
else keep_going is True
Prep->>Prep: log exception, continue
end
end
end
alt table is replicated and metadata_cleaner and not drop_performed and (db, name) not in readonly_cleaned_metadata
Prep->>Prep: tables_to_clean_metadata.append(table)
end
end
alt metadata_cleaner and tables_to_clean_metadata not empty
Prep->>MC: clean_tables_metadata(tables_to_clean_metadata)
end
Prep-->>TblMgr: list of tables to actually restore
Class diagram for new replicated database sync and readonly repair helpersclassDiagram
class BackupContext {
+config: Dict
+config_root: Dict
+ch_ctl_conf: Dict
+ch_ctl: ClickhouseCTL
+zk_ctl: ZookeeperCTL
}
class ClickhouseCTL {
+get_macros() Dict
+get_replicas(readonly: bool) List~Dict~
+restore_replica(table: Table) None
+get_ddl_queue_unfinished_status(db_name: str) List~Dict~
+drop_dictionary_if_exists(table: Table) None
+drop_table_if_exists(table: Table) None
+rename_table(src: Table, new_name: str) None
}
class ZookeeperCTL {
+zk_client
+zk_root_path: str
}
class Database {
+name: str
+engine_full: str
+is_replicated_db_engine() bool
}
class Table {
+database: str
+name: str
+uuid: str
+create_statement: str
+is_dictionary() bool
+is_replicated() bool
+make_dummy(database: str, name: str) Table
}
class MetadataCleaner {
+clean_tables_metadata(tables: List~Table~) None
}
class DatabaseZookeeperData {
+zk_path: str
+shard: str
+replica: str
+max_log_ptr: int
+get_log_ptr() int
+get_max_log_ptr() int
+parse_replicated_db_zk_info(db: Database, macros: Dict~str,str~) tuple
+resolve(context: BackupContext, db: Database) DatabaseZookeeperData
-_read_int_node(path: str) int
-_full_path(suffix: str) str
}
DatabaseZookeeperData --> ZookeeperCTL : uses
class SyncStatus {
<<enumeration>>
DONE
FAILED
}
class DatabaseSyncModule {
+wait_sync_replicated_databases(context: BackupContext, databases: Iterable~Database~, keep_going: bool) Dict~str,SyncStatus~
-_sync_database(context: BackupContext, db: Database, deadline: float) None
-_check_ddl_queue_for_errors(context: BackupContext, db_name: str) bool
}
DatabaseSyncModule --> BackupContext
DatabaseSyncModule --> DatabaseZookeeperData
DatabaseSyncModule --> ClickhouseCTL
class TableBackupManager {
+restore(context: BackupContext, ...) None
-_preprocess_tables_to_restore(context: BackupContext, tables: List~Table~, metadata_cleaner: MetadataCleaner, restore_tables_in_replicated_database: bool, keep_going: bool) List~Table~
-_get_remaining_readonly_tables_after_fix_attempt(context: BackupContext, tables: List~Table~, metadata_cleaner: MetadataCleaner) Tuple~Set~Tuple~str,str~~,Set~Tuple~str,str~~~
-_drop_existing_table(context: BackupContext, table: Table, existing_table: Table) None
}
TableBackupManager --> BackupContext
TableBackupManager --> ClickhouseCTL
TableBackupManager --> MetadataCleaner
TableBackupManager --> Table
BackupContext --> ClickhouseCTL
BackupContext --> ZookeeperCTL
BackupContext --> Database
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- Several new log statements use
{}-style placeholders withlogging.*(e.g. in_try_fix_readonly_tables), but the standard logging API expects%-style formatting or f-strings, so these messages will not be formatted as intended; consider switching to f-strings or%splaceholders. - The
_try_fix_readonly_tableshelper assumes it is given replicated tables (per the docstring) but does not enforce or validate this; consider either assertingtable.is_replicated()inside or relaxing the wording in the docstring to match its actual behavior.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Several new log statements use `{}`-style placeholders with `logging.*` (e.g. in `_try_fix_readonly_tables`), but the standard logging API expects `%`-style formatting or f-strings, so these messages will not be formatted as intended; consider switching to f-strings or `%s` placeholders.
- The `_try_fix_readonly_tables` helper assumes it is given replicated tables (per the docstring) but does not enforce or validate this; consider either asserting `table.is_replicated()` inside or relaxing the wording in the docstring to match its actual behavior.
## Individual Comments
### Comment 1
<location path="ch_backup/logic/table.py" line_range="691-700" />
<code_context>
+ not restore_tables_in_replicated_database
+ and is_in_replicated_db
+ ):
+ logging.info(
+ f"Skipping table {table_name_for_logs} because it is in replicated database and --restore-tables-in-replicated-database flag is not set",
+ )
</code_context>
<issue_to_address>
**issue:** Use standard logging formatting instead of `{}` placeholders to ensure arguments are interpolated.
The stdlib logger won’t interpolate `{}` placeholders; they’ll be logged literally and the extra args will be ignored. Please switch these calls to either f-strings or `%`-style placeholders, for example:
```python
logging.info(
"Attempting to fix %d readonly table(s) via metadata cleanup + SYSTEM RESTORE REPLICA",
len(readonly_tables),
)
```
Apply the same change to the other `logging.debug` / `logging.warning` / `logging.info` calls using `{}` placeholders.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| ) | ||
| continue | ||
|
|
||
| if need_drop and existing_table: |
There was a problem hiding this comment.
need_drop looks superfluous
| for replica in context.ch_ctl.get_replicas(readonly=True) | ||
| } | ||
|
|
||
| if metadata_cleaner and existing_readonly_tables: | ||
| replicated_readonly_tables = [] | ||
|
|
||
| for table in tables: | ||
| table_key = (table.database, table.name) | ||
| if table.is_replicated() and table_key in existing_readonly_tables: | ||
| replicated_readonly_tables.append(table) | ||
|
|
||
| if replicated_readonly_tables: | ||
| existing_readonly_tables -= self._try_fix_readonly_tables( | ||
| context, | ||
| replicated_readonly_tables, | ||
| metadata_cleaner, | ||
| ) |
There was a problem hiding this comment.
Let's extract as separate function also
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The new logging calls in
_get_remaining_readonly_tables_after_fix_attemptuse{}-style placeholders (logging.info(..., len(...))etc.), but the standard Python logger expects%s-style placeholders, so these messages will log the braces literally rather than interpolated values; consider switching to%splaceholders or f-strings for those new messages. - Readonly replicated tables that are successfully repaired are passed to
metadata_cleaner.clean_tables_metadataboth inside_get_remaining_readonly_tables_after_fix_attemptand again later viatables_to_clean_metadata; if this double-cleaning is not required for correctness, you might want to avoid re-adding already-cleaned tables to reduce redundant work.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new logging calls in `_get_remaining_readonly_tables_after_fix_attempt` use `{}`-style placeholders (`logging.info(..., len(...))` etc.), but the standard Python logger expects `%s`-style placeholders, so these messages will log the braces literally rather than interpolated values; consider switching to `%s` placeholders or f-strings for those new messages.
- Readonly replicated tables that are successfully repaired are passed to `metadata_cleaner.clean_tables_metadata` both inside `_get_remaining_readonly_tables_after_fix_attempt` and again later via `tables_to_clean_metadata`; if this double-cleaning is not required for correctness, you might want to avoid re-adding already-cleaned tables to reduce redundant work.
## Individual Comments
### Comment 1
<location path="ch_backup/logic/table.py" line_range="669-678" />
<code_context>
+ # For tables in replicated databases, skip unless explicitly requested.
+ if not restore_tables_in_replicated_database and is_in_replicated_db:
+ logging.info(
+ f"Skipping table {table_name_for_logs} because it is in replicated database and --restore-tables-in-replicated-database flag is not set",
+ )
</code_context>
<issue_to_address>
**issue (bug_risk):** Logging messages use `{}` placeholders but `logging` default formatting is `%`-style, so these arguments likely won’t be interpolated.
In this helper you’re passing extra positional arguments to `logging.info`/`logging.warning` while using `{}` in the message, so those arguments won’t be interpolated and the braces will appear literally. Please either use f-strings and drop the extra args, or switch the placeholders to `%s` (e.g. `"Attempting to fix %s readonly table(s)...", len(replicated_readonly_tables)`). Apply the same correction to all new logging calls in this block that use `{}` with positional args.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
…cated DB Track already-cleaned tables to avoid redundant re-cleaning; defer drop for replicated DB tables only when flag is not set instead of skipping restore entirely
|
Reproducing the test case for deleting a table in read-only mode is difficult, because the script’s query that fetches read-only tables has to hit exactly the moment when a table has already been added by the database, but has not yet been activated up by the restart thread. |
…estore Track log_ptr via ZooKeeper to detect stalled DDL worker; retry sync after table restore.
Remove system sync branch, stall threshold and intermediate sync statuses
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- In
ch_backup.pythelogging.infocall for retrying sync uses{}placeholders ("Retrying sync for {} database(s) that failed: {}") instead of%s, so the arguments will not be interpolated correctly; switch to%sorstr.formatto log the actual counts and names. - In
GET_DDL_QUEUE_STATUS_SQLthe filter usesWHERE cluster = '{db_name}', but the argument passed is a database name; if the intention is to query DDL entries for that database, this should likely filter by thedatabasecolumn (or by the actual cluster name derived elsewhere) rather thancluster.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `ch_backup.py` the `logging.info` call for retrying sync uses `{}` placeholders (`"Retrying sync for {} database(s) that failed: {}"`) instead of `%s`, so the arguments will not be interpolated correctly; switch to `%s` or `str.format` to log the actual counts and names.
- In `GET_DDL_QUEUE_STATUS_SQL` the filter uses `WHERE cluster = '{db_name}'`, but the argument passed is a database name; if the intention is to query DDL entries for that database, this should likely filter by the `database` column (or by the actual cluster name derived elsewhere) rather than `cluster`.
## Individual Comments
### Comment 1
<location path="ch_backup/ch_backup.py" line_range="631-632" />
<code_context>
+ if sync_results.get(db.name) == SyncStatus.FAILED
+ ]
+ if failed_databases:
+ logging.info(
+ "Retrying sync for {} database(s) that failed: {}",
+ len(failed_databases),
+ ", ".join(db.name for db in failed_databases),
</code_context>
<issue_to_address>
**issue (bug_risk):** The logging call uses '{}' formatting but passes arguments separately, so placeholders won't be interpolated.
Here `logging.info` is given a `{}`-style format string plus separate arguments, but the stdlib logger only applies `%`-style interpolation. This will log the literal `{}` instead of the values. Please either format the string yourself (e.g. with an f-string) or switch to `%s` placeholders and let the logger format the message.
</issue_to_address>
### Comment 2
<location path="ch_backup/clickhouse/control.py" line_range="279" />
<code_context>
+ exception_code,
+ exception_text
+ FROM system.distributed_ddl_queue
+ WHERE cluster = '{db_name}'
+ AND status != 'Finished'
+ ORDER BY entry
</code_context>
<issue_to_address>
**issue (bug_risk):** Filtering distributed_ddl_queue by `cluster = '{db_name}'` looks conceptually wrong and may never match entries.
In `GET_DDL_QUEUE_STATUS_SQL`, `cluster = '{db_name}'` is compared against `db.name`, but `cluster` in `system.distributed_ddl_queue` normally stores the *cluster name*, not the database name. This will likely return no rows and hide real DDL errors. Revisit the predicate: either filter by the correct column for the replicated database, or drop the `cluster` filter and apply any DB/schema filtering in Python instead.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Can we reproduce a test case by deleting table metadata in zookeeper after first restore and make second restore ? |
|
| """ | ||
| 3 | ||
| """ | ||
| When we execute query on clickhouse02 |
There was a problem hiding this comment.
Let's check that table is not readonly anymore

Try to fix readonly replicated tables in-place via ZooKeeper metadata cleanup and SYSTEM RESTORE REPLICA before falling back to table recreation
Summary by Sourcery
Improve handling of replicated databases and readonly replicated tables during restore by syncing via ZooKeeper log pointers and attempting in-place fixes before table recreation.
Bug Fixes:
Enhancements: