Fix Iceberg REST catalog commit safety under post-commit network failure#1982
Fix Iceberg REST catalog commit safety under post-commit network failure#1982il9ue wants to merge 1 commit into
Conversation
On INSERT via a REST catalog, a commit whose response is lost was retried, got a 409, collapsed to a bool false, and triggered cleanup that deleted the files the now-live snapshot referenced. Replace the bool with a typed CommitOutcome and, on failure, re-read the catalog to classify whether our snapshot actually landed before deciding to clean up.
SequencesWhat happens on an INSERT when the commit response is lost:
How the pieces fitThe whole design hangs on one contract: the The catalog layer decides the outcome. Three implementations satisfy the contract:
The caller layer acts on the outcome. Four call sites all call the same
TestsBoth tests use a fault proxy (
|
Found while working on #1609 (TRUNCATE storage-leak cleanup).
RestCatalog::updateMetadatareturned a boolean result, which folded two outcomes that need different handling into one value:Both came back as
false, so the caller ran cleanup in either case and deleted the manifest/data files the live snapshot still references. That's the same cleanup-on-failure pattern I'd otherwise be carrying into TRUNCATE.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix possible data loss in Iceberg writes via REST catalog when a commit response is lost
Documentation entry for user-facing changes
When the network drops the response to a catalog commit that actually succeeded, ClickHouse retries the commit, sees the retry fail, and runs cleanup that deletes object-storage files the catalog at that point still references, leading to possible table corruption. The fix runs cleanup only when the commit was cleanly rejected; ambiguous outcomes leave the files in place.
CI/CD Options
Exclude tests:
Regression jobs to run:
Tests
Two integration tests under
tests/integration/test_storage_iceberg_no_spark/, both asserting one invariant: no catalog-referenced snapshot may point at a manifest list missing from object storage.test_insert_commit_response_loss_is_handled: the fix, end to end. A fault proxy lets the commit POST reach the catalog (so it lands) then drops the response; ClickHouse retries and gets a 409. With the fix the INSERT completes, no snapshot loses its manifest list, and the row is readable. Without the fix this same test reproduces the corruption.test_insert_without_fault_is_clean_baseline: control. With no fault armed the same INSERT must succeed cleanly, proving the corruption above comes from the injected response loss, not from the build or the harness.Files:
test_catalog_commit_safety.py: the two tests above.catalog_fault_proxy.py: a transparent reverse proxy in front of the REST catalog. Forwards everything by default; when armed, it forwards a commit POST upstream (so the catalog commits) then RSTs the client without returning the response, reproducing the "commit succeeded, response lost" window.docker_compose_rest_proxy.yml: runs the proxy as arest-proxyservice on the cluster network and publishes its control port.conftest.py: merges that compose file into the cluster so the proxy starts with the suite. ClickHouse talks to the catalog through the proxy; pyiceberg talks to the real catalog directly, so verification sees true committed state.