Skip to content

Fix Iceberg REST catalog commit safety under post-commit network failure#1982

Open
il9ue wants to merge 1 commit into
Altinity:antalya-26.3from
il9ue:fix/iceberg-commit-safety-26.3
Open

Fix Iceberg REST catalog commit safety under post-commit network failure#1982
il9ue wants to merge 1 commit into
Altinity:antalya-26.3from
il9ue:fix/iceberg-commit-safety-26.3

Conversation

@il9ue

@il9ue il9ue commented Jun 29, 2026

Copy link
Copy Markdown

Found while working on #1609 (TRUNCATE storage-leak cleanup).

RestCatalog::updateMetadata returned a boolean result, which folded two outcomes that need different handling into one value:

  • the commit may have succeeded with the response lost, or
  • the catalog may have rejected it cleanly.

Both came back as false, so the caller ran cleanup in either case and deleted the manifest/data files the live snapshot still references. That's the same cleanup-on-failure pattern I'd otherwise be carrying into TRUNCATE.

  • INSERT corruption reproduces on 26.3.10. Mutations aren't reachable via SQL on these builds, and TRUNCATE on 26.3 stays safe only because of commit e229240.

Changelog category (leave one):

  • Bug Fix

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix possible data loss in Iceberg writes via REST catalog when a commit response is lost

Documentation entry for user-facing changes

When the network drops the response to a catalog commit that actually succeeded, ClickHouse retries the commit, sees the retry fail, and runs cleanup that deletes object-storage files the catalog at that point still references, leading to possible table corruption. The fix runs cleanup only when the commit was cleanly rejected; ambiguous outcomes leave the files in place.

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

Tests

Two integration tests under tests/integration/test_storage_iceberg_no_spark/, both asserting one invariant: no catalog-referenced snapshot may point at a manifest list missing from object storage.

  • test_insert_commit_response_loss_is_handled : the fix, end to end. A fault proxy lets the commit POST reach the catalog (so it lands) then drops the response; ClickHouse retries and gets a 409. With the fix the INSERT completes, no snapshot loses its manifest list, and the row is readable. Without the fix this same test reproduces the corruption.
  • test_insert_without_fault_is_clean_baseline : control. With no fault armed the same INSERT must succeed cleanly, proving the corruption above comes from the injected response loss, not from the build or the harness.

Files:

  • test_catalog_commit_safety.py : the two tests above.
  • catalog_fault_proxy.py : a transparent reverse proxy in front of the REST catalog. Forwards everything by default; when armed, it forwards a commit POST upstream (so the catalog commits) then RSTs the client without returning the response, reproducing the "commit succeeded, response lost" window.
  • docker_compose_rest_proxy.yml : runs the proxy as a rest-proxy service on the cluster network and publishes its control port.
  • conftest.py : merges that compose file into the cluster so the proxy starts with the suite. ClickHouse talks to the catalog through the proxy; pyiceberg talks to the real catalog directly, so verification sees true committed state.

On INSERT via a REST catalog, a commit whose response is lost was retried,
got a 409, collapsed to a bool false, and triggered cleanup that deleted the
files the now-live snapshot referenced. Replace the bool with a typed
CommitOutcome and, on failure, re-read the catalog to classify whether our
snapshot actually landed before deciding to clean up.
@il9ue

il9ue commented Jun 29, 2026

Copy link
Copy Markdown
Author

Sequences

What happens on an INSERT when the commit response is lost:

  1. The INSERT sink (IcebergWrites) calls catalog->updateMetadata(...) to commit the new snapshot.
  2. RestCatalog sends the commit POST. The catalog applies it server-side : main now points at our new snapshot, but the response never makes it back (the connection drops).
  3. The HTTP layer retries the same POST. This time the catalog rejects it with a 409, because our assert-ref-snapshot-id now refers to the snapshot the first attempt already created. It looks like a failure, but the commit had in fact landed.
  4. Old behavior: updateMetadata caught the 409 and returned false. The caller read false as "rejected" and deleted the files it had written, including the ones the now-live snapshot points at. That is the corruption.
  5. New behavior: instead of returning false, RestCatalog calls classifyCommitOutcomeAfterFailure. It re-reads the table from the catalog (a fresh GET, no cache) and looks for our snapshot-id.
  6. It returns one of three answers (CommitOutcome):
    • Committed : our id is the current snapshot, or sits in the snapshot history.
    • RejectedCleanly : the read succeeded and our id is nowhere to be found.
    • Unknown : the re-read itself failed, so we can't tell.
  7. The caller acts on the answer: keep the files on Committed, delete them only on RejectedCleanly, and preserve them on Unknown. The reproduced INSERT returns Committed, so cleanup never runs and the data stays intact.

How the pieces fit

The whole design hangs on one contract: the CommitOutcome enum in ICatalog.h. Changing updateMetadata's return type from bool to CommitOutcome there forces the compiler to walk every implementation and every caller into the new shape. There are two layers:

The catalog layer decides the outcome. Three implementations satisfy the contract:

  • RestCatalog is the only one that re-reads. On failure it issues a fresh GET (getRawTableMetadataObject) and classifies the result (classifyCommitOutcomeAfterFailure).
  • GlueCatalog has no lost-response semantics (its update is synchronous and non-transactional), so a success is always Committed.
  • ICatalog is the abstract base; it just throws.

The caller layer acts on the outcome. Four call sites all call the same updateMetadata and receive the same enum, then branch to fit their context:

  • INSERT (IcebergWrites) is the reproduced path and the most careful and it also guards its outer catch with a published flag so a failure after the commit lands can't delete live files.

Tests

Both tests use a fault proxy (catalog_fault_proxy.py) that forwards a commit POST upstream so it lands, then drops the response, reproducing the exact "commit succeeded, response lost" window. pyiceberg verifies against the real catalog, so it checks the true committed state rather than what ClickHouse believes. The invariant in both: no catalog-referenced snapshot may point at a manifest list missing from object storage.

  • test_insert_commit_response_loss_is_handled : The fix, end to end. The commit lands server-side, its response is dropped, ClickHouse retries and gets a 409. With the fix: the INSERT completes, no snapshot loses its manifest list, and the row is readable (count() == 2). Without the fix this same test reproduces the corruption.
  • test_insert_without_fault_is_clean_baseline : Control. With no fault armed the same INSERT must succeed and leave no corruption, proving the corruption above is caused by the injected response loss, not by the build or the harness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant