Skip to content

Fix Iceberg REST catalog commit safety, corruption under post-commit network failure#1676

Closed
il9ue wants to merge 1 commit into
antalya-26.1from
fix/iceberg-truncate-cleanup-on-commit-failure
Closed

Fix Iceberg REST catalog commit safety, corruption under post-commit network failure#1676
il9ue wants to merge 1 commit into
antalya-26.1from
fix/iceberg-truncate-cleanup-on-commit-failure

Conversation

@il9ue

@il9ue il9ue commented Apr 22, 2026

Copy link
Copy Markdown

Found while working on #1609 (TRUNCATE storage-leak cleanup)
RestCatalog::updateMetadata returns a boolean result, which folds two outcomes that need different handling into one value:

  • The commit may have succeeded with the response lost or,
  • The catalog may have rejected it cleanly.

Both come back as false return, so the caller runs cleanup in either case and deletes the manifest/data files in which live snapshot still references. That's the same cleanup-on-failure pattern I'd otherwise be carrying into TRUNCATE.

  • INSERT corruption reproduces on v26.4.1 and 26.3.10. Mutations aren't reachable via SQL on these builds, and TRUNCATE on 26.3 stays safe only because of the commit e229240.

Changelog category (leave one):

  • Bug Fix

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix possible data loss in Iceberg writes via REST catalog when a commit response is lost

Documentation entry for user-facing changes

When the network drops the response to a catalog commit that actually succeeded, ClickHouse retries the commit, sees the retry fail, and runs cleanup that deletes object-storage files the catalog at that time points at, leads to possible table corruption. The fix runs cleanup only when the commit was cleanly rejected, while ambiguous outcomes leave the files in place.

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

Original ticket #1609 (storage leak cleanup on TRUNCATE commit failure)
investigation revealed a latent table-corruption bug in the REST catalog
client that the #1609 fix would propagate into TRUNCATE if written as
originally scoped. Repurposing this branch to fix the root cause, with
#1609 becoming the final layer on top.

See design-notes.md for full investigation and layered fix proposal.
Awaiting lead approval on scope expansion before implementation.
@github-actions

Copy link
Copy Markdown

Workflow [PR], commit [3bb0e62]

@il9ue il9ue changed the title [WIP] Fix Iceberg REST catalog commit safety — corruption under post-commit network failure Fix Iceberg REST catalog commit safety, corruption under post-commit network failure Jun 27, 2026
@il9ue

il9ue commented Jun 28, 2026

Copy link
Copy Markdown
Author

Will close this issue after pushing PR to antalya 26.3, to be superseded, when the patchset is ready.

@il9ue

il9ue commented Jun 29, 2026

Copy link
Copy Markdown
Author

Handled in #1982

@il9ue il9ue closed this Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants