Skip to content

Conversation

@platinumhamburg
Copy link
Contributor

Core changes:

  • Add ProducerSnapshotManager for lifecycle management with atomic registration
  • Add ProducerSnapshotStore for ZK + remote storage operations
  • Add tryRegisterProducerSnapshot in ZooKeeperClient for atomic check-and-create
  • Add Admin API: registerProducerOffsets, getProducerOffsets, deleteProducerOffsets
  • Add configurable TTL and cleanup interval for producer snapshots

Design highlights:

  • Atomic registration via ZK's NodeExistsException handling
  • Eventually consistent: ZK as commit point, orphan files cleaned periodically
  • UUID-based file naming prevents concurrent upload conflicts

Tests:

  • ProducerSnapshotManagerTest: lifecycle, expiration, concurrent atomicity
  • ProducerSnapshotJsonSerdeTest: JSON format compatibility

Purpose

Linked issue: close #2433

Brief change log

Tests

API and Format

Documentation

This commit introduces a producer offset snapshot mechanism to support
undo recovery when Flink jobs fail before completing their first checkpoint.

Key changes:
- Add ProducerSnapshotManager for snapshot lifecycle management including
  registration, retrieval, deletion, and periodic cleanup of expired snapshots
- Add ProducerSnapshotStore for low-level ZooKeeper and remote storage operations
- Add ProducerSnapshot and ProducerSnapshotJsonSerde for snapshot data model
- Extend Admin API with registerProducerOffsets, getProducerOffsets, and
  deleteProducerOffsets operations
- Add RPC protocol definitions for producer offset snapshot management
- Add configuration options for snapshot TTL and cleanup interval
- Add unit tests for ProducerSnapshotManager and JSON serialization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Server] Add Producer Offset Snapshot Registry as Infrastructure for Exactly-Once Semantics

1 participant