Skip to content

Conversation

@tenfyzhong
Copy link
Collaborator

What problem does this PR solve?

Issue Number: close #3943

What is changed and how it works?

This PR introduces CMEK (Customer-Managed Encryption Keys) encryption at rest for TiCDC's EventStore and SchemaStore components. The primary goal is to enhance data security by encrypting persisted data on disk using encryption keys managed by customers through their preferred KMS (Key Management Service) providers.

Key changes include:

  • Added encryption support for EventStore: Data written to pebble DB is now encrypted before storage, and decrypted during reads. The encryption uses a layered key model where data keys encrypt the actual data, and master keys (managed by KMS) encrypt the data keys.
  • Added encryption support for SchemaStore: DDL events and schema information stored in the schema store are now encrypted using the same encryption framework.
  • Implemented encryption framework: Created a comprehensive encryption package with support for multiple algorithms (AES-256-CTR, AES-256-GCM), KMS integration, and key management.
  • Graceful degradation: The system can gracefully degrade to unencrypted mode when encryption is disabled or when encryption operations fail (configurable).
  • Backward compatibility: The implementation maintains backward compatibility with existing unencrypted data through automatic detection of encryption headers.

The encryption works by:

  1. Checking if encryption is enabled for a keyspace via encryption metadata stored in TiKV
  2. Using a current data key to encrypt data, with the data key itself encrypted by a master key
  3. Storing encrypted data with a header containing version and data key ID information
  4. Decrypting data on read by extracting the data key ID from the header and retrieving the corresponding data key

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
  • Performance: There will be some performance overhead due to encryption/decryption operations, but this is necessary for security. The impact should be minimal with efficient cipher implementations.
  • Compatibility: The changes maintain backward compatibility. Existing unencrypted data will continue to work, and new data can be written encrypted when encryption is enabled. The system automatically detects whether data is encrypted based on headers.
Do you need to update user documentation, design documentation or monitoring documentation?
  • User documentation: Yes, documentation should be updated to explain how to configure CMEK encryption for TiCDC.
  • Design documentation: The encryption architecture and key management should be documented.
  • Monitoring documentation: New metrics for encryption operations (success/failure rates, latency) should be documented.

Release note

Added support for CMEK (Customer-Managed Encryption Keys) encryption at rest for TiCDC EventStore and SchemaStore. This feature allows users to encrypt persisted data on disk using keys managed through AWS KMS, Azure Key Vault, or GCP KMS. The encryption can be enabled via configuration and provides graceful degradation on encryption errors.

- Add encryption manager for transparent data encryption/decryption
- Implement AES-256-CTR and AES-256-GCM cipher algorithms
- Add encryption format with header for version and key ID
- Integrate encryption with event store and schema store
- Add mock KMS and TiKV clients for testing
- Support graceful degradation on encryption errors
- Add configuration options for encryption settings

Signed-off-by: tenfyzhong <tenfy@tenfy.cn>
@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed labels Jan 8, 2026
@ti-chi-bot
Copy link

ti-chi-bot bot commented Jan 8, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign flowbehappy for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 8, 2026
@tenfyzhong tenfyzhong marked this pull request as draft January 8, 2026 06:11
@ti-chi-bot ti-chi-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 8, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello @tenfyzhong, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant security enhancement by integrating Customer-Managed Encryption Keys (CMEK) encryption at rest for TiCDC's EventStore and SchemaStore components. The primary objective is to safeguard persisted data on disk by encrypting it with keys managed by customers through their chosen Key Management Service (KMS) providers. This foundational work establishes a robust and flexible encryption architecture, ensuring data confidentiality while maintaining seamless operation and compatibility with existing data structures.

Highlights

  • EventStore Encryption: Data written to PebbleDB in the EventStore is now encrypted before storage and decrypted during reads, utilizing a layered key model where data keys encrypt the actual data, and master keys (managed by KMS) encrypt the data keys.
  • SchemaStore Encryption: DDL events and schema information stored in the SchemaStore are now encrypted using the same comprehensive encryption framework.
  • Comprehensive Encryption Framework: A new encryption package has been implemented, supporting multiple algorithms (AES-256-CTR, AES-256-GCM), KMS integration, and robust key management.
  • Graceful Degradation: The system is designed to gracefully degrade to an unencrypted mode when encryption is disabled or when encryption operations encounter failures, which is configurable.
  • Backward Compatibility: The implementation ensures backward compatibility, allowing existing unencrypted data to function correctly while new data can be written in an encrypted state, with automatic detection of encryption headers.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive framework for Customer-Managed Encryption Keys (CMEK) to encrypt data at rest for TiCDC's EventStore and SchemaStore. The changes are extensive, including a new encryption package with support for multiple ciphers, KMS integration, and a layered key model. The implementation also considers backward compatibility and graceful degradation. My review focuses on improving code clarity, robustness, and performance by addressing issues such as non-idiomatic Go patterns for optional dependencies, inconsistent error handling, potential race conditions in caching, and inefficient data access patterns. I've also pointed out some minor issues in comments and mock data to enhance maintainability.

Comment on lines 1309 to 1329
if encryption.IsEncrypted(value) {
// Try to get encryptionManager from appcontext
var encMgr encryption.EncryptionManager
defer func() {
if r := recover(); r != nil {
// EncryptionManager not registered, skip decryption
encMgr = nil
}
}()
encMgr = appcontext.GetService[encryption.EncryptionManager]("EncryptionManager")
if encMgr != nil {
// TODO: Get keyspaceID from dispatcher/subscription metadata
// For now, use default keyspaceID (0)
keyspaceID := uint32(0)
decryptedValue, err := encMgr.DecryptData(context.Background(), keyspaceID, value)
if err != nil {
log.Panic("failed to decrypt value", zap.Error(err))
}
value = decryptedValue
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

As the comment here suggests, fetching the encryptionManager from the global appcontext on every Next() call is inefficient and not idiomatic. It also uses a defer/recover pattern which is best avoided for control flow.

The encryptionManager should be passed to eventStoreIter upon its creation and stored as a field.

  1. Add encryptionManager encryption.EncryptionManager to the eventStoreIter struct.
  2. In GetIterator, initialize this new field from e.encryptionManager.
  3. Then, simplify this Next method to use iter.encryptionManager directly.

This would improve performance and make the code cleaner and more maintainable.

                if iter.encryptionManager != nil && encryption.IsEncrypted(value) {
			// TODO: Get keyspaceID from dispatcher/subscription metadata
			// For now, use default keyspaceID (0)
			keyspaceID := uint32(0)
			decryptedValue, err := iter.encryptionManager.DecryptData(context.Background(), keyspaceID, value)
			if err != nil {
				log.Panic("failed to decrypt value", zap.Error(err))
			}
			value = decryptedValue
		}

Comment on lines +198 to +200
if err != nil {
log.Fatal("decrypt db info failed", zap.Error(err))
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using log.Fatal can lead to abrupt program termination without proper cleanup or stack unwinding. It's generally better to return an error from this function and let the caller decide on the appropriate action. If the error is considered truly unrecoverable at this level, panic(err) would be a better choice as it allows for a top-level recovery mechanism to perform a more graceful shutdown. This applies to other log.Fatal calls in this file as well.

Suggested change
if err != nil {
log.Fatal("decrypt db info failed", zap.Error(err))
}
if err != nil {
return nil, errors.Trace(err)
}

Comment on lines +55 to +56
// DecryptMasterKey decrypts the master key using mock KMS
// In a real implementation, this would call the actual KMS service

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The dataKeyID is defined to be 3 bytes long throughout the encryption framework (e.g., in format.go and types.go). However, these mock data key IDs are 8 bytes long. This will cause EncodeEncryptedData to fail. The mock data should be updated to use 3-byte strings for data key IDs to be consistent with the implementation.

	mockDataKeyID1 := "001"
	mockDataKeyID2 := "002"

if cached, ok := m.metaCache[keyspaceID]; ok {
// Check if cache is still valid
if time.Since(cached.timestamp) < m.ttl {
meta := cached.meta

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The cached meta object is returned directly. Since KeyspaceEncryptionMeta contains pointers and maps, a caller could inadvertently modify the cached data, leading to race conditions or inconsistent state. To prevent this, a deep copy of the meta object should be returned from the cache.

Comment on lines 264 to 270
defer func() {
if r := recover(); r != nil {
// EncryptionManager not registered, use nil
encMgr = nil
}
}()
encMgr = appcontext.GetService[encryption.EncryptionManager]("EncryptionManager")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using defer and recover for control flow to handle an optional dependency is not idiomatic Go. It can obscure the program's control flow and is generally reserved for handling unexpected panics. A better approach would be to have a TryGetService function in appcontext that returns a boolean indicating whether the service was found, for example: encMgr, ok := appcontext.TryGetService[...](...). This would make the code clearer and more robust.

// TODO: Get keyspaceID from dispatcher/subscription metadata
// For now, use default keyspaceID (0) for classic mode
keyspaceID := uint32(0)
encryptedValue, err := e.encryptionManager.EncryptData(context.Background(), keyspaceID, value)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using context.Background() here means this encryption operation will not be cancelled if the parent context (from the write task pool) is cancelled. It's better to pass the context from writeTaskPool.run down to writeEvents and use it here. This ensures that long-running operations can be properly cancelled.

return &AES256GCMCipher{}
}

// IVSize returns the IV size for AES-256-GCM (12 bytes recommended, but we use 16 for compatibility)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment states that 16 bytes are used for compatibility, but the function returns 12. This is confusing. The standard and recommended nonce size for GCM is 12 bytes, which the code correctly returns. The comment should be updated to reflect this and remove the mention of 16 bytes to avoid confusion.

Suggested change
// IVSize returns the IV size for AES-256-GCM (12 bytes recommended, but we use 16 for compatibility)
// IVSize returns the IV size for AES-256-GCM (12 bytes).

// DataKeyID represents a 3-byte data key identifier
type DataKeyID [3]byte

// ToString converts DataKeyID to hex string

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment says ToString converts the DataKeyID to a hex string, but the implementation performs a direct byte-to-string conversion. The comment should be corrected to reflect the actual behavior.

Suggested change
// ToString converts DataKeyID to hex string
// ToString converts DataKeyID to a string.

- Add unit tests for cipher functionality including unsupported algorithm detection and AES256CTR encryption/decryption
- Add unit tests for data format encoding/decoding with encrypted and unencrypted data
- Add unit tests for mock TiKV client behavior including keyspace encryption meta retrieval

Signed-off-by: tenfyzhong <tenfy@tenfy.cn>
- Replace panic-prone appcontext.GetService calls with TryGetService for optional encryption manager
- Pass encryption manager to eventStoreIter to avoid repeated appcontext lookups
- Update encryption format to support version-based detection from TiKV metadata
- Add GetEncryptionVersion method to encryption meta manager interface
- Improve backward compatibility for legacy unencrypted data formats
- Add comprehensive tests for encryption format handling

Signed-off-by: tenfyzhong <tenfy@tenfy.cn>
- Add keyspaceID field to eventWithCallback struct to carry keyspace information
- Store keyspaceID in dispatcher statistics during registration
- Pass keyspaceID to encryption manager for both encryption and decryption operations
- Update encryption manager to return errors instead of graceful degradation when configured
- Add test to verify keyspaceID is correctly used in encryption/decryption flow

Signed-off-by: tenfyzhong <tenfy@tenfy.cn>
- Add test for encryption degrade on error with allow flag enabled
- Add test for encryption degrade on error with allow flag disabled
- Add test for encryption disabled scenario
- Include mock meta manager for testing encryption manager behavior

Signed-off-by: tenfyzhong <tenfy@tenfy.cn>
- Remove redundant comments and build tags from test files
- Use sync.Once for safe stop channel closure in encryption manager
- Clear meta cache entries before refresh to ensure fresh data
- Standardize formatting and remove outdated compatibility comments

Signed-off-by: tenfyzhong <tenfy@tenfy.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-triage-completed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement CMEK Encryption Support for TiCDC Next-Gen Architecture

1 participant