Enabling CSE encryption for COPY command#377
Open
ameent wants to merge 4 commits intodatabricks:masterfrom
Open
Enabling CSE encryption for COPY command#377ameent wants to merge 4 commits intodatabricks:masterfrom
ameent wants to merge 4 commits intodatabricks:masterfrom
Conversation
added 4 commits
November 28, 2017 10:10
Changed usage of the CREDENTIALS command for Redshift to specific
keywords. In other words, the commands produced will no longer be
of the form:
CREDENTIALS('access_key=X&secrety_key=Y')
and will instead be of the form
access_key = 'X' secret_key = 'Y'
This is needed because for loading encrypted payloads into Redshift using
client side encryption, one
needs to place symmetric_master_key as an argument on the copy command, however,
it is also an options within the CREDENTIALS command, so if a query to Redshift
includes a CREDENTIALS clause and also symmetric_master_key, then Redshift will
report this error:
com.amazon.support.exceptions.ErrorException: Amazon Invalid operation: conflicting or redundant options;
When data is encrypted in S3 and a COPY command is invoked, it's expected that the manifest is not encrypted and is in plain-text form. Encryption on EMR through EMRFS is controlled by a Hadoop option (fs.enable.cse). Once set, all data that goes through the file system will be encrypted. This commit adds an exception around generation of manifest files so that even if the encryption option is set, the manifest file created on S3 is not encrypted. This enables Redshift to read the manifest and ingest the data even for cases where data is encrypted on the client side with a symmetric encryption key.
Switching the UNLOAD statement to no longer use the WITH CREDENTIALS method and instead rely on explicitly passing the role, access key, secret key, session token, etc. Generally speaking this is a more flexible way of passing credentials, though for UNLOAD it doesn't make much difference. This change is pursued to achieve consistency with the COPY command. In COPY command, this change is necessary to enable copy of client-side encrypted data with Redshift. http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-authorization.html
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
With these changes one can write encrypted data to S3 using client side encryption, with a custom symmetric master key, and then use spark-redshift to ingest the data. I have sensitive data ingesting into Redshift and can't use S3-SSE for my data.
The main change was to switch away from WITH CREDENTIALS syntax and explicitly pass iam_role, access_key, etc. parameters so that the end-user can use the "extracopyoptions" to supply their symmetric key.
Usage: