A Nextflow pipeline designed to efficiently and reliably transfer massive datasets (e.g., 35TB+) between AWS S3 buckets across different AWS accounts.
By leveraging Nextflow, this pipeline parallelizes AWS CLI s3 sync operations, automatically handles retries for transient network/API failures, and allows you to cleanly resume interrupted transfers without starting from scratch.
- Nextflow: (version 22.0 or later).
- AWS CLI: Ensure the AWS Command Line Interface is installed (
aws --version). - AWS Credentials: You must be authenticated locally (e.g.,
aws configureor via SSO) with an IAM entity that has appropriate permissions.
To successfully copy data from Account A (Source) to Account B (Destination), your active AWS credentials must have:
- Read access (
s3:GetObject,s3:ListBucket) to the source buckets. - Write access (
s3:PutObject) to the destination buckets.
Important Note on Object Ownership: This pipeline automatically applies the --acl bucket-owner-full-control flag. Without this flag, files transferred to Account B would still be "owned" by Account A, making them unreadable to Account B. Ensure the Destination bucket has ACLs enabled or bucket policies allowing s3:PutObjectAcl.
Define your transfers in a file named buckets.csv in the root directory. Format the file as source,destination with no spaces and no headers.
s3://source-account-bucket-1/,s3://destination-account-bucket-1/
s3://source-bucket/batch-1/,s3://dest-bucket/batch-1/
s3://source-bucket/batch-2/,s3://dest-bucket/batch-2/aws batch submit-job \
--job-name nf-transfer \
--job-queue priority-maf-pipelines \
--job-definition nextflow-production \
--container-overrides command="FischbachLab/nf-transfer, \
"--buckets_list", "s3://genomics-workflow-core/Results/transfer/buckets.csv" "