Skip to content

aws-samples/sample-distributed-cross-cloud-data-migration-s3-rclone

Distributed cross-cloud data migration at scale into Amazon S3 using rclone

This repository contains a single AWS CloudFormation template that deploys a distributed, scalable architecture for migrating data from any S3-compatible storage provider to Amazon S3 using rclone. It is designed to handle petabyte-scale migrations from providers such as IBM Cloud Object Storage, Google Cloud Storage, and Azure Blob Storage.

Single-instance migration approaches often suffer from lack of visibility, frequent failures, source-side throttling, and performance saturation. This solution addresses those challenges with a fan-out architecture that distributes work across multiple parallel workers, provides granular progress tracking, and automatically retries failed transfers.

This sample was published alongside an AWS Storage Blog post.

Disclaimer: This is sample code intended for educational and demonstration purposes. Review and test thoroughly before using in a production environment.

Architecture diagram showing distributed cross-cloud data migration solution. (1) Discovery layer shows Amazon ECS with AWS Fargate containers listing objects from source storage. (2) Queueing layer shows Amazon SQS receiving batched file messages. (3) Execution layer shows Amazon EC2 Auto Scaling Group (0-6 r5n.xlarge instances) with distributed rclone workers transferring data from S3-compatible source storage to Amazon S3. Amazon CloudWatch Logs provide observability across all layers.

Figure 1: Distributed cross-cloud migration architecture showing the three layers — Discovery (ECS Fargate lister), Queueing (Amazon SQS), and Execution (EC2 Auto Scaling workers with rclone).

Architecture overview

The CloudFormation template deploys the following components:

Component Resources Purpose
Networking VPC, 3 public subnets (3 AZs), Internet Gateway, route tables, security group Outbound internet access for cross-cloud transfers
Message queue SQS main queue + dead-letter queue (both SSE-encrypted) Fan-out work distribution between lister and workers
Credentials AWS Secrets Manager (3 secrets) Secure storage for source endpoint credentials
Logging Amazon CloudWatch Log Groups (encrypted with AWS KMS, 7-day retention) Centralized logging for lister and worker components
IAM 3 least-privilege roles (ECS execution, ECS task, EC2 worker) Scoped permissions for each component
Lister ECS Fargate task (python:3.13-slim) Enumerates source objects, batches 20 keys per SQS message
Workers Amazon EC2 Auto Scaling group (r5n.xlarge, 0–5 instances, 6 rclone processes each) Copies objects from source to Amazon S3 using rclone

How it works

  1. You run an aws ecs run-task command (provided as a stack output) specifying source and destination buckets.
  2. The Lister Fargate task connects to the source endpoint using credentials from AWS Secrets Manager, lists all objects, and sends batches of 20 keys as messages to the SQS queue.
  3. EC2 workers poll the queue and run rclone copyto for each object key. On success, the message is deleted. On failure, the message visibility is reset for immediate retry.
  4. After 2 failed attempts, messages move to the dead-letter queue for investigation.
  5. The Amazon EC2 Auto Scaling group adjusts worker count based on queue depth using a target tracking scaling policy.
  6. Workers protect themselves from scale-in termination while actively processing a message.

Performance

This architecture was tested with a 2.7 PB dataset migrated from IBM Cloud Object Storage, achieving 20–80 Gbps aggregate throughput. The migration completed in approximately 2 weeks at roughly $2,000 in compute costs. Results may vary based on file sizes, network conditions, and source provider throttling limits.

Prerequisites

  • An AWS account with permissions to create CloudFormation stacks, IAM roles, VPCs, ECS clusters, EC2 instances, SQS queues, and Secrets Manager secrets
  • AWS CLI v2 installed and configured
  • HMAC credentials (access key and secret key) for your S3-compatible source storage provider
  • A destination Amazon S3 bucket configured with required security controls (must already exist — see SECURITY.md for bucket security prerequisites including Block Public Access, encryption at rest, and TLS enforcement)

Deployment

1. Deploy the CloudFormation stack

aws cloudformation deploy \
  --template-file cross-cloud-s3-migration.yaml \
  --stack-name cross-cloud-migration \
  --capabilities CAPABILITY_IAM \
  --region <your-region>

2. Update source credentials in AWS Secrets Manager

Replace the placeholder values with your actual source storage credentials:

aws secretsmanager put-secret-value \
  --secret-id /migration/source_endpoint \
  --secret-string "https://s3.us-south.cloud-object-storage.appdomain.cloud" \
  --region <your-region>

aws secretsmanager put-secret-value \
  --secret-id /migration/source_access_key \
  --secret-string "<your-access-key>" \
  --region <your-region>

aws secretsmanager put-secret-value \
  --secret-id /migration/source_secret_key \
  --secret-string "<your-secret-key>" \
  --region <your-region>

Usage

Retrieve the run-task command

The stack outputs a ready-to-use CLI command for starting migration jobs:

aws cloudformation describe-stacks \
  --stack-name cross-cloud-migration \
  --query 'Stacks[0].Outputs[?OutputKey==`RunTaskCommand`].OutputValue' \
  --output text \
  --region <your-region>

Start a migration job

Replace YOUR-SOURCE-BUCKET and YOUR-DEST-BUCKET in the output command with your actual bucket names and run it. You can optionally set a PREFIX value to migrate only objects matching a specific key prefix.

aws ecs run-task \
  --cluster <cluster-from-output> \
  --task-definition <task-def-from-output> \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[<subnet-from-output>],securityGroups=[<sg-from-output>],assignPublicIp=ENABLED}" \
  --overrides '{"containerOverrides": [{"name": "lister", "environment": [{"name": "SOURCE_BUCKET", "value": "my-source-bucket"}, {"name": "DEST_BUCKET", "value": "my-dest-bucket"}, {"name": "QUEUE_URL", "value": "<queue-url-from-output>"}, {"name": "PREFIX", "value": ""}]}]}' \
  --region <your-region>

Monitoring

What to monitor Where to find it
Lister progress CloudWatch Logs → /migration/lister
Worker activity CloudWatch Logs → /migration/workers
Queue depth SQS console → ApproximateNumberOfMessagesVisible
Failed messages SQS console → dead-letter queue
Worker scaling EC2 Auto Scaling console → activity history

The lister logs each S3 listing page, every SQS message sent, and a final summary with total keys and messages. Workers log each object copy operation with progress indicators.

Cost considerations

The primary cost drivers for this solution are:

  • EC2 instances: r5n.xlarge instances ($0.298/hr in us-east-1). The Amazon EC2 Auto Scaling group scales from 0 to 5 instances based on queue depth, so costs scale with workload.
  • Data transfer in: AWS does not charge for inbound data transfer from the internet. Data flowing from the source provider to EC2 workers via the Internet Gateway incurs no AWS charges.
  • Data transfer to S3: Transfers from EC2 to S3 within the same Region are free.
  • ECS Fargate: The lister task runs briefly (minutes) and costs are minimal.
  • SQS, Secrets Manager, CloudWatch: Costs are negligible for typical migration workloads.

This solution deliberately avoids NAT Gateways. NAT Gateways charge $0.045/GB for data processing, which would add significant cost for large-scale migrations. Instead, workers use public subnets with direct Internet Gateway access (no per-GB charge).

Cleaning up

Delete the CloudFormation stack to remove all resources:

aws cloudformation delete-stack \
  --stack-name cross-cloud-migration \
  --region <your-region>

This removes all resources created by the template. Your source and destination buckets are not affected.

Security considerations

  • IAM least privilege: Each component (ECS task, EC2 worker) has a dedicated IAM role scoped to the minimum required permissions.
  • Encryption at rest: SQS queues and CloudWatch Log Groups encrypted with customer-managed AWS KMS keys (automatic annual key material rotation enabled). Source credentials encrypted in AWS Secrets Manager.
  • Encryption in transit: rclone uses HTTPS by default for all S3-compatible endpoints.
  • No inbound traffic: The security group allows all outbound traffic but no inbound traffic.
  • Secrets management: Source credentials stored in AWS Secrets Manager, encrypted at rest by default.
  • IMDSv2: Enforced on EC2 instances via Launch Template MetadataOptions (HttpTokens: required).
  • Input validation: Worker validates SQS message structure and uses regex allowlists for bucket names and object keys.
  • Network isolation: The VPC is purpose-built for migration workloads with no shared resources.

For detailed security documentation, see:

  • SECURITY.md — Shared responsibility model, data classification, risk assessment, AWS KMS key management, and access logging guidance
  • DESIGN.md — Architecture design decisions and security trade-offs

Contributing

See CONTRIBUTING for information on how to contribute to this project.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

About

AWS CloudFormation template for a distributed, petabyte-scale, cross-cloud data migration to Amazon S3 using rclone. Deploys a fan-out architecture with ECS Fargate for object discovery, SQS for fault-tolerant job distribution, and EC2 Auto Scaling workers running parallel rclone transfers from any S3-compatible source (IBM COS, GCS, Azure Blob).

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors