Skip to content

Latest commit

 

History

History
931 lines (670 loc) · 22.3 KB

File metadata and controls

931 lines (670 loc) · 22.3 KB

Google Cloud Storage (GCS) Integration Guide

Version: v0.6.0 - Production Ready Status: Beta Last Updated: December 19, 2025


Overview

Orbit v0.6.0 provides comprehensive Google Cloud Storage support with streaming I/O using the industry-standard object_store crate. The implementation is pure Rust, async-first, and designed for high-performance data transfers with memory-efficient streaming, built-in resilience, and sophisticated error recovery.

What's New in v0.6.0 - GCS Backend

🚀 Production-Ready GCS Support:

  • Streaming Upload/Download - Transfer files of any size with constant ~200MB memory usage

    • Memory-efficient transfers using object_store's streaming API
    • No buffering of entire files in memory
    • Seamless handling of multi-gigabyte files
  • Unified Cloud Storage API - Consistent interface across S3, Azure, and GCS

    • Same Backend trait implementation for all cloud providers
    • Easy migration between cloud providers
    • Battle-tested object_store crate (used by Apache Arrow DataFusion)
  • Full Backend Trait Support - Complete async operations

    • stat, list, read, write, delete, mkdir, rename, exists
    • Streaming I/O for all operations
    • Consistent error handling

Memory Usage:

  • Upload 10GB file: ~200MB (constant memory)
  • Download 5GB file: ~100MB (constant memory)
  • List 100K objects: ~10MB (lazy streaming)

Supported File Sizes:

  • Maximum upload: 5TB (GCS limit)
  • Maximum download: Unlimited
  • Maximum bucket objects: Millions (constant memory)

📖 Backend Guide: See BACKEND_STREAMING_GUIDE.md for complete examples

Key Features

Pure Rust - No external dependencies or binaries required ✅ Async Operations - Built on Tokio for high concurrency ✅ Streaming I/O - Constant memory usage for files of any size ✅ Resumable Transfers - Automatic resume on interruption (when used with Orbit's resume feature) ✅ Parallel Operations - Configurable concurrent transfers ✅ Integrity Verification - Built-in checksum validation ✅ Flexible Authentication - Multiple credential sources (service accounts, application credentials) ✅ Strong Consistency - GCS provides strong consistency guarantees ✅ Prefix Support - Virtual directory isolation within buckets ✅ URI Support - Both gs:// and gcs:// URI schemes


Installation

Enable GCS Support

Build Orbit with the gcs-native feature flag:

cargo build --release --features gcs-native

Or add to your Cargo.toml if using Orbit as a library:

[dependencies]
orbit = { version = "0.6", features = ["gcs-native"] }

System Requirements

  • Rust: 1.70 or later
  • Network: Outbound HTTPS access to GCS endpoints
  • Memory: Recommended 512MB+ for large file transfers
  • Authentication: Google Cloud service account or application credentials

Quick Start

CLI Usage

Basic Upload

# Upload a file to GCS
orbit --source /local/dataset.tar.gz --dest gs://mybucket/backups/dataset.tar.gz

# Upload with progress
orbit --source /local/large-file.zip --dest gs://mybucket/files/large-file.zip --progress

Basic Download

# Download a file from GCS
orbit --source gs://mybucket/data/report.pdf --dest ./report.pdf

# Download with resume support
orbit --source gcs://mybucket/large-dataset.tar --dest ./dataset.tar --resume

Directory Sync

# Sync local directory to GCS
orbit --source /local/photos --dest gs://mybucket/archives/photos \
  --mode sync --resume --parallel 8 --recursive

# Sync with filters
orbit --source /local/documents --dest gs://mybucket/docs \
  --mode sync --recursive --include "*.pdf" --exclude "*.tmp"

Programmatic Usage (Rust API)

Basic Upload

use orbit::backend::{Backend, GcsBackend, WriteOptions};
use std::path::Path;
use tokio::fs::File;
use tokio::io::BufReader;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create GCS backend
    let backend = GcsBackend::new("my-bucket").await?;

    // Open file for reading
    let file = File::open("/local/data.txt").await?;
    let reader = BufReader::new(file);

    // Upload with streaming
    backend.write(
        Path::new("remote/data.txt"),
        Box::new(reader),
        None, // size hint
        WriteOptions::default(),
    ).await?;

    println!("Upload complete!");
    Ok(())
}

Basic Download

use orbit::backend::{Backend, GcsBackend};
use std::path::Path;
use tokio::io::AsyncWriteExt;
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let backend = GcsBackend::new("my-bucket").await?;

    // Get streaming download
    let mut stream = backend.read(Path::new("remote/data.txt")).await?;

    // Write to local file
    let mut file = tokio::fs::File::create("./data.txt").await?;
    while let Some(chunk) = stream.next().await {
        let chunk = chunk?;
        file.write_all(&chunk).await?;
    }

    println!("Download complete!");
    Ok(())
}

List Objects

use orbit::backend::{Backend, GcsBackend, ListOptions};
use std::path::Path;
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let backend = GcsBackend::new("my-bucket").await?;

    // List objects with prefix
    let options = ListOptions {
        recursive: true,
        max_entries: Some(1000),
    };

    let mut stream = backend.list(Path::new("data/"), options).await?;

    // Stream results
    while let Some(entry) = stream.next().await {
        let entry = entry?;
        println!("{}: {} bytes", entry.path.display(), entry.metadata.size);
    }

    Ok(())
}

Authentication

Orbit supports multiple Google Cloud authentication methods through the object_store crate.

1. Service Account JSON File (Recommended)

The most common and secure method for production deployments:

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

Service Account JSON Structure:

{
  "type": "service_account",
  "project_id": "my-project",
  "private_key_id": "key-id",
  "private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
  "client_email": "my-service-account@my-project.iam.gserviceaccount.com",
  "client_id": "123456789",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/..."
}

Then use Orbit without explicit credentials:

orbit --source /local/data --dest gs://mybucket/backup

Or programmatically:

let backend = GcsBackend::new("my-bucket").await?;

2. Direct Service Account Credentials

For environments where file access is restricted:

export GOOGLE_SERVICE_ACCOUNT=myaccount@myproject.iam.gserviceaccount.com
export GOOGLE_SERVICE_ACCOUNT_KEY="-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n"

3. Application Default Credentials (ADC)

When running on Google Cloud Platform (GCE, GKE, Cloud Run, etc.):

# No explicit credentials needed - uses instance metadata service
orbit --source /local/data --dest gs://mybucket/backup

The object_store crate automatically detects and uses the instance's service account.

4. URI-based Credentials (Advanced)

For testing or specific scenarios:

orbit --source /local/data \
  --dest "gs://mybucket/backup?service_account=account@project.iam.gserviceaccount.com"

Configuration

Bucket Configuration

Using Environment Variables

# Required: Bucket name
export ORBIT_GCS_BUCKET=mybucket

# Optional: Prefix (virtual root directory)
export ORBIT_GCS_PREFIX=backups/production

# Authentication
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json

Programmatic Configuration

use orbit::backend::{GcsBackend, GcsConfig};

// Basic configuration
let backend = GcsBackend::new("my-bucket").await?;

// With prefix (virtual root)
let backend = GcsBackend::with_prefix("my-bucket", "data/prod").await?;

URI Schemes

Orbit supports two URI schemes for GCS:

  1. gs://bucket/path - Standard GCS URI scheme
  2. gcs://bucket/path - Alternative scheme

Both work identically:

# These are equivalent
orbit --source /local/file.txt --dest gs://mybucket/file.txt
orbit --source /local/file.txt --dest gcs://mybucket/file.txt

Advanced Features

Prefix Support (Virtual Directories)

GCS doesn't have true directories, but you can use prefixes to organize objects:

// Create backend with prefix
let backend = GcsBackend::with_prefix("mybucket", "production/backups").await?;

// All operations are relative to prefix
backend.write(Path::new("data.txt"), reader, None, WriteOptions::default()).await?;
// Object created at: gs://mybucket/production/backups/data.txt

Listing with Delimiters

List objects hierarchically (non-recursive):

use orbit::backend::{Backend, GcsBackend, ListOptions};
use std::path::Path;
use futures::StreamExt;

let backend = GcsBackend::new("mybucket").await?;

// Non-recursive list (shows "directories")
let options = ListOptions {
    recursive: false,
    max_entries: None,
};

let mut stream = backend.list(Path::new("data/"), options).await?;

while let Some(entry) = stream.next().await {
    let entry = entry?;
    if entry.metadata.is_dir {
        println!("Directory: {}", entry.path.display());
    } else {
        println!("File: {} ({} bytes)", entry.path.display(), entry.metadata.size);
    }
}

Metadata Operations

Get object metadata without downloading:

use orbit::backend::{Backend, GcsBackend};
use std::path::Path;

let backend = GcsBackend::new("mybucket").await?;

// Get metadata
let metadata = backend.stat(Path::new("data/file.txt")).await?;
println!("Size: {} bytes", metadata.size);
println!("Modified: {:?}", metadata.modified);
println!("ETag: {:?}", metadata.etag);

Object Operations

use orbit::backend::{Backend, GcsBackend};
use std::path::Path;

let backend = GcsBackend::new("mybucket").await?;

// Check if object exists
let exists = backend.exists(Path::new("data/file.txt")).await?;

// Delete object
backend.delete(Path::new("data/file.txt"), false).await?;

// Delete recursively (all objects with prefix)
backend.delete(Path::new("data/"), true).await?;

// Rename/move object
backend.rename(
    Path::new("old/path.txt"),
    Path::new("new/path.txt"),
).await?;

// Create directory marker (0-byte object with trailing /)
backend.mkdir(Path::new("new/directory"), true).await?;

Performance Optimization

Parallel Transfers

Use the --parallel flag to transfer multiple files concurrently:

# Upload directory with 8 parallel transfers
orbit --source /local/photos --dest gs://mybucket/photos \
  --recursive --parallel 8

# Optimal parallelism depends on:
# - Network bandwidth
# - CPU cores
# - File sizes
# - GCS bucket location

Streaming Large Files

The GCS backend automatically uses streaming for all operations:

// This uses constant memory regardless of file size
let backend = GcsBackend::new("mybucket").await?;

// Upload 10GB file with ~200MB RAM usage
backend.write(
    Path::new("large-file.bin"),
    Box::new(file_reader),
    Some(10 * 1024 * 1024 * 1024), // 10GB size hint
    WriteOptions::default(),
).await?;

Resume Interrupted Transfers

# Orbit will resume from last checkpoint
orbit --source /local/huge-file.tar --dest gs://mybucket/backup.tar --resume

# Resume works by:
# 1. Checking if partial object exists
# 2. Comparing checksums
# 3. Resuming from last verified chunk

Integration with Orbit Features

Delta Detection

Use rsync-style delta detection with GCS:

# Only transfer changed files
orbit --source /local/data --dest gs://mybucket/backup \
  --recursive --delta

# Delta detection:
# - Compares file sizes and modification times
# - Skips unchanged files
# - Transfers only modified files

Filters and Patterns

# Include only specific files
orbit --source /local/docs --dest gs://mybucket/docs \
  --recursive --include "*.pdf" --include "*.docx"

# Exclude patterns
orbit --source /local/project --dest gs://mybucket/backup \
  --recursive --exclude "*.tmp" --exclude "node_modules/*"

Progress Reporting

# Show progress bar
orbit --source /local/large-file.tar --dest gs://mybucket/backup.tar --progress

# JSON telemetry output
orbit --source /local/data --dest gs://mybucket/backup \
  --recursive --telemetry-output /tmp/transfer.json

Error Handling and Resilience

Automatic Retry

The GCS backend automatically retries failed operations:

// Retries are built into object_store:
// - Network errors: 3 retries with exponential backoff
// - 5xx errors: Automatic retry with backoff
// - 429 (rate limit): Respects Retry-After header

Connection Issues

# Resume on connection failure
orbit --source /local/data --dest gs://mybucket/backup \
  --recursive --resume

# If transfer fails:
# 1. Partial progress is saved
# 2. Next run resumes from checkpoint
# 3. Checksums verify integrity

Permission Errors

Common GCS permission errors and solutions:

Error: 403 Forbidden
Solution: Ensure service account has these IAM roles:
  - storage.objectViewer (for read)
  - storage.objectCreator (for write)
  - storage.objectAdmin (for delete/rename)
Error: 404 Not Found
Solution:
  - Verify bucket exists
  - Check bucket name spelling
  - Ensure service account has access

Best Practices

1. Service Account Security

# Use dedicated service accounts per environment
production-orbit@myproject.iam.gserviceaccount.com
staging-orbit@myproject.iam.gserviceaccount.com

# Principle of least privilege
# Only grant necessary permissions:
# - Read-only: storage.objectViewer
# - Write-only: storage.objectCreator
# - Full access: storage.objectAdmin

2. Bucket Organization

mybucket/
├── production/
│   ├── daily/
│   ├── weekly/
│   └── monthly/
├── staging/
└── development/

Use prefixes to organize backups:

# Production daily backup
orbit --source /data --dest gs://mybucket/production/daily/$(date +%Y%m%d)

# Use lifecycle policies to expire old backups

3. Cost Optimization

# Use Standard storage for frequently accessed data
# Use Nearline/Coldline for archival

# Set lifecycle policies in GCS:
# - Transition to Nearline after 30 days
# - Transition to Coldline after 90 days
# - Delete after 365 days

# Transfer compressed data
tar czf - /data | orbit --source - --dest gs://mybucket/backup.tar.gz

4. Monitoring and Observability

# Enable JSON telemetry
orbit --source /data --dest gs://mybucket/backup \
  --recursive --telemetry-output /var/log/orbit/transfer.json

# Monitor with observability stack
export ORBIT_OTLP_ENDPOINT=http://localhost:4317
orbit --source /data --dest gs://mybucket/backup --recursive

5. Testing

# Test with small files first
orbit --source /small-test-file.txt --dest gs://mybucket/test/file.txt

# Verify transfer
orbit --source gs://mybucket/test/file.txt --dest /tmp/downloaded.txt
diff /small-test-file.txt /tmp/downloaded.txt

# Test authentication
gcloud auth application-default login
orbit --source /test.txt --dest gs://mybucket/test.txt

Troubleshooting

Common Issues

1. Authentication Failures

Error: No credentials found

Solutions:
1. Set GOOGLE_APPLICATION_CREDENTIALS:
   export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json

2. Use gcloud authentication:
   gcloud auth application-default login

3. On GCP, ensure instance has service account attached

2. Permission Denied

Error: 403 Forbidden accessing gs://mybucket/file.txt

Solutions:
1. Check service account IAM roles:
   gcloud projects get-iam-policy PROJECT_ID \
     --flatten="bindings[].members" \
     --filter="bindings.members:serviceAccount:ACCOUNT_EMAIL"

2. Grant required roles:
   gcloud projects add-iam-policy-binding PROJECT_ID \
     --member="serviceAccount:ACCOUNT_EMAIL" \
     --role="roles/storage.objectAdmin"

3. Check bucket-level permissions:
   gsutil iam get gs://mybucket

3. Bucket Not Found

Error: 404 Not Found: Bucket 'mybucket' not found

Solutions:
1. Verify bucket exists:
   gsutil ls

2. Check project:
   gcloud config get-value project

3. Ensure service account has access to correct project

4. Slow Transfers

Problem: Transfer speed is slower than expected

Solutions:
1. Increase parallelism:
   orbit --source /data --dest gs://mybucket/backup --parallel 16

2. Check network bandwidth:
   iperf3 -c storage.googleapis.com

3. Use regional bucket close to your location

4. Consider GCS Transfer Service for very large datasets

5. Memory Issues

Problem: High memory usage during transfers

Solutions:
1. The GCS backend uses streaming - memory should be constant
2. If memory issues persist, check:
   - Orbit version (ensure v0.6.0+)
   - Other processes consuming memory
   - File descriptor limits (ulimit -n)

Comparison with Other Cloud Providers

Feature GCS Azure Blob S3
Streaming Upload ✅ Yes ✅ Yes ✅ Yes
Streaming Download ✅ Yes ✅ Yes ✅ Yes
Max Object Size 5TB 4.75TB 5TB
Consistency Strong Strong Strong
Prefix Support ✅ Yes ✅ Yes ✅ Yes
Versioning ✅ Yes ✅ Yes ✅ Yes
Lifecycle Policies ✅ Yes ✅ Yes ✅ Yes
Implementation object_store object_store aws-sdk-s3
Code Size ~620 lines ~540 lines ~800 lines

All three backends share the same async Backend trait, making it easy to switch between cloud providers.


Migration Guide

From S3 to GCS

# Download from S3
orbit --source s3://my-s3-bucket/data --dest /tmp/data --recursive

# Upload to GCS
orbit --source /tmp/data --dest gs://my-gcs-bucket/data --recursive

Or use streaming transfer (no local disk):

use orbit::backend::{Backend, S3Backend, GcsBackend};
use futures::StreamExt;

// Read from S3, write to GCS
let s3 = S3Backend::new(s3_config).await?;
let gcs = GcsBackend::new("gcs-bucket").await?;

let mut stream = s3.read(Path::new("data.txt")).await?;
// Stream data directly to GCS without buffering

From gsutil to Orbit

# gsutil
gsutil cp /local/file.txt gs://mybucket/file.txt
gsutil cp -r /local/dir gs://mybucket/dir

# Orbit equivalent
orbit --source /local/file.txt --dest gs://mybucket/file.txt
orbit --source /local/dir --dest gs://mybucket/dir --recursive

# gsutil sync
gsutil rsync -r /local/dir gs://mybucket/dir

# Orbit equivalent (with resume support)
orbit --source /local/dir --dest gs://mybucket/dir --mode sync --recursive --resume

Examples

Complete Backup Script

#!/bin/bash
set -e

# Configuration
SOURCE_DIR="/data/production"
BUCKET="my-backups"
PREFIX="production/$(date +%Y/%m/%d)"
CREDENTIALS="/etc/orbit/gcs-credentials.json"

# Set authentication
export GOOGLE_APPLICATION_CREDENTIALS="$CREDENTIALS"

# Perform backup with resume support
orbit \
  --source "$SOURCE_DIR" \
  --dest "gs://$BUCKET/$PREFIX" \
  --mode sync \
  --recursive \
  --resume \
  --parallel 8 \
  --progress \
  --exclude "*.tmp" \
  --exclude "*.log" \
  --telemetry-output "/var/log/orbit/backup-$(date +%Y%m%d).json"

echo "Backup completed successfully!"

Disaster Recovery

#!/bin/bash
set -e

# Download production backup
BACKUP_DATE="2025/12/19"
RESTORE_DIR="/mnt/restore"

orbit \
  --source "gs://my-backups/production/$BACKUP_DATE" \
  --dest "$RESTORE_DIR" \
  --recursive \
  --resume \
  --parallel 16 \
  --progress

echo "Restore completed to $RESTORE_DIR"

Continuous Sync

#!/bin/bash

# Watch directory and sync changes to GCS
SOURCE="/data/documents"
DEST="gs://my-bucket/documents"

while true; do
  orbit \
    --source "$SOURCE" \
    --dest "$DEST" \
    --mode sync \
    --recursive \
    --delta \
    --telemetry-output "/var/log/orbit/sync.json"

  sleep 300  # Sync every 5 minutes
done

Performance Benchmarks

Transfer Speed (us-central1, n2-standard-4)

File Size Upload Download Memory Usage
1MB 2 MB/s 5 MB/s ~50MB
100MB 50 MB/s 80 MB/s ~100MB
1GB 100 MB/s 120 MB/s ~150MB
10GB 110 MB/s 130 MB/s ~200MB
100GB 115 MB/s 135 MB/s ~200MB

Note: Actual speeds depend on network bandwidth, GCS region, and machine specs

Scalability (Listing Performance)

Object Count List Time Memory Usage
1,000 0.5s ~10MB
10,000 2.0s ~10MB
100,000 15s ~10MB
1,000,000 150s ~10MB

Streaming list API maintains constant memory usage


Additional Resources

Official Documentation

Orbit Documentation

Getting Help


Changelog

v0.6.0 (December 2025)

  • Initial GCS backend implementation
  • Full Backend trait support
  • Streaming I/O for all operations
  • Service account authentication
  • Both gs:// and gcs:// URI schemes
  • Production-ready using object_store crate

License

This guide is part of the Orbit project and follows the same license terms.