This guide will help you get the ETL pipeline running in 5 minutes.
- Docker and Docker Compose installed
- Java 11 or later (for local development)
- SBT 1.9.7 (for building the application)
git clone https://github.com/akreit/async2databricks.git
cd async2databricksStart PostgreSQL and LocalStack (S3 emulator):
docker compose up -dWait about 15 seconds for services to be healthy:
docker compose psYou should see both etl-postgres and etl-localstack as healthy.
Run the integration test script:
./docker/integration-test.shThis will verify:
- Docker services are running
- Database has 10 sample records
- S3 bucket exists
sbt compileOr build a fat JAR:
sbt assemblyThe JAR will be at target/scala-2.13/async2databricks-assembly-0.1.0.jar.
Option A: Using SBT
sbt runOption B: Using the JAR
java -jar target/scala-2.13/async2databricks-assembly-0.1.0.jarCheck that data was written to S3:
docker exec etl-localstack awslocal s3 ls s3://etl-output-bucket/data/parquet/ --recursiveYou should see a .parquet file with a timestamp.
Download and inspect the file (optional):
docker exec etl-localstack awslocal s3 cp s3://etl-output-bucket/data/parquet/<filename>.parquet /tmp/output.parquet
docker cp etl-localstack:/tmp/output.parquet ./output.parquetEdit src/main/scala/com/async2databricks/model/SampleData.scala to match your database schema.
Modify the query in src/main/resources/application.conf:
etl {
batch-size = 1000
query = "SELECT * FROM your_table WHERE ..."
}Update database credentials in src/main/resources/application.conf:
database {
url = "jdbc:postgresql://your-host:5432/your-database"
user = "your-username"
password = "your-password"
}See the main README.md for detailed AWS deployment instructions.
docker compose logs postgres
docker compose logs localstackMake sure services are healthy:
docker compose psBoth should show (healthy) status.
Increase heap size:
java -Xmx4g -jar target/scala-2.13/async2databricks-assembly-0.1.0.jarTo start fresh:
docker compose down -v
docker compose up -dThis removes volumes and recreates everything.
sbt testStop and remove containers:
docker compose downRemove volumes too:
docker compose down -v┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ PostgreSQL │ ──────> │ ETL Pipeline │ ──────> │ S3 │
│ (Source) │ Doobie │ (Scala) │ Parquet │ (Destination)│
└─────────────┘ └──────────────┘ └─────────────┘
│ │
│ FS2 Stream
│ Cats Effect
└────────────────────────┘
The pipeline:
- Connects to PostgreSQL using Doobie
- Streams data efficiently using FS2
- Batches records for optimal performance
- Writes to S3 in Parquet format using Parquet4s
- Configuration managed by PureConfig
All configuration is in src/main/resources/application.conf. You can override values using:
System Properties:
sbt run -Ddatabase.url=jdbc:postgresql://newhost:5432/dbEnvironment Variables:
export DATABASE_URL=jdbc:postgresql://newhost:5432/db
sbt run- Read the full README.md for deployment guides
- Explore the code in
src/main/scala/com/async2databricks/ - Customize for your use case
- Deploy to AWS