Quick Start Guide

This guide will help you get the ETL pipeline running in 5 minutes.

Prerequisites

Docker and Docker Compose installed
Java 11 or later (for local development)
SBT 1.9.7 (for building the application)

Quick Start

1. Clone the Repository

git clone https://github.com/akreit/async2databricks.git
cd async2databricks

2. Start Infrastructure

Start PostgreSQL and LocalStack (S3 emulator):

docker compose up -d

Wait about 15 seconds for services to be healthy:

docker compose ps

You should see both etl-postgres and etl-localstack as healthy.

3. Verify Setup

Run the integration test script:

./docker/integration-test.sh

This will verify:

Docker services are running
Database has 10 sample records
S3 bucket exists

4. Build the Application

sbt compile

Or build a fat JAR:

sbt assembly

The JAR will be at target/scala-2.13/async2databricks-assembly-0.1.0.jar.

5. Run the ETL Pipeline

Option A: Using SBT

sbt run

Option B: Using the JAR

java -jar target/scala-2.13/async2databricks-assembly-0.1.0.jar

6. Verify Results

Check that data was written to S3:

docker exec etl-localstack awslocal s3 ls s3://etl-output-bucket/data/parquet/ --recursive

You should see a .parquet file with a timestamp.

Download and inspect the file (optional):

docker exec etl-localstack awslocal s3 cp s3://etl-output-bucket/data/parquet/<filename>.parquet /tmp/output.parquet
docker cp etl-localstack:/tmp/output.parquet ./output.parquet

What's Next?

Customize the Data Model

Edit src/main/scala/com/async2databricks/model/SampleData.scala to match your database schema.

Update the Query

Modify the query in src/main/resources/application.conf:

etl {
  batch-size = 1000
  query = "SELECT * FROM your_table WHERE ..."
}

Connect to Your Database

Update database credentials in src/main/resources/application.conf:

database {
  url = "jdbc:postgresql://your-host:5432/your-database"
  user = "your-username"
  password = "your-password"
}

Deploy to AWS

See the main README.md for detailed AWS deployment instructions.

Troubleshooting

Services Not Starting

docker compose logs postgres
docker compose logs localstack

Connection Refused

Make sure services are healthy:

docker compose ps

Both should show (healthy) status.

Out of Memory

Increase heap size:

java -Xmx4g -jar target/scala-2.13/async2databricks-assembly-0.1.0.jar

Clean Start

To start fresh:

docker compose down -v
docker compose up -d

This removes volumes and recreates everything.

Running Tests

sbt test

Cleaning Up

Stop and remove containers:

docker compose down

Remove volumes too:

docker compose down -v

Architecture Overview

┌─────────────┐         ┌──────────────┐         ┌─────────────┐
│  PostgreSQL │ ──────> │ ETL Pipeline │ ──────> │     S3      │
│  (Source)   │  Doobie │   (Scala)    │ Parquet │ (Destination)│
└─────────────┘         └──────────────┘         └─────────────┘
      │                        │
      │                    FS2 Stream
      │                    Cats Effect
      └────────────────────────┘

The pipeline:

Connects to PostgreSQL using Doobie
Streams data efficiently using FS2
Batches records for optimal performance
Writes to S3 in Parquet format using Parquet4s
Configuration managed by PureConfig

Configuration

All configuration is in src/main/resources/application.conf. You can override values using:

System Properties:

sbt run -Ddatabase.url=jdbc:postgresql://newhost:5432/db

Environment Variables:

export DATABASE_URL=jdbc:postgresql://newhost:5432/db
sbt run

Next Steps

Read the full README.md for deployment guides
Explore the code in src/main/scala/com/async2databricks/
Customize for your use case
Deploy to AWS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick Start Guide

Prerequisites

Quick Start

1. Clone the Repository

2. Start Infrastructure

3. Verify Setup

4. Build the Application

5. Run the ETL Pipeline

6. Verify Results

What's Next?

Customize the Data Model

Update the Query

Connect to Your Database

Deploy to AWS

Troubleshooting

Services Not Starting

Connection Refused

Out of Memory

Clean Start

Running Tests

Cleaning Up

Architecture Overview

Configuration

Next Steps

FilesExpand file tree

QUICKSTART.md

Latest commit

History

QUICKSTART.md

File metadata and controls

Quick Start Guide

Prerequisites

Quick Start

1. Clone the Repository

2. Start Infrastructure

3. Verify Setup

4. Build the Application

5. Run the ETL Pipeline

6. Verify Results

What's Next?

Customize the Data Model

Update the Query

Connect to Your Database

Deploy to AWS

Troubleshooting

Services Not Starting

Connection Refused

Out of Memory

Clean Start

Running Tests

Cleaning Up

Architecture Overview

Configuration

Next Steps