A demonstration of local data transformation and reverse ETL pipelines using DuckDB and dbt with the dbt-duckdb adapter. This project showcases how to build a complete data pipeline that processes Dutch railway services and geographic data into a structured data mart.
📖 Based on: Fully Local Data Transformation with dbt and DuckDB - DuckDB Blog Post
- Fully Local Processing: No cloud dependencies - all data transformation happens locally
- Spatial Data Integration: Combines railway data with geographic information using DuckDB's spatial extensions
- Modern Data Stack: Uses dbt for transformation
- Multiple Export Options: Export to files (CSV, JSON, Parquet) or directly to PostgreSQL
- Dimensional Modeling: Implements a star schema with fact and dimension tables
- Incremental Processing: Supports both full refresh and incremental data loads
- Data visualization: Add metabase for adhoc analysis
The project creates a dimensional model with:
dim_nl_provinces- Dutch provinces dimension tabledim_nl_municipalities- Dutch municipalities dimension table (linked to provinces)dim_nl_train_stations- Train stations dimension table (linked to municipalities)fact_services- Train services fact table (linked to stations)
- DuckDB - In-process analytical database
- dbt - Data transformation framework
- dbt-duckdb - DuckDB adapter for dbt
- Metabase - Business intelligence tool
- Spatial Extensions - For geographic data processing
- PostgreSQL - Optional target for reverse ETL
Make sure you have:
- uv: Python package and project manager (instructions)
- Docker or a docker container manager (use colima for macOS)
- dbeaver optional database tool to manage your database or pgadmin
make installIn a terminal, we will spin up postgres and metabase
# Open a terminal on the root folder of the project
make docker-upNow in the same terminal
# Export environment variables in the shell so dbt can use them
set -o allexport && source .env && set + allexport
### Change directory
cd railway_analytics
# Run all models
dbt run
# Run with tests
dbt run && dbt test
# Run specific model groups
dbt run --select "+transformation"
dbt run --select "+reverse_etl"
dbt run --select "+exports ".
├── docker
├── example.env
├── Makefile
├── pyproject.toml
├── railway_analytics
│ ├── analyses
│ ├── data
│ │ ├── exports
│ │ └── railway_analytics.duckdb
│ ├── dbt_project.yml
│ ├── macros
│ ├── models
│ │ ├── exports
│ │ ├── reverse_etl
│ │ ├── sources.yml
│ │ └── transformation
│ ├── package-lock.yml
│ ├── packages.yml
│ ├── profiles.yml
│ ├── README.md
│ ├── seeds
│ ├── snapshots
│ └── tests
├── README.md
└── uv.lock
The project uses multiple data sources:
- External DuckDB Database: Railway services data hosted on Cloudflare
- Remote GeoJSON: Dutch provinces data from cartomap
- Local GeoJSON: Dutch municipalities data included in the project
table: Full refresh replacementincremental: Append or delete+insert strategiesexternal: Export to external files (CSV, JSON, Parquet)view: Creates database views
-- Full refresh table
{{ config(materialized='table') }}
-- Incremental model
{{ config(
materialized='incremental',
incremental_strategy='delete+insert',
unique_key='service_date, station_sk'
) }}
-- External file export
{{ config(
materialized='external',
location='data/exports/output',
options={
"partition_by": "year, month",
"overwrite": True
}
) }}To enable reverse ETL to PostgreSQL:
- Add PostgreSQL extension to your profile:
extensions:
- spatial
- httpfs
- postgres- Configure database attachment:
attach:
- path: "postgresql://user:password@localhost:5432/dbname"
type: postgres
alias: internal_db- Run reverse ETL models:
dbt run --models +reverse_etlThe project includes comprehensive data quality tests:
# Run all tests
dbt test
# Run tests for specific models
dbt test --models dim_nl_provinces
Performance characteristics on a MacBook Pro (12GB RAM):
- Total execution time: 40-45 seconds
- Data processed: 400MB+ from external sources
- Models: 10 transformation models
- Tests: 20 data quality tests
- Processing: ~30 seconds for main transformations
- PostgreSQL export: ~4 seconds
This project demonstrates:
- Local data lake/warehouse creation
- Spatial data processing and analysis
- Modern data transformation patterns
- Multi-format data export capabilities
- Reverse ETL to operational databases
- Data quality testing and validation.