Skip to content

leif-erickson/etl-with-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

Explanation of Key Concepts

Hashing

Used to generate surrogate keys (e.g., customer_hash). This ensures keys are system-agnostic and handles composites in links.

Data Vault 2.0:

Hubs capture raw business keys without attributes. Satellites add changeable details with timestamps for auditing changes over time. Links model many-to-many relationships without redundancy. The design supports scalability, as new satellites can be added for evolving attributes.

ETL Flow

Extract via pandas, transform with hashing and deduplication, load with SQL inserts (using OR IGNORE for hubs/links to avoid duplicates, and always insert for satellites to track history).

This is a minimal implementation. For production, use a full DB like PostgreSQL, add error handling, and implement incremental loads (e.g., via staging tables). Expand with more entities as needed.

TL;DR

pip install -r requirements.txt python etl.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages