120 lines (96 loc) · 3.81 KB

06 - SQL Integration (SQL & ETL Pipelines for Reporting and Dashboards)

Home: README

Who this is for

Learners moving from file automation to data pipelines.
Teams with SQL databases as a reporting backbone.

What you will build

A Python ETL pipeline that ingests validated records into a SQL database.
Staging and reporting tables with idempotent loads.
A daily summary query output for dashboard use.

Prerequisites

Capstone A outputs from 05_AUTOMATION_FILES_EXCEL.md.
Basic SQL query familiarity.
DB credentials with insert/select access in target schema.

Step-by-step lab pack

Lab 1 - SQL fundamentals in strict order

SELECT and WHERE.
GROUP BY aggregates.
JOIN across lookup tables.
INSERT and UPDATE basics.
transactions with commit/rollback.

Lab 2 - Table design

Create tables:

staging_alerts
alerts_reporting

Minimum metadata fields:

source_file
ingested_at_utc
idempotency_key

Lab 3 - Python connection strategy

Preferred start: sqlite3 (built-in, no driver needed). Optional scaling path: SQLAlchemy (supports SQLite, PostgreSQL, and more).

Connection requirements:

no hardcoded secrets,
explicit timeout,
retry for transient failures,
structured error logging.

Lab 4 - Load strategy

Load raw validated rows into staging_alerts.
Promote clean rows into alerts_reporting.
Use idempotency_key to prevent duplicates.

Lab 5 - Daily summary output

Generate query output by:

date,
severity,
customer/site.

Export summary to output/daily_summary.csv for dashboard consumption.

Lab 6 - Custom reporting backend integration

For an existing reporting database:

identify existing table contracts,
map your ETL output to existing schema,
avoid direct writes to unmanaged tables until schema ownership is clear.

Lab 7 - Optional PostgreSQL extension

If you scale beyond SQLite:

ingest to staging only first,
normalize data types to PostgreSQL-compatible forms,
preserve source system metadata.

Expected output

Repeatable ETL job with clean staging-to-reporting flow.
No duplicate records on reruns.
Usable CSV summary artifacts for dashboards.

Break/fix drills

Force duplicate ingest and prove idempotency key blocks duplicates.
Simulate DB timeout and confirm retries/logging.
Insert malformed rows in staging and verify promotion filter blocks them.

Troubleshooting

connection failures:
- confirm driver installation,
- validate host, db name, user,
- test least-privilege account manually.
duplicate rows:
- inspect key generation and unique constraint.
poor query performance:
- add indexes on date, severity, idempotency key.

Mastery check

You are ready for API integration when you can:

explain your table contract,
rerun ETL safely,
recover from transient DB failures,
produce daily summaries without manual edits.

Learning-style options (Play/Build/Dissect/Teach-back)

Play: test different idempotency key designs.
Build: implement full staging -> reporting pipeline.
Dissect: explain query plans and table role boundaries.
Teach-back: present ETL data flow and failure strategy.

Primary Sources

Optional Resources

Sample database schemas

Full schema pack for staging/reporting/cache/marts:
- 13_SAMPLE_DATABASE_SCHEMAS.md

Next

Next: projects/level-4/README.md →