Home: README
- Learners moving from file automation to data pipelines.
- Teams with SQL databases as a reporting backbone.
- A Python ETL pipeline that ingests validated records into a SQL database.
- Staging and reporting tables with idempotent loads.
- A daily summary query output for dashboard use.
- Capstone A outputs from 05_AUTOMATION_FILES_EXCEL.md.
- Basic SQL query familiarity.
- DB credentials with insert/select access in target schema.
SELECTandWHERE.GROUP BYaggregates.JOINacross lookup tables.INSERTandUPDATEbasics.- transactions with commit/rollback.
Create tables:
staging_alertsalerts_reporting
Minimum metadata fields:
source_fileingested_at_utcidempotency_key
Preferred start: sqlite3 (built-in, no driver needed).
Optional scaling path: SQLAlchemy (supports SQLite, PostgreSQL, and more).
Connection requirements:
- no hardcoded secrets,
- explicit timeout,
- retry for transient failures,
- structured error logging.
- Load raw validated rows into
staging_alerts. - Promote clean rows into
alerts_reporting. - Use
idempotency_keyto prevent duplicates.
Generate query output by:
- date,
- severity,
- customer/site.
Export summary to output/daily_summary.csv for dashboard consumption.
For an existing reporting database:
- identify existing table contracts,
- map your ETL output to existing schema,
- avoid direct writes to unmanaged tables until schema ownership is clear.
If you scale beyond SQLite:
- ingest to staging only first,
- normalize data types to PostgreSQL-compatible forms,
- preserve source system metadata.
- Repeatable ETL job with clean staging-to-reporting flow.
- No duplicate records on reruns.
- Usable CSV summary artifacts for dashboards.
- Force duplicate ingest and prove idempotency key blocks duplicates.
- Simulate DB timeout and confirm retries/logging.
- Insert malformed rows in staging and verify promotion filter blocks them.
- connection failures:
- confirm driver installation,
- validate host, db name, user,
- test least-privilege account manually.
- duplicate rows:
- inspect key generation and unique constraint.
- poor query performance:
- add indexes on date, severity, idempotency key.
You are ready for API integration when you can:
- explain your table contract,
- rerun ETL safely,
- recover from transient DB failures,
- produce daily summaries without manual edits.
- Play: test different idempotency key designs.
- Build: implement full staging -> reporting pipeline.
- Dissect: explain query plans and table role boundaries.
- Teach-back: present ETL data flow and failure strategy.
- Full schema pack for staging/reporting/cache/marts: