Skip to content

Cheng-Vang/integrated-workforce-intel

Repository files navigation

Tulsa For You and Me Project

📖 Table of Contents

📜 Disclaimer

This project is structured as a lightweight, localized proof-of-concept designed to rapidly demonstrate a functional data warehouse schema for my Tulsa For You and Me project. To optimize development velocity and eliminate cloud infrastructure overhead, pipeline leverages Python pandas for fast, in-memory transformations and loads directly into a local PostgreSQL warehouse. Data validation is managed via standard Python data profiling (using Great Expectations pre-load) paired with custom, in-house referential integrity assertions during runtime.

📖 Table Of Contents

Rainbow bar

🗂️ Business Context

You’ve just joined the tech team at Tulsa For You and Me. The team is working to standardize job and wage data to support multiple workforce programs across Tulsa. Your first assignment is to prototype a simple data warehouse that can power future dashboards and analysis on Tulsa’s workforce and labor market trends.

📖 Table Of Contents

Rainbow bar

❗ Business Problem

The Tulsa For You and Me initiative is tasked with driving economic mobility and expanding workforce programs across the city of Tulsa. However, the organization faces a critical operational bottleneck. Its foundational labor market, occupation, and demographic data are heavily fragmented, unstandardized, and siloed across various public and municipal source files.

Because no data infrastructure currently exists, the technical team cannot support regional dashboards, track wage trends, or identify skill gaps. To solve this, a Data Warehouse Engineer must design and execute a complete, end-to-end data ecosystem from scratch, moving from identifying sources to final analytics.

📖 Table Of Contents

Rainbow bar

🗺️ Project Overview My solution provides an end-to-end analytics solution built from scratch to centralize highly fragmented public datasets. The pipeline standardizes disparate source formats into a clean, read-optimized warehouse ready for downstream enterprise reporting. The data lifecycle moves through a structured, six-stage pipeline execution:
  1. Extractions
  2. Extraction Validations
  3. Transformations
  4. Transformation Validations
  5. Warehouse Loading
  6. Loading Validations

📖 Table Of Contents

Rainbow bar

🏆 Business Outcome

By centralizing data from scratch and resolving the fragmentation between O*NET, GeoCorr, and Census datasets, this proof-of-concept shows how Tulsa For You and Me could successfully transition from an operational standstill to a data-driven workforce organization.

📖 Table Of Contents

Rainbow bar

🏗️ Repository Structure
.
├── artifacts <----------- Any outputs produced during runtime get saved here
│   ├── census.json
│   ├── clean_census.json
│   ├── clean_geocorr.csv
│   ├── clean_job_zones.xlsx
│   ├── clean_occupation_data.xlsx
│   ├── geocorr.csv
│   ├── job_zones.xlsx
│   └── occupation_data.xlsx
|
├── assets <----------- Stores external, non-code dependencies
│   └─── readme
│       └── images
│           └── rainbow_bar.png
|
├── code <----------- Contains all of the pipeline's source code
│   ├── libs
│   │   ├── extractions <----------- Source code related to extractions
│   │   │   ├── census.py
│   │   │   ├── geocorr.py
│   │   │   └──  onet.py
|   │   │
│   │   ├── loaders <----------- Source code related to loading
│   │   │   └──  postgres.py
│   │   │
│   │   ├── transformations <----------- Source code related to transformations
│   │   │   ├── census.py
│   │   │   ├── geocorr.py
│   │   │   └── onet.py
│   │   │
│   │   ├── utilities <----------- Source code related to helper and common misc. funcs
│   │   │   ├── configs.py
│   │   │   ├── env.py
│   │   │   ├── extractions.py
|   |   |   ├── transformations.py
│   │   │   ├── file_system.py
│   │   │   ├── __init__.py
│   │   │   └── postgres_helper.py
|   |   |
│   │   └── validations <----------- Source code related to data validations
│   │       ├── census.py
│   │       ├── database.py
│   │       ├── geocorr.py
│   │       └── onet.py
|   │
│   ├── main.py <----------- *** The pipeline's entry point ***
|   |
│   └── setup <----------- Source code related to Great Expectations/our data validations
│         ├── expectations.py
│         └── gx_setup.py
|
|── configs <----------- Configurations directory
│   ├── general.toml
│   ├── gx
│   ├── gx.toml
|   └── .env <----------- A `.env` FILE MUST BE CREATED LOCALLY HERE TO STORE CREDENTIALS AS INSTRUCTED IN `./docs/3 - Extractions.md` FOR PROPER RUNTIME
│
├── docs  <----------- Documentation covering the pipeline/program/source code
│   ├── 1 - Architecture.md
│   ├── 2 - Sources.md
│   ├── 3 - Extractions.md
│   ├── 4 - Extraction Validations.md
│   ├── 5 - Transformations.md
│   ├── 6 - Transformation Validations.md
│   ├── 7 - Warehouse Schema.md
│   ├── 8 - Warehouse Loading.md
│   ├── 9 - Loading Validations.md
│   └── warehouse_star_schema_erd.html
│
├── management <----------- Contains project management and lifecycle materials
│   └── DWE Candidate Technical Activity.pdf
|
├── sql_queries  <----------- 3 SQL queries and their results
│   ├── query_1.sql
│   ├── query_1_results.png
│   ├── query_2.sql
│   ├── query_2_results.png
│   ├── query_3.sql
│   └── query_3_results.png
|
├── tests <-----------  Active test suites were bypassed for this proof-of-concept but included to maintain my standard project directory layout.
│   └── .gitkeep
|
├── pyproject.toml
├── README.md
└── uv.lock

📖 Table Of Contents

Rainbow bar

🎥 Video Presentation

Click here to watch my presentation

📖 Table Of Contents

Rainbow bar

About

A scalable, configuration-driven data warehouse that unifies public labor market and demographic data streams for localized Tulsa workforce analytics.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages