- Objective
- Visualizations
- Setup
- Dataset Setup
- Project Structure
- Pipeline
- Validation & Schema Enforcement
- Skills & Key Learnings
- Future Improvements
- API Documentation
- License
Build a reproducible, production-style data cleaning, validation, and ingestion pipeline for the Olist Brazilian e-commerce dataset. Raw CSVs are transformed into clean, schema-enforced Parquet files optimized for BigQuery and downstream analytics.
Key enhancement (merged from feature/improve-schema-enforcement branch):
Strengthened deterministic type casting and explicit schema enforcement during Parquet serialization. This gives full control over column types and eliminates common BigQuery import errors caused by weak/ambiguous typing (e.g., object → STRING coercion failures).
The resulting Parquet files now power reliable analysis and an interactive Looker Studio dashboard showing key revenue and order metrics across Brazilian states and product categories.
This project demonstrates modern data engineering and analytics practices:
- Modular
src/package layout - Explicit schema contracts and validation
- Automated CI testing + post-ingestion verification
- Reproducible environments via
uv
Early Tableau choropleth map created to validate the cleaned dataset:
- Shows revenue concentration across states (darker = higher revenue)
- Highlights strong market dominance by São Paulo.
- Built from aggregated
priceandcustomer_state
Polished interactive dashboard summarizing key findings from the 2017 Olist dataset:
- KPIs: Total Revenue, Unique Customers, Total Orders
- Monthly order trend with seasonality insights
- State-level revenue distribution and top performers
- Key insights and actionable business recommendations
git clone https://github.com/space-lumps/ecommerce-data-cleaning.git
cd ecommerce-data-cleaning
uv venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
uv pip install -e .
# Use sample data (no Kaggle needed)
cp data/samples/*.csv data/raw/
uv run python run_pipeline.pyThe pipeline can run using either the included sample dataset or the full Kaggle dataset.
Copy the included samples and run the pipeline:
cp data/samples/*.csv data/raw/
uv run python run_pipeline.pyDownload from Brazilian E-Commerce Public Dataset by Olist and place all .csv files in data/raw/.
Using Kaggle CLI (fastest):
pip install kaggle
kaggle datasets download -d olistbr/brazilian-ecommerce -p data/raw --unzipEnsure data/raw/ contains only one version of the data at a time.
ecommerce-data-cleaning/
├── src/ecom_pipeline/ # All reusable code (installable package)
├── data/samples/ # Lightweight test data (included)
├── docs/ # schema_contract.md, data_dictionary.md, etc.
├── reports/ # Generated audits and profiles
├── tests/ # E2E and IO smoke tests
├── .github/workflows/ci.yml
├── pyproject.toml + uv.lock # Reproducible environment
└── run_pipeline.py
The project follows a proper src/ layout.
All reusable code lives inside the ecom_pipeline package.
Tests live in a top-level tests/ directory.
uv run python run_pipeline.py- Sanity Check Raw Confirms raw files exist and are readable.
- Profile Raw Profiles source datasets before transformation.
- Standardize Columns Applies consistent column naming.
- Enforce Schema Applies explicit type casting using SCHEMA_CONTRACT as the single source of truth.
- Audit Dtypes Flags suspicious type patterns and generates detailed reports.
- Validate Schema Contract Comprehensive validation (required columns, dtypes, nullability, PK uniqueness, FK integrity, domain constraints)
- Generate Data Dictionary
Generates
docs/data_dictionary.mdfromreports/clean_dtypes_full.csv.
This pipeline features strong type safety and relational integrity for BigQuery compatibility:
- All type casting is driven by
SCHEMA_CONTRACTas the single source of truth - Strict nullable dtypes (
string,Int64,Float64,datetime64[ns]) - Brazilian CEP zip codes preserved with leading zeros
- Full English state names added for better visualization support
- Cross-table foreign key integrity checks with orphan detection
- Automatic enrichment of
product_category_name_translationtable with missing categories fromolist_products_dataset(e.g.pc_gamer,portateis_cozinha_e_preparadores_de_alimentos)
These changes eliminate common import failures and produce cleaner, more reliable outputs for analysis and visualization.
data/clean/*.parquet– production-ready files with correct nullable typesdocs/data_dictionary.md– living, accurate documentation of the final schemareports/clean_contract_audit.csv– comprehensive contract validation (required columns, dtypes, constraints, FK integrity)reports/clean_dtypes_full.csv– detailed audit with null counts/percentagesreports/clean_dtypes_flags.csv– flagged suspicious columns for manual review
A declarative schema contract defines the expected structure of every cleaned dataset and serves as the single source of truth for type enforcement.
It specifies:
- Required columns and primary keys
- Logical data types (
string,numeric,datetime) - Nullable rules and domain constraints (including
numeric_type:Int64orFloat64) - Foreign key relationships (fully enforced with cross-table referential integrity checks)
The contract is enforced in enforce_schema.py and rigorously validated by validate_schema_contract.py.
- Modular Python package with proper
src/layout for maintainability - Strict schema enforcement + deterministic type casting
- Cross-table foreign key validation with automatic data enrichment
- Defensive validation + CI testing
- Reproducible environments with
uv - Production-ready data pipeline powering a Looker Studio dashboard
Key Learnings
- Explicit type casting early prevents BigQuery import failures and silent data issues
- Handling real-world data quirks (incomplete lookup tables) is critical for clean relational pipelines
- Clear schema contracts + validation save significant debugging time
- Expand domain-specific validation rules (e.g. price ≥ 0, valid order_status values)
- Increase test coverage for edge cases
The project is structured as an installable Python package (ecom_pipeline).
Full API reference (auto-generated from docstrings):
→ View API Documentation
MIT License
Copyright (c) 2026 Corin Stedman
See the LICENSE file for details.

