E-commerce Data Cleaning (Olist)

Objective

Build a reproducible, production-style data cleaning, validation, and ingestion pipeline for the Olist Brazilian e-commerce dataset. Raw CSVs are transformed into clean, schema-enforced Parquet files optimized for BigQuery and downstream analytics.

Key enhancement (merged from feature/improve-schema-enforcement branch):
Strengthened deterministic type casting and explicit schema enforcement during Parquet serialization. This gives full control over column types and eliminates common BigQuery import errors caused by weak/ambiguous typing (e.g., object → STRING coercion failures).

The resulting Parquet files now power reliable analysis and an interactive Looker Studio dashboard showing key revenue and order metrics across Brazilian states and product categories.

This project demonstrates modern data engineering and analytics practices:

Modular src/ package layout
Explicit schema contracts and validation
Automated CI testing + post-ingestion verification
Reproducible environments via uv

Visualizations

Exploratory Analysis: Revenue by Brazilian State

Early Tableau choropleth map created to validate the cleaned dataset:

Shows revenue concentration across states (darker = higher revenue)
Highlights strong market dominance by São Paulo.
Built from aggregated price and customer_state

2017 Revenue & Sales Dashboard

Polished interactive dashboard summarizing key findings from the 2017 Olist dataset:

KPIs: Total Revenue, Unique Customers, Total Orders
Monthly order trend with seasonality insights
State-level revenue distribution and top performers
Key insights and actionable business recommendations

Setup

git clone https://github.com/space-lumps/ecommerce-data-cleaning.git
cd ecommerce-data-cleaning

uv venv
source .venv/bin/activate    # Windows: .venv\Scripts\activate

uv pip install -e .

# Use sample data (no Kaggle needed)
cp data/samples/*.csv data/raw/
uv run python run_pipeline.py

Dataset Setup

The pipeline can run using either the included sample dataset or the full Kaggle dataset.

Option A — Sample Data (recommended for quick start, no Kaggle needed)

Copy the included samples and run the pipeline:

cp data/samples/*.csv data/raw/
uv run python run_pipeline.py

Option B - Download full dataset from Kaggle

Download from Brazilian E-Commerce Public Dataset by Olist and place all .csv files in data/raw/.

Using Kaggle CLI (fastest):

pip install kaggle
kaggle datasets download -d olistbr/brazilian-ecommerce -p data/raw --unzip

Ensure data/raw/ contains only one version of the data at a time.

Project Structure

ecommerce-data-cleaning/
├── src/ecom_pipeline/          # All reusable code (installable package)
├── data/samples/               # Lightweight test data (included)
├── docs/                       # schema_contract.md, data_dictionary.md, etc.
├── reports/                    # Generated audits and profiles
├── tests/                      # E2E and IO smoke tests
├── .github/workflows/ci.yml
├── pyproject.toml + uv.lock    # Reproducible environment
└── run_pipeline.py

The project follows a proper src/ layout.
All reusable code lives inside the ecom_pipeline package. Tests live in a top-level tests/ directory.

Pipeline

Execution

uv run python run_pipeline.py

Pipeline Stages

Sanity Check Raw Confirms raw files exist and are readable.
Profile Raw Profiles source datasets before transformation.
Standardize Columns Applies consistent column naming.
Enforce Schema Applies explicit type casting using SCHEMA_CONTRACT as the single source of truth.
Audit Dtypes Flags suspicious type patterns and generates detailed reports.
Validate Schema Contract Comprehensive validation (required columns, dtypes, nullability, PK uniqueness, FK integrity, domain constraints)
Generate Data Dictionary Generates docs/data_dictionary.md from reports/clean_dtypes_full.csv.

Validation & Schema Enforcement

This pipeline features strong type safety and relational integrity for BigQuery compatibility:

Key Improvements

All type casting is driven by SCHEMA_CONTRACT as the single source of truth
Strict nullable dtypes (string, Int64, Float64, datetime64[ns])
Brazilian CEP zip codes preserved with leading zeros
Full English state names added for better visualization support
Cross-table foreign key integrity checks with orphan detection
Automatic enrichment of product_category_name_translation table with missing categories from olist_products_dataset (e.g. pc_gamer, portateis_cozinha_e_preparadores_de_alimentos)

These changes eliminate common import failures and produce cleaner, more reliable outputs for analysis and visualization.

Outputs

data/clean/*.parquet – production-ready files with correct nullable types
docs/data_dictionary.md – living, accurate documentation of the final schema
reports/clean_contract_audit.csv – comprehensive contract validation (required columns, dtypes, constraints, FK integrity)
reports/clean_dtypes_full.csv – detailed audit with null counts/percentages
reports/clean_dtypes_flags.csv – flagged suspicious columns for manual review

Schema Contract

A declarative schema contract defines the expected structure of every cleaned dataset and serves as the single source of truth for type enforcement.

It specifies:

Required columns and primary keys
Logical data types (string, numeric, datetime)
Nullable rules and domain constraints (including numeric_type: Int64 or Float64)
Foreign key relationships (fully enforced with cross-table referential integrity checks)

The contract is enforced in enforce_schema.py and rigorously validated by validate_schema_contract.py.

Skills & Key Learnings

Modular Python package with proper src/ layout for maintainability
Strict schema enforcement + deterministic type casting
Cross-table foreign key validation with automatic data enrichment
Defensive validation + CI testing
Reproducible environments with uv
Production-ready data pipeline powering a Looker Studio dashboard

Key Learnings

Explicit type casting early prevents BigQuery import failures and silent data issues
Handling real-world data quirks (incomplete lookup tables) is critical for clean relational pipelines
Clear schema contracts + validation save significant debugging time

Future Improvements

Expand domain-specific validation rules (e.g. price ≥ 0, valid order_status values)
Increase test coverage for edge cases

API Documentation

The project is structured as an installable Python package (ecom_pipeline).

Full API reference (auto-generated from docstrings):
→ View API Documentation

License

MIT License

See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

E-commerce Data Cleaning (Olist)

Table of Contents

Objective

Visualizations

Exploratory Analysis: Revenue by Brazilian State

2017 Revenue & Sales Dashboard

Setup

Dataset Setup

Option A — Sample Data (recommended for quick start, no Kaggle needed)

Option B - Download full dataset from Kaggle

Project Structure

Pipeline

Execution

Pipeline Stages

Validation & Schema Enforcement

Key Improvements

Outputs

Schema Contract

Skills & Key Learnings

Future Improvements

API Documentation

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
.github/workflows		.github/workflows
assets/images		assets/images
data/samples		data/samples
docs		docs
reports		reports
src/ecom_pipeline		src/ecom_pipeline
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.ruff.toml		.ruff.toml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-lock.txt		requirements-lock.txt
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

E-commerce Data Cleaning (Olist)

Table of Contents

Objective

Visualizations

Exploratory Analysis: Revenue by Brazilian State

2017 Revenue & Sales Dashboard

Setup

Dataset Setup

Option A — Sample Data (recommended for quick start, no Kaggle needed)

Option B - Download full dataset from Kaggle

Project Structure

Pipeline

Execution

Pipeline Stages

Validation & Schema Enforcement

Key Improvements

Outputs

Schema Contract

Skills & Key Learnings

Future Improvements

API Documentation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages