The executive management team at the DVD store chain Sakila Entertainment seeks to gain deeper insights into the company’s rental business. Their goals are to:
- Analyze the performance of top products
- Identify opportunities for operational improvement
- Make data-driven decisions to optimize business outcomes
This project delivers a comprehensive end-to-end data engineering solution. It transforms the legacy, normalized Sakila rental database (OLTP) into a high-performance, denormalized Star Schema Data Warehouse (OLAP), enabling advanced analytics and robust business intelligence.
Key features of the solution include:
- Containerized pipeline: All components are packaged with
Dockerfor portability and consistency. - Automated orchestration: Data flow is managed and automated via
docker-compose.yaml, ensuring seamless deployment and operation. - Incremental loading: A watermarking strategy is implemented to support efficient, incremental data updates.
- Executive dashboards: The pipeline delivers actionable insights directly to decision-makers through tailored Jupyter Notebook and Metabase dashboards.
This architecture empowers Sakila Entertainment to unlock the full potential of their data, driving strategic improvements and operational excellence.
Source (OLTP)
- Source:
sakila(Transactional MySQL database). - ETL: Custom Python pipeline with Incremental Loading (Watermark strategy).
- Target:
sakila_star(Dimensional Data Warehouse). - Consumption:
- Jupyter Notebook: For deep-dive statistical analysis (
JupySQL,Seaborn). - Metabase: For self-service executive dashboards.
- Jupyter Notebook: For deep-dive statistical analysis (
| Component | Technology | Key Skills Demonstrated |
|---|---|---|
| Orchestration | Docker Compose | Multi-container networking (DB, App, BI), Volume persistence |
| Database | MySQL 8.0 | Schema Design (3NF |
| ETL Engine | Python (Pandas) | Incremental Upserts, Watermarking, Data Cleaning |
| Analysis | SQL / JupySQL | CTEs, Window Functions (RANK, LAG), Aggregations |
| Visualization | Metabase / Seaborn | Dashboard design, Heatmaps, Time-series forecasting |
- Docker & Docker Compose installed on your machine.
# Clone the repository
git clone https://github.com/yourusername/sakila-data-engineering.git# Start the environment (approx. 60 seconds to initialize)
docker-compose up -dRun docker ps in the terminal to ensure the following are active:
- sakila-mysql (Port 3306) - The Database Host
- sakila-notebook (Port 8888) - The ETL & Analysis Environment
- sakila-metabase (Port 3000) - The BI Dashboard
They should also have a status of "Up" and "Healthy". If not, try again - it may take a few minutes for the containers to start up fully.
The pipeline is pre-configured to run automatically upon container startup.
By following these steps, you'll have a fully functional data engineering pipeline ready to use.
-
http://localhost:8888: The ETL & Analysis Environment (Jupyter Notebook)
- This is where you can run the pipeline and analyze the data.
-
http://localhost:3306: The Database Host (MySQL Workbench)
- You can use this to view and manage the database via MySQL Workbench or CLI:
docker exec -it sakila-mysql mysql -u app -p(Password:app_password)
-
http://localhost:3000: The BI Dashboard (Metabase)
To use the pipeline, you can access the Jupyter Notebook at http://localhost:8888.
- Run
notebooks/01_incremental_ETL_pipeline.ipynbto initialize theetl_statetable and populate the data ofsakila_star. - Run
notebooks/02_data_analysis.ipynbto analyze the data and generate insights.
To see the dashboard, you can access it at http://localhost:3000. Create a username/password, and connect to MySQL database following the instructions:
AFter running the notebook, you can access the dashboard at http://localhost:3000 and view the executive overview.
File:
sql/10_sakila_star-schema.sql
Before any data movement, I architected a Star Schema optimized for analytical queries (OLAP). This involved writing the SQL DDL to define the warehouse structure:
-
Fact Table (fact_rental):
- The central table containing over 16,000 transaction records.
- It connects to dimensions via Foreign Keys (
film_id,customer_id, etc.) and includes performance indexes (idx_rental_date).
-
Dimension Tables: Denormalized tables to reduce join complexity. Examples:
dim_customer: Merges customer profile + address + city + country.dim_film: Consolidates film details + language.
-
Infrastructure Tables:
etl_state: A custom table designed to store Watermarks (last_success_ts) for each pipeline, enabling the "incremental ETL logic".
-
Data Integrity:
- Implemented
ON DUPLICATE KEY UPDATElogic in the schema to ensure idempotency.
- Implemented
File:
notebooks/01_incremental_ETL_pipeline.ipynb
The ETL process moves data from Source to Target using Python and Pandas.
- Extraction: Queries the source
sakilaDB usingWHERE last_update > watermark. - Transformation: Cleans timestamps and handles NaN values for SQL compatibility.
- Loading: Executes Upserts (Update-Insert) into the
sakila_starwarehouse. - Watermarking: After a successful load, the
etl_statetable is updated with the latest timestamp, ensuring the next run only processes new data.
File:
notebooks/02_data_analysis.ipynb
I utilized the Data Warehouse to answer critical business questions using advanced SQL.
-
Financial Volatility (Window Functions): I calculated Month-over-Month Revenue Growth using the LAG() window function to detect financial trends. Insight: Identifies immediate periods of growth vs. decline.
-
Top Products (
RANK()): I identified the highest-grossing movie for each MPAA rating category (G, PG, R, etc.) using RANK() OVER (PARTITION BY rating). -
VIP Customer Analysis (Pivoting): I analyzed the Top 20 highest-spending customers, pivoting their spending data to generate a Heatmap of category preferences (e.g., specific customers preferring Animation vs. Sports).
Access:
http://localhost:3000
I deployed a persistent "Sakila Executive Overview" dashboard in Metabase, organised into three strategic categories to display the notebook's queries via different graph types.
MoM Revenue Volatility: A Bar Chart displaying the month-over-month growth or decline in total revenue.
Catalog Distribution: Bar chart visualizing the count of films per Genre.
Top Revenue Films: A Table with conditional formatting (deep to light green) identifying the #1 highest-grossing movie per Rating.
Spending Tiers: A Row-based Bar Chart segmenting customers into Low/Mid/High value tiers, featuring a target line for desired high-value transaction volume.
Return Policy Compliance: Pie chart showing the ratio of Late vs. On-Time returns.
VIP Heatmap: A Pivot Table/Heatmap visualizing our Top 10 customers and their total spending amount per film category.



