This project implements a robust End-to-End ETL (Extract, Transform, Load) Framework for a banking institution. It processes customer data, standardizes it into a normalized relational schema, and performs advanced risk analysis to identify potential customer churn.
Unlike simple query collections, this project simulates a production-grade environment featuring transaction handling, error logging, audit trails, and role-based security.
The dataset used is the Bank Churners dataset from Kaggle.
- Source: Kaggle - Credit Card Customers
- Note: To maintain a lightweight repository, the raw
.csvand processed.jsonfiles are excluded via.gitignore.
Setup Instructions:
- Download the dataset from the link above.
- Place the
.csvfile in the04_Source_Scripts/folder. - Run the Python script to convert the CSV to JSON format.
The repository is organized into the logical flow of the engineering pipeline:
| Folder | Description |
|---|---|
01_Schema_Setup |
Database creation, Schema definitions (ETLLog, Staging, ETL, Person, Bank, Operations), Table creation with constraints, and Security Roles. |
02_ETL_Framework |
Stored Procedures for data loading, the Master Orchestration script, and the custom ETL Logging architecture. |
03_Analysis_Reporting |
Views for demographics, financial behavior analysis, high-risk churn segmentation, and reporting queries. |
04_Source_Scripts |
Python script (converter.py) for pre-processing raw CSV data into JSON format for SQL ingestion. |
- Custom Logging Schema: The
ETLLogschema tracks every execution (RunHeader) and every step (RunStep), capturing row counts (Read/Inserted/Rejected) and execution times. - Error Handling: Full use of
TRY...CATCHblocks with transaction management (BEGIN TRAN,COMMIT,ROLLBACK) to ensure data atomicity. - Data Quality Logic: "Bad data" isn't just discarded; it is captured in a
RowRejecttable with specific reasons (e.g., "Invalid Gender Code", "Scientific Notation Error") for auditing.
- Merge Statements: Efficiently handles
UPSERToperations (Update existing records, Insert new ones). - Business Logic | Stored Procedures to detect high-risk churners and log operational alerts.
- Window Functions: Utilized for pagination, quartiles (NTILE), and ranking customers by value.
- Role-Based Access Control (RBAC): Implementation of specific roles (e.g.,
JuniorAnalystRole) with granular permissions (SELECT only, DENY DELETE).
- Microsoft SQL Server (Developer or Enterprise Edition)
- Python 3.x
-
Prepare Data: Run the Python script in folder
04_Source_Scripts/to generate the JSON file. It defaults to looking fordata.csv, or you can specify filenames:python converter.py --input "BankChurners.csv" --output "BankChurners_json_20251218.json"
⚠️ Critical Step: Before running the SQL, open02_ETL_Framework/usp_RunBankChurnersETL.sqland update the@FilePathvariable to point to the location of the JSON file on your local computer. -
Build Database: Execute scripts in
01_Schema_Setupto create the database, schemas, and tables. -
Run ETL: Execute the scripts in
02_ETL_Frameworkto create the stored procedures.- Finally, run the Master Procedure:
EXEC ETL.usp_RunBankChurnersETL;
- Finally, run the Master Procedure:
-
Automation (Optional): To simulate a production environment, you can schedule the Master Procedure using SQL Server Agent (requires Developer or Enterprise Edition).
Steps to Schedule:
- Open SSMS and expand SQL Server Agent.
- Right-click Jobs > New Job.
- General: Name the job
Daily_BankChurn_ETL. - Steps: Add a new step type Transact-SQL script (T-SQL).
- Command:
EXEC ETL.usp_RunBankChurnersETL; - Database:
BankChurnersDec
- Command:
- Schedules: Create a new schedule (e.g., Daily at 3:00 AM).
-
Analyze: Run the scripts in
03_Analysis_Reportingto generate insights and view the dashboard metrics.
Author: [ifelv]