The goal of this project is to build a robust machine learning pipeline to predict the probability of credit card default. Using the Taiwan Credit Card dataset, I developed a model that helps financial institutions identify high-risk applicants, shifting the focus from simple accuracy to high-sensitivity risk detection.
By moving from Logistic Regression to a Tuned XGBoost model, I improved the detection of defaulters (Recall) from 0% to 86%, achieving an ROC-AUC score of 0.866.
- Language: Python 3.x
- Libraries: Pandas, NumPy, Scikit-Learn, XGBoost
- Visualization: Matplotlib, Seaborn
- Environment: Google Colab / Jupyter Notebook
- Handling Missing Values: Imputed
MonthlyIncomewith the median andNumberOfDependentswith the mode. - Outlier Correction: Addressed "masking codes" (values 96-98) in payment history columns to ensure data integrity.
- Feature Scaling: Applied
StandardScalerto normalize features for linear model performance.
I engineered new features to capture deeper behavioral insights:
TotalLate: Aggregated count of all late payments.SeverityScore: A weighted penalty system where 90-day delays are weighted 3x more than 30-day delays.HighUtilizationFlag: A binary indicator for customers exceeding their credit limit.
I evaluated three different algorithms to determine the best business solution:
| Metric | Logistic Regression | Random Forest | Tuned XGBoost |
|---|---|---|---|
| ROC-AUC | 0.8255 | 0.8383 | 0.8666 |
| Recall (Class 1) | 0.00 | 0.17 | 0.86 |
| Business Value | Failed to find risk | Missed 83% of risk | Caught 86% of risk |
The XGBoost model identified that the most significant predictors of default are:
- TotalLate: Previous payment behavior is the #1 predictor.
- SeverityScore: The intensity of delays significantly impacts the risk profile.
- Revolving Utilization: Customers using a high percentage of their credit limit are at higher risk.
In credit risk modeling, Recall is more important than Accuracy. A bank would rather "double-check" a safe customer than miss a customer who will default on a $50,000 loan. This project successfully optimized for Recall (86%), providing a safe and reliable tool for credit assessment.
- Name: [Tahir Ahmad]
- Domain: Data Science
- Organization: CodeAlpha
- Task: 1 (Credit Scoring)