💳 Credit Scoring Model for Default Prediction

CodeAlpha Data Science Internship - Task 1

📌 Project Overview

The goal of this project is to build a robust machine learning pipeline to predict the probability of credit card default. Using the Taiwan Credit Card dataset, I developed a model that helps financial institutions identify high-risk applicants, shifting the focus from simple accuracy to high-sensitivity risk detection.

🚀 The Result

By moving from Logistic Regression to a Tuned XGBoost model, I improved the detection of defaulters (Recall) from 0% to 86%, achieving an ROC-AUC score of 0.866.

🛠️ Tech Stack & Tools

Language: Python 3.x
Libraries: Pandas, NumPy, Scikit-Learn, XGBoost
Visualization: Matplotlib, Seaborn
Environment: Google Colab / Jupyter Notebook

📂 Project Workflow

1. Data Cleaning & Preprocessing

Handling Missing Values: Imputed MonthlyIncome with the median and NumberOfDependents with the mode.
Outlier Correction: Addressed "masking codes" (values 96-98) in payment history columns to ensure data integrity.
Feature Scaling: Applied StandardScaler to normalize features for linear model performance.

2. Feature Engineering

I engineered new features to capture deeper behavioral insights:

TotalLate: Aggregated count of all late payments.
SeverityScore: A weighted penalty system where 90-day delays are weighted 3x more than 30-day delays.
HighUtilizationFlag: A binary indicator for customers exceeding their credit limit.

3. Model Training & Comparison

I evaluated three different algorithms to determine the best business solution:

Metric	Logistic Regression	Random Forest	Tuned XGBoost
ROC-AUC	0.8255	0.8383	0.8666
Recall (Class 1)	0.00	0.17	0.86
Business Value	Failed to find risk	Missed 83% of risk	Caught 86% of risk

🔍 Key Findings (Feature Importance)

The XGBoost model identified that the most significant predictors of default are:

TotalLate: Previous payment behavior is the #1 predictor.
SeverityScore: The intensity of delays significantly impacts the risk profile.
Revolving Utilization: Customers using a high percentage of their credit limit are at higher risk.

🏆 Conclusion

In credit risk modeling, Recall is more important than Accuracy. A bank would rather "double-check" a safe customer than miss a customer who will default on a $50,000 loan. This project successfully optimized for Recall (86%), providing a safe and reliable tool for credit assessment.

👤 Intern Details

Name: [Tahir Ahmad]
Domain: Data Science
Organization: CodeAlpha
Task: 1 (Credit Scoring)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Credit_Scoring_Model_ML.ipynb		Credit_Scoring_Model_ML.ipynb
README.md		README.md
credit_scoring_model_ml (1).py		credit_scoring_model_ml (1).py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💳 Credit Scoring Model for Default Prediction

CodeAlpha Data Science Internship - Task 1

📌 Project Overview

🚀 The Result

🛠️ Tech Stack & Tools

📂 Project Workflow

1. Data Cleaning & Preprocessing

2. Feature Engineering

3. Model Training & Comparison

🔍 Key Findings (Feature Importance)

🏆 Conclusion

👤 Intern Details

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

💳 Credit Scoring Model for Default Prediction

CodeAlpha Data Science Internship - Task 1

📌 Project Overview

🚀 The Result

🛠️ Tech Stack & Tools

📂 Project Workflow

1. Data Cleaning & Preprocessing

2. Feature Engineering

3. Model Training & Comparison

🔍 Key Findings (Feature Importance)

🏆 Conclusion

👤 Intern Details

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages