Skip to content

Tahirahmad1002/Codealpha_tasks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

💳 Credit Scoring Model for Default Prediction

CodeAlpha Data Science Internship - Task 1


📌 Project Overview

The goal of this project is to build a robust machine learning pipeline to predict the probability of credit card default. Using the Taiwan Credit Card dataset, I developed a model that helps financial institutions identify high-risk applicants, shifting the focus from simple accuracy to high-sensitivity risk detection.

🚀 The Result

By moving from Logistic Regression to a Tuned XGBoost model, I improved the detection of defaulters (Recall) from 0% to 86%, achieving an ROC-AUC score of 0.866.


🛠️ Tech Stack & Tools

  • Language: Python 3.x
  • Libraries: Pandas, NumPy, Scikit-Learn, XGBoost
  • Visualization: Matplotlib, Seaborn
  • Environment: Google Colab / Jupyter Notebook

📂 Project Workflow

1. Data Cleaning & Preprocessing

  • Handling Missing Values: Imputed MonthlyIncome with the median and NumberOfDependents with the mode.
  • Outlier Correction: Addressed "masking codes" (values 96-98) in payment history columns to ensure data integrity.
  • Feature Scaling: Applied StandardScaler to normalize features for linear model performance.

2. Feature Engineering

I engineered new features to capture deeper behavioral insights:

  • TotalLate: Aggregated count of all late payments.
  • SeverityScore: A weighted penalty system where 90-day delays are weighted 3x more than 30-day delays.
  • HighUtilizationFlag: A binary indicator for customers exceeding their credit limit.

3. Model Training & Comparison

I evaluated three different algorithms to determine the best business solution:

Metric Logistic Regression Random Forest Tuned XGBoost
ROC-AUC 0.8255 0.8383 0.8666
Recall (Class 1) 0.00 0.17 0.86
Business Value Failed to find risk Missed 83% of risk Caught 86% of risk

🔍 Key Findings (Feature Importance)

The XGBoost model identified that the most significant predictors of default are:

  1. TotalLate: Previous payment behavior is the #1 predictor.
  2. SeverityScore: The intensity of delays significantly impacts the risk profile.
  3. Revolving Utilization: Customers using a high percentage of their credit limit are at higher risk.

🏆 Conclusion

In credit risk modeling, Recall is more important than Accuracy. A bank would rather "double-check" a safe customer than miss a customer who will default on a $50,000 loan. This project successfully optimized for Recall (86%), providing a safe and reliable tool for credit assessment.


👤 Intern Details

  • Name: [Tahir Ahmad]
  • Domain: Data Science
  • Organization: CodeAlpha
  • Task: 1 (Credit Scoring)

About

The goal of this project is to build a robust machine learning pipeline to predict the probability of credit card default. Using the Taiwan Credit Card dataset, I developed a model that helps financial institutions identify high-risk applicants, shifting the focus from simple accuracy to high-sensitivity risk detection.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors