A machine learning project that implements the three core components of the Basel III credit risk framework using historical Lending Club loan data (2007–2014).
Predicts the likelihood that a borrower will default on a loan.
- Algorithm: Logistic Regression with custom p-value computation
- Feature Selection: LassoCV (5-fold cross-validation)
- Evaluation: ROC-AUC, Confusion Matrix, Classification threshold = 0.75
- Output:
pd_model.sav(generated on run, not tracked in repo)
Estimates the percentage of the loan amount lost if a default occurs. Uses a two-stage approach:
- Stage 1 — Classification: Logistic Regression to predict whether any loss occurs (threshold = 0.45)
- Stage 2 — Regression: Linear Regression to estimate the magnitude of loss
- Evaluation: ROC-AUC (Stage 1), MSE + R² (Stage 2)
- Output:
lgd_model_stage_1.sav,lgd_model_stage_2.sav(generated on run, not tracked in repo)
Estimates the Credit Conversion Factor (CCF) — the proportion of the loan outstanding at the time of default.
- Algorithm: Linear Regression with custom p-value computation
- Target:
CCF = (funded_amnt − total_rec_prncp) / funded_amnt - Evaluation: R², MSE, Pearson Correlation
- Output:
ead_model.sav(generated on run, not tracked in repo)
Source: Lending Club Loan Data (2007–2014) — available on Kaggle.
| Split | Samples |
|---|---|
| Train | 373,028 |
| Test | 93,257 |
Class Distribution (PD target): ~89% good loans, ~11% bad loans
Required raw file: loan_data_2007_2014.csv (place in project root before running preprocessing)
Credit risk/
├── 01_Data_Preprocessing.ipynb # Data cleaning, encoding, feature engineering
├── 02_Probability_of_Default_Model.ipynb # PD model training and evaluation
├── 03_LGD_and_EAD_Models.ipynb # LGD (2-stage) and EAD model training
│
├── loan_data_2007_2014.csv # Raw dataset (not included — download separately)
├── loan_data_inputs_train.csv # Preprocessed training features (generated)
├── loan_data_inputs_test.csv # Preprocessed test features (generated)
├── loan_data_targets_train.csv # Training labels
├── loan_data_targets_test.csv # Test labels
│
└── requirements.txt
pip install -r requirements.txtPython version: 3.9+
Run the notebooks in order:
01_Data_Preprocessing.ipynb— Cleans and preprocessesloan_data_2007_2014.csv, outputs train/test CSV files.02_Probability_of_Default_Model.ipynb— Trains and saves the PD model.03_LGD_and_EAD_Models.ipynb— Trains and saves the LGD and EAD models.
Both logistic and linear regression models are extended to compute p-values for every feature coefficient, enabling statistical significance testing beyond standard sklearn outputs.
- Logistic Regression: Uses the Fisher Information Matrix to derive coefficient variances, then computes two-tailed z-test p-values.
- Linear Regression: Computes t-statistics and p-values using the t-distribution.
LassoCV automatically selects significant features for the PD model by shrinking irrelevant coefficients to zero, reducing noise and improving generalization.
| Metric | Value |
|---|---|
| ROC-AUC | 0.690 |
| Accuracy (threshold = 0.75) | 87.3% |
Confusion Matrix (test set, 93,257 samples):
| Predicted Good | Predicted Default | |
|---|---|---|
| Actual Good | 1,066 | 9,128 |
| Actual Default | 2,680 | 80,383 |
| Metric | Value |
|---|---|
| ROC-AUC | 0.620 |
| Accuracy (threshold = 0.45) | 57.7% |
Confusion Matrix:
| Predicted 0 | Predicted 1 | |
|---|---|---|
| Actual 0 | 523 | 3,250 |
| Actual 1 | 410 | 4,465 |
| Metric | Value |
|---|---|
| Pearson Correlation (actual vs predicted) | 0.325 |
| Mean predicted LGD | 8.9% |
| Std | 5.0% |
| Range | 0% – 23.5% |
| Metric | Value |
|---|---|
| R² | 0.278 |
| MSE | 0.0290 |
| Pearson Correlation | 0.527 |
| Mean predicted CCF | 73.6% |
After training all three models, the expected loss for any loan can be estimated as:
Expected Loss (EL) = PD × LGD × EAD
This is the standard Basel III formula used by financial institutions to quantify credit risk.