Skip to content

A port of defaulter classification networks from Keras to PyTorch.

License

Notifications You must be signed in to change notification settings

hml230/loan-default-risks-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Risky Borrowers Identification with PyTorch

Python PyTorch Jupyter NumPy Pandas PySpark Scikit-learn License: MIT

Title Image


Background and Introduction Sourcing data Data Transformation and EDA Visualisation Model Evaluation Conclusions

Background and Introduction

This project is a port to PyTorch from a collaborated project I did. The task proposed selection of a loan/credit default dataset and develop classification models to predict or investigate factors that influence whether a borrower would default, recorded as 0 (non-default) or 1 (default) in the default_ind variable.

The networks were previously in Tensorflow, but I saw PyTorch offered a better interface and development experimence, so I ported them.

In this exercise, I investigated a variation of the LoanStatNew dataset using PyTorch. I performed dimensionality reduction, features selection and class weighting to compare the performance of different model architectures.

See the notebooks here: Analysis Notebook and Training Notebook

Sourcing data

The dataset is a variation of Kaggle's Lending Club dataset. with addition features that were created based on the original dataset. It offers a wide view of different aspects in the borrowers' financial status, and a high flexibility for model selection.

The data dictionary with schema definiton can be found in the .md file.

Data Transformation

The proposed processing pipeline includes the following steps:

  • Dropping a range of columns that were deemed uninformative or have high proportion of missing values

  • Dropping rows (sice data size was adequate) for columns that have low missingness, including the target, default_ind

  • Transforming the time-based columns into Timestamp type, stored as number of days since 1970-01-01

  • Preprocess remaining categorical or numerical variables using MLlib's StringIndexer() estimator and casting respectively

  • Used median to fill any missing or 'Unknown' values after acquiring the train set

Overall, the dataset's quality is relatively high, no inconsistent string values that require regex were found. The only other prominent issue is class imbalance, which was alleviated using class weights.

default_ind count
0 647004
1 37149

A design change I made that's different to the original work was to store processed data in .parquet format instead of .csv. Originally, code was written in a single Colab notebook so we didn't have issues with replication as we used the .toPandas() method to convert dataset into tensors via numpy.

Visualisation

Correlation Heatmap

Distribution of Relevant Features

After visualising the underlying patterns of the dataset, we employed compute_class_weight() from scikit-learn to calculate the bias for model training.

Building the models

Using nn.Module, I created 3 classes, a Multilayer Perceptron (MLP), a bagging model with MLP base models and a convolutional network (CNN) with 1-D convolution layers. The models' architecture are relatively simple as I initially built them using Keras' documentation.

The first two model has some Dense/Linear layers, while the CNN has the usual design. All models are validated using a DataLoader instance crated from the validation set.

val_loader = DataLoader(val_tensor_ds, 
                        batch_size=32, 
                        shuffle=False, 
                        pin_memory=True if device.type == 'cuda' else False)

Evaluation

  • The MLP and bagging model achieved high performance, with precision and recall ~99%.

    • The inclusion of class weights in the loss function may have reduced the selection bias within these models and may have improved the results.
  • The CNN model performed noticeably worse to the other two, which is expected CNNs are not typically fit on tabular data.

  • The area under the curve (AUC) are close to 1 across all networks, meaning they are very likely to make the correct classification.

    • This high AUC result also suggests potential overfitting, but regularlisation/normalisation was applied, so this suggests presence of leakage.

Below is the summary of the Tensorflow models' performance on the test set.

Model Accuracy AUC Precision Recall Loss False Negative Count
Multilayer Perceptron 0.9981 0.9992 0.9763 0.9884 0.0313 107.0
Bagging Approach MLP 0.9958 0.9984 0.9378 0.9868 0.0506 122.0
1D Convolutional Network 0.9651 0.9651 0.7551 0.5254 0.1647 4394.0

For PyTorch:


====================== Multilayer Perceptron ======================
Test Accuracy: 0.9964450142782213
Precision: 0.9685643296964459
Recall: 0.9656435110393107
ROC AUC: 0.9906845145536228

====================== Bagging with MLP base ======================
Test Accuracy: 0.9971210443499039
Precision: 0.9612761045230349
Recall: 0.9865374259558427
ROC AUC: 0.9987235500711122

====================== CNN with 1-D Convoluitonal layers ======================
Test Accuracy: 0.9458884550381724
Precision: 0.0
Recall: 0.0
ROC AUC: 0.9009592831143862

The convolutional network was not able to classify the other label during test, possible causing 0 precision and recall despite having informative AUC and Accuracy.

Conclusion

We've prototyped tree-based models previously, which were more robust and achieved better performance compared to neural networks. On the other hand, PyTorch's interface offers a nice development experience and is easier to scale with its CUDA integration.

About

A port of defaulter classification networks from Keras to PyTorch.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published