Risky Borrowers Identification with PyTorch

Background and Introduction • Sourcing data • Data Transformation and EDA • Visualisation • Model • Evaluation • Conclusions

Background and Introduction

This project is a port to PyTorch from a collaborated project I did. The task proposed selection of a loan/credit default dataset and develop classification models to predict or investigate factors that influence whether a borrower would default, recorded as 0 (non-default) or 1 (default) in the default_ind variable.

The networks were previously in Tensorflow, but I saw PyTorch offered a better interface and development experimence, so I ported them.

In this exercise, I investigated a variation of the LoanStatNew dataset using PyTorch. I performed dimensionality reduction, features selection and class weighting to compare the performance of different model architectures.

See the notebooks here: Analysis Notebook and Training Notebook

Sourcing data

The dataset is a variation of Kaggle's Lending Club dataset. with addition features that were created based on the original dataset. It offers a wide view of different aspects in the borrowers' financial status, and a high flexibility for model selection.

The data dictionary with schema definiton can be found in the .md file.

Data Transformation

The proposed processing pipeline includes the following steps:

Dropping a range of columns that were deemed uninformative or have high proportion of missing values
Dropping rows (sice data size was adequate) for columns that have low missingness, including the target, default_ind
Transforming the time-based columns into Timestamp type, stored as number of days since 1970-01-01
Preprocess remaining categorical or numerical variables using MLlib's StringIndexer() estimator and casting respectively
Used median to fill any missing or 'Unknown' values after acquiring the train set

Overall, the dataset's quality is relatively high, no inconsistent string values that require regex were found. The only other prominent issue is class imbalance, which was alleviated using class weights.

default_ind	count
0	647004
1	37149

A design change I made that's different to the original work was to store processed data in .parquet format instead of .csv. Originally, code was written in a single Colab notebook so we didn't have issues with replication as we used the .toPandas() method to convert dataset into tensors via numpy.

Visualisation

After visualising the underlying patterns of the dataset, we employed compute_class_weight() from scikit-learn to calculate the bias for model training.

Building the models

Using nn.Module, I created 3 classes, a Multilayer Perceptron (MLP), a bagging model with MLP base models and a convolutional network (CNN) with 1-D convolution layers. The models' architecture are relatively simple as I initially built them using Keras' documentation.

The first two model has some Dense/Linear layers, while the CNN has the usual design. All models are validated using a DataLoader instance crated from the validation set.

val_loader = DataLoader(val_tensor_ds, 
                        batch_size=32, 
                        shuffle=False, 
                        pin_memory=True if device.type == 'cuda' else False)

Evaluation

The MLP and bagging model achieved high performance, with precision and recall ~99%.
- The inclusion of class weights in the loss function may have reduced the selection bias within these models and may have improved the results.
The CNN model performed noticeably worse to the other two, which is expected CNNs are not typically fit on tabular data.
The area under the curve (AUC) are close to 1 across all networks, meaning they are very likely to make the correct classification.
- This high AUC result also suggests potential overfitting, but regularlisation/normalisation was applied, so this suggests presence of leakage.

Below is the summary of the Tensorflow models' performance on the test set.

Model	Accuracy	AUC	Precision	Recall	Loss	False Negative Count
Multilayer Perceptron	0.9981	0.9992	0.9763	0.9884	0.0313	107.0
Bagging Approach MLP	0.9958	0.9984	0.9378	0.9868	0.0506	122.0
1D Convolutional Network	0.9651	0.9651	0.7551	0.5254	0.1647	4394.0

For PyTorch:


====================== Multilayer Perceptron ======================
Test Accuracy: 0.9964450142782213
Precision: 0.9685643296964459
Recall: 0.9656435110393107
ROC AUC: 0.9906845145536228

====================== Bagging with MLP base ======================
Test Accuracy: 0.9971210443499039
Precision: 0.9612761045230349
Recall: 0.9865374259558427
ROC AUC: 0.9987235500711122

====================== CNN with 1-D Convoluitonal layers ======================
Test Accuracy: 0.9458884550381724
Precision: 0.0
Recall: 0.0
ROC AUC: 0.9009592831143862

The convolutional network was not able to classify the other label during test, possible causing 0 precision and recall despite having informative AUC and Accuracy.

Conclusion

We've prototyped tree-based models previously, which were more robust and achieved better performance compared to neural networks. On the other hand, PyTorch's interface offers a nice development experience and is easier to scale with its CUDA integration.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
images		images
src		src
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
data_dict.md		data_dict.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Risky Borrowers Identification with PyTorch

Background and Introduction

Sourcing data

Data Transformation

Visualisation

Building the models

Evaluation

Conclusion

About

Uh oh!

Releases

Packages

Languages

License

hml230/loan-default-risks-analysis

Folders and files

Latest commit

History

Repository files navigation

Risky Borrowers Identification with PyTorch

Background and Introduction

Sourcing data

Data Transformation

Visualisation

Building the models

Evaluation

Conclusion

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages