This project involves the development of a predictive credit scoring model to help a financial institution classify unsecured loan applications. By leveraging data analytics, the model aims to minimize credit risk, optimize loan approval rates, and provide a clear understanding of the key factors that drive loan defaults.
A bank needs to refine its loan approval process to balance market competitiveness with risk management. The core challenge is to accurately distinguish between applicants who are likely to repay a loan ("good" customers) and those who are likely to default ("bad" customers).
The key business objectives are:
- Develop a model that accepts the maximum number of good applicants while correctly identifying at least 85% of bad applicants.
- Develop a model that accepts at least 70% of good applicants while rejecting the maximum possible number of bad applicants.
- Identify the most influential variables that determine a customer's repayment behavior to inform future lending strategies.
The analysis is based on a historical dataset of past bank customers. The dataset (data.csv) contains customer-level information on financial health and loan characteristics. The target variable is BAD, where:
- 1: The applicant defaulted on the loan.
- 0: The applicant successfully paid off the loan.
Key Predictor Variables Include:
- LOAN: The total amount of the loan requested.
- MORTDUE: The amount due on the applicant's existing mortgage.
- VALUE: The current value of the applicant's property.
- DEBTINC: The applicant's debt-to-income ratio.
- YOJ: Years at the applicant's current job.
- DEROG: Number of major derogatory reports.
- DELINQ: Number of delinquent credit lines.
- CLAGE: Age of the oldest credit line in months.
The project follows a structured data analytics workflow to deliver a robust and interpretable solution.
- Exploratory Data Analysis (EDA):
- Conducted a thorough investigation of all variables to understand their distributions, identify outliers, and assess data quality.
- Analyzed relationships between variables and their correlation with the target variable (BAD) using visualizations to rank feature importance.
- Data Pre-processing:
- Developed strategies for handling missing values and records.
- Transformed categorical variables and created new features where appropriate to prepare the dataset for modeling.
- Modeling:
- Built and evaluated at least two distinct classification algorithms to predict loan default.
- Selected appropriate performance measures aligned with the specific business objectives (e.g., recall, precision, ROC-AUC).
- Incorporated feature selection methods to create parsimonious and effective models.
- Performance Evaluation and Optimization:
- Tuned model hyperparameters to optimize performance against the predefined business goals.
- Assessed the generalization performance of the final recommended models to ensure they are robust and reliable for future predictions.
- Business Recommendations:
- Translated the model's findings into actionable business insights.
- Provided clear recommendations for the bank's lending strategy, highlighting key assumptions and potential limitations of the analytical solution.
This analysis was conducted using Python and/or SAS. The primary Python libraries used are:
- Pandas for data manipulation and analysis.
- NumPy for numerical operations.
- Matplotlib & Seaborn for data visualization.
- Scikit-learn for building and evaluating machine learning models.
- Histogram Gradient Boosting Classifier
- Random Forest Classifier
- Support Vector Classifier
- Logistic Regression
- k-Nearest Neighbors Classifier
- Neural Network (Multi Layer Perceptron)
- Decision Tree Classifier