This project contains a comprehensive analysis of student academic performance data, exploring factors that influence student success and conducting predictive modeling. The analysis covers exploratory data analysis, logistic regression modeling, and stress testing of predictive models.
The analysis is organized in the following sequence:
- Initial data exploration and visualization
- Statistical summary of student data
- Distribution analysis of key variables (GPA, study habits, etc.)
- Correlation analysis between variables
- Missing value analysis and data quality assessment
- In-depth analysis focusing on four key variables
- Feature importance analysis
- Implementation of logistic regression models
- Interpretation of model coefficients and statistical significance
- In GLM model, the output is severity level
- Sensitivity analysis under 4 scenarios
- revaluate the risk segmentation and compare with the original model
The dataset contains student information including:
- See "Glossary.pdf" for detailed variable descriptions
The project requires the following Python packages:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrixInstall dependencies using:
pip install pandas numpy seaborn matplotlib statsmodels scikit-learn- Clone or download the project files
- Install dependencies as listed above
- Run the notebooks in sequence:
- Start with
EDA.ipynbfor data exploration - Proceed to
severity_on_4.ipynbfor further key factors analysis - Continue with
LR.ipynbfor logistic regression modeling - Finish with
stress_test.ipynbfor stress tesing
- Start with
Each notebook is self-contained and can be run independently, but following the recommended sequence will provide the most comprehensive understanding of the analysis.
The analysis provides insights into:
- Key factors influencing student academic performance
- Predictive models for student success classification
- Model performance metrics and validation results
- Recommendations based on statistical findings
This project is for academic/analytical purposes. Please ensure proper data privacy and ethical considerations when working with student data.