This project performs customer segmentation on a credit card dataset using K-Means clustering in PySpark. By grouping users based on their financial behavior, the goal is to provide insights that can support targeted marketing, personalized services, and strategic decision-making in financial institutions.
- Preprocess real-world customer transaction data
- Apply clustering techniques to segment users into distinct groups
- Use dimensionality reduction and scaling to improve model quality
- Visualize and evaluate the effectiveness of clustering results
- PySpark (MLlib, DataFrame APIs)
- K-Means Clustering
- PCA (Principal Component Analysis)
- RobustScaler
- Seaborn, matplotlib, pandas
- Jupyter Notebook
- Loaded credit card data from
CC_GENERAL.csv - Dropped non-numerical IDs
- Handled missing values by imputing median values
- Selected top 4 features based on variance
- Scaled features using
RobustScalerto minimize outlier influence
- Applied PCA to reduce the feature space to 2 dimensions for better visualization and clustering performance
- Used Elbow Method to determine optimal number of clusters (k=3)
- Trained K-Means model using
pca_features - Assigned cluster labels to each customer
- Visualized clusters in 2D using Seaborn
- Evaluated model using Silhouette Score:
0.89, indicating strong cluster separation and cohesion
- Optimal number of clusters: 3
- Key behavioral groupings were identified, supporting segmentation-based decision-making
- Visual plots clearly demonstrated cluster separation
Customer_Clustering.ipynb– Full code, step-by-step process- Dataset:
CC_GENERAL.csv
- Ensure you have PySpark and required Python libraries installed
- Update the path to your dataset if needed
- Open the notebook in Jupyter
- Run each cell sequentially
- Financial institutions segmenting customers for credit offers
- Retail businesses personalizing campaigns based on spending patterns
- Analysts exploring unsupervised learning on behavioral data