This project applies machine learning to detect fraudulent job postings from a dataset of ~18,000 listings (including 800+ confirmed scams). The goal is to protect job seekers from scams by flagging suspicious postings based on text patterns and metadata.
- Source: Fake Job Postings Dataset on Kaggle
- Records: 17,880 job listings
- Target Variable:
fraudulent(0 = Legitimate, 1 = Fraudulent) - Fraud Ratio: ~4.5% of records (highly imbalanced dataset)
-
Data Cleaning & Imputation
- Resolved missing values in
title,description,company_profile, and salary fields. - Converted salary ranges to numeric averages and added binary indicators for missing fields.
- Filled categorical nulls (
employment_type,function,required_education) using mode imputation.
- Resolved missing values in
-
Feature Engineering
- Extracted experience levels from
required_experience. - Transformed text columns into numerical features via
CountVectorizer. - Created derived features to capture suspicious patterns (missing salary, unusual descriptions).
- Extracted experience levels from
-
Modeling
- Handled class imbalance using SMOTE.
- Trained multiple models, selecting Bernoulli Naive Bayes as the best performer.
- Split dataset into training/testing (80/20).
-
Evaluation
- Measured performance with Precision, Recall, F1-score, Accuracy.
- Generated a confusion matrix and classification report.
- Accuracy: 96.4%
- F1-score (Fraudulent class): 0.95
- Precision: 0.94 | Recall: 0.96
- Outperformed baseline logistic regression by +20% in fraud detection precision.
๐ Business Impact:
This pipeline successfully identified 800+ fake postings, reducing risk for job seekers and helping platforms maintain trust by catching scams early.
- Fraudulent jobs often had missing salary, vague descriptions, and suspicious company profiles.
- Text features like โurgent requirementโ, โwork from homeโ, and โno experience neededโ showed high correlation with fraudulent postings.
- Model performance was highly sensitive to balancing techniques โ SMOTE improved recall by ~15%.
- fraud_detection.ipynb โ Full notebook with code & outputs
- fake_job_postings.csv โ Dataset (Kaggle-sourced, ~18k rows)
- README.md โ Project documentation
yaml Copy code
- Python: pandas, NumPy, scikit-learn, imbalanced-learn
- ML Models: Naive Bayes, Logistic Regression (baseline)
- NLP: CountVectorizer, text preprocessing
- Visualization: Matplotlib, Seaborn
- Deploy model as a Streamlit app where users can paste job descriptions and check fraud risk.
- Integrate advanced NLP (TF-IDF, Word2Vec, BERT) for semantic context.
- Build a real-time fraud monitoring dashboard for job boards.