This repository implements GAMBA (Generalized Additive Model and Bayesian Analysis), a framework for pay equity analysis using GAMs with frequentist (REML) and Bayesian (MCMC) estimation. It demonstrates gender pay gap analysis, predictive modeling, and visualization of interaction effects across job level and family status.
It is built on the cookiecutter data science template, providing a clean, reproducible structure and CI integration with pytest and flake8.
This repository accompanies the paper:
Kuna, M., Kowalczyk, M.: Bayesian and Frequentist Approaches to Pay Equity Analysis Using Generalized Additive Models. PP-RAI 2026.
The repository follows a clear modular design. Each step – from data ingestion to feature engineering, model training, and visualization – has its own module under src/.
dataset.py– loading raw CSV datafeatures.py– preprocessing and feature creationmodeling/train.py– train all GAMBA model variants (GCV Default, Grid Search GCV, Grid Search AIC, Bayesian PyMC)plots.py– visualizations (gender pay gap, counterfactual analysis, motherhood penalty)config.py– central place for constants, paths, feature lists, MCMC parameters, and GAM specifications
All datasets in this repository are synthetic and contain no real individual records. They are based on realistic distributions characteristic of Polish salary data circa 2009 (national minimum wage: 1,276 PLN).
The raw dataset is provided in data/raw/salary_data_2009.csv.
| Column | Type | Values / Range | Description |
|---|---|---|---|
gender |
categorical | M / F | Employee gender |
age |
continuous | integer, years | Employee age |
education_level |
ordinal | 1–4 | Level of education (1 = lowest, 4 = highest) |
job_level |
ordinal | 1 = Junior, 2 = Mid, 3 = Senior, 4 = Manager | Hierarchical job level |
experience_years |
continuous | years | Total work experience |
distance_from_home |
binary | 0 = ≤15 km, 1 = >15 km | Whether employee lives more than 15 km from workplace |
absence |
discrete | count | Number of absence days |
child |
discrete | count (0–3+) | Number of children |
income |
continuous | PLN | Gross monthly salary |
The dataset was generated using CTGAN (Xu et al., 2019) trained on an HR training dataset with realistic salary dependencies. CTGAN is a conditional generative adversarial network designed for tabular data that preserves realistic distributional properties — including salary ranges, gender ratios, job level distributions, and family status patterns — characteristic of the Polish labor market circa 2009.
Post-processing rules enforce:
- Non-negativity of all numeric variables
- Minimum income of 1,276 PLN (national minimum wage, 2009)
- Binary encoding of
distance_from_home(0 = ≤15 km, 1 = >15 km)
Since the data were generated by a generative model rather than a structural process, exact ground truth parameters are not available. Validation is assessed through interval calibration and consistency with empirical patterns in labor economics literature.
# Load dataset (raw)
python3 -m src.dataset
# Preprocess and create features
python3 -m src.features
# Train all GAMBA models (GCV Default, Grid Search GCV, Grid Search AIC, Bayesian PyMC)
python3 -m src.modeling.trainNotebooks in notebooks/ allow interactive exploration and plotting.
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtAll random seeds are fixed in the code to ensure exact replication of reported results.
├── README.md
├── data
│ ├── external
│ ├── interim
│ ├── processed
│ └── raw
│ └── salary_data_2009.csv
│
├── docs
├── models
├── notebooks
├── reports
│ └── figures
│
├── requirements.txt
├── setup.cfg
└── src
├── __init__.py
├── config.py
├── dataset.py
├── features.py
├── modeling
│ ├── __init__.py
│ └── train.py
└── plots.py
- Counterfactual Gender Pay Gap Analysis: predict salaries if all employees were male/female.
- Flexible GAM specification: monotonicity constraints on age, experience, job level, absence, etc.
- Bayesian modeling: probabilistic inference and credible intervals via MCMC (PyMC, NUTS sampler).
- Frequentist modeling: REML-based smoothing parameter selection via pyGAM.
- Interaction effects: gender × job level and gender × child visualizations.
- Reproducible pipeline: fully modular
dataset → features → modeling → plots.
- Keep
config.pyupdated with paths, features, and GAM settings. - Use
reports/figuresfor all plots to maintain reproducibility. - Tests live in
tests/(pytest compatible) to ensure preprocessing and modeling correctness.