GAMBA – Generalized Additive Model and Bayesian Analysis

This repository implements GAMBA (Generalized Additive Model and Bayesian Analysis), a framework for pay equity analysis using GAMs with frequentist (REML) and Bayesian (MCMC) estimation. It demonstrates gender pay gap analysis, predictive modeling, and visualization of interaction effects across job level and family status.

It is built on the cookiecutter data science template, providing a clean, reproducible structure and CI integration with pytest and flake8.

This repository accompanies the paper:

Kuna, M., Kowalczyk, M.: Bayesian and Frequentist Approaches to Pay Equity Analysis Using Generalized Additive Models. PP-RAI 2026.

What is here?

Code and Files Structure

The repository follows a clear modular design. Each step – from data ingestion to feature engineering, model training, and visualization – has its own module under src/.

dataset.py – loading raw CSV data
features.py – preprocessing and feature creation
modeling/train.py – train all GAMBA model variants (GCV Default, Grid Search GCV, Grid Search AIC, Bayesian PyMC)
plots.py – visualizations (gender pay gap, counterfactual analysis, motherhood penalty)
config.py – central place for constants, paths, feature lists, MCMC parameters, and GAM specifications

Data

All datasets in this repository are synthetic and contain no real individual records. They are based on realistic distributions characteristic of Polish salary data circa 2009 (national minimum wage: 1,276 PLN).

The raw dataset is provided in data/raw/salary_data_2009.csv.

Column Descriptions

Column	Type	Values / Range	Description
`gender`	categorical	M / F	Employee gender
`age`	continuous	integer, years	Employee age
`education_level`	ordinal	1–4	Level of education (1 = lowest, 4 = highest)
`job_level`	ordinal	1 = Junior, 2 = Mid, 3 = Senior, 4 = Manager	Hierarchical job level
`experience_years`	continuous	years	Total work experience
`distance_from_home`	binary	0 = ≤15 km, 1 = >15 km	Whether employee lives more than 15 km from workplace
`absence`	discrete	count	Number of absence days
`child`	discrete	count (0–3+)	Number of children
`income`	continuous	PLN	Gross monthly salary

Data Generation

The dataset was generated using CTGAN (Xu et al., 2019) trained on an HR training dataset with realistic salary dependencies. CTGAN is a conditional generative adversarial network designed for tabular data that preserves realistic distributional properties — including salary ranges, gender ratios, job level distributions, and family status patterns — characteristic of the Polish labor market circa 2009.

Post-processing rules enforce:

Non-negativity of all numeric variables
Minimum income of 1,276 PLN (national minimum wage, 2009)
Binary encoding of distance_from_home (0 = ≤15 km, 1 = >15 km)

Since the data were generated by a generative model rather than a structural process, exact ground truth parameters are not available. Validation is assessed through interval calibration and consistency with empirical patterns in labor economics literature.

How to run the code

# Load dataset (raw)
python3 -m src.dataset

# Preprocess and create features
python3 -m src.features

# Train all GAMBA models (GCV Default, Grid Search GCV, Grid Search AIC, Bayesian PyMC)
python3 -m src.modeling.train

Notebooks in notebooks/ allow interactive exploration and plotting.

Environment Setup

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

All random seeds are fixed in the code to ensure exact replication of reported results.

Project Organization

├── README.md
├── data
│   ├── external
│   ├── interim
│   ├── processed
│   └── raw
│       └── salary_data_2009.csv
│
├── docs
├── models
├── notebooks
├── reports
│   └── figures
│
├── requirements.txt
├── setup.cfg
└── src
    ├── __init__.py
    ├── config.py
    ├── dataset.py
    ├── features.py
    ├── modeling
    │   ├── __init__.py
    │   └── train.py
    └── plots.py

GAMBA Highlights

Counterfactual Gender Pay Gap Analysis: predict salaries if all employees were male/female.
Flexible GAM specification: monotonicity constraints on age, experience, job level, absence, etc.
Bayesian modeling: probabilistic inference and credible intervals via MCMC (PyMC, NUTS sampler).
Frequentist modeling: REML-based smoothing parameter selection via pyGAM.
Interaction effects: gender × job level and gender × child visualizations.
Reproducible pipeline: fully modular dataset → features → modeling → plots.

Additional Tips

Keep config.py updated with paths, features, and GAM settings.
Use reports/figures for all plots to maintain reproducibility.
Tests live in tests/ (pytest compatible) to ensure preprocessing and modeling correctness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GAMBA – Generalized Additive Model and Bayesian Analysis

What is here?

Code and Files Structure

Data

Column Descriptions

Data Generation

How to run the code

Environment Setup

Project Organization

GAMBA Highlights

Additional Tips

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
data		data
docs		docs
models		models
notebooks		notebooks
reports		reports
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Folders and files

Latest commit

History

Repository files navigation

GAMBA – Generalized Additive Model and Bayesian Analysis

What is here?

Code and Files Structure

Data

Column Descriptions

Data Generation

How to run the code

Environment Setup

Project Organization

GAMBA Highlights

Additional Tips

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages