fmm-fairness-eval

SaMD-specific fairness evaluation CLI for foundation-model medical AI. Emits AI Act Art. 10 / Art. 9 evidence artifacts.

Pricing

CLI (Free). OSS Python CLI, MIT-licensed, fairness metrics + audit chain.
Hosted CI (€99/mo). Hosted runner on customer prediction CSVs, monthly evidence pack.
Consulting (€60-100/hour). Fairness-evidence drafting, AI Act Art. 9 / Art. 10 dossier review, Calendly booking.

Paid tiers open early-access via email; the CLI itself is free under MIT and pip install --pre fmm-fairness-eval works today. See pricing.md for tier details and how to request early access.

fmm-fairness-eval (fmm-fairness on the command line) is a small, focused CLI that takes a predictions CSV from any SaMD or SaMD-adjacent medical-AI model and produces a regulator-friendly fairness evidence pack: a Markdown report, a machine-readable JSON pack, and a SHA-256 audit chain. It is built around the failure mode regulators actually care about: inter-hospital and inter-site bias. The output drops straight into an EU AI Act Art. 10 / Art. 9 dossier.

Why this exists

Modern medical-AI systems are increasingly built on foundation-model embeddings (CONCH for histopathology, DINOv2 for general radiology, RadFM-style models for multi-modal radiology) plus a small downstream classifier. The dominant failure mode is no longer "the model is biased against women" or "the model misses dark skin tones" in isolation. It is inter-hospital generalization collapse: the model that scores F1=0.89 on one cohort drops to F1=0.70 on the cohort across the river, and the gap is largest in subgroups the training data under-represented.

The author's TFG (Universitat Politècnica de València, 2024) measured exactly this on dermatology AI using CONCH embeddings and multiple-instance learning over the AI4SkIN cohort: weighted F1 = 0.89, but an inter-hospital fairness gap of 0.19 between sites. That is not a corner case. It is the modal failure mode for any SaMD that crosses a hospital network boundary.

Existing fairness libraries (FairLearn, AIF360, Holistic AI, Microsoft Responsible AI Toolbox) are general-purpose ML fairness tools. None of them ship a SaMD-specific evaluation pipeline that:

Treats site / hospital as a first-class protected attribute distinct from individual demographics.
Emits AI Act Art. 10 / Art. 9 cross-cited evidence by default.
Defines a composite SaMD fairness score whose weighting reflects how regulators actually prioritize bias categories.
Ships a SHA-256 audit chain so the evidence pack is tamper-evident the moment it leaves your pipeline.

This tool fills that gap. Nothing more, nothing less.

Citation for the underlying TFG work: César Pereiro, Foundation-model-based fairness evaluation in dermatology classification using the AI4SkIN dataset, Universitat Politècnica de València, 2024. https://riunet.upv.es/handle/10251/226903

Install

The current feature set (BCa bootstrap, permutation tests, intersectional analysis, multi-class, AI Act full dossier) is in the 0.2.0a pre-release line. Install it with --pre:

pip install --pre fmm-fairness-eval

A bare pip install fmm-fairness-eval (without --pre) currently resolves to the older 0.1.0 stable, which lacks most of the capability documented below. Use --pre until 0.2.0 is promoted to stable.

Or from source:

git clone https://github.com/Ces107/fmm-fairness-eval-cli
cd fmm-fairness-eval-cli
pip install -e .

Requires Python 3.10+, numpy ≥ 1.24, pandas ≥ 2.0, scikit-learn ≥ 1.3. No GPU dependency required.

What it does

1. Run an evaluation

fmm-fairness evaluate predictions.csv \
    --protected-attrs site,sex,age_bucket \
    --site-attribute site \
    --output fairness-report/

predictions.csv must contain these columns, where the score-column shape depends on the number of classes K your model produces.

Binary (K = 2):

column	type	meaning
`y_true`	int ∈ {0, 1}	Ground-truth label
`y_pred`	int ∈ {0, 1}	Thresholded prediction
`y_score`	float ∈ [0, 1]	Raw model probability `P(class = 1)`
(declared)	str	One column per `--protected-attrs` value

Multi-class (K ≥ 2):

column	type	meaning
`y_true`	int ∈ {0..K-1}	Ground-truth label
`y_pred`	int ∈ {0..K-1}	Argmax prediction
`y_score_0`	float ∈ [0, 1]	`P(class = 0)`
`y_score_1`	float ∈ [0, 1]	`P(class = 1)`
...	...	...
`y_score_{K-1}`	float ∈ [0, 1]	`P(class = K-1)` (rows sum to ~1.0)
(declared)	str	One column per `--protected-attrs` value

The CLI auto-detects K from the score columns. Pass --num-classes K to override or to assert the expected value; a mismatch with the detected shape triggers a warning.

For multi-class inputs the evidence pack carries the F1-family fairness metrics (weighted_f1_gap, macro_f1_gap, per_class_f1_gap, multi_class_auc_gap); see docs/multi-class-metrics.md for definitions. For binary inputs the v0.1 metrics (equal_opportunity_gap, demographic_parity_gap, calibration_gap) are retained.

2. Read the output

The CLI produces three files in fairness-report/:

fairness-report.md: human-readable, regulator-friendly summary.
fairness-evidence.json: machine-readable evidence pack (stable schema, sorted keys for deterministic SHA).
audit.sha256: SHA-256 of the above two files. Pin in your QMS / change-control record.

3. Cross-cite to the AI Act

fmm-fairness evaluate predictions.csv \
    --protected-attrs site,sex,age_bucket \
    --manifest-mode ai-act \
    --output fairness-report/

In ai-act mode the JSON pack gains a regulatory_mapping block that cross-cites each metric to the EU AI Act article it evidences:

Art. 9 (Risk management system) ↔ samd_fairness_score, inter_site_auc_variance.
Art. 10 (Data and data governance) ↔ equal_opportunity_gap, demographic_parity_gap, calibration_gap. Evidences Art. 10(2)(f-g) examination of biases and shortcomings.
Art. 15 (Accuracy, robustness) ↔ inter_site_auc_variance. Evidences generalization claims.

Metrics computed

Metric	Formula (short)	When it matters
`equal_opportunity_gap`	max-min TPR across groups (Hardt et al. 2016)	Under-diagnosis disparity (Pierson et al. 2021)
`demographic_parity_gap`	max-min P(ŷ=1) across groups	Selection-rate disparity
`calibration_gap`	max-min ECE across groups	Score-trust differs by subgroup
`inter_site_auc_variance`	Var(AUC) across sites	Inter-hospital generalization risk (the SaMD failure mode)
`samd_fairness_score`	composite ∈ [0,1] (see `docs/samd-fairness-score.md`)	Single-number summary for QMS dashboards

All gap metrics ship with percentile bootstrap 95% CIs computed over a stratified resample.

The composite samd_fairness_score is defined explicitly with documented weights and a sensitivity analysis in docs/samd-fairness-score.md. It is not a black box and is not an FDA-blessed metric. It is a transparent aggregate the operator can defend, override, or replace.

Scientific context

CONCH (Lu et al. 2024) is the visual-language pathology foundation model used in the underlying TFG work. Lu, M. Y. et al. "A visual-language foundation model for computational pathology." Nature Medicine 30, 863–874 (2024). doi:10.1038/s41591-024-02856-4
AI4SkIN is the multi-hospital dermatopathology dataset (Spain, multi-site) on which the TFG measured the 0.19 inter-hospital gap.
Under-diagnosis bias on chest X-rays (Seyyed-Kalantari et al. 2021, Nat. Med. 27, 2176-2182) is the canonical demonstration that single-site fairness audits miss the dominant failure mode.
Pain disparity reduction (Pierson et al. 2021, Nat. Med. 27, 136-140) demonstrates the inverse. Algorithmic predictions can outperform human-graded severity in capturing real disparities, motivating better measurement, not less.
Ethical implementation (Char, Shah, Magnus 2018, NEJM 378, 981-983) sets the still-canonical framing for healthcare-ML ethics.

A one-paragraph literature pointer for the formal definitions: the equal-opportunity criterion is Hardt, Price, Srebro (NeurIPS 2016); the demographic-parity definition follows Dwork et al. (ITCS 2012); calibration-by-group follows Pleiss et al. (NeurIPS 2017). The composite weighting is justified from FDA GMLP guidance (2021, updated 2024 IMDRF GMLP) + EU AI Act Art. 10 prioritisation of multi-site data governance.

What it does NOT do

Not a model-training framework. Bring your own predictions.
Not a foundation-model serving stack. Embeddings are outside scope.
Not auto-detection of protected attributes. You must declare them. Silent attribute inference is itself a bias risk.
Not a certification. A fairness evaluation is evidence; certification is a regulatory process this tool helps you prepare for.
Not an explainability tool. It surfaces where bias lives, not why.

Honest scientific caveats (read before quoting numbers)

Threshold sensitivity. equal_opportunity_gap and demographic_parity_gap both depend on the operating threshold used to produce y_pred. Re-run the evaluation at any threshold you would actually deploy at.
Small-sample bootstrap. Percentile bootstrap is approximate for small groups; for n < 50 prefer the BCa interval or treat CIs as exploratory. Groups with n < min_group_n (default 20) are excluded with a warning rather than silently producing a near-zero gap.
Prevalence confound. Inter-site bias is frequently confounded with prevalence shift. A site that sees twice the disease prevalence will have different TPR even from a perfectly fair model. The tool reports both per-site AUC (less prevalence-sensitive) and per-site rates; interpret jointly.
Composite score is opinionated. The default weights (w_site=0.4, w_eo=0.3, w_dp=0.15, w_cal=0.15) reflect this author's read of regulatory priority. Override with weights= in the Python API or treat the components separately. The single number is for dashboards; the components are for decisions.
No causal inference. A measured gap does not identify the mechanism. Combine with subgroup analysis, training-set provenance audit, and (where possible) prospective evaluation.

Pricing

CLI: MIT, free, forever.
Hosted "fairness CI" (Phase 2, not yet shipped): planned at €99/month for teams that want every commit to a model repo to fire an evaluation against a frozen multi-site cohort and post the evidence pack as a CI artifact. Mailing list opens at validation green-light.
Consulting: the author is available for SaMD fairness review / AI Act Art. 10 evidence-pack design at €60-100/hour. Contact via the linked GitHub profile; introductions through the academic-DM channel are welcome.

Citing

If you use this tool in published research:

@software{pereiro2026fmmfairness,
  author       = {Pereiro, C{\'e}sar},
  title        = {{fmm-fairness-eval}: SaMD-specific fairness evaluation for foundation-model medical AI},
  year         = {2026},
  url          = {https://github.com/Ces107/fmm-fairness-eval-cli},
  note         = {Zenodo DOI to be minted on the first stable (0.2.0) release}
}

@thesis{pereiro2024dermfairness,
  author = {Pereiro, C{\'e}sar},
  title  = {Foundation-model-based fairness evaluation in dermatology classification using the AI4SkIN dataset},
  school = {Universitat Polit{\`e}cnica de Val{\`e}ncia},
  year   = {2024},
  url    = {https://riunet.upv.es/handle/10251/226903}
}

Roadmap

v0.1 (this release): CLI, 4 fairness gap metrics, composite score, AI Act manifest mode, SHA-256 audit chain.
v0.2: BCa bootstrap, sub-group intersectionality (site × sex), CSV-of-CSVs batch mode.
v0.3: HTML report option, hosted fairness-CI (Phase 2, gated on validation pass).
v0.4: subgroup-aware threshold optimisation (opt-in, with the appropriate caveats).

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
legal		legal
src/fmm_fairness		src/fmm_fairness
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
kill-gate.md		kill-gate.md
landing-copy.md		landing-copy.md
launch-show-hn-draft.md		launch-show-hn-draft.md
monetization-gate-pass.md		monetization-gate-pass.md
pricing.md		pricing.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fmm-fairness-eval

Pricing

Why this exists

Install

What it does

1. Run an evaluation

2. Read the output

3. Cross-cite to the AI Act

Metrics computed

Scientific context

What it does NOT do

Honest scientific caveats (read before quoting numbers)

Pricing

Citing

Roadmap

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

fmm-fairness-eval

Pricing

Why this exists

Install

What it does

1. Run an evaluation

2. Read the output

3. Cross-cite to the AI Act

Metrics computed

Scientific context

What it does NOT do

Honest scientific caveats (read before quoting numbers)

Pricing

Citing

Roadmap

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages