aedt-mspe-experiments

This repository audits a patent-faithful MSPE scoring pipeline (US 12,265,502 B1, §530) and finds that the claimed semantic-preservation property is not satisfied: materially equivalent leave-of-absence documentation can receive different scores depending on sentence-level clinical-register wording. Follow-up sensitivity testing narrows the mechanism from "rehabilitative content" to a lexical threshold-gating effect. When calibrated against the JAMA-published ERAS population structure (Nguyen et al., 2024; doi:10.1001/jama.2024.4797), the mechanism is real but bounded, predicting limited amplification rather than explaining the full pre-existing race × LOA placement disparity. The resulting claim is falsifiable in post-AEDT cohorts.

Companion to the published audit at chadmarkey/aedt-fairness-audit (OSF osf.io/uk6xn), which covers the §PS Component personal-statement pipeline.

No MSPE-shaped content is committed to this repo. All paragraph wordings live in a gitignored wordings_local.py (template at wordings_local.example.py). The committed scripts are scoring infrastructure that operates over whatever wordings the local config provides. Anyone reproducing the analysis with a different wording manipulation populates their own wordings_local.py from the example template.

Read this first

walkthrough.md — seven-figure walkthrough (one composite + six detail) explaining what the experiments show, with approachable language and explicit caveats.

nguyen_calibration/findings_significance.md — what the work establishes, what it does not, and the falsifiable prediction it pre-registers.

Headline finding

The patent's Claim 1 specifies that the §532 BiasMitigator should "preserve semantic structure." Across N=50 paired synthetic MSPE-shaped documents (~1,300-word median, both indicator sets, both raw and mitigated modes), the patent-faithful pipeline scores documents differently when only the wording of a leave-of-absence paragraph changes. The mechanism the audit identifies is a sentence-level clinical-register lexical sensitivity in the §530 threshold gate.

The F1 perturbation suite establishes this empirically. A standalone sentence containing clinical-register vocabulary in subject/predicate position (e.g., "Clinical responsibilities were paused during this period") crosses the 0.35 cosine-similarity gate to the exemplar centroids and contributes ~+1.85 Δ — full-magnitude effect, on a wording with no medical context, no remediation language, and no rehabilitative framing. The same words embedded in a subordinate clause produce zero. A standalone "recovery period" sentence without clinical vocabulary produces zero. Stripping "clinical" and "recovery" from a control wording produces zero. The triggering condition is the standalone-sentence clinical-register lexical configuration, not the paragraph's semantic content.

Vague-administrative LOA wording (the original arm, single fixed LOA_ORIGINAL string) contributes zero to within ~8 µ-units of the document score (max |Δ_original| = 7.6e-06; the residual is SBERT float32 quantization jitter). The per-document Δ_original is identical across the 50 documents by construction — the same fixed string is evaluated against the same exemplar centroids — so this is effectively N=1 for the original-arm contribution. The 50/50 framing applies meaningfully only to the paired Δ_corrected − Δ_original contrast, where paragraph-variant draw differs between documents and the result is 50/50 above zero with min Δ_corrected ≈ 1.31.
Rehabilitative-content LOA wording (the corrected arm) contributes strictly positive score on 50/50 docs in all four cells: ACGME mitigated Δ_corrected = 1.747, 95% CI [1.690, 1.803]; patent-illustrative mitigated Δ_corrected = 2.734, CI [2.639, 2.828]. F1 establishes that the corrected arm's positive contribution operates through the same standalone-sentence clinical-register mechanism — not through "rehabilitative semantics."

Claim 1's "preserve semantic structure" requirement is empirically not met under this interpretation either: the mitigator does not penalize the insertion of a standalone clinical-register sentence into an otherwise vague-administrative LOA paragraph, even when that sentence semantically describes cessation of clinical activity rather than rehabilitation. Same person, same fact, lexical-register-different wording → different outcome.

What the population-scale numbers actually mean

Plugged into an MSPE-only top-K=20% screening simulation under uniform-group assumptions, the wording effect produces DI below the EEOC 4/5 threshold across realistic disclosure rates (0–25%). This is a single-feature failure-mode result: it answers what the mechanism alone predicts at population scale, holding the rest of the application fixed.

Under a realistic multi-feature ranker (see f5_multifeature/), with MSPE weighted at 0.4 alongside Step 2 CK, LORs, signaling, and program filters, DI at 10% disclosure rises to 0.812 — above the 4/5 threshold. The single-feature DI failure result therefore characterizes the mechanism's strength under MSPE-only ranking, not its deployed population impact. The JAMA-calibrated bound in nguyen_calibration/ is the population-scale claim; the single-feature DI is the mechanism diagnostic.

The JAMA-calibrated simulation replaces the uniform-group assumption with race-stratified LOA prevalences and Step 2 CK distributions for the 2014–2016 ERAS cohort (n=37,485 applicants), as reported in JAMA April 15, 2024 (doi:10.1001/jama.2024.4797). Under the strongest documentation-heterogeneity assumption swept (N=200 bootstrap replicates × 20,000 applicants), the mechanism produces ~20% widening of the Black-LOA / White-LOA ratio over null — about ~15% of the JAMA-reported pre-AI excess gradient over parity, not the full magnitude.

The bounded reading:

The mechanism (sentence-level clinical-register lexical sensitivity) is empirically established by the F1 decoy experiment (V9 alone, no rehab content, full-magnitude trigger).
Its population-level contribution, in a realistic multi-feature ranker, is small relative to the JAMA-reported pre-AI gradient.
The pre-registered prediction: post-AI cohort replication of the JAMA pre-AI cohort will show at most ~20% widening of the Black-LOA / White-LOA ratio (about ~15% of the JAMA-reported pre-AI excess gradient over parity) attributable to this mechanism. Observed widening above that bound implies AEDT-mediated mechanisms beyond wording asymmetry; observed widening at or below it is consistent with mechanism-only transmission; null widening rules out AEDT mediation at this layer.

What's in here

`run_synthetic_corpus.py`

Generates N=50 paired synthetic MSPE-shaped documents by sampling from paragraph inventories defined in wordings_local.py. Each document is rendered in three LOA arms (none / original / corrected) and scored at the document level under both ACGME core competencies and patent-illustrative attribute indicators, raw and mitigated. Reports paired Δ_original and Δ_corrected with 95% bootstrap CIs. Output: synthetic_corpus/{summary.md, all_scores_long.csv}.

Note on the PSExtractor contamination fix: the public aedt-fairness-audit toolkit before 2026-05-22 had a PSExtractor.__init__ that merged custom exemplars into the patent's default four PS questions rather than replacing them. This script's make_extractor_with_only helper performed a post-construction override to work around that. The upstream toolkit was fixed on 2026-05-22 (commit 0951db9); the helper is now redundant on toolkit versions at or above that commit. Verify with extractor._question_keys after construction.

`run_empirical_invite_rate.py`

Bridges the document-level wording effect to a population-scale DI calculation. Plugs the corpus-measured score distributions into a top-K=20% screening simulation (using the public toolkit's top_k_selection for proper random tie-breaking). Two groups, n=3,000 per group, 500 bootstrap replicates per disclosure rate.

Output: empirical_invite_rate/{summary.md, empirical_di_sweep.csv}.

`nguyen_calibration/`

Race-stratified ERAS population calibration anchored on the JAMA-published 2014–2016 cohort (Nguyen et al., JAMA April 15, 2024; doi:10.1001/jama.2024.4797). Inputs: race × LOA prevalences and Step 2 CK distributions from the published Table 1; ACGME-mitigated MSPE _total anchorings from this repo's synthetic_corpus/; realistic ERAS-style composite weights (w_MSPE = 0.4, w_Step2CK = 0.4, w_other = 0.2). Sweeps the free parameter P(rehab-LOA | race) across four heterogeneity scenarios (null → strong). Outputs: per-scenario Black-LOA / White-LOA crude-OR ratios with bootstrap CIs.

run_nguyen_calibration.py — the simulation script. results.json — machine-readable per-scenario results. summary.md — full inputs, scenarios, and headline table. findings_significance.md — what the combined work establishes and the pre-registered falsifiable prediction. figure.png / figure.pdf — visualization of the per-scenario bounds against the JAMA-reported pre-AI baseline.

`f1_perturbations/`, `f2_threshold/`, `f4_embedding/`, `f5_multifeature/`, `f6_length/`, `f8_clusterboot/`

Sensitivity tests for the mechanism finding and the population-level DI claim. F1 wording perturbation surfaces the actual triggering condition: a sentence-level clinical-register lexical configuration in subject/predicate position crosses the §530 gate; the same tokens in a subordinate clause do not. F2 threshold sweep shows the mechanism direction (positive Δ_corrected) is preserved across thresholds 0.20–0.55, but the magnitude is implementer's-choice sensitive — Δ_corrected collapses to zero at thresholds ≥ 0.40 and peaks at thresholds ≤ 0.20. F4 confirms direction-robustness within the SBERT embedder family (cross-family swap untested). F5 quantifies the population-level dilution under multi-feature ranking: under realistic ERAS-style weights, the MSPE-only DI failure dilutes substantially. F6 confirms per-sentence locality (the LOA paragraph itself is byte-identical across length conditions, by construction). F8 cluster bootstrap widens DI uncertainty materially (up to 3.74× wider at the worst rate) and flips the "CI entirely below 0.80" status at low disclosure rates — the narrower DI wording F8 supports is recorded in f8_clusterboot/summary.md. Each subdirectory has its own summary.md.

`make_figures.py`

Generates the seven PNG figures referenced in walkthrough.md (one composite + six detail) from the corpus and DI-sweep CSVs. Forces the matplotlib Agg backend at import so bare python3 make_figures.py works on macOS without setting MPLBACKEND=Agg.

`pipeline.mmd` / `pipeline_full.mmd`

Mermaid diagrams of the patent-faithful MSPE pipeline (focused) and the entire patent architecture (§502–§532, non-ML factors, GUI, ranking).

Pipeline (inherited from the published toolkit)

text
  → BiasMitigator (§532 input-side, Claim 1)        [from chadmarkey/aedt-fairness-audit]
  → SBERT MiniLM-L6-v2
  → cosine similarity to attribute-indicator exemplar centroids
  → threshold gate at 0.35
  → softmax β=8 soft-assignment across indicators
  → power-2 aggregation per §530
  → per-indicator score + _total

This repo does not modify the public aedt-fairness-audit library; it imports from it via sys.path. Scripts assume the public audit repo is cloned at ~/aedt-fairness-audit/ (or installed as a package).

Reproducing from scratch

# 1. Clone the public audit toolkit and check out the pinned commit
git clone https://github.com/chadmarkey/aedt-fairness-audit.git ~/aedt-fairness-audit
(cd ~/aedt-fairness-audit && git checkout 0951db99e3385fd235bd12fe5ba411251f4eba40)

# 2. Install Python dependencies
pip install -r requirements.txt
pip install sentence-transformers spacy
python -m spacy download en_core_web_sm

# 3. Verify the toolkit pin (fails fast if the toolkit is at a different SHA)
python3 _toolkit_pin.py

# 4. Create your local wordings module (template provided)
cp wordings_local.example.py wordings_local.py
# Edit wordings_local.py and populate:
#   LOA_NONE / LOA_ORIGINAL / LOA_CORRECTED                  (the wordings to test)
#   IDENTIFYING_INFO / ACADEMIC_HISTORY_STATIC / ...         (static section content)
#   NOTEWORTHY_VARIANTS / ON_DOCTORING_VARIANTS / ...        (variant inventories)

# 5. Run the corpus generator + scorer (~2-3 min)
python3 run_synthetic_corpus.py

# 6. Run the empirically-anchored DI sweep (~30 sec)
python3 run_empirical_invite_rate.py

# 7. Run the JAMA-calibrated bound (~1 min)
python3 nguyen_calibration/run_nguyen_calibration.py

# 8. Generate figures (~15-20 sec; Fig 6 dominates with its top-K × disclosure sweep)
python3 make_figures.py

# 9. Run the test suite (51 tests, ~2 sec, fully offline)
pytest tests/ -v

requirements.txt lists hard runtime deps. requirements-lock.txt pins the upstream toolkit SHA (0951db99...) plus floor versions of the PyPI deps used to generate the canonical reference outputs. _toolkit_pin.py asserts the toolkit SHA at runtime; bypass with AEDT_MSPE_SKIP_TOOLKIT_PIN_CHECK=1. No API keys; entirely local.

Reproducibility criterion

The audit's reproducibility criterion is byte identity of the data artifacts, not the figure PNGs. Within a single matplotlib + Python environment with the same wordings_local.py, all of the following regenerate to byte-identical output: synthetic_corpus/all_scores_long.csv, synthetic_corpus/summary.md, empirical_invite_rate/empirical_di_sweep.csv, empirical_invite_rate/summary.md, nguyen_calibration/results.json, and the per-test CSVs under f1_perturbations/, f2_threshold/, f4_embedding/, f5_multifeature/, f6_length/, f8_clusterboot/. The figure PNGs strip their matplotlib Software metadata at write time so they also regenerate byte-identically within a given environment, but font availability and matplotlib patch versions can still introduce sub-pixel rendering drift across environments; pixel content is the visualization, the CSVs are the artifact.

What's not in this repo (intentionally)

wordings_local.py — local-only; contains all paragraph wordings being tested. Never committed.
documents_local/ — local-only working directory for any source documents.
Scripts that operate on real-MSPE inputs (triangulation runner, corrected-vs-original comparison, paragraph-level audits) — kept local only.
Output directories derived from real-MSPE scoring — kept local only.

Relationship to the published audit

The published audit at chadmarkey/aedt-fairness-audit is the authoritative timestamped record on OSF (osf.io/uk6xn, pre-reg osf.io/vwyjm). It covers the §PS Component personal-statement pipeline and the wording-asymmetry finding on synthetic personal-statement corpus.

This repo:

Uses the published toolkit as a library; does not modify it.
Extends the audit to the patent's MSPE pipeline (different attribute-indicator set, different document type).
Bounds the simulation's population-level prediction against the JAMA-published 2014–2016 ERAS cohort.

If any of these analyses are written up for peer-reviewed publication, that write-up will reference this repo as the methods record; nothing here is rewritten retroactively to fit a publication.

Citation

If you cite this work, the recommended form is:

Markey, C. (2026). aedt-mspe-experiments: independent population-scale audit of the MSPE pipeline disclosed in US 12,265,502 B1, calibrated against Nguyen et al. (JAMA 2024). GitHub: chadmarkey/aedt-mspe-experiments.

License

See LICENSE. Code: MIT. Analysis text and figures: CC-BY-4.0 unless noted otherwise.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

aedt-mspe-experiments

Read this first

Headline finding

What the population-scale numbers actually mean

What's in here

`run_synthetic_corpus.py`

`run_empirical_invite_rate.py`

`nguyen_calibration/`

`f1_perturbations/`, `f2_threshold/`, `f4_embedding/`, `f5_multifeature/`, `f6_length/`, `f8_clusterboot/`

`make_figures.py`

`pipeline.mmd` / `pipeline_full.mmd`

Pipeline (inherited from the published toolkit)

Reproducing from scratch

Reproducibility criterion

What's not in this repo (intentionally)

Relationship to the published audit

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
empirical_invite_rate		empirical_invite_rate
f1_perturbations		f1_perturbations
f2_threshold		f2_threshold
f4_embedding		f4_embedding
f5_multifeature		f5_multifeature
f6_length		f6_length
f8_clusterboot		f8_clusterboot
figures		figures
nguyen_calibration		nguyen_calibration
synthetic_corpus		synthetic_corpus
tests		tests
.gitignore		.gitignore
CONFIRMATORY_STUDY_PREREG.md		CONFIRMATORY_STUDY_PREREG.md
LICENSE		LICENSE
README.md		README.md
_toolkit_pin.py		_toolkit_pin.py
make_figures.py		make_figures.py
pipeline.mmd		pipeline.mmd
pipeline_full.mmd		pipeline_full.mmd
requirements-lock.txt		requirements-lock.txt
requirements.txt		requirements.txt
run_empirical_invite_rate.py		run_empirical_invite_rate.py
run_synthetic_corpus.py		run_synthetic_corpus.py
walkthrough.md		walkthrough.md
wordings_local.example.py		wordings_local.example.py

Folders and files

Latest commit

History

Repository files navigation

aedt-mspe-experiments

Read this first

Headline finding

What the population-scale numbers actually mean

What's in here

run_synthetic_corpus.py

run_empirical_invite_rate.py

nguyen_calibration/

f1_perturbations/, f2_threshold/, f4_embedding/, f5_multifeature/, f6_length/, f8_clusterboot/

make_figures.py

pipeline.mmd / pipeline_full.mmd

Pipeline (inherited from the published toolkit)

Reproducing from scratch

Reproducibility criterion

What's not in this repo (intentionally)

Relationship to the published audit

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`run_synthetic_corpus.py`

`run_empirical_invite_rate.py`

`nguyen_calibration/`

`f1_perturbations/`, `f2_threshold/`, `f4_embedding/`, `f5_multifeature/`, `f6_length/`, `f8_clusterboot/`

`make_figures.py`

`pipeline.mmd` / `pipeline_full.mmd`

Packages