Skip to content

chadmarkey/aedt-mspe-experiments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

aedt-mspe-experiments

This repository audits a patent-faithful MSPE scoring pipeline (US 12,265,502 B1, §530) and finds that the claimed semantic-preservation property is not satisfied: materially equivalent leave-of-absence documentation can receive different scores depending on sentence-level clinical-register wording. Follow-up sensitivity testing narrows the mechanism from "rehabilitative content" to a lexical threshold-gating effect. When calibrated against the JAMA-published ERAS population structure (Nguyen et al., 2024; doi:10.1001/jama.2024.4797), the mechanism is real but bounded, predicting limited amplification rather than explaining the full pre-existing race × LOA placement disparity. The resulting claim is falsifiable in post-AEDT cohorts.

Companion to the published audit at chadmarkey/aedt-fairness-audit (OSF osf.io/uk6xn), which covers the §PS Component personal-statement pipeline.

No MSPE-shaped content is committed to this repo. All paragraph wordings live in a gitignored wordings_local.py (template at wordings_local.example.py). The committed scripts are scoring infrastructure that operates over whatever wordings the local config provides. Anyone reproducing the analysis with a different wording manipulation populates their own wordings_local.py from the example template.

Read this first

walkthrough.md — seven-figure walkthrough (one composite + six detail) explaining what the experiments show, with approachable language and explicit caveats.

nguyen_calibration/findings_significance.md — what the work establishes, what it does not, and the falsifiable prediction it pre-registers.

Headline finding

The patent's Claim 1 specifies that the §532 BiasMitigator should "preserve semantic structure." Across N=50 paired synthetic MSPE-shaped documents (~1,300-word median, both indicator sets, both raw and mitigated modes), the patent-faithful pipeline scores documents differently when only the wording of a leave-of-absence paragraph changes. The mechanism the audit identifies is a sentence-level clinical-register lexical sensitivity in the §530 threshold gate.

The F1 perturbation suite establishes this empirically. A standalone sentence containing clinical-register vocabulary in subject/predicate position (e.g., "Clinical responsibilities were paused during this period") crosses the 0.35 cosine-similarity gate to the exemplar centroids and contributes ~+1.85 Δ — full-magnitude effect, on a wording with no medical context, no remediation language, and no rehabilitative framing. The same words embedded in a subordinate clause produce zero. A standalone "recovery period" sentence without clinical vocabulary produces zero. Stripping "clinical" and "recovery" from a control wording produces zero. The triggering condition is the standalone-sentence clinical-register lexical configuration, not the paragraph's semantic content.

  • Vague-administrative LOA wording (the original arm, single fixed LOA_ORIGINAL string) contributes zero to within ~8 µ-units of the document score (max |Δ_original| = 7.6e-06; the residual is SBERT float32 quantization jitter). The per-document Δ_original is identical across the 50 documents by construction — the same fixed string is evaluated against the same exemplar centroids — so this is effectively N=1 for the original-arm contribution. The 50/50 framing applies meaningfully only to the paired Δ_corrected − Δ_original contrast, where paragraph-variant draw differs between documents and the result is 50/50 above zero with min Δ_corrected ≈ 1.31.
  • Rehabilitative-content LOA wording (the corrected arm) contributes strictly positive score on 50/50 docs in all four cells: ACGME mitigated Δ_corrected = 1.747, 95% CI [1.690, 1.803]; patent-illustrative mitigated Δ_corrected = 2.734, CI [2.639, 2.828]. F1 establishes that the corrected arm's positive contribution operates through the same standalone-sentence clinical-register mechanism — not through "rehabilitative semantics."

Claim 1's "preserve semantic structure" requirement is empirically not met under this interpretation either: the mitigator does not penalize the insertion of a standalone clinical-register sentence into an otherwise vague-administrative LOA paragraph, even when that sentence semantically describes cessation of clinical activity rather than rehabilitation. Same person, same fact, lexical-register-different wording → different outcome.

What the population-scale numbers actually mean

Plugged into an MSPE-only top-K=20% screening simulation under uniform-group assumptions, the wording effect produces DI below the EEOC 4/5 threshold across realistic disclosure rates (0–25%). This is a single-feature failure-mode result: it answers what the mechanism alone predicts at population scale, holding the rest of the application fixed.

Under a realistic multi-feature ranker (see f5_multifeature/), with MSPE weighted at 0.4 alongside Step 2 CK, LORs, signaling, and program filters, DI at 10% disclosure rises to 0.812 — above the 4/5 threshold. The single-feature DI failure result therefore characterizes the mechanism's strength under MSPE-only ranking, not its deployed population impact. The JAMA-calibrated bound in nguyen_calibration/ is the population-scale claim; the single-feature DI is the mechanism diagnostic.

The JAMA-calibrated simulation replaces the uniform-group assumption with race-stratified LOA prevalences and Step 2 CK distributions for the 2014–2016 ERAS cohort (n=37,485 applicants), as reported in JAMA April 15, 2024 (doi:10.1001/jama.2024.4797). Under the strongest documentation-heterogeneity assumption swept (N=200 bootstrap replicates × 20,000 applicants), the mechanism produces ~20% widening of the Black-LOA / White-LOA ratio over null — about ~15% of the JAMA-reported pre-AI excess gradient over parity, not the full magnitude.

The bounded reading:

  • The mechanism (sentence-level clinical-register lexical sensitivity) is empirically established by the F1 decoy experiment (V9 alone, no rehab content, full-magnitude trigger).
  • Its population-level contribution, in a realistic multi-feature ranker, is small relative to the JAMA-reported pre-AI gradient.
  • The pre-registered prediction: post-AI cohort replication of the JAMA pre-AI cohort will show at most ~20% widening of the Black-LOA / White-LOA ratio (about ~15% of the JAMA-reported pre-AI excess gradient over parity) attributable to this mechanism. Observed widening above that bound implies AEDT-mediated mechanisms beyond wording asymmetry; observed widening at or below it is consistent with mechanism-only transmission; null widening rules out AEDT mediation at this layer.

What's in here

run_synthetic_corpus.py

Generates N=50 paired synthetic MSPE-shaped documents by sampling from paragraph inventories defined in wordings_local.py. Each document is rendered in three LOA arms (none / original / corrected) and scored at the document level under both ACGME core competencies and patent-illustrative attribute indicators, raw and mitigated. Reports paired Δ_original and Δ_corrected with 95% bootstrap CIs. Output: synthetic_corpus/{summary.md, all_scores_long.csv}.

Note on the PSExtractor contamination fix: the public aedt-fairness-audit toolkit before 2026-05-22 had a PSExtractor.__init__ that merged custom exemplars into the patent's default four PS questions rather than replacing them. This script's make_extractor_with_only helper performed a post-construction override to work around that. The upstream toolkit was fixed on 2026-05-22 (commit 0951db9); the helper is now redundant on toolkit versions at or above that commit. Verify with extractor._question_keys after construction.

run_empirical_invite_rate.py

Bridges the document-level wording effect to a population-scale DI calculation. Plugs the corpus-measured score distributions into a top-K=20% screening simulation (using the public toolkit's top_k_selection for proper random tie-breaking). Two groups, n=3,000 per group, 500 bootstrap replicates per disclosure rate.

Output: empirical_invite_rate/{summary.md, empirical_di_sweep.csv}.

nguyen_calibration/

Race-stratified ERAS population calibration anchored on the JAMA-published 2014–2016 cohort (Nguyen et al., JAMA April 15, 2024; doi:10.1001/jama.2024.4797). Inputs: race × LOA prevalences and Step 2 CK distributions from the published Table 1; ACGME-mitigated MSPE _total anchorings from this repo's synthetic_corpus/; realistic ERAS-style composite weights (w_MSPE = 0.4, w_Step2CK = 0.4, w_other = 0.2). Sweeps the free parameter P(rehab-LOA | race) across four heterogeneity scenarios (null → strong). Outputs: per-scenario Black-LOA / White-LOA crude-OR ratios with bootstrap CIs.

run_nguyen_calibration.py — the simulation script. results.json — machine-readable per-scenario results. summary.md — full inputs, scenarios, and headline table. findings_significance.md — what the combined work establishes and the pre-registered falsifiable prediction. figure.png / figure.pdf — visualization of the per-scenario bounds against the JAMA-reported pre-AI baseline.

f1_perturbations/, f2_threshold/, f4_embedding/, f5_multifeature/, f6_length/, f8_clusterboot/

Sensitivity tests for the mechanism finding and the population-level DI claim. F1 wording perturbation surfaces the actual triggering condition: a sentence-level clinical-register lexical configuration in subject/predicate position crosses the §530 gate; the same tokens in a subordinate clause do not. F2 threshold sweep shows the mechanism direction (positive Δ_corrected) is preserved across thresholds 0.20–0.55, but the magnitude is implementer's-choice sensitive — Δ_corrected collapses to zero at thresholds ≥ 0.40 and peaks at thresholds ≤ 0.20. F4 confirms direction-robustness within the SBERT embedder family (cross-family swap untested). F5 quantifies the population-level dilution under multi-feature ranking: under realistic ERAS-style weights, the MSPE-only DI failure dilutes substantially. F6 confirms per-sentence locality (the LOA paragraph itself is byte-identical across length conditions, by construction). F8 cluster bootstrap widens DI uncertainty materially (up to 3.74× wider at the worst rate) and flips the "CI entirely below 0.80" status at low disclosure rates — the narrower DI wording F8 supports is recorded in f8_clusterboot/summary.md. Each subdirectory has its own summary.md.

make_figures.py

Generates the seven PNG figures referenced in walkthrough.md (one composite + six detail) from the corpus and DI-sweep CSVs. Forces the matplotlib Agg backend at import so bare python3 make_figures.py works on macOS without setting MPLBACKEND=Agg.

pipeline.mmd / pipeline_full.mmd

Mermaid diagrams of the patent-faithful MSPE pipeline (focused) and the entire patent architecture (§502–§532, non-ML factors, GUI, ranking).

Pipeline (inherited from the published toolkit)

text
  → BiasMitigator (§532 input-side, Claim 1)        [from chadmarkey/aedt-fairness-audit]
  → SBERT MiniLM-L6-v2
  → cosine similarity to attribute-indicator exemplar centroids
  → threshold gate at 0.35
  → softmax β=8 soft-assignment across indicators
  → power-2 aggregation per §530
  → per-indicator score + _total

This repo does not modify the public aedt-fairness-audit library; it imports from it via sys.path. Scripts assume the public audit repo is cloned at ~/aedt-fairness-audit/ (or installed as a package).

Reproducing from scratch

# 1. Clone the public audit toolkit and check out the pinned commit
git clone https://github.com/chadmarkey/aedt-fairness-audit.git ~/aedt-fairness-audit
(cd ~/aedt-fairness-audit && git checkout 0951db99e3385fd235bd12fe5ba411251f4eba40)

# 2. Install Python dependencies
pip install -r requirements.txt
pip install sentence-transformers spacy
python -m spacy download en_core_web_sm

# 3. Verify the toolkit pin (fails fast if the toolkit is at a different SHA)
python3 _toolkit_pin.py

# 4. Create your local wordings module (template provided)
cp wordings_local.example.py wordings_local.py
# Edit wordings_local.py and populate:
#   LOA_NONE / LOA_ORIGINAL / LOA_CORRECTED                  (the wordings to test)
#   IDENTIFYING_INFO / ACADEMIC_HISTORY_STATIC / ...         (static section content)
#   NOTEWORTHY_VARIANTS / ON_DOCTORING_VARIANTS / ...        (variant inventories)

# 5. Run the corpus generator + scorer (~2-3 min)
python3 run_synthetic_corpus.py

# 6. Run the empirically-anchored DI sweep (~30 sec)
python3 run_empirical_invite_rate.py

# 7. Run the JAMA-calibrated bound (~1 min)
python3 nguyen_calibration/run_nguyen_calibration.py

# 8. Generate figures (~15-20 sec; Fig 6 dominates with its top-K × disclosure sweep)
python3 make_figures.py

# 9. Run the test suite (51 tests, ~2 sec, fully offline)
pytest tests/ -v

requirements.txt lists hard runtime deps. requirements-lock.txt pins the upstream toolkit SHA (0951db99...) plus floor versions of the PyPI deps used to generate the canonical reference outputs. _toolkit_pin.py asserts the toolkit SHA at runtime; bypass with AEDT_MSPE_SKIP_TOOLKIT_PIN_CHECK=1. No API keys; entirely local.

Reproducibility criterion

The audit's reproducibility criterion is byte identity of the data artifacts, not the figure PNGs. Within a single matplotlib + Python environment with the same wordings_local.py, all of the following regenerate to byte-identical output: synthetic_corpus/all_scores_long.csv, synthetic_corpus/summary.md, empirical_invite_rate/empirical_di_sweep.csv, empirical_invite_rate/summary.md, nguyen_calibration/results.json, and the per-test CSVs under f1_perturbations/, f2_threshold/, f4_embedding/, f5_multifeature/, f6_length/, f8_clusterboot/. The figure PNGs strip their matplotlib Software metadata at write time so they also regenerate byte-identically within a given environment, but font availability and matplotlib patch versions can still introduce sub-pixel rendering drift across environments; pixel content is the visualization, the CSVs are the artifact.

What's not in this repo (intentionally)

  • wordings_local.py — local-only; contains all paragraph wordings being tested. Never committed.
  • documents_local/ — local-only working directory for any source documents.
  • Scripts that operate on real-MSPE inputs (triangulation runner, corrected-vs-original comparison, paragraph-level audits) — kept local only.
  • Output directories derived from real-MSPE scoring — kept local only.

Relationship to the published audit

The published audit at chadmarkey/aedt-fairness-audit is the authoritative timestamped record on OSF (osf.io/uk6xn, pre-reg osf.io/vwyjm). It covers the §PS Component personal-statement pipeline and the wording-asymmetry finding on synthetic personal-statement corpus.

This repo:

  1. Uses the published toolkit as a library; does not modify it.
  2. Extends the audit to the patent's MSPE pipeline (different attribute-indicator set, different document type).
  3. Bounds the simulation's population-level prediction against the JAMA-published 2014–2016 ERAS cohort.

If any of these analyses are written up for peer-reviewed publication, that write-up will reference this repo as the methods record; nothing here is rewritten retroactively to fit a publication.

Citation

If you cite this work, the recommended form is:

Markey, C. (2026). aedt-mspe-experiments: independent population-scale audit of the MSPE pipeline disclosed in US 12,265,502 B1, calibrated against Nguyen et al. (JAMA 2024). GitHub: chadmarkey/aedt-mspe-experiments.

License

See LICENSE. Code: MIT. Analysis text and figures: CC-BY-4.0 unless noted otherwise.

About

Private working extension of chadmarkey/aedt-fairness-audit: synthetic-corpus generator + empirically-anchored disparate-impact sweep for the patent's MSPE pipeline.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors