Source code for paper Matching clinicians with clinical trials using AI, Nature Health, 2026
If you find our work helpful, please cite it by
Gao, J., Xiao, C., Glass, L.M. et al. Matching clinicians with clinical trials using AI. Nat. Health 1, 290–299 (2026). https://doi.org/10.1038/s44360-026-00073-6
- Install python, pytorch and RecBole. We use Python 3.7.6, Pytorch 1.12.1.
- We use the Clinical-Trial-Parser to parse trial criteria from https://github.com/facebookresearch/Clinical-Trial-Parser/tree/main.
- If you plan to use GPU computation, install CUDA
- The composite similarity metric need to be manually added to the
RecBole/evaluator/metrics.py. The metrics calculation function is inutils.py/com_sim.
All data should be downloaded in the data folder.
Public external data
- The CMS Open payments data https://www.cms.gov/priorities/key-initiatives/open-payments/data/dataset-downloads. We use the
OP_DTL_RSRCH_PGYRXXXX_P01202023.csvfrom 2017-2021. - US State level zipcode mapping file
uszips.csvfrom https://github.com/akinniyi/US-Zip-Codes-With-City-State/tree/master - Trial XML files from https://clinicaltrials.gov/
We have provided processed data in the data folder. They can be read using pickle.read Some key files are:
npi2trial.pkl: The linked relationship between NPI and NCTID.npi_info_dict.pkl: The clinician information extracted from CMS data, including location information and other public information.payment_dict.pkl: The processed CMS dataset. Recording the payment record from each trial identified by NCTID to each clinician or teaching hospital identified by NPI.ie_extracted_clinical_trials.tsv: The processed trial criteria using the Clinical-Trial-Parser.
01_A_process_payment_data.ipynb: Extract the clinical trial and clinician relationship from the OpenPayment data.01_B_process_trial_info.ipynb: Parse clinical trial information from trial XML documents.01_C_process_trial_criteria_embd.ipynb: Generate the trial criteria embeddings using ClinicalBERT.01_D_process_trial_summary_embd.ipynb: Generate the trial summary embeddings using ClinicalBERT.01_E_process_claims_data.ipynb: Process the ICD codes in the claims data.01_F_process_clinician_info.ipynb: Extract the clinician information from the CMS data.01_G_process_geo_data.ipynb: Extract demographics information (e.g., racial and ethnicity distributions) from the regional data.
02_A_gen_trial_npi_relation.ipynb: Link and filter trials and clinicians information.02_B_get_trial_phase.ipynb: Get trial phase and condition information.02_C_get_stat.ipynb: Get basic data statistics of the dataset we built.
03_A_gen_atom_file.ipynb: Build the atomic dataset under regular setting for recommendation model training, based on the requirement of the RecBole package.03_A_gen_zeroshot_atom_file.ipynb: Build the atomic dataset under temporal setting for recommendation model training, based on the requirement of the RecBole package.
04_train_doctr.ipynb: We build and evaluate the proposed DocTr model.
05_A_get_competing_trial.ipynb: We extract the competing trials from the trial relationships.05_B_fairness_analysis.ipynb: We run the genetic algorithm to improve the fairness of the recommendation results, and report the results. The genetic algorithm is ingenetic.py.