Abstract: Machine learning (ML) holds great promise to support, improve, and automatize clinical decision-making in hospitals. Model training on abundantly available routine data, however, is hindered by data protection regulations. Generative models can comply with privacy laws by learning to synthesize hospital data from a target population while ensuring data privacy. Clinical time series acquired during intensive care are difficult to model using established techniques, especially due to uneven sampling intervals. Here we introduce GHOSTS (Generator of Hospital Time Series), a novel generator of synthetic patient trajectories that is capable of generating realistic heterogeneous hospital data including realistic time series with uneven sampling intervals and static patient attributes. To achieve this, GHOSTS introduces novel regularizers and a postprocessing module leveraging low-dimensional summary statistics.We further present a suite of novel benchmarks for synthetic hospital time series, GHOSTS-Bench. We train GHOSTS on a large cohort of patient data from the MIMIC-IV and EICU critical care datasets. Along with measuring the quality of the generated data in terms of how faithfully the distributions of the real data as well as their spatio-temporal dynamics are preserved, we also measure how well ML models trained on the generated data can solve a clinical prediction task on the real data. We observe that GHOSTS outperforms a state-of-the-art approach, DoppelGANger, with respect to these criteria. We intend to make the GHOSTS model, a corpus of synthetic data as well as Python codes implementing GHOSTS and GHOSTS-Bench publicly available. These resources will become instrumental in the future development of powerful predictive models for intensive and perioperative care.
Companion repositories
This is the source code for reproducing experiments in the paper "GHOSTS: Generation of synthetic hospital time series for clinical machine learning research".
-
Dataset generation: Code for generation of datasets can be found in
data_preparationfolder. a. MIMIC-IV dataset: Run SQL queries found indata_preparation/queries.sqland rundata_preparation/preprocess.ipynbsubstituting paths. b. EICU dataset: Rundata_preparation/extract_r_ricu_eicu.ipynbwith R jupyter kernel, or copy code into R script. Then, rundata_preparation/eicu_preprocessing.ipynbsubstituting paths. c. HALO: HALO model requires special encoding. To generate HALO datasets, rundata_preparation/convert_to_halo.ipynbto encode the data for HALO. The same notebooks contain the code for decoding back from HALO format. -
Benchmarking: For the benchmarking please install companion repository -- GHOSTS-Bench. Configs and instructions can be found in
benchmark_configsfolder. -
Figures and tables: Jupyter notebooks for the figures and tables can be found in the
figuresfolder.