GHOSTS: Generation of synthetic hospital time series for clinical machine learning research

Abstract: Machine learning (ML) holds great promise to support, improve, and automatize clinical decision-making in hospitals. Model training on abundantly available routine data, however, is hindered by data protection regulations. Generative models can comply with privacy laws by learning to synthesize hospital data from a target population while ensuring data privacy. Clinical time series acquired during intensive care are difficult to model using established techniques, especially due to uneven sampling intervals. Here we introduce GHOSTS (Generator of Hospital Time Series), a novel generator of synthetic patient trajectories that is capable of generating realistic heterogeneous hospital data including realistic time series with uneven sampling intervals and static patient attributes. To achieve this, GHOSTS introduces novel regularizers and a postprocessing module leveraging low-dimensional summary statistics.We further present a suite of novel benchmarks for synthetic hospital time series, GHOSTS-Bench. We train GHOSTS on a large cohort of patient data from the MIMIC-IV and EICU critical care datasets. Along with measuring the quality of the generated data in terms of how faithfully the distributions of the real data as well as their spatio-temporal dynamics are preserved, we also measure how well ML models trained on the generated data can solve a clinical prediction task on the real data. We observe that GHOSTS outperforms a state-of-the-art approach, DoppelGANger, with respect to these criteria. We intend to make the GHOSTS model, a corpus of synthetic data as well as Python codes implementing GHOSTS and GHOSTS-Bench publicly available. These resources will become instrumental in the future development of powerful predictive models for intensive and perioperative care.

Companion repositories

This is the source code for reproducing experiments in the paper "GHOSTS: Generation of synthetic hospital time series for clinical machine learning research".

Dataset generation: Code for generation of datasets can be found in data_preparation folder. a. MIMIC-IV dataset: Run SQL queries found in data_preparation/queries.sql and run data_preparation/preprocess.ipynb substituting paths. b. EICU dataset: Run data_preparation/extract_r_ricu_eicu.ipynb with R jupyter kernel, or copy code into R script. Then, run data_preparation/eicu_preprocessing.ipynb substituting paths. c. HALO: HALO model requires special encoding. To generate HALO datasets, run data_preparation/convert_to_halo.ipynb to encode the data for HALO. The same notebooks contain the code for decoding back from HALO format.
Benchmarking: For the benchmarking please install companion repository -- GHOSTS-Bench. Configs and instructions can be found in benchmark_configs folder.
Figures and tables: Jupyter notebooks for the figures and tables can be found in the figures folder.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
benchmark_configs		benchmark_configs
data_preparation		data_preparation
figures		figures
.gitignore		.gitignore
README.md		README.md
generate_metadata_doppelGANger.py		generate_metadata_doppelGANger.py
optimize_doppelGANger.py		optimize_doppelGANger.py
optimize_torch_gan.py		optimize_torch_gan.py
sample_doppelGANger.py		sample_doppelGANger.py
train_doppelGANger.py		train_doppelGANger.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GHOSTS: Generation of synthetic hospital time series for clinical machine learning research

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GHOSTS: Generation of synthetic hospital time series for clinical machine learning research

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages