GitHub - bschilder/synthlab: Generate synthetic EHR, genomics, imaging, and other data types from Python.

Python tools for working with synthetic healthcare datasets.

SynthLab provides Python interfaces for working with major synthetic healthcare datasets, including:

Synthea: A synthetic patient population simulator that generates realistic synthetic patient records
UK Biobank Synthetic Dataset: A large-scale synthetic dataset designed for system testing with UK Biobank-compatible data

Features

Synthea Support

Synthea Runner: Easy-to-use Python interface for running Synthea simulations
OMOP Conversion: Convert Synthea CSV output to OMOP CDM format
AWS Dataset Download: Download pre-generated Synthea OMOP datasets from AWS
Configuration Management: Flexible configuration system with validation

Synthea Coherent Data Set (Multimodal) - NEW!

EHR + Imaging + Genomics: The only publicly available synthetic dataset combining all three modalities
FHIR Records: Complete patient records with demographics, conditions, medications, encounters
MRI DICOM: Synthetic brain imaging linked to patients
Familial Genomes: VCF files with genetic variants for patients and family members
Clinical Notes: SOAP-style clinical documentation
Physiological Data: Time-series vital signs data

UK Biobank Synthetic Dataset Support

Dataset Download: Download tabular, medical, genetic, and bulk data files
Automatic Caching: Files are automatically cached in ~/.cache/synthlab/ukbiobank_synthetic/
MD5 Verification: Automatic checksum verification for downloaded files
Data Loading: Load data into Polars DataFrames for efficient analysis
Category Management: Organized downloads by category (tabular, medical, genetic, bulk)

Genomics Data Support (NEW)

HAPNEST Integration: Download and load HAPNEST synthetic genomics data (1M+ individuals, 6.8M variants)
Synthetic Genotype Generation: Generate simple synthetic genotypes for testing
PLINK Format Support: Work with standard genomics file formats (.pgen, .pvar, .psam)

Medical Imaging Catalog (NEW)

Dataset Discovery: Catalog of 15+ publicly available medical imaging datasets
Multi-modal Coverage: CT, MRI, X-ray, and histopathology datasets
Access Information: Clear documentation of open vs. registration-required datasets
Download Utilities: Helpers for downloading select open-access datasets

Installation

pip install synthlab

For AWS dataset download functionality:

pip install synthlab[aws]

For imaging notebooks (DICOM):

pip install synthlab[imaging]

For genomics helpers:

pip install synthlab[genomics]

Install everything:

pip install synthlab[all]

Optional GPU performance (FlashAttention):

# Requires a supported NVIDIA GPU + CUDA toolchain
# See https://github.com/Dao-AILab/flash-attention for compatibility details
pip install flash-attn --no-build-isolation

Optional GPU performance (FlashAttention 2):

# FlashAttention 2 is provided by the same package; install as usual
# See https://github.com/Dao-AILab/flash-attention for compatibility details
pip install flash-attn --no-build-isolation

Requirements

Python 3.8+
Java 11 or newer (required by Synthea)
Polars (for efficient data loading)
Requests (for dataset downloads)
Matplotlib (plots)
Optional (imaging notebooks): pydicom

Quick Start

Synthea: Generate Synthetic Patient Data

from synthlab import SyntheaRunner, SyntheaConfig

# Create a runner (downloads Synthea JAR automatically)
runner = SyntheaRunner()

# Configure a simulation
config = SyntheaConfig(
    population_size=100,
    state="Massachusetts",
    seed=12345,
    output_dir="output/synthea_data"
)

# Run the simulation
result = runner.run(config)

if result['returncode'] == 0:
    print(f"Generated data in: {result['output_dir']}")

Synthea: Convert to OMOP CDM

from synthlab import convert_synthea_to_omop

# Convert Synthea CSV to OMOP CDM format
output_files = convert_synthea_to_omop(
    synthea_csv_dir="output/synthea_data",
    output_dir="output/omop",
    cdm_version="5.4",
    output_format="parquet"
)

print(f"Generated {len(output_files)} OMOP tables")

Synthea: Download Pre-generated Datasets

from synthlab import list_synthea_datasets, download_dataset

# List available datasets
datasets = list_synthea_datasets()

# Download a dataset
download_dataset("synthea1k", output_dir="data/synthea1k")

UK Biobank Synthetic Dataset: Download and Load Data

from synthlab import download_category, load_tabular_data, get_cache_dir

# Download tabular data (saved to ~/.cache/synthlab/ukbiobank_synthetic/tabular/)
download_category("tabular", verify_md5=True)

# Load the data into Polars DataFrames
tabular_data = load_tabular_data(sample_rows=1000)  # Load first 1000 rows for demo

# Access individual files
death_data = tabular_data["dates_death"]
integer_data = tabular_data["integer_no_arrays"]

print(f"Loaded {len(tabular_data)} tabular files")
print(f"Cache directory: {get_cache_dir()}")

Genomics: Download HAPNEST Synthetic Data

from synthlab import download_hapnest_small, load_hapnest_variants, load_hapnest_samples

# Download small HAPNEST test dataset (600 individuals)
data_dir = download_hapnest_small()

# Load variant and sample information
variants = load_hapnest_variants(data_dir)
samples = load_hapnest_samples(data_dir)

print(f"Variants: {len(variants)}")
print(f"Samples: {len(samples)}")

Genomics: Generate Synthetic Genotypes

from synthlab import generate_synthetic_genotypes

# Generate random synthetic genotype data for testing
data_dir = generate_synthetic_genotypes(
    n_samples=1000,
    n_variants=10000,
    seed=42
)

Coherent Data Set: Multimodal EHR + Imaging + Genomics

from synthlab import (
    download_coherent_dataset,
    load_fhir_patients,
    print_coherent_info,
)

# See what's available
print_coherent_info()

# Download specific components (FHIR records + genomics)
download_coherent_dataset(components=['fhir', 'genomics'])

# Or download everything (several GB)
# download_coherent_dataset()

# Load FHIR patient records
patients = load_fhir_patients(max_patients=10)
print(f"Loaded {len(patients)} patient bundles")

Medical Imaging: Explore Available Datasets

from synthlab import print_dataset_catalog, list_imaging_datasets, get_dataset_info

# Print catalog of all datasets
print_dataset_catalog()

# Filter by modality
histology = list_imaging_datasets(modality="Histopathology")

# Get info about specific dataset
mhist_info = get_dataset_info("mhist")
print(f"MHIST: {mhist_info.n_images} images, {mhist_info.size_gb} GB")

UK Biobank Synthetic Dataset: Download Specific Files

from synthlab import download_file

# Download a single file
download_file("dates_death.tsv", category="tabular")

# Files are automatically saved to ~/.cache/synthlab/ukbiobank_synthetic/tabular/

Documentation

SyntheaRunner

The main class for running Synthea simulations.

runner = SyntheaRunner(
    jar_path=None,          # Path to existing JAR (auto-downloads if None)
    jar_url=SYNTHEA_JAR_URL,  # URL to download JAR from
    cache_dir=None,         # Cache directory (defaults to OS cache)
    java_executable="java"  # Java executable path
)

SyntheaConfig

Configuration class for Synthea simulations.

config = SyntheaConfig(
    population_size=100,     # Number of patients
    seed=12345,              # Random seed
    state="Massachusetts",   # US state
    city="Boston",           # Optional city
    min_age=0,              # Minimum age
    max_age=100,            # Maximum age
    gender="M",             # "M", "F", or None
    output_dir="output"     # Output directory
)

UK Biobank Synthetic Dataset Functions

from synthlab import (
    list_available_files,
    download_file,
    download_category,
    load_tabular_data,
    load_medical_records,
    load_genetic_dictionary,
    get_cache_dir,
)

# List available files
files = list_available_files(category="tabular")

# Download entire category
download_category("tabular")  # Downloads to ~/.cache/synthlab/ukbiobank_synthetic/tabular/

# Download single file
download_file("dates_death.tsv", category="tabular")

# Load data (uses cache directory by default)
data = load_tabular_data(sample_rows=1000)
medical = load_medical_records(sample_rows=10000)
genetic_dict = load_genetic_dictionary()

# Get cache directory
cache_dir = get_cache_dir()  # Returns ~/.cache/synthlab/ukbiobank_synthetic/

Convenience Methods

# Synthea quick test run
runner.run_quick(population_size=10, state="Massachusetts")

# Synthea custom location
runner.run_custom_location(state="California", city="San Francisco", population_size=100)

# Synthea age-specific population
runner.run_age_specific(min_age=25, max_age=65, population_size=100)

Dataset Information

Synthea

Synthea generates synthetic patient records with:

Demographics
Medical history
Medications
Lab results
Procedures
Encounters

Reference: Synthea GitHub

UK Biobank Synthetic Dataset

The UK Biobank Synthetic Dataset contains:

Tabular Records (23 TSV files): Main phenotype data (~600K participants × ~27K columns)
- Survey responses, measurements, clinical data
- Files: dates_death.tsv, integer_no_arrays.tsv, real_fields1.tsv, etc.
Medical Records (6 text files): GP clinical records (~400M rows)
- Diagnosis codes (Read 2/3), visit data, clinical events
Genetic Records: SNP genotype data (~600K participants × 840K SNPs)
- Dictionary file + 26 chromosome files (compressed)
Bulk Files (37 zip archives): ~6M files for system testing

Important: This is synthetic data and may not be internally consistent (e.g., events after death, prostate cancer in females).

Reference: UK Biobank Synthetic Dataset

Examples

See the examples/ and notebooks/ directories for detailed examples:

examples/basic_usage.py - Basic Synthea usage
examples/ukbiobank_synthetic_example.py - UK Biobank Synthetic Dataset examples
notebooks/Synthea.ipynb - Comprehensive Synthea tutorial
notebooks/UKBiobank_Synthetic.ipynb - UK Biobank Synthetic Dataset tutorial
notebooks/Coherent_Dataset.ipynb - Multimodal EHR + Imaging + Genomics tutorial

License

MIT License

References

Synthea

Synthea Coherent Data Set (Multimodal)

AWS Open Data Registry
Paper: "The Coherent Data Set" (MDPI Electronics, 2022)
S3 Bucket: s3://synthea-open-data/coherent/

UK Biobank

UK Biobank Synthetic Dataset
UK Biobank Showcase (for field definitions)
Read Code Mappings

Genomics

HAPNEST Paper - Synthetic genotype/phenotype generation
HAPNEST BioStudies - Full dataset download
PLINK 2 - Genomics analysis toolkit

Medical Imaging

TCIA - The Cancer Imaging Archive
Stanford AIMI Datasets
MHIST Histopathology Dataset
Grand Challenge - Medical imaging challenges and datasets

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
conda		conda
docs		docs
examples		examples
img		img
notebooks		notebooks
synthlab		synthlab
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

License

bschilder/synthlab

Folders and files

Latest commit

History

Repository files navigation

Features

Synthea Support

Synthea Coherent Data Set (Multimodal) - NEW!

UK Biobank Synthetic Dataset Support

Genomics Data Support (NEW)

Medical Imaging Catalog (NEW)

Installation

Requirements

Quick Start

Synthea: Generate Synthetic Patient Data

Synthea: Convert to OMOP CDM

Synthea: Download Pre-generated Datasets

UK Biobank Synthetic Dataset: Download and Load Data

Genomics: Download HAPNEST Synthetic Data

Genomics: Generate Synthetic Genotypes

Coherent Data Set: Multimodal EHR + Imaging + Genomics

Medical Imaging: Explore Available Datasets

UK Biobank Synthetic Dataset: Download Specific Files

Documentation

SyntheaRunner

SyntheaConfig

UK Biobank Synthetic Dataset Functions

Convenience Methods

Dataset Information

Synthea

UK Biobank Synthetic Dataset

Examples

License

References

Synthea

Synthea Coherent Data Set (Multimodal)

UK Biobank

Genomics

Medical Imaging

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages