Synthetic Healthcare Data Generator

Welcome to the Synthetic Healthcare Data Generator, a project designed to leverage Large Language Models (LLMs) for generating synthetic healthcare datasets. The goal is to evaluate the ability of various open-source LLMs to produce realistic, diverse, and useful synthetic data, all while adhering to privacy and ethical standards.

Features

Healthcare Data Generation: Generate synthetic datasets based on real-world schemas.
Multi-Model Integration: Compare results from multiple LLMs, such as GPT-4, LLaMA, etc., accessed via API.
Validation & Analysis:
- Schema validation to ensure data consistency.
- Statistical analysis to compare synthetic data against real datasets.
- Utility analysis to test synthetic data in downstream machine learning tasks.

Project Workflow

Dataset Preparation:
- Preprocessed healthcare data schema with 15 attributes (e.g., Name, Age, Gender, Medical Condition).
- Original dataset: https://www.kaggle.com/datasets/prasad22/healthcare-dataset
Prompt Engineering:
- Designed prompts at three levels (Basic, Improved, Advanced) to enhance data generation.
- Provided developer instructions at three levels (Basic, Improved, Advanced) which the model should follow, regardless of messages sent by the user.
Data Generation:
- Generated partial synthetic datasets using APIs for various open-source LLMs.
Validation:
- Ensured data consistency, schema adherence, and statistical similarity.
Benchmarking:
- Compared model performance across metrics like realism, diversity, and efficiency.

Models Used

The project integrates the following LLMs through their respective APIs:

Model	Description	Access
GPT-4	OpenAI's advanced language model known for high accuracy and contextual depth.	OpenAI API
LLaMA	Meta’s language model optimized for open-ended generation tasks.	LLaMA API

Each model is benchmarked for:

Realism: Statistical similarity to the real dataset.
Cost: API usage cost per 1,000 rows.
Speed: Latency for generating synthetic data.

Sample Schema

The synthetic data follows the schema below:

Column	Data Type	Description
Name	`string`	Patient's full name
Age	`int`	Patient's age in years
Gender	`string`	Gender of the patient
Blood Type	`string`	Blood type (e.g., A+, O-)
Medical Condition	`string`	Primary medical diagnosis
Date of Admission	`string`	Date the patient was admitted
Doctor	`string`	Attending doctor's name
Hospital	`string`	Name of the hospital
Insurance Provider	`string`	Insurance company
Billing Amount	`float`	Amount billed in USD
Room Number	`int`	Hospital room number
Admission Type	`string`	Type of admission (e.g., emergency, routine)
Discharge Date	`string`	Date the patient was discharged
Medication	`string`	Medications prescribed
Test Results	`string`	Key diagnostic test results

Model Results Comparison

Cost & Efficiency

Time and cost benchmarks for generating 1,000 rows of data.

Model	Time	Cost (USD)	Input Cost (USD)	Output Cost (USD)
GPT-4o 2024-08-06	60m	$1.75	$0.13	$1.61
LLaMA 3.1-70b	3m 57.8s

More detailed statiscal comparisons (such KL divergence, predictive utility, visual distribution comparisons, etc.) can be seen with {model}_data_validation.ipynb scripts.

Getting Started

Clone the repository:

git clone https://github.com/your-username/synthetic-healthcare-data-generator.git

Install dependencies
```
pip install -r requirements.txt
```
Set up your API keys
Run the scripts in the following order
1. dataset_exploration
2. data_preprocessing
3. {model}_llm_integration
4. {model}_data_parsing
5. {model}_data_validation

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data_parsing		data_parsing
data_validation		data_validation
datasets		datasets
full_data		full_data
llm_api		llm_api
.gitattributes		.gitattributes
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synthetic Healthcare Data Generator

Features

Project Workflow

Models Used

Sample Schema

Model Results Comparison

Cost & Efficiency

Getting Started

About

Uh oh!

Releases

Packages

Uh oh!

Languages

RohanMannem/SyntheticHealthDataGenerator

Folders and files

Latest commit

History

Repository files navigation

Synthetic Healthcare Data Generator

Features

Project Workflow

Models Used

Sample Schema

Model Results Comparison

Cost & Efficiency

Getting Started

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages