Skip to content

NadiaRozman/Clinical_Trials_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🧬 Clinical Trials Data Analysis — End-to-End Portfolio Project

Dataset: chemNLP/clinical-trials-v2 — ~450,000 registered trials from ClinicalTrials.gov

⚠️ Disclaimer: This project was created solely to demonstrate data science and analytical skills. It does not represent clinical, medical, or scientific advice, and should not be interpreted as such. All findings are exploratory in nature and are based on a sample of publicly available data from ClinicalTrials.gov. No conclusions drawn here should be used to inform any clinical, regulatory, or healthcare decision.


📌 Project Overview

This project demonstrates a complete, production-style data science pipeline applied to real-world clinical trial data — from raw XML ingestion through to predictive machine learning and SQL analytics. The dataset is drawn from ClinicalTrials.gov, the world's largest registry of clinical research studies, which stores each trial as a richly structured XML document following the FDA Amendments Act (FDAAA 801) schema.

The central analytical question is:

Can we predict whether a clinical trial will be completed — using only information available at the time of registration?

This question has real-world value: funders, ethics boards, and protocol reviewers make resource allocation decisions before a trial launches. A model that estimates completion probability from registration metadata could meaningfully support those decisions.

The project is structured across four modular notebooks, each building on the output of the previous:

Notebook Focus Key Output
01_data_acquisition_and_eda.ipynb Data ingestion, XML parsing, EDA clinical_trials_processed.csv
02_nlp_text_analytics.ipynb NLP from TF-IDF to transformers Classifier, embeddings, NER
03_machine_learning_models.ipynb XGBoost + SHAP explainability Tuned model, feature importance
04_sql_analytics.ipynb SQL analytics on relational schema Sponsor scorecard, window functions

🔬 Notebook 01 — Data Acquisition & EDA

Purpose

This notebook handles the full data pipeline from raw source to a clean, analysis-ready dataset. Rather than downloading the full ~4 GB dataset, it uses HuggingFace streaming mode to fetch and process 5,000 records on-the-fly — making the project fully reproducible on any laptop in under a minute.

Each record in the dataset is an XML document following the FDAAA schema. Parsing raw XML directly mirrors real-world regulatory data pipelines and extracts the full richness of the nested structure: trial identifiers, phase, status, enrollment, sponsor information, geographic scope, dates, and free-text summaries.

Why This Approach?

Clinical data is rarely delivered in clean tabular form. In practice, regulatory systems (EHR, CTMS, EDC platforms) store data as nested XML or JSON. This notebook demonstrates the full ETL workflow: streaming ingestion → schema-driven parsing → type coercion → feature engineering → validation → export.

Key Steps

Streaming ingestiondatasets.load_dataset() with streaming=True avoids a 4 GB download and keeps memory constant regardless of total dataset size.

XML parsing — A custom parser using xml.etree.ElementTree extracts 30 structured fields from each trial record. Safe accessor functions wrap every ET.find() call in a try/except, ensuring a single malformed record doesn't abort the pipeline.

Cleaning and feature engineering:

  • Date columns coerced to datetime using pd.to_datetime(errors='coerce') — NaT for unparseable values
  • Enrollment coerced to numeric; non-numeric strings (empty strings) become NaN
  • Phase normalised: "N/A" and empty strings → "Unknown"; "Early Phase 1""Phase 0"

Output: clinical_trials_processed.csv — 5,000 records × 31 columns, 6.59 MB

EDA Results

Trial Status Distribution

The registry is live and active, which means the largest single category is Recruiting (2,386 trials, 47.7%). Completed trials account for 26.0% (1,301), with the remainder spanning various in-progress and terminal states.

Status Count %
Recruiting 2,386 47.72%
Completed 1,301 26.02%
Not yet recruiting 735 14.70%
Active, not recruiting 295 5.90%
Enrolling by invitation 135 2.70%
Withdrawn 52 1.04%
Terminated 42 0.84%

Phase Distribution

The large "Unknown" slice (3,599 / 72%) reflects observational studies and registry studies, which do not follow the phase framework — that structure is specific to interventional drug trials. Among phased trials, Phase 1 and Phase 2 are most common, reflecting the high attrition rate in early drug development.

Phase Count
Unknown (observational/N/A) 3,599
Phase 2 384
Phase 1 382
Phase 3 216
Phase 4 186
Phase 1–Phase 2 123

Enrollment Distribution

Enrollment is extremely right-skewed. The median is 78 participants, but the mean is 3,479 — driven by a small number of population-level screening studies with up to 5,341,584 participants. This skew motivated the log-transform feature log_enrollment used in the ML model (Notebook 03).

Statistic Value
Median enrollment 78
Mean enrollment 3,479
Maximum enrollment 5,341,584

Trials Started Per Year

Trial registrations grew sharply from 2005 onwards, reflecting the FDA Amendments Act (FDAAA 801), which made ClinicalTrials.gov registration a legal requirement for most interventional trials from 2007 onwards.

Top 15 Countries by Trial Count

The United States dominates trial volume, reflecting both the size of the US research infrastructure and ClinicalTrials.gov's origins as a US registry. Other high-volume countries include China, Egypt, and the United Kingdom.

Phase × Status Heatmap

A cross-tabulation of phase vs. status reveals that Phase 2 trials have proportionally more "Terminated" entries — consistent with the well-documented "valley of death" in drug development where promising Phase 1 results fail to replicate in larger Phase 2 studies.


📝 Notebook 02 — NLP & Text Analytics

Purpose

This notebook treats the free-text trial summary (brief_summary) as a data source in its own right. A trial's description contains information not captured by structured fields — the medical context, patient population, and treatment rationale — all of which carry predictive signal about trial completion.

The notebook spans the full spectrum from traditional to modern NLP methods, demonstrating each approach and when it is most appropriate.

Why This Approach?

Text analytics on clinical summaries is directly relevant to healthcare AI. Regulators, researchers, and funders routinely read large numbers of trial descriptions. NLP can automate categorisation, flag unusual language, or identify trials similar to a query of interest — tasks that are time-consuming to do manually.

Methods Covered

Section Method Purpose
Text preprocessing Lowercasing, lemmatisation, stopword removal Normalise vocabulary for TF-IDF
Word cloud + frequency WordCloud, Counter Exploratory — what language dominates?
TF-IDF + Logistic Regression TfidfVectorizer, LogisticRegression Interpretable baseline classifier
Sentence embeddings all-MiniLM-L6-v2 (SentenceTransformers) Semantic similarity + clustering
Zero-shot classification facebook/bart-large-mnli Topic tagging without labelled data
Named entity recognition dbmdz/bert-large-cased-finetuned-conll03 Extract organisations, locations, terms
Sentiment analysis distilbert-base-uncased-finetuned-sst-2 Tone analysis + model applicability limits

Results

TF-IDF + Logistic Regression Baseline

Trained on 80% of the 2,292 records that had non-trivial summaries, with stratified splits to preserve the 26.7% completion rate. TF-IDF with bigrams (ngram_range=(1,2)) and balanced class weights.

Metric Score
Accuracy 0.6536
ROC-AUC 0.6906
Precision (Completed) 0.39
Recall (Completed) 0.52
F1 (Completed) 0.45

The ROC-AUC of 0.69 confirms that trial summaries carry genuine predictive signal beyond random chance — the language used to describe a trial at registration correlates with its eventual outcome.

Sentence Embeddings & Clustering

Using all-MiniLM-L6-v2 (80 MB, CPU-compatible), 500 trial summaries were embedded into 384-dimensional vectors. K-Means clustering (k=5) on these vectors groups semantically similar trials without any explicit labels. PCA was used to reduce to 2 dimensions for visualisation.

Cluster sizes:

Cluster Trials
0 87
1 124
2 78
3 126
4 85

Unlike TF-IDF, sentence embeddings understand that "heart disease" and "cardiac condition" refer to the same concept — enabling true semantic similarity rather than surface-level word matching.

Zero-Shot Classification

Using facebook/bart-large-mnli, 10 trial summaries were classified into 6 medical categories (cancer treatment, cardiovascular disease, diabetes management, mental health, infectious disease, pain management) without any task-specific training. This demonstrates the practical value of zero-shot NLP for rapid categorisation of new data.

Sentiment Analysis

Applied distilbert-base-uncased-finetuned-sst-2-english to 50 trial summaries. This section is deliberately included to illustrate a model's limits: clinical language is neutral by design, yet disease vocabulary ("mortality", "adverse", "failed") scores as "negative" to a model trained on movie reviews. Of 50 summaries, 29 scored NEGATIVE — a clear sign of domain mismatch rather than actual negativity in the text.

NLP Method Comparison

Approach Training Time Inference Est. Accuracy GPU Needed Interpretability
TF-IDF + Logistic Regression Seconds Very fast ~70–80% No High
Sentence Transformers (MiniLM) Pre-trained Fast (CPU) ~80–88% No Medium
BERT / DistilBERT (fine-tuned) Minutes–Hours Moderate ~85–93% Recommended Low (SHAP needed)
Domain BERT (BioBERT, ClinicalBERT) Hours+ Moderate ~90–95% Yes Low (SHAP needed)

🤖 Notebook 03 — Machine Learning with XGBoost

Purpose

This notebook builds a structured feature model to predict whether a clinical trial will be completed, using only information available at the time of registration — making the prediction practically useful for prospective decision-making.

Why XGBoost?

XGBoost (Extreme Gradient Boosting) builds sequential decision trees, each correcting the residuals of the previous one. For tabular data with mixed feature types and moderate dataset size (~5,000 rows), it typically outperforms both linear models (which miss non-linear feature interactions) and neural networks (which overfit at this scale). It also natively handles missing values and is robust to feature scaling.

Feature Engineering

Eight features were engineered from the processed dataset, each with a specific analytical justification:

Feature Justification
enrollment_numeric Raw participant count — proxy for trial scope
log_enrollment Log-transform stabilises the extreme right skew (median 78, mean 3,479)
has_keywords Keyword entry at registration suggests more systematic trial management
is_industry_sponsor Industry sponsors face financial pressure to reach endpoints — structural incentive for completion
has_collaborators Multi-institution trials share resources and oversight — historically better completion rates
gender_encoded Eligibility scope: all-gender vs. single-gender trials differ structurally
phase_encoded Trial complexity and regulatory stage
study_type_encoded Interventional vs. observational — fundamentally different completion dynamics

Dataset shape after dropping rows with missing features: 4,977 rows × 8 features. Class balance: 26.1% completed (1,301) vs. 73.9% not completed (3,676).

Results

Baseline XGBoost (default parameters)

Metric Score
Accuracy 0.7380
Precision 0.4964
Recall 0.2654
F1 0.3459
ROC-AUC 0.6614
CV ROC-AUC (5-fold) 0.6627 ± 0.0108

Hyperparameter Tuning (RandomizedSearchCV)

20 random configurations × 3-fold cross-validation. RandomizedSearchCV is preferred over GridSearchCV for continuous parameters like learning_rate, where a grid would miss the optimal value between grid points.

Best CV ROC-AUC: 0.6912

Best parameters found:

  • colsample_bytree: 0.926
  • gamma: 0.283
  • learning_rate: 0.192
  • max_depth: 3
  • min_child_weight: 3

Tuned Model Performance

Metric Baseline Tuned Change
Accuracy 0.7380 0.7460 +0.80%
ROC-AUC 0.6614 0.6822 +2.08%
F1 (Completed) 0.3459 0.2537 –9.22%

The ROC-AUC improvement of +2.08% demonstrates genuine gain from tuning. The F1 trade-off reflects the precision/recall threshold chosen — tuning improved the model's discrimination ability (AUC) while adjusting the decision boundary in a way that reduced recall on the minority class.

Confusion Matrix (Tuned Model)

Predicted: Not Completed Predicted: Completed
Actual: Not Completed 700 (TN) 36 (FP)
Actual: Completed 217 (FN) 43 (TP)

Feature Importance (XGBoost Gain)

Feature Importance
phase_encoded 0.235
has_collaborators 0.199
log_enrollment 0.158
gender_encoded 0.120
has_keywords 0.106
enrollment_numeric 0.104
study_type_encoded 0.078
is_industry_sponsor 0.000

Phase is the strongest predictor — later-phase trials represent more mature, better-funded programmes with a higher institutional commitment to completion. The zero importance of is_industry_sponsor is notable: sponsor class alone does not predict completion when other structural features are controlled for.

SHAP Explainability

SHAP (SHapley Additive Explanations) was used to interpret the tuned model. Unlike built-in feature importance (which is aggregate and unsigned), SHAP values show each feature's directional contribution to each individual prediction.

Key SHAP findings:

  • High log_enrollment values push predictions towards completion
  • High phase_encoded (later phases) increases completion probability
  • has_collaborators = 1 reliably increases predicted completion probability

Multi-Model Comparison

Model Accuracy Precision Recall F1 ROC-AUC
Logistic Regression 0.5753 0.3380 0.6538 0.4456 0.6177
Random Forest 0.7209 0.4464 0.2885 0.3505 0.6404
XGBoost Baseline 0.7380 0.4964 0.2654 0.3459 0.6614
XGBoost Tuned 0.7460 0.5443 0.1654 0.2537 0.6822

XGBoost consistently achieves the highest ROC-AUC, validating the model choice. Logistic Regression achieves a higher recall but lower precision — appropriate for use cases where missing completions is more costly than false alarms.


🗄️ Notebook 04 — SQL Analytics

Purpose

This notebook loads the processed dataset into a SQLite relational database and demonstrates a progressive range of SQL analytical patterns — from basic aggregations through to window functions, multi-level CTEs, and data quality auditing.

Why SQLite in a Notebook?

SQLite runs in-process with no server configuration, making the notebook fully self-contained and reproducible. Every SQL pattern demonstrated here is directly portable to production databases: PostgreSQL, BigQuery, Snowflake, Redshift. The schema design decisions (normalised tables, foreign keys) mirror how production clinical databases are structured.

Schema Design

Two tables linked by nct_id (the ClinicalTrials.gov trial identifier):

-- Trials table: core trial metadata
CREATE TABLE trials (
    trial_id        INTEGER PRIMARY KEY AUTOINCREMENT,
    nct_id          TEXT UNIQUE NOT NULL,
    brief_title     TEXT,
    overall_status  TEXT,
    phase           TEXT,
    study_type      TEXT,
    enrollment      REAL,
    has_results     TEXT,
    start_date      TEXT,
    completion_date TEXT
);

-- Sponsors table: sponsor information linked by foreign key
CREATE TABLE sponsors (
    sponsor_id         INTEGER PRIMARY KEY AUTOINCREMENT,
    nct_id             TEXT,
    lead_sponsor       TEXT,
    lead_sponsor_class TEXT,
    FOREIGN KEY (nct_id) REFERENCES trials(nct_id)
);

Both tables loaded with 5,000 rows each.

SQL Concepts Demonstrated

Basic Aggregations

Status distribution with inline percentage calculation using a window function in the same pass as GROUP BY — avoids a subquery:

SELECT
    overall_status,
    COUNT(*) AS trial_count,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) AS pct_of_total
FROM trials
GROUP BY overall_status
ORDER BY trial_count DESC;

Completion Rate by Phase (CASE WHEN conditional aggregation):

Phase Total Trials Completed Completion %
Phase 1 382 126 32.98%
Phase 4 186 51 27.42%
Early Phase 1 64 11 17.19%
Phase 2–Phase 3 46 6 13.04%
Phase 1–Phase 2 123 15 12.20%
Phase 3 216 23 10.65%
Phase 2 384 35 9.11%

Phase 1 has the highest completion rate (33%) — counterintuitively, because Phase 1 trials are typically smaller, shorter, and primarily focused on safety rather than efficacy. Phase 2 has the lowest (9%), consistent with its role as the attrition bottleneck in drug development.

Window Functions

RANK() OVER (PARTITION BY lead_sponsor_class ORDER BY COUNT(*) DESC) ranks each sponsor within its class — equivalent to a grouped rank in pandas:

Sponsor Class Trials Rank in Class Overall Rank
Riphah International University Other 70 1 1
Assiut University Other 53 2 2
Cairo University Other 49 3 3
Mayo Clinic ... ... ... ...

Cumulative Trial Registrations using ROWS UNBOUNDED PRECEDING in a CTE — a running total of registrations per year showing the sharp growth post-2007 (FDAAA 801 mandate).

Multi-Level CTE — Sponsor Performance Scorecard

A two-layer CTE computes completion rate and results reporting rate per sponsor, filtered to sponsors with ≥ 10 trials, then ranked by trial volume:

WITH sponsor_metrics AS (
    SELECT ..., COUNT(...) AS total_trials, SUM(CASE WHEN ...) AS completed
    FROM sponsors s JOIN trials t ON s.nct_id = t.nct_id
    GROUP BY s.lead_sponsor HAVING total_trials >= 10
),
ranked AS (
    SELECT *, RANK() OVER (ORDER BY total_trials DESC) AS trial_rank
    FROM sponsor_metrics
)
SELECT * FROM ranked ORDER BY trial_rank LIMIT 20;

Industry vs. Non-Industry Comparison

Sponsor Type Total Trials Completed Completion % Avg Enrollment
Non-Industry 5,000 1,301 26.02% 3,442

(Note: All 5,000 records in this sample are non-industry sponsored — confirming the academic/institutional character of this slice of the registry.)

Data Quality Audit

Single-pass completeness check using conditional aggregation:

Column Missing Values Completeness
nct_id 0 100.00%
brief_title 0 100.00%
overall_status 0 100.00%
phase 3,599 28.02%
enrollment 23 99.54%

Phase incompleteness (72%) is expected — observational and registry studies do not have a formal phase. This is a data characteristic, not a data quality issue.


🛠️ Tech Stack

Category Tools
Data ingestion HuggingFace datasets (streaming), xml.etree.ElementTree
Data manipulation pandas, numpy
Visualisation plotly.express, matplotlib, seaborn, WordCloud
NLP (traditional) nltk (tokenisation, lemmatisation, stopwords), scikit-learn (TF-IDF)
NLP (modern) sentence-transformers (MiniLM), transformers (BART, BERT, DistilBERT)
Machine learning scikit-learn, xgboost
Explainability shap (TreeExplainer, beeswarm, waterfall)
Database sqlite3 (standard library)
Clustering / Dimensionality reduction KMeans, PCA (scikit-learn)

⚙️ Setup & Reproducibility

Requirements

pip install -r requirements.txt

Key dependencies:

pandas>=2.0
numpy>=1.24
matplotlib>=3.7
seaborn>=0.12
plotly>=5.14
datasets>=2.14          # HuggingFace streaming
scikit-learn>=1.3
xgboost>=1.7
shap>=0.42
nltk>=3.8
transformers>=4.35
sentence-transformers>=2.2
wordcloud>=1.9
tqdm>=4.65

Run Order

The notebooks must be run in sequence — each one depends on outputs from the previous:

01_data_acquisition_and_eda.ipynb   →  generates clinical_trials_processed.csv
02_nlp_text_analytics.ipynb         →  reads clinical_trials_processed.csv
03_machine_learning_models.ipynb    →  reads clinical_trials_processed.csv
04_sql_analytics.ipynb              →  reads clinical_trials_processed.csv → creates clinical_trials.db

Run Time (standard laptop, CPU only)

Notebook Approx. Time
01 — Data Acquisition & EDA 30–60 seconds
02 — NLP & Text Analytics 3–8 minutes (transformer downloads dominate first run)
03 — Machine Learning 3–5 minutes (hyperparameter search)
04 — SQL Analytics < 1 minute

💡 Key Insights

On clinical trial completion:

  • Only 26% of trials in this sample are completed — most are still active or recruiting
  • Phase is the strongest structural predictor of completion probability
  • Collaborator presence (multi-site trials) significantly increases the likelihood of completion
  • Enrollment size (log-transformed) carries meaningful signal — very small and very large trials have different completion dynamics

On NLP for clinical text:

  • TF-IDF remains a powerful, interpretable baseline with ROC-AUC of 0.69 using trial summaries alone
  • Sentence embeddings enable semantic clustering without labels — a practical tool for literature organisation
  • Domain-specific models (BioBERT, ClinicalBERT, PubMedBERT) are preferred for production clinical NLP
  • General sentiment models are not appropriate for clinical language — they misread neutral clinical descriptions as negative

On model selection:

  • XGBoost consistently outperforms Logistic Regression and Random Forest on ROC-AUC for this task
  • SHAP explainability is essential when presenting model outputs to clinical or regulatory stakeholders
  • The moderate AUC (~0.68) reflects a genuine signal ceiling: completion is fundamentally hard to predict from registration metadata alone — external factors (funding continuity, site performance, regulatory changes) dominate

🧭 Background & Motivation

This project emerged from my transition from clinical research into data science. Having worked within the clinical research ecosystem — with study protocols, data management, and regulatory submissions — I have firsthand knowledge of the challenges that motivate this analysis. Trial failure is costly: a single Phase 3 failure can cost hundreds of millions of dollars and delay treatments for patients who need them.

Data science applied to trial registries can help the field ask better questions earlier: which trial designs are most likely to succeed? Which sponsors consistently complete and report? Where do the structural bottlenecks lie?

This project is my attempt to bridge those two worlds — applying rigorous data science to a domain I understand deeply.


✨ Created by Nadia Rozman | March 2026

📂 Project Structure

Clinical_Trials_Analysis/
│
├── notebooks/
│    ├── 01_data_acquisition_and_eda.ipynb      # Data pipeline + exploratory analysis
│    ├── 02_nlp_text_analytics.ipynb            # NLP + transformer models
│    ├── 03_machine_learning_models.ipynb       # XGBoost prediction + SHAP
│    ├── 04_sql_analytics.ipynb                 # SQL analytics (SQLite)
│    └── 04_sql_analytics.sql                   # Standalone SQL query library
│ 
├── requirements.txt                       # Python dependencies
└── README.md                              

🔗 Connect with me

⭐ If you found this project helpful, please consider giving it a star!

About

End-to-end data science pipeline on 5,000 ClinicalTrials.gov records — EDA, NLP, XGBoost, SHAP, and SQL analytics.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors