🧬 Clinical Trials Data Analysis — End-to-End Portfolio Project

Dataset: chemNLP/clinical-trials-v2 — ~450,000 registered trials from ClinicalTrials.gov

⚠️ Disclaimer: This project was created solely to demonstrate data science and analytical skills. It does not represent clinical, medical, or scientific advice, and should not be interpreted as such. All findings are exploratory in nature and are based on a sample of publicly available data from ClinicalTrials.gov. No conclusions drawn here should be used to inform any clinical, regulatory, or healthcare decision.

📌 Project Overview

This project demonstrates a complete, production-style data science pipeline applied to real-world clinical trial data — from raw XML ingestion through to predictive machine learning and SQL analytics. The dataset is drawn from ClinicalTrials.gov, the world's largest registry of clinical research studies, which stores each trial as a richly structured XML document following the FDA Amendments Act (FDAAA 801) schema.

The central analytical question is:

Can we predict whether a clinical trial will be completed — using only information available at the time of registration?

This question has real-world value: funders, ethics boards, and protocol reviewers make resource allocation decisions before a trial launches. A model that estimates completion probability from registration metadata could meaningfully support those decisions.

The project is structured across four modular notebooks, each building on the output of the previous:

Notebook	Focus	Key Output
`01_data_acquisition_and_eda.ipynb`	Data ingestion, XML parsing, EDA	`clinical_trials_processed.csv`
`02_nlp_text_analytics.ipynb`	NLP from TF-IDF to transformers	Classifier, embeddings, NER
`03_machine_learning_models.ipynb`	XGBoost + SHAP explainability	Tuned model, feature importance
`04_sql_analytics.ipynb`	SQL analytics on relational schema	Sponsor scorecard, window functions

🔬 Notebook 01 — Data Acquisition & EDA

Purpose

This notebook handles the full data pipeline from raw source to a clean, analysis-ready dataset. Rather than downloading the full ~4 GB dataset, it uses HuggingFace streaming mode to fetch and process 5,000 records on-the-fly — making the project fully reproducible on any laptop in under a minute.

Each record in the dataset is an XML document following the FDAAA schema. Parsing raw XML directly mirrors real-world regulatory data pipelines and extracts the full richness of the nested structure: trial identifiers, phase, status, enrollment, sponsor information, geographic scope, dates, and free-text summaries.

Why This Approach?

Clinical data is rarely delivered in clean tabular form. In practice, regulatory systems (EHR, CTMS, EDC platforms) store data as nested XML or JSON. This notebook demonstrates the full ETL workflow: streaming ingestion → schema-driven parsing → type coercion → feature engineering → validation → export.

Key Steps

Streaming ingestion — datasets.load_dataset() with streaming=True avoids a 4 GB download and keeps memory constant regardless of total dataset size.

XML parsing — A custom parser using xml.etree.ElementTree extracts 30 structured fields from each trial record. Safe accessor functions wrap every ET.find() call in a try/except, ensuring a single malformed record doesn't abort the pipeline.

Cleaning and feature engineering:

Date columns coerced to datetime using pd.to_datetime(errors='coerce') — NaT for unparseable values
Enrollment coerced to numeric; non-numeric strings (empty strings) become NaN
Phase normalised: "N/A" and empty strings → "Unknown"; "Early Phase 1" → "Phase 0"

Output: clinical_trials_processed.csv — 5,000 records × 31 columns, 6.59 MB

EDA Results

Trial Status Distribution

The registry is live and active, which means the largest single category is Recruiting (2,386 trials, 47.7%). Completed trials account for 26.0% (1,301), with the remainder spanning various in-progress and terminal states.

Status	Count	%
Recruiting	2,386	47.72%
Completed	1,301	26.02%
Not yet recruiting	735	14.70%
Active, not recruiting	295	5.90%
Enrolling by invitation	135	2.70%
Withdrawn	52	1.04%
Terminated	42	0.84%

Phase Distribution

The large "Unknown" slice (3,599 / 72%) reflects observational studies and registry studies, which do not follow the phase framework — that structure is specific to interventional drug trials. Among phased trials, Phase 1 and Phase 2 are most common, reflecting the high attrition rate in early drug development.

Phase	Count
Unknown (observational/N/A)	3,599
Phase 2	384
Phase 1	382
Phase 3	216
Phase 4	186
Phase 1–Phase 2	123

Enrollment Distribution

Enrollment is extremely right-skewed. The median is 78 participants, but the mean is 3,479 — driven by a small number of population-level screening studies with up to 5,341,584 participants. This skew motivated the log-transform feature log_enrollment used in the ML model (Notebook 03).

Statistic	Value
Median enrollment	78
Mean enrollment	3,479
Maximum enrollment	5,341,584

Trials Started Per Year

Trial registrations grew sharply from 2005 onwards, reflecting the FDA Amendments Act (FDAAA 801), which made ClinicalTrials.gov registration a legal requirement for most interventional trials from 2007 onwards.

Top 15 Countries by Trial Count

The United States dominates trial volume, reflecting both the size of the US research infrastructure and ClinicalTrials.gov's origins as a US registry. Other high-volume countries include China, Egypt, and the United Kingdom.

Phase × Status Heatmap

A cross-tabulation of phase vs. status reveals that Phase 2 trials have proportionally more "Terminated" entries — consistent with the well-documented "valley of death" in drug development where promising Phase 1 results fail to replicate in larger Phase 2 studies.

📝 Notebook 02 — NLP & Text Analytics

Purpose

This notebook treats the free-text trial summary (brief_summary) as a data source in its own right. A trial's description contains information not captured by structured fields — the medical context, patient population, and treatment rationale — all of which carry predictive signal about trial completion.

The notebook spans the full spectrum from traditional to modern NLP methods, demonstrating each approach and when it is most appropriate.

Why This Approach?

Text analytics on clinical summaries is directly relevant to healthcare AI. Regulators, researchers, and funders routinely read large numbers of trial descriptions. NLP can automate categorisation, flag unusual language, or identify trials similar to a query of interest — tasks that are time-consuming to do manually.

Methods Covered

Section	Method	Purpose
Text preprocessing	Lowercasing, lemmatisation, stopword removal	Normalise vocabulary for TF-IDF
Word cloud + frequency	`WordCloud`, `Counter`	Exploratory — what language dominates?
TF-IDF + Logistic Regression	`TfidfVectorizer`, `LogisticRegression`	Interpretable baseline classifier
Sentence embeddings	`all-MiniLM-L6-v2` (SentenceTransformers)	Semantic similarity + clustering
Zero-shot classification	`facebook/bart-large-mnli`	Topic tagging without labelled data
Named entity recognition	`dbmdz/bert-large-cased-finetuned-conll03`	Extract organisations, locations, terms
Sentiment analysis	`distilbert-base-uncased-finetuned-sst-2`	Tone analysis + model applicability limits

Results

TF-IDF + Logistic Regression Baseline

Trained on 80% of the 2,292 records that had non-trivial summaries, with stratified splits to preserve the 26.7% completion rate. TF-IDF with bigrams (ngram_range=(1,2)) and balanced class weights.

Metric	Score
Accuracy	0.6536
ROC-AUC	0.6906
Precision (Completed)	0.39
Recall (Completed)	0.52
F1 (Completed)	0.45

The ROC-AUC of 0.69 confirms that trial summaries carry genuine predictive signal beyond random chance — the language used to describe a trial at registration correlates with its eventual outcome.

Sentence Embeddings & Clustering

Using all-MiniLM-L6-v2 (80 MB, CPU-compatible), 500 trial summaries were embedded into 384-dimensional vectors. K-Means clustering (k=5) on these vectors groups semantically similar trials without any explicit labels. PCA was used to reduce to 2 dimensions for visualisation.

Cluster sizes:

Cluster	Trials
0	87
1	124
2	78
3	126
4	85

Unlike TF-IDF, sentence embeddings understand that "heart disease" and "cardiac condition" refer to the same concept — enabling true semantic similarity rather than surface-level word matching.

Zero-Shot Classification

Using facebook/bart-large-mnli, 10 trial summaries were classified into 6 medical categories (cancer treatment, cardiovascular disease, diabetes management, mental health, infectious disease, pain management) without any task-specific training. This demonstrates the practical value of zero-shot NLP for rapid categorisation of new data.

Sentiment Analysis

Applied distilbert-base-uncased-finetuned-sst-2-english to 50 trial summaries. This section is deliberately included to illustrate a model's limits: clinical language is neutral by design, yet disease vocabulary ("mortality", "adverse", "failed") scores as "negative" to a model trained on movie reviews. Of 50 summaries, 29 scored NEGATIVE — a clear sign of domain mismatch rather than actual negativity in the text.

NLP Method Comparison

Approach	Training Time	Inference	Est. Accuracy	GPU Needed	Interpretability
TF-IDF + Logistic Regression	Seconds	Very fast	~70–80%	No	High
Sentence Transformers (MiniLM)	Pre-trained	Fast (CPU)	~80–88%	No	Medium
BERT / DistilBERT (fine-tuned)	Minutes–Hours	Moderate	~85–93%	Recommended	Low (SHAP needed)
Domain BERT (BioBERT, ClinicalBERT)	Hours+	Moderate	~90–95%	Yes	Low (SHAP needed)

🤖 Notebook 03 — Machine Learning with XGBoost

Purpose

This notebook builds a structured feature model to predict whether a clinical trial will be completed, using only information available at the time of registration — making the prediction practically useful for prospective decision-making.

Why XGBoost?

XGBoost (Extreme Gradient Boosting) builds sequential decision trees, each correcting the residuals of the previous one. For tabular data with mixed feature types and moderate dataset size (~5,000 rows), it typically outperforms both linear models (which miss non-linear feature interactions) and neural networks (which overfit at this scale). It also natively handles missing values and is robust to feature scaling.

Feature Engineering

Eight features were engineered from the processed dataset, each with a specific analytical justification:

Feature	Justification
`enrollment_numeric`	Raw participant count — proxy for trial scope
`log_enrollment`	Log-transform stabilises the extreme right skew (median 78, mean 3,479)
`has_keywords`	Keyword entry at registration suggests more systematic trial management
`is_industry_sponsor`	Industry sponsors face financial pressure to reach endpoints — structural incentive for completion
`has_collaborators`	Multi-institution trials share resources and oversight — historically better completion rates
`gender_encoded`	Eligibility scope: all-gender vs. single-gender trials differ structurally
`phase_encoded`	Trial complexity and regulatory stage
`study_type_encoded`	Interventional vs. observational — fundamentally different completion dynamics

Dataset shape after dropping rows with missing features: 4,977 rows × 8 features. Class balance: 26.1% completed (1,301) vs. 73.9% not completed (3,676).

Results

Baseline XGBoost (default parameters)

Metric	Score
Accuracy	0.7380
Precision	0.4964
Recall	0.2654
F1	0.3459
ROC-AUC	0.6614
CV ROC-AUC (5-fold)	0.6627 ± 0.0108

Hyperparameter Tuning (RandomizedSearchCV)

20 random configurations × 3-fold cross-validation. RandomizedSearchCV is preferred over GridSearchCV for continuous parameters like learning_rate, where a grid would miss the optimal value between grid points.

Best CV ROC-AUC: 0.6912

Best parameters found:

colsample_bytree: 0.926
gamma: 0.283
learning_rate: 0.192
max_depth: 3
min_child_weight: 3

Tuned Model Performance

Metric	Baseline	Tuned	Change
Accuracy	0.7380	0.7460	+0.80%
ROC-AUC	0.6614	0.6822	+2.08%
F1 (Completed)	0.3459	0.2537	–9.22%

The ROC-AUC improvement of +2.08% demonstrates genuine gain from tuning. The F1 trade-off reflects the precision/recall threshold chosen — tuning improved the model's discrimination ability (AUC) while adjusting the decision boundary in a way that reduced recall on the minority class.

Confusion Matrix (Tuned Model)

	Predicted: Not Completed	Predicted: Completed
Actual: Not Completed	700 (TN)	36 (FP)
Actual: Completed	217 (FN)	43 (TP)

Feature Importance (XGBoost Gain)

Feature	Importance
`phase_encoded`	0.235
`has_collaborators`	0.199
`log_enrollment`	0.158
`gender_encoded`	0.120
`has_keywords`	0.106
`enrollment_numeric`	0.104
`study_type_encoded`	0.078
`is_industry_sponsor`	0.000

Phase is the strongest predictor — later-phase trials represent more mature, better-funded programmes with a higher institutional commitment to completion. The zero importance of is_industry_sponsor is notable: sponsor class alone does not predict completion when other structural features are controlled for.

SHAP Explainability

SHAP (SHapley Additive Explanations) was used to interpret the tuned model. Unlike built-in feature importance (which is aggregate and unsigned), SHAP values show each feature's directional contribution to each individual prediction.

Key SHAP findings:

High log_enrollment values push predictions towards completion
High phase_encoded (later phases) increases completion probability
has_collaborators = 1 reliably increases predicted completion probability

Multi-Model Comparison

Model	Accuracy	Precision	Recall	F1	ROC-AUC
Logistic Regression	0.5753	0.3380	0.6538	0.4456	0.6177
Random Forest	0.7209	0.4464	0.2885	0.3505	0.6404
XGBoost Baseline	0.7380	0.4964	0.2654	0.3459	0.6614
XGBoost Tuned	0.7460	0.5443	0.1654	0.2537	0.6822

XGBoost consistently achieves the highest ROC-AUC, validating the model choice. Logistic Regression achieves a higher recall but lower precision — appropriate for use cases where missing completions is more costly than false alarms.

🗄️ Notebook 04 — SQL Analytics

Purpose

This notebook loads the processed dataset into a SQLite relational database and demonstrates a progressive range of SQL analytical patterns — from basic aggregations through to window functions, multi-level CTEs, and data quality auditing.

Why SQLite in a Notebook?

SQLite runs in-process with no server configuration, making the notebook fully self-contained and reproducible. Every SQL pattern demonstrated here is directly portable to production databases: PostgreSQL, BigQuery, Snowflake, Redshift. The schema design decisions (normalised tables, foreign keys) mirror how production clinical databases are structured.

Schema Design

Two tables linked by nct_id (the ClinicalTrials.gov trial identifier):

-- Trials table: core trial metadata
CREATE TABLE trials (
    trial_id        INTEGER PRIMARY KEY AUTOINCREMENT,
    nct_id          TEXT UNIQUE NOT NULL,
    brief_title     TEXT,
    overall_status  TEXT,
    phase           TEXT,
    study_type      TEXT,
    enrollment      REAL,
    has_results     TEXT,
    start_date      TEXT,
    completion_date TEXT
);

-- Sponsors table: sponsor information linked by foreign key
CREATE TABLE sponsors (
    sponsor_id         INTEGER PRIMARY KEY AUTOINCREMENT,
    nct_id             TEXT,
    lead_sponsor       TEXT,
    lead_sponsor_class TEXT,
    FOREIGN KEY (nct_id) REFERENCES trials(nct_id)
);

Both tables loaded with 5,000 rows each.

SQL Concepts Demonstrated

Basic Aggregations

Status distribution with inline percentage calculation using a window function in the same pass as GROUP BY — avoids a subquery:

SELECT
    overall_status,
    COUNT(*) AS trial_count,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) AS pct_of_total
FROM trials
GROUP BY overall_status
ORDER BY trial_count DESC;

Completion Rate by Phase (CASE WHEN conditional aggregation):

Phase	Total Trials	Completed	Completion %
Phase 1	382	126	32.98%
Phase 4	186	51	27.42%
Early Phase 1	64	11	17.19%
Phase 2–Phase 3	46	6	13.04%
Phase 1–Phase 2	123	15	12.20%
Phase 3	216	23	10.65%
Phase 2	384	35	9.11%

Phase 1 has the highest completion rate (33%) — counterintuitively, because Phase 1 trials are typically smaller, shorter, and primarily focused on safety rather than efficacy. Phase 2 has the lowest (9%), consistent with its role as the attrition bottleneck in drug development.

Window Functions

RANK() OVER (PARTITION BY lead_sponsor_class ORDER BY COUNT(*) DESC) ranks each sponsor within its class — equivalent to a grouped rank in pandas:

Sponsor	Class	Trials	Rank in Class	Overall Rank
Riphah International University	Other	70	1	1
Assiut University	Other	53	2	2
Cairo University	Other	49	3	3
Mayo Clinic	...	...	...	...

Cumulative Trial Registrations using ROWS UNBOUNDED PRECEDING in a CTE — a running total of registrations per year showing the sharp growth post-2007 (FDAAA 801 mandate).

Multi-Level CTE — Sponsor Performance Scorecard

A two-layer CTE computes completion rate and results reporting rate per sponsor, filtered to sponsors with ≥ 10 trials, then ranked by trial volume:

WITH sponsor_metrics AS (
    SELECT ..., COUNT(...) AS total_trials, SUM(CASE WHEN ...) AS completed
    FROM sponsors s JOIN trials t ON s.nct_id = t.nct_id
    GROUP BY s.lead_sponsor HAVING total_trials >= 10
),
ranked AS (
    SELECT *, RANK() OVER (ORDER BY total_trials DESC) AS trial_rank
    FROM sponsor_metrics
)
SELECT * FROM ranked ORDER BY trial_rank LIMIT 20;

Industry vs. Non-Industry Comparison

Sponsor Type	Total Trials	Completed	Completion %	Avg Enrollment
Non-Industry	5,000	1,301	26.02%	3,442

(Note: All 5,000 records in this sample are non-industry sponsored — confirming the academic/institutional character of this slice of the registry.)

Data Quality Audit

Single-pass completeness check using conditional aggregation:

Column	Missing Values	Completeness
`nct_id`	0	100.00%
`brief_title`	0	100.00%
`overall_status`	0	100.00%
`phase`	3,599	28.02%
`enrollment`	23	99.54%

Phase incompleteness (72%) is expected — observational and registry studies do not have a formal phase. This is a data characteristic, not a data quality issue.

🛠️ Tech Stack

Category	Tools
Data ingestion	HuggingFace `datasets` (streaming), `xml.etree.ElementTree`
Data manipulation	`pandas`, `numpy`
Visualisation	`plotly.express`, `matplotlib`, `seaborn`, `WordCloud`
NLP (traditional)	`nltk` (tokenisation, lemmatisation, stopwords), `scikit-learn` (TF-IDF)
NLP (modern)	`sentence-transformers` (MiniLM), `transformers` (BART, BERT, DistilBERT)
Machine learning	`scikit-learn`, `xgboost`
Explainability	`shap` (TreeExplainer, beeswarm, waterfall)
Database	`sqlite3` (standard library)
Clustering / Dimensionality reduction	`KMeans`, `PCA` (scikit-learn)

⚙️ Setup & Reproducibility

Requirements

pip install -r requirements.txt

Key dependencies:

pandas>=2.0
numpy>=1.24
matplotlib>=3.7
seaborn>=0.12
plotly>=5.14
datasets>=2.14          # HuggingFace streaming
scikit-learn>=1.3
xgboost>=1.7
shap>=0.42
nltk>=3.8
transformers>=4.35
sentence-transformers>=2.2
wordcloud>=1.9
tqdm>=4.65

Run Order

The notebooks must be run in sequence — each one depends on outputs from the previous:

01_data_acquisition_and_eda.ipynb   →  generates clinical_trials_processed.csv
02_nlp_text_analytics.ipynb         →  reads clinical_trials_processed.csv
03_machine_learning_models.ipynb    →  reads clinical_trials_processed.csv
04_sql_analytics.ipynb              →  reads clinical_trials_processed.csv → creates clinical_trials.db

Run Time (standard laptop, CPU only)

Notebook	Approx. Time
01 — Data Acquisition & EDA	30–60 seconds
02 — NLP & Text Analytics	3–8 minutes (transformer downloads dominate first run)
03 — Machine Learning	3–5 minutes (hyperparameter search)
04 — SQL Analytics	< 1 minute

💡 Key Insights

On clinical trial completion:

Only 26% of trials in this sample are completed — most are still active or recruiting
Phase is the strongest structural predictor of completion probability
Collaborator presence (multi-site trials) significantly increases the likelihood of completion
Enrollment size (log-transformed) carries meaningful signal — very small and very large trials have different completion dynamics

On NLP for clinical text:

TF-IDF remains a powerful, interpretable baseline with ROC-AUC of 0.69 using trial summaries alone
Sentence embeddings enable semantic clustering without labels — a practical tool for literature organisation
Domain-specific models (BioBERT, ClinicalBERT, PubMedBERT) are preferred for production clinical NLP
General sentiment models are not appropriate for clinical language — they misread neutral clinical descriptions as negative

On model selection:

XGBoost consistently outperforms Logistic Regression and Random Forest on ROC-AUC for this task
SHAP explainability is essential when presenting model outputs to clinical or regulatory stakeholders
The moderate AUC (~0.68) reflects a genuine signal ceiling: completion is fundamentally hard to predict from registration metadata alone — external factors (funding continuity, site performance, regulatory changes) dominate

🧭 Background & Motivation

This project emerged from my transition from clinical research into data science. Having worked within the clinical research ecosystem — with study protocols, data management, and regulatory submissions — I have firsthand knowledge of the challenges that motivate this analysis. Trial failure is costly: a single Phase 3 failure can cost hundreds of millions of dollars and delay treatments for patients who need them.

Data science applied to trial registries can help the field ask better questions earlier: which trial designs are most likely to succeed? Which sponsors consistently complete and report? Where do the structural bottlenecks lie?

This project is my attempt to bridge those two worlds — applying rigorous data science to a domain I understand deeply.

✨ Created by Nadia Rozman | March 2026

📂 Project Structure

Clinical_Trials_Analysis/
│
├── notebooks/
│    ├── 01_data_acquisition_and_eda.ipynb      # Data pipeline + exploratory analysis
│    ├── 02_nlp_text_analytics.ipynb            # NLP + transformer models
│    ├── 03_machine_learning_models.ipynb       # XGBoost prediction + SHAP
│    ├── 04_sql_analytics.ipynb                 # SQL analytics (SQLite)
│    └── 04_sql_analytics.sql                   # Standalone SQL query library
│ 
├── requirements.txt                       # Python dependencies
└── README.md

🔗 Connect with me

GitHub: @NadiaRozman
LinkedIn: Nadia Rozman

⭐ If you found this project helpful, please consider giving it a star!

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
notebooks		notebooks
.gitattributes		.gitattributes
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🧬 Clinical Trials Data Analysis — End-to-End Portfolio Project

📌 Project Overview

🔬 Notebook 01 — Data Acquisition & EDA

Purpose

Why This Approach?

Key Steps

EDA Results

📝 Notebook 02 — NLP & Text Analytics

Purpose

Why This Approach?

Methods Covered

Results

🤖 Notebook 03 — Machine Learning with XGBoost

Purpose

Why XGBoost?

Feature Engineering

Results

🗄️ Notebook 04 — SQL Analytics

Purpose

Why SQLite in a Notebook?

Schema Design

SQL Concepts Demonstrated

🛠️ Tech Stack

⚙️ Setup & Reproducibility

Requirements

Run Order

Run Time (standard laptop, CPU only)

💡 Key Insights

🧭 Background & Motivation

✨ Created by Nadia Rozman | March 2026

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages