🧪 A/B Test Statistical Significance Engine

A full-stack A/B testing platform built with Python and Streamlit. Automates frequentist hypothesis testing, Bayesian inference, sequential boundary analysis, and multi-metric batch testing — then compiles everything into a downloadable PDF report. Built to replicate the statistical rigor used at data-driven product and growth teams.

💡 Key Findings (Sample Run)

Analysis	Result
Frequentist (continuous)	p=0.0000, Cohen's d=0.529 (Medium), +10.8% lift → SHIP IT
Multi-Metric	3 of 5 metrics significant — revenue (+12.7%), pages_viewed (+27.4%), add_to_cart (+60.0%)
Bayesian	P(B > A) = 96.7%, expected loss = 0.02% → SHIP IT
Sequential (Monte Carlo)	Naive peeking = 15% false positive rate vs OBF = 5.0% — 3× inflation proven

✨ Features

Page	What it does
⚡ Power Analysis	Sample size calculator with live power curves, MDE analysis, and What-If checker
🔬 Run Test	T-test + Mann-Whitney U + chi-squared with data profiling, outlier detection, normality testing
📊 Multi-Metric	Batch-test all CSV metric columns — color-coded summary table, lift chart, p-value heatmap
🎲 Bayesian	P(B beats A), expected loss, credible intervals, prior sensitivity analysis
📈 Sequential	O'Brien-Fleming bounds, peeking risk detector, Monte Carlo false positive simulation
📋 Report	One-click PDF export of all results with business-readable ship/don't-ship recommendations

🏗️ Pipeline

Raw data (paste / CSV upload)
        │
        ▼
data_profiler.py      ← Shapiro-Wilk normality, IQR outlier detection, skewness
        │
        ▼
stats_engine.py       ← Welch's T-Test, Mann-Whitney U, Chi-Squared
        │
        ▼
power_analysis.py     ← Required N, achieved power, MDE curves (statsmodels)
        │
        ▼
bayesian_engine.py    ← Beta-Binomial posterior, Normal posterior, expected loss
        │
        ▼
sequential_testing.py ← O'Brien-Fleming bounds, Pocock bounds, Monte Carlo sim
        │
        ▼
report_builder.py     ← Ship / Don't Ship / Caution / Inconclusive decision logic
        │
        ▼
pdf_exporter.py       ← ReportLab multi-section PDF report
        │
        ▼
pages/ (Streamlit)    ← 6-page interactive UI

📁 Project Structure

ab_test_analyzer/
├── app.py                        # Home page and navigation
├── stats_engine.py               # T-test, Mann-Whitney U, Chi-Squared
├── bayesian_engine.py            # Bayesian Beta-Binomial + Normal model
├── power_analysis.py             # Sample size & power calculations
├── sequential_testing.py         # O'Brien-Fleming, Pocock, peeking detector
├── data_profiler.py              # Outlier detection, normality testing
├── report_builder.py             # Ship/don't-ship recommendation logic
├── pdf_exporter.py               # ReportLab PDF generation
├── sample_data.py                # Built-in sample datasets
├── utils.py                      # Shared UI helpers
├── pages/
│   ├── 1_Power_Analysis.py
│   ├── 2_Run_Test.py
│   ├── 3_Multi_Metric.py
│   ├── 4_Bayesian.py
│   ├── 5_Sequential.py
│   └── 6_Report.py
└── sample_data_files/
    └── multi_metric_sample.csv

🗂️ Statistical Methods

Method	Purpose
Welch's T-Test	Compares means of two independent groups; robust to unequal variances
Mann-Whitney U	Non-parametric alternative when normality fails or sample is small
Chi-Squared Test	Compares conversion rates between two groups
Cohen's d	Effect size for continuous metrics (T-Test)
Cramér's V	Effect size for conversion rate tests (Chi-Squared)
Rank Biserial r	Effect size for Mann-Whitney U
Beta-Binomial Model	Bayesian posterior for conversion rate experiments
Normal Posterior	Bayesian model for continuous metric experiments
O'Brien-Fleming	Sequential boundary controlling false positives across interim looks
Pocock Boundary	Constant sequential boundary for early-stopping experiments
Shapiro-Wilk	Normality test — auto-recommends T-Test vs Mann-Whitney
IQR Fences	Outlier detection flagging data quality issues before analysis

🚀 How to Run

Prerequisites: Python 3.10+, pip

# 1. Clone the repo
git clone https://github.com/DeekshithaKalluri/ab-test-analyzer.git
cd ab-test-analyzer

# 2. Set up environment
python -m venv venv
source venv/bin/activate        # Mac/Linux
# venv\Scripts\activate         # Windows

# 3. Install dependencies
pip install -r requirements.txt

# 4. Launch the app
streamlit run app.py

Open http://localhost:8501 in your browser. The app loads with built-in sample data on every page — no uploads required to explore all features.

⚙️ Recommended Workflow

Step	Page	What to do
1	⚡ Power Analysis	Set your baseline rate and MDE — find required sample size before launching
2	🔬 Run Test	Paste values or upload CSV — run all three statistical tests
3	📊 Multi-Metric	Upload experiment CSV — batch-test all metrics at once
4	🎲 Bayesian	Run Bayesian analysis — get P(B > A) and expected loss
5	📈 Sequential	Check peeking risk — validate your result isn't a false positive
6	📋 Report	Generate and download the full PDF report

📊 Sample PDF Report

The Report page compiles all analysis into a structured PDF including:

Frequentist test results with confidence intervals and effect sizes
Multi-metric summary table (color-coded by decision)
Bayesian posterior probabilities and credible intervals
Sequential testing simulation results and OBF correction proof

🛠️ Tech Stack

Layer	Tool
Language	Python 3.10+
UI framework	Streamlit
Statistical tests	SciPy
Power analysis	statsmodels
Data processing	pandas, NumPy
Visualization	Matplotlib, Seaborn
PDF generation	ReportLab
Version control	Git / GitHub

🧠 Challenges and What I Learned

Peeking problem in sequential testing — Implemented O'Brien-Fleming spending bounds to show that checking results at 5 interim looks inflates the false positive rate from 5% to 15% without correction. Monte Carlo simulation over 500 A/A tests confirmed OBF holds the rate at exactly 5.0%.

Bayesian vs Frequentist framing — A p-value does not give the probability that B is better than A — that is what Bayesian inference provides. The Beta-Binomial model outputs P(B > A) directly, along with an expected loss metric that quantifies the cost of a wrong ship decision.

Prior sensitivity analysis — Added a sweep over prior strengths (α = β from 0.1 to 10) to prove that the Bayesian conclusion holds regardless of prior choice. A result that changes dramatically with prior strength indicates insufficient data, not a genuine effect.

Normality-aware test selection — Implemented Shapiro-Wilk on both groups before running tests. The profiler automatically recommends Mann-Whitney U when either group fails normality with n < 30, preventing silent invalid T-Test usage on non-normal small samples.

Multi-page session state — Streamlit resets state on navigation. Designed each analysis page to explicitly write results to st.session_state so the Report page can compile a full cross-page PDF without requiring the user to re-run anything.

📄 License

MIT — see LICENSE

👤 Author

Deekshitha Kalluri — GitHub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧪 A/B Test Statistical Significance Engine

💡 Key Findings (Sample Run)

✨ Features

🏗️ Pipeline

📁 Project Structure

🗂️ Statistical Methods

🚀 How to Run

⚙️ Recommended Workflow

📊 Sample PDF Report

🛠️ Tech Stack

🧠 Challenges and What I Learned

📄 License

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
__pycache__		__pycache__
pages		pages
sample_data_files		sample_data_files
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
bayesian_engine.py		bayesian_engine.py
data_profiler.py		data_profiler.py
pdf_exporter.py		pdf_exporter.py
power_analysis.py		power_analysis.py
report_builder.py		report_builder.py
requirements.txt		requirements.txt
sample_data.py		sample_data.py
sequential_testing.py		sequential_testing.py
stats_engine.py		stats_engine.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

🧪 A/B Test Statistical Significance Engine

💡 Key Findings (Sample Run)

✨ Features

🏗️ Pipeline

📁 Project Structure

🗂️ Statistical Methods

🚀 How to Run

⚙️ Recommended Workflow

📊 Sample PDF Report

🛠️ Tech Stack

🧠 Challenges and What I Learned

📄 License

👤 Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages