Skip to content

DeekshithaKalluri/ab-test-analyzer

Repository files navigation

🧪 A/B Test Statistical Significance Engine

Python Streamlit SciPy License Status

A full-stack A/B testing platform built with Python and Streamlit. Automates frequentist hypothesis testing, Bayesian inference, sequential boundary analysis, and multi-metric batch testing — then compiles everything into a downloadable PDF report. Built to replicate the statistical rigor used at data-driven product and growth teams.


💡 Key Findings (Sample Run)

Analysis Result
Frequentist (continuous) p=0.0000, Cohen's d=0.529 (Medium), +10.8% lift → SHIP IT
Multi-Metric 3 of 5 metrics significant — revenue (+12.7%), pages_viewed (+27.4%), add_to_cart (+60.0%)
Bayesian P(B > A) = 96.7%, expected loss = 0.02% → SHIP IT
Sequential (Monte Carlo) Naive peeking = 15% false positive rate vs OBF = 5.0% — 3× inflation proven

✨ Features

Page What it does
⚡ Power Analysis Sample size calculator with live power curves, MDE analysis, and What-If checker
🔬 Run Test T-test + Mann-Whitney U + chi-squared with data profiling, outlier detection, normality testing
📊 Multi-Metric Batch-test all CSV metric columns — color-coded summary table, lift chart, p-value heatmap
🎲 Bayesian P(B beats A), expected loss, credible intervals, prior sensitivity analysis
📈 Sequential O'Brien-Fleming bounds, peeking risk detector, Monte Carlo false positive simulation
📋 Report One-click PDF export of all results with business-readable ship/don't-ship recommendations

🏗️ Pipeline

Raw data (paste / CSV upload)
        │
        ▼
data_profiler.py      ← Shapiro-Wilk normality, IQR outlier detection, skewness
        │
        ▼
stats_engine.py       ← Welch's T-Test, Mann-Whitney U, Chi-Squared
        │
        ▼
power_analysis.py     ← Required N, achieved power, MDE curves (statsmodels)
        │
        ▼
bayesian_engine.py    ← Beta-Binomial posterior, Normal posterior, expected loss
        │
        ▼
sequential_testing.py ← O'Brien-Fleming bounds, Pocock bounds, Monte Carlo sim
        │
        ▼
report_builder.py     ← Ship / Don't Ship / Caution / Inconclusive decision logic
        │
        ▼
pdf_exporter.py       ← ReportLab multi-section PDF report
        │
        ▼
pages/ (Streamlit)    ← 6-page interactive UI

📁 Project Structure

ab_test_analyzer/
├── app.py                        # Home page and navigation
├── stats_engine.py               # T-test, Mann-Whitney U, Chi-Squared
├── bayesian_engine.py            # Bayesian Beta-Binomial + Normal model
├── power_analysis.py             # Sample size & power calculations
├── sequential_testing.py         # O'Brien-Fleming, Pocock, peeking detector
├── data_profiler.py              # Outlier detection, normality testing
├── report_builder.py             # Ship/don't-ship recommendation logic
├── pdf_exporter.py               # ReportLab PDF generation
├── sample_data.py                # Built-in sample datasets
├── utils.py                      # Shared UI helpers
├── pages/
│   ├── 1_Power_Analysis.py
│   ├── 2_Run_Test.py
│   ├── 3_Multi_Metric.py
│   ├── 4_Bayesian.py
│   ├── 5_Sequential.py
│   └── 6_Report.py
└── sample_data_files/
    └── multi_metric_sample.csv

🗂️ Statistical Methods

Method Purpose
Welch's T-Test Compares means of two independent groups; robust to unequal variances
Mann-Whitney U Non-parametric alternative when normality fails or sample is small
Chi-Squared Test Compares conversion rates between two groups
Cohen's d Effect size for continuous metrics (T-Test)
Cramér's V Effect size for conversion rate tests (Chi-Squared)
Rank Biserial r Effect size for Mann-Whitney U
Beta-Binomial Model Bayesian posterior for conversion rate experiments
Normal Posterior Bayesian model for continuous metric experiments
O'Brien-Fleming Sequential boundary controlling false positives across interim looks
Pocock Boundary Constant sequential boundary for early-stopping experiments
Shapiro-Wilk Normality test — auto-recommends T-Test vs Mann-Whitney
IQR Fences Outlier detection flagging data quality issues before analysis

🚀 How to Run

Prerequisites: Python 3.10+, pip

# 1. Clone the repo
git clone https://github.com/DeekshithaKalluri/ab-test-analyzer.git
cd ab-test-analyzer

# 2. Set up environment
python -m venv venv
source venv/bin/activate        # Mac/Linux
# venv\Scripts\activate         # Windows

# 3. Install dependencies
pip install -r requirements.txt

# 4. Launch the app
streamlit run app.py

Open http://localhost:8501 in your browser. The app loads with built-in sample data on every page — no uploads required to explore all features.


⚙️ Recommended Workflow

Step Page What to do
1 ⚡ Power Analysis Set your baseline rate and MDE — find required sample size before launching
2 🔬 Run Test Paste values or upload CSV — run all three statistical tests
3 📊 Multi-Metric Upload experiment CSV — batch-test all metrics at once
4 🎲 Bayesian Run Bayesian analysis — get P(B > A) and expected loss
5 📈 Sequential Check peeking risk — validate your result isn't a false positive
6 📋 Report Generate and download the full PDF report

📊 Sample PDF Report

The Report page compiles all analysis into a structured PDF including:

  • Frequentist test results with confidence intervals and effect sizes
  • Multi-metric summary table (color-coded by decision)
  • Bayesian posterior probabilities and credible intervals
  • Sequential testing simulation results and OBF correction proof

🛠️ Tech Stack

Layer Tool
Language Python 3.10+
UI framework Streamlit
Statistical tests SciPy
Power analysis statsmodels
Data processing pandas, NumPy
Visualization Matplotlib, Seaborn
PDF generation ReportLab
Version control Git / GitHub

🧠 Challenges and What I Learned

Peeking problem in sequential testing — Implemented O'Brien-Fleming spending bounds to show that checking results at 5 interim looks inflates the false positive rate from 5% to 15% without correction. Monte Carlo simulation over 500 A/A tests confirmed OBF holds the rate at exactly 5.0%.

Bayesian vs Frequentist framing — A p-value does not give the probability that B is better than A — that is what Bayesian inference provides. The Beta-Binomial model outputs P(B > A) directly, along with an expected loss metric that quantifies the cost of a wrong ship decision.

Prior sensitivity analysis — Added a sweep over prior strengths (α = β from 0.1 to 10) to prove that the Bayesian conclusion holds regardless of prior choice. A result that changes dramatically with prior strength indicates insufficient data, not a genuine effect.

Normality-aware test selection — Implemented Shapiro-Wilk on both groups before running tests. The profiler automatically recommends Mann-Whitney U when either group fails normality with n < 30, preventing silent invalid T-Test usage on non-normal small samples.

Multi-page session state — Streamlit resets state on navigation. Designed each analysis page to explicitly write results to st.session_state so the Report page can compile a full cross-page PDF without requiring the user to re-run anything.


📄 License

MIT — see LICENSE


👤 Author

Deekshitha Kalluri — GitHub

About

Full-stack A/B testing engine: frequentist tests, Bayesian inference, sequential O'Brien-Fleming bounds, and multi-metric batch analysis with one-click PDF report export.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages