A full-stack A/B testing platform built with Python and Streamlit. Automates frequentist hypothesis testing, Bayesian inference, sequential boundary analysis, and multi-metric batch testing — then compiles everything into a downloadable PDF report. Built to replicate the statistical rigor used at data-driven product and growth teams.
| Analysis | Result |
|---|---|
| Frequentist (continuous) | p=0.0000, Cohen's d=0.529 (Medium), +10.8% lift → SHIP IT |
| Multi-Metric | 3 of 5 metrics significant — revenue (+12.7%), pages_viewed (+27.4%), add_to_cart (+60.0%) |
| Bayesian | P(B > A) = 96.7%, expected loss = 0.02% → SHIP IT |
| Sequential (Monte Carlo) | Naive peeking = 15% false positive rate vs OBF = 5.0% — 3× inflation proven |
| Page | What it does |
|---|---|
| ⚡ Power Analysis | Sample size calculator with live power curves, MDE analysis, and What-If checker |
| 🔬 Run Test | T-test + Mann-Whitney U + chi-squared with data profiling, outlier detection, normality testing |
| 📊 Multi-Metric | Batch-test all CSV metric columns — color-coded summary table, lift chart, p-value heatmap |
| 🎲 Bayesian | P(B beats A), expected loss, credible intervals, prior sensitivity analysis |
| 📈 Sequential | O'Brien-Fleming bounds, peeking risk detector, Monte Carlo false positive simulation |
| 📋 Report | One-click PDF export of all results with business-readable ship/don't-ship recommendations |
Raw data (paste / CSV upload)
│
▼
data_profiler.py ← Shapiro-Wilk normality, IQR outlier detection, skewness
│
▼
stats_engine.py ← Welch's T-Test, Mann-Whitney U, Chi-Squared
│
▼
power_analysis.py ← Required N, achieved power, MDE curves (statsmodels)
│
▼
bayesian_engine.py ← Beta-Binomial posterior, Normal posterior, expected loss
│
▼
sequential_testing.py ← O'Brien-Fleming bounds, Pocock bounds, Monte Carlo sim
│
▼
report_builder.py ← Ship / Don't Ship / Caution / Inconclusive decision logic
│
▼
pdf_exporter.py ← ReportLab multi-section PDF report
│
▼
pages/ (Streamlit) ← 6-page interactive UI
ab_test_analyzer/
├── app.py # Home page and navigation
├── stats_engine.py # T-test, Mann-Whitney U, Chi-Squared
├── bayesian_engine.py # Bayesian Beta-Binomial + Normal model
├── power_analysis.py # Sample size & power calculations
├── sequential_testing.py # O'Brien-Fleming, Pocock, peeking detector
├── data_profiler.py # Outlier detection, normality testing
├── report_builder.py # Ship/don't-ship recommendation logic
├── pdf_exporter.py # ReportLab PDF generation
├── sample_data.py # Built-in sample datasets
├── utils.py # Shared UI helpers
├── pages/
│ ├── 1_Power_Analysis.py
│ ├── 2_Run_Test.py
│ ├── 3_Multi_Metric.py
│ ├── 4_Bayesian.py
│ ├── 5_Sequential.py
│ └── 6_Report.py
└── sample_data_files/
└── multi_metric_sample.csv
| Method | Purpose |
|---|---|
| Welch's T-Test | Compares means of two independent groups; robust to unequal variances |
| Mann-Whitney U | Non-parametric alternative when normality fails or sample is small |
| Chi-Squared Test | Compares conversion rates between two groups |
| Cohen's d | Effect size for continuous metrics (T-Test) |
| Cramér's V | Effect size for conversion rate tests (Chi-Squared) |
| Rank Biserial r | Effect size for Mann-Whitney U |
| Beta-Binomial Model | Bayesian posterior for conversion rate experiments |
| Normal Posterior | Bayesian model for continuous metric experiments |
| O'Brien-Fleming | Sequential boundary controlling false positives across interim looks |
| Pocock Boundary | Constant sequential boundary for early-stopping experiments |
| Shapiro-Wilk | Normality test — auto-recommends T-Test vs Mann-Whitney |
| IQR Fences | Outlier detection flagging data quality issues before analysis |
Prerequisites: Python 3.10+, pip
# 1. Clone the repo
git clone https://github.com/DeekshithaKalluri/ab-test-analyzer.git
cd ab-test-analyzer
# 2. Set up environment
python -m venv venv
source venv/bin/activate # Mac/Linux
# venv\Scripts\activate # Windows
# 3. Install dependencies
pip install -r requirements.txt
# 4. Launch the app
streamlit run app.pyOpen http://localhost:8501 in your browser. The app loads with built-in sample data on every page — no uploads required to explore all features.
| Step | Page | What to do |
|---|---|---|
| 1 | ⚡ Power Analysis | Set your baseline rate and MDE — find required sample size before launching |
| 2 | 🔬 Run Test | Paste values or upload CSV — run all three statistical tests |
| 3 | 📊 Multi-Metric | Upload experiment CSV — batch-test all metrics at once |
| 4 | 🎲 Bayesian | Run Bayesian analysis — get P(B > A) and expected loss |
| 5 | 📈 Sequential | Check peeking risk — validate your result isn't a false positive |
| 6 | 📋 Report | Generate and download the full PDF report |
The Report page compiles all analysis into a structured PDF including:
- Frequentist test results with confidence intervals and effect sizes
- Multi-metric summary table (color-coded by decision)
- Bayesian posterior probabilities and credible intervals
- Sequential testing simulation results and OBF correction proof
| Layer | Tool |
|---|---|
| Language | Python 3.10+ |
| UI framework | Streamlit |
| Statistical tests | SciPy |
| Power analysis | statsmodels |
| Data processing | pandas, NumPy |
| Visualization | Matplotlib, Seaborn |
| PDF generation | ReportLab |
| Version control | Git / GitHub |
Peeking problem in sequential testing — Implemented O'Brien-Fleming spending bounds to show that checking results at 5 interim looks inflates the false positive rate from 5% to 15% without correction. Monte Carlo simulation over 500 A/A tests confirmed OBF holds the rate at exactly 5.0%.
Bayesian vs Frequentist framing — A p-value does not give the probability that B is better than A — that is what Bayesian inference provides. The Beta-Binomial model outputs P(B > A) directly, along with an expected loss metric that quantifies the cost of a wrong ship decision.
Prior sensitivity analysis — Added a sweep over prior strengths (α = β from 0.1 to 10) to prove that the Bayesian conclusion holds regardless of prior choice. A result that changes dramatically with prior strength indicates insufficient data, not a genuine effect.
Normality-aware test selection — Implemented Shapiro-Wilk on both groups before running tests. The profiler automatically recommends Mann-Whitney U when either group fails normality with n < 30, preventing silent invalid T-Test usage on non-normal small samples.
Multi-page session state — Streamlit resets state on navigation. Designed each analysis page to explicitly write results to st.session_state so the Report page can compile a full cross-page PDF without requiring the user to re-run anything.
MIT — see LICENSE
Deekshitha Kalluri — GitHub