|
1 | | -# BuildABiocWorkshop |
| 1 | +# Introduction and Motivation |
2 | 2 |
|
3 | | -This package is a template for building a Bioconductor workshop. The package |
4 | | -includes Github actions to: |
| 3 | +Artificial intelligence and machine learning (AI/ML) have become essential tools in biomedical research, enabling large-scale analyses across diverse domains such as genomics, structural biology, and electronic health records-based research. Increasingly, researchers rely on model-generated predictions, rather than directly measured variables, as inputs for downstream statistical analyses. For example, predicted gene expression values or polygenic risk scores are often used in place of experimental assays, allowing researchers to expand cohort sizes and explore hypotheses when traditional data collection is infeasible, costly, or time-consuming. |
5 | 4 |
|
6 | | -1. Set up bioconductor/bioconductor_docker:devel on Github resources |
7 | | -2. Install package dependencies for your package (based on the `DESCRIPTION` file) |
8 | | -3. Run `rcmdcheck::rcmdcheck` |
9 | | -4. Build a pkgdown website and push it to github pages |
10 | | -5. Build a docker image with the installed package and dependencies and deploy to [the Github Container Repository](https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry#pulling-container-images) at the name `ghcr.io/gihub_user/repo_name`, all lowercase. |
| 5 | +While this practice of "using predictions as data" holds promise for accelerating scientific discovery, it presents significant challenges for statistical inference. When predicted values are used in place of true variables, the resulting estimates of association can be biased and misleading if uncertainty in the prediction step is not properly accounted for. |
11 | 6 |
|
12 | | -## Responsibilities |
| 7 | +In this workshop, we explore the consequences of inference on predicted data across several biomedical applications. Drawing from classical approaches to measurement error and recent developments in bias correction, we will present a suite of prediction-based inference methods that adjust for prediction-related uncertainty and improve inference validity and efficiency. We will also introduce {ipd}, a user-friendly Bioconductor R package that implements several of these correction methods through a unified interface. The package supports modular integration into existing workflows and includes tidy methods for model inspection and diagnostics. |
13 | 8 |
|
14 | | -Package authors are primarily responsible for: |
15 | 9 |
|
16 | | -1. Creating a landing site of their choosing for their workshops (a website). This website should be listed in the `DESCRIPTION` file as the `URL`. |
17 | | -2. Creating a docker image that will contain workshop materials and the installed packages necessary to run those materials. The name of the resulting docker image, including "tag" if desired, should be listed in a non-standard tag, `DockerImage:` in the `DESCRIPTION` file. |
18 | 10 |
|
19 | | -Both of those tasks can be accomplished using the Github actions included in this template package. The vignette accompanying this package describes how to accomplish both of these tasks. |
20 | 11 |
|
21 | | -## Details |
22 | 12 |
|
23 | | -For detailed instructions, see the `How to build a workshop` article/vignette. |
| 13 | +################################################################################ |
24 | 14 |
|
25 | | -## Results of successful deployment |
26 | 15 |
|
27 | | -- A working docker image that contains the installed package and dependencies. |
28 | | -- An up-to-date `pkgdown` website at https://YOURUSERNAME.github.io/YOURREPOSITORYNAME/ |
29 | | -- Docker image will be tagged with `latest`, `sha-XXXXXX` where `XXXXXX` is the hash of the current `master` commit, and `master`. |
30 | 16 |
|
31 | | -## To use the resulting image: |
32 | 17 |
|
33 | | -```sh |
34 | | -docker run -e PASSWORD=<choose_a_password_for_rstudio> -p 8787:8787 YOURDOCKERIMAGENAME |
35 | | -``` |
36 | | -Once running, navigate to http://localhost:8787/ and then login with `rstudio`:`yourchosenpassword`. |
| 18 | +In many modern data science applications, it is common to encounter settings |
| 19 | +where measuring a particular outcome, $Y$, is expensive or time-consuming, |
| 20 | +whereas predictions, $\hat{Y} = f(X)$, from a machine learning model are |
| 21 | +readily available on largedatasets. The `ipd` package provides a suite of |
| 22 | +methods to perform valid statistical inference on when some outcomes are |
| 23 | +observed (labeled) and others are only predicted. |
| 24 | + |
| 25 | +In this workshop, you will learn: |
| 26 | + |
| 27 | +* The theoretical foundation behind prediction-powered inference (PPI) and its extensions. |
| 28 | +* How to use **ipd** functions to simulate data, fit models, and extract inference results. |
| 29 | +* Practical exercises to compare naive estimators with IPD methods. |
| 30 | + |
| 31 | +By the end, you should be able to design analyses that leverage large unlabeled datasets while maintaining correct uncertainty quantification. |
| 32 | + |
| 33 | +## The Augmented Data Scheme |
| 34 | + |
| 35 | +Consider three sets of observations: |
| 36 | + |
| 37 | +* **Training set**: ${(X_i, Y_i)}*{i=1}^{n*\text{train}}$, used to fit a predictive model $f(\cdot)$. |
| 38 | +* **Labeled set**: ${(X_i, Y_i)}*{i=1}^{n*\ell}$, smaller sample with true outcomes. |
| 39 | +* **Unlabeled set**: ${X_i}*{i=n*\text{train}+n_\ell+1}^{n_\text{train}+n_\ell+n_u}$, only features available. |
37 | 40 |
|
38 | | -To try with **this** repository docker image: |
| 41 | +After fitting $f$ on the training set, we apply it to the labeled and unlabeled sets to obtain predictions $f_i = f(X_i)$. We then construct an **augmented dataset**: |
39 | 42 |
|
40 | | -```sh |
41 | | -docker run -e PASSWORD=abc -p 8787:8787 ghcr.io/bioconductor/buildabiocworkshop |
| 43 | +```plaintext |
| 44 | ++---------------+ +---------------+ +----------------+ |
| 45 | +| Training (T) | ---> | Labeled (L) | ---> | Unlabeled (U) | |
| 46 | +| (X, Y) | | (X, Y, f) | | (X, f) | |
| 47 | ++---------------+ +---------------+ +----------------+ |
42 | 48 | ``` |
43 | 49 |
|
44 | | -*NOTE*: Running docker that uses the password in plain text like above exposes the password to others |
45 | | -in a multi-user system (like a shared workstation or compute node). In practice, consider using an environment |
46 | | -variable instead of plain text to pass along passwords and other secrets in docker command lines. |
| 50 | +We treat $f_i$ in the unlabeled set as surrogate outcomes and combine them with observed $Y_i$ in the labeled set to estimate regression parameters $\beta$. |
| 51 | + |
| 52 | +## Key Formulas |
| 53 | + |
| 54 | +### Naive Estimator |
| 55 | + |
| 56 | +Using only the unlabeled predictions, the naive OLS estimator solves |
| 57 | + |
| 58 | +$$ |
| 59 | +\hat\beta_{\text{naive}} = \arg\min_\beta \sum_{i\in U} \bigl(f_i - X_i^T\beta\bigr)^2. |
| 60 | +$$ |
| 61 | + |
| 62 | + |
47 | 63 |
|
| 64 | +# This Workshop |
48 | 65 |
|
49 | | -## Whatcha get |
| 66 | +Welcome! This workshop provides a brief introduction to performing valid statistical inference when your outcome has been partially imputed by a machine learning model. The central package is [**ipd**](https://bioconductor.org/packages/ipd/), which implements several recent methods for conducting inference with predicted data (IPD). |
| 67 | + |
| 68 | +> **Prerequisites** |
| 69 | +> 1. R (≥ 4.1) and Bioconductor installed |
| 70 | +> 2. The `ipd` package: |
| 71 | +> ```r |
| 72 | +> if (!requireNamespace("BiocManager", quietly = TRUE)) |
| 73 | +> install.packages("BiocManager") |
| 74 | +> BiocManager::install("ipd") |
| 75 | +> ``` |
| 76 | +> 3. Supporting packages: |
| 77 | +> ```r |
| 78 | +> install.packages(c("tidyverse", "patchwork", "NHANES", "rashomonquartet", "ranger", "mgcv", "pROC", "ALL")) |
| 79 | +> BiocManager::install(c("BiocStyle", "Biobase", "BiocGenerics")) |
| 80 | +> ``` |
| 81 | +
|
| 82 | +--- |
| 83 | +
|
| 84 | +## Workshop Structure |
| 85 | +
|
| 86 | +There are four hands‐on tutorials (R Markdown vignettes). Each vignette loads or simulates data, trains a prediction model, applies `ipd::ipd()`, and includes exercises: |
| 87 | +
|
| 88 | +1. [Chapter 1: Simulated Data](vignettes/01-simulated-data.html) |
| 89 | + - Fully synthetic linear‐regression example |
| 90 | + - Compare naïve, classical, and PPI/PPI++/PostPI/PSPA methods |
| 91 | + - Residual diagnostics and bootstrap coverage |
| 92 | +
|
| 93 | +2. [Chapter 2: Rashomon Quartet](vignettes/02-rashomon-quartet.html) |
| 94 | + - Illustrate how four datasets with identical summary statistics can behave very differently |
| 95 | + - Show why naïve regression on predictions can fail under nonlinearity or outliers |
| 96 | + - Compare IPD corrections (PPI, PPI++, PSPA) across R1–R4 scenarios |
| 97 | +
|
| 98 | +3. [Chapter 3: NHANES Body Fat vs BMI](vignettes/03-nhanes-bodyfat.html) |
| 99 | + - Real‐world data from NHANES (DXA percent body fat vs BMI) |
| 100 | + - Fit a linear (or nonlinear) prediction model on labeled participants |
| 101 | + - Use IPD to estimate the effect of Age on true percent body fat (correcting for bias) |
| 102 | +
|
| 103 | +4. [Chapter 4: Genetic Data (Bioconductor `ALL`)](vignettes/04-genetic-data.html) |
| 104 | + - A binary IPD example using leucemia microarray data (`ALL` package) |
| 105 | + - Fit a logistic model (CD19‐based) to predict BCR/ABL labels |
| 106 | + - Apply IPD to estimate the log‐odds effect of CD38 expression on true BCR/ABL status |
| 107 | +
|
| 108 | +--- |
| 109 | +
|
| 110 | +## How to Build & View Locally |
| 111 | +
|
| 112 | +1. Clone or download this repository: |
| 113 | +
|
| 114 | +```bash |
| 115 | +git clone https://github.com/salernos/ipdworkshop.git |
| 116 | +cd ipdworkshop |
| 117 | +``` |
50 | 118 |
|
51 | | -- https://bioconductor.github.io/BuildABiocWorkshop |
52 | | -- A Docker image that you can run locally, in the cloud, or (usually) even as a singularity container on HPC systems. |
|
0 commit comments