Skip to content

Commit 0f0f2e5

Browse files
committed
Updating workshop to include all the dependencies we will need and to ensure it is able to build without error. Pushing changes now, but will need to continue updating content.
1 parent 593ca01 commit 0f0f2e5

206 files changed

Lines changed: 29237 additions & 40 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

DESCRIPTION

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,12 @@ Authors@R:
1919
Description: |
2020
A workshop demonstrating how to perform downstream statistical inference
2121
when the outcome has been partially imputed via a machine learning model.
22-
The central R/Bioconductor package is ipd (Inference with Predicted Data).
22+
The central R package is the ipd package (Inference with Predicted Data).
2323
License: MIT + file LICENSE
2424
Encoding: UTF-8
2525
LazyData: true
2626
Roxygen: list(markdown = TRUE)
27-
RoxygenNote: 7.1.0
27+
RoxygenNote: 7.3.2
2828
Depends:
2929
R (>= 4.0),
3030
ipd,
@@ -40,7 +40,22 @@ Imports:
4040
mgcv,
4141
pROC,
4242
ALL,
43-
BiocGenerics
43+
golubEsets,
44+
AnnotationDbi,
45+
hgu95av2.db,
46+
hu6800.db,
47+
BiocGenerics,
48+
DALEX,
49+
broom,
50+
GGally,
51+
MLInterfaces,
52+
here,
53+
janitor,
54+
keggorthology,
55+
neuralnet,
56+
partykit,
57+
randomForest,
58+
scales
4459
Suggests:
4560
DT
4661
URL: https://salernos.github.io/ipdworkshop/

README.md

Lines changed: 99 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,52 +1,118 @@
1-
# BuildABiocWorkshop
1+
# Introduction and Motivation
22

3-
This package is a template for building a Bioconductor workshop. The package
4-
includes Github actions to:
3+
Artificial intelligence and machine learning (AI/ML) have become essential tools in biomedical research, enabling large-scale analyses across diverse domains such as genomics, structural biology, and electronic health records-based research. Increasingly, researchers rely on model-generated predictions, rather than directly measured variables, as inputs for downstream statistical analyses. For example, predicted gene expression values or polygenic risk scores are often used in place of experimental assays, allowing researchers to expand cohort sizes and explore hypotheses when traditional data collection is infeasible, costly, or time-consuming.
54

6-
1. Set up bioconductor/bioconductor_docker:devel on Github resources
7-
2. Install package dependencies for your package (based on the `DESCRIPTION` file)
8-
3. Run `rcmdcheck::rcmdcheck`
9-
4. Build a pkgdown website and push it to github pages
10-
5. Build a docker image with the installed package and dependencies and deploy to [the Github Container Repository](https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry#pulling-container-images) at the name `ghcr.io/gihub_user/repo_name`, all lowercase.
5+
While this practice of "using predictions as data" holds promise for accelerating scientific discovery, it presents significant challenges for statistical inference. When predicted values are used in place of true variables, the resulting estimates of association can be biased and misleading if uncertainty in the prediction step is not properly accounted for.
116

12-
## Responsibilities
7+
In this workshop, we explore the consequences of inference on predicted data across several biomedical applications. Drawing from classical approaches to measurement error and recent developments in bias correction, we will present a suite of prediction-based inference methods that adjust for prediction-related uncertainty and improve inference validity and efficiency. We will also introduce {ipd}, a user-friendly Bioconductor R package that implements several of these correction methods through a unified interface. The package supports modular integration into existing workflows and includes tidy methods for model inspection and diagnostics.
138

14-
Package authors are primarily responsible for:
159

16-
1. Creating a landing site of their choosing for their workshops (a website). This website should be listed in the `DESCRIPTION` file as the `URL`.
17-
2. Creating a docker image that will contain workshop materials and the installed packages necessary to run those materials. The name of the resulting docker image, including "tag" if desired, should be listed in a non-standard tag, `DockerImage:` in the `DESCRIPTION` file.
1810

19-
Both of those tasks can be accomplished using the Github actions included in this template package. The vignette accompanying this package describes how to accomplish both of these tasks.
2011

21-
## Details
2212

23-
For detailed instructions, see the `How to build a workshop` article/vignette.
13+
################################################################################
2414

25-
## Results of successful deployment
2615

27-
- A working docker image that contains the installed package and dependencies.
28-
- An up-to-date `pkgdown` website at https://YOURUSERNAME.github.io/YOURREPOSITORYNAME/
29-
- Docker image will be tagged with `latest`, `sha-XXXXXX` where `XXXXXX` is the hash of the current `master` commit, and `master`.
3016

31-
## To use the resulting image:
3217

33-
```sh
34-
docker run -e PASSWORD=<choose_a_password_for_rstudio> -p 8787:8787 YOURDOCKERIMAGENAME
35-
```
36-
Once running, navigate to http://localhost:8787/ and then login with `rstudio`:`yourchosenpassword`.
18+
In many modern data science applications, it is common to encounter settings
19+
where measuring a particular outcome, $Y$, is expensive or time-consuming,
20+
whereas predictions, $\hat{Y} = f(X)$, from a machine learning model are
21+
readily available on largedatasets. The `ipd` package provides a suite of
22+
methods to perform valid statistical inference on when some outcomes are
23+
observed (labeled) and others are only predicted.
24+
25+
In this workshop, you will learn:
26+
27+
* The theoretical foundation behind prediction-powered inference (PPI) and its extensions.
28+
* How to use **ipd** functions to simulate data, fit models, and extract inference results.
29+
* Practical exercises to compare naive estimators with IPD methods.
30+
31+
By the end, you should be able to design analyses that leverage large unlabeled datasets while maintaining correct uncertainty quantification.
32+
33+
## The Augmented Data Scheme
34+
35+
Consider three sets of observations:
36+
37+
* **Training set**: ${(X_i, Y_i)}*{i=1}^{n*\text{train}}$, used to fit a predictive model $f(\cdot)$.
38+
* **Labeled set**: ${(X_i, Y_i)}*{i=1}^{n*\ell}$, smaller sample with true outcomes.
39+
* **Unlabeled set**: ${X_i}*{i=n*\text{train}+n_\ell+1}^{n_\text{train}+n_\ell+n_u}$, only features available.
3740

38-
To try with **this** repository docker image:
41+
After fitting $f$ on the training set, we apply it to the labeled and unlabeled sets to obtain predictions $f_i = f(X_i)$. We then construct an **augmented dataset**:
3942

40-
```sh
41-
docker run -e PASSWORD=abc -p 8787:8787 ghcr.io/bioconductor/buildabiocworkshop
43+
```plaintext
44+
+---------------+ +---------------+ +----------------+
45+
| Training (T) | ---> | Labeled (L) | ---> | Unlabeled (U) |
46+
| (X, Y) | | (X, Y, f) | | (X, f) |
47+
+---------------+ +---------------+ +----------------+
4248
```
4349

44-
*NOTE*: Running docker that uses the password in plain text like above exposes the password to others
45-
in a multi-user system (like a shared workstation or compute node). In practice, consider using an environment
46-
variable instead of plain text to pass along passwords and other secrets in docker command lines.
50+
We treat $f_i$ in the unlabeled set as surrogate outcomes and combine them with observed $Y_i$ in the labeled set to estimate regression parameters $\beta$.
51+
52+
## Key Formulas
53+
54+
### Naive Estimator
55+
56+
Using only the unlabeled predictions, the naive OLS estimator solves
57+
58+
$$
59+
\hat\beta_{\text{naive}} = \arg\min_\beta \sum_{i\in U} \bigl(f_i - X_i^T\beta\bigr)^2.
60+
$$
61+
62+
4763

64+
# This Workshop
4865

49-
## Whatcha get
66+
Welcome! This workshop provides a brief introduction to performing valid statistical inference when your outcome has been partially imputed by a machine learning model. The central package is [**ipd**](https://bioconductor.org/packages/ipd/), which implements several recent methods for conducting inference with predicted data (IPD).
67+
68+
> **Prerequisites**
69+
> 1. R (≥ 4.1) and Bioconductor installed
70+
> 2. The `ipd` package:
71+
> ```r
72+
> if (!requireNamespace("BiocManager", quietly = TRUE))
73+
> install.packages("BiocManager")
74+
> BiocManager::install("ipd")
75+
> ```
76+
> 3. Supporting packages:
77+
> ```r
78+
> install.packages(c("tidyverse", "patchwork", "NHANES", "rashomonquartet", "ranger", "mgcv", "pROC", "ALL"))
79+
> BiocManager::install(c("BiocStyle", "Biobase", "BiocGenerics"))
80+
> ```
81+
82+
---
83+
84+
## Workshop Structure
85+
86+
There are four handson tutorials (R Markdown vignettes). Each vignette loads or simulates data, trains a prediction model, applies `ipd::ipd()`, and includes exercises:
87+
88+
1. [Chapter 1: Simulated Data](vignettes/01-simulated-data.html)
89+
- Fully synthetic linearregression example
90+
- Compare naïve, classical, and PPI/PPI++/PostPI/PSPA methods
91+
- Residual diagnostics and bootstrap coverage
92+
93+
2. [Chapter 2: Rashomon Quartet](vignettes/02-rashomon-quartet.html)
94+
- Illustrate how four datasets with identical summary statistics can behave very differently
95+
- Show why naïve regression on predictions can fail under nonlinearity or outliers
96+
- Compare IPD corrections (PPI, PPI++, PSPA) across R1R4 scenarios
97+
98+
3. [Chapter 3: NHANES Body Fat vs BMI](vignettes/03-nhanes-bodyfat.html)
99+
- Realworld data from NHANES (DXA percent body fat vs BMI)
100+
- Fit a linear (or nonlinear) prediction model on labeled participants
101+
- Use IPD to estimate the effect of Age on true percent body fat (correcting for bias)
102+
103+
4. [Chapter 4: Genetic Data (Bioconductor `ALL`)](vignettes/04-genetic-data.html)
104+
- A binary IPD example using leucemia microarray data (`ALL` package)
105+
- Fit a logistic model (CD19based) to predict BCR/ABL labels
106+
- Apply IPD to estimate the logodds effect of CD38 expression on true BCR/ABL status
107+
108+
---
109+
110+
## How to Build & View Locally
111+
112+
1. Clone or download this repository:
113+
114+
```bash
115+
git clone https://github.com/salernos/ipdworkshop.git
116+
cd ipdworkshop
117+
```
50118
51-
- https://bioconductor.github.io/BuildABiocWorkshop
52-
- A Docker image that you can run locally, in the cloud, or (usually) even as a singularity container on HPC systems.

_pkgdown.yml

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,21 @@
11
url: https://salernos.github.io/ipdworkshop
22

33
template:
4-
params:
4+
bootstrap: 5
5+
lightswitch: true
6+
bslib:
57
bootswatch: flatly
6-
#ganalytics: UA-99999999-9
8+
pkgdown-nav-height: 100px
9+
primary: "#1B365D"
710

811
home:
912
title: "Inference with Predicted Data Using the ipd Package"
1013
type: inverse
1114

12-
1315
navbar:
1416
right:
1517
- icon: fa-github
1618
href: https://github.com/salernos/ipdworkshop
1719

18-
20+
project:
21+
render: ['*.qmd']

docs/404.html

Lines changed: 82 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docs/LICENSE-text.html

Lines changed: 63 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)