These pipelines were developed through a collaboration between the Bioinformatics and Proteomics Cores at VCU. Their primary goal is to streamline the analysis of data generated by the Proteomics Core. In addition, the pipelines are designed to be modular and fully reproducible, enabling broader adoption and adaptation by other research groups. The pipeline has been developed to handle the following study designs and protein data (see below).
| Study Design | LFQ | PTM Analysis |
|---|---|---|
| 1 × 1 | NS | NS |
| 3 × 3 | 90% | NS |
| 3 × 3 × 3 | NS | NS |
| 3 × 3 × 3 × 3 | NS | NS |
NS means "Not started" Percentages describe about how much of the pipeline is complete
ANALYST READ THIS
1. The pipeline temporarily runs using the instructions under "Current Execution Workflow".
Currently this is the most reliable way of making sure the correct environment is loaded, the analysis
data is output, and the rmarkdown is rendered. This last step has been problematic due to a bug with
knittr and knitting must be done manually with Rstudio.
2. Currently there are some difficulties using the renv setup. The current `.renv` file was generated
using Cardinal (VCU server) and somehow, configurations on that server affected this file.
- Input Datasets and Proteome Sources
- Imputation
- Differential Abundance Analysis
- Developmental Notes
The overview briefly describe the analyses that are under development and, in this section, we will go more in-depth into the study designs and proteome sources that can be run with this code base.
The proteomics core generates multiple types of datasets, each with distinct study designs that determine the analytical pipeline we can apply. The diagram below gives an overview with follow-up verbal descriptions:
The available study designs include:
- 1 × 1: two treatment groups, one sample per group
- 3 × 3: two treatment groups, three samples per group
- 3 × 3 × 3: three treatment groups, three samples per group
- 3 × 3 × 3 × 3: four treatment groups, three samples per group
Given the limitations imposed by sample size, differential abundance analysis is performed only for designs with at least a 3 × 3 structure or larger. For the 1 × 1 design, we restrict the analysis to exploratory and pilot-level visualizations.
The pipeline(s) developed here are focused on data generated by Label-Free Quantification (LFQ) proteomics and PTM (post-translational modification) proteomics. Both data types are generated using tandem mass spectrometry (MS/MS). The diagram below gives an overview with follow-up verbal descriptions:
Raw proteomic data are generated by the VCU Proteomics Core using tandem mass spectrometry (MS/MS). The data are processed through a standardized bioinformatics pipeline for peptide and protein identification and quantification.
Peptide identification is performed using the Sequest HT search algorithm implemented in Proteome Discoverer 3.0. MS/MS spectra are searched against the SwissProt protein database, supplemented with a contaminant database to identify and remove exogenous protein sequences.
Protein-level quantification is derived from constituent peptide measurements. Peptide intensities are aggregated to generate:
-
Total protein abundance
-
Number of unique peptides
-
Peptide sequence coverage
The PTM analysis workflow follows a similar MS/MS acquisition and database search strategy. However, due to inherent limitations in PTM peptide detection and protein inference:
-
Quantification is performed at the PTM peptide level
-
Not all PTM peptides can be confidently and uniquely assigned to a single parent protein
-
In some cases, PTM peptides may map to multiple protein isoforms or homologous proteins
Therefore, PTM analyses are conducted primarily at the site- or peptide-level, rather than at the aggregated protein level.
Missing intensity values are common in data-dependent acquisition (DDA) proteomics experiments and are predominantly attributed to a missing-not-at-random (MNAR) mechanism — low-abundance peptides fall below the instrument's detection threshold rather than being truly absent. Imputation is performed before differential abundance analysis to preserve statistical power.
The imputation method is selected via the --imputation flag:
| Method | Description |
|---|---|
DEP-MinProb (default) |
Draws imputed values from a down-shifted Gaussian distribution centered at the low end of the observed intensity distribution. The --imputation-q parameter (default 0.01) controls the quantile cutoff for the left-censored distribution. |
DEP-QRILC |
Quantile Regression Imputation of Left-Censored data — fits a QRILC model per sample. Also accepts --imputation-q. |
DEP-knn |
k-Nearest Neighbour imputation across proteins. |
DEP-bpca |
Bayesian PCA-based imputation. |
DEP-man |
Manual/global minimum imputation via the DEP package. |
none |
Skips imputation entirely. |
3by3 |
In-house method developed for VCU Proteomics Core experiments (see below). |
The 3by3 method was developed specifically for the study designs most common at the VCU Proteomics Core, where each group has exactly 3 biological replicates. Standard per-column distribution estimates become unreliable at this sample size, so 3by3 uses a group-aware, globally-calibrated strategy:
- ≥ 2 observed values in a group → missing values are imputed as the median of the observed values within that group. This preserves the within-group signal rather than pulling values artificially low.
- < 2 observed values in a group → missing values are drawn from a global left-tail Gaussian (µ = 1st-percentile of all observed intensities pooled; σ = MAD of the pooled distribution). This handles cases where a protein is nearly absent in one condition.
- All groups have ≤ 1 valid observation → the protein is flagged as not-imputable and removed from the analysis.
This hybrid approach minimizes artificial noise while retaining the MNAR-aware behavior needed for low-abundance proteins. The 3by3 function is defined in workflow/custom_imputation.R and is automatically sourced by the pipeline.
Note: Additional custom imputation methods can be added to
workflow/custom_imputation.Rby defining a function namedimpute_<method_name>()and then passing--imputation <method_name>at runtime.
Differential abundance analysis is performed using the limma R package (v3.64+). Protein intensity values are log₂-transformed and a linear model is fit per protein using the design matrix derived from the sample group assignments. Empirical Bayes moderation is applied to stabilize variance estimates across proteins, making limma well-suited for smaller sample sizes typical of proteomics experiments. Pairwise comparisons are defined in the samplesheet and evaluated using moderated t-tests. Proteins are considered significantly differentially abundant (DAPs) when they exhibit an absolute log₂ fold-change ≥ 0.58 (1.5×) and an FDR-adjusted P-value < 0.05 (Benjamini–Hochberg correction). Results feed directly into volcano plots, MA plots, heatmaps, and GSEA.
- The input samplesheet is parsed to generate comparisons definitions.
- Runs differential analysis over all comparisons specified using limma R package 3.64.3.
- Runs Gene Set Enrichment Analysis (Gene Ontology).
- Generates exploratory and differential analysis plots.
- Automatically builds an HTML report based on R markdown, with plots and tables.
Note:
- This pipeline is intended for a "first pass" analysis. For custom or complex analyses, please contact the Bioinformatics core and submit a Jira ticket
- Differential abundance results are considered significant if the Benjamini-Hochberg adjusted p-value (padj) is less than or equal to 0.05 and the absolute log2 fold change is greater than 0.58 (corresponding to an absolute fold change of 1.5).
renv is used to create an isolated environment. If this is your first time working with renv then please start an R session and run the following:
install.packages("renv")Then go ahead and close this repository and cd inside
git clone https://github.com/VCU-Bioinformatics-Core/proteomics_core_pipelines
cd proteomics_core_pipelinesFrom there, start another R session and run the following:
library(renv)
renv::restore()Two input files are required in specific formats: the Intensity Matrix and the Samplesheet.
Required format: Tab-Separated Values (.tsv)
This file contains the protein ids and raw merged protein levels. The columns must be organized as follows:
- PG.ProteinGroups: The first column must contain protein identifiers in the form of UniProt/SwissProt IDs.
- PG.Genes: The second column must contain protein identifiers in the form of gene symbols.
- Subsequent Columns: All subsequent columns should contain the raw count data for each sample. The
headervalues for these sample columns must match the sample identifiers (SampleID) used in the samplesheet. The pipeline expects integer counts, as is typical for RNA-seq data. While the script has the functionality to automatically parse and handle count data from various quantification tools, the users are still responsible for removing any additional columns besides the protein IDs and sample counts. It currently works best with the merged counts output from pipelines likenf-core's rnaseq Nextflow pipeline.
| PG.ProteinGroups | PG.Genes | sample1_r1 | sample1_r2 | sample1_r3 | sample2_r1 | sample2_r2 | sample2_r3 | sample3_r1 | sample3_r2 | sample3_r3 | sample4_r1 | sample4_r2 | sample4_r3 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A0A087WV53 | SPEGNB | 1243.58 | 2310.42 | 3428.67 | 4592.11 | 5635.74 | 6772.45 | 2384.91 | 3499.26 | 4621.08 | 5790.34 | 3435.89 | 4522.73 |
| A0A0B4J2D5 | P0DPI2 | 1189.24 | 2285.33 | 3375.19 | 4478.62 | 5592.47 | 6721.53 | 2290.16 | 3420.58 | 4518.47 | 5659.82 | 3392.41 | 4461.05 |
| A0A0B4J2F0 | PIGBOS1 | 1322.88 | 2399.75 | 3541.62 | 4684.91 | 5743.56 | 6833.10 | 2415.77 | 3556.28 | 4652.94 | 5734.62 | 3522.84 | 4613.25 |
| A0A0C4DH30 | IGHV3-16 | 1278.52 | 2366.42 | 3478.13 | 4555.08 | 5692.75 | 6811.44 | 2362.27 | 3518.32 | 4588.69 | 5688.11 | 3451.95 | 4582.63 |
| A0A0U1RRE5 | NBDY | 1210.67 | 2333.91 | 3411.27 | 4518.45 | 5623.83 | 6745.92 | 2321.34 | 3465.28 | 4549.71 | 5679.53 | 3408.72 | 4528.84 |
Required format: Comma-Separated Values (.csv)
This file contains metadata for each sample, including group identifiers and binary indicators for specific comparisons. The columns must be organized as follows:
- SampleID: Sample identifiers that exactly match the sample column headers in the count matrix.
- GroupID: Group identifiers for each sample (e.g.,
experiment1,control1). These group identifiers are crucial for defining the experimental design in limma. - [comparison]: Subsequent columns define pairwise comparisons. The column name should follow the format
experiment_vs_control. For each comparison column, use1to indicate samples belonging to the experimental group,0for the control group, and leave the cellblankfor samples to be excluded from that specific comparison. This design matrix setup allows the user to specify which samples are used for each comparison, providing flexibility in complex experimental designs.
| SampleID | GroupID | experiment1_vs_control1 | experiment2_vs_control2 |
|---|---|---|---|
| sample1_r1 | experiment1 | 1 | |
| sample1_r2 | experiment1 | 1 | |
| sample1_r3 | experiment1 | 1 | |
| sample2_r1 | control1 | 0 | |
| sample2_r2 | control1 | 0 | |
| sample2_r3 | control1 | 0 | |
| sample3_r1 | experiment2 | 1 | |
| sample3_r2 | experiment2 | 1 | |
| sample3_r3 | experiment2 | 1 | |
| sample4_r1 | control2 | 0 | |
| sample4_r2 | control2 | 0 | |
| sample4_r3 | control2 | 0 |
The pipeline is executed using an R script. Example commands for running it on VCU HPRC (high performance research computing servers) are as follows:
module load R/4.4.1
Rscript de.regular.R \
--counts human_counts.tsv \
--samplesheet samplesheet.csv \
--outdir human_results \
--runid human_experiment \
--annotation humanmodule load R/4.4.1
Rscript de.regular.R \
--counts mouse_counts.tsv \
--samplesheet samplesheet.csv \
--outdir mouse_results \
--runid mouse_experiment \
--annotation mouse-c, --counts: Path to the merged counts file (Mandatory)-s, --samplesheet: Path to the sample sheet file (Mandatory)-o, --outdir: Output directory (default:./output)-r, --runid: Unique identifier for the analysis run (Mandatory)-a, --annotation: Genome to use for annotation:mouseorhuman(default:mouse)-i, --imputation: Imputation method (default:DEP-MinProb). DEP methods:DEP-MinProb,DEP-knn,DEP-bpca,DEP-QRILC,DEP-man. Custom methods defined inworkflow/custom_imputation.R(e.g.3by3). Usenoneto skip.-q, --imputation-q: Quantile cutoff for the left-censored distribution used byDEP-MinProbandDEP-QRILC(default:0.01)--seed: Random seed for reproducibility of stochastic imputation methods (default:42)--heatmap-norm: Heatmap row normalization —zscore(z-score normalize rows) ornone(raw intensity) (default:zscore)--heatmap-top-n: Number of top proteins by CV to show in the global heatmap (default:1000)--gsea-ont: Gene Ontology category for GSEA:BP,MF,CC, orALL(default:BP)--skip-gsea: Skip GSEA analysis for faster runs (flag, no value needed)--skip-anova: Skip one-way ANOVA for faster runs (flag, no value needed)--group-color1: Hex color for the up-regulated group in plots (default:#D55E00)--group-color2: Hex color for the down-regulated group in plots (default:#0072B2)
At present, the pipeline is executed manually through an interactive RStudio session. This workflow exists to ensure a controlled runtime environment (via renv) and will likely be automated in the future.
The current steps are:
-
Launch RStudio on
cardinal.som.edu. -
Open the proteomics_core_pipelines GitHub repository as the active RStudio project. This step is required to activate the project-specific renv environment and ensure all package versions are correctly resolved.
-
From within RStudio, open a terminal and navigate to the desired analysis directory.
-
Execute the pipeline using a wrapper script (example below).
-
Open the generated R Markdown file located in
<outdir>/*.Rmdfor inspection, debugging, or manual rendering.
The pipeline is currently launched via a shell wrapper that calls the R script with all required command-line arguments as show:
project_dir="<path-to-project-dir>"
Rscript <wrapper-script-path> \
--counts <counts-path> \
--samplesheet <samplesheet-path> \
--outdir <output-dir> \
--runid <arbitrary-id> \
--annotation <human or mouse>
This approach allows the analysis to be parameterized from the command line while still relying on RStudio to manage the execution environment. The wrapper script is responsible only for argument passing; all analysis logic remains within the R pipeline itself.
The pipeline generates an output directory (specified by --outdir) with the following structure:
[outDir]/
├── proteomics_analysis.html ← final HTML report (name controlled by --report-prefix)
├── data/
│ ├── de_data/
│ │ └── [comparison]_limma.csv
│ ├── gsea_data/
│ │ └── [comparison]_go_analysis_.csv
│ └── anova/
│ └── global_anova.csv
└── figures/
├── imputation/
│ ├── global_imputation_histogram.png
│ ├── global_imputation_distribution.png
│ ├── global_imputation_missing_per_sample.png
│ └── global_imputation_total_counts.png
├── pca/
│ └── global_pca_plot.png
├── heatmap/
│ ├── global_heatmap.png
│ └── [comparison]_heatmap.png
├── volcano/
│ └── [comparison]_volcano.png
├── ma/
│ └── [comparison]_ma.png
└── gsea/
└── [comparison]_gsea.png
Key output files:
[comparison]_limma.csv— full limma results (logFC, pvalue, padj, imputation_category per protein)[comparison]_go_analysis_.csv— GSEA Gene Ontology results (NES, pvalue, p.adjust, leading edge)global_anova.csv— one-way ANOVA results across all groups (generated when >2 groups)global_pca_plot.png— PCA of all samples colored by groupglobal_heatmap.png— top proteins by coefficient of variation across all samplesglobal_imputation_*.png— four diagnostic plots for the imputation step
- This pipeline currently supports only pairwise comparisons. Support for more complex designs with multiple comparisons with covariates and contrast matrices will be added in future versions. This is a limitation for experiments with more than two conditions.
The CSV Differential Abundance output from limma (available in the results directory provided alongside this report: de_data/[comparison_name]_limma.csv) can be uploaded directly into QIAGEN Ingenuity Pathway Analysis (IPA) for self-exploration of pathways predicted to be enriched by this experimental condition. Massey’s BISR provides access to VCU’s license of IPA. If you do not already have an account associated with this license, you may reach out to morecockcm@vcu.edu with your name, VCU Health or VCU email, and a request for IPA access. To perform a core abundance analysis, log in with your credentials at: https://analysis.ingenuity.com/pa
Prior to differential abundance analysis, missing intensity values were imputed using the DEP R package [1] or an in-house method developed by the VCU Proteomics Core. Missing values in proteomics data-dependent acquisition (DDA) experiments are commonly attributed to a missing-not-at-random (MNAR) mechanism, where low-abundance peptides fall below the instrument detection threshold. To address this, remaining missing values were imputed from a down-shifted Gaussian distribution centered at the low end of the observed intensity distribution.
Differential abundance analysis was carried out using limma v1.44.0 [2]. Proteins were considered significantly differentially abundant (DAP) when they exhibited an absolute log₂ fold-change ≥ 0.58 (1.5-fold) and an adjusted P-value (FDR) < 0.05 (Benjamini–Hochberg correction). Volcano plots and heatmaps were generated using the raw or log-transformed intensity values and visualized using R packages (ggplot2 and ggrepel), where the top up-regulated and down-regulated proteins (based on adjusted P-value) were highlighted. Heatmaps were produced using ComplexHeatmap after z-score transformation of the filtered expression matrix. Gene Set Enrichment Analysis (GSEA) [3] for Gene Ontology terms (GO) was performed using the clusterProfiler package [4] across all proteins, regardless of significance. All computational analyses were performed on VCU’s High Performance Research Computing cluster.
-
Tyanova S, Temu T, Cox J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat Protoc. 2016;11(12):2301–2319. doi:10.1038/nprot.2016.136
-
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47. doi:10.1093/nar/gkv007
-
Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005;102(43):15545–15550. doi:10.1073/pnas.0506580102
-
Wu T, Hu E, Xu S, et al. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. Innovation. 2021;2(3):100141. doi:10.1016/j.xinn.2021.100141
Please include the following statements in your acknowledgements manuscript section:
-
“Services in support of the research project were provided by the VCU Massey Comprehensive Cancer Center Bioinformatics Shared Resource. Massey is supported, in part, with funding from NIH-NCI Cancer Center Support Grant P30 CA016059.”
-
“High Performance Computing resources provided by the High Performance Research Computing (HPRC) core facility at Virginia Commonwealth University (https://hprc.vcu.edu) were used for conducting the research reported in this work.”
- Implement functionality to perform analyses using multi-factor designs and covariates.
- Incorporate a KEGG pathway analysis module.
- Add support for gene annotation using Entrez IDS.
For questions or issues, please contact the BISR group at [mccbioinfo@vcu.edu] or open a GitHub issue in the repository.
Note: figure was developed in biorender at: link
The pipeline is written within the following scripts:
- helpers.R
- report_generator.regular.R
- de.regular.R
- report_generator.ptm.R
- de.ptm.R
To generate results, the two data files (intensity matrix and samplesheet) are passed to de.{ptype}.R.
de.regular.R serves as the orchestration layer for the differential abundance pipeline. It contains all logistics required to run the analysis end to end, including input handling, command-line argument parsing, data preparation, and invocation of the helper functions that perform statistical testing, visualization, result serialization, and report generation.
The script implements the full limma-based differential abundance workflow, starting from raw quantified protein intensities and proceeding through model fitting, downstream visualizations, PCA, RDS serialization, and automated HTML report creation. It is designed to be executed as a command-line script and to coordinate the interaction between analysis modules rather than reimplementing their internal logic.
All computational heavy lifting (limma modeling, plotting, enrichment analysis, and reporting) is delegated to sourced helper functions. As such, the primary responsibilities of this script are control flow, data plumbing, and enforcing the execution contract between inputs, analysis components, and outputs.
Overall execution flow
-
At a high level, the script executes in the following phases:
-
Library loading and environment setup
-
Command-line argument parsing (with optional debug overrides)
-
Annotation database initialization
-
Input data ingestion and cleaning
-
Comparison definition from the samplesheet
-
Limma design matrix construction
-
Per-comparison differential analysis loop
-
Global PCA computation and visualization
-
Serialization of all results into a single RDS
-
Automated report generation from the RDS
-
-
Each phase is intentionally linear to make failures easier to diagnose.
Command-line interface and runtime configuration
-
The script is designed to be executed via Rscript and uses optparse to define required and optional arguments:
-
Counts matrix (merged protein intensities)
-
Samplesheet defining experimental groups and comparisons
-
Output directory
-
Run identifier
-
Genome annotation (human or mouse)
-
A debug block is included to bypass CLI arguments during development. This block overrides all runtime inputs and should be disabled or removed in production to avoid accidental misuse.
-
Annotation and identifier mapping
-
Protein identifiers are assumed to be UniProt Swiss-Prot IDs, potentially concatenated with semicolons.
-
Key steps:
-
UniProt IDs are split, de-duplicated, and queried against Ensembl using biomaRt
-
A one-to-one UniProt → Ensembl mapping is enforced via distinct()
-
Ensembl gene IDs are merged back into the main protein-level table
-
Row names are set to protein accessions for downstream compatibility
-
-
Any change in identifier conventions upstream (e.g., gene symbols instead of UniProt) will require updates here and in all downstream plotting and reporting functions.
Data cleaning and preprocessing assumptions
-
Several important assumptions are embedded in the preprocessing logic:
-
Proteins with missing PG.Genes are dropped
-
Duplicate protein accessions are removed naïvely using distinct() (this is explicitly marked as a temporary and potentially lossy step)
-
Non-intensity metadata columns are manually excluded
-
Intensities are coerced to integers and log2-transformed with a pseudocount
-
-
These steps should be reviewed carefully if the upstream quantification format changes or if missingness patterns differ substantially.
Comparison definition logic
-
Comparisons are inferred from the samplesheet using the following convention:
-
Column 1–2: sample metadata (SampleID, GroupID)
-
Column 3 onward: binary comparison vectors
-
1 → experimental group
-
0 → control group
-
-
-
For each comparison column:
-
Experimental and control group labels are inferred dynamically
-
Only valid comparisons (both groups present) are retained
-
-
This makes the pipeline flexible but tightly couples it to the samplesheet schema.
Limma model construction
The limma setup is intentionally minimal:
-
A no-intercept design (~0 + condition) is used
-
One contrast is evaluated per comparison
-
Expression data are assumed to be already normalized and log-scaled
All limma-specific logic is delegated to helper functions (run_analysis(), perform_limma_analysis()), allowing this script to focus on orchestration rather than statistics.
Analysis loop and result collection
For each comparison:
-
run_analysis() is called with:
-
Comparison metadata
-
Shared limma parameters
-
Intensity matrix
-
Output directory structure
-
-
Results are appended only if non-null
-
Each comparison produces:
-
Limma tables
-
Volcano plots
-
Heatmaps
-
GSEA outputs (if applicable)
-
Failures in a single comparison do not halt the entire run unless explicitly thrown.
Principal Component Analysis (PCA)
PCA is computed once globally, across all samples:
-
Missing values are imputed using per-protein means (explicitly flagged in-code as a temporary and suboptimal solution)
-
Additional filtering removes columns with remaining non-finite values
-
PCA is limited to three components
-
Only PC1 vs PC2 is plotted and saved
The resulting pca_plot object is included in the final RDS and reused by the report generator to avoid recomputation.
Result serialization contract
All downstream reporting depends on the structure of the saved RDS:
list(
results,
comparisons,
out_dirs,
pca_plot
)This positional contract is relied upon verbatim by generate_report(). Any modification to the order or contents of this list must be reflected in the report generator, otherwise the report will break or silently misinterpret data.
Report generation
The final step calls generate_report() using the saved RDS as its sole input. At this point:
-
No analysis is recomputed
-
All figures are loaded from disk
-
All tables are reconstructed from serialized results
This separation ensures reproducibility and allows report generation to be rerun independently of the analysis.
Known technical debt and update hotspots
This script intentionally documents several weak points inline (marked with warnings), including:
-
Duplicate protein handling
-
PCA imputation strategy
-
Tight coupling to input file schemas
-
Hard-coded thresholds propagated across multiple components
When updating the pipeline, changes should be evaluated holistically across:
-
This orchestration script
-
Helper functions
-
Report generation logic
Treat this file as the control plane of the pipeline: it defines what runs, in what order, and with what assumptions.
helper.R::perform_limma_analysis() function encodes the differential abundance code with limma.
Purpose
- Fit a linear model for a specified experimental vs control contrast
- Apply empirical Bayes moderation
- Return a complete results table suitable for downstream filtering, plotting, and reporting
Inputs and assumptions
- limma_params is a structured list that must contain:
- E: an expression matrix (already normalized / voom-transformed)
- design: a design matrix with column names matching experimental conditions
- exp and ctrl must exactly match column names in limma_params$design
- Only a single contrast is constructed and evaluated per function call
Execution flow
- A contrast string (exp - ctrl) is constructed dynamically and passed to makeContrasts().
- A linear model is fit using lmFit() on the provided expression matrix.
- The contrast is applied using contrasts.fit().
- Empirical Bayes moderation is performed with eBayes().
- Results are extracted with topTable():
- All features are returned (number = Inf)
- Sorted by nominal p-value (sort.by = "P")
report_generator.{ptype}.R::generate_report() function encapsulates the full logic required to assemble and render the automated HTML report from a serialized analysis object. The function does not perform analysis; it strictly consumes precomputed results and formats them into a reproducible R Markdown document.
At a high level, the function performs four distinct tasks:
-
Input validation and setup
-
Deserialization of analysis outputs
-
Programmatic construction of an R Markdown document
-
Rendering (optional) and file management
Below is a breakdown of how these pieces fit together and where updates should be made.
- Inputs and assumptions
-
The function assumes that analysis_results_path points to an .rds file with a fixed positional structure:
-
[[1]] → results: a list of per-comparison analysis outputs (limma, GSEA, etc.)
-
[[2]] → comparisons: metadata defining comparison names and group labels
-
[[3]] → out_dirs: named list of output directories (volcano, heatmap, gsea, pca)
-
[[4]] → pca_plot: a precomputed PCA ggplot object
-
-
Any change to how the upstream pipeline saves this RDS (order, naming, or contents) must be mirrored here, otherwise the report will silently misbehave or fail.
- Output behavior
-
The function creates a single .Rmd file ({report_prefix}.Rmd) in output_dir.
-
The HTML filename is fixed to {report_prefix}.html (timestamping is currently disabled but preserved in comments).
-
Rendering via rmarkdown::render() is intentionally commented out. This allows:
-
Manual inspection/debugging of the generated Rmd
-
Deferred rendering on systems without Pandoc or a full R Markdown stack
-
- R Markdown generation strategy
-
The report is assembled as a set of character strings using glue() and appended incrementally:
-
Front matter + setup chunk. Defines document metadata, HTML options, and reloads the RDS inside the report environment to ensure self-containment.
-
Global sections. Overview, pipeline description, PCA visualization, and a cross-comparison summary table computed dynamically from results.
-
Per-comparison sections (looped) For each comparison:
-
Experimental design metadata
-
Volcano plot (loaded from file)
-
Heatmap (loaded from file)
-
Differential abundance tables (computed in-report)
-
GSEA plots and tables (if available)
-
All plots except PCA are included via knitr::include_graphics(). This deliberately decouples visualization generation from reporting and avoids recomputation.
-
-
- Thresholds and hard-coded logic
-
Several analytical thresholds are hard-coded in the reporting layer and must remain synchronized with the analysis pipeline:
-
Adjusted p-value cutoff: adj.P.Val < 0.05
-
Fold-change cutoff: |log2FC| ≥ 0.58 (1.5×)
-
GSEA presence is inferred by checking results[[i]]$gsea
-
This environment was created within
