Skip to content

VCU-Bioinformatics-Core/proteomics_core_pipelines

Repository files navigation

Proteomics Core Analyses

DAPRmd Logo

Overview

These pipelines were developed through a collaboration between the Bioinformatics and Proteomics Cores at VCU. Their primary goal is to streamline the analysis of data generated by the Proteomics Core. In addition, the pipelines are designed to be modular and fully reproducible, enabling broader adoption and adaptation by other research groups. The pipeline has been developed to handle the following study designs and protein data (see below).

Study Design LFQ PTM Analysis
1 × 1 NS NS
3 × 3 90% NS
3 × 3 × 3 NS NS
3 × 3 × 3 × 3 NS NS

NS means "Not started" Percentages describe about how much of the pipeline is complete

ANALYST READ THIS

1. The pipeline temporarily runs using the instructions under "Current Execution Workflow".
   Currently this is the most reliable way of making sure the correct environment is loaded, the analysis
   data is output, and the rmarkdown is rendered. This last step has been problematic due to a bug with
   knittr and knitting must be done manually with Rstudio.

2. Currently there are some difficulties using the renv setup. The current `.renv` file was generated
   using Cardinal (VCU server) and somehow, configurations on that server affected this file.

Table of Contents

Input Datasets and Proteome Sources

The overview briefly describe the analyses that are under development and, in this section, we will go more in-depth into the study designs and proteome sources that can be run with this code base.

Study Designs

The proteomics core generates multiple types of datasets, each with distinct study designs that determine the analytical pipeline we can apply. The diagram below gives an overview with follow-up verbal descriptions:

image

The available study designs include:

  • 1 × 1: two treatment groups, one sample per group
  • 3 × 3: two treatment groups, three samples per group
  • 3 × 3 × 3: three treatment groups, three samples per group
  • 3 × 3 × 3 × 3: four treatment groups, three samples per group

Given the limitations imposed by sample size, differential abundance analysis is performed only for designs with at least a 3 × 3 structure or larger. For the 1 × 1 design, we restrict the analysis to exploratory and pilot-level visualizations.

Protein Assays

The pipeline(s) developed here are focused on data generated by Label-Free Quantification (LFQ) proteomics and PTM (post-translational modification) proteomics. Both data types are generated using tandem mass spectrometry (MS/MS). The diagram below gives an overview with follow-up verbal descriptions:

image

Label-Free Quantification (LFQ)

Raw proteomic data are generated by the VCU Proteomics Core using tandem mass spectrometry (MS/MS). The data are processed through a standardized bioinformatics pipeline for peptide and protein identification and quantification.

Peptide identification is performed using the Sequest HT search algorithm implemented in Proteome Discoverer 3.0. MS/MS spectra are searched against the SwissProt protein database, supplemented with a contaminant database to identify and remove exogenous protein sequences.

Protein-level quantification is derived from constituent peptide measurements. Peptide intensities are aggregated to generate:

  • Total protein abundance

  • Number of unique peptides

  • Peptide sequence coverage

PTM Analysis

The PTM analysis workflow follows a similar MS/MS acquisition and database search strategy. However, due to inherent limitations in PTM peptide detection and protein inference:

  • Quantification is performed at the PTM peptide level

  • Not all PTM peptides can be confidently and uniquely assigned to a single parent protein

  • In some cases, PTM peptides may map to multiple protein isoforms or homologous proteins

Therefore, PTM analyses are conducted primarily at the site- or peptide-level, rather than at the aggregated protein level.

Imputation

Missing intensity values are common in data-dependent acquisition (DDA) proteomics experiments and are predominantly attributed to a missing-not-at-random (MNAR) mechanism — low-abundance peptides fall below the instrument's detection threshold rather than being truly absent. Imputation is performed before differential abundance analysis to preserve statistical power.

Available Methods

The imputation method is selected via the --imputation flag:

Method Description
DEP-MinProb (default) Draws imputed values from a down-shifted Gaussian distribution centered at the low end of the observed intensity distribution. The --imputation-q parameter (default 0.01) controls the quantile cutoff for the left-censored distribution.
DEP-QRILC Quantile Regression Imputation of Left-Censored data — fits a QRILC model per sample. Also accepts --imputation-q.
DEP-knn k-Nearest Neighbour imputation across proteins.
DEP-bpca Bayesian PCA-based imputation.
DEP-man Manual/global minimum imputation via the DEP package.
none Skips imputation entirely.
3by3 In-house method developed for VCU Proteomics Core experiments (see below).

The 3×3 Method (In-House)

The 3by3 method was developed specifically for the study designs most common at the VCU Proteomics Core, where each group has exactly 3 biological replicates. Standard per-column distribution estimates become unreliable at this sample size, so 3by3 uses a group-aware, globally-calibrated strategy:

  • ≥ 2 observed values in a group → missing values are imputed as the median of the observed values within that group. This preserves the within-group signal rather than pulling values artificially low.
  • < 2 observed values in a group → missing values are drawn from a global left-tail Gaussian (µ = 1st-percentile of all observed intensities pooled; σ = MAD of the pooled distribution). This handles cases where a protein is nearly absent in one condition.
  • All groups have ≤ 1 valid observation → the protein is flagged as not-imputable and removed from the analysis.

This hybrid approach minimizes artificial noise while retaining the MNAR-aware behavior needed for low-abundance proteins. The 3by3 function is defined in workflow/custom_imputation.R and is automatically sourced by the pipeline.

Note: Additional custom imputation methods can be added to workflow/custom_imputation.R by defining a function named impute_<method_name>() and then passing --imputation <method_name> at runtime.


Differential Abundance Analysis

Differential abundance analysis is performed using the limma R package (v3.64+). Protein intensity values are log₂-transformed and a linear model is fit per protein using the design matrix derived from the sample group assignments. Empirical Bayes moderation is applied to stabilize variance estimates across proteins, making limma well-suited for smaller sample sizes typical of proteomics experiments. Pairwise comparisons are defined in the samplesheet and evaluated using moderated t-tests. Proteins are considered significantly differentially abundant (DAPs) when they exhibit an absolute log₂ fold-change ≥ 0.58 (1.5×) and an FDR-adjusted P-value < 0.05 (Benjamini–Hochberg correction). Results feed directly into volcano plots, MA plots, heatmaps, and GSEA.

Pipeline

Proteomics Pipeline
  • The input samplesheet is parsed to generate comparisons definitions.
  • Runs differential analysis over all comparisons specified using limma R package 3.64.3.
  • Runs Gene Set Enrichment Analysis (Gene Ontology).
  • Generates exploratory and differential analysis plots.
  • Automatically builds an HTML report based on R markdown, with plots and tables.

Note:

  • This pipeline is intended for a "first pass" analysis. For custom or complex analyses, please contact the Bioinformatics core and submit a Jira ticket
  • Differential abundance results are considered significant if the Benjamini-Hochberg adjusted p-value (padj) is less than or equal to 0.05 and the absolute log2 fold change is greater than 0.58 (corresponding to an absolute fold change of 1.5).

Preparing your R Environment

renv is used to create an isolated environment. If this is your first time working with renv then please start an R session and run the following:

install.packages("renv")

Then go ahead and close this repository and cd inside

git clone https://github.com/VCU-Bioinformatics-Core/proteomics_core_pipelines
cd proteomics_core_pipelines

From there, start another R session and run the following:

library(renv)
renv::restore()

Usage

Preparing Your Data

Two input files are required in specific formats: the Intensity Matrix and the Samplesheet.

1. Intensity Matrix

Required format: Tab-Separated Values (.tsv) This file contains the protein ids and raw merged protein levels. The columns must be organized as follows:

  • PG.ProteinGroups: The first column must contain protein identifiers in the form of UniProt/SwissProt IDs.
  • PG.Genes: The second column must contain protein identifiers in the form of gene symbols.
  • Subsequent Columns: All subsequent columns should contain the raw count data for each sample. The header values for these sample columns must match the sample identifiers (SampleID) used in the samplesheet. The pipeline expects integer counts, as is typical for RNA-seq data. While the script has the functionality to automatically parse and handle count data from various quantification tools, the users are still responsible for removing any additional columns besides the protein IDs and sample counts. It currently works best with the merged counts output from pipelines like nf-core's rnaseq Nextflow pipeline.
PG.ProteinGroups PG.Genes sample1_r1 sample1_r2 sample1_r3 sample2_r1 sample2_r2 sample2_r3 sample3_r1 sample3_r2 sample3_r3 sample4_r1 sample4_r2 sample4_r3
A0A087WV53 SPEGNB 1243.58 2310.42 3428.67 4592.11 5635.74 6772.45 2384.91 3499.26 4621.08 5790.34 3435.89 4522.73
A0A0B4J2D5 P0DPI2 1189.24 2285.33 3375.19 4478.62 5592.47 6721.53 2290.16 3420.58 4518.47 5659.82 3392.41 4461.05
A0A0B4J2F0 PIGBOS1 1322.88 2399.75 3541.62 4684.91 5743.56 6833.10 2415.77 3556.28 4652.94 5734.62 3522.84 4613.25
A0A0C4DH30 IGHV3-16 1278.52 2366.42 3478.13 4555.08 5692.75 6811.44 2362.27 3518.32 4588.69 5688.11 3451.95 4582.63
A0A0U1RRE5 NBDY 1210.67 2333.91 3411.27 4518.45 5623.83 6745.92 2321.34 3465.28 4549.71 5679.53 3408.72 4528.84
2. Samplesheet:

Required format: Comma-Separated Values (.csv) This file contains metadata for each sample, including group identifiers and binary indicators for specific comparisons. The columns must be organized as follows:

  • SampleID: Sample identifiers that exactly match the sample column headers in the count matrix.
  • GroupID: Group identifiers for each sample (e.g., experiment1, control1). These group identifiers are crucial for defining the experimental design in limma.
  • [comparison]: Subsequent columns define pairwise comparisons. The column name should follow the format experiment_vs_control. For each comparison column, use 1 to indicate samples belonging to the experimental group, 0 for the control group, and leave the cell blank for samples to be excluded from that specific comparison. This design matrix setup allows the user to specify which samples are used for each comparison, providing flexibility in complex experimental designs.
SampleID GroupID experiment1_vs_control1 experiment2_vs_control2
sample1_r1 experiment1 1
sample1_r2 experiment1 1
sample1_r3 experiment1 1
sample2_r1 control1 0
sample2_r2 control1 0
sample2_r3 control1 0
sample3_r1 experiment2 1
sample3_r2 experiment2 1
sample3_r3 experiment2 1
sample4_r1 control2 0
sample4_r2 control2 0
sample4_r3 control2 0

Running the Script

The pipeline is executed using an R script. Example commands for running it on VCU HPRC (high performance research computing servers) are as follows:

Human Analysis
module load R/4.4.1

Rscript de.regular.R \
--counts human_counts.tsv \
--samplesheet samplesheet.csv \
--outdir human_results \
--runid human_experiment \
--annotation human
Mouse Analysis
module load R/4.4.1

Rscript de.regular.R \
--counts mouse_counts.tsv \
--samplesheet samplesheet.csv \
--outdir mouse_results \
--runid mouse_experiment \
--annotation mouse
Arguments
  • -c, --counts: Path to the merged counts file (Mandatory)
  • -s, --samplesheet: Path to the sample sheet file (Mandatory)
  • -o, --outdir: Output directory (default: ./output)
  • -r, --runid: Unique identifier for the analysis run (Mandatory)
  • -a, --annotation: Genome to use for annotation: mouse or human (default: mouse)
  • -i, --imputation: Imputation method (default: DEP-MinProb). DEP methods: DEP-MinProb, DEP-knn, DEP-bpca, DEP-QRILC, DEP-man. Custom methods defined in workflow/custom_imputation.R (e.g. 3by3). Use none to skip.
  • -q, --imputation-q: Quantile cutoff for the left-censored distribution used by DEP-MinProb and DEP-QRILC (default: 0.01)
  • --seed: Random seed for reproducibility of stochastic imputation methods (default: 42)
  • --heatmap-norm: Heatmap row normalization — zscore (z-score normalize rows) or none (raw intensity) (default: zscore)
  • --heatmap-top-n: Number of top proteins by CV to show in the global heatmap (default: 1000)
  • --gsea-ont: Gene Ontology category for GSEA: BP, MF, CC, or ALL (default: BP)
  • --skip-gsea: Skip GSEA analysis for faster runs (flag, no value needed)
  • --skip-anova: Skip one-way ANOVA for faster runs (flag, no value needed)
  • --group-color1: Hex color for the up-regulated group in plots (default: #D55E00)
  • --group-color2: Hex color for the down-regulated group in plots (default: #0072B2)

Current execution workflow (temporary)

At present, the pipeline is executed manually through an interactive RStudio session. This workflow exists to ensure a controlled runtime environment (via renv) and will likely be automated in the future.

The current steps are:

  1. Launch RStudio on cardinal.som.edu.

  2. Open the proteomics_core_pipelines GitHub repository as the active RStudio project. This step is required to activate the project-specific renv environment and ensure all package versions are correctly resolved.

  3. From within RStudio, open a terminal and navigate to the desired analysis directory.

  4. Execute the pipeline using a wrapper script (example below).

  5. Open the generated R Markdown file located in <outdir>/*.Rmd for inspection, debugging, or manual rendering.

Wrapper script invocation

The pipeline is currently launched via a shell wrapper that calls the R script with all required command-line arguments as show:

project_dir="<path-to-project-dir>"
Rscript <wrapper-script-path> \
  --counts <counts-path> \
  --samplesheet <samplesheet-path> \
  --outdir <output-dir> \
  --runid <arbitrary-id> \
  --annotation <human or mouse>

This approach allows the analysis to be parameterized from the command line while still relying on RStudio to manage the execution environment. The wrapper script is responsible only for argument passing; all analysis logic remains within the R pipeline itself.

Pipeline Output

The pipeline generates an output directory (specified by --outdir) with the following structure:

[outDir]/
├── proteomics_analysis.html        ← final HTML report (name controlled by --report-prefix)
├── data/
│   ├── de_data/
│   │   └── [comparison]_limma.csv
│   ├── gsea_data/
│   │   └── [comparison]_go_analysis_.csv
│   └── anova/
│       └── global_anova.csv
└── figures/
    ├── imputation/
    │   ├── global_imputation_histogram.png
    │   ├── global_imputation_distribution.png
    │   ├── global_imputation_missing_per_sample.png
    │   └── global_imputation_total_counts.png
    ├── pca/
    │   └── global_pca_plot.png
    ├── heatmap/
    │   ├── global_heatmap.png
    │   └── [comparison]_heatmap.png
    ├── volcano/
    │   └── [comparison]_volcano.png
    ├── ma/
    │   └── [comparison]_ma.png
    └── gsea/
        └── [comparison]_gsea.png

Key output files:

  • [comparison]_limma.csv — full limma results (logFC, pvalue, padj, imputation_category per protein)
  • [comparison]_go_analysis_.csv — GSEA Gene Ontology results (NES, pvalue, p.adjust, leading edge)
  • global_anova.csv — one-way ANOVA results across all groups (generated when >2 groups)
  • global_pca_plot.png — PCA of all samples colored by group
  • global_heatmap.png — top proteins by coefficient of variation across all samples
  • global_imputation_*.png — four diagnostic plots for the imputation step

Limitations

  • This pipeline currently supports only pairwise comparisons. Support for more complex designs with multiple comparisons with covariates and contrast matrices will be added in future versions. This is a limitation for experiments with more than two conditions.

Utilizing IPA

The CSV Differential Abundance output from limma (available in the results directory provided alongside this report: de_data/[comparison_name]_limma.csv) can be uploaded directly into QIAGEN Ingenuity Pathway Analysis (IPA) for self-exploration of pathways predicted to be enriched by this experimental condition. Massey’s BISR provides access to VCU’s license of IPA. If you do not already have an account associated with this license, you may reach out to morecockcm@vcu.edu with your name, VCU Health or VCU email, and a request for IPA access. To perform a core abundance analysis, log in with your credentials at: https://analysis.ingenuity.com/pa

Manuscript-Ready Text

Methods

Prior to differential abundance analysis, missing intensity values were imputed using the DEP R package [1] or an in-house method developed by the VCU Proteomics Core. Missing values in proteomics data-dependent acquisition (DDA) experiments are commonly attributed to a missing-not-at-random (MNAR) mechanism, where low-abundance peptides fall below the instrument detection threshold. To address this, remaining missing values were imputed from a down-shifted Gaussian distribution centered at the low end of the observed intensity distribution.

Differential abundance analysis was carried out using limma v1.44.0 [2]. Proteins were considered significantly differentially abundant (DAP) when they exhibited an absolute log₂ fold-change ≥ 0.58 (1.5-fold) and an adjusted P-value (FDR) < 0.05 (Benjamini–Hochberg correction). Volcano plots and heatmaps were generated using the raw or log-transformed intensity values and visualized using R packages (ggplot2 and ggrepel), where the top up-regulated and down-regulated proteins (based on adjusted P-value) were highlighted. Heatmaps were produced using ComplexHeatmap after z-score transformation of the filtered expression matrix. Gene Set Enrichment Analysis (GSEA) [3] for Gene Ontology terms (GO) was performed using the clusterProfiler package [4] across all proteins, regardless of significance. All computational analyses were performed on VCU’s High Performance Research Computing cluster.

References

  1. Tyanova S, Temu T, Cox J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat Protoc. 2016;11(12):2301–2319. doi:10.1038/nprot.2016.136

  2. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47. doi:10.1093/nar/gkv007

  3. Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005;102(43):15545–15550. doi:10.1073/pnas.0506580102

  4. Wu T, Hu E, Xu S, et al. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. Innovation. 2021;2(3):100141. doi:10.1016/j.xinn.2021.100141

Required Acknowledgements

Please include the following statements in your acknowledgements manuscript section:

  • “Services in support of the research project were provided by the VCU Massey Comprehensive Cancer Center Bioinformatics Shared Resource. Massey is supported, in part, with funding from NIH-NCI Cancer Center Support Grant P30 CA016059.”

  • “High Performance Computing resources provided by the High Performance Research Computing (HPRC) core facility at Virginia Commonwealth University (https://hprc.vcu.edu) were used for conducting the research reported in this work.”

Future Improvements

  • Implement functionality to perform analyses using multi-factor designs and covariates.
  • Incorporate a KEGG pathway analysis module.
  • Add support for gene annotation using Entrez IDS.

Contact

For questions or issues, please contact the BISR group at [mccbioinfo@vcu.edu] or open a GitHub issue in the repository.

License

GPL-3.0 license

Developmental Notes

Code Files and Pipeline Execution Overview

LFQ Software Design

Note: figure was developed in biorender at: link

The pipeline is written within the following scripts:

  • helpers.R
  • report_generator.regular.R
  • de.regular.R
  • report_generator.ptm.R
  • de.ptm.R

To generate results, the two data files (intensity matrix and samplesheet) are passed to de.{ptype}.R.

Update the pipeline code

de.regular.R serves as the orchestration layer for the differential abundance pipeline. It contains all logistics required to run the analysis end to end, including input handling, command-line argument parsing, data preparation, and invocation of the helper functions that perform statistical testing, visualization, result serialization, and report generation.

The script implements the full limma-based differential abundance workflow, starting from raw quantified protein intensities and proceeding through model fitting, downstream visualizations, PCA, RDS serialization, and automated HTML report creation. It is designed to be executed as a command-line script and to coordinate the interaction between analysis modules rather than reimplementing their internal logic.

All computational heavy lifting (limma modeling, plotting, enrichment analysis, and reporting) is delegated to sourced helper functions. As such, the primary responsibilities of this script are control flow, data plumbing, and enforcing the execution contract between inputs, analysis components, and outputs.

Overall execution flow

  • At a high level, the script executes in the following phases:

    1. Library loading and environment setup

    2. Command-line argument parsing (with optional debug overrides)

    3. Annotation database initialization

    4. Input data ingestion and cleaning

    5. Comparison definition from the samplesheet

    6. Limma design matrix construction

    7. Per-comparison differential analysis loop

    8. Global PCA computation and visualization

    9. Serialization of all results into a single RDS

    10. Automated report generation from the RDS

  • Each phase is intentionally linear to make failures easier to diagnose.

Command-line interface and runtime configuration

  • The script is designed to be executed via Rscript and uses optparse to define required and optional arguments:

    • Counts matrix (merged protein intensities)

    • Samplesheet defining experimental groups and comparisons

    • Output directory

    • Run identifier

    • Genome annotation (human or mouse)

    • A debug block is included to bypass CLI arguments during development. This block overrides all runtime inputs and should be disabled or removed in production to avoid accidental misuse.

Annotation and identifier mapping

  • Protein identifiers are assumed to be UniProt Swiss-Prot IDs, potentially concatenated with semicolons.

  • Key steps:

    • UniProt IDs are split, de-duplicated, and queried against Ensembl using biomaRt

    • A one-to-one UniProt → Ensembl mapping is enforced via distinct()

    • Ensembl gene IDs are merged back into the main protein-level table

    • Row names are set to protein accessions for downstream compatibility

  • Any change in identifier conventions upstream (e.g., gene symbols instead of UniProt) will require updates here and in all downstream plotting and reporting functions.

Data cleaning and preprocessing assumptions

  • Several important assumptions are embedded in the preprocessing logic:

    • Proteins with missing PG.Genes are dropped

    • Duplicate protein accessions are removed naïvely using distinct() (this is explicitly marked as a temporary and potentially lossy step)

    • Non-intensity metadata columns are manually excluded

    • Intensities are coerced to integers and log2-transformed with a pseudocount

  • These steps should be reviewed carefully if the upstream quantification format changes or if missingness patterns differ substantially.

Comparison definition logic

  • Comparisons are inferred from the samplesheet using the following convention:

    • Column 1–2: sample metadata (SampleID, GroupID)

    • Column 3 onward: binary comparison vectors

      • 1 → experimental group

      • 0 → control group

  • For each comparison column:

    • Experimental and control group labels are inferred dynamically

    • Only valid comparisons (both groups present) are retained

  • This makes the pipeline flexible but tightly couples it to the samplesheet schema.

Limma model construction

The limma setup is intentionally minimal:

  • A no-intercept design (~0 + condition) is used

  • One contrast is evaluated per comparison

  • Expression data are assumed to be already normalized and log-scaled

All limma-specific logic is delegated to helper functions (run_analysis(), perform_limma_analysis()), allowing this script to focus on orchestration rather than statistics.

Analysis loop and result collection

For each comparison:

  • run_analysis() is called with:

    • Comparison metadata

    • Shared limma parameters

    • Intensity matrix

    • Output directory structure

  • Results are appended only if non-null

  • Each comparison produces:

    • Limma tables

    • Volcano plots

    • Heatmaps

    • GSEA outputs (if applicable)

Failures in a single comparison do not halt the entire run unless explicitly thrown.

Principal Component Analysis (PCA)

PCA is computed once globally, across all samples:

  • Missing values are imputed using per-protein means (explicitly flagged in-code as a temporary and suboptimal solution)

  • Additional filtering removes columns with remaining non-finite values

  • PCA is limited to three components

  • Only PC1 vs PC2 is plotted and saved

The resulting pca_plot object is included in the final RDS and reused by the report generator to avoid recomputation.

Result serialization contract

All downstream reporting depends on the structure of the saved RDS:

list(
  results,
  comparisons,
  out_dirs,
  pca_plot
)

This positional contract is relied upon verbatim by generate_report(). Any modification to the order or contents of this list must be reflected in the report generator, otherwise the report will break or silently misinterpret data.

Report generation

The final step calls generate_report() using the saved RDS as its sole input. At this point:

  • No analysis is recomputed

  • All figures are loaded from disk

  • All tables are reconstructed from serialized results

This separation ensures reproducibility and allows report generation to be rerun independently of the analysis.

Known technical debt and update hotspots

This script intentionally documents several weak points inline (marked with warnings), including:

  • Duplicate protein handling

  • PCA imputation strategy

  • Tight coupling to input file schemas

  • Hard-coded thresholds propagated across multiple components

When updating the pipeline, changes should be evaluated holistically across:

  • This orchestration script

  • Helper functions

  • Report generation logic

Treat this file as the control plane of the pipeline: it defines what runs, in what order, and with what assumptions.

Updating the limma code

helper.R::perform_limma_analysis() function encodes the differential abundance code with limma.

Purpose

  • Fit a linear model for a specified experimental vs control contrast
  • Apply empirical Bayes moderation
  • Return a complete results table suitable for downstream filtering, plotting, and reporting

Inputs and assumptions

  • limma_params is a structured list that must contain:
    • E: an expression matrix (already normalized / voom-transformed)
    • design: a design matrix with column names matching experimental conditions
    • exp and ctrl must exactly match column names in limma_params$design
    • Only a single contrast is constructed and evaluated per function call

Execution flow

  • A contrast string (exp - ctrl) is constructed dynamically and passed to makeContrasts().
  • A linear model is fit using lmFit() on the provided expression matrix.
  • The contrast is applied using contrasts.fit().
  • Empirical Bayes moderation is performed with eBayes().
  • Results are extracted with topTable():
    • All features are returned (number = Inf)
    • Sorted by nominal p-value (sort.by = "P")

Updating the reporter

report_generator.{ptype}.R::generate_report() function encapsulates the full logic required to assemble and render the automated HTML report from a serialized analysis object. The function does not perform analysis; it strictly consumes precomputed results and formats them into a reproducible R Markdown document.

At a high level, the function performs four distinct tasks:

  1. Input validation and setup

  2. Deserialization of analysis outputs

  3. Programmatic construction of an R Markdown document

  4. Rendering (optional) and file management

Below is a breakdown of how these pieces fit together and where updates should be made.

  1. Inputs and assumptions
  • The function assumes that analysis_results_path points to an .rds file with a fixed positional structure:

    • [[1]] → results: a list of per-comparison analysis outputs (limma, GSEA, etc.)

    • [[2]] → comparisons: metadata defining comparison names and group labels

    • [[3]] → out_dirs: named list of output directories (volcano, heatmap, gsea, pca)

    • [[4]] → pca_plot: a precomputed PCA ggplot object

  • Any change to how the upstream pipeline saves this RDS (order, naming, or contents) must be mirrored here, otherwise the report will silently misbehave or fail.

  1. Output behavior
  • The function creates a single .Rmd file ({report_prefix}.Rmd) in output_dir.

  • The HTML filename is fixed to {report_prefix}.html (timestamping is currently disabled but preserved in comments).

  • Rendering via rmarkdown::render() is intentionally commented out. This allows:

    • Manual inspection/debugging of the generated Rmd

    • Deferred rendering on systems without Pandoc or a full R Markdown stack

  1. R Markdown generation strategy
  • The report is assembled as a set of character strings using glue() and appended incrementally:

    • Front matter + setup chunk. Defines document metadata, HTML options, and reloads the RDS inside the report environment to ensure self-containment.

    • Global sections. Overview, pipeline description, PCA visualization, and a cross-comparison summary table computed dynamically from results.

    • Per-comparison sections (looped) For each comparison:

      • Experimental design metadata

      • Volcano plot (loaded from file)

      • Heatmap (loaded from file)

      • Differential abundance tables (computed in-report)

      • GSEA plots and tables (if available)

      • All plots except PCA are included via knitr::include_graphics(). This deliberately decouples visualization generation from reporting and avoids recomputation.

  1. Thresholds and hard-coded logic
  • Several analytical thresholds are hard-coded in the reporting layer and must remain synchronized with the analysis pipeline:

    • Adjusted p-value cutoff: adj.P.Val < 0.05

    • Fold-change cutoff: |log2FC| ≥ 0.58 (1.5×)

    • GSEA presence is inferred by checking results[[i]]$gsea

Issues with the renv environment

This environment was created within

About

Pipelines that we are building in collaboration with the proteomics core

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors