Proteomics Core Analyses

Overview

These pipelines were developed through a collaboration between the Bioinformatics and Proteomics Cores at VCU. Their primary goal is to streamline the analysis of data generated by the Proteomics Core. In addition, the pipelines are designed to be modular and fully reproducible, enabling broader adoption and adaptation by other research groups. The pipeline has been developed to handle the following study designs and protein data (see below).

Study Design	LFQ	PTM Analysis
1 × 1	NS	NS
3 × 3	90%	NS
3 × 3 × 3	NS	NS
3 × 3 × 3 × 3	NS	NS

NS means "Not started" Percentages describe about how much of the pipeline is complete

ANALYST READ THIS

1. The pipeline temporarily runs using the instructions under "Current Execution Workflow".
   Currently this is the most reliable way of making sure the correct environment is loaded, the analysis
   data is output, and the rmarkdown is rendered. This last step has been problematic due to a bug with
   knittr and knitting must be done manually with Rstudio.

2. Currently there are some difficulties using the renv setup. The current `.renv` file was generated
   using Cardinal (VCU server) and somehow, configurations on that server affected this file.

Input Datasets and Proteome Sources

The overview briefly describe the analyses that are under development and, in this section, we will go more in-depth into the study designs and proteome sources that can be run with this code base.

Study Designs

The proteomics core generates multiple types of datasets, each with distinct study designs that determine the analytical pipeline we can apply. The diagram below gives an overview with follow-up verbal descriptions:

The available study designs include:

1 × 1: two treatment groups, one sample per group
3 × 3: two treatment groups, three samples per group
3 × 3 × 3: three treatment groups, three samples per group
3 × 3 × 3 × 3: four treatment groups, three samples per group

Given the limitations imposed by sample size, differential abundance analysis is performed only for designs with at least a 3 × 3 structure or larger. For the 1 × 1 design, we restrict the analysis to exploratory and pilot-level visualizations.

Protein Assays

The pipeline(s) developed here are focused on data generated by Label-Free Quantification (LFQ) proteomics and PTM (post-translational modification) proteomics. Both data types are generated using tandem mass spectrometry (MS/MS). The diagram below gives an overview with follow-up verbal descriptions:

Label-Free Quantification (LFQ)

Raw proteomic data are generated by the VCU Proteomics Core using tandem mass spectrometry (MS/MS). The data are processed through a standardized bioinformatics pipeline for peptide and protein identification and quantification.

Peptide identification is performed using the Sequest HT search algorithm implemented in Proteome Discoverer 3.0. MS/MS spectra are searched against the SwissProt protein database, supplemented with a contaminant database to identify and remove exogenous protein sequences.

Protein-level quantification is derived from constituent peptide measurements. Peptide intensities are aggregated to generate:

Total protein abundance
Number of unique peptides
Peptide sequence coverage

PTM Analysis

The PTM analysis workflow follows a similar MS/MS acquisition and database search strategy. However, due to inherent limitations in PTM peptide detection and protein inference:

Quantification is performed at the PTM peptide level
Not all PTM peptides can be confidently and uniquely assigned to a single parent protein
In some cases, PTM peptides may map to multiple protein isoforms or homologous proteins

Therefore, PTM analyses are conducted primarily at the site- or peptide-level, rather than at the aggregated protein level.

Imputation

Missing intensity values are common in data-dependent acquisition (DDA) proteomics experiments and are predominantly attributed to a missing-not-at-random (MNAR) mechanism — low-abundance peptides fall below the instrument's detection threshold rather than being truly absent. Imputation is performed before differential abundance analysis to preserve statistical power.

Available Methods

The imputation method is selected via the --imputation flag:

Method	Description
`DEP-MinProb` (default)	Draws imputed values from a down-shifted Gaussian distribution centered at the low end of the observed intensity distribution. The `--imputation-q` parameter (default 0.01) controls the quantile cutoff for the left-censored distribution.
`DEP-QRILC`	Quantile Regression Imputation of Left-Censored data — fits a QRILC model per sample. Also accepts `--imputation-q`.
`DEP-knn`	k-Nearest Neighbour imputation across proteins.
`DEP-bpca`	Bayesian PCA-based imputation.
`DEP-man`	Manual/global minimum imputation via the DEP package.
`none`	Skips imputation entirely.
`3by3`	In-house method developed for VCU Proteomics Core experiments (see below).

The 3×3 Method (In-House)

The 3by3 method was developed specifically for the study designs most common at the VCU Proteomics Core, where each group has exactly 3 biological replicates. Standard per-column distribution estimates become unreliable at this sample size, so 3by3 uses a group-aware, globally-calibrated strategy:

≥ 2 observed values in a group → missing values are imputed as the median of the observed values within that group. This preserves the within-group signal rather than pulling values artificially low.
< 2 observed values in a group → missing values are drawn from a global left-tail Gaussian (µ = 1st-percentile of all observed intensities pooled; σ = MAD of the pooled distribution). This handles cases where a protein is nearly absent in one condition.
All groups have ≤ 1 valid observation → the protein is flagged as not-imputable and removed from the analysis.

This hybrid approach minimizes artificial noise while retaining the MNAR-aware behavior needed for low-abundance proteins. The 3by3 function is defined in workflow/custom_imputation.R and is automatically sourced by the pipeline.

Note: Additional custom imputation methods can be added to workflow/custom_imputation.R by defining a function named impute_<method_name>() and then passing --imputation <method_name> at runtime.

Differential Abundance Analysis

Differential abundance analysis is performed using the limma R package (v3.64+). Protein intensity values are log₂-transformed and a linear model is fit per protein using the design matrix derived from the sample group assignments. Empirical Bayes moderation is applied to stabilize variance estimates across proteins, making limma well-suited for smaller sample sizes typical of proteomics experiments. Pairwise comparisons are defined in the samplesheet and evaluated using moderated t-tests. Proteins are considered significantly differentially abundant (DAPs) when they exhibit an absolute log₂ fold-change ≥ 0.58 (1.5×) and an FDR-adjusted P-value < 0.05 (Benjamini–Hochberg correction). Results feed directly into volcano plots, MA plots, heatmaps, and GSEA.

Pipeline

The input samplesheet is parsed to generate comparisons definitions.
Runs differential analysis over all comparisons specified using limma R package 3.64.3.
Runs Gene Set Enrichment Analysis (Gene Ontology).
Generates exploratory and differential analysis plots.
Automatically builds an HTML report based on R markdown, with plots and tables.

Note:

This pipeline is intended for a "first pass" analysis. For custom or complex analyses, please contact the Bioinformatics core and submit a Jira ticket
Differential abundance results are considered significant if the Benjamini-Hochberg adjusted p-value (padj) is less than or equal to 0.05 and the absolute log2 fold change is greater than 0.58 (corresponding to an absolute fold change of 1.5).

Preparing your R Environment

renv is used to create an isolated environment. If this is your first time working with renv then please start an R session and run the following:

install.packages("renv")

Then go ahead and close this repository and cd inside

git clone https://github.com/VCU-Bioinformatics-Core/proteomics_core_pipelines
cd proteomics_core_pipelines

From there, start another R session and run the following:

library(renv)
renv::restore()

Usage

Preparing Your Data

Two input files are required in specific formats: the Intensity Matrix and the Samplesheet.

1. Intensity Matrix

Required format: Tab-Separated Values (.tsv) This file contains the protein ids and raw merged protein levels. The columns must be organized as follows:

PG.ProteinGroups: The first column must contain protein identifiers in the form of UniProt/SwissProt IDs.
PG.Genes: The second column must contain protein identifiers in the form of gene symbols.
Subsequent Columns: All subsequent columns should contain the raw count data for each sample. The header values for these sample columns must match the sample identifiers (SampleID) used in the samplesheet. The pipeline expects integer counts, as is typical for RNA-seq data. While the script has the functionality to automatically parse and handle count data from various quantification tools, the users are still responsible for removing any additional columns besides the protein IDs and sample counts. It currently works best with the merged counts output from pipelines like nf-core's rnaseq Nextflow pipeline.

PG.ProteinGroups	PG.Genes	sample1_r1	sample1_r2	sample1_r3	sample2_r1	sample2_r2	sample2_r3	sample3_r1	sample3_r2	sample3_r3	sample4_r1	sample4_r2	sample4_r3
A0A087WV53	SPEGNB	1243.58	2310.42	3428.67	4592.11	5635.74	6772.45	2384.91	3499.26	4621.08	5790.34	3435.89	4522.73
A0A0B4J2D5	P0DPI2	1189.24	2285.33	3375.19	4478.62	5592.47	6721.53	2290.16	3420.58	4518.47	5659.82	3392.41	4461.05
A0A0B4J2F0	PIGBOS1	1322.88	2399.75	3541.62	4684.91	5743.56	6833.10	2415.77	3556.28	4652.94	5734.62	3522.84	4613.25
A0A0C4DH30	IGHV3-16	1278.52	2366.42	3478.13	4555.08	5692.75	6811.44	2362.27	3518.32	4588.69	5688.11	3451.95	4582.63
A0A0U1RRE5	NBDY	1210.67	2333.91	3411.27	4518.45	5623.83	6745.92	2321.34	3465.28	4549.71	5679.53	3408.72	4528.84

2. Samplesheet:

Required format: Comma-Separated Values (.csv) This file contains metadata for each sample, including group identifiers and binary indicators for specific comparisons. The columns must be organized as follows:

SampleID: Sample identifiers that exactly match the sample column headers in the count matrix.
GroupID: Group identifiers for each sample (e.g., experiment1, control1). These group identifiers are crucial for defining the experimental design in limma.
[comparison]: Subsequent columns define pairwise comparisons. The column name should follow the format experiment_vs_control. For each comparison column, use 1 to indicate samples belonging to the experimental group, 0 for the control group, and leave the cell blank for samples to be excluded from that specific comparison. This design matrix setup allows the user to specify which samples are used for each comparison, providing flexibility in complex experimental designs.

SampleID	GroupID	experiment1_vs_control1	experiment2_vs_control2
sample1_r1	experiment1	1
sample1_r2	experiment1	1
sample1_r3	experiment1	1
sample2_r1	control1	0
sample2_r2	control1	0
sample2_r3	control1	0
sample3_r1	experiment2		1
sample3_r2	experiment2		1
sample3_r3	experiment2		1
sample4_r1	control2		0
sample4_r2	control2		0
sample4_r3	control2		0

Running the Script

The pipeline is executed using an R script. Example commands for running it on VCU HPRC (high performance research computing servers) are as follows:

Human Analysis

module load R/4.4.1

Rscript de.regular.R \
--counts human_counts.tsv \
--samplesheet samplesheet.csv \
--outdir human_results \
--runid human_experiment \
--annotation human

Mouse Analysis

module load R/4.4.1

Rscript de.regular.R \
--counts mouse_counts.tsv \
--samplesheet samplesheet.csv \
--outdir mouse_results \
--runid mouse_experiment \
--annotation mouse

Arguments

-c, --counts: Path to the merged counts file (Mandatory)
-s, --samplesheet: Path to the sample sheet file (Mandatory)
-o, --outdir: Output directory (default: ./output)
-r, --runid: Unique identifier for the analysis run (Mandatory)
-a, --annotation: Genome to use for annotation: mouse or human (default: mouse)
-i, --imputation: Imputation method (default: DEP-MinProb). DEP methods: DEP-MinProb, DEP-knn, DEP-bpca, DEP-QRILC, DEP-man. Custom methods defined in workflow/custom_imputation.R (e.g. 3by3). Use none to skip.
-q, --imputation-q: Quantile cutoff for the left-censored distribution used by DEP-MinProb and DEP-QRILC (default: 0.01)
--seed: Random seed for reproducibility of stochastic imputation methods (default: 42)
--heatmap-norm: Heatmap row normalization — zscore (z-score normalize rows) or none (raw intensity) (default: zscore)
--heatmap-top-n: Number of top proteins by CV to show in the global heatmap (default: 1000)
--gsea-ont: Gene Ontology category for GSEA: BP, MF, CC, or ALL (default: BP)
--skip-gsea: Skip GSEA analysis for faster runs (flag, no value needed)
--skip-anova: Skip one-way ANOVA for faster runs (flag, no value needed)
--group-color1: Hex color for the up-regulated group in plots (default: #D55E00)
--group-color2: Hex color for the down-regulated group in plots (default: #0072B2)

Current execution workflow (temporary)

At present, the pipeline is executed manually through an interactive RStudio session. This workflow exists to ensure a controlled runtime environment (via renv) and will likely be automated in the future.

The current steps are:

Launch RStudio on cardinal.som.edu.
Open the proteomics_core_pipelines GitHub repository as the active RStudio project. This step is required to activate the project-specific renv environment and ensure all package versions are correctly resolved.
From within RStudio, open a terminal and navigate to the desired analysis directory.
Execute the pipeline using a wrapper script (example below).
Open the generated R Markdown file located in <outdir>/*.Rmd for inspection, debugging, or manual rendering.

Wrapper script invocation

The pipeline is currently launched via a shell wrapper that calls the R script with all required command-line arguments as show:

project_dir="<path-to-project-dir>"
Rscript <wrapper-script-path> \
  --counts <counts-path> \
  --samplesheet <samplesheet-path> \
  --outdir <output-dir> \
  --runid <arbitrary-id> \
  --annotation <human or mouse>

This approach allows the analysis to be parameterized from the command line while still relying on RStudio to manage the execution environment. The wrapper script is responsible only for argument passing; all analysis logic remains within the R pipeline itself.

Pipeline Output

The pipeline generates an output directory (specified by --outdir) with the following structure:

[outDir]/
├── proteomics_analysis.html        ← final HTML report (name controlled by --report-prefix)
├── data/
│   ├── de_data/
│   │   └── [comparison]_limma.csv
│   ├── gsea_data/
│   │   └── [comparison]_go_analysis_.csv
│   └── anova/
│       └── global_anova.csv
└── figures/
    ├── imputation/
    │   ├── global_imputation_histogram.png
    │   ├── global_imputation_distribution.png
    │   ├── global_imputation_missing_per_sample.png
    │   └── global_imputation_total_counts.png
    ├── pca/
    │   └── global_pca_plot.png
    ├── heatmap/
    │   ├── global_heatmap.png
    │   └── [comparison]_heatmap.png
    ├── volcano/
    │   └── [comparison]_volcano.png
    ├── ma/
    │   └── [comparison]_ma.png
    └── gsea/
        └── [comparison]_gsea.png

Key output files:

[comparison]_limma.csv — full limma results (logFC, pvalue, padj, imputation_category per protein)
[comparison]_go_analysis_.csv — GSEA Gene Ontology results (NES, pvalue, p.adjust, leading edge)
global_anova.csv — one-way ANOVA results across all groups (generated when >2 groups)
global_pca_plot.png — PCA of all samples colored by group
global_heatmap.png — top proteins by coefficient of variation across all samples
global_imputation_*.png — four diagnostic plots for the imputation step

Limitations

This pipeline currently supports only pairwise comparisons. Support for more complex designs with multiple comparisons with covariates and contrast matrices will be added in future versions. This is a limitation for experiments with more than two conditions.

Utilizing IPA

The CSV Differential Abundance output from limma (available in the results directory provided alongside this report: de_data/[comparison_name]_limma.csv) can be uploaded directly into QIAGEN Ingenuity Pathway Analysis (IPA) for self-exploration of pathways predicted to be enriched by this experimental condition. Massey’s BISR provides access to VCU’s license of IPA. If you do not already have an account associated with this license, you may reach out to morecockcm@vcu.edu with your name, VCU Health or VCU email, and a request for IPA access. To perform a core abundance analysis, log in with your credentials at: https://analysis.ingenuity.com/pa

Manuscript-Ready Text

Methods

Prior to differential abundance analysis, missing intensity values were imputed using the DEP R package [1] or an in-house method developed by the VCU Proteomics Core. Missing values in proteomics data-dependent acquisition (DDA) experiments are commonly attributed to a missing-not-at-random (MNAR) mechanism, where low-abundance peptides fall below the instrument detection threshold. To address this, remaining missing values were imputed from a down-shifted Gaussian distribution centered at the low end of the observed intensity distribution.

Differential abundance analysis was carried out using limma v1.44.0 [2]. Proteins were considered significantly differentially abundant (DAP) when they exhibited an absolute log₂ fold-change ≥ 0.58 (1.5-fold) and an adjusted P-value (FDR) < 0.05 (Benjamini–Hochberg correction). Volcano plots and heatmaps were generated using the raw or log-transformed intensity values and visualized using R packages (ggplot2 and ggrepel), where the top up-regulated and down-regulated proteins (based on adjusted P-value) were highlighted. Heatmaps were produced using ComplexHeatmap after z-score transformation of the filtered expression matrix. Gene Set Enrichment Analysis (GSEA) [3] for Gene Ontology terms (GO) was performed using the clusterProfiler package [4] across all proteins, regardless of significance. All computational analyses were performed on VCU’s High Performance Research Computing cluster.

References

Tyanova S, Temu T, Cox J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat Protoc. 2016;11(12):2301–2319. doi:10.1038/nprot.2016.136
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47. doi:10.1093/nar/gkv007
Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005;102(43):15545–15550. doi:10.1073/pnas.0506580102
Wu T, Hu E, Xu S, et al. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. Innovation. 2021;2(3):100141. doi:10.1016/j.xinn.2021.100141

Required Acknowledgements

Please include the following statements in your acknowledgements manuscript section:

“Services in support of the research project were provided by the VCU Massey Comprehensive Cancer Center Bioinformatics Shared Resource. Massey is supported, in part, with funding from NIH-NCI Cancer Center Support Grant P30 CA016059.”
“High Performance Computing resources provided by the High Performance Research Computing (HPRC) core facility at Virginia Commonwealth University (https://hprc.vcu.edu) were used for conducting the research reported in this work.”

Future Improvements

Implement functionality to perform analyses using multi-factor designs and covariates.
Incorporate a KEGG pathway analysis module.
Add support for gene annotation using Entrez IDS.

Contact

For questions or issues, please contact the BISR group at [mccbioinfo@vcu.edu] or open a GitHub issue in the repository.

License

GPL-3.0 license

Developmental Notes

Code Files and Pipeline Execution Overview

Note: figure was developed in biorender at: link

The pipeline is written within the following scripts:

helpers.R
report_generator.regular.R
de.regular.R
report_generator.ptm.R
de.ptm.R

To generate results, the two data files (intensity matrix and samplesheet) are passed to de.{ptype}.R.

Update the pipeline code

de.regular.R serves as the orchestration layer for the differential abundance pipeline. It contains all logistics required to run the analysis end to end, including input handling, command-line argument parsing, data preparation, and invocation of the helper functions that perform statistical testing, visualization, result serialization, and report generation.

The script implements the full limma-based differential abundance workflow, starting from raw quantified protein intensities and proceeding through model fitting, downstream visualizations, PCA, RDS serialization, and automated HTML report creation. It is designed to be executed as a command-line script and to coordinate the interaction between analysis modules rather than reimplementing their internal logic.

All computational heavy lifting (limma modeling, plotting, enrichment analysis, and reporting) is delegated to sourced helper functions. As such, the primary responsibilities of this script are control flow, data plumbing, and enforcing the execution contract between inputs, analysis components, and outputs.

Overall execution flow

At a high level, the script executes in the following phases:
1. Library loading and environment setup
2. Command-line argument parsing (with optional debug overrides)
3. Annotation database initialization
4. Input data ingestion and cleaning
5. Comparison definition from the samplesheet
6. Limma design matrix construction
7. Per-comparison differential analysis loop
8. Global PCA computation and visualization
9. Serialization of all results into a single RDS
10. Automated report generation from the RDS
Each phase is intentionally linear to make failures easier to diagnose.

Command-line interface and runtime configuration

The script is designed to be executed via Rscript and uses optparse to define required and optional arguments:
- Counts matrix (merged protein intensities)
- Samplesheet defining experimental groups and comparisons
- Output directory
- Run identifier
- Genome annotation (human or mouse)
- A debug block is included to bypass CLI arguments during development. This block overrides all runtime inputs and should be disabled or removed in production to avoid accidental misuse.

Annotation and identifier mapping

Protein identifiers are assumed to be UniProt Swiss-Prot IDs, potentially concatenated with semicolons.
Key steps:
- UniProt IDs are split, de-duplicated, and queried against Ensembl using biomaRt
- A one-to-one UniProt → Ensembl mapping is enforced via distinct()
- Ensembl gene IDs are merged back into the main protein-level table
- Row names are set to protein accessions for downstream compatibility
Any change in identifier conventions upstream (e.g., gene symbols instead of UniProt) will require updates here and in all downstream plotting and reporting functions.

Data cleaning and preprocessing assumptions

Several important assumptions are embedded in the preprocessing logic:
- Proteins with missing PG.Genes are dropped
- Duplicate protein accessions are removed naïvely using distinct() (this is explicitly marked as a temporary and potentially lossy step)
- Non-intensity metadata columns are manually excluded
- Intensities are coerced to integers and log2-transformed with a pseudocount
These steps should be reviewed carefully if the upstream quantification format changes or if missingness patterns differ substantially.

Comparison definition logic

Comparisons are inferred from the samplesheet using the following convention:
- Column 1–2: sample metadata (SampleID, GroupID)
- Column 3 onward: binary comparison vectors
  - 1 → experimental group
  - 0 → control group
For each comparison column:
- Experimental and control group labels are inferred dynamically
- Only valid comparisons (both groups present) are retained
This makes the pipeline flexible but tightly couples it to the samplesheet schema.

Limma model construction

The limma setup is intentionally minimal:

A no-intercept design (~0 + condition) is used
One contrast is evaluated per comparison
Expression data are assumed to be already normalized and log-scaled

All limma-specific logic is delegated to helper functions (run_analysis(), perform_limma_analysis()), allowing this script to focus on orchestration rather than statistics.

Analysis loop and result collection

For each comparison:

run_analysis() is called with:
- Comparison metadata
- Shared limma parameters
- Intensity matrix
- Output directory structure
Results are appended only if non-null
Each comparison produces:
- Limma tables
- Volcano plots
- Heatmaps
- GSEA outputs (if applicable)

Failures in a single comparison do not halt the entire run unless explicitly thrown.

Principal Component Analysis (PCA)

PCA is computed once globally, across all samples:

Missing values are imputed using per-protein means (explicitly flagged in-code as a temporary and suboptimal solution)
Additional filtering removes columns with remaining non-finite values
PCA is limited to three components
Only PC1 vs PC2 is plotted and saved

The resulting pca_plot object is included in the final RDS and reused by the report generator to avoid recomputation.

Result serialization contract

All downstream reporting depends on the structure of the saved RDS:

list(
  results,
  comparisons,
  out_dirs,
  pca_plot
)

This positional contract is relied upon verbatim by generate_report(). Any modification to the order or contents of this list must be reflected in the report generator, otherwise the report will break or silently misinterpret data.

Report generation

The final step calls generate_report() using the saved RDS as its sole input. At this point:

No analysis is recomputed
All figures are loaded from disk
All tables are reconstructed from serialized results

This separation ensures reproducibility and allows report generation to be rerun independently of the analysis.

Known technical debt and update hotspots

This script intentionally documents several weak points inline (marked with warnings), including:

Duplicate protein handling
PCA imputation strategy
Tight coupling to input file schemas
Hard-coded thresholds propagated across multiple components

When updating the pipeline, changes should be evaluated holistically across:

This orchestration script
Helper functions
Report generation logic

Treat this file as the control plane of the pipeline: it defines what runs, in what order, and with what assumptions.

Updating the limma code

helper.R::perform_limma_analysis() function encodes the differential abundance code with limma.

Purpose

Fit a linear model for a specified experimental vs control contrast
Apply empirical Bayes moderation
Return a complete results table suitable for downstream filtering, plotting, and reporting

Inputs and assumptions

limma_params is a structured list that must contain:
- E: an expression matrix (already normalized / voom-transformed)
- design: a design matrix with column names matching experimental conditions
- exp and ctrl must exactly match column names in limma_params$design
- Only a single contrast is constructed and evaluated per function call

Execution flow

A contrast string (exp - ctrl) is constructed dynamically and passed to makeContrasts().
A linear model is fit using lmFit() on the provided expression matrix.
The contrast is applied using contrasts.fit().
Empirical Bayes moderation is performed with eBayes().
Results are extracted with topTable():
- All features are returned (number = Inf)
- Sorted by nominal p-value (sort.by = "P")

Updating the reporter

report_generator.{ptype}.R::generate_report() function encapsulates the full logic required to assemble and render the automated HTML report from a serialized analysis object. The function does not perform analysis; it strictly consumes precomputed results and formats them into a reproducible R Markdown document.

At a high level, the function performs four distinct tasks:

Input validation and setup
Deserialization of analysis outputs
Programmatic construction of an R Markdown document
Rendering (optional) and file management

Below is a breakdown of how these pieces fit together and where updates should be made.

Inputs and assumptions

The function assumes that analysis_results_path points to an .rds file with a fixed positional structure:
- [[1]] → results: a list of per-comparison analysis outputs (limma, GSEA, etc.)
- [[2]] → comparisons: metadata defining comparison names and group labels
- [[3]] → out_dirs: named list of output directories (volcano, heatmap, gsea, pca)
- [[4]] → pca_plot: a precomputed PCA ggplot object
Any change to how the upstream pipeline saves this RDS (order, naming, or contents) must be mirrored here, otherwise the report will silently misbehave or fail.

Output behavior

The function creates a single .Rmd file ({report_prefix}.Rmd) in output_dir.
The HTML filename is fixed to {report_prefix}.html (timestamping is currently disabled but preserved in comments).
Rendering via rmarkdown::render() is intentionally commented out. This allows:
- Manual inspection/debugging of the generated Rmd
- Deferred rendering on systems without Pandoc or a full R Markdown stack

R Markdown generation strategy

The report is assembled as a set of character strings using glue() and appended incrementally:
- Front matter + setup chunk. Defines document metadata, HTML options, and reloads the RDS inside the report environment to ensure self-containment.
- Global sections. Overview, pipeline description, PCA visualization, and a cross-comparison summary table computed dynamically from results.
- Per-comparison sections (looped) For each comparison:
  - Experimental design metadata
  - Volcano plot (loaded from file)
  - Heatmap (loaded from file)
  - Differential abundance tables (computed in-report)
  - GSEA plots and tables (if available)
  - All plots except PCA are included via knitr::include_graphics(). This deliberately decouples visualization generation from reporting and avoids recomputation.

Thresholds and hard-coded logic

Several analytical thresholds are hard-coded in the reporting layer and must remain synchronized with the analysis pipeline:
- Adjusted p-value cutoff: adj.P.Val < 0.05
- Fold-change cutoff: |log2FC| ≥ 0.58 (1.5×)
- GSEA presence is inferred by checking results[[i]]$gsea

Issues with the renv environment

This environment was created within

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
.vscode		.vscode
R		R
assets		assets
envs		envs
inst		inst
man		man
test		test
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md
proteomics_core_pipelines.Rproj		proteomics_core_pipelines.Rproj
proteomics_core_pipelines.code-workspace		proteomics_core_pipelines.code-workspace
renv.lock		renv.lock

Folders and files

Latest commit

History

Repository files navigation

Proteomics Core Analyses

Overview

Table of Contents

Input Datasets and Proteome Sources

Study Designs

Protein Assays

Label-Free Quantification (LFQ)

PTM Analysis

Imputation

Available Methods

The 3×3 Method (In-House)

Differential Abundance Analysis

Pipeline

Preparing your R Environment

Usage

Preparing Your Data

1. Intensity Matrix

2. Samplesheet:

Running the Script

Human Analysis

Mouse Analysis

Arguments

Current execution workflow (temporary)

Wrapper script invocation

Pipeline Output

Limitations

Utilizing IPA

Manuscript-Ready Text

Methods

References

Required Acknowledgements

Future Improvements

Contact

License

Developmental Notes

Code Files and Pipeline Execution Overview

Update the pipeline code

Updating the limma code

Updating the reporter

Issues with the renv environment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages