Skip to content

CompOmics/MoDPAv1.0

Repository files navigation

Example Image

MoDPA: Modification-Dependent Protein Associations

Post-translational modifications (PTMs) are key regulators of protein function and cellular processes; however, the overall principles of PTM co-regulation and crosstalk remain to be fully understood.

To overcome the extreme sparsity and heterogeneity of PTM calls across experiments, MoDPA utilizes a variational autoencoder (VAE) to embed per-site detection profiles into a low-dimensional latent space that preserves covariation while denoising missing data. A PTM association network is constructed by correlating latent representations across experiments.


Reproducing the results

To reproduce our publication's results, you will need:

  1. A peptidoform identifications (IDs) file
  2. A peptidoform counts file
  3. A fasta file to map peptidoforms to proteins

To ensure reproducibility, we recommend running the code in a new Python environment that can be created from the .yml file in the repositiory: conda env create -f env.yml
Estimated run time is 1-2 hour(s) for the small pulsed SILAC dataset (~3h on very small laptops).

Preprocessing & PTMs quantification

The scripts inside the 1-quant-pipeline-MoDPA_v2 folder parse the peptidoform IDs and counts files to generate absolute counts for each peptidoforms (A_Pipeline_Sept2025.py). These absolute counts are then used to compute relative PTMs counts (B_relative_PTMs.py).

Example

python ./A_Pipeline_Sept2025.py ./v0113-2025/20250217_Peptidoforms_IDs_v0113.csv.gz ./v0113-2025/20250217_Peptidoforms_counts_v0113.csv.gz Human_2023_01_isoforms.fasta.gz

python ./B_relative_PTMs.py . 2025-12-19

Note: This repository only contains the pulsed SILAC dataset. The v0113-2025 dataset is available in Zenodo (https://zenodo.org/records/18310674)

Prepare VAE training data

Filter the results with 1-quant-pipeline-MoDPA_v2/C_prefilter-relative-PTMs.py and extract PTM-by-experiment matrices (1 per PTM of interest) using 1-quant-pipeline-MoDPA_v2\D_Get-MoDPA-matrices.py.
Provide the list of PTMs of interest as a .csv file with the following columns:

AA unimod_id ptm_name
C 4 Carbamidomethyl
M 35 Oxidation

Combine the PTMs of interest into one dataset with 1-quant-pipeline-MoDPA_v2\E_combine-MoDPA-matrices.py.

Train VAE

We recommend using GPUs for this task.

Multiple models can be trained at the same time with 2-VAE-code/New-VAE-gridsearch.py. The hyperparameters of the models can be specified either within the code file OR in a separate .txt file (if hard-coded, they will be saved in a .txt, in case you need to re-run).

After training, run 2-VAE-code/New-VAE-validation.py to get an overview of trained models and select the best one. The Latent-space of the model will serve as input to Step 3, PTM correlations calculation.

Calculate correlations between PTMs

Use 3-calculate-correlations/get-signdistance-multiproc.py to calculate signed distances correlations between PTMs in the latent space. To make it feasible to process large datasets (10000+ PTMs), the data is analyzed in chunks. Use 3-calculate-correlations/combine-signdistance-results.py to combine the partial results into one file and perform multiple testing correction.

Pulsed SILAC data analysis

Pulsed SILAC is a variation of SILAC where labeled amino acids are added to the growth medium for a short period to assess de novo protein production. Over time, light-labeled proteins decrease due to turnover, while heavy-labeled proteins increase.
This experimental setup thus provides a simple model in which unlabelled Arginines (R0) and Lysines (K0) will follow similar trends, while their heavy-labeled counterparts (R10, K8) which will exhibit the opposite trend. Other modifications, like Methionine oxidations, are not expected to correlate with light or heavy labels.
Validation used rule-based edge labeling:

  1. Positive associations between two heavy or two light residues are valid
  2. Negative associations between one heavy and one light residue are valid
  3. All other associations are invalid.

The MoDPA network is compared to a randomized network to verify whether our method is capturing biological associations or random artefacts.
Two randomizations algorithms are available: 4-pulse-silac-validation/randomize-network.py, whereby edges are shuffled without changing the degree of the nodes, and 4-pulse-silac-validation/Full-random-network.py, whereby no restrictions have been specified.

Clustering and enrichment analysis

Downstream analysis of the network can be performed in different ways.

In our publication, we used the clusterMaker Cytoscape plugin to generate 23 clusters of strongly interconnected PTMs (Leiden algorithm; objective_function = modularity; resolution_parameter = 0.5; beta = 0.01; n_iterations = 2). The clustered PTMs can be found in 5-Enrichment-analysis/compassionate_buck-v2/Leiden-0.5/clustered-nodes.csv. We finally performed a Reactome pathway enrichment analysis on the modified proteins within each cluster (5-Enrichment-analysis/reactome-pathways-enrichment.py).


About

MoDPA repository for publication

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors