This pipeline tries to find putative protein functions for unknown phage proteins using a combination of sequence- and structure-based inference methods.
MMseqs2 clustering and InterProScan are used for sequence-based inference. ColabFold and Foldseek are used for structure-based inference.
The pipeline is build using Nextflow and Python. All scripts can found in bin/.
The pipeline can be run using the two Docker images provided in dockerfiles.
- Input proteins are filtered based on their annotation. Proteins annotated with hypothetical, putative or postulated are labelled as unknown. The rest as known.
- Proteins are clusters twice using
MMseqs2. First to reduce redundance, and second to group related proteins. The cluster representatives are used for further analysis.- Unknown proteins that are clustered together with known proteins are assigned the function of the known protein.
- Unknown proteins get assigned sequence domains using
InterProScan. - All sequence representatives are input to
ColabFoldfor structure prediction. - Protein structures from Unknown proteins are aligned against a structure database using
Foldseek. - Optionally use structures of known proteins to align against a reference database of known protein structures to verify accuracy of predicted structures.
- Generate a JSON-report of findings for each input phage.
- Re-annotate GFF files with putative functions and domains.
The standard command to run the pipeline is as follows:
nextflow run main.nf \
--foldseekdb=databaseDirectory/db/db_name \
--colabdb=databaseDirectory/colabfolddb \
--email=valid-E-mail-adressTo run without the colabfold structure database and with the Foldseek database downloaded like instructed below:
nextflow run main.nf \
--foldseekdb=databases/af_prot/db \
--wsl \
--email=valid-E-mail-adressFor further information run:
nextflow run main.nf --helpResults are stored in the results directory. Files for each major step are stored in sepereate sub-folders.
Docker images can be installed with:
docker build -t phagomics dockerfiles/docker_phage_imageand
docker build -t colabfold dockerfiles/docker_colabfoldThe Foldseek database (Alphafold/Proteome) can be downloaded via the docker image using the following commands:
mkdir -p databases/af_protdocker run --rm \
-v ./databases/af_prot:/data \
phagomics \
foldseek databases Alphafold/Proteome /data/db /data/tmpFor information on how to install the colabfold search database (~2TB!) used for structure prediction, visit: https://colabfold.mmseqs.com/
bin/analysisScripts/analysis.py, bin/analysisScripts/visualizationScripts.py and bin/analysisScripts/getPdbs.py are the main scipts and can be used with default pipeline execution (fingers crossed).
analysis.py calculates statistics about the results. Also must be run before getPdbs.py.
getPdbs.py gets the .pdb files (protein structure files) from the colabfold subfolder, for structures that had a putative function assigned through Foldseek. These can then be clustered using Foldseek to identify similar predicted structures.
For more details and an example workflow see bin/analysisScripts/analysis_workflow.md
visualizationScripts.py creates three figures:
- Barplot showing the ratio of known to unknown proteins per phage.
- Barplot showing the cluster sizes and makeup.
- Scatterplot showing cluster heterogeneity.