This repository holds scripts that were used in our analysis of protein-coding genes in RNA viruses, using a method called "evolutionary fingerprinting". An evolutionary fingerprint is the bivariate posterior distribution of codon site-specific non-synonymous (dN) and synonymous (dS) substitution rates that is inferred by fitting a codon substitution model to the phylogeny and multiple sequence alignment for a gene. A manuscript is being prepared for submission to a preprint server and eventually for peer review.
Essential data have been deposited in Zenodo under the Creative Commons 4.0 Attribution International license:
-
To get accession numbers, search your virus from the
taxonomybrowser in NCBI. -
Click your virus species and filter genomes by length (based on the length of your genome). This should give you accession numbers of entries properly classified as your virus species that are complete or close to complete genomes.
-
In the right panel, click
Send to,Complete Record. Destination should beFileand then chose the formatAccession Listand clickCreate File.
Example of a list with accessions numbers:
KU728743.1
KU728742.1
MN125030.1
MN125029.1
Option 1: Genomes with genes translated independently:
Use get_all_accns.py to download the information associated with your genomes.
Inputs:
- List with accession numbers
Outputs: The outputs of this script depends on the type of genome:
_md.csv: metadata_aa.fasta: amino acid sequences (DEPRECATED)_CDS.fasta: coding sequences (CDSs)
Example:
$ python3 get_all_accns.py data/measles_accns.seq user@gmail.com --outfile data/measlesNext, we need to sort the CDSs for the different gene products (proteins).
Locate a well-annotated reference genome and obtain the Genbank file for that record.
Run the script sortCDS.py to partition the CDS file produced by get_all_accns.py into multiple FASTA files.
Inputs:
- Genbank file of reference genome
- File containing all CDS records
Example:
$ python3 ../../surfaces/scripts/sortCDS.py rsv_sequence.gb rsv_sequence.seq_CDSs.fastaOutput:
- Automatically opens multiple FASTA files to write different CDSs
Option 2: Genomes encoding a single polyprotein:
To download sequences from NCBI for genomes with a single open reading frame encoding all proteins, use the get_all_accns.py script with the --poly flag:
python3 ../../surfaces/scripts/get_all_accns.py --prefix zika zika.seq --poly hcastelans@gmail.comNext you need to run the extract_mat_peptides.py script to extract the CDS.
In this case, it is necessary to use a reference genome in Genbank format to extract the CDS.
python3 ../../surfaces/scripts/extract_mat_peptides.py --label zika 'sequence_zika.gb' '/home/hugocastelan/zika.seq_CDSs_polyprot.fasta' Use the codon_align.py script to align your protein encoding sequence in a codon-aware manner:
python3 ../../surfaces/scripts/codon_align.py zika_nonstructural_protein_NS4B_step1.fasta -o zika_nonstructural_protein_NS4B_step2.fasta- For multiple coding sequences:
for i in *.fasta; do python3 ../../surfaces/scripts/codon_align.py "$i" -o "${i%_step1.fasta}_step2.fasta"; done-
Use
Aliviewto visualize your alignments. If required, remove problematic sequences or nucleotides that introduce frameshifts. After removing the sequences, re-align the data and save the alignments with the extension_step3.fastain the step 3 folder. -
Use
FastTreeto build the phylogenetic tree on the manually curated alignments:
for f in *.fasta; do fasttree -nt -quote "$f" > ${f%.fasta}.nwk ; done - Use
prunetree.pyto record the effect on tree sizes when progressively down-sampling the sequence alignment by removing the shortest tips of the tree.
for f in *_step3.nwk; do python3 scripts/prunetree.py "$f" > "${f%_step3.nwk}_step4.csv"; done- Use
step4_filter.Rto create a tree-length-decay plot. Take note of the number of tips to prune per alignment. - It is necessary to include the "path in quotes"
Rscript step4_filter.R "/home/hugocastelan/Documents/projects/surfaces_data/dengue/step4/*.csv" "/home/hugocastelan/Documents/projects/surfaces_data/dengue/step4/figure.png" > step4_filter.csvNOTE: If length of entire tree is below some threshold (0.5) then abandon alignment (stop here)
- Use
prunetree.pyone more time, but now provide a target number of tips (--targetoption under--mode ntips) for each alignment based on your previous output.
python3 prunetree.py measles_C_protein_step3.nwk --seq measles_C_protein_step2.fasta --mode ntips --target 97 --csvfile measles_C_protein_step5.labels.csv --outfile measles_C_protein_step5.fastaFor batch processing, use:
python3 "step4_batch.py" "step4_filter.csv" "*_step3.nwk" NOTES:
step4_filter.csvis a CSV fromstep4_filter.Routput- Script assumes Newick files are in the same directory as corresponding
\*\_step3.fastafiles
-
Case 1: prune down to tree length of
1.0if doing so does not removes too many tips (change mode to--mode treelen). -
Case 2: prune to
100tips when you cannot reach a tree length of1.0without removing too many sequences.
for f in data/HCV/HCV1a_*_step3.nwk; do python scripts_pipeline/prunetree.py -t 100 \
--mode ntips --seq "${f%.nwk}.fasta" -o "${f%_step3.nwk}_step5.fasta" --csvfile \
"${f%_step3.nwk}_step5.labels.csv" $f; done- Use
fubar.pyto run FUBAR on your pruned alignments and store selection pressures (dN and dS per codon) in acsvfile. For example:
python3 fubar.py zika_protein_2K_step5.fasta zika_protein_2K_step5.fubar.csvFor batch processing, use:
for f in *step5.fasta; do python3 scripts/fubar.py "$f" "${f%.fasta}"; done- Create and visualize selection fingerprints with
fingerprint_dnds_plot.R.