- pharokka
- https://phrogs.lmge.uca.fr/
https://github.com/gbouras13/pharokka#database-installation
- tail spike database
- sanitize genomes module
- fold and search fold database (of phage spike proteins?)
- reverse containment (see journal, 2022-03-30,
reverse_gather.py) - circular binary segmentation (see methylation work and py code)
maybe use shapemers? (cluster phage fee or whatever to discover new ones)
https://pubmed.ncbi.nlm.nih.gov/33381814/ https://www.biorxiv.org/content/10.1101/2022.10.11.511548v1 https://www.nature.com/articles/s41594-022-00849-w
could implement this in faltwerk
https://github.com/TurtleTools/afdb-shapemer-darkness/blob/main/scripts/make_shapemers.py
or ask q like: increased resistance means more/ less phages? are they lost if res increased or do they bring res?
https://www.nature.com/articles/s41564-022-01263-0
not one word of phages!
expose mmseqs params in annotation module (search carful, search feet)
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC135240/
phage proteomic tree -- classify the (pro)phage!
https://ggdc.dsmz.de/victor.php https://academic.oup.com/bioinformatics/article/33/21/3396/3933260
extend data sources
phage feet (see journal)
https://mjoh223.github.io/jbd-lab.github.io/static/pdf/publications/soto-perez_2019.pdf https://github.com/jbisanz/HuVirDB/blob/master/readme.md
- don't hard code params mmseqs or at least put lower vals
- another segmentation: circular binary ... -- benchmark!
- annotate pVOGs further https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8442406/
http://ftp.ebi.ac.uk/pub/databases/metagenomics/genome_sets/gut_phage_database/README.txt
-
PCs_GPD.txt: GPD proteome clustered. Each line represents a protein cluster.
-
GPD_proteome_orthology_assignment.txt.gz: Compressed file containing the functional annotation of GPD proteome.
-
concat the queries to mmseqs so we don't load the db a million times
-
search crispr takes too much memory with metagraph -- any solution? (split index?)
-
predict small ORFs
-
add other annotation resources:
-
vis raw signal after protein search
touch annotation.gff
find GPD_annotations -name '*.gff' | awk '/##FASTA/ {exit} {print}' >> annotation.gff
But code does not match GPD proteome code ...
We can even include non-intact prophages bc/ at some point they got in, nevermind their fate in the genome, though of course the change of an intact receptor is smaller.
Search strategy:
- Manually collect receptors, sequences and 3D
- search for more (snowball)
- Where are they? (end of the protein? search beginning and annotate as putative or fold and see if they have some characteristics)
3D, search upper part homology, fold lower part and search, get upper part homology, ...