Releases: gbouras13/phold
v1.2.5
v1.2.4
v1.2.3
v1.2.2
v1.2.1
v1.2.0
- Improved ProstT5 3Di prediction throughput for
phold run,phold predictandphold proteins-predictdue to smarter batching implmentations - Addition of
phold autotunesubcommand to detect an appropriate--batch_sizefor your hardware - You can also use
--autotunewithphold run,phold predictandphold proteins-predictto automatically detect and use the optimal--batch_size(only recommended for large datasets with thousands of proteins) - Manuscript now published:
Bouras G., Grigson S.R., Mirdita M., Heinzinger M., Papudeshi B.,
Mallawaarachchi V., Green R., Kim S.R., Mihalia V., Psaltis A.J.,
Wormald P-J., Vreugde S., Steinegger M., Edwards R.A.Protein Structure Informed Bacteriophage Genome Annotation with Phold
Nucleic Acids Research, Volume 54, Issue 1, 13 January 2026
https://doi.org/10.1093/nar/gkaf1448
v1.1.0
- Integration with suvtk to make to it easier to submit Pharokka and Phold annotated genomes to Genbank - thanks to @LanderDC for suvtk and integration. See https://github.com/gbouras13/phold?tab=readme-ov-file#genbank-submission for more details
- Adds
--restartparameter to complete largephold comparejobs #79
v1.0.0
Major Phold release to go with the preprint. For more details, see the preprint and updated documentation.
You will need to re-install the updated Phold search database with phold install to be compatible with v1.0.0
Major Changes
- Phold search database has been modified, filtered and curated to contain 1,363,704 proteins structures with functional labels (see https://zenodo.org/records/16741548). In particular, since the previous release of Phold, the enVhogs were re-clustered and re-labelled by the authors of that work. This release contains the updated enVhog structures.
- We additionally make available a larger database containing 3,166,602 structures (i.e. the Phold search database plus an extra 1.8M efam and enVhog proteins without PHROG assignment or functional label) to download using
phold install --extended_db. Using this database provides marginally fewer functional annotations and takes longer than using the default Phold search database, so is not recommended for functional annotation, but finds more hits (i.e. including to unknown function proteins) overall, so may be of interest for viral identification tasks. - PHROG functional labels have been updated for 2,798 PHROGs using manual curation informed by structural similarity searches. See the preprint for more details. The updated annotations are available in the phold database under
phold_annots.tsv - Phold search database is no longer pre-clustered, as it was shown not to significantly differ in terms of sensitivity and runtime from unclustered for the updated database.
- Phold supports Foldseek-GPU acceleration for NVIDIA GPUs using
--foldseek_gpu. Note that it is still ideal to run Phold with multiple CPU-threads (e.g.-t 8or however many threads you have available), as GPU acceleration only accelerates and improves the prefilter of Foldseek. - Phold supports custom user-specified Foldseek databases with
--custom_db. - Phold adds high, medium and low confidence annotation heuristics to guide the user (especially users from wet-lab backgrounds or without much understanding of protein structural alignment metrics) as to what annotations they should trust with a very high degree of confidence, and which they should prioritise for manual curation. See the documentation for more.
- Phold will now mask all residues below 25 ProstT5 Confidence by default (can be varied with
--mask_threshold), as this was shown to increase annotation performance compared to no masking. - If you only want to annotate hypothetical proteins from Pharokka to save runtime and resource usage, you can use
--hyps - You can run Phold with fine-tuned ProstT5 models using
--finetune(phage finetuned ProstT5 encoder and phage fine-tuned CNN) or--vanilla(phage finetuned ProstT5 encoder and vanilla PDB-based CNN). Annotation performance with these do not dramatically differ with the default ProstT5 (see the preprint), but may be of interest to some users of Phold.
v0.2.0
You will need to re-install the updated phold database for v0.2.0 using phold install
You will also need to upgrade Foldseek to v9.427df8a
v0.2.0 is a very large update adding:
- Improved sensitivity and faster runtime for the
foldseeksearch. This is achieved by clustering the Phold database at--min-seq-id 0.3 -c 0.8and creating a cluster db before running withfoldseekwhich significantly improves runtime- Overall, just over 1.1M structures are clustered into around 372k clusters
--cluster-search 1parameter is added tofoldseek searchto search against the cluster representatives first and then within each cluster, which increases sensitivity and reduces resource usage compared tophold v0.1.4- Changed default
--max_seqsfrom 1000 to 10000 to improve sensitivity at little resource usage cost - Phold database is expanded adding:
- Extremely conservative high confidence efam proteins with hits to PHROGs.
- 95% dereplicated diversity-generating retroelements (DGRs) from Roux et al.
- 7153 netflax toxin-antitoxin system proteins from Ernits et al.
- Adds
--ultra_sensitiveflag which turns off Foldseek prefiltering for maximum sensitivity. Recommended for small datasets/single phages only.- This passes the
--exhaustive-searchparameter tofoldseek search
- This passes the
- Adds the ability to save ProstT5 embeddings with
--save_per_residue_embeddingsand--save_per_protein_embeddings - Adds
.cifsupport (e.g. from Alphafold3 server) for structures, not just.pdbfile format and changing the CLI to reflect this - Removes some experimental parameters from v0.1.4 (
--splitetc)
Breaking CLI parameter changes
--pdbhas changed to--structures--pdb_dirhas changed to--structure_dir--filter_pdbshas changed to--filter_structures
v0.1.4
- Fixes #31 issue with older Pharokka genbank input (prior to v1.5.0) that lacked 'transl_table' field, thanks @btemperton
- All Pharokka genbank input prior to v1.5.0 will be transl_table 11 (it is before pyrodigal-gv was added)
- Fixes genbank parsing bug that would occur if the ID/locus tag of the CDS features in the input genbank were longer than 54 characters