Skip to content
/ P-GRe Public

P-GRe is a software allowing the prediction of pseudogenes' structures on a genome scale.

Notifications You must be signed in to change notification settings

sebc31/P-GRe

Repository files navigation

P-GRe v.0.2


Contacts and authors

Sébastien Cabanac, PhD student, sebastien.cabanac@univ-tlse3.fr

Christophe Dunand, professor at Université Paul Sabatier Toulouse 3, christophe.dunand@univ-tlse3.fr

Catherine Mathé, associate professor at Université Paul Sabatier Toulouse 3, catherine.mathe-dehais@univ-tlse3.fr

Contents

About P-GRe

Pseudogenes are genomic sequences with homology to functional genes but that harbor deleterious mutations, such as loss of the start codon, loss of coding sequence, gain of stop or frame-shifts. The goal of PseudoGene REtriever (P-GRe) is to find the position of pseudogenes on a genome, as well as to infer their structures in pseudo-CDSs. P-GRe aims to be more user-friendly, with a limited number of dependencies, ease of use and total automaticity, while producing qualitative results and having greater sensitivity than other software with the same goal. P-GRe relies on miniprot to align user-provided protein sequences on the genome and then filter the overlapping results. P-GRe also categorizes all predictions into the three main categories of pseudogenes (unitary, duplicated or processed).

Input requirements

P-GRe requires the genome in FASTA format and the structural annotation of the genes present on this genome, as well as the set of protein sequences encoded by these genes. Optionally, a second set of protein sequences from other organisms can be provided to increase the sensitivity and to annotate unitary pseudogenes. Input files must follow some common formatting rules: the genome and protein sequence sets must be in FASTA format, while the structural annotation must be in GFF3 format.

The GFF3 file must contain the "CDS" information in the type field (3rd column). More importantly, on CDS lines, the attributes field (9th column) must contain the predefined tag "Parent" and the value associated with this tag must be identical to the identifier of the corresponding sequence in the protein sequences FASTA file. This is necessary because P-GRe uses this tag to establish the correspondence between the structural annotation of each transcript and the corresponding protein sequence. Comparing the structure of the transcripts to those obtained on the pseudogenes then allows P-GRe to identify intron loss events, which is needed for the categorization of pseudogenes.

Installation requirements

P-GRe is designed to run on a Unix system and requires some dependencies. Before running P-GRe, make sure you have installed the following software: bedtools (arq5x/bedtools2), miniprot (lh3/miniprot) and diamond (bbuchfink/diamond). P-GRe also requires a python3 interpreter.

Running P-GRe

To predict the structure of all pseudogenes in an organism's genome (genome_A.fasta) from that organism's protein sequences (proteins_A.fasta), a typical command line is:

PGRe.sh -f genome_A.fasta -g genome_A_annotation.gff -p proteins_A.fasta

To add a set of protein sequences from multiple organisms (proteins_B.fasta) to increase sensitivity and predict unitary pseudogenes, the typical command line would look like this:

PGRe.sh -f genome_A.fasta -g genome_A_annotation.gff -p proteins_A.fasta -u proteins_B.fasta

Other options include multithreading and output directory options. The main output files are the structural annotation of the pseudogenes (PGRe.gff), and the sequences of peptides encoded by the predicted pseudogenes (pseudogene_protein.fasta).

Citing P-GRe

To be added.

Future implementation

Future implementations are planned and include, among others:

  1. Greater control over alignment and filter parameters for users.

  2. Checking the input GFF file to verify its formatting before processing.

  3. The ability to add functional annotation of genes to statistically identify fading functions.

Licence

All source code contained in the scripts folder and the P-GRe_pipeline.sh file are under the Artistic Licence 2.0.

About

P-GRe is a software allowing the prediction of pseudogenes' structures on a genome scale.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published