-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Is your feature request related to a problem? Please describe.
A user, Aditi Nagaraj, found a series of ORFs predicted by Prodigal (within treesapp assign) that had fragmented a single RpoB protein sequence into five consecutive ORFs.
Example outputs can be generated from the following command with rpob_test.txt and the RpoB reference package from RefPkgs:
treesapp assign \
-i rpob_test.txt -o RpoB_fragment_test/ \
--refpkg_dir RefPkgs/Translation/RpoB/seed_refpkg/final_outputs/
Describe the solution you'd like
A single ORF should be reported in cases where the whole protein sequence has been fragmented into pieces.
The 'stitching' can happen after the profile HMM alignment results have been parsed. A new function needs to be written that compares the alignment positions of ORFs on a single contig or scaffold (i.e. parent sequence). If it finds multiple ORFs from the same parent sequence whose profile HMM positions do not overlap and are located on the same strand, then the ORFs must be stitched.
Stitching involves going back to the (untranslated) input sequences, finding the start and stop positions, deducing the frame in which the ORFs were translated in, and conceptually translating a single sequence using the same translation table used by Prodigal.