Hi @PatrickRWright @JensGeorg
CopraRNA checks for duplicated CDS within the all.fas file:
|
system "grep '>' all.fas | uniq -d > duplicated_CDS.txt"; |
why? what is the problem with them? I find that even the recent E.coli genome has duplicated CDS, like
bash-4.2$ bzgrep -A10 -B10 b0149 ~/data/ncbi-refseq-gbk/NC_000913.gb.bz2
LQMLGALEGERLSAQGQKMAALGNDPRLAAMLVSAKNDDEAATAAKIAAILEEPPRMG
NSDLGVAFSRNQPAWQQRSQQLLKRLNVRGGEADSSLIAPLLAGAFADRIARRRGQDG
RYQLANGMGAMLDANDALSRHEWLIAPLLLQGSASPDARILLALLVDIDELVQRCPQL
VQQSDTVEWDDAQGTLKAWRRLQIGQLTVKVQPLAKPSEDELHQAMLNGIRDKGLSVL
NWTAEAEQLRLRLLCAAKWLPEYDWPAVDDESLLAALETWLLPHMTGVHSLRGLKSLD
IYQALRGLLDWGMQQRLDSELPAHYTVPTGSRIAIRYHEDNPPALAVRMQEMFGEATN
PTIAQGRVPLVLELLSPAQRPLQITRDLSDFWKGAYREVQKEMKGRYPKHVWPDDPAN
TAPTRRTKKYS"
gene 164730..167264
/gene="mrcB"
/locus_tag="b0149"
/gene_synonym="ECK0148; pbpF; ponB"
/db_xref="ASAP:ABE-0000516"
/db_xref="ECOCYC:EG10605"
/db_xref="GeneID:944843"
CDS 164730..167264
/gene="mrcB"
/locus_tag="b0149"
/gene_synonym="ECK0148; pbpF; ponB"
/EC_number="2.4.1.129"
/codon_start=1
/transl_table=11
/product="peptidoglycan glycosyltransferase/peptidoglycan
DD-transpeptidase MrcB"
/protein_id="NP_414691.1"
/db_xref="UniProtKB/Swiss-Prot:P02919"
/db_xref="ASAP:ABE-0000516"
/db_xref="ECOCYC:EG10605"
--
MLSARPLGVQPRGGVISPQPAFMQLVRQELQAKLGDKVKDLSGVKIFTTFDSVAQDAA
EKAAVEGIPALKKQRKLSDLETAIVVVDRFSGEVRAMVGGSEPQFAGYNRAMQARRSI
GSLAKPATYLTALSQPKIYRLNTWIADAPIALRQPNGQVWSPQNDDRRYSESGRVMLV
DALTRSMNVPTVNLGMALGLPAVTETWIKLGVPKDQLHPVPAMLLGALNLTPIEVAQA
FQTIASGGNRAPLSALRSVIAEDGKVLYQSFPQAERAVPAQAAYLTLWTMQQVVQRGT
GRQLGAKYPNLHLAGKTGTTNNNVDTWFAGIDGSTVTITWVGRDNNQPTKLYGASGAM
SIYQRYLANQTPTPLNLVPPEDIADMGVDYDGNFVCSGGMRILPVWTSDPQSLCQQSE
MQQQPSGNPFDQSSQPQQQPQQQPAQQEQKDSDGVAGWIKDMFGSN"
CDS 164865..167264
/gene="mrcB"
/locus_tag="b0149"
/gene_synonym="ECK0148; pbpF; ponB"
/codon_start=1
/transl_table=11
/product="PBP1Bgamma"
/protein_id="YP_010051172.1"
/db_xref="ASAP:ABE-0000516"
/db_xref="ECOCYC:EG10605"
/db_xref="GeneID:944843"
/translation="MPRKGKGKGKGRKPRGKRGWLWLLLKLAIVFAVLIAIYGVYLDQ
KIRSRIDGKVWQLPAAVYGRMVNLEPDMTISKNEMVKLLEATQYRQVSKMTRPGEFTV
can this be ignored? if not: can we just prune the all.fas to the first occurrence of each CDS?
thanks,
Martin
Hi @PatrickRWright @JensGeorg
CopraRNA checks for duplicated CDS within the all.fas file:
CopraRNA/coprarna_aux/homology_intaRNA.pl
Line 341 in fdbca79
why? what is the problem with them? I find that even the recent E.coli genome has duplicated CDS, like
can this be ignored? if not: can we just prune the all.fas to the first occurrence of each CDS?
thanks,
Martin