Skip to content

what's the problem with duplicated CDS ? #35

@martin-raden

Description

@martin-raden

Hi @PatrickRWright @JensGeorg

CopraRNA checks for duplicated CDS within the all.fas file:

system "grep '>' all.fas | uniq -d > duplicated_CDS.txt";

why? what is the problem with them? I find that even the recent E.coli genome has duplicated CDS, like

bash-4.2$ bzgrep -A10 -B10 b0149 ~/data/ncbi-refseq-gbk/NC_000913.gb.bz2
                     LQMLGALEGERLSAQGQKMAALGNDPRLAAMLVSAKNDDEAATAAKIAAILEEPPRMG
                     NSDLGVAFSRNQPAWQQRSQQLLKRLNVRGGEADSSLIAPLLAGAFADRIARRRGQDG
                     RYQLANGMGAMLDANDALSRHEWLIAPLLLQGSASPDARILLALLVDIDELVQRCPQL
                     VQQSDTVEWDDAQGTLKAWRRLQIGQLTVKVQPLAKPSEDELHQAMLNGIRDKGLSVL
                     NWTAEAEQLRLRLLCAAKWLPEYDWPAVDDESLLAALETWLLPHMTGVHSLRGLKSLD
                     IYQALRGLLDWGMQQRLDSELPAHYTVPTGSRIAIRYHEDNPPALAVRMQEMFGEATN
                     PTIAQGRVPLVLELLSPAQRPLQITRDLSDFWKGAYREVQKEMKGRYPKHVWPDDPAN
                     TAPTRRTKKYS"
     gene            164730..167264
                     /gene="mrcB"
                     /locus_tag="b0149"
                     /gene_synonym="ECK0148; pbpF; ponB"
                     /db_xref="ASAP:ABE-0000516"
                     /db_xref="ECOCYC:EG10605"
                     /db_xref="GeneID:944843"
     CDS             164730..167264
                     /gene="mrcB"
                     /locus_tag="b0149"
                     /gene_synonym="ECK0148; pbpF; ponB"
                     /EC_number="2.4.1.129"
                     /codon_start=1
                     /transl_table=11
                     /product="peptidoglycan glycosyltransferase/peptidoglycan
                     DD-transpeptidase MrcB"
                     /protein_id="NP_414691.1"
                     /db_xref="UniProtKB/Swiss-Prot:P02919"
                     /db_xref="ASAP:ABE-0000516"
                     /db_xref="ECOCYC:EG10605"
--
                     MLSARPLGVQPRGGVISPQPAFMQLVRQELQAKLGDKVKDLSGVKIFTTFDSVAQDAA
                     EKAAVEGIPALKKQRKLSDLETAIVVVDRFSGEVRAMVGGSEPQFAGYNRAMQARRSI
                     GSLAKPATYLTALSQPKIYRLNTWIADAPIALRQPNGQVWSPQNDDRRYSESGRVMLV
                     DALTRSMNVPTVNLGMALGLPAVTETWIKLGVPKDQLHPVPAMLLGALNLTPIEVAQA
                     FQTIASGGNRAPLSALRSVIAEDGKVLYQSFPQAERAVPAQAAYLTLWTMQQVVQRGT
                     GRQLGAKYPNLHLAGKTGTTNNNVDTWFAGIDGSTVTITWVGRDNNQPTKLYGASGAM
                     SIYQRYLANQTPTPLNLVPPEDIADMGVDYDGNFVCSGGMRILPVWTSDPQSLCQQSE
                     MQQQPSGNPFDQSSQPQQQPQQQPAQQEQKDSDGVAGWIKDMFGSN"
     CDS             164865..167264
                     /gene="mrcB"
                     /locus_tag="b0149"
                     /gene_synonym="ECK0148; pbpF; ponB"
                     /codon_start=1
                     /transl_table=11
                     /product="PBP1Bgamma"
                     /protein_id="YP_010051172.1"
                     /db_xref="ASAP:ABE-0000516"
                     /db_xref="ECOCYC:EG10605"
                     /db_xref="GeneID:944843"
                     /translation="MPRKGKGKGKGRKPRGKRGWLWLLLKLAIVFAVLIAIYGVYLDQ
                     KIRSRIDGKVWQLPAAVYGRMVNLEPDMTISKNEMVKLLEATQYRQVSKMTRPGEFTV

can this be ignored? if not: can we just prune the all.fas to the first occurrence of each CDS?

thanks,
Martin

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions