Dear authors,
Thank you for providing this repository with data from your work, as well as for making the analyses as much reproducible as possible. It's really helpful. I was using the genomic data of the four placozoans to do some comparative genomics, and I was extracting the full-lengths mRNAs of several genes from T. adaherens by using AGAT. However, I noticed that when trying to extract the corresponding CDS (still using AGAT), the result does not correspond to the one provided in the file Tadh_long.cds.fasta. It seems that the start and stop position in the gtf are out of phase, i.e., the sequence is shifted of 64 nucleotides downstream. For clarity, see below:
- the code I used and the sequence I am given:
$ agat_sp_extract_sequences.pl -g gene_of_interest_below.gtf -f Tadh_gDNA.fasta -t cds
>Tadh_TriadT26009 gene=Tadh_TriadG26009 seq_id=scaffold_5 type=cds
TTGTCTCAGATTGTCTATGGCGTATCATTTTAAGCTGTTTCATGACCCTATTTATGTTGATGAGCTGCATTGGTAATGGCGCCGTCTTACTCGTTTTACGCTACCATCATGATGATATCAAGTCGGCATCTAACTATTTTATCACTAATTTAGCCTTAACTGATTTTTTACTGGGCGTACTATGCATGCCCTGTATTTTGATTTCCTGCTTAAATGGGCAATGGGTTTTTGGTCAGACCTTATGCAGTTTAACAGGGTTTGCTAACTCATTTTTTTGTATTAATTCCATGATTACTTTAGCCGCTGTTAGTGTGGAAAAATACTGTGCTATTGCTTCACCATTGACATATCATCATTATATGAGCAAAAGTAAAGTCACATGTGTAATTTCAATTATATGGATCCATTCAGCTATTAATGCTAGTCTACCCTTTTTGGGCTGGGGAGAATATGTCTACCTTCCTTTCGAAACAATTTGCACAGTTGCTTGGTGGAGCTTTCCAAATTATGTTGGTTTTATAGTTGGTATTAATTTTGGACTACCTACCGTGATCATGAGTTGTACTTATTTCCTCATACTAAAAATTGCTCGTAAACATTCAAGGCGGATAGGTGTATCTACGTCAACTGTAGCAATTTCAACTTATCTAAGCCCAACTGGTACATATAATAACCTTAGTCCAGTTTTTATAGTCTGCTGGCTACCGCATCTTATTAGTATGATATATTTAACCATTTATGAAATAAGCCCGTTACCCTGTAGTTTTCATCAAATTACAACATGGCTAGCAATGGCTAACTCGGCTTTTAACCCAATCATATATGGAGCTATGGATACATCTATAAGAAAAGGTCTTAAAACCTTACTCGGATCCTGGGTAAAATATTGTAAATTATACTAAATTCGAATGAAATTGGTGCAGTTTTGTTGTTTATATTTATCGTATTTTTATTTCTTGCATA
- the same sequence as provided in
Tadh_long.cds.fasta:
>Tadh_TriadT26009
ATGGCTGATACCTACATTAACAATTTCACGAATAAATCACTAGAGCTATGCAATGGGAGCCTAGTTGTCTCAGATTGTCTATGGCGTATCATTTTAAGCTGTTTCATGACCCTATTTATGTTGATGAGCTGCATTGGTAATGGCGCCGTCTTACTCGTTTTACGCTACCATCATGATGATATCAAGTCGGCATCTAACTATTTTATCACTAATTTAGCCTTAACTGATTTTTTACTGGGCGTACTATGCATGCCCTGTATTTTGATTTCCTGCTTAAATGGGCAATGGGTTTTTGGTCAGACCTTATGCAGTTTAACAGGGTTTGCTAACTCATTTTTTTGTATTAATTCCATGATTACTTTAGCCGCTGTTAGTGTGGAAAAATACTGTGCTATTGCTTCACCATTGACATATCATCATTATATGAGCAAAAGTAAAGTCACATGTGTAATTTCAATTATATGGATCCATTCAGCTATTAATGCTAGTCTACCCTTTTTGGGCTGGGGAGAATATGTCTACCTTCCTTTCGAAACAATTTGCACAGTTGCTTGGTGGAGCTTTCCAAATTATGTTGGTTTTATAGTTGGTATTAATTTTGGACTACCTACCGTGATCATGAGTTGTACTTATTTCCTCATACTAAAAATTGCTCGTAAACATTCAAGGCGGATAGGTGTATCTACCCGAAGAATACATTATAAAACACATATTAAAGCAACATTGATGTTATTAATTGTCATCGGTAGTTTTATAGTCTGCTGGCTACCGCATCTTATTAGTATGATATATTTAACCATTTATGAAATAAGCCCGTTACCCTGTAGTTTTCATCAAATTACAACATGGCTAGCAATGGCTAACTCGGCTTTTAACCCAATCATATATGGAGCTATGGATACATCTATAAGAAAAGGTCTTAAAACCTTACTCGGATCCTGGGTAAAATATTGTAAATTATAC
- the nucleotide alignment between the two:
>Tadh_TriadT26009
ATGGCTGATACCTACATTAACAATTTCACGAATAAATCACTAGAGCTATGCAATGGGAGCCTAGTTGTCTCAGATTGTCTATGGCGTATCATTTTAAGCTGTTTCATGACCCTATTTATGTTGATGAGCTGCATTGGTAATGGCGCCGTCTTACTCGTTTTACGCTACCATCATGATGATATCAAGTCGGCATCTAACTATTTTATCACTAATTTAGCCTTAACTGATTTTTTACTGGGCGTACTATGCATGCCCTGTATTTTGATTTCCTGCTTAAATGGGCAATGGGTTTTTGGTCAGACCTTATGCAGTTTAACAGGGTTTGCTAACTCATTTTTTTGTATTAATTCCATGATTACTTTAGCCGCTGTTAGTGTGGAAAAATACTGTGCTATTGCTTCACCATTGACATATCATCATTATATGAGCAAAAGTAAAGTCACATGTGTAATTTCAATTATATGGATCCATTCAGCTATTAATGCTAGTCTACCCTTTTTGGGCTGGGGAGAATATGTCTACCTTCCTTTCGAAACAATTTGCACAGTTGCTTGGTGGAGCTTTCCAAATTATGTTGGTTTTATAGTTGGTATTAATTTTGGACTACCTACCGTGATCATGAGTTGTACTTATTTCCTCATACTAAAAATTGCTCGTAAACATTCAAGGCGGATAGGTGTATCTACCCGAAGAATA-CATTATAAAACACATATTAAAGCAACATTGATGTTATTAATTGTCAT----CGGTAGTTTTATAGTCTGCTGGCTACCGCATCTTATTAGTATGATATATTTAACCATTTATGAAATAAGCCCGTTACCCTGTAGTTTTCATCAAATTACAACATGGCTAGCAATGGCTAACTCGGCTTTTAACCCAATCATATATGGAGCTATGGATACATCTATAAGAAAAGGTCTTAAAACCTTACTCGGATCCTGGGTAAAATATTGTAAATTATAC----------------------------------------------------------------
>Tadh_TriadT26009 gene=Tadh_TriadG26009 seq_id=scaffold_5 type=cds
----------------------------------------------------------------TTGTCTCAGATTGTCTATGGCGTATCATTTTAAGCTGTTTCATGACCCTATTTATGTTGATGAGCTGCATTGGTAATGGCGCCGTCTTACTCGTTTTACGCTACCATCATGATGATATCAAGTCGGCATCTAACTATTTTATCACTAATTTAGCCTTAACTGATTTTTTACTGGGCGTACTATGCATGCCCTGTATTTTGATTTCCTGCTTAAATGGGCAATGGGTTTTTGGTCAGACCTTATGCAGTTTAACAGGGTTTGCTAACTCATTTTTTTGTATTAATTCCATGATTACTTTAGCCGCTGTTAGTGTGGAAAAATACTGTGCTATTGCTTCACCATTGACATATCATCATTATATGAGCAAAAGTAAAGTCACATGTGTAATTTCAATTATATGGATCCATTCAGCTATTAATGCTAGTCTACCCTTTTTGGGCTGGGGAGAATATGTCTACCTTCCTTTCGAAACAATTTGCACAGTTGCTTGGTGGAGCTTTCCAAATTATGTTGGTTTTATAGTTGGTATTAATTTTGGACTACCTACCGTGATCATGAGTTGTACTTATTTCCTCATACTAAAAATTGCTCGTAAACATTCAAGGCGGATAGGTGTATCTACGTCAACTGTAGCAATTTCA---ACTTATCTAAGCCCAACTGGTACATATAATAACCTTAGTCCAGT--TTTTATAGTCTGCTGGCTACCGCATCTTATTAGTATGATATATTTAACCATTTATGAAATAAGCCCGTTACCCTGTAGTTTTCATCAAATTACAACATGGCTAGCAATGGCTAACTCGGCTTTTAACCCAATCATATATGGAGCTATGGATACATCTATAAGAAAAGGTCTTAAAACCTTACTCGGATCCTGGGTAAAATATTGTAAATTATACTAAATTCGAATGAAATTGGTGCAGTTTTGTTGTTTATATTTATCGTATTTTTATTTCTTGCATA
- and the corresponding gene model as per
Tadh_long.annot.gtf:
scaffold_5 JGI transcript 3427178 3428332 . + . transcript_id "Tadh_TriadT26009"; gene_id "Tadh_TriadG26009";
scaffold_5 JGI exon 3427178 3427863 . + . transcript_id "Tadh_TriadT26009"; gene_id "Tadh_TriadG26009";
scaffold_5 JGI exon 3428053 3428332 . + . transcript_id "Tadh_TriadT26009"; gene_id "Tadh_TriadG26009";
scaffold_5 JGI CDS 3427178 3427863 . + 0 transcript_id "Tadh_TriadT26009"; gene_id "Tadh_TriadG26009";
scaffold_5 JGI CDS 3428053 3428329 . + 1 transcript_id "Tadh_TriadT26009"; gene_id "Tadh_TriadG26009";
I have noticed this issue also with other genes from T. adaherens, but apparently not for T. adaherens H2. I haven't checked with the other two species.
Am I doing something wrong or is there something I am not seeing? Thank you for helping!
Filippo
Dear authors,
Thank you for providing this repository with data from your work, as well as for making the analyses as much reproducible as possible. It's really helpful. I was using the genomic data of the four placozoans to do some comparative genomics, and I was extracting the full-lengths mRNAs of several genes from T. adaherens by using AGAT. However, I noticed that when trying to extract the corresponding CDS (still using AGAT), the result does not correspond to the one provided in the file
Tadh_long.cds.fasta. It seems that the start and stop position in the gtf are out of phase, i.e., the sequence is shifted of 64 nucleotides downstream. For clarity, see below:Tadh_long.cds.fasta:Tadh_long.annot.gtf:I have noticed this issue also with other genes from T. adaherens, but apparently not for T. adaherens H2. I haven't checked with the other two species.
Am I doing something wrong or is there something I am not seeing? Thank you for helping!
Filippo