-
Notifications
You must be signed in to change notification settings - Fork 23
Description
Hi Rob. I'm using the --writeMappings option in Salmon 0.7.2, but reporting this issue here since I think the code base is (will be) the same. I have a few genes that we've noticed (based on biological intuition) are called expression when they shouldn't be. The mapping data has really helped to diagnose these transcripts! Here's one example:
This transcript has some repetitive sequence near the 3' end, which is masked by RepeatMasker:
>ENSMUST00000034602.7
GAATACCCATGAGTCTGTTGGCGGAGGATAGTGCTAACCCTGGCTGATCATCCTGTGGCTTGCCTCTATTTGTTGCATGATTCCCTCTGGAAGATGGAACACAGCGGGATTCTGGCTAGTCTGATACTGATTGCTGTTCTCCCCCAAGGGAGCCCCTTCAAGATACAAGTGACCGAATATGAGGACAAAGTATTTGTGACCTGCAATACCAGCGTCATGCATCTAGATGGAACGGTGGAAGGATGGTTTGCAAAGAATAAAACACTCAACTTGGGCAAAGGCGTTCTGGACCCACGAGGGATATATCTGTGTAATGGGACAGAGCAGCTGGCAAAGGTGGTGTCTTCTGTGCAAGTCCATTACCGAATGTGCCAGAACTGTGTGGAGCTAGACTCGGGCACCATGGCTGGTGTCATCTTCATTGACCTCATCGCAACTCTGCTCCTGGCTTTGGGCGTCTACTGCTTTGCAGGACATGAGACCGGAAGGCCTTCTGGGGCTGCTGAGGTTCAAGCACTGCTGAAGAATGAGCAGCTGTATCAGCCTCTTCGAGATCGTGAAGATACCCAGTACAGCCGTCTTGGAGGGAACTGGCCCCGGAACAAGAAATCTTAAGATGGAGGTTTCTCAAAGTGCCATAGCAACGGCGCCTTCCCCTGTGATCAACCAATAAAGACGTTTCCTTCCGTCGGCAGGCGCCTGTGTTTTCCAAAGCATGGATTGAGAGTTGTCTTAGCTTGGCTGAGTTCTGCCCACCCGTGGCCCCTCACTTCTGGCTTCCTCTCCTCAGACCCAGACAGGCAGTACGAGGGACAGCTTCCCATCCTGGCCCCTTGGCATTTCTAACCCTGTGGGTTTCCTGTGGGCAGGGTCAGCCCAGCCACTCTGCTTCATTGCCTGTTGGTTCCAGTCCCTATCATAACAACTTCCGGGATTTTGCTGCCTTACCACAGCCCAAAGCAGGCTGTGGTGTTTAGAAGCTGAGCCACAAAGAAGTTTCCATGACATCATGAATGGGGGTGGCAGAGAAGAATATTGGGGCTCAGAGGgtgtgtgtgtgtgtgtgtgtgtgtgtgtgagtgagagagagagagagagagagagagagagagagagagagagacagacagagagacagagagacagaaagacagaaagacagaaagacagagacagagctgtcAAGaattttcttctcacttcgtgggttctggggattgtactaaggtcatgaggcttgtgtggcaagcacccttacccactgagccatcttgcctgcccTCATCCTCaaattaattaaaaataaagaacatgatgaattcc
The IGV screenshot has two tracks: the top track is using the transcript sequence as is, and the bottom is converting all of the RepeatMasker soft-masked bases to hard-masking ("N"). You can see that the mapping is greatly improved. I'm not sure other transcripts will benefit so much (or might even be harmed) by masking repetitive elements, but have you considered any way to deal with this issue? The reads mapping to this transcript are:
1 163 ENSMUST00000034602.7 146 1 80M = 319 253 AAGGGAGCCCCTTCAAGATACAAGTGACCGAATATGAGGACAAAGTATTTGTGACCTGCAATACCAGCGTCATGCATCTA * NH:i:1
2 83 ENSMUST00000034602.7 319 1 80M = 146 -253 ACAGAGCAGCTGGCAAAGGTGGTGTCTTCTGTGCAAGTCCATTACCGAATGTGCCAGAACTGTGTGGAGCTAGACTCCGG * NH:i:1
3 121 ENSMUST00000034602.7 976 1 125M = 976 0 CCAGTTGGTTGTCTAGCACCAAATAGCTCAGAAAATACACATAAGTAATGTTGTACACACTGAGCAGGTTATATGTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGCACA * NH:i:1
4 377 ENSMUST00000034602.7 988 1 125M = 988 0 CCCTGTATTATAATGACACTGGGATGTGAAGCCAGCTCTTCTCACTCCAGCCTGCTCTTTGTCTGCTTCGGCACACCTCTTATCACCCCTCAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG * NH:i:15
5 121 ENSMUST00000034602.7 1007 1 80M = 1007 0 GAACCGGGATTGGTCTATTTGTCCTTTTAGAGGCCAGACAACTGCATATGTGTGTGTGTGTGTGTGTGTGTGAGTGAGAG * NH:i:1
6 163 ENSMUST00000034602.7 1010 1 80M = 1018 88 ATACACATAAGTAATGTTGTACACACTGAGCAGGTTATATGTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGA * NH:i:1
7 121 ENSMUST00000034602.7 1017 1 80M = 1017 0 CAAATACTTTTAAAGCAGTCCAAGTTTATTTTCATTCGTGTGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAG * NH:i:1
8 83 ENSMUST00000034602.7 1018 1 80M = 1010 -88 AAGTAATGTTGTACACACTGAGCAGGTTATATGTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGC * NH:i:1
9 121 ENSMUST00000034602.7 1021 1 80M = 1021 0 TAATGTTGTACACACTGAGCAGGTTATATGTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGCACA * NH:i:1
10 163 ENSMUST00000034602.7 1022 1 125M = 1056 159 AATGTTGTACACACTGAGCAGGTTATATGTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGCACATGTATATGTATGTACATACATATGCACATAACAGCAATTAATGAAA * NH:i:1
11 105 ENSMUST00000034602.7 1024 1 80M = 1024 0 CCCGGCATAGTCTGATTTATTTATTTTTCATTGTGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAGAGAGAGA * NH:i:1
12 153 ENSMUST00000034602.7 1030 1 80M = 1030 0 GTTTCTTAATCTCTCTTTCTCTTTCTGTGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGTCTTTTCTATACCAGTTTATA * NH:i:1
13 361 ENSMUST00000034602.7 1033 1 80M = 1033 0 GTCAGCGCTAAGAAGTCCTGTTAGACCGTGTGCGTGTGCGTGTGTGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG * NH:i:2
14 393 ENSMUST00000034602.7 1035 1 80M = 1035 0 TAATAATTCTCATAATTTTTTTCTCCTTACCATGTGCATGTGTGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG * NH:i:2
15 377 ENSMUST00000034602.7 1035 1 80M = 1035 0 GCAGAAAGTTAAATAAAAGGTAAGTACGTGTGTGTGGCCAAATCAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG * NH:i:15
16 361 ENSMUST00000034602.7 1036 1 80M = 1036 0 AGCGCTAAGAAGTCCTGTTAGACCGTGTGCGTGTGCGTGTGTGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA * NH:i:2
17 361 ENSMUST00000034602.7 1044 1 80M = 1044 0 CGGAAATGTGAAAGTTCTGAGACAGACATATAGACAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA * NH:i:7
18 153 ENSMUST00000034602.7 1050 1 80M = 1050 0 GTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGCACATGTATATGTATGTACATACATATGCACAT * NH:i:1
19 99 ENSMUST00000034602.7 1054 1 125M = 1065 136 GTGTGAGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAGACTAAGTTAAATTTAAGAGTTCCATCTGACTTGAATGAGCAGCACAAACTTAACACTAACCACAAAATACCTGGGGCTTTCT * NH:i:1
20 83 ENSMUST00000034602.7 1056 1 125M = 1022 -159 AAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGCACATGTATATGTATGTACATACATATGCACATAACAGCAATTAATGAAAAAAGAGGACATGAATTTAAAAGAGCAAGGAGGGG * NH:i:1
21 105 ENSMUST00000034602.7 1056 1 80M = 1056 0 GTGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGCGCGTGAGAGAGAGCGGGCGAGCACACTCTGTGCCTCTCAGGAAG * NH:i:1
22 121 ENSMUST00000034602.7 1056 1 80M = 1056 0 GTGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAGAGAGAGCAAGAGCGAGCGAGAGCGAAAGCATGTTCAGGT * NH:i:1
23 153 ENSMUST00000034602.7 1058 1 80M = 1058 0 GAGAAAATGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAGAGAGAGCGAGCGCAGGCTGGTGCCTTTTTCAGTAACTTCCC * NH:i:1
24 147 ENSMUST00000034602.7 1065 1 125M = 1054 -136 TGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAGACTAAGTTAAATTTAAGAGTTCCATCTGACTTGAATGAGCAGCACAAACTTAACACTAACCACAAAATACCTGGGGCTTTCTCAGTCAGAAAC * NH:i:1
25 361 ENSMUST00000034602.7 1086 1 80M = 1086 0 GAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACAGTGTGCAAATGATGCAAAGCCTACTGCCTAAGTTGTATTT * NH:i:22
26 409 ENSMUST00000034602.7 1094 1 80M = 1094 0 GAGAGAGAGAGAGAGAGAGAGAGAGAGAGACAGAAACGTGACTTCCTCTTTTTCTATTTGTAGAACATTTATTTTCTTCC * NH:i:27
27 83 ENSMUST00000034602.7 1134 1 125M = 1137 128 GACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAG * NH:i:1
28 163 ENSMUST00000034602.7 1137 1 125M = 1134 -128 AGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAA * NH:i:1
29 99 ENSMUST00000034602.7 1137 1 80M = 1139 82 AGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGAC * NH:i:1
30 147 ENSMUST00000034602.7 1139 1 80M = 1137 -82 AAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAG * NH:i:1
31 105 ENSMUST00000034602.7 1141 1 125M = 1141 0 AGACAGAAAGACAGAAAGACAGAAAGACAGAAACACATAGAGGTGTACTGTACAACTGGTAAGAAAAGGCATATTTTTTAAATATAGAAGTGCCTACATATAGAGTGTACTTGGGAAAGAAAAGA * NH:i:1
32 169 ENSMUST00000034602.7 1225 1 80M = 1225 0 TAACATCATCAGGCTTGGTGGCAAGCACCCTTACCCACTGAGCCATCTAATAGCTCTTCTCTTGAATTTTTTTAAACACC * NH:i:1
