truvari/Notes.txt at develop · sonny-mo/truvari · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130

Need to change the readme so that it only has the pure basics
Then we point to wiki pages for each of the submodules

So I need to make detailed documentation for each of the sub-pages


Original Notes:
    """ Want to pull out this structure from all of them. Need to figure out how to get the ID in there
    And I'm trying to figure out if bcftools annotate with the bed is good enough or if I need
    to look into Survivor_ant so I can do more 'reciprocal overlap' type things to match up calls...
    Or do I make my own annotation engine in Truvari? Dunno. Hard to say..
    I know for sure that if I do that, though, I'm going to do a 'annot_source_creator' so I can
    index/save the Interval Tree data sources and then reuse them often... then its just looking up for comp entry...

    Actually, I really like that idea all around...
    Can I handle all SV Types?
    Can I also annotate the SNPs/INDELs? How would I separate them?
    Would have a 'known matching' mode and a 'genome region' mode. anything genome region could just go
    into SNPs/ INDELs... Then I could do a name lookup....
    The key differentiation would be speed maybe once you create the IntervalTree index of knowns.
    But it could also be ease of use and collection of well curated annotations for the index
    So its essental SV annovar...
    But then I can see what could be done to replicate annovar. And additionally VEP
    And additionally I wouldn't mind REPEATMASKING the resolved insertion sequences...

    This is getting out of scope for core Truvari... but Truvari could become a next level
    Truannovari

    Imagine a tool where you just give it a text file of annotations to add.. anda

    PLUS -- how could would it be to have the index create a lookup for each of the sources
    So instead of annotating the VCF with EVERY single piece of information, you can make a single
    truvari-anno key that, with the truvari api, you could then retreive all the information...
    If I structured it in a way that was easily consumed by an ML, that would be awesome...
    1-hot encoding? Then its pretty easily consumed??
    I'm just going to focus on this to start...
    CHROM START END ID MSTART MLEN MEND MEINFO AF EAS_AF EUR_AF AFR_AF AMR_AF SAS_AF ALT"""

Okay, bcftools isn't working....

So I'm going to make a tool.

1) vcf-to-db:
	I want to transform the vcf to a pickle so I can use it more quickly.
2) Matching Rules:
	I need to make rules per-database that say how to match...
	For example, All the INS types of 1kg should match to an INS call.
	Sequence comparison should be turned off.
	I'll reciprocal overlap should be turned on for DELs
	INS don't need that, though.
	So I'll need to figure out a light-rule-script language that we can writea

3) Reusing Truvari:
	I need to modularize the Truvari functionality so that I can easily lookup things

4) Annotation Sources:
	Need to encode each of the sources' details that would be placed on the output VCF

5) SNPs/Indels - this is way too extra, but I want to be able to auto annotate these things also.
	e.g. Annovar?

6) Matching operations... Do I want to report the TruScore type metrics for everything??
	Generic Annotation Source
	BND Rules - annotate just the start/end positions (particullarly for DELs), we don't care whats deleted,
	but what is mediating it?

7) RepeatMasker -
	Would like to lookup inserted sequences to figure out of they contain a functional element or known repeat or
	something

8) Locus Simplifier -- doesn't really fit here.

9) Micro-homology -
	report how much microhomology is around the breakpoints because that's a thing that exists


Actual todolist -

#1 - Make Truvari Modular.
	Take out the GenomicTree
	The matching operators
	Generalize how we edit the header
	Make sure pysam writers are fine...

#2 - Make create-anno
	Given a vcf or a bed file, create the corresponding IntervalTrees
	I could require that a lot of the match-rules work is done here.
	So then its just a simple 'TAG', "COMPARESVTYPE" etc that we need to lookup?

#3 - Match rules:
	See (6) above, figure out how to code that beside the created anno
	This will be tricky. Will need to do a lot of regular expression stuff?
		Or will need to create an ontology for things like START/END/SVTYPE etc

#4 - RepeatMasker.. It's going to be very irritating getting that thing wrapped up,
	And I need to figure out if I can use it commercially. Lots of DFAM requirements


https://github.com/bcbio/bcbio-nextgen/issues/963


create-anno

	Given an input .bed or .vcf file (maybe GFF eventually, too),
	And a --header (for what the fields are going to be added in the vcf)
	And a set of --rules (need to make it so you can put those together)
	Build an annotation object
	If we want special rules for matching, we'll recommend you manually edit the pickled object created.

anno
	Given a file with a list of paths to annotations.pkl
	And a vcf
	Annotate every vcf entry

And if I want to create custom annotations (for things like 'repeatmasker', then I can just make a new class
for it and then make sure it has the \.annotate method.

I'm also going to need filtering rules. (e.g. VARTYPE== and VARSIZE(min, max) and possibly)
and I want bcftools +fill-tags functionality, also.


1) Collect the annotation sources I want to use.
2) For each annotation source - figure out a method of how I want calls to match up to them
	Run this with some tests/figures so you can see how what 'feels' best.
	This will also allow you to build the `truvari anno` for each file.
3) Pool the annotion sources, anno procedures into a single protocol. Run on everything.