You can use Variant Transforms to load multiple files at once
by providing a pattern like --input_pattern gs://mybucket/*.vcf. This
will create a merged BigQuery schema that contains all variants from all
matching files. Note that the VCF fields must be compatible across files (e.g.
INFO and FORMAT fields with the same key have compatible types and numbers
across files).
However, by default, calls between files will not be merged. For instance, consider the following VCF files:
gs://mybucket/a.vcf:
## <headers omitted>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2
1 100 rs1 A T 29 PASS DP=14 GT:GQ 0|1:48 0/0:30
gs://mybucket/b.vcf:
## <headers omitted>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S3
1 100 rs1 A T 7 q10 DP=11;AA=G GT:GQ 1/1:2
If you load both files together (i.e. by specifying
--input_pattern gs://mybucket/*.vcf), you would end up with 2 rows in
BigQuery. One with calls S1 and S2, and another with just call S3 even
though all calls have the same variant.
Variant merging can be used to merge calls across files. There are different
merging strategies and you can specify any of the following
strategies by adding --variant_merge_strategy <merge_strategy> to the
arguments.
This strategy merges calls that have the same reference_name,
start_position, end_position, reference_bases, and all alternate_bases
across files. By default, the variant fields are merged as follows:
Callsare merged together in no particular order.- The highest
qualityvalue is chosen among all variants being merged. - All
filtervalues are merged and deduplicated. - All
names(IDcolumn in the VCF spec) are merged and deduplicated. - For each
INFOfield, one is chosen as representative for the entire merged variant record in no particular order. Note that for repeated fields, a single set is chosen as representative for that particular field.
Depending on your use case, you may customize the merge strategy as follows:
- Adding
--copy_quality_to_callsand/or--copy_filter_to_callsto the arguments will copy thequalityand/orfiltervalues in each file to the set of calls specified in that file. This option is useful when merging single-call VCF files asqualityandfiltervalues usually correspond to call-level quality/filter. Note that the variant-levelqualityandfiltervalues will still be merged according to the above logic. - Adding
--info_keys_to_move_to_calls_regex <regex>will moveINFOfields that match the provided regular expression to the associated calls. This is useful to ensure allINFOfields are kept when merging variants. For instance, specifying.*will move allINFOkeys to calls, or specifying^(AA|AF)$will only moveAAandAFfields to calls. See here for more details on the regular expression specifications. Note that if a key is the same in bothINFOandFORMAT, then such a key cannot be moved to calls and an error will be raised.
The examples below help with illustrating the details of the above options.
You may also read the code and documentation in
move_to_calls_strategy.py.
Particularly the get_merged_variants method implements the merging logic.
The following examples illustrate the merging outcome using the example files
gs://mybucket/a.vcf and gs://mybucket/b.vcf that are defined above.
- Default (i.e. just adding
--variant_merge_strategy MOVE_TO_CALLS):
reference_name: "1"
start_position: 100
end_position: 101
reference_bases: "A"
alternate_bases: {alt: "T"}
names: ["rs1"]
quality: 29
filter: ["PASS", "q10"]
call: {
[
name: "S3",
genotype: [1, 1]
phaseset: null
GQ: 2
],
[
name: "S1",
genotype: [0, 1]
phaseset: "*"
GQ: 48
],
[
name: "S2",
genotype: [0, 0]
phaseset: null
GQ: 30
]
}
DP: 14 # This can also be 11.
AA: "G"
- With
--copy_quality_to_callsand--copy_filter_to_calls:
reference_name: "1"
start_position: 100
end_position: 101
reference_bases: "A"
alternate_bases: {alt: "T"}
names: ["rs1"]
quality: 29
filter: ["PASS", "q10"]
call: {
[
name: "S3",
genotype: [1, 1]
phaseset: null
GQ: 2
quality: 7
filter: ["q10"]
],
[
name: "S1",
genotype: [0, 1]
phaseset: "*"
GQ: 48
quality: 29
filter: ["PASS"]
],
[
name: "S2",
genotype: [0, 0]
phaseset: null
GQ: 30
quality: 29
filter: ["PASS"]
]
}
DP: 14 # This can also be 11.
AA: "G"
- With
--info_keys_to_move_to_calls_regex ^DP$,--copy_quality_to_calls,--copy_filter_to_calls:
reference_name: "1"
start_position: 100
end_position: 101
reference_bases: "A"
alternate_bases: {alt: "T"}
names: ["rs1"]
quality: 29
filter: ["PASS", "q10"]
call: {
[
name: "S3",
genotype: [1, 1]
phaseset: null
GQ: 2
quality: 7
filter: ["q10"]
DP: 11
],
[
name: "S1",
genotype: [0, 1]
phaseset: "*"
GQ: 48
quality: 29
filter: ["PASS"]
DP: 14
],
[
name: "S2",
genotype: [0, 0]
phaseset: null
GQ: 30
quality: 29
filter: ["PASS"]
DP: 14
]
}
AA: "G"
This strategy is not yet production ready and can't be used by users. This section is included here to provide more context about our ongoing work (Issue #80 and Issue #88).
The MOVE_TO_CALLS strategy is effective for merging calls that have exactly
the same reference and alternate bases. However, this strategy is not effective
for merging
gVCF files
since non-variants are represented as intervals in the genome.
Consider the following VCF file:
gs://mybucket/c.g.vcf:
## <headers omitted>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S4
1 50 . T <NON_REF> . . END=150 GT:GQ 0/0:40
If you specify --input_pattern gs://mybucket/*.vcf with MOVE_TO_CALLS merge
strategy, the non-variant region from call S4 will not be merged with calls
S1, S2, and S3. However, we know that call S4 has genotype 0/0 in
chr1:100 since it matches the reference from position 50-150 in chr1.
The MERGE_WITH_NON_VARIANTS strategy will infer 0/0 for call S4 at
chr1:100 so that the resulting BigQuery table has a single row with all calls
S1-4.
Our initial work is focused on correctly calling SNPs (Issue #80). We will then attempt to merge insertions/deletions (Issue #88).