fix: sentieon gvcftyper accepts intervals#11582
Conversation
|
Please join the nf-core organization on GitHub to enable the CI-tests to run on your PR. You can request to join the organization via #github-invitations in the nf-core slack. You can join the nf-core slack via https://nf-co.re/join. |
|
Can you also add some line breaks in the script where sentieon is called, while you are updating the module. |
Per @SPPearce review on nf-core#11582 — match the multi-line format used in sentieon/haplotyper.
|
I’ve requested access to the nf-core GitHub organisation via #github-invitations so CI can run on the PR. I’ve also pushed a small follow-up commit formatting the sentieon driver call over multiple lines while keeping the same logic. Once CI runs with the Sentieon credentials, I expect the old intervals snapshots to fail because they encoded the previous unconstrained behaviour. I’m happy to update the snapshots from the CI output if the new expected md5s are visible. |
Description
SENTIEON_GVCFTYPERdeclarespath(intervals)as an input and stages the file into the work dir, but the renderedsentieon drivercommand never references it — so GVCFtyper has beenrunning unconstrained even when callers pass per-interval BEDs.
The fix mirrors the pattern already used in
sentieon/haplotyper(modules/nf-core/sentieon/haplotyper/main.nf:39).Reason for PR
In a multi-sample joint-genotyping pipeline (e.g.
nf-core/sarek's sentieon path), GVCFtyper is invoked per shard with a per-interval BED so each shard genotypes only its assigned region; the per-interval VCFs are then concatenated by GATK4 MergeVcfs (which assumes non-overlapping inputs). Without--interval, every shard processes the full gVCF, so neighbouring shards emit overlapping records at interval boundaries — the concat then produces duplicate/overlapping variant calls. Single-sample runs are largely unaffected, which is why this slipped through.Evidence the input was being ignored
Looking at
tests/main.nf.test.snapbefore this PR:sentieon gvcftyper vcfd13216836f1452e200b215b796606671sentieon gvcftyper vcf.gzd13216836f1452e200b215b796606671sentieon gvcftyper intervalsd13216836f1452e200b215b796606671←sentieon gvcftyper dbsnp intervals21606383c760bf676d4c1f747b97d118sentieon gvcftyper dbsnp21606383c760bf676d4c1f747b97d118←The intervals tests produced the exact same md5 as the matching no-intervals runs — proof that the BED was being silently ignored.
Test added
A regression test (
sentieon gvcftyper intervals constrain output - regression) runs GVCFtyper with the existinggenome.bedand asserts:i.e. the produced md5 must differ from the buggy md5 that was previously locked in. This will fail loudly if
--intervalis ever dropped from the command again.Snapshot regeneration⚠️
The two existing snapshots that encoded the bug (
sentieon gvcftyper intervalsandsentieon gvcftyper dbsnp intervals) will fail on the first CI run because their md5s were computed under the broken behaviour. They need to be regenerated with--update-snapshotin an environment with full Sentieon credentials. I couldn't do this locally — happy to take a follow-upcommit once a maintainer with the auth secrets can confirm the expected new md5s, or apply them in this branch if a reviewer pastes them.
PR checklist
existing module
genome.bedfromnf-core/test-datasetstopic: versions— unchanged from existing modulelabel—label 'process_high'already presentnf-core modules test sentieon/gvcftyper --profile dockernf-core modules test sentieon/gvcftyper --profile singularitynf-core modules test sentieon/gvcftyper --profile conda