Skip to content

fix: sentieon gvcftyper accepts intervals#11582

Open
nicorap wants to merge 4 commits into
nf-core:masterfrom
nicorap:master
Open

fix: sentieon gvcftyper accepts intervals#11582
nicorap wants to merge 4 commits into
nf-core:masterfrom
nicorap:master

Conversation

@nicorap
Copy link
Copy Markdown

@nicorap nicorap commented May 11, 2026

Description

SENTIEON_GVCFTYPER declares path(intervals) as an input and stages the file into the work dir, but the rendered sentieon driver command never references it — so GVCFtyper has been
running unconstrained even when callers pass per-interval BEDs.

+ def interval_command = intervals ? "--interval ${intervals}" : ""
  ...
- sentieon driver -r ${fasta} --algo GVCFtyper ${gvcfs_input} ${dbsnp_cmd} ${prefix}.vcf.gz
+ sentieon driver -r ${fasta} ${interval_command} --algo GVCFtyper ${gvcfs_input} ${dbsnp_cmd} ${prefix}.vcf.gz

The fix mirrors the pattern already used in sentieon/haplotyper (modules/nf-core/sentieon/haplotyper/main.nf:39).

Reason for PR

In a multi-sample joint-genotyping pipeline (e.g. nf-core/sarek's sentieon path), GVCFtyper is invoked per shard with a per-interval BED so each shard genotypes only its assigned region; the per-interval VCFs are then concatenated by GATK4 MergeVcfs (which assumes non-overlapping inputs). Without --interval, every shard processes the full gVCF, so neighbouring shards emit overlapping records at interval boundaries — the concat then produces duplicate/overlapping variant calls. Single-sample runs are largely unaffected, which is why this slipped through.

Evidence the input was being ignored

Looking at tests/main.nf.test.snap before this PR:

Test variantsMD5
sentieon gvcftyper vcf d13216836f1452e200b215b796606671
sentieon gvcftyper vcf.gz d13216836f1452e200b215b796606671
sentieon gvcftyper intervals d13216836f1452e200b215b796606671
sentieon gvcftyper dbsnp intervals 21606383c760bf676d4c1f747b97d118
sentieon gvcftyper dbsnp 21606383c760bf676d4c1f747b97d118

The intervals tests produced the exact same md5 as the matching no-intervals runs — proof that the BED was being silently ignored.

Test added

A regression test (sentieon gvcftyper intervals constrain output - regression) runs GVCFtyper with the existing genome.bed and asserts:

assert path(process.out.vcf_gz[0][1]).vcf.variantsMD5 != "d13216836f1452e200b215b796606671"

i.e. the produced md5 must differ from the buggy md5 that was previously locked in. This will fail loudly if --interval is ever dropped from the command again.

Snapshot regeneration ⚠️

The two existing snapshots that encoded the bug (sentieon gvcftyper intervals and sentieon gvcftyper dbsnp intervals) will fail on the first CI run because their md5s were computed under the broken behaviour. They need to be regenerated with --update-snapshot in an environment with full Sentieon credentials. I couldn't do this locally — happy to take a follow-up
commit once a maintainer with the auth secrets can confirm the expected new md5s, or apply them in this branch if a reviewer pastes them.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docsN/A,
    existing module
  • If necessary, include test data in your PR. — uses existing genome.bed from nf-core/test-datasets
  • Remove all TODO statements.
  • Broadcast software version numbers to topic: versionsunchanged from existing module
  • Follow the naming conventions. — unchanged
  • Follow the parameters requirements. — unchanged
  • Follow the input/output options guidelines. — unchanged
  • Add a resource labellabel 'process_high' already present
  • Use BioConda and BioContainers if possible to fulfil software requirements. — already configured
  • For modules:
    • nf-core modules test sentieon/gvcftyper --profile docker
    • nf-core modules test sentieon/gvcftyper --profile singularity
    • nf-core modules test sentieon/gvcftyper --profile conda

Could not run the three test-profile boxes locally: the Sentieon container needs SENTIEON_AUTH_MECH + SENTIEON_AUTH_DATA (plus SENTIEON_LICSRVR_IP) which I don't have outside CI.
Relying on the CI matrix to validate. The intervals snapshots will need regeneration as noted above.

@famosab
Copy link
Copy Markdown
Contributor

famosab commented May 11, 2026

Please join the nf-core organization on GitHub to enable the CI-tests to run on your PR. You can request to join the organization via #github-invitations in the nf-core slack. You can join the nf-core slack via https://nf-co.re/join.

@SPPearce
Copy link
Copy Markdown
Contributor

Can you also add some line breaks in the script where sentieon is called, while you are updating the module.

  Per @SPPearce review on nf-core#11582 — match the multi-line format used
  in sentieon/haplotyper.
@nicorap
Copy link
Copy Markdown
Author

nicorap commented May 15, 2026

Thanks @famosab @SPPearce !

I’ve requested access to the nf-core GitHub organisation via #github-invitations so CI can run on the PR.

I’ve also pushed a small follow-up commit formatting the sentieon driver call over multiple lines while keeping the same logic.

Once CI runs with the Sentieon credentials, I expect the old intervals snapshots to fail because they encoded the previous unconstrained behaviour. I’m happy to update the snapshots from the CI output if the new expected md5s are visible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants