-
Notifications
You must be signed in to change notification settings - Fork 1k
Add snpclustering subworkflow #11059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| #!/usr/bin/env nextflow | ||
| nextflow.enable.dsl = 2 | ||
|
|
||
| /* | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| IMPORT NF-CORE MODULES | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| */ | ||
|
|
||
| include { BCFTOOLS_FILTER } from '../../../modules/nf-core/bcftools/filter/main' | ||
| include { PLINK2_INDEP_PAIRWISE } from '../../../modules/nf-core/plink2/indeppairwise/main' | ||
| include { PLINK2_RECODE_VCF } from '../../../modules/nf-core/plink2/recodevcf/main' | ||
| include { FLASHPCA2 } from '../../../modules/nf-core/flashpca2/main' | ||
|
|
||
| /* | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| SUBWORKFLOW | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| */ | ||
|
|
||
| workflow SNPCLUSTERING { | ||
| take: | ||
| meta | ||
| vcf | ||
| vcf_index | ||
| maf | ||
| missing | ||
|
|
||
| main: | ||
| versions = Channel.empty() | ||
|
|
||
| BCFTOOLS_FILTER ( vcf.join(vcf_index), maf, missing ) | ||
| versions = versions.mix(BCFTOOLS_FILTER.out.versions.first()) | ||
|
|
||
| PLINK2_INDEP_PAIRWISE ( BCFTOOLS_FILTER.out.vcf ) | ||
| versions = versions.mix(PLINK2_INDEP_PAIRWISE.out.versions.first()) | ||
|
|
||
| PLINK2_RECODE_VCF ( PLINK2_INDEP_PAIRWISE.out.pgen ) | ||
| versions = versions.mix(PLINK2_RECODE_VCF.out.versions.first()) | ||
|
|
||
| FLASHPCA2 ( PLINK2_RECODE_VCF.out.vcf ) | ||
| versions = versions.mix(FLASHPCA2.out.versions.first()) | ||
|
|
||
| // TODO: qui aggiungeremo KMeans/DBSCAN/plot quando creeremo i moduli local | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there still something to add?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for your comment @famosab . You’re absolutely right — the clustering components (KMeans, DBSCAN), internal validation metrics (Silhouette, Calinski–Harabasz, Davies–Bouldin), non-linear embeddings (t-SNE/UMAP), and the final HTML report still need to be integrated. These features are already implemented in the original pipeline (https://github.com/dbaku42/nf-core-snpclustering). I intentionally left them out of this PR to keep the subworkflow minimal and easier to review. I’m happy to proceed in either of the following ways:
Please let me know which approach you’d prefer. Thanks again! |
||
|
|
||
| emit: | ||
| cluster_labels = Channel.empty() // placeholder | ||
| metrics = Channel.empty() // placeholder | ||
| plots = Channel.empty() | ||
| versions = versions | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| --- | ||
| # yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/subworkflows/nf-core/meta-schema.json | ||
|
|
||
| name: "snpclustering" | ||
| description: "End-to-end unsupervised clustering of genomic samples starting from multi-sample VCF files. Performs variant filtering (MAF + missingness), optional LD pruning, PCA (FlashPCA2 or IncrementalPCA), KMeans/DBSCAN clustering and internal validation." | ||
| keywords: | ||
| - genomics | ||
| - clustering | ||
| - unsupervised clustering | ||
| - VCF | ||
| - nf-core | ||
| authors: | ||
| - "Donald Baku (@dbaku42)" | ||
| components: | ||
| - bcftools/filter | ||
| - plink2/indep/pairwise | ||
| - plink2/recode/vcf | ||
| - plink2/indeppairwise | ||
| - plink2/recodevcf | ||
| - flashpca2 | ||
| input: | ||
| - meta: | ||
| type: map | ||
| description: "Groovy Map containing sample metadata" | ||
| - vcf: | ||
| type: file | ||
| description: "Multi-sample VCF file (bgzipped and indexed)" | ||
| pattern: "*.vcf.gz" | ||
| - vcf_index: | ||
| type: file | ||
| description: "Index of the VCF file (.tbi or .csi)" | ||
| pattern: "*.{tbi,csi}" | ||
| - maf: | ||
| type: float | ||
| description: "Minimum minor allele frequency threshold" | ||
| default: 0.01 | ||
| - missing: | ||
| type: float | ||
| description: "Maximum missingness threshold" | ||
| default: 0.10 | ||
| output: | ||
| - meta: | ||
| type: map | ||
| description: "Groovy Map containing sample metadata" | ||
| - cluster_labels: | ||
| type: file | ||
| description: "CSV file with per-sample cluster assignments" | ||
| pattern: "cluster_labels.csv" | ||
| - metrics: | ||
| type: file | ||
| description: "Table with all cluster quality metrics" | ||
| pattern: "*_metrics.tsv" | ||
| - plots: | ||
| type: file | ||
| description: "Directory containing publication-ready plots" | ||
| pattern: "plots/" | ||
| - versions: | ||
| type: file | ||
| description: "File containing versions of all tools used" | ||
| pattern: "versions.yml" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,34 @@ | ||
| nextflow_workflow { | ||
|
|
||
| name "Test Workflow SNPCLUSTERING" | ||
| script "../main.nf" | ||
| workflow "SNPCLUSTERING" | ||
| config "./nextflow.config" | ||
|
|
||
| tag "subworkflows" | ||
| tag "subworkflows_nfcore" | ||
| tag "subworkflows/snpclustering" | ||
| tag "bcftools/filter" | ||
| tag "plink2/indeppairwise" | ||
| tag "plink2/recodevcf" | ||
| tag "flashpca2" | ||
|
|
||
| test("vcf.gz input") { | ||
|
|
||
| when { | ||
| workflow { | ||
| """ | ||
| input[0] = [ id:'test' ] | ||
| input[1] = file(params.modules_testdata_base_path + 'genomics/homo_sapiens/illumina/vcf/test.vcf.gz', checkIfExists: true) | ||
| input[2] = file(params.modules_testdata_base_path + 'genomics/homo_sapiens/illumina/vcf/test.vcf.gz.tbi', checkIfExists: true) | ||
| input[3] = 0.01 | ||
| input[4] = 0.10 | ||
| """ | ||
| } | ||
| } | ||
|
|
||
| then { | ||
| assert workflow.success | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We also want a snapshot here (look at other subworkflows)
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The test now passes with direct nf-test. The failure with nf-core subworkflows test is due to a temporary missing Wave container for the plink2/vcf module (manifest unknown). The logic and snapshot are correct. |
||
| } | ||
| } | ||
| } | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not needed anymore |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| subworkflows/snpclustering: | ||
| - subworkflows/nf-core/snpclustering/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check for each module if they still export the versions I think at least bcftools/filter does not anymore