Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
f949409
Add scdblfinder module skeleton generated by nf-core tools
KurayiChawatama Mar 12, 2026
0c1c2ee
Fix scdblfinder: remove mockDoubletSCE and use real SCE object directly
KurayiChawatama Mar 12, 2026
8ba85f7
Integrate scdblfinder into pipeline configuration and tests
KurayiChawatama Mar 12, 2026
5a109a5
Fix scdblfinder module implementation and tests
KurayiChawatama Mar 12, 2026
0fb9d77
Update documentation to include scDblFinder
KurayiChawatama Mar 12, 2026
9af9d36
added more documentation for scdblfinder
KurayiChawatama Mar 12, 2026
bd40dce
added scdblfinder citation to citations md
KurayiChawatama Mar 12, 2026
807f00f
removed template comment from meta yml
KurayiChawatama Mar 12, 2026
275fac8
moved scdblfinder module to doublet detection dirtectory
KurayiChawatama Mar 12, 2026
993a8f2
updated docs ouput to include scdblfinder
KurayiChawatama Mar 12, 2026
a56e656
[automated] Fix code linting
nf-core-bot Mar 12, 2026
ec56f35
Update modules/local/doublet_detection/scdblfinder/templates/scdblfin…
KurayiChawatama Mar 12, 2026
632676f
Merge remote-tracking branch 'origin/dev' into module/scdblfinder
nictru Mar 12, 2026
7b03b01
Merge branch 'nf-core:dev' into module/scdblfinder
KurayiChawatama Mar 13, 2026
0b0b19d
added https version of the singularity container link
KurayiChawatama Mar 13, 2026
6d741bc
refactor(scDblFinder): optimize multiplet rate calculation using find…
KurayiChawatama Mar 13, 2026
947ffa2
added explanation for column name change
KurayiChawatama Mar 13, 2026
8c2e382
write updated SingleCellExperiment directly as h5ad without explicit …
KurayiChawatama Mar 13, 2026
212b3e2
enhance h5ad writing with validation for cell barcodes and primary assay
KurayiChawatama Mar 13, 2026
9774828
add scdblfinder to input methods in doublet detection subworkflow test
KurayiChawatama Mar 13, 2026
e32e393
streamline renaming of scDblFinder columns with less clumsy code
KurayiChawatama Mar 13, 2026
addb641
removed explicit call of artifical doublet number in scdblfinder func…
KurayiChawatama Mar 13, 2026
1ad461f
updated test snapshot to match previous commit
KurayiChawatama Mar 13, 2026
63c0307
Enhance scDblFinder functionality and documentation
KurayiChawatama Mar 13, 2026
ee63725
[automated] Fix code linting
nf-core-bot Mar 13, 2026
6480259
change other doublet detection methods to use mix
KurayiChawatama Mar 13, 2026
4b5b2b3
Remove redundant restoration of original cell barcodes in scDblFinder…
KurayiChawatama Mar 13, 2026
ee1d883
Remove unnecessary comment about RNG seed parameter in scDblFinder sc…
KurayiChawatama Mar 13, 2026
a2fe8ab
Refactor doublet rate handling in scDblFinder to streamline logic and…
KurayiChawatama Mar 13, 2026
6194ca4
Fix regex pattern for doublet detection tool options in nextflow_sche…
KurayiChawatama Mar 13, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,13 @@ Initial release of nf-core/scdownstream, created with the [nf-core](https://nf-c
### `Added`

- Added `singleR` module for automated cell type annotation.
- Added `scDblFinder` module for doublet detection.
- Added optional `doublet_rate` column in input samplesheet to provide per-sample expected doublet rate for `scDblFinder`.

### `Fixed`

- Updated `scDblFinder` to use internal `dbr` estimation when `doublet_rate` is not provided, and to use provided `doublet_rate` when available.

### `Dependencies`

### `Deprecated`
4 changes: 4 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,10 @@

> Cannoodt R, Zappia L, Morgan M, Deconinck L (2025). anndataR: AnnData interoperability in R. R package version 0.99.0

- [scDblFinder](https://pubmed.ncbi.nlm.nih.gov/35118618/)

> Germain P, Lun A, Garcia Meixide C, Macnair W, Robinson M. Doublet identification in single-cell sequencing data using scDblFinder. F1000Res. 2022;11:979. doi: 10.12688/f1000research.73600.2.

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)
Expand Down
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ Steps marked with the boat icon are not yet implemented. For the other steps, th
- [scrublet](https://scanpy.readthedocs.io/en/stable/api/generated/scanpy.pp.scrublet.html)
- [DoubletDetection](https://doubletdetection.readthedocs.io/en/v2.5.2/doubletdetection.doubletdetection.html)
- [SCDS](https://bioconductor.org/packages/devel/bioc/vignettes/scds/inst/doc/scds.html)
- [scDblFinder](https://bioconductor.org/packages/release/bioc/html/scDblFinder.html)
2. Sample aggregation
1. Merge into a single h5ad file
2. Present QC for merged counts ([`MultiQC`](http://multiqc.info/))
Expand Down Expand Up @@ -87,6 +88,8 @@ sample4,/absolute/path/to/sample3.csv
Each entry represents a h5ad, h5, RDS or CSV file. RDS files may contain any object that can be converted to a SingleCellExperiment using the [Seurat `as.SingleCellExperiment`](https://satijalab.org/seurat/reference/as.singlecellexperiment) function.
CSV files should contain a matrix with genes as columns and cells as rows. The first column should contain cell names/barcodes.

For `scDblFinder`, you can optionally add a `doublet_rate` column (values between `0` and `1`) to the samplesheet. If omitted, `scDblFinder` estimates the doublet rate internally.

-->

Now, you can run the pipeline using:
Expand Down
7 changes: 7 additions & 0 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,13 @@
"errorMessage": "Number of cells expected from the experimental design, used as input to cellbender.",
"meta": ["expected_cells"]
},
"doublet_rate": {
"type": "number",
"minimum": 0,
"maximum": 1,
"errorMessage": "doublet_rate must be a number between 0 and 1.",
"meta": ["doublet_rate"]
},
"ambient_correction": {
"type": "boolean",
"default": true,
Expand Down
10 changes: 10 additions & 0 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -213,6 +213,16 @@ process {
]
}

withName: SCDBLFINDER {
ext.prefix = { meta.id + '_scdblfinder' }
publishDir = [
path: { "${params.outdir}/quality_control/doublet_detection/scdblfinder" },
mode: params.publish_dir_mode,
enabled: params.save_intermediates,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
]
}

withName: DOUBLET_REMOVAL {
publishDir = [
path: { "${params.outdir}/quality_control/doublet_detection" },
Expand Down
2 changes: 1 addition & 1 deletion conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ params {
// Input data
input = params.pipelines_testdata_base_path + 'samplesheet.csv'
integration_methods = 'scvi,harmony,bbknn,combat'
doublet_detection = 'solo,scrublet,scds'
doublet_detection = 'solo,scrublet,scds,scdblfinder'
celltypist_model = 'Adult_Human_Skin'
celldex_reference = 'https://raw.githubusercontent.com/nf-core/test-datasets/scdownstream/singleR/references.csv'
integration_hvgs = 500
Expand Down
2 changes: 1 addition & 1 deletion conf/test_full.config
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ params {
// Input data for full size test
input = params.pipelines_testdata_base_path + 'samplesheet.csv'
integration_methods = 'scvi,harmony,bbknn,combat'
doublet_detection = 'solo,scrublet,doubletdetection,scds'
doublet_detection = 'solo,scrublet,doubletdetection,scds,scdblfinder'
celltypist_model = 'Adult_Human_Skin'
celldex_reference = 'hpca__2024-02-26,monaco_immune__2024-02-26' // Feature: Support offline.
celldex_reference_label = 'label.main,label.fine'
Expand Down
3 changes: 2 additions & 1 deletion docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
- [scrublet](https://scanpy.readthedocs.io/en/stable/api/generated/scanpy.pp.scrublet.html)
- [DoubletDetection](https://doubletdetection.readthedocs.io/en/v2.5.2/doubletdetection.doubletdetection.html)
- [SCDS](https://bioconductor.org/packages/devel/bioc/vignettes/scds/inst/doc/scds.html)
- [scDblFinder](https://bioconductor.org/packages/release/bioc/html/scDblFinder.html)
2. Sample aggregation
1. Merge into a single h5ad file
2. Present QC for merged counts ([`MultiQC`](http://multiqc.info/))
Expand Down Expand Up @@ -57,7 +58,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
- `custom_thresholds/`: Results of applying user-defined QC thresholds.
- `doublet_detection/`: Directories related to doublet detection.
- `input_rds/`: RDS version of the h5ad file that is used as input to the doublet detection tools.
- `(doubletdetection|scds|scrublet|solo)/`: Results of doublet detection. Each directory contains a filtered `h5ad`/`rds` and a `csv`/`pkl` file with the doublet annotations.
- `(doubletdetection|scdblfinder|scds|scrublet|solo)/`: Results of doublet detection. Each directory contains a filtered `h5ad`/`rds` and a `csv`/`pkl` file with the doublet annotations.
- `${sample_id}.h5ad`: The h5ad without doublets.
- `qc_preprocessed/`: QC plots for the preprocessed data.

Expand Down
9 changes: 5 additions & 4 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,10 @@ sample3,/absolute/path/to/sample3.csv
There are a couple of optional columns that can be used for more advanced features:

```csv title="samplesheet.csv"
sample,filtered,unfiltered,batch_col,label_col,condition_col,unknown_label,min_genes,min_cells,min_counts_cell,min_counts_gene,expected_cells,ambient_correction,ambient_corrected_integration
sample1,/absolute/path/to/sample1_filtered.h5ad,/absolute/path/to/sample1.h5ad,batch,cell_type,condition,unknown,1,2,3,4,5000,true,false
sample2,relative/path/to/sample2_filtered.rds,relative/path/to/sample2.rds,batch_id,annotation,condition,unannotated,5,6,7,8,3000,false,
sample3,/absolute/path/to/sample3_filtered.csv,/absolute/path/to/sample3.csv,,,,,9,10,11,12,,true,true
sample,filtered,unfiltered,batch_col,label_col,condition_col,unknown_label,min_genes,min_cells,min_counts_cell,min_counts_gene,expected_cells,doublet_rate,ambient_correction,ambient_corrected_integration
sample1,/absolute/path/to/sample1_filtered.h5ad,/absolute/path/to/sample1.h5ad,batch,cell_type,condition,unknown,1,2,3,4,5000,0.08,true,false
sample2,relative/path/to/sample2_filtered.rds,relative/path/to/sample2.rds,batch_id,annotation,condition,unannotated,5,6,7,8,3000,,false,
sample3,/absolute/path/to/sample3_filtered.csv,/absolute/path/to/sample3.csv,,,,,9,10,11,12,,,true,true
```

For CSV input files, specifying the `batch_col`, `label_col`, `condition_col`, and `unknown_label` columns will not have any effect, as no additional metadata is available in the CSV file.
Expand All @@ -63,6 +63,7 @@ For CSV input files, specifying the `batch_col`, `label_col`, `condition_col`, a
| `min_counts_cell` | Minimum number of counts required for a cell to be considered. Defaults to `1`. |
| `min_counts_gene` | Minimum number of counts required for a gene to be considered. Defaults to `1`. |
| `expected_cells` | Number of expected cells, used as input to CellBender for empty droplet detection. |
| `doublet_rate` | Optional expected doublet rate (0-1) for `scDblFinder`. If not provided, `scDblFinder` estimates it internally. |
| `max_mito_percentage` | Maximum percentage of mitochondrial reads for a cell to be considered. Defaults to `100`. |
| `ambient_correction` | Whether to perform ambient RNA correction for this sample. Set to `true` to use the globally configured method, `false` to skip ambient correction for this sample. Defaults to `true`. |
| `ambient_corrected_integration` | Whether to use ambient-corrected counts for integration for this sample. Set to `true` to use corrected counts in downstream integration, `false` to store them only as additional layers. Can override the global `--ambient_corrected_integration` parameter. Defaults to global setting. |
Expand Down
11 changes: 11 additions & 0 deletions modules/local/doublet_detection/scdblfinder/environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
name: scdblfinder
channels:
- conda-forge
- bioconda
dependencies:
- bioconda::bioconductor-scdblfinder=1.24.0
- bioconda::bioconductor-singlecellexperiment=1.32.0
- bioconda::bioconductor-biocparallel=1.44.0
- bioconda::bioconductor-anndatar=1.0.2
- bioconda::bioconductor-rhdf5=2.54.1
- conda-forge::r-tidyverse=2.0.0
32 changes: 32 additions & 0 deletions modules/local/doublet_detection/scdblfinder/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
process SCDBLFINDER {
tag "$meta.id"
label 'process_medium'

conda "${moduleDir}/environment.yml"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/99/993a012a69d920412b090701eb733ccf35c8655c3d012756ca6b0af1cfcd4780/data' :
'community.wave.seqera.io/library/bioconductor-anndatar_bioconductor-biocparallel_bioconductor-rhdf5_bioconductor-scdblfinder_pruned:0f9db6b0855861de' }"

input:
tuple val(meta), path(h5ad), val(dbr)

output:
tuple val(meta), path("${prefix}.h5ad"), emit: h5ad
tuple val(meta), path("${prefix}.csv"), emit: predictions
path "versions.yml", emit: versions

when:
task.ext.when == null || task.ext.when

script:
prefix = task.ext.prefix ?: "${meta.id}"
template('scdblfinder.R')

stub:
prefix = task.ext.prefix ?: "${meta.id}"
"""
touch ${prefix}.h5ad
touch ${prefix}.csv
touch versions.yml
"""
}
72 changes: 72 additions & 0 deletions modules/local/doublet_detection/scdblfinder/meta.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
name: "scdblfinder"
description: Detect doublets in single-cell RNA-seq data using scDblFinder
keywords:
- doublet-detection
- single-cell
- scrnaseq
- quality-control
tools:
- "scdblfinder":
description: "scDblFinder: Computational identification of doublets in single-cell transcriptomics data"
homepage: "https://bioconductor.org/packages/scDblFinder"
documentation: "https://bioconductor.org/packages/release/bioc/vignettes/scDblFinder/inst/doc/scDblFinder.html"
tool_dev_url: "https://github.com/plger/scDblFinder"
doi: "10.12688/f1000research.73600.2"
licence: ["GPL-3.0"]
identifier: biotools:scdblfinder

input:
- - meta:
type: map
description: |
Groovy Map containing sample information
e.g. `[ id:'sample1' ]`
- h5ad:
type: file
description: AnnData object in h5ad format
pattern: "*.{h5ad}"
ontologies:
- edam: "http://edamontology.org/format_3590" # HDF5 format
- dbr:
type: number
description: |
Optional expected doublet rate (0-1). If null, scDblFinder estimates
the doublet rate internally.

output:
h5ad:
- - meta:
type: map
description: |
Groovy Map containing sample information
e.g. `[ id:'sample1' ]`
- "*.h5ad":
type: file
description: AnnData object with doublet annotations
pattern: "*.h5ad"
ontologies:
- edam: "http://edamontology.org/format_3590" # HDF5 format
predictions:
- - meta:
type: map
description: |
Groovy Map containing sample information
e.g. `[ id:'sample1' ]`
- "*.csv":
type: file
description: CSV file containing doublet predictions (boolean)
pattern: "*.csv"
ontologies:
- edam: "http://edamontology.org/format_3752" # CSV
versions:
- versions.yml:
type: file
description: File containing software versions
pattern: "versions.yml"
ontologies:
- edam: http://edamontology.org/format_3750 # YAML

authors:
- "@KurayiChawatama"
maintainers:
- "@KurayiChawatama"
102 changes: 102 additions & 0 deletions modules/local/doublet_detection/scdblfinder/templates/scdblfinder.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
#!/usr/bin/env Rscript

library(scDblFinder)
library(tidyverse)
library(SingleCellExperiment)
library(BiocParallel)
library(anndataR)

adata <- read_h5ad("${h5ad}")
sce <- adata\$as_SingleCellExperiment()

num_threads <- max(1L, as.integer("${task.cpus}"))
bp <- MulticoreParam(workers = num_threads, RNGseed = 123)

# Save original cell names and count before overwriting sce
original_cell_names <- colnames(sce)

# Parse per-sample doublet rate from Nextflow input. If unavailable, let
# scDblFinder estimate dbr internally (recommended default for 10X data).
dbr_raw <- trimws("${dbr}")
dbr <- suppressWarnings(as.numeric(dbr_raw))

# Run scDblFinder on the counts matrix (first assay)
# scDblFinder creates artificial doublets internally and returns a new SCE
set.seed(123)
if (is.na(dbr)) {
message("No valid doublet_rate provided; using scDblFinder internal dbr estimation")
dbr <- NULL
} else {
message(paste0("Using provided doublet_rate (dbr): ", dbr))
}

sce <- scDblFinder(
assays(sce)[[1]],
BPPARAM = bp,
dbr = dbr
)
Comment on lines +33 to +37
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that scDblFinder has an optional samples argument, which is highly relevant here. You should pass the batch_col to it, which you can get via the ch_batch_col in the doublet detection subworkflow


# Generate a summary table
message("scDblFinder results summary:")
print(table(sce\$scDblFinder.class))

# Rename scDblFinder.* columns for consistency with other doublet methods.
# Replace prefix first, then replace any remaining dots with underscores.
idx <- grep("^scDblFinder\\\\.", colnames(colData(sce)))
colnames(colData(sce))[idx] <- gsub(
"\\\\.",
"_",
sub("^scDblFinder\\\\.", "scdblfinder_", colnames(colData(sce))[idx])
)

# The doublet calls must stay keyed by the original cell barcodes. If they are not
# present here, something went wrong during conversion or scDblFinder processing and
# we should fail instead of inventing replacement identifiers.
if (is.null(colnames(sce)) || length(colnames(sce)) != ncol(sce)) {
stop("scDblFinder output is missing valid cell barcodes; cannot write aligned h5ad and prediction outputs.")
}

# Write the updated SingleCellExperiment directly as h5ad, explicitly mapping the
# primary assay to AnnData X so downstream readers see a valid matrix field.
primary_assay <- assayNames(sce)[1]
if (is.na(primary_assay) || primary_assay == "") {
stop("scDblFinder output is missing a primary assay; cannot write h5ad output.")
}
Comment on lines +59 to +64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is also unnecessary because the sce is created from the anndata in a way that is perfectly prepared for reversing the process

write_h5ad(sce, "${prefix}.h5ad", x_mapping = primary_assay)

# Extract predictions for doublet removal step
# Create a binary doublet call based on class

# Create predictions vector
doublet_calls <- colData(sce)\$scdblfinder_class == "doublet"

# Create data frame without row.names first, then add them
predictions <- data.frame(doublet = doublet_calls)
row.names(predictions) <- colnames(sce)

colnames(predictions) <- "${prefix}"

# Save predictions to CSV
write.csv(predictions, "${prefix}.csv")

################################################
################################################
## VERSIONS FILE ##
################################################
################################################

r.version <- strsplit(version[['version.string']], ' ')[[1]][3]
scDblFinder.version <- as.character(packageVersion('scDblFinder'))

writeLines(
c(
'"${task.process}":',
paste(' R:', r.version),
paste(' scDblFinder:', scDblFinder.version)
),
'versions.yml')

################################################
################################################
################################################
################################################
Loading
Loading