nf-core · KurayiChawatama · Mar 12, 2026 · Mar 12, 2026 · Mar 12, 2026 · Mar 12, 2026
@@ -10,9 +10,13 @@ Initial release of nf-core/scdownstream, created with the [nf-core](https://nf-c
 ### `Added`
 
 - Added `singleR` module for automated cell type annotation.
+- Added `scDblFinder` module for doublet detection.
+- Added optional `doublet_rate` column in input samplesheet to provide per-sample expected doublet rate for `scDblFinder`.
 
 ### `Fixed`
 
+- Updated `scDblFinder` to use internal `dbr` estimation when `doublet_rate` is not provided, and to use provided `doublet_rate` when available.
+
 ### `Dependencies`
 
 ### `Deprecated`
@@ -47,6 +47,10 @@
 
   > Cannoodt R, Zappia L, Morgan M, Deconinck L (2025). anndataR: AnnData interoperability in R. R package version 0.99.0
 
+- [scDblFinder](https://pubmed.ncbi.nlm.nih.gov/35118618/)
+
+  > Germain P, Lun A, Garcia Meixide C, Macnair W, Robinson M. Doublet identification in single-cell sequencing data using scDblFinder. F1000Res. 2022;11:979. doi: 10.12688/f1000research.73600.2.
+
 ## Software packaging/containerisation tools
 
 - [Anaconda](https://anaconda.com)

@@ -49,6 +49,7 @@ Steps marked with the boat icon are not yet implemented. For the other steps, th
       - [scrublet](https://scanpy.readthedocs.io/en/stable/api/generated/scanpy.pp.scrublet.html)
       - [DoubletDetection](https://doubletdetection.readthedocs.io/en/v2.5.2/doubletdetection.doubletdetection.html)
       - [SCDS](https://bioconductor.org/packages/devel/bioc/vignettes/scds/inst/doc/scds.html)
+      - [scDblFinder](https://bioconductor.org/packages/release/bioc/html/scDblFinder.html)
 2. Sample aggregation
    1. Merge into a single h5ad file
    2. Present QC for merged counts ([`MultiQC`](http://multiqc.info/))
@@ -87,6 +88,8 @@ sample4,/absolute/path/to/sample3.csv
 Each entry represents a h5ad, h5, RDS or CSV file. RDS files may contain any object that can be converted to a SingleCellExperiment using the [Seurat `as.SingleCellExperiment`](https://satijalab.org/seurat/reference/as.singlecellexperiment) function.
 CSV files should contain a matrix with genes as columns and cells as rows. The first column should contain cell names/barcodes.
 
+For `scDblFinder`, you can optionally add a `doublet_rate` column (values between `0` and `1`) to the samplesheet. If omitted, `scDblFinder` estimates the doublet rate internally.
+
 -->
 
 Now, you can run the pipeline using:

@@ -122,6 +122,13 @@
                 "errorMessage": "Number of cells expected from the experimental design, used as input to cellbender.",
                 "meta": ["expected_cells"]
             },
+            "doublet_rate": {
+                "type": "number",
+                "minimum": 0,
+                "maximum": 1,
+                "errorMessage": "doublet_rate must be a number between 0 and 1.",
+                "meta": ["doublet_rate"]
+            },
             "ambient_correction": {
                 "type": "boolean",
                 "default": true,

@@ -213,6 +213,16 @@ process {
         ]
     }
 
+    withName: SCDBLFINDER {
+        ext.prefix = { meta.id + '_scdblfinder' }
+        publishDir = [
+            path: { "${params.outdir}/quality_control/doublet_detection/scdblfinder" },
+            mode: params.publish_dir_mode,
+            enabled: params.save_intermediates,
+            saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
+        ]
+    }
+
     withName: DOUBLET_REMOVAL {
         publishDir = [
             path: { "${params.outdir}/quality_control/doublet_detection" },

@@ -25,7 +25,7 @@ params {
     // Input data
     input               = params.pipelines_testdata_base_path + 'samplesheet.csv'
     integration_methods = 'scvi,harmony,bbknn,combat'
-    doublet_detection   = 'solo,scrublet,scds'
+    doublet_detection   = 'solo,scrublet,scds,scdblfinder'
     celltypist_model    = 'Adult_Human_Skin'
     celldex_reference   = 'https://raw.githubusercontent.com/nf-core/test-datasets/scdownstream/singleR/references.csv'
     integration_hvgs    = 500

@@ -25,7 +25,7 @@ params {
     // Input data for full size test
     input               = params.pipelines_testdata_base_path + 'samplesheet.csv'
     integration_methods = 'scvi,harmony,bbknn,combat'
-    doublet_detection   = 'solo,scrublet,doubletdetection,scds'
+    doublet_detection   = 'solo,scrublet,doubletdetection,scds,scdblfinder'
     celltypist_model    = 'Adult_Human_Skin'
     celldex_reference   = 'hpca__2024-02-26,monaco_immune__2024-02-26' // Feature: Support offline.
     celldex_reference_label = 'label.main,label.fine'

@@ -25,6 +25,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
       - [scrublet](https://scanpy.readthedocs.io/en/stable/api/generated/scanpy.pp.scrublet.html)
       - [DoubletDetection](https://doubletdetection.readthedocs.io/en/v2.5.2/doubletdetection.doubletdetection.html)
       - [SCDS](https://bioconductor.org/packages/devel/bioc/vignettes/scds/inst/doc/scds.html)
+      - [scDblFinder](https://bioconductor.org/packages/release/bioc/html/scDblFinder.html)
 2. Sample aggregation
    1. Merge into a single h5ad file
    2. Present QC for merged counts ([`MultiQC`](http://multiqc.info/))
@@ -57,7 +58,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
   - `custom_thresholds/`: Results of applying user-defined QC thresholds.
   - `doublet_detection/`: Directories related to doublet detection.
     - `input_rds/`: RDS version of the h5ad file that is used as input to the doublet detection tools.
-    - `(doubletdetection|scds|scrublet|solo)/`: Results of doublet detection. Each directory contains a filtered `h5ad`/`rds` and a `csv`/`pkl` file with the doublet annotations.
+    - `(doubletdetection|scdblfinder|scds|scrublet|solo)/`: Results of doublet detection. Each directory contains a filtered `h5ad`/`rds` and a `csv`/`pkl` file with the doublet annotations.
     - `${sample_id}.h5ad`: The h5ad without doublets.
   - `qc_preprocessed/`: QC plots for the preprocessed data.
 

@@ -38,10 +38,10 @@ sample3,/absolute/path/to/sample3.csv
 There are a couple of optional columns that can be used for more advanced features:
 
 ```csv title="samplesheet.csv"
-sample,filtered,unfiltered,batch_col,label_col,condition_col,unknown_label,min_genes,min_cells,min_counts_cell,min_counts_gene,expected_cells,ambient_correction,ambient_corrected_integration
-sample1,/absolute/path/to/sample1_filtered.h5ad,/absolute/path/to/sample1.h5ad,batch,cell_type,condition,unknown,1,2,3,4,5000,true,false
-sample2,relative/path/to/sample2_filtered.rds,relative/path/to/sample2.rds,batch_id,annotation,condition,unannotated,5,6,7,8,3000,false,
-sample3,/absolute/path/to/sample3_filtered.csv,/absolute/path/to/sample3.csv,,,,,9,10,11,12,,true,true
+sample,filtered,unfiltered,batch_col,label_col,condition_col,unknown_label,min_genes,min_cells,min_counts_cell,min_counts_gene,expected_cells,doublet_rate,ambient_correction,ambient_corrected_integration
+sample1,/absolute/path/to/sample1_filtered.h5ad,/absolute/path/to/sample1.h5ad,batch,cell_type,condition,unknown,1,2,3,4,5000,0.08,true,false
+sample2,relative/path/to/sample2_filtered.rds,relative/path/to/sample2.rds,batch_id,annotation,condition,unannotated,5,6,7,8,3000,,false,
+sample3,/absolute/path/to/sample3_filtered.csv,/absolute/path/to/sample3.csv,,,,,9,10,11,12,,,true,true
 ```
 
 For CSV input files, specifying the `batch_col`, `label_col`, `condition_col`, and `unknown_label` columns will not have any effect, as no additional metadata is available in the CSV file.
@@ -63,6 +63,7 @@ For CSV input files, specifying the `batch_col`, `label_col`, `condition_col`, a
 | `min_counts_cell`               | Minimum number of counts required for a cell to be considered. Defaults to `1`.                                                                                                                                                                                                                                                                                                                                     |
 | `min_counts_gene`               | Minimum number of counts required for a gene to be considered. Defaults to `1`.                                                                                                                                                                                                                                                                                                                                     |
 | `expected_cells`                | Number of expected cells, used as input to CellBender for empty droplet detection.                                                                                                                                                                                                                                                                                                                                  |
+| `doublet_rate`                  | Optional expected doublet rate (0-1) for `scDblFinder`. If not provided, `scDblFinder` estimates it internally.                                                                                                                                                                                                                                                                                                     |
 | `max_mito_percentage`           | Maximum percentage of mitochondrial reads for a cell to be considered. Defaults to `100`.                                                                                                                                                                                                                                                                                                                           |
 | `ambient_correction`            | Whether to perform ambient RNA correction for this sample. Set to `true` to use the globally configured method, `false` to skip ambient correction for this sample. Defaults to `true`.                                                                                                                                                                                                                             |
 | `ambient_corrected_integration` | Whether to use ambient-corrected counts for integration for this sample. Set to `true` to use corrected counts in downstream integration, `false` to store them only as additional layers. Can override the global `--ambient_corrected_integration` parameter. Defaults to global setting.                                                                                                                         |

@@ -0,0 +1,11 @@
+name: scdblfinder
+channels:
+  - conda-forge
+  - bioconda
+dependencies:
+  - bioconda::bioconductor-scdblfinder=1.24.0
+  - bioconda::bioconductor-singlecellexperiment=1.32.0
+  - bioconda::bioconductor-biocparallel=1.44.0
+  - bioconda::bioconductor-anndatar=1.0.2
+  - bioconda::bioconductor-rhdf5=2.54.1
+  - conda-forge::r-tidyverse=2.0.0
@@ -0,0 +1,32 @@
+process SCDBLFINDER {
+    tag "$meta.id"
+    label 'process_medium'
+
+    conda "${moduleDir}/environment.yml"
+    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
+        'https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/99/993a012a69d920412b090701eb733ccf35c8655c3d012756ca6b0af1cfcd4780/data' :
+        'community.wave.seqera.io/library/bioconductor-anndatar_bioconductor-biocparallel_bioconductor-rhdf5_bioconductor-scdblfinder_pruned:0f9db6b0855861de' }"
+
+    input:
+    tuple val(meta), path(h5ad), val(dbr)
+
+    output:
+    tuple val(meta), path("${prefix}.h5ad"), emit: h5ad
+    tuple val(meta), path("${prefix}.csv"), emit: predictions
+    path "versions.yml", emit: versions
+
+    when:
+    task.ext.when == null || task.ext.when
+
+    script:
+    prefix = task.ext.prefix ?: "${meta.id}"
+    template('scdblfinder.R')
+
+    stub:
+    prefix = task.ext.prefix ?: "${meta.id}"
+    """
+    touch ${prefix}.h5ad
+    touch ${prefix}.csv
+    touch versions.yml
+    """
+}
@@ -0,0 +1,72 @@
+name: "scdblfinder"
+description: Detect doublets in single-cell RNA-seq data using scDblFinder
+keywords:
+  - doublet-detection
+  - single-cell
+  - scrnaseq
+  - quality-control
+tools:
+  - "scdblfinder":
+      description: "scDblFinder: Computational identification of doublets in single-cell transcriptomics data"
+      homepage: "https://bioconductor.org/packages/scDblFinder"
+      documentation: "https://bioconductor.org/packages/release/bioc/vignettes/scDblFinder/inst/doc/scDblFinder.html"
+      tool_dev_url: "https://github.com/plger/scDblFinder"
+      doi: "10.12688/f1000research.73600.2"
+      licence: ["GPL-3.0"]
+      identifier: biotools:scdblfinder
+
+input:
+  - - meta:
+        type: map
+        description: |
+          Groovy Map containing sample information
+          e.g. `[ id:'sample1' ]`
+    - h5ad:
+        type: file
+        description: AnnData object in h5ad format
+        pattern: "*.{h5ad}"
+        ontologies:
+          - edam: "http://edamontology.org/format_3590" # HDF5 format
+    - dbr:
+        type: number
+        description: |
+          Optional expected doublet rate (0-1). If null, scDblFinder estimates
+          the doublet rate internally.
+
+output:
+  h5ad:
+    - - meta:
+          type: map
+          description: |
+            Groovy Map containing sample information
+            e.g. `[ id:'sample1' ]`
+      - "*.h5ad":
+          type: file
+          description: AnnData object with doublet annotations
+          pattern: "*.h5ad"
+          ontologies:
+            - edam: "http://edamontology.org/format_3590" # HDF5 format
+  predictions:
+    - - meta:
+          type: map
+          description: |
+            Groovy Map containing sample information
+            e.g. `[ id:'sample1' ]`
+      - "*.csv":
+          type: file
+          description: CSV file containing doublet predictions (boolean)
+          pattern: "*.csv"
+          ontologies:
+            - edam: "http://edamontology.org/format_3752" # CSV
+  versions:
+    - versions.yml:
+        type: file
+        description: File containing software versions
+        pattern: "versions.yml"
+        ontologies:
+          - edam: http://edamontology.org/format_3750 # YAML
+
+authors:
+  - "@KurayiChawatama"
+maintainers:
+  - "@KurayiChawatama"
@@ -0,0 +1,102 @@
+#!/usr/bin/env Rscript
+
+library(scDblFinder)
+library(tidyverse)
+library(SingleCellExperiment)
+library(BiocParallel)
+library(anndataR)
+
+adata <- read_h5ad("${h5ad}")
+sce <- adata\$as_SingleCellExperiment()
+
+num_threads <- max(1L, as.integer("${task.cpus}"))
+bp <- MulticoreParam(workers = num_threads, RNGseed = 123)
+
+# Save original cell names and count before overwriting sce
+original_cell_names <- colnames(sce)
+
+# Parse per-sample doublet rate from Nextflow input. If unavailable, let
+# scDblFinder estimate dbr internally (recommended default for 10X data).
+dbr_raw <- trimws("${dbr}")
+dbr <- suppressWarnings(as.numeric(dbr_raw))
+
+# Run scDblFinder on the counts matrix (first assay)
+# scDblFinder creates artificial doublets internally and returns a new SCE
+set.seed(123)
+if (is.na(dbr)) {
+  message("No valid doublet_rate provided; using scDblFinder internal dbr estimation")
+  dbr <- NULL
+} else {
+  message(paste0("Using provided doublet_rate (dbr): ", dbr))
+}
+
+sce <- scDblFinder(
+  assays(sce)[[1]],
+  BPPARAM = bp,
+  dbr = dbr
+)
+
+# Generate a summary table
+message("scDblFinder results summary:")
+print(table(sce\$scDblFinder.class))
+
+# Rename scDblFinder.* columns for consistency with other doublet methods.
+# Replace prefix first, then replace any remaining dots with underscores.
+idx <- grep("^scDblFinder\\\\.", colnames(colData(sce)))
+colnames(colData(sce))[idx] <- gsub(
+  "\\\\.",
+  "_",
+  sub("^scDblFinder\\\\.", "scdblfinder_", colnames(colData(sce))[idx])
+)
+
+# The doublet calls must stay keyed by the original cell barcodes. If they are not
+# present here, something went wrong during conversion or scDblFinder processing and
+# we should fail instead of inventing replacement identifiers.
+if (is.null(colnames(sce)) || length(colnames(sce)) != ncol(sce)) {
+  stop("scDblFinder output is missing valid cell barcodes; cannot write aligned h5ad and prediction outputs.")
+}
+
+# Write the updated SingleCellExperiment directly as h5ad, explicitly mapping the
+# primary assay to AnnData X so downstream readers see a valid matrix field.
+primary_assay <- assayNames(sce)[1]
+if (is.na(primary_assay) || primary_assay == "") {
+  stop("scDblFinder output is missing a primary assay; cannot write h5ad output.")
+}
+write_h5ad(sce, "${prefix}.h5ad", x_mapping = primary_assay)
+
+# Extract predictions for doublet removal step
+# Create a binary doublet call based on class
+
+# Create predictions vector
+doublet_calls <- colData(sce)\$scdblfinder_class == "doublet"
+
+# Create data frame without row.names first, then add them
+predictions <- data.frame(doublet = doublet_calls)
+row.names(predictions) <- colnames(sce)
+
+colnames(predictions) <- "${prefix}"
+
+# Save predictions to CSV
+write.csv(predictions, "${prefix}.csv")
+
+################################################
+################################################
+## VERSIONS FILE                              ##
+################################################
+################################################
+
+r.version <- strsplit(version[['version.string']], ' ')[[1]][3]
+scDblFinder.version <- as.character(packageVersion('scDblFinder'))
+
+writeLines(
+    c(
+        '"${task.process}":',
+        paste('    R:', r.version),
+        paste('    scDblFinder:', scDblFinder.version)
+    ),
+'versions.yml')
+
+################################################
+################################################
+################################################
+################################################