Pipeline for gene-fusion searching in Rna Single-cell sequences
PEARS is a Nextflow DSL2 pipeline that detects gene fusions at single-cell resolution from 10X scRNA-seq data. It combines three complementary fusion-calling approaches — FUSCIA, Flexiplex, and Arriba — and assigns cell barcodes to each detected fusion event, producing per-cell fusion calls.
- Reference preparation — Downloads genome FASTA and GTF annotation (or uses pre-built references).
- Fusion target generation — Builds search targets from a known fusions list using the reference annotation.
- Alignment — Aligns reads with STARsolo (chimeric-aware) and produces a BAM and single-cell count matrix.
- Fusion detection - Calls fusions using FUSCIA, Flexiplex, and Arriba in parallel.
- Formatting — Consolidates results into three CSV files of per-cell fusion calls (
fuscia_fusion_calls.csv,flexiplex_fusion_calls.csv,arriba_fusion_calls.csv).
Running locally:
nextflow run DavidsonLab/pears \
--fastq_r1 "/path/to/Reads_R1.fastq.gz" \
--fastq_r2 "/path/to/Reads_R2.fastq.gz" \
--known_fusions_list "known_fusions.csv" \
--protocol "10x-3prime-v3" \
--genome_version "GRCh38+GENCODE44" \
-profile "local" \
-resumeRunning on SLURM cluster:
nextflow run DavidsonLab/pears \
--fastq_r1 "/path/to/Reads_R1.fastq.gz" \
--fastq_r2 "/path/to/Reads_R2.fastq.gz" \
--known_fusions_list "known_fusions.csv" \
--protocol "10x-3prime-v3" \
--genome_version "GRCh38+GENCODE44" \
-profile "slurm" \
-resumeRunning on SLURM will be much faster due to parallelization of the fusion detection step. The -resume flag allows you to continue from the last successful step if the pipeline is interrupted.
Tip
After running the pipeline for the first time, you can extract the references and index from the references/ and STAR_index/ folders in the output directory. Note that the STAR index is built with a specific --sjdbOverhang parameter according to the read length of the input FASTQ. If you want to reuse the same index, make sure that the read lengths of the new FASTQ files are the same as the original ones. If they differ, you should rebuild the STAR index with the appropriate --sjdbOverhang (read length - 1) for optimal performance.
See Pre-built or reusing reference overrides for the relevant arguments.
| Argument | Default | Description |
|---|---|---|
--fastq_r1 |
— | Glob pattern or path to Read 1 FASTQ files (gzipped). |
--fastq_r2 |
— | Glob pattern or path to Read 2 FASTQ files (gzipped). |
--known_fusions_list |
— | CSV file of known/candidate fusions to search for (see Known fusions list format). |
--protocol |
— | 10x Chromium chemistry preset (see Protocol presets). Sets the barcode whitelist and UMI length automatically. |
--genome_version |
GRCh38+GENCODE44 |
Genome build to download. Available versions: GRCh38+GENCODE40 through GRCh38+GENCODE49. |
--out_dir |
pears_output |
Directory for all pipeline outputs. |
-profile |
— | Execution environment: local or slurm. |
--protocol sets the barcode whitelist and UMI length for the given 10x chemistry. These values can be individually overridden with --barcode_include_list and --umi_len (see Read structure overrides).
| Preset | Chemistry | UMI length | Barcode whitelist |
|---|---|---|---|
10x-3prime-v2 |
3' Gene Expression v2 | 10 bp | 737K-august-2016 |
10x-3prime-v3 |
3' Gene Expression v3/v3.1 | 12 bp | 3M-february-2018 |
10x-3prime-v4 |
3' Gene Expression v4 | 12 bp | 3M-3pgex-may-2023 |
10x-5prime-v2 |
5' Gene Expression v1/v2 | 10 bp | 737K-august-2016 |
10x-5prime-v3 |
5' Gene Expression v3 | 12 bp | 3M-5pgex-jan-2023 |
--barcode_include_list and --umi_len override the corresponding values set by --protocol. If --protocol is omitted entirely, both must be provided.
| Argument | Default | Overrides |
|---|---|---|
--barcode_include_list |
set by --protocol |
Barcode whitelist. Path to a custom whitelist file (can be gzipped). |
--umi_len |
set by --protocol |
UMI length in bases. |
By default, the pipeline downloads the genome specified by --genome_version and builds the STAR index automatically. To skip this, provide all three arguments below — --genome_version is then ignored.
| Argument | Default | Overrides |
|---|---|---|
--ref_fasta |
downloaded | Genome FASTA. Must be provided together with --ref_gtf and --star_genome_index. |
--ref_gtf |
downloaded | GTF annotation. Must be provided together with --ref_fasta and --star_genome_index. |
--star_genome_index |
built from download | STAR genome index directory. Must be provided together with --ref_fasta and --ref_gtf. |
| Argument | Default | Description |
|---|---|---|
--flexiplex_searchlen |
20 |
Length of fusion junction sequence to search for (2x actual overlap). |
--flexiplex_demultiplex_options |
auto-generated | Flexiplex demultiplexing options string. When not set, auto-generated as -b "?{barcode_len}" -u "?{umi_len}" -e 1 -f 0 where barcode length is read from the whitelist file and UMI length comes from --protocol or --umi_len. Setting this explicitly overrides the auto-generated value. |
--fuscia_mapqual |
30 |
Minimum mapping quality for FUSCIA read extraction. |
--fuscia_up |
1000 |
Upstream search distance (bp) when no gene annotation is available. |
--fuscia_down |
1000 |
Downstream search distance (bp) when no gene annotation is available. |
The --known_fusions_list input is a CSV with the following required columns:
| Column | Description |
|---|---|
fusion genes |
Fusion gene pair separated by -- (e.g. BCAS4--BCAS3). |
chrom1 |
Chromosome of gene 1 (e.g. chr20). |
base1 |
Breakpoint position of gene 1. |
strand1 |
Strand of gene 1 (+ or -). |
chrom2 |
Chromosome of gene 2 (e.g. chr17). |
base2 |
Breakpoint position of gene 2. |
strand2 |
Strand of gene 2 (+ or -). |
Additional columns (e.g. classification) are ignored. This format is compatible with JAFFA output. Fusions involving mitochondrial genes (MT-) are automatically filtered out.
Example:
fusion genes,chrom1,base1,strand1,chrom2,base2,strand2,classification
BCAS4--BCAS3,chr20,50795173,+,chr17,61368327,+,HighConfidence
RPS6KB1--VMP1,chr17,59914703,+,chr17,59839768,+,HighConfidence
SLC25A24--NBPF6,chr1,108161182,-,chr1,108470597,+,HighConfidence
Results are written to --out_dir (default pears_output/):
| File/Directory | Description |
|---|---|
fuscia_fusion_calls.csv |
Per-cell fusion calls from FUSCIA. |
flexiplex_fusion_calls.csv |
Per-cell fusion calls from Flexiplex. |
arriba_fusion_calls.csv |
Per-cell fusion calls from Arriba. |
STARsolo/ |
BAM alignment, index, and single-cell count matrix. |
fuscia_out/ |
Per-fusion FUSCIA discordant read files. |
flexiplex_out/ |
Per-fusion Flexiplex barcode files. |
arriba_out/ |
Arriba fusion table and per-fusion barcode files. |
fusion_targets.csv |
Generated fusion target coordinates and sequences. |
nextflow_report.html |
Nextflow execution report. |
nextflow_trace.txt |
Nextflow process trace log. |
The three fusion calls CSVs (fuscia_fusion_calls.csv, flexiplex_fusion_calls.csv, arriba_fusion_calls.csv) share the same format:
| Column | Description |
|---|---|
cell_barcode |
10x cell barcode (CB tag) identifying the single cell. The trailing -1 suffix is stripped. |
molecular_barcode |
Unique Molecular Identifier (UMI / UB tag) distinguishing distinct RNA molecules from PCR duplicates within the same cell. |
fusion |
Detected gene fusion, formatted as GENE1--GENE2 (names taken from the --known_fusions_list input). |
Example:
cell_barcode,molecular_barcode,fusion
CCACAAAAGGTTCTTG,CAGGGATCAGTA,JPH1--NCOA2
CAGGGCTCACTTGGGC,TGATAGGAATCG,JPH1--NCOA2
GTGTGGCGTGGCCCAT,GGTAATCAGCAA,KIAA1429--RP11-586K2.1
Each row represents one unique observation of a fusion transcript in a specific cell. Rows are deduplicated — each (cell_barcode, molecular_barcode, fusion) triple appears at most once. Multiple rows with different cell barcodes for the same fusion indicate independent cells harbouring that fusion. Multiple rows with the same cell barcode but different UMIs for the same fusion indicate multiple distinct fusion transcript molecules captured in that cell, providing stronger evidence. Fusions detected by more than one tool in the same cell are higher confidence and can be identified by cross-referencing the three CSVs.
Adapted from FUSCIA (Steven Foltz, 2019) and Flexiplex (Davidson et al., 2022).