The References Extension simplifies genome reference and annotation file management in Nextflow pipelines. It provides intelligent parameter resolution that seamlessly handles both user-provided files and standardized reference collections like iGenomes, making your pipelines more flexible and user-friendly.
Managing reference files is a common challenge in bioinformatics pipelines. Users might provide:
- Custom reference files via parameters
- References from standardized collections (iGenomes)
- Mixed approaches depending on the analysis
The References Extension solves this by providing a unified interface that automatically resolves the appropriate reference source based on user input and pipeline configuration.
The extension provides two essential functions:
- File Resolution:
getReferencesFile()- Resolves file paths from parameters or reference metadata - Value Resolution:
getReferencesValue()- Retrieves metadata values with parameter override support
This function is used to retrieve genome attributes in the nf-core TEMPLATE.
It retrieves a specific attribute (such as fasta, gtf, or index paths) for the selected genome from the params.genomes map. It is useful for pipelines that support multiple genomes and need to access reference files or metadata for the currently selected genome.
Object getGenomeAttribute(String attribute)or (static utility):
Object ReferencesUtils.getGenomeAttribute(Map params, String attribute)| Parameter | Type | Required | Description |
|---|---|---|---|
attribute |
String | Yes | The attribute name to retrieve (e.g. 'fasta', 'gtf', 'star') |
params |
Map | Yes | (static) The Nextflow params map containing genome and genomes |
Returns the value of the requested attribute for the selected genome, or null if not found.
// Example params structure
def params = [
genome: 'GRCh38',
genomes: [
GRCh38: [
fasta: 's3://bucket/genome.fa',
gtf: 's3://bucket/genes.gtf',
],
GRCh37: [
fasta: 's3://bucket/genome37.fa'
star: 's3://bucket/star_index/'
]
]
]
// Retrieve the FASTA file for the selected genome
def fasta = ReferencesUtils.getGenomeAttribute(params, 'fasta')
// Returns: 's3://bucket/genome.fa'
// Retrieve the GTF file
def gtf = ReferencesUtils.getGenomeAttribute(params, 'gtf')
// Returns: 's3://bucket/genes.gtf'
// If the attribute or genome is missing, returns null
def missing = ReferencesUtils.getGenomeAttribute(params, 'star')
// Returns: nullinclude { getGenomeAttribute } from 'plugin/nf-core-utils'
workflow {
// Example: get the FASTA file for the selected genome
genome_fasta = getGenomeAttribute('fasta')
log.info "Selected genome FASTA: ${genome_fasta}"
}Let's start with a simple example that demonstrates the core concept:
#!/usr/bin/env nextflow
// Import reference utilities
include { getReferencesFile } from 'plugin/nf-core-utils'
include { getReferencesValue } from 'plugin/nf-core-utils'
// Pipeline parameters
params.fasta = null // User can override with custom file
params.genome = 'GRCh38' // Default to standard reference
params.igenomes_base = 's3://ngi-igenomes/igenomes'
workflow {
log.info "Setting up genome references for ${params.genome}"
// Create references channel (example structure)
references_ch = Channel.of([
genome: params.genome,
fasta: "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa"
])
// Get reference file - user parameter takes precedence
genome_fasta = getReferencesFile(
references_ch, // Reference metadata channel
params.fasta, // User-provided parameter (null = use metadata)
'fasta', // Attribute to look for in metadata
params.igenomes_base // Base path for reference resolution
)
genome_fasta.view { "Using genome: ${it}" }
}N E X T F L O W ~ version 25.04.0
Launching `basic_references.nf` [peaceful-darwin] - revision: abc1234
INFO: Setting up genome references for GRCh38
Using genome: s3://ngi-igenomes/igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.faThe extension follows a clear precedence hierarchy:
- User Parameters (highest priority) - Direct file paths provided by users
- Reference Metadata (fallback) - Files from reference collections
- Default Values (lowest priority) - Pipeline defaults
#!/usr/bin/env nextflow
params.fasta = "/custom/path/genome.fa" // User override
params.genome = 'GRCh38'
workflow {
references_ch = Channel.of([
genome: 'GRCh38',
fasta: 's3://igenomes/GRCh38/genome.fa' // This will be ignored
])
// User parameter takes precedence
genome_fasta = getReferencesFile(references_ch, params.fasta, 'fasta', null)
genome_fasta.view { "Selected: ${it}" } // Shows custom path
}This function intelligently resolves file paths based on user parameters and reference metadata.
| Parameter | Type | Required | Description |
|---|---|---|---|
references |
Channel | Yes | Channel containing reference metadata |
param |
String/null | Yes | User-provided file path (null = use metadata) |
attribute |
String | Yes | Metadata attribute name to extract |
basepath |
String | No | Base path for relative path resolution |
#!/usr/bin/env nextflow
include { getReferencesFile } from 'plugin/nf-core-utils'
params.fasta = null
params.gtf = "/custom/annotations.gtf"
params.igenomes_base = 's3://ngi-igenomes/igenomes'
workflow {
// Create comprehensive reference metadata
references = Channel.of([
genome: 'GRCh38',
fasta: 'Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa',
gtf: 'Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.gtf',
readme: 'Homo_sapiens/NCBI/GRCh38/README.txt'
])
// Resolve multiple reference files
genome_fasta = getReferencesFile(references, params.fasta, 'fasta', params.igenomes_base)
genome_gtf = getReferencesFile(references, params.gtf, 'gtf', params.igenomes_base)
// Combine for downstream processing
references_ready = genome_fasta.combine(genome_gtf)
references_ready.view { fasta, gtf ->
"""
Reference files ready:
FASTA: ${fasta}
GTF: ${gtf}
"""
}
}#!/usr/bin/env nextflow
include { getReferencesFile } from 'plugin/nf-core-utils'
params.fasta = null
params.gtf = "/custom/annotations.gtf"
params.igenomes_base = 's3://ngi-igenomes/igenomes'
workflow {
// Create comprehensive reference metadata
references = Channel.of([
genome: 'GRCh38',
fasta: 'Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa',
gtf: 'Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.gtf',
readme: 'Homo_sapiens/NCBI/GRCh38/README.txt'
])
// Resolve multiple reference files
genome_fasta = getReferencesFile(references, params.fasta, 'fasta', params.igenomes_base)
genome_gtf = getReferencesFile(references, params.gtf, 'gtf', params.igenomes_base)
// Combine for downstream processing
references_ready = genome_fasta.combine(genome_gtf)
references_ready.view { fasta, gtf ->
"""
Reference files ready:
FASTA: ${fasta}
GTF: ${gtf}
"""
}
}This function extracts metadata values with user parameter override support.
Object getReferencesValue(
Channel references, // Reference metadata channel
Object param, // User parameter value or null
String attribute // Metadata attribute name
)| Parameter | Type | Required | Description |
|---|---|---|---|
references |
Channel | Yes | Channel containing reference metadata |
param |
Any/null | Yes | User-provided value (null = use metadata) |
attribute |
String | Yes | Metadata attribute name to extract |
#!/usr/bin/env nextflow
include { getReferencesValue } from 'plugin/nf-core-utils'
params.species = null // Use metadata default
params.build = "custom_v2" // Override metadata
workflow {
references = Channel.of([
genome: 'GRCh38',
species: 'Homo sapiens',
build: 'GRCh38.p13',
assembly_date: '2020-12-01'
])
// Extract various metadata values
species_name = getReferencesValue(references, params.species, 'species')
genome_build = getReferencesValue(references, params.build, 'build')
assembly_date = getReferencesValue(references, null, 'assembly_date')
// Combine all metadata
metadata = species_name.combine(genome_build).combine(assembly_date)
metadata.view { species, build, date ->
"""
Genome Metadata:
Species: ${species} // From metadata (params.species = null)
Build: ${build} // From params.build override
Date: ${date} // From metadata
"""
}
}Here's a comprehensive example showing how to integrate reference resolution into a real bioinformatics pipeline:
#!/usr/bin/env nextflow
include { getReferencesFile } from 'plugin/nf-core-utils'
include { getReferencesValue } from 'plugin/nf-core-utils'
// Pipeline parameters with sensible defaults
params.input = 'samples.csv'
params.fasta = null // User can override
params.gtf = null // User can override
params.genome = 'GRCh38' // Default reference
params.igenomes_base = 's3://ngi-igenomes/igenomes'
workflow {
log.info "Starting analysis with genome: ${params.genome}"
// Load reference metadata (in practice, this might come from a YAML file)
references = Channel.of([
genome: params.genome,
species: 'Homo sapiens',
fasta: "Homo_sapiens/NCBI/${params.genome}/Sequence/WholeGenomeFasta/genome.fa",
gtf: "Homo_sapiens/NCBI/${params.genome}/Annotation/Genes/genes.gtf",
star_index: "Homo_sapiens/NCBI/${params.genome}/Sequence/STARIndex/",
build: "${params.genome}.p13"
])
// Resolve reference files intelligently
genome_fasta = getReferencesFile(references, params.fasta, 'fasta', params.igenomes_base)
genome_gtf = getReferencesFile(references, params.gtf, 'gtf', params.igenomes_base)
star_index = getReferencesFile(references, null, 'star_index', params.igenomes_base)
// Extract metadata values
species_name = getReferencesValue(references, null, 'species')
genome_build = getReferencesValue(references, null, 'build')
// Use in downstream processes
ALIGNMENT(Channel.fromPath(params.input), genome_fasta, star_index)
ANNOTATION(ALIGNMENT.out.bam, genome_gtf)
// Create analysis report with metadata
species_name.combine(genome_build).view { species, build ->
log.info "Analysis completed for ${species} (${build})"
}
}
process ALIGNMENT {
input:
path samples
path fasta
path index
output:
path "*.bam", emit: bam
script:
"""
echo "Aligning samples using:"
echo "Reference: ${fasta}"
echo "Index: ${index}"
# STAR alignment commands would go here
touch aligned.bam
"""
}
process ANNOTATION {
input:
path bam
path gtf
output:
path "*.counts", emit: counts
script:
"""
echo "Counting features using: ${gtf}"
# featureCounts commands would go here
touch counts.txt
"""
}For pipelines using standardized reference collections, create a systematic approach:
#!/usr/bin/env nextflow
include { getReferencesFile } from 'plugin/nf-core-utils'
include { getReferencesValue } from 'plugin/nf-core-utils'
params.genome = 'GRCh38'
params.igenomes_base = 's3://ngi-igenomes/igenomes'
// User overrides (any can be null to use defaults)
params.fasta = null
params.gtf = null
params.bed12 = null
workflow {
// Define comprehensive iGenomes structure
igenomes_references = Channel.of([
genome: params.genome,
species: 'Homo sapiens',
provider: 'NCBI',
build: 'GRCh38.p13',
fasta: "Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa",
fasta_fai: "Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa.fai",
gtf: "Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.gtf",
bed12: "Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.bed",
star_index: "Homo_sapiens/NCBI/GRCh38/Sequence/STARIndex/",
bowtie2_index: "Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/"
])
// Resolve all required references
genome_fasta = getReferencesFile(igenomes_references, params.fasta, 'fasta', params.igenomes_base)
genome_fasta_fai = getReferencesFile(igenomes_references, null, 'fasta_fai', params.igenomes_base)
genome_gtf = getReferencesFile(igenomes_references, params.gtf, 'gtf', params.igenomes_base)
genome_bed12 = getReferencesFile(igenomes_references, params.bed12, 'bed12', params.igenomes_base)
// Create reference bundle for processes
reference_bundle = genome_fasta
.combine(genome_fasta_fai)
.combine(genome_gtf)
.combine(genome_bed12)
reference_bundle.view { fasta, fai, gtf, bed ->
"""
Reference Bundle Ready:
- FASTA: ${fasta}
- Index: ${fai}
- GTF: ${gtf}
- BED12: ${bed}
"""
}
}For pipelines supporting multiple genomes:
#!/usr/bin/env nextflow
include { getReferencesFile } from 'plugin/nf-core-utils'
include { getReferencesValue } from 'plugin/nf-core-utils'
// Support for multiple genomes
params.genomes = ['GRCh38', 'mm10']
params.igenomes_base = 's3://ngi-igenomes/igenomes'
workflow {
// Create references for multiple genomes
genome_configs = Channel.fromList([
[genome: 'GRCh38', species: 'Homo sapiens', provider: 'NCBI'],
[genome: 'mm10', species: 'Mus musculus', provider: 'UCSC']
])
// Add file paths to each genome config
references = genome_configs.map { config ->
config + [
fasta: "${config.species.replace(' ', '_')}/${config.provider}/${config.genome}/Sequence/WholeGenomeFasta/genome.fa",
gtf: "${config.species.replace(' ', '_')}/${config.provider}/${config.genome}/Annotation/Genes/genes.gtf"
]
}
// Resolve references for each genome using proper channel operations
resolved_references = references
.map { ref ->
tuple(ref,
getReferencesFile(Channel.of(ref), null, 'fasta', params.igenomes_base),
getReferencesFile(Channel.of(ref), null, 'gtf', params.igenomes_base)
)
}
.flatMap { ref, fasta_ch, gtf_ch ->
fasta_ch
.combine(gtf_ch)
.map { fasta, gtf ->
[
genome: ref.genome,
species: ref.species,
fasta: fasta,
gtf: gtf
]
}
}
resolved_references.view { ref ->
"Ready: ${ref.genome} (${ref.species}) - FASTA: ${ref.fasta}, GTF: ${ref.gtf}"
}
}Add validation to ensure reference files exist and are compatible:
#!/usr/bin/env nextflow
include { getReferencesFile } from 'plugin/nf-core-utils'
params.fasta = null
params.igenomes_base = 's3://ngi-igenomes/igenomes'
workflow {
references = Channel.of([
genome: 'GRCh38',
fasta: 'Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa'
])
genome_fasta = getReferencesFile(references, params.fasta, 'fasta', params.igenomes_base)
// Validate reference file
VALIDATE_REFERENCE(genome_fasta)
}
process VALIDATE_REFERENCE {
input:
path fasta
output:
path fasta, emit: validated_fasta
stdout emit: validation_report
script:
"""
# Check if file exists and is not empty
if [[ ! -s "${fasta}" ]]; then
echo "ERROR: Reference file ${fasta} is empty or doesn't exist"
exit 1
fi
# Check FASTA format
if ! grep -q "^>" "${fasta}"; then
echo "ERROR: ${fasta} doesn't appear to be a valid FASTA file"
exit 1
fi
# Count sequences
num_sequences=\$(grep -c "^>" "${fasta}")
echo "Validated FASTA with \${num_sequences} sequences"
"""
}[!TIP] "Parameter Naming" Use consistent parameter names across your pipeline:
params.fastafor genome sequencesparams.gtffor gene annotationsparams.bed12for BED format annotationsparams.{tool}_indexfor tool-specific indices
#!/usr/bin/env nextflow
include { getReferencesFile } from 'plugin/nf-core-utils'
params.fasta = null
params.genome = 'GRCh38'
workflow {
// Validate required parameters
if (!params.genome && !params.fasta) {
error "Either --genome or --fasta must be provided!"
}
references = Channel.of([
genome: params.genome,
fasta: params.genome ? "genomes/${params.genome}/genome.fa" : null
])
// Handle missing reference gracefully
try {
genome_fasta = getReferencesFile(references, params.fasta, 'fasta', null)
genome_fasta.view { "Using reference: ${it}" }
} catch (Exception e) {
log.error "Failed to resolve reference: ${e.message}"
log.error "Please check --genome parameter or provide --fasta directly"
System.exit(1)
}
}Always document your reference requirements:
#!/usr/bin/env nextflow
/*
* REFERENCE FILES
*
* This pipeline supports flexible reference file specification:
*
* Option 1: Use standard genome (automatic file resolution)
* --genome GRCh38
*
* Option 2: Provide custom files (override defaults)
* --fasta /path/to/genome.fa
* --gtf /path/to/annotations.gtf
*
* Option 3: Mix standard and custom
* --genome GRCh38 --gtf /custom/annotations.gtf
*
* Supported genomes: GRCh38, GRCh37, mm10, mm9
*/
include { getReferencesFile } from 'plugin/nf-core-utils'
include { getReferencesValue } from 'plugin/nf-core-utils'
// Reference parameters with documentation
params.genome = null // Standard genome name (e.g., 'GRCh38')
params.fasta = null // Custom genome FASTA file
params.gtf = null // Custom gene annotation file
params.igenomes_base = 's3://ngi-igenomes/igenomes'
workflow {
// Implementation here...
}The References Extension provides a powerful, flexible system for managing genome references in Nextflow pipelines:
- Unified Interface: Single functions handle both custom files and reference collections
- Smart Resolution: Automatic parameter precedence with user override support
- iGenomes Integration: Seamless integration with standardized reference collections
- Pipeline Flexibility: Users can mix custom and standard references as needed
- Explore NfCore Utilities for comprehensive pipeline utilities
- Learn about NextflowPipelineExtension for core pipeline functions
- Check out the utility documentation in
utilities/for specialized functions