-
Notifications
You must be signed in to change notification settings - Fork 26
Support multiple outputs #63
Description
One limitation of modules currently is that they typically only output one file type. Also, sample grouping information is lost. This is fine when, for example, an alignment module has five inputs and creates five outputs. But if an analysis module creates three different output files for each input (and different downstream modules could make use of different file types or combinations), it gets more difficult.
To add additional complexity, output files could be assigned a group based on the original starting file. Modules could then filter the input that they use by file type, and output as many files as possible.
Currently:
start_000 file1.fq.gz
start_000 file2.fq.gz
align_838 file1.bam
align_838 file2.bam
analyse_239 file1_stats.csv
analyse_239 file1_filtered.bam
analyse_239 file2_stats.csv
analyse_239 file2_filtered.bam
Suggested:
start_000 file1 file1.fq.gz
start_000 file2 file2.fq.gz
align_838 file1 file1.bam
align_838 file2 file2.bam
analyse_239 file1 file1_stats.csv
analyse_239 file1 file1_filtered.bam
analyse_239 file2 file2_stats.csv
analyse_239 file2 file2_filtered.bam
This grouping means that modules downstream of analyse_239 can use both the alignments and the stats file in combination safely, knowing that they came from the same sample.
We could also add a 'type' field to describe the kind of output being generated. Modules could then list the input types needed and output types generated at the --request stage, enabling a pipeline to be checked for compatability. e.g:
start_000 file1 fastq file1.fq.gz
start_000 file2 fastq file2.fq.gz
align_838 file1 bam file1.bam
align_838 file2 bam file2.bam
analyse_239 file1 stats file1_stats.csv
analyse_239 file1 counts file1_counts.csv
analyse_239 file1 bam file1_filtered.bam
analyse_239 file2 stats file2_stats.csv
analyse_239 file2 counts file2_counts.csv
analyse_239 file2 bam file2_filtered.bam
Using a named tag would enable differentiation between different types of files. For instance, if a module generates a csv with a specific format, that could be included in the name.
This is a fairly major change in behaviour and not a top priority. Just a thought at this point, but could be useful as cluster flow gains more modules and is able to handle more complex pipelines.