In case users want to leverage the downsampling step in Hummingbird but have input files in formats different than BAM or fastq/fastq.gz, please follow the examples below:
-
Input file(s) are in CRAM format
The user can convert CRAM to SAM using
samtools(http://www.htslib.org/doc/samtools-view.html):samtools view -C -T ref.fa aln.bam > aln.cramPlease note that the original reference fasta file is required for this conversion. Generating the index file after conversion may be necessary for subsequent analysis using software tools or pipelines. For more information on using CRAM files with samtools, please see http://www.htslib.org/workflow/. -
Input file(s) are in SAM format
Samtoolshas a functionality that does this conversion:samtools view -bS file.sam | samtools sort - file_sortedGenerating the index file after conversion may be necessary for subsequent analysis using software tools or pipelines. -
Input file(s) are in FASTQ but need uBAM
In some cases, the bioinformatics pipeline to evaluate accepts unaligned BAM files so conversion of the FASTQ files to uBAM is needed. FastqToSam tool within the Picard suite of tools (https://broadinstitute.github.io/picard/command-line-overview.html#FastqToSam) can be used:
java -jar picard.jar FastqToSam F1=file_1.fastq O=fastq_to_bam.bam SM=for_tool_testing -
Aligned BAM to unmapped BAM
The RevertSam tool from the Picard tools suite can be used:
java -jar picard.jar RevertSam I=input.bam O=reverted.bamwhere theinput.bamis the aligned BAM andreverted.bamis the output unmapped BAM. -
BED to BAM
Bedtoolscan be used to convert a file from BED format to BAM:bedtools bedToBam -i input.bed -g genome_file > input_converted.bamwheregenome_fileis not a fasta file but a two column file with list of chromosomes and the corresponding chromosome sizes in basepairs.The
genome_filecan be fasta index file in the.faiformat where the first two columns are extracted (indexing can be done usingsamtools faidx reference.fasta) and the "chr" prefix is added to the chromosome names. For more details ongenome_fileand pre-defined genome files available with bedtools distribution, please see https://bedtools.readthedocs.io/en/latest/content/general-usage.html#genome-file-format.In case the input file is in BED12 format and spliced BAM entries are to be generated, use:
bedToBam -i input_bed12format.bed -g genome_file -bed12 > input_converted_spliced.bamIf required, BED12 file (has blocked features) can be converted to BED6 (each feature listed in a separate line) format using:
bedtools bed12ToBed6 -i input_bed12.bed. For details on options, please refer to bedtools documentation (https://bedtools.readthedocs.io/en/latest/content/overview.html). -
GFF to BAM
Bedtoolscan help convert a feature file such as ones in GFF format to BAM:bedtools bedToBam -i input.gff -g genome_file > input_converted.bamwheregenome_fileis not a fasta file but a two column file with list of chromosomes and the corresponding chromosome sizes in basepairs.Please refer to BED to BAM for more details on
genome_fileformat. -
VCF to BAM
Bedtoolscan help convert a feature file such as ones in GFF format to BAM:bedtools bedToBam -i input.vcf -g genome_file > input_converted.bamwheregenome_fileis not a fasta file but a two column file with list of chromosomes and the corresponding chromosome sizes in basepairs. -
BAM to BED/BEDPE formats
Please note that currently Hummingbird does not natively support BEDPE format. However, users can skip the downsampling step (please check Downsample option in Hummingbird for more details) and continue using other features of Hummingbird.
a) An input BAM file can be converted to a BED file (BED6 format by default) using
bedtools:bedtools bamtobed -i input.bam > output.bedFor further details, please see https://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html.
b) An input BAM file can be converted to a BEDPE file using
bedtools:bedtools bamtobed -i input.bam -bedpe > output.bedpec) For non-lossy conversions of BAM to BED can be performed using
bam2bedfunctionality in theBEDOPSsuite of tools:bam2bed --keep-header < input.bam > output.bedThe
--keep-headeroption is needed for the header information to be included in the output file.In some cases, sorting and indexing of the input.bam file may be required for subsequent analyses.
-
VCF to BED/BEDPE formats
A number of structural variant based tools help with conversion from VCF to BED or BEDPE formats such as
a)
lumpysv(https://github.com/arq5x/lumpy-sv): Please refer to the scriptvcfToBedpe.py.b)
SURVIVOR(https://github.com/fritzsedlazeck/SURVIVOR): Please refer to thebedpetovcffunctionality.c)
svtools(https://github.com/hall-lab/svtools): Please refer to thebedpetovcfandvcftobedpesubcommands. The benchmarking details on these subcommands can be found in Table 3 of their publication (https://academic.oup.com/bioinformatics/article/35/22/4782/5520944) which gives an idea of the computational resources required and execution times. -
BEDPE to BED12 format
The subcommand
bedpetobed12within thesvtools(https://github.com/hall-lab/svtools) can convert a BEDPE file to a BED12 format. -
BAM to FASTQ
a) Using the
bamtofastqfunctionality inbedtools, for paired-end data:bedtools bamtofastq -i aln.qsort.bam -fq aln.R1.fq -fq2 aln.R2.fqThe input bam file has to be sorted by query name that can be done using
samtools sort -n aln.bam aln.qsortIn case of single-end reads, conversion is done using:bedtools bamtofastq -i aln.bam -fq aln.fqFurther information on the various options that can be used in thebamtofastqcommand line, please see https://bedtools.readthedocs.io/en/latest/content/tools/bamtofastq.htmlb) Using
samtools, for paired-end data:samtools fastq -1 paired1.fq -2 paired2.fq -0 /dev/null -s /dev/null -n in_sorted.bamThe input bam file has to be sorted before providing it for the conversion similar to the explanation in 11 (a) above. For more details on the options, please refer to http://www.htslib.org/doc/samtools-fasta.html.Alternative tools for conversion of BAM to FASTQ can be found here: https://sites.google.com/site/wiki4metagenomics/tools/samtools/converting-bam-to-fastq.
For other file format conversions not listed here refer to,
a) GALAXY suite of tools can be used. Please refer to "Convert Formats" in https://vclv99-241.hpc.ncsu.edu/?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fvipints%2Ffml_gff3togtf%2Ffml_bed2gff%2F2.1.0&version=2.1.0&__identifer=ibvartqtce.
b) Jvarkit: Java utilities for Bioinfomatics can be used. Please see http://lindenb.github.io/jvarkit/.
Any of the above functionalities from various format conversion tools can be incoporated to the Hummingbird code by building a docker image of the tool and adding the required command lines for an automated execution.
Please note that sorting and generating the index file after the conversion may be necessary for subsequent analysis using software tools or bioinformatics pipelines.