Skip to content

Latest commit

 

History

History
252 lines (134 loc) · 7.17 KB

File metadata and controls

252 lines (134 loc) · 7.17 KB

Background




Source: Schatz et al. Genome Biology 2012 13:243

De novo genome assembly is a method for constructing genomes from a large number of (short- or long-) DNA fragments, with no a priori knowledge of the correct sequence or order of those fragments. In de novo genome assembly there is no reference template to help guide the reassembly of the reads into the original genome. Instead, the genome assembler reassembles the pieces of sequenced DNA. The order of the reads can be computationally inferred by detecting overlapping regions within the reads. The more similarity between the end of one read and the beginning of another, the more likely they are to have originated from overlapping sections of the genome.

We used PacBio long reads to generate the raw de-novo assembly.

There are two kinds of PacBio long reads:

HiFi reads. see https://www.pacb.com/smrt-science/smrt-sequencing/hifi-reads-for-highly-accurate-long-read-sequencing/

CLR (continuous long reads) See https://www.pacb.com/tag/continuous-long-read-clr/

There are many assembly tools for these types of long reads.

We used hifiasm to generate raw assemblies with HiFi reads. https://github.com/chhylp123/hifiasm

We used Redbean (aka wtdbg2) to generate raw assemblies with CLR reads. https://github.com/ruanjue/wtdbg2 https://github.com/ruanjue/wtdbg2/blob/master/README-ori.md

Dependencies

1. Generate raw assembly using hifiasm and HiFi reads

We used hifiasm to generate raw assemblies with HiFi reads. https://github.com/chhylp123/hifiasm

This tool does not require read depth or estimated genome size to be specified at the command line

1.1 The hifiasm script

The script hifiasm_slurm.sh runs hifiasm on a linux cluster.

Please edit the script with your information.

The script consists of two steps:

  • Step 1 runs hifiasm and generates the raw assembly. The output is NOT in fasta format.

  • Step 2 runs gfatools gfa2fa to reformat the output and write it in fasta format.

hifiasm -o ${PREFIX}_hifiasm -t $SLURM_NPROCS $HiFi

gfatools gfa2fa ${PREFIX}_hifiasm.p_ctg.gfa > ${PREFIX}_hifiasm.p_ctg.fasta

1.2 Run the script

 sbatch hifiasm_slurm.sh

The tool is very fast. It should not take more than 6 hours to run.

1.3 hifiasm output files

Hifiasm generates different types of assemblies based on the input data.

It also writes error corrected reads to the prefix.ec.bin binary file and writes overlaps to prefix.ovlp.source.bin and prefix.ovlp.reverse.bin.

For more details, please see the complete: https://hifiasm.readthedocs.io/en/latest/interpreting-output.html#interpreting-output

2. Generate raw assembly using redbean and CLR reads

This tool requires some preprocessing steps.

  • Estimate genome size

  • Filter CLR reads by length min=5000 max=library-length

  • Estimate read coverage

2.1 Estimate genome size

See Estimate Genome Size

Example:

Results

GenomeScope version 2.0
input file = user_uploads/UFE4kh5TB9DozL0NkKCq
output directory = user_data/UFE4kh5TB9DozL0NkKCq
p = 2
k = 21

property                      min               max               
Homozygous (aa)               95.9674%          100%              
Heterozygous (ab)             0%                4.03258%          
Genome Haploid Length         1,709,133,088 bp  3,291,686,136 bp  
Genome Repeat Length          723,339,182 bp    1,393,107,168 bp  
Genome Unique Length          985,793,907 bp    1,898,578,968 bp  
Model Fit                     62.3549%          97.7983%          
Read Error Rate               0.303731%         0.303731%

The estimated genome size is 3.3Gbp

2.2 Calculate stats of CLR reads

There are many tools that can do this job. We used seqkit stats

module load seqkit

export SEQKIT_THREADS=2

seqkit stats -j 24 -T *subreads.fastq > read_fate.tsv

cat read_fate.tv

file 	num_seqs 	sum_len 	min_len 	avg_len 	max_len
EB20_m64108_210117_053528.subreads.fastq 	27,570,320 	232,559,004,086 	50 	8,435 	521,818
EB20_m64108_210118_121306.subreads.fastq 	28,535,669 	240,060,443,237 	50 	8,413 	436,859

2.3 Filter CLR reads by read length

Denovo assembly tools select the longest reads as *anchors".

We need to make sure that the longest reads are of high quality.

One way to achieve that goal is to filter out reads that are longer than the length of the library used for sequencing. In our case that would be 60kb.

Assembly tools like redbean filter reads shorter than 5kb anyway; but we may need to switch to a different tool that does not filter reads by length at all.

Therefore, it is better to do it now.

This is the cmd to filter reads shorter than 5kb and longer than 60kb

$ module load seqkit

$ seqkit seq -w 0 -m 5000 -M 60000 EB20_m64108_210117_053528.subreads.fastq  > EB20_m64108_210117_053528.subreads_min5kb_max50kb.fastq &

$ seqkit seq -w 0 -m 5000 -M 60000 EB20_m64108_210118_121306.subreads.fastq  > EB20_m64108_210118_121306.subreads_min5kb_max60kb.fastq &

2.4 Check read lengths again

$ module load seqkit

$ seqkit stats -j 24 *.fastq -T > CLR_stats.tsv

$ cat CLR_stats.tsv

file    format    type    num_seqs    sum_len    min_len    avg_len    max_len
EB20_m64108_210117_053528.subreads_min5kb_max60kb.fastq    FASTQ    DNA    12505818    1.85924E+11    5000    14867    60000
EB20_m64108_210118_121306.subreads_min5kb_max60kb.fastq    FASTQ    DNA    13251586    1.95534E+11    5000    14755.5    60000

2.5 Calculate read depth

The estimated genome size is 3.3Gbp

For 90x coverage we would need = 90 * (3 300 000 000) For 80x coverage we would need = 80 * (3 300 000 000) And so on

Do we have enough read coverage for each one of these levels? The answer is yes we do

2.6 Edit the script

The script run_redbean_slurm.sh runs the assembly tool on a cluster.

Please edit the script with your information.

The script consists of two steps:

  • Step 1 runs wtdbg2 and generates the raw assembly. The output is NOT in fasta format.

  • Step 2 runs wtpoa-cns to reformat the output and write it in fasta format.

$ wtdbg2 -x sq -g $SIZE -X $DEPTH \
  -t $threads \
  -i $CLR1  -i $CLR2 \
  -fo ${PREFIX}_Redbean_${SIZE}_${DEPTH}

$ wtpoa-cns -t $threads \
   -i ${PREFIX}_Redbean_${SIZE}_${DEPTH}.ctg.lay.gz \
   -fo ${PREFIX}_Redbean_${SIZE}_${DEPTH}.ctg.fa

2.7 Run the script

 sbatch run_redbean_slurm.sh
 

This tool takes a very long time to execute. It could take over a week or more to complete.

2.8 Output files

This link explains the output files of wtdbg2 : https://github.com/ruanjue/wtdbg2/blob/master/README-ori.md

The raw assembly should be in the file ending in ctg.fa

QC the raw assembly

We provide a script BUSCO_metrics_slurm.sh that runs assemblathon.pl to calculate assembly contiguity.

It also runs BUSCO (database: insecta_odb10) to calculate genome completeness.