Update README.md

cheny19 · web-flow · commit a640384feb66 · 2020-06-08T21:31:38.000-07:00
diff --git a/README.md b/README.md
@@ -1,9 +1,9 @@
 [![Release](https://img.shields.io/github/v/release/bcgsc/nanosim?include_prereleases)](https://github.com/bcgsc/NanoSim/releases)
-[![Downloads](https://img.shields.io/github/downloads/bcgsc/Nanosim/total?logo=github)](https://github.com/bcgsc/NanoSim/archive/v2.5.0.zip)
+[![Downloads](https://img.shields.io/github/downloads/bcgsc/Nanosim/total?logo=github)](https://github.com/bcgsc/NanoSim/archive/v2.6.0.zip)
 [![Conda](https://img.shields.io/conda/dn/bioconda/nanosim?label=Conda)](https://anaconda.org/bioconda/nanosim)
 [![Stars](https://img.shields.io/github/stars/bcgsc/NanoSim.svg)](https://github.com/bcgsc/NanoSim/stargazers)  
 
-![NanoSim](https://github.com/bcgsc/NanoSim/blob/master/NanoSim%20logo.png)
+![NanoSim](https://github.com/bcgsc/NanoSim/blob/master/NanoSim_logo.png)
 
 NanoSim is a fast and scalable read simulator that captures the technology-specific features of ONT data, and allows for adjustments upon improvement of nanopore sequencing technology.  
 
@@ -246,7 +246,7 @@ usage: simulator.py genome [-h] -rg REF_G [-c MODEL_PREFIX] [-o OUTPUT]
                            [-med MEDIAN_LEN] [-sd SD_LEN] [--seed SEED]
                            [-k KMERBIAS] [-b {albacore,guppy,guppy-flipflop}]
                            [-s STRANDNESS] [-dna_type {linear,circular}]
-                           [--perfect] [-t NUM_THREADS]
+                           [--perfect] [--fastq] [-t NUM_THREADS]
 
 optional arguments:
   -h, --help            show this help message and exit
@@ -285,6 +285,7 @@ optional arguments:
                         Specify the dna type: circular OR linear (Default =
                         linear)
   --perfect             Ignore error profiles and simulate perfect reads
+  --fastq               Output fastq files instead of fasta files
   -t NUM_THREADS, --num_threads NUM_THREADS
                         Number of threads for simulation (Default = 1)
 
@@ -298,10 +299,10 @@ __transcriptome mode usage:__
 usage: simulator.py transcriptome [-h] -rt REF_T [-rg REF_G] -e EXP
                                   [-c MODEL_PREFIX] [-o OUTPUT] [-n NUMBER]
                                   [-max MAX_LEN] [-min MIN_LEN] [--seed SEED]
-                                  [-k KMERBIAS] [-b {albacore, guppy}]
+                                  [-k KMERBIAS] [-b {albacore,guppy}]
                                   [-r {dRNA,cDNA_1D,cDNA_1D2}] [-s STRANDNESS]
-                                  [--no_model_ir] [--perfect] [-t NUM_THREADS]
-                                  [--uracil]
+                                  [--no_model_ir] [--perfect] [--polya POLYA]
+                                  [--fastq] [-t NUM_THREADS] [--uracil]
 
 optional arguments:
   -h, --help            show this help message and exit
@@ -340,14 +341,18 @@ optional arguments:
                         0 and 1
   --no_model_ir         Simulate intron retention events
   --perfect             Ignore profiles and simulate perfect reads
+  --polya POLYA         Simulate polyA tails for given list of transcripts
+  --fastq               Output fastq files instead of fasta files
   -t NUM_THREADS, --num_threads NUM_THREADS
                         Number of threads for simulation (Default = 1)
   --uracil              Converts the thymine (T) bases to uracil (U) in the
                         output fasta format
 ```
 
 
-\* Notice: the use of `max_len` and `min_len` in genome mode will affect the read length distributions. If the range between `max_len` and `min_len` is too small, the program will run slowlier accordingly.
+\* Notice: the use of `max_len` and `min_len` in genome mode will affect the read length distributions. If the range between `max_len` and `min_len` is too small, the program will run slowlier accordingly.  
+
+\* Notice: the transcript name in the expression tsv file and the ones in th polyadenylated transcript list has to be consistent with the ones in the reference transcripts, otherwise the tool won't recognize them and don't know where to find them to extract reads for simulation.
 
 __Example runs:__  
 1 If you want to simulate _E. coli_ genome, then circular command must be chosen because it's a circular genome  
@@ -371,6 +376,9 @@ __Example runs:__
 7 If you want to simulate five thousands cDNA/directRNA reads from mouse reference transcriptome without modeling intron retention  
 `./simulator.py transcriptome -rt Mus_musculus.GRCm38.cdna.all.fa -c mouse_cdna -e abundance.tsv -n 5000 --no_model_ir`
 
+8 If you want to simulate two thousands cDNA/directRNA reads from human reference transcriptome with polya tails, mimicking homopolymer bias (starting from homopolymer length >= 6) and reads in fastq format  
+`./simulator.py transcriptome -rt Homo_sapiens.GRCh38.cdna.all.fa -c Homo_sapiens_model -e abundance.tsv -rg Homo_sapiens.GRCh38.dna.primary.assembly.fa --polya transcripts_with_polya_tails --fastq -k 6 --basecaller guppy -r dRNA`
+
 ## Explanation of output files
 ### 1. Characterization stage
 #### 1.1 Characterization stage (genome)
@@ -425,6 +433,8 @@ __Example runs:__
   
   The information in the header can help users to locate the read easily.  
   
+__Specific to transcriptome simulation__: for reads that include retained introns, the header contains the information starting from `Retained_intron`, each genomic interval is separated by `;`.
+  
 2. `simulated_error_profile`
   Contains all the information of errors introduced into each reads, including error type, position, original bases and current bases.