This snakemake pipeline is designed for Variant Calling
- the bam files
- output directory
- reference genome
- SNP database
- bed file
logs\- Directory of log files for each job, check here first if you run into errorsworking\- Directory containing intermediate files for each job
- **genome_prepare
- **BQSR--BaseRecalibrator, ApplyBQSR
- **HaplotypeCaller--HaplotypeCaller
- **JointCallSNPs--JointCallSNP, GenotypeGVCFs, GatherVcfs
- **VQSR--VariantRecalibrator, ApplyVQSR
-
Install conda
-
Clone workflow into working directory
git clone <repo> <dir> cd <dir>
-
Create a new enviroment
conda env create -n <project> --file environment.yaml
-
Activate the environment
conda activate <project_name>
-
Enable the Bioconda channel
conda config --add channels bioconda conda config --add channels conda-forge!!!Notice conda default channel will cause SSL error with Max Planck intranet.
-
Install snakemake
conda install snakemake
-
Edit configuration files change the path of fastq_dir, output_dir, reference_genome in "config.yaml"
-
Create index for your SNP database
gatk IndexFeatureFile -I dbSNP.vcf.gz
-
The first time you are executing this snakemake pipeline it should run locally, once the first run is over (you can use --dry), you can switch to running it on the cluster.
snakemake --configfile "config.yaml" --use-conda --cores N --dryrun -
Execute the workflow (Add the --notemp flag if you want to keep the intermediate files instead of automatically deleting them)
snakemake --configfile "config.yaml" --use-conda --cores N
if you need to submit the jobs to SGE cluster to run the pipeline
download snakemake-executor-plugin-cluster-generic by pip
```bash
pip install snakemake-executor-plugin-cluster-generic
```
then
```bash
snakemake --use-conda --jobs {cores} --executor cluster-generic --cluster-generic-submit-cmd "qsub -cwd -V -l h_vmem={resources.mem_mb}M -pe parallel {threads} -o logs/ -e logs/"
```