For detailed descriptions of analysis, see the Materials and Methods section. Raw data and processed output files are available at the NCBI Gene Expression Omnibus with the accession number GSE302092.
The numbering of analysis steps are:
- retrieve raw sequencing reads (as FASTQ files)
- quality control of raw sequencing reads
- extract spacer sequences using adapter trimming of sequencing constructs, extract UMIs when applicable
- when applicable, deduplicate spacer-UMI pairs
- align spacer sequences to reference containing Neisseria meningitidis genome and MDA phage genome, and filter to keep only unique alignments using
bowtie2flags - calculate the ranges of genomic positions after padding the aligned positions for spacers for upstream 25 bp and downstream 50 bp (BED files) and extract the sequences specific on the aligned strands (FASTA files) as genomic contexts
- reverse complement the genomic context sequences for non-target sequence consistent for PAM identification conventions
The numbers in script names indicate the steps for which each scripts are written for, with scripts phageAD-00-* indicating the overall pipeline scripts or environment variables.
env/contains two YML files ofcondaenvironments used in analysisadapt_py-3.7.yml: Python 3.7, used for all Python scripts and other programs called frombashunless otherwise specifiedadapt_r-3.6.2.yml: R 3.6.2 and corresponding packages, used for all R scriptsadapt_umitools-3.7.yml: environment whereumitoolsis installed on top of the existingadapt_py-3.7.yml
- Others:
Cutadapt2.6ParDRe2.2.5bowtie22.4.1samtools1.9 (usinghtslib1.9)bedtools: 2.19.1FastQC0.11.8MultiQC0.9
- Genome: Neisseria meningitidis strain 8013 (RefSeq accession number: NC_017501.1, GenBank assembly accession: GCF_000026965.1, v1); MDA phage genomic sequence, supplied by collaborators. A combined genome sequence file is provided (
202211-genomeNme_phageMDA_ref.fa).
The sub-directories under the overall project data directories are organized by the analysis steps:
data directory
├ 00-seq/ # raw FASTQ files or symbolic links
├ 01-qc/ # data quality control
├ 02-spacer/ # extracted spacer sequences
├ (02-umi/) # extracted UMI sequences, UMI samples only
├ (03-dedupped/) # deduplication, spacer and UMI sequences, runtime logs, UMI samples only
├ 04-align/ # alignment to Nme genome and BAM file processing
├ 05-pad/ # extracted strand-specific genome context positions and sequences
└ 06-rc/ # non-target strand sequences for motif analysis