diff --git a/CHANGELOG.md b/CHANGELOG.md index 0455b54..318eafd 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,6 +3,7 @@ ## ASPEN 1.0.6 - fix: dockername typo ([#57](https://github.com/CCBR/ASPEN/issues/57), @kopardev) +- docs: update documentation, change theme ([#77](https://github.com/CCBR/ASPEN/issues/77), [#78](https://github.com/CCBR/ASPEN/issues/78), @kopardev) ## ASPEN 1.0.5 diff --git a/docs/css/custom.css b/docs/css/custom.css index 22c7639..5fcc4ec 100644 --- a/docs/css/custom.css +++ b/docs/css/custom.css @@ -18,3 +18,30 @@ pre { background: #e0e0e0; } +/* Light blue background for notes */ +div.admonition.note { + background-color: #e3f2fd; + border-left: 5px solid #2196F3; + padding: 10px; + margin: 10px 0; + border-radius: 5px; +} + +/* Light grey background for info */ +div.admonition.info { + background-color: #f5f5f5; + border-left: 5px solid #9E9E9E; + padding: 10px; + margin: 10px 0; + border-radius: 5px; +} + +/* Orange warning box */ +div.admonition.warning { + background-color: #fff3e0; + border-left: 5px solid #FF9800; + padding: 10px; + margin: 10px 0; + border-radius: 5px; +} + diff --git a/docs/deployment.md b/docs/deployment.md index b5d11f2..760baab 100644 --- a/docs/deployment.md +++ b/docs/deployment.md @@ -30,13 +30,14 @@ ASPEN requires a sample manifest file (`samples.tsv`) to identify and organize y - `path_to_R1_fastq`: Absolute path to the Read 1 FASTQ file. - `path_to_R2_fastq`: Absolute path to the Read 2 FASTQ file (required for paired-end data). -> **Note**: Symlinks for R1 and R2 files will be created in the results directory, named as .R1.fastq.gz and .R2.fastq.gz, respectively. Therefore, original filenames do not need to be altered. +!!! note + Symlinks for R1 and R2 files will be created in the results directory, named as .R1.fastq.gz and .R2.fastq.gz, respectively. Therefore, original filenames do not need to be altered. -> **Note**: The `replicateName` is used as a prefix for individual peak calls, while the `sampleName` serves as a prefix for consensus peak calls. +!!! note + The `replicateName` is used as a prefix for individual peak calls, while the `sampleName` serves as a prefix for consensus peak calls. -> **Note**: For differential ATAC analysis, prepare a contrasts.tsv file with two columns (Group1 and Group2, without headers) and place it in the output directory after initialization. - -[Back to Table of Contents](#table-of-contents) +!!! note + For differential ATAC analysis, create a `contrasts.tsv` file with two columns (Group1 and Group2 ... aka Sample1 and Sample2, without headers) and place it in the output directory after initialization. Ensure each group/sample in the contrast has at least two replicates, as DESeq2 requires this for accurate contrast calculations. ## Running the ASPEN Pipeline ASPEN operates through a series of modes to facilitate various stages of the analysis. @@ -50,7 +51,8 @@ aspen -m=init -w= This command generates a config.yaml and a placeholder `samples.tsv` in the specified directory. Edit these files to reflect your experimental setup, replacing the placeholder `samples.tsv` with your prepared manifest. If performing differential analysis, include the `contrasts.tsv` file at this stage. -> **Note**: To explore all possible options of the `aspen` command you can either run it without any arguments or run `aspen --help` +!!! note + To explore all possible options of the `aspen` command you can either run it without any arguments or run `aspen --help` Here is what help looks like: ```bash diff --git a/docs/outputs.md b/docs/outputs.md index dad2a09..22c5063 100644 --- a/docs/outputs.md +++ b/docs/outputs.md @@ -67,12 +67,23 @@ Content details: | Folder | Description | |-------------|------------| -| dedupBam | Deduplicated filtered BAM files; can be used for visualization. | -| peaks | Genrich/MACS2 peak calls (raw, consensus, fixed-width); also contains ROI files with Diff-ATAC results if `contrasts.tsv` is provided; motif enrichments using HOMER and AME; bigwigs for visualization. | -| QC | Flagstats; dupmetrics; read counts; motif enrichments; FLD stats; Fqscreen; FRiP; ChIPSeeker results; TSS enrichments; Preseq; MultiQC. | -| qsortedBam | Query name sorted BAM files; used for Genrich peak calling (includes multimappers). | -| tagAlign | `tagAlign.gz` files; deduplicated; used for MACS2 peak calling. | -| tmp | Can be deleted; blacklist index; intermediate FASTQs; Genrich output reads. | +| `dedupBam` | Deduplicated filtered BAM files; can be used for visualization. | +| `peaks` | Genrich/MACS2 peak calls (raw, consensus, fixed-width); also contains ROI files with Diff-ATAC results if `contrasts.tsv` is provided; motif enrichments using HOMER and AME; bigwigs for visualization. | +| `QC` | Flagstats; dupmetrics; read counts; motif enrichments; FLD stats; Fqscreen; FRiP; ChIPSeeker results; TSS enrichments; Preseq; MultiQC. | +| `qsortedBam` | Query name sorted BAM files; used for Genrich peak calling (includes multimappers). | +| `tagAlign` | `tagAlign.gz` files; deduplicated; used for MACS2 peak calling. | +| `tmp` | Can be deleted; blacklist index; intermediate FASTQs; Genrich output reads. | + +The `QC` folder contains the `multiqc_report.html` file which provides a comprehensive summary of the quality control metrics across all samples, including read quality, duplication rates, and other relevant statistics. This report aggregates results from various QC tools such as FastQC, FastqScreen, FLD, TSS enrichment, Peak Annotations, and others, presenting them in an easy-to-read format with interactive plots and tables. It helps in quickly identifying any issues with the sequencing data and ensures that the data quality is sufficient for downstream analysis. + +!!! note + BAM files from `dedupBam` can be used for downstream footprinting analysis using [CCBR_TOBIAS](https://github.com/CCBR/CCBR_Tobias) pipeline + +!!! note + [bamCompare](https://deeptools.readthedocs.io/en/develop/content/tools/bamCompare.html) from deeptools can be run to compare BAMs from `dedupBam` for comprehensive BAM comparisons. + +!!! note + BAM files from `dedupBam` can also be converted to BED format and processed with [chromVAR](https://github.com/GreenleafLab/chromVAR) to identify variability in motif accessibility across samples and assess differentially active transcription factors from the JASPAR database. Most of the above folders are self-explanatory. The `peaks` folder has this hierarchy: @@ -81,22 +92,131 @@ WORKDIR ├── results ├── peaks ├── genrich + │   ├── .genrich.narrowPeak + │   ├── .genrich.narrowPeak.annotated + │   ├── .genrich.narrowPeak.genelist + │   ├── .genrich.narrowPeak.annotation_summary + │   ├── .genrich.narrowPeak.annotation_distribution + │   ├── .genrich.pooled.narrowPeak + │   ├── .genrich.consensus.bed + │   ├── ROI.counts.tsv │   ├── bigwig │   ├── .genrich.narrowPeak_motif_enrichment │   │   └── knownResults │   ├── DiffATAC - │   ├── fixed_width + │   ├── .genrich.consensus.bed_motif_enrichment │   └── tn5nicks └── macs2 + │   ├── .macs2.narrowPeak + │   ├── .macs2.narrowPeak.annotated + │   ├── .macs2.narrowPeak.genelist + │   ├── .macs2.narrowPeak.annotation_summary + │   ├── .macs2.narrowPeak.annotation_distribution + │   ├── .macs2.pooled.narrowPeak + │   ├── .macs2.consensus.bed + │   ├── ROI.counts.tsv ├── bigwig ├── .macs2.narrowPeak_motif_enrichment │   └── knownResults ├── DiffATAC + │   ├── .macs2.consensus.bed_motif_enrichment ├── fixed_width └── tn5nicks ``` -`tn5nicks` folders host the per-replicate BAM files containing the Tn5 nicking sites in Genrich or MACS2 "peakcalling" reads, respectively. For easy visualization, they are converted to bigWig format and saved in respective `bigwig` folders. `DiffATAC` contains the DESeq2 differential accessiblity results, both per-contrast and aggregated accross all contrasts in `contrasts.tsv`. These results are solely based on tn5 nick counts. +Some of the important folders and files are highlighted below: + +### Folders + +- `bigwig`: + +For easy visualization, they are converted to bigWig format and saved in respective `bigwig` folders. The bigWig files can be directly loaded into [UCSC Browser](https://genome.ucsc.edu/) or [IGV](https://igv.org/doc/desktop/). + +- `tn5nicks`: + +This folder host the per-replicate BAM files containing the Tn5 nicking sites in Genrich or MACS2 "peakcalling" reads, respectively. + +- `DiffATAC`: + +Contains the DESeq2 differential accessiblity results, both per-contrast and aggregated accross all contrasts in `contrasts.tsv`. These results are solely based on tn5 nick counts. + +- `fixed_width`: + +This folder contains fixed-width consensus peaks across replicates and samples, represented in the "Regions-Of-Interest" files. The `ROI.bed` file lists genomic regions where chromatin accessibility is analyzed using DESeq2, with results stored in the `DiffATAC` folder. + +- `.macs2.narrowPeak_motif_enrichment`;`.genrich.narrowPeak_motif_enrichment`;`.macs2.consensus.bed_motif_enrichment`;`.genrich.consensus.bed_motif_enrichment`: + +Contains the motif enrichments calculated using HOMER and AME for peaks called for each replicate, sample consensus peaks using both MACS2 and Genrich. Specifically, two types of motif enrichments are performed: + + - Enrichment of known [HOCOMOCO](https://hocomoco11.autosome.org/) (version 11) motifs for HUMAN or MOUSE or BOTH using [HOMER](http://homer.ucsd.edu/homer/ngs/peaks.html). See file `knownResults.html`. + + - _de novo_ motif enrichment using [AME](https://meme-suite.org/meme/doc/ame.html) from MEME suite. See file `ame_results.txt`. Custom parallelization is used to optimize AME based enrichment analysis. + +### Files + +- `*.narrowPeak`: + +Called peaks from Genrich or MACS2 + +- Annotated peak files: + +Peaks are annotated with ChIPSeeker and results are saved in the following files: + + - `.annotated` + + Tab-delimited txt file with the following columns: + +| Column Number | Field Name | Description | +|--------------|------------|-------------| +| 1 | #peakID | Peak identifier | +| 2 | chrom | Peak chromosome | +| 3 | chromStart | Peak start coordinate | +| 4 | chromEnd | Peak end coordinate | +| 5 | width | Peak width | +| 6 | annotation | Peak annotation (Promoter; 3' or 5' UTR; Distal; Downstream; Exon; Intron) | +| 7 | geneChr | Gene chromosome | +| 8 | geneStart | Gene start coordinate | +| 9 | geneEnd | Gene end coordinate | +| 10 | geneLength | Gene length (including introns) | +| 11 | geneStrand | Gene strand | +| 12 | geneId | Gene identifier | +| 13 | transcriptId | Transcript identifier | +| 14 | distanceToTSS | Distance of peak from the Transcription Start Site | +| 15 | ENSEMBL | Gene Ensembl ID | +| 16 | SYMBOL | Gene symbol | +| 17 | GENENAME | Gene description | +| 18 | score | Score from `.narrowPeak` file | +| 19 | signalValue | Signal from `.narrowPeak` file | +| 20 | pValue | p-value from `.narrowPeak` file | +| 21 | qValue | q-value from `.narrowPeak` file | +| 22 | peak | Distance of peak summit from peak start coordinate | + + - `.genelist` + +This is a tab-delimited file with names (Ensembl ID, gene symbol) of genes which have ATAC-seq peaks in their promotor regions. This file can be used downstream for gene enrichment analysis (ORA or over-representation analysis). + + - `.annotation_summary`; `.annotation_distribution` + +Tab-delimited files that provide statistics on peak annotations, quantifying the number of peaks found in Promoters, Exonic regions, Distal Intergenic regions, etc. The `.annotation_distribution` is use to create visualization of these annotation-distributions in the MultiQC report. + +- `ROI.counts.tsv` + +This file contains the read counts for each Region-Of-Interest (ROI) across all replicates of all samples. It is a tab-delimited file with the following columns: + +| Column Number | Field Name | Description | +|---------------|------------|-------------| +| 1 | Geneid | Region-Of-Interest identifier | +| 2 | Chr | Chromosome of the ROI | +| 3 | Start | Start coordinate of the ROI | +| 4 | End | End coordinate of the ROI | +| 5 | Strand | "." | +| 6 | Length | Length of the ROI | +| 7 | sample1_replicate1 | Tn5 nicking site counts in this ROI for replicate1 of sample1 | +| 8 | sample1_replicate2 | Tn5 nicking site counts in this ROI for replicate2 of sample1 | +| ... | ... | ... | +| n | sampleN_replicateM | Tn5 nicking site counts in this ROI for replicateM of sampleN| +Each row represents a specific ROI, and the columns contain the read counts for each sample, allowing for differential accessibility analysis. -> DISCLAIMER: This folder hierarchy is specific to v1.0.6 and is subject to change with version. \ No newline at end of file +!!! warning + DISCLAIMER: This folder hierarchy is specific to v1.0.6 and is subject to change with version. \ No newline at end of file