Skip to content

Latest commit

 

History

History
78 lines (60 loc) · 3.59 KB

File metadata and controls

78 lines (60 loc) · 3.59 KB

Postprocessing of ipod data

For the purposes of this distribution of tools, postprocessing refers to calling peaks and EPODS in the protein occupancy data after a final set of IPOD enrichment scores has been calculated.

Running the peak and epod calling driver script

From within our provided singularity container (see here for instructions for using our container), running the following will perform peak and epod calling:

python /src_for_distrib/drivers/do_peak_and_epod_calls.py /ipod_data/main.conf

Peak calling

Peak calling is performed by using either robust z-scores or log10p-values. We first calculate a rolling mean score at each position. The width of the window is defined by windowsize_bp (link) in the peaks section of the top-level configuration file.

We then go through the rolling mean scores to define any position whose score is above a given threshold as a peak. We use several score thresholds, defined by the rz_thresholds (link) and log10p_thresholds (link) options in the peaks section of the top-level configuration file.

Resulting narrowpeaks files will be written for each score type (robust z-score and log10p value), and for each threshold.

EPOD calling

Options for EPOD calling can be set here in the main configuration file. Of note are loose_epod_length and strict_epod_length. These options set the minimum length for which the pipeline will call a region a "loose" epod or a "strict" epod, respectively.

We find that having both the loose epod calls and strict epod calls is very helpful when interpreting the calls. For instance, if a strict epod is absent in one condition but present in another, check the loose epod calls for that region to see whether a loose epod is present in both. If a loose epod is present in both conditions, the absence of a strict epod in one condition should receive less weight in your final interpretation of results.

NOTE: in prior versions of this pipeline, EPODs that wrapped around the ends of contigs representing circular chromosomes or plasmids could be called. As of version 2.5.4 of this pipeline, we do not allow wrapping around the ends of contigs.

Irreproducible discovery rate

For peak calling, we calculate the irreproducible discovery rate (IDR, doi:10.1214/11-AOAS466; click here for link). IDR calculation works on one pair of replicates at a time. Therefore, we have chosen to perform pairwise IDR calculations for each pair of replicates, resulting in comp-num comparisons, where n is the number of replicates performed.

After calculating each pairwise IDR, we determine, for each position in the genome, the fraction of comparisons that passed the threshold IDR <= x, where x is the value set by the threshold option in the idr section of the main configuration file. Any region containing contiguous positions for which at least half of the pairwise IDR calculations pass this threshold are reported in a final narrowpeaks output file.

Assessing EPOD reproducibility

For EPOD calling across replicates, we take the union of all detected EPODs in all replicates. For each merged epod in this set union, we calculate the mean representation across replicates.