Exogenous DNA (i.e. E. coli or Drosophila) is added to the stop buffer of the CUT&RUN or CUT&Tag reaction. Spike-in reads are mapped separately, for example, to the E.coli genome, in addition to reads mapped to the target/primary genome (i.e. human, mouse, etc.). The assumption is that the ratio of reads mapped to the target genome compared to the spike-in genome is the same across all samples that were processed in the same batch and using the same number of cells.
Using a constant C, we define the scaling factor as:
Then, we apply the scaling factor to normalize the target reads:
Advantages:
- Accounts for technical variability between samples
- Corrects for differences in cell number and antibody efficiency
- Particularly useful when comparing samples with different signal strengths
Disadvantages:
- Requires careful spike-in addition at the experimental stage
- Assumes uniform spike-in distribution
- May not work well if spike-in percentage varies significantly
This method normalizes by the total number of mapped reads in each sample, assuming that the total amount of sequencing obtained is proportional to the amount of material in each sample.
Scaling factor:
Normalized coverage:
where C is typically set to the minimum library size across all samples, or to a standard value like 1 million or 10 million.
Advantages:
- Simple and straightforward
- No additional experimental steps required
- Works when spike-ins are not available
Disadvantages:
- Assumes similar signal-to-noise ratios across samples
- Can be biased if global changes in binding occur
- Not ideal when samples have very different enrichment levels
Counts Per Million (CPM) or Reads Per Million (RPM) normalizes coverage values to a standard library size of 1 million reads.
For bigWig files or coverage tracks:
Advantages:
- Makes samples directly comparable
- Intuitive interpretation (counts per million reads)
- Standard method in many analysis pipelines
Disadvantages:
- Same limitations as library size normalization
- Doesn't account for fragment length differences
Also known as 1x depth normalization, this method normalizes by both library size and effective genome size.
where:
The effective genome size is the portion of the genome that is mappable with the given read length.
Advantages:
- Accounts for effective genome size
- Useful for comparing across different genomes
- Implemented in deepTools
Disadvantages:
- Requires knowing the effective genome size
- Still assumes similar global enrichment
Background signal is estimated using an IgG control experiment and subtracted from the target signal.
The scaling factor can be determined by:
- Library size ratio between target and IgG
- Spike-in normalization factors
- Or set to 1 if both are already normalized
Advantages:
- Removes non-specific binding signal
- Improves signal-to-noise ratio
- Identifies true binding sites
Disadvantages:
- Requires additional IgG control experiment
- May overcorrect in low-signal regions
- Negative values require handling (typically set to 0)
This method makes the distribution of signal values identical across all samples by ranking values and averaging across samples.
Steps:
- Rank all values in each sample
- For each rank, compute the mean value across all samples
- Replace each value with the mean of its rank position
Advantages:
- Removes systematic biases
- Makes distributions comparable
- Useful for batch effect removal
Disadvantages:
- Assumes samples should have similar distributions
- Can obscure true biological differences
- Not recommended if samples are expected to be very different
For standard experiments:
- Use spike-in normalization when available (gold standard)
- Fall back to CPM/RPM if spike-ins were not used
- Always include IgG controls for background subtraction
For differential binding analysis:
- Consider using tools with built-in normalization (DiffBind, csaw)
- These often use TMM or DESeq2-style normalization
For visualization:
- CPM/RPM is often sufficient for browser tracks
- Spike-in normalization provides the most accurate comparison
- Consider log-transformation for better visualization of fold changes
- Not normalizing: Can lead to false positives due to library size differences
- Over-normalization: Applying multiple normalizations can introduce artifacts
- Mixing methods: Be consistent across all samples in an experiment
- Ignoring spike-in quality: Low spike-in reads (<1%) may not be reliable for normalization