Evaluation of TopDownCrawl (TDC)

Here we describe a detailed workflow for reproducing the tests used to evaluate the TopDownCrawl (TDC) method available at topdowncrawl.usc.edu. The method is used for the alignment of sequences with quantitative binding metrics from DNA binding experiments such as SELEX-seq.

Many popular tools for processing SELEX-seq data are designed to create a position weight matrix (PWM) to represent the binding preferences of a given DNA-binding protein, which assumes all positions contribute independently to binding. Although k-mer level enrichment is a highly reproducible measurement of binding affinity, it is difficult to know which region of the binding site is covered by each k-mer. For this purpose, we developed TopDownCrawl for aligning k-mers using their associated binding metrics. We evaluate the quality of alignments based on the performance of a multiple linear regression (MLR) model which is inherently sensitive to the alignment of the input sequences. To compare with PWM-based methods, we align the same set of k-mers to the generated PWMs and then perform MLR as was done for the TDC-based alignment.

Datasets Used for Evaluation

We evaluated TDC using 10 previously published SELEX-seq datasets released with the following publications. All reads were trimmed to only include the variable region.

Abe, N., Dror, I., Yang, L., Slattery, M., Zhou, T., Bussemaker, H. J., Rohs, R., & Mann, R. S. (2015). Deconvolving the recognition of DNA shape from sequence. Cell, 161(2), 307-318.

Dantas Machado, A. C., Cooper, B. H., Lei, X., Di Felice, R., Chen, L., & Rohs, R. (2020). Landscape of DNA binding signatures of myocyte enhancer factor-2B reveals a unique interplay of base and shape readout. Nucleic acids research, 48(15), 8529-8544.

Zhang, L., Martini, G. D., Rube, H. T., Kribelbauer, J. F., Rastogi, C., FitzPatrick, V. D., Houtman, J. C., Bussemaker, H. J., & Pufall, M. A. (2018). SelexGLM differentiates androgen and glucocorticoid receptor DNA-binding preference over an extended binding site. Genome research, 28(1), 111-121.

SRA Details

SRA	Renamed File
SRR5340724	AR_R0.fastq.gz
SRR5340730	AR_R3.fastq.gz
SRR5340729	AR_R4.fastq.gz
SRR5340724	GR_R0.fastq.gz
SRR5340721	GR_R3.fastq.gz
SRR5340720	GR_R4.fastq.gz
SRR7450249	MEF2B_R0.fastq.gz
SRR7450250	MEF2B_R1.fastq.gz
SRR7450251	MEF2B_R2.fastq.gz
SRR1765757	AbdA-Exd_R0.fastq.gz
SRR1765754	AbdA-Exd_R3.fastq.gz
SRR1765757	Dfd-Exd_R0.fastq.gz
SRR1765752	Dfd-Exd_R3.fastq.gz
SRR1765757	Lab-Exd_R0.fastq.gz
SRR1765751	Lab-Exd_R3.fastq.gz
SRR1765759	PbFl-Exd_R0.fastq.gz
SRR1765746	PbFl-Exd_R3.fastq.gz
SRR1765756	Scr-Exd_R0.fastq.gz
SRR1765733	Scr-Exd_R3.fastq.gz
SRR1765756	UbxIa-Exd_R0.fastq.gz
SRR1765750	UbxIa-Exd_R3.fastq.gz
SRR1765757	UbxIVa-Exd_R0.fastq.gz
SRR1765753	UbxIVa-Exd_R3.fastq.gz

Calculation of Relative Enrichment

The relative enrichments for 10-mers were calculated using a modified R script provided for the SELEX package available on bioconductor. Additionally, the script depends on the R package, 'stringr'.

https://bioconductor.org/packages/release/bioc/html/SELEX.html

./calculateEnrichment.R <R0 input> <R# input> <round #> <k length>
./calculateEnrichment.R AbdA-Exd_R0.fastq.gz AbdA-Exd_R3.fastq.gz 3 10
# Output: AbdA-Exd_R3_k10.tsv

Alignment of 10-mers using TDC

The table of 10-mers with their assciated relative enrichments can be directly aligned using the TDC algorithm provided through our webserver at topdowncrawl.usc.edu or using the our python package available through pip.

pip install TopDownCrawl
TopDownCrawl AbdA-Exd_R3_k10.tsv
# Output: AbdA-Exd_R3_k10_aligned.tsv

Alignment of 10-mers using PWM-based Methods

BEESEM

BEESEM is available as a python package through a github page maintained by the original creator. In the original publication, the authors use the sequenceing results from the previous round of selection as a prior, but this data was not published for the Hox-Exd datasets. Instead, we used the initial library as the prior. Although this may affect the scaling of the output motif, I do not expect it to have an impact on the predicted ranking of differing binding sites, and should therefore not impact algnment. For the AR and GR datasets, R3 sequencing was used as a prior for R4. For the MEF2B dataset, R1 was used as the prior for R2.

https://github.com/sx-ruan/BEESEM

Ruan S, Swamidass SJ, Stormo GD. BEESEM: estimation of binding energy models using HT-SELEX data. Bioinformatics (2017)

./countUnique.py <input fastq.gz>
./countUnique.py AbdA-Exd_R0.fastq.gz
./countUnique.py AbdA-Exd_R3.fastq.gz

beesem.py -s <most enriched 10-mer as seed> <output name> <processed prior> <processed R# input>
beesem.py -s ATGATTTATG beesem_k10 AbdA-Exd_R0.tsv AbdA-Exd_R3.tsv
# Output: ./beesem_k10_rep=1_phs=10/results/pfm_r0_p9_w10.txt

./alignWithBEESEM.py <input sequences> <beesem generated PFM>
./alignWithBEESEM.py AbdA-Exd_R3_k10.tsv ./beesem_k10_rep=1_phs=10/results/pfm_r0_p9_w10.txt
# Output: ./AbdA-Exd_R3_k10_BEESEM.tsv

SelexGLM

SelexGLM is available as an R package through a github page maintained by the original lab. The included runSELEXGLM.R script is based on the script published for the original AR / GR paper by Zhang et al. For consistency accross datasets, symmetry was not enforced.

https://github.com/BussemakerLab/SelexGLM/

SELEX Run Details (Used for SelexGLM configuration)

# Hox-Exd
varLen = 16
leftFixed = "CCGACGATCTGG"
rightFixed = "CCAGCTGTCGTAT"

# AR / GR
varLen = 23
leftFixed = "GTTCAGAGTTCTACAGTCCGACGATC"
rightFixed = "TGGAATTCTCGGGTGCCAAGG"

# MEF2B
varLen = 16
leftFixed = "GAGTTCTACAGTCCGACGATCCGC"
rightFixed = "CCTGGAATTCTCGGGTGCCA"

./runSelexGLM.R <R0 fastq.gz> <R# fastq.gz> <R#> <most enriched 10-mer as seed> <variable region length> <left adapter> <right adapter>
./runSelexGLM.R AbdA-Exd_R0.fastq.gz AbdA-Exd_R3.fastq.gz 3 ATGATTTATG 16 CCGACGATCTGG CCAGCTGTCGTAT
# Output: SelexGLM_k10_R3.pwm

./alignWithSelexGLM.py <input sequences> <SelexGLM generated PWM>
./alignWithSelexGLM.py AbdA-Exd_R3_k10.tsv SelexGLM_k10_R3.pwm
# Output: ./AbdA-Exd_R3_k10_SelexGLM.tsv

MEME

MEME is a very popular and powerful tool for the alignment of DNA sequences, but inherently weights all sequences equally. This restricts the ability to use quantitative binding measurements in the alignment process. For this reason, we must restrict the table of 10-mers to only 10-mers which might create a useful binding motif. In this case, we arbitratily chose 10-mers with a log enrichment 2 standard deviations above the mean. This set of 10-mers is padded with k-1 ambiguous N bases and used to generate a single motif, which is then used for alignment of all 10-mers. Padding helps generate larger and more representative motifs. MEME can be accessed through a publicly available webserver or downloaded as part of the meme-suite package available on conda.

https://meme-suite.org/meme/tools/meme https://anaconda.org/bioconda/meme

./Zfilt.py AbdA-Exd_R3_k10.tsv
./tsvToFa.py AbdA-Exd_R3_k10_Z2.tsv
meme -revcomp -csites <# greater than the input length> -searchsize <# chars greater than input length> -dna <filtered input fa>
meme -revcomp -csites 10000 -searchsize 1000000 -dna AbdA-Exd_R3_k10_Z2.fa 
meme-get-motif -all meme_out/meme.txt > meme_out/motif.txt
# Output: ./meme_out/motif.txt

./alignWithMEME.py <input sequences> <MEME generated PFM>
./alignWithMEME.py AbdA-Exd_R3_k10.tsv meme_out/motif.txt
# Output: ./AbdA-Exd_R3_k10_MEME.tsv

Multiple Linear Regression (MLR)

Multiple linear regression is performed with elastic net regularization using the sci-kit learn package in python. Models are trained and evaluated using 5-fold cross validation and the median R² is reported. Models are trained to predict the log enrichment of the aligned input using 1-hot encoded sequences, in which case gaps are encoded as an empty vector. Additionally, minor groove width (MGW) and electrostatic potential (EP) features are added to the input using feature tables derived from the DNAshapeR package available on bioconductor. A random seed of 0 is used arbitratily for consistence accross datasets. Since low-affinity binding sites are less likely to have a meaningful alignment, we chose to evaluate only sequences with a log enrichment at least 2 standard deviations above the mean.

./Zfilt.py <input alignment>
./Zfilt.py AbdA-Exd_R3_k10_aligned.tsv
./Zfilt.py AbdA-Exd_R3_k10_BEESEM.tsv
./Zfilt.py AbdA-Exd_R3_k10_SelexGLM.tsv
./Zfilt.py AbdA-Exd_R3_k10_MEME.tsv

./MLR.py <filtered alignment>
./MLR.py AbdA-Exd_R3_k10_aligned_Z2.tsv
./MLR.py AbdA-Exd_R3_k10_BEESEM_Z2.tsv
./MLR.py AbdA-Exd_R3_k10_SelexGLM_Z2.tsv
./MLR.py AbdA-Exd_R3_k10_MEME_Z2.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation of TopDownCrawl (TDC)

Datasets Used for Evaluation

Calculation of Relative Enrichment

Alignment of 10-mers using TDC

Alignment of 10-mers using PWM-based Methods

BEESEM

SelexGLM

MEME

Multiple Linear Regression (MLR)

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
featureTables		featureTables
MLR.py		MLR.py
README.md		README.md
Zfilt.py		Zfilt.py
alignWithBEESEM.py		alignWithBEESEM.py
alignWithMEME.py		alignWithMEME.py
alignWithSelexGLM.py		alignWithSelexGLM.py
calculateEnrichment.R		calculateEnrichment.R
countUnique.py		countUnique.py
runSelexGLM.R		runSelexGLM.R
tsvToFa.py		tsvToFa.py

bhcooper/TDC_evaluation

Folders and files

Latest commit

History

Repository files navigation

Evaluation of TopDownCrawl (TDC)

Datasets Used for Evaluation

Calculation of Relative Enrichment

Alignment of 10-mers using TDC

Alignment of 10-mers using PWM-based Methods

BEESEM

SelexGLM

MEME

Multiple Linear Regression (MLR)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages