Exploring the Combinatorics of Motif Alignments Foraccurately Computing E-values from P-values
In biological and biomedical research motif finding tools are important in locating regulatory elements in DNA sequences. There are many such motif finding tools available, which often yield position weight matrices and significance indicators. These indicators, p-values and E-values, describe the likelihood that a motif alignment is generated by the background process, and the expected number of occurrences of the motif in the data set, respectively. The various tools often estimate these indicators differently, making them not directly comparable. One approach for comparing motifs from different tools, is computing the E-value as the product of the p-value and the number of possible alignments in the data set. In this paper we explore the combinatorics of the motif alignment models OOPS, ZOOPS, and ANR, and propose a generic algorithm for computing the number of possible combinations accurately. We also show that using the wrong alignment model can give E-values that significantly diverge from their true values.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1072171Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 885
 M. Tompa, N. Li, T. L. Bailey, G. M. Church, B. De Moor, E. Eskin, A. V. Favorov, M. C. Frith, Y. Fu, W. J. Kent, V. J. Makeev, A. A. Mironov, W. S. Noble, G. Pavesi, G. Pesole, M. Regnier, N. Simonis, S. Sinha, G. Thijs, J. van Helden, M. Vandenbogaert, Z. Weng, C. Workman, C. Ye, and Z. Zhu, "Assessing computational tools for the discovery of transcription factor binding sites." Nat Biotechnol, vol. 23, no. 1, pp. 137-144, Jan 2005.
 T. A. Down and T. J. P. Hubbard, "Nestedmica: sensitive inference of over-represented motifs in nucleic acid sequence." Nucleic Acids Res, vol. 33, no. 5, pp. 1445-1453, 2005.
 T. Bailey and C. Elkan, "Unsupervised learning of multiple motifs in biopolymers using expectation maximization," Machine Learning, vol. 21, no. 1-2, pp. 51-80, Oct-Nov 1995.
 G. Pavesi, G. Mauri, and G. Pesole, "An algorithm for finding signals of unknown length in dna sequences." Bioinformatics, vol. 17 Suppl 1, pp. S207-14, 2001.
 G. Z. Hertz and G. D. Stormo, "Identifying dna and protein patterns with statistically significant alignments of multiple sequences." Bioinformatics, vol. 15, no. 7-8, pp. 563-577, Jul-Aug 1999.
 C. Pizzia and E. Ukkonen, "Fast profile matching algorithms - a survey," Theoretical Computer Science, vol. 395, no. 2-3, pp. 137-157, MAY 1 2008.
 N. Nagarajan, P. Ng, and U. Keich, "Refining motif finders with e-value calculations," Regulatory Genomics, p. 73, 2006.
 G. Andrews, The theory of partitions. Cambridge University Press, 1998.
 G. Andrews and K. Eriksson, Integer partitions. Cambridge University Press, 2004.
 A. Zoghbi and I. Stojmenovic, "Fast algorithms for generating integer partitions," International Journal of Computer Mathematics, vol. 70, no. 2, pp. 319-332, 1998.