**Commenced**in January 2007

**Frequency:**Monthly

**Edition:**International

**Paper Count:**30373

##### Computing Entropy for Ortholog Detection

**Authors:**
Hsing-Kuo Pao,
John Case

**Abstract:**

Biological sequences from different species are called or-thologs if they evolved from a sequence of a common ancestor species and they have the same biological function. Approximations of Kolmogorov complexity or entropy of biological sequences are already well known to be useful in extracting similarity information between such sequences -in the interest, for example, of ortholog detection. As is well known, the exact Kolmogorov complexity is not algorithmically computable. In prac-tice one can approximate it by computable compression methods. How-ever, such compression methods do not provide a good approximation to Kolmogorov complexity for short sequences. Herein is suggested a new ap-proach to overcome the problem that compression approximations may notwork well on short sequences. This approach is inspired by new, conditional computations of Kolmogorov entropy. A main contribution of the empir-ical work described shows the new set of entropy-based machine learning attributes provides good separation between positive (ortholog) and nega-tive (non-ortholog) data - better than with good, previously known alter-natives (which do not employ some means to handle short sequences well).Also empirically compared are the new entropy based attribute set and a number of other, more standard similarity attributes sets commonly used in genomic analysis. The various similarity attributes are evaluated by cross validation, through boosted decision tree induction C5.0, and by Receiver Operating Characteristic (ROC) analysis. The results point to the conclu-sion: the new, entropy based attribute set by itself is not the one giving the best prediction; however, it is the best attribute set for use in improving the other, standard attribute sets when conjoined with them.

**Keywords:**
Entropy,
Compression,
Decision Tree,
ROC,
ortholog

**Digital Object Identifier (DOI):**
doi.org/10.5281/zenodo.1076994

**References:**

[1] M. Li and P.Vitanyi , An Introduction to Kolmogorov Complexity and Its Applications (2nd Ed.), Springer, New York, 1997.

[2] J. K. Lanctot, M. Li, and E.-H. Yang, ''Estimating DNA sequence entropy,''in Symposium on Discrete Algorithms, 2000, pp. 409-418.

[3] S. Karlin and S. F. Altschul, ''Methods for assessing the statistical signif-icance of molecular sequence features by using general scoring schemes,''Proc. Natl. Acad. Sci. USA, vol. 87, pp. 2264-2268, 1990.

[4] M. Ouyang, J. Case, and J. Burnside, ''Divide and conquer machine learn-ing for a genomics analogy problem (progress report),'' in Discovery Sci-ence, 2001, pp. 290-303.

[5] J. R. Quinlan, C4.5: Programs for machine learning, Morgan Kaufmann Publishers, San Mateo, CA, 1993.

[6] Y. Freund and R. E. Schapire, ''A decision-theoretic generalization of on-line learning and an application to boosting,'' Journal of Computer and System Science, vol. 55, pp. 119-139, 1997.

[7] F. Mosteller and J. Tukey, Data Analysis and Regression: A Second Coursein Statistics, Addison-Wesley, Reading, Mass., 1977.

[8] M. Stone, ''Cross-validatory choice and assessment of statistical predic-tions,'' Journal of the Royal Statistical Society, vol. 36, pp. 111-147, 1974.

[9] J. A. Hanley and B. J. McNeil, ''The meaning and use of the area under a receiver operating characteristic (ROC) curve,'' Radiology, vol. 143, pp.29-36, 1982.

[10] O. Gotoh, ''An improved algorithm for matching biological sequences,'' J.Mol. Biol., vol. 162, pp. 705-708, 1982.

[11] S. Needleman and C. Wunsch, ''A general method applicable to the search for similarities in the amino acid sequence of two proteins,'' Journal of Molecular Biology, vol. 48, pp. 443-453, 1970.

[12] J. D. Thompson, D. G. Higgins, and T. J. Gibson, ''Clustal w: improv-ing the sensitivity of progressive multiple sequence alignment through sequence weighting position-specific gap penalties and weight matrix choice,'' Nucleic Acids Res., vol. 22, no. 22, pp. 4673-4680, 1994.

[13] T. F. Smith and M. S. Waterman, ''Comparison of biosequences,'' Adv. Appl. Math., vol. 2, pp. 482-489, 1981.

[14] X. Huang and W. Miller, ''A time efficient, linear space local similarity algorithm,'' Adv. Appl. Math., vol. 12, pp. 337-357, 1991.

[15] B. Morgenstern, A. Dress, and T. Werner, ''Multiple DNA and protein se-quence alignment based on segment-to-segment comparison,'' Proc. Natl. Acad. Sci. USA, vol. 93, no. 22, pp. 12098-12103, 1996.

[16] B. Morgenstern, K. Frech, A. Dress, and T. Werner, ''Dialign: finding local similarities by multiple sequence alignment,'' Bioinformatics, vol. 14, no.3, pp. 290-294, 1998.

[17] A. Lempel and J. Ziv, ''A universal algorithm for sequential data compres-sion,'' IEEE Trans. Inf. Theory, vol. 23, no. 3, pp. 337-343, 1977.

[18] X. Chen, S. Kwong, and M. Li, ''A compression algorithm for DNA se-quences and its applications in genome comparison,'' in RECOMB, 2000,p. 107.

[19] C. E. Shannon, ''A mathematical theory of communication,'' Bell Syst Tech. J., vol. 27, pp. 379-423, 1948.

[20] S. Grumbach and F. Tahi, ''Compression of DNA sequences,'' in Proceed-ings of the IEEE Symposium on Data Compression, 1993, pp. 340-350.

[21] D. M. Loewenstern, H. Hirsh, P. Yianilos, and M. Noordewier, ''DNA sequence classification using compression-based induction.,'' Tech. Rep.95-04, DIMACS, 1995.

[22] D. M. Loewenstern and P. N. Yianilos, ''Significantly lower entropy estimates for natural DNA sequences,'' in IEEE Data Compression Conf.,DCC97, 1997, pp. 151-160.

[23] M. Li, J. H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang, ''An information-based sequence distance and its application to whole mitochondrial genome phylogeny,'' Bioinformatics, vol. 17, no. 2, pp. 149-154,2001.

[24] D. Benedetto, E. Caglioti, and V. Loreto, ''Language trees and zipping,''Physical Review Letters, vol. 88, no. 4, pp. 048702, 2002.

[25] Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul Vitanyi, ''The similarity metric,'' in Proc. 14th ACMSIAM Symp. Discrete Algorithms (SODA),2003.

[26] D. R. Powell, D. L. Dowe, L. Allison, and T. I. Dix, ''Discovering simple DNA sequences by compression,'' in PSB. 1998, pp. 597-608, World Scientific.

[27] M. Ouyang, J. Case, V. Tirunagaru, and J. Burnside, ''565 triples of chicken, human, and mouse candidate orthologs,'' Journal of Molecular Evolution, vol. 57, pp. 271-281, 2003.

[28] M.D. Adams, A.R. Kerlavage, R.D. Fleischmann, R.A. Fuldner, C.J. Bult,N.H. Lee, E.F. Kirkness, K.G. Weinstock, J.D. Gocayne, O. White, and et al, ''Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence,'' Nature, vol. 377,pp. 3-174, 1995.

[29] V. Tirunagaru, L. Sofer, and J. Burnside, ''An expressed sequence tag database of activated chicken T cells: Sequence analysis of 5000 cDNA clones,'' Genomics, vol. 66, pp. 144-151, 2000.