Hsing-Kuo Pao and John Case
Computing Entropy for Ortholog Detection
5 - 8
2007
1
1
International Journal of Bioengineering and Life Sciences
https://publications.waset.org/pdf/11278
https://publications.waset.org/vol/1
World Academy of Science, Engineering and Technology
Biological sequences from different species are called orthologs if they evolved from a sequence of a common ancestor species and they have the same biological function. Approximations of Kolmogorov complexity or entropy of biological sequences are already well known to be useful in extracting similarity information between such sequences in the interest, for example, of ortholog detection. As is well known, the exact Kolmogorov complexity is not algorithmically computable. In practice one can approximate it by computable compression methods. However, such compression methods do not provide a good approximation to Kolmogorov complexity for short sequences. Herein is suggested a new approach to overcome the problem that compression approximations may notwork well on short sequences. This approach is inspired by new, conditional computations of Kolmogorov entropy. A main contribution of the empirical work described shows the new set of entropybased machine learning attributes provides good separation between positive (ortholog) and negative (nonortholog) data better than with good, previously known alternatives (which do not employ some means to handle short sequences well).Also empirically compared are the new entropy based attribute set and a number of other, more standard similarity attributes sets commonly used in genomic analysis. The various similarity attributes are evaluated by cross validation, through boosted decision tree induction C5.0, and by Receiver Operating Characteristic (ROC) analysis. The results point to the conclusion the new, entropy based attribute set by itself is not the one giving the best prediction; however, it is the best attribute set for use in improving the other, standard attribute sets when conjoined with them.
Open Science Index 1, 2007