Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 32129
MIM: A Species Independent Approach for Classifying Coding and Non-Coding DNA Sequences in Bacterial and Archaeal Genomes

Authors: Achraf El Allali, John R. Rose


A number of competing methodologies have been developed to identify genes and classify DNA sequences into coding and non-coding sequences. This classification process is fundamental in gene finding and gene annotation tools and is one of the most challenging tasks in bioinformatics and computational biology. An information theory measure based on mutual information has shown good accuracy in classifying DNA sequences into coding and noncoding. In this paper we describe a species independent iterative approach that distinguishes coding from non-coding sequences using the mutual information measure (MIM). A set of sixty prokaryotes is used to extract universal training data. To facilitate comparisons with the published results of other researchers, a test set of 51 bacterial and archaeal genomes was used to evaluate MIM. These results demonstrate that MIM produces superior results while remaining species independent.

Keywords: Coding Non-coding Classification, Entropy, GeneRecognition, Mutual Information.

Digital Object Identifier (DOI):

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1595


[1] A. Lukashin and M. Borodovsky, "Genemark.hmm: new solutions for gene finding." Nucleic Acids Res., vol. 26, pp. 1107-1115, 1998.
[2] D. Hyatt, G.-L. Chen, P. F. LoCascio, M. L. Land, F. W. Larimer, and L. J. Hauser, "Prodigal: prokaryotic gene recognition and translation initiation site identification," BMC Bioinformatics, vol. 11, 2010.
[3] A. Delcher, K. Bratke, E. Powers, and S. Salzberg, "Identifying bacterial genes and endosymbiont dna with glimmer," Bioinformatics, vol. 23, pp. 673-679, 2007.
[4] G.-Q. Hu, X. Zheng, H.-Q. Zhu, and Z.-S. She, "Prediction of translation initiation site with tritisa," Bioinformatics, vol. 25, pp. 123-125, 2009.
[5] H. Ou, F. Guo, and C. Zhang, "Gs-finder: a program to find bacterial gene start sites with a self-training method," Int. J. Biochem. Cell Biol., vol. 36, pp. 535-544, 2004.
[6] I. Rogozin and L. Milanesi, "Analysis of donor splice signals in different organisms," J. Mol. Evl., vol. 45, pp. 50-59, 1997.
[7] J. Kleffe, K. Hermann, W. Vahrson, B. Wittig, and V. Brendel, "Logitlinear models for the prediction of splice sites in plant pre-mrna sequences," Nucleic Acids Res., vol. 24, pp. 4709-4718, 1996.
[8] S. Brunak, J. Engelbrecht, and S. Knudsen, "Prediction of human mrna donor and acceptor sites from the dna sequence," J. Mol. Biol., vol. 220, pp. 49-65, 1991.
[9] S. M. Hebsgaard, P. G. Korning, N. Tolstrup, J. Engelbrecht, P. Rouz, and S. Brunak, "Splice site prediction in arabidopsis thaliana pre mrna by combining local and global sequence information," Nucleic Acids Res., vol. 24, pp. 3439-3452, 1996.
[10] M. Q. Zhang and T. G. Marr, "A weight array method for splicing signal analysis," Comput. Appl. Biosci., vol. 9, pp. 499-509, 1993.
[11] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, "Basic local alignment search tool," J. Mol. Biol., vol. 215, pp. 403- 410, 1990.
[12] P. McCaklon and P. Argos, "Oligopeptide biases in protein sequences and their use in predicting protein coding regions in nucleotide sequences," Proteins: Structure, Function and Genetics, vol. 4, pp. 99-122, 1988.
[13] R. Staden and A. D. McLachlan, "Codon preferences and its uses in identifying protein coding regions in long dna sequences," Nucleic Acids Res., vol. 10, pp. 141-156, 1982.
[14] A. S. Kolaskar and B. V. B. Reddy, "A method to locate protein sequences in dna and prokaryotic systems," Nucleic Acids Res., vol. 13, pp. 185-194, 1985.
[15] R. D. Blake and S. Early, "Distribution and evolution of sequence characterisitcs in e. coli genome," J. Biomol. Struct. Dynam., vol. 4, pp. 291-307, 1996.
[16] J. R. Rose and A. El Allali, "Mutual information measure for distinguishing coding and non-coding dna sequences," Biocomp, vol. 1, pp. 214-219, 2008.
[17] Z. Ouyang and Z. S. She, "Multivariate entropy distance method for distinguishing coding and non-coding dna sequences," J. Bioinform. Comput. Biol., vol. 2, pp. 353-373, 2004.
[18] L. Q. Zhou, Z. G. Yu, J. Q. Deng, V. Anh, and S. C. Long, "A fractal method to distinguish coding and non-coding sequences in a complete genome based on a number sequence representation, j," Theor. Biol., vol. 232, pp. 559-567, 2004.
[19] Y. Zhou, L. Q. Zhou, Z. G. Yu, and V. V. Anh, "Distinguish coding and noncoding sequences in a complete genome using fourier transform," International Conference on Natural Computation, pp. 295-299, 2007.
[20] V. A. Guo-Sheng and Y. Zu-Guo, "Distinguishing coding from noncoding sequences in prokaryote complete genome based on the global desciptor," IEEE Computer Society: Sixth International Conference on Fuzzy Systems and Knowledge Discovery, pp. 42-46, 2009.
[21] D. A. Benson, I. Karsch-Mizrachi, D. Lipman, J. Ostell, and E. Sayers, "Genbank," Nucleic Acids Res., vol. 37(Database issue), pp. D26-31, 2009.
[22] M. W. Bern and D. Goldberg, "Automatic selection of representative proteins for bacterial phylogeny," BMC Evolutionary Biology, vol. 5, 2005.
[23] M. Burset and R. Guigo, "Evaluation of gene structure prediction programs," Genomics, vol. 34, pp. 353-367, 1996.
[24] R. K.E., "Ecogene: a genome sequence database for escherichia coli k-12," Nucleic Acids Res., vol. 28, pp. 60-64, 2000.