Application of KL Divergence for Estimation of Each Metabolic Pathway Genes
Development of a method to estimate gene functions is an important task in bioinformatics. One of the approaches for the annotation is the identification of the metabolic pathway that genes are involved in. Since gene expression data reflect various intracellular phenomena, those data are considered to be related with genes’ functions. However, it has been difficult to estimate the gene function with high accuracy. It is considered that the low accuracy of the estimation is caused by the difficulty of accurately measuring a gene expression. Even though they are measured under the same condition, the gene expressions will vary usually. In this study, we proposed a feature extraction method focusing on the variability of gene expressions to estimate the genes' metabolic pathway accurately. First, we estimated the distribution of each gene expression from replicate data. Next, we calculated the similarity between all gene pairs by KL divergence, which is a method for calculating the similarity between distributions. Finally, we utilized the similarity vectors as feature vectors and trained the multiclass SVM for identifying the genes' metabolic pathway. To evaluate our developed method, we applied the method to budding yeast and trained the multiclass SVM for identifying the seven metabolic pathways. As a result, the accuracy that calculated by our developed method was higher than the one that calculated from the raw gene expression data. Thus, our developed method combined with KL divergence is useful for identifying the genes' metabolic pathway.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1099834Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1897
 T. Obayashi, Y. Okamura, S. Ito, S. Tadaka, Y. Aoki, M. Shirota, and K. Kinoshita, “ATTED-II in 2014: evaluation of gene coexpression in agriculturally important plants,” Plant Cell Physiol., vol. 55, no. 1, p. e6, Jan. 2014.
 K. Aoki, Y. Ogata, and D. Shibata, “Approaches for extracting practical information from gene co-expression networks in plant biology,” Plant Cell Physiol., vol. 48, no. 3, pp. 381-390, Mar. 2007.
 K. Saito, M. Y. Hirai, and K. Yonekura-Sakakibara, “Decoding genes with coexpression networks and metabolomics - ‘majority report by precogs’,” Trends Plant Sci., vol. 13, no. 1, pp. 36-43, Jan. 2008.
 C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, pp. 273-297, 1995.
 C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machine,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 1-27, Apr. 2011.
 M. P. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Ares, and D. Haussler, “Knowledge-based analysis of microarray gene expression data by using support vector machines,” Proc. Natl. Acad. Sci. U. S. A., vol. 97, no. 1, pp. 262-267, Jan. 2000.
 S. Kullback, and R. A. Leibler, “On information and sufficiency,” Annals of Mathematical Statistics, vol. 22, pp. 79-86, 1951.
 R. Edgar, M. Domrachev, and A. E. Lash, “Gene Expression Omnibus: NCBI gene expression and hybridization array data repository,” Nucleic Acids Res., vol. 30, pp. 207-210, 2002.
 E. Hubbell, W. M. Liu, and R. Mei, “Robust estimators for expression analysis,” Bioinformatics, vol. 18, pp. 1585-1592, 2002.
 S. D. Pepper, E. K. Saunders, L. E. Edwards, C. L. Wilson, and C. J. Miller, “The utility of MAS5 expression summary and detection call algorithms,” BMC Bioinformatics, vol. 8, p. 273, 2007.
 M. Kanehisa, S. Goto, Y. Sato, M. Kawashima, M. Furumichi, and M. Tanabe, “Data, information, knowledge and principle: back to metabolism in KEGG,” Nucleic Acids Res., vol. 42, no. Database issue, pp. D199-205, Jan. 2014.