A Novel Approach for Protein Classification Using Fourier Transform
Authors: A. F. Ali, D. M. Shawky
Abstract:
Discovering new biological knowledge from the highthroughput biological data is a major challenge to bioinformatics today. To address this challenge, we developed a new approach for protein classification. Proteins that are evolutionarily- and thereby functionally- related are said to belong to the same classification. Identifying protein classification is of fundamental importance to document the diversity of the known protein universe. It also provides a means to determine the functional roles of newly discovered protein sequences. Our goal is to predict the functional classification of novel protein sequences based on a set of features extracted from each protein sequence. The proposed technique used datasets extracted from the Structural Classification of Proteins (SCOP) database. A set of spectral domain features based on Fast Fourier Transform (FFT) is used. The proposed classifier uses multilayer back propagation (MLBP) neural network for protein classification. The maximum classification accuracy is about 91% when applying the classifier to the full four levels of the SCOP database. However, it reaches a maximum of 96% when limiting the classification to the family level. The classification results reveal that spectral domain contains information that can be used for classification with high accuracy. In addition, the results emphasize that sequence similarity measures are of great importance especially at the family level.
Keywords: Bioinformatics, Artificial Neural Networks, Protein Sequence Analysis, Feature Extraction.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1326728
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2372References:
[1] J. Zhao, "Multivariate Statistical Analysis of Protein Variation", A Ph. D. dissertation, available at http://www.lib.ncsu.edu/theses/available/etd- 12092005-003538/unrestricted/etd.pdf
[2] A. Murzin, S. Brenner, T. Hubbard, and C. Chothia, "SCOP: A Structural Classification of Proteins Database for the Investigation of Sequences and Structures," Journal of Molecular Biology, vol. 247, no. 4, pp. 536-540, 1995.
[3] C. Orengo, A. Michie, S. Jones, D. Jones, M. Swindells, and J. Thornton, "CATH- A Hierarchic Classification of Protein Domain Structures," Structure, vol. 5, no. 4, pp. 1093-1108, 1997.
[4] A. Bateman, L. Coin, R. Durbin, R. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E. Sonnhammer, D. Holme, C. Yeats, and S. Eddy, "The Pfam protein Families Database," Nucleic Acids Res., vol. 32, no. 36, pp. D138-D141, 2004.
[5] O. Camoglu, T. Can, A. Singh, and Y. Wang, "Decision Tree Based Information Integration for Automated Protein Classification," Journal of Bioinformatics and Computational Biology (JBCB), Vol. 3, No. 3, pp. 717- 742, 2005.
[6] O. André, F. Daniel, F. Ant├│nio, "Peptide programs: applying fragment programs to protein classification", Proceeding of the 2nd International Workshop on Data and Text Mining in Bioinformatics, pp. 37-44, 2008.
[7] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res., vol. 25, no. 17, pp. 3389-3402, 1997.
[8] W. Tian, and J. Skolnick, "How well is enzyme function conserved as a function of pairwise sequence identity?", Molecular Biological, vol. 3, no.4, pp. 863-882, 2003.
[9] D. Devos, and A. Valencia, "Intrinsic errors in genome annotation", Trends Genetics, vol. 17, no.8, pp. 429-431, 2001.
[10] E. N. Baker, V. L. Arcus, and J. S. Lott, "Protein structure prediction and analysis as a tool for functional genomics", Appl. Bioinformatics, vol. 2, no. 3, pp. 3-10, 2003.
[11] M. Grotthuss, D. Plewczynski, K. Ginalski, L. Rychlewski, and E. I. Shakhnovich, "PDB-UF: database of predicted enzymatic functions for unannotated protein structures from structural genomics", BMC Bioinformatics, vol. 7, no. 1, pp. 53-56, 2006.
[12] J. C. Whisstock, and A. M. Lesk, "Prediction of protein function from protein sequence and structure", Q Rev Biophys., vol. 36, no. 3, pp. 307- 340, 2003.
[13] I. Friedberg, "Automated protein function prediction the genomic challenge", Brief Bioinformatics, vol. 7, no. 3, pp. 225-242, 2006.
[14] I., Melvin, E. Ie, J. Wetson, W. S. Noble, and C. Leslie, "Multi-class protein classification using adaptive codes", J Mach. Learn. Res., vol. 8, pp. 1557-1581, 2007.
[15] L. Y. Han , C. Z. Cai, Z. L. Ji, Z. W Cao., J. Cui, and Y. Z. Chen, " Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach", Nucleic Acids Res., vol. 32, no. 21, pp. 6437-6444, 2004.
[16] R. E. Langlois, M. B. Carson, N. Bhardwaj, and H. Lu "Learning to translate sequence and structure to function: Identifying DNA binding and membrane binding proteins" , Annals of Biomedical Engineering, vol. 35, no. 6, pp. 1043-1052, 2007.
[17] Z. R. Yang, and R. Hamer, "Bio-basis function neural networks in protein data mining", Current Pharmaceutical Design, vol. 13, no. 14, pp. 1403-1413, 2007.
[18] J. Busch, P. Ferrari, A. Flesia, S. P. Grynberg, and F. Leonardi," Testing statistical hypothesis on random trees and applications to the protein classification problem", Annals of Applied Statistics, Vol.3, No.2, pp.542- 563, 2009.
[19] M. Q. Yang, J. Y. Yang, and O. K. Ersoy, "Classification of proteins multiple-labelled and single-labelled with protein functional classes", Int. J Gen. Syst., vol. 36, no.1, pp. 91-109, 2007.
[20] C. Pasquier, V. Promponas, and S. J. Hamodrakas, "PRED-CLASS: Cascading Neural networks for generalized protein classification and genome wide applications", Proteins, PROTEINS: Structure, Function, and Genetics, vol. 44, no.1, pp. 361-369, 2001.
[21] B. J. Webb-Robertson, C. Oehmen, and M. Matzke, "SVM-BALSA: Remote homology detection based on Bayesian sequence alignment", Computational Biological Chemistry, vol. 29, no. 6, pp. 440-443, 2005.
[22] Z. D. Zhang, S. Kochhar, and M. G. Grigorov, " Descriptor-based protein remote homology identification", Protein Science, vol. 14, no.2, pp. 431-444, 2005.
[23] N. Bhardwaj, R. E. Langlois, G. J Zhao, and H. Lu " Kernel-based machine learning protocol for predicting DNA binding proteins", Nucleic Acids Res, vol. 33, no. 20, pp. 6486-6493, 2005.
[24] P. D. Dobson, and A. J. Doig, "Predicting enzyme class from protein structure without alignments", Journal of Molecular Biology, vol. 345, no. 1, pp. 187-199, 2005.
[25] Y. D. Cai, and A. J. Doig, "Prediction of Saccharomyces cerevisiae protein functional class from functional domain composition", Bioinformatics, vol. 20, no.8, pp. 1292-1300, 2004.
[26] Q. W. Dong, X. L. Wang, and L. Lin, "Application of latent semantic analysis to protein remote homology detection", Bioinformatics, vol. 22, no. 3, pp. 285-290, 2005.
[27] R. Kuang, E. Ie, K. Wang, K. Wang, M. Siddiqi, Y. Freund, and C. Leslie, "Profile-based string kernels for remote homology detection and motif extraction", Journal of Bioinformatics and Computational Biology, vol. 3, no.3, pp. 527-550, 2005.
[28] H. Rangwala, and G. Karypis, "Profile-based direct kernels for remote homology detection and fold recognition", Bioinformatics, vol. 2, no.23, pp. 4239-4247, 2005.
[29] L. Nanni, S. Mazzara, L. Pattini, and A. Lumini, "Protein classification combining surface analysis and primary structure", Protein Engineering: Design and Selection, vol. 22, no. 4, pp. 267-272, 2009.
[30] D. Eisenberg, R. Weiss, and T. Terwilliger, "The Helical Hydrophobic Moment: A Measure of the Amphiphilicity of a Helix", Nature, vol.4, pp. 299-371, 1982.
[31] D. Eisenberg, E. Schwarz, M., Komaromy and R. Wall, "Analysis of Membrane and Surface Protein Sequences with the Hydrophobic Moment Plot", Journal of Molecular Biology, vol.42, no.1, pp. 125-179, 1984.
[32] L. Pattini, L. Riva, and S. Cerutti, "A wavelet based method to predict the alpha helix content in the secondary structure of globular proteins", Proceedings of the IEEE-EMBS, pp.132-133 , 2002.
[33] A. Shepherd, G. Gorse, and J. Thornton, "A novel approach to the recognition of protein architecture from sequence using Fourier analysis and neural networks", Proteins, vol. 50, no.2, pp. 290-302, 2003.
[34] A. Antonina, H. Dave, C. John-Marc, and E. Steven, "Data growth and its impact on the SCOP database: new developments", Nucleic Acids Res., vol. 36, no. 1, pp. 1-7, 2008.
[35] H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, and P.E. Bourne, "The Protein Data Bank", Nucleic Acids Res., vol. 28, no. 1, pp.235-242, 2000.
[36] L. Lo Conte, S.E. Brenner, T.J.P. Hubbard, C. Chothia, and A.G. Murzin, "SCOP database in 2002: refinements accommodate structural genomics", Nucleic Acids Res., vol. 30, no.1, pp. 264-267, 2002.
[37] J. M. Chandonia, G. Hon, N.S. Walker, L. Lo Conte, P. Koehl, M. Levitt, and S.E. Brenner, "The ASTRAL compendium in 2004", Nucleic Acids Res., vol. 32, no.1, pp. 189-192, 2004.
[38] D. Wilson, M. Madera, C. Vogel, C. Chothia, and J. Gough, "The SUPERFAMILY database in 2007: families and functions", Nucleic Acids Res., vol. 35, Database Issue, pp. 308-313, 2007.