Dimensionality Reduction of PSSM Matrix and its Influence on Secondary Structure and Relative Solvent Accessibility Predictions

Rafał Adamczak

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33122

Dimensionality Reduction of PSSM Matrix and its Influence on Secondary Structure and Relative Solvent Accessibility Predictions

Authors: Rafał Adamczak

Abstract:

State-of-the-art methods for secondary structure (Porter, Psi-PRED, SAM-T99sec, Sable) and solvent accessibility (Sable, ACCpro) predictions use evolutionary profiles represented by the position specific scoring matrix (PSSM). It has been demonstrated that evolutionary profiles are the most important features in the feature space for these predictions. Unfortunately applying PSSM matrix leads to high dimensional feature spaces that may create problems with parameter optimization and generalization. Several recently published suggested that applying feature extraction for the PSSM matrix may result in improvements in secondary structure predictions. However, none of the top performing methods considered here utilizes dimensionality reduction to improve generalization. In the present study, we used simple and fast methods for features selection (t-statistics, information gain) that allow us to decrease the dimensionality of PSSM matrix by 75% and improve generalization in the case of secondary structure prediction compared to the Sable server.

Keywords: Secondary structure prediction, feature selection, position specific scoring matrix.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1080508

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1940

References:

[1] Jones, D. T. (1999). Protein secondary structure prediction based on position-specific scoring matrices., J Mol Biol 292 : 195-202.
[2] Pollastri, G. & McLysaght, A. (2005). Porter: a new, accurate server for protein secondary structure prediction., Bioinformatics 21 : 1719-1720.
[3] Rost, B. (2001). Review: protein secondary structure prediction continues to rise., J Struct Biol 134 : 204-218.
[4] Adamczak, R.; Porollo, A. & Meller, J. (2004). Accurate prediction of solvent accessibility using neural networks-based regression., Proteins 56 : 753-767.
[5] Pollastri, G.; Martin, A. J. M.; Mooney, C. & Vullo, A. (2007). Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information., BMC Bioinformatics 8 : 201.
[6] Pollastri, G.; Baldi, P.; Fariselli, P. & Casadio, R. (2001). Improved prediction of the number of residue contacts in proteins by recurrent neural networks., Bioinformatics 17 Suppl 1 : S234-S242.
[7] King, R. D. & Sternberg, M. J. (1996). Identification and application of the concepts important for accurate and reliable protein secondary structure prediction., Protein Sci 5 : 2298-2310.
[8] Woodcock, S.; Mornon, J. P. & Henrissat, B. (1992). Detection of secondary structure elements in proteins by hydrophobic cluster analysis., Protein Eng 5 : 629-635.
[9] Bastolla, U.; Porto, M.; Roman, H. E. & Vendruscolo, M. (2005). Principal eigenvector of contact matrices and hydrophobicity profiles in proteins., Proteins 58 : 22-30.
[10] Gribskov, M.; McLachlan, A. D. & Eisenberg, D. (1987). Profile analysis: detection of distantly related proteins., Proc Natl Acad Sci U S A 84 : 4355-4358.
[11] Altschul, S. F.; Madden, T. L.; Sch├ñffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W. & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs., Nucleic Acids Res 25 : 3389-3402.
[12] Melo, J. C. B.; Cavalcanti, G. D. C. & Guimaraes, K. S. (2003). PCA feature extraction for protein structure prediction, 4 : 2952-2957.
[13] Simas, G. M.; Botelho, S. S. C.; Grando, N. & Colares, R. G. (2008). Dimensional Reduction in the Protein Secondary Structure Prediction ÔÇö Nonlinear Method Improvements. In: (Ed.), Innovations in Hybrid Intelligent Systems, Springer Berlin / Heidelberg.
[14] Jollife, I. T., 1986. Principle component analysis. Springer Varlag, .
[15] Cuff, J. A. & Barton, G. J. (1999). Evaluation and improvement of multiple sequence methods for protein secondary structure prediction., Proteins 34 : 508-519.
[16] E. Hunt, J. Martin, P. S. (1966). Experiments in Induction, Academic Press, New York .
[17] Adamczak, R.; Porollo, A. & Meller, J. (2005). Combining prediction of secondary structure and solvent accessibility in proteins., Proteins 59 : 467-475.
[18] Rost, B.; Sander, C. & Schneider, R. (1994). PHD--an automatic mail server for protein secondary structure prediction., Comput Appl Biosci 10 : 53-60.
[19] Zell, A.; Mache, N.; Hubner, R.; Mamier, G.; Vogt, M.; uwe Herrmann, K.; Schmalzl, M.; Sommer, T.; Hatzigeorgiou, A.; Doring, S.; Posselt, D.; Reczko, M. & Riedmiller, M. (1993). SNNS - Stuttgart Neural Network Simulator, .
[20] Riedmiller, M. & Braun, H. (1992). RPROP- A fast adaptive learning algorithm, .
[21] Kabsch, W. & Sander, C. (1983). Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features., Biopolymers 22 : 2577-2637.
[22] Zemla, A.; Venclovas, C.; Fidelis, K. & Rost, B. (1999). A modified definition of Sov, a segment-based measure for protein secondary structure prediction assessment., Proteins 34 : 220-223.
[23] Eyrich, V. A.; Mart├¡-Renom, M. A.; Przybylski, D.; Madhusudhan, M. S.; Fiser, A.; Pazos, F.; Valencia, A.; Sali, A. & Rost, B. (2001). EVA: continuous automatic evaluation of protein structure prediction servers., Bioinformatics 17 : 1242-1243.
[24] Wagner, M.; Adamczak, R.; Porollo, A. & Meller, J. (2005). Linear regression models for solvent accessibility prediction in proteins., J Comput Biol 12 : 355-369.