SAF: A Substitution and Alignment Free Similarity Measure for Protein Sequences

Abdellali Kelil; Shengrui Wang; Ryszard Brzezinski

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33122

SAF: A Substitution and Alignment Free Similarity Measure for Protein Sequences

Authors: Abdellali Kelil, Shengrui Wang, Ryszard Brzezinski

Abstract:

The literature reports a large number of approaches for measuring the similarity between protein sequences. Most of these approaches estimate this similarity using alignment-based techniques that do not necessarily yield biologically plausible results, for two reasons. First, for the case of non-alignable (i.e., not yet definitively aligned and biologically approved) sequences such as multi-domain, circular permutation and tandem repeat protein sequences, alignment-based approaches do not succeed in producing biologically plausible results. This is due to the nature of the alignment, which is based on the matching of subsequences in equivalent positions, while non-alignable proteins often have similar and conserved domains in non-equivalent positions. Second, the alignment-based approaches lead to similarity measures that depend heavily on the parameters set by the user for the alignment (e.g., gap penalties and substitution matrices). For easily alignable protein sequences, it's possible to supply a suitable combination of input parameters that allows such an approach to yield biologically plausible results. However, for difficult-to-align protein sequences, supplying different combinations of input parameters yields different results. Such variable results create ambiguities and complicate the similarity measurement task. To overcome these drawbacks, this paper describes a novel and effective approach for measuring the similarity between protein sequences, called SAF for Substitution and Alignment Free. Without resorting either to the alignment of protein sequences or to substitution relations between amino acids, SAF is able to efficiently detect the significant subsequences that best represent the intrinsic properties of protein sequences, those underlying the chronological dependencies of structural features and biochemical activities of protein sequences. Moreover, by using a new efficient subsequence matching scheme, SAF more efficiently handles protein sequences that contain similar structural features with significant meaning in chronologically non-equivalent positions. To show the effectiveness of SAF, extensive experiments were performed on protein datasets from different databases, and the results were compared with those obtained by several mainstream algorithms.

Keywords: Protein, Similarity, Substitution, Alignment.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1057677

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1415

References:

[1] S. Vinga, and J. Almeida, "Alignment-free sequence comparison - a review," BIOINFORMATICS, 4, vol. 19, 2003, pp. 513-523.
[2] M.W. Berry, and R.D. Fierro, "Low-rank orthogonal decompositions for information retrieval applications," Numerical Linear Algebra with Applications, 3, vol. 4, 1996, pp. 301-327.
[3] S. Karlin, and G. Ghandour, "Comparative statistics for DNA and protein sequences: Single sequence analysis," Proc. Natl. Acad. Sci. USA, 17, vol. 82, 1985, pp. 5800-5804.
[4] M. Ganapathiraju, J. Klein-Seetharaman, N. Balakrishnan, and R. Reddy, "Characterization of protein secondary structure," Signal Processing Magazine, IEEE, 3, vol. 21, May 2004, pp. 78-87.
[5] A. Amir, M. Lewenstein, and E. Porat, "Faster algorithms for string matching with k mismatches," J. Algorithms, 2, vol. 50, 2004, pp. 257-275.
[6] M. Brand, "Fast low-rank modifications of the thin singular value decomposition," Linear Algebra and Its Applications, 1, vol. 415, 2006, pp. 20-30.
[7] http://www.ncbi.nlm.nih.gov/BLAST.
[8] http://www.ebi.ac.uk/fasta33.
[9] http://www.ebi.ac.uk/clustalw.
[10] E. L. L. Sonnhammer, and V. Hollich, "Scoredist: A simple and robust protein sequence distance estimator," BMC Bioinformatics 2005, vol. 6 pp. 108.
[11] A. Kelil, S. Wang, and R. Brzezinski, "A new alignment-independent algorithm for clustering protein sequences," in Proc. 7th IEEE International Conference on BioInformatics and BioEngineering. Cambridge, Harvard University, Massachusetts, USA, 2007, pp. 27-34.
[12] A. Kelil, S. Wang, and R. Brzezinski, "CLUSS2: An alignment-independent algorithm for clustering protein families with multiple biological functions," International Journal of Computational Biology and Drug Design, 2008. (In press).
[13] K.P. Wu, H.N. Lin, T.Y. Sung, and W.L. Hsu, "A New Similarity Measure among Protein Sequences," in Proc. Computational Systems Bioinformatics, Stanford, CA, USA, 2003, pp. 347-352.
[14] A. Bogan-Marta, N. Laskaris, M. A. Gavrielides, I. Pitas, K. Lyroudia, "A novel efficient protein similarity measure based on n-gram modeling," in Proc. 2nd International Conference on Computational Intelligence in Medicine and Healthcare. Costa da Caparica, Lisbon, Portugal, 2005.
[15] R.L. Tatusov, N.D. Fedorova, J.D. Jackson, A.R. Jacobs, B. Kiryutin, E.V. Koonin, D.M. Krylov, R. Mazumder, S.L. Mekhedov, A.N. Nikolskaya, B.S. Rao, S. Smirnov, A.V. Sverdlov, S. Vasudevan, Y.I. Wolf, J.J. Yin, and D.A. Natale, "The COG database: An updated version includes eukaryotes," BMC Bioinformatics, 4, vol. 41, 2003.
[16] K. ONeill, W. Klimke, and T. Tatusova, "Protein clusters: A collection of proteins grouped by sequence similarity and function," NCBI, October 04, 2007.
[17] N. C├┤té, A. Fleury, E. Dumont-Blanchette, T. Fukamizo, M. Mitsutomi, and R. Brzezinski, "Two exo-β-D-glucosaminidases/exochitosanases from actinomycetes define a new subfamily within family 2 of glycoside hydrolases," Biochem. J., vol. 394, 2006, pp. 675-686.
[18] T. Fukamizo, A. Fleury, N. C├┤té, M. Mitsutomi, and R. Brzezinski, "Exo-β-D-glucosaminidase from Amycolatopsis orientalis: Catalytic residues, sugar recognition specificity, kinetics, and synergism," Glycobiology, vol. 16, 2006, pp. 1064-1072.
[19] www.cazy.org.
[20] D. Higgins, "Multiple Alignment," in The Phylogenetic Handbook, M. Salemi, and A. M. Vandamme, Ed. Cambridge University Press, 2004, pp. 45-71.
[21] A. Kelil, S. Wang, R. Brzezinski, and F. Alain, "CLUSS: Clustering of protein sequences based on a new similarity measure," BMC Bioinformatics, 2007, vol. 8, pp. 286.
[22] W. C. Lo, and P. C. Lyu, "CPSARST: An efficient circular permutation search tool applied to the detection of novel protein structural relationships," Genome Biology, vol. 9, 2008.
[23] S. Uliel, A. Fliess, and R. Unger, "Naturally occurring circular permutations in proteins," Protein Eng., vol. 14, 2001, pp. 533-542.
[24] http://www.ncbi.nlm.nih.gov.
[25] http://prospectus.usherbrooke.ca/CLUSS.