Clustering Protein Sequences with Tailored General Regression Model Technique

G. Lavanya Devi; Allam Appa Rao; A. Damodaram; GR Sridhar; G. Jaya Suma

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 32797

Clustering Protein Sequences with Tailored General Regression Model Technique

Authors: G. Lavanya Devi, Allam Appa Rao, A. Damodaram, GR Sridhar, G. Jaya Suma

Abstract:

Cluster analysis divides data into groups that are meaningful, useful, or both. Analysis of biological data is creating a new generation of epidemiologic, prognostic, diagnostic and treatment modalities. Clustering of protein sequences is one of the current research topics in the field of computer science. Linear relation is valuable in rule discovery for a given data, such as if value X goes up 1, value Y will go down 3", etc. The classical linear regression models the linear relation of two sequences perfectly. However, if we need to cluster a large repository of protein sequences into groups where sequences have strong linear relationship with each other, it is prohibitively expensive to compare sequences one by one. In this paper, we propose a new technique named General Regression Model Technique Clustering Algorithm (GRMTCA) to benignly handle the problem of linear sequences clustering. GRMT gives a measure, GR*, to tell the degree of linearity of multiple sequences without having to compare each pair of them.

Keywords: Clustering, General Regression Model, Protein Sequences, Similarity Measure.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1075565

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1514

References:

[1] R. Agrawal, C. Faloutsos and A. Swami, Efficient Similarity Search in Sequence Databases, Proceedings of the 4th Intl. Conf. on Foundations of Data Organizations and Algorithms (FODO) (1993), pp. 69-84.
[2] B. Yi and C. Faloutsos, Fast Time Sequence Indexingfor Arbitrary Lp Norms, The 26th International Conference on Very Large Databases(VLDB) (2000), pp. 385-394.
[3] D. Rafiei and A. Mendelzon, Efficient Retrieval of Similar Time Sequences Using DFT, Proceedings of the 5th International Conference on Foundations of Data Organizations and Algorithms (FODO) (1998), pp. 69-84.
[4] R. Agrawal, K. I. Lin, H. S. Sawhne and K. Shim, Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases, Proc. of the 2Ist VLDB Conference(1995), pp. 490-501.
[5] T. Bozkaya, N. Yazdani and Z.M. Ozsoyoglu, Matchingand Indexing Sequences of Different Lengths, Proc. of the 6th International Conference on Information and Knowledge Management(1997), pp. 128-135.
[6] E. Keogh, A fast and robust method for pattern matching in sequences database, WUSS (1997).
[7] E. Keogh and P. Smyth. A Probabilistic Approach to Fast Pattern Matching in Sequences Databases, The 3rd Intl. Conf. on Knowledge Discovery and DataMining(1997), pp. 24-30.
[8] C. Faloutsos, M. Ranganathan and Y. Manolopoulos, Fast Subsequence Matching in Time-Series Databases, International Proceedings of the ACM SIGMOD Conference on management of Data(1994), pp. 419- 429.
[9] C. Chung, S. Lee, S. Chun, D. Kim and J. Lee, Similarity Search for Multidimensional Data Sequences, Proceedings of the 16th International Conf. on Data Engineering(2000), pp. 599-608.
[10] D. Goldin and P. Kanellakis, On similarity queries for time-series data: constraint specification and implementation, The 1st International Conference on the Principles and practice of Constraint Programming (1995), pp. 137-153.
[11] C. Perng, H. Wang, S. Zhang and D. Parker, Landmarks: a New Model for Similarity-based Pattern Querying in Sequences Databases, Proc. of the 16th International Conference on Data Engineering(2000)
[12] H. Jagadish, A. Mendelzon and T. Milo, Similarity-Based Queries, The Symposium on Principles of Database Systems (1995), pp. 36-45.
[13] D. Rafiei and A. Mendelzon, Similarity-Based Queries for Sequences Data, Proc. of the ACM SIGMOD Conference on Management of Data(1997), pp. 13-25.
[14] C. Li, P. Yu and V. Castelli, Similarity Search Algorithm for Databases of Long Sequences, The 12th International Conference on Data Engineering (1996), pp. 546-553.
[15] G. Das, D. Gunopulos and H. Mannila, Finding similar sequences, The 1st European Symposium on Principles of Data Mining and Knowledge Discovery(1997),pp. 88-100.
[16] K. Chu and M. Wong, Fast Time-Series Searching with Scaling and Shifting, The 18th ACM Symp. On Principles of Database Systems (PODS 1999), pp. 237-248.
[17] B. Bollobas, G. Das, D. Gunopulos and H. Mannila, Time-Series Similarity Problems and Well-Separated Geometric Sets, The 13th Annual ACM Symposium on Computational Geometry (1997), pp. 454- 456.
[18] D. Berndt and J. Clifford, Using Dynamic Time Warping to Find Patterns in Sequences, Working Notes of the Knowledge Discovery in Databases Workshop(1994), pp. 359-370.
[19] B. Yi, H. Jagadish and C. Faloutsos, Efficient Retrieval of Similar Time Sequences Under Time Warping, Proc. of the 14th International Conference on Data Engineering (1998), pp. 23-27.
[20] S. Park, W. Chu, J. Yoon and C. Hsu, Efficient Similarity Searches for Time-Warped Subsequences in Sequence Databases, Proc. of the 16th International Conf. on Data Engineering (2000).
[21] Z. Struzik and A. Siebes, The Haar Wavelet Transform in the Sequences Similarity Paradigm, PKDD (1999).
[22] K. Chan and W. FU. Efficient Sequences Matching by Wavelets, The 15th international Conf. on Data Engineering (1999).
[23] G. Das, K. Lin, H. Mannila, G. Renganathan and P. Smyt, Rule Discovery from Sequences, Knowledge Discovery and Data Mining(1998), pp. 16-22.
[24] G. Das, D. Gunopulos, Sequences Similarity Measures, KDD-2000: Sequences Tutorial.
[25] I. Dhillon, A New O(n2) Algorithm for the Symetric Tridiagonal Eigenvalue/Eigenvector Problem, Ph.D. Thesis. University of. California, Berkerley, 1997.
[26] R. Duda, P. Hart and D. Stork, Pattern Classification. 2nd Edition, John Wiley & Sons, 2000.
[27] J. Wooldridge, Introductory Econometrics: a modern approach, South- Western College Publishing, 1999.
[28] F. Mosteller and J. Tukey, Data Analysis and Regression: A Second Course in Statistics, Addison-Wesley, 1977.
[29] M.R. Anderberg. Cluster Analysis for Applications. Academic Press, New York, December 1973.
[30] J. Han, M.Kamber, and A.Tung. Spatial Clustering Methods in Data Mining: A review. In H.J. Miller and J.Han, editors, Geographic Data Mining and Knowledge Discovery, pages 188-217. Taylor and Francis, London, December 2001.
[31] Gusfield D. Algorithms on Strings, Trees and Sequences. New York: Cambridge University Press, 1997.