Constraint Based Frequent Pattern Mining Technique for Solving GCS Problem

First G.M. Karthik; Second Ramachandra.V.Pujeri; Dr.

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33156

Constraint Based Frequent Pattern Mining Technique for Solving GCS Problem

Authors: First G.M. Karthik, Second Ramachandra.V.Pujeri, Dr.

Abstract:

Generalized Center String (GCS) problem are generalized from Common Approximate Substring problem and Common substring problems. GCS are known to be NP-hard allowing the problems lies in the explosion of potential candidates. Finding longest center string without concerning the sequence that may not contain any motifs is not known in advance in any particular biological gene process. GCS solved by frequent pattern-mining techniques and known to be fixed parameter tractable based on the fixed input sequence length and symbol set size. Efficient method known as Bpriori algorithms can solve GCS with reasonable time/space complexities. Bpriori 2 and Bpriori 3-2 algorithm are been proposed of any length and any positions of all their instances in input sequences. In this paper, we reduced the time/space complexity of Bpriori algorithm by Constrained Based Frequent Pattern mining (CBFP) technique which integrates the idea of Constraint Based Mining and FP-tree mining. CBFP mining technique solves the GCS problem works for all center string of any length, but also for the positions of all their mutated copies of input sequence. CBFP mining technique construct TRIE like with FP tree to represent the mutated copies of center string of any length, along with constraints to restraint growth of the consensus tree. The complexity analysis for Constrained Based FP mining technique and Bpriori algorithm is done based on the worst case and average case approach. Algorithm's correctness compared with the Bpriori algorithm using artificial data is shown.

Keywords: Constraint Based Mining, FP tree, Data mining, GCS problem, CBFP mining technique.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1076646

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1710

References:

[1] C.N. Meneses, Z. Lu, A.S. Oliveria, and P.M. Pardolos, "Optimal Solutions for the Closest String Problem via Integer Programming," INFORMS J. Computing, vol. 16, no. 4, pp. 419-429, 2004.
[2] E. Eskin and P. Pevzner, "Finding Composite Regulatory Patterns in DNA Sequences," Bioinformatics, pp. 354-363, 2002.
[3] F.Y.L. Chin and H.C.M. Leung, "Voting Algorithms for Discovering Long Motifs," Proc. Third Asia-Pacific Bioinformatics Conf. (APBC -05), pp. 261-271, 2005.
[4] G. Grahne, L. Lakshmanan, and X. Wang, "Efficient mining of constrained correlated sets," In Proc. 2000 Int. Conf. Data Engineering (ICDE-00), San Diego, CA, pp. 512-521, 2000.
[5] G. Pavesi, G. Mauri, and G. Pesole, "An Algorithm for Finding Signals of Unknown Length in DNA Sequences," Bioinformatics, vol. 17, pp. 207-214, 2001.
[6] G. Pavesi, G. Mauri, and G. Pesole, "In Silico Representation and Discovery of Transcription Factor Binding Sites," Brief in Bioinformatics, vol. 5, no. 3, pp. 217-36, 2004.
[7] G.D. Stormo, "DNA Binding Sites: Representation and Discovery," Bionformatics, vol. 16, no. 1, pp. 16-23, 2004.
[8] H. Mannila, H. Toivonen, "Levelwise search and borders of theories in knowledge discovery," Data Mining and Knowledge Discovery 1, pp 241-258, 1997.
[9] H.C.M. Leung and F.Y.L. Chin, "Algorithms for Challenging Motif Problems," J. Bioinformatics and Computational Biology, vol. 4, no. 1, pp. 43-58, 2006.
[10] J. Buhler and M. Tompa, "Finding Motifs Using Random Projections," Proc. Fifth Ann. Int-l Conf. Computational Molecular Biology (RECOMB), pp. 69-76, 2001.
[11] J. Davila, S. Balla, and S. Rajasekaran, "Space and Time Efficient Algorithms for Planted Motif Search," Proc. Int-l Workshop Bioinformatics Research and Applications (IWBRA -06), May 2006.
[12] J. Gramm, R. Niedermeier, and P. Rossmanith, "Exact Solutions for CLOSEST STRING and Related Problems," Proc. 12th Int-l Symp. Algorithms and Computation, pp. 441-453, 2001.
[13] J. Guan, D.Y. Liu, and D.A. Bell, "Discovering Motifs in DNA Sequences," Fundamenta Informaticae, vol. 59, pp. 119-132, 2004.
[14] J. Han, J. Pei, and Y. Yin, "Mining frequent patterns without candidate generation," In Proc. 2000 ACMSIGMOD, Int. Conf. Management of Data (SIGMOD-00), Dallas, TX, pp. 1-12, 2000.
[15] J. Han, J. Pei, and Y. Yin, "Mining Frequent Patterns without Candidate Generation," Proc. ACM SIGMOD, pp. 1-12, 2000.
[16] J. Han, L.V.S. Lakshmanan, T.N.Raymond, "Constraint-Based Multidimensional Data Mining," IEEE Computer, 32(8), 46- 50,1999.
[17] J. Pei, J. Han, and L.V.S. Lakshmanan, "Mining frequent itemsets with convertible constraints," In Proc. 2001 Int. Conf. Data Engineering (ICDE-01), Heidelberg, Germany, pp. 433-332, 2001.
[18] J. Pei, J. Han, and R. Mao, "CLOSET: An efficient algorithm for mining frequent closed itemsets," In Proc. 2000 ACM-SIGMOD Int. Workshop Data Mining and Knowledge Discovery (DMKD-00), Dallas, TX, pp. 11-20, 2000.
[19] K. Lanctot, M. Li, B. Ma, and L. Zhang, "Distinguishing String Selection Problems," Information and Computation, vol. 185, no. 1, pp. 44-51, 2003.
[20] L. De Raedt, S. Kramer, "The levelwise version space algorithm and its application to molecular fragment finding," In. IJCA101: 7th Int. Joint Conf. Artificial Intelligence, 2001.
[21] L. Marsan and M.F. Sagot, "Algorithms for Extracting Structured Motifs Using a Suffix Tree with Application to Promoter and Regulatory Site Consensus Identification," J. Computational Biology, vol. 7, pp. 345-360, 2000.
[22] M. Ester and X. Zhang, "A Top-Down Method for Mining Most Specific Frequent Patterns in Biological Sequence Data," Proc. SIAM Int-l Conf. on Data Mining (SDM -04), Apr. 2004.
[23] M. Frances and A. Litman, "On Covering Problems of Codes," Theoretical Computer System, vol. 30, pp. 113-119, 1997.
[24] M. Li, B. Ma, and L. Wang, "Finding Similar Regions in Many Strings," Proc. 31st Ann. ACM Symp. Theory of Computing (STOC -99), pp. 473-482, 1999.
[25] M. Li, B. Ma, and L. Wang, "On the Closest String and Substring Problems," J. ACM, vol. 49, no. 2, pp. 151-171, 2002.
[26] M. Tompa et al., "Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites," Nature Biotechnology, vol. 23, no. 1, pp. 137-144, 2005.
[27] M. Tompa, "An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem," Proc. Seventh Int-l Conf. Intelligent Systems for Molecular Biology, pp. 262-271, 1999.
[28] M.F. Sagot, "Spelling Approximate Repeated or Common Motifs Using a Suffix Tree," Proc. Third Latin Am. Symp. Theoretical Informatics, pp. 111-127, 1998.
[29] M.J. Zaki, and C.J. Hsiao, "CHARM: An efficient algorithm for closed itemset mining," In Proc. 2002 SIAMInt. Conf. Data Mining, Arlington, VA, pp. 457-473, 2002.
[30] M.N. Garofalakis, R. Rastogi, and K. Shim, "SPIRIT: Sequential pattern mining with regular expression constraints," In Proc. 25th Int. Conf. Very Large Data Bases (VLDB '99), San Francisco, Morgan Kaufmann, pp 223-234, 1999.
[31] M.S. Waterman, R. Arratia, and D.J. Galas, "Pattern Recognition in Several Sequences: Consensus and Alignment," Bull. Math. Biology, vol. 46, pp. 515-527, 1984
[32] P. Pevzner and S. Sze, "Combinatorial Approaches to Finding Subtle Signals in DNA Sequences," Proc. Int-l Conf. Intelligent Systems for Molecular Biology, pp. 269-278, 2000.
[33] P.A. Evans and A.D. Smith, "Complexity of Approximating Closest Substring Problems," Fundamentals of Computation Theory, pp. 210-221, 2003.
[34] P.A. Evans and A.D. Smith, "Toward Optimal Motif Enumeration," Proc. Algorithms and Data Structures, Eighth Int-l Workshop (WADS -03), pp. 47-58, 2003.
[35] P.A. Evans and H.T. Wareham, "Practical Algorithms for Universal DNA Primer Design: An Exercise in Algorithm Engineering," Currents in Computational Molecular Biology 2001, N. El-Mabrouk, T. Lengauer, and D. Sankoff, eds., pp. 25-26, Les Publications CRM, 2001.
[36] P.A. Evans, A.D. Smith, and H.T. Wareham, "On the Complexity of Finding Common Approximate Substrings," Theoretical Computer Science, vol. 306, no. 1-3, pp. 407-430, 2003.
[37] R. Agarwal, C. Aggarwal, and V.V.V. Prasad, "A tree projection algorithm for generation of frequentitemsets," Journal of Parallel and Distributed Computing, 61:350-371, 2001.
[38] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules," Proc. 20th Int-l Conf. Very Large Data Bases (VLDB -94), pp. 487-499, 1994.
[39] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and I. Verkamo, "Fast Discovery of the Association Rules," Advances in Knowledge Discovery and Data Mining, pp. 307-328, AAAI Press, 1996.
[40] R. Ng et al., "Exploratory Mining and Pruning Optimizations of Constrained Associations Rules," Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data, ACM Press, New York, 1998, pp. 13- 24.
[41] R. Srikant, and R. Agrawal, "Mining sequential patterns: Generalizations and performance improvements," In Proc. 5th Int. Conf. Extending Database Technology (EDBT-96), Avignon, France, pp. 3-17, 1996.
[42] R. Srikant, Q. Vu, and R. Agrawal, "Mining association rules with item constraints," In Proc. 1997 Int. Conf. Knowledge Discovery and Data Mining (KDD-97), Newport Beach, CA, pp. 67-73, 1997.
[43] Ruqian Lu, Caiyan Jia, Shaofang Zhang, Lusheng Chen, Hongyu Zhang, "An Exact Data Mining Method for Finding Center Strings and All Their Instances," IEEE Trans. Knowledge and Data Engineering," 19(4), 509-522, 2007.
[44] S. Rajasekaran, S. Balla, and C.-H. Huang, "Exact Algorithms for the Planted Motif Problem," J. Computational Biology, vol. 12, no. 8, pp. 1117-1128, 2005.
[45] S. Sinha and M. Tompa, "Discovery of Novel Transcription Factor Binding Sites by Statistical Overrepresentation," Nucleic Acids Research, vol. 30, no. 24, pp. 5549-5560, 2002.
[46] S. Sinha and M. Tompa, "YMF: A Program or Discovery of Novel Transcription Factor Binding Sites by Statistical Overrepresentation," Nucleic Acids Research, vol. 31, pp. 3586- 3588, 2003.
[47] S.T. Jensen, X.S. Liu, Q. Zhou, and J.S. Liu, "Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective," Statistical Science, vol. 19, pp. 188-204, 2004.
[48] Sau Dan Lee, Luc De Raedt, "Constraint Based Mining of First Order Sequences in SeqLog," Database Support for Data Mining Applications 154-173, 2004.