{"title":"Constraint Based Frequent Pattern Mining Technique for Solving GCS Problem","authors":"First G.M. Karthik, Second Ramachandra.V.Pujeri, Dr.","volume":18,"journal":"International Journal of Computer and Information Engineering","pagesStart":1989,"pagesEnd":1997,"ISSN":"1307-6892","URL":"https:\/\/publications.waset.org\/pdf\/11090","abstract":"Generalized Center String (GCS) problem are\r\ngeneralized from Common Approximate Substring problem\r\nand Common substring problems. GCS are known to be\r\nNP-hard allowing the problems lies in the explosion of\r\npotential candidates. Finding longest center string without\r\nconcerning the sequence that may not contain any motifs is\r\nnot known in advance in any particular biological gene\r\nprocess. GCS solved by frequent pattern-mining techniques\r\nand known to be fixed parameter tractable based on the\r\nfixed input sequence length and symbol set size. Efficient\r\nmethod known as Bpriori algorithms can solve GCS with\r\nreasonable time\/space complexities. Bpriori 2 and Bpriori\r\n3-2 algorithm are been proposed of any length and any\r\npositions of all their instances in input sequences. In this\r\npaper, we reduced the time\/space complexity of Bpriori\r\nalgorithm by Constrained Based Frequent Pattern mining\r\n(CBFP) technique which integrates the idea of Constraint\r\nBased Mining and FP-tree mining. CBFP mining technique\r\nsolves the GCS problem works for all center string of any\r\nlength, but also for the positions of all their mutated copies\r\nof input sequence. CBFP mining technique construct TRIE\r\nlike with FP tree to represent the mutated copies of center\r\nstring of any length, along with constraints to restraint\r\ngrowth of the consensus tree. The complexity analysis for\r\nConstrained Based FP mining technique and Bpriori\r\nalgorithm is done based on the worst case and average case\r\napproach. Algorithm's correctness compared with the\r\nBpriori algorithm using artificial data is shown.","references":"[1] C.N. Meneses, Z. Lu, A.S. Oliveria, and P.M. Pardolos, \"Optimal\r\nSolutions for the Closest String Problem via Integer Programming,\"\r\nINFORMS J. Computing, vol. 16, no. 4, pp. 419-429, 2004.\r\n[2] E. Eskin and P. Pevzner, \"Finding Composite Regulatory Patterns\r\nin DNA Sequences,\" Bioinformatics, pp. 354-363, 2002.\r\n[3] F.Y.L. Chin and H.C.M. Leung, \"Voting Algorithms for\r\nDiscovering Long Motifs,\" Proc. Third Asia-Pacific Bioinformatics\r\nConf. (APBC -05), pp. 261-271, 2005.\r\n[4] G. Grahne, L. Lakshmanan, and X. Wang, \"Efficient mining of\r\nconstrained correlated sets,\" In Proc. 2000 Int. Conf. Data\r\nEngineering (ICDE-00), San Diego, CA, pp. 512-521, 2000.\r\n[5] G. Pavesi, G. Mauri, and G. Pesole, \"An Algorithm for Finding\r\nSignals of Unknown Length in DNA Sequences,\" Bioinformatics,\r\nvol. 17, pp. 207-214, 2001.\r\n[6] G. Pavesi, G. Mauri, and G. Pesole, \"In Silico Representation and\r\nDiscovery of Transcription Factor Binding Sites,\" Brief in\r\nBioinformatics, vol. 5, no. 3, pp. 217-36, 2004.\r\n[7] G.D. Stormo, \"DNA Binding Sites: Representation and Discovery,\"\r\nBionformatics, vol. 16, no. 1, pp. 16-23, 2004.\r\n[8] H. Mannila, H. Toivonen, \"Levelwise search and borders of theories\r\nin knowledge discovery,\" Data Mining and Knowledge Discovery\r\n1, pp 241-258, 1997.\r\n[9] H.C.M. Leung and F.Y.L. Chin, \"Algorithms for Challenging Motif\r\nProblems,\" J. Bioinformatics and Computational Biology, vol. 4,\r\nno. 1, pp. 43-58, 2006.\r\n[10] J. Buhler and M. Tompa, \"Finding Motifs Using Random\r\nProjections,\" Proc. Fifth Ann. Int-l Conf. Computational Molecular\r\nBiology (RECOMB), pp. 69-76, 2001.\r\n[11] J. Davila, S. Balla, and S. Rajasekaran, \"Space and Time Efficient\r\nAlgorithms for Planted Motif Search,\" Proc. Int-l Workshop\r\nBioinformatics Research and Applications (IWBRA -06), May\r\n2006.\r\n[12] J. Gramm, R. Niedermeier, and P. Rossmanith, \"Exact Solutions for\r\nCLOSEST STRING and Related Problems,\" Proc. 12th Int-l Symp.\r\nAlgorithms and Computation, pp. 441-453, 2001.\r\n[13] J. Guan, D.Y. Liu, and D.A. Bell, \"Discovering Motifs in DNA\r\nSequences,\" Fundamenta Informaticae, vol. 59, pp. 119-132, 2004.\r\n[14] J. Han, J. Pei, and Y. Yin, \"Mining frequent patterns without\r\ncandidate generation,\" In Proc. 2000 ACMSIGMOD, Int. Conf.\r\nManagement of Data (SIGMOD-00), Dallas, TX, pp. 1-12, 2000.\r\n[15] J. Han, J. Pei, and Y. Yin, \"Mining Frequent Patterns without\r\nCandidate Generation,\" Proc. ACM SIGMOD, pp. 1-12, 2000.\r\n[16] J. Han, L.V.S. Lakshmanan, T.N.Raymond, \"Constraint-Based\r\nMultidimensional Data Mining,\" IEEE Computer, 32(8), 46-\r\n50,1999.\r\n[17] J. Pei, J. Han, and L.V.S. Lakshmanan, \"Mining frequent itemsets\r\nwith convertible constraints,\" In Proc. 2001 Int. Conf. Data\r\nEngineering (ICDE-01), Heidelberg, Germany, pp. 433-332, 2001.\r\n[18] J. Pei, J. Han, and R. Mao, \"CLOSET: An efficient algorithm for\r\nmining frequent closed itemsets,\" In Proc. 2000 ACM-SIGMOD\r\nInt. Workshop Data Mining and Knowledge Discovery\r\n(DMKD-00), Dallas, TX, pp. 11-20, 2000.\r\n[19] K. Lanctot, M. Li, B. Ma, and L. Zhang, \"Distinguishing String\r\nSelection Problems,\" Information and Computation, vol. 185, no. 1,\r\npp. 44-51, 2003.\r\n[20] L. De Raedt, S. Kramer, \"The levelwise version space algorithm\r\nand its application to molecular fragment finding,\" In. IJCA101: 7th\r\nInt. Joint Conf. Artificial Intelligence, 2001.\r\n[21] L. Marsan and M.F. Sagot, \"Algorithms for Extracting Structured\r\nMotifs Using a Suffix Tree with Application to Promoter and\r\nRegulatory Site Consensus Identification,\" J. Computational\r\nBiology, vol. 7, pp. 345-360, 2000.\r\n[22] M. Ester and X. Zhang, \"A Top-Down Method for Mining Most\r\nSpecific Frequent Patterns in Biological Sequence Data,\" Proc.\r\nSIAM Int-l Conf. on Data Mining (SDM -04), Apr. 2004.\r\n[23] M. Frances and A. Litman, \"On Covering Problems of Codes,\"\r\nTheoretical Computer System, vol. 30, pp. 113-119, 1997.\r\n[24] M. Li, B. Ma, and L. Wang, \"Finding Similar Regions in Many\r\nStrings,\" Proc. 31st Ann. ACM Symp. Theory of Computing\r\n(STOC -99), pp. 473-482, 1999.\r\n[25] M. Li, B. Ma, and L. Wang, \"On the Closest String and Substring\r\nProblems,\" J. ACM, vol. 49, no. 2, pp. 151-171, 2002.\r\n[26] M. Tompa et al., \"Assessing Computational Tools for the Discovery\r\nof Transcription Factor Binding Sites,\" Nature Biotechnology, vol.\r\n23, no. 1, pp. 137-144, 2005.\r\n[27] M. Tompa, \"An Exact Method for Finding Short Motifs in\r\nSequences, with Application to the Ribosome Binding Site\r\nProblem,\" Proc. Seventh Int-l Conf. Intelligent Systems for\r\nMolecular Biology, pp. 262-271, 1999.\r\n[28] M.F. Sagot, \"Spelling Approximate Repeated or Common Motifs\r\nUsing a Suffix Tree,\" Proc. Third Latin Am. Symp. Theoretical\r\nInformatics, pp. 111-127, 1998.\r\n[29] M.J. Zaki, and C.J. Hsiao, \"CHARM: An efficient algorithm for\r\nclosed itemset mining,\" In Proc. 2002 SIAMInt. Conf. Data Mining,\r\nArlington, VA, pp. 457-473, 2002.\r\n[30] M.N. Garofalakis, R. Rastogi, and K. Shim, \"SPIRIT: Sequential\r\npattern mining with regular expression constraints,\" In Proc. 25th Int.\r\nConf. Very Large Data Bases (VLDB '99), San Francisco, Morgan\r\nKaufmann, pp 223-234, 1999.\r\n[31] M.S. Waterman, R. Arratia, and D.J. Galas, \"Pattern Recognition in\r\nSeveral Sequences: Consensus and Alignment,\" Bull. Math. Biology,\r\nvol. 46, pp. 515-527, 1984\r\n[32] P. Pevzner and S. Sze, \"Combinatorial Approaches to Finding Subtle\r\nSignals in DNA Sequences,\" Proc. Int-l Conf. Intelligent Systems for\r\nMolecular Biology, pp. 269-278, 2000.\r\n[33] P.A. Evans and A.D. Smith, \"Complexity of Approximating Closest\r\nSubstring Problems,\" Fundamentals of Computation Theory, pp.\r\n210-221, 2003.\r\n[34] P.A. Evans and A.D. Smith, \"Toward Optimal Motif Enumeration,\"\r\nProc. Algorithms and Data Structures, Eighth Int-l Workshop\r\n(WADS -03), pp. 47-58, 2003.\r\n[35] P.A. Evans and H.T. Wareham, \"Practical Algorithms for Universal\r\nDNA Primer Design: An Exercise in Algorithm Engineering,\"\r\nCurrents in Computational Molecular Biology 2001, N. El-Mabrouk,\r\nT. Lengauer, and D. Sankoff, eds., pp. 25-26, Les Publications CRM,\r\n2001.\r\n[36] P.A. Evans, A.D. Smith, and H.T. Wareham, \"On the Complexity of\r\nFinding Common Approximate Substrings,\" Theoretical Computer\r\nScience, vol. 306, no. 1-3, pp. 407-430, 2003.\r\n[37] R. Agarwal, C. Aggarwal, and V.V.V. Prasad, \"A tree projection\r\nalgorithm for generation of frequentitemsets,\" Journal of Parallel and\r\nDistributed Computing, 61:350-371, 2001.\r\n[38] R. Agrawal and R. Srikant, \"Fast Algorithms for Mining Association\r\nRules,\" Proc. 20th Int-l Conf. Very Large Data Bases (VLDB -94),\r\npp. 487-499, 1994.\r\n[39] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and I. Verkamo,\r\n\"Fast Discovery of the Association Rules,\" Advances in Knowledge\r\nDiscovery and Data Mining, pp. 307-328, AAAI Press, 1996.\r\n[40] R. Ng et al., \"Exploratory Mining and Pruning Optimizations of\r\nConstrained Associations Rules,\" Proc. 1998 ACM SIGMOD Int'l\r\nConf. Management of Data, ACM Press, New York, 1998, pp. 13-\r\n24.\r\n[41] R. Srikant, and R. Agrawal, \"Mining sequential patterns:\r\nGeneralizations and performance improvements,\" In Proc. 5th Int.\r\nConf. Extending Database Technology (EDBT-96), Avignon,\r\nFrance, pp. 3-17, 1996.\r\n[42] R. Srikant, Q. Vu, and R. Agrawal, \"Mining association rules with\r\nitem constraints,\" In Proc. 1997 Int. Conf. Knowledge Discovery and\r\nData Mining (KDD-97), Newport Beach, CA, pp. 67-73, 1997.\r\n[43] Ruqian Lu, Caiyan Jia, Shaofang Zhang, Lusheng Chen, Hongyu\r\nZhang, \"An Exact Data Mining Method for Finding Center Strings\r\nand All Their Instances,\" IEEE Trans. Knowledge and Data\r\nEngineering,\" 19(4), 509-522, 2007.\r\n[44] S. Rajasekaran, S. Balla, and C.-H. Huang, \"Exact Algorithms for\r\nthe Planted Motif Problem,\" J. Computational Biology, vol. 12, no.\r\n8, pp. 1117-1128, 2005.\r\n[45] S. Sinha and M. Tompa, \"Discovery of Novel Transcription Factor\r\nBinding Sites by Statistical Overrepresentation,\" Nucleic Acids\r\nResearch, vol. 30, no. 24, pp. 5549-5560, 2002.\r\n[46] S. Sinha and M. Tompa, \"YMF: A Program or Discovery of Novel\r\nTranscription Factor Binding Sites by Statistical\r\nOverrepresentation,\" Nucleic Acids Research, vol. 31, pp. 3586-\r\n3588, 2003.\r\n[47] S.T. Jensen, X.S. Liu, Q. Zhou, and J.S. Liu, \"Computational\r\nDiscovery of Gene Regulatory Binding Motifs: A Bayesian\r\nPerspective,\" Statistical Science, vol. 19, pp. 188-204, 2004.\r\n[48] Sau Dan Lee, Luc De Raedt, \"Constraint Based Mining of First\r\nOrder Sequences in SeqLog,\" Database Support for Data Mining\r\nApplications 154-173, 2004.","publisher":"World Academy of Science, Engineering and Technology","index":"Open Science Index 18, 2008"}