{"title":"FCNN-MR: A Parallel Instance Selection Method Based on Fast Condensed Nearest Neighbor Rule","authors":"Lu Si, Jie Yu, Shasha Li, Jun Ma, Lei Luo, Qingbo Wu, Yongqi Ma, Zhengji Liu","volume":127,"journal":"International Journal of Information and Communication Engineering","pagesStart":855,"pagesEnd":862,"ISSN":"1307-6892","URL":"https:\/\/publications.waset.org\/pdf\/10007534","abstract":"Instance selection (IS) technique is used to reduce<br \/>\r\nthe data size to improve the performance of data mining methods.<br \/>\r\nRecently, to process very large data set, several proposed methods<br \/>\r\ndivide the training set into some disjoint subsets and apply IS<br \/>\r\nalgorithms independently to each subset. In this paper, we analyze<br \/>\r\nthe limitation of these methods and give our viewpoint about how to<br \/>\r\ndivide and conquer in IS procedure. Then, based on fast condensed<br \/>\r\nnearest neighbor (FCNN) rule, we propose a large data sets instance<br \/>\r\nselection method with MapReduce framework. Besides ensuring the<br \/>\r\nprediction accuracy and reduction rate, it has two desirable properties:<br \/>\r\nFirst, it reduces the work load in the aggregation node; Second<br \/>\r\nand most important, it produces the same result with the sequential<br \/>\r\nversion, which other parallel methods cannot achieve. We evaluate the<br \/>\r\nperformance of FCNN-MR on one small data set and two large data<br \/>\r\nsets. The experimental results show that it is effective and practical.","references":"[1] S. Garc\u0142a, J. Luengo, and F. Herrera, \u201cData preprocessing in data\r\nmining,\u201d Computer Science, vol. 72, 2015.\r\n[2] B. J. Han, \u201cData mining. concepts and techniques. 3rd ed,\u201d Data Mining\r\nConcepts Models Methods & Algorithms Second Edition, vol. 5, no. 4,\r\npp. 1 \u2013 18, 2000.\r\n[3] T. Cover and P. Hart, \u201cNearest neighbor pattern classification,\u201d IEEE\r\nTransactions on Information Theory, vol. 13, no. 1, pp. 21\u201327, 1967.\r\n[4] D. R. Wilson and T. R. Martinez, \u201cReduction techniques for\r\ninstance-based learning algorithms,\u201d Machine Learning, vol. 38, no. 3,\r\npp. 257\u2013286, 2000.\r\n[5] M. Kudo and J. Sklansky, \u201cComparison of algorithms that select features\r\nfor pattern classifiers,\u201d Pattern Recognition, vol. 33, no. 1, pp. 25\u201341,\r\n2000.\r\n[6] H. Liu and H. Motoda, \u201cFeature extraction construction and selection:\r\nA data mining perspective,\u201d Springer International, vol. 94, no. 448, p.\r\n014004, 1999.\r\n[7] I. Triguero, J. Derrac, S. Garcia, and F. Herrera, \u201cA taxonomy\r\nand experimental study on prototype generation for nearest neighbor\r\nclassification,\u201d Systems Man & Cybernetics Part C Applications &\r\nReviews IEEE Transactions on, vol. 42, no. 1, pp. 86\u2013100, 2012.\r\n[8] J. Hamidzadeh, R. Monsefi, and H. S. Yazdi, \u201cIrahc: Instance reduction\r\nalgorithm using hyperrectangle,\u201d Pattern Recognition, vol. 48, no. 5, pp.\r\n1878\u20131889, 2015.\r\n[9] Haro-Garc, A. Aida, Garc, and N. A-Pedrajas, \u201cA divide-and-conquer\r\nrecursive approach for scaling up instance selection algorithms,\u201d Data\r\nMining and Knowledge Discovery, vol. 18, no. 3, pp. 392\u2013418, 2009.\r\n[10] I. Triguero, D. Peralta, J. Bacardit, S. Garc\u0142a, and F. Herrera, \u201cMrpr: A\r\nmapreduce solution for prototype reduction in big data classification,\u201d\r\nNeurocomputing, vol. 150, no. 150, p. 331C345, 2015.\r\n[11] J. Zhai, X. Wang, and X. Pang, \u201cVoting-based instance selection\r\nfrom large data sets with mapreduce and random weight networks,\u201d\r\nInformation Sciences, vol. 367, pp. 1066\u20131077, 2016.\r\n[12] H. Liu and H. Motoda, \u201cOn issues of instance selection.\u201d Data Mining\r\nand Knowledge Discovery, vol. 6, no. 2, pp. 115\u2013130, 2002.\r\n[13] J. R. Cano, F. Herrera, and M. Lozano, \u201cStratification for scaling up\r\nevolutionary prototype selection,\u201d Pattern Recognition Letters, vol. 26,\r\nno. 7, pp. 953\u2013963, 2005.\r\n[14] J. Dean and S. Ghemawat, \u201cMapreduce: Simplified data processing\r\non large clusters.\u201d in Conference on Symposium on Opearting Systems\r\nDesign & Implementation, 2004, pp. 107\u2013113.\r\n[15] F. Angiulli, \u201cFast condensed nearest neighbor rule,\u201d in International\r\nConference, 2005, pp. 25\u201332.\r\n[16] Angiulli, \u201cFast nearest neighbor condensation for large data sets\r\nclassification,\u201d IEEE Transactions on Knowledge & Data Engineering,\r\nvol. 19, no. 11, pp. 1450\u20131464, 2007.\r\n[17] B. P. E. Hart, \u201cThe condensed nearest neighbor rule,\u201d in IEEE Trans.\r\nInformation Theory, 1968.\r\n[18] C. H. Chou, B. H. Kuo, and F. Chang, \u201cThe generalized condensed\r\nnearest neighbor rule as a data reduction method,\u201d vol. 2, pp. 556\u2013559,\r\n2006.\r\n[19] G. W. Gates, \u201cThe reduced nearest neighbor rule,\u201d IEEE Transactions\r\non Information Theory, vol. 18, no. 3, pp. 431 \u2013 433, 1972.\r\n[20] D. L. Wilson, \u201cAsymptotic properties of nearest neighbor rules using\r\nedited data,\u201d IEEE Transactions on Systems Man & Cybernetics, vol. 2,\r\nno. 3, pp. 408\u2013421, 1972.\r\n[21] W. C. Lin, C. F. Tsai, S. W. Ke, C. W. Hung, and W. Eberle, \u201cLearning\r\nto detect representative data for large scale instance selection,\u201d Journal\r\nof Systems & Software, vol. 106, no. C, pp. 1\u20138, 2015.\r\n[22] A. Onan, \u201cA fuzzy-rough nearest neighbor classifier combined with\r\nconsistency-based subset evaluation and instance selection for automated\r\ndiagnosis of breast cancer,\u201d Expert Systems with Applications, vol. 42,\r\nno. 20, pp. 6844\u20136852, 2015.\r\n[23] J. A. Olvera-Lpez, J. A. Carrasco-Ochoa, and J. F. Mart\u0142nez-Trinidad,\r\n\u201cA new fast prototype selection method based on clustering,\u201d Pattern\r\nAnalysis and Applications, vol. 13, no. 2, pp. 131\u2013141, 2010.\r\n[24] J. R. Cano, F. Herrera, and M. Lozano, \u201cUsing evolutionary algorithms\r\nas instance selection for data reduction in kdd: an experimental study,\u201d\r\nIEEE Transactions on Evolutionary Computation, vol. 7, no. 6, pp.\r\n561\u2013575, 2004.\r\n[25] D. Borthakur, \u201cThe hadoop distributed file system: Architecture and\r\ndesign,\u201d Hadoop Project Website, vol. 11, no. 11, pp. 1 \u2013 10, 2007.\r\n[26] D. B. Skalak, \u201cPrototype and feature selection by sampling and random\r\nmutation hill climbing algorithms,\u201d Machine Learning Proceedings, pp.\r\n293\u2013301, 1994.\r\n[27] V. S. Devi and M. N. Murty, \u201cAn incremental prototype set building\r\ntechnique,\u201d Pattern Recognition, vol. 35, no. 2, pp. 505\u2013513, 2002.","publisher":"World Academy of Science, Engineering and Technology","index":"Open Science Index 127, 2017"}