{"title":"FCNN-MR: A Parallel Instance Selection Method Based on Fast Condensed Nearest Neighbor Rule","authors":"Lu Si, Jie Yu, Shasha Li, Jun Ma, Lei Luo, Qingbo Wu, Yongqi Ma, Zhengji Liu","volume":127,"journal":"International Journal of Information and Communication Engineering","pagesStart":855,"pagesEnd":862,"ISSN":"1307-6892","URL":"https:\/\/publications.waset.org\/pdf\/10007534","abstract":"Instance selection (IS) technique is used to reduce
\r\nthe data size to improve the performance of data mining methods.
\r\nRecently, to process very large data set, several proposed methods
\r\ndivide the training set into some disjoint subsets and apply IS
\r\nalgorithms independently to each subset. In this paper, we analyze
\r\nthe limitation of these methods and give our viewpoint about how to
\r\ndivide and conquer in IS procedure. Then, based on fast condensed
\r\nnearest neighbor (FCNN) rule, we propose a large data sets instance
\r\nselection method with MapReduce framework. Besides ensuring the
\r\nprediction accuracy and reduction rate, it has two desirable properties:
\r\nFirst, it reduces the work load in the aggregation node; Second
\r\nand most important, it produces the same result with the sequential
\r\nversion, which other parallel methods cannot achieve. We evaluate the
\r\nperformance of FCNN-MR on one small data set and two large data
\r\nsets. The experimental results show that it is effective and practical.","references":"[1] S. Garc\u0142a, J. Luengo, and F. Herrera, \u201cData preprocessing in data\r\nmining,\u201d Computer Science, vol. 72, 2015.\r\n[2] B. J. Han, \u201cData mining. concepts and techniques. 3rd ed,\u201d Data Mining\r\nConcepts Models Methods & Algorithms Second Edition, vol. 5, no. 4,\r\npp. 1 \u2013 18, 2000.\r\n[3] T. Cover and P. Hart, \u201cNearest neighbor pattern classification,\u201d IEEE\r\nTransactions on Information Theory, vol. 13, no. 1, pp. 21\u201327, 1967.\r\n[4] D. R. Wilson and T. R. Martinez, \u201cReduction techniques for\r\ninstance-based learning algorithms,\u201d Machine Learning, vol. 38, no. 3,\r\npp. 257\u2013286, 2000.\r\n[5] M. Kudo and J. Sklansky, \u201cComparison of algorithms that select features\r\nfor pattern classifiers,\u201d Pattern Recognition, vol. 33, no. 1, pp. 25\u201341,\r\n2000.\r\n[6] H. Liu and H. Motoda, \u201cFeature extraction construction and selection:\r\nA data mining perspective,\u201d Springer International, vol. 94, no. 448, p.\r\n014004, 1999.\r\n[7] I. Triguero, J. Derrac, S. Garcia, and F. Herrera, \u201cA taxonomy\r\nand experimental study on prototype generation for nearest neighbor\r\nclassification,\u201d Systems Man & Cybernetics Part C Applications &\r\nReviews IEEE Transactions on, vol. 42, no. 1, pp. 86\u2013100, 2012.\r\n[8] J. Hamidzadeh, R. Monsefi, and H. S. Yazdi, \u201cIrahc: Instance reduction\r\nalgorithm using hyperrectangle,\u201d Pattern Recognition, vol. 48, no. 5, pp.\r\n1878\u20131889, 2015.\r\n[9] Haro-Garc, A. Aida, Garc, and N. A-Pedrajas, \u201cA divide-and-conquer\r\nrecursive approach for scaling up instance selection algorithms,\u201d Data\r\nMining and Knowledge Discovery, vol. 18, no. 3, pp. 392\u2013418, 2009.\r\n[10] I. Triguero, D. Peralta, J. Bacardit, S. Garc\u0142a, and F. Herrera, \u201cMrpr: A\r\nmapreduce solution for prototype reduction in big data classification,\u201d\r\nNeurocomputing, vol. 150, no. 150, p. 331C345, 2015.\r\n[11] J. Zhai, X. Wang, and X. Pang, \u201cVoting-based instance selection\r\nfrom large data sets with mapreduce and random weight networks,\u201d\r\nInformation Sciences, vol. 367, pp. 1066\u20131077, 2016.\r\n[12] H. Liu and H. Motoda, \u201cOn issues of instance selection.\u201d Data Mining\r\nand Knowledge Discovery, vol. 6, no. 2, pp. 115\u2013130, 2002.\r\n[13] J. R. Cano, F. Herrera, and M. Lozano, \u201cStratification for scaling up\r\nevolutionary prototype selection,\u201d Pattern Recognition Letters, vol. 26,\r\nno. 7, pp. 953\u2013963, 2005.\r\n[14] J. Dean and S. Ghemawat, \u201cMapreduce: Simplified data processing\r\non large clusters.\u201d in Conference on Symposium on Opearting Systems\r\nDesign & Implementation, 2004, pp. 107\u2013113.\r\n[15] F. Angiulli, \u201cFast condensed nearest neighbor rule,\u201d in International\r\nConference, 2005, pp. 25\u201332.\r\n[16] Angiulli, \u201cFast nearest neighbor condensation for large data sets\r\nclassification,\u201d IEEE Transactions on Knowledge & Data Engineering,\r\nvol. 19, no. 11, pp. 1450\u20131464, 2007.\r\n[17] B. P. E. Hart, \u201cThe condensed nearest neighbor rule,\u201d in IEEE Trans.\r\nInformation Theory, 1968.\r\n[18] C. H. Chou, B. H. Kuo, and F. Chang, \u201cThe generalized condensed\r\nnearest neighbor rule as a data reduction method,\u201d vol. 2, pp. 556\u2013559,\r\n2006.\r\n[19] G. W. Gates, \u201cThe reduced nearest neighbor rule,\u201d IEEE Transactions\r\non Information Theory, vol. 18, no. 3, pp. 431 \u2013 433, 1972.\r\n[20] D. L. Wilson, \u201cAsymptotic properties of nearest neighbor rules using\r\nedited data,\u201d IEEE Transactions on Systems Man & Cybernetics, vol. 2,\r\nno. 3, pp. 408\u2013421, 1972.\r\n[21] W. C. Lin, C. F. Tsai, S. W. Ke, C. W. Hung, and W. Eberle, \u201cLearning\r\nto detect representative data for large scale instance selection,\u201d Journal\r\nof Systems & Software, vol. 106, no. C, pp. 1\u20138, 2015.\r\n[22] A. Onan, \u201cA fuzzy-rough nearest neighbor classifier combined with\r\nconsistency-based subset evaluation and instance selection for automated\r\ndiagnosis of breast cancer,\u201d Expert Systems with Applications, vol. 42,\r\nno. 20, pp. 6844\u20136852, 2015.\r\n[23] J. A. Olvera-Lpez, J. A. Carrasco-Ochoa, and J. F. Mart\u0142nez-Trinidad,\r\n\u201cA new fast prototype selection method based on clustering,\u201d Pattern\r\nAnalysis and Applications, vol. 13, no. 2, pp. 131\u2013141, 2010.\r\n[24] J. R. Cano, F. Herrera, and M. Lozano, \u201cUsing evolutionary algorithms\r\nas instance selection for data reduction in kdd: an experimental study,\u201d\r\nIEEE Transactions on Evolutionary Computation, vol. 7, no. 6, pp.\r\n561\u2013575, 2004.\r\n[25] D. Borthakur, \u201cThe hadoop distributed file system: Architecture and\r\ndesign,\u201d Hadoop Project Website, vol. 11, no. 11, pp. 1 \u2013 10, 2007.\r\n[26] D. B. Skalak, \u201cPrototype and feature selection by sampling and random\r\nmutation hill climbing algorithms,\u201d Machine Learning Proceedings, pp.\r\n293\u2013301, 1994.\r\n[27] V. S. Devi and M. N. Murty, \u201cAn incremental prototype set building\r\ntechnique,\u201d Pattern Recognition, vol. 35, no. 2, pp. 505\u2013513, 2002.","publisher":"World Academy of Science, Engineering and Technology","index":"Open Science Index 127, 2017"}