Image Spam Detection Using Color Features and K-Nearest Neighbor Classification
Authors: T. Kumaresan, S. Sanjushree, C. Palanisamy
Abstract:
Image spam is a kind of email spam where the spam text is embedded with an image. It is a new spamming technique being used by spammers to send their messages to bulk of internet users. Spam email has become a big problem in the lives of internet users, causing time consumption and economic losses. The main objective of this paper is to detect the image spam by using histogram properties of an image. Though there are many techniques to automatically detect and avoid this problem, spammers employing new tricks to bypass those techniques, as a result those techniques are inefficient to detect the spam mails. In this paper we have proposed a new method to detect the image spam. Here the image features are extracted by using RGB histogram, HSV histogram and combination of both RGB and HSV histogram. Based on the optimized image feature set classification is done by using k- Nearest Neighbor(k-NN) algorithm. Experimental result shows that our method has achieved better accuracy. From the result it is known that combination of RGB and HSV histogram with k-NN algorithm gives the best accuracy in spam detection.
Keywords: File Type, HSV Histogram, k-NN, RGB Histogram, Spam Detection.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1337863
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2153References:
[1] K.M. Svore, Q. Wu, and C. J. Burges, "Improving web spam classification using rank-time features", Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’07), Banff, Alberta, Canada, pp. 9–16, 2007.
[2] G. Shen, B. Gao, T. Liu, G. Feng, S. Song, and H. Li, "Detecting link spam using temporal information", Proceedings of the 6th IEEE International Conference on Data Mining (ICDM’06), Hong Kong, China, pp. 1049–1053, 2006.
[3] M. Egele, C. Kolbitsch, and C. Platzer, "Removing web spam links from search engine results", Journal in Computer Virology, vol. 7, pp. 51–62, 2011.
[4] Marc Najork, Web Spam Detection. Microsoft Research, Mountain View, CA, USA.
[5] M. Hu & B. Liu, "Mining and summarizing customer reviews", KDD’ 2004.
[6] B. Liu, "Web Data Mining", Springer, 2007.
[7] Z. Gyongyi& H. Garcia-Molina, "Web Spam Taxonomy. Technical Report" Stanford University, 2004.
[8] K. Li, & Z. Zhong, "Fast statistical spam filter by approximate classifications", SIGMETRICS, 2006.
[9] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly, "Detecting spam web pages through content analysis", Proceedings of the World Wide Web conference (WWW’06), Edinburgh, Scotland, pp. 83–92, 2006.
[10] B. Wu, V. Goel& B.D. Davison, "Topical Trust Rank: using topicality to combat Web spam", WWW'2006.
[11] T. Almeida, A. Yamakami, and J. Almeida, " Evaluation of Approaches for Dimensionality Reduction Applied with Naive Bayes Anti-Spam Filters", Proceedings of the 8th IEEE International Conference on Machine Learning and Applications, Miami, FL, USA, pp. 517– 522,2009.
[12] T. Almeida and A. Yamakami, " Content-Based Spam Filtering", Proceedings of the 23rd IEEE International Joint Conference on Neural Networks, Barcelona, Spain, pp. 1–7.2010.
[13] T. Almeida, J. Almeida, and A. Yamakami, "Spam Filtering: How the Dimensionality Reduction Affects the Accuracy of Naive Bayes Classifiers", Journal of Internet Services and Applications, vol. 1, no. 3, pp. 183–200, 2011.
[14] Q. Gan and T. Suel, "Improving web spam classifiers using link structure", Proceedings of the 3rd international Workshop on Adversarial Information Retrieval on the Web (AIRWeb’07), Banff, Alberta, Canada, pp. 17–20, 2007.
[15] T. Urvoy, E. Chauveau, and P. Filoche, "Tracking web spam with html style similarities", ACM Transactions on the Web, vol. 2, no. 1, pp.1–3, 2008.