Identification of Spam Keywords Using Hierarchical Category in C2C E-commerce
Consumer-to-Consumer (C2C) E-commerce has been growing at a very high speed in recent years. Since identical or nearly-same kinds of products compete one another by relying on keyword search in C2C E-commerce, some sellers describe their products with spam keywords that are popular but are not related to their products. Though such products get more chances to be retrieved and selected by consumers than those without spam keywords, the spam keywords mislead the consumers and waste their time. This problem has been reported in many commercial services like ebay and taobao, but there have been little research to solve this problem. As a solution to this problem, this paper proposes a method to classify whether keywords of a product are spam or not. The proposed method assumes that a keyword for a given product is more reliable if the keyword is observed commonly in specifications of products which are the same or the same kind as the given product. This is because that a hierarchical category of a product in general determined precisely by a seller of the product and so is the specification of the product. Since higher layers of the hierarchical category represent more general kinds of products, a reliable degree is differently determined according to the layers. Hence, reliable degrees from different layers of a hierarchical category become features for keywords and they are used together with features only from specifications for classification of the keywords. Support Vector Machines are adopted as a basic classifier using the features, since it is powerful, and widely used in many classification tasks. In the experiments, the proposed method is evaluated with a golden standard dataset from Yi-han-wang, a Chinese C2C E-commerce, and is compared with a baseline method that does not consider the hierarchical category. The experimental results show that the proposed method outperforms the baseline in F1-measure, which proves that spam keywords are effectively identified by a hierarchical category in C2C E-commerce.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1337471Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2198
 camino3x2, "Keyword spam busting.” (Online). Available: http://www.ebay.com/gds/ Keyword-Spam-Busting-/ 10000000001612568/g.html
 fransgems, "Beware misleading item headers on ebay auctions!” (Online). Available: http://www.ebay.com/gds/Beware-Misleading -Item-Headers-on-eBay-Auctions- /10000000003890459/g.html
 jandbcannon, "You might be keyword spamming too.” (Online). Available: http://www.ebay.com/gds/ Beware-You-Might-Be-Keyword-Spamming-Too- /10000000001620833/g.html
 E. Blanzieri and A. Bryl, "A survey of learning-based techniques of email spam filtering,” Artificial Intelligence Review, vol. 29, no. 1, pp. 63–92, 2008.
 G. Cormack, "Email spam filtering: A systematic review,” Foundations and Trends in Information Retrieval, vol. 1, no. 4, pp. 335–455, 2007.
 G. Koutrika, F. Effendi, Z. Gy¨ongyi, P. Heymann, and H. Garcia-Molina, "Combating spam in tagging systems,” in Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, 2007, pp. 57–64.
 A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly, "Detecting spam web pages through content analysis,” in Proceedings of WWW, 2006, pp. 83–92.
 A. Khorsi, "An overview of content-based spam filtering techniques.” Informatica, vol. 31, no. 3, 2007.
 A. Hearst, J. Dumais, and S. .B, "Support vector machines,” Intelligent Systems and their Applications, IEEE, vol. 13, no. 4, pp. 18–28, 1998.
 C. Chang and C. Lin, "Libsvm: a library for support vector machines,” ACM TIST, vol. 2, no. 3, p. 27, 2011.
 "jcseg.” (Online). Available: http://code.google.com/p/jcseg/