Identification of Most Frequently Occurring Lexis in Body-enhancement Medicinal Unsolicited Bulk e-mails

Jatinderkumar R. Saini; Apurva A. Desai

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33093

Identification of Most Frequently Occurring Lexis in Body-enhancement Medicinal Unsolicited Bulk e-mails

Authors: Jatinderkumar R. Saini, Apurva A. Desai

Abstract:

e-mail has become an important means of electronic communication but the viability of its usage is marred by Unsolicited Bulk e-mail (UBE) messages. UBE consists of many types like pornographic, virus infected and 'cry-for-help' messages as well as fake and fraudulent offers for jobs, winnings and medicines. UBE poses technical and socio-economic challenges to usage of e-mails. To meet this challenge and combat this menace, we need to understand UBE. Towards this end, the current paper presents a content-based textual analysis of more than 2700 body enhancement medicinal UBE. Technically, this is an application of Text Parsing and Tokenization for an un-structured textual document and we approach it using Bag Of Words (BOW) and Vector Space Document Model techniques. We have attempted to identify the most frequently occurring lexis in the UBE documents that advertise various products for body enhancement. The analysis of such top 100 lexis is also presented. We exhibit the relationship between occurrence of a word from the identified lexis-set in the given UBE and the probability that the given UBE will be the one advertising for fake medicinal product. To the best of our knowledge and survey of related literature, this is the first formal attempt for identification of most frequently occurring lexis in such UBE by its textual analysis. Finally, this is a sincere attempt to bring about alertness against and mitigate the threat of such luring but fake UBE.

Keywords: Body Enhancement, Lexis, Medicinal, Unsolicited Bulk e-mail (UBE), Vector Space Document Model, Viagra

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1071888

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3508

References:

[1] Berry R. "The 100 Most Annoying Things of 2003", January 18, 2004, http://www.retrocrush.buzznet.com/archive2004/annoying2003/
[2] Castillo C., Donato D., Becchetti L. et al. "A Reference Collection for Web Spam", ACM SIGIR Forum, December 2006. 40(2). 11-24p. ISSN: 0163-5840
[3] Crucial Web Hosting Ltd. "How Consumers Define Spam", March 06, 2007, http://www.crucialwebost.com/blog/howconsumers- define-spam/
[4] Fette I., Sadeh N. and Tomasic A. "Learning to Detect Phishing Emails", Institute for Software Research International School of Computer Science (ISRI), Carnegie Mellon University (CMU), CMU-ISRI-06-112, June 2006
[5] Frederic E. "Text Mining Applied to Spam Detection", Presentation given at University of Geneva on January 24, 2007, http://cui.unige.ch/~ehrler/presentation/ Spam%20Filtering.pdf
[6] Gajewski W. P. "Adaptive Naïve Bayesian Anti-spam Engine", Proceedings of World Academy of Science, Engineering and Technology (PWASET 2005), August 2005. 7. 45-50p. ISSN: 1307-6884
[7] Gyongyi Z. and Garcia-Molina H. "Web Spam Taxonomy", First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb, 2005), Chiba, Japan, April 2005
[8] Infinite Monkeys & Co. "Spam Defined", http://www.monkeys.com/spam-defined/definition.shtml
[9] Kiritchenko S. and Matwin S. "Email Classification with Co- Training", Proceedings of the 2001 Conference of the Centre for Advanced Studies on Collaborative Research, Toronto, Canada, 2001. 8p.
[10] Knujon.com "Categorizing junk eMail", http://www.knujon.com/categories.html
[11] Lambert A. "Analysis of Spam", Dissertation for Degree of Master of Science in Computer Science, Department of Computer Science, University of Dublin, Trinity College, September 2003
[12] Mahalo.com "How to stop spam email", http://www.mahalo.com/How_to_Stop_Spam_Email
[13] Martin S., Sewani A., Nelson B., et al. "Analyzing Behavioral Features for Email Classification", Proceedings of the Second Conference on Email and Anti-Spam (CEAS, 2005), Stanford University, California, U.S.A. July 21-22, 2005
[14] Roth W. "Spam? Its All Relative", published online on December 19, 2005, http://www.imediaconnection.com/content/7581.asp
[15] Sebastiani F. "Machine Learning in Automated Text Categorization", in ACM Computing Surveys, March 2002. 32(1), 1-47p. ISSN: 0360-0300
[16] Sen P. "Types of Spam", Interactive Advertising, Fall 2004, http://ciadvertising.org/sa/fall_04/adv391k/paroma/spam/types_ of_spam.htm
[17] The Spam Register "Spam Email Directory: Categorized Spam Emails", December 17, 2008, http://www.spamreg.com/directory.php
[18] Threat Research and Content Engineering (TRACE) "Spam Type Descriptions", http://www.marshal.com/TRACE/Spam_Types.asp
[19] Youn S. and McLeod D. "Spam Email Classification Using an Adaptive Ontology", Institute of Electrical and Electronics Engineers (IEEE) Journal of Software, April 2007
[20] Zahren B. "Blizzard of Spam", http://www.pcpitstop.com/news/blizzard.asp
[21] Zhang T. "Predictive Methods for Text Mining", Machine Learning Summer School - 2006, Taipei, http://videolectures.net/mlss06tw_zhang_pmtm