Identification of Most Frequently Occurring Lexis in Winnings-announcing Unsolicited Bulke-mails

Jatinderkumar R. Saini; Apurva A. Desai

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33122

Identification of Most Frequently Occurring Lexis in Winnings-announcing Unsolicited Bulke-mails

Authors: Jatinderkumar R. Saini, Apurva A. Desai

Abstract:

e-mail has become an important means of electronic communication but the viability of its usage is marred by Unsolicited Bulk e-mail (UBE) messages. UBE consists of many types like pornographic, virus infected and 'cry-for-help' messages as well as fake and fraudulent offers for jobs, winnings and medicines. UBE poses technical and socio-economic challenges to usage of e-mails. To meet this challenge and combat this menace, we need to understand UBE. Towards this end, the current paper presents a content-based textual analysis of nearly 3000 winnings-announcing UBE. Technically, this is an application of Text Parsing and Tokenization for an un-structured textual document and we approach it using Bag Of Words (BOW) and Vector Space Document Model techniques. We have attempted to identify the most frequently occurring lexis in the winnings-announcing UBE documents. The analysis of such top 100 lexis is also presented. We exhibit the relationship between occurrence of a word from the identified lexisset in the given UBE and the probability that the given UBE will be the one announcing fake winnings. To the best of our knowledge and survey of related literature, this is the first formal attempt for identification of most frequently occurring lexis in winningsannouncing UBE by its textual analysis. Finally, this is a sincere attempt to bring about alertness against and mitigate the threat of such luring but fake UBE.

Keywords: Lexis, Unsolicited Bulk e-mail (UBE), Vector SpaceDocument Model, Winnings, Lottery

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1057173

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1544

References:

[1] Anonymous, "Categorizing junk eMail", Available: http://www.knujon.com/categories.html, 2008
[2] Berry R. "The 100 Most Annoying Things of 2003", Available: http://www.retrocrush.buzznet.com/archive2004/annoying2003/, January 18, 2004
[3] Castillo C., Donato D., Becchetti L., Boldi P., Leonardi S., Santini M., Vigna S. "A Reference Collection for Web Spam", ACM SIGIR Forum, vol. 40 (2), pp. 11-24, December 2006, ISSN: 0163-5840
[4] Commtouch Software Ltd. "Spam Trends For First Half of 2004", Commtouch Report, Available: http://www.commtouch.com/Site/News_Events/pr_content.asp?news_id =45&cat_id=1, Press Release, 30 June, 2004
[5] Crucial Web Hosting Ltd., "How Consumers Define Spam", Available: http://www.crucialwebost.com/blog/how-consumers-define-spam/, March 06, 2007
[6] CUED, "Junk e-mail", Cambridge University Engineering Department, Available: http://www.cam.ac.uk/cs/email/junk, 2008
[7] Fette I., Sadeh N. and Tomasic A. "Learning to Detect Phishing Emails", Institute for Software Research International School of Computer Science (ISRI), Carnegie Mellon University (CMU), CMU-ISRI-06-112, June 2006
[8] Frederic E. "Text Mining Applied to Spam Detection", Presentation given at University of Geneva on January 24, 2007, Available: http://cui.unige.ch/~ehrler/presentation/ Spam%20Filtering.pdf
[9] Gajewski W. P. "Adaptive Naïve Bayesian Anti-spam Engine", Proceedings of World Academy of Science, Engineering and Technology (PWASET 2005), Pages 45-50 vol. 7 August 2005 ISSN 1307-6884
[10] Gyongyi Z. and Garcia-Molina H. "Web Spam Taxonomy", First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb, 2005), Chiba, Japan, April 2005
[11] Indiana University. "What is spam?", University Information Technology Services, Knowledge Base, Indiana University, Pennsylvania, November 11, 2008. Available: http://kb.iu.edu/data/afne.html
[12] Infinite Monkeys & Co., "Spam Defined", Available: http://www.monkeys.com/spam-defined/definition.shtml, 2008
[13] Kiritchenko S. and Matwin S. "Email Classification with Co-Training", Proceedings of the 2001 Conference of the Centre for Advanced Studies on Collaborative Research, Toronto, Canada, Page 8, 2001
[14] Lambert A. "Analysis of Spam", Dissertation for Degree of Master of Science in Computer Science, Department of Computer Science, University of Dublin, Trinity College September 2003
[15] Lance J. "Phishing Exposed", Syngress Inc., Chapter 1 Page 2 ISBN: 159749030X; 2005
[16] Martin S., Sewani A., Nelson B., Chen K. and Joseph A. D. "Analyzing Behaviorial Features for Email Classification", Proceedings of the Second Conference on Email and Anti-Spam (CEAS, 2005), Stanford University, California, U.S.A. July 21-22, 2005
[17] Roth W. "Spam? Its All Relative", Available: http://www.imediaconnection.com/content/7581.asp, published online on December 19, 2005
[18] ScamBusters Editors "Email Scam Analysis". Available: http://www.scamdex.com/MHON/E/msg08805.php, Scamdex, Scambusters Online - Issue No. 292
[19] Sebastiani F. "Machine Learning in Automated Text Categorization", in ACM Computing Surveys, vol. 32 (1), pp. 1-47, March 2002. ISSN 0360-0300
[20] Sen P. "Types of Spam", Interactive Advertising, Fall 2004, Available: http://ciadvertising.org/sa/fall_04/adv391k/paroma/spam/types_of_spam .htm
[21] Threat Research and Content Engineering (TRACE) "Spam Type Descriptions". Available: http://www.marshal.com/TRACE/Spam_Types.asp, TRACE Blog, 2008
[22] Wikimedia Foundation Inc. "E-mail", Available: http://en.wikipedia.org/wiki/Email, 2010
[23] Youn, S. and McLeod D. "Spam Email Classification Using an Adaptive Ontology", Institute of Electrical and Electronics Engineers (IEEE) Journal of Software, April 2007
[24] Zhang T. "Predictive Methods for Text Mining", Machine Learning Summer School - 2006, Taipei. Available: videolectures.net/mlss06tw_zhang_pmtm