Identification of Non-Lexicon Non-Slang Unigrams in Body-enhancement Medicinal UBE
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 32870
Identification of Non-Lexicon Non-Slang Unigrams in Body-enhancement Medicinal UBE

Authors: Jatinderkumar R. Saini, Apurva A. Desai


Email has become a fast and cheap means of online communication. The main threat to email is Unsolicited Bulk Email (UBE), commonly called spam email. The current work aims at identification of unigrams in more than 2700 UBE that advertise body-enhancement drugs. The identification is based on the requirement that the unigram is neither present in dictionary, nor is a slang term. The motives of the paper are many fold. This is an attempt to analyze spamming behaviour and employment of wordmutation technique. On the side-lines of the paper, we have attempted to better understand the spam, the slang and their interplay. The problem has been addressed by employing Tokenization technique and Unigram BOW model. We found that the non-lexicon words constitute nearly 66% of total number of lexis of corpus whereas non-slang words constitute nearly 2.4% of non-lexicon words. Further, non-lexicon non-slang unigrams composed of 2 lexicon words, form more than 71% of the total number of such unigrams. To the best of our knowledge, this is the first attempt to analyze usage of non-lexicon non-slang unigrams in any kind of UBE.

Keywords: Body Enhancement, Lexicon, Medicinal, Slang, Unigram, Unsolicited Bulk e-mail (UBE)

Digital Object Identifier (DOI):

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1789


[1] Astriyani, Sutjiati R. and Purwaningsih D. E. "An Analysis of Slang Language Related to Sex in Eminem-s Rap Songs- Lyrics", Repository of Gunadarma University, Jakarta, 2007. ISSN: 1987-4783
[2] Berry R. "The 100 Most Annoying Things of 2003". Available:, January 18, 2004
[3] Castillo C., Donato D., Becchetti L., Boldi P., Leonardi S., Santini M., and Vigna S. "A. Reference Collection for Web Spam", ACM SIGIR Forum, v. 40, n. 2, p. 11-24, December 2006, ISSN: 0163-5840
[4] Crucial Web Hosting Ltd. "How Consumers Define Spam". Available:, March 06, 2007
[5] Evett D. "Spam Statistics 2006", TopTenREVIEWS Inc. Available:
[6] Frederic E. "Text Mining Applied to Spam Detection", Presentation given at University of Geneva on January 24, 2007. Available:
[7] Gajewski W. P. "Adaptive Naïve Bayesian Anti-spam Engine", in Proceedings of World Academy of Science, Engineering and Technology (PWASET 2005), Pages 45-50 Volume 7 August 2005 ISSN 1307-6884
[8] Goswami S., Sarkar S. and Rustagi M. "Stylometric Analysis of Bloggers- Age and Gender" in Proceedings of the 3rd International AAAI Conference on Weblogs and Social Media (ICWSM - 2009), San Jose, California, May 2009
[9] Gyongyi Z., Garcia-Molina H. "Web Spam Taxonomy", First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb, 2005), Chiba, Japan, April 2005
[10] Infinite Monkeys & Co. "Spam Defined". Available:, 2011
[11] Kiritchenko S. and Matwin S. "Email Classification with Co-Training", in Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative Research, Toronto, Canada, pp. 8, 2001
[12] "Categorizing junk eMail". Available:, 2011
[13] Krasny M. "Analysis: Usage of Slang Words", article from Talk of the Nation (NPR), August 7, 2000. Available:
[14] Kucukyilmaz T., Cambazoglu B. B., Aykanat C. and Can F. "Chat Mining for Gender Prediction", in Lecture Notes in Computer Science, Springer Berlin, Heidelberg vol. 4243/2006, pp. 274-283,. ISSN: 0302- 9743
[15] Kucukyilmaz T., Cambazoglu B. B., Aykanat C. and Can F. "Chat mining: Predicting user and message attributes in computer-mediated communication" in Information Processing and Management: An International Journal, vol. 44, issue no. 4, pp. 1448-1466, July 2008. ISSN: 0306-4573
[16] Lambert A. "Analysis of Spam", Dissertation for Degree of Master of Science in Computer Science, Department of Computer Science, University of Dublin, Trinity College September 2003
[17] Lance J. "Phishing Exposed", Syngress Inc., ISBN:159749030X
[18] Ma W., Tran D. and Sharma D. "Filtering Spam Email with Flexible Preprocessors", Advances in Communication Systems and Electrical Engineering, Lecture Notes in Electrical Engineering Volume 4 Pages 211-227, ISBN 978-0-387-74937-2
[19] Meyer T. and Whateley B. "Spambayes: Effective Open-Source, Bayesian Based, Email Classification System", in Proceedings of the First Conference on Email and Anti-Spam (CEAS, 2004), Mountain View,California, USA 2004
[20] Roth W. "Spam? Its All Relative". Available:, Published online on December 19, 2005
[21] Saini J. R. "Self Learning Taxonomical Classification System using Vector Space Document Analysis Model for Web Text Mining in UBE", Ph.D. Thesis guided by Desai A. A., accepted by Department of Computer Science, Veer Narmad South Gujarat University, Surat, Gujarat, India, September 2009
[22] Sebastiani F. "Machine Learning in Automated Text Categorization", in ACM Computing Surveys, Vol. 32, No. 1, pp. 1-47, March 2002. ISSN: 0360-0300
[23] Sen P. "Types of Spam". Available: m.htm, Interactive Advertising, Fall 2004
[24] Sravan "Types of Spam Mail". Available:, November 18, 2008
[25] Thorne T. "Slang, Style-shifting and Sociability", Multicultural Perspectives on English Language and Literature, Tallinn/London, 2004. Available: Slang/%20Style-shifting%20and%20Sociability.doc
[26] Youn S. and McLeod D. "Spam Email Classification using an Adaptive Ontology", Institute of Electrical and Electronics Engineers (IEEE) Journal of Software, April 2007
[27] Zhang T. "Predictive Methods for Text Mining", Machine Learning Summer School - 2006, Taipei. Available: