Adaptive Naïve Bayesian Anti-Spam Engine
Authors: Wojciech P. Gajewski
Abstract:
The problem of spam has been seriously troubling the Internet community during the last few years and currently reached an alarming scale. Observations made at CERN (European Organization for Nuclear Research located in Geneva, Switzerland) show that spam mails can constitute up to 75% of daily SMTP traffic. A naïve Bayesian classifier based on a Bag Of Words representation of an email is widely used to stop this unwanted flood as it combines good performance with simplicity of the training and classification processes. However, facing the constantly changing patterns of spam, it is necessary to assure online adaptability of the classifier. This work proposes combining such a classifier with another NBC (naïve Bayesian classifier) based on pairs of adjacent words. Only the latter will be retrained with examples of spam reported by users. Tests are performed on considerable sets of mails both from public spam archives and CERN mailboxes. They suggest that this architecture can increase spam recall without affecting the classifier precision as it happens when only the NBC based on single words is retrained.
Keywords: Text classification, naïve Bayesian classification, spam, email.
Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1061840
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 4420References:
[1] P.Graham. (2002, August). A Plan for Spam (Online). Available: www.paulgraham.com/spam.html
[2] P. Graham, "Better Bayesian Filtering," in Proceedings of Spam Conference 2003. Available: http://spamconference.org/proceedings2003.html
[3] I.Androutsopoulos, J.Koutsias, K.V.Chandrinos, G.Paliouras, C.D. Spyropoulos, "An evaluation of naïve Bayesian anti-spam filtering," in Workshop on Machine Training in the New Information Age 2000.
[4] R.Segal, J.Crawford, J.Kephart, B.Leiba, "SpamGuru: An Enterprise Anti-Spam Filtering System," in Proceedings of First Conference on Email and Anti-Spam (CEAS) 2004.
[5] K. Aas, L. Eikvil. "Text categorization: A survey," Technical report, Norwegian Computing Center, 1999.
[6] G. Zipf, Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949.
[7] C. M. Tan, Y. F. Wang, C. D. Lee, "The Use of BiGrams to Enhance Text Categorization," in Journal Information Processing and Management., vol. 30, No. 4, pp. 529-546, 2002.
[8] H. Stern. "Optimizing Naïve Bayesian Networks for Spam Detection," CSCI 6509: Natural Language Processing project, Dalhousie University, Halifax, NS, Canada, 2002.
[9] T. Mitchell, Machine learning. McGraw Hill, 1997.
[10] N. Dalvi, P. Domingos, Mausam, S. Sanghai, D. Verma. "Adversarial Classification," in Proceedings of the Tenth International Conference on Knowledge Discovery and Data Mining (pp. 99-108), 2004.
[11] Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies. Request for Comments 2045, 1996.
[12] European Organization for Nuclear Research. Mail Service Web Site. http://mmm.cern.ch
[13] European Organization for Nuclear Research. Anti-Spam Web Site. http://mmmservices.web.cern.ch/mmmservices/Antispam/
[14] M. Fromberger. "Bayesian Classification of Unsolicited E-Mail," unpublished. Available: http://www.cs.dartmouth.edu/~sting/sw/bayes-spam.pdf