Layout Based Spam Filtering
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 32769
Layout Based Spam Filtering

Authors: Claudiu N.Musat

Abstract:

Due to the constant increase in the volume of information available to applications in fields varying from medical diagnosis to web search engines, accurate support of similarity becomes an important task. This is also the case of spam filtering techniques where the similarities between the known and incoming messages are the fundaments of making the spam/not spam decision. We present a novel approach to filtering based solely on layout, whose goal is not only to correctly identify spam, but also warn about major emerging threats. We propose a mathematical formulation of the email message layout and based on it we elaborate an algorithm to separate different types of emails and find the new, numerically relevant spam types.

Keywords: Clustering, layout, k-means, spam.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1085253

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1588

References:


[1] J. B. MacQueen "Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability", 1967 Berkeley, University of California Press, 1:281-297
[2] P. L. Hammer "Distance-based classification methods", 1999, INFOR, Canadian OR Society Vol.37, s. 337-352
[3] T. M. Cover. Estimation by the Nearest Neighbor Rule. IEEE Transactions on Information Theory, IT-14(1):50--55, 1968
[4] E.S.Ristad, P.N.Yianilos "Learning String Edit Distance" (Online). Available: http://www.pnylab.com/pny/papers/sed/sed.pdf
[5] P. Graham. A plan for spam., 2002 (Online). Available: http://www.paulgraham.com/spam.html.
[6] H. Lee, A. Y. Hg "Spam Deobfuscation using a Hidden Markov Model, 2005 (Online). Available: http://ai.stanford.edu/~ang/papers/ceas05- spamdeobfuscation.pdf
[7] C. Miller "Neural Network-based Antispam Heuristics", 2005 (Online). Available: http://www.mnissa. org/whitepapers/Symantec/AntiSpam%20Heuristics%20White%20P apers.pdf .
[8] J. C. Burges "A Tutorial on Support Vector Machines for Pattern Recognition" 1998 "Data Mining and Knowledge Discovery", 2, 121- 167, Kluwer Academic Publishers, Boston, USA.
[9] H. J. Mucha, H. Sofyan: "Nonhierarchical Clustering" ch.9.3. (Online). Available: http://www.quantlet.com/mdstat/scripts/xag/html/xaghtmlframe149.html
[10] P. Berkhin, "Survey of Clustering Data Mining Techniques", 2002, Accrue Software, Available: www.ee.ucr.edu/~barth/EE242/clustering_survey.pdf
[11] R. Ng, J. Han. "Efficient and effective clustering method for spatial data mining", 1994. Proceedings of the 20th VLDB conference Santiago, Chile, 144-155.
[12] J. Zhang, M. Zhu, D. Papadias, Y. Tao, D. L. Lee "Location-based Spatial Queries" 2003, ACM SIGMOD San Diego, USA
[13] T.Seidl, H. P. Kriegel "Optimal Multi-Step k-Nearest Neighbor Search", 1996, ACM SIGMOD Seattle, USA
[14] U. Luxburg, O. Bousquet "Distance-Based Classification with Lipschitz Functions", 2004, Journal of Machine Learning Research 5, 669-695
[15] S. Dixit, S. Gupta, C. V. Ravishankar "An Online Detection and Control System for SMS Spam", 2005, Proceedings of the IASTED International Conference Communication, Network and Information Security, Phoenix, AZ, USA.
[16] R. M. Hayes, "Mathematical models in information retrieval", 1963 Natural Language and the Computer, McGraw-Hill, New York, USA.
[17] RFC 2045 (Online) Available: http://rfc.net/rfc2045.html