Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 31324
Detecting Email Forgery using Random Forests and Naïve Bayes Classifiers

Authors: Emad E Abdallah, A.F. Otoom, ArwaSaqer, Ola Abu-Aisheh, Diana Omari, Ghadeer Salem


As emails communications have no consistent authentication procedure to ensure the authenticity, we present an investigation analysis approach for detecting forged emails based on Random Forests and Naïve Bays classifiers. Instead of investigating the email headers, we use the body content to extract a unique writing style for all the possible suspects. Our approach consists of four main steps: (1) The cybercrime investigator extract different effective features including structural, lexical, linguistic, and syntactic evidence from previous emails for all the possible suspects, (2) The extracted features vectors are normalized to increase the accuracy rate. (3) The normalized features are then used to train the learning engine, (4) upon receiving the anonymous email (M); we apply the feature extraction process to produce a feature vector. Finally, using the machine learning classifiers the email is assigned to one of the suspects- whose writing style closely matches M. Experimental results on real data sets show the improved performance of the proposed method and the ability of identifying the authors with a very limited number of features.

Keywords: Cybercrimes, Digital investigation, emails forensics, anonymous emails, writing style, and authorship analysis

Digital Object Identifier (DOI):

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 4888


[1] T. McElroy and J. J. Seta, "Framing the frame: How task goals determine the likelihood and direction of framing effects," Judgment and Decision Making, Vol. 2 (4), Aug 2007, pp. 251-256.
[2] F. Iqbal, R. Hadjidj, B.C.M. Fung, M. Debbabi, "A novel approach of mining write-prints for authorship attribution in email forensics," Digital Investigation, Vol. 5 (1), 2008, pp. 42-51.
[3] O. De Vel, A. Anderson, M. Corney, and G. Mohay, "Mining Email Content for Author Identification Forensics", SIGMOD Record, Vol. 30(4), 2001, pp. 55-64.
[4] A. Gray, P. Sallis, and S. MacDonell, "Software Forensics: Extending Authorship Analysis Techniques to Computer Programs," in the 3rd Biannual Conference International Association of Forensic Linguists, 1997.
[5] M. Koppel, S. Argamon, and A.R. Shimoni, "Automatically categorizing written texts by author gender," Literary and Linguistic Computing, Vol. 17(4), 2002, pp. 401-412.
[6] A. Abbasi, and H. Chen, "Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace," ACM Transactions on Information Systems, Vol. 26(2), March 2008, pp. 1-29.
[7] M. Koppel, J. Schler, and S. Argamon, "Computational methods in authorship attribution," Journal of the American Society for Information Science and Technology, Vol. 60(1), 2009, pp. 9-26.
[8] R. Zheng, J. Li, H. Chen, and Z. Huang, "A framework for authorship identification of online messages: Writing-style features and classification techniques," Journal of the American Society for Information Science and Technology, Vol. 57(3), February 2006, pp. 378-393,.
[9] F. Iqbal, H. Binsalleeh, B.C.M. Fung, and M. Debbabi, "Mining writeprints from anonymous emails for forensic investigation," Digital Investigation, 2010, pp. 1-9.
[10] L. Breiman, "Random forests," Machine Learning, 2001, pp. 5-32.
[11] P. Domingos and M. Pazzani, "On the optimality of the simple Bayesian classifier under zero-one loss," Machine Learning, 2001, pp. 103-137.
[12] DJ. Hand and K. Yu, "Idiot's Bayes - not so stupid after all?," International Statistical Review, Vol. 69(3), 2001, pp. 385-399.
[13] L. Kaelbling, "Enron email dataset," CALO Project,, August 21 2009.
[14] I. Witten and E. Frank, "Data Mining: Practical Machine Learning Tools and Techniques," Margan Kaufmann, San Francisco, 2nd edition, 2005.