Breast Cancer Survivability Prediction via Classifier Ensemble

Mohamed Al-Badrashiny; Abdelghani Bellaachia

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33024

Breast Cancer Survivability Prediction via Classifier Ensemble

Authors: Mohamed Al-Badrashiny, Abdelghani Bellaachia

Abstract:

This paper presents a classifier ensemble approach for predicting the survivability of the breast cancer patients using the latest database version of the Surveillance, Epidemiology, and End Results (SEER) Program of the National Cancer Institute. The system consists of two main components; features selection and classifier ensemble components. The features selection component divides the features in SEER database into four groups. After that it tries to find the most important features among the four groups that maximizes the weighted average F-score of a certain classification algorithm. The ensemble component uses three different classifiers, each of which models different set of features from SEER through the features selection module. On top of them, another classifier is used to give the final decision based on the output decisions and confidence scores from each of the underlying classifiers. Different classification algorithms have been examined; the best setup found is by using the decision tree, Bayesian network, and Na¨ıve Bayes algorithms for the underlying classifiers and Na¨ıve Bayes for the classifier ensemble step. The system outperforms all published systems to date when evaluated against the exact same data of SEER (period of 1973-2002). It gives 87.39% weighted average F-score compared to 85.82% and 81.34% of the other published systems. By increasing the data size to cover the whole database (period of 1973-2014), the overall weighted average F-score jumps to 92.4% on the held out unseen test set.

Keywords: Classifier ensemble, breast cancer survivability, data mining, SEER.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1123969

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1666

References:

[1] “World health organization,” in World Cancer Report, 2014, pp. Chapter 1.1, ISBN 92–832–0429–8.
[2] “International agency for research on cancer,” in World Cancer Report, 2008.
[3] “Breast cancer. nci,” in SEER Stat Fact Sheets, 2014.
[4] Z.-H. Zhou and Y. Jiang, “Medical diagnosis with c4.5 rule preceded by artificial neural network ensemble,” Information Technology in Biomedicine, IEEE Transactions on, vol. 7, no. 1, pp. 37–42, March 2003.
[5] M. Lundin, J. Lundin, H. B. Burke, S. Toikkanen, L. Pylkk¨anen, and H. Joensuu, “Artificial neural networks applied to survival prediction in breast cancer,” Oncology, vol. 57, no. 4, pp. 281–286, 1999. (Online). Available: http://www.karger.com/DOI/10.1159/000012061
[6] D. Delen, G. Walker, and A. Kadam, “Predicting breast cancer survivability: a comparison of three data mining methods,” Artificial Intelligence in Medicine, vol. 34, no. 2, pp. 113–127, Jun 2005.
[Online]. Available: http://www.aiimjournal.com/article/S0933-3657(04) 00101-0/abstract
[7] “Seer cancer statistics review. surveillance, epidemiology, and end results (seer) program (www.seer.cancer.gov) public-use data (1973-2000). national cancer institute, surveillance research program, cancer statistics branch, released april 2003. based on the november 2002 submission. diagnosis period 1973-2000, registries 1-9.”
[8] A. Bellaachia and E. Guven, “Predicting breast cancer survivability using data mining techniques,” in Ninth Workshop on Mining Scientific and Engineering Datasets in conjunction with the Sixth SIAM International Conference on Data Mining (SDM 2006), April 22 2006.
[9] “Seer cancer statistics review. surveillance, epidemiology, and end results (seer) program (www.seer.cancer.gov) public-use data (1973-2002). national cancer institute, surveillance research program, cancer statistics branch, released april 2005. based on the november 2004 submission.”
[10] “Surveillance, epidemiology, and end results (seer) program (www.seer.cancer.gov) research data (1973-2011), national cancer institute, dccps, surveillance research program, surveillance systems branch, released april 2014, based on the november 2013 submission.”
[11] R. Eskander, M. Al-Badrashiny, N. Habash, and O. Rambow, “Foreign words and the automatic processing of arabic social media text written in roman script,” In Proceedings of the First Workshop on Computational Approaches to Code-Switching. EMNLP 2014, Conference on Empirical Methods in Natural Language Processing, October, 2014, Doha, Qatar, 2014.
[12] J. Kittler and F. Roli, Eds., Multiple Classifier Systems, First International Workshop, MCS 2000, Cagliari, Italy, June 21-23, 2000, Proceedings, ser. Lecture Notes in Computer Science, vol. 1857. Springer, 2000.
[13] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The weka data mining software: an update,” ACM SIGKDD Explorations Newsletter, vol. 11, no. 1, 2009.