Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 30848
Latent Topic Based Medical Data Classification

Authors: Jian-hua Yeh, Shi-yi Kuo


This paper discusses the classification process for medical data. In this paper, we use the data from ACM KDDCup 2008 to demonstrate our classification process based on latent topic discovery. In this data set, the target set and outliers are quite different in their nature: target set is only 0.6% size in total, while the outliers consist of 99.4% of the data set. We use this data set as an example to show how we dealt with this extremely biased data set with latent topic discovery and noise reduction techniques. Our experiment faces two major challenge: (1) extremely distributed outliers, and (2) positive samples are far smaller than negative ones. We try to propose a suitable process flow to deal with these issues and get a best AUC result of 0.98.

Keywords: classification, latent topics, outlier adjustment, feature scaling

Digital Object Identifier (DOI):

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1344


[1] D.M.J. Tax, "One-class classification" , PhD Thesis, Delft University of Technology,˜davidt/thesis.pdf ISBN: 90-75691-05-x, 2001.
[2] Claudia Perlich , Prem Melville , Yan Liu , Grzegorz Swirszcz , Richard Lawrence , Saharon Rosset, "Breast cancer identification: KDD CUP winner's report", ACM SIGKDD Explorations Newsletter, v.10 n.2, December 2008.
[3] M. Girolami and A. Kaban, "On an equivalence between PLSI and LDA", Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 433-434, 2003.
[4] Thomas Landauer, P. W. Foltz, and D. Laham, Introduction to Latent Semantic Analysis, Discourse Processes 25: 259-284, 1998.
[5] T. Hofmann, "Unsupervised learning by probabilistic latent semantic analysis", Machine Learning, vol. 42, no. 1, pp. 177-196, 2001.
[6] D. M. Blei, A. Y. Ng and M. I. Jordan, "Latent Dirichlet Allocation", Journal of Machine Learning Research, vol. 3, no. 5, pp. 993-1022, 2003.
[7] Grubbs, F. E., "Procedures for detecting outlying observations in samples", Technometrics 11, 1-21, 1969.
[8] Rousseeuw, P. and Leroy, A., "Robust Regression and Outlier Detection", John Wiley & Sons., 3rd edition, 1996.
[9] Juszczak, P., Tax, D. M. J., & Duin, R. P. W., "Feature scaling in support vector data description", In N., Japkowicz (Ed.), Learning from Imbalanced Data Sets (pp. 25-30). Menlo Park, CA: AAAI Press, 2000.
[10] Salton, Gerard and Buckley, C., "Term-weighting approaches in automatic text retrieval," Information Processing & Management 24 (5): 513-523, 1988.
[11] Jian-hua Yeh, Chun-hsing Chen, "Protein Remote Homology Detection Based on Latent Topic Vector Model", in Proceedings of 2012 International Conference on Database and Data Mining(ICDDM2010) , Manila, Philippine, June 2010.
[12] Vapnik VN. Statistical Learning Theory. New York, 1998.
[13] R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection using the second order information for training SVM. Journal of Machine Learning Research 6, 1889-1918, 2005.
[14] Gribskov, M. and Robinson, N.L., "Use of receiver operating characteristic(ROC) analysis to evaluate sequence matching", Comput. Chem., 20, 25-33, 1996.