Text Mining Technique for Data Mining Application

M. Govindarajan

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 32799

Text Mining Technique for Data Mining Application

Authors: M. Govindarajan

Abstract:

Text Mining is around applying knowledge discovery techniques to unstructured text is termed knowledge discovery in text (KDT), or Text data mining or Text Mining. In decision tree approach is most useful in classification problem. With this technique, tree is constructed to model the classification process. There are two basic steps in the technique: building the tree and applying the tree to the database. This paper describes a proposed C5.0 classifier that performs rulesets, cross validation and boosting for original C5.0 in order to reduce the optimization of error ratio. The feasibility and the benefits of the proposed approach are demonstrated by means of medial data set like hypothyroid. It is shown that, the performance of a classifier on the training cases from which it was constructed gives a poor estimate by sampling or using a separate test file, either way, the classifier is evaluated on cases that were not used to build and evaluate the classifier are both are large. If the cases in hypothyroid.data and hypothyroid.test were to be shuffled and divided into a new 2772 case training set and a 1000 case test set, C5.0 might construct a different classifier with a lower or higher error rate on the test cases. An important feature of see5 is its ability to classifiers called rulesets. The ruleset has an error rate 0.5 % on the test cases. The standard errors of the means provide an estimate of the variability of results. One way to get a more reliable estimate of predictive is by f-fold –cross- validation. The error rate of a classifier produced from all the cases is estimated as the ratio of the total number of errors on the hold-out cases to the total number of cases. The Boost option with x trials instructs See5 to construct up to x classifiers in this manner. Trials over numerous datasets, large and small, show that on average 10-classifier boosting reduces the error rate for test cases by about 25%.

Keywords: C5.0, Error Ratio, text mining, training data, test data.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1082531

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2428

References:

[1] Themis P.Exarchos, Markos G. Tsipouras, Costas P. Exarchos, Costas Papaloukas, Dimitrios I. Fotiadis, Lampros K. Michalis, "A methodology for the automated creation of fuzzy expert systems for ischaemic and arrhymic beat classification based on a set of rules obtained by a decision tree" Artificial Intelligence in medicine (2007) 40, 187-200.
[2] M.Govindarajan, Dr.RM.Chandrasekaran, "Classifier Based Text Mining for Neural Network" Proceeding of XII international conference on computer, electrical and system science and engineering, may 24-26, Vienna , Austria, waste.org,2007. pp. 200-205.
[3] Jiawei Han , Micheline Kamber " Data Mining - Concepts and Techniques" Elsevier, 2007 pages 291- 310.
[4] Margaret H.Dunham, "Data Mining- Introductory and Advanced Topics" Pearson Education, 2007 pages 92-101.
[5] Marion Verduijn, Lucia Sacchi, Niels Peek, Riccardo Bellazzi, Evert de Jonge, Bas A.J.M. de Mol."Temporal abstraction for feature extraction: A comparative case study in prediction from intensive care monitorinf data" Artificial Intelligence in Medicine (2007) 41, 1-12.
[6] Kemal Polat, Salih Gunes, Sulayman Tosun "Diagnosis disease using artificial immune recognition system and fuzzy weighted preprocessing" Pattern Recognition 39 (2006) 2186-2193.
[7] Tim W.Nattemper, Bert Arnrich, Oliver Lichte, Wiebke Timm, Andreas Degenhard, Linda Pointon, Carmel Hayes, Martin O. Leach. "Evaluation of radiological features for breast tumour classification in clinical screening with machine learning methods" Artificial Intelligence in Medicine (2005) 34, 129-139.
[8] Sanchis, A., GIL, J.A. and Heras, A. (2003): "El an├ílisis discriminante en la previsi├│n de la insolvencia en las empresas de seguros no vida", Revista Espa├▒ola de Financiaci├│n y Contabilidad,116, enero-marzo, 183-233.
[9] Segovia, M.J., Gil, J.A., Heras, A. and Vilar, J.L. (2003): "Lametodolog├¡a Rough Set frente al An├ílisis Discriminante en los problemas de clasificaci├│n multiatributo", XI Jornadas ASEPUMA, Oviedo, Spain.
[10] Venables, W.N. and Ripley, B.D. (2002): Modern Applied Statistics with S, Springer-Verlag, New York.
[11] De Anders, J. (2001): "Statistical Techniques vs. SEE5 Algorithm. An Application to a Small Business Environment", International Journal of Digital Accounting Research, 1 (2), 153-179.
[12] Duda, R.O., Hart, P.E. and STORK, D.G. (2001): Pattern Classification, John Wiley & Sons, Inc., New York.