Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 31903
Learning to Order Terms: Supervised Interestingness Measures in Terminology Extraction

Authors: Jérôme Azé, Mathieu Roche, Yves Kodratoff, Michèle Sebag


Term Extraction, a key data preparation step in Text Mining, extracts the terms, i.e. relevant collocation of words, attached to specific concepts (e.g. genetic-algorithms and decisiontrees are terms associated to the concept “Machine Learning" ). In this paper, the task of extracting interesting collocations is achieved through a supervised learning algorithm, exploiting a few collocations manually labelled as interesting/not interesting. From these examples, the ROGER algorithm learns a numerical function, inducing some ranking on the collocations. This ranking is optimized using genetic algorithms, maximizing the trade-off between the false positive and true positive rates (Area Under the ROC curve). This approach uses a particular representation for the word collocations, namely the vector of values corresponding to the standard statistical interestingness measures attached to this collocation. As this representation is general (over corpora and natural languages), generality tests were performed by experimenting the ranking function learned from an English corpus in Biology, onto a French corpus of Curriculum Vitae, and vice versa, showing a good robustness of the approaches compared to the state-of-the-art Support Vector Machine (SVM).

Keywords: Text-mining, Terminology Extraction, Evolutionary algorithm, ROC Curve.

Digital Object Identifier (DOI):

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1513


[1] T. Bäck, Evolutionary Algorithms in theory and practice, 1995.
[2] D. Bourigault and C. Jacquemin, "Term Extraction + Term Clustering: An Integrated Platform for Computer-Aided Terminology," Proc. of EACL, Bergen., pp. 15-22, 1999".
[3] L. Breiman, "Arcing Classifiers," Annals of Statistics, vol. 26, no. 3, pp. 801-845, 1998.
[4] R. Caruana and A. Niculescu-Mizil, "Data Mining in Metric Space: An Empirical Analysis of Supervised Learning Performance Criteria". Proc. of "ROC Analysis in AI" Workshop ECAI, pp 9-18, 2004.
[5] K.W. Church and P. Hanks, "Word Association Norms, Mutual Information, and Lexicography," Computational Linguistics, vol. 16, pp. 22-29, 1990.
[6] W. Cohen, R. Schapire, and Y. Singer, "Learning to Order Things," Journal of Artificial Intelligence Research, vol. 10, 243-270, 1999.
[7] B. Daille, E. Gaussier, and J.M. Langé, "An Evaluation of Statistical Scores for Word Association," The Tbilisi Symposium on Logic, Language and Computation, CSLI Publications, pp. 177-188, 1998.
[8] P. Domingos, "Meta-Cost: A general method for making Classifiers Cost Sensitive," Knowledge Discovery from Databases, pp. 155-164, 1999.
[9] T.E. Dunning, "Accurate Methods for the Statistics of Surprise and Coincidence," Computational Linguistics, vol. 19, n┬░1, pp. 61-74, 1993.
[10] R. Esposito and L. Saitta, "Monte Carlo Theory as an Explanation of Bagging and Boosting," Proc. of International Joint Conference on Artificial Intelligence, pp. 499-504, Morgan Kaufman Publishers, 2003.
[11] C. Ferri, P. Flach, and J. Hernandez-Orallo, "Learning decision trees using the area under the ROC curve," Proc. of International Conference on Machine Learning (ICML), pp. 139-146, 2002.
[12] D.B. Fogel, E.C. Wasson, and E.M. Boughton, "Evolving Neural Networks for Detecting Breast Cancer," Cancer Letters, vol. 96, pp. 49- 53, 1995.
[13] Y. Freund, R. Iyer, R. E. Schapire, Y. Singer, "An Efficient Boosting Algorithm for Combining Preferences", Journal of Machine Learning Research, 4(Nov):933-969, 2003.
[14] R. Jin, Y. Liu, L. Si, J. Carbonell, and A. Hauptmann, "A New Boosting Algorithm Using Input-Dependent Regularizer," Proc. of International Conference on Machine Learning (ICML), AAAI Press, 2003.
[15] A. Kolcz, A. Chowdhury, J. Alspector, "Data duplication: An Imbalance Problem?" Workshop on Learning from Imbalanced Data Sets II (ICML), 2003
[16] G. Nenadic, H. Mima, I. Spasic, S. Ananiadou, and J. Tsujii, "Terminology-based Literature Mining and Knowledge Acquisition in Biomedicine", International Journal of Medical Informatics, vol. 67, pp 33-48, 2002.
[17] M. Roche, J. Azé, O. Matte-Tailliez, and Y. Kodratoff, "Mining texts by association rules discovery in a technical corpus," Proc. of IIPWM'04, Springer Verlag, pp. 89-98, 2004.
[18] M. Roche, J. Azé, Y. Kodratoff and M. Sebag, "Learning Interestingness Measures in Terminology Extraction. A ROC-based approach," Proc. of "ROC Analysis in AI" Workshop ECAI, pp 81-88, 2004.
[19] S. Rosset, "Model Selection via the AUC," Proc. of International Conference on Machine Learning (ICML), 2004.
[20] R.E. Schapire, "Theoretical views of boosting," Proc. of European Conference on Computational Learning Theory, pp. 1-10, 1999.
[21] M. Sebag, N. Lucas, and J. Azé, "ROC-based Evolutionary Learning: Application to Medical Data Mining," Proc. of International Conference on Artificial Evolution (EA), Springer Verlag, pp. 384-396, 2004.
[22] M. Sebag, N. Lucas, and J. Azé, "Impact studies and sensitivity analysis in medical data mining with ROC-based genetic learning," Proc. of IEEE International Conference on Data Mining (ICDM), pp. 637-640, 2003.
[23] F. Smadja, "Retrieving collocations from text: Xtract," Computational Linguistics, vol. 19, no. 1, pp. 143-177, 1993
[24] F. Smadja, K. R. McKeown, and V. Hatzivassiloglou, "Translating collocations for bilingual lexicons: A statistical approach," Computational Linguistics, vol. 22, n┬░1, pp. 1-38, 1996.
[25] V.N. Vapnik, "The Nature of Statistical Learning," Springer Verlag, 1995.
[26] J. Vivaldi and L. Marquez and H. Rodriguez, "Improving Term Extraction by System Combination Using Boosting," Lecture Notes in Computer Science, vol 2167, pp. 515-526, 2001.
[27] I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin, and C.G. Nevill- Manning. Kea: Practical automatic keyphrase extraction. Proc. of DL '99, pp. 254-256, 1999.
[28] F. Xu, D. Kurz, J. Piskorski, and S. Schmeier, "A Domain Adaptive Approach to Automatic Acquisition of Domain Relevant Terms and their Relations with Bootstrapping," Proc. of LREC 2002, the third international conference on language resources and evaluation, 2002.