Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 30184
Performance Analysis of MT Evaluation Measures and Test Suites

Authors: Yao Jian-Min, Lv Qiang, Zhang Jing


Many measures have been proposed for machine translation evaluation (MTE) while little research has been done on the performance of MTE methods. This paper is an effort for MTE performance analysis. A general frame is proposed for the description of the MTE measure and the test suite, including whether the automatic measure is consistent with human evaluation, whether different results from various measures or test suites are consistent, whether the content of the test suite is suitable for performance evaluation, the degree of difficulty of the test suite and its influence on the MTE, the relationship of MTE result significance and the size of the test suite, etc. For a better clarification of the frame, several experiment results are analyzed relating human evaluation, BLEU evaluation, and typological MTE. A visualization method is introduced for better presentation of the results. The study aims for aid in construction of test suite and method selection in MTE practice.

Keywords: Machine translation, natural language processing, visualization.

Digital Object Identifier (DOI):

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1205


[1] ALPAC (1966). Languages and machines: computers in translation and linguistics. A report by the Automatic Language Processing Advisory Committee, National Research Council. Washington, D.C. National Academy of Sciences.
[2] White J.S., T.A. O-Connell (1994). The ARPA MT evaluation methodologies: evolution, lessons, and further approaches. Proceedings of the 1994 Conference of the Association for Machine Translation in the Americas, Columbia, MD. pp. 193-205.
[3] ISLE (2000). The ISLE classification of machine translation evaluations, draft 1. A document by the International Standards for Language Engineering. See
[4] Jones Douglas A., Gregory M. Rusk (2000). Toward a scoring function for quality-driven machine translation. Proceedings of the International Conference on Computational Linguistics. pp. 376-382.
[5] Brew C, Thompson H.S. (1994). Automatic evaluation of computer generated text: a progress report on the TextEval project. Proceedings of the Human Language Technology Workshop. pp. 108-113.
[6] Yasuda Keiji, Fumiaki Sugaya, et al (2001). An automatic evaluation method of translation quality using translation answer candidates queried from a parallel corpus. MT Summit Conference, Santiago de Compostela. pp. 373-378.
[7] Akiba Yasuhiro, Kenji Imamura, Eiichiro Sumita (2001). Using multiple edit distances to automatically rank machine translation output. MT Summit Conference, Santiago de Compostela. pp. 15-20.
[8] Papineni K., S.Roukos, T.Ward, W.-J. Zhu (2001). BLEU: a method for automatic evaluation of MT. Research Report, Computer Science RC22176(W0109-022), IBM Research Division, T.J.Watson Research Cente. See
[9] NIST (2002). The NIST 2002 machine translation evaluation plan. A document by the National Institute of Standards and Technology. See MT-EvalPlan- v1.3.pdf
[10] I.D.Melamed, R.Green, J.P.Turian (2003). Precision and recall of machine translation. NAACL/Human Language Technology 2003, Edmonton, Canada.
[11] S. Ni╬▓en, F. J. Och, et al (2000). An evaluation tool for machine translation: fast evaluation for MT research. 2nd International Conference on Language Resources and Evaluation. Athens, Greece. pp. 39-45.
[12] Yokoyama S. et al. (1999). Quantitative evaluation of machine translation using two-way MT. Proceeding of Machine Translation Summit VII. pp. 568--573.
[13] Yu Shiwen (1993). Automatic Evaluation of Quality for Machine Translation Systems. Machine Translation, 8. pp. 117-126.
[14] Guessoum A., R. Zantout (2001). Semi-automatic evaluation of the grammatical coverage of machine translation systems. MT Summit Conference , Santiago de Compostela. pp. 133-138.
[15] Popesc-Belis (1999). Evaluation of natural language processing systems: a model for coherence verification of quality measure. Marc Blasband and Patrick Paroubek, editors, A Blueprint for a General Infrastructure for Natural Language Processing Systems Evaluation Using Semi- Automatic Quantitative Black Box Approach in a Multilingual Environment. ELSE Project LE4-8340 (Evaluation in Language and Speech Engineering).
[16] Yao Jianmin, Ming Zhou et al (2002). An automatic evaluation method for localization oriented lexicalised EBMT system. The 19th International Conference on Computational Linguistics, Taipei. pp. 1142-1148.
[17] Zhang Minqiang (2003). Education measurement. 2nd edition. People-s Education Press Beijing. (in Chinese). pp. 98-132.
[18] Darwin, M. (2001). Trial and Error: An Evaluation Project on Japanese- English MT Output Quality. In Maegaard, B. (Ed.). MT Summit VIII, 77-82. Santiago de Compostela, Spain: European Association for Machine Translation (EAMT).