Data Gathering and Analysis for Arabic Historical Documents

Ali Dulla

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33156

Data Gathering and Analysis for Arabic Historical Documents

Authors: Ali Dulla

Abstract:

This paper introduces a new dataset (and the methodology used to generate it) based on a wide range of historical Arabic documents containing clean data simple and homogeneous-page layouts. The experiments are implemented on printed and handwritten documents obtained respectively from some important libraries such as Qatar Digital Library, the British Library and the Library of Congress. We have gathered and commented on 150 archival document images from different locations and time periods. It is based on different documents from the 17th-19th century. The dataset comprises differing page layouts and degradations that challenge text line segmentation methods. Ground truth is produced using the Aletheia tool by PRImA and stored in an XML representation, in the PAGE (Page Analysis and Ground truth Elements) format. The dataset presented will be easily available to researchers world-wide for research into the obstacles facing various historical Arabic documents such as geometric correction of historical Arabic documents.

Keywords: Dataset production, ground truth production, historical documents, arbitrary warping, geometric correction.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.2643822

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 875

References:

[1] Yang, P., Antonacopoulos, A., Clausner, C. & Pletschacher, S. Grid-based modelling and correction of arbitrarily warped historical document images for large-scale digitisation. Proceedings of the 2011 Workshop on Historical Document Imaging and Processing, 2011. ACM, 106-111.
[2] Lund, W. B. 2014. Ensemble Methods for Historical Machine-Printed Document Recognition
[3] Rahnemoonfar, M. 2010.Correction of arbitrarygeometric artefactsin historical documents. Salford: University of Salford.
[4] M. Pechwitz, S. S. Maddouri, V. M¨argner, N. Ellouze, H. Amiri, et al., “Ifn/enit-database of handwritten arabic words,” in Proc. of CIFED, vol. 2pp. 127–136, Citeseer, 2002.
[5] Slimane, F., Ingold, R., Kanoun, S., Alimi, A.M. and Hennebert, J., 2009, July. A new arabic printed text image database and evaluation protocols. In Document Analysis and Recognition, 2009. ICDAR'09. 10th International Conference on (pp. 946-950). IEEE.
[6] Mahmoud, S. A., Ahmad, I., Alshayeb, M., Al-Khatib, W. G., Parvez, M. T., Fink, G. A., Märgner, V. and El Abed, H., 2012, September. KHATT: Arabic offline handwritten text database. In Frontiers in Handwriting Recognition (ICFHR), 2012 International Conference on (pp. 449-454). IEEE.
[7] Mousa, I. S. 2001. The Arabs in the first communication revolution: development of the Arabic Script. Canadian Journal of communication, 26.
[8] Abuhaiba, I. S. 2003. A discrete Arabic script for better automatic document understanding.
[9] Alromima W, Elgohary R, Moawad IF, Aref M. Applying ontological engineering approach for Arabic Quran corpus: A comprehensive survey. InIntelligent Computing and Information Systems (ICICIS), 2015 IEEE Seventh International Conference on 2015 Dec 12 (pp. 620-627). IEEE.
[10] Suen, C. Y., Nikfal, S., Zhang, B. and Janbi, J., 2017. Characteristics of English, Chinese and Arabic Typefaces. In Advances in Chinese Document and Text Processing (pp. 1-30).
[11] Clausner, C., Pletschacher, S., and Antonacopoulos, A.(2011). Aletheia - an advanced document layout andtext ground-truthing system for production environments.In International Conference on Document Analysis and Recognition. Beijing, China, pp. 48–52.
[12] S. Pletschacher and A. Antonacopoulos, "The PAGE (PageAnalysis and Ground-Truth Elements) Format Framework",Proc. ICPR2008, Istanbul, Turkey, August 23-26, 2010,IEEE-CS Press, pp. 257-260.