Search results for: Persian/Arabic document
401 Persian/Arabic Document Segmentation Based On Pyramidal Image Structure
Authors: Seyyed Yasser Hashemi, Khalil Monfaredi
Abstract:
Automatic transformation of paper documents into electronic documents requires document segmentation at the first stage. However, some parameters restrictions such as variations in character font sizes, different text line spacing, and also not uniform document layout structures altogether have made it difficult to design a general-purpose document layout analysis algorithm for many years. Thus in most previously reported methods it is inevitable to include these parameters. This problem becomes excessively acute and severe, especially in Persian/Arabic documents. Since the Persian/Arabic scripts differ considerably from the English scripts, most of the proposed methods for the English scripts do not render good results for the Persian scripts. In this paper, we present a novel parameter-free method for segmenting the Persian/Arabic document images which also works well for English scripts. This method segments the document image into maximal homogeneous regions and identifies them as texts and non-texts based on a pyramidal image structure. In other words the proposed method is capable of document segmentation without considering the character font sizes, text line spacing, and document layout structures. This algorithm is examined for 150 Arabic/Persian and English documents and document segmentation process are done successfully for 96 percent of documents.
Keywords: Persian/Arabic document, document segmentation, Pyramidal Image Structure, skew detection and correction.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1765400 Ultra High Speed Approach for Document Skew Detection and Correction Based On Centre of Gravity
Authors: Seyyed Yasser Hashemi
Abstract:
Skew detection and correction (SDC) has a direct effect in efficiency and exactitude of documents’ segmentation and analysis and thus is considered as a very important step in documents’ analysis field. Skew is a major problem in documents’ analysis for every language. For Arabic/Persian document scripts this problem is more severe because of special features of these languages. In this paper an efficient and fast algorithm for Document Skew Detection (DSD) based on the concept of segmentation and Center of Gravity (COG) is proposed. This algorithm is examined for 150 Arabic/Persian and English documents and SDC process are done successfully for 93 percent of documents with error rate of less than 1°. This algorithm shows better results for English documents compared to Arabic/Persian documents. The proposed method is also represents favorable results for handwritten, printed and also complicated documents such as newspapers and journals even with very low quality and resolution.
Keywords: Arabic/Persian document, Baseline, Centre of gravity, Document segmentation, Skew detection and correction.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1911399 Simultaneous Segmentation and Recognition of Arabic Characters in an Unconstrained On-Line Cursive Handwritten Document
Authors: Randa I. Elanwar, Mohsen A. Rashwan, Samia A. Mashali
Abstract:
The last two decades witnessed some advances in the development of an Arabic character recognition (CR) system. Arabic CR faces technical problems not encountered in any other language that make Arabic CR systems achieve relatively low accuracy and retards establishing them as market products. We propose the basic stages towards a system that attacks the problem of recognizing online Arabic cursive handwriting. Rule-based methods are used to perform simultaneous segmentation and recognition of word portions in an unconstrained cursively handwritten document using dynamic programming. The output of these stages is in the form of a ranked list of the possible decisions. A new technique for text line separation is also used.
Keywords: Arabic handwriting, character recognition, cursive handwriting, on-line recognition.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1899398 The Contribution of Translation to Arabic and Islamic Civilization during the Golden Age: 661-1258
Authors: Smail Hadj Mahammed
Abstract:
Translation is not merely a process of conveying the meaning from one particular language into another to overcome language barriers and ensure a good understanding; it is also a work of civilization and progress. Without the translation of Greek, Indian and Persian works, Arabic and Islamic Civilization would not have taken off, and without the translations of Arabic works into Latin, and then into European languages, the scientific and technological revolution of the modern world would not have taken place. In this context, the present paper seeks to investigate how the translation movement contributed to the Arabic and Islamic Civilizations during the Golden Age. The paper consists of three major parts: the first part provides a brief historical overview of the translation movement during the golden age, which witnessed two important eras: the Umayyad and Abbasid eras. The second part shows the main reasons why translation was a prominent cultural activity during the Golden Age and why it gained great interest from the Arabs. The last part highlights the constructive contribution of translation to the Arabic and Islamic Civilization during the period (661–1258). The results demonstrate that Arabic translation movement during the Golden Age had significantly assisted in enriching the Arabic and Islamic civilizations considering the major and important scientific works of old Greek, Indian and Persian civilizations which had been absorbed.
Keywords: Arabic and Islamic civilization, contribution, golden age, translation.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 230397 Automatic Building an Extensive Arabic FA Terms Dictionary
Authors: El-Sayed Atlam, Masao Fuketa, Kazuhiro Morita, Jun-ichi Aoe
Abstract:
Field Association (FA) terms are a limited set of discriminating terms that give us the knowledge to identify document fields which are effective in document classification, similar file retrieval and passage retrieval. But the problem lies in the lack of an effective method to extract automatically relevant Arabic FA Terms to build a comprehensive dictionary. Moreover, all previous studies are based on FA terms in English and Japanese, and the extension of FA terms to other language such Arabic could be definitely strengthen further researches. This paper presents a new method to extract, Arabic FA Terms from domain-specific corpora using part-of-speech (POS) pattern rules and corpora comparison. Experimental evaluation is carried out for 14 different fields using 251 MB of domain-specific corpora obtained from Arabic Wikipedia dumps and Alhyah news selected average of 2,825 FA Terms (single and compound) per field. From the experimental results, recall and precision are 84% and 79% respectively. Therefore, this method selects higher number of relevant Arabic FA Terms at high precision and recall.
Keywords: Arabic Field Association Terms, information extraction, document classification, information retrieval.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1733396 Comparing Arabic and Latin Handwritten Digits Recognition Problems
Authors: Sherif Abdelazeem
Abstract:
A comparison between the performance of Latin and Arabic handwritten digits recognition problems is presented. The performance of ten different classifiers is tested on two similar Arabic and Latin handwritten digits databases. The analysis shows that Arabic handwritten digits recognition problem is easier than that of Latin digits. This is because the interclass difference in case of Latin digits is smaller than in Arabic digits and variances in writing Latin digits are larger. Consequently, weaker yet fast classifiers are expected to play more prominent role in Arabic handwritten digits recognition.Keywords: Handwritten recognition, Arabic recognition, Digits recognition, Document recognition
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1985395 Support Vector Machine for Persian Font Recognition
Abstract:
In this paper we examine the use of global texture analysis based approaches for the purpose of Persian font recognition in machine-printed document images. Most existing methods for font recognition make use of local typographical features and connected component analysis. However derivation of such features is not an easy task. Gabor filters are appropriate tools for texture analysis and are motivated by human visual system. Here we consider document images as textures and use Gabor filter responses for identifying the fonts. The method is content independent and involves no local feature analysis. Two different classifiers Weighted Euclidean Distance and SVM are used for the purpose of classification. Experiments on seven different type faces and four font styles show average accuracy of 85% with WED and 82% with SVM classifier over typefacesKeywords: Persian font recognition, support vector machine, gabor filter.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1708394 A Novel Arabic Text Steganography Method Using Letter Points and Extensions
Authors: Adnan Abdul-Aziz Gutub, Manal Mohammad Fattani
Abstract:
This paper presents a new steganography approach suitable for Arabic texts. It can be classified under steganography feature coding methods. The approach hides secret information bits within the letters benefiting from their inherited points. To note the specific letters holding secret bits, the scheme considers the two features, the existence of the points in the letters and the redundant Arabic extension character. We use the pointed letters with extension to hold the secret bit 'one' and the un-pointed letters with extension to hold 'zero'. This steganography technique is found attractive to other languages having similar texts to Arabic such as Persian and Urdu.Keywords: Arabic text, Cryptography, Feature coding, Information security, Text steganography, Text watermarking.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3503393 A New Pattern for Handwritten Persian/Arabic Digit Recognition
Authors: A. Harifi, A. Aghagolzadeh
Abstract:
The main problem for recognition of handwritten Persian digits using Neural Network is to extract an appropriate feature vector from image matrix. In this research an asymmetrical segmentation pattern is proposed to obtain the feature vector. This pattern can be adjusted as an optimum model thanks to its one degree of freedom as a control point. Since any chosen algorithm depends on digit identity, a Neural Network is used to prevail over this dependence. Inputs of this Network are the moment of inertia and the center of gravity which do not depend on digit identity. Recognizing the digit is carried out using another Neural Network. Simulation results indicate the high recognition rate of 97.6% for new introduced pattern in comparison to the previous models for recognition of digits.
Keywords: Pattern recognition, Persian digits, NeuralNetwork.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1676392 Applying Clustering of Hierarchical K-means-like Algorithm on Arabic Language
Authors: Sameh H. Ghwanmeh
Abstract:
In this study a clustering technique has been implemented which is K-Means like with hierarchical initial set (HKM). The goal of this study is to prove that clustering document sets do enhancement precision on information retrieval systems, since it was proved by Bellot & El-Beze on French language. A comparison is made between the traditional information retrieval system and the clustered one. Also the effect of increasing number of clusters on precision is studied. The indexing technique is Term Frequency * Inverse Document Frequency (TF * IDF). It has been found that the effect of Hierarchical K-Means Like clustering (HKM) with 3 clusters over 242 Arabic abstract documents from the Saudi Arabian National Computer Conference has significant results compared with traditional information retrieval system without clustering. Additionally it has been found that it is not necessary to increase the number of clusters to improve precision more.
Keywords: Hierarchical K-mean like clustering (HKM), Kmeans, cluster centroids, initial partition, and document distances
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2570391 Abai Kunanbayev's Role in Enrichment of the Kazakh Language
Authors: Y.M. Paltore, B.N. Zhubatova, A.A. Mustafayeva
Abstract:
Abai Kunanbayev is famous for being enlightener, composer, interpreter, social agent, philosopher, reformer, who wanted to enrich Kazakh literature by emergence with Russian and European culture, and also as a founder of Kazakh written literary language. Abai Kunanbayev was born in 1845 in East Kazakhstan area and passed away in 1904 in his hometown. His oeuvre absorbed and reflected all changes in the life of Kazakh society of the second half of XIX century. Because ХІХ century, especially its second half, was an important transition period for Kazakhstan, which radically changed traditional way of Kazakh society and predetermined further development in consequence of activation of Russian colonial policy and approval of commodity-money relations in Steppe Land.Abai Kunanbayev, besides Arabic and Persian common words and loanwords from Quran in his words of edification, had used a lot of words of Arabic, Persian, Latin, Russian, Nogai, Shaghatai, Polish, Greek, Turkish, which are used in the Kazakh language.Keywords: Abai Kunanbayev, the Kazakh, Russian languages, literature
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2864390 Persian Printed Numeral Characters Recognition Using Geometrical Central Moments and Fuzzy Min-Max Neural Network
Authors: Hamid Reza Boveiri
Abstract:
In this paper, a new proposed system for Persian printed numeral characters recognition with emphasis on representation and recognition stages is introduced. For the first time, in Persian optical character recognition, geometrical central moments as character image descriptor and fuzzy min-max neural network for Persian numeral character recognition has been used. Set of different experiments on binary images of regular, translated, rotated and scaled Persian numeral characters has been done and variety of results has been presented. The best result was 99.16% correct recognition demonstrating geometrical central moments and fuzzy min-max neural network are adequate for Persian printed numeral character recognition.Keywords: Fuzzy min-max neural network, geometrical centralmoments, optical character recognition, Persian digits recognition, Persian printed numeral characters recognition.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1724389 A Novel Approach to Persian Online Hand Writing Recognition
Authors: Ramin Halavati, Mansour Jamzad, Mahdieh Soleymani
Abstract:
Persian (Farsi) script is totally cursive and each character is written in several different forms depending on its former and later characters in the word. These complexities make automatic handwriting recognition of Persian a very hard problem and there are few contributions trying to work it out. This paper presents a novel practical approach to online recognition of Persian handwriting which is based on representation of inputs and patterns with very simple visual features and comparison of these simple terms. This recognition approach is tested over a set of Persian words and the results have been quite acceptable when the possible words where unknown and they were almost all correct in cases that the words where chosen from a prespecified list.
Keywords: Image Processing, Pattern Recognition.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1329388 Morphological Analysis of English L1-Persian L2 Adult Learners’ Interlanguage: From the Perspective of SLA Variation
Authors: Maassoumeh Bemani Naeini
Abstract:
Studies on interlanguage have long been engaged in describing the phenomenon of variation in SLA. Pursuing the same goal and particularly addressing the role of linguistic features, this study describes the use of Persian morphology in the interlanguage of two adult English-speaking learners of Persian L2. Taking the general approach of a combination of contrastive analysis, error analysis and interlanguage analysis, this study focuses on the identification and prediction of some possible instances of transfer from English L1 to Persian L2 across six elicitation tasks aiming to investigate whether any of contextual features may variably influence the learners’ order of morpheme accuracy in the areas of copula, possessives, articles, demonstratives, plural form, personal pronouns, and genitive cases. Results describe the existence of task variation in the interlanguage system of Persian L2 learners.Keywords: English L1, Interlanguage Analysis, Persian L2, SLA variation.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1312387 Arabic Light Stemmer for Better Search Accuracy
Authors: Sahar Khedr, Dina Sayed, Ayman Hanafy
Abstract:
Arabic is one of the most ancient and critical languages in the world. It has over than 250 million Arabic native speakers and more than twenty countries having Arabic as one of its official languages. In the past decade, we have witnessed a rapid evolution in smart devices, social network and technology sector which led to the need to provide tools and libraries that properly tackle the Arabic language in different domains. Stemming is one of the most crucial linguistic fundamentals. It is used in many applications especially in information extraction and text mining fields. The motivation behind this work is to enhance the Arabic light stemmer to serve the data mining industry and leverage it in an open source community. The presented implementation works on enhancing the Arabic light stemmer by utilizing and enhancing an algorithm that provides an extension for a new set of rules and patterns accompanied by adjusted procedure. This study has proven a significant enhancement for better search accuracy with an average 10% improvement in comparison with previous works.Keywords: Arabic data mining, Arabic Information extraction, Arabic Light stemmer, Arabic stemmer.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1496386 The Using of Rasch-Model in Validating the Arabic Version of Multiple Intelligence Development Assessment Scale (MIDAS)
Authors: Saher Ali Al-Sabbah, See Ching Mey, Ong Saw Lan
Abstract:
This article addresses the procedures to validate the Arabic version of Multiple Intelligence Development Assessment Scale (MIDAS). The content validity was examined based on the experts- judgments on the MIDAS-s items in the Arabic version. The content of eleven items in the Arabic version of MIDAS was modified to match the Arabic context. Then a translation from original English version of MIDAS into Arabic language was performed. The reliability of the Arabic MIDAS was calculated based on test and retest method and found to be 0.85 for the overall MIDAS and for the different subscales ranging between 0.78 - 0.87. The examination of construct validity for the overall Arabic MIDAS and its subscales was established by using Winsteps program version 6 based on Rasch model in order to fit the items into the Arabic context. The findings indicated that, the eight subscales in Arabic version of MIDAS scale have a unidimensionality, and the total number of kept items in the overall scale is 108 items.
Keywords: Rasch-Model, validation, multiple intelligence, and MIDAS scale.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1891385 Arabic Word Semantic Similarity
Authors: Faaza A, Almarsoomi, James D, O'Shea, Zuhair A, Bandar, Keeley A, Crockett
Abstract:
This paper is concerned with the production of an Arabic word semantic similarity benchmark dataset. It is the first of its kind for Arabic which was particularly developed to assess the accuracy of word semantic similarity measurements. Semantic similarity is an essential component to numerous applications in fields such as natural language processing, artificial intelligence, linguistics, and psychology. Most of the reported work has been done for English. To the best of our knowledge, there is no word similarity measure developed specifically for Arabic. In this paper, an Arabic benchmark dataset of 70 word pairs is presented. New methods and best possible available techniques have been used in this study to produce the Arabic dataset. This includes selecting and creating materials, collecting human ratings from a representative sample of participants, and calculating the overall ratings. This dataset will make a substantial contribution to future work in the field of Arabic WSS and hopefully it will be considered as a reference basis from which to evaluate and compare different methodologies in the field.
Keywords: Arabic categories, benchmark dataset, semantic similarity, word pair, stimulus Arabic words
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3106384 Data Gathering and Analysis for Arabic Historical Documents
Authors: Ali Dulla
Abstract:
This paper introduces a new dataset (and the methodology used to generate it) based on a wide range of historical Arabic documents containing clean data simple and homogeneous-page layouts. The experiments are implemented on printed and handwritten documents obtained respectively from some important libraries such as Qatar Digital Library, the British Library and the Library of Congress. We have gathered and commented on 150 archival document images from different locations and time periods. It is based on different documents from the 17th-19th century. The dataset comprises differing page layouts and degradations that challenge text line segmentation methods. Ground truth is produced using the Aletheia tool by PRImA and stored in an XML representation, in the PAGE (Page Analysis and Ground truth Elements) format. The dataset presented will be easily available to researchers world-wide for research into the obstacles facing various historical Arabic documents such as geometric correction of historical Arabic documents.Keywords: Dataset production, ground truth production, historical documents, arbitrary warping, geometric correction.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 865383 A Persian OCR System using Morphological Operators
Authors: M. Salmani Jelodar, M.J. Fadaeieslam, N. Mozayani, M. Fazeli
Abstract:
Optical Character Recognition (OCR) is a very old and of great interest in pattern recognition field. In this paper we introduce a very powerful approach to recognize Persian text. We have used morphological operators, especially Hit/Miss operator to descript each sub-word and by using a template matching approach we have tried to classify generated description. We used just one font in two different sizes to verify our approach. We achieved a very good rate, up to 99.9%.
Keywords: A Persian Optical Character Recognition.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2308382 Query Reformulation Guided by External Resource for Information Retrieval
Authors: Mohammed El Amine Abderrahim
Abstract:
Reformulating the user query is a technique that aims to improve the performance of an Information Retrieval System (IRS) in terms of precision and recall. This paper tries to evaluate the technique of query reformulation guided by an external resource for Arabic texts. To do this, various precision and recall measures were conducted and two corpora with different external resources like Arabic WordNet (AWN) and the Arabic Dictionary (thesaurus) of Meaning (ADM) were used. Examination of the obtained results will allow us to measure the real contribution of this reformulation technique in improving the IRS performance.
Keywords: Arabic NLP, Arabic Information Retrieval, Arabic WordNet, Query Expansion.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1401381 Persian Printed Numerals Classification Using Extended Moment Invariants
Authors: Hamid Reza Boveiri
Abstract:
Classification of Persian printed numeral characters has been considered and a proposed system has been introduced. In representation stage, for the first time in Persian optical character recognition, extended moment invariants has been utilized as characters image descriptor. In classification stage, four different classifiers namely minimum mean distance, nearest neighbor rule, multi layer perceptron, and fuzzy min-max neural network has been used, which first and second are traditional nonparametric statistical classifier. Third is a well-known neural network and forth is a kind of fuzzy neural network that is based on utilizing hyperbox fuzzy sets. Set of different experiments has been done and variety of results has been presented. The results showed that extended moment invariants are qualified as features to classify Persian printed numeral characters.Keywords: Extended moment invariants, optical characterrecognition, Persian numerals classification.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1918380 Entropy Based Data Hiding for Document Images
Authors: Swetha Kurup, Sridhar G., Sridhar V.
Abstract:
In this paper we present a novel technique for data hiding in binary document images. We use the concept of entropy in order to identify document specific least distortive areas throughout the binary document image. The document image is treated as any other image and the proposed method utilizes the standard document characteristics for the embedding process. Proposed method minimizes perceptual distortion due to embedding and allows watermark extraction without the requirement of any side information at the decoder end.Keywords: Entropy, Steganography, Watermarking.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1529379 A New Recognition Scheme for Machine- Printed Arabic Texts based on Neural Networks
Authors: Z. Shaaban
Abstract:
This paper presents a new approach to tackle the problem of recognizing machine-printed Arabic texts. Because of the difficulty of recognizing cursive Arabic words, the text has to be normalized and segmented to be ready for the recognition stage. The new scheme for recognizing Arabic characters depends on multiple parallel neural networks classifier. The classifier has two phases. The first phase categories the input character into one of eight groups. The second phase classifies the character into one of the Arabic character classes in the group. The system achieved high recognition rate.
Keywords: Neural Networks, character recognition, feature extraction, multiple networks, Arabic text.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1476378 Recognition-based Segmentation in Persian Character Recognition
Authors: Mohsen Zand, Ahmadreza Naghsh Nilchi, S. Amirhassan Monadjemi
Abstract:
Optical character recognition of cursive scripts presents a number of challenging problems in both segmentation and recognition processes in different languages, including Persian. In order to overcome these problems, we use a newly developed Persian word segmentation method and a recognition-based segmentation technique to overcome its segmentation problems. This method is robust as well as flexible. It also increases the system-s tolerances to font variations. The implementation results of this method on a comprehensive database show a high degree of accuracy which meets the requirements for commercial use. Extended with a suitable pre and post-processing, the method offers a simple and fast framework to develop a full OCR system.Keywords: OCR, Persian, Recognition, Segmentation.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1839377 Review of the Characteristics of Mahan Garden:One Type of Persian Gardens
Authors: Ladan Tajaddini
Abstract:
Iranians- imagination of heaven, which is the reward of a person-s good deeds during their life, has shown itself in pleasant and green gardens where earthly gardens were made as representations of paradise. Iranians are also quite interested in making their earthly gardens and plantations around their buildings. With Iran-s hot and dry climate with a lack of sufficient water for plantation coverage, it becomes noticeable how important it is to Iranians- art in making gardens. This study, with regard to examples, documents and library studies, investigates the characteristics of Persian gardens. The result shows that elements such as soil, water, plants and layout have been used in forming a unique style of Persian gardens. Bagh-e Shah Zadeh Mahan (Mahan prince garden) is a typical example and has been carefully studied. In this paper I try to investigate and evaluate the characteristics of a Persian garden by means of a descriptive approach.Keywords: environmental planning, Persian garden, landscape, shah zadeh garden, soil and water, gardening.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2938376 A New Approach for Flexible Document Categorization
Authors: Jebari Chaker, Ounelli Habib
Abstract:
In this paper we propose a new approach for flexible document categorization according to the document type or genre instead of topic. Our approach implements two homogenous classifiers: contextual classifier and logical classifier. The contextual classifier is based on the document URL, whereas, the logical classifier use the logical structure of the document to perform the categorization. The final categorization is obtained by combining contextual and logical categorizations. In our approach, each document is assigned to all predefined categories with different membership degrees. Our experiments demonstrate that our approach is best than other genre categorization approaches.
Keywords: Categorization, combination, flexible, logicalstructure, genre, category, URL.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1482375 An Enhanced Support Vector Machine-Based Approach for Sentiment Classification of Arabic Tweets of Different Dialects
Authors: Gehad S. Kaseb, Mona F. Ahmed
Abstract:
Arabic Sentiment Analysis (SA) is one of the most common research fields with many open areas. This paper proposes different pre-processing steps and a modified methodology to improve the accuracy using normal Support Vector Machine (SVM) classification. The paper works on two datasets, Arabic Sentiment Tweets Dataset (ASTD) and Extended Arabic Tweets Sentiment Dataset (Extended-ATSD), which are publicly available for academic use. The results show that the classification accuracy approaches 86%.
Keywords: Arabic, hybrid classification, sentiment analysis, tweets.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 474374 Optimal Document Archiving and Fast Information Retrieval
Authors: Hazem M. El-Bakry, Ahmed A. Mohammed
Abstract:
In this paper, an intelligent algorithm for optimal document archiving is presented. It is kown that electronic archives are very important for information system management. Minimizing the size of the stored data in electronic archive is a main issue to reduce the physical storage area. Here, the effect of different types of Arabic fonts on electronic archives size is discussed. Simulation results show that PDF is the best file format for storage of the Arabic documents in electronic archive. Furthermore, fast information detection in a given PDF file is introduced. Such approach uses fast neural networks (FNNs) implemented in the frequency domain. The operation of these networks relies on performing cross correlation in the frequency domain rather than spatial one. It is proved mathematically and practically that the number of computation steps required for the presented FNNs is less than that needed by conventional neural networks (CNNs). Simulation results using MATLAB confirm the theoretical computations.Keywords: Information Storage and Retrieval, Electronic Archiving, Fast Information Detection, Cross Correlation, Frequency Domain.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1585373 Echo State Networks for Arabic Phoneme Recognition
Authors: Nadia Hmad, Tony Allen
Abstract:
This paper presents an ESN-based Arabic phoneme recognition system trained with supervised, forced and combined supervised/forced supervised learning algorithms. Mel-Frequency Cepstrum Coefficients (MFCCs) and Linear Predictive Code (LPC) techniques are used and compared as the input feature extraction technique. The system is evaluated using 6 speakers from the King Abdulaziz Arabic Phonetics Database (KAPD) for Saudi Arabia dialectic and 34 speakers from the Center for Spoken Language Understanding (CSLU2002) database of speakers with different dialectics from 12 Arabic countries. Results for the KAPD and CSLU2002 Arabic databases show phoneme recognition performances of 72.31% and 38.20% respectively.
Keywords: Arabic phonemes recognition, echo state networks (ESNs), neural networks (NNs), supervised learning.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2409372 A Study on Explicitation Strategies Employed in Persian Subtitling of English Crime Movies
Authors: Hossein Heidari Tabrizi, Azizeh Chalak, Hossein Enayat
Abstract:
The present study seeks to investigate the application of expansion strategy in Persian subtitles of English crime movies. More precisely, this study aims at classifying the different types of expansion used in subtitles as well as investigating the appropriateness or inappropriateness of the application of each type. To achieve this end, three movies; namely, The Net (1995), Contact (1997) and Mission Impossible 2 (2000), available with Persian subtitles, were selected for the study. To collect the data, the above mentioned movies were watched and those parts of the Persian subtitles in which expansion had been used were identified and extracted along with their English dialogs. Then, the extracted Persian subtitles were classified based on the reason that led to expansion in each case. Next, the appropriateness or inappropriateness of using expansion in the extracted Persian subtitles was descriptively investigated. Finally, an equivalent not containing any expansion was proposed for those cases in which the meaning could be fully transferred without this strategy. The findings of the study indicated that the reasons range from explicitation (explicitation of visual, co-textual and contextual information), mistranslation and paraphrasing to the preferences of subtitlers. Furthermore, it was found that the employment of expansion strategy was inappropriate in all cases except for those caused by explicitation of contextual information since correct and shorter equivalents which were equally capable of conveying the intended meaning could be posited for the original dialogs.Keywords: Audiovisual translation, English crime movies, expansion strategies, Persian subtitles.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2248