Search results for: corpus linguistics
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 105

Search results for: corpus linguistics

75 A Study on Bilingual Semantic Processing: Category Effects and Age Effects

Authors: Lai Yi-Hsiu

Abstract:

The present study addressed the nature of bilingual semantic processing in Mandarin Chinese and Southern Min and examined category effects and age effects. Nineteen bilingual adults of Mandarin Chinese and Southern Min, nine monolingual seniors of Mandarin Chinese, and ten monolingual seniors of Southern Min in Taiwan individually completed two semantic tasks: Picture naming and category fluency tasks. The instruments for the naming task were sixty black-and-white pictures, including thirty-five object pictures and twenty-five action pictures. The category fluency task also consisted of two semantic categories – objects (or nouns) and actions (or verbs). The reaction time for each picture/question was additionally calculated and analyzed. Oral productions in Mandarin Chinese and in Southern Min were compared and discussed to examine the category effects and age effects. The results of the category fluency task indicated that the content of information of these seniors was comparatively deteriorated, and thus they produced a smaller number of semantic-lexical items. Significant group differences were also found in the reaction time results. Category effects were significant for both adults and seniors in the semantic fluency task. The findings of the present study will help characterize the nature of the bilingual semantic processing of adults and seniors, and contribute to the fields of contrastive and corpus linguistics.

Keywords: Bilingual semantic processing, aging, Mandarin Chinese, Southern Min.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1254
74 Distributional Semantics Approach to Thai Word Sense Disambiguation

Authors: Sunee Pongpinigpinyo, Wanchai Rivepiboon

Abstract:

Word sense disambiguation is one of the most important open problems in natural language processing applications such as information retrieval and machine translation. Many approach strategies can be employed to resolve word ambiguity with a reasonable degree of accuracy. These strategies are: knowledgebased, corpus-based, and hybrid-based. This paper pays attention to the corpus-based strategy that employs an unsupervised learning method for disambiguation. We report our investigation of Latent Semantic Indexing (LSI), an information retrieval technique and unsupervised learning, to the task of Thai noun and verbal word sense disambiguation. The Latent Semantic Indexing has been shown to be efficient and effective for Information Retrieval. For the purposes of this research, we report experiments on two Thai polysemous words, namely  /hua4/ and /kep1/ that are used as a representative of Thai nouns and verbs respectively. The results of these experiments demonstrate the effectiveness and indicate the potential of applying vector-based distributional information measures to semantic disambiguation.

Keywords: Distributional semantics, Latent Semantic Indexing, natural language processing, Polysemous words, unsupervisedlearning, Word Sense Disambiguation.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1757
73 Analyzing Microblogs: Exploring the Psychology of Political Leanings

Authors: Meaghan Bowman

Abstract:

Microblogging has become increasingly popular for commenting on current events, spreading gossip, and encouraging individualism--which favors its low-context communication channel. These social media (SM) platforms allow users to express opinions while interacting with a wide range of populations. Hashtags allow immediate identification of like-minded individuals worldwide on a vast array of topics. The output of the analytic tool, Linguistic Inquiry and Word Count (LIWC)--a program that associates psychological meaning with the frequency of use of specific words--may suggest the nature of individuals’ internal states and general sentiments. When applied to groupings of SM posts unified by a hashtag, such information can be helpful to community leaders during periods in which the forming of public opinion happens in parallel with the unfolding of political, economic, or social events. This is especially true when outcomes stand to impact the well-being of the group. Here, we applied the online tools, Google Translate and the University of Texas’s LIWC, to a 90-posting sample from a corpus of Colombian Spanish microblogs. On translated disjoint sets, identified by hashtag as being authored by advocates of voting “No,” advocates voting “Yes,” and entities refraining from hashtag use, we observed the value of LIWC’s Tone feature as distinguishing among the categories and the word “peace,” as carrying particular significance, due to its frequency of use in the data.

Keywords: Colombia peace referendum, FARC, hashtags, linguistics, microblogging, social media.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 782
72 Learning to Order Terms: Supervised Interestingness Measures in Terminology Extraction

Authors: Jérôme Azé, Mathieu Roche, Yves Kodratoff, Michèle Sebag

Abstract:

Term Extraction, a key data preparation step in Text Mining, extracts the terms, i.e. relevant collocation of words, attached to specific concepts (e.g. genetic-algorithms and decisiontrees are terms associated to the concept “Machine Learning" ). In this paper, the task of extracting interesting collocations is achieved through a supervised learning algorithm, exploiting a few collocations manually labelled as interesting/not interesting. From these examples, the ROGER algorithm learns a numerical function, inducing some ranking on the collocations. This ranking is optimized using genetic algorithms, maximizing the trade-off between the false positive and true positive rates (Area Under the ROC curve). This approach uses a particular representation for the word collocations, namely the vector of values corresponding to the standard statistical interestingness measures attached to this collocation. As this representation is general (over corpora and natural languages), generality tests were performed by experimenting the ranking function learned from an English corpus in Biology, onto a French corpus of Curriculum Vitae, and vice versa, showing a good robustness of the approaches compared to the state-of-the-art Support Vector Machine (SVM).

Keywords: Text-mining, Terminology Extraction, Evolutionary algorithm, ROC Curve.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1614
71 A Corpus-Based Analysis on Code-Mixing Features in Mandarin-English Bilingual Children in Singapore

Authors: Xunan Huang, Caicai Zhang

Abstract:

This paper investigated the code-mixing features in Mandarin-English bilingual children in Singapore. First, it examined whether the code-mixing rate was different in Mandarin Chinese and English contexts. Second, it explored the syntactic categories of code-mixing in Singapore bilingual children. Moreover, this study investigated whether morphological information was preserved when inserting syntactic components into the matrix language. Data are derived from the Singapore Bilingual Corpus, in which the recordings and transcriptions of sixty English-Mandarin 5-to-6-year-old children were preserved for analysis. Results indicated that the rate of code-mixing was asymmetrical in the two language contexts, with the rate being significantly higher in the Mandarin context than that in the English context. The asymmetry is related to language dominance in that children are more likely to code-mix when using their nondominant language. Concerning the syntactic categories of code-mixing words in the Singaporean bilingual children, we found that noun-mixing, verb-mixing, and adjective-mixing are the three most frequently used categories in code-mixing in the Mandarin context. This pattern mirrors the syntactic categories of code-mixing in the Cantonese context in Cantonese-English bilingual children, and the general trend observed in lexical borrowing. Third, our results also indicated that English vocabularies that carry morphological information are embedded in bare forms in the Mandarin context. These findings shed light upon how bilingual children take advantage of the two languages in mixed utterances in a bilingual environment.

Keywords: Code-mixing, Mandarin Chinese, English, bilingual children.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1050
70 Re-telling Goa's History: The Margin Narrative

Authors: Anna Beatriz Paula

Abstract:

This paper presents the first reflexions about Margaret Mascarenhas-s novel, “Skin", based on post-colonial critic perception of History and its agents. By doing so, this study will put light on a literary corpus of Indian Literatures: the Goan Literature whose cultural basis creates an unique historiographic metafiction conducted by different characters that one by one plays the narrator role.

Keywords: Goa, History, Literature, Metafiction.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2087
69 A Corpus-Based Approach to Understanding Market Access in Fisheries and Aquaculture: A Systematic Literature Review

Authors: Cheryl Marie Cordeiro

Abstract:

Although fisheries and aquaculture studies might seem marginal to international business (IB) studies in general, fisheries and aquaculture IB (FAIB) management is currently facing increasing pressure to meet global demand and consumption for fish in the next coming decades. In part address to this challenge, the purpose of this systematic review of literature (SLR) study is to investigate the use of the term ‘market access’ in its context of use in the generic literature and business sector discourse, in comparison to the more specific literature and discourse in fisheries, aquaculture and seafood. This SLR aims to uncover the knowledge/interest gaps between the academic subject discourses and business sector practices. Corpus driven in methodology and using a triangulation method of three different text analysis software including AntConc, VOSviewer and Web of Science (WoS) analytics, the SLR results indicate a gap in conceptual knowledge and business practices in how ‘market access’ is conceived and used in the context of the pharmaceutical healthcare industry and FAIB research and practice. While it is acknowledged that the product orientation of different business sectors might differ, this SLR study works with the assumption that both business sectors are global in orientation. These business sectors are complex in their operations from product to market. This SLR suggests a conceptual model in understanding the challenges, the potential barriers as well as avenues for solutions to developing market access for FAIB.

Keywords: Market access, fisheries and aquaculture, international business, systematic literature review.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 877
68 Named Entity Recognition using Support Vector Machine: A Language Independent Approach

Authors: Asif Ekbal, Sivaji Bandyopadhyay

Abstract:

Named Entity Recognition (NER) aims to classify each word of a document into predefined target named entity classes and is now-a-days considered to be fundamental for many Natural Language Processing (NLP) tasks such as information retrieval, machine translation, information extraction, question answering systems and others. This paper reports about the development of a NER system for Bengali and Hindi using Support Vector Machine (SVM). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Indian languages (ILs) is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named (NE) classes, such as Person name, Location name, Organization name and Miscellaneous name. We have used the annotated corpora of 122,467 tokens of Bengali and 502,974 tokens of Hindi tagged with the twelve different NE classes 1, defined as part of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (SSEAL) 2. In addition, we have manually annotated 150K wordforms of the Bengali news corpus, developed from the web-archive of a leading Bengali newspaper. We have also developed an unsupervised algorithm in order to generate the lexical context patterns from a part of the unlabeled Bengali news corpus. Lexical patterns have been used as the features of SVM in order to improve the system performance. The NER system has been tested with the gold standard test sets of 35K, and 60K tokens for Bengali, and Hindi, respectively. Evaluation results have demonstrated the recall, precision, and f-score values of 88.61%, 80.12%, and 84.15%, respectively, for Bengali and 80.23%, 74.34%, and 77.17%, respectively, for Hindi. Results show the improvement in the f-score by 5.13% with the use of context patterns. Statistical analysis, ANOVA is also performed to compare the performance of the proposed NER system with that of the existing HMM based system for both the languages.

Keywords: Named Entity (NE), Named Entity Recognition (NER), Support Vector Machine (SVM), Bengali, Hindi.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3319
67 Analysis of Steles with Libyan Inscriptions of Grande Kabylia, Algeria

Authors: Samia Ait Ali Yahia

Abstract:

Several steles with Libyan inscriptions were discovered in Grande Kabylia (Algeria), but very few researchers were interested in these inscriptions. Our work is to list, if possible all these steles in order to do a descriptive study of the corpus. The steles analysis will be focused on the iconographic and epigraphic level and on the different forms of Libyan characters in order to highlight the alphabet used by the Grande Kabylia.

Keywords: Epigraphy, stele, Libyan inscription, Grande Kabylia.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1026
66 Maya Semantic Technique: A Mathematical Technique Used to Determine Partial Semantics for Declarative Sentences

Authors: Marcia T. Mitchell

Abstract:

This research uses computational linguistics, an area of study that employs a computer to process natural language, and aims at discerning the patterns that exist in declarative sentences used in technical texts. The approach is mathematical, and the focus is on instructional texts found on web pages. The technique developed by the author and named the MAYA Semantic Technique is used here and organized into four stages. In the first stage, the parts of speech in each sentence are identified. In the second stage, the subject of the sentence is determined. In the third stage, MAYA performs a frequency analysis on the remaining words to determine the verb and its object. In the fourth stage, MAYA does statistical analysis to determine the content of the web page. The advantage of the MAYA Semantic Technique lies in its use of mathematical principles to represent grammatical operations which assist processing and accuracy if performed on unambiguous text. The MAYA Semantic Technique is part of a proposed architecture for an entire web-based intelligent tutoring system. On a sample set of sentences, partial semantics derived using the MAYA Semantic Technique were approximately 80% accurate. The system currently processes technical text in one domain, namely Cµ programming. In this domain all the keywords and programming concepts are known and understood.

Keywords: Natural language understanding, computational linguistics, knowledge representation, linguistic theories.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1621
65 Effective Features for Disambiguation of Turkish Verbs

Authors: Zeynep Orhan, Zeynep Altan

Abstract:

This paper summarizes the results of some experiments for finding the effective features for disambiguation of Turkish verbs. Word sense disambiguation is a current area of investigation in which verbs have the dominant role. Generally verbs have more senses than the other types of words in the average and detecting these features for verbs may lead to some improvements for other word types. In this paper we have considered only the syntactical features that can be obtained from the corpus and tested by using some famous machine learning algorithms.

Keywords: Word sense disambiguation, feature selection.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1714
64 The Role of Paraphrase in Interpreting Students’ Writing

Authors: Maya Lisa Aryanti, S. S. M. Hum

Abstract:

To improve students’ skill, writing is the most challenging skill to be developed. The reason is that besides helping the students to develop their skill, this activity also helps them to express themselves. This paper depicts how paraphrasing is very helpful to interpret students’ writing. Syntactic units, used tenses and meanings will indeed change once the writings were paraphrased. The objectives of this research are to reveal the inappropriate structure of syntactic units, to show what types of sentences the students often make, and to show how paraphrasing can help to infer the message. The methodology of this research is descriptive qualitative research. In addition, theories of linguistics are also included. This includes theory of Syntax to describe syntactic units and tenses and theory of Semantics to describe theories of meaning and how paraphrasing works. The theories of general linguistics, grammar and writing are also provided to support the theories of Syntax and Semantics. The results of this research are concerned with how the message is received in the end. The message written in the students’ essay is not clear because of the improper structure of syntactic units and use of incorrect of tenses. The students tend to use simple sentences, compound sentences and complex sentences with a few mistakes in their writing. In addition, they tend to create unnecessary phrases. The last point is that this research shows how paraphrase works to attain complete meaning of a sentence.

Keywords: Paraphrase, meanings, syntactic units and tenses.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 975
63 Achieving Maximum Performance through the Practice of Entrepreneurial Ethics: Evidence from SMEs in Nigeria

Authors: S. B. Tende, H. L. Abubakar

Abstract:

It is acknowledged that small and medium enterprises (SMEs) may encounter different ethical issues and pressures that could affect the way in which they strategize or make decisions concerning the outcome of their business. Therefore, this research aimed at assessing entrepreneurial ethics in the business of SMEs in Nigeria. Secondary data were adopted as source of corpus for the analysis. The findings conclude that a sound entrepreneurial ethics system has a significant effect on the level of performance of SMEs in Nigeria. The Nigerian Government needs to provide both guiding and physical structures; as well as learning systems that could inculcate these entrepreneurial ethics.

Keywords: Entrepreneurial ethics, culture, performance, SME.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 892
62 Software Architectural Design Ontology

Authors: Muhammad Irfan Marwat, Sadaqat Jan, Syed Zafar Ali Shah

Abstract:

Software Architecture plays a key role in software development but absence of formal description of Software Architecture causes different impede in software development. To cope with these difficulties, ontology has been used as artifact. This paper proposes ontology for Software Architectural design based on IEEE model for architecture description and Kruchten 4+1 model for viewpoints classification. For categorization of style and views, ISO/IEC 42010 has been used. Corpus method has been used to evaluate ontology. The main aim of the proposed ontology is to classify and locate Software Architectural design information.

Keywords: Software Architecture Ontology, Semantic based Software Architecture, Software Architecture, Ontology, Software Engineering.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 4116
61 A Multigranular Linguistic Additive Ratio Assessment Model in Group Decision Making

Authors: Wiem Daoud Ben Amor, Luis Martínez López, Jr., Hela Moalla Frikha

Abstract:

Most of the multi-criteria group decision making (MCGDM) problems dealing with qualitative criteria require consideration of the large background of expert information. It is common that experts have different degrees of knowledge for giving their alternative assessments according to criteria. So, it seems logical that they use different evaluation scales to express their judgment, i.e., multi granular linguistic scales. In this context, we propose the extension of the classical additive ratio assessment (ARAS) method to the case of a hierarchical linguistics term for managing multi granular linguistic scales in uncertain context where uncertainty is modeled by means in linguistic information. The proposed approach is called the extended hierarchical linguistics-ARAS method (ELH-ARAS). Within the ELH-ARAS approach, the decision maker (DMs) can diagnose the results (the ranking of the alternatives) in a decomposed style i.e., not only at one level of the hierarchy but also at the intermediate ones. Also, the developed approach allows a feedback transformation i.e., the collective final results of all experts are able to be transformed at any level of the extended linguistic hierarchy that each expert has previously used. Therefore, the ELH-ARAS technique makes it easier for decision-makers to understand the results. Finally, an MCGDM case study is given to illustrate the proposed approach.

Keywords: Additive ratio assessment, extended hierarchical linguistic, multi-criteria group decision making problems, multi granular linguistic contexts.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 305
60 Part of Speech Tagging Using Statistical Approach for Nepali Text

Authors: Archit Yajnik

Abstract:

Part of Speech Tagging has always been a challenging task in the era of Natural Language Processing. This article presents POS tagging for Nepali text using Hidden Markov Model and Viterbi algorithm. From the Nepali text, annotated corpus training and testing data set are randomly separated. Both methods are employed on the data sets. Viterbi algorithm is found to be computationally faster and accurate as compared to HMM. The accuracy of 95.43% is achieved using Viterbi algorithm. Error analysis where the mismatches took place is elaborately discussed.

Keywords: Hidden Markov model, Viterbi algorithm, POS tagging, natural language processing.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1652
59 Contextual Distribution for Textual Alignment

Authors: Yuri Bizzoni, Marianne Reboul

Abstract:

Our program compares French and Italian translations of Homer’s Odyssey, from the XVIth to the XXth century. We focus on the third point, showing how distributional semantics systems can be used both to improve alignment between different French translations as well as between the Greek text and a French translation. Although we focus on French examples, the techniques we display are completely language independent.

Keywords: Translation studies, machine translation, computational linguistics, distributional semantics.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 972
58 A Similarity Measure for Clustering and its Applications

Authors: Guadalupe J. Torres, Ram B. Basnet, Andrew H. Sung, Srinivas Mukkamala, Bernardete M. Ribeiro

Abstract:

This paper introduces a measure of similarity between two clusterings of the same dataset produced by two different algorithms, or even the same algorithm (K-means, for instance, with different initializations usually produce different results in clustering the same dataset). We then apply the measure to calculate the similarity between pairs of clusterings, with special interest directed at comparing the similarity between various machine clusterings and human clustering of datasets. The similarity measure thus can be used to identify the best (in terms of most similar to human) clustering algorithm for a specific problem at hand. Experimental results pertaining to the text categorization problem of a Portuguese corpus (wherein a translation-into-English approach is used) are presented, as well as results on the well-known benchmark IRIS dataset. The significance and other potential applications of the proposed measure are discussed.

Keywords: Clustering Algorithms, Clustering Applications, Similarity Measures, Text Clustering

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1515
57 New Ways of Vocabulary Enlargement

Authors: T. Solonchak, S. Pesina

Abstract:

Lexical invariants, being a sort of stereotypes within the frames of ordinary consciousness, are created by the members of a language community as a result of uniform division of reality. The invariant meaning is formed in person’s mind gradually in the course of different actualizations of secondary meanings in various contexts. We understand lexical the invariant as abstract language essence containing a set of semantic components. In one of its configurations it is the basis or all or a number of the meanings making up the semantic structure of the word.

Keywords: Lexical invariant, invariant theories, polysemantic word, cognitive linguistics.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2306
56 A Content Vector Model for Text Classification

Authors: Eric Jiang

Abstract:

As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications. In this paper, an LSI-based content vector model for text classification is presented, which constructs multiple augmented category LSI spaces and classifies text by their content. The model integrates the class discriminative information from the training data and is equipped with several pertinent feature selection and text classification algorithms. The proposed classifier has been applied to email classification and its experiments on a benchmark spam testing corpus (PU1) have shown that the approach represents a competitive alternative to other email classifiers based on the well-known SVM and naïve Bayes algorithms.

Keywords: Feature Selection, Latent Semantic Indexing, Text Classification, Vector Space Model.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1840
55 A System of Automatic Speech Recognition based on the Technique of Temporal Retiming

Authors: Samir Abdelhamid, Noureddine Bouguechal

Abstract:

We report in this paper the procedure of a system of automatic speech recognition based on techniques of the dynamic programming. The technique of temporal retiming is a technique used to synchronize between two forms to compare. We will see how this technique is adapted to the field of the automatic speech recognition. We will expose, in a first place, the theory of the function of retiming which is used to compare and to adjust an unknown form with a whole of forms of reference constituting the vocabulary of the application. Then we will give, in the second place, the various algorithms necessary to their implementation on machine. The algorithms which we will present were tested on part of the corpus of words in Arab language Arabdic-10 [4] and gave whole satisfaction. These algorithms are effective insofar as we apply them to the small ones or average vocabularies.

Keywords: Continuous speech recognition, temporal retiming, phonetic decoding, algorithms, vocal signal, dynamic programming.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1298
54 Questions in the School

Authors: Jana M. Havigerová, Jiří Haviger

Abstract:

Paper deals with the topic of questions as important components of information behavior in the school. By analyzing the Corpus Schola2010, the state of contemporary education in terms of questioning is proven unsatisfactory: 80% of the questions are asked by teachers; most of teacher-s questions are asked at the beginning of the first grade, than their number decreases and is settling down on 80±10 questions per lesson. The average number of questions within one lesson per one pupil is generally less than one whole question. The highest values are achieved in the first, sixth, eighth and tenth grade,, i.e. in the transition years in which pupils are moving into higher levels of education and every following year it declines. We can state Czech school do not support questioning and question skill of their pupils, thereby typical Czech schools are neglecting the development of thinking, reasoning and cooperation of their pupils.

Keywords: information behavior, questions, primary and secondary education, Czech Republic

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1418
53 Resolving Dependency Ambiguity of Subordinate Clauses using Support Vector Machines

Authors: Sang-Soo Kim, Seong-Bae Park, Sang-Jo Lee

Abstract:

In this paper, we propose a method of resolving dependency ambiguities of Korean subordinate clauses based on Support Vector Machines (SVMs). Dependency analysis of clauses is well known to be one of the most difficult tasks in parsing sentences, especially in Korean. In order to solve this problem, we assume that the dependency relation of Korean subordinate clauses is the dependency relation among verb phrase, verb and endings in the clauses. As a result, this problem is represented as a binary classification task. In order to apply SVMs to this problem, we selected two kinds of features: static and dynamic features. The experimental results on STEP2000 corpus show that our system achieves the accuracy of 73.5%.

Keywords: Dependency analysis, subordinate clauses, binaryclassification, support vector machines.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1551
52 Growing Self Organising Map Based Exploratory Analysis of Text Data

Authors: Sumith Matharage, Damminda Alahakoon

Abstract:

Textual data plays an important role in the modern world. The possibilities of applying data mining techniques to uncover hidden information present in large volumes of text collections is immense. The Growing Self Organizing Map (GSOM) is a highly successful member of the Self Organising Map family and has been used as a clustering and visualisation tool across wide range of disciplines to discover hidden patterns present in the data. A comprehensive analysis of the GSOM’s capabilities as a text clustering and visualisation tool has so far not been published. These functionalities, namely map visualisation capabilities, automatic cluster identification and hierarchical clustering capabilities are presented in this paper and are further demonstrated with experiments on a benchmark text corpus.

Keywords: Text Clustering, Growing Self Organizing Map, Automatic Cluster Identification, Hierarchical Clustering.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1899
51 A New Time-Frequency Speech Analysis Approach Based On Adaptive Fourier Decomposition

Authors: Liming Zhang

Abstract:

In this paper, a new adaptive Fourier decomposition (AFD) based time-frequency speech analysis approach is proposed. Given the fact that the fundamental frequency of speech signals often undergo fluctuation, the classical short-time Fourier transform (STFT) based spectrogram analysis suffers from the difficulty of window size selection. AFD is a newly developed signal decomposition theory. It is designed to deal with time-varying non-stationary signals. Its outstanding characteristic is to provide instantaneous frequency for each decomposed component, so the time-frequency analysis becomes easier. Experiments are conducted based on the sample sentence in TIMIT Acoustic-Phonetic Continuous Speech Corpus. The results show that the AFD based time-frequency distribution outperforms the STFT based one.

Keywords: Adaptive fourier decomposition, instantaneous frequency, speech analysis, time-frequency distribution.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1681
50 Automatic Text Summarization

Authors: Mohamed Abdel Fattah, Fuji Ren

Abstract:

This work proposes an approach to address automatic text summarization. This approach is a trainable summarizer, which takes into account several features, including sentence position, positive keyword, negative keyword, sentence centrality, sentence resemblance to the title, sentence inclusion of name entity, sentence inclusion of numerical data, sentence relative length, Bushy path of the sentence and aggregated similarity for each sentence to generate summaries. First we investigate the effect of each sentence feature on the summarization task. Then we use all features score function to train genetic algorithm (GA) and mathematical regression (MR) models to obtain a suitable combination of feature weights. The proposed approach performance is measured at several compression rates on a data corpus composed of 100 English religious articles. The results of the proposed approach are promising.

Keywords: Automatic Summarization, Genetic Algorithm, Mathematical Regression, Text Features.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2274
49 Extraction of Temporal Relation by the Creation of Historical Natural Disaster Archive

Authors: Suguru Yoshioka, Seiichi Tani, Seinosuke Toda

Abstract:

In historical science and social science, the influence of natural disaster upon society is a matter of great interest. In recent years, some archives are made through many hands for natural disasters, however it is inefficiency and waste. So, we suppose a computer system to create a historical natural disaster archive. As the target of this analysis, we consider newspaper articles. The news articles are considered to be typical examples that prescribe the temporal relations of affairs for natural disaster. In order to do this analysis, we identify the occurrences in newspaper articles by some index entries, considering the affairs which are specific to natural disasters, and show the temporal relation between natural disasters. We designed and implemented the automatic system of “extraction of the occurrences of natural disaster" and “temporal relation table for natural disaster."

Keywords: Database, digital library, corpus, historical natural disaster, temporal relation

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1368
48 Foreign Elements in the Methodologies of Usul Fiqh: Analyzing the Orientalist Thought

Authors: Ariyanti Mustapha

Abstract:

The development of Islamic jurisprudence since the first century of the hijra has fascinated many orientalists to explore the historiography of Islamic legislation. The practice of usul fiqh began during the lifetime of the Prophet Muhammad and was continued by the companions as the legal reasoning due to the absence of the legal injunction in the Qur’an and Sunnah. The orientalists propagated that the Roman and Jewish legislation were transplanted into Islamic jurisprudence and it was the primary reason for its progression. We used qualitative and comparative methods to analyze the orientalists’ views. Results showed that many erroneous facts were propagated by Goldziher and Schacht by claiming the parallels between the principles, methodologies, and fundamental concepts in Islamic jurisprudence and Roman Provincial law. The orientalists claimed that Islamic jurisprudence was derived from the corpus of Jewish Mishnah and Ha-kol. These judgments are used by the orientalists to prove the inferiority of Islamic jurisprudence. Nevertheless, many evidences have proven that Islamic legislation is capable of developing independently without any foreign transplant.

Keywords: Foreign transplant, ijtihad, orientalist, usul fiqh.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 299
47 Assessment of the Validity of Sentiment Analysis as a Tool to Analyze the Emotional Content of Text

Authors: Trisha Malhotra

Abstract:

Sentiment analysis is a recent field of study that computationally assesses the emotional nature of a body of text. To assess its test-validity, sentiment analysis was carried out on the emotional corpus of text from a personal 15-day mood diary. Self-reported mood scores varied more or less accurately with daily mood evaluation score given by the software. On further assessment, it was found that while sentiment analysis was good at assessing ‘global’ mood, it was not able to ‘locally’ identify and differentially score synonyms of various emotional words. It is further critiqued for treating the intensity of an emotion as universal across cultures. Finally, the software is shown not to account for emotional complexity in sentences by treating emotions as strictly positive or negative. Hence, it is posited that a better output could be two (positive and negative) affect scores for the same body of text.

Keywords: Analysis, data, diary, emotions, mood, sentiment.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1062
46 Algorithm for Information Retrieval Optimization

Authors: Kehinde K. Agbele, Kehinde Daniel Aruleba, Eniafe F. Ayetiran

Abstract:

When using Information Retrieval Systems (IRS), users often present search queries made of ad-hoc keywords. It is then up to the IRS to obtain a precise representation of the user’s information need and the context of the information. This paper investigates optimization of IRS to individual information needs in order of relevance. The study addressed development of algorithms that optimize the ranking of documents retrieved from IRS. This study discusses and describes a Document Ranking Optimization (DROPT) algorithm for information retrieval (IR) in an Internet-based or designated databases environment. Conversely, as the volume of information available online and in designated databases is growing continuously, ranking algorithms can play a major role in the context of search results. In this paper, a DROPT technique for documents retrieved from a corpus is developed with respect to document index keywords and the query vectors. This is based on calculating the weight (

Keywords: Internet ranking,

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1428