Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 75

Search results for: annotation

75 Fuzzy Semantic Annotation of Web Resources

Authors: Sahar Maâlej Dammak, Anis Jedidi, Rafik Bouaziz

Abstract:

With the great mass of pages managed through the world, and especially with the advent of the Web, their manual annotation is impossible. We focus, in this paper, on the semiautomatic annotation of the web pages. We propose an approach and a framework for semantic annotation of web pages entitled “Querying Web”. Our solution is an enhancement of the first result of annotation done by the “Semantic Radar” Plug-in on the web resources, by annotations using an enriched domain ontology. The concepts of the result of Semantic Radar may be connected to several terms of the ontology, but connections may be uncertain. We represent annotations as possibility distributions. We use the hierarchy defined in the ontology to compute degrees of possibilities. We want to achieve an automation of the fuzzy semantic annotation of web resources.

Keywords: fuzzy semantic annotation, semantic web, domain ontologies, querying web

Procedia PDF Downloads 283
74 A Method of the Semantic on Image Auto-Annotation

Authors: Lin Huo, Xianwei Liu, Jingxiong Zhou

Abstract:

Recently, due to the existence of semantic gap between image visual features and human concepts, the semantic of image auto-annotation has become an important topic. Firstly, by extract low-level visual features of the image, and the corresponding Hash method, mapping the feature into the corresponding Hash coding, eventually, transformed that into a group of binary string and store it, image auto-annotation by search is a popular method, we can use it to design and implement a method of image semantic auto-annotation. Finally, Through the test based on the Corel image set, and the results show that, this method is effective.

Keywords: image auto-annotation, color correlograms, Hash code, image retrieval

Procedia PDF Downloads 343
73 Towards a Large Scale Deep Semantically Analyzed Corpus for Arabic: Annotation and Evaluation

Authors: S. Alansary, M. Nagi

Abstract:

This paper presents an approach of conducting semantic annotation of Arabic corpus using the Universal Networking Language (UNL) framework. UNL is intended to be a promising strategy for providing a large collection of semantically annotated texts with formal, deep semantics rather than shallow. The result would constitute a semantic resource (semantic graphs) that is editable and that integrates various phenomena, including predicate-argument structure, scope, tense, thematic roles and rhetorical relations, into a single semantic formalism for knowledge representation. The paper will also present the Interactive Analysis​ tool for automatic semantic annotation (IAN). In addition, the cornerstone of the proposed methodology which are the disambiguation and transformation rules, will be presented. Semantic annotation using UNL has been applied to a corpus of 20,000 Arabic sentences representing the most frequent structures in the Arabic Wikipedia. The representation, at different linguistic levels was illustrated starting from the morphological level passing through the syntactic level till the semantic representation is reached. The output has been evaluated using the F-measure. It is 90% accurate. This demonstrates how powerful the formal environment is, as it enables intelligent text processing and search.

Keywords: semantic analysis, semantic annotation, Arabic, universal networking language

Procedia PDF Downloads 501
72 Annotation Ontology for Semantic Web Development

Authors: Hadeel Al Obaidy, Amani Al Heela

Abstract:

The main purpose of this paper is to examine the concept of semantic web and the role that ontology and semantic annotation plays in the development of semantic web services. The paper focuses on semantic web infrastructure illustrating how ontology and annotation work to provide the learning capabilities for building content semantically. To improve productivity and quality of software, the paper applies approaches, notations and techniques offered by software engineering. It proposes a conceptual model to develop semantic web services for the infrastructure of web information retrieval system of digital libraries. The developed system uses ontology and annotation to build a knowledge based system to define and link the meaning of a web content to retrieve information for users’ queries. The results are more relevant through keywords and ontology rule expansion that will be more accurate to satisfy the requested information. The level of results accuracy would be enhanced since the query semantically analyzed work with the conceptual architecture of the proposed system.

Keywords: semantic web services, software engineering, semantic library, knowledge representation, ontology

Procedia PDF Downloads 94
71 Automatic Multi-Label Image Annotation System Guided by Firefly Algorithm and Bayesian Method

Authors: Saad M. Darwish, Mohamed A. El-Iskandarani, Guitar M. Shawkat

Abstract:

Nowadays, the amount of available multimedia data is continuously on the rise. The need to find a required image for an ordinary user is a challenging task. Content based image retrieval (CBIR) computes relevance based on the visual similarity of low-level image features such as color, textures, etc. However, there is a gap between low-level visual features and semantic meanings required by applications. The typical method of bridging the semantic gap is through the automatic image annotation (AIA) that extracts semantic features using machine learning techniques. In this paper, a multi-label image annotation system guided by Firefly and Bayesian method is proposed. Firstly, images are segmented using the maximum variance intra cluster and Firefly algorithm, which is a swarm-based approach with high convergence speed, less computation rate and search for the optimal multiple threshold. Feature extraction techniques based on color features and region properties are applied to obtain the representative features. After that, the images are annotated using translation model based on the Net Bayes system, which is efficient for multi-label learning with high precision and less complexity. Experiments are performed using Corel Database. The results show that the proposed system is better than traditional ones for automatic image annotation and retrieval.

Keywords: feature extraction, feature selection, image annotation, classification

Procedia PDF Downloads 510
70 The Omani Learner of English Corpus: Source and Tools

Authors: Anood Al-Shibli

Abstract:

Designing a learner corpus is not an easy task to accomplish because dealing with learners’ language has many variables which might affect the results of any study based on learners’ language production (spoken and written). Also, it is very essential to systematically design a learner corpus especially when it is aimed to be a reference to language research. Therefore, designing the Omani Learner Corpus (OLEC) has undergone many explicit and systematic considerations. These criteria can be regarded as the foundation to design any learner corpus to be exploited effectively in language use and language learning studies. Added to that, OLEC is manually error-annotated corpus. Error-annotation in learner corpora is very essential; however, it is time-consuming and prone to errors. Consequently, a navigating tool is designed to help the annotators to insert errors’ codes in order to make the error-annotation process more efficient and consistent. To assure accuracy, error annotation procedure is followed to annotate OLEC and some preliminary findings are noted. One of the main results of this procedure is creating an error-annotation system based on the Omani learners of English language production. Because OLEC is still in the first stages, the primary findings are related to only one level of proficiency and one error type which is verb related errors. It is found that Omani learners in OLEC has the tendency to have more errors in forming the verb and followed by problems in agreement of verb. Comparing the results to other error-based studies indicate that the Omani learners tend to have basic verb errors which can found in lower-level of proficiency. To this end, it is essential to admit that examining learners’ errors can give insights to language acquisition and language learning and most errors do not happen randomly but they occur systematically among language learners.

Keywords: error-annotation system, error-annotation manual, learner corpora, verbs related errors

Procedia PDF Downloads 64
69 The Automatisation of Dictionary-Based Annotation in a Parallel Corpus of Old English

Authors: Ana Elvira Ojanguren Lopez, Javier Martin Arista

Abstract:

The aims of this paper are to present the automatisation procedure adopted in the implementation of a parallel corpus of Old English, as well as, to assess the progress of automatisation with respect to tagging, annotation, and lemmatisation. The corpus consists of an aligned parallel text with word-for-word comparison Old English-English that provides the Old English segment with inflectional form tagging (gloss, lemma, category, and inflection) and lemma annotation (spelling, meaning, inflectional class, paradigm, word-formation and secondary sources). This parallel corpus is intended to fill a gap in the field of Old English, in which no parallel and/or lemmatised corpora are available, while the average amount of corpus annotation is low. With this background, this presentation has two main parts. The first part, which focuses on tagging and annotation, selects the layouts and fields of lexical databases that are relevant for these tasks. Most information used for the annotation of the corpus can be retrieved from the lexical and morphological database Nerthus and the database of secondary sources Freya. These are the sources of linguistic and metalinguistic information that will be used for the annotation of the lemmas of the corpus, including morphological and semantic aspects as well as the references to the secondary sources that deal with the lemmas in question. Although substantially adapted and re-interpreted, the lemmatised part of these databases draws on the standard dictionaries of Old English, including The Student's Dictionary of Anglo-Saxon, An Anglo-Saxon Dictionary, and A Concise Anglo-Saxon Dictionary. The second part of this paper deals with lemmatisation. It presents the lemmatiser Norna, which has been implemented on Filemaker software. It is based on a concordance and an index to the Dictionary of Old English Corpus, which comprises around three thousand texts and three million words. In its present state, the lemmatiser Norna can assign lemma to around 80% of textual forms on an automatic basis, by searching the index and the concordance for prefixes, stems and inflectional endings. The conclusions of this presentation insist on the limits of the automatisation of dictionary-based annotation in a parallel corpus. While the tagging and annotation are largely automatic even at the present stage, the automatisation of alignment is pending for future research. Lemmatisation and morphological tagging are expected to be fully automatic in the near future, once the database of secondary sources Freya and the lemmatiser Norna have been completed.

Keywords: corpus linguistics, historical linguistics, old English, parallel corpus

Procedia PDF Downloads 110
68 Extraction of Text Subtitles in Multimedia Systems

Authors: Amarjit Singh

Abstract:

In this paper, a method for extraction of text subtitles in large video is proposed. The video data needs to be annotated for many multimedia applications. Text is incorporated in digital video for the motive of providing useful information about that video. So need arises to detect text present in video to understanding and video indexing. This is achieved in two steps. First step is text localization and the second step is text verification. The method of text detection can be extended to text recognition which finds applications in automatic video indexing; video annotation and content based video retrieval. The method has been tested on various types of videos.

Keywords: video, subtitles, extraction, annotation, frames

Procedia PDF Downloads 504
67 BingleSeq: A User-Friendly R Package for Single-Cell RNA-Seq Data Analysis

Authors: Quan Gu, Daniel Dimitrov

Abstract:

BingleSeq was developed as a shiny-based, intuitive, and comprehensive application that enables the analysis of single-Cell RNA-Sequencing count data. This was achieved via incorporating three state-of-the-art software packages for each type of RNA sequencing analysis, alongside functional annotation analysis and a way to assess the overlap of differential expression method results. At its current state, the functionality implemented within BingleSeq is comparable to that of other applications, also developed with the purpose of lowering the entry requirements to RNA Sequencing analyses. BingleSeq is available on GitHub and will be submitted to R/Bioconductor.

Keywords: bioinformatics, functional annotation analysis, single-cell RNA-sequencing, transcriptomics

Procedia PDF Downloads 63
66 Tagging a corpus of Media Interviews with Diplomats: Challenges and Solutions

Authors: Roberta Facchinetti, Sara Corrizzato, Silvia Cavalieri

Abstract:

Increasing interconnection between data digitalization and linguistic investigation has given rise to unprecedented potentialities and challenges for corpus linguists, who need to master IT tools for data analysis and text processing, as well as to develop techniques for efficient and reliable annotation in specific mark-up languages that encode documents in a format that is both human and machine-readable. In the present paper, the challenges emerging from the compilation of a linguistic corpus will be taken into consideration, focusing on the English language in particular. To do so, the case study of the InterDiplo corpus will be illustrated. The corpus, currently under development at the University of Verona (Italy), represents a novelty in terms both of the data included and of the tag set used for its annotation. The corpus covers media interviews and debates with diplomats and international operators conversing in English with journalists who do not share the same lingua-cultural background as their interviewees. To date, this appears to be the first tagged corpus of international institutional spoken discourse and will be an important database not only for linguists interested in corpus analysis but also for experts operating in international relations. In the present paper, special attention will be dedicated to the structural mark-up, parts of speech annotation, and tagging of discursive traits, that are the innovational parts of the project being the result of a thorough study to find the best solution to suit the analytical needs of the data. Several aspects will be addressed, with special attention to the tagging of the speakers’ identity, the communicative events, and anthropophagic. Prominence will be given to the annotation of question/answer exchanges to investigate the interlocutors’ choices and how such choices impact communication. Indeed, the automated identification of questions, in relation to the expected answers, is functional to understand how interviewers elicit information as well as how interviewees provide their answers to fulfill their respective communicative aims. A detailed description of the aforementioned elements will be given using the InterDiplo-Covid19 pilot corpus. The data yielded by our preliminary analysis of the data will highlight the viable solutions found in the construction of the corpus in terms of XML conversion, metadata definition, tagging system, and discursive-pragmatic annotation to be included via Oxygen.

Keywords: spoken corpus, diplomats’ interviews, tagging system, discursive-pragmatic annotation, english linguistics

Procedia PDF Downloads 78
65 Contextual Sentiment Analysis with Untrained Annotators

Authors: Lucas A. Silva, Carla R. Aguiar

Abstract:

This work presents a proposal to perform contextual sentiment analysis using a supervised learning algorithm and disregarding the extensive training of annotators. To achieve this goal, a web platform was developed to perform the entire procedure outlined in this paper. The main contribution of the pipeline described in this article is to simplify and automate the annotation process through a system of analysis of congruence between the notes. This ensured satisfactory results even without using specialized annotators in the context of the research, avoiding the generation of biased training data for the classifiers. For this, a case study was conducted in a blog of entrepreneurship. The experimental results were consistent with the literature related annotation using formalized process with experts.

Keywords: sentiment analysis, untrained annotators, naive bayes, entrepreneurship, contextualized classifier

Procedia PDF Downloads 298
64 A Framework for Secure Information Flow Analysis in Web Applications

Authors: Ralph Adaimy, Wassim El-Hajj, Ghassen Ben Brahim, Hazem Hajj, Haidar Safa

Abstract:

Huge amounts of data and personal information are being sent to and retrieved from web applications on daily basis. Every application has its own confidentiality and integrity policies. Violating these policies can have broad negative impact on the involved company’s financial status, while enforcing them is very hard even for the developers with good security background. In this paper, we propose a framework that enforces security-by-construction in web applications. Minimal developer effort is required, in a sense that the developer only needs to annotate database attributes by a security class. The web application code is then converted into an intermediary representation, called Extended Program Dependence Graph (EPDG). Using the EPDG, the provided annotations are propagated to the application code and run against generic security enforcement rules that were carefully designed to detect insecure information flows as early as they occur. As a result, any violation in the data’s confidentiality or integrity policies is reported. As a proof of concept, two PHP web applications, Hotel Reservation and Auction, were used for testing and validation. The proposed system was able to catch all the existing insecure information flows at their source. Moreover and to highlight the simplicity of the suggested approaches vs. existing approaches, two professional web developers assessed the annotation tasks needed in the presented case studies and provided a very positive feedback on the simplicity of the annotation task.

Keywords: web applications security, secure information flow, program dependence graph, database annotation

Procedia PDF Downloads 381
63 Grammatically Coded Corpus of Spoken Lithuanian: Methodology and Development

Authors: L. Kamandulytė-Merfeldienė

Abstract:

The paper deals with the main issues of methodology of the Corpus of Spoken Lithuanian which was started to be developed in 2006. At present, the corpus consists of 300,000 grammatically annotated word forms. The creation of the corpus consists of three main stages: collecting the data, the transcription of the recorded data, and the grammatical annotation. Collecting the data was based on the principles of balance and naturality. The recorded speech was transcribed according to the CHAT requirements of CHILDES. The transcripts were double-checked and annotated grammatically using CHILDES. The development of the Corpus of Spoken Lithuanian has led to the constant increase in studies on spontaneous communication, and various papers have dealt with a distribution of parts of speech, use of different grammatical forms, variation of inflectional paradigms, distribution of fillers, syntactic functions of adjectives, the mean length of utterances.

Keywords: CHILDES, corpus of spoken Lithuanian, grammatical annotation, grammatical disambiguation, lexicon, Lithuanian

Procedia PDF Downloads 159
62 The Publishing Process and Results of the Chinese Annotated Edition of John Dewey’s “Experience and Education: The 60th Anniversary Edition”

Authors: Wen-jing Shan

Abstract:

The Chinese annotated edition of “Experience and education: The 60th anniversary edition,” originally written in English by John Dewey (1859-1952), was published in 2015 by this author. A report of the process and results of the translation and annotation of the book is the purpose of this paper. It is worth mentioning that the original 1938 edition was considered as the best concise statement on education by John Dewey, one the most important educational theorists of the twentieth century. One of the features of this The 60th anniversary edition is that the original publisher, Kappa Delta Pi International Honor Society, invited four contemporary Deweyan scholars who had been awarded the Society’s Laureate Scholar to write a review of the book published by Dewey, who was the first to receive this honor. The four scholars are Maxine Greene(1917-2014), Philip W. Jackson(1928-2015), Linda Darling-Hammond(1951-), and O. L. Davis, Jr.(1928-). The original 1938 edition, the best concise statement on education by the most important educational theorist of the twentieth century, was translated into Chinese for five times after its publication in the U.S.A, three in the 1940s, one in the 1990s, and one in 2010s. Nonetheless, the five translations have few or no annotations and have some flaws of mis-interpretations and lack of information. The author retranslated and annotated the book to make the interpretations more faithful, expressive, and elegant, and providing the readers with more understanding and more correct information. This author started the project of translation and annotation sponsored by Taiwan Ministry of Science and Technology in August 2011 and finished and published by July 2015. The work, the author, did was divided into three stages. First, in the preparatory stage of the project, the summary of each chapter, the rationale of the book, the textual commentary, the development of the original and Chinese editions, and reviews and criticisms, as well as Dewey’s biography and bibliography were initially investigated. Secondly, on the basis of the above preliminary work, the translation with annotation of Experience and Education, an epitome of Dewey’s biography and bibliography, a chronology, and a critical introduction for the Experience and Education were written. In the critical introduction, Dewey’s philosophy of experience and educational ideas will be examined along the timeline of human thought. And the vast literature about Dewey and his work will be instrumental to reveal the historical significance of Experience and Education on the modern age and make the critical introduction more knowledgeable. Third, the final stage took another two years to review and revise the draft of the work and send it for publication. There are two parts in the book. The first part is a scholarly introduction including Dewey’s chronicle (in short form), Dewey’s mind, people and life, the importance of “Experience and education”, the necessity of re-translation and re-annotation of “Experience and education” into Chinese. The second part is the re-translation and re-annotation version, including Dewey’s “Experience and education” and four papers written by contemporary scholars.

Keywords: John Dewey, experience and education: the 60th anniversary edition, translation, annotation

Procedia PDF Downloads 46
61 MSIpred: A Python 2 Package for the Classification of Tumor Microsatellite Instability from Tumor Mutation Annotation Data Using a Support Vector Machine

Authors: Chen Wang, Chun Liang

Abstract:

Microsatellite instability (MSI) is characterized by high degree of polymorphism in microsatellite (MS) length due to a deficiency in mismatch repair (MMR) system. MSI is associated with several tumor types and its status can be considered as an important indicator for tumor prognostic. Conventional clinical diagnosis of MSI examines PCR products of a panel of MS markers using electrophoresis (MSI-PCR) which is laborious, time consuming, and less reliable. MSIpred, a python 2 package for automatic classification of MSI was released by this study. It computes important somatic mutation features from files in mutation annotation format (MAF) generated from paired tumor-normal exome sequencing data, subsequently using these to predict tumor MSI status with a support vector machine (SVM) classifier trained by MAF files of 1074 tumors belonging to four types. Evaluation of MSIpred on an independent 358-tumor test set achieved overall accuracy of over 98% and area under receiver operating characteristic (ROC) curve of 0.967. These results indicated that MSIpred is a robust pan-cancer MSI classification tool and can serve as a complementary diagnostic to MSI-PCR in MSI diagnosis.

Keywords: microsatellite instability, pan-cancer classification, somatic mutation, support vector machine

Procedia PDF Downloads 91
60 Genome-Wide Functional Analysis of Phosphatase in Cryptococcus neoformans

Authors: Jae-Hyung Jin, Kyung-Tae Lee, Yee-Seul So, Eunji Jeong, Yeonseon Lee, Dongpil Lee, Dong-Gi Lee, Yong-Sun Bahn

Abstract:

Cryptococcus neoformans causes cryptococcal meningoencephalitis mainly in immunocompromised patients as well as immunocompetent people. But therapeutic options are limited to treat cryptococcosis. Some signaling pathways including cyclic AMP pathway, MAPK pathway, and calcineurin pathway play a central role in the regulation of the growth, differentiation, and virulence of C. neoformans. To understand signaling networks regulating the virulence of C. neoformans, we selected the 114 putative phosphatase genes, one of the major components of signaling networks, in the genome of C. neoformans. We identified putative phosphatases based on annotation in C. neoformans var. grubii genome database provided by the Broad Institute and National Center for Biotechnology Information (NCBI) and performed a BLAST search of phosphatases of Saccharomyces cerevisiae, Aspergillus nidulans, Candida albicans and Fusarium graminearum to Cryptococcus neoformans. We classified putative phosphatases into 14 groups based on InterPro phosphatase domain annotation. Here, we constructed 170 signature-tagged gene-deletion strains through homologous recombination methods for 91 putative phosphatases. We examined their phenotypic traits under 30 different in vitro conditions, including growth, differentiation, stress response, antifungal resistance and virulence-factor production.

Keywords: human fungal pathogen, phosphatase, deletion library, functional genomics

Procedia PDF Downloads 274
59 Glycan Analyzer: Software to Annotate Glycan Structures from Exoglycosidase Experiments

Authors: Ian Walsh, Terry Nguyen-Khuong, Christopher H. Taron, Pauline M. Rudd

Abstract:

Glycoproteins and their covalently bonded glycans play critical roles in the immune system, cell communication, disease and disease prognosis. Ultra performance liquid chromatography (UPLC) coupled with mass spectrometry is conventionally used to qualitatively and quantitatively characterise glycan structures in a given sample. Exoglycosidases are enzymes that catalyze sequential removal of monosaccharides from the non-reducing end of glycans. They naturally have specificity for a particular type of sugar, its stereochemistry (α or β anomer) and its position of attachment to an adjacent sugar on the glycan. Thus, monitoring the peak movements (both in the UPLC and MS1) after application of exoglycosidases provides a unique and effective way to annotate sugars with high detail - i.e. differentiating positional and linkage isomers. Manual annotation of an exoglycosidase experiment is difficult and time consuming. As such, with increasing sample complexity and the number of exoglycosidases, the analysis could result in manually interpreting hundreds of peak movements. Recently, we have implemented pattern recognition software for automated interpretation of UPLC-MS1 exoglycosidase digestions. In this work, we explain the software, indicate how much time it will save and provide example usage showing the annotation of positional and linkage isomers in Immunoglobulin G, apolipoprotein J, and simple glycan standards.

Keywords: bioinformatics, automated glycan assignment, liquid chromatography, mass spectrometry

Procedia PDF Downloads 116
58 Video Object Segmentation for Automatic Image Annotation of Ethernet Connectors with Environment Mapping and 3D Projection

Authors: Marrone Silverio Melo Dantas Pedro Henrique Dreyer, Gabriel Fonseca Reis de Souza, Daniel Bezerra, Ricardo Souza, Silvia Lins, Judith Kelner, Djamel Fawzi Hadj Sadok

Abstract:

The creation of a dataset is time-consuming and often discourages researchers from pursuing their goals. To overcome this problem, we present and discuss two solutions adopted for the automation of this process. Both optimize valuable user time and resources and support video object segmentation with object tracking and 3D projection. In our scenario, we acquire images from a moving robotic arm and, for each approach, generate distinct annotated datasets. We evaluated the precision of the annotations by comparing these with a manually annotated dataset, as well as the efficiency in the context of detection and classification problems. For detection support, we used YOLO and obtained for the projection dataset an F1-Score, accuracy, and mAP values of 0.846, 0.924, and 0.875, respectively. Concerning the tracking dataset, we achieved an F1-Score of 0.861, an accuracy of 0.932, whereas mAP reached 0.894. In order to evaluate the quality of the annotated images used for classification problems, we employed deep learning architectures. We adopted metrics accuracy and F1-Score, for VGG, DenseNet, MobileNet, Inception, and ResNet. The VGG architecture outperformed the others for both projection and tracking datasets. It reached an accuracy and F1-score of 0.997 and 0.993, respectively. Similarly, for the tracking dataset, it achieved an accuracy of 0.991 and an F1-Score of 0.981.

Keywords: RJ45, automatic annotation, object tracking, 3D projection

Procedia PDF Downloads 68
57 Automatic Reporting System for Transcriptome Indel Identification and Annotation Based on Snapshot of Next-Generation Sequencing Reads Alignment

Authors: Shuo Mu, Guangzhi Jiang, Jinsa Chen

Abstract:

The analysis of Indel for RNA sequencing of clinical samples is easily affected by sequencing experiment errors and software selection. In order to improve the efficiency and accuracy of analysis, we developed an automatic reporting system for Indel recognition and annotation based on image snapshot of transcriptome reads alignment. This system includes sequence local-assembly and realignment, target point snapshot, and image-based recognition processes. We integrated high-confidence Indel dataset from several known databases as a training set to improve the accuracy of image processing and added a bioinformatical processing module to annotate and filter Indel artifacts. Subsequently, the system will automatically generate data, including data quality levels and images results report. Sanger sequencing verification of the reference Indel mutation of cell line NA12878 showed that the process can achieve 83% sensitivity and 96% specificity. Analysis of the collected clinical samples showed that the interpretation accuracy of the process was equivalent to that of manual inspection, and the processing efficiency showed a significant improvement. This work shows the feasibility of accurate Indel analysis of clinical next-generation sequencing (NGS) transcriptome. This result may be useful for RNA study for clinical samples with microsatellite instability in immunotherapy in the future.

Keywords: automatic reporting, indel, next-generation sequencing, NGS, transcriptome

Procedia PDF Downloads 62
56 TARF: Web Toolkit for Annotating RNA-Related Genomic Features

Authors: Jialin Ma, Jia Meng

Abstract:

Genomic features, the genome-based coordinates, are commonly used for the representation of biological features such as genes, RNA transcripts and transcription factor binding sites. For the analysis of RNA-related genomic features, such as RNA modification sites, a common task is to correlate these features with transcript components (5'UTR, CDS, 3'UTR) to explore their distribution characteristics in terms of transcriptomic coordinates, e.g., to examine whether a specific type of biological feature is enriched near transcription start sites. Existing approaches for performing these tasks involve the manipulation of a gene database, conversion from genome-based coordinate to transcript-based coordinate, and visualization methods that are capable of showing RNA transcript components and distribution of the features. These steps are complicated and time consuming, and this is especially true for researchers who are not familiar with relevant tools. To overcome this obstacle, we develop a dedicated web app TARF, which represents web toolkit for annotating RNA-related genomic features. TARF web tool intends to provide a web-based way to easily annotate and visualize RNA-related genomic features. Once a user has uploaded the features with BED format and specified a built-in transcript database or uploaded a customized gene database with GTF format, the tool could fulfill its three main functions. First, it adds annotation on gene and RNA transcript components. For every features provided by the user, the overlapping with RNA transcript components are identified, and the information is combined in one table which is available for copy and download. Summary statistics about ambiguous belongings are also carried out. Second, the tool provides a convenient visualization method of the features on single gene/transcript level. For the selected gene, the tool shows the features with gene model on genome-based view, and also maps the features to transcript-based coordinate and show the distribution against one single spliced RNA transcript. Third, a global transcriptomic view of the genomic features is generated utilizing the Guitar R/Bioconductor package. The distribution of features on RNA transcripts are normalized with respect to RNA transcript landmarks and the enrichment of the features on different RNA transcript components is demonstrated. We tested the newly developed TARF toolkit with 3 different types of genomics features related to chromatin H3K4me3, RNA N6-methyladenosine (m6A) and RNA 5-methylcytosine (m5C), which are obtained from ChIP-Seq, MeRIP-Seq and RNA BS-Seq data, respectively. TARF successfully revealed their respective distribution characteristics, i.e. H3K4me3, m6A and m5C are enriched near transcription starting sites, stop codons and 5’UTRs, respectively. Overall, TARF is a useful web toolkit for annotation and visualization of RNA-related genomic features, and should help simplify the analysis of various RNA-related genomic features, especially those related RNA modifications.

Keywords: RNA-related genomic features, annotation, visualization, web server

Procedia PDF Downloads 122
55 Saudi Twitter Corpus for Sentiment Analysis

Authors: Adel Assiri, Ahmed Emam, Hmood Al-Dossari

Abstract:

Sentiment analysis (SA) has received growing attention in Arabic language research. However, few studies have yet to directly apply SA to Arabic due to lack of a publicly available dataset for this language. This paper partially bridges this gap due to its focus on one of the Arabic dialects which is the Saudi dialect. This paper presents annotated data set of 4700 for Saudi dialect sentiment analysis with (K= 0.807). Our next work is to extend this corpus and creation a large-scale lexicon for Saudi dialect from the corpus.

Keywords: Arabic, sentiment analysis, Twitter, annotation

Procedia PDF Downloads 434
54 A Method for Clinical Concept Extraction from Medical Text

Authors: Moshe Wasserblat, Jonathan Mamou, Oren Pereg

Abstract:

Natural Language Processing (NLP) has made a major leap in the last few years, in practical integration into medical solutions; for example, extracting clinical concepts from medical texts such as medical condition, medication, treatment, and symptoms. However, training and deploying those models in real environments still demands a large amount of annotated data and NLP/Machine Learning (ML) expertise, which makes this process costly and time-consuming. We present a practical and efficient method for clinical concept extraction that does not require costly labeled data nor ML expertise. The method includes three steps: Step 1- the user injects a large in-domain text corpus (e.g., PubMed). Then, the system builds a contextual model containing vector representations of concepts in the corpus, in an unsupervised manner (e.g., Phrase2Vec). Step 2- the user provides a seed set of terms representing a specific medical concept (e.g., for the concept of the symptoms, the user may provide: ‘dry mouth,’ ‘itchy skin,’ and ‘blurred vision’). Then, the system matches the seed set against the contextual model and extracts the most semantically similar terms (e.g., additional symptoms). The result is a complete set of terms related to the medical concept. Step 3 –in production, there is a need to extract medical concepts from the unseen medical text. The system extracts key-phrases from the new text, then matches them against the complete set of terms from step 2, and the most semantically similar will be annotated with the same medical concept category. As an example, the seed symptom concepts would result in the following annotation: “The patient complaints on fatigue [symptom], dry skin [symptom], and Weight loss [symptom], which can be an early sign for Diabetes.” Our evaluations show promising results for extracting concepts from medical corpora. The method allows medical analysts to easily and efficiently build taxonomies (in step 2) representing their domain-specific concepts, and automatically annotate a large number of texts (in step 3) for classification/summarization of medical reports.

Keywords: clinical concepts, concept expansion, medical records annotation, medical records summarization

Procedia PDF Downloads 43
53 Reading as Moral Afternoon Tea: An Empirical Study on the Compensation Effect between Literary Novel Reading and Readers’ Moral Motivation

Authors: Chong Jiang, Liang Zhao, Hua Jian, Xiaoguang Wang

Abstract:

The belief that there is a strong relationship between reading narrative and morality has generally become the basic assumption of scholars, philosophers, critics, and cultural critics. The virtuality constructed by literary novels inspires readers to regard the narrative as a thinking experiment, creating the distance between readers and events so that they can freely and morally experience the positions of different roles. Therefore, the virtual narrative combined with literary characteristics is always considered as a "moral laboratory." Well-established findings revealed that people show less lying and deceptive behaviors in the morning than in the afternoon, called the morning morality effect. As a limited self-regulation resource, morality will be constantly depleted with the change of time rhythm under the influence of the morning morality effect. It can also be compensated and restored in various ways, such as eating, sleeping, etc. As a common form of entertainment in modern society, literary novel reading gives people more virtual experience and emotional catharsis, just as a relaxing afternoon tea that helps people break away from fast-paced work, restore physical strength, and relieve stress in a short period of leisure. In this paper, inspired by the compensation control theory, we wonder whether reading literary novels in the digital environment could replenish a kind of spiritual energy for self-regulation to compensate for people's moral loss in the afternoon. Based on this assumption, we leverage the social annotation text content generated by readers in digital reading to represent the readers' reading attention. We then recognized the semantics and calculated the readers' moral motivation expressed in the annotations and investigated the fine-grained dynamics of the moral motivation changing in each time slot within 24 hours of a day. Comprehensively comparing the division of different time intervals, sufficient experiments showed that the moral motivation reflected in the annotations in the afternoon is significantly higher than that in the morning. The results robustly verified the hypothesis that reading compensates for moral motivation, which we called the moral afternoon tea effect. Moreover, we quantitatively identified that such moral compensation can last until 14:00 in the afternoon and 21:00 in the evening. In addition, it is interesting to find that the division of time intervals of different units impacts the identification of moral rhythms. Dividing the time intervals by four-hour time slot brings more insights of moral rhythms compared with that of three-hour and six-hour time slot.

Keywords: digital reading, social annotation, moral motivation, morning morality effect, control compensation

Procedia PDF Downloads 55
52 Enhancement of Indexing Model for Heterogeneous Multimedia Documents: User Profile Based Approach

Authors: Aicha Aggoune, Abdelkrim Bouramoul, Mohamed Khiereddine Kholladi

Abstract:

Recent research shows that user profile as important element can improve heterogeneous information retrieval with its content. In this context, we present our indexing model for heterogeneous multimedia documents. This model is based on the combination of user profile to the indexing process. The general idea of our proposal is to operate the common concepts between the representation of a document and the definition of a user through his profile. These two elements will be added as additional indexing entities to enrich the heterogeneous corpus documents indexes. We have developed IRONTO domain ontology allowing annotation of documents. We will present also the developed tool validating the proposed model.

Keywords: indexing model, user profile, multimedia document, heterogeneous of sources, ontology

Procedia PDF Downloads 268
51 An Improvement of Multi-Label Image Classification Method Based on Histogram of Oriented Gradient

Authors: Ziad Abdallah, Mohamad Oueidat, Ali El-Zaart

Abstract:

Image Multi-label Classification (IMC) assigns a label or a set of labels to an image. The big demand for image annotation and archiving in the web attracts the researchers to develop many algorithms for this application domain. The existing techniques for IMC have two drawbacks: The description of the elementary characteristics from the image and the correlation between labels are not taken into account. In this paper, we present an algorithm (MIML-HOGLPP), which simultaneously handles these limitations. The algorithm uses the histogram of gradients as feature descriptor. It applies the Label Priority Power-set as multi-label transformation to solve the problem of label correlation. The experiment shows that the results of MIML-HOGLPP are better in terms of some of the evaluation metrics comparing with the two existing techniques.

Keywords: data mining, information retrieval system, multi-label, problem transformation, histogram of gradients

Procedia PDF Downloads 296
50 A Comparison of YOLO Family for Apple Detection and Counting in Orchards

Authors: Yuanqing Li, Changyi Lei, Zhaopeng Xue, Zhuo Zheng, Yanbo Long

Abstract:

In agricultural production and breeding, implementing automatic picking robot in orchard farming to reduce human labour and error is challenging. The core function of it is automatic identification based on machine vision. This paper focuses on apple detection and counting in orchards and implements several deep learning methods. Extensive datasets are used and a semi-automatic annotation method is proposed. The proposed deep learning models are in state-of-the-art YOLO family. In view of the essence of the models with various backbones, a multi-dimensional comparison in details is made in terms of counting accuracy, mAP and model memory, laying the foundation for realising automatic precision agriculture.

Keywords: agricultural object detection, deep learning, machine vision, YOLO family

Procedia PDF Downloads 72
49 Scalable Learning of Tree-Based Models on Sparsely Representable Data

Authors: Fares Hedayatit, Arnauld Joly, Panagiotis Papadimitriou

Abstract:

Many machine learning tasks such as text annotation usually require training over very big datasets, e.g., millions of web documents, that can be represented in a sparse input space. State-of the-art tree-based ensemble algorithms cannot scale to such datasets, since they include operations whose running time is a function of the input space size rather than a function of the non-zero input elements. In this paper, we propose an efficient splitting algorithm to leverage input sparsity within decision tree methods. Our algorithm improves training time over sparse datasets by more than two orders of magnitude and it has been incorporated in the current version of scikit-learn.org, the most popular open source Python machine learning library.

Keywords: big data, sparsely representable data, tree-based models, scalable learning

Procedia PDF Downloads 186
48 Opinion Mining and Sentiment Analysis on DEFT

Authors: Najiba Ouled Omar, Azza Harbaoui, Henda Ben Ghezala

Abstract:

Current research practices sentiment analysis with a focus on social networks, DEfi Fouille de Texte (DEFT) (Text Mining Challenge) evaluation campaign focuses on opinion mining and sentiment analysis on social networks, especially social network Twitter. It aims to confront the systems produced by several teams from public and private research laboratories. DEFT offers participants the opportunity to work on regularly renewed themes and proposes to work on opinion mining in several editions. The purpose of this article is to scrutinize and analyze the works relating to opinions mining and sentiment analysis in the Twitter social network realized by DEFT. It examines the tasks proposed by the organizers of the challenge and the methods used by the participants.

Keywords: opinion mining, sentiment analysis, emotion, polarity, annotation, OSEE, figurative language, DEFT, Twitter, Tweet

Procedia PDF Downloads 47
47 Number Variation of the Personal Pronoun We in American Spoken English

Authors: Qiong Hu, Ming Yue

Abstract:

Language variation signals the newest usage of language community, which might become the developmental trend of that language. The personal pronoun we is prescribed as a plural pronoun in grammar, but its number value is more flexible in actual use. Based on the homemade Friends corpus, the present research explores the number value of the first person pronoun we in nowadays American spoken English. With consideration of the subjectivity of we, this paper used ‘we+ PCU (Perception-cognation-utterance) verbs’ collocations and ‘we+ plural categories’ as the parameters. Results from corpus data and manual annotation show that: 1) the overall frequency of we has been increasing; 2) we has been increasingly used with other plural categories, indicating a weakening of its plural reference; and 3) we has been increasingly used with PCU (perception-cognition-utterance) verbs of strong subjectivity, indicating a strengthening of its singular reference. All these seem to support our hypothesis that we is undergoing the process of further grammaticalization towards a singular reference, though future evidence is needed to attest the bold prediction.

Keywords: number, PCU verbs, personal pronoun we,

Procedia PDF Downloads 166
46 Isolation and Characterization of a Narrow-Host Range Aeromonas hydrophila Lytic Bacteriophage

Authors: Sumeet Rai, Anuj Tyagi, B. T. Naveen Kumar, Shubhkaramjeet Kaur, Niraj K. Singh

Abstract:

Since their discovery, indiscriminate use of antibiotics in human, veterinary and aquaculture systems has resulted in global emergence/spread of multidrug-resistant bacterial pathogens. Thus, the need for alternative approaches to control bacterial infections has become utmost important. High selectivity/specificity of bacteriophages (phages) permits the targeting of specific bacteria without affecting the desirable flora. In this study, a lytic phage (Ahp1) specific to Aeromonas hydrophila subsp. hydrophila was isolated from finfish aquaculture pond. The host range of Ahp1 range was tested against 10 isolates of A. hydrophila, 7 isolates of A. veronii, 25 Vibrio cholerae isolates, 4 V. parahaemolyticus isolates and one isolate each of V. harveyi and Salmonella enterica collected previously. Except the host A. hydrophila subsp. hydrophila strain, no lytic activity against any other bacterial was detected. During the adsorption rate and one-step growth curve analysis, 69.7% of phage particles were able to get adsorbed on host cell followed by the release of 93 ± 6 phage progenies per host cell after a latent period of ~30 min. Phage nucleic acid was extracted by column purification methods. After determining the nature of phage nucleic acid as dsDNA, phage genome was subjected to next-generation sequencing by generating paired-end (PE, 2 x 300bp) reads on Illumina MiSeq system. De novo assembly of sequencing reads generated circular phage genome of 42,439 bp with G+C content of 58.95%. During open read frame (ORF) prediction and annotation, 22 ORFs (out of 49 total predicted ORFs) were functionally annotated and rest encoded for hypothetical proteins. Proteins involved in major functions such as phage structure formation and packaging, DNA replication and repair, DNA transcription and host cell lysis were encoded by the phage genome. The complete genome sequence of Ahp1 along with gene annotation was submitted to NCBI GenBank (accession number MF683623). Stability of Ahp1 preparations at storage temperatures of 4 °C, 30 °C, and 40 °C was studied over a period of 9 months. At 40 °C storage, phage counts declined by 4 log units within one month; with a total loss of viability after 2 months. At 30 °C temperature, phage preparation was stable for < 5 months. On the other hand, phage counts decreased by only 2 log units over a period of 9 during storage at 4 °C. As some of the phages have also been reported as glycerol sensitive, the stability of Ahp1 preparations in (0%, 15%, 30% and 45%) glycerol stocks were also studied during storage at -80 °C over a period of 9 months. The phage counts decreased only by 2 log units during storage, and no significant difference in phage counts was observed at different concentrations of glycerol. The Ahp1 phage discovered in our study had a very narrow host range and it may be useful for phage typing applications. Moreover, the endolysin and holin genes in Ahp1 genome could be ideal candidates for recombinant cloning and expression of antimicrobial proteins.

Keywords: Aeromonas hydrophila, endolysin, phage, narrow host range

Procedia PDF Downloads 78