Search results for: annotation
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 98

Search results for: annotation

98 Fuzzy Semantic Annotation of Web Resources

Authors: Sahar Maâlej Dammak, Anis Jedidi, Rafik Bouaziz

Abstract:

With the great mass of pages managed through the world, and especially with the advent of the Web, their manual annotation is impossible. We focus, in this paper, on the semiautomatic annotation of the web pages. We propose an approach and a framework for semantic annotation of web pages entitled “Querying Web”. Our solution is an enhancement of the first result of annotation done by the “Semantic Radar” Plug-in on the web resources, by annotations using an enriched domain ontology. The concepts of the result of Semantic Radar may be connected to several terms of the ontology, but connections may be uncertain. We represent annotations as possibility distributions. We use the hierarchy defined in the ontology to compute degrees of possibilities. We want to achieve an automation of the fuzzy semantic annotation of web resources.

Keywords: fuzzy semantic annotation, semantic web, domain ontologies, querying web

Procedia PDF Downloads 337
97 A Method of the Semantic on Image Auto-Annotation

Authors: Lin Huo, Xianwei Liu, Jingxiong Zhou

Abstract:

Recently, due to the existence of semantic gap between image visual features and human concepts, the semantic of image auto-annotation has become an important topic. Firstly, by extract low-level visual features of the image, and the corresponding Hash method, mapping the feature into the corresponding Hash coding, eventually, transformed that into a group of binary string and store it, image auto-annotation by search is a popular method, we can use it to design and implement a method of image semantic auto-annotation. Finally, Through the test based on the Corel image set, and the results show that, this method is effective.

Keywords: image auto-annotation, color correlograms, Hash code, image retrieval

Procedia PDF Downloads 456
96 Towards a Large Scale Deep Semantically Analyzed Corpus for Arabic: Annotation and Evaluation

Authors: S. Alansary, M. Nagi

Abstract:

This paper presents an approach of conducting semantic annotation of Arabic corpus using the Universal Networking Language (UNL) framework. UNL is intended to be a promising strategy for providing a large collection of semantically annotated texts with formal, deep semantics rather than shallow. The result would constitute a semantic resource (semantic graphs) that is editable and that integrates various phenomena, including predicate-argument structure, scope, tense, thematic roles and rhetorical relations, into a single semantic formalism for knowledge representation. The paper will also present the Interactive Analysis​ tool for automatic semantic annotation (IAN). In addition, the cornerstone of the proposed methodology which are the disambiguation and transformation rules, will be presented. Semantic annotation using UNL has been applied to a corpus of 20,000 Arabic sentences representing the most frequent structures in the Arabic Wikipedia. The representation, at different linguistic levels was illustrated starting from the morphological level passing through the syntactic level till the semantic representation is reached. The output has been evaluated using the F-measure. It is 90% accurate. This demonstrates how powerful the formal environment is, as it enables intelligent text processing and search.

Keywords: semantic analysis, semantic annotation, Arabic, universal networking language

Procedia PDF Downloads 556
95 Annotation Ontology for Semantic Web Development

Authors: Hadeel Al Obaidy, Amani Al Heela

Abstract:

The main purpose of this paper is to examine the concept of semantic web and the role that ontology and semantic annotation plays in the development of semantic web services. The paper focuses on semantic web infrastructure illustrating how ontology and annotation work to provide the learning capabilities for building content semantically. To improve productivity and quality of software, the paper applies approaches, notations and techniques offered by software engineering. It proposes a conceptual model to develop semantic web services for the infrastructure of web information retrieval system of digital libraries. The developed system uses ontology and annotation to build a knowledge based system to define and link the meaning of a web content to retrieve information for users’ queries. The results are more relevant through keywords and ontology rule expansion that will be more accurate to satisfy the requested information. The level of results accuracy would be enhanced since the query semantically analyzed work with the conceptual architecture of the proposed system.

Keywords: semantic web services, software engineering, semantic library, knowledge representation, ontology

Procedia PDF Downloads 149
94 Automatic Multi-Label Image Annotation System Guided by Firefly Algorithm and Bayesian Method

Authors: Saad M. Darwish, Mohamed A. El-Iskandarani, Guitar M. Shawkat

Abstract:

Nowadays, the amount of available multimedia data is continuously on the rise. The need to find a required image for an ordinary user is a challenging task. Content based image retrieval (CBIR) computes relevance based on the visual similarity of low-level image features such as color, textures, etc. However, there is a gap between low-level visual features and semantic meanings required by applications. The typical method of bridging the semantic gap is through the automatic image annotation (AIA) that extracts semantic features using machine learning techniques. In this paper, a multi-label image annotation system guided by Firefly and Bayesian method is proposed. Firstly, images are segmented using the maximum variance intra cluster and Firefly algorithm, which is a swarm-based approach with high convergence speed, less computation rate and search for the optimal multiple threshold. Feature extraction techniques based on color features and region properties are applied to obtain the representative features. After that, the images are annotated using translation model based on the Net Bayes system, which is efficient for multi-label learning with high precision and less complexity. Experiments are performed using Corel Database. The results show that the proposed system is better than traditional ones for automatic image annotation and retrieval.

Keywords: feature extraction, feature selection, image annotation, classification

Procedia PDF Downloads 556
93 The Omani Learner of English Corpus: Source and Tools

Authors: Anood Al-Shibli

Abstract:

Designing a learner corpus is not an easy task to accomplish because dealing with learners’ language has many variables which might affect the results of any study based on learners’ language production (spoken and written). Also, it is very essential to systematically design a learner corpus especially when it is aimed to be a reference to language research. Therefore, designing the Omani Learner Corpus (OLEC) has undergone many explicit and systematic considerations. These criteria can be regarded as the foundation to design any learner corpus to be exploited effectively in language use and language learning studies. Added to that, OLEC is manually error-annotated corpus. Error-annotation in learner corpora is very essential; however, it is time-consuming and prone to errors. Consequently, a navigating tool is designed to help the annotators to insert errors’ codes in order to make the error-annotation process more efficient and consistent. To assure accuracy, error annotation procedure is followed to annotate OLEC and some preliminary findings are noted. One of the main results of this procedure is creating an error-annotation system based on the Omani learners of English language production. Because OLEC is still in the first stages, the primary findings are related to only one level of proficiency and one error type which is verb related errors. It is found that Omani learners in OLEC has the tendency to have more errors in forming the verb and followed by problems in agreement of verb. Comparing the results to other error-based studies indicate that the Omani learners tend to have basic verb errors which can found in lower-level of proficiency. To this end, it is essential to admit that examining learners’ errors can give insights to language acquisition and language learning and most errors do not happen randomly but they occur systematically among language learners.

Keywords: error-annotation system, error-annotation manual, learner corpora, verbs related errors

Procedia PDF Downloads 107
92 The Automatisation of Dictionary-Based Annotation in a Parallel Corpus of Old English

Authors: Ana Elvira Ojanguren Lopez, Javier Martin Arista

Abstract:

The aims of this paper are to present the automatisation procedure adopted in the implementation of a parallel corpus of Old English, as well as, to assess the progress of automatisation with respect to tagging, annotation, and lemmatisation. The corpus consists of an aligned parallel text with word-for-word comparison Old English-English that provides the Old English segment with inflectional form tagging (gloss, lemma, category, and inflection) and lemma annotation (spelling, meaning, inflectional class, paradigm, word-formation and secondary sources). This parallel corpus is intended to fill a gap in the field of Old English, in which no parallel and/or lemmatised corpora are available, while the average amount of corpus annotation is low. With this background, this presentation has two main parts. The first part, which focuses on tagging and annotation, selects the layouts and fields of lexical databases that are relevant for these tasks. Most information used for the annotation of the corpus can be retrieved from the lexical and morphological database Nerthus and the database of secondary sources Freya. These are the sources of linguistic and metalinguistic information that will be used for the annotation of the lemmas of the corpus, including morphological and semantic aspects as well as the references to the secondary sources that deal with the lemmas in question. Although substantially adapted and re-interpreted, the lemmatised part of these databases draws on the standard dictionaries of Old English, including The Student's Dictionary of Anglo-Saxon, An Anglo-Saxon Dictionary, and A Concise Anglo-Saxon Dictionary. The second part of this paper deals with lemmatisation. It presents the lemmatiser Norna, which has been implemented on Filemaker software. It is based on a concordance and an index to the Dictionary of Old English Corpus, which comprises around three thousand texts and three million words. In its present state, the lemmatiser Norna can assign lemma to around 80% of textual forms on an automatic basis, by searching the index and the concordance for prefixes, stems and inflectional endings. The conclusions of this presentation insist on the limits of the automatisation of dictionary-based annotation in a parallel corpus. While the tagging and annotation are largely automatic even at the present stage, the automatisation of alignment is pending for future research. Lemmatisation and morphological tagging are expected to be fully automatic in the near future, once the database of secondary sources Freya and the lemmatiser Norna have been completed.

Keywords: corpus linguistics, historical linguistics, old English, parallel corpus

Procedia PDF Downloads 169
91 Extraction of Text Subtitles in Multimedia Systems

Authors: Amarjit Singh

Abstract:

In this paper, a method for extraction of text subtitles in large video is proposed. The video data needs to be annotated for many multimedia applications. Text is incorporated in digital video for the motive of providing useful information about that video. So need arises to detect text present in video to understanding and video indexing. This is achieved in two steps. First step is text localization and the second step is text verification. The method of text detection can be extended to text recognition which finds applications in automatic video indexing; video annotation and content based video retrieval. The method has been tested on various types of videos.

Keywords: video, subtitles, extraction, annotation, frames

Procedia PDF Downloads 566
90 BingleSeq: A User-Friendly R Package for Single-Cell RNA-Seq Data Analysis

Authors: Quan Gu, Daniel Dimitrov

Abstract:

BingleSeq was developed as a shiny-based, intuitive, and comprehensive application that enables the analysis of single-Cell RNA-Sequencing count data. This was achieved via incorporating three state-of-the-art software packages for each type of RNA sequencing analysis, alongside functional annotation analysis and a way to assess the overlap of differential expression method results. At its current state, the functionality implemented within BingleSeq is comparable to that of other applications, also developed with the purpose of lowering the entry requirements to RNA Sequencing analyses. BingleSeq is available on GitHub and will be submitted to R/Bioconductor.

Keywords: bioinformatics, functional annotation analysis, single-cell RNA-sequencing, transcriptomics

Procedia PDF Downloads 160
89 Tagging a corpus of Media Interviews with Diplomats: Challenges and Solutions

Authors: Roberta Facchinetti, Sara Corrizzato, Silvia Cavalieri

Abstract:

Increasing interconnection between data digitalization and linguistic investigation has given rise to unprecedented potentialities and challenges for corpus linguists, who need to master IT tools for data analysis and text processing, as well as to develop techniques for efficient and reliable annotation in specific mark-up languages that encode documents in a format that is both human and machine-readable. In the present paper, the challenges emerging from the compilation of a linguistic corpus will be taken into consideration, focusing on the English language in particular. To do so, the case study of the InterDiplo corpus will be illustrated. The corpus, currently under development at the University of Verona (Italy), represents a novelty in terms both of the data included and of the tag set used for its annotation. The corpus covers media interviews and debates with diplomats and international operators conversing in English with journalists who do not share the same lingua-cultural background as their interviewees. To date, this appears to be the first tagged corpus of international institutional spoken discourse and will be an important database not only for linguists interested in corpus analysis but also for experts operating in international relations. In the present paper, special attention will be dedicated to the structural mark-up, parts of speech annotation, and tagging of discursive traits, that are the innovational parts of the project being the result of a thorough study to find the best solution to suit the analytical needs of the data. Several aspects will be addressed, with special attention to the tagging of the speakers’ identity, the communicative events, and anthropophagic. Prominence will be given to the annotation of question/answer exchanges to investigate the interlocutors’ choices and how such choices impact communication. Indeed, the automated identification of questions, in relation to the expected answers, is functional to understand how interviewers elicit information as well as how interviewees provide their answers to fulfill their respective communicative aims. A detailed description of the aforementioned elements will be given using the InterDiplo-Covid19 pilot corpus. The data yielded by our preliminary analysis of the data will highlight the viable solutions found in the construction of the corpus in terms of XML conversion, metadata definition, tagging system, and discursive-pragmatic annotation to be included via Oxygen.

Keywords: spoken corpus, diplomats’ interviews, tagging system, discursive-pragmatic annotation, english linguistics

Procedia PDF Downloads 152
88 Contextual Sentiment Analysis with Untrained Annotators

Authors: Lucas A. Silva, Carla R. Aguiar

Abstract:

This work presents a proposal to perform contextual sentiment analysis using a supervised learning algorithm and disregarding the extensive training of annotators. To achieve this goal, a web platform was developed to perform the entire procedure outlined in this paper. The main contribution of the pipeline described in this article is to simplify and automate the annotation process through a system of analysis of congruence between the notes. This ensured satisfactory results even without using specialized annotators in the context of the research, avoiding the generation of biased training data for the classifiers. For this, a case study was conducted in a blog of entrepreneurship. The experimental results were consistent with the literature related annotation using formalized process with experts.

Keywords: sentiment analysis, untrained annotators, naive bayes, entrepreneurship, contextualized classifier

Procedia PDF Downloads 359
87 A Framework for Secure Information Flow Analysis in Web Applications

Authors: Ralph Adaimy, Wassim El-Hajj, Ghassen Ben Brahim, Hazem Hajj, Haidar Safa

Abstract:

Huge amounts of data and personal information are being sent to and retrieved from web applications on daily basis. Every application has its own confidentiality and integrity policies. Violating these policies can have broad negative impact on the involved company’s financial status, while enforcing them is very hard even for the developers with good security background. In this paper, we propose a framework that enforces security-by-construction in web applications. Minimal developer effort is required, in a sense that the developer only needs to annotate database attributes by a security class. The web application code is then converted into an intermediary representation, called Extended Program Dependence Graph (EPDG). Using the EPDG, the provided annotations are propagated to the application code and run against generic security enforcement rules that were carefully designed to detect insecure information flows as early as they occur. As a result, any violation in the data’s confidentiality or integrity policies is reported. As a proof of concept, two PHP web applications, Hotel Reservation and Auction, were used for testing and validation. The proposed system was able to catch all the existing insecure information flows at their source. Moreover and to highlight the simplicity of the suggested approaches vs. existing approaches, two professional web developers assessed the annotation tasks needed in the presented case studies and provided a very positive feedback on the simplicity of the annotation task.

Keywords: web applications security, secure information flow, program dependence graph, database annotation

Procedia PDF Downloads 435
86 Grammatically Coded Corpus of Spoken Lithuanian: Methodology and Development

Authors: L. Kamandulytė-Merfeldienė

Abstract:

The paper deals with the main issues of methodology of the Corpus of Spoken Lithuanian which was started to be developed in 2006. At present, the corpus consists of 300,000 grammatically annotated word forms. The creation of the corpus consists of three main stages: collecting the data, the transcription of the recorded data, and the grammatical annotation. Collecting the data was based on the principles of balance and naturality. The recorded speech was transcribed according to the CHAT requirements of CHILDES. The transcripts were double-checked and annotated grammatically using CHILDES. The development of the Corpus of Spoken Lithuanian has led to the constant increase in studies on spontaneous communication, and various papers have dealt with a distribution of parts of speech, use of different grammatical forms, variation of inflectional paradigms, distribution of fillers, syntactic functions of adjectives, the mean length of utterances.

Keywords: CHILDES, corpus of spoken Lithuanian, grammatical annotation, grammatical disambiguation, lexicon, Lithuanian

Procedia PDF Downloads 204
85 Attitude in Academic Writing (CAAW): Corpus Compilation and Annotation

Authors: Hortènsia Curell, Ana Fernández-Montraveta

Abstract:

This paper presents the creation, development, and analysis of a corpus designed to study the presence of attitude markers and author’s stance in research articles in two different areas of linguistics (theoretical linguistics and sociolinguistics). These two disciplines are expected to behave differently in this respect, given the disparity in their discursive conventions. Attitude markers in this work are understood as the linguistic elements (adjectives, nouns and verbs) used to convey the writer's stance towards the content presented in the article, and are crucial in understanding writer-reader interaction and the writer's position. These attitude markers are divided into three broad classes: assessment, significance, and emotion. In addition to them, we also consider first-person singular and plural pronouns and possessives, modal verbs, and passive constructions, which are other linguistic elements expressing the author’s stance. The corpus, Corpus of Attitude in Academic Writing (CAAW), comprises a collection of 21 articles, collected from six journals indexed in JCR. These articles were originally written in English by a single native-speaker author from the UK or USA and were published between 2022 and 2023. The total number of words in the corpus is approximately 222,400, with 106,422 from theoretical linguistics (Lingua, Linguistic Inquiry and Journal of Linguistics) and 116,022 from sociolinguistics journals (International Journal of the Sociology of Language, Language in Society and Journal of Sociolinguistics). Together with the corpus, we present the tool created for the creation and storage of the corpus, along with a tool for automatic annotation. The steps followed in the compilation of the corpus are as follows. First, the articles were selected according to the parameters explained above. Second, they were downloaded and converted to txt format. Finally, examples, direct quotes, section titles and references were eliminated, since they do not involve the author’s stance. The resulting texts were the input for the annotation of the linguistic features related to stance. As for the annotation, two articles (one from each subdiscipline) were annotated manually by the two researchers. An existing list was used as a baseline, and other attitude markers were identified, together with the other elements mentioned above. Once a consensus was reached, the rest of articles were annotated automatically using the tool created for this purpose. The annotated corpus will serve as a resource for scholars working in discourse analysis (both in linguistics and communication) and related fields, since it offers new insights into the expression of attitude. The tools created for the compilation and annotation of the corpus will be useful to study author’s attitude and stance in articles from any academic discipline: new data can be uploaded and the list of markers can be enlarged. Finally, the tool can be expanded to other languages, which will allow cross-linguistic studies of author’s stance.

Keywords: academic writing, attitude, corpus, english

Procedia PDF Downloads 22
84 The Publishing Process and Results of the Chinese Annotated Edition of John Dewey’s “Experience and Education: The 60th Anniversary Edition”

Authors: Wen-jing Shan

Abstract:

The Chinese annotated edition of “Experience and education: The 60th anniversary edition,” originally written in English by John Dewey (1859-1952), was published in 2015 by this author. A report of the process and results of the translation and annotation of the book is the purpose of this paper. It is worth mentioning that the original 1938 edition was considered as the best concise statement on education by John Dewey, one the most important educational theorists of the twentieth century. One of the features of this The 60th anniversary edition is that the original publisher, Kappa Delta Pi International Honor Society, invited four contemporary Deweyan scholars who had been awarded the Society’s Laureate Scholar to write a review of the book published by Dewey, who was the first to receive this honor. The four scholars are Maxine Greene(1917-2014), Philip W. Jackson(1928-2015), Linda Darling-Hammond(1951-), and O. L. Davis, Jr.(1928-). The original 1938 edition, the best concise statement on education by the most important educational theorist of the twentieth century, was translated into Chinese for five times after its publication in the U.S.A, three in the 1940s, one in the 1990s, and one in 2010s. Nonetheless, the five translations have few or no annotations and have some flaws of mis-interpretations and lack of information. The author retranslated and annotated the book to make the interpretations more faithful, expressive, and elegant, and providing the readers with more understanding and more correct information. This author started the project of translation and annotation sponsored by Taiwan Ministry of Science and Technology in August 2011 and finished and published by July 2015. The work, the author, did was divided into three stages. First, in the preparatory stage of the project, the summary of each chapter, the rationale of the book, the textual commentary, the development of the original and Chinese editions, and reviews and criticisms, as well as Dewey’s biography and bibliography were initially investigated. Secondly, on the basis of the above preliminary work, the translation with annotation of Experience and Education, an epitome of Dewey’s biography and bibliography, a chronology, and a critical introduction for the Experience and Education were written. In the critical introduction, Dewey’s philosophy of experience and educational ideas will be examined along the timeline of human thought. And the vast literature about Dewey and his work will be instrumental to reveal the historical significance of Experience and Education on the modern age and make the critical introduction more knowledgeable. Third, the final stage took another two years to review and revise the draft of the work and send it for publication. There are two parts in the book. The first part is a scholarly introduction including Dewey’s chronicle (in short form), Dewey’s mind, people and life, the importance of “Experience and education”, the necessity of re-translation and re-annotation of “Experience and education” into Chinese. The second part is the re-translation and re-annotation version, including Dewey’s “Experience and education” and four papers written by contemporary scholars.

Keywords: John Dewey, experience and education: the 60th anniversary edition, translation, annotation

Procedia PDF Downloads 120
83 MSIpred: A Python 2 Package for the Classification of Tumor Microsatellite Instability from Tumor Mutation Annotation Data Using a Support Vector Machine

Authors: Chen Wang, Chun Liang

Abstract:

Microsatellite instability (MSI) is characterized by high degree of polymorphism in microsatellite (MS) length due to a deficiency in mismatch repair (MMR) system. MSI is associated with several tumor types and its status can be considered as an important indicator for tumor prognostic. Conventional clinical diagnosis of MSI examines PCR products of a panel of MS markers using electrophoresis (MSI-PCR) which is laborious, time consuming, and less reliable. MSIpred, a python 2 package for automatic classification of MSI was released by this study. It computes important somatic mutation features from files in mutation annotation format (MAF) generated from paired tumor-normal exome sequencing data, subsequently using these to predict tumor MSI status with a support vector machine (SVM) classifier trained by MAF files of 1074 tumors belonging to four types. Evaluation of MSIpred on an independent 358-tumor test set achieved overall accuracy of over 98% and area under receiver operating characteristic (ROC) curve of 0.967. These results indicated that MSIpred is a robust pan-cancer MSI classification tool and can serve as a complementary diagnostic to MSI-PCR in MSI diagnosis.

Keywords: microsatellite instability, pan-cancer classification, somatic mutation, support vector machine

Procedia PDF Downloads 134
82 Domain Adaptation Save Lives - Drowning Detection in Swimming Pool Scene Based on YOLOV8 Improved by Gaussian Poisson Generative Adversarial Network Augmentation

Authors: Simiao Ren, En Wei

Abstract:

Drowning is a significant safety issue worldwide, and a robust computer vision-based alert system can easily prevent such tragedies in swimming pools. However, due to domain shift caused by the visual gap (potentially due to lighting, indoor scene change, pool floor color etc.) between the training swimming pool and the test swimming pool, the robustness of such algorithms has been questionable. The annotation cost for labeling each new swimming pool is too expensive for mass adoption of such a technique. To address this issue, we propose a domain-aware data augmentation pipeline based on Gaussian Poisson Generative Adversarial Network (GP-GAN). Combined with YOLOv8, we demonstrate that such a domain adaptation technique can significantly improve the model performance (from 0.24 mAP to 0.82 mAP) on new test scenes. As the augmentation method only require background imagery from the new domain (no annotation needed), we believe this is a promising, practical route for preventing swimming pool drowning.

Keywords: computer vision, deep learning, YOLOv8, detection, swimming pool, drowning, domain adaptation, generative adversarial network, GAN, GP-GAN

Procedia PDF Downloads 56
81 Comparison of Rumen Microbial Analysis Pipelines Based on 16s rRNA Gene Sequencing

Authors: Xiaoxing Ye

Abstract:

To investigate complex rumen microbial communities, 16S ribosomal RNA (rRNA) sequencing is widely used. Here, we evaluated the impact of bioinformatics pipelines on the observation of OTUs and taxonomic classification of 750 cattle rumen microbial samples by comparing three commonly used pipelines (LotuS, UPARSE, and QIIME) with Usearch. In LotuS-based analyses, 189 archaeal and 3894 bacterial OTUs were observed. The observed OTUs for the Usearch analysis were significantly larger than the LotuS results. We discovered 1495 OTUs for archaea and 92665 OTUs for bacteria using Usearch analysis. In addition, taxonomic assignments were made for the rumen microbial samples. All pipelines had consistent taxonomic annotations from the phylum to the genus level. A difference in relative abundance was calculated for all microbial levels, including Bacteroidetes (QIIME: 72.2%, Usearch: 74.09%), Firmicutes (QIIME: 18.3%, Usearch: 20.20%) for the bacterial phylum, Methanobacteriales (QIIME: 64.2%, Usearch: 45.7%) for the archaeal class, Methanobacteriaceae (QIIME: 35%, Usearch: 45.7%) and Methanomassiliicoccaceae (QIIME: 35%, Usearch: 31.13%) for archaeal family. However, the most prevalent archaeal class varied between these two annotation pipelines. The Thermoplasmata was the top class according to the QIIME annotation, whereas Methanobacteria was the top class according to Usearch.

Keywords: cattle rumen, rumen microbial, 16S rRNA gene sequencing, bioinformatics pipeline

Procedia PDF Downloads 46
80 Computational Pipeline for Lynch Syndrome Detection: Integrating Alignment, Variant Calling, and Annotations

Authors: Rofida Gamal, Mostafa Mohammed, Mariam Adel, Marwa Gamal, Marwa kamal, Ayat Saber, Maha Mamdouh, Amira Emad, Mai Ramadan

Abstract:

Lynch Syndrome is an inherited genetic condition associated with an increased risk of colorectal and other cancers. Detecting Lynch Syndrome in individuals is crucial for early intervention and preventive measures. This study proposes a computational pipeline for Lynch Syndrome detection by integrating alignment, variant calling, and annotation. The pipeline leverages popular tools such as FastQC, Trimmomatic, BWA, bcftools, and ANNOVAR to process the input FASTQ file, perform quality trimming, align reads to the reference genome, call variants, and annotate them. It is believed that the computational pipeline was applied to a dataset of Lynch Syndrome cases, and its performance was evaluated. It is believed that the quality check step ensured the integrity of the sequencing data, while the trimming process is thought to have removed low-quality bases and adaptors. In the alignment step, it is believed that the reads were accurately mapped to the reference genome, and the subsequent variant calling step is believed to have identified potential genetic variants. The annotation step is believed to have provided functional insights into the detected variants, including their effects on known Lynch Syndrome-associated genes. The results obtained from the pipeline revealed Lynch Syndrome-related positions in the genome, providing valuable information for further investigation and clinical decision-making. The pipeline's effectiveness was demonstrated through its ability to streamline the analysis workflow and identify potential genetic markers associated with Lynch Syndrome. It is believed that the computational pipeline presents a comprehensive and efficient approach to Lynch Syndrome detection, contributing to early diagnosis and intervention. The modularity and flexibility of the pipeline are believed to enable customization and adaptation to various datasets and research settings. Further optimization and validation are believed to be necessary to enhance performance and applicability across diverse populations.

Keywords: Lynch Syndrome, computational pipeline, alignment, variant calling, annotation, genetic markers

Procedia PDF Downloads 36
79 Genome-Wide Functional Analysis of Phosphatase in Cryptococcus neoformans

Authors: Jae-Hyung Jin, Kyung-Tae Lee, Yee-Seul So, Eunji Jeong, Yeonseon Lee, Dongpil Lee, Dong-Gi Lee, Yong-Sun Bahn

Abstract:

Cryptococcus neoformans causes cryptococcal meningoencephalitis mainly in immunocompromised patients as well as immunocompetent people. But therapeutic options are limited to treat cryptococcosis. Some signaling pathways including cyclic AMP pathway, MAPK pathway, and calcineurin pathway play a central role in the regulation of the growth, differentiation, and virulence of C. neoformans. To understand signaling networks regulating the virulence of C. neoformans, we selected the 114 putative phosphatase genes, one of the major components of signaling networks, in the genome of C. neoformans. We identified putative phosphatases based on annotation in C. neoformans var. grubii genome database provided by the Broad Institute and National Center for Biotechnology Information (NCBI) and performed a BLAST search of phosphatases of Saccharomyces cerevisiae, Aspergillus nidulans, Candida albicans and Fusarium graminearum to Cryptococcus neoformans. We classified putative phosphatases into 14 groups based on InterPro phosphatase domain annotation. Here, we constructed 170 signature-tagged gene-deletion strains through homologous recombination methods for 91 putative phosphatases. We examined their phenotypic traits under 30 different in vitro conditions, including growth, differentiation, stress response, antifungal resistance and virulence-factor production.

Keywords: human fungal pathogen, phosphatase, deletion library, functional genomics

Procedia PDF Downloads 329
78 A Local Tensor Clustering Algorithm to Annotate Uncharacterized Genes with Many Biological Networks

Authors: Paul Shize Li, Frank Alber

Abstract:

A fundamental task of clinical genomics is to unravel the functions of genes and their associations with disorders. Although experimental biology has made efforts to discover and elucidate the molecular mechanisms of individual genes in the past decades, still about 40% of human genes have unknown functions, not to mention the diseases they may be related to. For those biologists who are interested in a particular gene with unknown functions, a powerful computational method tailored for inferring the functions and disease relevance of uncharacterized genes is strongly needed. Studies have shown that genes strongly linked to each other in multiple biological networks are more likely to have similar functions. This indicates that the densely connected subgraphs in multiple biological networks are useful in the functional and phenotypic annotation of uncharacterized genes. Therefore, in this work, we have developed an integrative network approach to identify the frequent local clusters, which are defined as those densely connected subgraphs that frequently occur in multiple biological networks and consist of the query gene that has few or no disease or function annotations. This is a local clustering algorithm that models multiple biological networks sharing the same gene set as a three-dimensional matrix, the so-called tensor, and employs the tensor-based optimization method to efficiently find the frequent local clusters. Specifically, massive public gene expression data sets that comprehensively cover dynamic, physiological, and environmental conditions are used to generate hundreds of gene co-expression networks. By integrating these gene co-expression networks, for a given uncharacterized gene that is of biologist’s interest, the proposed method can be applied to identify the frequent local clusters that consist of this uncharacterized gene. Finally, those frequent local clusters are used for function and disease annotation of this uncharacterized gene. This local tensor clustering algorithm outperformed the competing tensor-based algorithm in both module discovery and running time. We also demonstrated the use of the proposed method on real data of hundreds of gene co-expression data and showed that it can comprehensively characterize the query gene. Therefore, this study provides a new tool for annotating the uncharacterized genes and has great potential to assist clinical genomic diagnostics.

Keywords: local tensor clustering, query gene, gene co-expression network, gene annotation

Procedia PDF Downloads 101
77 Glycan Analyzer: Software to Annotate Glycan Structures from Exoglycosidase Experiments

Authors: Ian Walsh, Terry Nguyen-Khuong, Christopher H. Taron, Pauline M. Rudd

Abstract:

Glycoproteins and their covalently bonded glycans play critical roles in the immune system, cell communication, disease and disease prognosis. Ultra performance liquid chromatography (UPLC) coupled with mass spectrometry is conventionally used to qualitatively and quantitatively characterise glycan structures in a given sample. Exoglycosidases are enzymes that catalyze sequential removal of monosaccharides from the non-reducing end of glycans. They naturally have specificity for a particular type of sugar, its stereochemistry (α or β anomer) and its position of attachment to an adjacent sugar on the glycan. Thus, monitoring the peak movements (both in the UPLC and MS1) after application of exoglycosidases provides a unique and effective way to annotate sugars with high detail - i.e. differentiating positional and linkage isomers. Manual annotation of an exoglycosidase experiment is difficult and time consuming. As such, with increasing sample complexity and the number of exoglycosidases, the analysis could result in manually interpreting hundreds of peak movements. Recently, we have implemented pattern recognition software for automated interpretation of UPLC-MS1 exoglycosidase digestions. In this work, we explain the software, indicate how much time it will save and provide example usage showing the annotation of positional and linkage isomers in Immunoglobulin G, apolipoprotein J, and simple glycan standards.

Keywords: bioinformatics, automated glycan assignment, liquid chromatography, mass spectrometry

Procedia PDF Downloads 168
76 Video Object Segmentation for Automatic Image Annotation of Ethernet Connectors with Environment Mapping and 3D Projection

Authors: Marrone Silverio Melo Dantas Pedro Henrique Dreyer, Gabriel Fonseca Reis de Souza, Daniel Bezerra, Ricardo Souza, Silvia Lins, Judith Kelner, Djamel Fawzi Hadj Sadok

Abstract:

The creation of a dataset is time-consuming and often discourages researchers from pursuing their goals. To overcome this problem, we present and discuss two solutions adopted for the automation of this process. Both optimize valuable user time and resources and support video object segmentation with object tracking and 3D projection. In our scenario, we acquire images from a moving robotic arm and, for each approach, generate distinct annotated datasets. We evaluated the precision of the annotations by comparing these with a manually annotated dataset, as well as the efficiency in the context of detection and classification problems. For detection support, we used YOLO and obtained for the projection dataset an F1-Score, accuracy, and mAP values of 0.846, 0.924, and 0.875, respectively. Concerning the tracking dataset, we achieved an F1-Score of 0.861, an accuracy of 0.932, whereas mAP reached 0.894. In order to evaluate the quality of the annotated images used for classification problems, we employed deep learning architectures. We adopted metrics accuracy and F1-Score, for VGG, DenseNet, MobileNet, Inception, and ResNet. The VGG architecture outperformed the others for both projection and tracking datasets. It reached an accuracy and F1-score of 0.997 and 0.993, respectively. Similarly, for the tracking dataset, it achieved an accuracy of 0.991 and an F1-Score of 0.981.

Keywords: RJ45, automatic annotation, object tracking, 3D projection

Procedia PDF Downloads 132
75 PaSA: A Dataset for Patent Sentiment Analysis to Highlight Patent Paragraphs

Authors: Renukswamy Chikkamath, Vishvapalsinhji Ramsinh Parmar, Christoph Hewel, Markus Endres

Abstract:

Given a patent document, identifying distinct semantic annotations is an interesting research aspect. Text annotation helps the patent practitioners such as examiners and patent attorneys to quickly identify the key arguments of any invention, successively providing a timely marking of a patent text. In the process of manual patent analysis, to attain better readability, recognising the semantic information by marking paragraphs is in practice. This semantic annotation process is laborious and time-consuming. To alleviate such a problem, we proposed a dataset to train machine learning algorithms to automate the highlighting process. The contributions of this work are: i) we developed a multi-class dataset of size 150k samples by traversing USPTO patents over a decade, ii) articulated statistics and distributions of data using imperative exploratory data analysis, iii) baseline Machine Learning models are developed to utilize the dataset to address patent paragraph highlighting task, and iv) future path to extend this work using Deep Learning and domain-specific pre-trained language models to develop a tool to highlight is provided. This work assists patent practitioners in highlighting semantic information automatically and aids in creating a sustainable and efficient patent analysis using the aptitude of machine learning.

Keywords: machine learning, patents, patent sentiment analysis, patent information retrieval

Procedia PDF Downloads 61
74 Automatic Reporting System for Transcriptome Indel Identification and Annotation Based on Snapshot of Next-Generation Sequencing Reads Alignment

Authors: Shuo Mu, Guangzhi Jiang, Jinsa Chen

Abstract:

The analysis of Indel for RNA sequencing of clinical samples is easily affected by sequencing experiment errors and software selection. In order to improve the efficiency and accuracy of analysis, we developed an automatic reporting system for Indel recognition and annotation based on image snapshot of transcriptome reads alignment. This system includes sequence local-assembly and realignment, target point snapshot, and image-based recognition processes. We integrated high-confidence Indel dataset from several known databases as a training set to improve the accuracy of image processing and added a bioinformatical processing module to annotate and filter Indel artifacts. Subsequently, the system will automatically generate data, including data quality levels and images results report. Sanger sequencing verification of the reference Indel mutation of cell line NA12878 showed that the process can achieve 83% sensitivity and 96% specificity. Analysis of the collected clinical samples showed that the interpretation accuracy of the process was equivalent to that of manual inspection, and the processing efficiency showed a significant improvement. This work shows the feasibility of accurate Indel analysis of clinical next-generation sequencing (NGS) transcriptome. This result may be useful for RNA study for clinical samples with microsatellite instability in immunotherapy in the future.

Keywords: automatic reporting, indel, next-generation sequencing, NGS, transcriptome

Procedia PDF Downloads 147
73 TARF: Web Toolkit for Annotating RNA-Related Genomic Features

Authors: Jialin Ma, Jia Meng

Abstract:

Genomic features, the genome-based coordinates, are commonly used for the representation of biological features such as genes, RNA transcripts and transcription factor binding sites. For the analysis of RNA-related genomic features, such as RNA modification sites, a common task is to correlate these features with transcript components (5'UTR, CDS, 3'UTR) to explore their distribution characteristics in terms of transcriptomic coordinates, e.g., to examine whether a specific type of biological feature is enriched near transcription start sites. Existing approaches for performing these tasks involve the manipulation of a gene database, conversion from genome-based coordinate to transcript-based coordinate, and visualization methods that are capable of showing RNA transcript components and distribution of the features. These steps are complicated and time consuming, and this is especially true for researchers who are not familiar with relevant tools. To overcome this obstacle, we develop a dedicated web app TARF, which represents web toolkit for annotating RNA-related genomic features. TARF web tool intends to provide a web-based way to easily annotate and visualize RNA-related genomic features. Once a user has uploaded the features with BED format and specified a built-in transcript database or uploaded a customized gene database with GTF format, the tool could fulfill its three main functions. First, it adds annotation on gene and RNA transcript components. For every features provided by the user, the overlapping with RNA transcript components are identified, and the information is combined in one table which is available for copy and download. Summary statistics about ambiguous belongings are also carried out. Second, the tool provides a convenient visualization method of the features on single gene/transcript level. For the selected gene, the tool shows the features with gene model on genome-based view, and also maps the features to transcript-based coordinate and show the distribution against one single spliced RNA transcript. Third, a global transcriptomic view of the genomic features is generated utilizing the Guitar R/Bioconductor package. The distribution of features on RNA transcripts are normalized with respect to RNA transcript landmarks and the enrichment of the features on different RNA transcript components is demonstrated. We tested the newly developed TARF toolkit with 3 different types of genomics features related to chromatin H3K4me3, RNA N6-methyladenosine (m6A) and RNA 5-methylcytosine (m5C), which are obtained from ChIP-Seq, MeRIP-Seq and RNA BS-Seq data, respectively. TARF successfully revealed their respective distribution characteristics, i.e. H3K4me3, m6A and m5C are enriched near transcription starting sites, stop codons and 5’UTRs, respectively. Overall, TARF is a useful web toolkit for annotation and visualization of RNA-related genomic features, and should help simplify the analysis of various RNA-related genomic features, especially those related RNA modifications.

Keywords: RNA-related genomic features, annotation, visualization, web server

Procedia PDF Downloads 180
72 Mobi-DiQ: A Pervasive Sensing System for Delirium Risk Assessment in Intensive Care Unit

Authors: Subhash Nerella, Ziyuan Guan, Azra Bihorac, Parisa Rashidi

Abstract:

Intensive care units (ICUs) provide care to critically ill patients in severe and life-threatening conditions. However, patient monitoring in the ICU is limited by the time and resource constraints imposed on healthcare providers. Many critical care indices such as mobility are still manually assessed, which can be subjective, prone to human errors, and lack granularity. Other important aspects, such as environmental factors, are not monitored at all. For example, critically ill patients often experience circadian disruptions due to the absence of effective environmental “timekeepers” such as the light/dark cycle and the systemic effect of acute illness on chronobiologic markers. Although the occurrence of delirium is associated with circadian disruption risk factors, these factors are not routinely monitored in the ICU. Hence, there is a critical unmet need to develop systems for precise and real-time assessment through novel enabling technologies. We have developed the mobility and circadian disruption quantification system (Mobi-DiQ) by augmenting biomarker and clinical data with pervasive sensing data to generate mobility and circadian cues related to mobility, nightly disruptions, and light and noise exposure. We hypothesize that Mobi-DiQ can provide accurate mobility and circadian cues that correlate with bedside clinical mobility assessments and circadian biomarkers, ultimately important for delirium risk assessment and prevention. The collected multimodal dataset consists of depth images, Electromyography (EMG) data, patient extremity movement captured by accelerometers, ambient light levels, Sound Pressure Level (SPL), and indoor air quality measured by volatile organic compounds, and the equivalent CO₂ concentration. For delirium risk assessment, the system recognizes mobility cues (axial body movement features and body key points) and circadian cues, including nightly disruptions, ambient SPL, and light intensity, as well as other environmental factors such as indoor air quality. The Mobi-DiQ system consists of three major components: the pervasive sensing system, a data storage and analysis server, and a data annotation system. For data collection, six local pervasive sensing systems were deployed, including a local computer and sensors. A video recording tool with graphical user interface (GUI) developed in python was used to capture depth image frames for analyzing patient mobility. All sensor data is encrypted, then automatically uploaded to the Mobi-DiQ server through a secured VPN connection. Several data pipelines are developed to automate the data transfer, curation, and data preparation for annotation and model training. The data curation and post-processing are performed on the server. A custom secure annotation tool with GUI was developed to annotate depth activity data. The annotation tool is linked to the MongoDB database to record the data annotation and to provide summarization. Docker containers are also utilized to manage services and pipelines running on the server in an isolated manner. The processed clinical data and annotations are used to train and develop real-time pervasive sensing systems to augment clinical decision-making and promote targeted interventions. In the future, we intend to evaluate our system as a clinical implementation trial, as well as to refine and validate it by using other data sources, including neurological data obtained through continuous electroencephalography (EEG).

Keywords: deep learning, delirium, healthcare, pervasive sensing

Procedia PDF Downloads 62
71 Saudi Twitter Corpus for Sentiment Analysis

Authors: Adel Assiri, Ahmed Emam, Hmood Al-Dossari

Abstract:

Sentiment analysis (SA) has received growing attention in Arabic language research. However, few studies have yet to directly apply SA to Arabic due to lack of a publicly available dataset for this language. This paper partially bridges this gap due to its focus on one of the Arabic dialects which is the Saudi dialect. This paper presents annotated data set of 4700 for Saudi dialect sentiment analysis with (K= 0.807). Our next work is to extend this corpus and creation a large-scale lexicon for Saudi dialect from the corpus.

Keywords: Arabic, sentiment analysis, Twitter, annotation

Procedia PDF Downloads 589
70 A Method for Clinical Concept Extraction from Medical Text

Authors: Moshe Wasserblat, Jonathan Mamou, Oren Pereg

Abstract:

Natural Language Processing (NLP) has made a major leap in the last few years, in practical integration into medical solutions; for example, extracting clinical concepts from medical texts such as medical condition, medication, treatment, and symptoms. However, training and deploying those models in real environments still demands a large amount of annotated data and NLP/Machine Learning (ML) expertise, which makes this process costly and time-consuming. We present a practical and efficient method for clinical concept extraction that does not require costly labeled data nor ML expertise. The method includes three steps: Step 1- the user injects a large in-domain text corpus (e.g., PubMed). Then, the system builds a contextual model containing vector representations of concepts in the corpus, in an unsupervised manner (e.g., Phrase2Vec). Step 2- the user provides a seed set of terms representing a specific medical concept (e.g., for the concept of the symptoms, the user may provide: ‘dry mouth,’ ‘itchy skin,’ and ‘blurred vision’). Then, the system matches the seed set against the contextual model and extracts the most semantically similar terms (e.g., additional symptoms). The result is a complete set of terms related to the medical concept. Step 3 –in production, there is a need to extract medical concepts from the unseen medical text. The system extracts key-phrases from the new text, then matches them against the complete set of terms from step 2, and the most semantically similar will be annotated with the same medical concept category. As an example, the seed symptom concepts would result in the following annotation: “The patient complaints on fatigue [symptom], dry skin [symptom], and Weight loss [symptom], which can be an early sign for Diabetes.” Our evaluations show promising results for extracting concepts from medical corpora. The method allows medical analysts to easily and efficiently build taxonomies (in step 2) representing their domain-specific concepts, and automatically annotate a large number of texts (in step 3) for classification/summarization of medical reports.

Keywords: clinical concepts, concept expansion, medical records annotation, medical records summarization

Procedia PDF Downloads 102
69 Variables, Annotation, and Metadata Schemas for Early Modern Greek

Authors: Eleni Karantzola, Athanasios Karasimos, Vasiliki Makri, Ioanna Skouvara

Abstract:

Historical linguistics unveils the historical depth of languages and traces variation and change by analyzing linguistic variables over time. This field of linguistics usually deals with a closed data set that can only be expanded by the (re)discovery of previously unknown manuscripts or editions. In some cases, it is possible to use (almost) the entire closed corpus of a language for research, as is the case with the Thesaurus Linguae Graecae digital library for Ancient Greek, which contains most of the extant ancient Greek literature. However, concerning ‘dynamic’ periods when the production and circulation of texts in printed as well as manuscript form have not been fully mapped, representative samples and corpora of texts are needed. Such material and tools are utterly lacking for Early Modern Greek (16th-18th c.). In this study, the principles of the creation of EMoGReC, a pilot representative corpus of Early Modern Greek (16th-18th c.) are presented. Its design follows the fundamental principles of historical corpora. The selection of texts aims to create a representative and balanced corpus that gives insight into diachronic, diatopic and diaphasic variation. The pilot sample includes data derived from fully machine-readable vernacular texts, which belong to 4-5 different textual genres and come from different geographical areas. We develop a hierarchical linguistic annotation scheme, further customized to fit the characteristics of our text corpus. Regarding variables and their variants, we use as a point of departure the bundle of twenty-four features (or categories of features) for prose demotic texts of the 16th c. Tags are introduced bearing the variants [+old/archaic] or [+novel/vernacular]. On the other hand, further phenomena that are underway (cf. The Cambridge Grammar of Medieval and Early Modern Greek) are selected for tagging. The annotated texts are enriched with metalinguistic and sociolinguistic metadata to provide a testbed for the development of the first comprehensive set of tools for the Greek language of that period. Based on a relational management system with interconnection of data, annotations, and their metadata, the EMoGReC database aspires to join a state-of-the-art technological ecosystem for the research of observed language variation and change using advanced computational approaches.

Keywords: early modern Greek, variation and change, representative corpus, diachronic variables.

Procedia PDF Downloads 29