Search results for: datasets
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 697

Search results for: datasets

577 Genomic Sequence Representation Learning: An Analysis of K-Mer Vector Embedding Dimensionality

Authors: James Jr. Mashiyane, Risuna Nkolele, Stephanie J. Müller, Gciniwe S. Dlamini, Rebone L. Meraba, Darlington S. Mapiye

Abstract:

When performing language tasks in natural language processing (NLP), the dimensionality of word embeddings is chosen either ad-hoc or is calculated by optimizing the Pairwise Inner Product (PIP) loss. The PIP loss is a metric that measures the dissimilarity between word embeddings, and it is obtained through matrix perturbation theory by utilizing the unitary invariance of word embeddings. Unlike in natural language, in genomics, especially in genome sequence processing, unlike in natural language processing, there is no notion of a “word,” but rather, there are sequence substrings of length k called k-mers. K-mers sizes matter, and they vary depending on the goal of the task at hand. The dimensionality of word embeddings in NLP has been studied using the matrix perturbation theory and the PIP loss. In this paper, the sufficiency and reliability of applying word-embedding algorithms to various genomic sequence datasets are investigated to understand the relationship between the k-mer size and their embedding dimension. This is completed by studying the scaling capability of three embedding algorithms, namely Latent Semantic analysis (LSA), Word2Vec, and Global Vectors (GloVe), with respect to the k-mer size. Utilising the PIP loss as a metric to train embeddings on different datasets, we also show that Word2Vec outperforms LSA and GloVe in accurate computing embeddings as both the k-mer size and vocabulary increase. Finally, the shortcomings of natural language processing embedding algorithms in performing genomic tasks are discussed.

Keywords: word embeddings, k-mer embedding, dimensionality reduction

Procedia PDF Downloads 95
576 Big Data in Telecom Industry: Effective Predictive Techniques on Call Detail Records

Authors: Sara ElElimy, Samir Moustafa

Abstract:

Mobile network operators start to face many challenges in the digital era, especially with high demands from customers. Since mobile network operators are considered a source of big data, traditional techniques are not effective with new era of big data, Internet of things (IoT) and 5G; as a result, handling effectively different big datasets becomes a vital task for operators with the continuous growth of data and moving from long term evolution (LTE) to 5G. So, there is an urgent need for effective Big data analytics to predict future demands, traffic, and network performance to full fill the requirements of the fifth generation of mobile network technology. In this paper, we introduce data science techniques using machine learning and deep learning algorithms: the autoregressive integrated moving average (ARIMA), Bayesian-based curve fitting, and recurrent neural network (RNN) are employed for a data-driven application to mobile network operators. The main framework included in models are identification parameters of each model, estimation, prediction, and final data-driven application of this prediction from business and network performance applications. These models are applied to Telecom Italia Big Data challenge call detail records (CDRs) datasets. The performance of these models is found out using a specific well-known evaluation criteria shows that ARIMA (machine learning-based model) is more accurate as a predictive model in such a dataset than the RNN (deep learning model).

Keywords: big data analytics, machine learning, CDRs, 5G

Procedia PDF Downloads 110
575 Data Clustering Algorithm Based on Multi-Objective Periodic Bacterial Foraging Optimization with Two Learning Archives

Authors: Chen Guo, Heng Tang, Ben Niu

Abstract:

Clustering splits objects into different groups based on similarity, making the objects have higher similarity in the same group and lower similarity in different groups. Thus, clustering can be treated as an optimization problem to maximize the intra-cluster similarity or inter-cluster dissimilarity. In real-world applications, the datasets often have some complex characteristics: sparse, overlap, high dimensionality, etc. When facing these datasets, simultaneously optimizing two or more objectives can obtain better clustering results than optimizing one objective. However, except for the objectives weighting methods, traditional clustering approaches have difficulty in solving multi-objective data clustering problems. Due to this, evolutionary multi-objective optimization algorithms are investigated by researchers to optimize multiple clustering objectives. In this paper, the Data Clustering algorithm based on Multi-objective Periodic Bacterial Foraging Optimization with two Learning Archives (DC-MPBFOLA) is proposed. Specifically, first, to reduce the high computing complexity of the original BFO, periodic BFO is employed as the basic algorithmic framework. Then transfer the periodic BFO into a multi-objective type. Second, two learning strategies are proposed based on the two learning archives to guide the bacterial swarm to move in a better direction. On the one hand, the global best is selected from the global learning archive according to the convergence index and diversity index. On the other hand, the personal best is selected from the personal learning archive according to the sum of weighted objectives. According to the aforementioned learning strategies, a chemotaxis operation is designed. Third, an elite learning strategy is designed to provide fresh power to the objects in two learning archives. When the objects in these two archives do not change for two consecutive times, randomly initializing one dimension of objects can prevent the proposed algorithm from falling into local optima. Fourth, to validate the performance of the proposed algorithm, DC-MPBFOLA is compared with four state-of-art evolutionary multi-objective optimization algorithms and one classical clustering algorithm on evaluation indexes of datasets. To further verify the effectiveness and feasibility of designed strategies in DC-MPBFOLA, variants of DC-MPBFOLA are also proposed. Experimental results demonstrate that DC-MPBFOLA outperforms its competitors regarding all evaluation indexes and clustering partitions. These results also indicate that the designed strategies positively influence the performance improvement of the original BFO.

Keywords: data clustering, multi-objective optimization, bacterial foraging optimization, learning archives

Procedia PDF Downloads 110
574 A Multi-Release Software Reliability Growth Models Incorporating Imperfect Debugging and Change-Point under the Simulated Testing Environment and Software Release Time

Authors: Sujit Kumar Pradhan, Anil Kumar, Vijay Kumar

Abstract:

The testing process of the software during the software development time is a crucial step as it makes the software more efficient and dependable. To estimate software’s reliability through the mean value function, many software reliability growth models (SRGMs) were developed under the assumption that operating and testing environments are the same. Practically, it is not true because when the software works in a natural field environment, the reliability of the software differs. This article discussed an SRGM comprising change-point and imperfect debugging in a simulated testing environment. Later on, we extended it in a multi-release direction. Initially, the software was released to the market with few features. According to the market’s demand, the software company upgraded the current version by adding new features as time passed. Therefore, we have proposed a generalized multi-release SRGM where change-point and imperfect debugging concepts have been addressed in a simulated testing environment. The failure-increasing rate concept has been adopted to determine the change point for each software release. Based on nine goodness-of-fit criteria, the proposed model is validated on two real datasets. The results demonstrate that the proposed model fits the datasets better. We have also discussed the optimal release time of the software through a cost model by assuming that the testing and debugging costs are time-dependent.

Keywords: software reliability growth models, non-homogeneous Poisson process, multi-release software, mean value function, change-point, environmental factors

Procedia PDF Downloads 45
573 Profiling Risky Code Using Machine Learning

Authors: Zunaira Zaman, David Bohannon

Abstract:

This study explores the application of machine learning (ML) for detecting security vulnerabilities in source code. The research aims to assist organizations with large application portfolios and limited security testing capabilities in prioritizing security activities. ML-based approaches offer benefits such as increased confidence scores, false positives and negatives tuning, and automated feedback. The initial approach using natural language processing techniques to extract features achieved 86% accuracy during the training phase but suffered from overfitting and performed poorly on unseen datasets during testing. To address these issues, the study proposes using the abstract syntax tree (AST) for Java and C++ codebases to capture code semantics and structure and generate path-context representations for each function. The Code2Vec model architecture is used to learn distributed representations of source code snippets for training a machine-learning classifier for vulnerability prediction. The study evaluates the performance of the proposed methodology using two datasets and compares the results with existing approaches. The Devign dataset yielded 60% accuracy in predicting vulnerable code snippets and helped resist overfitting, while the Juliet Test Suite predicted specific vulnerabilities such as OS-Command Injection, Cryptographic, and Cross-Site Scripting vulnerabilities. The Code2Vec model achieved 75% accuracy and a 98% recall rate in predicting OS-Command Injection vulnerabilities. The study concludes that even partial AST representations of source code can be useful for vulnerability prediction. The approach has the potential for automated intelligent analysis of source code, including vulnerability prediction on unseen source code. State-of-the-art models using natural language processing techniques and CNN models with ensemble modelling techniques did not generalize well on unseen data and faced overfitting issues. However, predicting vulnerabilities in source code using machine learning poses challenges such as high dimensionality and complexity of source code, imbalanced datasets, and identifying specific types of vulnerabilities. Future work will address these challenges and expand the scope of the research.

Keywords: code embeddings, neural networks, natural language processing, OS command injection, software security, code properties

Procedia PDF Downloads 74
572 Imputation of Incomplete Large-Scale Monitoring Count Data via Penalized Estimation

Authors: Mohamed Dakki, Genevieve Robin, Marie Suet, Abdeljebbar Qninba, Mohamed A. El Agbani, Asmâa Ouassou, Rhimou El Hamoumi, Hichem Azafzaf, Sami Rebah, Claudia Feltrup-Azafzaf, Nafouel Hamouda, Wed a.L. Ibrahim, Hosni H. Asran, Amr A. Elhady, Haitham Ibrahim, Khaled Etayeb, Essam Bouras, Almokhtar Saied, Ashrof Glidan, Bakar M. Habib, Mohamed S. Sayoud, Nadjiba Bendjedda, Laura Dami, Clemence Deschamps, Elie Gaget, Jean-Yves Mondain-Monval, Pierre Defos Du Rau

Abstract:

In biodiversity monitoring, large datasets are becoming more and more widely available and are increasingly used globally to estimate species trends and con- servation status. These large-scale datasets challenge existing statistical analysis methods, many of which are not adapted to their size, incompleteness and heterogeneity. The development of scalable methods to impute missing data in incomplete large-scale monitoring datasets is crucial to balance sampling in time or space and thus better inform conservation policies. We developed a new method based on penalized Poisson models to impute and analyse incomplete monitoring data in a large-scale framework. The method al- lows parameterization of (a) space and time factors, (b) the main effects of predic- tor covariates, as well as (c) space–time interactions. It also benefits from robust statistical and computational capability in large-scale settings. The method was tested extensively on both simulated and real-life waterbird data, with the findings revealing that it outperforms six existing methods in terms of missing data imputation errors. Applying the method to 16 waterbird species, we estimated their long-term trends for the first time at the entire North African scale, a region where monitoring data suffer from many gaps in space and time series. This new approach opens promising perspectives to increase the accuracy of species-abundance trend estimations. We made it freely available in the r package ‘lori’ (https://CRAN.R-project.org/package=lori) and recommend its use for large- scale count data, particularly in citizen science monitoring programmes.

Keywords: biodiversity monitoring, high-dimensional statistics, incomplete count data, missing data imputation, waterbird trends in North-Africa

Procedia PDF Downloads 117
571 Data Augmentation for Early-Stage Lung Nodules Using Deep Image Prior and Pix2pix

Authors: Qasim Munye, Juned Islam, Haseeb Qureshi, Syed Jung

Abstract:

Lung nodules are commonly identified in computed tomography (CT) scans by experienced radiologists at a relatively late stage. Early diagnosis can greatly increase survival. We propose using a pix2pix conditional generative adversarial network to generate realistic images simulating early-stage lung nodule growth. We have applied deep images prior to 2341 slices from 895 computed tomography (CT) scans from the Lung Image Database Consortium (LIDC) dataset to generate pseudo-healthy medical images. From these images, 819 were chosen to train a pix2pix network. We observed that for most of the images, the pix2pix network was able to generate images where the nodule increased in size and intensity across epochs. To evaluate the images, 400 generated images were chosen at random and shown to a medical student beside their corresponding original image. Of these 400 generated images, 384 were defined as satisfactory - meaning they resembled a nodule and were visually similar to the corresponding image. We believe that this generated dataset could be used as training data for neural networks to detect lung nodules at an early stage or to improve the accuracy of such networks. This is particularly significant as datasets containing the growth of early-stage nodules are scarce. This project shows that the combination of deep image prior and generative models could potentially open the door to creating larger datasets than currently possible and has the potential to increase the accuracy of medical classification tasks.

Keywords: medical technology, artificial intelligence, radiology, lung cancer

Procedia PDF Downloads 39
570 Lost Maritime Culture in the Netherlands: Linking Material and Immaterial Datasets for a Modern Day Perception of the Late Medieval Maritime Cultural Landscape of the Zuiderzee Region

Authors: Y. T. van Popta

Abstract:

This paper focuses on the never thoroughly examined yet in native relevant late medieval maritime cultural landscape of the former Zuiderzee (A.D. 1170-1932) in the center part of the Netherlands. Especially the northeastern part of the region, nowadays known as the Noordoostpolder, testifies of the dynamic battle of the Dutch against the water. This highly dynamic maritime region developed from a lake district into a sea and eventually into a polder. By linking physical and cognitive datasets from the Noordoostpol-der region in a spatial environment, new information on a late medieval maritime culture is brought to light, giving the opportunity to: (i) create a modern day perception on the late medieval maritime cultural landscape of the region and (ii) to underline the value of interdisciplinary and spatial research in maritime archaeology in general. Since the large scale reclamations of the region (A.D. 1932-1968), many remains have been discovered of a drowned and eroded late medieval maritime culture, represented by lost islands, drowned settlements, cultivated lands, shipwrecks and socio-economic networks. Recent archaeological research has proved the existence of this late medieval maritime culture by the discovery of the remains of the drowned settlement Fenehuysen (Veenhuizen) and its surroundings. The fact that this settlement and its cultivated surroundings remained hidden for so long proves that a large part of the maritime cultural landscape is ‘invisible’ and can only be found by extensive interdisciplinary research.

Keywords: drowned settlements, late middle ages, lost islands, maritime cultural landscape, the Netherlands

Procedia PDF Downloads 175
569 Using Machine Learning to Build a Real-Time COVID-19 Mask Safety Monitor

Authors: Yash Jain

Abstract:

The US Center for Disease Control has recommended wearing masks to slow the spread of the virus. The research uses a video feed from a camera to conduct real-time classifications of whether or not a human is correctly wearing a mask, incorrectly wearing a mask, or not wearing a mask at all. Utilizing two distinct datasets from the open-source website Kaggle, a mask detection network had been trained. The first dataset that was used to train the model was titled 'Face Mask Detection' on Kaggle, where the dataset was retrieved from and the second dataset was titled 'Face Mask Dataset, which provided the data in a (YOLO Format)' so that the TinyYoloV3 model could be trained. Based on the data from Kaggle, two machine learning models were implemented and trained: a Tiny YoloV3 Real-time model and a two-stage neural network classifier. The two-stage neural network classifier had a first step of identifying distinct faces within the image, and the second step was a classifier to detect the state of the mask on the face and whether it was worn correctly, incorrectly, or no mask at all. The TinyYoloV3 was used for the live feed as well as for a comparison standpoint against the previous two-stage classifier and was trained using the darknet neural network framework. The two-stage classifier attained a mean average precision (MAP) of 80%, while the model trained using TinyYoloV3 real-time detection had a mean average precision (MAP) of 59%. Overall, both models were able to correctly classify stages/scenarios of no mask, mask, and incorrectly worn masks.

Keywords: datasets, classifier, mask-detection, real-time, TinyYoloV3, two-stage neural network classifier

Procedia PDF Downloads 127
568 Incorporating Lexical-Semantic Knowledge into Convolutional Neural Network Framework for Pediatric Disease Diagnosis

Authors: Xiaocong Liu, Huazhen Wang, Ting He, Xiaozheng Li, Weihan Zhang, Jian Chen

Abstract:

The utilization of electronic medical record (EMR) data to establish the disease diagnosis model has become an important research content of biomedical informatics. Deep learning can automatically extract features from the massive data, which brings about breakthroughs in the study of EMR data. The challenge is that deep learning lacks semantic knowledge, which leads to impracticability in medical science. This research proposes a method of incorporating lexical-semantic knowledge from abundant entities into a convolutional neural network (CNN) framework for pediatric disease diagnosis. Firstly, medical terms are vectorized into Lexical Semantic Vectors (LSV), which are concatenated with the embedded word vectors of word2vec to enrich the feature representation. Secondly, the semantic distribution of medical terms serves as Semantic Decision Guide (SDG) for the optimization of deep learning models. The study evaluate the performance of LSV-SDG-CNN model on four kinds of Chinese EMR datasets. Additionally, CNN, LSV-CNN, and SDG-CNN are designed as baseline models for comparison. The experimental results show that LSV-SDG-CNN model outperforms baseline models on four kinds of Chinese EMR datasets. The best configuration of the model yielded an F1 score of 86.20%. The results clearly demonstrate that CNN has been effectively guided and optimized by lexical-semantic knowledge, and LSV-SDG-CNN model improves the disease classification accuracy with a clear margin.

Keywords: convolutional neural network, electronic medical record, feature representation, lexical semantics, semantic decision

Procedia PDF Downloads 107
567 Enhanced Multi-Scale Feature Extraction Using a DCNN by Proposing Dynamic Soft Margin SoftMax for Face Emotion Detection

Authors: Armin Nabaei, M. Omair Ahmad, M. N. S. Swamy

Abstract:

Many facial expression and emotion recognition methods in the traditional approaches of using LDA, PCA, and EBGM have been proposed. In recent years deep learning models have provided a unique platform addressing by automatically extracting the features for the detection of facial expression and emotions. However, deep networks require large training datasets to extract automatic features effectively. In this work, we propose an efficient emotion detection algorithm using face images when only small datasets are available for training. We design a deep network whose feature extraction capability is enhanced by utilizing several parallel modules between the input and output of the network, each focusing on the extraction of different types of coarse features with fined grained details to break the symmetry of produced information. In fact, we leverage long range dependencies, which is one of the main drawback of CNNs. We develop this work by introducing a Dynamic Soft-Margin SoftMax.The conventional SoftMax suffers from reaching to gold labels very soon, which take the model to over-fitting. Because it’s not able to determine adequately discriminant feature vectors for some variant class labels. We reduced the risk of over-fitting by using a dynamic shape of input tensor instead of static in SoftMax layer with specifying a desired Soft- Margin. In fact, it acts as a controller to how hard the model should work to push dissimilar embedding vectors apart. For the proposed Categorical Loss, by the objective of compacting the same class labels and separating different class labels in the normalized log domain.We select penalty for those predictions with high divergence from ground-truth labels.So, we shorten correct feature vectors and enlarge false prediction tensors, it means we assign more weights for those classes with conjunction to each other (namely, “hard labels to learn”). By doing this work, we constrain the model to generate more discriminate feature vectors for variant class labels. Finally, for the proposed optimizer, our focus is on solving weak convergence of Adam optimizer for a non-convex problem. Our noteworthy optimizer is working by an alternative updating gradient procedure with an exponential weighted moving average function for faster convergence and exploiting a weight decay method to help drastically reducing the learning rate near optima to reach the dominant local minimum. We demonstrate the superiority of our proposed work by surpassing the first rank of three widely used Facial Expression Recognition datasets with 93.30% on FER-2013, and 16% improvement compare to the first rank after 10 years, reaching to 90.73% on RAF-DB, and 100% k-fold average accuracy for CK+ dataset, and shown to provide a top performance to that provided by other networks, which require much larger training datasets.

Keywords: computer vision, facial expression recognition, machine learning, algorithms, depp learning, neural networks

Procedia PDF Downloads 51
566 Tractography Analysis of the Evolutionary Origin of Schizophrenia

Authors: Asmaa Tahiri, Mouktafi Amine

Abstract:

A substantial number of traditional medical research has been put forward to managing and treating mental disorders. At the present time, to our best knowledge, it is believed that fundamental understanding of the underlying causes of the majority psychological disorders needs to be explored further to inform early diagnosis, managing symptoms and treatment. The emerging field of evolutionary psychology is a promising prospect to address the origin of mental disorders, potentially leading to more effective treatments. Schizophrenia as a topical mental disorder has been linked to the evolutionary adaptation of the human brain represented in the brain connectivity and asymmetry directly linked to humans higher brain cognition in contrast to other primates being our direct living representation of the structure and connectivity of our earliest common African ancestors. As proposed in the evolutionary psychology scientific literature the pathophysiology of schizophrenia is expressed and directly linked to altered connectivity between the Hippocampal Formation (HF) and Dorsolateral Prefrontal Cortex (DLPFC). This research paper presents the results of the use of tractography analysis using multiple open access Diffusion Weighted Imaging (DWI) datasets of healthy subjects, schizophrenia-affected subjects and primates to illustrate the relevance of the aforementioned brain regions connectivity and the underlying evolutionary changes in the human brain. Deterministic fiber tracking and streamline analysis were used to generate connectivity matrices from the DWI datasets overlaid to compute distances and highlight disconnectivity patterns in conjunction with other fiber tracking metrics; Fractional Anisotropy (FA), Mean Diffusivity (MD) and Radial Diffusivity (RD).

Keywords: tractography, evolutionary psychology, schizophrenia, brain connectivity

Procedia PDF Downloads 31
565 Differentially Expressed Genes in Atopic Dermatitis: Bioinformatics Analysis Of Pooled Microarray Gene Expression Datasets In Gene Expression Omnibus

Authors: Danna Jia, Bin Li

Abstract:

Background: Atopic dermatitis (AD) is a chronic and refractory inflammatory skin disease characterized by relapsing eczematous and pruritic skin lesions. The global prevalence of AD ranges from 1~ 20%, and its incidence rates are increasing. It affects individuals from infancy to adulthood, significantly impacting their daily lives and social activities. Despite its major health burden, the precise mechanisms underlying AD remain unknown. Understanding the genetic differences associated with AD is crucial for advancing diagnosis and targeted treatment development. This study aims to identify candidate genes of AD by using bioinformatics analysis. Methods: We conducted a comprehensive analysis of four pooled transcriptomic datasets (GSE16161, GSE32924, GSE130588, and GSE120721) obtained from the Gene Expression Omnibus (GEO) database. Differential gene expression analysis was performed using the R statistical language. The differentially expressed genes (DEGs) between AD patients and normal individuals were functionally analyzed using Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment. Furthermore, a protein-protein interaction (PPI) network was constructed to identify candidate genes. Results: Among the patient-level gene expression datasets, we identified 114 shared DEGs, consisting of 53 upregulated genes and 61 downregulated genes. Functional analysis using GO and KEGG revealed that the DEGs were mainly associated with the negative regulation of transcription from RNA polymerase II promoter, membrane-related functions, protein binding, and the Human papillomavirus infection pathway. Through the PPI network analysis, we identified eight core genes: CD44, STAT1, HMMR, AURKA, MKI67, and SMARCA4. Conclusion: This study elucidates key genes associated with AD, providing potential targets for diagnosis and treatment. The identified genes have the potential to contribute to the understanding and management of AD. The bioinformatics analysis conducted in this study offers new insights and directions for further research on AD. Future studies can focus on validating the functional roles of these genes and exploring their therapeutic potential in AD. While these findings will require further verification as achieved with experiments involving in vivo and in vitro models, these results provided some initial insights into dysfunctional inflammatory and immune responses associated with AD. Such information offers the potential to develop novel therapeutic targets for use in preventing and treating AD.

Keywords: atopic dermatitis, bioinformatics, biomarkers, genes

Procedia PDF Downloads 51
564 Shark Detection and Classification with Deep Learning

Authors: Jeremy Jenrette, Z. Y. C. Liu, Pranav Chimote, Edward Fox, Trevor Hastie, Francesco Ferretti

Abstract:

Suitable shark conservation depends on well-informed population assessments. Direct methods such as scientific surveys and fisheries monitoring are adequate for defining population statuses, but species-specific indices of abundance and distribution coming from these sources are rare for most shark species. We can rapidly fill these information gaps by boosting media-based remote monitoring efforts with machine learning and automation. We created a database of shark images by sourcing 24,546 images covering 219 species of sharks from the web application spark pulse and the social network Instagram. We used object detection to extract shark features and inflate this database to 53,345 images. We packaged object-detection and image classification models into a Shark Detector bundle. We developed the Shark Detector to recognize and classify sharks from videos and images using transfer learning and convolutional neural networks (CNNs). We applied these models to common data-generation approaches of sharks: boosting training datasets, processing baited remote camera footage and online videos, and data-mining Instagram. We examined the accuracy of each model and tested genus and species prediction correctness as a result of training data quantity. The Shark Detector located sharks in baited remote footage and YouTube videos with an average accuracy of 89\%, and classified located subjects to the species level with 69\% accuracy (n =\ eight species). The Shark Detector sorted heterogeneous datasets of images sourced from Instagram with 91\% accuracy and classified species with 70\% accuracy (n =\ 17 species). Data-mining Instagram can inflate training datasets and increase the Shark Detector’s accuracy as well as facilitate archiving of historical and novel shark observations. Base accuracy of genus prediction was 68\% across 25 genera. The average base accuracy of species prediction within each genus class was 85\%. The Shark Detector can classify 45 species. All data-generation methods were processed without manual interaction. As media-based remote monitoring strives to dominate methods for observing sharks in nature, we developed an open-source Shark Detector to facilitate common identification applications. Prediction accuracy of the software pipeline increases as more images are added to the training dataset. We provide public access to the software on our GitHub page.

Keywords: classification, data mining, Instagram, remote monitoring, sharks

Procedia PDF Downloads 89
563 Fake News Detection Based on Fusion of Domain Knowledge and Expert Knowledge

Authors: Yulan Wu

Abstract:

The spread of fake news on social media has posed significant societal harm to the public and the nation, with its threats spanning various domains, including politics, economics, health, and more. News on social media often covers multiple domains, and existing models studied by researchers and relevant organizations often perform well on datasets from a single domain. However, when these methods are applied to social platforms with news spanning multiple domains, their performance significantly deteriorates. Existing research has attempted to enhance the detection performance of multi-domain datasets by adding single-domain labels to the data. However, these methods overlook the fact that a news article typically belongs to multiple domains, leading to the loss of domain knowledge information contained within the news text. To address this issue, research has found that news records in different domains often use different vocabularies to describe their content. In this paper, we propose a fake news detection framework that combines domain knowledge and expert knowledge. Firstly, it utilizes an unsupervised domain discovery module to generate a low-dimensional vector for each news article, representing domain embeddings, which can retain multi-domain knowledge of the news content. Then, a feature extraction module uses the domain embeddings discovered through unsupervised domain knowledge to guide multiple experts in extracting news knowledge for the total feature representation. Finally, a classifier is used to determine whether the news is fake or not. Experiments show that this approach can improve multi-domain fake news detection performance while reducing the cost of manually labeling domain labels.

Keywords: fake news, deep learning, natural language processing, multiple domains

Procedia PDF Downloads 37
562 Using Visualization Techniques to Support Common Clinical Tasks in Clinical Documentation

Authors: Jonah Kenei, Elisha Opiyo

Abstract:

Electronic health records, as a repository of patient information, is nowadays the most commonly used technology to record, store and review patient clinical records and perform other clinical tasks. However, the accurate identification and retrieval of relevant information from clinical records is a difficult task due to the unstructured nature of clinical documents, characterized in particular by a lack of clear structure. Therefore, medical practice is facing a challenge thanks to the rapid growth of health information in electronic health records (EHRs), mostly in narrative text form. As a result, it's becoming important to effectively manage the growing amount of data for a single patient. As a result, there is currently a requirement to visualize electronic health records (EHRs) in a way that aids physicians in clinical tasks and medical decision-making. Leveraging text visualization techniques to unstructured clinical narrative texts is a new area of research that aims to provide better information extraction and retrieval to support clinical decision support in scenarios where data generated continues to grow. Clinical datasets in electronic health records (EHR) offer a lot of potential for training accurate statistical models to classify facets of information which can then be used to improve patient care and outcomes. However, in many clinical note datasets, the unstructured nature of clinical texts is a common problem. This paper examines the very issue of getting raw clinical texts and mapping them into meaningful structures that can support healthcare professionals utilizing narrative texts. Our work is the result of a collaborative design process that was aided by empirical data collected through formal usability testing.

Keywords: classification, electronic health records, narrative texts, visualization

Procedia PDF Downloads 90
561 Factors Affecting Air Surface Temperature Variations in the Philippines

Authors: John Christian Lequiron, Gerry Bagtasa, Olivia Cabrera, Leoncio Amadore, Tolentino Moya

Abstract:

Changes in air surface temperature play an important role in the Philippine’s economy, industry, health, and food production. While increasing global mean temperature in the recent several decades has prompted a number of climate change and variability studies in the Philippines, most studies still focus on rainfall and tropical cyclones. This study aims to investigate the trend and variability of observed air surface temperature and determine its major influencing factor/s in the Philippines. A non-parametric Mann-Kendall trend test was applied to monthly mean temperature of 17 synoptic stations covering 56 years from 1960 to 2015 and a mean change of 0.58 °C or a positive trend of 0.0105 °C/year (p < 0.05) was found. In addition, wavelet decomposition was used to determine the frequency of temperature variability show a 12-month, 30-80-month and more than 120-month cycles. This indicates strong annual variations, interannual variations that coincide with ENSO events, and interdecadal variations that are attributed to PDO and CO2 concentrations. Air surface temperature was also correlated with smoothed sunspot number and galactic cosmic rays, the results show a low to no effect. The influence of ENSO teleconnection on temperature, wind pattern, cloud cover, and outgoing longwave radiation on different ENSO phases had significant effects on regional temperature variability. Particularly, an anomalous anticyclonic (cyclonic) flow east of the Philippines during the peak and decay phase of El Niño (La Niña) events leads to the advection of warm southeasterly (cold northeasterly) air mass over the country. Furthermore, an apparent increasing cloud cover trend is observed over the West Philippine Sea including portions of the Philippines, and this is believed to lessen the effect of the increasing air surface temperature. However, relative humidity was also found to be increasing especially on the central part of the country, which results in a high positive trend of heat index, exacerbating the effects on human discomfort. Finally, an assessment of gridded temperature datasets was done to look at the viability of using three high-resolution datasets in future climate analysis and model calibration and verification. Several error statistics (i.e. Pearson correlation, Bias, MAE, and RMSE) were used for this validation. Results show that gridded temperature datasets generally follows the observed surface temperature change and anomalies. In addition, it is more representative of regional temperature rather than a substitute to station-observed air temperature.

Keywords: air surface temperature, carbon dioxide, ENSO, galactic cosmic rays, smoothed sunspot number

Procedia PDF Downloads 283
560 Similar Script Character Recognition on Kannada and Telugu

Authors: Gurukiran Veerapur, Nytik Birudavolu, Seetharam U. N., Chandravva Hebbi, R. Praneeth Reddy

Abstract:

This work presents a robust approach for the recognition of characters in Telugu and Kannada, two South Indian scripts with structural similarities in characters. To recognize the characters exhaustive datasets are required, but there are only a few publicly available datasets. As a result, we decided to create a dataset for one language (source language),train the model with it, and then test it with the target language.Telugu is the target language in this work, whereas Kannada is the source language. The suggested method makes use of Canny edge features to increase character identification accuracy on pictures with noise and different lighting. A dataset of 45,150 images containing printed Kannada characters was created. The Nudi software was used to automatically generate printed Kannada characters with different writing styles and variations. Manual labelling was employed to ensure the accuracy of the character labels. The deep learning models like CNN (Convolutional Neural Network) and Visual Attention neural network (VAN) are used to experiment with the dataset. A Visual Attention neural network (VAN) architecture was adopted, incorporating additional channels for Canny edge features as the results obtained were good with this approach. The model's accuracy on the combined Telugu and Kannada test dataset was an outstanding 97.3%. Performance was better with Canny edge characteristics applied than with a model that solely used the original grayscale images. The accuracy of the model was found to be 80.11% for Telugu characters and 98.01% for Kannada words when it was tested with these languages. This model, which makes use of cutting-edge machine learning techniques, shows excellent accuracy when identifying and categorizing characters from these scripts.

Keywords: base characters, modifiers, guninthalu, aksharas, vattakshara, VAN

Procedia PDF Downloads 26
559 Audit of TPS photon beam dataset for small field output factors using OSLDs against RPC standard dataset

Authors: Asad Yousuf

Abstract:

Purpose: The aim of the present study was to audit treatment planning system beam dataset for small field output factors against standard dataset produced by radiological physics center (RPC) from a multicenter study. Such data are crucial for validity of special techniques, i.e., IMRT or stereotactic radiosurgery. Materials/Method: In this study, multiple small field size output factor datasets were measured and calculated for 6 to 18 MV x-ray beams using the RPC recommend methods. These beam datasets were measured at 10 cm depth for 10 × 10 cm2 to 2 × 2 cm2 field sizes, defined by collimator jaws at 100 cm. The measurements were made with a Landauer’s nanoDot OSLDs whose volume is small enough to gather a full ionization reading even for the 1×1 cm2 field size. At our institute the beam data including output factors have been commissioned at 5 cm depth with an SAD setup. For comparison with the RPC data, the output factors were converted to an SSD setup using tissue phantom ratios. SSD setup also enables coverage of the ion chamber in 2×2 cm2 field size. The measured output factors were also compared with those calculated by Eclipse™ treatment planning software. Result: The measured and calculated output factors are in agreement with RPC dataset within 1% and 4% respectively. The large discrepancies in TPS reflect the increased challenge in converting measured data into a commissioned beam model for very small fields. Conclusion: OSLDs are simple, durable, and accurate tool to verify doses that delivered using small photon beam fields down to a 1x1 cm2 field sizes. The study emphasizes that the treatment planning system should always be evaluated for small field out factors for the accurate dose delivery in clinical setting.

Keywords: small field dosimetry, optically stimulated luminescence, audit treatment, radiological physics center

Procedia PDF Downloads 299
558 Observed Changes in Constructed Precipitation at High Resolution in Southern Vietnam

Authors: Nguyen Tien Thanh, Günter Meon

Abstract:

Precipitation plays a key role in water cycle, defining the local climatic conditions and in ecosystem. It is also an important input parameter for water resources management and hydrologic models. With spatial continuous data, a certainty of discharge predictions or other environmental factors is unquestionably better than without. This is, however, not always willingly available to acquire for a small basin, especially for coastal region in Vietnam due to a low network of meteorological stations (30 stations) on long coast of 3260 km2. Furthermore, available gridded precipitation datasets are not fine enough when applying to hydrologic models. Under conditions of global warming, an application of spatial interpolation methods is a crucial for the climate change impact studies to obtain the spatial continuous data. In recent research projects, although some methods can perform better than others do, no methods draw the best results for all cases. The objective of this paper therefore, is to investigate different spatial interpolation methods for daily precipitation over a small basin (approximately 400 km2) located in coastal region, Southern Vietnam and find out the most efficient interpolation method on this catchment. The five different interpolation methods consisting of cressman, ordinary kriging, regression kriging, dual kriging and inverse distance weighting have been applied to identify the best method for the area of study on the spatio-temporal scale (daily, 10 km x 10 km). A 30-year precipitation database was created and merged into available gridded datasets. Finally, observed changes in constructed precipitation were performed. The results demonstrate that the method of ordinary kriging interpolation is an effective approach to analyze the daily precipitation. The mixed trends of increasing and decreasing monthly, seasonal and annual precipitation have documented at significant levels.

Keywords: interpolation, precipitation, trend, vietnam

Procedia PDF Downloads 255
557 Automatic Near-Infrared Image Colorization Using Synthetic Images

Authors: Yoganathan Karthik, Guhanathan Poravi

Abstract:

Colorizing near-infrared (NIR) images poses unique challenges due to the absence of color information and the nuances in light absorption. In this paper, we present an approach to NIR image colorization utilizing a synthetic dataset generated from visible light images. Our method addresses two major challenges encountered in NIR image colorization: accurately colorizing objects with color variations and avoiding over/under saturation in dimly lit scenes. To tackle these challenges, we propose a Generative Adversarial Network (GAN)-based framework that learns to map NIR images to their corresponding colorized versions. The synthetic dataset ensures diverse color representations, enabling the model to effectively handle objects with varying hues and shades. Furthermore, the GAN architecture facilitates the generation of realistic colorizations while preserving the integrity of dimly lit scenes, thus mitigating issues related to over/under saturation. Experimental results on benchmark NIR image datasets demonstrate the efficacy of our approach in producing high-quality colorizations with improved color accuracy and naturalness. Quantitative evaluations and comparative studies validate the superiority of our method over existing techniques, showcasing its robustness and generalization capability across diverse NIR image scenarios. Our research not only contributes to advancing NIR image colorization but also underscores the importance of synthetic datasets and GANs in addressing domain-specific challenges in image processing tasks. The proposed framework holds promise for various applications in remote sensing, medical imaging, and surveillance where accurate color representation of NIR imagery is crucial for analysis and interpretation.

Keywords: computer vision, near-infrared images, automatic image colorization, generative adversarial networks, synthetic data

Procedia PDF Downloads 4
556 Transcriptomine: The Nuclear Receptor Signaling Transcriptome Database

Authors: Scott A. Ochsner, Christopher M. Watkins, Apollo McOwiti, David L. Steffen Lauren B. Becnel, Neil J. McKenna

Abstract:

Understanding signaling by nuclear receptors (NRs) requires an appreciation of their cognate ligand- and tissue-specific transcriptomes. While target gene regulation data are abundant in this field, they reside in hundreds of discrete publications in formats refractory to routine query and analysis and, accordingly, their full value to the NR signaling community has not been realized. One of the mandates of the Nuclear Receptor Signaling Atlas (NURSA) is to facilitate access of the community to existing public datasets. Pursuant to this mandate we are developing a freely-accessible community web resource, Transcriptomine, to bring together the sum total of available expression array and RNA-Seq data points generated by the field in a single location. Transcriptomine currently contains over 25,000,000 gene fold change datapoints from over 1200 contrasts relevant to over 100 NRs, ligands and coregulators in over 200 tissues and cell lines. Transcriptomine is designed to accommodate a spectrum of end users ranging from the bench researcher to those with advanced bioinformatic training. Visualization tools allow users to build custom charts to compare and contrast patterns of gene regulation across different tissues and in response to different ligands. Our resource affords an entirely new paradigm for leveraging gene expression data in the NR signaling field, empowering users to query gene fold changes across diverse regulatory molecules, tissues and cell lines, target genes, biological functions and disease associations, and that would otherwise be prohibitive in terms of time and effort. Transcriptomine will be regularly updated with gene lists from future genome-wide expression array and expression-sequencing datasets in the NR signaling field.

Keywords: target gene database, informatics, gene expression, transcriptomics

Procedia PDF Downloads 249
555 Gene Expression Meta-Analysis of Potential Shared and Unique Pathways Between Autoimmune Diseases Under anti-TNFα Therapy

Authors: Charalabos Antonatos, Mariza Panoutsopoulou, Georgios K. Georgakilas, Evangelos Evangelou, Yiannis Vasilopoulos

Abstract:

The extended tissue damage and severe clinical outcomes of autoimmune diseases, accompanied by the high annual costs to the overall health care system, highlight the need for an efficient therapy. Increasing knowledge over the pathophysiology of specific chronic inflammatory diseases, namely Psoriasis (PsO), Inflammatory Bowel Diseases (IBD) consisting of Crohn’s disease (CD) and Ulcerative colitis (UC), and Rheumatoid Arthritis (RA), has provided insights into the underlying mechanisms that lead to the maintenance of the inflammation, such as Tumor Necrosis Factor alpha (TNF-α). Hence, the anti-TNFα biological agents pose as an ideal therapeutic approach. Despite the efficacy of anti-TNFα agents, several clinical trials have shown that 20-40% of patients do not respond to treatment. Nowadays, high-throughput technologies have been recruited in order to elucidate the complex interactions in multifactorial phenotypes, with the most ubiquitous ones referring to transcriptome quantification analyses. In this context, a random effects meta-analysis of available gene expression cDNA microarray datasets was performed between responders and non-responders to anti-TNFα therapy in patients with IBD, PsO, and RA. Publicly available datasets were systematically searched from inception to 10th of November 2020 and selected for further analysis if they assessed the response to anti-TNFα therapy with clinical score indexes from inflamed biopsies. Specifically, 4 IBD (79 responders/72 non-responders), 3 PsO (40 responders/11 non-responders) and 2 RA (16 responders/6 non-responders) datasetswere selected. After the separate pre-processing of each dataset, 4 separate meta-analyses were conducted; three disease-specific and a single combined meta-analysis on the disease-specific results. The MetaVolcano R package (v.1.8.0) was utilized for a random-effects meta-analysis through theRestricted Maximum Likelihood (RELM) method. The top 1% of the most consistently perturbed genes in the included datasets was highlighted through the TopConfects approach while maintaining a 5% False Discovery Rate (FDR). Genes were considered as Differentialy Expressed (DEGs) as those with P ≤ 0.05, |log2(FC)| ≥ log2(1.25) and perturbed in at least 75% of the included datasets. Over-representation analysis was performed using Gene Ontology and Reactome Pathways for both up- and down-regulated genes in all 4 performed meta-analyses. Protein-Protein interaction networks were also incorporated in the subsequentanalyses with STRING v11.5 and Cytoscape v3.9. Disease-specific meta-analyses detected multiple distinct pro-inflammatory and immune-related down-regulated genes for each disease, such asNFKBIA, IL36, and IRAK1, respectively. Pathway analyses revealed unique and shared pathways between each disease, such as Neutrophil Degranulation and Signaling by Interleukins. The combined meta-analysis unveiled 436 DEGs, 86 out of which were up- and 350 down-regulated, confirming the aforementioned shared pathways and genes, as well as uncovering genes that participate in anti-inflammatory pathways, namely IL-10 signaling. The identification of key biological pathways and regulatory elements is imperative for the accurate prediction of the patient’s response to biological drugs. Meta-analysis of such gene expression data could aid the challenging approach to unravel the complex interactions implicated in the response to anti-TNFα therapy in patients with PsO, IBD, and RA, as well as distinguish gene clusters and pathways that are altered through this heterogeneous phenotype.

Keywords: anti-TNFα, autoimmune, meta-analysis, microarrays

Procedia PDF Downloads 144
554 Feature Evaluation Based on Random Subspace and Multiple-K Ensemble

Authors: Jaehong Yu, Seoung Bum Kim

Abstract:

Clustering analysis can facilitate the extraction of intrinsic patterns in a dataset and reveal its natural groupings without requiring class information. For effective clustering analysis in high dimensional datasets, unsupervised dimensionality reduction is an important task. Unsupervised dimensionality reduction can generally be achieved by feature extraction or feature selection. In many situations, feature selection methods are more appropriate than feature extraction methods because of their clear interpretation with respect to the original features. The unsupervised feature selection can be categorized as feature subset selection and feature ranking method, and we focused on unsupervised feature ranking methods which evaluate the features based on their importance scores. Recently, several unsupervised feature ranking methods were developed based on ensemble approaches to achieve their higher accuracy and stability. However, most of the ensemble-based feature ranking methods require the true number of clusters. Furthermore, these algorithms evaluate the feature importance depending on the ensemble clustering solution, and they produce undesirable evaluation results if the clustering solutions are inaccurate. To address these limitations, we proposed an ensemble-based feature ranking method with random subspace and multiple-k ensemble (FRRM). The proposed FRRM algorithm evaluates the importance of each feature with the random subspace ensemble, and all evaluation results are combined with the ensemble importance scores. Moreover, FRRM does not require the determination of the true number of clusters in advance through the use of the multiple-k ensemble idea. Experiments on various benchmark datasets were conducted to examine the properties of the proposed FRRM algorithm and to compare its performance with that of existing feature ranking methods. The experimental results demonstrated that the proposed FRRM outperformed the competitors.

Keywords: clustering analysis, multiple-k ensemble, random subspace-based feature evaluation, unsupervised feature ranking

Procedia PDF Downloads 308
553 Rd-PLS Regression: From the Analysis of Two Blocks of Variables to Path Modeling

Authors: E. Tchandao Mangamana, V. Cariou, E. Vigneau, R. Glele Kakai, E. M. Qannari

Abstract:

A new definition of a latent variable associated with a dataset makes it possible to propose variants of the PLS2 regression and the multi-block PLS (MB-PLS). We shall refer to these variants as Rd-PLS regression and Rd-MB-PLS respectively because they are inspired by both Redundancy analysis and PLS regression. Usually, a latent variable t associated with a dataset Z is defined as a linear combination of the variables of Z with the constraint that the length of the loading weights vector equals 1. Formally, t=Zw with ‖w‖=1. Denoting by Z' the transpose of Z, we define herein, a latent variable by t=ZZ’q with the constraint that the auxiliary variable q has a norm equal to 1. This new definition of a latent variable entails that, as previously, t is a linear combination of the variables in Z and, in addition, the loading vector w=Z’q is constrained to be a linear combination of the rows of Z. More importantly, t could be interpreted as a kind of projection of the auxiliary variable q onto the space generated by the variables in Z, since it is collinear to the first PLS1 component of q onto Z. Consider the situation in which we aim to predict a dataset Y from another dataset X. These two datasets relate to the same individuals and are assumed to be centered. Let us consider a latent variable u=YY’q to which we associate the variable t= XX’YY’q. Rd-PLS consists in seeking q (and therefore u and t) so that the covariance between t and u is maximum. The solution to this problem is straightforward and consists in setting q to the eigenvector of YY’XX’YY’ associated with the largest eigenvalue. For the determination of higher order components, we deflate X and Y with respect to the latent variable t. Extending Rd-PLS to the context of multi-block data is relatively easy. Starting from a latent variable u=YY’q, we consider its ‘projection’ on the space generated by the variables of each block Xk (k=1, ..., K) namely, tk= XkXk'YY’q. Thereafter, Rd-MB-PLS seeks q in order to maximize the average of the covariances of u with tk (k=1, ..., K). The solution to this problem is given by q, eigenvector of YY’XX’YY’, where X is the dataset obtained by horizontally merging datasets Xk (k=1, ..., K). For the determination of latent variables of order higher than 1, we use a deflation of Y and Xk with respect to the variable t= XX’YY’q. In the same vein, extending Rd-MB-PLS to the path modeling setting is straightforward. Methods are illustrated on the basis of case studies and performance of Rd-PLS and Rd-MB-PLS in terms of prediction is compared to that of PLS2 and MB-PLS.

Keywords: multiblock data analysis, partial least squares regression, path modeling, redundancy analysis

Procedia PDF Downloads 109
552 Combinational Therapeutic Targeting of BRD4 and CDK7 Synergistically Induces Anticancer Effects in Hepatocellular Carcinoma

Authors: Xinxiu Li, Chuqian Zheng, Yanyan Qian, Hong Fan

Abstract:

Objectives: In hepatocellular carcinoma (HCC), oncogenes are continuously and robustly transcribed due to aberrant expression of essential components of the trans-acting super-enhancers (SE) complex. Preclinical and clinical trials are now being conducted on small-molecule inhibitors that target core-transcriptional components, including as transcriptional bromodomain protein 4 (BRD4) and cyclin-dependent kinase 7 (CDK7), in a number of malignant tumors. This study aims to explore whether co-overexpression of BRD4 and CDK7 is a potential marker of worse prognosis and a combined therapeutic target in HCC. Methods: The expression pattern of BRD4 and CDK7 and their correlation with prognosis in HCC were analyzed by RNA sequencing data and survival data of HCC patients from TCGA and GEO datasets. The protein levels of BRD4 and CDK7 were determined by immunohistochemistry (IHC), and survival data of patients were analyzed using the Kaplan-Meier method. The mRNA expression levels of genes in HCC cell lines were evaluated by quantitative PCR (q-PCR). CCK-8 and colony formation assays were conducted to assess cell proliferation of HCC upon treatment with BRD4 inhibitor JQ1 or/and CDK7 inhibitor THZ1. Results: It was shown that BRD4 and CDK7 were often overexpressed in HCCs and were associated with poor prognosis of HCC by analyzing the TCGA and GEO datasets. BRD4 or CDK7 overexpression was related to a lower survival rate. It's interesting to note that co-overexpression of CDK7 and BRD4 was a worse prognostic factor in HCC. Treatment with JQ1 or THZ1 alone had an inhibitory effect on cell proliferation; however, when JQ1 and THZ1 were combined, there was a more notable suppression of cell growth. At the same time, the combined use of JQ1 and THZ1 synergistically suppresses the expression of HCC driver genes. Conclusion: Our research revealed that BRD4 and CDK7 coupled can be a useful biomarker in HCC prognosis and the combination of JQ1 and THZ1 can be a promising therapeutic therapy against HCC.

Keywords: BRD4, CDK7, cell proliferation, combined inhibition

Procedia PDF Downloads 32
551 Machine Learning Approaches Based on Recency, Frequency, Monetary (RFM) and K-Means for Predicting Electrical Failures and Voltage Reliability in Smart Cities

Authors: Panaya Sudta, Wanchalerm Patanacharoenwong, Prachya Bumrungkun

Abstract:

As With the evolution of smart grids, ensuring the reliability and efficiency of electrical systems in smart cities has become crucial. This paper proposes a distinct approach that combines advanced machine learning techniques to accurately predict electrical failures and address voltage reliability issues. This approach aims to improve the accuracy and efficiency of reliability evaluations in smart cities. The aim of this research is to develop a comprehensive predictive model that accurately predicts electrical failures and voltage reliability in smart cities. This model integrates RFM analysis, K-means clustering, and LSTM networks to achieve this objective. The research utilizes RFM analysis, traditionally used in customer value assessment, to categorize and analyze electrical components based on their failure recency, frequency, and monetary impact. K-means clustering is employed to segment electrical components into distinct groups with similar characteristics and failure patterns. LSTM networks are used to capture the temporal dependencies and patterns in customer data. This integration of RFM, K-means, and LSTM results in a robust predictive tool for electrical failures and voltage reliability. The proposed model has been tested and validated on diverse electrical utility datasets. The results show a significant improvement in prediction accuracy and reliability compared to traditional methods, achieving an accuracy of 92.78% and an F1-score of 0.83. This research contributes to the proactive maintenance and optimization of electrical infrastructures in smart cities. It also enhances overall energy management and sustainability. The integration of advanced machine learning techniques in the predictive model demonstrates the potential for transforming the landscape of electrical system management within smart cities. The research utilizes diverse electrical utility datasets to develop and validate the predictive model. RFM analysis, K-means clustering, and LSTM networks are applied to these datasets to analyze and predict electrical failures and voltage reliability. The research addresses the question of how accurately electrical failures and voltage reliability can be predicted in smart cities. It also investigates the effectiveness of integrating RFM analysis, K-means clustering, and LSTM networks in achieving this goal. The proposed approach presents a distinct, efficient, and effective solution for predicting and mitigating electrical failures and voltage issues in smart cities. It significantly improves prediction accuracy and reliability compared to traditional methods. This advancement contributes to the proactive maintenance and optimization of electrical infrastructures, overall energy management, and sustainability in smart cities.

Keywords: electrical state prediction, smart grids, data-driven method, long short-term memory, RFM, k-means, machine learning

Procedia PDF Downloads 22
550 Arabic Lexicon Learning to Analyze Sentiment in Microblogs

Authors: Mahmoud B. Rokaya

Abstract:

The study of opinion mining and sentiment analysis includes analysis of opinions, sentiments, evaluations, attitudes, and emotions. The rapid growth of social media, social networks, reviews, forum discussions, microblogs, and Twitter, leads to a parallel growth in the field of sentiment analysis. The field of sentiment analysis tries to develop effective tools to make it possible to capture the trends of people. There are two approaches in the field, lexicon-based and corpus-based methods. A lexicon-based method uses a sentiment lexicon which includes sentiment words and phrases with assigned numeric scores. These scores reveal if sentiment phrases are positive or negative, their intensity, and/or their emotional orientations. Creation of manual lexicons is hard. This brings the need for adaptive automated methods for generating a lexicon. The proposed method generates dynamic lexicons based on the corpus and then classifies text using these lexicons. In the proposed method, different approaches are combined to generate lexicons from text. The proposed method classifies the tweets into 5 classes instead of +ve or –ve classes. The sentiment classification problem is written as an optimization problem, finding optimum sentiment lexicons are the goal of the optimization process. The solution was produced based on mathematical programming approaches to find the best lexicon to classify texts. A genetic algorithm was written to find the optimal lexicon. Then, extraction of a meta-level feature was done based on the optimal lexicon. The experiments were conducted on several datasets. Results, in terms of accuracy, recall and F measure, outperformed the state-of-the-art methods proposed in the literature in some of the datasets. A better understanding of the Arabic language and culture of Arab Twitter users and sentiment orientation of words in different contexts can be achieved based on the sentiment lexicons proposed by the algorithm.

Keywords: social media, Twitter sentiment, sentiment analysis, lexicon, genetic algorithm, evolutionary computation

Procedia PDF Downloads 153
549 River Habitat Modeling for the Entire Macroinvertebrate Community

Authors: Pinna Beatrice., Laini Alex, Negro Giovanni, Burgazzi Gemma, Viaroli Pierluigi, Vezza Paolo

Abstract:

Habitat models rarely consider macroinvertebrates as ecological targets in rivers. Available approaches mainly focus on single macroinvertebrate species, not addressing the ecological needs and functionality of the entire community. This research aimed to provide an approach to model the habitat of the macroinvertebrate community. The approach is based on the recently developed Flow-T index, together with a Random Forest (RF) regression, which is employed to apply the Flow-T index at the meso-habitat scale. Using different datasets gathered from both field data collection and 2D hydrodynamic simulations, the model has been calibrated in the Trebbia river (2019 campaign), and then validated in the Trebbia, Taro, and Enza rivers (2020 campaign). The three rivers are characterized by a braiding morphology, gravel riverbeds, and summer low flows. The RF model selected 12 mesohabitat descriptors as important for the macroinvertebrate community. These descriptors belong to different frequency classes of water depth, flow velocity, substrate grain size, and connectivity to the main river channel. The cross-validation R² coefficient (R²𝒸ᵥ) of the training dataset is 0.71 for the Trebbia River (2019), whereas the R² coefficient for the validation datasets (Trebbia, Taro, and Enza Rivers 2020) is 0.63. The agreement between the simulated results and the experimental data shows sufficient accuracy and reliability. The outcomes of the study reveal that the model can identify the ecological response of the macroinvertebrate community to possible flow regime alterations and to possible river morphological modifications. Lastly, the proposed approach allows extending the MesoHABSIM methodology, widely used for the fish habitat assessment, to a different ecological target community. Further applications of the approach can be related to flow design in both perennial and non-perennial rivers, including river reaches in which fish fauna is absent.

Keywords: ecological flows, macroinvertebrate community, mesohabitat, river habitat modeling

Procedia PDF Downloads 58
548 Robustness of the Deep Chroma Extractor and Locally-Normalized Quarter Tone Filters in Automatic Chord Estimation under Reverberant Conditions

Authors: Luis Alvarado, Victor Poblete, Isaac Gonzalez, Yetzabeth Gonzalez

Abstract:

In MIREX 2016 (http://www.music-ir.org/mirex), the deep neural network (DNN)-Deep Chroma Extractor, proposed by Korzeniowski and Wiedmer, reached the highest score in an audio chord recognition task. In the present paper, this tool is assessed under acoustic reverberant environments and distinct source-microphone distances. The evaluation dataset comprises The Beatles and Queen datasets. These datasets are sequentially re-recorded with a single microphone in a real reverberant chamber at four reverberation times (0 -anechoic-, 1, 2, and 3 s, approximately), as well as four source-microphone distances (32, 64, 128, and 256 cm). It is expected that the performance of the trained DNN will dramatically decrease under these acoustic conditions with signals degraded by room reverberation and distance to the source. Recently, the effect of the bio-inspired Locally-Normalized Cepstral Coefficients (LNCC), has been assessed in a text independent speaker verification task using speech signals degraded by additive noise at different signal-to-noise ratios with variations of recording distance, and it has also been assessed under reverberant conditions with variations of recording distance. LNCC showed a performance so high as the state-of-the-art Mel Frequency Cepstral Coefficient filters. Based on these results, this paper proposes a variation of locally-normalized triangular filters called Locally-Normalized Quarter Tone (LNQT) filters. By using the LNQT spectrogram, robustness improvements of the trained Deep Chroma Extractor are expected, compared with classical triangular filters, and thus compensating the music signal degradation improving the accuracy of the chord recognition system.

Keywords: chord recognition, deep neural networks, feature extraction, music information retrieval

Procedia PDF Downloads 195