Search results for: manually annotated dataset
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 1468

Search results for: manually annotated dataset

1408 Applying Spanning Tree Graph Theory for Automatic Database Normalization

Authors: Chetneti Srisa-an

Abstract:

In Knowledge and Data Engineering field, relational database is the best repository to store data in a real world. It has been using around the world more than eight decades. Normalization is the most important process for the analysis and design of relational databases. It aims at creating a set of relational tables with minimum data redundancy that preserve consistency and facilitate correct insertion, deletion, and modification. Normalization is a major task in the design of relational databases. Despite its importance, very few algorithms have been developed to be used in the design of commercial automatic normalization tools. It is also rare technique to do it automatically rather manually. Moreover, for a large and complex database as of now, it make even harder to do it manually. This paper presents a new complete automated relational database normalization method. It produces the directed graph and spanning tree, first. It then proceeds with generating the 2NF, 3NF and also BCNF normal forms. The benefit of this new algorithm is that it can cope with a large set of complex function dependencies.

Keywords: relational database, functional dependency, automatic normalization, primary key, spanning tree

Procedia PDF Downloads 331
1407 2D Fingerprint Performance for PubChem Chemical Database

Authors: Fatimah Zawani Abdullah, Shereena Mohd Arif, Nurul Malim

Abstract:

The study of molecular similarity search in chemical database is increasingly widespread, especially in the area of drug discovery. Similarity search is an application in the field of Chemoinformatics to measure the similarity between the molecular structure which is known as the query and the structure of chemical compounds in the database. Similarity search is also one of the approaches in virtual screening which involves computational techniques and scoring the probabilities of activity. The main objective of this work is to determine the best fingerprint when compared to the other five fingerprints selected in this study using PubChem chemical dataset. This paper will discuss the similarity searching process conducted using 6 types of descriptors, which are ECFP4, ECFC4, FCFP4, FCFC4, SRECFC4 and SRFCFC4 on 15 activity classes of PubChem dataset using Tanimoto coefficient to calculate the similarity between the query structures and each of the database structure. The results suggest that ECFP4 performs the best to be used with Tanimoto coefficient in the PubChem dataset.

Keywords: 2D fingerprints, Tanimoto, PubChem, similarity searching, chemoinformatics

Procedia PDF Downloads 265
1406 A Large Dataset Imputation Approach Applied to Country Conflict Prediction Data

Authors: Benjamin Leiby, Darryl Ahner

Abstract:

This study demonstrates an alternative stochastic imputation approach for large datasets when preferred commercial packages struggle to iterate due to numerical problems. A large country conflict dataset motivates the search to impute missing values well over a common threshold of 20% missingness. The methodology capitalizes on correlation while using model residuals to provide the uncertainty in estimating unknown values. Examination of the methodology provides insight toward choosing linear or nonlinear modeling terms. Static tolerances common in most packages are replaced with tailorable tolerances that exploit residuals to fit each data element. The methodology evaluation includes observing computation time, model fit, and the comparison of known values to replaced values created through imputation. Overall, the country conflict dataset illustrates promise with modeling first-order interactions while presenting a need for further refinement that mimics predictive mean matching.

Keywords: correlation, country conflict, imputation, stochastic regression

Procedia PDF Downloads 95
1405 Implementation of a Paraconsistent-Fuzzy Digital PID Controller in a Level Control Process

Authors: H. M. Côrtes, J. I. Da Silva Filho, M. F. Blos, B. S. Zanon

Abstract:

In a modern society the factor corresponding to the increase in the level of quality in industrial production demand new techniques of control and machinery automation. In this context, this work presents the implementation of a Paraconsistent-Fuzzy Digital PID controller. The controller is based on the treatment of inconsistencies both in the Paraconsistent Logic and in the Fuzzy Logic. Paraconsistent analysis is performed on the signals applied to the system inputs using concepts from the Paraconsistent Annotated Logic with annotation of two values (PAL2v). The signals resulting from the paraconsistent analysis are two values defined as Dc - Degree of Certainty and Dct - Degree of Contradiction, which receive a treatment according to the Fuzzy Logic theory, and the resulting output of the logic actions is a single value called the crisp value, which is used to control dynamic system. Through an example, it was demonstrated the application of the proposed model. Initially, the Paraconsistent-Fuzzy Digital PID controller was built and tested in an isolated MATLAB environment and then compared to the equivalent Digital PID function of this software for standard step excitation. After this step, a level control plant was modeled to execute the controller function on a physical model, making the tests closer to the actual. For this, the control parameters (proportional, integral and derivative) were determined for the configuration of the conventional Digital PID controller and of the Paraconsistent-Fuzzy Digital PID, and the control meshes in MATLAB were assembled with the respective transfer function of the plant. Finally, the results of the comparison of the level control process between the Paraconsistent-Fuzzy Digital PID controller and the conventional Digital PID controller were presented.

Keywords: fuzzy logic, paraconsistent annotated logic, level control, digital PID

Procedia PDF Downloads 260
1404 Fine Grained Action Recognition of Skateboarding Tricks

Authors: Frederik Calsius, Mirela Popa, Alexia Briassouli

Abstract:

In the field of machine learning, it is common practice to use benchmark datasets to prove the working of a method. The domain of action recognition in videos often uses datasets like Kinet-ics, Something-Something, UCF-101 and HMDB-51 to report results. Considering the properties of the datasets, there are no datasets that focus solely on very short clips (2 to 3 seconds), and on highly-similar fine-grained actions within one specific domain. This paper researches how current state-of-the-art action recognition methods perform on a dataset that consists of highly similar, fine-grained actions. To do so, a dataset of skateboarding tricks was created. The performed analysis highlights both benefits and limitations of state-of-the-art methods, while proposing future research directions in the activity recognition domain. The conducted research shows that the best results are obtained by fusing RGB data with OpenPose data for the Temporal Shift Module.

Keywords: activity recognition, fused deep representations, fine-grained dataset, temporal modeling

Procedia PDF Downloads 204
1403 De Novo Assembly and Characterization of the Transcriptome from the Fluoroacetate Producing Plant, Dichapetalum Cymosum

Authors: Selisha A. Sooklal, Phelelani Mpangase, Shaun Aron, Karl Rumbold

Abstract:

Organically bound fluorine (C-F bond) is extremely rare in nature. Despite this, the first fluorinated secondary metabolite, fluoroacetate, was isolated from the plant Dichapetalum cymosum (commonly known as Gifblaar). However, the enzyme responsible for fluorination (fluorinase) in Gifblaar was never isolated and very little progress has been achieved in understanding this process in higher plants. Fluorinated compounds have vast applications in the pharmaceutical, agrochemical and fine chemicals industries. Consequently, an enzyme capable of catalysing a C-F bond has great potential as a biocatalyst in the industry considering that the field of fluorination is virtually synthetic. As with any biocatalyst, a range of these enzymes are required. Therefore, it is imperative to expand the exploration for novel fluorinases. This study aimed to gain molecular insights into secondary metabolite biosynthesis in Gifblaar using a high-throughput sequencing-based approach. Mechanical wounding studies were performed using Gifblaar leaf tissue in order to induce expression of the fluorinase. The transcriptome of the wounded and unwounded plant was then sequenced on the Illumina HiSeq platform. A total of 26.4 million short sequence reads were assembled into 77 845 transcripts using Trinity. Overall, 68.6 % of transcripts were annotated with gene identities using public databases (SwissProt, TrEMBL, GO, COG, Pfam, EC) with an E-value threshold of 1E-05. Sequences exhibited the greatest homology to the model plant, Arabidopsis thaliana (27 %). A total of 244 annotated transcripts were found to be differentially expressed between the wounded and unwounded plant. In addition, secondary metabolic pathways present in Gifblaar were successfully reconstructed using Pathway tools. Due to lack of genetic information for plant fluorinases, a transcript failed to be annotated as a fluorinating enzyme. Thus, a local database containing the 5 existing bacterial fluorinases was created. Fifteen transcripts having homology to partial regions of existing fluorinases were found. In efforts to obtain the full coding sequence of the Gifblaar fluorinase, primers were designed targeting the regions of homology and genome walking will be performed to amplify the unknown regions. This is the first genetic data available for Gifblaar. It has provided novel insights into the mechanisms of metabolite biosynthesis and will allow for the discovery of the first eukaryotic fluorinase.

Keywords: biocatalyst, fluorinase, gifblaar, transcriptome

Procedia PDF Downloads 251
1402 Reducing the Imbalance Penalty Through Artificial Intelligence Methods Geothermal Production Forecasting: A Case Study for Turkey

Authors: Hayriye Anıl, Görkem Kar

Abstract:

In addition to being rich in renewable energy resources, Turkey is one of the countries that promise potential in geothermal energy production with its high installed power, cheapness, and sustainability. Increasing imbalance penalties become an economic burden for organizations since geothermal generation plants cannot maintain the balance of supply and demand due to the inadequacy of the production forecasts given in the day-ahead market. A better production forecast reduces the imbalance penalties of market participants and provides a better imbalance in the day ahead market. In this study, using machine learning, deep learning, and, time series methods, the total generation of the power plants belonging to Zorlu Natural Electricity Generation, which has a high installed capacity in terms of geothermal, was estimated for the first one and two weeks of March, then the imbalance penalties were calculated with these estimates and compared with the real values. These modeling operations were carried out on two datasets, the basic dataset and the dataset created by extracting new features from this dataset with the feature engineering method. According to the results, Support Vector Regression from traditional machine learning models outperformed other models and exhibited the best performance. In addition, the estimation results in the feature engineering dataset showed lower error rates than the basic dataset. It has been concluded that the estimated imbalance penalty calculated for the selected organization is lower than the actual imbalance penalty, optimum and profitable accounts.

Keywords: machine learning, deep learning, time series models, feature engineering, geothermal energy production forecasting

Procedia PDF Downloads 81
1401 Unraveling Language Contact through Syntactic Dynamics of ‘Also’ in Hong Kong and Britain English

Authors: Xu Zhang

Abstract:

This article unveils an indicator of language contact between English and Cantonese in one of the Outer Circle Englishes, Hong Kong (HK) English, through an empirical investigation into 1000 tokens from the Global Web-based English (GloWbE) corpus, employing frequency analysis and logistic regression analysis. It is perceived that Cantonese and general Chinese are contextually marked by an integral underlying thinking pattern. Chinese speakers exhibit a reliance on semantic context over syntactic rules and lexical forms. This linguistic trait carries over to their use of English, affording greater flexibility to formal elements in constructing English sentences. The study focuses on the syntactic positioning of the focusing subjunct ‘also’, a linguistic element used to add new or contrasting prominence to specific sentence constituents. The English language generally allows flexibility in the relative position of 'also’, while there is a preference for close marking relationships. This article shifts attention to Hong Kong, where Cantonese and English converge, and 'also' finds counterparts in Cantonese ‘jaa’ and Mandarin ‘ye’. Employing a corpus-based data-driven method, we investigate the syntactic position of 'also' in both HK and GB English. The study aims to ascertain whether HK English exhibits a greater 'syntactic freedom,' allowing for a more distant marking relationship with 'also' compared to GB English. The analysis involves a random extraction of 500 samples from both HK and GB English from the GloWbE corpus, forming a dataset (N=1000). Exclusions are made for cases where 'also' functions as an additive conjunct or serves as a copulative adverb, as well as sentences lacking sufficient indication that 'also' functions as a focusing particle. The final dataset comprises 820 tokens, with 416 for GB and 404 for HK, annotated according to the focused constituent and the relative position of ‘also’. Frequency analysis reveals significant differences in the relative position of 'also' and marking relationships between HK and GB English. Regression analysis indicates a preference in HK English for a distant marking relationship between 'also' and its focused constituent. Notably, the subject and other constituents emerge as significant predictors of a distant position for 'also.' Together, these findings underscore the nuanced linguistic dynamics in HK English and contribute to our understanding of language contact. It suggests that future pedagogical practice should consider incorporating the syntactic variation within English varieties, facilitating leaners’ effective communication in diverse English-speaking environments and enhancing their intercultural communication competence.

Keywords: also, Cantonese, English, focus marker, frequency analysis, language contact, logistic regression analysis

Procedia PDF Downloads 26
1400 Audit of TPS photon beam dataset for small field output factors using OSLDs against RPC standard dataset

Authors: Asad Yousuf

Abstract:

Purpose: The aim of the present study was to audit treatment planning system beam dataset for small field output factors against standard dataset produced by radiological physics center (RPC) from a multicenter study. Such data are crucial for validity of special techniques, i.e., IMRT or stereotactic radiosurgery. Materials/Method: In this study, multiple small field size output factor datasets were measured and calculated for 6 to 18 MV x-ray beams using the RPC recommend methods. These beam datasets were measured at 10 cm depth for 10 × 10 cm2 to 2 × 2 cm2 field sizes, defined by collimator jaws at 100 cm. The measurements were made with a Landauer’s nanoDot OSLDs whose volume is small enough to gather a full ionization reading even for the 1×1 cm2 field size. At our institute the beam data including output factors have been commissioned at 5 cm depth with an SAD setup. For comparison with the RPC data, the output factors were converted to an SSD setup using tissue phantom ratios. SSD setup also enables coverage of the ion chamber in 2×2 cm2 field size. The measured output factors were also compared with those calculated by Eclipse™ treatment planning software. Result: The measured and calculated output factors are in agreement with RPC dataset within 1% and 4% respectively. The large discrepancies in TPS reflect the increased challenge in converting measured data into a commissioned beam model for very small fields. Conclusion: OSLDs are simple, durable, and accurate tool to verify doses that delivered using small photon beam fields down to a 1x1 cm2 field sizes. The study emphasizes that the treatment planning system should always be evaluated for small field out factors for the accurate dose delivery in clinical setting.

Keywords: small field dosimetry, optically stimulated luminescence, audit treatment, radiological physics center

Procedia PDF Downloads 303
1399 Analysis of Real Time Seismic Signal Dataset Using Machine Learning

Authors: Sujata Kulkarni, Udhav Bhosle, Vijaykumar T.

Abstract:

Due to the closeness between seismic signals and non-seismic signals, it is vital to detect earthquakes using conventional methods. In order to distinguish between seismic events and non-seismic events depending on their amplitude, our study processes the data that come from seismic sensors. The authors suggest a robust noise suppression technique that makes use of a bandpass filter, an IIR Wiener filter, recursive short-term average/long-term average (STA/LTA), and Carl short-term average (STA)/long-term average for event identification (LTA). The trigger ratio used in the proposed study to differentiate between seismic and non-seismic activity is determined. The proposed work focuses on significant feature extraction for machine learning-based seismic event detection. This serves as motivation for compiling a dataset of all features for the identification and forecasting of seismic signals. We place a focus on feature vector dimension reduction techniques due to the temporal complexity. The proposed notable features were experimentally tested using a machine learning model, and the results on unseen data are optimal. Finally, a presentation using a hybrid dataset (captured by different sensors) demonstrates how this model may also be employed in a real-time setting while lowering false alarm rates. The planned study is based on the examination of seismic signals obtained from both individual sensors and sensor networks (SN). A wideband seismic signal from BSVK and CUKG station sensors, respectively located near Basavakalyan, Karnataka, and the Central University of Karnataka, makes up the experimental dataset.

Keywords: Carl STA/LTA, features extraction, real time, dataset, machine learning, seismic detection

Procedia PDF Downloads 77
1398 Combining the Deep Neural Network with the K-Means for Traffic Accident Prediction

Authors: Celso L. Fernando, Toshio Yoshii, Takahiro Tsubota

Abstract:

Understanding the causes of a road accident and predicting their occurrence is key to preventing deaths and serious injuries from road accident events. Traditional statistical methods such as the Poisson and the Logistics regressions have been used to find the association of the traffic environmental factors with the accident occurred; recently, an artificial neural network, ANN, a computational technique that learns from historical data to make a more accurate prediction, has emerged. Although the ability to make accurate predictions, the ANN has difficulty dealing with highly unbalanced attribute patterns distribution in the training dataset; in such circumstances, the ANN treats the minority group as noise. However, in the real world data, the minority group is often the group of interest; e.g., in the road traffic accident data, the events of the accident are the group of interest. This study proposes a combination of the k-means with the ANN to improve the predictive ability of the neural network model by alleviating the effect of the unbalanced distribution of the attribute patterns in the training dataset. The results show that the proposed method improves the ability of the neural network to make a prediction on a highly unbalanced distributed attribute patterns dataset; however, on an even distributed attribute patterns dataset, the proposed method performs almost like a standard neural network.

Keywords: accident risks estimation, artificial neural network, deep learning, k-mean, road safety

Procedia PDF Downloads 125
1397 Affects Associations Analysis in Emergency Situations

Authors: Joanna Grzybowska, Magdalena Igras, Mariusz Ziółko

Abstract:

Association rule learning is an approach for discovering interesting relationships in large databases. The analysis of relations, invisible at first glance, is a source of new knowledge which can be subsequently used for prediction. We used this data mining technique (which is an automatic and objective method) to learn about interesting affects associations in a corpus of emergency phone calls. We also made an attempt to match revealed rules with their possible situational context. The corpus was collected and subjectively annotated by two researchers. Each of 3306 recordings contains information on emotion: (1) type (sadness, weariness, anxiety, surprise, stress, anger, frustration, calm, relief, compassion, contentment, amusement, joy) (2) valence (negative, neutral, or positive) (3) intensity (low, typical, alternating, high). Also, additional information, that is a clue to speaker’s emotional state, was annotated: speech rate (slow, normal, fast), characteristic vocabulary (filled pauses, repeated words) and conversation style (normal, chaotic). Exponentially many rules can be extracted from a set of items (an item is a previously annotated single information). To generate the rules in the form of an implication X → Y (where X and Y are frequent k-itemsets) the Apriori algorithm was used - it avoids performing needless computations. Then, two basic measures (Support and Confidence) and several additional symmetric and asymmetric objective measures (e.g. Laplace, Conviction, Interest Factor, Cosine, correlation coefficient) were calculated for each rule. Each applied interestingness measure revealed different rules - we selected some top rules for each measure. Owing to the specificity of the corpus (emergency situations), most of the strong rules contain only negative emotions. There are though strong rules including neutral or even positive emotions. Three examples of the strongest rules are: {sadness} → {anxiety}; {sadness, weariness, stress, frustration} → {anger}; {compassion} → {sadness}. Association rule learning revealed the strongest configurations of affects (as well as configurations of affects with affect-related information) in our emergency phone calls corpus. The acquired knowledge can be used for prediction to fulfill the emotional profile of a new caller. Furthermore, a rule-related possible context analysis may be a clue to the situation a caller is in.

Keywords: data mining, emergency phone calls, emotional profiles, rules

Procedia PDF Downloads 388
1396 Comparison between XGBoost, LightGBM and CatBoost Using a Home Credit Dataset

Authors: Essam Al Daoud

Abstract:

Gradient boosting methods have been proven to be a very important strategy. Many successful machine learning solutions were developed using the XGBoost and its derivatives. The aim of this study is to investigate and compare the efficiency of three gradient methods. Home credit dataset is used in this work which contains 219 features and 356251 records. However, new features are generated and several techniques are used to rank and select the best features. The implementation indicates that the LightGBM is faster and more accurate than CatBoost and XGBoost using variant number of features and records.

Keywords: gradient boosting, XGBoost, LightGBM, CatBoost, home credit

Procedia PDF Downloads 135
1395 PaSA: A Dataset for Patent Sentiment Analysis to Highlight Patent Paragraphs

Authors: Renukswamy Chikkamath, Vishvapalsinhji Ramsinh Parmar, Christoph Hewel, Markus Endres

Abstract:

Given a patent document, identifying distinct semantic annotations is an interesting research aspect. Text annotation helps the patent practitioners such as examiners and patent attorneys to quickly identify the key arguments of any invention, successively providing a timely marking of a patent text. In the process of manual patent analysis, to attain better readability, recognising the semantic information by marking paragraphs is in practice. This semantic annotation process is laborious and time-consuming. To alleviate such a problem, we proposed a dataset to train machine learning algorithms to automate the highlighting process. The contributions of this work are: i) we developed a multi-class dataset of size 150k samples by traversing USPTO patents over a decade, ii) articulated statistics and distributions of data using imperative exploratory data analysis, iii) baseline Machine Learning models are developed to utilize the dataset to address patent paragraph highlighting task, and iv) future path to extend this work using Deep Learning and domain-specific pre-trained language models to develop a tool to highlight is provided. This work assists patent practitioners in highlighting semantic information automatically and aids in creating a sustainable and efficient patent analysis using the aptitude of machine learning.

Keywords: machine learning, patents, patent sentiment analysis, patent information retrieval

Procedia PDF Downloads 66
1394 The Effectiveness and Accuracy of the Schulte Holt IOL Toric Calculator Processor in Comparison to Manually Input Data into the Barrett Toric IOL Calculator

Authors: Gabrielle Holt

Abstract:

This paper is looking to prove the efficacy of the Schulte Holt IOL Toric Calculator Processor (Schulte Holt ITCP). It has been completed using manually inputted data into the Barrett Toric Calculator and comparing the number of minutes taken to complete the Toric calculations, the number of errors identified during completion, and distractions during completion. It will then compare that data to the number of minutes taken for the Schulte Holt ITCP to complete also, using the Barrett method, as well as the number of errors identified in the Schulte Holt ITCP. The data clearly demonstrate a momentous advantage to the Schulte Holt ITCP and notably reduces time spent doing Toric Calculations, as well as reducing the number of errors. With the ever-growing number of cataract surgeries taking place around the world and the waitlists increasing -the Schulte Holt IOL Toric Calculator Processor may well demonstrate a way forward to increase the availability of ophthalmologists and ophthalmic staff while maintaining patient safety.

Keywords: Toric, toric lenses, ophthalmology, cataract surgery, toric calculations, Barrett

Procedia PDF Downloads 57
1393 Generation of High-Quality Synthetic CT Images from Cone Beam CT Images Using A.I. Based Generative Networks

Authors: Heeba A. Gurku

Abstract:

Introduction: Cone Beam CT(CBCT) images play an integral part in proper patient positioning in cancer patients undergoing radiation therapy treatment. But these images are low in quality. The purpose of this study is to generate high-quality synthetic CT images from CBCT using generative models. Material and Methods: This study utilized two datasets from The Cancer Imaging Archive (TCIA) 1) Lung cancer dataset of 20 patients (with full view CBCT images) and 2) Pancreatic cancer dataset of 40 patients (only 27 patients having limited view images were included in the study). Cycle Generative Adversarial Networks (GAN) and its variant Attention Guided Generative Adversarial Networks (AGGAN) models were used to generate the synthetic CTs. Models were evaluated by visual evaluation and on four metrics, Structural Similarity Index Measure (SSIM), Peak Signal Noise Ratio (PSNR) Mean Absolute Error (MAE) and Root Mean Square Error (RMSE), to compare the synthetic CT and original CT images. Results: For pancreatic dataset with limited view CBCT images, our study showed that in Cycle GAN model, MAE, RMSE, PSNR improved from 12.57to 8.49, 20.94 to 15.29 and 21.85 to 24.63, respectively but structural similarity only marginally increased from 0.78 to 0.79. Similar, results were achieved with AGGAN with no improvement over Cycle GAN. However, for lung dataset with full view CBCT images Cycle GAN was able to reduce MAE significantly from 89.44 to 15.11 and AGGAN was able to reduce it to 19.77. Similarly, RMSE was also decreased from 92.68 to 23.50 in Cycle GAN and to 29.02 in AGGAN. SSIM and PSNR also improved significantly from 0.17 to 0.59 and from 8.81 to 21.06 in Cycle GAN respectively while in AGGAN SSIM increased to 0.52 and PSNR increased to 19.31. In both datasets, GAN models were able to reduce artifacts, reduce noise, have better resolution, and better contrast enhancement. Conclusion and Recommendation: Both Cycle GAN and AGGAN were significantly able to reduce MAE, RMSE and PSNR in both datasets. However, full view lung dataset showed more improvement in SSIM and image quality than limited view pancreatic dataset.

Keywords: CT images, CBCT images, cycle GAN, AGGAN

Procedia PDF Downloads 58
1392 Emotion-Convolutional Neural Network for Perceiving Stress from Audio Signals: A Brain Chemistry Approach

Authors: Anup Anand Deshmukh, Catherine Soladie, Renaud Seguier

Abstract:

Emotion plays a key role in many applications like healthcare, to gather patients’ emotional behavior. Unlike typical ASR (Automated Speech Recognition) problems which focus on 'what was said', it is equally important to understand 'how it was said.' There are certain emotions which are given more importance due to their effectiveness in understanding human feelings. In this paper, we propose an approach that models human stress from audio signals. The research challenge in speech emotion detection is finding the appropriate set of acoustic features corresponding to an emotion. Another difficulty lies in defining the very meaning of emotion and being able to categorize it in a precise manner. Supervised Machine Learning models, including state of the art Deep Learning classification methods, rely on the availability of clean and labelled data. One of the problems in affective computation is the limited amount of annotated data. The existing labelled emotions datasets are highly subjective to the perception of the annotator. We address the first issue of feature selection by exploiting the use of traditional MFCC (Mel-Frequency Cepstral Coefficients) features in Convolutional Neural Network. Our proposed Emo-CNN (Emotion-CNN) architecture treats speech representations in a manner similar to how CNN’s treat images in a vision problem. Our experiments show that Emo-CNN consistently and significantly outperforms the popular existing methods over multiple datasets. It achieves 90.2% categorical accuracy on the Emo-DB dataset. We claim that Emo-CNN is robust to speaker variations and environmental distortions. The proposed approach achieves 85.5% speaker-dependant categorical accuracy for SAVEE (Surrey Audio-Visual Expressed Emotion) dataset, beating the existing CNN based approach by 10.2%. To tackle the second problem of subjectivity in stress labels, we use Lovheim’s cube, which is a 3-dimensional projection of emotions. Monoamine neurotransmitters are a type of chemical messengers in the brain that transmits signals on perceiving emotions. The cube aims at explaining the relationship between these neurotransmitters and the positions of emotions in 3D space. The learnt emotion representations from the Emo-CNN are mapped to the cube using three component PCA (Principal Component Analysis) which is then used to model human stress. This proposed approach not only circumvents the need for labelled stress data but also complies with the psychological theory of emotions given by Lovheim’s cube. We believe that this work is the first step towards creating a connection between Artificial Intelligence and the chemistry of human emotions.

Keywords: deep learning, brain chemistry, emotion perception, Lovheim's cube

Procedia PDF Downloads 126
1391 Feature Based Unsupervised Intrusion Detection

Authors: Deeman Yousif Mahmood, Mohammed Abdullah Hussein

Abstract:

The goal of a network-based intrusion detection system is to classify activities of network traffics into two major categories: normal and attack (intrusive) activities. Nowadays, data mining and machine learning plays an important role in many sciences; including intrusion detection system (IDS) using both supervised and unsupervised techniques. However, one of the essential steps of data mining is feature selection that helps in improving the efficiency, performance and prediction rate of proposed approach. This paper applies unsupervised K-means clustering algorithm with information gain (IG) for feature selection and reduction to build a network intrusion detection system. For our experimental analysis, we have used the new NSL-KDD dataset, which is a modified dataset for KDDCup 1999 intrusion detection benchmark dataset. With a split of 60.0% for the training set and the remainder for the testing set, a 2 class classifications have been implemented (Normal, Attack). Weka framework which is a java based open source software consists of a collection of machine learning algorithms for data mining tasks has been used in the testing process. The experimental results show that the proposed approach is very accurate with low false positive rate and high true positive rate and it takes less learning time in comparison with using the full features of the dataset with the same algorithm.

Keywords: information gain (IG), intrusion detection system (IDS), k-means clustering, Weka

Procedia PDF Downloads 268
1390 MhAGCN: Multi-Head Attention Graph Convolutional Network for Web Services Classification

Authors: Bing Li, Zhi Li, Yilong Yang

Abstract:

Web classification can promote the quality of service discovery and management in the service repository. It is widely used to locate developers desired services. Although traditional classification methods based on supervised learning models can achieve classification tasks, developers need to manually mark web services, and the quality of these tags may not be enough to establish an accurate classifier for service classification. With the doubling of the number of web services, the manual tagging method has become unrealistic. In recent years, the attention mechanism has made remarkable progress in the field of deep learning, and its huge potential has been fully demonstrated in various fields. This paper designs a multi-head attention graph convolutional network (MHAGCN) service classification method, which can assign different weights to the neighborhood nodes without complicated matrix operations or relying on understanding the entire graph structure. The framework combines the advantages of the attention mechanism and graph convolutional neural network. It can classify web services through automatic feature extraction. The comprehensive experimental results on a real dataset not only show the superior performance of the proposed model over the existing models but also demonstrate its potentially good interpretability for graph analysis.

Keywords: attention mechanism, graph convolutional network, interpretability, service classification, service discovery

Procedia PDF Downloads 112
1389 Dataset Quality Index:Development of Composite Indicator Based on Standard Data Quality Indicators

Authors: Sakda Loetpiparwanich, Preecha Vichitthamaros

Abstract:

Nowadays, poor data quality is considered one of the majority costs for a data project. The data project with data quality awareness almost as much time to data quality processes while data project without data quality awareness negatively impacts financial resources, efficiency, productivity, and credibility. One of the processes that take a long time is defining the expectations and measurements of data quality because the expectation is different up to the purpose of each data project. Especially, big data project that maybe involves with many datasets and stakeholders, that take a long time to discuss and define quality expectations and measurements. Therefore, this study aimed at developing meaningful indicators to describe overall data quality for each dataset to quick comparison and priority. The objectives of this study were to: (1) Develop a practical data quality indicators and measurements, (2) Develop data quality dimensions based on statistical characteristics and (3) Develop Composite Indicator that can describe overall data quality for each dataset. The sample consisted of more than 500 datasets from public sources obtained by random sampling. After datasets were collected, there are five steps to develop the Dataset Quality Index (SDQI). First, we define standard data quality expectations. Second, we find any indicators that can measure directly to data within datasets. Thirdly, each indicator aggregates to dimension using factor analysis. Next, the indicators and dimensions were weighted by an effort for data preparing process and usability. Finally, the dimensions aggregate to Composite Indicator. The results of these analyses showed that: (1) The developed useful indicators and measurements contained ten indicators. (2) the developed data quality dimension based on statistical characteristics, we found that ten indicators can be reduced to 4 dimensions. (3) The developed Composite Indicator, we found that the SDQI can describe overall datasets quality of each dataset and can separate into 3 Level as Good Quality, Acceptable Quality, and Poor Quality. The conclusion, the SDQI provide an overall description of data quality within datasets and meaningful composition. We can use SQDI to assess for all data in the data project, effort estimation, and priority. The SDQI also work well with Agile Method by using SDQI to assessment in the first sprint. After passing the initial evaluation, we can add more specific data quality indicators into the next sprint.

Keywords: data quality, dataset quality, data quality management, composite indicator, factor analysis, principal component analysis

Procedia PDF Downloads 113
1388 Enhancing Cultural Heritage Data Retrieval by Mapping COURAGE to CIDOC Conceptual Reference Model

Authors: Ghazal Faraj, Andras Micsik

Abstract:

The CIDOC Conceptual Reference Model (CRM) is an extensible ontology that provides integrated access to heterogeneous and digital datasets. The CIDOC-CRM offers a “semantic glue” intended to promote accessibility to several diverse and dispersed sources of cultural heritage data. That is achieved by providing a formal structure for the implicit and explicit concepts and their relationships in the cultural heritage field. The COURAGE (“Cultural Opposition – Understanding the CultuRal HeritAGE of Dissent in the Former Socialist Countries”) project aimed to explore methods about socialist-era cultural resistance during 1950-1990 and planned to serve as a basis for further narratives and digital humanities (DH) research. This project highlights the diversity of flourished alternative cultural scenes in Eastern Europe before 1989. Moreover, the dataset of COURAGE is an online RDF-based registry that consists of historical people, organizations, collections, and featured items. For increasing the inter-links between different datasets and retrieving more relevant data from various data silos, a shared federated ontology for reconciled data is needed. As a first step towards these goals, a full understanding of the CIDOC CRM ontology (target ontology), as well as the COURAGE dataset, was required to start the work. Subsequently, the queries toward the ontology were determined, and a table of equivalent properties from COURAGE and CIDOC CRM was created. The structural diagrams that clarify the mapping process and construct queries are on progress to map person, organization, and collection entities to the ontology. Through mapping the COURAGE dataset to CIDOC-CRM ontology, the dataset will have a common ontological foundation with several other datasets. Therefore, the expected results are: 1) retrieving more detailed data about existing entities, 2) retrieving new entities’ data, 3) aligning COURAGE dataset to a standard vocabulary, 4) running distributed SPARQL queries over several CIDOC-CRM datasets and testing the potentials of distributed query answering using SPARQL. The next plan is to map CIDOC-CRM to other upper-level ontologies or large datasets (e.g., DBpedia, Wikidata), and address similar questions on a wide variety of knowledge bases.

Keywords: CIDOC CRM, cultural heritage data, COURAGE dataset, ontology alignment

Procedia PDF Downloads 122
1387 Plant Identification Using Convolution Neural Network and Vision Transformer-Based Models

Authors: Virender Singh, Mathew Rees, Simon Hampton, Sivaram Annadurai

Abstract:

Plant identification is a challenging task that aims to identify the family, genus, and species according to plant morphological features. Automated deep learning-based computer vision algorithms are widely used for identifying plants and can help users narrow down the possibilities. However, numerous morphological similarities between and within species render correct classification difficult. In this paper, we tested custom convolution neural network (CNN) and vision transformer (ViT) based models using the PyTorch framework to classify plants. We used a large dataset of 88,000 provided by the Royal Horticultural Society (RHS) and a smaller dataset of 16,000 images from the PlantClef 2015 dataset for classifying plants at genus and species levels, respectively. Our results show that for classifying plants at the genus level, ViT models perform better compared to CNN-based models ResNet50 and ResNet-RS-420 and other state-of-the-art CNN-based models suggested in previous studies on a similar dataset. ViT model achieved top accuracy of 83.3% for classifying plants at the genus level. For classifying plants at the species level, ViT models perform better compared to CNN-based models ResNet50 and ResNet-RS-420, with a top accuracy of 92.5%. We show that the correct set of augmentation techniques plays an important role in classification success. In conclusion, these results could help end users, professionals and the general public alike in identifying plants quicker and with improved accuracy.

Keywords: plant identification, CNN, image processing, vision transformer, classification

Procedia PDF Downloads 68
1386 PatchMix: Learning Transferable Semi-Supervised Representation by Predicting Patches

Authors: Arpit Rai

Abstract:

In this work, we propose PatchMix, a semi-supervised method for pre-training visual representations. PatchMix mixes patches of two images and then solves an auxiliary task of predicting the label of each patch in the mixed image. Our experiments on the CIFAR-10, 100 and the SVHN dataset show that the representations learned by this method encodes useful information for transfer to new tasks and outperform the baseline Residual Network encoders by on CIFAR 10 by 12% on ResNet 101 and 2% on ResNet-56, by 4% on CIFAR-100 on ResNet101 and by 6% on SVHN dataset on the ResNet-101 baseline model.

Keywords: self-supervised learning, representation learning, computer vision, generalization

Procedia PDF Downloads 63
1385 Rd-PLS Regression: From the Analysis of Two Blocks of Variables to Path Modeling

Authors: E. Tchandao Mangamana, V. Cariou, E. Vigneau, R. Glele Kakai, E. M. Qannari

Abstract:

A new definition of a latent variable associated with a dataset makes it possible to propose variants of the PLS2 regression and the multi-block PLS (MB-PLS). We shall refer to these variants as Rd-PLS regression and Rd-MB-PLS respectively because they are inspired by both Redundancy analysis and PLS regression. Usually, a latent variable t associated with a dataset Z is defined as a linear combination of the variables of Z with the constraint that the length of the loading weights vector equals 1. Formally, t=Zw with ‖w‖=1. Denoting by Z' the transpose of Z, we define herein, a latent variable by t=ZZ’q with the constraint that the auxiliary variable q has a norm equal to 1. This new definition of a latent variable entails that, as previously, t is a linear combination of the variables in Z and, in addition, the loading vector w=Z’q is constrained to be a linear combination of the rows of Z. More importantly, t could be interpreted as a kind of projection of the auxiliary variable q onto the space generated by the variables in Z, since it is collinear to the first PLS1 component of q onto Z. Consider the situation in which we aim to predict a dataset Y from another dataset X. These two datasets relate to the same individuals and are assumed to be centered. Let us consider a latent variable u=YY’q to which we associate the variable t= XX’YY’q. Rd-PLS consists in seeking q (and therefore u and t) so that the covariance between t and u is maximum. The solution to this problem is straightforward and consists in setting q to the eigenvector of YY’XX’YY’ associated with the largest eigenvalue. For the determination of higher order components, we deflate X and Y with respect to the latent variable t. Extending Rd-PLS to the context of multi-block data is relatively easy. Starting from a latent variable u=YY’q, we consider its ‘projection’ on the space generated by the variables of each block Xk (k=1, ..., K) namely, tk= XkXk'YY’q. Thereafter, Rd-MB-PLS seeks q in order to maximize the average of the covariances of u with tk (k=1, ..., K). The solution to this problem is given by q, eigenvector of YY’XX’YY’, where X is the dataset obtained by horizontally merging datasets Xk (k=1, ..., K). For the determination of latent variables of order higher than 1, we use a deflation of Y and Xk with respect to the variable t= XX’YY’q. In the same vein, extending Rd-MB-PLS to the path modeling setting is straightforward. Methods are illustrated on the basis of case studies and performance of Rd-PLS and Rd-MB-PLS in terms of prediction is compared to that of PLS2 and MB-PLS.

Keywords: multiblock data analysis, partial least squares regression, path modeling, redundancy analysis

Procedia PDF Downloads 113
1384 Automated Evaluation Approach for Time-Dependent Question Answering Pairs on Web Crawler Based Question Answering System

Authors: Shraddha Chaudhary, Raksha Agarwal, Niladri Chatterjee

Abstract:

This work demonstrates a web crawler-based generalized end-to-end open domain Question Answering (QA) system. An efficient QA system requires a significant amount of domain knowledge to answer any question with the aim to find an exact and correct answer in the form of a number, a noun, a short phrase, or a brief piece of text for the user's questions. Analysis of the question, searching the relevant document, and choosing an answer are three important steps in a QA system. This work uses a web scraper (Beautiful Soup) to extract K-documents from the web. The value of K can be calibrated on the basis of a trade-off between time and accuracy. This is followed by a passage ranking process using the MS-Marco dataset trained on 500K queries to extract the most relevant text passage, to shorten the lengthy documents. Further, a QA system is used to extract the answers from the shortened documents based on the query and return the top 3 answers. For evaluation of such systems, accuracy is judged by the exact match between predicted answers and gold answers. But automatic evaluation methods fail due to the linguistic ambiguities inherent in the questions. Moreover, reference answers are often not exhaustive or are out of date. Hence correct answers predicted by the system are often judged incorrect according to the automated metrics. One such scenario arises from the original Google Natural Question (GNQ) dataset which was collected and made available in the year 2016. Use of any such dataset proves to be inefficient with respect to any questions that have time-varying answers. For illustration, if the query is where will be the next Olympics? Gold Answer for the above query as given in the GNQ dataset is “Tokyo”. Since the dataset was collected in the year 2016, and the next Olympics after 2016 were in 2020 that was in Tokyo which is absolutely correct. But if the same question is asked in 2022 then the answer is “Paris, 2024”. Consequently, any evaluation based on the GNQ dataset will be incorrect. Such erroneous predictions are usually given to human evaluators for further validation which is quite expensive and time-consuming. To address this erroneous evaluation, the present work proposes an automated approach for evaluating time-dependent question-answer pairs. In particular, it proposes a metric using the current timestamp along with top-n predicted answers from a given QA system. To test the proposed approach GNQ dataset has been used and the system achieved an accuracy of 78% for a test dataset comprising 100 QA pairs. This test data was automatically extracted using an analysis-based approach from 10K QA pairs of the GNQ dataset. The results obtained are encouraging. The proposed technique appears to have the possibility of developing into a useful scheme for gathering precise, reliable, and specific information in a real-time and efficient manner. Our subsequent experiments will be guided towards establishing the efficacy of the above system for a larger set of time-dependent QA pairs.

Keywords: web-based information retrieval, open domain question answering system, time-varying QA, QA evaluation

Procedia PDF Downloads 76
1383 Cosmetic Recommendation Approach Using Machine Learning

Authors: Shakila N. Senarath, Dinesh Asanka, Janaka Wijayanayake

Abstract:

The necessity of cosmetic products is arising to fulfill consumer needs of personality appearance and hygiene. A cosmetic product consists of various chemical ingredients which may help to keep the skin healthy or may lead to damages. Every chemical ingredient in a cosmetic product does not perform on every human. The most appropriate way to select a healthy cosmetic product is to identify the texture of the body first and select the most suitable product with safe ingredients. Therefore, the selection process of cosmetic products is complicated. Consumer surveys have shown most of the time, the selection process of cosmetic products is done in an improper way by consumers. From this study, a content-based system is suggested that recommends cosmetic products for the human factors. To such an extent, the skin type, gender and price range will be considered as human factors. The proposed system will be implemented by using Machine Learning. Consumer skin type, gender and price range will be taken as inputs to the system. The skin type of consumer will be derived by using the Baumann Skin Type Questionnaire, which is a value-based approach that includes several numbers of questions to derive the user’s skin type to one of the 16 skin types according to the Bauman Skin Type indicator (BSTI). Two datasets are collected for further research proceedings. The user data set was collected using a questionnaire given to the public. Those are the user dataset and the cosmetic dataset. Product details are included in the cosmetic dataset, which belongs to 5 different kinds of product categories (Moisturizer, Cleanser, Sun protector, Face Mask, Eye Cream). An alternate approach of TF-IDF (Term Frequency – Inverse Document Frequency) is applied to vectorize cosmetic ingredients in the generic cosmetic products dataset and user-preferred dataset. Using the IF-IPF vectors, each user-preferred products dataset and generic cosmetic products dataset can be represented as sparse vectors. The similarity between each user-preferred product and generic cosmetic product will be calculated using the cosine similarity method. For the recommendation process, a similarity matrix can be used. Higher the similarity, higher the match for consumer. Sorting a user column from similarity matrix in a descending order, the recommended products can be retrieved in ascending order. Even though results return a list of similar products, and since the user information has been gathered, such as gender and the price ranges for product purchasing, further optimization can be done by considering and giving weights for those parameters once after a set of recommended products for a user has been retrieved.

Keywords: content-based filtering, cosmetics, machine learning, recommendation system

Procedia PDF Downloads 113
1382 Aiding Water Flow in Irrigation Technology with a Pedal Operated Manual Pump

Authors: Isaac Ali Kwasu, Aje Tokan

Abstract:

The research was set to design a manually pedal operated water pump to aid water flow technology for irrigation activities for rural farmers. The development was carried out first by a prototype design to guide the fabrication. All items needed for the fabrication were used for the final product. The machine is operated manually by pedaling. This engages all the parts of the machine into active motion. Energy is generated and transfer finally to the pumping unit which is wired with plastic pipes. The pumping unit which is wired with PVC pipes, both linked to the water source and the reservoir respectively. The (rpm) revolution per minute of the machine is approximated at 3130 depending on the pedaling speed of the user. The machine does not have gear arrangement yet can give high (rpm) for effective performance. The pumping performance of the machine is 125 liters in one minute and can sustain small scale irrigation farming activities and to supplement water management system to sustain crop growth.

Keywords: pump, development, manual, flywheel, sprocket, pulley, machine, v belt, chain, hub, pipe, steel, mechanism, irrigation, prototype, fabrication

Procedia PDF Downloads 180
1381 An Enhanced Support Vector Machine Based Approach for Sentiment Classification of Arabic Tweets of Different Dialects

Authors: Gehad S. Kaseb, Mona F. Ahmed

Abstract:

Arabic Sentiment Analysis (SA) is one of the most common research fields with many open areas. Few studies apply SA to Arabic dialects. This paper proposes different pre-processing steps and a modified methodology to improve the accuracy using normal Support Vector Machine (SVM) classification. The paper works on two datasets, Arabic Sentiment Tweets Dataset (ASTD) and Extended Arabic Tweets Sentiment Dataset (Extended-AATSD), which are publicly available for academic use. The results show that the classification accuracy approaches 86%.

Keywords: Arabic, classification, sentiment analysis, tweets

Procedia PDF Downloads 118
1380 Using Machine Learning to Build a Real-Time COVID-19 Mask Safety Monitor

Authors: Yash Jain

Abstract:

The US Center for Disease Control has recommended wearing masks to slow the spread of the virus. The research uses a video feed from a camera to conduct real-time classifications of whether or not a human is correctly wearing a mask, incorrectly wearing a mask, or not wearing a mask at all. Utilizing two distinct datasets from the open-source website Kaggle, a mask detection network had been trained. The first dataset that was used to train the model was titled 'Face Mask Detection' on Kaggle, where the dataset was retrieved from and the second dataset was titled 'Face Mask Dataset, which provided the data in a (YOLO Format)' so that the TinyYoloV3 model could be trained. Based on the data from Kaggle, two machine learning models were implemented and trained: a Tiny YoloV3 Real-time model and a two-stage neural network classifier. The two-stage neural network classifier had a first step of identifying distinct faces within the image, and the second step was a classifier to detect the state of the mask on the face and whether it was worn correctly, incorrectly, or no mask at all. The TinyYoloV3 was used for the live feed as well as for a comparison standpoint against the previous two-stage classifier and was trained using the darknet neural network framework. The two-stage classifier attained a mean average precision (MAP) of 80%, while the model trained using TinyYoloV3 real-time detection had a mean average precision (MAP) of 59%. Overall, both models were able to correctly classify stages/scenarios of no mask, mask, and incorrectly worn masks.

Keywords: datasets, classifier, mask-detection, real-time, TinyYoloV3, two-stage neural network classifier

Procedia PDF Downloads 131
1379 Application of Data Mining Techniques for Tourism Knowledge Discovery

Authors: Teklu Urgessa, Wookjae Maeng, Joong Seek Lee

Abstract:

Application of five implementations of three data mining classification techniques was experimented for extracting important insights from tourism data. The aim was to find out the best performing algorithm among the compared ones for tourism knowledge discovery. Knowledge discovery process from data was used as a process model. 10-fold cross validation method is used for testing purpose. Various data preprocessing activities were performed to get the final dataset for model building. Classification models of the selected algorithms were built with different scenarios on the preprocessed dataset. The outperformed algorithm tourism dataset was Random Forest (76%) before applying information gain based attribute selection and J48 (C4.5) (75%) after selection of top relevant attributes to the class (target) attribute. In terms of time for model building, attribute selection improves the efficiency of all algorithms. Artificial Neural Network (multilayer perceptron) showed the highest improvement (90%). The rules extracted from the decision tree model are presented, which showed intricate, non-trivial knowledge/insight that would otherwise not be discovered by simple statistical analysis with mediocre accuracy of the machine using classification algorithms.

Keywords: classification algorithms, data mining, knowledge discovery, tourism

Procedia PDF Downloads 270