Search results for: surveyed dataset.
383 Semi-Supervised Outlier Detection Using a Generative and Adversary Framework
Authors: Jindong Gu, Matthias Schubert, Volker Tresp
Abstract:
In many outlier detection tasks, only training data belonging to one class, i.e., the positive class, is available. The task is then to predict a new data point as belonging either to the positive class or to the negative class, in which case the data point is considered an outlier. For this task, we propose a novel corrupted Generative Adversarial Network (CorGAN). In the adversarial process of training CorGAN, the Generator generates outlier samples for the negative class, and the Discriminator is trained to distinguish the positive training data from the generated negative data. The proposed framework is evaluated using an image dataset and a real-world network intrusion dataset. Our outlier-detection method achieves state-of-the-art performance on both tasks.Keywords: Outlier detection, generative adversary networks, semi-supervised learning.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1074382 Upgraded Rough Clustering and Outlier Detection Method on Yeast Dataset by Entropy Rough K-Means Method
Authors: P. Ashok, G. M. Kadhar Nawaz
Abstract:
Rough set theory is used to handle uncertainty and incomplete information by applying two accurate sets, Lower approximation and Upper approximation. In this paper, the rough clustering algorithms are improved by adopting the Similarity, Dissimilarity–Similarity and Entropy based initial centroids selection method on three different clustering algorithms namely Entropy based Rough K-Means (ERKM), Similarity based Rough K-Means (SRKM) and Dissimilarity-Similarity based Rough K-Means (DSRKM) were developed and executed by yeast dataset. The rough clustering algorithms are validated by cluster validity indexes namely Rand and Adjusted Rand indexes. An experimental result shows that the ERKM clustering algorithm perform effectively and delivers better results than other clustering methods. Outlier detection is an important task in data mining and very much different from the rest of the objects in the clusters. Entropy based Rough Outlier Factor (EROF) method is seemly to detect outlier effectively for yeast dataset. In rough K-Means method, by tuning the epsilon (ᶓ) value from 0.8 to 1.08 can detect outliers on boundary region and the RKM algorithm delivers better results, when choosing the value of epsilon (ᶓ) in the specified range. An experimental result shows that the EROF method on clustering algorithm performed very well and suitable for detecting outlier effectively for all datasets. Further, experimental readings show that the ERKM clustering method outperformed the other methods.
Keywords: Clustering, Entropy, Outlier, Rough K-Means, validity index.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1412381 Comparative Evaluation of Accuracy of Selected Machine Learning Classification Techniques for Diagnosis of Cancer: A Data Mining Approach
Authors: Rajvir Kaur, Jeewani Anupama Ginige
Abstract:
With recent trends in Big Data and advancements in Information and Communication Technologies, the healthcare industry is at the stage of its transition from clinician oriented to technology oriented. Many people around the world die of cancer because the diagnosis of disease was not done at an early stage. Nowadays, the computational methods in the form of Machine Learning (ML) are used to develop automated decision support systems that can diagnose cancer with high confidence in a timely manner. This paper aims to carry out the comparative evaluation of a selected set of ML classifiers on two existing datasets: breast cancer and cervical cancer. The ML classifiers compared in this study are Decision Tree (DT), Support Vector Machine (SVM), k-Nearest Neighbor (k-NN), Logistic Regression, Ensemble (Bagged Tree) and Artificial Neural Networks (ANN). The evaluation is carried out based on standard evaluation metrics Precision (P), Recall (R), F1-score and Accuracy. The experimental results based on the evaluation metrics show that ANN showed the highest-level accuracy (99.4%) when tested with breast cancer dataset. On the other hand, when these ML classifiers are tested with the cervical cancer dataset, Ensemble (Bagged Tree) technique gave better accuracy (93.1%) in comparison to other classifiers.Keywords: Artificial neural networks, breast cancer, cancer dataset, classifiers, cervical cancer, F-score, logistic regression, machine learning, precision, recall, support vector machine.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1553380 A Neural Network Approach in Predicting the Blood Glucose Level for Diabetic Patients
Authors: Zarita Zainuddin, Ong Pauline, C. Ardil
Abstract:
Diabetes Mellitus is a chronic metabolic disorder, where the improper management of the blood glucose level in the diabetic patients will lead to the risk of heart attack, kidney disease and renal failure. This paper attempts to enhance the diagnostic accuracy of the advancing blood glucose levels of the diabetic patients, by combining principal component analysis and wavelet neural network. The proposed system makes separate blood glucose prediction in the morning, afternoon, evening and night intervals, using dataset from one patient covering a period of 77 days. Comparisons of the diagnostic accuracy with other neural network models, which use the same dataset are made. The comparison results showed overall improved accuracy, which indicates the effectiveness of this proposed system.Keywords: Diabetes Mellitus, principal component analysis, time-series, wavelet neural network.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2988379 Analysis of a Population of Diabetic Patients Databases with Classifiers
Authors: Murat Koklu, Yavuz Unal
Abstract:
Data mining can be called as a technique to extract information from data. It is the process of obtaining hidden information and then turning it into qualified knowledge by statistical and artificial intelligence technique. One of its application areas is medical area to form decision support systems for diagnosis just by inventing meaningful information from given medical data. In this study a decision support system for diagnosis of illness that make use of data mining and three different artificial intelligence classifier algorithms namely Multilayer Perceptron, Naive Bayes Classifier and J.48. Pima Indian dataset of UCI Machine Learning Repository was used. This dataset includes urinary and blood test results of 768 patients. These test results consist of 8 different feature vectors. Obtained classifying results were compared with the previous studies. The suggestions for future studies were presented.
Keywords: Artificial Intelligence, Classifiers, Data Mining, Diabetic Patients.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 5431378 Anomaly Based On Frequent-Outlier for Outbreak Detection in Public Health Surveillance
Authors: Zalizah Awang Long, Abdul Razak Hamdan, Azuraliza Abu Bakar
Abstract:
Public health surveillance system focuses on outbreak detection and data sources used. Variation or aberration in the frequency distribution of health data, compared to historical data is often used to detect outbreaks. It is important that new techniques be developed to improve the detection rate, thereby reducing wastage of resources in public health. Thus, the objective is to developed technique by applying frequent mining and outlier mining techniques in outbreak detection. 14 datasets from the UCI were tested on the proposed technique. The performance of the effectiveness for each technique was measured by t-test. The overall performance shows that DTK can be used to detect outlier within frequent dataset. In conclusion the outbreak detection technique using anomaly-based on frequent-outlier technique can be used to identify the outlier within frequent dataset.
Keywords: Outlier detection, frequent-outlier, outbreak, anomaly, surveillance, public health
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2274377 Cascaded Neural Network for Internal Temperature Forecasting in Induction Motor
Authors: Hidir S. Nogay
Abstract:
In this study, two systems were created to predict interior temperature in induction motor. One of them consisted of a simple ANN model which has two layers, ten input parameters and one output parameter. The other one consisted of eight ANN models connected each other as cascaded. Cascaded ANN system has 17 inputs. Main reason of cascaded system being used in this study is to accomplish more accurate estimation by increasing inputs in the ANN system. Cascaded ANN system is compared with simple conventional ANN model to prove mentioned advantages. Dataset was obtained from experimental applications. Small part of the dataset was used to obtain more understandable graphs. Number of data is 329. 30% of the data was used for testing and validation. Test data and validation data were determined for each ANN model separately and reliability of each model was tested. As a result of this study, it has been understood that the cascaded ANN system produced more accurate estimates than conventional ANN model.Keywords: Cascaded neural network, internal temperature, three-phase induction motor, inverter.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 872376 Using Machine Learning Techniques for Autism Spectrum Disorder Analysis and Detection in Children
Authors: Norah Alshahrani, Abdulaziz Almaleh
Abstract:
Autism Spectrum Disorder (ASD) is a condition related to issues with brain development that affects how a person recognises and communicates with others which results in difficulties with interaction and communication socially and it is constantly growing. Early recognition of ASD allows children to lead safe and healthy lives and helps doctors with accurate diagnoses and management of conditions. Therefore, it is crucial to develop a method that will achieve good results and with high accuracy for the measurement of ASD in children. In this paper, ASD datasets of toddlers and children have been analyzed. We employed the following machine learning techniques to attempt to explore ASD: Random Forest (RF), Decision Tree (DT), Na¨ıve Bayes (NB) and Support Vector Machine (SVM). Then feature selection was used to provide fewer attributes from ASD datasets while preserving model performance. As a result, we found that the best result has been provided by SVM, achieving 0.98% in the toddler dataset and 0.99% in the children dataset.
Keywords: Autism Spectrum Disorder, ASD, Machine Learning, ML, Feature Selection, Support Vector Machine, SVM.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 598375 Genetic Programming Approach for Multi-Category Pattern Classification Appliedto Network Intrusions Detection
Authors: K.M. Faraoun, A. Boukelif
Abstract:
This paper describes a new approach of classification using genetic programming. The proposed technique consists of genetically coevolving a population of non-linear transformations on the input data to be classified, and map them to a new space with a reduced dimension, in order to get a maximum inter-classes discrimination. The classification of new samples is then performed on the transformed data, and so become much easier. Contrary to the existing GP-classification techniques, the proposed one use a dynamic repartition of the transformed data in separated intervals, the efficacy of a given intervals repartition is handled by the fitness criterion, with a maximum classes discrimination. Experiments were first performed using the Fisher-s Iris dataset, and then, the KDD-99 Cup dataset was used to study the intrusion detection and classification problem. Obtained results demonstrate that the proposed genetic approach outperform the existing GP-classification methods [1],[2] and [3], and give a very accepted results compared to other existing techniques proposed in [4],[5],[6],[7] and [8].Keywords: Genetic programming, patterns classification, intrusion detection
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1710374 Class Outliers Mining: Distance-Based Approach
Authors: Nabil M. Hewahi, Motaz K. Saad
Abstract:
In large datasets, identifying exceptional or rare cases with respect to a group of similar cases is considered very significant problem. The traditional problem (Outlier Mining) is to find exception or rare cases in a dataset irrespective of the class label of these cases, they are considered rare events with respect to the whole dataset. In this research, we pose the problem that is Class Outliers Mining and a method to find out those outliers. The general definition of this problem is “given a set of observations with class labels, find those that arouse suspicions, taking into account the class labels". We introduce a novel definition of Outlier that is Class Outlier, and propose the Class Outlier Factor (COF) which measures the degree of being a Class Outlier for a data object. Our work includes a proposal of a new algorithm towards mining of the Class Outliers, presenting experimental results applied on various domains of real world datasets and finally a comparison study with other related methods is performed.Keywords: Class Outliers, Distance-Based Approach, Outliers Mining.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3387373 Continual Learning Using Data Generation for Hyperspectral Remote Sensing Scene Classification
Authors: Samiah Alammari, Nassim Ammour
Abstract:
When providing a massive number of tasks successively to a deep learning process, a good performance of the model requires preserving the previous tasks data to retrain the model for each upcoming classification. Otherwise, the model performs poorly due to the catastrophic forgetting phenomenon. To overcome this shortcoming, we developed a successful continual learning deep model for remote sensing hyperspectral image regions classification. The proposed neural network architecture encapsulates two trainable subnetworks. The first module adapts its weights by minimizing the discrimination error between the land-cover classes during the new task learning, and the second module tries to learn how to replicate the data of the previous tasks by discovering the latent data structure of the new task dataset. We conduct experiments on hyperspectral image (HSI) dataset on Indian Pines. The results confirm the capability of the proposed method.
Keywords: Continual learning, data reconstruction, remote sensing, hyperspectral image segmentation.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 232372 A Theory in Optimization of Ad-hoc Routing Algorithms
Authors: M. Kargar, F.Fartash, T. Saderi, M. Ebrahimi Dishabi
Abstract:
In this paper optimization of routing in ad-hoc networks is surveyed and a new method for reducing the complexity of routing algorithms is suggested. Using binary matrices for each node in the network and updating it once the routing is done, helps nodes to stop repeating the routing protocols in each data transfer. The algorithm suggested can reduce the complexity of routing to the least amount possible.Keywords: Ad-hoc Networks, Algorithm, Protocol, RoutingTrain.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1671371 Machine Learning-Enabled Classification of Climbing Using Small Data
Authors: Nicholas Milburn, Yu Liang, Dalei Wu
Abstract:
Athlete performance scoring within the climbing domain presents interesting challenges as the sport does not have an objective way to assign skill. Assessing skill levels within any sport is valuable as it can be used to mark progress while training, and it can help an athlete choose appropriate climbs to attempt. Machine learning-based methods are popular for complex problems like this. The dataset available was composed of dynamic force data recorded during climbing; however, this dataset came with challenges such as data scarcity, imbalance, and it was temporally heterogeneous. Investigated solutions to these challenges include data augmentation, temporal normalization, conversion of time series to the spectral domain, and cross validation strategies. The investigated solutions to the classification problem included light weight machine classifiers KNN and SVM as well as the deep learning with CNN. The best performing model had an 80% accuracy. In conclusion, there seems to be enough information within climbing force data to accurately categorize climbers by skill.
Keywords: Classification, climbing, data imbalance, data scarcity, machine learning, time sequence.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 567370 Analysis of Relation between Unlabeled and Labeled Data to Self-Taught Learning Performance
Authors: Ekachai Phaisangittisagul, Rapeepol Chongprachawat
Abstract:
Obtaining labeled data in supervised learning is often difficult and expensive, and thus the trained learning algorithm tends to be overfitting due to small number of training data. As a result, some researchers have focused on using unlabeled data which may not necessary to follow the same generative distribution as the labeled data to construct a high-level feature for improving performance on supervised learning tasks. In this paper, we investigate the impact of the relationship between unlabeled and labeled data for classification performance. Specifically, we will apply difference unlabeled data which have different degrees of relation to the labeled data for handwritten digit classification task based on MNIST dataset. Our experimental results show that the higher the degree of relation between unlabeled and labeled data, the better the classification performance. Although the unlabeled data that is completely from different generative distribution to the labeled data provides the lowest classification performance, we still achieve high classification performance. This leads to expanding the applicability of the supervised learning algorithms using unsupervised learning.Keywords: Autoencoder, high-level feature, MNIST dataset, selftaught learning, supervised learning.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1832369 Towards Real-Time Classification of Finger Movement Direction Using Encephalography Independent Components
Authors: Mohamed Mounir Tellache, Hiroyuki Kambara, Yasuharu Koike, Makoto Miyakoshi, Natsue Yoshimura
Abstract:
This study explores the practicality of using electroencephalographic (EEG) independent components to predict eight-direction finger movements in pseudo-real-time. Six healthy participants with individual-head MRI images performed finger movements in eight directions with two different arm configurations. The analysis was performed in two stages. The first stage consisted of using independent component analysis (ICA) to separate the signals representing brain activity from non-brain activity signals and to obtain the unmixing matrix. The resulting independent components (ICs) were checked, and those reflecting brain-activity were selected. Finally, the time series of the selected ICs were used to predict eight finger-movement directions using Sparse Logistic Regression (SLR). The second stage consisted of using the previously obtained unmixing matrix, the selected ICs, and the model obtained by applying SLR to classify a different EEG dataset. This method was applied to two different settings, namely the single-participant level and the group-level. For the single-participant level, the EEG dataset used in the first stage and the EEG dataset used in the second stage originated from the same participant. For the group-level, the EEG datasets used in the first stage were constructed by temporally concatenating each combination without repetition of the EEG datasets of five participants out of six, whereas the EEG dataset used in the second stage originated from the remaining participants. The average test classification results across datasets (mean ± S.D.) were 38.62 ± 8.36% for the single-participant, which was significantly higher than the chance level (12.50 ± 0.01%), and 27.26 ± 4.39% for the group-level which was also significantly higher than the chance level (12.49% ± 0.01%). The classification accuracy within [–45°, 45°] of the true direction is 70.03 ± 8.14% for single-participant and 62.63 ± 6.07% for group-level which may be promising for some real-life applications. Clustering and contribution analyses further revealed the brain regions involved in finger movement and the temporal aspect of their contribution to the classification. These results showed the possibility of using the ICA-based method in combination with other methods to build a real-time system to control prostheses.Keywords: Brain-computer interface, BCI, electroencephalography, EEG, finger motion decoding, independent component analysis, pseudo-real-time motion decoding.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 599368 A Survey on Principal Aspects of Secure Image Transmission
Authors: Ali Soleymani, Zulkarnain Md Ali, Md Jan Nordin
Abstract:
This paper is a review on the aspects and approaches of design an image cryptosystem. First a general introduction given for cryptography and images encryption and followed by different techniques in image encryption and related works for each technique surveyed. Finally, general security analysis methods for encrypted images are mentioned.
Keywords: Image, cryptography, encryption, security, analysis.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2384367 Feature Selection with Kohonen Self Organizing Classification Algorithm
Authors: Francesco Maiorana
Abstract:
In this paper a one-dimension Self Organizing Map algorithm (SOM) to perform feature selection is presented. The algorithm is based on a first classification of the input dataset on a similarity space. From this classification for each class a set of positive and negative features is computed. This set of features is selected as result of the procedure. The procedure is evaluated on an in-house dataset from a Knowledge Discovery from Text (KDT) application and on a set of publicly available datasets used in international feature selection competitions. These datasets come from KDT applications, drug discovery as well as other applications. The knowledge of the correct classification available for the training and validation datasets is used to optimize the parameters for positive and negative feature extractions. The process becomes feasible for large and sparse datasets, as the ones obtained in KDT applications, by using both compression techniques to store the similarity matrix and speed up techniques of the Kohonen algorithm that take advantage of the sparsity of the input matrix. These improvements make it feasible, by using the grid, the application of the methodology to massive datasets.Keywords: Clustering algorithm, Data mining, Feature selection, Grid, Kohonen Self Organizing Map.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3051366 A New Hybrid K-Mean-Quick Reduct Algorithm for Gene Selection
Authors: E. N. Sathishkumar, K. Thangavel, T. Chandrasekhar
Abstract:
Feature selection is a process to select features which are more informative. It is one of the important steps in knowledge discovery. The problem is that all genes are not important in gene expression data. Some of the genes may be redundant, and others may be irrelevant and noisy. Here a novel approach is proposed Hybrid K-Mean-Quick Reduct (KMQR) algorithm for gene selection from gene expression data. In this study, the entire dataset is divided into clusters by applying K-Means algorithm. Each cluster contains similar genes. The high class discriminated genes has been selected based on their degree of dependence by applying Quick Reduct algorithm to all the clusters. Average Correlation Value (ACV) is calculated for the high class discriminated genes. The clusters which have the ACV value as 1 is determined as significant clusters, whose classification accuracy will be equal or high when comparing to the accuracy of the entire dataset. The proposed algorithm is evaluated using WEKA classifiers and compared. The proposed work shows that the high classification accuracy.
Keywords: Clustering, Gene Selection, K-Mean-Quick Reduct, Rough Sets.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2298365 Exploiting Global Self Similarity for Head-Shoulder Detection
Authors: Lae-Jeong Park, Jung-Ho Moon
Abstract:
People detection from images has a variety of applications such as video surveillance and driver assistance system, but is still a challenging task and more difficult in crowded environments such as shopping malls in which occlusion of lower parts of human body often occurs. Lack of the full-body information requires more effective features than common features such as HOG. In this paper, new features are introduced that exploits global self-symmetry (GSS) characteristic in head-shoulder patterns. The features encode the similarity or difference of color histograms and oriented gradient histograms between two vertically symmetric blocks. The domain-specific features are rapid to compute from the integral images in Viola-Jones cascade-of-rejecters framework. The proposed features are evaluated with our own head-shoulder dataset that, in part, consists of a well-known INRIA pedestrian dataset. Experimental results show that the GSS features are effective in reduction of false alarmsmarginally and the gradient GSS features are preferred more often than the color GSS ones in the feature selection.
Keywords: Pedestrian detection, cascade of rejecters, feature extraction, self-symmetry, HOG.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2400364 A Comparison of Different Soft Computing Models for Credit Scoring
Authors: Nnamdi I. Nwulu, Shola G. Oroja
Abstract:
It has become crucial over the years for nations to improve their credit scoring methods and techniques in light of the increasing volatility of the global economy. Statistical methods or tools have been the favoured means for this; however artificial intelligence or soft computing based techniques are becoming increasingly preferred due to their proficient and precise nature and relative simplicity. This work presents a comparison between Support Vector Machines and Artificial Neural Networks two popular soft computing models when applied to credit scoring. Amidst the different criteria-s that can be used for comparisons; accuracy, computational complexity and processing times are the selected criteria used to evaluate both models. Furthermore the German credit scoring dataset which is a real world dataset is used to train and test both developed models. Experimental results obtained from our study suggest that although both soft computing models could be used with a high degree of accuracy, Artificial Neural Networks deliver better results than Support Vector Machines.Keywords: Artificial Neural Networks, Credit Scoring, SoftComputing Models, Support Vector Machines.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2128363 DCBOR: A Density Clustering Based on Outlier Removal
Authors: A. M. Fahim, G. Saake, A. M. Salem, F. A. Torkey, M. A. Ramadan
Abstract:
Data clustering is an important data exploration technique with many applications in data mining. We present an enhanced version of the well known single link clustering algorithm. We will refer to this algorithm as DCBOR. The proposed algorithm alleviates the chain effect by removing the outliers from the given dataset. So this algorithm provides outlier detection and data clustering simultaneously. This algorithm does not need to update the distance matrix, since the algorithm depends on merging the most k-nearest objects in one step and the cluster continues grow as long as possible under specified condition. So the algorithm consists of two phases; at the first phase, it removes the outliers from the input dataset. At the second phase, it performs the clustering process. This algorithm discovers clusters of different shapes, sizes, densities and requires only one input parameter; this parameter represents a threshold for outlier points. The value of the input parameter is ranging from 0 to 1. The algorithm supports the user in determining an appropriate value for it. We have tested this algorithm on different datasets contain outlier and connecting clusters by chain of density points, and the algorithm discovers the correct clusters. The results of our experiments demonstrate the effectiveness and the efficiency of DCBOR.Keywords: Data Clustering, Clustering Algorithms, Handling Noise, Arbitrary Shape of Clusters.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1932362 Index t-SNE: Tracking Dynamics of High-Dimensional Datasets with Coherent Embeddings
Authors: G. Candel, D. Naccache
Abstract:
t-SNE is an embedding method that the data science community has widely used. It helps two main tasks: to display results by coloring items according to the item class or feature value; and for forensic, giving a first overview of the dataset distribution. Two interesting characteristics of t-SNE are the structure preservation property and the answer to the crowding problem, where all neighbors in high dimensional space cannot be represented correctly in low dimensional space. t-SNE preserves the local neighborhood, and similar items are nicely spaced by adjusting to the local density. These two characteristics produce a meaningful representation, where the cluster area is proportional to its size in number, and relationships between clusters are materialized by closeness on the embedding. This algorithm is non-parametric. The transformation from a high to low dimensional space is described but not learned. Two initializations of the algorithm would lead to two different embedding. In a forensic approach, analysts would like to compare two or more datasets using their embedding. A naive approach would be to embed all datasets together. However, this process is costly as the complexity of t-SNE is quadratic, and would be infeasible for too many datasets. Another approach would be to learn a parametric model over an embedding built with a subset of data. While this approach is highly scalable, points could be mapped at the same exact position, making them indistinguishable. This type of model would be unable to adapt to new outliers nor concept drift. This paper presents a methodology to reuse an embedding to create a new one, where cluster positions are preserved. The optimization process minimizes two costs, one relative to the embedding shape and the second relative to the support embedding’ match. The embedding with the support process can be repeated more than once, with the newly obtained embedding. The successive embedding can be used to study the impact of one variable over the dataset distribution or monitor changes over time. This method has the same complexity as t-SNE per embedding, and memory requirements are only doubled. For a dataset of n elements sorted and split into k subsets, the total embedding complexity would be reduced from O(n2) to O(n2/k), and the memory requirement from n2 to 2(n/k)2 which enables computation on recent laptops. The method showed promising results on a real-world dataset, allowing to observe the birth, evolution and death of clusters. The proposed approach facilitates identifying significant trends and changes, which empowers the monitoring high dimensional datasets’ dynamics.
Keywords: Concept drift, data visualization, dimension reduction, embedding, monitoring, reusability, t-SNE, unsupervised learning.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 489361 Evaluating Generative Neural Attention Weights-Based Chatbot on Customer Support Twitter Dataset
Authors: Sinarwati Mohamad Suhaili, Naomie Salim, Mohamad Nazim Jambli
Abstract:
Sequence-to-sequence (seq2seq) models augmented with attention mechanisms are increasingly important in automated customer service. These models, adept at recognizing complex relationships between input and output sequences, are essential for optimizing chatbot responses. Central to these mechanisms are neural attention weights that determine the model’s focus during sequence generation. Despite their widespread use, there remains a gap in the comparative analysis of different attention weighting functions within seq2seq models, particularly in the context of chatbots utilizing the Customer Support Twitter (CST) dataset. This study addresses this gap by evaluating four distinct attention-scoring functions—dot, multiplicative/general, additive, and an extended multiplicative function with a tanh activation parameter — in neural generative seq2seq models. Using the CST dataset, these models were trained and evaluated over 10 epochs with the AdamW optimizer. Evaluation criteria included validation loss and BLEU scores implemented under both greedy and beam search strategies with a beam size of k = 3. Results indicate that the model with the tanh-augmented multiplicative function significantly outperforms its counterparts, achieving the lowest validation loss (1.136484) and the highest BLEU scores (0.438926 under greedy search, 0.443000 under beam search, k = 3). These findings emphasize the crucial influence of selecting an appropriate attention-scoring function to enhance the performance of seq2seq models for chatbots, particularly highlighting the model integrating tanh activation as a promising approach to improving chatbot quality in customer support contexts.
Keywords: Attention weight, chatbot, encoder-decoder, neural generative attention, score function, sequence-to-sequence.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 89360 Distribution of Macrobenthic Polychaete Families in Relation to Environmental Parameters in North West Penang, Malaysia
Authors: Mohammad Gholizadeh, Khairun Yahya, Anita Talib, Omar Ahmad
Abstract:
The distribution of macrobenthic polychaetes along the coastal waters of Penang National Park was surveyed to estimate the effect of various environmental parameters at three stations (200m, 600m and 1200m) from the shoreline, during six sampling months, from June 2010 to April 2011.The use of polychaetes in descriptive ecology is surveyed in the light of a recent investigation particularly concerning the soft bottom biota environments. Polychaetes, often connected in the former to the notion of opportunistic species able to proliferate after an enhancement in organic matter, had performed a momentous role particularly with regard to effected soft-bottom habitats. The objective of this survey was to investigate different environment stress over soft bottom polychaete community along Teluk Ketapang and Pantai Acheh (Penang National Park) over a year period. Variations in the polychaete community were evaluated using univariate and multivariate methods. The results of PCA analysis displayed a positive relation between macrobenthic community structures and environmental parameters such as sediment particle size and organic matter in the coastal water. A total of 604 individuals were examined which was grouped into 23 families. Family Nereidae was the most abundant (22.68%), followed by Spionidae (22.02%), Hesionidae (12.58%), Nephtylidae (9.27%) and Orbiniidae (8.61%). It is noticeable that good results can only be obtained on the basis of good taxonomic resolution. We proposed that, in monitoring surveys, operative time could be optimized not only by working at a highertaxonomic level on the entire macrobenthic data set, but by also choosing an especially indicative group and working at lower taxonomic and good level.Keywords: Polychaete families, environment parameters, Bioindicators, Pantai Acheh, Teluk Ketapang.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1999359 Fast Database Indexing for Large Protein Sequence Collections Using Parallel N-Gram Transformation Algorithm
Authors: Jehad A. H. Hammad, Nur'Aini binti Abdul Rashid
Abstract:
With the rapid development in the field of life sciences and the flooding of genomic information, the need for faster and scalable searching methods has become urgent. One of the approaches that were investigated is indexing. The indexing methods have been categorized into three categories which are the lengthbased index algorithms, transformation-based algorithms and mixed techniques-based algorithms. In this research, we focused on the transformation based methods. We embedded the N-gram method into the transformation-based method to build an inverted index table. We then applied the parallel methods to speed up the index building time and to reduce the overall retrieval time when querying the genomic database. Our experiments show that the use of N-Gram transformation algorithm is an economical solution; it saves time and space too. The result shows that the size of the index is smaller than the size of the dataset when the size of N-Gram is 5 and 6. The parallel N-Gram transformation algorithm-s results indicate that the uses of parallel programming with large dataset are promising which can be improved further.Keywords: Biological sequence, Database index, N-gram indexing, Parallel computing, Sequence retrieval.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2135358 The Relationship between Representational Conflicts, Generalization, and Encoding Requirements in an Instance Memory Network
Authors: Mathew Wakefield, Matthew Mitchell, Lisa Wise, Christopher McCarthy
Abstract:
This paper aims to provide an interpretation of artificial neural networks (ANNs) and explore some of its implications. The interpretation views ANNs as a memory which encodes instances of experience. An experiment explores the behavior of encoding and retrieval of instances from memory. A localised representation ANN is created that allows control over encoding and retrieved memory sample size and is experimented with using the MNIST digits dataset. The relationship between input familiarity, conflict within retrieved samples, and error rates is described and demonstrated to be an effective driver for memory encoding. Results indicate that selective encoding and retrieval samples that allow detection of memory conflicts produce optimal performance, and that error rates are normally distributed with input familiarity and conflict. By using input familiarity and sample consistency to guide memory encoding, the number of encoding trials on the dataset were reduced to 18.33% of the training data while maintaining good recognition performance on the test data.
Keywords: Artificial Neural Networks, ANNs, representation, memory, conflict monitoring, confidence.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 506357 Early Recognition and Grading of Cataract Using a Combined Log Gabor/Discrete Wavelet Transform with ANN and SVM
Authors: Hadeer R. M. Tawfik, Rania A. K. Birry, Amani A. Saad
Abstract:
Eyes are considered to be the most sensitive and important organ for human being. Thus, any eye disorder will affect the patient in all aspects of life. Cataract is one of those eye disorders that lead to blindness if not treated correctly and quickly. This paper demonstrates a model for automatic detection, classification, and grading of cataracts based on image processing techniques and artificial intelligence. The proposed system is developed to ease the cataract diagnosis process for both ophthalmologists and patients. The wavelet transform combined with 2D Log Gabor Wavelet transform was used as feature extraction techniques for a dataset of 120 eye images followed by a classification process that classified the image set into three classes; normal, early, and advanced stage. A comparison between the two used classifiers, the support vector machine SVM and the artificial neural network ANN were done for the same dataset of 120 eye images. It was concluded that SVM gave better results than ANN. SVM success rate result was 96.8% accuracy where ANN success rate result was 92.3% accuracy.Keywords: Cataract, classification, detection, feature extraction, grading, log-gabor, neural networks, support vector machines, wavelet.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 993356 Comparative Study on Swarm Intelligence Techniques for Biclustering of Microarray Gene Expression Data
Authors: R. Balamurugan, A. M. Natarajan, K. Premalatha
Abstract:
Microarray gene expression data play a vital in biological processes, gene regulation and disease mechanism. Biclustering in gene expression data is a subset of the genes indicating consistent patterns under the subset of the conditions. Finding a biclustering is an optimization problem. In recent years, swarm intelligence techniques are popular due to the fact that many real-world problems are increasingly large, complex and dynamic. By reasons of the size and complexity of the problems, it is necessary to find an optimization technique whose efficiency is measured by finding the near optimal solution within a reasonable amount of time. In this paper, the algorithmic concepts of the Particle Swarm Optimization (PSO), Shuffled Frog Leaping (SFL) and Cuckoo Search (CS) algorithms have been analyzed for the four benchmark gene expression dataset. The experiment results show that CS outperforms PSO and SFL for 3 datasets and SFL give better performance in one dataset. Also this work determines the biological relevance of the biclusters with Gene Ontology in terms of function, process and component.
Keywords: Particle swarm optimization, Shuffled frog leaping, Cuckoo search, biclustering, gene expression data.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2663355 Improved Automated Classification of Alcoholics and Non-alcoholics
Authors: Ramaswamy Palaniappan
Abstract:
In this paper, several improvements are proposed to previous work of automated classification of alcoholics and nonalcoholics. In the previous paper, multiplayer-perceptron neural network classifying energy of gamma band Visual Evoked Potential (VEP) signals gave the best classification performance using 800 VEP signals from 10 alcoholics and 10 non-alcoholics. Here, the dataset is extended to include 3560 VEP signals from 102 subjects: 62 alcoholics and 40 non-alcoholics. Three modifications are introduced to improve the classification performance: i) increasing the gamma band spectral range by increasing the pass-band width of the used filter ii) the use of Multiple Signal Classification algorithm to obtain the power of the dominant frequency in gamma band VEP signals as features and iii) the use of the simple but effective knearest neighbour classifier. To validate that these two modifications do give improved performance, a 10-fold cross validation classification (CVC) scheme is used. Repeat experiments of the previously used methodology for the extended dataset are performed here and improvement from 94.49% to 98.71% in maximum averaged CVC accuracy is obtained using the modifications. This latest results show that VEP based classification of alcoholics is worth exploring further for system development.Keywords: Alcoholic, Multilayer-perceptron, Nearest neighbour, Gamma band, MUSIC, Visual evoked potential.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1377354 Meta-Learning for Hierarchical Classification and Applications in Bioinformatics
Authors: Fabio Fabris, Alex A. Freitas
Abstract:
Hierarchical classification is a special type of classification task where the class labels are organised into a hierarchy, with more generic class labels being ancestors of more specific ones. Meta-learning for classification-algorithm recommendation consists of recommending to the user a classification algorithm, from a pool of candidate algorithms, for a dataset, based on the past performance of the candidate algorithms in other datasets. Meta-learning is normally used in conventional, non-hierarchical classification. By contrast, this paper proposes a meta-learning approach for more challenging task of hierarchical classification, and evaluates it in a large number of bioinformatics datasets. Hierarchical classification is especially relevant for bioinformatics problems, as protein and gene functions tend to be organised into a hierarchy of class labels. This work proposes meta-learning approach for recommending the best hierarchical classification algorithm to a hierarchical classification dataset. This work’s contributions are: 1) proposing an algorithm for splitting hierarchical datasets into new datasets to increase the number of meta-instances, 2) proposing meta-features for hierarchical classification, and 3) interpreting decision-tree meta-models for hierarchical classification algorithm recommendation.Keywords: Algorithm recommendation, meta-learning, bioinformatics, hierarchical classification.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1370