Search results for: imbalanced data with class overlap
26784 Improved Classification Procedure for Imbalanced and Overlapped Situations
Authors: Hankyu Lee, Seoung Bum Kim
Abstract:
The issue with imbalance and overlapping in the class distribution becomes important in various applications of data mining. The imbalanced dataset is a special case in classification problems in which the number of observations of one class (i.e., major class) heavily exceeds the number of observations of the other class (i.e., minor class). Overlapped dataset is the case where many observations are shared together between the two classes. Imbalanced and overlapped data can be frequently found in many real examples including fraud and abuse patients in healthcare, quality prediction in manufacturing, text classification, oil spill detection, remote sensing, and so on. The class imbalance and overlap problem is the challenging issue because this situation degrades the performance of most of the standard classification algorithms. In this study, we propose a classification procedure that can effectively handle imbalanced and overlapped datasets by splitting data space into three parts: nonoverlapping, light overlapping, and severe overlapping and applying the classification algorithm in each part. These three parts were determined based on the Hausdorff distance and the margin of the modified support vector machine. An experiments study was conducted to examine the properties of the proposed method and compared it with other classification algorithms. The results showed that the proposed method outperformed the competitors under various imbalanced and overlapped situations. Moreover, the applicability of the proposed method was demonstrated through the experiment with real data.Keywords: classification, imbalanced data with class overlap, split data space, support vector machine
Procedia PDF Downloads 30726783 Machine Learning Facing Behavioral Noise Problem in an Imbalanced Data Using One Side Behavioral Noise Reduction: Application to a Fraud Detection
Authors: Salma El Hajjami, Jamal Malki, Alain Bouju, Mohammed Berrada
Abstract:
With the expansion of machine learning and data mining in the context of Big Data analytics, the common problem that affects data is class imbalance. It refers to an imbalanced distribution of instances belonging to each class. This problem is present in many real world applications such as fraud detection, network intrusion detection, medical diagnostics, etc. In these cases, data instances labeled negatively are significantly more numerous than the instances labeled positively. When this difference is too large, the learning system may face difficulty when tackling this problem, since it is initially designed to work in relatively balanced class distribution scenarios. Another important problem, which usually accompanies these imbalanced data, is the overlapping instances between the two classes. It is commonly referred to as noise or overlapping data. In this article, we propose an approach called: One Side Behavioral Noise Reduction (OSBNR). This approach presents a way to deal with the problem of class imbalance in the presence of a high noise level. OSBNR is based on two steps. Firstly, a cluster analysis is applied to groups similar instances from the minority class into several behavior clusters. Secondly, we select and eliminate the instances of the majority class, considered as behavioral noise, which overlap with behavior clusters of the minority class. The results of experiments carried out on a representative public dataset confirm that the proposed approach is efficient for the treatment of class imbalances in the presence of noise.Keywords: machine learning, imbalanced data, data mining, big data
Procedia PDF Downloads 13026782 An Adaptive Oversampling Technique for Imbalanced Datasets
Authors: Shaukat Ali Shahee, Usha Ananthakumar
Abstract:
A data set exhibits class imbalance problem when one class has very few examples compared to the other class, and this is also referred to as between class imbalance. The traditional classifiers fail to classify the minority class examples correctly due to its bias towards the majority class. Apart from between-class imbalance, imbalance within classes where classes are composed of a different number of sub-clusters with these sub-clusters containing different number of examples also deteriorates the performance of the classifier. Previously, many methods have been proposed for handling imbalanced dataset problem. These methods can be classified into four categories: data preprocessing, algorithmic based, cost-based methods and ensemble of classifier. Data preprocessing techniques have shown great potential as they attempt to improve data distribution rather than the classifier. Data preprocessing technique handles class imbalance either by increasing the minority class examples or by decreasing the majority class examples. Decreasing the majority class examples lead to loss of information and also when minority class has an absolute rarity, removing the majority class examples is generally not recommended. Existing methods available for handling class imbalance do not address both between-class imbalance and within-class imbalance simultaneously. In this paper, we propose a method that handles between class imbalance and within class imbalance simultaneously for binary classification problem. Removing between class imbalance and within class imbalance simultaneously eliminates the biases of the classifier towards bigger sub-clusters by minimizing the error domination of bigger sub-clusters in total error. The proposed method uses model-based clustering to find the presence of sub-clusters or sub-concepts in the dataset. The number of examples oversampled among the sub-clusters is determined based on the complexity of sub-clusters. The method also takes into consideration the scatter of the data in the feature space and also adaptively copes up with unseen test data using Lowner-John ellipsoid for increasing the accuracy of the classifier. In this study, neural network is being used as this is one such classifier where the total error is minimized and removing the between-class imbalance and within class imbalance simultaneously help the classifier in giving equal weight to all the sub-clusters irrespective of the classes. The proposed method is validated on 9 publicly available data sets and compared with three existing oversampling techniques that rely on the spatial location of minority class examples in the euclidean feature space. The experimental results show the proposed method to be statistically significantly superior to other methods in terms of various accuracy measures. Thus the proposed method can serve as a good alternative to handle various problem domains like credit scoring, customer churn prediction, financial distress, etc., that typically involve imbalanced data sets.Keywords: classification, imbalanced dataset, Lowner-John ellipsoid, model based clustering, oversampling
Procedia PDF Downloads 41526781 Adaptive Swarm Balancing Algorithms for Rare-Event Prediction in Imbalanced Healthcare Data
Authors: Jinyan Li, Simon Fong, Raymond Wong, Mohammed Sabah, Fiaidhi Jinan
Abstract:
Clinical data analysis and forecasting have make great contributions to disease control, prevention and detection. However, such data usually suffer from highly unbalanced samples in class distributions. In this paper, we target at the binary imbalanced dataset, where the positive samples take up only the minority. We investigate two different meta-heuristic algorithms, particle swarm optimization and bat-inspired algorithm, and combine both of them with the synthetic minority over-sampling technique (SMOTE) for processing the datasets. One approach is to process the full dataset as a whole. The other is to split up the dataset and adaptively process it one segment at a time. The experimental results reveal that while the performance improvements obtained by the former methods are not scalable to larger data scales, the later one, which we call Adaptive Swarm Balancing Algorithms, leads to significant efficiency and effectiveness improvements on large datasets. We also find it more consistent with the practice of the typical large imbalanced medical datasets. We further use the meta-heuristic algorithms to optimize two key parameters of SMOTE. Leading to more credible performances of the classifier, and shortening the running time compared with the brute-force method.Keywords: Imbalanced dataset, meta-heuristic algorithm, SMOTE, big data
Procedia PDF Downloads 43926780 The Classification Performance in Parametric and Nonparametric Discriminant Analysis for a Class- Unbalanced Data of Diabetes Risk Groups
Authors: Lily Ingsrisawang, Tasanee Nacharoen
Abstract:
Introduction: The problems of unbalanced data sets generally appear in real world applications. Due to unequal class distribution, many research papers found that the performance of existing classifier tends to be biased towards the majority class. The k -nearest neighbors’ nonparametric discriminant analysis is one method that was proposed for classifying unbalanced classes with good performance. Hence, the methods of discriminant analysis are of interest to us in investigating misclassification error rates for class-imbalanced data of three diabetes risk groups. Objective: The purpose of this study was to compare the classification performance between parametric discriminant analysis and nonparametric discriminant analysis in a three-class classification application of class-imbalanced data of diabetes risk groups. Methods: Data from a healthy project for 599 staffs in a government hospital in Bangkok were obtained for the classification problem. The staffs were diagnosed into one of three diabetes risk groups: non-risk (90%), risk (5%), and diabetic (5%). The original data along with the variables; diabetes risk group, age, gender, cholesterol, and BMI was analyzed and bootstrapped up to 50 and 100 samples, 599 observations per sample, for additional estimation of misclassification error rate. Each data set was explored for the departure of multivariate normality and the equality of covariance matrices of the three risk groups. Both the original data and the bootstrap samples show non-normality and unequal covariance matrices. The parametric linear discriminant function, quadratic discriminant function, and the nonparametric k-nearest neighbors’ discriminant function were performed over 50 and 100 bootstrap samples and applied to the original data. In finding the optimal classification rule, the choices of prior probabilities were set up for both equal proportions (0.33: 0.33: 0.33) and unequal proportions with three choices of (0.90:0.05:0.05), (0.80: 0.10: 0.10) or (0.70, 0.15, 0.15). Results: The results from 50 and 100 bootstrap samples indicated that the k-nearest neighbors approach when k = 3 or k = 4 and the prior probabilities of {non-risk:risk:diabetic} as {0.90:0.05:0.05} or {0.80:0.10:0.10} gave the smallest error rate of misclassification. Conclusion: The k-nearest neighbors approach would be suggested for classifying a three-class-imbalanced data of diabetes risk groups.Keywords: error rate, bootstrap, diabetes risk groups, k-nearest neighbors
Procedia PDF Downloads 43326779 A Survey in Techniques for Imbalanced Intrusion Detection System Datasets
Authors: Najmeh Abedzadeh, Matthew Jacobs
Abstract:
An intrusion detection system (IDS) is a software application that monitors malicious activities and generates alerts if any are detected. However, most network activities in IDS datasets are normal, and the relatively few numbers of attacks make the available data imbalanced. Consequently, cyber-attacks can hide inside a large number of normal activities, and machine learning algorithms have difficulty learning and classifying the data correctly. In this paper, a comprehensive literature review is conducted on different types of algorithms for both implementing the IDS and methods in correcting the imbalanced IDS dataset. The most famous algorithms are machine learning (ML), deep learning (DL), synthetic minority over-sampling technique (SMOTE), and reinforcement learning (RL). Most of the research use the CSE-CIC-IDS2017, CSE-CIC-IDS2018, and NSL-KDD datasets for evaluating their algorithms.Keywords: IDS, imbalanced datasets, sampling algorithms, big data
Procedia PDF Downloads 32326778 An Empirical Evaluation of Performance of Machine Learning Techniques on Imbalanced Software Quality Data
Authors: Ruchika Malhotra, Megha Khanna
Abstract:
The development of change prediction models can help the software practitioners in planning testing and inspection resources at early phases of software development. However, a major challenge faced during the training process of any classification model is the imbalanced nature of the software quality data. A data with very few minority outcome categories leads to inefficient learning process and a classification model developed from the imbalanced data generally does not predict these minority categories correctly. Thus, for a given dataset, a minority of classes may be change prone whereas a majority of classes may be non-change prone. This study explores various alternatives for adeptly handling the imbalanced software quality data using different sampling methods and effective MetaCost learners. The study also analyzes and justifies the use of different performance metrics while dealing with the imbalanced data. In order to empirically validate different alternatives, the study uses change data from three application packages of open-source Android data set and evaluates the performance of six different machine learning techniques. The results of the study indicate extensive improvement in the performance of the classification models when using resampling method and robust performance measures.Keywords: change proneness, empirical validation, imbalanced learning, machine learning techniques, object-oriented metrics
Procedia PDF Downloads 41826777 Towards a Balancing Medical Database by Using the Least Mean Square Algorithm
Authors: Kamel Belammi, Houria Fatrim
Abstract:
imbalanced data set, a problem often found in real world application, can cause seriously negative effect on classification performance of machine learning algorithms. There have been many attempts at dealing with classification of imbalanced data sets. In medical diagnosis classification, we often face the imbalanced number of data samples between the classes in which there are not enough samples in rare classes. In this paper, we proposed a learning method based on a cost sensitive extension of Least Mean Square (LMS) algorithm that penalizes errors of different samples with different weight and some rules of thumb to determine those weights. After the balancing phase, we applythe different classifiers (support vector machine (SVM), k- nearest neighbor (KNN) and multilayer neuronal networks (MNN)) for balanced data set. We have also compared the obtained results before and after balancing method.Keywords: multilayer neural networks, k- nearest neighbor, support vector machine, imbalanced medical data, least mean square algorithm, diabetes
Procedia PDF Downloads 53126776 An Ensemble Deep Learning Architecture for Imbalanced Classification of Thoracic Surgery Patients
Authors: Saba Ebrahimi, Saeed Ahmadian, Hedie Ashrafi
Abstract:
Selecting appropriate patients for surgery is one of the main issues in thoracic surgery (TS). Both short-term and long-term risks and benefits of surgery must be considered in the patient selection criteria. There are some limitations in the existing datasets of TS patients because of missing values of attributes and imbalanced distribution of survival classes. In this study, a novel ensemble architecture of deep learning networks is proposed based on stacking different linear and non-linear layers to deal with imbalance datasets. The categorical and numerical features are split using different layers with ability to shrink the unnecessary features. Then, after extracting the insight from the raw features, a novel biased-kernel layer is applied to reinforce the gradient of the minority class and cause the network to be trained better comparing the current methods. Finally, the performance and advantages of our proposed model over the existing models are examined for predicting patient survival after thoracic surgery using a real-life clinical data for lung cancer patients.Keywords: deep learning, ensemble models, imbalanced classification, lung cancer, TS patient selection
Procedia PDF Downloads 14426775 Enhancing Fault Detection in Rotating Machinery Using Wiener-CNN Method
Authors: Mohamad R. Moshtagh, Ahmad Bagheri
Abstract:
Accurate fault detection in rotating machinery is of utmost importance to ensure optimal performance and prevent costly downtime in industrial applications. This study presents a robust fault detection system based on vibration data collected from rotating gears under various operating conditions. The considered scenarios include: (1) both gears being healthy, (2) one healthy gear and one faulty gear, and (3) introducing an imbalanced condition to a healthy gear. Vibration data was acquired using a Hentek 1008 device and stored in a CSV file. Python code implemented in the Spider environment was used for data preprocessing and analysis. Winner features were extracted using the Wiener feature selection method. These features were then employed in multiple machine learning algorithms, including Convolutional Neural Networks (CNN), Multilayer Perceptron (MLP), K-Nearest Neighbors (KNN), and Random Forest, to evaluate their performance in detecting and classifying faults in both the training and validation datasets. The comparative analysis of the methods revealed the superior performance of the Wiener-CNN approach. The Wiener-CNN method achieved a remarkable accuracy of 100% for both the two-class (healthy gear and faulty gear) and three-class (healthy gear, faulty gear, and imbalanced) scenarios in the training and validation datasets. In contrast, the other methods exhibited varying levels of accuracy. The Wiener-MLP method attained 100% accuracy for the two-class training dataset and 100% for the validation dataset. For the three-class scenario, the Wiener-MLP method demonstrated 100% accuracy in the training dataset and 95.3% accuracy in the validation dataset. The Wiener-KNN method yielded 96.3% accuracy for the two-class training dataset and 94.5% for the validation dataset. In the three-class scenario, it achieved 85.3% accuracy in the training dataset and 77.2% in the validation dataset. The Wiener-Random Forest method achieved 100% accuracy for the two-class training dataset and 85% for the validation dataset, while in the three-class training dataset, it attained 100% accuracy and 90.8% accuracy for the validation dataset. The exceptional accuracy demonstrated by the Wiener-CNN method underscores its effectiveness in accurately identifying and classifying fault conditions in rotating machinery. The proposed fault detection system utilizes vibration data analysis and advanced machine learning techniques to improve operational reliability and productivity. By adopting the Wiener-CNN method, industrial systems can benefit from enhanced fault detection capabilities, facilitating proactive maintenance and reducing equipment downtime.Keywords: fault detection, gearbox, machine learning, wiener method
Procedia PDF Downloads 7926774 A Ratio-Weighted Decision Tree Algorithm for Imbalance Dataset Classification
Authors: Doyin Afolabi, Phillip Adewole, Oladipupo Sennaike
Abstract:
Most well-known classifiers, including the decision tree algorithm, can make predictions on balanced datasets efficiently. However, the decision tree algorithm tends to be biased towards imbalanced datasets because of the skewness of the distribution of such datasets. To overcome this problem, this study proposes a weighted decision tree algorithm that aims to remove the bias toward the majority class and prevents the reduction of majority observations in imbalance datasets classification. The proposed weighted decision tree algorithm was tested on three imbalanced datasets- cancer dataset, german credit dataset, and banknote dataset. The specificity, sensitivity, and accuracy metrics were used to evaluate the performance of the proposed decision tree algorithm on the datasets. The evaluation results show that for some of the weights of our proposed decision tree, the specificity, sensitivity, and accuracy metrics gave better results compared to that of the ID3 decision tree and decision tree induced with minority entropy for all three datasets.Keywords: data mining, decision tree, classification, imbalance dataset
Procedia PDF Downloads 13326773 A Priority Based Imbalanced Time Minimization Assignment Problem: An Iterative Approach
Authors: Ekta Jain, Kalpana Dahiya, Vanita Verma
Abstract:
This paper discusses a priority based imbalanced time minimization assignment problem dealing with the allocation of n jobs to m < n persons in which the project is carried out in two stages, viz. Stage-I and Stage-II. Stage-I consists of n1 ( < m) primary jobs and Stage-II consists of remaining (n-n1) secondary jobs which are commenced only after primary jobs are finished. Each job is to be allocated to exactly one person, and each person has to do at least one job. It is assumed that nature of the Stage-I jobs is such that one person can do exactly one primary job whereas a person can do more than one secondary job in Stage-II. In a particular stage, all persons start doing the jobs simultaneously, but if a person is doing more than one job, he does them one after the other in any order. The aim of the proposed study is to find the feasible assignment which minimizes the total time for the two stage execution of the project. For this, an iterative algorithm is proposed, which at each iteration, solves a constrained imbalanced time minimization assignment problem to generate a pair of Stage-I and Stage-II times. For solving this constrained problem, an algorithm is developed in the current paper. Later, alternate combinations based method to solve the priority based imbalanced problem is also discussed and a comparative study is carried out. Numerical illustrations are provided in support of the theory.Keywords: assignment, imbalanced, priority, time minimization
Procedia PDF Downloads 23326772 Retinal Vascular Tortuosity in Obstructive Sleep Apnea-COPD Overlap Patients
Authors: Rabab A. El Wahsh, Hatem M. Marey, Maha Yousif, Asmaa M. Ibrahim
Abstract:
Background: OSA and COPD are associated with microvascular changes. Retinal microvasculature can be directly and non-invasively examined. Aim: to evaluate retinal vascular tortuosity in patients with COPD, OSA, and overlap syndrome. Subjects and method: Sixty subjects were included; 15 OSA patients, 15 COPD patients, 15 COPD-OSA overlap patients, and 15 matched controls. They underwent digital retinal photography, polysomnography, arterial blood gases, spirometry, ESS, and stop-bang questionnaires. Results: Tortuosity of most retinal vessels was higher in all patient groups compared to the control group; tortuosity was more marked in overlap syndrome. There was a negative correlation between tortuosity of retinal vessels and PO2, O2 saturation, and minimum O2 desaturation, and a positive correlation with PCO2, AHI, O2 desaturation index, BMI and smoking index. Conclusion: Retinal vascular tortuosity occurs in OSA, COPD and overlap syndrome. Retinal vascular tortuosity is correlated with arterial blood gases parameters, polysomnographic findings, smoking index and BMI.Keywords: OSA, COPD, overlap syndrome, retinal vascular tortuosity
Procedia PDF Downloads 7226771 One vs. Rest and Error Correcting Output Codes Principled Rebalancing Schemes for Solving Imbalanced Multiclass Problems
Authors: Alvaro Callejas-Ramos, Lorena Alvarez-Perez, Alexander Benitez-Buenache, Anibal R. Figueiras-Vidal
Abstract:
This contribution presents a promising formulation which allows to extend the principled binary rebalancing procedures, also known as neutral re-balancing mechanisms in the sense that they do not alter the likelihood ratioKeywords: Bregman divergences, imbalanced multiclass classifi-cation, informed re-balancing, invariant likelihood ratio
Procedia PDF Downloads 21326770 Kernel-Based Double Nearest Proportion Feature Extraction for Hyperspectral Image Classification
Authors: Hung-Sheng Lin, Cheng-Hsuan Li
Abstract:
Over the past few years, kernel-based algorithms have been widely used to extend some linear feature extraction methods such as principal component analysis (PCA), linear discriminate analysis (LDA), and nonparametric weighted feature extraction (NWFE) to their nonlinear versions, kernel principal component analysis (KPCA), generalized discriminate analysis (GDA), and kernel nonparametric weighted feature extraction (KNWFE), respectively. These nonlinear feature extraction methods can detect nonlinear directions with the largest nonlinear variance or the largest class separability based on the given kernel function. Moreover, they have been applied to improve the target detection or the image classification of hyperspectral images. The double nearest proportion feature extraction (DNP) can effectively reduce the overlap effect and have good performance in hyperspectral image classification. The DNP structure is an extension of the k-nearest neighbor technique. For each sample, there are two corresponding nearest proportions of samples, the self-class nearest proportion and the other-class nearest proportion. The term “nearest proportion” used here consider both the local information and other more global information. With these settings, the effect of the overlap between the sample distributions can be reduced. Usually, the maximum likelihood estimator and the related unbiased estimator are not ideal estimators in high dimensional inference problems, particularly in small data-size situation. Hence, an improved estimator by shrinkage estimation (regularization) is proposed. Based on the DNP structure, LDA is included as a special case. In this paper, the kernel method is applied to extend DNP to kernel-based DNP (KDNP). In addition to the advantages of DNP, KDNP surpasses DNP in the experimental results. According to the experiments on the real hyperspectral image data sets, the classification performance of KDNP is better than that of PCA, LDA, NWFE, and their kernel versions, KPCA, GDA, and KNWFE.Keywords: feature extraction, kernel method, double nearest proportion feature extraction, kernel double nearest feature extraction
Procedia PDF Downloads 34126769 Artificial Reproduction System and Imbalanced Dataset: A Mendelian Classification
Authors: Anita Kushwaha
Abstract:
We propose a new evolutionary computational model called Artificial Reproduction System which is based on the complex process of meiotic reproduction occurring between male and female cells of the living organisms. Artificial Reproduction System is an attempt towards a new computational intelligence approach inspired by the theoretical reproduction mechanism, observed reproduction functions, principles and mechanisms. A reproductive organism is programmed by genes and can be viewed as an automaton, mapping and reducing so as to create copies of those genes in its off springs. In Artificial Reproduction System, the binding mechanism between male and female cells is studied, parameters are chosen and a network is constructed also a feedback system for self regularization is established. The model then applies Mendel’s law of inheritance, allele-allele associations and can be used to perform data analysis of imbalanced data, multivariate, multiclass and big data. In the experimental study Artificial Reproduction System is compared with other state of the art classifiers like SVM, Radial Basis Function, neural networks, K-Nearest Neighbor for some benchmark datasets and comparison results indicates a good performance.Keywords: bio-inspired computation, nature- inspired computation, natural computing, data mining
Procedia PDF Downloads 27126768 Semi-Supervised Outlier Detection Using a Generative and Adversary Framework
Authors: Jindong Gu, Matthias Schubert, Volker Tresp
Abstract:
In many outlier detection tasks, only training data belonging to one class, i.e., the positive class, is available. The task is then to predict a new data point as belonging either to the positive class or to the negative class, in which case the data point is considered an outlier. For this task, we propose a novel corrupted Generative Adversarial Network (CorGAN). In the adversarial process of training CorGAN, the Generator generates outlier samples for the negative class, and the Discriminator is trained to distinguish the positive training data from the generated negative data. The proposed framework is evaluated using an image dataset and a real-world network intrusion dataset. Our outlier-detection method achieves state-of-the-art performance on both tasks.Keywords: one-class classification, outlier detection, generative adversary networks, semi-supervised learning
Procedia PDF Downloads 15126767 Credit Card Fraud Detection with Ensemble Model: A Meta-Heuristic Approach
Authors: Gong Zhilin, Jing Yang, Jian Yin
Abstract:
The purpose of this paper is to develop a novel system for credit card fraud detection based on sequential modeling of data using hybrid deep learning models. The projected model encapsulates five major phases are pre-processing, imbalance-data handling, feature extraction, optimal feature selection, and fraud detection with an ensemble classifier. The collected raw data (input) is pre-processed to enhance the quality of the data through alleviation of the missing data, noisy data as well as null values. The pre-processed data are class imbalanced in nature, and therefore they are handled effectively with the K-means clustering-based SMOTE model. From the balanced class data, the most relevant features like improved Principal Component Analysis (PCA), statistical features (mean, median, standard deviation) and higher-order statistical features (skewness and kurtosis). Among the extracted features, the most optimal features are selected with the Self-improved Arithmetic Optimization Algorithm (SI-AOA). This SI-AOA model is the conceptual improvement of the standard Arithmetic Optimization Algorithm. The deep learning models like Long Short-Term Memory (LSTM), Convolutional Neural Network (CNN), and optimized Quantum Deep Neural Network (QDNN). The LSTM and CNN are trained with the extracted optimal features. The outcomes from LSTM and CNN will enter as input to optimized QDNN that provides the final detection outcome. Since the QDNN is the ultimate detector, its weight function is fine-tuned with the Self-improved Arithmetic Optimization Algorithm (SI-AOA).Keywords: credit card, data mining, fraud detection, money transactions
Procedia PDF Downloads 12826766 Prediction of All-Beta Protein Secondary Structure Using Garnier-Osguthorpe-Robson Method
Authors: K. Tejasri, K. Suvarna Vani, S. Prathyusha, S. Ramya
Abstract:
Proteins are chained sequences of amino acids which are brought together by the peptide bonds. Many varying formations of the chains are possible due to multiple combinations of amino acids and rotation in numerous positions along the chain. Protein structure prediction is one of the crucial goals worked towards by the members of bioinformatics and theoretical chemistry backgrounds. Among the four different structure levels in proteins, we emphasize mainly the secondary level structure. Generally, the secondary protein basically comprises alpha-helix and beta-sheets. Multi-class classification problem of data with disparity is truly a challenge to overcome and has to be addressed for the beta strands. Imbalanced data distribution constitutes a couple of the classes of data having very limited training samples collated with other classes. The secondary structure data is extracted from the protein primary sequence, and the beta-strands are predicted using suitable machine learning algorithms.Keywords: proteins, secondary structure elements, beta-sheets, beta-strands, alpha-helices, machine learning algorithms
Procedia PDF Downloads 9226765 Microarray Gene Expression Data Dimensionality Reduction Using PCA
Authors: Fuad M. Alkoot
Abstract:
Different experimental technologies such as microarray sequencing have been proposed to generate high-resolution genetic data, in order to understand the complex dynamic interactions between complex diseases and the biological system components of genes and gene products. However, the generated samples have a very large dimension reaching thousands. Therefore, hindering all attempts to design a classifier system that can identify diseases based on such data. Additionally, the high overlap in the class distributions makes the task more difficult. The data we experiment with is generated for the identification of autism. It includes 142 samples, which is small compared to the large dimension of the data. The classifier systems trained on this data yield very low classification rates that are almost equivalent to a guess. We aim at reducing the data dimension and improve it for classification. Here, we experiment with applying a multistage PCA on the genetic data to reduce its dimensionality. Results show a significant improvement in the classification rates which increases the possibility of building an automated system for autism detection.Keywords: PCA, gene expression, dimensionality reduction, classification, autism
Procedia PDF Downloads 55926764 A Unique Multi-Class Support Vector Machine Algorithm Using MapReduce
Authors: Aditi Viswanathan, Shree Ranjani, Aruna Govada
Abstract:
With data sizes constantly expanding, and with classical machine learning algorithms that analyze such data requiring larger and larger amounts of computation time and storage space, the need to distribute computation and memory requirements among several computers has become apparent. Although substantial work has been done in developing distributed binary SVM algorithms and multi-class SVM algorithms individually, the field of multi-class distributed SVMs remains largely unexplored. This research seeks to develop an algorithm that implements the Support Vector Machine over a multi-class data set and is efficient in a distributed environment. For this, we recursively choose the best binary split of a set of classes using a greedy technique. Much like the divide and conquer approach. Our algorithm has shown better computation time during the testing phase than the traditional sequential SVM methods (One vs. One, One vs. Rest) and out-performs them as the size of the data set grows. This approach also classifies the data with higher accuracy than the traditional multi-class algorithms.Keywords: distributed algorithm, MapReduce, multi-class, support vector machine
Procedia PDF Downloads 39926763 The Facilitatory Effect of Phonological Priming on Visual Word Recognition in Arabic as a Function of Lexicality and Overlap Positions
Authors: Ali Al Moussaoui
Abstract:
An experiment was designed to assess the performance of 24 Lebanese adults (mean age 29:5 years) in a lexical decision making (LDM) task to find out how the facilitatory effect of phonological priming (PP) affects the speed of visual word recognition in Arabic as lexicality (wordhood) and phonological overlap positions (POP) vary. The experiment falls in line with previous research on phonological priming in the light of the cohort theory and in relation to visual word recognition. The experiment also departs from the research on the Arabic language in which the importance of the consonantal root as a distinct morphological unit is confirmed. Based on previous research, it is hypothesized that (1) PP has a facilitating effect in LDM with words but not with nonwords and (2) final phonological overlap between the prime and the target is more facilitatory than initial overlap. An LDM task was programmed on PsychoPy application. Participants had to decide if a target (e.g., bayn ‘between’) preceded by a prime (e.g., bayt ‘house’) is a word or not. There were 4 conditions: no PP (NP), nonwords priming nonwords (NN), nonwords priming words (NW), and words priming words (WW). The conditions were simultaneously controlled for word length, wordhood, and POP. The interstimulus interval was 700 ms. Within the PP conditions, POP was controlled for in which there were 3 overlap positions between the primes and the targets: initial (e.g., asad ‘lion’ and asaf ‘sorrow’), final (e.g., kattab ‘cause to write’ 2sg-mas and rattab ‘organize’ 2sg-mas), or two-segmented (e.g., namle ‘ant’ and naħle ‘bee’). There were 96 trials, 24 in each condition, using a within-subject design. The results show that concerning (1), the highest average reaction time (RT) is that in NN, followed firstly by NW and finally by WW. There is statistical significance only between the pairs NN-NW and NN-WW. Regarding (2), the shortest RT is that in the two-segmented overlap condition, followed by the final POP in the first place and the initial POP in the last place. The difference between the two-segmented and the initial overlap is significant, while other pairwise comparisons are not. Based on these results, PP emerges as a facilitatory phenomenon that is highly sensitive to lexicality and POP. While PP can have a facilitating effect under lexicality, it shows no facilitation in its absence, which intersects with several previous findings. Participants are found to be more sensitive to the final phonological overlap than the initial overlap, which also coincides with a body of earlier literature. The results contradict the cohort theory’s stress on the onset overlap position and, instead, give more weight to final overlap, and even heavier weight to the two-segmented one. In conclusion, this study confirms the facilitating effect of PP with words but not when stimuli (at least the primes and at most both the primes and targets) are nonwords. It also shows that the two-segmented priming is the most influential in LDM in Arabic.Keywords: lexicality, phonological overlap positions, phonological priming, visual word recognition
Procedia PDF Downloads 18326762 Empirical Exploration for the Correlation between Class Object-Oriented Connectivity-Based Cohesion and Coupling
Authors: Jehad Al Dallal
Abstract:
Attributes and methods are the basic contents of an object-oriented class. The connectivity among these class members and the relationship between the class and other classes play an important role in determining the quality of an object-oriented system. Class cohesion evaluates the degree of relatedness of class attributes and methods, whereas class coupling refers to the degree to which a class is related to other classes. Researchers have proposed several class cohesion and class coupling measures. However, the correlation between class coupling and class cohesion measures have not been thoroughly studied. In this paper, using classes of three open-source Java systems, we empirically investigate the correlation between several measures of connectivity-based class cohesion and coupling. Four connectivity-based cohesion measures and eight coupling measures are considered in the empirical study. The empirical study results show that class connectivity-based cohesion and coupling internal quality attributes are inversely correlated. The strength of the correlation depends highly on the cohesion and coupling measurement approaches.Keywords: object-oriented class, software quality, class cohesion measure, class coupling measure
Procedia PDF Downloads 31926761 Powers of Class p-w A (s, t) Operators Associated with Generalized Aluthge Transformations
Authors: Mohammed Husein Mohammed Rashid
Abstract:
Let Τ = U |Τ| be a polar decomposition of a bounded linear operator T on a complex Hilbert space with ker U = ker |T|. T is said to be class p-w A(s,t) if (|T*|ᵗ|T|²ˢ|T*|ᵗ )ᵗᵖ/ˢ⁺ᵗ ≥|T*|²ᵗᵖ and |T|²ˢᵖ ≥ (|T|ˢ|T*|²ᵗ|T|ˢ)ˢᵖ/ˢ⁺ᵗ with 0Keywords: class p-w A (s, t), normaloid, isoloid, finite, orthogonality
Procedia PDF Downloads 11526760 Selection of Soil Quality Indicators of Rice Cropping Systems Using Minimum Data Set Influenced by Imbalanced Fertilization
Authors: Theresa K., Shanmugasundaram R., Kennedy J. S.
Abstract:
Nutrient supplements are indispensable for raising crops and to reap determining productivity. The nutrient imbalance between replenishment and crop uptake is attempted through the input of inorganic fertilizers. Excessive dumping of inorganic nutrients in soil cause stagnant and decline in yield. Imbalanced N-P-K ratio in the soil exacerbates and agitates the soil ecosystems. The study evaluated the fertilization practices of conventional (CFs), organic and Integrated Nutrient Management system (INM) on soil quality using key indicators and soil quality indices. Twelve rice farming fields of which, ten fields were having conventional cultivation practices, one field each was organic farming based and INM based cultivated under monocropping sequence in the Thondamuthur block of Coimbatore district were fixed and properties viz., physical, chemical and biological were studied for four cropping seasons to determine soil quality index (SQI). SQI was computed for conventional, organic and INM fields. Comparing conventional farming (CF) with organic and INM, CF was recorded with a lower soil quality index. While in organic and INM fields, the higher SQI value of 0.99 and 0.88 respectively were registered. CF₄ received with a super-optimal dose of N (250%) showed a lesser SQI value (0.573) as well as the yield (3.20 t ha⁻¹) and the CF6 which received 125 % N recorded the highest SQI (0.715) and yield (6.20 t ha⁻¹). Likewise, most of the CFs received higher N beyond the level of 125 % except CF₃ and CF₉, which recorded lower yields. CFs which received super-optimal P in the order of CF₆&CF₇>CF₁&CF₁₀ recorded lesser yields except for CF₆. Super-optimal K application also recorded lesser yield in CF₄, CF₇ and CF₉.Keywords: rice cropping system, soil quality indicators, imbalanced fertilization, yield
Procedia PDF Downloads 15626759 Fuglede-Putnam Theorem for ∗-Class A Operators
Authors: Mohammed Husein Mohammad Rashid
Abstract:
For a bounded linear operator T acting on a complex infinite dimensional Hilbert space ℋ, we say that T is ∗-class A operator (abbreviation T∈A*) if |T²|≥ |T*|². In this article, we prove the following assertions:(i) we establish some conditions which imply the normality of ∗-class A; (ii) we consider ∗-class A operator T ∈ ℬ(ℋ) with reducing kernel such that TX = XS for some X ∈ ℬ(K, ℋ) and prove the Fuglede-Putnam type theorem when adjoint of S ∈ ℬ(K) is dominant operators; (iii) furthermore, we extend the asymmetric Putnam-Fuglede theorem the class of ∗-class A operators.Keywords: fuglede-putnam theorem, normal operators, ∗-class a operators, dominant operators
Procedia PDF Downloads 8626758 Knowledge Representation and Inconsistency Reasoning of Class Diagram Maintenance in Big Data
Authors: Chi-Lun Liu
Abstract:
Requirements modeling and analysis are important in successful information systems' maintenance. Unified Modeling Language (UML) class diagrams are useful standards for modeling information systems. To our best knowledge, there is a lack of a systems development methodology described by the organism metaphor. The core concept of this metaphor is adaptation. Using the knowledge representation and reasoning approach and ontologies to adopt new requirements are emergent in recent years. This paper proposes an organic methodology which is based on constructivism theory. This methodology is a knowledge representation and reasoning approach to analyze new requirements in the class diagrams maintenance. The process and rules in the proposed methodology automatically analyze inconsistencies in the class diagram. In the big data era, developing an automatic tool based on the proposed methodology to analyze large amounts of class diagram data is an important research topic in the future.Keywords: knowledge representation, reasoning, ontology, class diagram, software engineering
Procedia PDF Downloads 24126757 Predictive Modelling of Aircraft Component Replacement Using Imbalanced Learning and Ensemble Method
Authors: Dangut Maren David, Skaf Zakwan
Abstract:
Adequate monitoring of vehicle component in other to obtain high uptime is the goal of predictive maintenance, the major challenge faced by businesses in industries is the significant cost associated with a delay in service delivery due to system downtime. Most of those businesses are interested in predicting those problems and proactively prevent them in advance before it occurs, which is the core advantage of Prognostic Health Management (PHM) application. The recent emergence of industry 4.0 or industrial internet of things (IIoT) has led to the need for monitoring systems activities and enhancing system-to-system or component-to- component interactions, this has resulted to a large generation of data known as big data. Analysis of big data represents an increasingly important, however, due to complexity inherently in the dataset such as imbalance classification problems, it becomes extremely difficult to build a model with accurate high precision. Data-driven predictive modeling for condition-based maintenance (CBM) has recently drowned research interest with growing attention to both academics and industries. The large data generated from industrial process inherently comes with a different degree of complexity which posed a challenge for analytics. Thus, imbalance classification problem exists perversely in industrial datasets which can affect the performance of learning algorithms yielding to poor classifier accuracy in model development. Misclassification of faults can result in unplanned breakdown leading economic loss. In this paper, an advanced approach for handling imbalance classification problem is proposed and then a prognostic model for predicting aircraft component replacement is developed to predict component replacement in advanced by exploring aircraft historical data, the approached is based on hybrid ensemble-based method which improves the prediction of the minority class during learning, we also investigate the impact of our approach on multiclass imbalance problem. We validate the feasibility and effectiveness in terms of the performance of our approach using real-world aircraft operation and maintenance datasets, which spans over 7 years. Our approach shows better performance compared to other similar approaches. We also validate our approach strength for handling multiclass imbalanced dataset, our results also show good performance compared to other based classifiers.Keywords: prognostics, data-driven, imbalance classification, deep learning
Procedia PDF Downloads 17226756 Construction of a Fusion Gene Carrying E10A and K5 with 2A Peptide-Linked by Using Overlap Extension PCR
Authors: Tiancheng Lan
Abstract:
E10A is a kind of replication-defective adenovirus which carries the human endostatin gene to inhibit the growth of tumors. Kringle 5(K5) has almost the same function as angiostatin to also inhibit the growth of tumors since they are all the byproduct of the proteolytic cleavage of plasminogen. Tumor size increasing can be suppressed because both of the endostatin and K5 can restrain the angiogenesis process. Therefore, in order to improve the treatment effect on tumor, 2A peptide is used to construct a fusion gene carrying both E10A and K5. Using 2A peptide is an ideal strategy when a fusion gene is expressed because it can avoid many problems during the expression of more than one kind of protein. The overlap extension PCR is also used to connect 2A peptide with E10A and K5. The final construction of fusion gene E10A-2A-K5 can provide a possible new method of the anti-angiogenesis treatment with a better expression performance.Keywords: E10A, Kringle 5, 2A peptide, overlap extension PCR
Procedia PDF Downloads 14926755 The Current Status of Middle Class Internet Use in China: An Analysis Based on the Chinese General Social Survey 2015 Data and Semi-Structured Investigation
Authors: Abigail Qian Zhou
Abstract:
In today's China, the well-educated middle class, with stable jobs and above-average income, are the driving force behind its Internet society. Through the analysis of data from the 2015 Chinese General Social Survey and 50 interviewees, this study investigates the current situation of this group’s specific internet usage. The findings of this study demonstrate that daily life among the members of this socioeconomic group is closely tied to the Internet. For Chinese middle class, the Internet is used to socialize and entertain self and others. It is also used to search for and share information as well as to build their identities. The empirical results of this study will provide a reference, supported by factual data, for enterprises seeking to target the Chinese middle class through online marketing efforts.Keywords: middle class, Internet use, network behaviour, online marketing, China
Procedia PDF Downloads 118