Search results for: Imbalanced dataset
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 1129

Search results for: Imbalanced dataset

1129 Adaptive Swarm Balancing Algorithms for Rare-Event Prediction in Imbalanced Healthcare Data

Authors: Jinyan Li, Simon Fong, Raymond Wong, Mohammed Sabah, Fiaidhi Jinan

Abstract:

Clinical data analysis and forecasting have make great contributions to disease control, prevention and detection. However, such data usually suffer from highly unbalanced samples in class distributions. In this paper, we target at the binary imbalanced dataset, where the positive samples take up only the minority. We investigate two different meta-heuristic algorithms, particle swarm optimization and bat-inspired algorithm, and combine both of them with the synthetic minority over-sampling technique (SMOTE) for processing the datasets. One approach is to process the full dataset as a whole. The other is to split up the dataset and adaptively process it one segment at a time. The experimental results reveal that while the performance improvements obtained by the former methods are not scalable to larger data scales, the later one, which we call Adaptive Swarm Balancing Algorithms, leads to significant efficiency and effectiveness improvements on large datasets. We also find it more consistent with the practice of the typical large imbalanced medical datasets. We further use the meta-heuristic algorithms to optimize two key parameters of SMOTE. Leading to more credible performances of the classifier, and shortening the running time compared with the brute-force method.

Keywords: Imbalanced dataset, meta-heuristic algorithm, SMOTE, big data

Procedia PDF Downloads 411
1128 A Ratio-Weighted Decision Tree Algorithm for Imbalance Dataset Classification

Authors: Doyin Afolabi, Phillip Adewole, Oladipupo Sennaike

Abstract:

Most well-known classifiers, including the decision tree algorithm, can make predictions on balanced datasets efficiently. However, the decision tree algorithm tends to be biased towards imbalanced datasets because of the skewness of the distribution of such datasets. To overcome this problem, this study proposes a weighted decision tree algorithm that aims to remove the bias toward the majority class and prevents the reduction of majority observations in imbalance datasets classification. The proposed weighted decision tree algorithm was tested on three imbalanced datasets- cancer dataset, german credit dataset, and banknote dataset. The specificity, sensitivity, and accuracy metrics were used to evaluate the performance of the proposed decision tree algorithm on the datasets. The evaluation results show that for some of the weights of our proposed decision tree, the specificity, sensitivity, and accuracy metrics gave better results compared to that of the ID3 decision tree and decision tree induced with minority entropy for all three datasets.

Keywords: data mining, decision tree, classification, imbalance dataset

Procedia PDF Downloads 89
1127 Improved Classification Procedure for Imbalanced and Overlapped Situations

Authors: Hankyu Lee, Seoung Bum Kim

Abstract:

The issue with imbalance and overlapping in the class distribution becomes important in various applications of data mining. The imbalanced dataset is a special case in classification problems in which the number of observations of one class (i.e., major class) heavily exceeds the number of observations of the other class (i.e., minor class). Overlapped dataset is the case where many observations are shared together between the two classes. Imbalanced and overlapped data can be frequently found in many real examples including fraud and abuse patients in healthcare, quality prediction in manufacturing, text classification, oil spill detection, remote sensing, and so on. The class imbalance and overlap problem is the challenging issue because this situation degrades the performance of most of the standard classification algorithms. In this study, we propose a classification procedure that can effectively handle imbalanced and overlapped datasets by splitting data space into three parts: nonoverlapping, light overlapping, and severe overlapping and applying the classification algorithm in each part. These three parts were determined based on the Hausdorff distance and the margin of the modified support vector machine. An experiments study was conducted to examine the properties of the proposed method and compared it with other classification algorithms. The results showed that the proposed method outperformed the competitors under various imbalanced and overlapped situations. Moreover, the applicability of the proposed method was demonstrated through the experiment with real data.

Keywords: classification, imbalanced data with class overlap, split data space, support vector machine

Procedia PDF Downloads 273
1126 A Survey in Techniques for Imbalanced Intrusion Detection System Datasets

Authors: Najmeh Abedzadeh, Matthew Jacobs

Abstract:

An intrusion detection system (IDS) is a software application that monitors malicious activities and generates alerts if any are detected. However, most network activities in IDS datasets are normal, and the relatively few numbers of attacks make the available data imbalanced. Consequently, cyber-attacks can hide inside a large number of normal activities, and machine learning algorithms have difficulty learning and classifying the data correctly. In this paper, a comprehensive literature review is conducted on different types of algorithms for both implementing the IDS and methods in correcting the imbalanced IDS dataset. The most famous algorithms are machine learning (ML), deep learning (DL), synthetic minority over-sampling technique (SMOTE), and reinforcement learning (RL). Most of the research use the CSE-CIC-IDS2017, CSE-CIC-IDS2018, and NSL-KDD datasets for evaluating their algorithms.

Keywords: IDS, imbalanced datasets, sampling algorithms, big data

Procedia PDF Downloads 274
1125 An Empirical Evaluation of Performance of Machine Learning Techniques on Imbalanced Software Quality Data

Authors: Ruchika Malhotra, Megha Khanna

Abstract:

The development of change prediction models can help the software practitioners in planning testing and inspection resources at early phases of software development. However, a major challenge faced during the training process of any classification model is the imbalanced nature of the software quality data. A data with very few minority outcome categories leads to inefficient learning process and a classification model developed from the imbalanced data generally does not predict these minority categories correctly. Thus, for a given dataset, a minority of classes may be change prone whereas a majority of classes may be non-change prone. This study explores various alternatives for adeptly handling the imbalanced software quality data using different sampling methods and effective MetaCost learners. The study also analyzes and justifies the use of different performance metrics while dealing with the imbalanced data. In order to empirically validate different alternatives, the study uses change data from three application packages of open-source Android data set and evaluates the performance of six different machine learning techniques. The results of the study indicate extensive improvement in the performance of the classification models when using resampling method and robust performance measures.

Keywords: change proneness, empirical validation, imbalanced learning, machine learning techniques, object-oriented metrics

Procedia PDF Downloads 389
1124 Enhancing Fault Detection in Rotating Machinery Using Wiener-CNN Method

Authors: Mohamad R. Moshtagh, Ahmad Bagheri

Abstract:

Accurate fault detection in rotating machinery is of utmost importance to ensure optimal performance and prevent costly downtime in industrial applications. This study presents a robust fault detection system based on vibration data collected from rotating gears under various operating conditions. The considered scenarios include: (1) both gears being healthy, (2) one healthy gear and one faulty gear, and (3) introducing an imbalanced condition to a healthy gear. Vibration data was acquired using a Hentek 1008 device and stored in a CSV file. Python code implemented in the Spider environment was used for data preprocessing and analysis. Winner features were extracted using the Wiener feature selection method. These features were then employed in multiple machine learning algorithms, including Convolutional Neural Networks (CNN), Multilayer Perceptron (MLP), K-Nearest Neighbors (KNN), and Random Forest, to evaluate their performance in detecting and classifying faults in both the training and validation datasets. The comparative analysis of the methods revealed the superior performance of the Wiener-CNN approach. The Wiener-CNN method achieved a remarkable accuracy of 100% for both the two-class (healthy gear and faulty gear) and three-class (healthy gear, faulty gear, and imbalanced) scenarios in the training and validation datasets. In contrast, the other methods exhibited varying levels of accuracy. The Wiener-MLP method attained 100% accuracy for the two-class training dataset and 100% for the validation dataset. For the three-class scenario, the Wiener-MLP method demonstrated 100% accuracy in the training dataset and 95.3% accuracy in the validation dataset. The Wiener-KNN method yielded 96.3% accuracy for the two-class training dataset and 94.5% for the validation dataset. In the three-class scenario, it achieved 85.3% accuracy in the training dataset and 77.2% in the validation dataset. The Wiener-Random Forest method achieved 100% accuracy for the two-class training dataset and 85% for the validation dataset, while in the three-class training dataset, it attained 100% accuracy and 90.8% accuracy for the validation dataset. The exceptional accuracy demonstrated by the Wiener-CNN method underscores its effectiveness in accurately identifying and classifying fault conditions in rotating machinery. The proposed fault detection system utilizes vibration data analysis and advanced machine learning techniques to improve operational reliability and productivity. By adopting the Wiener-CNN method, industrial systems can benefit from enhanced fault detection capabilities, facilitating proactive maintenance and reducing equipment downtime.

Keywords: fault detection, gearbox, machine learning, wiener method

Procedia PDF Downloads 47
1123 Machine Learning Facing Behavioral Noise Problem in an Imbalanced Data Using One Side Behavioral Noise Reduction: Application to a Fraud Detection

Authors: Salma El Hajjami, Jamal Malki, Alain Bouju, Mohammed Berrada

Abstract:

With the expansion of machine learning and data mining in the context of Big Data analytics, the common problem that affects data is class imbalance. It refers to an imbalanced distribution of instances belonging to each class. This problem is present in many real world applications such as fraud detection, network intrusion detection, medical diagnostics, etc. In these cases, data instances labeled negatively are significantly more numerous than the instances labeled positively. When this difference is too large, the learning system may face difficulty when tackling this problem, since it is initially designed to work in relatively balanced class distribution scenarios. Another important problem, which usually accompanies these imbalanced data, is the overlapping instances between the two classes. It is commonly referred to as noise or overlapping data. In this article, we propose an approach called: One Side Behavioral Noise Reduction (OSBNR). This approach presents a way to deal with the problem of class imbalance in the presence of a high noise level. OSBNR is based on two steps. Firstly, a cluster analysis is applied to groups similar instances from the minority class into several behavior clusters. Secondly, we select and eliminate the instances of the majority class, considered as behavioral noise, which overlap with behavior clusters of the minority class. The results of experiments carried out on a representative public dataset confirm that the proposed approach is efficient for the treatment of class imbalances in the presence of noise.

Keywords: machine learning, imbalanced data, data mining, big data

Procedia PDF Downloads 102
1122 Towards a Balancing Medical Database by Using the Least Mean Square Algorithm

Authors: Kamel Belammi, Houria Fatrim

Abstract:

imbalanced data set, a problem often found in real world application, can cause seriously negative effect on classification performance of machine learning algorithms. There have been many attempts at dealing with classification of imbalanced data sets. In medical diagnosis classification, we often face the imbalanced number of data samples between the classes in which there are not enough samples in rare classes. In this paper, we proposed a learning method based on a cost sensitive extension of Least Mean Square (LMS) algorithm that penalizes errors of different samples with different weight and some rules of thumb to determine those weights. After the balancing phase, we applythe different classifiers (support vector machine (SVM), k- nearest neighbor (KNN) and multilayer neuronal networks (MNN)) for balanced data set. We have also compared the obtained results before and after balancing method.

Keywords: multilayer neural networks, k- nearest neighbor, support vector machine, imbalanced medical data, least mean square algorithm, diabetes

Procedia PDF Downloads 494
1121 A Priority Based Imbalanced Time Minimization Assignment Problem: An Iterative Approach

Authors: Ekta Jain, Kalpana Dahiya, Vanita Verma

Abstract:

This paper discusses a priority based imbalanced time minimization assignment problem dealing with the allocation of n jobs to m < n persons in which the project is carried out in two stages, viz. Stage-I and Stage-II. Stage-I consists of n1 ( < m) primary jobs and Stage-II consists of remaining (n-n1) secondary jobs which are commenced only after primary jobs are finished. Each job is to be allocated to exactly one person, and each person has to do at least one job. It is assumed that nature of the Stage-I jobs is such that one person can do exactly one primary job whereas a person can do more than one secondary job in Stage-II. In a particular stage, all persons start doing the jobs simultaneously, but if a person is doing more than one job, he does them one after the other in any order. The aim of the proposed study is to find the feasible assignment which minimizes the total time for the two stage execution of the project. For this, an iterative algorithm is proposed, which at each iteration, solves a constrained imbalanced time minimization assignment problem to generate a pair of Stage-I and Stage-II times. For solving this constrained problem, an algorithm is developed in the current paper. Later, alternate combinations based method to solve the priority based imbalanced problem is also discussed and a comparative study is carried out. Numerical illustrations are provided in support of the theory.

Keywords: assignment, imbalanced, priority, time minimization

Procedia PDF Downloads 199
1120 An Adaptive Oversampling Technique for Imbalanced Datasets

Authors: Shaukat Ali Shahee, Usha Ananthakumar

Abstract:

A data set exhibits class imbalance problem when one class has very few examples compared to the other class, and this is also referred to as between class imbalance. The traditional classifiers fail to classify the minority class examples correctly due to its bias towards the majority class. Apart from between-class imbalance, imbalance within classes where classes are composed of a different number of sub-clusters with these sub-clusters containing different number of examples also deteriorates the performance of the classifier. Previously, many methods have been proposed for handling imbalanced dataset problem. These methods can be classified into four categories: data preprocessing, algorithmic based, cost-based methods and ensemble of classifier. Data preprocessing techniques have shown great potential as they attempt to improve data distribution rather than the classifier. Data preprocessing technique handles class imbalance either by increasing the minority class examples or by decreasing the majority class examples. Decreasing the majority class examples lead to loss of information and also when minority class has an absolute rarity, removing the majority class examples is generally not recommended. Existing methods available for handling class imbalance do not address both between-class imbalance and within-class imbalance simultaneously. In this paper, we propose a method that handles between class imbalance and within class imbalance simultaneously for binary classification problem. Removing between class imbalance and within class imbalance simultaneously eliminates the biases of the classifier towards bigger sub-clusters by minimizing the error domination of bigger sub-clusters in total error. The proposed method uses model-based clustering to find the presence of sub-clusters or sub-concepts in the dataset. The number of examples oversampled among the sub-clusters is determined based on the complexity of sub-clusters. The method also takes into consideration the scatter of the data in the feature space and also adaptively copes up with unseen test data using Lowner-John ellipsoid for increasing the accuracy of the classifier. In this study, neural network is being used as this is one such classifier where the total error is minimized and removing the between-class imbalance and within class imbalance simultaneously help the classifier in giving equal weight to all the sub-clusters irrespective of the classes. The proposed method is validated on 9 publicly available data sets and compared with three existing oversampling techniques that rely on the spatial location of minority class examples in the euclidean feature space. The experimental results show the proposed method to be statistically significantly superior to other methods in terms of various accuracy measures. Thus the proposed method can serve as a good alternative to handle various problem domains like credit scoring, customer churn prediction, financial distress, etc., that typically involve imbalanced data sets.

Keywords: classification, imbalanced dataset, Lowner-John ellipsoid, model based clustering, oversampling

Procedia PDF Downloads 387
1119 One vs. Rest and Error Correcting Output Codes Principled Rebalancing Schemes for Solving Imbalanced Multiclass Problems

Authors: Alvaro Callejas-Ramos, Lorena Alvarez-Perez, Alexander Benitez-Buenache, Anibal R. Figueiras-Vidal

Abstract:

This contribution presents a promising formulation which allows to extend the principled binary rebalancing procedures, also known as neutral re-balancing mechanisms in the sense that they do not alter the likelihood ratio

Keywords: Bregman divergences, imbalanced multiclass classifi-cation, informed re-balancing, invariant likelihood ratio

Procedia PDF Downloads 174
1118 Artificial Reproduction System and Imbalanced Dataset: A Mendelian Classification

Authors: Anita Kushwaha

Abstract:

We propose a new evolutionary computational model called Artificial Reproduction System which is based on the complex process of meiotic reproduction occurring between male and female cells of the living organisms. Artificial Reproduction System is an attempt towards a new computational intelligence approach inspired by the theoretical reproduction mechanism, observed reproduction functions, principles and mechanisms. A reproductive organism is programmed by genes and can be viewed as an automaton, mapping and reducing so as to create copies of those genes in its off springs. In Artificial Reproduction System, the binding mechanism between male and female cells is studied, parameters are chosen and a network is constructed also a feedback system for self regularization is established. The model then applies Mendel’s law of inheritance, allele-allele associations and can be used to perform data analysis of imbalanced data, multivariate, multiclass and big data. In the experimental study Artificial Reproduction System is compared with other state of the art classifiers like SVM, Radial Basis Function, neural networks, K-Nearest Neighbor for some benchmark datasets and comparison results indicates a good performance.

Keywords: bio-inspired computation, nature- inspired computation, natural computing, data mining

Procedia PDF Downloads 234
1117 An Ensemble Deep Learning Architecture for Imbalanced Classification of Thoracic Surgery Patients

Authors: Saba Ebrahimi, Saeed Ahmadian, Hedie Ashrafi

Abstract:

Selecting appropriate patients for surgery is one of the main issues in thoracic surgery (TS). Both short-term and long-term risks and benefits of surgery must be considered in the patient selection criteria. There are some limitations in the existing datasets of TS patients because of missing values of attributes and imbalanced distribution of survival classes. In this study, a novel ensemble architecture of deep learning networks is proposed based on stacking different linear and non-linear layers to deal with imbalance datasets. The categorical and numerical features are split using different layers with ability to shrink the unnecessary features. Then, after extracting the insight from the raw features, a novel biased-kernel layer is applied to reinforce the gradient of the minority class and cause the network to be trained better comparing the current methods. Finally, the performance and advantages of our proposed model over the existing models are examined for predicting patient survival after thoracic surgery using a real-life clinical data for lung cancer patients.

Keywords: deep learning, ensemble models, imbalanced classification, lung cancer, TS patient selection

Procedia PDF Downloads 110
1116 A Hybrid Feature Selection and Deep Learning Algorithm for Cancer Disease Classification

Authors: Niousha Bagheri Khulenjani, Mohammad Saniee Abadeh

Abstract:

Learning from very big datasets is a significant problem for most present data mining and machine learning algorithms. MicroRNA (miRNA) is one of the important big genomic and non-coding datasets presenting the genome sequences. In this paper, a hybrid method for the classification of the miRNA data is proposed. Due to the variety of cancers and high number of genes, analyzing the miRNA dataset has been a challenging problem for researchers. The number of features corresponding to the number of samples is high and the data suffer from being imbalanced. The feature selection method has been used to select features having more ability to distinguish classes and eliminating obscures features. Afterward, a Convolutional Neural Network (CNN) classifier for classification of cancer types is utilized, which employs a Genetic Algorithm to highlight optimized hyper-parameters of CNN. In order to make the process of classification by CNN faster, Graphics Processing Unit (GPU) is recommended for calculating the mathematic equation in a parallel way. The proposed method is tested on a real-world dataset with 8,129 patients, 29 different types of tumors, and 1,046 miRNA biomarkers, taken from The Cancer Genome Atlas (TCGA) database.

Keywords: cancer classification, feature selection, deep learning, genetic algorithm

Procedia PDF Downloads 85
1115 Predictive Modelling of Aircraft Component Replacement Using Imbalanced Learning and Ensemble Method

Authors: Dangut Maren David, Skaf Zakwan

Abstract:

Adequate monitoring of vehicle component in other to obtain high uptime is the goal of predictive maintenance, the major challenge faced by businesses in industries is the significant cost associated with a delay in service delivery due to system downtime. Most of those businesses are interested in predicting those problems and proactively prevent them in advance before it occurs, which is the core advantage of Prognostic Health Management (PHM) application. The recent emergence of industry 4.0 or industrial internet of things (IIoT) has led to the need for monitoring systems activities and enhancing system-to-system or component-to- component interactions, this has resulted to a large generation of data known as big data. Analysis of big data represents an increasingly important, however, due to complexity inherently in the dataset such as imbalance classification problems, it becomes extremely difficult to build a model with accurate high precision. Data-driven predictive modeling for condition-based maintenance (CBM) has recently drowned research interest with growing attention to both academics and industries. The large data generated from industrial process inherently comes with a different degree of complexity which posed a challenge for analytics. Thus, imbalance classification problem exists perversely in industrial datasets which can affect the performance of learning algorithms yielding to poor classifier accuracy in model development. Misclassification of faults can result in unplanned breakdown leading economic loss. In this paper, an advanced approach for handling imbalance classification problem is proposed and then a prognostic model for predicting aircraft component replacement is developed to predict component replacement in advanced by exploring aircraft historical data, the approached is based on hybrid ensemble-based method which improves the prediction of the minority class during learning, we also investigate the impact of our approach on multiclass imbalance problem. We validate the feasibility and effectiveness in terms of the performance of our approach using real-world aircraft operation and maintenance datasets, which spans over 7 years. Our approach shows better performance compared to other similar approaches. We also validate our approach strength for handling multiclass imbalanced dataset, our results also show good performance compared to other based classifiers.

Keywords: prognostics, data-driven, imbalance classification, deep learning

Procedia PDF Downloads 147
1114 An Analysis of Classification of Imbalanced Datasets by Using Synthetic Minority Over-Sampling Technique

Authors: Ghada A. Alfattni

Abstract:

Analysing unbalanced datasets is one of the challenges that practitioners in machine learning field face. However, many researches have been carried out to determine the effectiveness of the use of the synthetic minority over-sampling technique (SMOTE) to address this issue. The aim of this study was therefore to compare the effectiveness of the SMOTE over different models on unbalanced datasets. Three classification models (Logistic Regression, Support Vector Machine and Nearest Neighbour) were tested with multiple datasets, then the same datasets were oversampled by using SMOTE and applied again to the three models to compare the differences in the performances. Results of experiments show that the highest number of nearest neighbours gives lower values of error rates. 

Keywords: imbalanced datasets, SMOTE, machine learning, logistic regression, support vector machine, nearest neighbour

Procedia PDF Downloads 311
1113 Using Autoencoder as Feature Extractor for Malware Detection

Authors: Umm-E-Hani, Faiza Babar, Hanif Durad

Abstract:

Malware-detecting approaches suffer many limitations, due to which all anti-malware solutions have failed to be reliable enough for detecting zero-day malware. Signature-based solutions depend upon the signatures that can be generated only when malware surfaces at least once in the cyber world. Another approach that works by detecting the anomalies caused in the environment can easily be defeated by diligently and intelligently written malware. Solutions that have been trained to observe the behavior for detecting malicious files have failed to cater to the malware capable of detecting the sandboxed or protected environment. Machine learning and deep learning-based approaches greatly suffer in training their models with either an imbalanced dataset or an inadequate number of samples. AI-based anti-malware solutions that have been trained with enough samples targeted a selected feature vector, thus ignoring the input of leftover features in the maliciousness of malware just to cope with the lack of underlying hardware processing power. Our research focuses on producing an anti-malware solution for detecting malicious PE files by circumventing the earlier-mentioned shortcomings. Our proposed framework, which is based on automated feature engineering through autoencoders, trains the model over a fairly large dataset. It focuses on the visual patterns of malware samples to automatically extract the meaningful part of the visual pattern. Our experiment has successfully produced a state-of-the-art accuracy of 99.54 % over test data.

Keywords: malware, auto encoders, automated feature engineering, classification

Procedia PDF Downloads 38
1112 Distorted Document Images Dataset for Text Detection and Recognition

Authors: Ilia Zharikov, Philipp Nikitin, Ilia Vasiliev, Vladimir Dokholyan

Abstract:

With the increasing popularity of document analysis and recognition systems, text detection (TD) and optical character recognition (OCR) in document images become challenging tasks. However, according to our best knowledge, no publicly available datasets for these particular problems exist. In this paper, we introduce a Distorted Document Images dataset (DDI-100) and provide a detailed analysis of the DDI-100 in its current state. To create the dataset we collected 7000 unique document pages, and extend it by applying different types of distortions and geometric transformations. In total, DDI-100 contains more than 100,000 document images together with binary text masks, text and character locations in terms of bounding boxes. We also present an analysis of several state-of-the-art TD and OCR approaches on the presented dataset. Lastly, we demonstrate the usefulness of DDI-100 to improve accuracy and stability of the considered TD and OCR models.

Keywords: document analysis, open dataset, optical character recognition, text detection

Procedia PDF Downloads 132
1111 Selection of Soil Quality Indicators of Rice Cropping Systems Using Minimum Data Set Influenced by Imbalanced Fertilization

Authors: Theresa K., Shanmugasundaram R., Kennedy J. S.

Abstract:

Nutrient supplements are indispensable for raising crops and to reap determining productivity. The nutrient imbalance between replenishment and crop uptake is attempted through the input of inorganic fertilizers. Excessive dumping of inorganic nutrients in soil cause stagnant and decline in yield. Imbalanced N-P-K ratio in the soil exacerbates and agitates the soil ecosystems. The study evaluated the fertilization practices of conventional (CFs), organic and Integrated Nutrient Management system (INM) on soil quality using key indicators and soil quality indices. Twelve rice farming fields of which, ten fields were having conventional cultivation practices, one field each was organic farming based and INM based cultivated under monocropping sequence in the Thondamuthur block of Coimbatore district were fixed and properties viz., physical, chemical and biological were studied for four cropping seasons to determine soil quality index (SQI). SQI was computed for conventional, organic and INM fields. Comparing conventional farming (CF) with organic and INM, CF was recorded with a lower soil quality index. While in organic and INM fields, the higher SQI value of 0.99 and 0.88 respectively were registered. CF₄ received with a super-optimal dose of N (250%) showed a lesser SQI value (0.573) as well as the yield (3.20 t ha⁻¹) and the CF6 which received 125 % N recorded the highest SQI (0.715) and yield (6.20 t ha⁻¹). Likewise, most of the CFs received higher N beyond the level of 125 % except CF₃ and CF₉, which recorded lower yields. CFs which received super-optimal P in the order of CF₆&CF₇>CF₁&CF₁₀ recorded lesser yields except for CF₆. Super-optimal K application also recorded lesser yield in CF₄, CF₇ and CF₉.

Keywords: rice cropping system, soil quality indicators, imbalanced fertilization, yield

Procedia PDF Downloads 117
1110 SAMRA: Dataset in Al-Soudani Arabic Maghrebi Script for Recognition of Arabic Ancient Words Handwritten

Authors: Sidi Ahmed Maouloud, Cheikh Ba

Abstract:

Much of West Africa’s cultural heritage is written in the Al-Soudani Arabic script, which was widely used in West Africa before the time of European colonization. This Al-Soudani Arabic script is an African version of the Maghrebi script, in particular, the Al-Mebssout script. However, the local African qualities were incorporated into the Al-Soudani script in a way that gave it a unique African diversity and character. Despite the existence of several Arabic datasets in Oriental script, allowing for the analysis, layout, and recognition of texts written in these calligraphies, many Arabic scripts and written traditions remain understudied. In this paper, we present a dataset of words from Al-Soudani calligraphy scripts. This dataset consists of 100 images selected from three different manuscripts written in Al-Soudani Arabic script by different copyists. The primary source for this database was the libraries of Boston University and Cambridge University. This dataset highlights the unique characteristics of the Al-Soudani Arabic script as well as the new challenges it presents in terms of automatic word recognition of Arabic manuscripts. An HTR system based on a hybrid ANN (CRNN-CTC) is also proposed to test this dataset. SAMRA is a dataset of annotated Arabic manuscript words in the Al-Soudani script that can help researchers automatically recognize and analyze manuscript words written in this script.

Keywords: dataset, CRNN-CTC, handwritten words recognition, Al-Soudani Arabic script, HTR, manuscripts

Procedia PDF Downloads 71
1109 Fuzzy-Machine Learning Models for the Prediction of Fire Outbreak: A Comparative Analysis

Authors: Uduak Umoh, Imo Eyoh, Emmauel Nyoho

Abstract:

This paper compares fuzzy-machine learning algorithms such as Support Vector Machine (SVM), and K-Nearest Neighbor (KNN) for the predicting cases of fire outbreak. The paper uses the fire outbreak dataset with three features (Temperature, Smoke, and Flame). The data is pre-processed using Interval Type-2 Fuzzy Logic (IT2FL) algorithm. Min-Max Normalization and Principal Component Analysis (PCA) are used to predict feature labels in the dataset, normalize the dataset, and select relevant features respectively. The output of the pre-processing is a dataset with two principal components (PC1 and PC2). The pre-processed dataset is then used in the training of the aforementioned machine learning models. K-fold (with K=10) cross-validation method is used to evaluate the performance of the models using the matrices – ROC (Receiver Operating Curve), Specificity, and Sensitivity. The model is also tested with 20% of the dataset. The validation result shows KNN is the better model for fire outbreak detection with an ROC value of 0.99878, followed by SVM with an ROC value of 0.99753.

Keywords: Machine Learning Algorithms , Interval Type-2 Fuzzy Logic, Fire Outbreak, Support Vector Machine, K-Nearest Neighbour, Principal Component Analysis

Procedia PDF Downloads 134
1108 Intelligent Recognition of Diabetes Disease via FCM Based Attribute Weighting

Authors: Kemal Polat

Abstract:

In this paper, an attribute weighting method called fuzzy C-means clustering based attribute weighting (FCMAW) for classification of Diabetes disease dataset has been used. The aims of this study are to reduce the variance within attributes of diabetes dataset and to improve the classification accuracy of classifier algorithm transforming from non-linear separable datasets to linearly separable datasets. Pima Indians Diabetes dataset has two classes including normal subjects (500 instances) and diabetes subjects (268 instances). Fuzzy C-means clustering is an improved version of K-means clustering method and is one of most used clustering methods in data mining and machine learning applications. In this study, as the first stage, fuzzy C-means clustering process has been used for finding the centers of attributes in Pima Indians diabetes dataset and then weighted the dataset according to the ratios of the means of attributes to centers of theirs. Secondly, after weighting process, the classifier algorithms including support vector machine (SVM) and k-NN (k- nearest neighbor) classifiers have been used for classifying weighted Pima Indians diabetes dataset. Experimental results show that the proposed attribute weighting method (FCMAW) has obtained very promising results in the classification of Pima Indians diabetes dataset.

Keywords: fuzzy C-means clustering, fuzzy C-means clustering based attribute weighting, Pima Indians diabetes, SVM

Procedia PDF Downloads 378
1107 Optimizing the Capacity of a Convolutional Neural Network for Image Segmentation and Pattern Recognition

Authors: Yalong Jiang, Zheru Chi

Abstract:

In this paper, we study the factors which determine the capacity of a Convolutional Neural Network (CNN) model and propose the ways to evaluate and adjust the capacity of a CNN model for best matching to a specific pattern recognition task. Firstly, a scheme is proposed to adjust the number of independent functional units within a CNN model to make it be better fitted to a task. Secondly, the number of independent functional units in the capsule network is adjusted to fit it to the training dataset. Thirdly, a method based on Bayesian GAN is proposed to enrich the variances in the current dataset to increase its complexity. Experimental results on the PASCAL VOC 2010 Person Part dataset and the MNIST dataset show that, in both conventional CNN models and capsule networks, the number of independent functional units is an important factor that determines the capacity of a network model. By adjusting the number of functional units, the capacity of a model can better match the complexity of a dataset.

Keywords: CNN, convolutional neural network, capsule network, capacity optimization, character recognition, data augmentation, semantic segmentation

Procedia PDF Downloads 117
1106 Energy Complementary in Colombia: Imputation of Dataset

Authors: Felipe Villegas-Velasquez, Harold Pantoja-Villota, Sergio Holguin-Cardona, Alejandro Osorio-Botero, Brayan Candamil-Arango

Abstract:

Colombian electricity comes mainly from hydric resources, affected by environmental variations such as the El Niño phenomenon. That is why incorporating other types of resources is necessary to provide electricity constantly. This research seeks to fill the wind speed and global solar irradiance dataset for two years with the highest amount of information. A further result is the characterization of the data by region that led to infer which errors occurred and offered the incomplete dataset.

Keywords: energy, wind speed, global solar irradiance, Colombia, imputation

Procedia PDF Downloads 111
1105 The Clustering of Multiple Sclerosis Subgroups through L2 Norm Multifractal Denoising Technique

Authors: Yeliz Karaca, Rana Karabudak

Abstract:

Multifractal Denoising techniques are used in the identification of significant attributes by removing the noise of the dataset. Magnetic resonance (MR) image technique is the most sensitive method so as to identify chronic disorders of the nervous system such as Multiple Sclerosis. MRI and Expanded Disability Status Scale (EDSS) data belonging to 120 individuals who have one of the subgroups of MS (Relapsing Remitting MS (RRMS), Secondary Progressive MS (SPMS), Primary Progressive MS (PPMS)) as well as 19 healthy individuals in the control group have been used in this study. The study is comprised of the following stages: (i) L2 Norm Multifractal Denoising technique, one of the multifractal technique, has been used with the application on the MS data (MRI and EDSS). In this way, the new dataset has been obtained. (ii) The new MS dataset obtained from the MS dataset and L2 Multifractal Denoising technique has been applied to the K-Means and Fuzzy C Means clustering algorithms which are among the unsupervised methods. Thus, the clustering performances have been compared. (iii) In the identification of significant attributes in the MS dataset through the Multifractal denoising (L2 Norm) technique using K-Means and FCM algorithms on the MS subgroups and control group of healthy individuals, excellent performance outcome has been yielded. According to the clustering results based on the MS subgroups obtained in the study, successful clustering results have been obtained in the K-Means and FCM algorithms by applying the L2 norm of multifractal denoising technique for the MS dataset. Clustering performance has been more successful with the MS Dataset (L2_Norm MS Data Set) K-Means and FCM in which significant attributes are obtained by applying L2 Norm Denoising technique.

Keywords: clinical decision support, clustering algorithms, multiple sclerosis, multifractal techniques

Procedia PDF Downloads 132
1104 Supervised/Unsupervised Mahalanobis Algorithm for Improving Performance for Cyberattack Detection over Communications Networks

Authors: Radhika Ranjan Roy

Abstract:

Deployment of machine learning (ML)/deep learning (DL) algorithms for cyberattack detection in operational communications networks (wireless and/or wire-line) is being delayed because of low-performance parameters (e.g., recall, precision, and f₁-score). If datasets become imbalanced, which is the usual case for communications networks, the performance tends to become worse. Complexities in handling reducing dimensions of the feature sets for increasing performance are also a huge problem. Mahalanobis algorithms have been widely applied in scientific research because Mahalanobis distance metric learning is a successful framework. In this paper, we have investigated the Mahalanobis binary classifier algorithm for increasing cyberattack detection performance over communications networks as a proof of concept. We have also found that high-dimensional information in intermediate features that are not utilized as much for classification tasks in ML/DL algorithms are the main contributor to the state-of-the-art of improved performance of the Mahalanobis method, even for imbalanced and sparse datasets. With no feature reduction, MD offers uniform results for precision, recall, and f₁-score for unbalanced and sparse NSL-KDD datasets.

Keywords: Mahalanobis distance, machine learning, deep learning, NS-KDD, local intrinsic dimensionality, chi-square, positive semi-definite, area under the curve

Procedia PDF Downloads 47
1103 Facial Expression Phoenix (FePh): An Annotated Sequenced Dataset for Facial and Emotion-Specified Expressions in Sign Language

Authors: Marie Alaghband, Niloofar Yousefi, Ivan Garibay

Abstract:

Facial expressions are important parts of both gesture and sign language recognition systems. Despite the recent advances in both fields, annotated facial expression datasets in the context of sign language are still scarce resources. In this manuscript, we introduce an annotated sequenced facial expression dataset in the context of sign language, comprising over 3000 facial images extracted from the daily news and weather forecast of the public tv-station PHOENIX. Unlike the majority of currently existing facial expression datasets, FePh provides sequenced semi-blurry facial images with different head poses, orientations, and movements. In addition, in the majority of images, identities are mouthing the words, which makes the data more challenging. To annotate this dataset we consider primary, secondary, and tertiary dyads of seven basic emotions of "sad", "surprise", "fear", "angry", "neutral", "disgust", and "happy". We also considered the "None" class if the image’s facial expression could not be described by any of the aforementioned emotions. Although we provide FePh as a facial expression dataset of signers in sign language, it has a wider application in gesture recognition and Human Computer Interaction (HCI) systems.

Keywords: annotated facial expression dataset, gesture recognition, sequenced facial expression dataset, sign language recognition

Procedia PDF Downloads 128
1102 Data Augmentation for Automatic Graphical User Interface Generation Based on Generative Adversarial Network

Authors: Xulu Yao, Moi Hoon Yap, Yanlong Zhang

Abstract:

As a branch of artificial neural network, deep learning is widely used in the field of image recognition, but the lack of its dataset leads to imperfect model learning. By analysing the data scale requirements of deep learning and aiming at the application in GUI generation, it is found that the collection of GUI dataset is a time-consuming and labor-consuming project, which is difficult to meet the needs of current deep learning network. To solve this problem, this paper proposes a semi-supervised deep learning model that relies on the original small-scale datasets to produce a large number of reliable data sets. By combining the cyclic neural network with the generated countermeasure network, the cyclic neural network can learn the sequence relationship and characteristics of data, make the generated countermeasure network generate reasonable data, and then expand the Rico dataset. Relying on the network structure, the characteristics of collected data can be well analysed, and a large number of reasonable data can be generated according to these characteristics. After data processing, a reliable dataset for model training can be formed, which alleviates the problem of dataset shortage in deep learning.

Keywords: GUI, deep learning, GAN, data augmentation

Procedia PDF Downloads 143
1101 The Classification Performance in Parametric and Nonparametric Discriminant Analysis for a Class- Unbalanced Data of Diabetes Risk Groups

Authors: Lily Ingsrisawang, Tasanee Nacharoen

Abstract:

Introduction: The problems of unbalanced data sets generally appear in real world applications. Due to unequal class distribution, many research papers found that the performance of existing classifier tends to be biased towards the majority class. The k -nearest neighbors’ nonparametric discriminant analysis is one method that was proposed for classifying unbalanced classes with good performance. Hence, the methods of discriminant analysis are of interest to us in investigating misclassification error rates for class-imbalanced data of three diabetes risk groups. Objective: The purpose of this study was to compare the classification performance between parametric discriminant analysis and nonparametric discriminant analysis in a three-class classification application of class-imbalanced data of diabetes risk groups. Methods: Data from a healthy project for 599 staffs in a government hospital in Bangkok were obtained for the classification problem. The staffs were diagnosed into one of three diabetes risk groups: non-risk (90%), risk (5%), and diabetic (5%). The original data along with the variables; diabetes risk group, age, gender, cholesterol, and BMI was analyzed and bootstrapped up to 50 and 100 samples, 599 observations per sample, for additional estimation of misclassification error rate. Each data set was explored for the departure of multivariate normality and the equality of covariance matrices of the three risk groups. Both the original data and the bootstrap samples show non-normality and unequal covariance matrices. The parametric linear discriminant function, quadratic discriminant function, and the nonparametric k-nearest neighbors’ discriminant function were performed over 50 and 100 bootstrap samples and applied to the original data. In finding the optimal classification rule, the choices of prior probabilities were set up for both equal proportions (0.33: 0.33: 0.33) and unequal proportions with three choices of (0.90:0.05:0.05), (0.80: 0.10: 0.10) or (0.70, 0.15, 0.15). Results: The results from 50 and 100 bootstrap samples indicated that the k-nearest neighbors approach when k = 3 or k = 4 and the prior probabilities of {non-risk:risk:diabetic} as {0.90:0.05:0.05} or {0.80:0.10:0.10} gave the smallest error rate of misclassification. Conclusion: The k-nearest neighbors approach would be suggested for classifying a three-class-imbalanced data of diabetes risk groups.

Keywords: error rate, bootstrap, diabetes risk groups, k-nearest neighbors

Procedia PDF Downloads 406
1100 Pose Normalization Network for Object Classification

Authors: Bingquan Shen

Abstract:

Convolutional Neural Networks (CNN) have demonstrated their effectiveness in synthesizing 3D views of object instances at various viewpoints. Given the problem where one have limited viewpoints of a particular object for classification, we present a pose normalization architecture to transform the object to existing viewpoints in the training dataset before classification to yield better classification performance. We have demonstrated that this Pose Normalization Network (PNN) can capture the style of the target object and is able to re-render it to a desired viewpoint. Moreover, we have shown that the PNN improves the classification result for the 3D chairs dataset and ShapeNet airplanes dataset when given only images at limited viewpoint, as compared to a CNN baseline.

Keywords: convolutional neural networks, object classification, pose normalization, viewpoint invariant

Procedia PDF Downloads 303