Search results for: microarray datasets
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 286

Search results for: microarray datasets

196 Using Satellite Images Datasets for Road Intersection Detection in Route Planning

Authors: Fatma El-zahraa El-taher, Ayman Taha, Jane Courtney, Susan Mckeever

Abstract:

Understanding road networks plays an important role in navigation applications such as self-driving vehicles and route planning for individual journeys. Intersections of roads are essential components of road networks. Understanding the features of an intersection, from a simple T-junction to larger multi-road junctions is critical to decisions such as crossing roads or selecting safest routes. The identification and profiling of intersections from satellite images is a challenging task. While deep learning approaches offer state-of-the-art in image classification and detection, the availability of training datasets is a bottleneck in this approach. In this paper, a labelled satellite image dataset for the intersection recognition  problem is presented. It consists of 14,692 satellite images of Washington DC, USA. To support other users of the dataset, an automated download and labelling script is provided for dataset replication. The challenges of construction and fine-grained feature labelling of a satellite image dataset are examined, including the issue of how to address features that are spread across multiple images. Finally, the accuracy of detection of intersections in satellite images is evaluated.

Keywords: Satellite images, remote sensing images, data acquisition, autonomous vehicles, robot navigation, route planning, road intersections.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 757
195 Gene Selection Guided by Feature Interdependence

Authors: Hung-Ming Lai, Andreas Albrecht, Kathleen Steinhöfel

Abstract:

Cancers could normally be marked by a number of differentially expressed genes which show enormous potential as biomarkers for a certain disease. Recent years, cancer classification based on the investigation of gene expression profiles derived by high-throughput microarrays has widely been used. The selection of discriminative genes is, therefore, an essential preprocess step in carcinogenesis studies. In this paper, we have proposed a novel gene selector using information-theoretic measures for biological discovery. This multivariate filter is a four-stage framework through the analyses of feature relevance, feature interdependence, feature redundancy-dependence and subset rankings, and having been examined on the colon cancer data set. Our experimental result show that the proposed method outperformed other information theorem based filters in all aspect of classification errors and classification performance.

Keywords: Colon cancer, feature interdependence, feature subset selection, gene selection, microarray data analysis.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2144
194 Facial Expression Phoenix (FePh): An Annotated Sequenced Dataset for Facial and Emotion-Specified Expressions in Sign Language

Authors: Marie Alaghband, Niloofar Yousefi, Ivan Garibay

Abstract:

Facial expressions are important parts of both gesture and sign language recognition systems. Despite the recent advances in both fields, annotated facial expression datasets in the context of sign language are still scarce resources. In this manuscript, we introduce an annotated sequenced facial expression dataset in the context of sign language, comprising over 3000 facial images extracted from the daily news and weather forecast of the public tv-station PHOENIX. Unlike the majority of currently existing facial expression datasets, FePh provides sequenced semi-blurry facial images with different head poses, orientations, and movements. In addition, in the majority of images, identities are mouthing the words, which makes the data more challenging. To annotate this dataset we consider primary, secondary, and tertiary dyads of seven basic emotions of "sad", "surprise", "fear", "angry", "neutral", "disgust", and "happy". We also considered the "None" class if the image’s facial expression could not be described by any of the aforementioned emotions. Although we provide FePh as a facial expression dataset of signers in sign language, it has a wider application in gesture recognition and Human Computer Interaction (HCI) systems.

Keywords: Annotated Facial Expression Dataset, Sign Language Recognition, Gesture Recognition, Sequenced Facial Expression Dataset.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 720
193 Intelligent Recognition of Diabetes Disease via FCM Based Attribute Weighting

Authors: Kemal Polat

Abstract:

In this paper, an attribute weighting method called fuzzy C-means clustering based attribute weighting (FCMAW) for classification of Diabetes disease dataset has been used. The aims of this study are to reduce the variance within attributes of diabetes dataset and to improve the classification accuracy of classifier algorithm transforming from non-linear separable datasets to linearly separable datasets. Pima Indians Diabetes dataset has two classes including normal subjects (500 instances) and diabetes subjects (268 instances). Fuzzy C-means clustering is an improved version of K-means clustering method and is one of most used clustering methods in data mining and machine learning applications. In this study, as the first stage, fuzzy C-means clustering process has been used for finding the centers of attributes in Pima Indians diabetes dataset and then weighted the dataset according to the ratios of the means of attributes to centers of theirs. Secondly, after weighting process, the classifier algorithms including support vector machine (SVM) and k-NN (k- nearest neighbor) classifiers have been used for classifying weighted Pima Indians diabetes dataset. Experimental results show that the proposed attribute weighting method (FCMAW) has obtained very promising results in the classification of Pima Indians diabetes dataset.

Keywords: Fuzzy C-means clustering, Fuzzy C-means clustering based attribute weighting, Pima Indians diabetes dataset, SVM.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1763
192 Towards Real-Time Classification of Finger Movement Direction Using Encephalography Independent Components

Authors: Mohamed Mounir Tellache, Hiroyuki Kambara, Yasuharu Koike, Makoto Miyakoshi, Natsue Yoshimura

Abstract:

This study explores the practicality of using electroencephalographic (EEG) independent components to predict eight-direction finger movements in pseudo-real-time. Six healthy participants with individual-head MRI images performed finger movements in eight directions with two different arm configurations. The analysis was performed in two stages. The first stage consisted of using independent component analysis (ICA) to separate the signals representing brain activity from non-brain activity signals and to obtain the unmixing matrix. The resulting independent components (ICs) were checked, and those reflecting brain-activity were selected. Finally, the time series of the selected ICs were used to predict eight finger-movement directions using Sparse Logistic Regression (SLR). The second stage consisted of using the previously obtained unmixing matrix, the selected ICs, and the model obtained by applying SLR to classify a different EEG dataset. This method was applied to two different settings, namely the single-participant level and the group-level. For the single-participant level, the EEG dataset used in the first stage and the EEG dataset used in the second stage originated from the same participant. For the group-level, the EEG datasets used in the first stage were constructed by temporally concatenating each combination without repetition of the EEG datasets of five participants out of six, whereas the EEG dataset used in the second stage originated from the remaining participants. The average test classification results across datasets (mean ± S.D.) were 38.62 ± 8.36% for the single-participant, which was significantly higher than the chance level (12.50 ± 0.01%), and 27.26 ± 4.39% for the group-level which was also significantly higher than the chance level (12.49% ± 0.01%). The classification accuracy within [–45°, 45°] of the true direction is 70.03 ± 8.14% for single-participant and 62.63 ± 6.07% for group-level which may be promising for some real-life applications. Clustering and contribution analyses further revealed the brain regions involved in finger movement and the temporal aspect of their contribution to the classification. These results showed the possibility of using the ICA-based method in combination with other methods to build a real-time system to control prostheses.

Keywords: Brain-computer interface, BCI, electroencephalography, EEG, finger motion decoding, independent component analysis, pseudo-real-time motion decoding.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 599
191 Classification of Potential Biomarkers in Breast Cancer Using Artificial Intelligence Algorithms and Anthropometric Datasets

Authors: Aref Aasi, Sahar Ebrahimi Bajgani, Erfan Aasi

Abstract:

Breast cancer (BC) continues to be the most frequent cancer in females and causes the highest number of cancer-related deaths in women worldwide. Inspired by recent advances in studying the relationship between different patient attributes and features and the disease, in this paper, we have tried to investigate the different classification methods for better diagnosis of BC in the early stages. In this regard, datasets from the University Hospital Centre of Coimbra were chosen, and different machine learning (ML)-based and neural network (NN) classifiers have been studied. For this purpose, we have selected favorable features among the nine provided attributes from the clinical dataset by using a random forest algorithm. This dataset consists of both healthy controls and BC patients, and it was noted that glucose, BMI, resistin, and age have the most importance, respectively. Moreover, we have analyzed these features with various ML-based classifier methods, including Decision Tree (DT), K-Nearest Neighbors (KNN), eXtreme Gradient Boosting (XGBoost), Logistic Regression (LR), Naive Bayes (NB), and Support Vector Machine (SVM) along with NN-based Multi-Layer Perceptron (MLP) classifier. The results revealed that among different techniques, the SVM and MLP classifiers have the most accuracy, with amounts of 96% and 92%, respectively. These results divulged that the adopted procedure could be used effectively for the classification of cancer cells, and also it encourages further experimental investigations with more collected data for other types of cancers.

Keywords: Breast cancer, health diagnosis, Machine Learning, biomarker classification, Neural Network.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 321
190 Performance Assessment of Multi-Level Ensemble for Multi-Class Problems

Authors: Rodolfo Lorbieski, Silvia Modesto Nassar

Abstract:

Many supervised machine learning tasks require decision making across numerous different classes. Multi-class classification has several applications, such as face recognition, text recognition and medical diagnostics. The objective of this article is to analyze an adapted method of Stacking in multi-class problems, which combines ensembles within the ensemble itself. For this purpose, a training similar to Stacking was used, but with three levels, where the final decision-maker (level 2) performs its training by combining outputs from the tree-based pair of meta-classifiers (level 1) from Bayesian families. These are in turn trained by pairs of base classifiers (level 0) of the same family. This strategy seeks to promote diversity among the ensembles forming the meta-classifier level 2. Three performance measures were used: (1) accuracy, (2) area under the ROC curve, and (3) time for three factors: (a) datasets, (b) experiments and (c) levels. To compare the factors, ANOVA three-way test was executed for each performance measure, considering 5 datasets by 25 experiments by 3 levels. A triple interaction between factors was observed only in time. The accuracy and area under the ROC curve presented similar results, showing a double interaction between level and experiment, as well as for the dataset factor. It was concluded that level 2 had an average performance above the other levels and that the proposed method is especially efficient for multi-class problems when compared to binary problems.

Keywords: Stacking, multi-layers, ensemble, multi-class.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1093
189 Experimental Evaluation of Drilling Damage on the Strength of Cores Extracted from RC Buildings

Authors: A. Masi, A. Digrisolo, G. Santarsiero

Abstract:

Concrete strength evaluated from compression tests on cores is affected by several factors causing differences from the in-situ strength at the location from which the core specimen was extracted. Among the factors, there is the damage possibly occurring during the drilling phase that generally leads to underestimate the actual in-situ strength. In order to quantify this effect, in this study two wide datasets have been examined, including: (i) about 500 core specimens extracted from Reinforced Concrete existing structures, and (ii) about 600 cube specimens taken during the construction of new structures in the framework of routine acceptance control. The two experimental datasets have been compared in terms of compression strength and specific weight values, accounting for the main factors affecting a concrete property, that is type and amount of cement, aggregates' grading, type and maximum size of aggregates, water/cement ratio, placing and curing modality, concrete age. The results show that the magnitude of the strength reduction due to drilling damage is strongly affected by the actual properties of concrete, being inversely proportional to its strength. Therefore, the application of a single value of the correction coefficient, as generally suggested in the technical literature and in structural codes, appears inappropriate. A set of values of the drilling damage coefficient is suggested as a function of the strength obtained from compressive tests on cores.

Keywords: RC Buildings, Assessment, In-situ concrete strength, Core testing, Drilling damage.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2059
188 Carbon Sources Utilization Profiles of Thermophilic Phytase Producing Bacteria Isolated from Hot-spring in Malaysia

Authors: Noor Muzamil Mohamad, Abdul Manaf Ali, Hamzah Mohd Salleh

Abstract:

Phytases (myo-inositol hexakisphosphate phosphohydrolases; EC 3.1.3.8) catalyze the hydrolysis of phytic acid (myoinositol hexakisphosphate) to the mono-, di-, tri-, tetra-, and pentaphosphates of myo-inositol and inorganic phosphate. Therrmophilic bacteria isolated from water sampled from hot spring. About 120 isolates of bacteria were successfully isolated form hot spring water sample and tested for extracellular phytase producing. After 5 passages of the screening on the PSM media, 4 isolates were found stable in producing phytase enzyme. The 16s RDNA sequencing for identification of bacteria using molecular technique revealed that all isolates those positive in phytase producing are belong to Geobacillus spp. And Anoxybacillus spp. Anoxybacillus rupiensis UniSZA-7 were identified for their carbon source utilization using Phenotype Microarray Plate of Biolog and found they utilize several kind of carbon source provided.

Keywords: Phytase, Phytic Acid, Thermophilic Bacteria, PSM Media and Phytase Assay

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2165
187 Cutaneous Application of Royal Jelly Inhibits Skin Lesions in NC/Nga Mice, a Human-Like Mouse Model of Atopic Dermatitis

Authors: Junki Miyamoto, Mariko Kiyomi, Yuuki Nagashio, Takuya Suzuki, Soichi Tanabe

Abstract:

Anti-allergic effects of royal jelly were evaluated in a human-like mouse model of atopic dermatitis. NC/Nga mice were cutaneously applied with royal jelly for 6 weeks. Royal jelly-treated mice exhibited lower levels of serum total immunoglobulin E in comparison with controls. We found that the treatment decreased (11% to the control) expression of mRNA for aquaporin-3, which is involved in the modulation of epidermal hydration. Microarray analysis revealed more than 10-fold changes in the expression of several genes, such as transglutaminase 2, repetin, and keratins. In normal human epidermal keratinocytes, royal jelly extract suppressed interleukin-8 elevation induced by TNF-α and interferon-γ, suggesting direct anti-inflammatory activity in keratinocytes. Collectively, topical application of royal jelly may be useful for amelioration of lesions and inflammation in atopic dermatitis.

Keywords: Aquaporin 3, immunoglobulin E, NC/Nga, royal jelly.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1877
186 Software Product Quality Evaluation Model with Multiple Criteria Decision Making Analysis

Authors: C. Ardil

Abstract:

This paper presents a software product quality evaluation model based on the ISO/IEC 25010 quality model. The evaluation characteristics and sub characteristics were identified from the ISO/IEC 25010 quality model. The multidimensional structure of the quality model is based on characteristics such as functional suitability, performance efficiency, compatibility, usability, reliability, security, maintainability, and portability, and associated sub characteristics. Random numbers are generated to establish the decision maker’s importance weights for each sub characteristics. Also, random numbers are generated to establish the decision matrix of the decision maker’s final scores for each software product against each sub characteristics. Thus, objective criteria importance weights and index scores for datasets were obtained from the random numbers. In the proposed model, five different software product quality evaluation datasets under three different weight vectors were applied to multiple criteria decision analysis method, preference analysis for reference ideal solution (PARIS) for comparison, and sensitivity analysis procedure. This study contributes to provide a better understanding of the application of MCDMA methods and ISO/IEC 25010 quality model guidelines in software product quality evaluation process.

Keywords: ISO/IEC 25010 quality model, multiple criteria decisions making, multiple criteria decision making analysis, MCDMA, PARIS, Software Product Quality Evaluation Model, Software Product Quality Evaluation, Software Evaluation, Software Selection, Software

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 448
185 Incorporating Lexical-Semantic Knowledge into Convolutional Neural Network Framework for Pediatric Disease Diagnosis

Authors: Xiaocong Liu, Huazhen Wang, Ting He, Xiaozheng Li, Weihan Zhang, Jian Chen

Abstract:

The utilization of electronic medical record (EMR) data to establish the disease diagnosis model has become an important research content of biomedical informatics. Deep learning can automatically extract features from the massive data, which brings about breakthroughs in the study of EMR data. The challenge is that deep learning lacks semantic knowledge, which leads to impracticability in medical science. This research proposes a method of incorporating lexical-semantic knowledge from abundant entities into a convolutional neural network (CNN) framework for pediatric disease diagnosis. Firstly, medical terms are vectorized into Lexical Semantic Vectors (LSV), which are concatenated with the embedded word vectors of word2vec to enrich the feature representation. Secondly, the semantic distribution of medical terms serves as Semantic Decision Guide (SDG) for the optimization of deep learning models. The study evaluates the performance of LSV-SDG-CNN model on four kinds of Chinese EMR datasets. Additionally, CNN, LSV-CNN, and SDG-CNN are designed as baseline models for comparison. The experimental results show that LSV-SDG-CNN model outperforms baseline models on four kinds of Chinese EMR datasets. The best configuration of the model yielded an F1 score of 86.20%. The results clearly demonstrate that CNN has been effectively guided and optimized by lexical-semantic knowledge, and LSV-SDG-CNN model improves the disease classification accuracy with a clear margin.

Keywords: lexical semantics, feature representation, semantic decision, convolutional neural network, electronic medical record

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 594
184 An Improved K-Means Algorithm for Gene Expression Data Clustering

Authors: Billel Kenidra, Mohamed Benmohammed

Abstract:

Data mining technique used in the field of clustering is a subject of active research and assists in biological pattern recognition and extraction of new knowledge from raw data. Clustering means the act of partitioning an unlabeled dataset into groups of similar objects. Each group, called a cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups. Several clustering methods are based on partitional clustering. This category attempts to directly decompose the dataset into a set of disjoint clusters leading to an integer number of clusters that optimizes a given criterion function. The criterion function may emphasize a local or a global structure of the data, and its optimization is an iterative relocation procedure. The K-Means algorithm is one of the most widely used partitional clustering techniques. Since K-Means is extremely sensitive to the initial choice of centers and a poor choice of centers may lead to a local optimum that is quite inferior to the global optimum, we propose a strategy to initiate K-Means centers. The improved K-Means algorithm is compared with the original K-Means, and the results prove how the efficiency has been significantly improved.

Keywords: Microarray data mining, biological pattern recognition, partitional clustering, k-means algorithm, centroid initialization.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1284
183 Mining Genes Relations in Microarray Data Combined with Ontology in Colon Cancer Automated Diagnosis System

Authors: A. Gruzdz, A. Ihnatowicz, J. Siddiqi, B. Akhgar

Abstract:

MATCH project [1] entitle the development of an automatic diagnosis system that aims to support treatment of colon cancer diseases by discovering mutations that occurs to tumour suppressor genes (TSGs) and contributes to the development of cancerous tumours. The constitution of the system is based on a) colon cancer clinical data and b) biological information that will be derived by data mining techniques from genomic and proteomic sources The core mining module will consist of the popular, well tested hybrid feature extraction methods, and new combined algorithms, designed especially for the project. Elements of rough sets, evolutionary computing, cluster analysis, self-organization maps and association rules will be used to discover the annotations between genes, and their influence on tumours [2]-[11]. The methods used to process the data have to address their high complexity, potential inconsistency and problems of dealing with the missing values. They must integrate all the useful information necessary to solve the expert's question. For this purpose, the system has to learn from data, or be able to interactively specify by a domain specialist, the part of the knowledge structure it needs to answer a given query. The program should also take into account the importance/rank of the particular parts of data it analyses, and adjusts the used algorithms accordingly.

Keywords: Bioinformatics, gene expression, ontology, selforganizingmaps.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1974
182 Incorporating Semantic Similarity Measure in Genetic Algorithm : An Approach for Searching the Gene Ontology Terms

Authors: Razib M. Othman, Safaai Deris, Rosli M. Illias, Hany T. Alashwal, Rohayanti Hassan, FarhanMohamed

Abstract:

The most important property of the Gene Ontology is the terms. These control vocabularies are defined to provide consistent descriptions of gene products that are shareable and computationally accessible by humans, software agent, or other machine-readable meta-data. Each term is associated with information such as definition, synonyms, database references, amino acid sequences, and relationships to other terms. This information has made the Gene Ontology broadly applied in microarray and proteomic analysis. However, the process of searching the terms is still carried out using traditional approach which is based on keyword matching. The weaknesses of this approach are: ignoring semantic relationships between terms, and highly depending on a specialist to find similar terms. Therefore, this study combines semantic similarity measure and genetic algorithm to perform a better retrieval process for searching semantically similar terms. The semantic similarity measure is used to compute similitude strength between two terms. Then, the genetic algorithm is employed to perform batch retrievals and to handle the situation of the large search space of the Gene Ontology graph. The computational results are presented to show the effectiveness of the proposed algorithm.

Keywords: Gene Ontology, Semantic similarity measure, Genetic algorithm, Ontology search

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1490
181 A Optimal Subclass Detection Method for Credit Scoring

Authors: Luciano Nieddu, Giuseppe Manfredi, Salvatore D'Acunto, Katia La Regina

Abstract:

In this paper a non-parametric statistical pattern recognition algorithm for the problem of credit scoring will be presented. The proposed algorithm is based on a clustering k- means algorithm and allows for the determination of subclasses of homogenous elements in the data. The algorithm will be tested on two benchmark datasets and its performance compared with other well known pattern recognition algorithm for credit scoring.

Keywords: Constrained clustering, Credit scoring, Statistical pattern recognition, Supervised classification.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2049
180 Application of KL Divergence for Estimation of Each Metabolic Pathway Genes

Authors: Shohei Maruyama, Yasuo Matsuyama, Sachiyo Aburatani

Abstract:

Development of a method to estimate gene functions is an important task in bioinformatics. One of the approaches for the annotation is the identification of the metabolic pathway that genes are involved in. Since gene expression data reflect various intracellular phenomena, those data are considered to be related with genes’ functions. However, it has been difficult to estimate the gene function with high accuracy. It is considered that the low accuracy of the estimation is caused by the difficulty of accurately measuring a gene expression. Even though they are measured under the same condition, the gene expressions will vary usually. In this study, we proposed a feature extraction method focusing on the variability of gene expressions to estimate the genes' metabolic pathway accurately. First, we estimated the distribution of each gene expression from replicate data. Next, we calculated the similarity between all gene pairs by KL divergence, which is a method for calculating the similarity between distributions. Finally, we utilized the similarity vectors as feature vectors and trained the multiclass SVM for identifying the genes' metabolic pathway. To evaluate our developed method, we applied the method to budding yeast and trained the multiclass SVM for identifying the seven metabolic pathways. As a result, the accuracy that calculated by our developed method was higher than the one that calculated from the raw gene expression data. Thus, our developed method combined with KL divergence is useful for identifying the genes' metabolic pathway.

Keywords: Metabolic pathways, gene expression data, microarray, Kullback–Leibler divergence, KL divergence, support vector machines, SVM, machine learning.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2336
179 Combining Bagging and Additive Regression

Authors: Sotiris B. Kotsiantis

Abstract:

Bagging and boosting are among the most popular re-sampling ensemble methods that generate and combine a diversity of regression models using the same learning algorithm as base-learner. Boosting algorithms are considered stronger than bagging on noise-free data. However, there are strong empirical indications that bagging is much more robust than boosting in noisy settings. For this reason, in this work we built an ensemble using an averaging methodology of bagging and boosting ensembles with 10 sub-learners in each one. We performed a comparison with simple bagging and boosting ensembles with 25 sub-learners on standard benchmark datasets and the proposed ensemble gave better accuracy.

Keywords: Regressors, statistical learning.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1640
178 A Dataset of Program Educational Objectives Mapped to ABET Outcomes: Data Cleansing, Exploratory Data Analysis and Modeling

Authors: Addin Osman, Anwar Ali Yahya, Mohammed Basit Kamal

Abstract:

Datasets or collections are becoming important assets by themselves and now they can be accepted as a primary intellectual output of a research. The quality and usage of the datasets depend mainly on the context under which they have been collected, processed, analyzed, validated, and interpreted. This paper aims to present a collection of program educational objectives mapped to student’s outcomes collected from self-study reports prepared by 32 engineering programs accredited by ABET. The manual mapping (classification) of this data is a notoriously tedious, time consuming process. In addition, it requires experts in the area, which are mostly not available. It has been shown the operational settings under which the collection has been produced. The collection has been cleansed, preprocessed, some features have been selected and preliminary exploratory data analysis has been performed so as to illustrate the properties and usefulness of the collection. At the end, the collection has been benchmarked using nine of the most widely used supervised multiclass classification techniques (Binary Relevance, Label Powerset, Classifier Chains, Pruned Sets, Random k-label sets, Ensemble of Classifier Chains, Ensemble of Pruned Sets, Multi-Label k-Nearest Neighbors and Back-Propagation Multi-Label Learning). The techniques have been compared to each other using five well-known measurements (Accuracy, Hamming Loss, Micro-F, Macro-F, and Macro-F). The Ensemble of Classifier Chains and Ensemble of Pruned Sets have achieved encouraging performance compared to other experimented multi-label classification methods. The Classifier Chains method has shown the worst performance. To recap, the benchmark has achieved promising results by utilizing preliminary exploratory data analysis performed on the collection, proposing new trends for research and providing a baseline for future studies.

Keywords: Benchmark collection, program educational objectives, student outcomes, ABET, Accreditation, machine learning, supervised multiclass classification, text mining.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 837
177 Combining Bagging and Boosting

Authors: S. B. Kotsiantis, P. E. Pintelas

Abstract:

Bagging and boosting are among the most popular resampling ensemble methods that generate and combine a diversity of classifiers using the same learning algorithm for the base-classifiers. Boosting algorithms are considered stronger than bagging on noisefree data. However, there are strong empirical indications that bagging is much more robust than boosting in noisy settings. For this reason, in this work we built an ensemble using a voting methodology of bagging and boosting ensembles with 10 subclassifiers in each one. We performed a comparison with simple bagging and boosting ensembles with 25 sub-classifiers, as well as other well known combining methods, on standard benchmark datasets and the proposed technique was the most accurate.

Keywords: data mining, machine learning, pattern recognition.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2562
176 Multi-Label Hierarchical Classification for Protein Function Prediction

Authors: Helyane B. Borges, Julio Cesar Nievola

Abstract:

Hierarchical classification is a problem with applications in many areas as protein function prediction where the dates are hierarchically structured. Therefore, it is necessary the development of algorithms able to induce hierarchical classification models. This paper presents experimenters using the algorithm for hierarchical classification called Multi-label Hierarchical Classification using a Competitive Neural Network (MHC-CNN). It was tested in ten datasets the Gene Ontology (GO) Cellular Component Domain. The results are compared with the Clus-HMC and Clus-HSC using the hF-Measure.

Keywords: Hierarchical Classification, Competitive Neural Network, Global Classifier.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2380
175 Bayesian Geostatistical Modelling of COVID-19 Datasets

Authors: I. Oloyede

Abstract:

The COVID-19 dataset is obtained by extracting weather, longitude, latitude, ISO3666, cases and death of coronavirus patients across the globe. The data were extracted for a period of eight day choosing uniform time within the specified period. Then mapping of cases and deaths with reverence to continents were obtained. Bayesian Geostastical modelling was carried out on the dataset. The study found out that countries in the tropical region suffered less deaths/attacks compared to countries in the temperate region, this is due to high temperature in the tropical region.

Keywords: COVID-19, Bayesian, geostastical modelling, prior, posterior.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 471
174 An ensemble of Weighted Support Vector Machines for Ordinal Regression

Authors: Willem Waegeman, Luc Boullart

Abstract:

Instead of traditional (nominal) classification we investigate the subject of ordinal classification or ranking. An enhanced method based on an ensemble of Support Vector Machines (SVM-s) is proposed. Each binary classifier is trained with specific weights for each object in the training data set. Experiments on benchmark datasets and synthetic data indicate that the performance of our approach is comparable to state of the art kernel methods for ordinal regression. The ensemble method, which is straightforward to implement, provides a very good sensitivity-specificity trade-off for the highest and lowest rank.

Keywords: Ordinal regression, support vector machines, ensemblelearning.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1642
173 Real-Time Visualization Using GPU-Accelerated Filtering of LiDAR Data

Authors: Sašo Pečnik, Borut Žalik

Abstract:

This paper presents a real-time visualization technique and filtering of classified LiDAR point clouds. The visualization is capable of displaying filtered information organized in layers by the classification attribute saved within LiDAR datasets. We explain the used data structure and data management, which enables real-time presentation of layered LiDAR data. Real-time visualization is achieved with LOD optimization based on the distance from the observer without loss of quality. The filtering process is done in two steps and is entirely executed on the GPU and implemented using programmable shaders.

Keywords: Filtering, graphics, level-of-details, LiDAR, realtime visualization.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2546
172 Relevant LMA Features for Human Motion Recognition

Authors: Insaf Ajili, Malik Mallem, Jean-Yves Didier

Abstract:

Motion recognition from videos is actually a very complex task due to the high variability of motions. This paper describes the challenges of human motion recognition, especially motion representation step with relevant features. Our descriptor vector is inspired from Laban Movement Analysis method. We propose discriminative features using the Random Forest algorithm in order to remove redundant features and make learning algorithms operate faster and more effectively. We validate our method on MSRC-12 and UTKinect datasets.

Keywords: Human motion recognition, Discriminative LMA features, random forest, features reduction.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 773
171 Svision: Visual Identification of Scanning and Denial of Service Attacks

Authors: Iosif-Viorel Onut, Bin Zhu, Ali A. Ghorbani

Abstract:

We propose a novel graphical technique (SVision) for intrusion detection, which pictures the network as a community of hosts independently roaming in a 3D space defined by the set of services that they use. The aim of SVision is to graphically cluster the hosts into normal and abnormal ones, highlighting only the ones that are considered as a threat to the network. Our experimental results using DARPA 1999 and 2000 intrusion detection and evaluation datasets show the proposed technique as a good candidate for the detection of various threats of the network such as vertical and horizontal scanning, Denial of Service (DoS), and Distributed DoS (DDoS) attacks.

Keywords: Anomaly Visualization, Network Security, Intrusion Detection.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1711
170 An Enhanced Support Vector Machine-Based Approach for Sentiment Classification of Arabic Tweets of Different Dialects

Authors: Gehad S. Kaseb, Mona F. Ahmed

Abstract:

Arabic Sentiment Analysis (SA) is one of the most common research fields with many open areas. This paper proposes different pre-processing steps and a modified methodology to improve the accuracy using normal Support Vector Machine (SVM) classification. The paper works on two datasets, Arabic Sentiment Tweets Dataset (ASTD) and Extended Arabic Tweets Sentiment Dataset (Extended-ATSD), which are publicly available for academic use. The results show that the classification accuracy approaches 86%.

Keywords: Arabic, hybrid classification, sentiment analysis, tweets.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 475
169 A Testbed for the Experiments Performed in Missing Value Treatments

Authors: Dias de J. C. Lilian, Lobato M. F. Fábio, de Santana L. Ádamo

Abstract:

The occurrence of missing values in database is a serious problem for Data Mining tasks, responsible for degrading data quality and accuracy of analyses. In this context, the area has shown a lack of standardization for experiments to treat missing values, introducing difficulties to the evaluation process among different researches due to the absence in the use of common parameters. This paper proposes a testbed intended to facilitate the experiments implementation and provide unbiased parameters using available datasets and suited performance metrics in order to optimize the evaluation and comparison between the state of art missing values treatments.

Keywords: Data imputation, data mining, missing values treatment, testbed.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1513
168 Face Recognition using Features Combination and a New Non-linear Kernel

Authors: Essam Al Daoud

Abstract:

To improve the classification rate of the face recognition, features combination and a novel non-linear kernel are proposed. The feature vector concatenates three different radius of local binary patterns and Gabor wavelet features. Gabor features are the mean, standard deviation and the skew of each scaling and orientation parameter. The aim of the new kernel is to incorporate the power of the kernel methods with the optimal balance between the features. To verify the effectiveness of the proposed method, numerous methods are tested by using four datasets, which are consisting of various emotions, orientations, configuration, expressions and lighting conditions. Empirical results show the superiority of the proposed technique when compared to other methods.

Keywords: Face recognition, Gabor wavelet, LBP, Non-linearkerner

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1540
167 Application of a New Efficient Normal Parameter Reduction Algorithm of Soft Sets in Online Shopping

Authors: Xiuqin Ma, Hongwu Qin

Abstract:

A new efficient normal parameter reduction algorithm of soft set in decision making was proposed. However, up to the present, few documents have focused on real-life applications of this algorithm. Accordingly, we apply a New Efficient Normal Parameter Reduction algorithm into real-life datasets of online shopping, such as Blackberry Mobile Phone Dataset. Experimental results show that this algorithm is not only suitable but feasible for dealing with the online shopping.

Keywords: Normal parameter reduction, Online shopping, Parameter reduction, Soft sets.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1826