Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 19108

Search results for: preprocessing data

19108 LaPEA: Language for Preprocessing of Edge Applications in Smart Factory

Authors: Masaki Sakai, Tsuyoshi Nakajima, Kazuya Takahashi

Abstract:

In order to improve the productivity of a factory, it is often the case to create an inference model by collecting and analyzing operational data off-line and then to develop an edge application (EAP) that evaluates the quality of the products or diagnoses machine faults in real-time. To accelerate this development cycle, an edge application framework for the smart factory is proposed, which enables to create and modify EAPs based on prepared inference models. In the framework, the preprocessing component is the key part to make it work. This paper proposes a language for preprocessing of edge applications, called LaPEA, which can flexibly process several sensor data from machines into explanatory variables for an inference model, and proves that it meets the requirements for the preprocessing.

Keywords: edge application framework, edgecross, preprocessing language, smart factory

Procedia PDF Downloads 62
19107 Text Data Preprocessing Library: Bilingual Approach

Authors: Kabil Boukhari

Abstract:

In the context of information retrieval, the selection of the most relevant words is a very important step. In fact, the text cleaning allows keeping only the most representative words for a better use. In this paper, we propose a library for the purpose text preprocessing within an implemented application to facilitate this task. This study has two purposes. The first, is to present the related work of the various steps involved in text preprocessing, presenting the segmentation, stemming and lemmatization algorithms that could be efficient in the rest of study. The second, is to implement a developed tool for text preprocessing in French and English. This library accepts unstructured text as input and provides the preprocessed text as output, based on a set of rules and on a base of stop words for both languages. The proposed library has been made on different corpora and gave an interesting result.

Keywords: text preprocessing, segmentation, knowledge extraction, normalization, text generation, information retrieval

Procedia PDF Downloads 1
19106 Optimized Preprocessing for Accurate and Efficient Bioassay Prediction with Machine Learning Algorithms

Authors: Jeff Clarine, Chang-Shyh Peng, Daisy Sang

Abstract:

Bioassay is the measurement of the potency of a chemical substance by its effect on a living animal or plant tissue. Bioassay data and chemical structures from pharmacokinetic and drug metabolism screening are mined from and housed in multiple databases. Bioassay prediction is calculated accordingly to determine further advancement. This paper proposes a four-step preprocessing of datasets for improving the bioassay predictions. The first step is instance selection in which dataset is categorized into training, testing, and validation sets. The second step is discretization that partitions the data in consideration of accuracy vs. precision. The third step is normalization where data are normalized between 0 and 1 for subsequent machine learning processing. The fourth step is feature selection where key chemical properties and attributes are generated. The streamlined results are then analyzed for the prediction of effectiveness by various machine learning algorithms including Pipeline Pilot, R, Weka, and Excel. Experiments and evaluations reveal the effectiveness of various combination of preprocessing steps and machine learning algorithms in more consistent and accurate prediction.

Keywords: bioassay, machine learning, preprocessing, virtual screen

Procedia PDF Downloads 192
19105 Mean Shift-Based Preprocessing Methodology for Improved 3D Buildings Reconstruction

Authors: Nikolaos Vassilas, Theocharis Tsenoglou, Djamchid Ghazanfarpour

Abstract:

In this work we explore the capability of the mean shift algorithm as a powerful preprocessing tool for improving the quality of spatial data, acquired from airborne scanners, from densely built urban areas. On one hand, high resolution image data corrupted by noise caused by lossy compression techniques are appropriately smoothed while at the same time preserving the optical edges and, on the other, low resolution LiDAR data in the form of normalized Digital Surface Map (nDSM) is upsampled through the joint mean shift algorithm. Experiments on both the edge-preserving smoothing and upsampling capabilities using synthetic RGB-z data show that the mean shift algorithm is superior to bilateral filtering as well as to other classical smoothing and upsampling algorithms. Application of the proposed methodology for 3D reconstruction of buildings of a pilot region of Athens, Greece results in a significant visual improvement of the 3D building block model.

Keywords: 3D buildings reconstruction, data fusion, data upsampling, mean shift

Procedia PDF Downloads 252
19104 Estimation of Chronic Kidney Disease Using Artificial Neural Network

Authors: Ilker Ali Ozkan

Abstract:

In this study, an artificial neural network model has been developed to estimate chronic kidney failure which is a common disease. The patients’ age, their blood and biochemical values, and 24 input data which consists of various chronic diseases are used for the estimation process. The input data have been subjected to preprocessing because they contain both missing values and nominal values. 147 patient data which was obtained from the preprocessing have been divided into as 70% training and 30% testing data. As a result of the study, artificial neural network model with 25 neurons in the hidden layer has been found as the model with the lowest error value. Chronic kidney failure disease has been able to be estimated accurately at the rate of 99.3% using this artificial neural network model. The developed artificial neural network has been found successful for the estimation of chronic kidney failure disease using clinical data.

Keywords: estimation, artificial neural network, chronic kidney failure disease, disease diagnosis

Procedia PDF Downloads 258
19103 Microarray Data Visualization and Preprocessing Using R and Bioconductor

Authors: Ruchi Yadav, Shivani Pandey, Prachi Srivastava

Abstract:

Microarrays provide a rich source of data on the molecular working of cells. Each microarray reports on the abundance of tens of thousands of mRNAs. Virtually every human disease is being studied using microarrays with the hope of finding the molecular mechanisms of disease. Bioinformatics analysis plays an important part of processing the information embedded in large-scale expression profiling studies and for laying the foundation for biological interpretation. A basic, yet challenging task in the analysis of microarray gene expression data is the identification of changes in gene expression that are associated with particular biological conditions. Careful statistical design and analysis are essential to improve the efficiency and reliability of microarray experiments throughout the data acquisition and analysis process. One of the most popular platforms for microarray analysis is Bioconductor, an open source and open development software project based on the R programming language. This paper describes specific procedures for conducting quality assessment, visualization and preprocessing of Affymetrix Gene Chip and also details the different bioconductor packages used to analyze affymetrix microarray data and describe the analysis and outcome of each plots.

Keywords: microarray analysis, R language, affymetrix visualization, bioconductor

Procedia PDF Downloads 405
19102 Preprocessing and Fusion of Multiple Representation of Finger Vein patterns using Conventional and Machine Learning techniques

Authors: Tomas Trainys, Algimantas Venckauskas

Abstract:

Application of biometric features to the cryptography for human identification and authentication is widely studied and promising area of the development of high-reliability cryptosystems. Biometric cryptosystems typically are designed for patterns recognition, which allows biometric data acquisition from an individual, extracts feature sets, compares the feature set against the set stored in the vault and gives a result of the comparison. Preprocessing and fusion of biometric data are the most important phases in generating a feature vector for key generation or authentication. Fusion of biometric features is critical for achieving a higher level of security and prevents from possible spoofing attacks. The paper focuses on the tasks of initial processing and fusion of multiple representations of finger vein modality patterns. These tasks are solved by applying conventional image preprocessing methods and machine learning techniques, Convolutional Neural Network (SVM) method for image segmentation and feature extraction. An article presents a method for generating sets of biometric features from a finger vein network using several instances of the same modality. Extracted features sets were fused at the feature level. The proposed method was tested and compared with the performance and accuracy results of other authors.

Keywords: bio-cryptography, biometrics, cryptographic key generation, data fusion, information security, SVM, pattern recognition, finger vein method.

Procedia PDF Downloads 74
19101 Sentiment Analysis: An Enhancement of Ontological-Based Features Extraction Techniques and Word Equations

Authors: Mohd Ridzwan Yaakub, Muhammad Iqbal Abu Latiffi

Abstract:

Online business has become popular recently due to the massive amount of information and medium available on the Internet. This has resulted in the huge number of reviews where the consumers share their opinion, criticisms, and satisfaction on the products they have purchased on the websites or the social media such as Facebook and Twitter. However, to analyze customer’s behavior has become very important for organizations to find new market trends and insights. The reviews from the websites or the social media are in structured and unstructured data that need a sentiment analysis approach in analyzing customer’s review. In this article, techniques used in will be defined. Definition of the ontology and description of its possible usage in sentiment analysis will be defined. It will lead to empirical research that related to mobile phones used in research and the ontology used in the experiment. The researcher also will explore the role of preprocessing data and feature selection methodology. As the result, ontology-based approach in sentiment analysis can help in achieving high accuracy for the classification task.

Keywords: feature selection, ontology, opinion, preprocessing data, sentiment analysis

Procedia PDF Downloads 116
19100 Clothes Identification Using Inception ResNet V2 and MobileNet V2

Authors: Subodh Chandra Shakya, Badal Shrestha, Suni Thapa, Ashutosh Chauhan, Saugat Adhikari

Abstract:

To tackle our problem of clothes identification, we used different architectures of Convolutional Neural Networks. Among different architectures, the outcome from Inception ResNet V2 and MobileNet V2 seemed promising. On comparison of the metrices, we observed that the Inception ResNet V2 slightly outperforms MobileNet V2 for this purpose. So this paper of ours proposes the cloth identifier using Inception ResNet V2 and also contains the comparison between the outcome of ResNet V2 and MobileNet V2. The document here contains the results and findings of the research that we performed on the DeepFashion Dataset. To improve the dataset, we used different image preprocessing techniques like image shearing, image rotation, and denoising. The whole experiment was conducted with the intention of testing the efficiency of convolutional neural networks on cloth identification so that we could develop a reliable system that is good enough in identifying the clothes worn by the users. The whole system can be integrated with some kind of recommendation system.

Keywords: inception ResNet, convolutional neural net, deep learning, confusion matrix, data augmentation, data preprocessing

Procedia PDF Downloads 83
19099 The Implementation of the Javanese Lettered-Manuscript Image Preprocessing Stage Model on the Batak Lettered-Manuscript Image

Authors: Anastasia Rita Widiarti, Agus Harjoko, Marsono, Sri Hartati

Abstract:

This paper presents the results of a study to test whether the Javanese character manuscript image preprocessing model that have been more widely applied, can also be applied to segment of the Batak characters manuscripts. The treatment process begins by converting the input image into a binary image. After the binary image is cleaned of noise, then the segmentation lines using projection profile is conducted. If unclear histogram projection is found, then the smoothing process before production indexes line segments is conducted. For each line image which has been produced, then the segmentation scripts in the line is applied, with regard of the connectivity between pixels which making up the letters that there is no characters are truncated. From the results of manuscript preprocessing system prototype testing, it is obtained the information about the system truth percentage value on pieces of Pustaka Batak Podani Ma AjiMamisinon manuscript ranged from 65% to 87.68% with a confidence level of 95%. The value indicates the truth percentage shown the initial processing model in Javanese characters manuscript image can be applied also to the image of the Batak characters manuscript.

Keywords: connected component, preprocessing, manuscript image, projection profiles

Procedia PDF Downloads 311
19098 Identity Verification Using k-NN Classifiers and Autistic Genetic Data

Authors: Fuad M. Alkoot

Abstract:

DNA data have been used in forensics for decades. However, current research looks at using the DNA as a biometric identity verification modality. The goal is to improve the speed of identification. We aim at using gene data that was initially used for autism detection to find if and how accurate is this data for identification applications. Mainly our goal is to find if our data preprocessing technique yields data useful as a biometric identification tool. We experiment with using the nearest neighbor classifier to identify subjects. Results show that optimal classification rate is achieved when the test set is corrupted by normally distributed noise with zero mean and standard deviation of 1. The classification rate is close to optimal at higher noise standard deviation reaching 3. This shows that the data can be used for identity verification with high accuracy using a simple classifier such as the k-nearest neighbor (k-NN). 

Keywords: biometrics, genetic data, identity verification, k nearest neighbor

Procedia PDF Downloads 173
19097 Improve Student Performance Prediction Using Majority Vote Ensemble Model for Higher Education

Authors: Wade Ghribi, Abdelmoty M. Ahmed, Ahmed Said Badawy, Belgacem Bouallegue

Abstract:

In higher education institutions, the most pressing priority is to improve student performance and retention. Large volumes of student data are used in Educational Data Mining techniques to find new hidden information from students' learning behavior, particularly to uncover the early symptom of at-risk pupils. On the other hand, data with noise, outliers, and irrelevant information may provide incorrect conclusions. By identifying features of students' data that have the potential to improve performance prediction results, comparing and identifying the most appropriate ensemble learning technique after preprocessing the data, and optimizing the hyperparameters, this paper aims to develop a reliable students' performance prediction model for Higher Education Institutions. Data was gathered from two different systems: a student information system and an e-learning system for undergraduate students in the College of Computer Science of a Saudi Arabian State University. The cases of 4413 students were used in this article. The process includes data collection, data integration, data preprocessing (such as cleaning, normalization, and transformation), feature selection, pattern extraction, and, finally, model optimization and assessment. Random Forest, Bagging, Stacking, Majority Vote, and two types of Boosting techniques, AdaBoost and XGBoost, are ensemble learning approaches, whereas Decision Tree, Support Vector Machine, and Artificial Neural Network are supervised learning techniques. Hyperparameters for ensemble learning systems will be fine-tuned to provide enhanced performance and optimal output. The findings imply that combining features of students' behavior from e-learning and students' information systems using Majority Vote produced better outcomes than the other ensemble techniques.

Keywords: educational data mining, student performance prediction, e-learning, classification, ensemble learning, higher education

Procedia PDF Downloads 13
19096 Improving the Performance of Deep Learning in Facial Emotion Recognition with Image Sharpening

Authors: Ksheeraj Sai Vepuri, Nada Attar

Abstract:

We as humans use words with accompanying visual and facial cues to communicate effectively. Classifying facial emotion using computer vision methodologies has been an active research area in the computer vision field. In this paper, we propose a simple method for facial expression recognition that enhances accuracy. We tested our method on the FER-2013 dataset that contains static images. Instead of using Histogram equalization to preprocess the dataset, we used Unsharp Mask to emphasize texture and details and sharpened the edges. We also used ImageDataGenerator from Keras library for data augmentation. Then we used Convolutional Neural Networks (CNN) model to classify the images into 7 different facial expressions, yielding an accuracy of 69.46% on the test set. Our results show that using image preprocessing such as the sharpening technique for a CNN model can improve the performance, even when the CNN model is relatively simple.

Keywords: facial expression recognittion, image preprocessing, deep learning, CNN

Procedia PDF Downloads 47
19095 Heuristic to Generate Random X-Monotone Polygons

Authors: Kamaljit Pati, Manas Kumar Mohanty, Sanjib Sadhu

Abstract:

A heuristic has been designed to generate a random simple monotone polygon from a given set of ‘n’ points lying on a 2-Dimensional plane. Our heuristic generates a random monotone polygon in O(n) time after O(nℓogn) preprocessing time which is improved over the previous work where a random monotone polygon is produced in the same O(n) time but the preprocessing time is O(k) for n < k < n2. However, our heuristic does not generate all possible random polygons with uniform probability. The space complexity of our proposed heuristic is O(n).

Keywords: sorting, monotone polygon, visibility, chain

Procedia PDF Downloads 353
19094 Analysis of Expression Data Using Unsupervised Techniques

Authors: M. A. I Perera, C. R. Wijesinghe, A. R. Weerasinghe

Abstract:

his study was conducted to review and identify the unsupervised techniques that can be employed to analyze gene expression data in order to identify better subtypes of tumors. Identifying subtypes of cancer help in improving the efficacy and reducing the toxicity of the treatments by identifying clues to find target therapeutics. Process of gene expression data analysis described under three steps as preprocessing, clustering, and cluster validation. Feature selection is important since the genomic data are high dimensional with a large number of features compared to samples. Hierarchical clustering and K Means are often used in the analysis of gene expression data. There are several cluster validation techniques used in validating the clusters. Heatmaps are an effective external validation method that allows comparing the identified classes with clinical variables and visual analysis of the classes.

Keywords: cancer subtypes, gene expression data analysis, clustering, cluster validation

Procedia PDF Downloads 56
19093 An Adaptive Oversampling Technique for Imbalanced Datasets

Authors: Shaukat Ali Shahee, Usha Ananthakumar

Abstract:

A data set exhibits class imbalance problem when one class has very few examples compared to the other class, and this is also referred to as between class imbalance. The traditional classifiers fail to classify the minority class examples correctly due to its bias towards the majority class. Apart from between-class imbalance, imbalance within classes where classes are composed of a different number of sub-clusters with these sub-clusters containing different number of examples also deteriorates the performance of the classifier. Previously, many methods have been proposed for handling imbalanced dataset problem. These methods can be classified into four categories: data preprocessing, algorithmic based, cost-based methods and ensemble of classifier. Data preprocessing techniques have shown great potential as they attempt to improve data distribution rather than the classifier. Data preprocessing technique handles class imbalance either by increasing the minority class examples or by decreasing the majority class examples. Decreasing the majority class examples lead to loss of information and also when minority class has an absolute rarity, removing the majority class examples is generally not recommended. Existing methods available for handling class imbalance do not address both between-class imbalance and within-class imbalance simultaneously. In this paper, we propose a method that handles between class imbalance and within class imbalance simultaneously for binary classification problem. Removing between class imbalance and within class imbalance simultaneously eliminates the biases of the classifier towards bigger sub-clusters by minimizing the error domination of bigger sub-clusters in total error. The proposed method uses model-based clustering to find the presence of sub-clusters or sub-concepts in the dataset. The number of examples oversampled among the sub-clusters is determined based on the complexity of sub-clusters. The method also takes into consideration the scatter of the data in the feature space and also adaptively copes up with unseen test data using Lowner-John ellipsoid for increasing the accuracy of the classifier. In this study, neural network is being used as this is one such classifier where the total error is minimized and removing the between-class imbalance and within class imbalance simultaneously help the classifier in giving equal weight to all the sub-clusters irrespective of the classes. The proposed method is validated on 9 publicly available data sets and compared with three existing oversampling techniques that rely on the spatial location of minority class examples in the euclidean feature space. The experimental results show the proposed method to be statistically significantly superior to other methods in terms of various accuracy measures. Thus the proposed method can serve as a good alternative to handle various problem domains like credit scoring, customer churn prediction, financial distress, etc., that typically involve imbalanced data sets.

Keywords: classification, imbalanced dataset, Lowner-John ellipsoid, model based clustering, oversampling

Procedia PDF Downloads 331
19092 Analyzing On-Line Process Data for Industrial Production Quality Control

Authors: Hyun-Woo Cho

Abstract:

The monitoring of industrial production quality has to be implemented to alarm early warning for unusual operating conditions. Furthermore, identification of their assignable causes is necessary for a quality control purpose. For such tasks many multivariate statistical techniques have been applied and shown to be quite effective tools. This work presents a process data-based monitoring scheme for production processes. For more reliable results some additional steps of noise filtering and preprocessing are considered. It may lead to enhanced performance by eliminating unwanted variation of the data. The performance evaluation is executed using data sets from test processes. The proposed method is shown to provide reliable quality control results, and thus is more effective in quality monitoring in the example. For practical implementation of the method, an on-line data system must be available to gather historical and on-line data. Recently large amounts of data are collected on-line in most processes and implementation of the current scheme is feasible and does not give additional burdens to users.

Keywords: detection, filtering, monitoring, process data

Procedia PDF Downloads 472
19091 Arabic Text Representation and Classification Methods: Current State of the Art

Authors: Rami Ayadi, Mohsen Maraoui, Mounir Zrigui

Abstract:

In this paper, we have presented a brief current state of the art for Arabic text representation and classification methods. We decomposed Arabic Task Classification into four categories. First we describe some algorithms applied to classification on Arabic text. Secondly, we cite all major works when comparing classification algorithms applied on Arabic text, after this, we mention some authors who proposing new classification methods and finally we investigate the impact of preprocessing on Arabic TC.

Keywords: text classification, Arabic, impact of preprocessing, classification algorithms

Procedia PDF Downloads 330
19090 From Two-Way to Multi-Way: A Comparative Study for Map-Reduce Join Algorithms

Authors: Marwa Hussien Mohamed, Mohamed Helmy Khafagy

Abstract:

Map-Reduce is a programming model which is widely used to extract valuable information from enormous volumes of data. Map-reduce designed to support heterogeneous datasets. Apache Hadoop map-reduce used extensively to uncover hidden pattern like data mining, SQL, etc. The most important operation for data analysis is joining operation. But, map-reduce framework does not directly support join algorithm. This paper explains and compares two-way and multi-way map-reduce join algorithms for map reduce also we implement MR join Algorithms and show the performance of each phase in MR join algorithms. Our experimental results show that map side join and map merge join in two-way join algorithms has the longest time according to preprocessing step sorting data and reduce side cascade join has the longest time at Multi-Way join algorithms.

Keywords: Hadoop, MapReduce, multi-way join, two-way join, Ubuntu

Procedia PDF Downloads 351
19089 A t-SNE and UMAP Based Neural Network Image Classification Algorithm

Authors: Shelby Simpson, William Stanley, Namir Naba, Xiaodi Wang

Abstract:

Both t-SNE and UMAP are brand new state of art tools to predominantly preserve the local structure that is to group neighboring data points together, which indeed provides a very informative visualization of heterogeneity in our data. In this research, we develop a t-SNE and UMAP base neural network image classification algorithm to embed the original dataset to a corresponding low dimensional dataset as a preprocessing step, then use this embedded database as input to our specially designed neural network classifier for image classification. We use the fashion MNIST data set, which is a labeled data set of images of clothing objects in our experiments. t-SNE and UMAP are used for dimensionality reduction of the data set and thus produce low dimensional embeddings. Furthermore, we use the embeddings from t-SNE and UMAP to feed into two neural networks. The accuracy of the models from the two neural networks is then compared to a dense neural network that does not use embedding as an input to show which model can classify the images of clothing objects more accurately.

Keywords: t-SNE, UMAP, fashion MNIST, neural networks

Procedia PDF Downloads 70
19088 Sparse Coding Based Classification of Electrocardiography Signals Using Data-Driven Complete Dictionary Learning

Authors: Fuad Noman, Sh-Hussain Salleh, Chee-Ming Ting, Hadri Hussain, Syed Rasul

Abstract:

In this paper, a data-driven dictionary approach is proposed for the automatic detection and classification of cardiovascular abnormalities. Electrocardiography (ECG) signal is represented by the trained complete dictionaries that contain prototypes or atoms to avoid the limitations of pre-defined dictionaries. The data-driven trained dictionaries simply take the ECG signal as input rather than extracting features to study the set of parameters that yield the most descriptive dictionary. The approach inherently learns the complicated morphological changes in ECG waveform, which is then used to improve the classification. The classification performance was evaluated with ECG data under two different preprocessing environments. In the first category, QT-database is baseline drift corrected with notch filter and it filters the 60 Hz power line noise. In the second category, the data are further filtered using fast moving average smoother. The experimental results on QT database confirm that our proposed algorithm shows a classification accuracy of 92%.

Keywords: electrocardiogram, dictionary learning, sparse coding, classification

Procedia PDF Downloads 279
19087 Brain Age Prediction Based on Brain Magnetic Resonance Imaging by 3D Convolutional Neural Network

Authors: Leila Keshavarz Afshar, Hedieh Sajedi

Abstract:

Estimation of biological brain age from MR images is a topic that has been much addressed in recent years due to the importance it attaches to early diagnosis of diseases such as Alzheimer's. In this paper, we use a 3D Convolutional Neural Network (CNN) to provide a method for estimating the biological age of the brain. The 3D-CNN model is trained by MRI data that has been normalized. In addition, to reduce computation while saving overall performance, some effectual slices are selected for age estimation. By this method, the biological age of individuals using selected normalized data was estimated with Mean Absolute Error (MAE) of 4.82 years.

Keywords: brain age estimation, biological age, 3D-CNN, deep learning, T1-weighted image, SPM, preprocessing, MRI, canny, gray matter

Procedia PDF Downloads 57
19086 Mood Recognition Using Indian Music

Authors: Vishwa Joshi

Abstract:

The study of mood recognition in the field of music has gained a lot of momentum in the recent years with machine learning and data mining techniques and many audio features contributing considerably to analyze and identify the relation of mood plus music. In this paper we consider the same idea forward and come up with making an effort to build a system for automatic recognition of mood underlying the audio song’s clips by mining their audio features and have evaluated several data classification algorithms in order to learn, train and test the model describing the moods of these audio songs and developed an open source framework. Before classification, Preprocessing and Feature Extraction phase is necessary for removing noise and gathering features respectively.

Keywords: music, mood, features, classification

Procedia PDF Downloads 403
19085 Application of Data Mining Techniques for Tourism Knowledge Discovery

Authors: Teklu Urgessa, Wookjae Maeng, Joong Seek Lee

Abstract:

Application of five implementations of three data mining classification techniques was experimented for extracting important insights from tourism data. The aim was to find out the best performing algorithm among the compared ones for tourism knowledge discovery. Knowledge discovery process from data was used as a process model. 10-fold cross validation method is used for testing purpose. Various data preprocessing activities were performed to get the final dataset for model building. Classification models of the selected algorithms were built with different scenarios on the preprocessed dataset. The outperformed algorithm tourism dataset was Random Forest (76%) before applying information gain based attribute selection and J48 (C4.5) (75%) after selection of top relevant attributes to the class (target) attribute. In terms of time for model building, attribute selection improves the efficiency of all algorithms. Artificial Neural Network (multilayer perceptron) showed the highest improvement (90%). The rules extracted from the decision tree model are presented, which showed intricate, non-trivial knowledge/insight that would otherwise not be discovered by simple statistical analysis with mediocre accuracy of the machine using classification algorithms.

Keywords: classification algorithms, data mining, knowledge discovery, tourism

Procedia PDF Downloads 226
19084 Brain Tumor Segmentation Based on Minimum Spanning Tree

Authors: Simeon Mayala, Ida Herdlevær, Jonas Bull Haugsøen, Shamundeeswari Anandan, Sonia Gavasso, Morten Brun

Abstract:

In this paper, we propose a minimum spanning tree-based method for segmenting brain tumors. The proposed method performs interactive segmentation based on the minimum spanning tree without tuning parameters. The steps involve preprocessing, making a graph, constructing a minimum spanning tree, and a newly implemented way of interactively segmenting the region of interest. In the preprocessing step, a Gaussian filter is applied to 2D images to remove the noise. Then, the pixel neighbor graph is weighted by intensity differences and the corresponding minimum spanning tree is constructed. The image is loaded in an interactive window for segmenting the tumor. The region of interest and the background are selected by clicking to split the minimum spanning tree into two trees. One of these trees represents the region of interest and the other represents the background. Finally, the segmentation given by the two trees is visualized. The proposed method was tested by segmenting two different 2D brain T1-weighted magnetic resonance image data sets. The comparison between our results and the standard gold segmentation confirmed the validity of the minimum spanning tree approach. The proposed method is simple to implement and the results indicate that it is accurate and efficient.

Keywords: brain tumor, brain tumor segmentation, minimum spanning tree, segmentation, image processing

Procedia PDF Downloads 30
19083 Principle Component Analysis on Colon Cancer Detection

Authors: N. K. Caecar Pratiwi, Yunendah Nur Fuadah, Rita Magdalena, R. D. Atmaja, Sofia Saidah, Ocky Tiaramukti

Abstract:

Colon cancer or colorectal cancer is a type of cancer that attacks the last part of the human digestive system. Lymphoma and carcinoma are types of cancer that attack human’s colon. Colon cancer causes deaths about half a million people every year. In Indonesia, colon cancer is the third largest cancer case for women and second in men. Unhealthy lifestyles such as minimum consumption of fiber, rarely exercising and lack of awareness for early detection are factors that cause high cases of colon cancer. The aim of this project is to produce a system that can detect and classify images into type of colon cancer lymphoma, carcinoma, or normal. The designed system used 198 data colon cancer tissue pathology, consist of 66 images for Lymphoma cancer, 66 images for carcinoma cancer and 66 for normal / healthy colon condition. This system will classify colon cancer starting from image preprocessing, feature extraction using Principal Component Analysis (PCA) and classification using K-Nearest Neighbor (K-NN) method. Several stages in preprocessing are resize, convert RGB image to grayscale, edge detection and last, histogram equalization. Tests will be done by trying some K-NN input parameter setting. The result of this project is an image processing system that can detect and classify the type of colon cancer with high accuracy and low computation time.

Keywords: carcinoma, colorectal cancer, k-nearest neighbor, lymphoma, principle component analysis

Procedia PDF Downloads 91
19082 Assessment of Pre-Processing Influence on Near-Infrared Spectra for Predicting the Mechanical Properties of Wood

Authors: Aasheesh Raturi, Vimal Kothiyal, P. D. Semalty

Abstract:

We studied mechanical properties of Eucalyptus tereticornis using FT-NIR spectroscopy. Firstly, spectra were pre-processed to eliminate useless information. Then, prediction model was constructed by partial least squares regression. To study the influence of pre-processing on prediction of mechanical properties for NIR analysis of wood samples, we applied various pretreatment methods like straight line subtraction, constant offset elimination, vector-normalization, min-max normalization, multiple scattering. Correction, first derivative, second derivatives and their combination with other treatment such as First derivative + straight line subtraction, First derivative+ vector normalization and First derivative+ multiplicative scattering correction. The data processing methods in combination of preprocessing with different NIR regions, RMSECV, RMSEP and optimum factors/rank were obtained by optimization process of model development. More than 350 combinations were obtained during optimization process. More than one pre-processing method gave good calibration/cross-validation and prediction/test models, but only the best calibration/cross-validation and prediction/test models are reported here. The results show that one can safely use NIR region between 4000 to 7500 cm-1 with straight line subtraction, constant offset elimination, first derivative and second derivative preprocessing method which were found to be most appropriate for models development.

Keywords: FT-NIR, mechanical properties, pre-processing, PLS

Procedia PDF Downloads 263
19081 Impact Location From Instrumented Mouthguard Kinematic Data In Rugby

Authors: Jazim Sohail, Filipe Teixeira-Dias

Abstract:

Mild traumatic brain injury (mTBI) within non-helmeted contact sports is a growing concern due to the serious risk of potential injury. Extensive research is being conducted looking into head kinematics in non-helmeted contact sports utilizing instrumented mouthguards that allow researchers to record accelerations and velocities of the head during and after an impact. This does not, however, allow the location of the impact on the head, and its magnitude and orientation, to be determined. This research proposes and validates two methods to quantify impact locations from instrumented mouthguard kinematic data, one using rigid body dynamics, the other utilizing machine learning. The rigid body dynamics technique focuses on establishing and matching moments from Euler’s and torque equations in order to find the impact location on the head. The methodology is validated with impact data collected from a lab test with the dummy head fitted with an instrumented mouthguard. Additionally, a Hybrid III Dummy head finite element model was utilized to create synthetic kinematic data sets for impacts from varying locations to validate the impact location algorithm. The algorithm calculates accurate impact locations; however, it will require preprocessing of live data, which is currently being done by cross-referencing data timestamps to video footage. The machine learning technique focuses on eliminating the preprocessing aspect by establishing trends within time-series signals from instrumented mouthguards to determine the impact location on the head. An unsupervised learning technique is used to cluster together impacts within similar regions from an entire time-series signal. The kinematic signals established from mouthguards are converted to the frequency domain before using a clustering algorithm to cluster together similar signals within a time series that may span the length of a game. Impacts are clustered within predetermined location bins. The same Hybrid III Dummy finite element model is used to create impacts that closely replicate on-field impacts in order to create synthetic time-series datasets consisting of impacts in varying locations. These time-series data sets are used to validate the machine learning technique. The rigid body dynamics technique provides a good method to establish accurate impact location of impact signals that have already been labeled as true impacts and filtered out of the entire time series. However, the machine learning technique provides a method that can be implemented with long time series signal data but will provide impact location within predetermined regions on the head. Additionally, the machine learning technique can be used to eliminate false impacts captured by sensors saving additional time for data scientists using instrumented mouthguard kinematic data as validating true impacts with video footage would not be required.

Keywords: head impacts, impact location, instrumented mouthguard, machine learning, mTBI

Procedia PDF Downloads 80
19080 Hybrid Fuzzy Weighted K-Nearest Neighbor to Predict Hospital Readmission for Diabetic Patients

Authors: Soha A. Bahanshal, Byung G. Kim

Abstract:

Identification of patients at high risk for hospital readmission is of crucial importance for quality health care and cost reduction. Predicting hospital readmissions among diabetic patients has been of great interest to many researchers and health decision makers. We build a prediction model to predict hospital readmission for diabetic patients within 30 days of discharge. The core of the prediction model is a modified k Nearest Neighbor called Hybrid Fuzzy Weighted k Nearest Neighbor algorithm. The prediction is performed on a patient dataset which consists of more than 70,000 patients with 50 attributes. We applied data preprocessing using different techniques in order to handle data imbalance and to fuzzify the data to suit the prediction algorithm. The model so far achieved classification accuracy of 80% compared to other models that only use k Nearest Neighbor.

Keywords: machine learning, prediction, classification, hybrid fuzzy weighted k-nearest neighbor, diabetic hospital readmission

Procedia PDF Downloads 87
19079 Life Prediction Method of Lithium-Ion Battery Based on Grey Support Vector Machines

Authors: Xiaogang Li, Jieqiong Miao

Abstract:

As for the problem of the grey forecasting model prediction accuracy is low, an improved grey prediction model is put forward. Firstly, use trigonometric function transform the original data sequence in order to improve the smoothness of data , this model called SGM( smoothness of grey prediction model), then combine the improved grey model with support vector machine , and put forward the grey support vector machine model (SGM - SVM).Before the establishment of the model, we use trigonometric functions and accumulation generation operation preprocessing data in order to enhance the smoothness of the data and weaken the randomness of the data, then use support vector machine (SVM) to establish a prediction model for pre-processed data and select model parameters using genetic algorithms to obtain the optimum value of the global search. Finally, restore data through the "regressive generate" operation to get forecasting data. In order to prove that the SGM-SVM model is superior to other models, we select the battery life data from calce. The presented model is used to predict life of battery and the predicted result was compared with that of grey model and support vector machines.For a more intuitive comparison of the three models, this paper presents root mean square error of this three different models .The results show that the effect of grey support vector machine (SGM-SVM) to predict life is optimal, and the root mean square error is only 3.18%. Keywords: grey forecasting model, trigonometric function, support vector machine, genetic algorithms, root mean square error

Keywords: Grey prediction model, trigonometric functions, support vector machines, genetic algorithms, root mean square error

Procedia PDF Downloads 362