Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 15

Search results for: pre-processing

15 Optimized Preprocessing for Accurate and Efficient Bioassay Prediction with Machine Learning Algorithms

Authors: Jeff Clarine, Chang-Shyh Peng, Daisy Sang

Abstract:

Bioassay is the measurement of the potency of a chemical substance by its effect on a living animal or plant tissue. Bioassay data and chemical structures from pharmacokinetic and drug metabolism screening are mined from and housed in multiple databases. Bioassay prediction is calculated accordingly to determine further advancement. This paper proposes a four-step preprocessing of datasets for improving the bioassay predictions. The first step is instance selection in which dataset is categorized into training, testing, and validation sets. The second step is discretization that partitions the data in consideration of accuracy vs. precision. The third step is normalization where data are normalized between 0 and 1 for subsequent machine learning processing. The fourth step is feature selection where key chemical properties and attributes are generated. The streamlined results are then analyzed for the prediction of effectiveness by various machine learning algorithms including Pipeline Pilot, R, Weka, and Excel. Experiments and evaluations reveal the effectiveness of various combination of preprocessing steps and machine learning algorithms in more consistent and accurate prediction.

Keywords: Bioassay, machine learning, preprocessing, virtual screen.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
14 The Implementation of the Javanese Lettered-Manuscript Image Preprocessing Stage Model on the Batak Lettered-Manuscript Image

Authors: Anastasia Rita Widiarti, Agus Harjoko, Marsono, Sri Hartati

Abstract:

This paper presents the results of a study to test whether the Javanese character manuscript image preprocessing model that have been more widely applied, can also be applied to segment of the Batak characters manuscripts. The treatment process begins by converting the input image into a binary image. After the binary image is cleaned of noise, then the segmentation lines using projection profile is conducted. If unclear histogram projection is found, then the smoothing process before production indexes line segments is conducted. For each line image which has been produced, then the segmentation scripts in the line is applied, with regard of the connectivity between pixels which making up the letters that there is no characters are truncated. From the results of manuscript preprocessing system prototype testing, it is obtained the information about the system truth percentage value on pieces of Pustaka Batak Podani Ma AjiMamisinon manuscript ranged from 65% to 87.68% with a confidence level of 95%. The value indicates the truth percentage shown the initial processing model in Javanese characters manuscript image can be applied also to the image of the Batak characters manuscript.

Keywords: Connected component, preprocessing manuscript image, projection profiles.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
13 Evaluation of Clustering Based on Preprocessing in Gene Expression Data

Authors: Seo Young Kim, Toshimitsu Hamasaki

Abstract:

Microarrays have become the effective, broadly used tools in biological and medical research to address a wide range of problems, including classification of disease subtypes and tumors. Many statistical methods are available for analyzing and systematizing these complex data into meaningful information, and one of the main goals in analyzing gene expression data is the detection of samples or genes with similar expression patterns. In this paper, we express and compare the performance of several clustering methods based on data preprocessing including strategies of normalization or noise clearness. We also evaluate each of these clustering methods with validation measures for both simulated data and real gene expression data. Consequently, clustering methods which are common used in microarray data analysis are affected by normalization and degree of noise and clearness for datasets.

Keywords: Gene expression, clustering, data preprocessing.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
12 Performance Evaluation of Iris Region Detection and Localization for Biometric Identification System

Authors: Chit Su Htwe, Win Htay

Abstract:

The iris recognition technology is the most accurate, fast and less invasive one compared to other biometric techniques using for example fingerprints, face, retina, hand geometry, voice or signature patterns. The system developed in this study has the potential to play a key role in areas of high-risk security and can enable organizations with means allowing only to the authorized personnel a fast and secure way to gain access to such areas. The paper aim is to perform the iris region detection and iris inner and outer boundaries localization. The system was implemented on windows platform using Visual C# programming language. It is easy and efficient tool for image processing to get great performance accuracy. In particular, the system includes two main parts. The first is to preprocess the iris images by using Canny edge detection methods, segments the iris region from the rest of the image and determine the location of the iris boundaries by applying Hough transform. The proposed system tested on 756 iris images from 60 eyes of CASIA iris database images.

Keywords: Canny, C#, hough transform, image preprocessing.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
11 Classification Influence Index and its Application for k-Nearest Neighbor Classifier

Authors: Sejong Oh

Abstract:

Classification is an important topic in machine learning and bioinformatics. Many datasets have been introduced for classification tasks. A dataset contains multiple features, and the quality of features influences the classification accuracy of the dataset. The power of classification for each feature differs. In this study, we suggest the Classification Influence Index (CII) as an indicator of classification power for each feature. CII enables evaluation of the features in a dataset and improved classification accuracy by transformation of the dataset. By conducting experiments using CII and the k-nearest neighbor classifier to analyze real datasets, we confirmed that the proposed index provided meaningful improvement of the classification accuracy.

Keywords: accuracy, classification, dataset, data preprocessing

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
10 An Improved Preprocessing for Biosonar Target Classification

Authors: Turgay Temel, John Hallam

Abstract:

An improved processing description to be employed in biosonar signal processing in a cochlea model is proposed and examined. It is compared to conventional models using a modified discrimination analysis and both are tested. Their performances are evaluated with echo data captured from natural targets (trees).Results indicate that the phase characteristics of low-pass filters employed in the echo processing have a significant effect on class separability for this data.

Keywords: Cochlea model, discriminant analysis, neurospikecoding, classification.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
9 A Novel Approach to Improve Users Search Goal in Web Usage Mining

Authors: R. Lokeshkumar, P. Sengottuvelan

Abstract:

Web mining is to discover and extract useful Information. Different users may have different search goals when they search by giving queries and submitting it to a search engine. The inference and analysis of user search goals can be very useful for providing an experience result for a user search query. In this project, we propose a novel approach to infer user search goals by analyzing search web logs. First, we propose a novel approach to infer user search goals by analyzing search engine query logs, the feedback sessions are constructed from user click-through logs and it efficiently reflect the information needed for users. Second we propose a preprocessing technique to clean the unnecessary data’s from web log file (feedback session). Third we propose a technique to generate pseudo-documents to representation of feedback sessions for clustering. Finally we implement k-medoids clustering algorithm to discover different user search goals and to provide a more optimal result for a search query based on feedback sessions for the user.

Keywords: Data Preprocessing, Session Identification, Web log mining, Web Personalization.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
8 Data Preprocessing for Supervised Leaning

Authors: S. B. Kotsiantis, D. Kanellopoulos, P. E. Pintelas

Abstract:

Many factors affect the success of Machine Learning (ML) on a given task. The representation and quality of the instance data is first and foremost. If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. It is well known that data preparation and filtering steps take considerable amount of processing time in ML problems. Data pre-processing includes data cleaning, normalization, transformation, feature extraction and selection, etc. The product of data pre-processing is the final training set. It would be nice if a single sequence of data pre-processing algorithms had the best performance for each data set but this is not happened. Thus, we present the most well know algorithms for each step of data pre-processing so that one achieves the best performance for their data set.

Keywords: Data mining, feature selection, data cleaning.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
7 On Preprocessing of Speech Signals

Authors: Ayaz Keerio, Bhargav Kumar Mitra, Philip Birch, Rupert Young, Chris Chatwin

Abstract:

Preprocessing of speech signals is considered a crucial step in the development of a robust and efficient speech or speaker recognition system. In this paper, we present some popular statistical outlier-detection based strategies to segregate the silence/unvoiced part of the speech signal from the voiced portion. The proposed methods are based on the utilization of the 3 σ edit rule, and the Hampel Identifier which are compared with the conventional techniques: (i) short-time energy (STE) based methods, and (ii) distribution based methods. The results obtained after applying the proposed strategies on some test voice signals are encouraging.

Keywords: STE based methods, Mahalanobis distance, 3 edit σ rule, Hampel Identifier.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
6 The Influence of Preprocessing Parameters on Text Categorization

Authors: Jan Pomikalek, Radim Rehurek

Abstract:

Text categorization (the assignment of texts in natural language into predefined categories) is an important and extensively studied problem in Machine Learning. Currently, popular techniques developed to deal with this task include many preprocessing and learning algorithms, many of which in turn require tuning nontrivial internal parameters. Although partial studies are available, many authors fail to report values of the parameters they use in their experiments, or reasons why these values were used instead of others. The goal of this work then is to create a more thorough comparison of preprocessing parameters and their mutual influence, and report interesting observations and results.

Keywords: Text categorization, machine learning, electronic documents, classification.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
5 Mean Shift-based Preprocessing Methodology for Improved 3D Buildings Reconstruction

Authors: Nikolaos Vassilas, Theocharis Tsenoglou, Djamchid Ghazanfarpour

Abstract:

In this work, we explore the capability of the mean shift algorithm as a powerful preprocessing tool for improving the quality of spatial data, acquired from airborne scanners, from densely built urban areas. On one hand, high resolution image data corrupted by noise caused by lossy compression techniques are appropriately smoothed while at the same time preserving the optical edges and, on the other, low resolution LiDAR data in the form of normalized Digital Surface Map (nDSM) is upsampled through the joint mean shift algorithm. Experiments on both the edge-preserving smoothing and upsampling capabilities using synthetic RGB-z data show that the mean shift algorithm is superior to bilateral filtering as well as to other classical smoothing and upsampling algorithms. Application of the proposed methodology for 3D reconstruction of buildings of a pilot region of Athens, Greece results in a significant visual improvement of the 3D building block model.

Keywords: 3D buildings reconstruction, data fusion, data upsampling, mean shift.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
4 Restoration of Noisy Document Images with an Efficient Bi-Level Adaptive Thresholding

Authors: Abhijit Mitra

Abstract:

An effective approach for extracting document images from a noisy background is introduced. The entire scheme is divided into three sub- stechniques – the initial preprocessing operations for noise cluster tightening, introduction of a new thresholding method by maximizing the ratio of stan- dard deviations of the combined effect on the image to the sum of weighted classes and finally the image restoration phase by image binarization utiliz- ing the proposed optimum threshold level. The proposed method is found to be efficient compared to the existing schemes in terms of computational complexity as well as speed with better noise rejection.

Keywords: Document image extraction, Preprocessing, Ratio of stan-dard deviations, Bi-level adaptive thresholding.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
3 Clustering Approach to Unveiling Relationships between Gene Regulatory Networks

Authors: Hiba Hasan, Khalid Raza

Abstract:

Reverse engineering of genetic regulatory network involves the modeling of the given gene expression data into a form of the network. Computationally it is possible to have the relationships between genes, so called gene regulatory networks (GRNs), that can help to find the genomics and proteomics based diagnostic approach for any disease. In this paper, clustering based method has been used to reconstruct genetic regulatory network from time series gene expression data. Supercoiled data set from Escherichia coli has been taken to demonstrate the proposed method.

Keywords: Gene expression, gene regulatory networks (GRNs), clustering, data preprocessing, network visualization.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
2 A Comparative Study of SVM Classifiers and Artificial Neural Networks Application for Rolling Element Bearing Fault Diagnosis using Wavelet Transform Preprocessing

Authors: Commander Sunil Tyagi

Abstract:

Effectiveness of Artificial Neural Networks (ANN) and Support Vector Machines (SVM) classifiers for fault diagnosis of rolling element bearings are presented in this paper. The characteristic features of vibration signals of rotating driveline that was run in its normal condition and with faults introduced were used as input to ANN and SVM classifiers. Simple statistical features such as standard deviation, skewness, kurtosis etc. of the time-domain vibration signal segments along with peaks of the signal and peak of power spectral density (PSD) are used as features to input the ANN and SVM classifier. The effect of preprocessing of the vibration signal by Discreet Wavelet Transform (DWT) prior to feature extraction is also studied. It is shown from the experimental results that the performance of SVM classifier in identification of bearing condition is better then ANN and pre-processing of vibration signal by DWT enhances the effectiveness of both ANN and SVM classifier

Keywords: ANN, Artificial Intelligence, Fault Diagnosis, Pattern Recognition, Rolling Element Bearing, SVM. Wavelet Transform

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF
1 Designing Ontology-Based Knowledge Integration for Preprocessing of Medical Data in Enhancing a Machine Learning System for Coding Assignment of a Multi-Label Medical Text

Authors: Phanu Waraporn

Abstract:

This paper discusses the designing of knowledge integration of clinical information extracted from distributed medical ontologies in order to ameliorate a machine learning-based multilabel coding assignment system. The proposed approach is implemented using a decision tree technique of the machine learning on the university hospital data for patients with Coronary Heart Disease (CHD). The preliminary results obtained show a satisfactory finding that the use of medical ontologies improves the overall system performance.

Keywords: Medical Ontology, Knowledge Integration, Machine Learning, Medical Coding, Text Assignment.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF