Search results for: data preprocessing.
7517 Evaluation of Clustering Based on Preprocessing in Gene Expression Data
Authors: Seo Young Kim, Toshimitsu Hamasaki
Abstract:
Microarrays have become the effective, broadly used tools in biological and medical research to address a wide range of problems, including classification of disease subtypes and tumors. Many statistical methods are available for analyzing and systematizing these complex data into meaningful information, and one of the main goals in analyzing gene expression data is the detection of samples or genes with similar expression patterns. In this paper, we express and compare the performance of several clustering methods based on data preprocessing including strategies of normalization or noise clearness. We also evaluate each of these clustering methods with validation measures for both simulated data and real gene expression data. Consequently, clustering methods which are common used in microarray data analysis are affected by normalization and degree of noise and clearness for datasets.
Keywords: Gene expression, clustering, data preprocessing.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 17397516 Optimized Preprocessing for Accurate and Efficient Bioassay Prediction with Machine Learning Algorithms
Authors: Jeff Clarine, Chang-Shyh Peng, Daisy Sang
Abstract:
Bioassay is the measurement of the potency of a chemical substance by its effect on a living animal or plant tissue. Bioassay data and chemical structures from pharmacokinetic and drug metabolism screening are mined from and housed in multiple databases. Bioassay prediction is calculated accordingly to determine further advancement. This paper proposes a four-step preprocessing of datasets for improving the bioassay predictions. The first step is instance selection in which dataset is categorized into training, testing, and validation sets. The second step is discretization that partitions the data in consideration of accuracy vs. precision. The third step is normalization where data are normalized between 0 and 1 for subsequent machine learning processing. The fourth step is feature selection where key chemical properties and attributes are generated. The streamlined results are then analyzed for the prediction of effectiveness by various machine learning algorithms including Pipeline Pilot, R, Weka, and Excel. Experiments and evaluations reveal the effectiveness of various combination of preprocessing steps and machine learning algorithms in more consistent and accurate prediction.
Keywords: Bioassay, machine learning, preprocessing, virtual screen.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 9817515 The Influence of Preprocessing Parameters on Text Categorization
Authors: Jan Pomikalek, Radim Rehurek
Abstract:
Text categorization (the assignment of texts in natural language into predefined categories) is an important and extensively studied problem in Machine Learning. Currently, popular techniques developed to deal with this task include many preprocessing and learning algorithms, many of which in turn require tuning nontrivial internal parameters. Although partial studies are available, many authors fail to report values of the parameters they use in their experiments, or reasons why these values were used instead of others. The goal of this work then is to create a more thorough comparison of preprocessing parameters and their mutual influence, and report interesting observations and results.
Keywords: Text categorization, machine learning, electronic documents, classification.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 15737514 Mean Shift-based Preprocessing Methodology for Improved 3D Buildings Reconstruction
Authors: Nikolaos Vassilas, Theocharis Tsenoglou, Djamchid Ghazanfarpour
Abstract:
In this work, we explore the capability of the mean shift algorithm as a powerful preprocessing tool for improving the quality of spatial data, acquired from airborne scanners, from densely built urban areas. On one hand, high resolution image data corrupted by noise caused by lossy compression techniques are appropriately smoothed while at the same time preserving the optical edges and, on the other, low resolution LiDAR data in the form of normalized Digital Surface Map (nDSM) is upsampled through the joint mean shift algorithm. Experiments on both the edge-preserving smoothing and upsampling capabilities using synthetic RGB-z data show that the mean shift algorithm is superior to bilateral filtering as well as to other classical smoothing and upsampling algorithms. Application of the proposed methodology for 3D reconstruction of buildings of a pilot region of Athens, Greece results in a significant visual improvement of the 3D building block model.Keywords: 3D buildings reconstruction, data fusion, data upsampling, mean shift.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 20057513 The Implementation of the Javanese Lettered-Manuscript Image Preprocessing Stage Model on the Batak Lettered-Manuscript Image
Authors: Anastasia Rita Widiarti, Agus Harjoko, Marsono, Sri Hartati
Abstract:
This paper presents the results of a study to test whether the Javanese character manuscript image preprocessing model that have been more widely applied, can also be applied to segment of the Batak characters manuscripts. The treatment process begins by converting the input image into a binary image. After the binary image is cleaned of noise, then the segmentation lines using projection profile is conducted. If unclear histogram projection is found, then the smoothing process before production indexes line segments is conducted. For each line image which has been produced, then the segmentation scripts in the line is applied, with regard of the connectivity between pixels which making up the letters that there is no characters are truncated. From the results of manuscript preprocessing system prototype testing, it is obtained the information about the system truth percentage value on pieces of Pustaka Batak Podani Ma AjiMamisinon manuscript ranged from 65% to 87.68% with a confidence level of 95%. The value indicates the truth percentage shown the initial processing model in Javanese characters manuscript image can be applied also to the image of the Batak characters manuscript.Keywords: Connected component, preprocessing manuscript image, projection profiles.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 9237512 Data Preprocessing for Supervised Leaning
Authors: S. B. Kotsiantis, D. Kanellopoulos, P. E. Pintelas
Abstract:
Many factors affect the success of Machine Learning (ML) on a given task. The representation and quality of the instance data is first and foremost. If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. It is well known that data preparation and filtering steps take considerable amount of processing time in ML problems. Data pre-processing includes data cleaning, normalization, transformation, feature extraction and selection, etc. The product of data pre-processing is the final training set. It would be nice if a single sequence of data pre-processing algorithms had the best performance for each data set but this is not happened. Thus, we present the most well know algorithms for each step of data pre-processing so that one achieves the best performance for their data set.Keywords: Data mining, feature selection, data cleaning.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 60897511 Fractal - Wavelet Based Techniques for Improving the Artificial Neural Network Models
Authors: Reza Bazargan Lari, Mohammad H. Fattahi
Abstract:
Natural resources management including water resources requires reliable estimations of time variant environmental parameters. Small improvements in the estimation of environmental parameters would result in grate effects on managing decisions. Noise reduction using wavelet techniques is an effective approach for preprocessing of practical data sets. Predictability enhancement of the river flow time series are assessed using fractal approaches before and after applying wavelet based preprocessing. Time series correlation and persistency, the minimum sufficient length for training the predicting model and the maximum valid length of predictions were also investigated through a fractal assessment.
Keywords: Wavelet, de-noising, predictability, time series fractal analysis, valid length, ANN.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 20637510 A Novel Approach to Improve Users Search Goal in Web Usage Mining
Authors: R. Lokeshkumar, P. Sengottuvelan
Abstract:
Web mining is to discover and extract useful Information. Different users may have different search goals when they search by giving queries and submitting it to a search engine. The inference and analysis of user search goals can be very useful for providing an experience result for a user search query. In this project, we propose a novel approach to infer user search goals by analyzing search web logs. First, we propose a novel approach to infer user search goals by analyzing search engine query logs, the feedback sessions are constructed from user click-through logs and it efficiently reflect the information needed for users. Second we propose a preprocessing technique to clean the unnecessary data’s from web log file (feedback session). Third we propose a technique to generate pseudo-documents to representation of feedback sessions for clustering. Finally we implement k-medoids clustering algorithm to discover different user search goals and to provide a more optimal result for a search query based on feedback sessions for the user.Keywords: Data Preprocessing, Session Identification, Web log mining, Web Personalization.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 20227509 Reactive Neural Control for Phototaxis and Obstacle Avoidance Behavior of Walking Machines
Authors: Poramate Manoonpong, Frank Pasemann, Florentin Wörgötter
Abstract:
This paper describes reactive neural control used to generate phototaxis and obstacle avoidance behavior of walking machines. It utilizes discrete-time neurodynamics and consists of two main neural modules: neural preprocessing and modular neural control. The neural preprocessing network acts as a sensory fusion unit. It filters sensory noise and shapes sensory data to drive the corresponding reactive behavior. On the other hand, modular neural control based on a central pattern generator is applied for locomotion of walking machines. It coordinates leg movements and can generate omnidirectional walking. As a result, through a sensorimotor loop this reactive neural controller enables the machines to explore a dynamic environment by avoiding obstacles, turn toward a light source, and then stop near to it.Keywords: Recurrent neural networks, Walking robots, Modular neural control, Phototaxis, Obstacle avoidance behavior.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 17287508 On Preprocessing of Speech Signals
Authors: Ayaz Keerio, Bhargav Kumar Mitra, Philip Birch, Rupert Young, Chris Chatwin
Abstract:
Preprocessing of speech signals is considered a crucial step in the development of a robust and efficient speech or speaker recognition system. In this paper, we present some popular statistical outlier-detection based strategies to segregate the silence/unvoiced part of the speech signal from the voiced portion. The proposed methods are based on the utilization of the 3 σ edit rule, and the Hampel Identifier which are compared with the conventional techniques: (i) short-time energy (STE) based methods, and (ii) distribution based methods. The results obtained after applying the proposed strategies on some test voice signals are encouraging.
Keywords: STE based methods, Mahalanobis distance, 3 edit σ rule, Hampel Identifier.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 17087507 Hand Vein Image Enhancement With Radon Like Features Descriptor
Authors: Randa Boukhris Trabelsi, Alima Damak Masmoudi, Dorra Sellami Masmoudi
Abstract:
Nowadays, hand vein recognition has attracted more attentions in identification biometrics systems. Generally, hand vein image is acquired with low contrast and irregular illumination. Accordingly, if you have a good preprocessing of hand vein image, we can easy extracted the feature extraction even with simple binarization. In this paper, a proposed approach is processed to improve the quality of hand vein image. First, a brief survey on existing methods of enhancement is investigated. Then a Radon Like features method is applied to preprocessing hand vein image. Finally, experiments results show that the proposed method give the better effective and reliable in improving hand vein images.
Keywords: Hand Vein, Enhancement, Contrast, RLF, SDME
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 22407506 Object Tracking System Using Camshift, Meanshift and Kalman Filter
Authors: Afef Salhi, Ameni Yengui Jammaoussi
Abstract:
This paper presents a implementation of an object tracking system in a video sequence. This object tracking is an important task in many vision applications. The main steps in video analysis are two: detection of interesting moving objects and tracking of such objects from frame to frame. In a similar vein, most tracking algorithms use pre-specified methods for preprocessing. In our work, we have implemented several object tracking algorithms (Meanshift, Camshift, Kalman filter) with different preprocessing methods. Then, we have evaluated the performance of these algorithms for different video sequences. The obtained results have shown good performances according to the degree of applicability and evaluation criteria.
Keywords: Tracking, meanshift, camshift, Kalman filter, evaluation.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 82507505 Identity Verification Using k-NN Classifiers and Autistic Genetic Data
Authors: Fuad M. Alkoot
Abstract:
DNA data have been used in forensics for decades. However, current research looks at using the DNA as a biometric identity verification modality. The goal is to improve the speed of identification. We aim at using gene data that was initially used for autism detection to find if and how accurate is this data for identification applications. Mainly our goal is to find if our data preprocessing technique yields data useful as a biometric identification tool. We experiment with using the nearest neighbor classifier to identify subjects. Results show that optimal classification rate is achieved when the test set is corrupted by normally distributed noise with zero mean and standard deviation of 1. The classification rate is close to optimal at higher noise standard deviation reaching 3. This shows that the data can be used for identity verification with high accuracy using a simple classifier such as the k-nearest neighbor (k-NN).
Keywords: Biometrics, identity verification, genetic data, k-nearest neighbor.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 11207504 On-line Handwritten Character Recognition: An Implementation of Counterpropagation Neural Net
Authors: Muhammad Faisal Zafar, Dzulkifli Mohamad, Razib M. Othman
Abstract:
On-line handwritten scripts are usually dealt with pen tip traces from pen-down to pen-up positions. Time evaluation of the pen coordinates is also considered along with trajectory information. However, the data obtained needs a lot of preprocessing including filtering, smoothing, slant removing and size normalization before recognition process. Instead of doing such lengthy preprocessing, this paper presents a simple approach to extract the useful character information. This work evaluates the use of the counter- propagation neural network (CPN) and presents feature extraction mechanism in full detail to work with on-line handwriting recognition. The obtained recognition rates were 60% to 94% using the CPN for different sets of character samples. This paper also describes a performance study in which a recognition mechanism with multiple thresholds is evaluated for counter-propagation architecture. The results indicate that the application of multiple thresholds has significant effect on recognition mechanism. The method is applicable for off-line character recognition as well. The technique is tested for upper-case English alphabets for a number of different styles from different peoples.
Keywords: On-line character recognition, character digitization, counter-propagation neural networks, extreme coordinates.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 24307503 Restoration of Noisy Document Images with an Efficient Bi-Level Adaptive Thresholding
Authors: Abhijit Mitra
Abstract:
An effective approach for extracting document images from a noisy background is introduced. The entire scheme is divided into three sub- stechniques – the initial preprocessing operations for noise cluster tightening, introduction of a new thresholding method by maximizing the ratio of stan- dard deviations of the combined effect on the image to the sum of weighted classes and finally the image restoration phase by image binarization utiliz- ing the proposed optimum threshold level. The proposed method is found to be efficient compared to the existing schemes in terms of computational complexity as well as speed with better noise rejection.
Keywords: Document image extraction, Preprocessing, Ratio of stan-dard deviations, Bi-level adaptive thresholding.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 14567502 A Genetic Algorithm for Clustering on Image Data
Authors: Qin Ding, Jim Gasvoda
Abstract:
Clustering is the process of subdividing an input data set into a desired number of subgroups so that members of the same subgroup are similar and members of different subgroups have diverse properties. Many heuristic algorithms have been applied to the clustering problem, which is known to be NP Hard. Genetic algorithms have been used in a wide variety of fields to perform clustering, however, the technique normally has a long running time in terms of input set size. This paper proposes an efficient genetic algorithm for clustering on very large data sets, especially on image data sets. The genetic algorithm uses the most time efficient techniques along with preprocessing of the input data set. We test our algorithm on both artificial and real image data sets, both of which are of large size. The experimental results show that our algorithm outperforms the k-means algorithm in terms of running time as well as the quality of the clustering.
Keywords: Clustering, data mining, genetic algorithm, image data.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 20517501 Clustering Approach to Unveiling Relationships between Gene Regulatory Networks
Authors: Hiba Hasan, Khalid Raza
Abstract:
Reverse engineering of genetic regulatory network involves the modeling of the given gene expression data into a form of the network. Computationally it is possible to have the relationships between genes, so called gene regulatory networks (GRNs), that can help to find the genomics and proteomics based diagnostic approach for any disease. In this paper, clustering based method has been used to reconstruct genetic regulatory network from time series gene expression data. Supercoiled data set from Escherichia coli has been taken to demonstrate the proposed method.
Keywords: Gene expression, gene regulatory networks (GRNs), clustering, data preprocessing, network visualization.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 21527500 An Improved Preprocessing for Biosonar Target Classification
Authors: Turgay Temel, John Hallam
Abstract:
An improved processing description to be employed in biosonar signal processing in a cochlea model is proposed and examined. It is compared to conventional models using a modified discrimination analysis and both are tested. Their performances are evaluated with echo data captured from natural targets (trees).Results indicate that the phase characteristics of low-pass filters employed in the echo processing have a significant effect on class separability for this data.
Keywords: Cochlea model, discriminant analysis, neurospikecoding, classification.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 14917499 Peakwise Smoothing of Data Models using Wavelets
Authors: D Sudheer Reddy, N Gopal Reddy, P V Radhadevi, J Saibaba, Geeta Varadan
Abstract:
Smoothing or filtering of data is first preprocessing step for noise suppression in many applications involving data analysis. Moving average is the most popular method of smoothing the data, generalization of this led to the development of Savitzky-Golay filter. Many window smoothing methods were developed by convolving the data with different window functions for different applications; most widely used window functions are Gaussian or Kaiser. Function approximation of the data by polynomial regression or Fourier expansion or wavelet expansion also gives a smoothed data. Wavelets also smooth the data to great extent by thresholding the wavelet coefficients. Almost all smoothing methods destroys the peaks and flatten them when the support of the window is increased. In certain applications it is desirable to retain peaks while smoothing the data as much as possible. In this paper we present a methodology called as peak-wise smoothing that will smooth the data to any desired level without losing the major peak features.Keywords: smoothing, moving average, peakwise smoothing, spatialdensity models, planar shape models, wavelets.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 17497498 Efficient STAKCERT KDD Processes in Worm Detection
Authors: Madihah Mohd Saudi, Andrea J Cullen, Mike E Woodward
Abstract:
This paper presents a new STAKCERT KDD processes for worm detection. The enhancement introduced in the data-preprocessing resulted in the formation of a new STAKCERT model for worm detection. In this paper we explained in detail how all the processes involved in the STAKCERT KDD processes are applied within the STAKCERT model for worm detection. Based on the experiment conducted, the STAKCERT model yielded a 98.13% accuracy rate for worm detection by integrating the STAKCERT KDD processes.Keywords: data mining, incident response, KDD processes, security metrics and worm detection.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 16547497 Neural Network Based Speech to Text in Malay Language
Authors: H. F. A. Abdul Ghani, R. R. Porle
Abstract:
Speech to text in Malay language is a system that converts Malay speech into text. The Malay language recognition system is still limited, thus, this paper aims to investigate the performance of ten Malay words obtained from the online Malay news. The methodology consists of three stages, which are preprocessing, feature extraction, and speech classification. In preprocessing stage, the speech samples are filtered using pre emphasis. After that, feature extraction method is applied to the samples using Mel Frequency Cepstrum Coefficient (MFCC). Lastly, speech classification is performed using Feedforward Neural Network (FFNN). The accuracy of the classification is further investigated based on the hidden layer size. From experimentation, the classifier with 40 hidden neurons shows the highest classification rate which is 94%.
Keywords: Feed-Forward Neural Network, FFNN, Malay speech recognition, Mel Frequency Cepstrum Coefficient, MFCC, speech-to-text.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 7457496 Yield Prediction Using Support Vectors Based Under-Sampling in Semiconductor Process
Authors: Sae-Rom Pak, Seung Hwan Park, Jeong Ho Cho, Daewoong An, Cheong-Sool Park, Jun Seok Kim, Jun-Geol Baek
Abstract:
It is important to predict yield in semiconductor test process in order to increase yield. In this study, yield prediction means finding out defective die, wafer or lot effectively. Semiconductor test process consists of some test steps and each test includes various test items. In other world, test data has a big and complicated characteristic. It also is disproportionably distributed as the number of data belonging to FAIL class is extremely low. For yield prediction, general data mining techniques have a limitation without any data preprocessing due to eigen properties of test data. Therefore, this study proposes an under-sampling method using support vector machine (SVM) to eliminate an imbalanced characteristic. For evaluating a performance, randomly under-sampling method is compared with the proposed method using actual semiconductor test data. As a result, sampling method using SVM is effective in generating robust model for yield prediction.
Keywords: Yield Prediction, Semiconductor Test Process, Support Vector Machine, Under Sampling
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 23977495 A Neural-Network-Based Fault Diagnosis Approach for Analog Circuits by Using Wavelet Transformation and Fractal Dimension as a Preprocessor
Abstract:
This paper presents a new method of analog fault diagnosis based on back-propagation neural networks (BPNNs) using wavelet decomposition and fractal dimension as preprocessors. The proposed method has the capability to detect and identify faulty components in an analog electronic circuit with tolerance by analyzing its impulse response. Using wavelet decomposition to preprocess the impulse response drastically de-noises the inputs to the neural network. The second preprocessing by fractal dimension can extract unique features, which are the fed to a neural network as inputs for further classification. A comparison of our work with [1] and [6], which also employs back-propagation (BP) neural networks, reveals that our system requires a much smaller network and performs significantly better in fault diagnosis of analog circuits due to our proposed preprocessing techniques.
Keywords: Analog circuits, fault diagnosis, tolerance, wavelettransform, fractal dimension, box dimension.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 21987494 Empirical Process Monitoring Via Chemometric Analysis of Partially Unbalanced Data
Authors: Hyun-Woo Cho
Abstract:
Real-time or in-line process monitoring frameworks are designed to give early warnings for a fault along with meaningful identification of its assignable causes. In artificial intelligence and machine learning fields of pattern recognition various promising approaches have been proposed such as kernel-based nonlinear machine learning techniques. This work presents a kernel-based empirical monitoring scheme for batch type production processes with small sample size problem of partially unbalanced data. Measurement data of normal operations are easy to collect whilst special events or faults data are difficult to collect. In such situations, noise filtering techniques can be helpful in enhancing process monitoring performance. Furthermore, preprocessing of raw process data is used to get rid of unwanted variation of data. The performance of the monitoring scheme was demonstrated using three-dimensional batch data. The results showed that the monitoring performance was improved significantly in terms of detection success rate of process fault.
Keywords: Process Monitoring, kernel methods, multivariate filtering, data-driven techniques, quality improvement.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 17467493 A Comparative Study of SVM Classifiers and Artificial Neural Networks Application for Rolling Element Bearing Fault Diagnosis using Wavelet Transform Preprocessing
Authors: Commander Sunil Tyagi
Abstract:
Effectiveness of Artificial Neural Networks (ANN) and Support Vector Machines (SVM) classifiers for fault diagnosis of rolling element bearings are presented in this paper. The characteristic features of vibration signals of rotating driveline that was run in its normal condition and with faults introduced were used as input to ANN and SVM classifiers. Simple statistical features such as standard deviation, skewness, kurtosis etc. of the time-domain vibration signal segments along with peaks of the signal and peak of power spectral density (PSD) are used as features to input the ANN and SVM classifier. The effect of preprocessing of the vibration signal by Discreet Wavelet Transform (DWT) prior to feature extraction is also studied. It is shown from the experimental results that the performance of SVM classifier in identification of bearing condition is better then ANN and pre-processing of vibration signal by DWT enhances the effectiveness of both ANN and SVM classifierKeywords: ANN, Artificial Intelligence, Fault Diagnosis, Pattern Recognition, Rolling Element Bearing, SVM. Wavelet Transform
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 21177492 Analysis of Web User Identification Methods
Authors: Renáta Iváncsy, Sándor Juhász
Abstract:
Web usage mining has become a popular research area, as a huge amount of data is available online. These data can be used for several purposes, such as web personalization, web structure enhancement, web navigation prediction etc. However, the raw log files are not directly usable; they have to be preprocessed in order to transform them into a suitable format for different data mining tasks. One of the key issues in the preprocessing phase is to identify web users. Identifying users based on web log files is not a straightforward problem, thus various methods have been developed. There are several difficulties that have to be overcome, such as client side caching, changing and shared IP addresses and so on. This paper presents three different methods for identifying web users. Two of them are the most commonly used methods in web log mining systems, whereas the third on is our novel approach that uses a complex cookie-based method to identify web users. Furthermore we also take steps towards identifying the individuals behind the impersonal web users. To demonstrate the efficiency of the new method we developed an implementation called Web Activity Tracking (WAT) system that aims at a more precise distinction of web users based on log data. We present some statistical analysis created by the WAT on real data about the behavior of the Hungarian web users and a comprehensive analysis and comparison of the three methodsKeywords: Data preparation, Tracking individuals, Web useridentification, Web usage mining
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 43917491 Sparse Coding Based Classification of Electrocardiography Signals Using Data-Driven Complete Dictionary Learning
Authors: Fuad Noman, Sh-Hussain Salleh, Chee-Ming Ting, Hadri Hussain, Syed Rasul
Abstract:
In this paper, a data-driven dictionary approach is proposed for the automatic detection and classification of cardiovascular abnormalities. Electrocardiography (ECG) signal is represented by the trained complete dictionaries that contain prototypes or atoms to avoid the limitations of pre-defined dictionaries. The data-driven trained dictionaries simply take the ECG signal as input rather than extracting features to study the set of parameters that yield the most descriptive dictionary. The approach inherently learns the complicated morphological changes in ECG waveform, which is then used to improve the classification. The classification performance was evaluated with ECG data under two different preprocessing environments. In the first category, QT-database is baseline drift corrected with notch filter and it filters the 60 Hz power line noise. In the second category, the data are further filtered using fast moving average smoother. The experimental results on QT database confirm that our proposed algorithm shows a classification accuracy of 92%.Keywords: Electrocardiogram, dictionary learning, sparse coding, classification.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 20937490 Designing Ontology-Based Knowledge Integration for Preprocessing of Medical Data in Enhancing a Machine Learning System for Coding Assignment of a Multi-Label Medical Text
Authors: Phanu Waraporn
Abstract:
This paper discusses the designing of knowledge integration of clinical information extracted from distributed medical ontologies in order to ameliorate a machine learning-based multilabel coding assignment system. The proposed approach is implemented using a decision tree technique of the machine learning on the university hospital data for patients with Coronary Heart Disease (CHD). The preliminary results obtained show a satisfactory finding that the use of medical ontologies improves the overall system performance.
Keywords: Medical Ontology, Knowledge Integration, Machine Learning, Medical Coding, Text Assignment.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 18507489 Application of Data Mining Techniques for Tourism Knowledge Discovery
Authors: Teklu Urgessa, Wookjae Maeng, Joong Seek Lee
Abstract:
Application of five implementations of three data mining classification techniques was experimented for extracting important insights from tourism data. The aim was to find out the best performing algorithm among the compared ones for tourism knowledge discovery. Knowledge discovery process from data was used as a process model. 10-fold cross validation method is used for testing purpose. Various data preprocessing activities were performed to get the final dataset for model building. Classification models of the selected algorithms were built with different scenarios on the preprocessed dataset. The outperformed algorithm tourism dataset was Random Forest (76%) before applying information gain based attribute selection and J48 (C4.5) (75%) after selection of top relevant attributes to the class (target) attribute. In terms of time for model building, attribute selection improves the efficiency of all algorithms. Artificial Neural Network (multilayer perceptron) showed the highest improvement (90%). The rules extracted from the decision tree model are presented, which showed intricate, non-trivial knowledge/insight that would otherwise not be discovered by simple statistical analysis with mediocre accuracy of the machine using classification algorithms.
Keywords: Classification algorithms; data mining; tourism; knowledge discovery.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 25467488 System Identification Based on Stepwise Regression for Dynamic Market Representation
Authors: Alexander Efremov
Abstract:
A system for market identification (SMI) is presented. The resulting representations are multivariable dynamic demand models. The market specifics are analyzed. Appropriate models and identification techniques are chosen. Multivariate static and dynamic models are used to represent the market behavior. The steps of the first stage of SMI, named data preprocessing, are mentioned. Next, the second stage, which is the model estimation, is considered in more details. Stepwise linear regression (SWR) is used to determine the significant cross-effects and the orders of the model polynomials. The estimates of the model parameters are obtained by a numerically stable estimator. Real market data is used to analyze SMI performance. The main conclusion is related to the applicability of multivariate dynamic models for representation of market systems.Keywords: market identification, dynamic models, stepwise regression.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1617