Search results for: Data mining classification algorithms
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 9096

Search results for: Data mining classification algorithms

8946 Distributed Data-Mining by Probability-Based Patterns

Authors: M. Kargar, F. Gharbalchi

Abstract:

In this paper a new method is suggested for distributed data-mining by the probability patterns. These patterns use decision trees and decision graphs. The patterns are cared to be valid, novel, useful, and understandable. Considering a set of functions, the system reaches to a good pattern or better objectives. By using the suggested method we will be able to extract the useful information from massive and multi-relational data bases.

Keywords: Data-mining, Decision tree, Decision graph, Pattern, Relationship.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1555
8945 Ensembling Classifiers – An Application toImage Data Classification from Cherenkov Telescope Experiment

Authors: Praveen Boinee, Alessandro De Angelis, Gian Luca Foresti

Abstract:

Ensemble learning algorithms such as AdaBoost and Bagging have been in active research and shown improvements in classification results for several benchmarking data sets with mainly decision trees as their base classifiers. In this paper we experiment to apply these Meta learning techniques with classifiers such as random forests, neural networks and support vector machines. The data sets are from MAGIC, a Cherenkov telescope experiment. The task is to classify gamma signals from overwhelmingly hadron and muon signals representing a rare class classification problem. We compare the individual classifiers with their ensemble counterparts and discuss the results. WEKA a wonderful tool for machine learning has been used for making the experiments.

Keywords: Ensembles, WEKA, Neural networks [NN], SupportVector Machines [SVM], Random Forests [RF].

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1765
8944 Time Series Regression with Meta-Clusters

Authors: Monika Chuchro

Abstract:

This paper presents a preliminary attempt to apply classification of time series using meta-clusters in order to improve the quality of regression models. In this case, clustering was performed as a method to obtain subgroups of time series data with normal distribution from the inflow into wastewater treatment plant data, composed of several groups differing by mean value. Two simple algorithms, K-mean and EM, were chosen as a clustering method. The Rand index was used to measure the similarity. After simple meta-clustering, a regression model was performed for each subgroups. The final model was a sum of the subgroups models. The quality of the obtained model was compared with the regression model made using the same explanatory variables, but with no clustering of data. Results were compared using determination coefficient (R2), measure of prediction accuracy- mean absolute percentage error (MAPE) and comparison on a linear chart. Preliminary results allow us to foresee the potential of the presented technique.

Keywords: Clustering, Data analysis, Data mining, Predictive models.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1951
8943 Improvement in Power Transformer Intelligent Dissolved Gas Analysis Method

Authors: S. Qaedi, S. Seyedtabaii

Abstract:

Non-Destructive evaluation of in-service power transformer condition is necessary for avoiding catastrophic failures. Dissolved Gas Analysis (DGA) is one of the important methods. Traditional, statistical and intelligent DGA approaches have been adopted for accurate classification of incipient fault sources. Unfortunately, there are not often enough faulty patterns required for sufficient training of intelligent systems. By bootstrapping the shortcoming is expected to be alleviated and algorithms with better classification success rates to be obtained. In this paper the performance of an artificial neural network, K-Nearest Neighbour and support vector machine methods using bootstrapped data are detailed and shown that while the success rate of the ANN algorithms improves remarkably, the outcome of the others do not benefit so much from the provided enlarged data space. For assessment, two databases are employed: IEC TC10 and a dataset collected from reported data in papers. High average test success rate well exhibits the remarkable outcome.

Keywords: Dissolved gas analysis, Transformer incipient fault, Artificial Neural Network, Support Vector Machine (SVM), KNearest Neighbor (KNN)

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2739
8942 Efficient Implementation of Serial and Parallel Support Vector Machine Training with a Multi-Parameter Kernel for Large-Scale Data Mining

Authors: Tatjana Eitrich, Bruno Lang

Abstract:

This work deals with aspects of support vector learning for large-scale data mining tasks. Based on a decomposition algorithm that can be run in serial and parallel mode we introduce a data transformation that allows for the usage of an expensive generalized kernel without additional costs. In order to speed up the decomposition algorithm we analyze the problem of working set selection for large data sets and analyze the influence of the working set sizes onto the scalability of the parallel decomposition scheme. Our modifications and settings lead to improvement of support vector learning performance and thus allow using extensive parameter search methods to optimize classification accuracy.

Keywords: Support Vector Machines, Shared Memory Parallel Computing, Large Data

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1577
8941 Rapid Study on Feature Extraction and Classification Models in Healthcare Applications

Authors: S. Sowmyayani

Abstract:

The advancement of computer-aided design helps the medical force and security force. Some applications include biometric recognition, elderly fall detection, face recognition, cancer recognition, tumor recognition, etc. This paper deals with different machine learning algorithms that are more generically used for any health care system. The most focused problems are classification and regression. With the rise of big data, machine learning has become particularly important for solving problems. Machine learning uses two types of techniques: supervised learning and unsupervised learning. The former trains a model on known input and output data and predicts future outputs. Classification and regression are supervised learning techniques. Unsupervised learning finds hidden patterns in input data. Clustering is one such unsupervised learning technique. The above-mentioned models are discussed briefly in this paper.

Keywords: Supervised learning, unsupervised learning, regression, neural network.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 346
8940 Proposing an Efficient Method for Frequent Pattern Mining

Authors: Vaibhav Kant Singh, Vijay Shah, Yogendra Kumar Jain, Anupam Shukla, A.S. Thoke, Vinay KumarSingh, Chhaya Dule, Vivek Parganiha

Abstract:

Data mining, which is the exploration of knowledge from the large set of data, generated as a result of the various data processing activities. Frequent Pattern Mining is a very important task in data mining. The previous approaches applied to generate frequent set generally adopt candidate generation and pruning techniques for the satisfaction of the desired objective. This paper shows how the different approaches achieve the objective of frequent mining along with the complexities required to perform the job. This paper will also look for hardware approach of cache coherence to improve efficiency of the above process. The process of data mining is helpful in generation of support systems that can help in Management, Bioinformatics, Biotechnology, Medical Science, Statistics, Mathematics, Banking, Networking and other Computer related applications. This paper proposes the use of both upward and downward closure property for the extraction of frequent item sets which reduces the total number of scans required for the generation of Candidate Sets.

Keywords: Data Mining, Candidate Sets, Frequent Item set, Pruning.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1682
8939 An Improved Data Mining Method Applied to the Search of Relationship between Metabolic Syndrome and Lifestyles

Authors: Yi Chao Huang, Yu Ling Liao, Chiu Shuang Lin

Abstract:

A data cutting and sorting method (DCSM) is proposed to optimize the performance of data mining. DCSM reduces the calculation time by getting rid of redundant data during the data mining process. In addition, DCSM minimizes the computational units by splitting the database and by sorting data with support counts. In the process of searching for the relationship between metabolic syndrome and lifestyles with the health examination database of an electronics manufacturing company, DCSM demonstrates higher search efficiency than the traditional Apriori algorithm in tests with different support counts.

Keywords: Data mining, Data cutting and sorting method, Apriori algorithm, Metabolic syndrome

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1588
8938 On Speeding Up Support Vector Machines: Proximity Graphs Versus Random Sampling for Pre-Selection Condensation

Authors: Xiaohua Liu, Juan F. Beltran, Nishant Mohanchandra, Godfried T. Toussaint

Abstract:

Support vector machines (SVMs) are considered to be the best machine learning algorithms for minimizing the predictive probability of misclassification. However, their drawback is that for large data sets the computation of the optimal decision boundary is a time consuming function of the size of the training set. Hence several methods have been proposed to speed up the SVM algorithm. Here three methods used to speed up the computation of the SVM classifiers are compared experimentally using a musical genre classification problem. The simplest method pre-selects a random sample of the data before the application of the SVM algorithm. Two additional methods use proximity graphs to pre-select data that are near the decision boundary. One uses k-Nearest Neighbor graphs and the other Relative Neighborhood Graphs to accomplish the task.

Keywords: Machine learning, data mining, support vector machines, proximity graphs, relative-neighborhood graphs, k-nearestneighbor graphs, random sampling, training data condensation.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1919
8937 Wavelet and K-L Seperability Based Feature Extraction Method for Functional Data Classification

Authors: Jun Wan, Zehua Chen, Yingwu Chen, Zhidong Bai

Abstract:

This paper proposes a novel feature extraction method, based on Discrete Wavelet Transform (DWT) and K-L Seperability (KLS), for the classification of Functional Data (FD). This method combines the decorrelation and reduction property of DWT and the additive independence property of KLS, which is helpful to extraction classification features of FD. It is an advanced approach of the popular wavelet based shrinkage method for functional data reduction and classification. A theory analysis is given in the paper to prove the consistent convergence property, and a simulation study is also done to compare the proposed method with the former shrinkage ones. The experiment results show that this method has advantages in improving classification efficiency, precision and robustness.

Keywords: classification, functional data, feature extraction, K-Lseperability, wavelet.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1466
8936 Curvelet Transform Based Two Class Motor Imagery Classification

Authors: Nebi Gedik

Abstract:

One of the important parts of the brain-computer interface (BCI) studies is the classification of motor imagery (MI) obtained by electroencephalography (EEG). The major goal is to provide non-muscular communication and control via assistive technologies to people with severe motor disorders so that they can communicate with the outside world. In this study, an EEG signal classification approach based on multiscale and multi-resolution transform method is presented. The proposed approach is used to decompose the EEG signal containing motor image information (right- and left-hand movement imagery). The decomposition process is performed using curvelet transform which is a multiscale and multiresolution analysis method, and the transform output was evaluated as feature data. The obtained feature set is subjected to feature selection process to obtain the most effective ones using t-test methods. SVM and k-NN algorithms are assigned for classification.

Keywords: motor imagery, EEG, curvelet transform, SVM, k-NN

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 620
8935 The Implementation of the Multi-Agent Classification System (MACS) in Compliance with FIPA Specifications

Authors: Mohamed R. Mhereeg

Abstract:

The paper discusses the implementation of the MultiAgent classification System (MACS) and utilizing it to provide an automated and accurate classification of end users developing applications in the spreadsheet domain. However, different technologies have been brought together to build MACS. The strength of the system is the integration of the agent technology with the FIPA specifications together with other technologies, which are the .NET widows service based agents, the Windows Communication Foundation (WCF) services, the Service Oriented Architecture (SOA), and Oracle Data Mining (ODM). The Microsoft's .NET widows service based agents were utilized to develop the monitoring agents of MACS, the .NET WCF services together with SOA approach allowed the distribution and communication between agents over the WWW. The Monitoring Agents (MAs) were configured to execute automatically to monitor excel spreadsheets development activities by content. Data gathered by the Monitoring Agents from various resources over a period of time was collected and filtered by a Database Updater Agent (DUA) residing in the .NET client application of the system. This agent then transfers and stores the data in Oracle server database via Oracle stored procedures for further processing that leads to the classification of the end user developers.

Keywords: MACS, Implementation, Multi-Agent, SOA, Autonomous, WCF.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1709
8934 Predictions Using Data Mining and Case-based Reasoning: A Case Study for Retinopathy

Authors: Vimala Balakrishnan, Mohammad R. Shakouri, Hooman Hoodeh, Loo, Huck-Soo

Abstract:

Diabetes is one of the high prevalence diseases worldwide with increased number of complications, with retinopathy as one of the most common one. This paper describes how data mining and case-based reasoning were integrated to predict retinopathy prevalence among diabetes patients in Malaysia. The knowledge base required was built after literature reviews and interviews with medical experts. A total of 140 diabetes patients- data were used to train the prediction system. A voting mechanism selects the best prediction results from the two techniques used. It has been successfully proven that both data mining and case-based reasoning can be used for retinopathy prediction with an improved accuracy of 85%.

Keywords: Case-Based Reasoning, Data Mining, Prediction, Retinopathy.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3021
8933 Adaptive Network Intrusion Detection Learning: Attribute Selection and Classification

Authors: Dewan Md. Farid, Jerome Darmont, Nouria Harbi, Nguyen Huu Hoa, Mohammad Zahidur Rahman

Abstract:

In this paper, a new learning approach for network intrusion detection using naïve Bayesian classifier and ID3 algorithm is presented, which identifies effective attributes from the training dataset, calculates the conditional probabilities for the best attribute values, and then correctly classifies all the examples of training and testing dataset. Most of the current intrusion detection datasets are dynamic, complex and contain large number of attributes. Some of the attributes may be redundant or contribute little for detection making. It has been successfully tested that significant attribute selection is important to design a real world intrusion detection systems (IDS). The purpose of this study is to identify effective attributes from the training dataset to build a classifier for network intrusion detection using data mining algorithms. The experimental results on KDD99 benchmark intrusion detection dataset demonstrate that this new approach achieves high classification rates and reduce false positives using limited computational resources.

Keywords: Attributes selection, Conditional probabilities, information gain, network intrusion detection.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2698
8932 A Data Mining Model for Detecting Financial and Operational Risk Indicators of SMEs

Authors: Ali Serhan Koyuncugil, Nermin Ozgulbas

Abstract:

In this paper, a data mining model to SMEs for detecting financial and operational risk indicators by data mining is presenting. The identification of the risk factors by clarifying the relationship between the variables defines the discovery of knowledge from the financial and operational variables. Automatic and estimation oriented information discovery process coincides the definition of data mining. During the formation of model; an easy to understand, easy to interpret and easy to apply utilitarian model that is far from the requirement of theoretical background is targeted by the discovery of the implicit relationships between the data and the identification of effect level of every factor. In addition, this paper is based on a project which was funded by The Scientific and Technological Research Council of Turkey (TUBITAK).

Keywords: Risk Management, Financial Risk, Operational Risk, Financial Early Warning System, Data Mining, CHAID Decision Tree Algorithm, SMEs.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3123
8931 Application of Argumentation for Improving the Classification Accuracy in Inductive Concept Formation

Authors: Vadim Vagin, Marina Fomina, Oleg Morosin

Abstract:

This paper contains the description of argumentation approach for the problem of inductive concept formation. It is proposed to use argumentation, based on defeasible reasoning with justification degrees, to improve the quality of classification models, obtained by generalization algorithms. The experiment’s results on both clear and noisy data are also presented.

Keywords: Argumentation, justification degrees, inductive concept formation, noise, generalization.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1617
8930 Speech Data Compression using Vector Quantization

Authors: H. B. Kekre, Tanuja K. Sarode

Abstract:

Mostly transforms are used for speech data compressions which are lossy algorithms. Such algorithms are tolerable for speech data compression since the loss in quality is not perceived by the human ear. However the vector quantization (VQ) has a potential to give more data compression maintaining the same quality. In this paper we propose speech data compression algorithm using vector quantization technique. We have used VQ algorithms LBG, KPE and FCG. The results table shows computational complexity of these three algorithms. Here we have introduced a new performance parameter Average Fractional Change in Speech Sample (AFCSS). Our FCG algorithm gives far better performance considering mean absolute error, AFCSS and complexity as compared to others.

Keywords: Vector Quantization, Data Compression, Encoding, , Speech coding.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2403
8929 A Review on Image Segmentation Techniques and Performance Measures

Authors: David Libouga Li Gwet, Marius Otesteanu, Ideal Oscar Libouga, Laurent Bitjoka, Gheorghe D. Popa

Abstract:

Image segmentation is a method to extract regions of interest from an image. It remains a fundamental problem in computer vision. The increasing diversity and the complexity of segmentation algorithms have led us firstly, to make a review and classify segmentation techniques, secondly to identify the most used measures of segmentation performance and thirdly, discuss deeply on segmentation philosophy in order to help the choice of adequate segmentation techniques for some applications. To justify the relevance of our analysis, recent algorithms of segmentation are presented through the proposed classification.

Keywords: Classification, image segmentation, measures of performance.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2051
8928 A Network Traffic Prediction Algorithm Based On Data Mining Technique

Authors: D. Prangchumpol

Abstract:

This paper is a description approach to predict incoming and outgoing data rate in network system by using association rule discover, which is one of the data mining techniques. Information of incoming and outgoing data in each times and network bandwidth are network performance parameters, which needed to solve in the traffic problem. Since congestion and data loss are important network problems. The result of this technique can predicted future network traffic. In addition, this research is useful for network routing selection and network performance improvement.

Keywords: Traffic prediction, association rule, data mining.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3669
8927 Application of Association Rule Mining in Supplier Selection Criteria

Authors: A. Haery, N. Salmasi, M. Modarres Yazdi, H. Iranmanesh

Abstract:

In this paper the application of rule mining in order to review the effective factors on supplier selection is reviewed in the following three sections 1) criteria selecting and information gathering 2) performing association rule mining 3) validation and constituting rule base. Afterwards a few of applications of rule base is explained. Then, a numerical example is presented and analyzed by Clementine software. Some of extracted rules as well as the results are presented at the end.

Keywords: Association rule mining, data mining, supplierselection criteria.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1924
8926 A Comparison and Analysis of Name Matching Algorithms

Authors: Chakkrit Snae

Abstract:

Names are important in many societies, even in technologically oriented ones which use e.g. ID systems to identify individual people. Names such as surnames are the most important as they are used in many processes, such as identifying of people and genealogical research. On the other hand variation of names can be a major problem for the identification and search for people, e.g. web search or security reasons. Name matching presumes a-priori that the recorded name written in one alphabet reflects the phonetic identity of two samples or some transcription error in copying a previously recorded name. We add to this the lode that the two names imply the same person. This paper describes name variations and some basic description of various name matching algorithms developed to overcome name variation and to find reasonable variants of names which can be used to further increasing mismatches for record linkage and name search. The implementation contains algorithms for computing a range of fuzzy matching based on different types of algorithms, e.g. composite and hybrid methods and allowing us to test and measure algorithms for accuracy. NYSIIS, LIG2 and Phonex have been shown to perform well and provided sufficient flexibility to be included in the linkage/matching process for optimising name searching.

Keywords: Data mining, name matching algorithm, nominaldata, searching system.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 11090
8925 Performance Analysis of Artificial Neural Network Based Land Cover Classification

Authors: Najam Aziz, Nasru Minallah, Ahmad Junaid, Kashaf Gul

Abstract:

Landcover classification using automated classification techniques, while employing remotely sensed multi-spectral imagery, is one of the promising areas of research. Different land conditions at different time are captured through satellite and monitored by applying different classification algorithms in specific environment. In this paper, a SPOT-5 image provided by SUPARCO has been studied and classified in Environment for Visual Interpretation (ENVI), a tool widely used in remote sensing. Then, Artificial Neural Network (ANN) classification technique is used to detect the land cover changes in Abbottabad district. Obtained results are compared with a pixel based Distance classifier. The results show that ANN gives the better overall accuracy of 99.20% and Kappa coefficient value of 0.98 over the Mahalanobis Distance Classifier.

Keywords: Landcover classification, artificial neural network, remote sensing, SPOT-5.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1606
8924 Improved Wavelet Neural Networks for Early Cancer Diagnosis Using Clustering Algorithms

Authors: Zarita Zainuddin, Ong Pauline

Abstract:

Wavelet neural networks (WNNs) have emerged as a vital alternative to the vastly studied multilayer perceptrons (MLPs) since its first implementation. In this paper, we applied various clustering algorithms, namely, K-means (KM), Fuzzy C-means (FCM), symmetry-based K-means (SBKM), symmetry-based Fuzzy C-means (SBFCM) and modified point symmetry-based K-means (MPKM) clustering algorithms in choosing the translation parameter of a WNN. These modified WNNs are further applied to the heterogeneous cancer classification using benchmark microarray data and were compared against the conventional WNN with random initialization method. Experimental results showed that a WNN classifier with the MPKM algorithm is more precise than the conventional WNN as well as the WNNs with other clustering algorithms.

Keywords: Clustering, microarray, symmetry, wavelet neural networks.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1616
8923 Application of Neural Networks in Financial Data Mining

Authors: Defu Zhang, Qingshan Jiang, Xin Li

Abstract:

This paper deals with the application of a well-known neural network technique, multilayer back-propagation (BP) neural network, in financial data mining. A modified neural network forecasting model is presented, and an intelligent mining system is developed. The system can forecast the buying and selling signs according to the prediction of future trends to stock market, and provide decision-making for stock investors. The simulation result of seven years to Shanghai Composite Index shows that the return achieved by this mining system is about three times as large as that achieved by the buy and hold strategy, so it is advantageous to apply neural networks to forecast financial time series, the different investors could benefit from it.

Keywords: Data mining, neural network, stock forecasting.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3590
8922 A Comparative Study of GTC and PSP Algorithms for Mining Sequential Patterns Embedded in Database with Time Constraints

Authors: Safa Adi

Abstract:

This paper will consider the problem of sequential mining patterns embedded in a database by handling the time constraints as defined in the GSP algorithm (level wise algorithms). We will compare two previous approaches GTC and PSP, that resumes the general principles of GSP. Furthermore this paper will discuss PG-hybrid algorithm, that using PSP and GTC. The results show that PSP and GTC are more efficient than GSP. On the other hand, the GTC algorithm performs better than PSP. The PG-hybrid algorithm use PSP algorithm for the two first passes on the database, and GTC approach for the following scans. Experiments show that the hybrid approach is very efficient for short, frequent sequences.

Keywords: Database, GTC algorithm, PSP algorithm, sequential patterns, time constraints.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 699
8921 Content Based Sampling over Transactional Data Streams

Authors: Mansour Tarafdar, Mohammad Saniee Abade

Abstract:

This paper investigates the problem of sampling from transactional data streams. We introduce CFISDS as a content based sampling algorithm that works on a landmark window model of data streams and preserve more informed sample in sample space. This algorithm that work based on closed frequent itemset mining tasks, first initiate a concept lattice using initial data, then update lattice structure using an incremental mechanism.Incremental mechanism insert, update and delete nodes in/from concept lattice in batch manner. Presented algorithm extracts the final samples on demand of user. Experimental results show the accuracy of CFISDS on synthetic and real datasets, despite on CFISDS algorithm is not faster than exist sampling algorithms such as Z and DSS.

Keywords: Sampling, data streams, closed frequent item set mining.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1709
8920 On Pattern-Based Programming towards the Discovery of Frequent Patterns

Authors: Kittisak Kerdprasop, Nittaya Kerdprasop

Abstract:

The problem of frequent pattern discovery is defined as the process of searching for patterns such as sets of features or items that appear in data frequently. Finding such frequent patterns has become an important data mining task because it reveals associations, correlations, and many other interesting relationships hidden in a database. Most of the proposed frequent pattern mining algorithms have been implemented with imperative programming languages. Such paradigm is inefficient when set of patterns is large and the frequent pattern is long. We suggest a high-level declarative style of programming apply to the problem of frequent pattern discovery. We consider two languages: Haskell and Prolog. Our intuitive idea is that the problem of finding frequent patterns should be efficiently and concisely implemented via a declarative paradigm since pattern matching is a fundamental feature supported by most functional languages and Prolog. Our frequent pattern mining implementation using the Haskell and Prolog languages confirms our hypothesis about conciseness of the program. The comparative performance studies on line-of-code, speed and memory usage of declarative versus imperative programming have been reported in the paper.

Keywords: Frequent pattern mining, functional programming, pattern matching, logic programming.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1343
8919 Meta Random Forests

Authors: Praveen Boinee, Alessandro De Angelis, Gian Luca Foresti

Abstract:

Leo Breimans Random Forests (RF) is a recent development in tree based classifiers and quickly proven to be one of the most important algorithms in the machine learning literature. It has shown robust and improved results of classifications on standard data sets. Ensemble learning algorithms such as AdaBoost and Bagging have been in active research and shown improvements in classification results for several benchmarking data sets with mainly decision trees as their base classifiers. In this paper we experiment to apply these Meta learning techniques to the random forests. We experiment the working of the ensembles of random forests on the standard data sets available in UCI data sets. We compare the original random forest algorithm with their ensemble counterparts and discuss the results.

Keywords: Random Forests [RF], ensembles, UCI.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2710
8918 Performance Comparison of Parallel Sorting Algorithms on the Cluster of Workstations

Authors: Lai Lai Win Kyi, Nay Min Tun

Abstract:

Sorting appears the most attention among all computational tasks over the past years because sorted data is at the heart of many computations. Sorting is of additional importance to parallel computing because of its close relation to the task of routing data among processes, which is an essential part of many parallel algorithms. Many parallel sorting algorithms have been investigated for a variety of parallel computer architectures. In this paper, three parallel sorting algorithms have been implemented and compared in terms of their overall execution time. The algorithms implemented are the odd-even transposition sort, parallel merge sort and parallel rank sort. Cluster of Workstations or Windows Compute Cluster has been used to compare the algorithms implemented. The C# programming language is used to develop the sorting algorithms. The MPI (Message Passing Interface) library has been selected to establish the communication and synchronization between processors. The time complexity for each parallel sorting algorithm will also be mentioned and analyzed.

Keywords: Cluster of Workstations, Parallel sorting algorithms, performance analysis, parallel computing and MPI.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1482
8917 Opinion Mining Framework in the Education Domain

Authors: A. M. H. Elyasir, K. S. M. Anbananthen

Abstract:

The internet is growing larger and becoming the most popular platform for the people to share their opinion in different interests. We choose the education domain specifically comparing some Malaysian universities against each other. This comparison produces benchmark based on different criteria shared by the online users in various online resources including Twitter, Facebook and web pages. The comparison is accomplished using opinion mining framework to extract, process the unstructured text and classify the result to positive, negative or neutral (polarity). Hence, we divide our framework to three main stages; opinion collection (extraction), unstructured text processing and polarity classification. The extraction stage includes web crawling, HTML parsing, Sentence segmentation for punctuation classification, Part of Speech (POS) tagging, the second stage processes the unstructured text with stemming and stop words removal and finally prepare the raw text for classification using Named Entity Recognition (NER). Last phase is to classify the polarity and present overall result for the comparison among the Malaysian universities. The final result is useful for those who are interested to study in Malaysia, in which our final output declares clear winners based on the public opinions all over the web.

Keywords: Entity Recognition, Education Domain, Opinion Mining, Unstructured Text.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2965