Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 15

k-means Related Abstracts

15 Multi-Cluster Overlapping K-Means Extension Algorithm (MCOKE)

Authors: Said Baadel, Fadi Thabtah, Joan Lu

Abstract:

Clustering involves the partitioning of n objects into k clusters. Many clustering algorithms use hard-partitioning techniques where each object is assigned to one cluster. In this paper, we propose an overlapping algorithm MCOKE which allows objects to belong to one or more clusters. The algorithm is different from fuzzy clustering techniques because objects that overlap are assigned a membership value of 1 (one) as opposed to a fuzzy membership degree. The algorithm is also different from other overlapping algorithms that require a similarity threshold to be defined as a priority which can be difficult to determine by novice users.

Keywords: Data Mining, k-means, MCOKE, overlapping

Procedia PDF Downloads 349
14 An Experiential Learning of Ontology-Based Multi-document Summarization by Removal Summarization Techniques

Authors: Pranjali Avinash Yadav-Deshmukh

Abstract:

Remarkable development of the Internet along with the new technological innovation, such as high-speed systems and affordable large storage space have led to a tremendous increase in the amount and accessibility to digital records. For any person, studying of all these data is tremendously time intensive, so there is a great need to access effective multi-document summarization (MDS) systems, which can successfully reduce details found in several records into a short, understandable summary or conclusion. For semantic representation of textual details in ontology area, as a theoretical design, our system provides a significant structure. The stability of using the ontology in fixing multi-document summarization problems in the sector of catastrophe control is finding its recommended design. Saliency ranking is usually allocated to each phrase and phrases are rated according to the ranking, then the top rated phrases are chosen as the conclusion. With regards to the conclusion quality, wide tests on a selection of media announcements are appropriate for “Jammu Kashmir Overflow in 2014” records. Ontology centered multi-document summarization methods using “NLP centered extraction” outshine other baselines. Our participation in recommended component is to implement the details removal methods (NLP) to enhance the results.

Keywords: Ontology, Disaster Management, NLP, k-means, extraction technique, multi-document summarization, sentence extraction

Procedia PDF Downloads 264
13 Approach Based on Fuzzy C-Means for Band Selection in Hyperspectral Images

Authors: Diego Saqui, José H. Saito, José R. Campos, Lúcio A. de C. Jorge

Abstract:

Hyperspectral images and remote sensing are important for many applications. A problem in the use of these images is the high volume of data to be processed, stored and transferred. Dimensionality reduction techniques can be used to reduce the volume of data. In this paper, an approach to band selection based on clustering algorithms is presented. This approach allows to reduce the volume of data. The proposed structure is based on Fuzzy C-Means (or K-Means) and NWHFC algorithms. New attributes in relation to other studies in the literature, such as kurtosis and low correlation, are also considered. A comparison of the results of the approach using the Fuzzy C-Means and K-Means with different attributes is performed. The use of both algorithms show similar good results but, particularly when used attributes variance and kurtosis in the clustering process, however applicable in hyperspectral images.

Keywords: fuzzy c-means, k-means, band selection, hyperspectral image

Procedia PDF Downloads 212
12 A Minimum Spanning Tree-Based Method for Initializing the K-Means Clustering Algorithm

Authors: X. Zhang, J. Yang, Y. Ma, S. Li, Y. Zhang

Abstract:

The traditional k-means algorithm has been widely used as a simple and efficient clustering method. However, the algorithm often converges to local minima for the reason that it is sensitive to the initial cluster centers. In this paper, an algorithm for selecting initial cluster centers on the basis of minimum spanning tree (MST) is presented. The set of vertices in MST with same degree are regarded as a whole which is used to find the skeleton data points. Furthermore, a distance measure between the skeleton data points with consideration of degree and Euclidean distance is presented. Finally, MST-based initialization method for the k-means algorithm is presented, and the corresponding time complexity is analyzed as well. The presented algorithm is tested on five data sets from the UCI Machine Learning Repository. The experimental results illustrate the effectiveness of the presented algorithm compared to three existing initialization methods.

Keywords: minimum spanning tree, degree, k-means, initial cluster center

Procedia PDF Downloads 211
11 A Fast, Reliable Technique for Face Recognition Based on Hidden Markov Model

Authors: Mohamed Ibrahim, Sameh Abaza, Tarek Mahmoud

Abstract:

Due to the development in the digital image processing, its wide use in many applications such as medical, security, and others, the need for more accurate techniques that are reliable, fast and robust is vehemently demanded. In the field of security, in particular, speed is of the essence. In this paper, a pattern recognition technique that is based on the use of Hidden Markov Model (HMM), K-means and the Sobel operator method is developed. The proposed technique is proved to be fast with respect to some other techniques that are investigated for comparison. Moreover, it shows its capability of recognizing the normal face (center part) as well as face boundary.

Keywords: Face Recognition, Accuracy, k-means, HMM, Sobel

Procedia PDF Downloads 154
10 Clustering Categorical Data Using the K-Means Algorithm and the Attribute’s Relative Frequency

Authors: Semeh Ben Salem, Sami Naouali, Moetez Sallami

Abstract:

Clustering is a well known data mining technique used in pattern recognition and information retrieval. The initial dataset to be clustered can either contain categorical or numeric data. Each type of data has its own specific clustering algorithm. In this context, two algorithms are proposed: the k-means for clustering numeric datasets and the k-modes for categorical datasets. The main encountered problem in data mining applications is clustering categorical dataset so relevant in the datasets. One main issue to achieve the clustering process on categorical values is to transform the categorical attributes into numeric measures and directly apply the k-means algorithm instead the k-modes. In this paper, it is proposed to experiment an approach based on the previous issue by transforming the categorical values into numeric ones using the relative frequency of each modality in the attributes. The proposed approach is compared with a previously method based on transforming the categorical datasets into binary values. The scalability and accuracy of the two methods are experimented. The obtained results show that our proposed method outperforms the binary method in all cases.

Keywords: Pattern Recognition, Knowledge Discovery, Clustering, unsupervised learning, k-means, categorical datasets

Procedia PDF Downloads 117
9 An Intrusion Detection Systems Based on K-Means, K-Medoids and Support Vector Clustering Using Ensemble

Authors: A. Mohammadpour, Ebrahim Najafi Kajabad, Ghazale Ipakchi

Abstract:

Presently, computer networks’ security rise in importance and many studies have also been conducted in this field. By the penetration of the internet networks in different fields, many things need to be done to provide a secure industrial and non-industrial network. Fire walls, appropriate Intrusion Detection Systems (IDS), encryption protocols for information sending and receiving, and use of authentication certificated are among things, which should be considered for system security. The aim of the present study is to use the outcome of several algorithms, which cause decline in IDS errors, in the way that improves system security and prevents additional overload to the system. Finally, regarding the obtained result we can also detect the amount and percentage of more sub attacks. By running the proposed system, which is based on the use of multi-algorithmic outcome and comparing that by the proposed single algorithmic methods, we observed a 78.64% result in attack detection that is improved by 3.14% than the proposed algorithms.

Keywords: Clustering, Intrusion Detection Systems, Ensemble, k-means, k-medoids, SV clustering

Procedia PDF Downloads 87
8 Implementation of an IoT Sensor Data Collection and Analysis Library

Authors: Minsoo Lee, Jihyun Song, Kyeongjoo Kim

Abstract:

Due to the development of information technology and wireless Internet technology, various data are being generated in various fields. These data are advantageous in that they provide real-time information to the users themselves. However, when the data are accumulated and analyzed, more various information can be extracted. In addition, development and dissemination of boards such as Arduino and Raspberry Pie have made it possible to easily test various sensors, and it is possible to collect sensor data directly by using database application tools such as MySQL. These directly collected data can be used for various research and can be useful as data for data mining. However, there are many difficulties in using the board to collect data, and there are many difficulties in using it when the user is not a computer programmer, or when using it for the first time. Even if data are collected, lack of expert knowledge or experience may cause difficulties in data analysis and visualization. In this paper, we aim to construct a library for sensor data collection and analysis to overcome these problems.

Keywords: Data Mining, Clustering, sensor data, k-means, k-medoids, DBSCAN

Procedia PDF Downloads 227
7 Road Traffic Accidents Analysis in Mexico City through Crowdsourcing Data and Data Mining Techniques

Authors: Gabriela V. Angeles Perez, Jose Castillejos Lopez, Araceli L. Reyes Cabello, Emilio Bravo Grajales, Adriana Perez Espinosa, Jose L. Quiroz Fabian

Abstract:

Road traffic accidents are among the principal causes of traffic congestion, causing human losses, damages to health and the environment, economic losses and material damages. Studies about traditional road traffic accidents in urban zones represents very high inversion of time and money, additionally, the result are not current. However, nowadays in many countries, the crowdsourced GPS based traffic and navigation apps have emerged as an important source of information to low cost to studies of road traffic accidents and urban congestion caused by them. In this article we identified the zones, roads and specific time in the CDMX in which the largest number of road traffic accidents are concentrated during 2016. We built a database compiling information obtained from the social network known as Waze. The methodology employed was Discovery of knowledge in the database (KDD) for the discovery of patterns in the accidents reports. Furthermore, using data mining techniques with the help of Weka. The selected algorithms was the Maximization of Expectations (EM) to obtain the number ideal of clusters for the data and k-means as a grouping method. Finally, the results were visualized with the Geographic Information System QGIS.

Keywords: Data Mining, Weka, k-means, road traffic accidents, Waze

Procedia PDF Downloads 194
6 Unsupervised Part-of-Speech Tagging for Amharic Using K-Means Clustering

Authors: Zelalem Fantahun

Abstract:

Part-of-speech tagging is the process of assigning a part-of-speech or other lexical class marker to each word into naturally occurring text. Part-of-speech tagging is the most fundamental and basic task almost in all natural language processing. In natural language processing, the problem of providing large amount of manually annotated data is a knowledge acquisition bottleneck. Since, Amharic is one of under-resourced language, the availability of tagged corpus is the bottleneck problem for natural language processing especially for POS tagging. A promising direction to tackle this problem is to provide a system that does not require manually tagged data. In unsupervised learning, the learner is not provided with classifications. Unsupervised algorithms seek out similarity between pieces of data in order to determine whether they can be characterized as forming a group. This paper explicates the development of unsupervised part-of-speech tagger using K-Means clustering for Amharic language since large amount of data is produced in day-to-day activities. In the development of the tagger, the following procedures are followed. First, the unlabeled data (raw text) is divided into 10 folds and tokenization phase takes place; at this level, the raw text is chunked at sentence level and then into words. The second phase is feature extraction which includes word frequency, syntactic and morphological features of a word. The third phase is clustering. Among different clustering algorithms, K-means is selected and implemented in this study that brings group of similar words together. The fourth phase is mapping, which deals with looking at each cluster carefully and the most common tag is assigned to a group. This study finds out two features that are capable of distinguishing one part-of-speech from others these are morphological feature and positional information and show that it is possible to use unsupervised learning for Amharic POS tagging. In order to increase performance of the unsupervised part-of-speech tagger, there is a need to incorporate other features that are not included in this study, such as semantic related information. Finally, based on experimental result, the performance of the system achieves a maximum of 81% accuracy.

Keywords: unsupervised learning, k-means, POS tagging, Amharic

Procedia PDF Downloads 257
5 A Method for False Alarm Recognition Based on Multi-Classification Support Vector Machine

Authors: Weiwei Cui, Dejian Lin, Leigang Zhang, Yao Wang, Zheng Sun, Lianfeng Li

Abstract:

Built-in test (BIT) is an important technology in testability field, and it is widely used in state monitoring and fault diagnosis. With the improvement of modern equipment performance and complexity, the scope of BIT becomes larger, and it leads to the emergence of false alarm problem. The false alarm makes the health assessment unstable, and it reduces the effectiveness of BIT. The conventional false alarm suppression methods such as repeated test and majority voting cannot meet the requirement for a complicated system, and the intelligence algorithms such as artificial neural networks (ANN) are widely studied and used. However, false alarm has a very low frequency and small sample, yet a method based on ANN requires a large size of training sample. To recognize the false alarm, we propose a method based on multi-classification support vector machine (SVM) in this paper. Firstly, we divide the state of a system into three states: healthy, false-alarm, and faulty. Then we use multi-classification with '1 vs 1' policy to train and recognize the state of a system. Finally, an example of fault injection system is taken to verify the effectiveness of the proposed method by comparing ANN. The result shows that the method is reasonable and effective.

Keywords: Fault diagnosis, false alarm, SVM, k-means, BIT

Procedia PDF Downloads 21
4 Heuristic Classification of Hydrophone Recordings

Authors: Daniel M. Wolff, Patricia Gray, Rafael de la Parra Venegas

Abstract:

An unsupervised machine listening system is constructed and applied to a dataset of 17,195 30-second marine hydrophone recordings. The system is then heuristically supplemented with anecdotal listening, contextual recording information, and supervised learning techniques to reduce the number of false positives. Features for classification are assembled by extracting the following data from each of the audio files: the spectral centroid, root-mean-squared values for each frequency band of a 10-octave filter bank, and mel-frequency cepstral coefficients in 5-second frames. In this way both time- and frequency-domain information are contained in the features to be passed to a clustering algorithm. Classification is performed using the k-means algorithm and then a k-nearest neighbors search. Different values of k are experimented with, in addition to different combinations of the available feature sets. Hypothesized class labels are 'primarily anthrophony' and 'primarily biophony', where the best class result conforming to the former label has 104 members after heuristic pruning. This demonstrates how a large audio dataset has been made more tractable with machine learning techniques, forming the foundation of a framework designed to acoustically monitor and gauge biological and anthropogenic activity in a marine environment.

Keywords: Machine Learning, k-means, anthrophony, hydrophone

Procedia PDF Downloads 53
3 Switched System Diagnosis Based on Intelligent State Filtering with Unknown Models

Authors: Faouzi Bouani, Nada Slimane, Foued Theljani

Abstract:

The paper addresses the problem of fault diagnosis for systems operating in several modes (normal or faulty) based on states assessment. We use, for this purpose, a methodology consisting of three main processes: 1) sequential data clustering, 2) linear model regression and 3) state filtering. Typically, Kalman Filter (KF) is an algorithm that provides estimation of unknown states using a sequence of I/O measurements. Inevitably, although it is an efficient technique for state estimation, it presents two main weaknesses. First, it merely predicts states without being able to isolate/classify them according to their different operating modes, whether normal or faulty modes. To deal with this dilemma, the KF is endowed with an extra clustering step based fully on sequential version of the k-means algorithm. Second, to provide state estimation, KF requires state space models, which can be unknown. A linear regularized regression is used to identify the required models. To prove its effectiveness, the proposed approach is assessed on a simulated benchmark.

Keywords: Diagnosis, Clustering, k-means, Kalman filtering, regularized regression

Procedia PDF Downloads 25
2 GBKMeans: A Genetic Based K-Means Applied to the Capacitated Planning of Reading Units

Authors: Anderson S. Fonseca, Italo F. S. Da Silva, Robert D. A. Santos, Mayara G. Da Silva, Pedro H. C. Vieira, Antonio M. S. Sobrinho, Victor H. B. Lemos, Petterson S. Diniz, Anselmo C. Paiva, Eliana M. G. Monteiro

Abstract:

In Brazil, the National Electric Energy Agency (ANEEL) establishes that electrical energy companies are responsible for measuring and billing their customers. Among these regulations, it’s defined that a company must bill your customers within 27-33 days. If a relocation or a change of period is required, the consumer must be notified in writing, in advance of a billing period. To make it easier to organize a workday’s measurements, these companies create a reading plan. These plans consist of grouping customers into reading groups, which are visited by an employee responsible for measuring consumption and billing. The creation process of a plan efficiently and optimally is a capacitated clustering problem with constraints related to homogeneity and compactness, that is, the employee’s working load and the geographical position of the consuming unit. This process is a work done manually by several experts who have experience in the geographic formation of the region, which takes a large number of days to complete the final planning, and because it’s human activity, there is no guarantee of finding the best optimization for planning. In this paper, the GBKMeans method presents a technique based on K-Means and genetic algorithms for creating a capacitated cluster that respects the constraints established in an efficient and balanced manner, that minimizes the cost of relocating consumer units and the time required for final planning creation. The results obtained by the presented method are compared with the current planning of a real city, showing an improvement of 54.71% in the standard deviation of working load and 11.97% in the compactness of the groups.

Keywords: Genetic Algorithm, k-means, capacitated clustering, districting problems

Procedia PDF Downloads 1
1 An Empirical Study to Predict Myocardial Infarction Using K-Means and Hierarchical Clustering

Authors: Md. Minhazul Islam, Shah Ashisul Abed Nipun, Majharul Islam, Md. Abdur Rakib Rahat, Jonayet Miah, Salsavil Kayyum, Anwar Shadaab, Faiz Al Faisal

Abstract:

The target of this research is to predict Myocardial Infarction using unsupervised Machine Learning algorithms. Myocardial Infarction Prediction related to heart disease is a challenging factor faced by doctors & hospitals. In this prediction, accuracy of the heart disease plays a vital role. From this concern, the authors have analyzed on a myocardial dataset to predict myocardial infarction using some popular Machine Learning algorithms K-Means and Hierarchical Clustering. This research includes a collection of data and the classification of data using Machine Learning Algorithms. The authors collected 345 instances along with 26 attributes from different hospitals in Bangladesh. This data have been collected from patients suffering from myocardial infarction along with other symptoms. This model would be able to find and mine hidden facts from historical Myocardial Infarction cases. The aim of this study is to analyze the accuracy level to predict Myocardial Infarction by using Machine Learning techniques.

Keywords: Machine Learning, Heart Disease, Hierarchical Clustering, myocardial infarction, k-means

Procedia PDF Downloads 1