Search results for: time series clustering
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 19651

Search results for: time series clustering

19651 Time Series Regression with Meta-Clusters

Authors: Monika Chuchro

Abstract:

This paper presents a preliminary attempt to apply classification of time series using meta-clusters in order to improve the quality of regression models. In this case, clustering was performed as a method to obtain a subgroups of time series data with normal distribution from inflow into waste water treatment plant data which Composed of several groups differing by mean value. Two simple algorithms: K-mean and EM were chosen as a clustering method. The rand index was used to measure the similarity. After simple meta-clustering, regression model was performed for each subgroups. The final model was a sum of subgroups models. The quality of obtained model was compared with the regression model made using the same explanatory variables but with no clustering of data. Results were compared by determination coefficient (R2), measure of prediction accuracy mean absolute percentage error (MAPE) and comparison on linear chart. Preliminary results allows to foresee the potential of the presented technique.

Keywords: clustering, data analysis, data mining, predictive models

Procedia PDF Downloads 437
19650 An Approach for Pattern Recognition and Prediction of Information Diffusion Model on Twitter

Authors: Amartya Hatua, Trung Nguyen, Andrew Sung

Abstract:

In this paper, we study the information diffusion process on Twitter as a multivariate time series problem. Our model concerns three measures (volume, network influence, and sentiment of tweets) based on 10 features, and we collected 27 million tweets to build our information diffusion time series dataset for analysis. Then, different time series clustering techniques with Dynamic Time Warping (DTW) distance were used to identify different patterns of information diffusion. Finally, we built the information diffusion prediction models for new hashtags which comprise two phrases: The first phrase is recognizing the pattern using k-NN with DTW distance; the second phrase is building the forecasting model using the traditional Autoregressive Integrated Moving Average (ARIMA) model and the non-linear recurrent neural network of Long Short-Term Memory (LSTM). Preliminary results of performance evaluation between different forecasting models show that LSTM with clustering information notably outperforms other models. Therefore, our approach can be applied in real-world applications to analyze and predict the information diffusion characteristics of selected topics or memes (hashtags) in Twitter.

Keywords: ARIMA, DTW, information diffusion, LSTM, RNN, time series clustering, time series forecasting, Twitter

Procedia PDF Downloads 361
19649 Fuzzy Time Series Forecasting Based on Fuzzy Logical Relationships, PSO Technique, and Automatic Clustering Algorithm

Authors: A. K. M. Kamrul Islam, Abdelhamid Bouchachia, Suang Cang, Hongnian Yu

Abstract:

Forecasting model has a great impact in terms of prediction and continues to do so into the future. Although many forecasting models have been studied in recent years, most researchers focus on different forecasting methods based on fuzzy time series to solve forecasting problems. The forecasted models accuracy fully depends on the two terms that are the length of the interval in the universe of discourse and the content of the forecast rules. Moreover, a hybrid forecasting method can be an effective and efficient way to improve forecasts rather than an individual forecasting model. There are different hybrids forecasting models which combined fuzzy time series with evolutionary algorithms, but the performances are not quite satisfactory. In this paper, we proposed a hybrid forecasting model which deals with the first order as well as high order fuzzy time series and particle swarm optimization to improve the forecasted accuracy. The proposed method used the historical enrollments of the University of Alabama as dataset in the forecasting process. Firstly, we considered an automatic clustering algorithm to calculate the appropriate interval for the historical enrollments. Then particle swarm optimization and fuzzy time series are combined that shows better forecasting accuracy than other existing forecasting models.

Keywords: fuzzy time series (fts), particle swarm optimization, clustering algorithm, hybrid forecasting model

Procedia PDF Downloads 220
19648 Hierarchical Piecewise Linear Representation of Time Series Data

Authors: Vineetha Bettaiah, Heggere S. Ranganath

Abstract:

This paper presents a Hierarchical Piecewise Linear Approximation (HPLA) for the representation of time series data in which the time series is treated as a curve in the time-amplitude image space. The curve is partitioned into segments by choosing perceptually important points as break points. Each segment between adjacent break points is recursively partitioned into two segments at the best point or midpoint until the error between the approximating line and the original curve becomes less than a pre-specified threshold. The HPLA representation achieves dimensionality reduction while preserving prominent local features and general shape of time series. The representation permits course-fine processing at different levels of details, allows flexible definition of similarity based on mathematical measures or general time series shape, and supports time series data mining operations including query by content, clustering and classification based on whole or subsequence similarity.

Keywords: data mining, dimensionality reduction, piecewise linear representation, time series representation

Procedia PDF Downloads 246
19647 Time-Series Load Data Analysis for User Power Profiling

Authors: Mahdi Daghmhehci Firoozjaei, Minchang Kim, Dima Alhadidi

Abstract:

In this paper, we present a power profiling model for smart grid consumers based on real time load data acquired smart meters. It profiles consumers’ power consumption behaviour using the dynamic time warping (DTW) clustering algorithm. Due to the invariability of signal warping of this algorithm, time-disordered load data can be profiled and consumption features be extracted. Two load types are defined and the related load patterns are extracted for classifying consumption behaviour by DTW. The classification methodology is discussed in detail. To evaluate the performance of the method, we analyze the time-series load data measured by a smart meter in a real case. The results verify the effectiveness of the proposed profiling method with 90.91% true positive rate for load type clustering in the best case.

Keywords: power profiling, user privacy, dynamic time warping, smart grid

Procedia PDF Downloads 107
19646 Flowing Online Vehicle GPS Data Clustering Using a New Parallel K-Means Algorithm

Authors: Orhun Vural, Oguz Bayat, Rustu Akay, Osman N. Ucan

Abstract:

This study presents a new parallel approach clustering of GPS data. Evaluation has been made by comparing execution time of various clustering algorithms on GPS data. This paper aims to propose a parallel based on neighborhood K-means algorithm to make it faster. The proposed parallelization approach assumes that each GPS data represents a vehicle and to communicate between vehicles close to each other after vehicles are clustered. This parallelization approach has been examined on different sized continuously changing GPS data and compared with serial K-means algorithm and other serial clustering algorithms. The results demonstrated that proposed parallel K-means algorithm has been shown to work much faster than other clustering algorithms.

Keywords: parallel k-means algorithm, parallel clustering, clustering algorithms, clustering on flowing data

Procedia PDF Downloads 188
19645 Mitigating the Negative Effect of Intrabrand Clustering: The Role of Interbrand Clustering and Firm Size

Authors: Moeen Naseer Butt

Abstract:

Clustering –geographic concentrations of entities– has recently received more attention in marketing research and has been shown to affect multiple outcomes. This study investigates the impact of intrabrand clustering (clustering of same-brand outlets) on an outlet’s quality performance. Further, it assesses the moderating effects of interbrand clustering (clustering of other-brand outlets) and firm size. An examination of approximately 21,000 food service establishments in New York State in 2019 finds that the impact of intrabrand clustering on an outlet’s quality performance is context-dependent. Specifically, intrabrand clustering decreases, whereas interbrand clustering and firm size help increase the outlet’s performance. Additionally, this study finds that the role of firm size is more substantial than interbrand clustering in mitigating the adverse effects of intrabrand clustering on outlet quality performance.

Keywords: intraband clustering, interbrand clustering, firm size, brand competition, outlet performance, quality violations

Procedia PDF Downloads 165
19644 Distributed Perceptually Important Point Identification for Time Series Data Mining

Authors: Tak-Chung Fu, Ying-Kit Hung, Fu-Lai Chung

Abstract:

In the field of time series data mining, the concept of the Perceptually Important Point (PIP) identification process is first introduced in 2001. This process originally works for financial time series pattern matching and it is then found suitable for time series dimensionality reduction and representation. Its strength is on preserving the overall shape of the time series by identifying the salient points in it. With the rise of Big Data, time series data contributes a major proportion, especially on the data which generates by sensors in the Internet of Things (IoT) environment. According to the nature of PIP identification and the successful cases, it is worth to further explore the opportunity to apply PIP in time series ‘Big Data’. However, the performance of PIP identification is always considered as the limitation when dealing with ‘Big’ time series data. In this paper, two distributed versions of PIP identification based on the Specialized Binary (SB) Tree are proposed. The proposed approaches solve the bottleneck when running the PIP identification process in a standalone computer. Improvement in term of speed is obtained by the distributed versions.

Keywords: distributed computing, performance analysis, Perceptually Important Point identification, time series data mining

Procedia PDF Downloads 395
19643 Power Iteration Clustering Based on Deflation Technique on Large Scale Graphs

Authors: Taysir Soliman

Abstract:

One of the current popular clustering techniques is Spectral Clustering (SC) because of its advantages over conventional approaches such as hierarchical clustering, k-means, etc. and other techniques as well. However, one of the disadvantages of SC is the time consuming process because it requires computing the eigenvectors. In the past to overcome this disadvantage, a number of attempts have been proposed such as the Power Iteration Clustering (PIC) technique, which is one of versions from SC; some of PIC advantages are: 1) its scalability and efficiency, 2) finding one pseudo-eigenvectors instead of computing eigenvectors, and 3) linear combination of the eigenvectors in linear time. However, its worst disadvantage is an inter-class collision problem because it used only one pseudo-eigenvectors which is not enough. Previous researchers developed Deflation-based Power Iteration Clustering (DPIC) to overcome problems of PIC technique on inter-class collision with the same efficiency of PIC. In this paper, we developed Parallel DPIC (PDPIC) to improve the time and memory complexity which is run on apache spark framework using sparse matrix. To test the performance of PDPIC, we compared it to SC, ESCG, ESCALG algorithms on four small graph benchmark datasets and nine large graph benchmark datasets, where PDPIC proved higher accuracy and better time consuming than other compared algorithms.

Keywords: spectral clustering, power iteration clustering, deflation-based power iteration clustering, Apache spark, large graph

Procedia PDF Downloads 152
19642 Fast Short-Term Electrical Load Forecasting under High Meteorological Variability with a Multiple Equation Time Series Approach

Authors: Charline David, Alexandre Blondin Massé, Arnaud Zinflou

Abstract:

In 2016, Clements, Hurn, and Li proposed a multiple equation time series approach for the short-term load forecasting, reporting an average mean absolute percentage error (MAPE) of 1.36% on an 11-years dataset for the Queensland region in Australia. We present an adaptation of their model to the electrical power load consumption for the whole Quebec province in Canada. More precisely, we take into account two additional meteorological variables — cloudiness and wind speed — on top of temperature, as well as the use of multiple meteorological measurements taken at different locations on the territory. We also consider other minor improvements. Our final model shows an average MAPE score of 1:79% over an 8-years dataset.

Keywords: short-term load forecasting, special days, time series, multiple equations, parallelization, clustering

Procedia PDF Downloads 74
19641 Hierarchical Clustering Algorithms in Data Mining

Authors: Z. Abdullah, A. R. Hamdan

Abstract:

Clustering is a process of grouping objects and data into groups of clusters to ensure that data objects from the same cluster are identical to each other. Clustering algorithms in one of the areas in data mining and it can be classified into partition, hierarchical, density based, and grid-based. Therefore, in this paper, we do a survey and review for four major hierarchical clustering algorithms called CURE, ROCK, CHAMELEON, and BIRCH. The obtained state of the art of these algorithms will help in eliminating the current problems, as well as deriving more robust and scalable algorithms for clustering.

Keywords: clustering, unsupervised learning, algorithms, hierarchical

Procedia PDF Downloads 848
19640 Progressive Multimedia Collection Structuring via Scene Linking

Authors: Aman Berhe, Camille Guinaudeau, Claude Barras

Abstract:

In order to facilitate information seeking in large collections of multimedia documents with long and progressive content (such as broadcast news or TV series), one can extract the semantic links that exist between semantically coherent parts of documents, i.e., scenes. The links can then create a coherent collection of scenes from which it is easier to perform content analysis, topic extraction, or information retrieval. In this paper, we focus on TV series structuring and propose two approaches for scene linking at different levels of granularity (episode and season): a fuzzy online clustering technique and a graph-based community detection algorithm. When evaluated on the two first seasons of the TV series Game of Thrones, we found that the fuzzy online clustering approach performed better compared to graph-based community detection at the episode level, while graph-based approaches show better performance at the season level.

Keywords: multimedia collection structuring, progressive content, scene linking, fuzzy clustering, community detection

Procedia PDF Downloads 72
19639 Chern-Simons Equation in Financial Theory and Time-Series Analysis

Authors: Ognjen Vukovic

Abstract:

Chern-Simons equation represents the cornerstone of quantum physics. The question that is often asked is if the aforementioned equation can be successfully applied to the interaction in international financial markets. By analysing the time series in financial theory, it is proved that Chern-Simons equation can be successfully applied to financial time-series. The aforementioned statement is based on one important premise and that is that the financial time series follow the fractional Brownian motion. All variants of Chern-Simons equation and theory are applied and analysed. Financial theory time series movement is, firstly, topologically analysed. The main idea is that exchange rate represents two-dimensional projections of three-dimensional Brownian motion movement. Main principles of knot theory and topology are applied to financial time series and setting is created so the Chern-Simons equation can be applied. As Chern-Simons equation is based on small particles, it is multiplied by the magnifying factor to mimic the real world movement. Afterwards, the following equation is optimised using Solver. The equation is applied to n financial time series in order to see if it can capture the interaction between financial time series and consequently explain it. The aforementioned equation represents a novel approach to financial time series analysis and hopefully it will direct further research.

Keywords: Brownian motion, Chern-Simons theory, financial time series, econophysics

Procedia PDF Downloads 441
19638 Hybrid Hierarchical Clustering Approach for Community Detection in Social Network

Authors: Radhia Toujani, Jalel Akaichi

Abstract:

Social Networks generally present a hierarchy of communities. To determine these communities and the relationship between them, detection algorithms should be applied. Most of the existing algorithms, proposed for hierarchical communities identification, are based on either agglomerative clustering or divisive clustering. In this paper, we present a hybrid hierarchical clustering approach for community detection based on both bottom-up and bottom-down clustering. Obviously, our approach provides more relevant community structure than hierarchical method which considers only divisive or agglomerative clustering to identify communities. Moreover, we performed some comparative experiments to enhance the quality of the clustering results and to show the effectiveness of our algorithm.

Keywords: agglomerative hierarchical clustering, community structure, divisive hierarchical clustering, hybrid hierarchical clustering, opinion mining, social network, social network analysis

Procedia PDF Downloads 324
19637 Investigation on Performance of Change Point Algorithm in Time Series Dynamical Regimes and Effect of Data Characteristics

Authors: Farhad Asadi, Mohammad Javad Mollakazemi

Abstract:

In this paper, Bayesian online inference in models of data series are constructed by change-points algorithm, which separated the observed time series into independent series and study the change and variation of the regime of the data with related statistical characteristics. variation of statistical characteristics of time series data often represent separated phenomena in the some dynamical system, like a change in state of brain dynamical reflected in EEG signal data measurement or a change in important regime of data in many dynamical system. In this paper, prediction algorithm for studying change point location in some time series data is simulated. It is verified that pattern of proposed distribution of data has important factor on simpler and smother fluctuation of hazard rate parameter and also for better identification of change point locations. Finally, the conditions of how the time series distribution effect on factors in this approach are explained and validated with different time series databases for some dynamical system.

Keywords: time series, fluctuation in statistical characteristics, optimal learning, change-point algorithm

Procedia PDF Downloads 400
19636 Nonstationarity Modeling of Economic and Financial Time Series

Authors: C. Slim

Abstract:

Traditional techniques for analyzing time series are based on the notion of stationarity of phenomena under study, but in reality most economic and financial series do not verify this hypothesis, which implies the implementation of specific tools for the detection of such behavior. In this paper, we study nonstationary non-seasonal time series tests in a non-exhaustive manner. We formalize the problem of nonstationary processes with numerical simulations and take stock of their statistical characteristics. The theoretical aspects of some of the most common unit root tests will be discussed. We detail the specification of the tests, showing the advantages and disadvantages of each. The empirical study focuses on the application of these tests to the exchange rate (USD/TND) and the Consumer Price Index (CPI) in Tunisia, in order to compare the Power of these tests with the characteristics of the series.

Keywords: stationarity, unit root tests, economic time series, ADF tests

Procedia PDF Downloads 397
19635 pscmsForecasting: A Python Web Service for Time Series Forecasting

Authors: Ioannis Andrianakis, Vasileios Gkatas, Nikos Eleftheriadis, Alexios Ellinidis, Ermioni Avramidou

Abstract:

pscmsForecasting is an open-source web service that implements a variety of time series forecasting algorithms and exposes them to the user via the ubiquitous HTTP protocol. It allows developers to enhance their applications by adding time series forecasting functionalities through an intuitive and easy-to-use interface. This paper provides some background on time series forecasting and gives details about the implemented algorithms, aiming to enhance the end user’s understanding of the underlying methods before incorporating them into their applications. A detailed description of the web service’s interface and its various parameterizations is also provided. Being an open-source project, pcsmsForecasting can also be easily modified and tailored to the specific needs of each application.

Keywords: time series, forecasting, web service, open source

Procedia PDF Downloads 48
19634 Performance Analysis of Hierarchical Agglomerative Clustering in a Wireless Sensor Network Using Quantitative Data

Authors: Tapan Jain, Davender Singh Saini

Abstract:

Clustering is a useful mechanism in wireless sensor networks which helps to cope with scalability and data transmission problems. The basic aim of our research work is to provide efficient clustering using Hierarchical agglomerative clustering (HAC). If the distance between the sensing nodes is calculated using their location then it’s quantitative HAC. This paper compares the various agglomerative clustering techniques applied in a wireless sensor network using the quantitative data. The simulations are done in MATLAB and the comparisons are made between the different protocols using dendrograms.

Keywords: routing, hierarchical clustering, agglomerative, quantitative, wireless sensor network

Procedia PDF Downloads 564
19633 Application of Data Mining for Aquifer Environmental Assessment

Authors: Saman Javadi, Mehdi Hashemy, Mohahammad Mahmoodi

Abstract:

Vulnerability maps are employed as an important solution in order to handle entrance of pollution into the aquifers. The common way to provide vulnerability map is DRASTIC. Meanwhile, application of the method is not easy to apply for any aquifer due to choosing appropriate constant values of weights and ranks. In this study, a new approach using k-means clustering is applied to make vulnerability maps. Four features of depth to groundwater, hydraulic conductivity, recharge value and vadose zone were considered at the same time as features of clustering. Five regions are recognized out of the case study represent zones with different level of vulnerability. The finding results show that clustering provides a realistic vulnerability map so that, Pearson’s correlation coefficients between nitrate concentrations and clustering vulnerability is obtained 61%.

Keywords: clustering, data mining, groundwater, vulnerability assessment

Procedia PDF Downloads 569
19632 Visualization of PM₂.₅ Time Series and Correlation Analysis of Cities in Bangladesh

Authors: Asif Zaman, Moinul Islam Zaber, Amin Ahsan Ali

Abstract:

In recent years of industrialization, the South Asian countries are being affected by air pollution due to a severe increase in fine particulate matter 2.5 (PM₂.₅). Among them, Bangladesh is one of the most polluting countries. In this paper, statistical analyses were conducted on the time series of PM₂.₅ from various districts in Bangladesh, mostly around Dhaka city. Research has been conducted on the dynamic interactions and relationships between PM₂.₅ concentrations in different zones. The study is conducted toward understanding the characteristics of PM₂.₅, such as spatial-temporal characterization, correlation of other contributors behind air pollution such as human activities, driving factors and environmental casualties. Clustering on the data gave an insight on the districts groups based on their AQI frequency as representative districts. Seasonality analysis on hourly and monthly frequency found higher concentration of fine particles in nighttime and winter season, respectively. Cross correlation analysis discovered a phenomenon of correlations among cities based on time-lagged series of air particle readings and visualization framework is developed for observing interaction in PM₂.₅ concentrations between cities. Significant time-lagged correlations were discovered between the PM₂.₅ time series in different city groups throughout the country by cross correlation analysis. Additionally, seasonal heatmaps depict that the pooled series correlations are less significant in warmer months, and among cities of greater geographic distance as well as time lag magnitude and direction of the best shifted correlated particulate matter time series among districts change seasonally. The geographic map visualization demonstrates spatial behaviour of air pollution among districts around Dhaka city and the significant effect of wind direction as the vital actor on correlated shifted time series. The visualization framework has multipurpose usage from gathering insight of general and seasonal air quality of Bangladesh to determining the pathway of regional transportation of air pollution.

Keywords: air quality, particles, cross correlation, seasonality

Procedia PDF Downloads 86
19631 Comparison of Applicability of Time Series Forecasting Models VAR, ARCH and ARMA in Management Science: Study Based on Empirical Analysis of Time Series Techniques

Authors: Muhammad Tariq, Hammad Tahir, Fawwad Mahmood Butt

Abstract:

Purpose: This study attempts to examine the best forecasting methodologies in the time series. The time series forecasting models such as VAR, ARCH and the ARMA are considered for the analysis. Methodology: The Bench Marks or the parameters such as Adjusted R square, F-stats, Durban Watson, and Direction of the roots have been critically and empirically analyzed. The empirical analysis consists of time series data of Consumer Price Index and Closing Stock Price. Findings: The results show that the VAR model performed better in comparison to other models. Both the reliability and significance of VAR model is highly appreciable. In contrary to it, the ARCH model showed very poor results for forecasting. However, the results of ARMA model appeared double standards i.e. the AR roots showed that model is stationary and that of MA roots showed that the model is invertible. Therefore, the forecasting would remain doubtful if it made on the bases of ARMA model. It has been concluded that VAR model provides best forecasting results. Practical Implications: This paper provides empirical evidences for the application of time series forecasting model. This paper therefore provides the base for the application of best time series forecasting model.

Keywords: forecasting, time series, auto regression, ARCH, ARMA

Procedia PDF Downloads 305
19630 Analysis of Dynamics Underlying the Observation Time Series by Using a Singular Spectrum Approach

Authors: O. Delage, H. Bencherif, T. Portafaix, A. Bourdier

Abstract:

The main purpose of time series analysis is to learn about the dynamics behind some time ordered measurement data. Two approaches are used in the literature to get a better knowledge of the dynamics contained in observation data sequences. The first of these approaches concerns time series decomposition, which is an important analysis step allowing patterns and behaviors to be extracted as components providing insight into the mechanisms producing the time series. As in many cases, time series are short, noisy, and non-stationary. To provide components which are physically meaningful, methods such as Empirical Mode Decomposition (EMD), Empirical Wavelet Transform (EWT) or, more recently, Empirical Adaptive Wavelet Decomposition (EAWD) have been proposed. The second approach is to reconstruct the dynamics underlying the time series as a trajectory in state space by mapping a time series into a set of Rᵐ lag vectors by using the method of delays (MOD). Takens has proved that the trajectory obtained with the MOD technic is equivalent to the trajectory representing the dynamics behind the original time series. This work introduces the singular spectrum decomposition (SSD), which is a new adaptive method for decomposing non-linear and non-stationary time series in narrow-banded components. This method takes its origin from singular spectrum analysis (SSA), a nonparametric spectral estimation method used for the analysis and prediction of time series. As the first step of SSD is to constitute a trajectory matrix by embedding a one-dimensional time series into a set of lagged vectors, SSD can also be seen as a reconstruction method like MOD. We will first give a brief overview of the existing decomposition methods (EMD-EWT-EAWD). The SSD method will then be described in detail and applied to experimental time series of observations resulting from total columns of ozone measurements. The results obtained will be compared with those provided by the previously mentioned decomposition methods. We will also compare the reconstruction qualities of the observed dynamics obtained from the SSD and MOD methods.

Keywords: time series analysis, adaptive time series decomposition, wavelet, phase space reconstruction, singular spectrum analysis

Procedia PDF Downloads 80
19629 Applying a Noise Reduction Method to Reveal Chaos in the River Flow Time Series

Authors: Mohammad H. Fattahi

Abstract:

Chaotic analysis has been performed on the river flow time series before and after applying the wavelet based de-noising techniques in order to investigate the noise content effects on chaotic nature of flow series. In this study, 38 years of monthly runoff data of three gauging stations were used. Gauging stations were located in Ghar-e-Aghaj river basin, Fars province, Iran. The noise level of time series was estimated with the aid of Gaussian kernel algorithm. This step was found to be crucial in preventing removal of the vital data such as memory, correlation and trend from the time series in addition to the noise during de-noising process.

Keywords: chaotic behavior, wavelet, noise reduction, river flow

Procedia PDF Downloads 439
19628 Semi-Supervised Hierarchical Clustering Given a Reference Tree of Labeled Documents

Authors: Ying Zhao, Xingyan Bin

Abstract:

Semi-supervised clustering algorithms have been shown effective to improve clustering process with even limited supervision. However, semi-supervised hierarchical clustering remains challenging due to the complexities of expressing constraints for agglomerative clustering algorithms. This paper proposes novel semi-supervised agglomerative clustering algorithms to build a hierarchy based on a known reference tree. We prove that by enforcing distance constraints defined by a reference tree during the process of hierarchical clustering, the resultant tree is guaranteed to be consistent with the reference tree. We also propose a framework that allows the hierarchical tree generation be aware of levels of levels of the agglomerative tree under creation, so that metric weights can be learned and adopted at each level in a recursive fashion. The experimental evaluation shows that the additional cost of our contraint-based semi-supervised hierarchical clustering algorithm (HAC) is negligible, and our combined semi-supervised HAC algorithm outperforms the state-of-the-art algorithms on real-world datasets. The experiments also show that our proposed methods can improve clustering performance even with a small number of unevenly distributed labeled data.

Keywords: semi-supervised clustering, hierarchical agglomerative clustering, reference trees, distance constraints

Procedia PDF Downloads 504
19627 Anomaly Detection Based Fuzzy K-Mode Clustering for Categorical Data

Authors: Murat Yazici

Abstract:

Anomalies are irregularities found in data that do not adhere to a well-defined standard of normal behavior. The identification of outliers or anomalies in data has been a subject of study within the statistics field since the 1800s. Over time, a variety of anomaly detection techniques have been developed in several research communities. The cluster analysis can be used to detect anomalies. It is the process of associating data with clusters that are as similar as possible while dissimilar clusters are associated with each other. Many of the traditional cluster algorithms have limitations in dealing with data sets containing categorical properties. To detect anomalies in categorical data, fuzzy clustering approach can be used with its advantages. The fuzzy k-Mode (FKM) clustering algorithm, which is one of the fuzzy clustering approaches, by extension to the k-means algorithm, is reported for clustering datasets with categorical values. It is a form of clustering: each point can be associated with more than one cluster. In this paper, anomaly detection is performed on two simulated data by using the FKM cluster algorithm. As a significance of the study, the FKM cluster algorithm allows to determine anomalies with their abnormality degree in contrast to numerous anomaly detection algorithms. According to the results, the FKM cluster algorithm illustrated good performance in the anomaly detection of data, including both one anomaly and more than one anomaly.

Keywords: fuzzy k-mode clustering, anomaly detection, noise, categorical data

Procedia PDF Downloads 21
19626 K-Means Clustering-Based Infinite Feature Selection Method

Authors: Seyyedeh Faezeh Hassani Ziabari, Sadegh Eskandari, Maziar Salahi

Abstract:

Infinite Feature Selection (IFS) algorithm is an efficient feature selection algorithm that selects a subset of features of all sizes (including infinity). In this paper, we present an improved version of it, called clustering IFS (CIFS), by clustering the dataset in advance. To do so, first, we apply the K-means algorithm to cluster the dataset, then we apply IFS. In the CIFS method, the spatial and temporal complexities are reduced compared to the IFS method. Experimental results on 6 datasets show the superiority of CIFS compared to IFS in terms of accuracy, running time, and memory consumption.

Keywords: feature selection, infinite feature selection, clustering, graph

Procedia PDF Downloads 95
19625 Fuzzy Optimization Multi-Objective Clustering Ensemble Model for Multi-Source Data Analysis

Authors: C. B. Le, V. N. Pham

Abstract:

In modern data analysis, multi-source data appears more and more in real applications. Multi-source data clustering has emerged as a important issue in the data mining and machine learning community. Different data sources provide information about different data. Therefore, multi-source data linking is essential to improve clustering performance. However, in practice multi-source data is often heterogeneous, uncertain, and large. This issue is considered a major challenge from multi-source data. Ensemble is a versatile machine learning model in which learning techniques can work in parallel, with big data. Clustering ensemble has been shown to outperform any standard clustering algorithm in terms of accuracy and robustness. However, most of the traditional clustering ensemble approaches are based on single-objective function and single-source data. This paper proposes a new clustering ensemble method for multi-source data analysis. The fuzzy optimized multi-objective clustering ensemble method is called FOMOCE. Firstly, a clustering ensemble mathematical model based on the structure of multi-objective clustering function, multi-source data, and dark knowledge is introduced. Then, rules for extracting dark knowledge from the input data, clustering algorithms, and base clusterings are designed and applied. Finally, a clustering ensemble algorithm is proposed for multi-source data analysis. The experiments were performed on the standard sample data set. The experimental results demonstrate the superior performance of the FOMOCE method compared to the existing clustering ensemble methods and multi-source clustering methods.

Keywords: clustering ensemble, multi-source, multi-objective, fuzzy clustering

Procedia PDF Downloads 145
19624 ACOPIN: An ACO Algorithm with TSP Approach for Clustering Proteins in Protein Interaction Networks

Authors: Jamaludin Sallim, Rozlina Mohamed, Roslina Abdul Hamid

Abstract:

In this paper, we proposed an Ant Colony Optimization (ACO) algorithm together with Traveling Salesman Problem (TSP) approach to investigate the clustering problem in Protein Interaction Networks (PIN). We named this combination as ACOPIN. The purpose of this work is two-fold. First, to test the efficacy of ACO in clustering PIN and second, to propose the simple generalization of the ACO algorithm that might allow its application in clustering proteins in PIN. We split this paper to three main sections. First, we describe the PIN and clustering proteins in PIN. Second, we discuss the steps involved in each phase of ACO algorithm. Finally, we present some results of the investigation with the clustering patterns.

Keywords: ant colony optimization algorithm, searching algorithm, protein functional module, protein interaction network

Procedia PDF Downloads 571
19623 A Weighted K-Medoids Clustering Algorithm for Effective Stability in Vehicular Ad Hoc Networks

Authors: Rejab Hajlaoui, Tarek Moulahi, Hervé Guyennet

Abstract:

In a highway scenario, the vehicle speed can exceed 120 kmph. Therefore, any vehicle can enter or leave the network within a very short time. This mobility adversely affects the network connectivity and decreases the life time of all established links. To ensure an effective stability in vehicular ad hoc networks with minimum broadcasting storm, we have developed a weighted algorithm based on the k-medoids clustering algorithm (WKCA). Indeed, the number of clusters and the initial cluster heads will not be selected randomly as usual, but considering the available transmission range and the environment size. Then, to ensure optimal assignment of nodes to clusters in both k-medoids phases, the combined weight of any node will be computed according to additional metrics including direction, relative speed and proximity. Empirical results prove that in addition to the convergence speed that characterizes the k-medoids algorithm, our proposed model performs well both AODV-Clustering and OLSR-Clustering protocols under different densities and velocities in term of end-to-end delay, packet delivery ratio, and throughput.

Keywords: communication, clustering algorithm, k-medoids, sensor, vehicular ad hoc network

Procedia PDF Downloads 205
19622 The Modelling of Real Time Series Data

Authors: Valeria Bondarenko

Abstract:

We proposed algorithms for: estimation of parameters fBm (volatility and Hurst exponent) and for the approximation of random time series by functional of fBm. We proved the consistency of the estimators, which constitute the above algorithms, and proved the optimal forecast of approximated time series. The adequacy of estimation algorithms, approximation, and forecasting is proved by numerical experiment. During the process of creating software, the system has been created, which is displayed by the hierarchical structure. The comparative analysis of proposed algorithms with the other methods gives evidence of the advantage of approximation method. The results can be used to develop methods for the analysis and modeling of time series describing the economic, physical, biological and other processes.

Keywords: mathematical model, random process, Wiener process, fractional Brownian motion

Procedia PDF Downloads 323