Search results for: Twitter data clustering
25470 Twitter's Impact on Print Media with Respect to Real World Events
Authors: Basit Shahzad, Abdullatif M. Abdullatif
Abstract:
Recent advancements in Information and Communication Technologies (ICT) and easy access to Internet have made social media the first choice for information sharing related to any important events or news. On Twitter, trend is a common feature that quantifies the level of popularity of a certain news or event. In this work, we examine the impact of Twitter trends on real world events by hypothesizing that Twitter trends have an influence on print media in Pakistan. For this, Twitter is used as a platform and Twitter trends as a base line. We first collect data from two sources (Twitter trends and print media) in the period May to August 2016. Obtained data from two sources is analyzed and it is observed that social media is significantly influencing the print media and majority of the news printed in newspaper are posted on Twitter earlier.Keywords: twitter trends, text mining, effectiveness of trends, print media
Procedia PDF Downloads 25625469 An Approach for Pattern Recognition and Prediction of Information Diffusion Model on Twitter
Authors: Amartya Hatua, Trung Nguyen, Andrew Sung
Abstract:
In this paper, we study the information diffusion process on Twitter as a multivariate time series problem. Our model concerns three measures (volume, network influence, and sentiment of tweets) based on 10 features, and we collected 27 million tweets to build our information diffusion time series dataset for analysis. Then, different time series clustering techniques with Dynamic Time Warping (DTW) distance were used to identify different patterns of information diffusion. Finally, we built the information diffusion prediction models for new hashtags which comprise two phrases: The first phrase is recognizing the pattern using k-NN with DTW distance; the second phrase is building the forecasting model using the traditional Autoregressive Integrated Moving Average (ARIMA) model and the non-linear recurrent neural network of Long Short-Term Memory (LSTM). Preliminary results of performance evaluation between different forecasting models show that LSTM with clustering information notably outperforms other models. Therefore, our approach can be applied in real-world applications to analyze and predict the information diffusion characteristics of selected topics or memes (hashtags) in Twitter.Keywords: ARIMA, DTW, information diffusion, LSTM, RNN, time series clustering, time series forecasting, Twitter
Procedia PDF Downloads 39025468 Visualization and Performance Measure to Determine Number of Topics in Twitter Data Clustering Using Hybrid Topic Modeling
Authors: Moulana Mohammed
Abstract:
Topic models are widely used in building clusters of documents for more than a decade, yet problems occurring in choosing optimal number of topics. The main problem is the lack of a stable metric of the quality of topics obtained during the construction of topic models. The authors analyzed from previous works, most of the models used in determining the number of topics are non-parametric and quality of topics determined by using perplexity and coherence measures and concluded that they are not applicable in solving this problem. In this paper, we used the parametric method, which is an extension of the traditional topic model with visual access tendency for visualization of the number of topics (clusters) to complement clustering and to choose optimal number of topics based on results of cluster validity indices. Developed hybrid topic models are demonstrated with different Twitter datasets on various topics in obtaining the optimal number of topics and in measuring the quality of clusters. The experimental results showed that the Visual Non-negative Matrix Factorization (VNMF) topic model performs well in determining the optimal number of topics with interactive visualization and in performance measure of the quality of clusters with validity indices.Keywords: interactive visualization, visual mon-negative matrix factorization model, optimal number of topics, cluster validity indices, Twitter data clustering
Procedia PDF Downloads 13325467 Topic Modelling Using Latent Dirichlet Allocation and Latent Semantic Indexing on SA Telco Twitter Data
Authors: Phumelele Kubheka, Pius Owolawi, Gbolahan Aiyetoro
Abstract:
Twitter is one of the most popular social media platforms where users can share their opinions on different subjects. As of 2010, The Twitter platform generates more than 12 Terabytes of data daily, ~ 4.3 petabytes in a single year. For this reason, Twitter is a great source for big mining data. Many industries such as Telecommunication companies can leverage the availability of Twitter data to better understand their markets and make an appropriate business decision. This study performs topic modeling on Twitter data using Latent Dirichlet Allocation (LDA). The obtained results are benchmarked with another topic modeling technique, Latent Semantic Indexing (LSI). The study aims to retrieve topics on a Twitter dataset containing user tweets on South African Telcos. Results from this study show that LSI is much faster than LDA. However, LDA yields better results with higher topic coherence by 8% for the best-performing model represented in Table 1. A higher topic coherence score indicates better performance of the model.Keywords: big data, latent Dirichlet allocation, latent semantic indexing, telco, topic modeling, twitter
Procedia PDF Downloads 14925466 Hierarchical Clustering Algorithms in Data Mining
Authors: Z. Abdullah, A. R. Hamdan
Abstract:
Clustering is a process of grouping objects and data into groups of clusters to ensure that data objects from the same cluster are identical to each other. Clustering algorithms in one of the areas in data mining and it can be classified into partition, hierarchical, density based, and grid-based. Therefore, in this paper, we do a survey and review for four major hierarchical clustering algorithms called CURE, ROCK, CHAMELEON, and BIRCH. The obtained state of the art of these algorithms will help in eliminating the current problems, as well as deriving more robust and scalable algorithms for clustering.Keywords: clustering, unsupervised learning, algorithms, hierarchical
Procedia PDF Downloads 88425465 Analysis of Urban Population Using Twitter Distribution Data: Case Study of Makassar City, Indonesia
Authors: Yuyun Wabula, B. J. Dewancker
Abstract:
In the past decade, the social networking app has been growing very rapidly. Geolocation data is one of the important features of social media that can attach the user's location coordinate in the real world. This paper proposes the use of geolocation data from the Twitter social media application to gain knowledge about urban dynamics, especially on human mobility behavior. This paper aims to explore the relation between geolocation Twitter with the existence of people in the urban area. Firstly, the study will analyze the spread of people in the particular area, within the city using Twitter social media data. Secondly, we then match and categorize the existing place based on the same individuals visiting. Then, we combine the Twitter data from the tracking result and the questionnaire data to catch the Twitter user profile. To do that, we used the distribution frequency analysis to learn the visitors’ percentage. To validate the hypothesis, we compare it with the local population statistic data and land use mapping released by the city planning department of Makassar local government. The results show that there is the correlation between Twitter geolocation and questionnaire data. Thus, integration the Twitter data and survey data can reveal the profile of the social media users.Keywords: geolocation, Twitter, distribution analysis, human mobility
Procedia PDF Downloads 31425464 Survey on Arabic Sentiment Analysis in Twitter
Authors: Sarah O. Alhumoud, Mawaheb I. Altuwaijri, Tarfa M. Albuhairi, Wejdan M. Alohaideb
Abstract:
Large-scale data stream analysis has become one of the important business and research priorities lately. Social networks like Twitter and other micro-blogging platforms hold an enormous amount of data that is large in volume, velocity and variety. Extracting valuable information and trends out of these data would aid in a better understanding and decision-making. Multiple analysis techniques are deployed for English content. Moreover, one of the languages that produce a large amount of data over social networks and is least analyzed is the Arabic language. The proposed paper is a survey on the research efforts to analyze the Arabic content in Twitter focusing on the tools and methods used to extract the sentiments for the Arabic content on Twitter.Keywords: big data, social networks, sentiment analysis, twitter
Procedia PDF Downloads 57525463 Flowing Online Vehicle GPS Data Clustering Using a New Parallel K-Means Algorithm
Authors: Orhun Vural, Oguz Bayat, Rustu Akay, Osman N. Ucan
Abstract:
This study presents a new parallel approach clustering of GPS data. Evaluation has been made by comparing execution time of various clustering algorithms on GPS data. This paper aims to propose a parallel based on neighborhood K-means algorithm to make it faster. The proposed parallelization approach assumes that each GPS data represents a vehicle and to communicate between vehicles close to each other after vehicles are clustered. This parallelization approach has been examined on different sized continuously changing GPS data and compared with serial K-means algorithm and other serial clustering algorithms. The results demonstrated that proposed parallel K-means algorithm has been shown to work much faster than other clustering algorithms.Keywords: parallel k-means algorithm, parallel clustering, clustering algorithms, clustering on flowing data
Procedia PDF Downloads 22025462 Fuzzy Optimization Multi-Objective Clustering Ensemble Model for Multi-Source Data Analysis
Authors: C. B. Le, V. N. Pham
Abstract:
In modern data analysis, multi-source data appears more and more in real applications. Multi-source data clustering has emerged as a important issue in the data mining and machine learning community. Different data sources provide information about different data. Therefore, multi-source data linking is essential to improve clustering performance. However, in practice multi-source data is often heterogeneous, uncertain, and large. This issue is considered a major challenge from multi-source data. Ensemble is a versatile machine learning model in which learning techniques can work in parallel, with big data. Clustering ensemble has been shown to outperform any standard clustering algorithm in terms of accuracy and robustness. However, most of the traditional clustering ensemble approaches are based on single-objective function and single-source data. This paper proposes a new clustering ensemble method for multi-source data analysis. The fuzzy optimized multi-objective clustering ensemble method is called FOMOCE. Firstly, a clustering ensemble mathematical model based on the structure of multi-objective clustering function, multi-source data, and dark knowledge is introduced. Then, rules for extracting dark knowledge from the input data, clustering algorithms, and base clusterings are designed and applied. Finally, a clustering ensemble algorithm is proposed for multi-source data analysis. The experiments were performed on the standard sample data set. The experimental results demonstrate the superior performance of the FOMOCE method compared to the existing clustering ensemble methods and multi-source clustering methods.Keywords: clustering ensemble, multi-source, multi-objective, fuzzy clustering
Procedia PDF Downloads 18825461 Performance Analysis of Hierarchical Agglomerative Clustering in a Wireless Sensor Network Using Quantitative Data
Authors: Tapan Jain, Davender Singh Saini
Abstract:
Clustering is a useful mechanism in wireless sensor networks which helps to cope with scalability and data transmission problems. The basic aim of our research work is to provide efficient clustering using Hierarchical agglomerative clustering (HAC). If the distance between the sensing nodes is calculated using their location then it’s quantitative HAC. This paper compares the various agglomerative clustering techniques applied in a wireless sensor network using the quantitative data. The simulations are done in MATLAB and the comparisons are made between the different protocols using dendrograms.Keywords: routing, hierarchical clustering, agglomerative, quantitative, wireless sensor network
Procedia PDF Downloads 61425460 A Fuzzy Kernel K-Medoids Algorithm for Clustering Uncertain Data Objects
Authors: Behnam Tavakkol
Abstract:
Uncertain data mining algorithms use different ways to consider uncertainty in data such as by representing a data object as a sample of points or a probability distribution. Fuzzy methods have long been used for clustering traditional (certain) data objects. They are used to produce non-crisp cluster labels. For uncertain data, however, besides some uncertain fuzzy k-medoids algorithms, not many other fuzzy clustering methods have been developed. In this work, we develop a fuzzy kernel k-medoids algorithm for clustering uncertain data objects. The developed fuzzy kernel k-medoids algorithm is superior to existing fuzzy k-medoids algorithms in clustering data sets with non-linearly separable clusters.Keywords: clustering algorithm, fuzzy methods, kernel k-medoids, uncertain data
Procedia PDF Downloads 21525459 Improved K-Means Clustering Algorithm Using RHadoop with Combiner
Authors: Ji Eun Shin, Dong Hoon Lim
Abstract:
Data clustering is a common technique used in data analysis and is used in many applications, such as artificial intelligence, pattern recognition, economics, ecology, psychiatry and marketing. K-means clustering is a well-known clustering algorithm aiming to cluster a set of data points to a predefined number of clusters. In this paper, we implement K-means algorithm based on MapReduce framework with RHadoop to make the clustering method applicable to large scale data. RHadoop is a collection of R packages that allow users to manage and analyze data with Hadoop. The main idea is to introduce a combiner as a function of our map output to decrease the amount of data needed to be processed by reducers. The experimental results demonstrated that K-means algorithm using RHadoop can scale well and efficiently process large data sets on commodity hardware. We also showed that our K-means algorithm using RHadoop with combiner was faster than regular algorithm without combiner as the size of data set increases.Keywords: big data, combiner, K-means clustering, RHadoop
Procedia PDF Downloads 43825458 EFL Saudi Students' Use of Vocabulary via Twitter
Authors: A. Alshabeb
Abstract:
Vocabulary is one of the elements that links the four skills of reading, writing, speaking, and listening and is very critical in learning a foreign language. This study aims to determine how Saudi Arabian EFL students learn English vocabulary via Twitter. The study adopts a mixed sequential research design in collecting and analysing data. The results of the study provide several recommendations for vocabulary learning. Moreover, the study can help teachers to consider the possibilities of using Twitter further, and perhaps to develop new approaches to vocabulary teaching and to support students in their use of social media.Keywords: social media, twitter, vocabulary, web 2
Procedia PDF Downloads 41725457 Effects of Twitter Interactions on Self-Esteem and Narcissistic Behaviour
Authors: Leena-Maria Alyedreessy
Abstract:
Self-esteem is thought to be determined by ones’ own feeling of being included, liked and accepted by others. This research explores whether this concept may also be applied in the virtual world and assesses whether there is any relationship between Twitter users' self-esteem and the amount of interactions they receive. 20 female Arab participants were given a survey asking them about their Twitter interactions and their feelings of having an imagined audience to fill out and a Rosenberg Self-Esteem Assessment to complete. After completion and statistical analysis, results showed a significant correlation between the feeling of being Twitter elite, the feeling of having a lot of people listening to your tweets and having a lot of interactions with high self-esteem. However, no correlations were detected for low-self-esteem and low interactions.Keywords: twitter, social media, self-esteem, narcissism, interactions
Procedia PDF Downloads 41025456 Text Mining of Twitter Data Using a Latent Dirichlet Allocation Topic Model and Sentiment Analysis
Authors: Sidi Yang, Haiyi Zhang
Abstract:
Twitter is a microblogging platform, where millions of users daily share their attitudes, views, and opinions. Using a probabilistic Latent Dirichlet Allocation (LDA) topic model to discern the most popular topics in the Twitter data is an effective way to analyze a large set of tweets to find a set of topics in a computationally efficient manner. Sentiment analysis provides an effective method to show the emotions and sentiments found in each tweet and an efficient way to summarize the results in a manner that is clearly understood. The primary goal of this paper is to explore text mining, extract and analyze useful information from unstructured text using two approaches: LDA topic modelling and sentiment analysis by examining Twitter plain text data in English. These two methods allow people to dig data more effectively and efficiently. LDA topic model and sentiment analysis can also be applied to provide insight views in business and scientific fields.Keywords: text mining, Twitter, topic model, sentiment analysis
Procedia PDF Downloads 17725455 Collision Theory Based Sentiment Detection Using Discourse Analysis in Hadoop
Authors: Anuta Mukherjee, Saswati Mukherjee
Abstract:
Data is growing everyday. Social networking sites such as Twitter are becoming an integral part of our daily lives, contributing a large increase in the growth of data. It is a rich source especially for sentiment detection or mining since people often express honest opinion through tweets. However, although sentiment analysis is a well-researched topic in text, this analysis using Twitter data poses additional challenges since these are unstructured data with abbreviations and without a strict grammatical correctness. We have employed collision theory to achieve sentiment analysis in Twitter data. We have also incorporated discourse analysis in the collision theory based model to detect accurate sentiment from tweets. We have also used the retweet field to assign weights to certain tweets and obtained the overall weightage of a topic provided in the form of a query. Hadoop has been exploited for speed. Our experiments show effective results.Keywords: sentiment analysis, twitter, collision theory, discourse analysis
Procedia PDF Downloads 53425454 Twitter: The New Marketing Communication Tools
Authors: Mansur Ahmed Kazaure
Abstract:
The emergence of internet-based social media has made it possible for one person to communication with hundreds or even thousands of people about a company goods and services and the companies that provides them. Thus, the impact of customer-to-customer communications has been significantly magnified in the marketplace. Therefore, the essence of this paper is to critically evaluate the literature of social media and their implication for practice, but the author pay attention on twitter as a new marketing communication tools. The author found out that, despite the implication of using social media especially twitter by the companies as part of their marketing communication tool, but still it can enhance the opportunity for the companies to develop and maintain long-term customer relationship. The paper concludes that, using twitter as a marketing communication tool is a market trend and it is the best way for marketers to add value to their customer, however with the Twitter marketers can get a feedback about the performance of their product and its brand in the marketplace. The paper is purely a conceptual discourse based on secondary data.Keywords: social media, marketing communication, marketing communication tools, Twitter, Facebook
Procedia PDF Downloads 47125453 Anomaly Detection Based Fuzzy K-Mode Clustering for Categorical Data
Authors: Murat Yazici
Abstract:
Anomalies are irregularities found in data that do not adhere to a well-defined standard of normal behavior. The identification of outliers or anomalies in data has been a subject of study within the statistics field since the 1800s. Over time, a variety of anomaly detection techniques have been developed in several research communities. The cluster analysis can be used to detect anomalies. It is the process of associating data with clusters that are as similar as possible while dissimilar clusters are associated with each other. Many of the traditional cluster algorithms have limitations in dealing with data sets containing categorical properties. To detect anomalies in categorical data, fuzzy clustering approach can be used with its advantages. The fuzzy k-Mode (FKM) clustering algorithm, which is one of the fuzzy clustering approaches, by extension to the k-means algorithm, is reported for clustering datasets with categorical values. It is a form of clustering: each point can be associated with more than one cluster. In this paper, anomaly detection is performed on two simulated data by using the FKM cluster algorithm. As a significance of the study, the FKM cluster algorithm allows to determine anomalies with their abnormality degree in contrast to numerous anomaly detection algorithms. According to the results, the FKM cluster algorithm illustrated good performance in the anomaly detection of data, including both one anomaly and more than one anomaly.Keywords: fuzzy k-mode clustering, anomaly detection, noise, categorical data
Procedia PDF Downloads 5125452 Extracting Actions with Improved Part of Speech Tagging for Social Networking Texts
Authors: Yassine Jamoussi, Ameni Youssfi, Henda Ben Ghezala
Abstract:
With the growing interest in social networking, the interaction of social actors evolved to a source of knowledge in which it becomes possible to perform context aware-reasoning. The information extraction from social networking especially Twitter and Facebook is one of the problems in this area. To extract text from social networking, we need several lexical features and large scale word clustering. We attempt to expand existing tokenizer and to develop our own tagger in order to support the incorrect words currently in existence in Facebook and Twitter. Our goal in this work is to benefit from the lexical features developed for Twitter and online conversational text in previous works, and to develop an extraction model for constructing a huge knowledge based on actionsKeywords: social networking, information extraction, part-of-speech tagging, natural language processing
Procedia PDF Downloads 30425451 Finding Bicluster on Gene Expression Data of Lymphoma Based on Singular Value Decomposition and Hierarchical Clustering
Authors: Alhadi Bustaman, Soeganda Formalidin, Titin Siswantining
Abstract:
DNA microarray technology is used to analyze thousand gene expression data simultaneously and a very important task for drug development and test, function annotation, and cancer diagnosis. Various clustering methods have been used for analyzing gene expression data. However, when analyzing very large and heterogeneous collections of gene expression data, conventional clustering methods often cannot produce a satisfactory solution. Biclustering algorithm has been used as an alternative approach to identifying structures from gene expression data. In this paper, we introduce a transform technique based on singular value decomposition to identify normalized matrix of gene expression data followed by Mixed-Clustering algorithm and the Lift algorithm, inspired in the node-deletion and node-addition phases proposed by Cheng and Church based on Agglomerative Hierarchical Clustering (AHC). Experimental study on standard datasets demonstrated the effectiveness of the algorithm in gene expression data.Keywords: agglomerative hierarchical clustering (AHC), biclustering, gene expression data, lymphoma, singular value decomposition (SVD)
Procedia PDF Downloads 27625450 A Framework for Analyzing Public Interaction of Saudi Universities on Twitter
Authors: Sahar Al-Qahtani, Rabeeh Ayaz Abbasi, Naif Radi Aljohani
Abstract:
Many universities use social media platforms as new communication channels to disseminate information and promptly communicate with their audience. As Twitter is one of the widely used social media platforms, this research aims to explore the adaption and utilization of Twitter by universities. We propose a framework called 'Social Network Analysis for Universities on Twitter' (SNAUT) to analyze the usage of Twitter by universities and to measure their interaction with public. The study includes a sample of around 110,000 tweets from 36 Saudi universities, including both public and private universities. Using SNAUT, we can (1) investigate the purpose of using Twitter by universities, (2) determine the broad topics discussed by them, and (3) identify the groups closely associated with the universities. The results show that most of the Saudi universities (whether public or private) actively use Twitter. Results also reveal that public universities respond to public queries more frequently, but private universities stand out more in terms of information dissemination using retweets and diverse hashtags. Finally, we develop a ranking mechanism in SNAUT for ranking universities based on their social interaction with the public on Twitter.Keywords: social media, twitter, social network analysis, universities, higher education, Saudi Arabia
Procedia PDF Downloads 13525449 Mitigating the Negative Effect of Intrabrand Clustering: The Role of Interbrand Clustering and Firm Size
Authors: Moeen Naseer Butt
Abstract:
Clustering –geographic concentrations of entities– has recently received more attention in marketing research and has been shown to affect multiple outcomes. This study investigates the impact of intrabrand clustering (clustering of same-brand outlets) on an outlet’s quality performance. Further, it assesses the moderating effects of interbrand clustering (clustering of other-brand outlets) and firm size. An examination of approximately 21,000 food service establishments in New York State in 2019 finds that the impact of intrabrand clustering on an outlet’s quality performance is context-dependent. Specifically, intrabrand clustering decreases, whereas interbrand clustering and firm size help increase the outlet’s performance. Additionally, this study finds that the role of firm size is more substantial than interbrand clustering in mitigating the adverse effects of intrabrand clustering on outlet quality performance.Keywords: intraband clustering, interbrand clustering, firm size, brand competition, outlet performance, quality violations
Procedia PDF Downloads 18825448 A Non-parametric Clustering Approach for Multivariate Geostatistical Data
Authors: Francky Fouedjio
Abstract:
Multivariate geostatistical data have become omnipresent in the geosciences and pose substantial analysis challenges. One of them is the grouping of data locations into spatially contiguous clusters so that data locations within the same cluster are more similar while clusters are different from each other, in some sense. Spatially contiguous clusters can significantly improve the interpretation that turns the resulting clusters into meaningful geographical subregions. In this paper, we develop an agglomerative hierarchical clustering approach that takes into account the spatial dependency between observations. It relies on a dissimilarity matrix built from a non-parametric kernel estimator of the spatial dependence structure of data. It integrates existing methods to find the optimal cluster number and to evaluate the contribution of variables to the clustering. The capability of the proposed approach to provide spatially compact, connected and meaningful clusters is assessed using bivariate synthetic dataset and multivariate geochemical dataset. The proposed clustering method gives satisfactory results compared to other similar geostatistical clustering methods.Keywords: clustering, geostatistics, multivariate data, non-parametric
Procedia PDF Downloads 47625447 The Polarization on Twitter and COVID-19 Vaccination in Brazil
Authors: Giselda Cristina Ferreira, Carlos Alberto Kamienski, Ana Lígia Scott
Abstract:
The COVID-19 pandemic has enhanced the anti-vaccination movement in Brazil, supported by unscientific theories and false news and the possibility of wide communication through social networks such as Twitter, Facebook, and YouTube. The World Health Organization (WHO) classified the large volume of information on the subject against COVID-19 as an Infodemic. In this paper, we present a protocol to identify polarizing users (called polarizers) and study the profiles of Brazilian polarizers on Twitter (renamed to X some weeks ago). We analyzed polarizing interactions on Twitter (in Portuguese) to identify the main polarizers and how the conflicts they caused influenced the COVID-19 vaccination rate throughout the pandemic. This protocol uses data from this social network, graph theory, Java, and R-studio scripts to model and analyze the data. The information about the vaccination rate was obtained in a public database for the government called OpenDataSus. The results present the profiles of Twitter’s Polarizer (political position, gender, professional activity, immunization opinions). We observed that social and political events influenced the participation of these different profiles in conflicts and the vaccination rate.Keywords: Twitter, polarization, vaccine, Brazil
Procedia PDF Downloads 7525446 Investigation of Clustering Algorithms Used in Wireless Sensor Networks
Authors: Naim Karasekreter, Ugur Fidan, Fatih Basciftci
Abstract:
Wireless sensor networks are networks in which more than one sensor node is organized among themselves. The working principle is based on the transfer of the sensed data over the other nodes in the network to the central station. Wireless sensor networks concentrate on routing algorithms, energy efficiency and clustering algorithms. In the clustering method, the nodes in the network are divided into clusters using different parameters and the most suitable cluster head is selected from among them. The data to be sent to the center is sent per cluster, and the cluster head is transmitted to the center. With this method, the network traffic is reduced and the energy efficiency of the nodes is increased. In this study, clustering algorithms were examined in terms of clustering performances and cluster head selection characteristics to try to identify weak and strong sides. This work is supported by the Project 17.Kariyer.123 of Afyon Kocatepe University BAP Commission.Keywords: wireless sensor networks (WSN), clustering algorithm, cluster head, clustering
Procedia PDF Downloads 51225445 Multimodal Optimization of Density-Based Clustering Using Collective Animal Behavior Algorithm
Authors: Kristian Bautista, Ruben A. Idoy
Abstract:
A bio-inspired metaheuristic algorithm inspired by the theory of collective animal behavior (CAB) was integrated to density-based clustering modeled as multimodal optimization problem. The algorithm was tested on synthetic, Iris, Glass, Pima and Thyroid data sets in order to measure its effectiveness relative to CDE-based Clustering algorithm. Upon preliminary testing, it was found out that one of the parameter settings used was ineffective in performing clustering when applied to the algorithm prompting the researcher to do an investigation. It was revealed that fine tuning distance δ3 that determines the extent to which a given data point will be clustered helped improve the quality of cluster output. Even though the modification of distance δ3 significantly improved the solution quality and cluster output of the algorithm, results suggest that there is no difference between the population mean of the solutions obtained using the original and modified parameter setting for all data sets. This implies that using either the original or modified parameter setting will not have any effect towards obtaining the best global and local animal positions. Results also suggest that CDE-based clustering algorithm is better than CAB-density clustering algorithm for all data sets. Nevertheless, CAB-density clustering algorithm is still a good clustering algorithm because it has correctly identified the number of classes of some data sets more frequently in a thirty trial run with a much smaller standard deviation, a potential in clustering high dimensional data sets. Thus, the researcher recommends further investigation in the post-processing stage of the algorithm.Keywords: clustering, metaheuristics, collective animal behavior algorithm, density-based clustering, multimodal optimization
Procedia PDF Downloads 23025444 The Paralinguistic Function of Emojis in Twitter Communication
Authors: Yasmin Tantawi, Mary Beth Rosson
Abstract:
In response to the dearth of information about emoji use for different purposes in different settings, this paper investigates the paralinguistic function of emojis within Twitter communication in the United States. To conduct this investigation, the Twitter feeds from 16 population centers spread throughout the United States were collected from the Twitter public API. One hundred tweets were collected from each population center, totaling to 1,600 tweets. Tweets containing emojis were next extracted using the “emot” Python package; these were then analyzed via the IBM Watson API Natural Language Understanding module to identify the topics discussed. A manual content analysis was then conducted to ascertain the paralinguistic and emotional features of the emojis used in these tweets. We present our characterization of emoji usage in Twitter and discuss implications for the design of Twitter and other text-based communication tools.Keywords: computer-mediated communication, content analysis, paralinguistics, sociology
Procedia PDF Downloads 16025443 Social Media Mining with R. Twitter Analyses
Authors: Diana Codat
Abstract:
Tweets' analysis is part of text mining. Each document is a written text. It's possible to apply the usual text search techniques, in particular by switching to the bag-of-words representation. But the tweets induce peculiarities. Some may enrich the analysis. Thus, their length is calibrated (at least as far as public messages are concerned), special characters make it possible to identify authors (@) and themes (#), the tweet and retweet mechanisms make it possible to follow the diffusion of the information. Conversely, other characteristics may disrupt the analyzes. Because space is limited, authors often use abbreviations, emoticons to express feelings, and they do not pay much attention to spelling. All this creates noise that can complicate the task. The tweets carry a lot of potentially interesting information. Their exploitation is one of the main axes of the analysis of the social networks. We show how to access Twitter-related messages. We will initiate a study of the properties of the tweets, and we will follow up on the exploitation of the content of the messages. We will work under R with the package 'twitteR'. The study of tweets is a strong focus of analysis of social networks because Twitter has become an important vector of communication. This example shows that it is easy to initiate an analysis from data extracted directly online. The data preparation phase is of great importance.Keywords: data mining, language R, social networks, Twitter
Procedia PDF Downloads 18425442 Harmonic Data Preparation for Clustering and Classification
Authors: Ali Asheibi
Abstract:
The rapid increase in the size of databases required to store power quality monitoring data has demanded new techniques for analysing and understanding the data. One suggested technique to assist in analysis is data mining. Preparing raw data to be ready for data mining exploration take up most of the effort and time spent in the whole data mining process. Clustering is an important technique in data mining and machine learning in which underlying and meaningful groups of data are discovered. Large amounts of harmonic data have been collected from an actual harmonic monitoring system in a distribution system in Australia for three years. This amount of acquired data makes it difficult to identify operational events that significantly impact the harmonics generated on the system. In this paper, harmonic data preparation processes to better understanding of the data have been presented. Underlying classes in this data has then been identified using clustering technique based on the Minimum Message Length (MML) method. The underlying operational information contained within the clusters can be rapidly visualised by the engineers. The C5.0 algorithm was used for classification and interpretation of the generated clusters.Keywords: data mining, harmonic data, clustering, classification
Procedia PDF Downloads 24625441 A Relative Entropy Regularization Approach for Fuzzy C-Means Clustering Problem
Authors: Ouafa Amira, Jiangshe Zhang
Abstract:
Clustering is an unsupervised machine learning technique; its aim is to extract the data structures, in which similar data objects are grouped in the same cluster, whereas dissimilar objects are grouped in different clusters. Clustering methods are widely utilized in different fields, such as: image processing, computer vision , and pattern recognition, etc. Fuzzy c-means clustering (fcm) is one of the most well known fuzzy clustering methods. It is based on solving an optimization problem, in which a minimization of a given cost function has been studied. This minimization aims to decrease the dissimilarity inside clusters, where the dissimilarity here is measured by the distances between data objects and cluster centers. The degree of belonging of a data point in a cluster is measured by a membership function which is included in the interval [0, 1]. In fcm clustering, the membership degree is constrained with the condition that the sum of a data object’s memberships in all clusters must be equal to one. This constraint can cause several problems, specially when our data objects are included in a noisy space. Regularization approach took a part in fuzzy c-means clustering technique. This process introduces an additional information in order to solve an ill-posed optimization problem. In this study, we focus on regularization by relative entropy approach, where in our optimization problem we aim to minimize the dissimilarity inside clusters. Finding an appropriate membership degree to each data object is our objective, because an appropriate membership degree leads to an accurate clustering result. Our clustering results in synthetic data sets, gaussian based data sets, and real world data sets show that our proposed model achieves a good accuracy.Keywords: clustering, fuzzy c-means, regularization, relative entropy
Procedia PDF Downloads 258