Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 33522

Search results for: big data analysis

33522 The Application of Data Mining Technology in Building Energy Consumption Data Analysis

Authors: Liang Zhao, Jili Zhang, Chongquan Zhong


Energy consumption data, in particular those involving public buildings, are impacted by many factors: the building structure, climate/environmental parameters, construction, system operating condition, and user behavior patterns. Traditional methods for data analysis are insufficient. This paper delves into the data mining technology to determine its application in the analysis of building energy consumption data including energy consumption prediction, fault diagnosis, and optimal operation. Recent literature are reviewed and summarized, the problems faced by data mining technology in the area of energy consumption data analysis are enumerated, and research points for future studies are given.

Keywords: data mining, data analysis, prediction, optimization, building operational performance

Procedia PDF Downloads 752
33521 Multi-Source Data Fusion for Urban Comprehensive Management

Authors: Bolin Hua


In city governance, various data are involved, including city component data, demographic data, housing data and all kinds of business data. These data reflects different aspects of people, events and activities. Data generated from various systems are different in form and data source are different because they may come from different sectors. In order to reflect one or several facets of an event or rule, data from multiple sources need fusion together. Data from different sources using different ways of collection raised several issues which need to be resolved. Problem of data fusion include data update and synchronization, data exchange and sharing, file parsing and entry, duplicate data and its comparison, resource catalogue construction. Governments adopt statistical analysis, time series analysis, extrapolation, monitoring analysis, value mining, scenario prediction in order to achieve pattern discovery, law verification, root cause analysis and public opinion monitoring. The result of Multi-source data fusion is to form a uniform central database, which includes people data, location data, object data, and institution data, business data and space data. We need to use meta data to be referred to and read when application needs to access, manipulate and display the data. A uniform meta data management ensures effectiveness and consistency of data in the process of data exchange, data modeling, data cleansing, data loading, data storing, data analysis, data search and data delivery.

Keywords: multi-source data fusion, urban comprehensive management, information fusion, government data

Procedia PDF Downloads 293
33520 Implementation of an IoT Sensor Data Collection and Analysis Library

Authors: Jihyun Song, Kyeongjoo Kim, Minsoo Lee


Due to the development of information technology and wireless Internet technology, various data are being generated in various fields. These data are advantageous in that they provide real-time information to the users themselves. However, when the data are accumulated and analyzed, more various information can be extracted. In addition, development and dissemination of boards such as Arduino and Raspberry Pie have made it possible to easily test various sensors, and it is possible to collect sensor data directly by using database application tools such as MySQL. These directly collected data can be used for various research and can be useful as data for data mining. However, there are many difficulties in using the board to collect data, and there are many difficulties in using it when the user is not a computer programmer, or when using it for the first time. Even if data are collected, lack of expert knowledge or experience may cause difficulties in data analysis and visualization. In this paper, we aim to construct a library for sensor data collection and analysis to overcome these problems.

Keywords: clustering, data mining, DBSCAN, k-means, k-medoids, sensor data

Procedia PDF Downloads 295
33519 Customer Data Analysis Model Using Business Intelligence Tools in Telecommunication Companies

Authors: Monica Lia


This article presents a customer data analysis model using business intelligence tools for data modelling, transforming, data visualization and dynamic reports building. Economic organizational customer’s analysis is made based on the information from the transactional systems of the organization. The paper presents how to develop the data model starting for the data that companies have inside their own operational systems. The owned data can be transformed into useful information about customers using business intelligence tool. For a mature market, knowing the information inside the data and making forecast for strategic decision become more important. Business Intelligence tools are used in business organization as support for decision-making.

Keywords: customer analysis, business intelligence, data warehouse, data mining, decisions, self-service reports, interactive visual analysis, and dynamic dashboards, use cases diagram, process modelling, logical data model, data mart, ETL, star schema, OLAP, data universes

Procedia PDF Downloads 310
33518 Predictive Analysis for Big Data: Extension of Classification and Regression Trees Algorithm

Authors: Ameur Abdelkader, Abed Bouarfa Hafida


Since its inception, predictive analysis has revolutionized the IT industry through its robustness and decision-making facilities. It involves the application of a set of data processing techniques and algorithms in order to create predictive models. Its principle is based on finding relationships between explanatory variables and the predicted variables. Past occurrences are exploited to predict and to derive the unknown outcome. With the advent of big data, many studies have suggested the use of predictive analytics in order to process and analyze big data. Nevertheless, they have been curbed by the limits of classical methods of predictive analysis in case of a large amount of data. In fact, because of their volumes, their nature (semi or unstructured) and their variety, it is impossible to analyze efficiently big data via classical methods of predictive analysis. The authors attribute this weakness to the fact that predictive analysis algorithms do not allow the parallelization and distribution of calculation. In this paper, we propose to extend the predictive analysis algorithm, Classification And Regression Trees (CART), in order to adapt it for big data analysis. The major changes of this algorithm are presented and then a version of the extended algorithm is defined in order to make it applicable for a huge quantity of data.

Keywords: predictive analysis, big data, predictive analysis algorithms, CART algorithm

Procedia PDF Downloads 66
33517 Big Data Analysis with RHadoop

Authors: Ji Eun Shin, Byung Ho Jung, Dong Hoon Lim


It is almost impossible to store or analyze big data increasing exponentially with traditional technologies. Hadoop is a new technology to make that possible. R programming language is by far the most popular statistical tool for big data analysis based on distributed processing with Hadoop technology. With RHadoop that integrates R and Hadoop environment, we implemented parallel multiple regression analysis with different sizes of actual data. Experimental results showed our RHadoop system was much faster as the number of data nodes increases. We also compared the performance of our RHadoop with lm function and big lm packages available on big memory. The results showed that our RHadoop was faster than other packages owing to paralleling processing with increasing the number of map tasks as the size of data increases.

Keywords: big data, Hadoop, parallel regression analysis, R, RHadoop

Procedia PDF Downloads 362
33516 Incremental Learning of Independent Topic Analysis

Authors: Takahiro Nishigaki, Katsumi Nitta, Takashi Onoda


In this paper, we present a method of applying Independent Topic Analysis (ITA) to increasing the number of document data. The number of document data has been increasing since the spread of the Internet. ITA was presented as one method to analyze the document data. ITA is a method for extracting the independent topics from the document data by using the Independent Component Analysis (ICA). ICA is a technique in the signal processing; however, it is difficult to apply the ITA to increasing number of document data. Because ITA must use the all document data so temporal and spatial cost is very high. Therefore, we present Incremental ITA which extracts the independent topics from increasing number of document data. Incremental ITA is a method of updating the independent topics when the document data is added after extracted the independent topics from a just previous the data. In addition, Incremental ITA updates the independent topics when the document data is added. And we show the result applied Incremental ITA to benchmark datasets.

Keywords: text mining, topic extraction, independent, incremental, independent component analysis

Procedia PDF Downloads 233
33515 On Pooling Different Levels of Data in Estimating Parameters of Continuous Meta-Analysis

Authors: N. R. N. Idris, S. Baharom


A meta-analysis may be performed using aggregate data (AD) or an individual patient data (IPD). In practice, studies may be available at both IPD and AD level. In this situation, both the IPD and AD should be utilised in order to maximize the available information. Statistical advantages of combining the studies from different level have not been fully explored. This study aims to quantify the statistical benefits of including available IPD when conducting a conventional summary-level meta-analysis. Simulated meta-analysis were used to assess the influence of the levels of data on overall meta-analysis estimates based on IPD-only, AD-only and the combination of IPD and AD (mixed data, MD), under different study scenario. The percentage relative bias (PRB), root mean-square-error (RMSE) and coverage probability were used to assess the efficiency of the overall estimates. The results demonstrate that available IPD should always be included in a conventional meta-analysis using summary level data as they would significantly increased the accuracy of the estimates. On the other hand, if more than 80% of the available data are at IPD level, including the AD does not provide significant differences in terms of accuracy of the estimates. Additionally, combining the IPD and AD has moderating effects on the biasness of the estimates of the treatment effects as the IPD tends to overestimate the treatment effects, while the AD has the tendency to produce underestimated effect estimates. These results may provide some guide in deciding if significant benefit is gained by pooling the two levels of data when conducting meta-analysis.

Keywords: aggregate data, combined-level data, individual patient data, meta-analysis

Procedia PDF Downloads 308
33514 Multidimensional Item Response Theory Models for Practical Application in Large Tests Designed to Measure Multiple Constructs

Authors: Maria Fernanda Ordoñez Martinez, Alvaro Mauricio Montenegro


This work presents a statistical methodology for measuring and founding constructs in Latent Semantic Analysis. This approach uses the qualities of Factor Analysis in binary data with interpretations present on Item Response Theory. More precisely, we propose initially reducing dimensionality with specific use of Principal Component Analysis for the linguistic data and then, producing axes of groups made from a clustering analysis of the semantic data. This approach allows the user to give meaning to previous clusters and found the real latent structure presented by data. The methodology is applied in a set of real semantic data presenting impressive results for the coherence, speed and precision.

Keywords: semantic analysis, factorial analysis, dimension reduction, penalized logistic regression

Procedia PDF Downloads 348
33513 Collision Theory Based Sentiment Detection Using Discourse Analysis in Hadoop

Authors: Anuta Mukherjee, Saswati Mukherjee


Data is growing everyday. Social networking sites such as Twitter are becoming an integral part of our daily lives, contributing a large increase in the growth of data. It is a rich source especially for sentiment detection or mining since people often express honest opinion through tweets. However, although sentiment analysis is a well-researched topic in text, this analysis using Twitter data poses additional challenges since these are unstructured data with abbreviations and without a strict grammatical correctness. We have employed collision theory to achieve sentiment analysis in Twitter data. We have also incorporated discourse analysis in the collision theory based model to detect accurate sentiment from tweets. We have also used the retweet field to assign weights to certain tweets and obtained the overall weightage of a topic provided in the form of a query. Hadoop has been exploited for speed. Our experiments show effective results.

Keywords: sentiment analysis, twitter, collision theory, discourse analysis

Procedia PDF Downloads 423
33512 Analysis of Expression Data Using Unsupervised Techniques

Authors: M. A. I Perera, C. R. Wijesinghe, A. R. Weerasinghe


his study was conducted to review and identify the unsupervised techniques that can be employed to analyze gene expression data in order to identify better subtypes of tumors. Identifying subtypes of cancer help in improving the efficacy and reducing the toxicity of the treatments by identifying clues to find target therapeutics. Process of gene expression data analysis described under three steps as preprocessing, clustering, and cluster validation. Feature selection is important since the genomic data are high dimensional with a large number of features compared to samples. Hierarchical clustering and K Means are often used in the analysis of gene expression data. There are several cluster validation techniques used in validating the clusters. Heatmaps are an effective external validation method that allows comparing the identified classes with clinical variables and visual analysis of the classes.

Keywords: cancer subtypes, gene expression data analysis, clustering, cluster validation

Procedia PDF Downloads 66
33511 Data Mining Algorithms Analysis: Case Study of Price Predictions of Lands

Authors: Julio Albuja, David Zaldumbide


Data analysis is an important step before taking a decision about money. The aim of this work is to analyze the factors that influence the final price of the houses through data mining algorithms. To our best knowledge, previous work was researched just to compare results. Furthermore, before using the data of the data set, the Z-Transformation were used to standardize the data in the same range. Hence, the data was classified into two groups to visualize them in a readability format. A decision tree was built, and graphical data is displayed where clearly is easy to see the results and the factors' influence in these graphics. The definitions of these methods are described, as well as the descriptions of the results. Finally, conclusions and recommendations are presented related to the released results that our research showed making it easier to apply these algorithms using a customized data set.

Keywords: algorithms, data, decision tree, transformation

Procedia PDF Downloads 210
33510 A Modular Framework for Enabling Analysis for Educators with Different Levels of Data Mining Skills

Authors: Kyle De Freitas, Margaret Bernard


Enabling data mining analysis among a wider audience of educators is an active area of research within the educational data mining (EDM) community. The paper proposes a framework for developing an environment that caters for educators who have little technical data mining skills as well as for more advanced users with some data mining expertise. This framework architecture was developed through the review of the strengths and weaknesses of existing models in the literature. The proposed framework provides a modular architecture for future researchers to focus on the development of specific areas within the EDM process. Finally, the paper also highlights a strategy of enabling analysis through either the use of predefined questions or a guided data mining process and highlights how the developed questions and analysis conducted can be reused and extended over time.

Keywords: educational data mining, learning management system, learning analytics, EDM framework

Procedia PDF Downloads 246
33509 Quantile Coherence Analysis: Application to Precipitation Data

Authors: Yaeji Lim, Hee-Seok Oh


The coherence analysis measures the linear time-invariant relationship between two data sets and has been studied various fields such as signal processing, engineering, and medical science. However classical coherence analysis tends to be sensitive to outliers and focuses only on mean relationship. In this paper, we generalized cross periodogram to quantile cross periodogram and provide richer inter-relationship between two data sets. This is a general version of Laplace cross periodogram. We prove its asymptotic distribution under the long range process and compare them with ordinary coherence through numerical examples. We also present real data example to confirm the usefulness of quantile coherence analysis.

Keywords: coherence, cross periodogram, spectrum, quantile

Procedia PDF Downloads 286
33508 Modeling and Statistical Analysis of a Soap Production Mix in Bejoy Manufacturing Industry, Anambra State, Nigeria

Authors: Okolie Chukwulozie Paul, Iwenofu Chinwe Onyedika, Sinebe Jude Ebieladoh, M. C. Nwosu


The research work is based on the statistical analysis of the processing data. The essence is to analyze the data statistically and to generate a design model for the production mix of soap manufacturing products in Bejoy manufacturing company Nkpologwu, Aguata Local Government Area, Anambra state, Nigeria. The statistical analysis shows the statistical analysis and the correlation of the data. T test, Partial correlation and bi-variate correlation were used to understand what the data portrays. The design model developed was used to model the data production yield and the correlation of the variables show that the R2 is 98.7%. However, the results confirm that the data is fit for further analysis and modeling. This was proved by the correlation and the R-squared.

Keywords: General Linear Model, correlation, variables, pearson, significance, T-test, soap, production mix and statistic

Procedia PDF Downloads 334
33507 Saving Energy at a Wastewater Treatment Plant through Electrical and Production Data Analysis

Authors: Adriano Araujo Carvalho, Arturo Alatrista Corrales


This paper intends to show how electrical energy consumption and production data analysis were used to find opportunities to save energy at Taboada wastewater treatment plant in Callao, Peru. In order to access the data, it was used independent data networks for both electrical and process instruments, which were taken to analyze under an ISO 50001 energy audit, which considered, thus, Energy Performance Indexes for each process and a step-by-step guide presented in this text. Due to the use of aforementioned methodology and data mining techniques applied on information gathered through electronic multimeters (conveniently placed on substation switchboards connected to a cloud network), it was possible to identify thoroughly the performance of each process and thus, evidence saving opportunities which were previously hidden before. The data analysis brought both costs and energy reduction, allowing the plant to save significant resources and to be certified under ISO 50001.

Keywords: energy and production data analysis, energy management, ISO 50001, wastewater treatment plant energy analysis

Procedia PDF Downloads 120
33506 A Method of Detecting the Difference in Two States of Brain Using Statistical Analysis of EEG Raw Data

Authors: Digvijaysingh S. Bana, Kiran R. Trivedi


This paper introduces various methods for the alpha wave to detect the difference between two states of brain. One healthy subject participated in the experiment. EEG was measured on the forehead above the eye (FP1 Position) with reference and ground electrode are on the ear clip. The data samples are obtained in the form of EEG raw data. The time duration of reading is of one minute. Various test are being performed on the alpha band EEG raw data.The readings are performed in different time duration of the entire day. The statistical analysis is being carried out on the EEG sample data in the form of various tests.

Keywords: electroencephalogram(EEG), biometrics, authentication, EEG raw data

Procedia PDF Downloads 389
33505 Analysis of Big Data

Authors: Sandeep Sharma, Sarabjit Singh


As per the user demand and growth trends of large free data the storage solutions are now becoming more challenge-able to protect, store and to retrieve data. The days are not so far when the storage companies and organizations are start saying 'no' to store our valuable data or they will start charging a huge amount for its storage and protection. On the other hand as per the environmental conditions it becomes challenge-able to maintain and establish new data warehouses and data centers to protect global warming threats. A challenge of small data is over now, the challenges are big that how to manage the exponential growth of data. In this paper we have analyzed the growth trend of big data and its future implications. We have also focused on the impact of the unstructured data on various concerns and we have also suggested some possible remedies to streamline big data.

Keywords: big data, unstructured data, volume, variety, velocity

Procedia PDF Downloads 430
33504 Data Mining Meets Educational Analysis: Opportunities and Challenges for Research

Authors: Carla Silva


Recent development of information and communication technology enables us to acquire, collect, analyse data in various fields of socioeconomic – technological systems. Along with the increase of economic globalization and the evolution of information technology, data mining has become an important approach for economic data analysis. As a result, there has been a critical need for automated approaches to effective and efficient usage of massive amount of educational data, in order to support institutions to a strategic planning and investment decision-making. In this article, we will address data from several different perspectives and define the applied data to sciences. Many believe that 'big data' will transform business, government, and other aspects of the economy. We discuss how new data may impact educational policy and educational research. Large scale administrative data sets and proprietary private sector data can greatly improve the way we measure, track, and describe educational activity and educational impact. We also consider whether the big data predictive modeling tools that have emerged in statistics and computer science may prove useful in educational and furthermore in economics. Finally, we highlight a number of challenges and opportunities for future research.

Keywords: data mining, research analysis, investment decision-making, educational research

Procedia PDF Downloads 271
33503 Survey on Arabic Sentiment Analysis in Twitter

Authors: Sarah O. Alhumoud, Mawaheb I. Altuwaijri, Tarfa M. Albuhairi, Wejdan M. Alohaideb


Large-scale data stream analysis has become one of the important business and research priorities lately. Social networks like Twitter and other micro-blogging platforms hold an enormous amount of data that is large in volume, velocity and variety. Extracting valuable information and trends out of these data would aid in a better understanding and decision-making. Multiple analysis techniques are deployed for English content. Moreover, one of the languages that produce a large amount of data over social networks and is least analyzed is the Arabic language. The proposed paper is a survey on the research efforts to analyze the Arabic content in Twitter focusing on the tools and methods used to extract the sentiments for the Arabic content on Twitter.

Keywords: big data, social networks, sentiment analysis, twitter

Procedia PDF Downloads 460
33502 Data and Spatial Analysis for Economy and Education of 28 E.U. Member-States for 2014

Authors: Alexiou Dimitra, Fragkaki Maria


The objective of the paper is the study of geographic, economic and educational variables and their contribution to determine the position of each member-state among the EU-28 countries based on the values of seven variables as given by Eurostat. The Data Analysis methods of Multiple Factorial Correspondence Analysis (MFCA) Principal Component Analysis and Factor Analysis have been used. The cross tabulation tables of data consist of the values of seven variables for the 28 countries for 2014. The data are manipulated using the CHIC Analysis V 1.1 software package. The results of this program using MFCA and Ascending Hierarchical Classification are given in arithmetic and graphical form. For comparison reasons with the same data the Factor procedure of Statistical package IBM SPSS 20 has been used. The numerical and graphical results presented with tables and graphs, demonstrate the agreement between the two methods. The most important result is the study of the relation between the 28 countries and the position of each country in groups or clouds, which are formed according to the values of the corresponding variables.

Keywords: Multiple Factorial Correspondence Analysis, Principal Component Analysis, Factor Analysis, E.U.-28 countries, Statistical package IBM SPSS 20, CHIC Analysis V 1.1 Software, Statistics

Procedia PDF Downloads 386
33501 Pattern Recognition Using Feature Based Die-Map Clustering in the Semiconductor Manufacturing Process

Authors: Seung Hwan Park, Cheng-Sool Park, Jun Seok Kim, Youngji Yoo, Daewoong An, Jun-Geol Baek


Depending on the big data analysis becomes important, yield prediction using data from the semiconductor process is essential. In general, yield prediction and analysis of the causes of the failure are closely related. The purpose of this study is to analyze pattern affects the final test results using a die map based clustering. Many researches have been conducted using die data from the semiconductor test process. However, analysis has limitation as the test data is less directly related to the final test results. Therefore, this study proposes a framework for analysis through clustering using more detailed data than existing die data. This study consists of three phases. In the first phase, die map is created through fail bit data in each sub-area of die. In the second phase, clustering using map data is performed. And the third stage is to find patterns that affect final test result. Finally, the proposed three steps are applied to actual industrial data and experimental results showed the potential field application.

Keywords: die-map clustering, feature extraction, pattern recognition, semiconductor manufacturing process

Procedia PDF Downloads 326
33500 Iot Device Cost Effective Storage Architecture and Real-Time Data Analysis/Data Privacy Framework

Authors: Femi Elegbeleye, Omobayo Esan, Muienge Mbodila, Patrick Bowe


This paper focused on cost effective storage architecture using fog and cloud data storage gateway and presented the design of the framework for the data privacy model and data analytics framework on a real-time analysis when using machine learning method. The paper began with the system analysis, system architecture and its component design, as well as the overall system operations. The several results obtained from this study on data privacy model shows that when two or more data privacy model is combined we tend to have a more stronger privacy to our data, and when fog storage gateway have several advantages over using the traditional cloud storage, from our result shows fog has reduced latency/delay, low bandwidth consumption, and energy usage when been compare with cloud storage, therefore, fog storage will help to lessen excessive cost. This paper dwelt more on the system descriptions, the researchers focused on the research design and framework design for the data privacy model, data storage, and real-time analytics. This paper also shows the major system components and their framework specification. And lastly, the overall research system architecture was shown, its structure, and its interrelationships.

Keywords: IoT, fog, cloud, data analysis, data privacy

Procedia PDF Downloads 7
33499 Cloud Design for Storing Large Amount of Data

Authors: M. Strémy, P. Závacký, P. Cuninka, M. Juhás


Main goal of this paper is to introduce our design of private cloud for storing large amount of data, especially pictures, and to provide good technological backend for data analysis based on parallel processing and business intelligence. We have tested hypervisors, cloud management tools, storage for storing all data and Hadoop to provide data analysis on unstructured data. Providing high availability, virtual network management, logical separation of projects and also rapid deployment of physical servers to our environment was also needed.

Keywords: cloud, glusterfs, hadoop, juju, kvm, maas, openstack, virtualization

Procedia PDF Downloads 280
33498 Data Integration with Geographic Information System Tools for Rural Environmental Monitoring

Authors: Tamas Jancso, Andrea Podor, Eva Nagyne Hajnal, Peter Udvardy, Gabor Nagy, Attila Varga, Meng Qingyan


The paper deals with the conditions and circumstances of the integration of remotely sensed data for rural environmental monitoring purposes. The main task is to make decisions during the integration process when we have data sources with different resolutions, locations, spectral channels and dimensions. In order to have exact knowledge about the integration and data fusion possibilities, it is necessary to know the properties (metadata) that characterize the data. The paper explains the joining of these data sources using their attribute data through a sample project. The resulted product will be used for rural environmental analysis.

Keywords: remote sensing, GIS, metadata, integration, environmental analysis

Procedia PDF Downloads 15
33497 Analysis of Genomics Big Data in Cloud Computing Using Fuzzy Logic

Authors: Mohammad Vahed, Ana Sadeghitohidi, Majid Vahed, Hiroki Takahashi


In the genomics field, the huge amounts of data have produced by the next-generation sequencers (NGS). Data volumes are very rapidly growing, as it is postulated that more than one billion bases will be produced per year in 2020. The growth rate of produced data is much faster than Moore's law in computer technology. This makes it more difficult to deal with genomics data, such as storing data, searching information, and finding the hidden information. It is required to develop the analysis platform for genomics big data. Cloud computing newly developed enables us to deal with big data more efficiently. Hadoop is one of the frameworks distributed computing and relies upon the core of a Big Data as a Service (BDaaS). Although many services have adopted this technology, e.g. amazon, there are a few applications in the biology field. Here, we propose a new algorithm to more efficiently deal with the genomics big data, e.g. sequencing data. Our algorithm consists of two parts: First is that BDaaS is applied for handling the data more efficiently. Second is that the hybrid method of MapReduce and Fuzzy logic is applied for data processing. This step can be parallelized in implementation. Our algorithm has great potential in computational analysis of genomics big data, e.g. de novo genome assembly and sequence similarity search. We will discuss our algorithm and its feasibility.

Keywords: big data, fuzzy logic, MapReduce, Hadoop, cloud computing

Procedia PDF Downloads 222
33496 Analysis of Different Classification Techniques Using WEKA for Diabetic Disease

Authors: Usama Ahmed


Data mining is the process of analyze data which are used to predict helpful information. It is the field of research which solve various type of problem. In data mining, classification is an important technique to classify different kind of data. Diabetes is most common disease. This paper implements different classification technique using Waikato Environment for Knowledge Analysis (WEKA) on diabetes dataset and find which algorithm is suitable for working. The best classification algorithm based on diabetic data is Naïve Bayes. The accuracy of Naïve Bayes is 76.31% and take 0.06 seconds to build the model.

Keywords: data mining, classification, diabetes, WEKA

Procedia PDF Downloads 67
33495 Estimation of Missing Values in Aggregate Level Spatial Data

Authors: Amitha Puranik, V. S. Binu, Seena Biju


Missing data is a common problem in spatial analysis especially at the aggregate level. Missing can either occur in covariate or in response variable or in both in a given location. Many missing data techniques are available to estimate the missing data values but not all of these methods can be applied on spatial data since the data are autocorrelated. Hence there is a need to develop a method that estimates the missing values in both response variable and covariates in spatial data by taking account of the spatial autocorrelation. The present study aims to develop a model to estimate the missing data points at the aggregate level in spatial data by accounting for (a) Spatial autocorrelation of the response variable (b) Spatial autocorrelation of covariates and (c) Correlation between covariates and the response variable. Estimating the missing values of spatial data requires a model that explicitly account for the spatial autocorrelation. The proposed model not only accounts for spatial autocorrelation but also utilizes the correlation that exists between covariates, within covariates and between a response variable and covariates. The precise estimation of the missing data points in spatial data will result in an increased precision of the estimated effects of independent variables on the response variable in spatial regression analysis.

Keywords: spatial regression, missing data estimation, spatial autocorrelation, simulation analysis

Procedia PDF Downloads 285
33494 Fuzzy Optimization Multi-Objective Clustering Ensemble Model for Multi-Source Data Analysis

Authors: C. B. Le, V. N. Pham


In modern data analysis, multi-source data appears more and more in real applications. Multi-source data clustering has emerged as a important issue in the data mining and machine learning community. Different data sources provide information about different data. Therefore, multi-source data linking is essential to improve clustering performance. However, in practice multi-source data is often heterogeneous, uncertain, and large. This issue is considered a major challenge from multi-source data. Ensemble is a versatile machine learning model in which learning techniques can work in parallel, with big data. Clustering ensemble has been shown to outperform any standard clustering algorithm in terms of accuracy and robustness. However, most of the traditional clustering ensemble approaches are based on single-objective function and single-source data. This paper proposes a new clustering ensemble method for multi-source data analysis. The fuzzy optimized multi-objective clustering ensemble method is called FOMOCE. Firstly, a clustering ensemble mathematical model based on the structure of multi-objective clustering function, multi-source data, and dark knowledge is introduced. Then, rules for extracting dark knowledge from the input data, clustering algorithms, and base clusterings are designed and applied. Finally, a clustering ensemble algorithm is proposed for multi-source data analysis. The experiments were performed on the standard sample data set. The experimental results demonstrate the superior performance of the FOMOCE method compared to the existing clustering ensemble methods and multi-source clustering methods.

Keywords: clustering ensemble, multi-source, multi-objective, fuzzy clustering

Procedia PDF Downloads 82
33493 Analysis and Forecasting of Bitcoin Price Using Exogenous Data

Authors: J-C. Leneveu, A. Chereau, L. Mansart, T. Mesbah, M. Wyka


Extracting and interpreting information from Big Data represent a stake for years to come in several sectors such as finance. Currently, numerous methods are used (such as Technical Analysis) to try to understand and to anticipate market behavior, with mixed results because it still seems impossible to exactly predict a financial trend. The increase of available data on Internet and their diversity represent a great opportunity for the financial world. Indeed, it is possible, along with these standard financial data, to focus on exogenous data to take into account more macroeconomic factors. Coupling the interpretation of these data with standard methods could allow obtaining more precise trend predictions. In this paper, in order to observe the influence of exogenous data price independent of other usual effects occurring in classical markets, behaviors of Bitcoin users are introduced in a model reconstituting Bitcoin value, which is elaborated and tested for prediction purposes.

Keywords: big data, bitcoin, data mining, social network, financial trends, exogenous data, global economy, behavioral finance

Procedia PDF Downloads 281