Search results for: Data Analysis
42000 Women Entrepreneurial Resiliency Amidst COVID-19
Authors: Divya Juneja, Sukhjeet Kaur Matharu
Abstract:
Purpose: The paper is aimed at identifying the challenging factors experienced by the women entrepreneurs in India in operating their enterprises amidst the challenges posed by the COVID-19 pandemic. Methodology: The sample for the study comprised 396 women entrepreneurs from different regions of India. A purposive sampling technique was adopted for data collection. Data was collected through a self-administered questionnaire. Analysis was performed using the SPSS package for quantitative data analysis. Findings: The results of the study state that entrepreneurial characteristics, resourcefulness, networking, adaptability, and continuity have a positive influence on the resiliency of women entrepreneurs when faced with a crisis situation. Practical Implications: The findings of the study have some important implications for women entrepreneurs, organizations, government, and other institutions extending support to entrepreneurs.Keywords: women entrepreneurs, analysis, data analysis, positive influence, resiliency
Procedia PDF Downloads 11341999 Generation of Quasi-Measurement Data for On-Line Process Data Analysis
Authors: Hyun-Woo Cho
Abstract:
For ensuring the safety of a manufacturing process one should quickly identify an assignable cause of a fault in an on-line basis. To this end, many statistical techniques including linear and nonlinear methods have been frequently utilized. However, such methods possessed a major problem of small sample size, which is mostly attributed to the characteristics of empirical models used for reference models. This work presents a new method to overcome the insufficiency of measurement data in the monitoring and diagnosis tasks. Some quasi-measurement data are generated from existing data based on the two indices of similarity and importance. The performance of the method is demonstrated using a real data set. The results turn out that the presented methods are able to handle the insufficiency problem successfully. In addition, it is shown to be quite efficient in terms of computational speed and memory usage, and thus on-line implementation of the method is straightforward for monitoring and diagnosis purposes.Keywords: data analysis, diagnosis, monitoring, process data, quality control
Procedia PDF Downloads 48141998 Spatial Econometric Approaches for Count Data: An Overview and New Directions
Authors: Paula Simões, Isabel Natário
Abstract:
This paper reviews a number of theoretical aspects for implementing an explicit spatial perspective in econometrics for modelling non-continuous data, in general, and count data, in particular. It provides an overview of the several spatial econometric approaches that are available to model data that are collected with reference to location in space, from the classical spatial econometrics approaches to the recent developments on spatial econometrics to model count data, in a Bayesian hierarchical setting. Considerable attention is paid to the inferential framework, necessary for structural consistent spatial econometric count models, incorporating spatial lag autocorrelation, to the corresponding estimation and testing procedures for different assumptions, to the constrains and implications embedded in the various specifications in the literature. This review combines insights from the classical spatial econometrics literature as well as from hierarchical modeling and analysis of spatial data, in order to look for new possible directions on the processing of count data, in a spatial hierarchical Bayesian econometric context.Keywords: spatial data analysis, spatial econometrics, Bayesian hierarchical models, count data
Procedia PDF Downloads 59341997 Vibrations of Springboards: Mode Shape and Time Domain Analysis
Authors: Stefano Frassinelli, Alessandro Niccolai, Riccardo E. Zich
Abstract:
Diving is an important Olympic sport. In this sport, the effective performance of the athlete is related to his capability to interact correctly with the springboard. In fact, the elevation of the jump and the correctness of the dive are influenced by the vibrations of the board. In this paper, the vibrations of the springboard will be analyzed by means of typical tools for vibration analysis: Firstly, a modal analysis will be done on two different models of the springboard, then, these two model and another one will be analyzed with a time analysis, done integrating the equations of motion od deformable bodies. All these analyses will be compared with experimental data measured on a real springboard by means of a 6-axis accelerometer; these measurements are aimed to assess the models proposed. The acquired data will be analyzed both in frequency domain and in time domain.Keywords: springboard analysis, modal analysis, time domain analysis, vibrations
Procedia PDF Downloads 46041996 Analysis of an Alternative Data Base for the Estimation of Solar Radiation
Authors: Graciela Soares Marcelli, Elison Eduardo Jardim Bierhals, Luciane Teresa Salvi, Claudineia Brazil, Rafael Haag
Abstract:
The sun is a source of renewable energy, and its use as both a source of heat and light is one of the most promising energy alternatives for the future. To measure the thermal or photovoltaic systems a solar irradiation database is necessary. Brazil still has a reduced number of meteorological stations that provide frequency tests, as an alternative to the radio data platform, with reanalysis systems, quite significant. ERA-Interim is a global fire reanalysis by the European Center for Medium-Range Weather Forecasts (ECMWF). The data assimilation system used for the production of ERA-Interim is based on a 2006 version of the IFS (Cy31r2). The system includes a 4-dimensional variable analysis (4D-Var) with a 12-hour analysis window. The spatial resolution of the dataset is approximately 80 km at 60 vertical levels from the surface to 0.1 hPa. This work aims to make a comparative analysis between the ERA-Interim data and the data observed in the Solarimmetric Atlas of the State of Rio Grande do Sul, to verify its applicability in the absence of an observed data network. The analysis of the results obtained for a study region as an alternative to the energy potential of a given region.Keywords: energy potential, reanalyses, renewable energy, solar radiation
Procedia PDF Downloads 16441995 Analysis of ECGs Survey Data by Applying Clustering Algorithm
Authors: Irum Matloob, Shoab Ahmad Khan, Fahim Arif
Abstract:
As Indo-pak has been the victim of heart diseases since many decades. Many surveys showed that percentage of cardiac patients is increasing in Pakistan day by day, and special attention is needed to pay on this issue. The framework is proposed for performing detailed analysis of ECG survey data which is conducted for measuring the prevalence of heart diseases statistics in Pakistan. The ECG survey data is evaluated or filtered by using automated Minnesota codes and only those ECGs are used for further analysis which is fulfilling the standardized conditions mentioned in the Minnesota codes. Then feature selection is performed by applying proposed algorithm based on discernibility matrix, for selecting relevant features from the database. Clustering is performed for exposing natural clusters from the ECG survey data by applying spectral clustering algorithm using fuzzy c means algorithm. The hidden patterns and interesting relationships which have been exposed after this analysis are useful for further detailed analysis and for many other multiple purposes.Keywords: arrhythmias, centroids, ECG, clustering, discernibility matrix
Procedia PDF Downloads 35141994 An Exploratory Research of Human Character Analysis Based on Smart Watch Data: Distinguish the Drinking State from Normal State
Authors: Lu Zhao, Yanrong Kang, Lili Guo, Yuan Long, Guidong Xing
Abstract:
Smart watches, as a handy device with rich functionality, has become one of the most popular wearable devices all over the world. Among the various function, the most basic is health monitoring. The monitoring data can be provided as an effective evidence or a clue for the detection of crime cases. For instance, the step counting data can help to determine whether the watch wearer was quiet or moving during the given time period. There is, however, still quite few research on the analysis of human character based on these data. The purpose of this research is to analyze the health monitoring data to distinguish the drinking state from normal state. The analysis result may play a role in cases involving drinking, such as drunk driving. The experiment mainly focused on finding the figures of smart watch health monitoring data that change with drinking and figuring up the change scope. The chosen subjects are mostly in their 20s, each of whom had been wearing the same smart watch for a week. Each subject drank for several times during the week, and noted down the begin and end time point of the drinking. The researcher, then, extracted and analyzed the health monitoring data from the watch. According to the descriptive statistics analysis, it can be found that the heart rate change when drinking. The average heart rate is about 10% higher than normal, the coefficient of variation is less than about 30% of the normal state. Though more research is needed to be carried out, this experiment and analysis provide a thought of the application of the data from smart watches.Keywords: character analysis, descriptive statistics analysis, drink state, heart rate, smart watch
Procedia PDF Downloads 16741993 To Handle Data-Driven Software Development Projects Effectively
Authors: Shahnewaz Khan
Abstract:
Machine learning (ML) techniques are often used in projects for creating data-driven applications. These tasks typically demand additional research and analysis. The proper technique and strategy must be chosen to ensure the success of data-driven projects. Otherwise, even exerting a lot of effort, the necessary development might not always be possible. In this post, an effort to examine the workflow of data-driven software development projects and its implementation process in order to describe how to manage a project successfully. Which will assist in minimizing the added workload.Keywords: data, data-driven projects, data science, NLP, software project
Procedia PDF Downloads 8341992 Sentiment Analysis: Comparative Analysis of Multilingual Sentiment and Opinion Classification Techniques
Authors: Sannikumar Patel, Brian Nolan, Markus Hofmann, Philip Owende, Kunjan Patel
Abstract:
Sentiment analysis and opinion mining have become emerging topics of research in recent years but most of the work is focused on data in the English language. A comprehensive research and analysis are essential which considers multiple languages, machine translation techniques, and different classifiers. This paper presents, a comparative analysis of different approaches for multilingual sentiment analysis. These approaches are divided into two parts: one using classification of text without language translation and second using the translation of testing data to a target language, such as English, before classification. The presented research and results are useful for understanding whether machine translation should be used for multilingual sentiment analysis or building language specific sentiment classification systems is a better approach. The effects of language translation techniques, features, and accuracy of various classifiers for multilingual sentiment analysis is also discussed in this study.Keywords: cross-language analysis, machine learning, machine translation, sentiment analysis
Procedia PDF Downloads 71341991 Impact of Map Generalization in Spatial Analysis
Authors: Lin Li, P. G. R. N. I. Pussella
Abstract:
When representing spatial data and their attributes on different types of maps, the scale plays a key role in the process of map generalization. The process is consisted with two main operators such as selection and omission. Once some data were selected, they would undergo of several geometrical changing processes such as elimination, simplification, smoothing, exaggeration, displacement, aggregation and size reduction. As a result of these operations at different levels of data, the geometry of the spatial features such as length, sinuosity, orientation, perimeter and area would be altered. This would be worst in the case of preparation of small scale maps, since the cartographer has not enough space to represent all the features on the map. What the GIS users do is when they wanted to analyze a set of spatial data; they retrieve a data set and does the analysis part without considering very important characteristics such as the scale, the purpose of the map and the degree of generalization. Further, the GIS users use and compare different maps with different degrees of generalization. Sometimes, GIS users are going beyond the scale of the source map using zoom in facility and violate the basic cartographic rule 'it is not suitable to create a larger scale map using a smaller scale map'. In the study, the effect of map generalization for GIS analysis would be discussed as the main objective. It was used three digital maps with different scales such as 1:10000, 1:50000 and 1:250000 which were prepared by the Survey Department of Sri Lanka, the National Mapping Agency of Sri Lanka. It was used common features which were on above three maps and an overlay analysis was done by repeating the data with different combinations. Road data, River data and Land use data sets were used for the study. A simple model, to find the best place for a wild life park, was used to identify the effects. The results show remarkable effects on different degrees of generalization processes. It can see that different locations with different geometries were received as the outputs from this analysis. The study suggests that there should be reasonable methods to overcome this effect. It can be recommended that, as a solution, it would be very reasonable to take all the data sets into a common scale and do the analysis part.Keywords: generalization, GIS, scales, spatial analysis
Procedia PDF Downloads 32841990 Text Mining of Twitter Data Using a Latent Dirichlet Allocation Topic Model and Sentiment Analysis
Authors: Sidi Yang, Haiyi Zhang
Abstract:
Twitter is a microblogging platform, where millions of users daily share their attitudes, views, and opinions. Using a probabilistic Latent Dirichlet Allocation (LDA) topic model to discern the most popular topics in the Twitter data is an effective way to analyze a large set of tweets to find a set of topics in a computationally efficient manner. Sentiment analysis provides an effective method to show the emotions and sentiments found in each tweet and an efficient way to summarize the results in a manner that is clearly understood. The primary goal of this paper is to explore text mining, extract and analyze useful information from unstructured text using two approaches: LDA topic modelling and sentiment analysis by examining Twitter plain text data in English. These two methods allow people to dig data more effectively and efficiently. LDA topic model and sentiment analysis can also be applied to provide insight views in business and scientific fields.Keywords: text mining, Twitter, topic model, sentiment analysis
Procedia PDF Downloads 17941989 Value Chain Based New Business Opportunity
Authors: Seonjae Lee, Sungjoo Lee
Abstract:
Excavation is necessary to remain competitive in the current business environment. The company survived the rapidly changing industry conditions by adapting new business strategy and reducing technology challenges. Traditionally, the two methods are conducted excavations for new businesses. The first method is, qualitative analysis of expert opinion, which is gathered through opportunities and secondly, new technologies are discovered through quantitative data analysis of method patents. The second method increases time and cost. Patent data is restricted for use and the purpose of discovering business opportunities. This study presents the company's characteristics (sector, size, etc.), of new business opportunities in customized form by reviewing the value chain perspective and to contributing to creating new business opportunities in the proposed model. It utilizes the trademark database of the Korean Intellectual Property Office (KIPO) and proprietary company information database of the Korea Enterprise Data (KED). This data is key to discovering new business opportunities with analysis of competitors and advanced business trademarks (Module 1) and trading analysis of competitors found in the KED (Module 2).Keywords: value chain, trademark, trading analysis, new business opportunity
Procedia PDF Downloads 37241988 Impact on the Results of Sub-Group Analysis on Performance of Recommender Systems
Authors: Ho Yeon Park, Kyoung-Jae Kim
Abstract:
The purpose of this study is to investigate whether friendship in social media can be an important factor in recommender system through social scientific analysis of friendship in popular social media such as Facebook and Twitter. For this purpose, this study analyzes data on friendship in real social media using component analysis and clique analysis among sub-group analysis in social network analysis. In this study, we propose an algorithm to reflect the results of sub-group analysis on the recommender system. The key to this algorithm is to ensure that recommendations from users in friendships are more likely to be reflected in recommendations from users. As a result of this study, outcomes of various subgroup analyzes were derived, and it was confirmed that the results were different from the results of the existing recommender system. Therefore, it is considered that the results of the subgroup analysis affect the recommendation performance of the system. Future research will attempt to generalize the results of the research through further analysis of various social data.Keywords: sub-group analysis, social media, social network analysis, recommender systems
Procedia PDF Downloads 36341987 Analysis of Urban Population Using Twitter Distribution Data: Case Study of Makassar City, Indonesia
Authors: Yuyun Wabula, B. J. Dewancker
Abstract:
In the past decade, the social networking app has been growing very rapidly. Geolocation data is one of the important features of social media that can attach the user's location coordinate in the real world. This paper proposes the use of geolocation data from the Twitter social media application to gain knowledge about urban dynamics, especially on human mobility behavior. This paper aims to explore the relation between geolocation Twitter with the existence of people in the urban area. Firstly, the study will analyze the spread of people in the particular area, within the city using Twitter social media data. Secondly, we then match and categorize the existing place based on the same individuals visiting. Then, we combine the Twitter data from the tracking result and the questionnaire data to catch the Twitter user profile. To do that, we used the distribution frequency analysis to learn the visitors’ percentage. To validate the hypothesis, we compare it with the local population statistic data and land use mapping released by the city planning department of Makassar local government. The results show that there is the correlation between Twitter geolocation and questionnaire data. Thus, integration the Twitter data and survey data can reveal the profile of the social media users.Keywords: geolocation, Twitter, distribution analysis, human mobility
Procedia PDF Downloads 31441986 Web Search Engine Based Naming Procedure for Independent Topic
Authors: Takahiro Nishigaki, Takashi Onoda
Abstract:
In recent years, the number of document data has been increasing since the spread of the Internet. Many methods have been studied for extracting topics from large document data. We proposed Independent Topic Analysis (ITA) to extract topics independent of each other from large document data such as newspaper data. ITA is a method for extracting the independent topics from the document data by using the Independent Component Analysis. The topic represented by ITA is represented by a set of words. However, the set of words is quite different from the topics the user imagines. For example, the top five words with high independence of a topic are as follows. Topic1 = {"scor", "game", "lead", "quarter", "rebound"}. This Topic 1 is considered to represent the topic of "SPORTS". This topic name "SPORTS" has to be attached by the user. ITA cannot name topics. Therefore, in this research, we propose a method to obtain topics easy for people to understand by using the web search engine, topics given by the set of words given by independent topic analysis. In particular, we search a set of topical words, and the title of the homepage of the search result is taken as the topic name. And we also use the proposed method for some data and verify its effectiveness.Keywords: independent topic analysis, topic extraction, topic naming, web search engine
Procedia PDF Downloads 11941985 Fuzzy Wavelet Model to Forecast the Exchange Rate of IDR/USD
Authors: Tri Wijayanti Septiarini, Agus Maman Abadi, Muhammad Rifki Taufik
Abstract:
The exchange rate of IDR/USD can be the indicator to analysis Indonesian economy. The exchange rate as a important factor because it has big effect in Indonesian economy overall. So, it needs the analysis data of exchange rate. There is decomposition data of exchange rate of IDR/USD to be frequency and time. It can help the government to monitor the Indonesian economy. This method is very effective to identify the case, have high accurate result and have simple structure. In this paper, data of exchange rate that used is weekly data from December 17, 2010 until November 11, 2014.Keywords: the exchange rate, fuzzy mamdani, discrete wavelet transforms, fuzzy wavelet
Procedia PDF Downloads 57141984 Heritage and Tourism in the Era of Big Data: Analysis of Chinese Cultural Tourism in Catalonia
Authors: Xinge Liao, Francesc Xavier Roige Ventura, Dolores Sanchez Aguilera
Abstract:
With the development of the Internet, the study of tourism behavior has rapidly expanded from the traditional physical market to the online market. Data on the Internet is characterized by dynamic changes, and new data appear all the time. In recent years the generation of a large volume of data was characterized, such as forums, blogs, and other sources, which have expanded over time and space, together they constitute large-scale Internet data, known as Big Data. This data of technological origin that derives from the use of devices and the activity of multiple users is becoming a source of great importance for the study of geography and the behavior of tourists. The study will focus on cultural heritage tourist practices in the context of Big Data. The research will focus on exploring the characteristics and behavior of Chinese tourists in relation to the cultural heritage of Catalonia. Geographical information, target image, perceptions in user-generated content will be studied through data analysis from Weibo -the largest social networks of blogs in China. Through the analysis of the behavior of heritage tourists in the Big Data environment, this study will understand the practices (activities, motivations, perceptions) of cultural tourists and then understand the needs and preferences of tourists in order to better guide the sustainable development of tourism in heritage sites.Keywords: Barcelona, Big Data, Catalonia, cultural heritage, Chinese tourism market, tourists’ behavior
Procedia PDF Downloads 13841983 Big Data: Concepts, Technologies and Applications in the Public Sector
Authors: A. Alexandru, C. A. Alexandru, D. Coardos, E. Tudora
Abstract:
Big Data (BD) is associated with a new generation of technologies and architectures which can harness the value of extremely large volumes of very varied data through real time processing and analysis. It involves changes in (1) data types, (2) accumulation speed, and (3) data volume. This paper presents the main concepts related to the BD paradigm, and introduces architectures and technologies for BD and BD sets. The integration of BD with the Hadoop Framework is also underlined. BD has attracted a lot of attention in the public sector due to the newly emerging technologies that allow the availability of network access. The volume of different types of data has exponentially increased. Some applications of BD in the public sector in Romania are briefly presented.Keywords: big data, big data analytics, Hadoop, cloud
Procedia PDF Downloads 31041982 Data Stream Association Rule Mining with Cloud Computing
Authors: B. Suraj Aravind, M. H. M. Krishna Prasad
Abstract:
There exist emerging applications of data streams that require association rule mining, such as network traffic monitoring, web click streams analysis, sensor data, data from satellites etc. Data streams typically arrive continuously in high speed with huge amount and changing data distribution. This raises new issues that need to be considered when developing association rule mining techniques for stream data. This paper proposes to introduce an improved data stream association rule mining algorithm by eliminating the limitation of resources. For this, the concept of cloud computing is used. Inclusion of this may lead to additional unknown problems which needs further research.Keywords: data stream, association rule mining, cloud computing, frequent itemsets
Procedia PDF Downloads 50141981 Using SNAP and RADTRAD to Establish the Analysis Model for Maanshan PWR Plant
Authors: J. R. Wang, H. C. Chen, C. Shih, S. W. Chen, J. H. Yang, Y. Chiang
Abstract:
In this study, we focus on the establishment of the analysis model for Maanshan PWR nuclear power plant (NPP) by using RADTRAD and SNAP codes with the FSAR, manuals, and other data. In order to evaluate the cumulative dose at the Exclusion Area Boundary (EAB) and Low Population Zone (LPZ) outer boundary, Maanshan NPP RADTRAD/SNAP model was used to perform the analysis of the DBA LOCA case. The analysis results of RADTRAD were similar to FSAR data. These analysis results were lower than the failure criteria of 10 CFR 100.11 (a total radiation dose to the whole body, 250 mSv; a total radiation dose to the thyroid from iodine exposure, 3000 mSv).Keywords: RADionuclide, transport, removal, and dose estimation (RADTRAD), symbolic nuclear analysis package (SNAP), dose, PWR
Procedia PDF Downloads 46341980 Efficiency of the Slovak Commercial Banks Applying the DEA Window Analysis
Authors: Iveta Řepková
Abstract:
The aim of this paper is to estimate the efficiency of the Slovak commercial banks employing the Data Envelopment Analysis (DEA) window analysis approach during the period 2003-2012. The research is based on unbalanced panel data of the Slovak commercial banks. Undesirable output was included into analysis of banking efficiency. It was found that most efficient banks were Postovabanka, UniCredit Bank and Istrobanka in CCR model and the most efficient banks were Slovenskasporitelna, Istrobanka and UniCredit Bank in BCC model. On contrary, the lowest efficient banks were found Privatbanka and CitiBank. We found that the largest banks in the Slovak banking market were lower efficient than medium-size and small banks. Results of the paper is that during the period 2003-2008 the average efficiency was increasing and then during the period 2010-2011 the average efficiency decreased as a result of financial crisis.Keywords: data envelopment analysis, efficiency, Slovak banking sector, window analysis
Procedia PDF Downloads 35741979 Spatial Variability of Brahmaputra River Flow Characteristics
Authors: Hemant Kumar
Abstract:
Brahmaputra River is known according to the Hindu mythology the son of the Lord Brahma. According to this name, the river Brahmaputra creates mass destruction during the monsoon season in Assam, India. It is a state situated in North-East part of India. This is one of the essential states out of the seven countries of eastern India, where almost all entire Brahmaputra flow carried out. The other states carry their tributaries. In the present case study, the spatial analysis performed in this specific case the number of MODIS data are acquired. In the method of detecting the change, the spray content was found during heavy rainfall and in the flooded monsoon season. By this method, particularly the analysis over the Brahmaputra outflow determines the flooded season. The charged particle-associated in aerosol content genuinely verifies the heavy water content below the ground surface, which is validated by trend analysis through rainfall spectrum data. This is confirmed by in-situ sampled view data from a different position of Brahmaputra River. Further, a Hyperion Hyperspectral 30 m resolution data were used to scan the sediment deposits, which is also confirmed by in-situ sampled view data from a different position.Keywords: aerosol, change detection, spatial analysis, trend analysis
Procedia PDF Downloads 14741978 Attribute Analysis of Quick Response Code Payment Users Using Discriminant Non-negative Matrix Factorization
Authors: Hironori Karachi, Haruka Yamashita
Abstract:
Recently, the system of quick response (QR) code is getting popular. Many companies introduce new QR code payment services and the services are competing with each other to increase the number of users. For increasing the number of users, we should grasp the difference of feature of the demographic information, usage information, and value of users between services. In this study, we conduct an analysis of real-world data provided by Nomura Research Institute including the demographic data of users and information of users’ usages of two services; LINE Pay, and PayPay. For analyzing such data and interpret the feature of them, Nonnegative Matrix Factorization (NMF) is widely used; however, in case of the target data, there is a problem of the missing data. EM-algorithm NMF (EMNMF) to complete unknown values for understanding the feature of the given data presented by matrix shape. Moreover, for comparing the result of the NMF analysis of two matrices, there is Discriminant NMF (DNMF) shows the difference of users features between two matrices. In this study, we combine EMNMF and DNMF and also analyze the target data. As the interpretation, we show the difference of the features of users between LINE Pay and Paypay.Keywords: data science, non-negative matrix factorization, missing data, quality of services
Procedia PDF Downloads 13141977 Machine Learning Analysis of Student Success in Introductory Calculus Based Physics I Course
Authors: Chandra Prayaga, Aaron Wade, Lakshmi Prayaga, Gopi Shankar Mallu
Abstract:
This paper presents the use of machine learning algorithms to predict the success of students in an introductory physics course. Data having 140 rows pertaining to the performance of two batches of students was used. The lack of sufficient data to train robust machine learning models was compensated for by generating synthetic data similar to the real data. CTGAN and CTGAN with Gaussian Copula (Gaussian) were used to generate synthetic data, with the real data as input. To check the similarity between the real data and each synthetic dataset, pair plots were made. The synthetic data was used to train machine learning models using the PyCaret package. For the CTGAN data, the Ada Boost Classifier (ADA) was found to be the ML model with the best fit, whereas the CTGAN with Gaussian Copula yielded Logistic Regression (LR) as the best model. Both models were then tested for accuracy with the real data. ROC-AUC analysis was performed for all the ten classes of the target variable (Grades A, A-, B+, B, B-, C+, C, C-, D, F). The ADA model with CTGAN data showed a mean AUC score of 0.4377, but the LR model with the Gaussian data showed a mean AUC score of 0.6149. ROC-AUC plots were obtained for each Grade value separately. The LR model with Gaussian data showed consistently better AUC scores compared to the ADA model with CTGAN data, except in two cases of the Grade value, C- and A-.Keywords: machine learning, student success, physics course, grades, synthetic data, CTGAN, gaussian copula CTGAN
Procedia PDF Downloads 4441976 Frequent Itemset Mining Using Rough-Sets
Authors: Usman Qamar, Younus Javed
Abstract:
Frequent pattern mining is the process of finding a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set. It was proposed in the context of frequent itemsets and association rule mining. Frequent pattern mining is used to find inherent regularities in data. What products were often purchased together? Its applications include basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. However, one of the bottlenecks of frequent itemset mining is that as the data increase the amount of time and resources required to mining the data increases at an exponential rate. In this investigation a new algorithm is proposed which can be uses as a pre-processor for frequent itemset mining. FASTER (FeAture SelecTion using Entropy and Rough sets) is a hybrid pre-processor algorithm which utilizes entropy and rough-sets to carry out record reduction and feature (attribute) selection respectively. FASTER for frequent itemset mining can produce a speed up of 3.1 times when compared to original algorithm while maintaining an accuracy of 71%.Keywords: rough-sets, classification, feature selection, entropy, outliers, frequent itemset mining
Procedia PDF Downloads 43741975 Harmonic Data Preparation for Clustering and Classification
Authors: Ali Asheibi
Abstract:
The rapid increase in the size of databases required to store power quality monitoring data has demanded new techniques for analysing and understanding the data. One suggested technique to assist in analysis is data mining. Preparing raw data to be ready for data mining exploration take up most of the effort and time spent in the whole data mining process. Clustering is an important technique in data mining and machine learning in which underlying and meaningful groups of data are discovered. Large amounts of harmonic data have been collected from an actual harmonic monitoring system in a distribution system in Australia for three years. This amount of acquired data makes it difficult to identify operational events that significantly impact the harmonics generated on the system. In this paper, harmonic data preparation processes to better understanding of the data have been presented. Underlying classes in this data has then been identified using clustering technique based on the Minimum Message Length (MML) method. The underlying operational information contained within the clusters can be rapidly visualised by the engineers. The C5.0 algorithm was used for classification and interpretation of the generated clusters.Keywords: data mining, harmonic data, clustering, classification
Procedia PDF Downloads 24841974 Simulation Data Summarization Based on Spatial Histograms
Authors: Jing Zhao, Yoshiharu Ishikawa, Chuan Xiao, Kento Sugiura
Abstract:
In order to analyze large-scale scientific data, research on data exploration and visualization has gained popularity. In this paper, we focus on the exploration and visualization of scientific simulation data, and define a spatial V-Optimal histogram for data summarization. We propose histogram construction algorithms based on a general binary hierarchical partitioning as well as a more specific one, the l-grid partitioning. For effective data summarization and efficient data visualization in scientific data analysis, we propose an optimal algorithm as well as a heuristic algorithm for histogram construction. To verify the effectiveness and efficiency of the proposed methods, we conduct experiments on the massive evacuation simulation data.Keywords: simulation data, data summarization, spatial histograms, exploration, visualization
Procedia PDF Downloads 17641973 Structural Equation Modeling Semiparametric Truncated Spline Using Simulation Data
Authors: Adji Achmad Rinaldo Fernandes
Abstract:
SEM analysis is a complex multivariate analysis because it involves a number of exogenous and endogenous variables that are interconnected to form a model. The measurement model is divided into two, namely, the reflective model (reflecting) and the formative model (forming). Before carrying out further tests on SEM, there are assumptions that must be met, namely the linearity assumption, to determine the form of the relationship. There are three modeling approaches to path analysis, including parametric, nonparametric and semiparametric approaches. The aim of this research is to develop semiparametric SEM and obtain the best model. The data used in the research is secondary data as the basis for the process of obtaining simulation data. Simulation data was generated with various sample sizes of 100, 300, and 500. In the semiparametric SEM analysis, the form of the relationship studied was determined, namely linear and quadratic and determined one and two knot points with various levels of error variance (EV=0.5; 1; 5). There are three levels of closeness of relationship for the analysis process in the measurement model consisting of low (0.1-0.3), medium (0.4-0.6) and high (0.7-0.9) levels of closeness. The best model lies in the form of the relationship X1Y1 linear, and. In the measurement model, a characteristic of the reflective model is obtained, namely that the higher the closeness of the relationship, the better the model obtained. The originality of this research is the development of semiparametric SEM, which has not been widely studied by researchers.Keywords: semiparametric SEM, measurement model, structural model, reflective model, formative model
Procedia PDF Downloads 4041972 Prediction of Marine Ecosystem Changes Based on the Integrated Analysis of Multivariate Data Sets
Authors: Prozorkevitch D., Mishurov A., Sokolov K., Karsakov L., Pestrikova L.
Abstract:
The current body of knowledge about the marine environment and the dynamics of marine ecosystems includes a huge amount of heterogeneous data collected over decades. It generally includes a wide range of hydrological, biological and fishery data. Marine researchers collect these data and analyze how and why the ecosystem changes from past to present. Based on these historical records and linkages between the processes it is possible to predict future changes. Multivariate analysis of trends and their interconnection in the marine ecosystem may be used as an instrument for predicting further ecosystem evolution. A wide range of information about the components of the marine ecosystem for more than 50 years needs to be used to investigate how these arrays can help to predict the future.Keywords: barents sea ecosystem, abiotic, biotic, data sets, trends, prediction
Procedia PDF Downloads 11641971 Combining Diffusion Maps and Diffusion Models for Enhanced Data Analysis
Authors: Meng Su
Abstract:
High-dimensional data analysis often presents challenges in capturing the complex, nonlinear relationships and manifold structures inherent to the data. This article presents a novel approach that leverages the strengths of two powerful techniques, Diffusion Maps and Diffusion Probabilistic Models (DPMs), to address these challenges. By integrating the dimensionality reduction capability of Diffusion Maps with the data modeling ability of DPMs, the proposed method aims to provide a comprehensive solution for analyzing and generating high-dimensional data. The Diffusion Map technique preserves the nonlinear relationships and manifold structure of the data by mapping it to a lower-dimensional space using the eigenvectors of the graph Laplacian matrix. Meanwhile, DPMs capture the dependencies within the data, enabling effective modeling and generation of new data points in the low-dimensional space. The generated data points can then be mapped back to the original high-dimensional space, ensuring consistency with the underlying manifold structure. Through a detailed example implementation, the article demonstrates the potential of the proposed hybrid approach to achieve more accurate and effective modeling and generation of complex, high-dimensional data. Furthermore, it discusses possible applications in various domains, such as image synthesis, time-series forecasting, and anomaly detection, and outlines future research directions for enhancing the scalability, performance, and integration with other machine learning techniques. By combining the strengths of Diffusion Maps and DPMs, this work paves the way for more advanced and robust data analysis methods.Keywords: diffusion maps, diffusion probabilistic models (DPMs), manifold learning, high-dimensional data analysis
Procedia PDF Downloads 107