Search results for: censored data
7454 Survival Model for Partly Interval-Censored Data with Application to Anti D in Rhesus D Negative Studies
Authors: F. A. M. Elfaki, Amar Abobakar, M. Azram, M. Usman
Abstract:
This paper discusses regression analysis of partly interval-censored failure time data, which is occur in many fields including demographical, epidemiological, financial, medical and sociological studies. For the problem, we focus on the situation where the survival time of interest can be described by the additive hazards model in the present of partly interval-censored. A major advantage of the approach is its simplicity and it can be easily implemented by using R software. Simulation studies are conducted which indicate that the approach performs well for practical situations and comparable to the existing methods. The methodology is applied to a set of partly interval-censored failure time data arising from anti D in Rhesus D negative studies.
Keywords: Anti D in Rhesus D negative, Cox’s model, EM algorithm.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 16937453 Estimating Regression Parameters in Linear Regression Model with a Censored Response Variable
Authors: Jesus Orbe, Vicente Nunez-Anton
Abstract:
In this work we study the effect of several covariates X on a censored response variable T with unknown probability distribution. In this context, most of the studies in the literature can be located in two possible general classes of regression models: models that study the effect the covariates have on the hazard function; and models that study the effect the covariates have on the censored response variable. Proposals in this paper are in the second class of models and, more specifically, on least squares based model approach. Thus, using the bootstrap estimate of the bias, we try to improve the estimation of the regression parameters by reducing their bias, for small sample sizes. Simulation results presented in the paper show that, for reasonable sample sizes and censoring levels, the bias is always smaller for the new proposals.
Keywords: Censored response variable, regression, bias.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 14747452 Discovery of Fuzzy Censored Production Rules from Large Set of Discovered Fuzzy if then Rules
Authors: Tamanna Siddiqui, M. Afshar Alam
Abstract:
Censored Production Rule is an extension of standard production rule, which is concerned with problems of reasoning with incomplete information, subject to resource constraints and problem of reasoning efficiently with exceptions. A CPR has a form: IF A (Condition) THEN B (Action) UNLESS C (Censor), Where C is the exception condition. Fuzzy CPR are obtained by augmenting ordinary fuzzy production rule “If X is A then Y is B with an exception condition and are written in the form “If X is A then Y is B Unless Z is C. Such rules are employed in situation in which the fuzzy conditional statement “If X is A then Y is B" holds frequently and the exception condition “Z is C" holds rarely. Thus “If X is A then Y is B" part of the fuzzy CPR express important information while the unless part acts only as a switch that changes the polarity of “Y is B" to “Y is not B" when the assertion “Z is C" holds. The proposed approach is an attempt to discover fuzzy censored production rules from set of discovered fuzzy if then rules in the form: A(X) ÔçÆ B(Y) || C(Z).Keywords: Uncertainty Quantification, Fuzzy if then rules, Fuzzy Censored Production Rules, Learning algorithm.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 14847451 A Renovated Cook's Distance Based On The Buckley-James Estimate In Censored Regression
Authors: Nazrina Aziz, Dong Q. Wang
Abstract:
There have been various methods created based on the regression ideas to resolve the problem of data set containing censored observations, i.e. the Buckley-James method, Miller-s method, Cox method, and Koul-Susarla-Van Ryzin estimators. Even though comparison studies show the Buckley-James method performs better than some other methods, it is still rarely used by researchers mainly because of the limited diagnostics analysis developed for the Buckley-James method thus far. Therefore, a diagnostic tool for the Buckley-James method is proposed in this paper. It is called the renovated Cook-s Distance, (RD* i ) and has been developed based on the Cook-s idea. The renovated Cook-s Distance (RD* i ) has advantages (depending on the analyst demand) over (i) the change in the fitted value for a single case, DFIT* i as it measures the influence of case i on all n fitted values Yˆ∗ (not just the fitted value for case i as DFIT* i) (ii) the change in the estimate of the coefficient when the ith case is deleted, DBETA* i since DBETA* i corresponds to the number of variables p so it is usually easier to look at a diagnostic measure such as RD* i since information from p variables can be considered simultaneously. Finally, an example using Stanford Heart Transplant data is provided to illustrate the proposed diagnostic tool.
Keywords: Buckley-James estimators, censored regression, censored data, diagnostic analysis, product-limit estimator, renovated Cook's Distance.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 14377450 Computational Aspects of Regression Analysis of Interval Data
Authors: Michal Cerny
Abstract:
We consider linear regression models where both input data (the values of independent variables) and output data (the observations of the dependent variable) are interval-censored. We introduce a possibilistic generalization of the least squares estimator, so called OLS-set for the interval model. This set captures the impact of the loss of information on the OLS estimator caused by interval censoring and provides a tool for quantification of this effect. We study complexity-theoretic properties of the OLS-set. We also deal with restricted versions of the general interval linear regression model, in particular the crisp input – interval output model. We give an argument that natural descriptions of the OLS-set in the crisp input – interval output cannot be computed in polynomial time. Then we derive easily computable approximations for the OLS-set which can be used instead of the exact description. We illustrate the approach by an example.
Keywords: Linear regression, interval-censored data, computational complexity.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 14697449 A Cumulative Learning Approach to Data Mining Employing Censored Production Rules (CPRs)
Authors: Rekha Kandwal, Kamal K.Bharadwaj
Abstract:
Knowledge is indispensable but voluminous knowledge becomes a bottleneck for efficient processing. A great challenge for data mining activity is the generation of large number of potential rules as a result of mining process. In fact sometimes result size is comparable to the original data. Traditional data mining pruning activities such as support do not sufficiently reduce the huge rule space. Moreover, many practical applications are characterized by continual change of data and knowledge, thereby making knowledge voluminous with each change. The most predominant representation of the discovered knowledge is the standard Production Rules (PRs) in the form If P Then D. Michalski & Winston proposed Censored Production Rules (CPRs), as an extension of production rules, that exhibit variable precision and supports an efficient mechanism for handling exceptions. A CPR is an augmented production rule of the form: If P Then D Unless C, where C (Censor) is an exception to the rule. Such rules are employed in situations in which the conditional statement 'If P Then D' holds frequently and the assertion C holds rarely. By using a rule of this type we are free to ignore the exception conditions, when the resources needed to establish its presence, are tight or there is simply no information available as to whether it holds or not. Thus the 'If P Then D' part of the CPR expresses important information while the Unless C part acts only as a switch changes the polarity of D to ~D. In this paper a scheme based on Dempster-Shafer Theory (DST) interpretation of a CPR is suggested for discovering CPRs from the discovered flat PRs. The discovery of CPRs from flat rules would result in considerable reduction of the already discovered rules. The proposed scheme incrementally incorporates new knowledge and also reduces the size of knowledge base considerably with each episode. Examples are given to demonstrate the behaviour of the proposed scheme. The suggested cumulative learning scheme would be useful in mining data streams.
Keywords: Censored production rules, cumulative learning, data mining, machine learning.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 14847448 Evolutionary Approach for Automated Discovery of Censored Production Rules
Authors: Kamal K. Bharadwaj, Basheer M. Al-Maqaleh
Abstract:
In the recent past, there has been an increasing interest in applying evolutionary methods to Knowledge Discovery in Databases (KDD) and a number of successful applications of Genetic Algorithms (GA) and Genetic Programming (GP) to KDD have been demonstrated. The most predominant representation of the discovered knowledge is the standard Production Rules (PRs) in the form If P Then D. The PRs, however, are unable to handle exceptions and do not exhibit variable precision. The Censored Production Rules (CPRs), an extension of PRs, were proposed by Michalski & Winston that exhibit variable precision and supports an efficient mechanism for handling exceptions. A CPR is an augmented production rule of the form: If P Then D Unless C, where C (Censor) is an exception to the rule. Such rules are employed in situations, in which the conditional statement 'If P Then D' holds frequently and the assertion C holds rarely. By using a rule of this type we are free to ignore the exception conditions, when the resources needed to establish its presence are tight or there is simply no information available as to whether it holds or not. Thus, the 'If P Then D' part of the CPR expresses important information, while the Unless C part acts only as a switch and changes the polarity of D to ~D. This paper presents a classification algorithm based on evolutionary approach that discovers comprehensible rules with exceptions in the form of CPRs. The proposed approach has flexible chromosome encoding, where each chromosome corresponds to a CPR. Appropriate genetic operators are suggested and a fitness function is proposed that incorporates the basic constraints on CPRs. Experimental results are presented to demonstrate the performance of the proposed algorithm.Keywords: Censored Production Rule, Data Mining, MachineLearning, Evolutionary Algorithms.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 18807447 Maximum Likelihood Estimation of Burr Type V Distribution under Left Censored Samples
Abstract:
The paper deals with the maximum likelihood estimation of the parameters of the Burr type V distribution based on left censored samples. The maximum likelihood estimators (MLE) of the parameters have been derived and the Fisher information matrix for the parameters of the said distribution has been obtained explicitly. The confidence intervals for the parameters have also been discussed. A simulation study has been conducted to investigate the performance of the point and interval estimates.
Keywords: Fisher information matrix, confidence intervals, censoring.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 17097446 On Bayesian Analysis of Failure Rate under Topp Leone Distribution using Complete and Censored Samples
Abstract:
The article is concerned with analysis of failure rate (shape parameter) under the Topp Leone distribution using a Bayesian framework. Different loss functions and a couple of noninformative priors have been assumed for posterior estimation. The posterior predictive distributions have also been derived. A simulation study has been carried to compare the performance of different estimators. A real life example has been used to illustrate the applicability of the results obtained. The findings of the study suggest that the precautionary loss function based on Jeffreys prior and singly type II censored samples can effectively be employed to obtain the Bayes estimate of the failure rate under Topp Leone distribution.
Keywords: loss functions, type II censoring, posterior distribution, Bayes estimators.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 25607445 A Forward Automatic Censored Cell-Averaging Detector for Multiple Target Situations in Log-Normal Clutter
Authors: Musa'ed N. Almarshad, Saleh A. Alshebeili, Mourad Barkat
Abstract:
A challenging problem in radar signal processing is to achieve reliable target detection in the presence of interferences. In this paper, we propose a novel algorithm for automatic censoring of radar interfering targets in log-normal clutter. The proposed algorithm, termed the forward automatic censored cell averaging detector (F-ACCAD), consists of two steps: removing the corrupted reference cells (censoring) and the actual detection. Both steps are performed dynamically by using a suitable set of ranked cells to estimate the unknown background level and set the adaptive thresholds accordingly. The F-ACCAD algorithm does not require any prior information about the clutter parameters nor does it require the number of interfering targets. The effectiveness of the F-ACCAD algorithm is assessed by computing, using Monte Carlo simulations, the probability of censoring and the probability of detection in different background environments.Keywords: CFAR, Log-normal clutter, Censoring, Probabilityof detection, Probability of false alarm, Probability of falsecensoring.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 19157444 Learning Classifier Systems Approach for Automated Discovery of Censored Production Rules
Authors: Suraiya Jabin, Kamal K. Bharadwaj
Abstract:
In the recent past Learning Classifier Systems have been successfully used for data mining. Learning Classifier System (LCS) is basically a machine learning technique which combines evolutionary computing, reinforcement learning, supervised or unsupervised learning and heuristics to produce adaptive systems. A LCS learns by interacting with an environment from which it receives feedback in the form of numerical reward. Learning is achieved by trying to maximize the amount of reward received. All LCSs models more or less, comprise four main components; a finite population of condition–action rules, called classifiers; the performance component, which governs the interaction with the environment; the credit assignment component, which distributes the reward received from the environment to the classifiers accountable for the rewards obtained; the discovery component, which is responsible for discovering better rules and improving existing ones through a genetic algorithm. The concatenate of the production rules in the LCS form the genotype, and therefore the GA should operate on a population of classifier systems. This approach is known as the 'Pittsburgh' Classifier Systems. Other LCS that perform their GA at the rule level within a population are known as 'Mitchigan' Classifier Systems. The most predominant representation of the discovered knowledge is the standard production rules (PRs) in the form of IF P THEN D. The PRs, however, are unable to handle exceptions and do not exhibit variable precision. The Censored Production Rules (CPRs), an extension of PRs, were proposed by Michalski and Winston that exhibit variable precision and supports an efficient mechanism for handling exceptions. A CPR is an augmented production rule of the form: IF P THEN D UNLESS C, where Censor C is an exception to the rule. Such rules are employed in situations, in which conditional statement IF P THEN D holds frequently and the assertion C holds rarely. By using a rule of this type we are free to ignore the exception conditions, when the resources needed to establish its presence are tight or there is simply no information available as to whether it holds or not. Thus, the IF P THEN D part of CPR expresses important information, while the UNLESS C part acts only as a switch and changes the polarity of D to ~D. In this paper Pittsburgh style LCSs approach is used for automated discovery of CPRs. An appropriate encoding scheme is suggested to represent a chromosome consisting of fixed size set of CPRs. Suitable genetic operators are designed for the set of CPRs and individual CPRs and also appropriate fitness function is proposed that incorporates basic constraints on CPR. Experimental results are presented to demonstrate the performance of the proposed learning classifier system.Keywords: Censored Production Rule, Data Mining, GeneticAlgorithm, Learning Classifier System, Machine Learning, PittsburgApproach, , Reinforcement learning.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 15287443 A Study on the Waiting Time for the First Employment of Arts Graduates in Sri Lanka
Authors: Imali T. Jayamanne, K. P. Asoka Ramanayake
Abstract:
Transition from tertiary level education to employment is one of the challenges that many fresh university graduates face after graduation. The transition period or the waiting time to obtain the first employment varies with the socio-economic factors and the general characteristics of a graduate. Compared to other fields of study, Arts graduates in Sri Lanka, have to wait a long time to find their first employment. The objective of this study is to identify the determinants of the transition from higher education to employment of these graduates using survival models. The study is based on a survey that was conducted in the year 2016 on a stratified random sample of Arts graduates from Sri Lankan universities who had graduated in 2012. Among the 469 responses, 36 (8%) waiting times were interval censored and 13 (3%) were right censored. Waiting time for the first employment varied between zero to 51 months. Initially, the log-rank and the Gehan-Wilcoxon tests were performed to identify the significant factors. Gender, ethnicity, GCE Advanced level English grade, civil status, university, class received, degree type, sector of first employment, type of first employment and the educational qualifications required for the first employment were significant at 10%. The Cox proportional hazards model was fitted to model the waiting time for first employment with these significant factors. All factors, except ethnicity and type of employment were significant at 5%. However, since the proportional hazard assumption was violated, the lognormal Accelerated failure time (AFT) model was fitted to model the waiting time for the first employment. The same factors were significant in the AFT model as in Cox proportional model.
Keywords: AFT model, first employment, proportional hazard, survey design, waiting time.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 11207442 Additional Considerations on a Sequential Life Testing Approach using a Weibull Model
Authors: D. I. De Souza, D. R. Fonseca, R. Rocha
Abstract:
In this paper we will develop further the sequential life test approach presented in a previous article by [1] using an underlying two parameter Weibull sampling distribution. The minimum life will be considered equal to zero. We will again provide rules for making one of the three possible decisions as each observation becomes available; that is: accept the null hypothesis H0; reject the null hypothesis H0; or obtain additional information by making another observation. The product being analyzed is a new type of a low alloy-high strength steel product. To estimate the shape and the scale parameters of the underlying Weibull model we will use a maximum likelihood approach for censored failure data. A new example will further develop the proposed sequential life testing approach.Keywords: Sequential Life Testing, Underlying Weibull Model, Maximum Likelihood Approach, Hypothesis Testing.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 13857441 Retail Strategy to Reduce Waste Keeping High Profit Utilizing Taylor's Law in Point-of-Sales Data
Authors: Gen Sakoda, Hideki Takayasu, Misako Takayasu
Abstract:
Waste reduction is a fundamental problem for sustainability. Methods for waste reduction with point-of-sales (POS) data are proposed, utilizing the knowledge of a recent econophysics study on a statistical property of POS data. Concretely, the non-stationary time series analysis method based on the Particle Filter is developed, which considers abnormal fluctuation scaling known as Taylor's law. This method is extended for handling incomplete sales data because of stock-outs by introducing maximum likelihood estimation for censored data. The way for optimal stock determination with pricing the cost of waste reduction is also proposed. This study focuses on the examination of the methods for large sales numbers where Taylor's law is obvious. Numerical analysis using aggregated POS data shows the effectiveness of the methods to reduce food waste maintaining a high profit for large sales numbers. Moreover, the way of pricing the cost of waste reduction reveals that a small profit loss realizes substantial waste reduction, especially in the case that the proportionality constant of Taylor’s law is small. Specifically, around 1% profit loss realizes half disposal at =0.12, which is the actual value of processed food items used in this research. The methods provide practical and effective solutions for waste reduction keeping a high profit, especially with large sales numbers.
Keywords: Food waste reduction, particle filter, point of sales, sustainable development goals, Taylor's Law, time series analysis.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 8717440 Further Thoughtson a Sequential Life Testing Approach Using an Inverse Weibull Model
Authors: D. I. De Souza, G. P. Azevedo, D. R. Fonseca
Abstract:
In this paper we will develop further the sequential life test approach presented in a previous article by [1] using an underlying two parameter Inverse Weibull sampling distribution. The location parameter or minimum life will be considered equal to zero. Once again we will provide rules for making one of the three possible decisions as each observation becomes available; that is: accept the null hypothesis H0; reject the null hypothesis H0; or obtain additional information by making another observation. The product being analyzed is a new electronic component. There is little information available about the possible values the parameters of the corresponding Inverse Weibull underlying sampling distribution could have.To estimate the shape and the scale parameters of the underlying Inverse Weibull model we will use a maximum likelihood approach for censored failure data. A new example will further develop the proposed sequential life testing approach.
Keywords: Sequential Life Testing, Inverse Weibull Model, Maximum Likelihood Approach, Hypothesis Testing.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 14207439 Inferences on Compound Rayleigh Parameters with Progressively Type-II Censored Samples
Authors: Abdullah Y. Al-Hossain
Abstract:
This paper considers inference under progressive type II censoring with a compound Rayleigh failure time distribution. The maximum likelihood (ML), and Bayes methods are used for estimating the unknown parameters as well as some lifetime parameters, namely reliability and hazard functions. We obtained Bayes estimators using the conjugate priors for two shape and scale parameters. When the two parameters are unknown, the closed-form expressions of the Bayes estimators cannot be obtained. We use Lindley.s approximation to compute the Bayes estimates. Another Bayes estimator has been obtained based on continuous-discrete joint prior for the unknown parameters. An example with the real data is discussed to illustrate the proposed method. Finally, we made comparisons between these estimators and the maximum likelihood estimators using a Monte Carlo simulation study.
Keywords: Progressive type II censoring, compound Rayleigh failure time distribution, maximum likelihood estimation, Bayes estimation, Lindley's approximation method, Monte Carlo simulation.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 23907438 A New Brazilian Friction-Resistant Low Alloy High Strength Steel – A Life Testing Approach
Authors: D. I. De Souza, G. P. Azevedo, R. Rocha
Abstract:
In this paper we will develop a sequential life test approach applied to a modified low alloy-high strength steel part used in highway overpasses in Brazil.We will consider two possible underlying sampling distributions: the Normal and theInverse Weibull models. The minimum life will be considered equal to zero. We will use the two underlying models to analyze a fatigue life test situation, comparing the results obtained from both.Since a major chemical component of this low alloy-high strength steel part has been changed, there is little information available about the possible values that the parameters of the corresponding Normal and Inverse Weibull underlying sampling distributions could have. To estimate the shape and the scale parameters of these two sampling models we will use a maximum likelihood approach for censored failure data. We will also develop a truncation mechanism for the Inverse Weibull and Normal models. We will provide rules to truncate a sequential life testing situation making one of the two possible decisions at the moment of truncation; that is, accept or reject the null hypothesis H0. An example will develop the proposed truncated sequential life testing approach for the Inverse Weibull and Normal models.
Keywords: Sequential life testing, normal and inverse Weibull models, maximum likelihood approach, truncation mechanism.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 14287437 A Combined Approach of a Sequential Life Testing and an Accelerated Life Testing Applied to a Low-Alloy High Strength Steel Component
Authors: D. I. De Souza, D. R. Fonseca, G. P. Azevedo
Abstract:
Sometimes the amount of time available for testing could be considerably less than the expected lifetime of the component. To overcome such a problem, there is the accelerated life-testing alternative aimed at forcing components to fail by testing them at much higher-than-intended application conditions. These models are known as acceleration models. One possible way to translate test results obtained under accelerated conditions to normal using conditions could be through the application of the “Maxwell Distribution Law.” In this paper we will apply a combined approach of a sequential life testing and an accelerated life testing to a low alloy high-strength steel component used in the construction of overpasses in Brazil. The underlying sampling distribution will be three-parameter Inverse Weibull model. To estimate the three parameters of the Inverse Weibull model we will use a maximum likelihood approach for censored failure data. We will be assuming a linear acceleration condition. To evaluate the accuracy (significance) of the parameter values obtained under normal conditions for the underlying Inverse Weibull model we will apply to the expected normal failure times a sequential life testing using a truncation mechanism. An example will illustrate the application of this procedure.
Keywords: Sequential Life Testing, Accelerated Life Testing, Underlying Three-Parameter Weibull Model, Maximum Likelihood Approach, Hypothesis Testing.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 16397436 Duration Analysis of New Firms in the Banking Industry
Authors: Jesus Orbe, Vicente Nunez-Anton
Abstract:
This paper studies the duration or survival time of commercial banks active in the Moscovian three month Rouble deposits market, during the 1994-1997 period. The privatization process of the Russian commercial banking industry, after the 1988 banking reform, caused a massive entry of new banks followed by a period of high rates of exit. As a consequence, many firms went bankrupt without refunding their deposits. Therefore, both for the banks and for the banks- depositors, it is of interest to analyze which are the significant characteristics that motivate the exit or the closing of the bank. We propose a different methodology based on penalized weighted least squares which represents a very general, flexible and innovative approach for this type of analysis. The more relevant results are that smaller banks exit sooner, banks that enter the market in the last part of the study have shorter durations. As expected, the more experienced banks have a longer duration in the market. In addition, the mean survival time is lower for banks which offer extreme interest rates.
Keywords: Banking, censored, duration, Kaplan-Meier.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 15487435 Big Data: Big Challenges to Privacy and Data Protection
Authors: Abu Bakar Munir, Siti Hajar Mohd Yasin, Firdaus Muhammad-Sukki
Abstract:
This paper seeks to analyse the benefits of big data and more importantly the challenges it pose to the subject of privacy and data protection. First, the nature of big data will be briefly deliberated before presenting the potential of big data in the present days. Afterwards, the issue of privacy and data protection is highlighted before discussing the challenges of implementing this issue in big data. In conclusion, the paper will put forward the debate on the adequacy of the existing legal framework in protecting personal data in the era of big data.
Keywords: Big data, data protection, information, privacy.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 39247434 Data Preprocessing for Supervised Leaning
Authors: S. B. Kotsiantis, D. Kanellopoulos, P. E. Pintelas
Abstract:
Many factors affect the success of Machine Learning (ML) on a given task. The representation and quality of the instance data is first and foremost. If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. It is well known that data preparation and filtering steps take considerable amount of processing time in ML problems. Data pre-processing includes data cleaning, normalization, transformation, feature extraction and selection, etc. The product of data pre-processing is the final training set. It would be nice if a single sequence of data pre-processing algorithms had the best performance for each data set but this is not happened. Thus, we present the most well know algorithms for each step of data pre-processing so that one achieves the best performance for their data set.Keywords: Data mining, feature selection, data cleaning.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 60897433 Applications of Big Data in Education
Authors: Faisal Kalota
Abstract:
Big Data and analytics have gained a huge momentum in recent years. Big Data feeds into the field of Learning Analytics (LA) that may allow academic institutions to better understand the learners’ needs and proactively address them. Hence, it is important to have an understanding of Big Data and its applications. The purpose of this descriptive paper is to provide an overview of Big Data, the technologies used in Big Data, and some of the applications of Big Data in education. Additionally, it discusses some of the concerns related to Big Data and current research trends. While Big Data can provide big benefits, it is important that institutions understand their own needs, infrastructure, resources, and limitation before jumping on the Big Data bandwagon.Keywords: Analytics, Big Data in Education, Hadoop, Learning Analytics.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 48737432 Research of Data Cleaning Methods Based on Dependency Rules
Authors: Yang Bao, Shi Wei Deng, Wang Qun Lin
Abstract:
This paper introduces the concept and principle of data cleaning, analyzes the types and causes of dirty data, and proposes several key steps of typical cleaning process, puts forward a well scalability and versatility data cleaning framework, in view of data with attribute dependency relation, designs several of violation data discovery algorithms by formal formula, which can obtain inconsistent data to all target columns with condition attribute dependent no matter data is structured (SQL) or unstructured (NoSql), and gives 6 data cleaning methods based on these algorithms.Keywords: Data cleaning, dependency rules, violation data discovery, data repair.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 26117431 Coalescing Data Marts
Authors: N. Parimala, P. Pahwa
Abstract:
OLAP uses multidimensional structures, to provide access to data for analysis. Traditionally, OLAP operations are more focused on retrieving data from a single data mart. An exception is the drill across operator. This, however, is restricted to retrieving facts on common dimensions of the multiple data marts. Our concern is to define further operations while retrieving data from multiple data marts. Towards this, we have defined six operations which coalesce data marts. While doing so we consider the common as well as the non-common dimensions of the data marts.Keywords: Data warehouse, Dimension, OLAP, Star Schema.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 15587430 Mining Big Data in Telecommunications Industry: Challenges, Techniques, and Revenue Opportunity
Authors: Hoda A. Abdel Hafez
Abstract:
Mining big data represents a big challenge nowadays. Many types of research are concerned with mining massive amounts of data and big data streams. Mining big data faces a lot of challenges including scalability, speed, heterogeneity, accuracy, provenance and privacy. In telecommunication industry, mining big data is like a mining for gold; it represents a big opportunity and maximizing the revenue streams in this industry. This paper discusses the characteristics of big data (volume, variety, velocity and veracity), data mining techniques and tools for handling very large data sets, mining big data in telecommunication and the benefits and opportunities gained from them.Keywords: Mining Big Data, Big Data, Machine learning, Data Streams, Telecommunication.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 24807429 Comparative Analysis of Diverse Collection of Big Data Analytics Tools
Authors: S. Vidhya, S. Sarumathi, N. Shanthi
Abstract:
Over the past era, there have been a lot of efforts and studies are carried out in growing proficient tools for performing various tasks in big data. Recently big data have gotten a lot of publicity for their good reasons. Due to the large and complex collection of datasets it is difficult to process on traditional data processing applications. This concern turns to be further mandatory for producing various tools in big data. Moreover, the main aim of big data analytics is to utilize the advanced analytic techniques besides very huge, different datasets which contain diverse sizes from terabytes to zettabytes and diverse types such as structured or unstructured and batch or streaming. Big data is useful for data sets where their size or type is away from the capability of traditional relational databases for capturing, managing and processing the data with low-latency. Thus the out coming challenges tend to the occurrence of powerful big data tools. In this survey, a various collection of big data tools are illustrated and also compared with the salient features.
Keywords: Big data, Big data analytics, Business analytics, Data analysis, Data visualization, Data discovery.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 37757428 Multi-labeled Data Expressed by a Set of Labels
Authors: Tetsuya Furukawa, Masahiro Kuzunishi
Abstract:
Collected data must be organized to be utilized efficiently, and hierarchical classification of data is efficient approach to organize data. When data is classified to multiple categories or annotated with a set of labels, users request multi-labeled data by giving a set of labels. There are several interpretations of the data expressed by a set of labels. This paper discusses which data is expressed by a set of labels by introducing orders for sets of labels and shows that there are four types of orders, which are characterized by whether the labels of expressed data includes every label of the given set of labels within the range of the set. Desirable properties of the orders, data is also expressed by the higher set of labels and different sets of labels express different data, are discussed for the orders.
Keywords: Classification Hierarchies, Multi-labeled Data, Multiple Classificaiton, Orders of Sets of Labels
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 13037427 The Comparison of Data Replication in Distributed Systems
Authors: Iman Zangeneh, Mostafa Moradi, Ali Mokhtarbaf
Abstract:
The necessity of ever-increasing use of distributed data in computer networks is obvious for all. One technique that is performed on the distributed data for increasing of efficiency and reliablity is data rplication. In this paper, after introducing this technique and its advantages, we will examine some dynamic data replication. We will examine their characteristies for some overus scenario and the we will propose some suggestion for their improvement.Keywords: data replication, data hiding, consistency, dynamicdata replication strategy
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 16347426 Implementation of an IoT Sensor Data Collection and Analysis Library
Authors: Jihyun Song, Kyeongjoo Kim, Minsoo Lee
Abstract:
Due to the development of information technology and wireless Internet technology, various data are being generated in various fields. These data are advantageous in that they provide real-time information to the users themselves. However, when the data are accumulated and analyzed, more various information can be extracted. In addition, development and dissemination of boards such as Arduino and Raspberry Pie have made it possible to easily test various sensors, and it is possible to collect sensor data directly by using database application tools such as MySQL. These directly collected data can be used for various research and can be useful as data for data mining. However, there are many difficulties in using the board to collect data, and there are many difficulties in using it when the user is not a computer programmer, or when using it for the first time. Even if data are collected, lack of expert knowledge or experience may cause difficulties in data analysis and visualization. In this paper, we aim to construct a library for sensor data collection and analysis to overcome these problems.
Keywords: Clustering, data mining, DBSCAN, k-means, k-medoids, sensor data.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 20107425 Government (Big) Data Ecosystem: Definition, Classification of Actors, and Their Roles
Authors: Syed Iftikhar Hussain Shah, Vasilis Peristeras, Ioannis Magnisalis
Abstract:
Organizations, including governments, generate (big) data that are high in volume, velocity, veracity, and come from a variety of sources. Public Administrations are using (big) data, implementing base registries, and enforcing data sharing within the entire government to deliver (big) data related integrated services, provision of insights to users, and for good governance. Government (Big) data ecosystem actors represent distinct entities that provide data, consume data, manipulate data to offer paid services, and extend data services like data storage, hosting services to other actors. In this research work, we perform a systematic literature review. The key objectives of this paper are to propose a robust definition of government (big) data ecosystem and a classification of government (big) data ecosystem actors and their roles. We showcase a graphical view of actors, roles, and their relationship in the government (big) data ecosystem. We also discuss our research findings. We did not find too much published research articles about the government (big) data ecosystem, including its definition and classification of actors and their roles. Therefore, we lent ideas for the government (big) data ecosystem from numerous areas that include scientific research data, humanitarian data, open government data, industry data, in the literature.
Keywords: Big data, big data ecosystem, classification of big data actors, big data actors roles, definition of government (big) data ecosystem, data-driven government, eGovernment, gaps in data ecosystems, government (big) data, public administration, systematic literature review.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2143