Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 5386

Search results for: data normalization

5386 Applying Spanning Tree Graph Theory for Automatic Database Normalization

Authors: Chetneti Srisa-an


In Knowledge and Data Engineering field, relational database is the best repository to store data in a real world. It has been using around the world more than eight decades. Normalization is the most important process for the analysis and design of relational databases. It aims at creating a set of relational tables with minimum data redundancy that preserve consistency and facilitate correct insertion, deletion, and modification. Normalization is a major task in the design of relational databases. Despite its importance, very few algorithms have been developed to be used in the design of commercial automatic normalization tools. It is also rare technique to do it automatically rather manually. Moreover, for a large and complex database as of now, it make even harder to do it manually. This paper presents a new complete automated relational database normalization method. It produces the directed graph and spanning tree, first. It then proceeds with generating the 2NF, 3NF and also BCNF normal forms. The benefit of this new algorithm is that it can cope with a large set of complex function dependencies.

Keywords: relational database, functional dependency, automatic normalization, primary key, spanning tree

Procedia PDF Downloads 257
5385 Investigating Data Normalization Techniques in Swarm Intelligence Forecasting for Energy Commodity Spot Price

Authors: Yuhanis Yusof, Zuriani Mustaffa, Siti Sakira Kamaruddin


Data mining is a fundamental technique in identifying patterns from large data sets. The extracted facts and patterns contribute in various domains such as marketing, forecasting, and medical. Prior to that, data are consolidated so that the resulting mining process may be more efficient. This study investigates the effect of different data normalization techniques, which are Min-max, Z-score, and decimal scaling, on Swarm-based forecasting models. Recent swarm intelligence algorithms employed includes the Grey Wolf Optimizer (GWO) and Artificial Bee Colony (ABC). Forecasting models are later developed to predict the daily spot price of crude oil and gasoline. Results showed that GWO works better with Z-score normalization technique while ABC produces better accuracy with the Min-Max. Nevertheless, the GWO is more superior that ABC as its model generates the highest accuracy for both crude oil and gasoline price. Such a result indicates that GWO is a promising competitor in the family of swarm intelligence algorithms.

Keywords: artificial bee colony, data normalization, forecasting, Grey Wolf optimizer

Procedia PDF Downloads 393
5384 Decision Making System for Clinical Datasets

Authors: P. Bharathiraja


Computer Aided decision making system is used to enhance diagnosis and prognosis of diseases and also to assist clinicians and junior doctors in clinical decision making. Medical Data used for decision making should be definite and consistent. Data Mining and soft computing techniques are used for cleaning the data and for incorporating human reasoning in decision making systems. Fuzzy rule based inference technique can be used for classification in order to incorporate human reasoning in the decision making process. In this work, missing values are imputed using the mean or mode of the attribute. The data are normalized using min-ma normalization to improve the design and efficiency of the fuzzy inference system. The fuzzy inference system is used to handle the uncertainties that exist in the medical data. Equal-width-partitioning is used to partition the attribute values into appropriate fuzzy intervals. Fuzzy rules are generated using Class Based Associative rule mining algorithm. The system is trained and tested using heart disease data set from the University of California at Irvine (UCI) Machine Learning Repository. The data was split using a hold out approach into training and testing data. From the experimental results it can be inferred that classification using fuzzy inference system performs better than trivial IF-THEN rule based classification approaches. Furthermore it is observed that the use of fuzzy logic and fuzzy inference mechanism handles uncertainty and also resembles human decision making. The system can be used in the absence of a clinical expert to assist junior doctors and clinicians in clinical decision making.

Keywords: decision making, data mining, normalization, fuzzy rule, classification

Procedia PDF Downloads 362
5383 A New Scheme for Chain Code Normalization in Arabic and Farsi Scripts

Authors: Reza Shakoori


This paper presents a structural correction of Arabic and Persian strokes using manipulation of their chain codes in order to improve the rate and performance of Persian and Arabic handwritten word recognition systems. It collects pure and effective features to represent a character with one consolidated feature vector and reduces variations in order to decrease the number of training samples and increase the chance of successful classification. Our results also show that how the proposed approaches can simplify classification and consequently recognition by reducing variations and possible noises on the chain code by keeping orientation of characters and their backbone structures.

Keywords: Arabic, chain code normalization, OCR systems, image processing

Procedia PDF Downloads 306
5382 Evaluating the Performance of Color Constancy Algorithm

Authors: Damanjit Kaur, Avani Bhatia


Color constancy is significant for human vision since color is a pictorial cue that helps in solving different visions tasks such as tracking, object recognition, or categorization. Therefore, several computational methods have tried to simulate human color constancy abilities to stabilize machine color representations. Two different kinds of methods have been used, i.e., normalization and constancy. While color normalization creates a new representation of the image by canceling illuminant effects, color constancy directly estimates the color of the illuminant in order to map the image colors to a canonical version. Color constancy is the capability to determine colors of objects independent of the color of the light source. This research work studies the most of the well-known color constancy algorithms like white point and gray world.

Keywords: color constancy, gray world, white patch, modified white patch

Procedia PDF Downloads 169
5381 Author Name Disambiguation for Biomedical Literature

Authors: Parthiban Srinivasan


PubMed provides online access to the National Library of Medicine database (MEDLINE) and other publications, which contain close to 25 million scientific citations from 1865 to the present. There are close to 80 million author name instances in those close to 25 million citations. For any work of literature, a fundamental issue is to identify the individual(s) who wrote it, and conversely, to identify all of the works that belong to a given individual. Due to the lack of universal standards for name information, there are two aspects of name ambiguity: name synonymy (a single author with multiple name representations), and name homonymy (multiple authors sharing the same name representation). In this talk, we present some results from our extensive work in author name disambiguation for PubMed citations. Information will be presented on the effectiveness and shortcomings of different aspects of successful name disambiguation such as parsing, validation, standardization and normalization.

Keywords: disambiguation, normalization, parsing, PubMed

Procedia PDF Downloads 186
5380 An Evaluation Method of Accelerated Storage Life Test for Typical Mechanical and Electronic Products

Authors: Jinyong Yao, Hongzhi Li, Chao Du, Jiao Li


Reliability of long-term storage products is related to the availability of the whole system, and the evaluation of storage life is of great necessity. These products are usually highly reliable and little failure information can be collected. In this paper, an analytical method based on data from accelerated storage life test is proposed to evaluate the reliability index of the long-term storage products. Firstly, singularities are eliminated by data normalization and residual analysis. Secondly, with the pre-processed data, the degradation path model is built to obtain the pseudo life values. Then by life distribution hypothesis, we can get the estimator of parameters in high stress levels and verify failure mechanisms consistency. Finally, the life distribution under the normal stress level is extrapolated via the acceleration model and evaluation of the true average life available. An application example with the camera stabilization device is provided to illustrate the methodology we proposed.

Keywords: accelerated storage life test, failure mechanisms consistency, life distribution, reliability

Procedia PDF Downloads 306
5379 Improved Pitch Detection Using Fourier Approximation Method

Authors: Balachandra Kumaraswamy, P. G. Poonacha


Automatic Music Information Retrieval has been one of the challenging topics of research for a few decades now with several interesting approaches reported in the literature. In this paper we have developed a pitch extraction method based on a finite Fourier series approximation to the given window of samples. We then estimate pitch as the fundamental period of the finite Fourier series approximation to the given window of samples. This method uses analysis of the strength of harmonics present in the signal to reduce octave as well as harmonic errors. The performance of our method is compared with three best known methods for pitch extraction, namely, Yin, Windowed Special Normalization of the Auto-Correlation Function and Harmonic Product Spectrum methods of pitch extraction. Our study with artificially created signals as well as music files show that Fourier Approximation method gives much better estimate of pitch with less octave and harmonic errors.

Keywords: pitch, fourier series, yin, normalization of the auto- correlation function, harmonic product, mean square error

Procedia PDF Downloads 324
5378 FT-NIR Method to Determine Moisture in Gluten Free Rice-Based Pasta during Drying

Authors: Navneet Singh Deora, Aastha Deswal, H. N. Mishra


Pasta is one of the most widely consumed food products around the world. Rapid determination of the moisture content in pasta will assist food processors to provide online quality control of pasta during large scale production. Rapid Fourier transform near-infrared method (FT-NIR) was developed for determining moisture content in pasta. A calibration set of 150 samples, a validation set of 30 samples and a prediction set of 25 samples of pasta were used. The diffuse reflection spectra of different types of pastas were measured by FT-NIR analyzer in the 4,000-12,000 cm-1 spectral range. Calibration and validation sets were designed for the conception and evaluation of the method adequacy in the range of moisture content 10 to 15 percent (w.b) of the pasta. The prediction models based on partial least squares (PLS) regression, were developed in the near-infrared. Conventional criteria such as the R2, the root mean square errors of cross validation (RMSECV), root mean square errors of estimation (RMSEE) as well as the number of PLS factors were considered for the selection of three pre-processing (vector normalization, minimum-maximum normalization and multiplicative scatter correction) methods. Spectra of pasta sample were treated with different mathematic pre-treatments before being used to build models between the spectral information and moisture content. The moisture content in pasta predicted by FT-NIR methods had very good correlation with their values determined via traditional methods (R2 = 0.983), which clearly indicated that FT-NIR methods could be used as an effective tool for rapid determination of moisture content in pasta. The best calibration model was developed with min-max normalization (MMN) spectral pre-processing (R2 = 0.9775). The MMN pre-processing method was found most suitable and the maximum coefficient of determination (R2) value of 0.9875 was obtained for the calibration model developed.

Keywords: FT-NIR, pasta, moisture determination, food engineering

Procedia PDF Downloads 181
5377 Applications of Big Data in Education

Authors: Faisal Kalota


Big Data and analytics have gained a huge momentum in recent years. Big Data feeds into the field of Learning Analytics (LA) that may allow academic institutions to better understand the learners’ needs and proactively address them. Hence, it is important to have an understanding of Big Data and its applications. The purpose of this descriptive paper is to provide an overview of Big Data, the technologies used in Big Data, and some of the applications of Big Data in education. Additionally, it discusses some of the concerns related to Big Data and current research trends. While Big Data can provide big benefits, it is important that institutions understand their own needs, infrastructure, resources, and limitation before jumping on the Big Data bandwagon.

Keywords: big data, learning analytics, analytics, big data in education, Hadoop

Procedia PDF Downloads 249
5376 Analysis of Big Data

Authors: Sandeep Sharma, Sarabjit Singh


As per the user demand and growth trends of large free data the storage solutions are now becoming more challenge-able to protect, store and to retrieve data. The days are not so far when the storage companies and organizations are start saying 'no' to store our valuable data or they will start charging a huge amount for its storage and protection. On the other hand as per the environmental conditions it becomes challenge-able to maintain and establish new data warehouses and data centers to protect global warming threats. A challenge of small data is over now, the challenges are big that how to manage the exponential growth of data. In this paper we have analyzed the growth trend of big data and its future implications. We have also focused on the impact of the unstructured data on various concerns and we have also suggested some possible remedies to streamline big data.

Keywords: big data, unstructured data, volume, variety, velocity

Procedia PDF Downloads 395
5375 Research of Data Cleaning Methods Based on Dependency Rules

Authors: Yang Bao, Shi Wei Deng, WangQun Lin


This paper introduces the concept and principle of data cleaning, analyzes the types and causes of dirty data, and proposes several key steps of typical cleaning process, puts forward a well scalability and versatility data cleaning framework, in view of data with attribute dependency relation, designs several of violation data discovery algorithms by formal formula, which can obtain inconsistent data to all target columns with condition attribute dependent no matter data is structured (SQL) or unstructured (NoSQL), and gives 6 data cleaning methods based on these algorithms.

Keywords: data cleaning, dependency rules, violation data discovery, data repair

Procedia PDF Downloads 307
5374 Computer Aide Discrimination of Benign and Malignant Thyroid Nodules by Ultrasound Imaging

Authors: Akbar Gharbali, Ali Abbasian Ardekani, Afshin Mohammadi


Introduction: Thyroid nodules have an incidence of 33-68% in the general population. More than 5-15% of these nodules are malignant. Early detection and treatment of thyroid nodules increase the cure rate and provide optimal treatment. Between the medical imaging methods, Ultrasound is the chosen imaging technique for assessment of thyroid nodules. The confirming of the diagnosis usually demands repeated fine-needle aspiration biopsy (FNAB). So, current management has morbidity and non-zero mortality. Objective: To explore diagnostic potential of automatic texture analysis (TA) methods in differentiation benign and malignant thyroid nodules by ultrasound imaging in order to help for reliable diagnosis and monitoring of the thyroid nodules in their early stages with no need biopsy. Material and Methods: The thyroid US image database consists of 70 patients (26 benign and 44 malignant) which were reported by Radiologist and proven by the biopsy. Two slices per patient were loaded in Mazda Software version 4.6 for automatic texture analysis. Regions of interests (ROIs) were defined within the abnormal part of the thyroid nodules ultrasound images. Gray levels within an ROI normalized according to three normalization schemes: N1: default or original gray levels, N2: +/- 3 Sigma or dynamic intensity limited to µ+/- 3σ, and N3: present intensity limited to 1% - 99%. Up to 270 multiscale texture features parameters per ROIs per each normalization schemes were computed from well-known statistical methods employed in Mazda software. From the statistical point of view, all calculated texture features parameters are not useful for texture analysis. So, the features based on maximum Fisher coefficient and the minimum probability of classification error and average correlation coefficients (POE+ACC) eliminated to 10 best and most effective features per normalization schemes. We analyze this feature under two standardization states (standard (S) and non-standard (NS)) with Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA) and Non-Linear Discriminant Analysis (NDA). The 1NN classifier was performed to distinguish between benign and malignant tumors. The confusion matrix and Receiver operating characteristic (ROC) curve analysis were used for the formulation of more reliable criteria of the performance of employed texture analysis methods. Results: The results demonstrated the influence of the normalization schemes and reduction methods on the effectiveness of the obtained features as a descriptor on discrimination power and classification results. The selected subset features under 1%-99% normalization, POE+ACC reduction and NDA texture analysis yielded a high discrimination performance with the area under the ROC curve (Az) of 0.9722, in distinguishing Benign from Malignant Thyroid Nodules which correspond to sensitivity of 94.45%, specificity of 100%, and accuracy of 97.14%. Conclusions: Our results indicate computer-aided diagnosis is a reliable method, and can provide useful information to help radiologists in the detection and classification of benign and malignant thyroid nodules.

Keywords: ultrasound imaging, thyroid nodules, computer aided diagnosis, texture analysis, PCA, LDA, NDA

Procedia PDF Downloads 196
5373 Reviewing Privacy Preserving Distributed Data Mining

Authors: Sajjad Baghernezhad, Saeideh Baghernezhad


Nowadays considering human involved in increasing data development some methods such as data mining to extract science are unavoidable. One of the discussions of data mining is inherent distribution of the data usually the bases creating or receiving such data belong to corporate or non-corporate persons and do not give their information freely to others. Yet there is no guarantee to enable someone to mine special data without entering in the owner’s privacy. Sending data and then gathering them by each vertical or horizontal software depends on the type of their preserving type and also executed to improve data privacy. In this study it was attempted to compare comprehensively preserving data methods; also general methods such as random data, coding and strong and weak points of each one are examined.

Keywords: data mining, distributed data mining, privacy protection, privacy preserving

Procedia PDF Downloads 363
5372 How to Use Big Data in Logistics Issues

Authors: Mehmet Akif Aslan, Mehmet Simsek, Eyup Sensoy


Big Data stands for today’s cutting-edge technology. As the technology becomes widespread, so does Data. Utilizing massive data sets enable companies to get competitive advantages over their adversaries. Out of many area of Big Data usage, logistics has significance role in both commercial sector and military. This paper lays out what big data is and how it is used in both military and commercial logistics.

Keywords: big data, logistics, operational efficiency, risk management

Procedia PDF Downloads 401
5371 Suppression Subtractive Hybridization Technique for Identification of the Differentially Expressed Genes

Authors: Tuhina-khatun, Mohamed Hanafi Musa, Mohd Rafii Yosup, Wong Mui Yun, Aktar-uz-Zaman, Mahbod Sahebi


Suppression subtractive hybridization (SSH) method is valuable tool for identifying differentially regulated genes in disease specific or tissue specific genes important for cellular growth and differentiation. It is a widely used method for separating DNA molecules that distinguish two closely related DNA samples. SSH is one of the most powerful and popular methods for generating subtracted cDNA or genomic DNA libraries. It is based primarily on a suppression polymerase chain reaction (PCR) technique and combines normalization and subtraction in a solitary procedure. The normalization step equalizes the abundance of DNA fragments within the target population, and the subtraction step excludes sequences that are common to the populations being compared. This dramatically increases the probability of obtaining low-abundance differentially expressed cDNAs or genomic DNA fragments and simplifies analysis of the subtracted library. SSH technique is applicable to many comparative and functional genetic studies for the identification of disease, developmental, tissue specific, or other differentially expressed genes, as well as for the recovery of genomic DNA fragments distinguishing the samples under comparison.

Keywords: suppression subtractive hybridization, differentially expressed genes, disease specific genes, tissue specific genes

Procedia PDF Downloads 302
5370 Combining ASTER Thermal Data and Spatial-Based Insolation Model for Identification of Geothermal Active Areas

Authors: Khalid Hussein, Waleed Abdalati, Pakorn Petchprayoon, Khaula Alkaabi


In this study, we integrated ASTER thermal data with an area-based spatial insolation model to identify and delineate geothermally active areas in Yellowstone National Park (YNP). Two pairs of L1B ASTER day- and nighttime scenes were used to calculate land surface temperature. We employed the Emissivity Normalization Algorithm which separates temperature from emissivity to calculate surface temperature. We calculated the incoming solar radiation for the area covered by each of the four ASTER scenes using an insolation model and used this information to compute temperature due to solar radiation. We then identified the statistical thermal anomalies using land surface temperature and the residuals calculated from modeled temperatures and ASTER-derived surface temperatures. Areas that had temperatures or temperature residuals greater than 2σ and between 1σ and 2σ were considered ASTER-modeled thermal anomalies. The areas identified as thermal anomalies were in strong agreement with the thermal areas obtained from the YNP GIS database. Also the YNP hot springs and geysers were located within areas identified as anomalous thermal areas. The consistency between our results and known geothermally active areas indicate that thermal remote sensing data, integrated with a spatial-based insolation model, provides an effective means for identifying and locating areas of geothermal activities over large areas and rough terrain.

Keywords: thermal remote sensing, insolation model, land surface temperature, geothermal anomalies

Procedia PDF Downloads 279
5369 Imputation Technique for Feature Selection in Microarray Data Set

Authors: Younies Saeed Hassan Mahmoud, Mai Mabrouk, Elsayed Sallam


Analysing DNA microarray data sets is a great challenge, which faces the bioinformaticians due to the complication of using statistical and machine learning techniques. The challenge will be doubled if the microarray data sets contain missing data, which happens regularly because these techniques cannot deal with missing data. One of the most important data analysis process on the microarray data set is feature selection. This process finds the most important genes that affect certain disease. In this paper, we introduce a technique for imputing the missing data in microarray data sets while performing feature selection.

Keywords: DNA microarray, feature selection, missing data, bioinformatics

Procedia PDF Downloads 379
5368 High Performance Computing and Big Data Analytics

Authors: Branci Sarra, Branci Saadia


Because of the multiplied data growth, many computer science tools have been developed to process and analyze these Big Data. High-performance computing architectures have been designed to meet the treatment needs of Big Data (view transaction processing standpoint, strategic, and tactical analytics). The purpose of this article is to provide a historical and global perspective on the recent trend of high-performance computing architectures especially what has a relation with Analytics and Data Mining.

Keywords: high performance computing, HPC, big data, data analysis

Procedia PDF Downloads 393
5367 A Study of Cloud Computing Solution for Transportation Big Data Processing

Authors: Ilgin Gökaşar, Saman Ghaffarian


The need for fast processed big data of transportation ridership (eg., smartcard data) and traffic operation (e.g., traffic detectors data) which requires a lot of computational power is incontrovertible in Intelligent Transportation Systems. Nowadays cloud computing is one of the important subjects and popular information technology solution for data processing. It enables users to process enormous measure of data without having their own particular computing power. Thus, it can also be a good selection for transportation big data processing as well. This paper intends to examine how the cloud computing can enhance transportation big data process with contrasting its advantages and disadvantages, and discussing cloud computing features.

Keywords: big data, cloud computing, Intelligent Transportation Systems, ITS, traffic data processing

Procedia PDF Downloads 310
5366 Proposal of Data Collection from Probes

Authors: M. Kebisek, L. Spendla, M. Kopcek, T. Skulavik


In our paper we describe the security capabilities of data collection. Data are collected with probes located in the near and distant surroundings of the company. Considering the numerous obstacles e.g. forests, hills, urban areas, the data collection is realized in several ways. The collection of data uses connection via wireless communication, LAN network, GSM network and in certain areas data are collected by using vehicles. In order to ensure the connection to the server most of the probes have ability to communicate in several ways. Collected data are archived and subsequently used in supervisory applications. To ensure the collection of the required data, it is necessary to propose algorithms that will allow the probes to select suitable communication channel.

Keywords: communication, computer network, data collection, probe

Procedia PDF Downloads 262
5365 The Application of Data Mining Technology in Building Energy Consumption Data Analysis

Authors: Liang Zhao, Jili Zhang, Chongquan Zhong


Energy consumption data, in particular those involving public buildings, are impacted by many factors: the building structure, climate/environmental parameters, construction, system operating condition, and user behavior patterns. Traditional methods for data analysis are insufficient. This paper delves into the data mining technology to determine its application in the analysis of building energy consumption data including energy consumption prediction, fault diagnosis, and optimal operation. Recent literature are reviewed and summarized, the problems faced by data mining technology in the area of energy consumption data analysis are enumerated, and research points for future studies are given.

Keywords: data mining, data analysis, prediction, optimization, building operational performance

Procedia PDF Downloads 724
5364 Impact of Charging PHEV at Different Penetration Levels on Power System Network

Authors: M. R. Ahmad, I. Musirin, M. M. Othman, N. A. Rahmat


Plug-in Hybrid-Electric Vehicle (PHEV) has gained immense popularity in recent years. PHEV offers numerous advantages compared to the conventional internal-combustion engine (ICE) vehicle. Millions of PHEVs are estimated to be on the road in the USA by 2020. Uncoordinated PHEV charging is believed to cause severe impacts to the power grid; i.e. feeders, lines and transformers overload and voltage drop. Nevertheless, improper PHEV data model used in such studies may cause the findings of their works is in appropriated. Although smart charging is more attractive to researchers in recent years, its implementation is not yet attainable on the street due to its requirement for physical infrastructure readiness and technology advancement. As the first step, it is finest to study the impact of charging PHEV based on real vehicle travel data from National Household Travel Survey (NHTS) and at present charging rate. Due to the lack of charging station on the street at the moment, charging PHEV at home is the best option and has been considered in this work. This paper proposed a technique that comprehensively presents the impact of charging PHEV on power system networks considering huge numbers of PHEV samples with its traveling data pattern. Vehicles Charging Load Profile (VCLP) is developed and implemented in IEEE 30-bus test system that represents a portion of American Electric Power System (Midwestern US). Normalization technique is used to correspond to real time loads at all buses. Results from the study indicated that charging PHEV using opportunity charging will have significant impacts on power system networks, especially whereas bigger battery capacity (kWh) is used as well as for higher penetration level.

Keywords: plug-in hybrid electric vehicle, transportation electrification, impact of charging PHEV, electricity demand profile, load profile

Procedia PDF Downloads 183
5363 Algorithms used in Spatial Data Mining GIS

Authors: Vahid Bairami Rad


Extracting knowledge from spatial data like GIS data is important to reduce the data and extract information. Therefore, the development of new techniques and tools that support the human in transforming data into useful knowledge has been the focus of the relatively new and interdisciplinary research area ‘knowledge discovery in databases’. Thus, we introduce a set of database primitives or basic operations for spatial data mining which are sufficient to express most of the spatial data mining algorithms from the literature. This approach has several advantages. Similar to the relational standard language SQL, the use of standard primitives will speed-up the development of new data mining algorithms and will also make them more portable. We introduced a database-oriented framework for spatial data mining which is based on the concepts of neighborhood graphs and paths. A small set of basic operations on these graphs and paths were defined as database primitives for spatial data mining. Furthermore, techniques to efficiently support the database primitives by a commercial DBMS were presented.

Keywords: spatial data base, knowledge discovery database, data mining, spatial relationship, predictive data mining

Procedia PDF Downloads 331
5362 Data Stream Association Rule Mining with Cloud Computing

Authors: B. Suraj Aravind, M. H. M. Krishna Prasad


There exist emerging applications of data streams that require association rule mining, such as network traffic monitoring, web click streams analysis, sensor data, data from satellites etc. Data streams typically arrive continuously in high speed with huge amount and changing data distribution. This raises new issues that need to be considered when developing association rule mining techniques for stream data. This paper proposes to introduce an improved data stream association rule mining algorithm by eliminating the limitation of resources. For this, the concept of cloud computing is used. Inclusion of this may lead to additional unknown problems which needs further research.

Keywords: data stream, association rule mining, cloud computing, frequent itemsets

Procedia PDF Downloads 387
5361 Semantic Data Schema Recognition

Authors: Aïcha Ben Salem, Faouzi Boufares, Sebastiao Correia


The subject covered in this paper aims at assisting the user in its quality approach. The goal is to better extract, mix, interpret and reuse data. It deals with the semantic schema recognition of a data source. This enables the extraction of data semantics from all the available information, inculding the data and the metadata. Firstly, it consists of categorizing the data by assigning it to a category and possibly a sub-category, and secondly, of establishing relations between columns and possibly discovering the semantics of the manipulated data source. These links detected between columns offer a better understanding of the source and the alternatives for correcting data. This approach allows automatic detection of a large number of syntactic and semantic anomalies.

Keywords: schema recognition, semantic data profiling, meta-categorisation, semantic dependencies inter columns

Procedia PDF Downloads 309
5360 Survey on Big Data Stream Classification by Decision Tree

Authors: Mansoureh Ghiasabadi Farahani, Samira Kalantary, Sara Taghi-Pour, Mahboubeh Shamsi


Nowadays, the development of computers technology and its recent applications provide access to new types of data, which have not been considered by the traditional data analysts. Two particularly interesting characteristics of such data sets include their huge size and streaming nature .Incremental learning techniques have been used extensively to address the data stream classification problem. This paper presents a concise survey on the obstacles and the requirements issues classifying data streams with using decision tree. The most important issue is to maintain a balance between accuracy and efficiency, the algorithm should provide good classification performance with a reasonable time response.

Keywords: big data, data streams, classification, decision tree

Procedia PDF Downloads 373
5359 Data-Mining Approach to Analyzing Industrial Process Information for Real-Time Monitoring

Authors: Seung-Lock Seo


This work presents a data-mining empirical monitoring scheme for industrial processes with partially unbalanced data. Measurement data of good operations are relatively easy to gather, but in unusual special events or faults it is generally difficult to collect process information or almost impossible to analyze some noisy data of industrial processes. At this time some noise filtering techniques can be used to enhance process monitoring performance in a real-time basis. In addition, pre-processing of raw process data is helpful to eliminate unwanted variation of industrial process data. In this work, the performance of various monitoring schemes was tested and demonstrated for discrete batch process data. It showed that the monitoring performance was improved significantly in terms of monitoring success rate of given process faults.

Keywords: data mining, process data, monitoring, safety, industrial processes

Procedia PDF Downloads 264
5358 A Privacy Protection Scheme Supporting Fuzzy Search for NDN Routing Cache Data Name

Authors: Feng Tao, Ma Jing, Guo Xian, Wang Jing


Named Data Networking (NDN) replaces IP address of traditional network with data name, and adopts dynamic cache mechanism. In the existing mechanism, however, only one-to-one search can be achieved because every data has a unique name corresponding to it. There is a certain mapping relationship between data content and data name, so if the data name is intercepted by an adversary, the privacy of the data content and user’s interest can hardly be guaranteed. In order to solve this problem, this paper proposes a one-to-many fuzzy search scheme based on order-preserving encryption to reduce the query overhead by optimizing the caching strategy. In this scheme, we use hash value to ensure the user’s query safe from each node in the process of search, so does the privacy of the requiring data content.

Keywords: NDN, order-preserving encryption, fuzzy search, privacy

Procedia PDF Downloads 328
5357 Healthcare Big Data Analytics Using Hadoop

Authors: Chellammal Surianarayanan


Healthcare industry is generating large amounts of data driven by various needs such as record keeping, physician’s prescription, medical imaging, sensor data, Electronic Patient Record(EPR), laboratory, pharmacy, etc. Healthcare data is so big and complex that they cannot be managed by conventional hardware and software. The complexity of healthcare big data arises from large volume of data, the velocity with which the data is accumulated and different varieties such as structured, semi-structured and unstructured nature of data. Despite the complexity of big data, if the trends and patterns that exist within the big data are uncovered and analyzed, higher quality healthcare at lower cost can be provided. Hadoop is an open source software framework for distributed processing of large data sets across clusters of commodity hardware using a simple programming model. The core components of Hadoop include Hadoop Distributed File System which offers way to store large amount of data across multiple machines and MapReduce which offers way to process large data sets with a parallel, distributed algorithm on a cluster. Hadoop ecosystem also includes various other tools such as Hive (a SQL-like query language), Pig (a higher level query language for MapReduce), Hbase(a columnar data store), etc. In this paper an analysis has been done as how healthcare big data can be processed and analyzed using Hadoop ecosystem.

Keywords: big data analytics, Hadoop, healthcare data, towards quality healthcare

Procedia PDF Downloads 288