Search results for: Large Data
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 8890

Search results for: Large Data

8740 A New Approach for Classifying Large Number of Mixed Variables

Authors: Hashibah Hamid

Abstract:

The issue of classifying objects into one of predefined groups when the measured variables are mixed with different types of variables has been part of interest among statisticians in many years. Some methods for dealing with such situation have been introduced that include parametric, semi-parametric and nonparametric approaches. This paper attempts to discuss on a problem in classifying a data when the number of measured mixed variables is larger than the size of the sample. A propose idea that integrates a dimensionality reduction technique via principal component analysis and a discriminant function based on the location model is discussed. The study aims in offering practitioners another potential tool in a classification problem that is possible to be considered when the observed variables are mixed and too large.

Keywords: classification, location model, mixed variables, principal component analysis.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1508
8739 A Genetic-Neural-Network Modeling Approach for Self-Heating in GaN High Electron Mobility Transistors

Authors: Anwar Jarndal

Abstract:

In this paper, a genetic-neural-network (GNN) based large-signal model for GaN HEMTs is presented along with its parameters extraction procedure. The model is easy to construct and implement in CAD software and requires only DC and S-parameter measurements. An improved decomposition technique is used to model self-heating effect. Two GNN models are constructed to simulate isothermal drain current and power dissipation, respectively. The two model are then composed to simulate the drain current. The modeling procedure was applied to a packaged GaN-on-Si HEMT and the developed model is validated by comparing its large-signal simulation with measured data. A very good agreement between the simulation and measurement is obtained.

Keywords: GaN HEMT, computer-aided design & modeling, neural networks, genetic optimization.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1599
8738 Studies on Determination of the Optimum Distance Between the Tmotes for Optimum Data Transfer in a Network with WLL Capability

Authors: N C Santhosh Kumar, N K Kishore

Abstract:

Using mini modules of Tmotes, it is possible to automate a small personal area network. This idea can be extended to large networks too by implementing multi-hop routing. Linking the various Tmotes using Programming languages like Nesc, Java and having transmitter and receiver sections, a network can be monitored. It is foreseen that, depending on the application, a long range at a low data transfer rate or average throughput may be an acceptable trade-off. To reduce the overall costs involved, an optimum number of Tmotes to be used under various conditions (Indoor/Outdoor) is to be deduced. By analyzing the data rates or throughputs at various locations of Tmotes, it is possible to deduce an optimal number of Tmotes for a specific network. This paper deals with the determination of optimum distances to reduce the cost and increase the reliability of the entire sensor network with Wireless Local Loop (WLL) capability.

Keywords: Average throughput, data rate, multi-hop routing, optimum data transfer, throughput, Tmotes, wireless local loop.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1311
8737 Reflections on Opportunities and Challenges for Systems Engineering

Authors: Ali E. Abbas

Abstract:

This paper summarizes some of the discussions that occurred in a workshop in West Virginia, U.S.A which was sponsored by the National Science Foundation (NSF) in February 2016. The goal of the workshop was to explore the opportunities and challenges for applying systems engineering in large enterprises, and some of the issues that still persist. The main topics of the discussion included challenges with elaboration and abstraction in large systems, interfacing physical and social systems, and the need for axiomatic frameworks for large enterprises. We summarize these main points of discussion drawing parallels with decision making in organizations to instigate research in these discussion areas.

Keywords: Decision analysis, systems engineering, framing, value creation.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 884
8736 Adaptive Kernel Principal Analysis for Online Feature Extraction

Authors: Mingtao Ding, Zheng Tian, Haixia Xu

Abstract:

The batch nature limits the standard kernel principal component analysis (KPCA) methods in numerous applications, especially for dynamic or large-scale data. In this paper, an efficient adaptive approach is presented for online extraction of the kernel principal components (KPC). The contribution of this paper may be divided into two parts. First, kernel covariance matrix is correctly updated to adapt to the changing characteristics of data. Second, KPC are recursively formulated to overcome the batch nature of standard KPCA.This formulation is derived from the recursive eigen-decomposition of kernel covariance matrix and indicates the KPC variation caused by the new data. The proposed method not only alleviates sub-optimality of the KPCA method for non-stationary data, but also maintains constant update speed and memory usage as the data-size increases. Experiments for simulation data and real applications demonstrate that our approach yields improvements in terms of both computational speed and approximation accuracy.

Keywords: adaptive method, kernel principal component analysis, online extraction, recursive algorithm

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1514
8735 Development of a Numerical Model to Predict Wear in Grouted Connections for Offshore Wind Turbine Generators

Authors: Paul Dallyn, Ashraf El-Hamalawi, Alessandro Palmeri, Bob Knight

Abstract:

In order to better understand the long term implications of the grout wear failure mode in large-diameter plainsided grouted connections, a numerical model has been developed and calibrated that can take advantage of existing operational plant data to predict the wear accumulation for the actual load conditions experienced over a given period, thus limiting the requirement for expensive monitoring systems. This model has been derived and calibrated based on site structural condition monitoring (SCM) data and supervisory control and data acquisition systems (SCADA) data for two operational wind turbine generator substructures afflicted with this challenge, along with experimentally derived wear rates.

Keywords: Grouted Connection, Numerical Model, Offshore Structure, Wear, Wind Energy.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2600
8734 Proposing an Efficient Method for Frequent Pattern Mining

Authors: Vaibhav Kant Singh, Vijay Shah, Yogendra Kumar Jain, Anupam Shukla, A.S. Thoke, Vinay KumarSingh, Chhaya Dule, Vivek Parganiha

Abstract:

Data mining, which is the exploration of knowledge from the large set of data, generated as a result of the various data processing activities. Frequent Pattern Mining is a very important task in data mining. The previous approaches applied to generate frequent set generally adopt candidate generation and pruning techniques for the satisfaction of the desired objective. This paper shows how the different approaches achieve the objective of frequent mining along with the complexities required to perform the job. This paper will also look for hardware approach of cache coherence to improve efficiency of the above process. The process of data mining is helpful in generation of support systems that can help in Management, Bioinformatics, Biotechnology, Medical Science, Statistics, Mathematics, Banking, Networking and other Computer related applications. This paper proposes the use of both upward and downward closure property for the extraction of frequent item sets which reduces the total number of scans required for the generation of Candidate Sets.

Keywords: Data Mining, Candidate Sets, Frequent Item set, Pruning.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1640
8733 Large Eddy Simulation of Hydrogen Deflagration in Open Space and Vented Enclosure

Authors: T. Nozu, K. Hibi, T. Nishiie

Abstract:

This paper discusses the applicability of the numerical model for a damage prediction method of the accidental hydrogen explosion occurring in a hydrogen facility. The numerical model was based on an unstructured finite volume method (FVM) code “NuFD/FrontFlowRed”. For simulating unsteady turbulent combustion of leaked hydrogen gas, a combination of Large Eddy Simulation (LES) and a combustion model were used. The combustion model was based on a two scalar flamelet approach, where a G-equation model and a conserved scalar model expressed a propagation of premixed flame surface and a diffusion combustion process, respectively. For validation of this numerical model, we have simulated the previous two types of hydrogen explosion tests. One is open-space explosion test, and the source was a prismatic 5.27 m3 volume with 30% of hydrogen-air mixture. A reinforced concrete wall was set 4 m away from the front surface of the source. The source was ignited at the bottom center by a spark. The other is vented enclosure explosion test, and the chamber was 4.6 m × 4.6 m × 3.0 m with a vent opening on one side. Vent area of 5.4 m2 was used. Test was performed with ignition at the center of the wall opposite the vent. Hydrogen-air mixtures with hydrogen concentrations close to 18% vol. were used in the tests. The results from the numerical simulations are compared with the previous experimental data for the accuracy of the numerical model, and we have verified that the simulated overpressures and flame time-of-arrival data were in good agreement with the results of the previous two explosion tests.

Keywords: Deflagration, Large Eddy Simulation, Turbulent combustion, Vented enclosure.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1432
8732 A Review on Soft Computing Technique in Intrusion Detection System

Authors: Noor Suhana Sulaiman, Rohani Abu Bakar, Norrozila Sulaiman

Abstract:

Intrusion Detection System is significant in network security. It detects and identifies intrusion behavior or intrusion attempts in a computer system by monitoring and analyzing the network packets in real time. In the recent year, intelligent algorithms applied in the intrusion detection system (IDS) have been an increasing concern with the rapid growth of the network security. IDS data deals with a huge amount of data which contains irrelevant and redundant features causing slow training and testing process, higher resource consumption as well as poor detection rate. Since the amount of audit data that an IDS needs to examine is very large even for a small network, classification by hand is impossible. Hence, the primary objective of this review is to review the techniques prior to classification process suit to IDS data.

Keywords: Intrusion Detection System, security, soft computing, classification.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1822
8731 On Speeding Up Support Vector Machines: Proximity Graphs Versus Random Sampling for Pre-Selection Condensation

Authors: Xiaohua Liu, Juan F. Beltran, Nishant Mohanchandra, Godfried T. Toussaint

Abstract:

Support vector machines (SVMs) are considered to be the best machine learning algorithms for minimizing the predictive probability of misclassification. However, their drawback is that for large data sets the computation of the optimal decision boundary is a time consuming function of the size of the training set. Hence several methods have been proposed to speed up the SVM algorithm. Here three methods used to speed up the computation of the SVM classifiers are compared experimentally using a musical genre classification problem. The simplest method pre-selects a random sample of the data before the application of the SVM algorithm. Two additional methods use proximity graphs to pre-select data that are near the decision boundary. One uses k-Nearest Neighbor graphs and the other Relative Neighborhood Graphs to accomplish the task.

Keywords: Machine learning, data mining, support vector machines, proximity graphs, relative-neighborhood graphs, k-nearestneighbor graphs, random sampling, training data condensation.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1876
8730 The Role Played by Swift Change of the Stability Characteristic of Mean Flow in Bypass Transition

Authors: Dong Ming, Su Caihong

Abstract:

The scenario of bypass transition is generally described as follows: the low-frequency disturbances in the free-stream may generate long stream-wise streaks in the boundary layer, which later may trigger secondary instability, leading to rapid increase of high-frequency disturbances. Then possibly turbulent spots emerge, and through their merging, lead to fully developed turbulence. This description, however, is insufficient in the sense that it does not provide the inherent mechanism of transition that during the transition, a large number of waves with different frequencies and wave numbers appear almost simultaneously, producing sufficiently large Reynolds stress, so the mean flow profile can change rapidly from laminar to turbulent. In this paper, such a mechanism will be figured out from analyzing DNS data of transition.

Keywords: boundary layer, breakdown, bypass transition, stability, streak.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1453
8729 Levenberg-Marquardt Algorithm for Karachi Stock Exchange Share Rates Forecasting

Authors: Syed Muhammad Aqil Burney, Tahseen Ahmed Jilani, C. Ardil

Abstract:

Financial forecasting is an example of signal processing problems. A number of ways to train/learn the network are available. We have used Levenberg-Marquardt algorithm for error back-propagation for weight adjustment. Pre-processing of data has reduced much of the variation at large scale to small scale, reducing the variation of training data.

Keywords: Gradient descent method, jacobian matrix.Levenberg-Marquardt algorithm, quadratic error surfaces,

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2424
8728 Object Alignment for Military Optical Surveillance

Authors: Oscar J.G. Somsen, Fok Bolderheij

Abstract:

Electro-optical devices are increasingly used for military sea-, land- and air applications to detect, recognize and track objects. Typically, these devices produce video information that is presented to an operator. However, with increasing availability of electro-optical devices the data volume is becoming very large, creating a rising need for automated analysis. In a military setting, this typically involves detecting and recognizing objects at a large distance, i.e. when they are difficult to distinguish from background and noise. One may consider combining multiple images from a video stream into a single enhanced image that provides more information for the operator. In this paper we investigate a simple algorithm to enhance simulated images from a military context and investigate how the enhancement is affected by various types of disturbance.

Keywords: Electro-Optics, Automated Image alignment

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1561
8727 Assessment of Performance Measures of Large-Scale Power Systems

Authors: Mohamed A. El-Kady, Badr M. Alshammari

Abstract:

In a recent major industry-supported research and development study, a novel framework was developed and applied for assessment of reliability and quality performance levels in reallife power systems with practical large-scale sizes. The new assessment methodology is based on three metaphors (dimensions) representing the relationship between available generation capacities and required demand levels. The paper shares the results of the successfully completed stud and describes the implementation of the new methodology on practical zones in the Saudi electricity system.

Keywords: Power systems; large-scale analysis, reliability; performance assessment, linear programming.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1809
8726 A New Variant of RC4 Stream Cipher

Authors: Lae Lae Khine

Abstract:

RC4 was used as an encryption algorithm in WEP(Wired Equivalent Privacy) protocol that is a standardized for 802.11 wireless network. A few attacks followed, indicating certain weakness in the design. In this paper, we proposed a new variant of RC4 stream cipher. The new version of the cipher does not only appear to be more secure, but its keystream also has large period, large complexity and good statistical properties.

Keywords: Cryptography, New variant, RC4, Stream Cipher.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1843
8725 A Prediction Method for Large-Size Event Occurrences in the Sandpile Model

Authors: S. Channgam, A. Sae-Tang, T. Termsaithong

Abstract:

In this research, the occurrences of large size events in various system sizes of the Bak-Tang-Wiesenfeld sandpile model are considered. The system sizes (square lattice) of model considered here are 25×25, 50×50, 75×75 and 100×100. The cross-correlation between the ratio of sites containing 3 grain time series and the large size event time series for these 4 system sizes are also analyzed. Moreover, a prediction method of the large-size event for the 50×50 system size is also introduced. Lastly, it can be shown that this prediction method provides a slightly higher efficiency than random predictions.

Keywords: Bak-Tang-Wiesenfeld sandpile model, avalanches, cross-correlation, prediction method.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1130
8724 Omni: Data Science Platform for Evaluate Performance of a LoRaWAN Network

Authors: Emanuele A. Solagna, Ricardo S, Tozetto, Roberto dos S. Rabello

Abstract:

Nowadays, physical processes are becoming digitized by the evolution of communication, sensing and storage technologies which promote the development of smart cities. The evolution of this technology has generated multiple challenges related to the generation of big data and the active participation of electronic devices in society. Thus, devices can send information that is captured and processed over large areas, but there is no guarantee that all the obtained data amount will be effectively stored and correctly persisted. Because, depending on the technology which is used, there are parameters that has huge influence on the full delivery of information. This article aims to characterize the project, currently under development, of a platform that based on data science will perform a performance and effectiveness evaluation of an industrial network that implements LoRaWAN technology considering its main parameters configuration relating these parameters to the information loss.

Keywords: Internet of Things, LoRa, LoRaWAN, smart cities.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 662
8723 Shear Buckling of a Large Pultruded Composite I-Section under Asymmetric Loading

Authors: Jin Y. Park, Jeong Wan Lee

Abstract:

An experimental and analytical research on shear buckling of a comparably large polymer composite I-section is presented. It is known that shear buckling load of a large span composite beam is difficult to determine experimentally. In order to sensitively detect shear buckling of the tested I-section, twenty strain rosettes and eight displacement sensors were applied and attached on the web and flange surfaces. The tested specimen was a pultruded composite beam made of vinylester resin, E-glass, carbon fibers and micro-fillers. Various coupon tests were performed before the shear buckling test to obtain fundamental material properties of the Isection. An asymmetric four-point bending loading scheme was utilized for the shear test. The loading scheme resulted in a high shear and almost zero moment condition at the center of the web panel. The shear buckling load was successfully determined after analyzing the obtained test data from strain rosettes and displacement sensors. An analytical approach was also performed to verify the experimental results and to support the discussed experimental program.

Keywords: Strain sensor, displacement sensor, shear buckling, polymer composite I-section, asymmetric loading.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1914
8722 Computational Investigations of Concrete Footing Rotational Rigidity

Authors: E. S. Fraser, G. P. A. G. van Zijl

Abstract:

In many buildings we rely on large footings to offer structural stability. Designers often compensate for the lack of knowledge available with regard to foundation-soil interaction by furnishing structures with overly large footings. This may lead to a significant increase in building expenditures if many large foundations are present. This paper describes the interface material law that governs the behavior along the contact surface of adjacent materials, and the behavior of a large foundation under ultimate limit loading. A case study is chosen that represents a common foundation-soil system frequently used in general practice and therefore relevant to other structures. Investigations include compressing versus uplifting wind forces, alterations to the foundation size and subgrade compositions, the role of the slab stiffness and presence and the effect of commonly used structural joints and connections. These investigations aim to provide the reader with an objective design approach, efficiently preventing structural instability.

Keywords: Computational investigation of footing rotation.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1559
8721 An Efficient Run Time Interface for Heterogeneous Architecture of Large Scale Supercomputing System

Authors: Prabu D., Andrew Aaron James, Vanamala V., Vineeth Simon, Sanjeeb Kumar Deka, Sridharan R., Prahlada Rao B.B., Mohanram N.

Abstract:

In this paper we propose a novel Run Time Interface (RTI) technique to provide an efficient environment for MPI jobs on the heterogeneous architecture of PARAM Padma. It suggests an innovative, unified framework for the job management interface system in parallel and distributed computing. This approach employs proxy scheme. The implementation shows that the proposed RTI is highly scalable and stable. Moreover RTI provides the storage access for the MPI jobs in various operating system platforms and improve the data access performance through high performance C-DAC Parallel File System (C-PFS). The performance of the RTI is evaluated by using the standard HPC benchmark suites and the simulation results show that the proposed RTI gives good performance on large scale supercomputing system.

Keywords: RTI, C-MPI, C-PFS, Scheduler Interface.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1398
8720 Scatter Analysis of Fatigue Life and Pore Size Data of Die-Cast AM60B Magnesium Alloy

Authors: S. Mohd, Y. Mutoh, Y. Otsuka, Y. Miyashita, T. Koike, T. Suzuki

Abstract:

Scatter behavior of fatigue life in die-cast AM60B alloy was investigated. For comparison, those in rolled AM60B alloy and die-cast A365-T5 aluminum alloy were also studied. Scatter behavior of pore size was also investigated to discuss dominant factors for fatigue life scatter in die-cast materials. Three-parameter Weibull function was suitable to explain the scatter behavior of both fatigue life and pore size. The scatter of fatigue life in die-cast AM60B alloy was almost comparable to that in die-cast A365-T5 alloy, while it was significantly large compared to that in the rolled AM60B alloy. Scatter behavior of pore size observed at fracture nucleation site on the fracture surface was comparable to that observed on the specimen cross-section and also to that of fatigue life. Therefore, the dominant factor for large scatter of fatigue life in die-cast alloys would be the large scatter of pore size. This speculation was confirmed by the fracture mechanics fatigue life prediction, where the pore observed at fatigue crack nucleation site was assumed as the pre-existing crack.

Keywords: Fatigue life, Pore size, Scatter, Weibull distribution, Die-cast magnesium alloy

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2338
8719 Application of Neural Networks in Financial Data Mining

Authors: Defu Zhang, Qingshan Jiang, Xin Li

Abstract:

This paper deals with the application of a well-known neural network technique, multilayer back-propagation (BP) neural network, in financial data mining. A modified neural network forecasting model is presented, and an intelligent mining system is developed. The system can forecast the buying and selling signs according to the prediction of future trends to stock market, and provide decision-making for stock investors. The simulation result of seven years to Shanghai Composite Index shows that the return achieved by this mining system is about three times as large as that achieved by the buy and hold strategy, so it is advantageous to apply neural networks to forecast financial time series, the different investors could benefit from it.

Keywords: Data mining, neural network, stock forecasting.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 3543
8718 Standard Languages for Creating a Database to Display Financial Statements on a Web Application

Authors: Vladimir Simovic, Matija Varga, Predrag Oreski

Abstract:

XHTML and XBRL are the standard languages for creating a database for the purpose of displaying financial statements on web applications. Today, XBRL is one of the most popular languages for business reporting. A large number of countries in the world recognize the role of XBRL language for financial reporting and the benefits that the reporting format provides in the collection, analysis, preparation, publication and the exchange of data (information) which is the positive side of this language. Here we present all advantages and opportunities that a company may have by using the XBRL format for business reporting. Also, this paper presents XBRL and other languages that are used for creating the database, such XML, XHTML, etc. The role of the AJAX complex model and technology will be explained in detail, and during the exchange of financial data between the web client and web server. Here will be mentioned basic layers of the network for data exchange via the web.

Keywords: XHTML, XBRL, XML, JavaScript, AJAX technology, data exchange.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1032
8717 Visual Text Analytics Technologies for Real-Time Big Data: Chronological Evolution and Issues

Authors: Siti Azrina B. A. Aziz, Siti Hafizah A. Hamid

Abstract:

New approaches to analyze and visualize data stream in real-time basis is important in making a prompt decision by the decision maker. Financial market trading and surveillance, large-scale emergency response and crowd control are some example scenarios that require real-time analytic and data visualization. This situation has led to the development of techniques and tools that support humans in analyzing the source data. With the emergence of Big Data and social media, new techniques and tools are required in order to process the streaming data. Today, ranges of tools which implement some of these functionalities are available. In this paper, we present chronological evolution evaluation of technologies for supporting of real-time analytic and visualization of the data stream. Based on the past research papers published from 2002 to 2014, we gathered the general information, main techniques, challenges and open issues. The techniques for streaming text visualization are identified based on Text Visualization Browser in chronological order. This paper aims to review the evolution of streaming text visualization techniques and tools, as well as to discuss the problems and challenges for each of identified tools.

Keywords: Information visualization, visual analytics, text mining, visual text analytics tools, big data visualization.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 960
8716 Data Preprocessing for Supervised Leaning

Authors: S. B. Kotsiantis, D. Kanellopoulos, P. E. Pintelas

Abstract:

Many factors affect the success of Machine Learning (ML) on a given task. The representation and quality of the instance data is first and foremost. If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. It is well known that data preparation and filtering steps take considerable amount of processing time in ML problems. Data pre-processing includes data cleaning, normalization, transformation, feature extraction and selection, etc. The product of data pre-processing is the final training set. It would be nice if a single sequence of data pre-processing algorithms had the best performance for each data set but this is not happened. Thus, we present the most well know algorithms for each step of data pre-processing so that one achieves the best performance for their data set.

Keywords: Data mining, feature selection, data cleaning.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 5935
8715 Reducing SAGE Data Using Genetic Algorithms

Authors: Cheng-Hong Yang, Tsung-Mu Shih, Li-Yeh Chuang

Abstract:

Serial Analysis of Gene Expression is a powerful quantification technique for generating cell or tissue gene expression data. The profile of the gene expression of cell or tissue in several different states is difficult for biologists to analyze because of the large number of genes typically involved. However, feature selection in machine learning can successfully reduce this problem. The method allows reducing the features (genes) in specific SAGE data, and determines only relevant genes. In this study, we used a genetic algorithm to implement feature selection, and evaluate the classification accuracy of the selected features with the K-nearest neighbor method. In order to validate the proposed method, we used two SAGE data sets for testing. The results of this study conclusively prove that the number of features of the original SAGE data set can be significantly reduced and higher classification accuracy can be achieved.

Keywords: Serial Analysis of Gene Expression, Feature selection, Genetic Algorithm, K-nearest neighbor method.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1566
8714 Applications of Big Data in Education

Authors: Faisal Kalota

Abstract:

Big Data and analytics have gained a huge momentum in recent years. Big Data feeds into the field of Learning Analytics (LA) that may allow academic institutions to better understand the learners’ needs and proactively address them. Hence, it is important to have an understanding of Big Data and its applications. The purpose of this descriptive paper is to provide an overview of Big Data, the technologies used in Big Data, and some of the applications of Big Data in education. Additionally, it discusses some of the concerns related to Big Data and current research trends. While Big Data can provide big benefits, it is important that institutions understand their own needs, infrastructure, resources, and limitation before jumping on the Big Data bandwagon.

Keywords: Analytics, Big Data in Education, Hadoop, Learning Analytics.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 4818
8713 Improving Convergence of Parameter Tuning Process of the Additive Fuzzy System by New Learning Strategy

Authors: Thi Nguyen, Lee Gordon-Brown, Jim Peterson, Peter Wheeler

Abstract:

An additive fuzzy system comprising m rules with n inputs and p outputs in each rule has at least t m(2n + 2 p + 1) parameters needing to be tuned. The system consists of a large number of if-then fuzzy rules and takes a long time to tune its parameters especially in the case of a large amount of training data samples. In this paper, a new learning strategy is investigated to cope with this obstacle. Parameters that tend toward constant values at the learning process are initially fixed and they are not tuned till the end of the learning time. Experiments based on applications of the additive fuzzy system in function approximation demonstrate that the proposed approach reduces the learning time and hence improves convergence speed considerably.

Keywords: Additive fuzzy system, improving convergence, parameter learning process, unsupervised learning.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1472
8712 Research of Data Cleaning Methods Based on Dependency Rules

Authors: Yang Bao, Shi Wei Deng, Wang Qun Lin

Abstract:

This paper introduces the concept and principle of data cleaning, analyzes the types and causes of dirty data, and proposes several key steps of typical cleaning process, puts forward a well scalability and versatility data cleaning framework, in view of data with attribute dependency relation, designs several of violation data discovery algorithms by formal formula, which can obtain inconsistent data to all target columns with condition attribute dependent no matter data is structured (SQL) or unstructured (NoSql), and gives 6 data cleaning methods based on these algorithms.

Keywords: Data cleaning, dependency rules, violation data discovery, data repair.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 2562
8711 Revised PLWAP Tree with Non-frequent Items for Mining Sequential Pattern

Authors: R. Vishnu Priya, A. Vadivel

Abstract:

Sequential pattern mining is a challenging task in data mining area with large applications. One among those applications is mining patterns from weblog. Recent times, weblog is highly dynamic and some of them may become absolute over time. In addition, users may frequently change the threshold value during the data mining process until acquiring required output or mining interesting rules. Some of the recently proposed algorithms for mining weblog, build the tree with two scans and always consume large time and space. In this paper, we build Revised PLWAP with Non-frequent Items (RePLNI-tree) with single scan for all items. While mining sequential patterns, the links related to the nonfrequent items are not considered. Hence, it is not required to delete or maintain the information of nodes while revising the tree for mining updated transactions. The algorithm supports both incremental and interactive mining. It is not required to re-compute the patterns each time, while weblog is updated or minimum support changed. The performance of the proposed tree is better, even the size of incremental database is more than 50% of existing one. For evaluation purpose, we have used the benchmark weblog dataset and found that the performance of proposed tree is encouraging compared to some of the recently proposed approaches.

Keywords: Sequential pattern mining, weblog, frequent and non-frequent items, incremental and interactive mining.

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1885