Search results for: hadoop
Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 34

Search results for: hadoop

34 Healthcare Big Data Analytics Using Hadoop

Authors: Chellammal Surianarayanan

Abstract:

Healthcare industry is generating large amounts of data driven by various needs such as record keeping, physician’s prescription, medical imaging, sensor data, Electronic Patient Record(EPR), laboratory, pharmacy, etc. Healthcare data is so big and complex that they cannot be managed by conventional hardware and software. The complexity of healthcare big data arises from large volume of data, the velocity with which the data is accumulated and different varieties such as structured, semi-structured and unstructured nature of data. Despite the complexity of big data, if the trends and patterns that exist within the big data are uncovered and analyzed, higher quality healthcare at lower cost can be provided. Hadoop is an open source software framework for distributed processing of large data sets across clusters of commodity hardware using a simple programming model. The core components of Hadoop include Hadoop Distributed File System which offers way to store large amount of data across multiple machines and MapReduce which offers way to process large data sets with a parallel, distributed algorithm on a cluster. Hadoop ecosystem also includes various other tools such as Hive (a SQL-like query language), Pig (a higher level query language for MapReduce), Hbase(a columnar data store), etc. In this paper an analysis has been done as how healthcare big data can be processed and analyzed using Hadoop ecosystem.

Keywords: big data analytics, Hadoop, healthcare data, towards quality healthcare

Procedia PDF Downloads 374
33 Big Data Analysis with RHadoop

Authors: Ji Eun Shin, Byung Ho Jung, Dong Hoon Lim

Abstract:

It is almost impossible to store or analyze big data increasing exponentially with traditional technologies. Hadoop is a new technology to make that possible. R programming language is by far the most popular statistical tool for big data analysis based on distributed processing with Hadoop technology. With RHadoop that integrates R and Hadoop environment, we implemented parallel multiple regression analysis with different sizes of actual data. Experimental results showed our RHadoop system was much faster as the number of data nodes increases. We also compared the performance of our RHadoop with lm function and big lm packages available on big memory. The results showed that our RHadoop was faster than other packages owing to paralleling processing with increasing the number of map tasks as the size of data increases.

Keywords: big data, Hadoop, parallel regression analysis, R, RHadoop

Procedia PDF Downloads 403
32 Software Defined Storage: Object Storage over Hadoop Platform

Authors: Amritesh Srivastava, Gaurav Sharma

Abstract:

The purpose of this project is to develop an open source object storage system that is highly durable, scalable and reliable. There are two representative systems in cloud computing: Google and Amazon. Their storage systems for Google GFS and Amazon S3 provide high reliability, performance and stability. Our proposed system is highly inspired from Amazon S3. We are using Hadoop Distributed File System (HDFS) Java API to implement our system. We propose the architecture of object storage system based on Hadoop. We discuss the requirements of our system, what we expect from our system and what problems we may encounter. We also give detailed design proposal along with the abstract source code to implement it. The final goal of the system is to provide REST based access to our object storage system that exists on top of HDFS.

Keywords: Hadoop, HBase, object storage, REST

Procedia PDF Downloads 297
31 Verification & Validation of Map Reduce Program Model for Parallel K-Mediod Algorithm on Hadoop Cluster

Authors: Trapti Sharma, Devesh Kumar Srivastava

Abstract:

This paper is basically a analysis study of above MapReduce implementation and also to verify and validate the MapReduce solution model for Parallel K-Mediod algorithm on Hadoop Cluster. MapReduce is a programming model which authorize the managing of huge amounts of data in parallel, on a large number of devices. It is specially well suited to constant or moderate changing set of data since the implementation point of a position is usually high. MapReduce has slowly become the framework of choice for “big data”. The MapReduce model authorizes for systematic and instant organizing of large scale data with a cluster of evaluate nodes. One of the primary affect in Hadoop is how to minimize the completion length (i.e. makespan) of a set of MapReduce duty. In this paper, we have verified and validated various MapReduce applications like wordcount, grep, terasort and parallel K-Mediod clustering algorithm. We have found that as the amount of nodes increases the completion time decreases.

Keywords: hadoop, mapreduce, k-mediod, validation, verification

Procedia PDF Downloads 333
30 Scientific Linux Cluster for BIG-DATA Analysis (SLBD): A Case of Fayoum University

Authors: Hassan S. Hussein, Rania A. Abul Seoud, Amr M. Refaat

Abstract:

Scientific researchers face in the analysis of very large data sets that is increasing noticeable rate in today’s and tomorrow’s technologies. Hadoop and Spark are types of software that developed frameworks. Hadoop framework is suitable for many Different hardware platforms. In this research, a scientific Linux cluster for Big Data analysis (SLBD) is presented. SLBD runs open source software with large computational capacity and high performance cluster infrastructure. SLBD composed of one cluster contains identical, commodity-grade computers interconnected via a small LAN. SLBD consists of a fast switch and Gigabit-Ethernet card which connect four (nodes). Cloudera Manager is used to configure and manage an Apache Hadoop stack. Hadoop is a framework allows storing and processing big data across the cluster by using MapReduce algorithm. MapReduce algorithm divides the task into smaller tasks which to be assigned to the network nodes. Algorithm then collects the results and form the final result dataset. SLBD clustering system allows fast and efficient processing of large amount of data resulting from different applications. SLBD also provides high performance, high throughput, high availability, expandability and cluster scalability.

Keywords: big data platforms, cloudera manager, Hadoop, MapReduce

Procedia PDF Downloads 325
29 Distributed Processing for Content Based Lecture Video Retrieval on Hadoop Framework

Authors: U. S. N. Raju, Kothuri Sai Kiran, Meena G. Kamal, Vinay Nikhil Pabba, Suresh Kanaparthi

Abstract:

There is huge amount of lecture video data available for public use, and many more lecture videos are being created and uploaded every day. Searching for videos on required topics from this huge database is a challenging task. Therefore, an efficient method for video retrieval is needed. An approach for automated video indexing and video search in large lecture video archives is presented. As the amount of video lecture data is huge, it is very inefficient to do the processing in a centralized computation framework. Hence, Hadoop Framework for distributed computing for Big Video Data is used. First, step in the process is automatic video segmentation and key-frame detection to offer a visual guideline for the video content navigation. In the next step, we extract textual metadata by applying video Optical Character Recognition (OCR) technology on key-frames. The OCR and detected slide text line types are adopted for keyword extraction, by which both video- and segment-level keywords are extracted for content-based video browsing and search. The performance of the indexing process can be improved for a large database by using distributed computing on Hadoop framework.

Keywords: video lectures, big video data, video retrieval, hadoop

Procedia PDF Downloads 487
28 Regression Approach for Optimal Purchase of Hosts Cluster in Fixed Fund for Hadoop Big Data Platform

Authors: Haitao Yang, Jianming Lv, Fei Xu, Xintong Wang, Yilin Huang, Lanting Xia, Xuewu Zhu

Abstract:

Given a fixed fund, purchasing fewer hosts of higher capability or inversely more of lower capability is a must-be-made trade-off in practices for building a Hadoop big data platform. An exploratory study is presented for a Housing Big Data Platform project (HBDP), where typical big data computing is with SQL queries of aggregate, join, and space-time condition selections executed upon massive data from more than 10 million housing units. In HBDP, an empirical formula was introduced to predict the performance of host clusters potential for the intended typical big data computing, and it was shaped via a regression approach. With this empirical formula, it is easy to suggest an optimal cluster configuration. The investigation was based on a typical Hadoop computing ecosystem HDFS+Hive+Spark. A proper metric was raised to measure the performance of Hadoop clusters in HBDP, which was tested and compared with its predicted counterpart, on executing three kinds of typical SQL query tasks. Tests were conducted with respect to factors of CPU benchmark, memory size, virtual host division, and the number of element physical host in cluster. The research has been applied to practical cluster procurement for housing big data computing.

Keywords: Hadoop platform planning, optimal cluster scheme at fixed-fund, performance predicting formula, typical SQL query tasks

Procedia PDF Downloads 200
27 Big Data Analysis with Rhipe

Authors: Byung Ho Jung, Ji Eun Shin, Dong Hoon Lim

Abstract:

Rhipe that integrates R and Hadoop environment made it possible to process and analyze massive amounts of data using a distributed processing environment. In this paper, we implemented multiple regression analysis using Rhipe with various data sizes of actual data. Experimental results for comparing the performance of our Rhipe with stats and biglm packages available on bigmemory, showed that our Rhipe was more fast than other packages owing to paralleling processing with increasing the number of map tasks as the size of data increases. We also compared the computing speeds of pseudo-distributed and fully-distributed modes for configuring Hadoop cluster. The results showed that fully-distributed mode was faster than pseudo-distributed mode, and computing speeds of fully-distributed mode were faster as the number of data nodes increases.

Keywords: big data, Hadoop, Parallel regression analysis, R, Rhipe

Procedia PDF Downloads 470
26 Ontology-Based Backpropagation Neural Network Classification and Reasoning Strategy for NoSQL and SQL Databases

Authors: Hao-Hsiang Ku, Ching-Ho Chi

Abstract:

Big data applications have become an imperative for many fields. Many researchers have been devoted into increasing correct rates and reducing time complexities. Hence, the study designs and proposes an Ontology-based backpropagation neural network classification and reasoning strategy for NoSQL big data applications, which is called ON4NoSQL. ON4NoSQL is responsible for enhancing the performances of classifications in NoSQL and SQL databases to build up mass behavior models. Mass behavior models are made by MapReduce techniques and Hadoop distributed file system based on Hadoop service platform. The reference engine of ON4NoSQL is the ontology-based backpropagation neural network classification and reasoning strategy. Simulation results indicate that ON4NoSQL can efficiently achieve to construct a high performance environment for data storing, searching, and retrieving.

Keywords: Hadoop, NoSQL, ontology, back propagation neural network, high distributed file system

Procedia PDF Downloads 232
25 Cloud Design for Storing Large Amount of Data

Authors: M. Strémy, P. Závacký, P. Cuninka, M. Juhás

Abstract:

Main goal of this paper is to introduce our design of private cloud for storing large amount of data, especially pictures, and to provide good technological backend for data analysis based on parallel processing and business intelligence. We have tested hypervisors, cloud management tools, storage for storing all data and Hadoop to provide data analysis on unstructured data. Providing high availability, virtual network management, logical separation of projects and also rapid deployment of physical servers to our environment was also needed.

Keywords: cloud, glusterfs, hadoop, juju, kvm, maas, openstack, virtualization

Procedia PDF Downloads 322
24 Big Data: Concepts, Technologies and Applications in the Public Sector

Authors: A. Alexandru, C. A. Alexandru, D. Coardos, E. Tudora

Abstract:

Big Data (BD) is associated with a new generation of technologies and architectures which can harness the value of extremely large volumes of very varied data through real time processing and analysis. It involves changes in (1) data types, (2) accumulation speed, and (3) data volume. This paper presents the main concepts related to the BD paradigm, and introduces architectures and technologies for BD and BD sets. The integration of BD with the Hadoop Framework is also underlined. BD has attracted a lot of attention in the public sector due to the newly emerging technologies that allow the availability of network access. The volume of different types of data has exponentially increased. Some applications of BD in the public sector in Romania are briefly presented.

Keywords: big data, big data analytics, Hadoop, cloud

Procedia PDF Downloads 273
23 Extracting Opinions from Big Data of Indonesian Customer Reviews Using Hadoop MapReduce

Authors: Veronica S. Moertini, Vinsensius Kevin, Gede Karya

Abstract:

Customer reviews have been collected by many kinds of e-commerce websites selling products, services, hotel rooms, tickets and so on. Each website collects its own customer reviews. The reviews can be crawled, collected from those websites and stored as big data. Text analysis techniques can be used to analyze that data to produce summarized information, such as customer opinions. Then, these opinions can be published by independent service provider websites and used to help customers in choosing the most suitable products or services. As the opinions are analyzed from big data of reviews originated from many websites, it is expected that the results are more trusted and accurate. Indonesian customers write reviews in Indonesian language, which comes with its own structures and uniqueness. We found that most of the reviews are expressed with “daily language”, which is informal, do not follow the correct grammar, have many abbreviations and slangs or non-formal words. Hadoop is an emerging platform aimed for storing and analyzing big data in distributed systems. A Hadoop cluster consists of master and slave nodes/computers operated in a network. Hadoop comes with distributed file system (HDFS) and MapReduce framework for supporting parallel computation. However, MapReduce has weakness (i.e. inefficient) for iterative computations, specifically, the cost of reading/writing data (I/O cost) is high. Given this fact, we conclude that MapReduce function is best adapted for “one-pass” computation. In this research, we develop an efficient technique for extracting or mining opinions from big data of Indonesian reviews, which is based on MapReduce with one-pass computation. In designing the algorithm, we avoid iterative computation and instead adopt a “look up table” technique. The stages of the proposed technique are: (1) Crawling the data reviews from websites; (2) cleaning and finding root words from the raw reviews; (3) computing the frequency of the meaningful opinion words; (4) analyzing customers sentiments towards defined objects. The experiments for evaluating the performance of the technique were conducted on a Hadoop cluster with 14 slave nodes. The results show that the proposed technique (stage 2 to 4) discovers useful opinions, is capable of processing big data efficiently and scalable.

Keywords: big data analysis, Hadoop MapReduce, analyzing text data, mining Indonesian reviews

Procedia PDF Downloads 177
22 Frequent Item Set Mining for Big Data Using MapReduce Framework

Authors: Tamanna Jethava, Rahul Joshi

Abstract:

Frequent Item sets play an essential role in many data Mining tasks that try to find interesting patterns from the database. Typically it refers to a set of items that frequently appear together in transaction dataset. There are several mining algorithm being used for frequent item set mining, yet most do not scale to the type of data we presented with today, so called “BIG DATA”. Big Data is a collection of large data sets. Our approach is to work on the frequent item set mining over the large dataset with scalable and speedy way. Big Data basically works with Map Reduce along with HDFS is used to find out frequent item sets from Big Data on large cluster. This paper focuses on using pre-processing & mining algorithm as hybrid approach for big data over Hadoop platform.

Keywords: frequent item set mining, big data, Hadoop, MapReduce

Procedia PDF Downloads 382
21 From Two-Way to Multi-Way: A Comparative Study for Map-Reduce Join Algorithms

Authors: Marwa Hussien Mohamed, Mohamed Helmy Khafagy

Abstract:

Map-Reduce is a programming model which is widely used to extract valuable information from enormous volumes of data. Map-reduce designed to support heterogeneous datasets. Apache Hadoop map-reduce used extensively to uncover hidden pattern like data mining, SQL, etc. The most important operation for data analysis is joining operation. But, map-reduce framework does not directly support join algorithm. This paper explains and compares two-way and multi-way map-reduce join algorithms for map reduce also we implement MR join Algorithms and show the performance of each phase in MR join algorithms. Our experimental results show that map side join and map merge join in two-way join algorithms has the longest time according to preprocessing step sorting data and reduce side cascade join has the longest time at Multi-Way join algorithms.

Keywords: Hadoop, MapReduce, multi-way join, two-way join, Ubuntu

Procedia PDF Downloads 447
20 A High-Level Co-Evolutionary Hybrid Algorithm for the Multi-Objective Job Shop Scheduling Problem

Authors: Aydin Teymourifar, Gurkan Ozturk

Abstract:

In this paper, a hybrid distributed algorithm has been suggested for the multi-objective job shop scheduling problem. Many new approaches are used at design steps of the distributed algorithm. Co-evolutionary structure of the algorithm and competition between different communicated hybrid algorithms, which are executed simultaneously, causes to efficient search. Using several machines for distributing the algorithms, at the iteration and solution levels, increases computational speed. The proposed algorithm is able to find the Pareto solutions of the big problems in shorter time than other algorithm in the literature. Apache Spark and Hadoop platforms have been used for the distribution of the algorithm. The suggested algorithm and implementations have been compared with results of the successful algorithms in the literature. Results prove the efficiency and high speed of the algorithm.

Keywords: distributed algorithms, Apache Spark, Hadoop, job shop scheduling, multi-objective optimization

Procedia PDF Downloads 327
19 Collision Theory Based Sentiment Detection Using Discourse Analysis in Hadoop

Authors: Anuta Mukherjee, Saswati Mukherjee

Abstract:

Data is growing everyday. Social networking sites such as Twitter are becoming an integral part of our daily lives, contributing a large increase in the growth of data. It is a rich source especially for sentiment detection or mining since people often express honest opinion through tweets. However, although sentiment analysis is a well-researched topic in text, this analysis using Twitter data poses additional challenges since these are unstructured data with abbreviations and without a strict grammatical correctness. We have employed collision theory to achieve sentiment analysis in Twitter data. We have also incorporated discourse analysis in the collision theory based model to detect accurate sentiment from tweets. We have also used the retweet field to assign weights to certain tweets and obtained the overall weightage of a topic provided in the form of a query. Hadoop has been exploited for speed. Our experiments show effective results.

Keywords: sentiment analysis, twitter, collision theory, discourse analysis

Procedia PDF Downloads 495
18 A Hybrid Distributed Algorithm for Solving Job Shop Scheduling Problem

Authors: Aydin Teymourifar, Gurkan Ozturk

Abstract:

In this paper, a distributed hybrid algorithm is proposed for solving the job shop scheduling problem. The suggested method executes different artificial neural networks, heuristics and meta-heuristics simultaneously on more than one machine. The neural networks are used to control the constraints of the problem while the meta-heuristics search the global space and the heuristics are used to prevent the premature convergence. To attain an efficient distributed intelligent method for solving big and distributed job shop scheduling problems, Apache Spark and Hadoop frameworks are used. In the algorithm implementation and design steps, new approaches are applied. Comparison between the proposed algorithm and other efficient algorithms from the literature shows its efficiency, which is able to solve large size problems in short time.

Keywords: distributed algorithms, Apache Spark, Hadoop, job shop scheduling, neural network

Procedia PDF Downloads 349
17 A Hybrid Distributed Algorithm for Multi-Objective Dynamic Flexible Job Shop Scheduling Problem

Authors: Aydin Teymourifar, Gurkan Ozturk

Abstract:

In this paper, a hybrid distributed algorithm has been suggested for multi-objective dynamic flexible job shop scheduling problem. The proposed algorithm is high level, in which several algorithms search the space on different machines simultaneously also it is a hybrid algorithm that takes advantages of the artificial intelligence, evolutionary and optimization methods. Distribution is done at different levels and new approaches are used for design of the algorithm. Apache spark and Hadoop frameworks have been used for the distribution of the algorithm. The Pareto optimality approach is used for solving the multi-objective benchmarks. The suggested algorithm that is able to solve large-size problems in short times has been compared with the successful algorithms of the literature. The results prove high speed and efficiency of the algorithm.

Keywords: distributed algorithms, apache-spark, Hadoop, flexible dynamic job shop scheduling, multi-objective optimization

Procedia PDF Downloads 313
16 Sensor Data Analysis for a Large Mining Major

Authors: Sudipto Shanker Dasgupta

Abstract:

One of the largest mining companies wanted to look at health analytics for their driverless trucks. These trucks were the key to their supply chain logistics. The automated trucks had multi-level sub-assemblies which would send out sensor information. The use case that was worked on was to capture the sensor signal from the truck subcomponents and analyze the health of the trucks from repair and replacement purview. Open source software was used to stream the data into a clustered Hadoop setup in Amazon Web Services cloud and Apache Spark SQL was used to analyze the data. All of this was achieved through a 10 node amazon 32 core, 64 GB RAM setup real-time analytics was achieved on ‘300 million records’. To check the scalability of the system, the cluster was increased to 100 node setup. This talk will highlight how Open Source software was used to achieve the above use case and the insights on the high data throughput on a cloud set up.

Keywords: streaming analytics, data science, big data, Hadoop, high throughput, sensor data

Procedia PDF Downloads 372
15 Genodata: The Human Genome Variation Using BigData

Authors: Surabhi Maiti, Prajakta Tamhankar, Prachi Uttam Mehta

Abstract:

Since the accomplishment of the Human Genome Project, there has been an unparalled escalation in the sequencing of genomic data. This project has been the first major vault in the field of medical research, especially in genomics. This project won accolades by using a concept called Bigdata which was earlier, extensively used to gain value for business. Bigdata makes use of data sets which are generally in the form of files of size terabytes, petabytes, or exabytes and these data sets were traditionally used and managed using excel sheets and RDBMS. The voluminous data made the process tedious and time consuming and hence a stronger framework called Hadoop was introduced in the field of genetic sciences to make data processing faster and efficient. This paper focuses on using SPARK which is gaining momentum with the advancement of BigData technologies. Cloud Storage is an effective medium for storage of large data sets which is generated from the genetic research and the resultant sets produced from SPARK analysis.

Keywords: human genome project, Bigdata, genomic data, SPARK, cloud storage, Hadoop

Procedia PDF Downloads 220
14 Analysis of Genomics Big Data in Cloud Computing Using Fuzzy Logic

Authors: Mohammad Vahed, Ana Sadeghitohidi, Majid Vahed, Hiroki Takahashi

Abstract:

In the genomics field, the huge amounts of data have produced by the next-generation sequencers (NGS). Data volumes are very rapidly growing, as it is postulated that more than one billion bases will be produced per year in 2020. The growth rate of produced data is much faster than Moore's law in computer technology. This makes it more difficult to deal with genomics data, such as storing data, searching information, and finding the hidden information. It is required to develop the analysis platform for genomics big data. Cloud computing newly developed enables us to deal with big data more efficiently. Hadoop is one of the frameworks distributed computing and relies upon the core of a Big Data as a Service (BDaaS). Although many services have adopted this technology, e.g. amazon, there are a few applications in the biology field. Here, we propose a new algorithm to more efficiently deal with the genomics big data, e.g. sequencing data. Our algorithm consists of two parts: First is that BDaaS is applied for handling the data more efficiently. Second is that the hybrid method of MapReduce and Fuzzy logic is applied for data processing. This step can be parallelized in implementation. Our algorithm has great potential in computational analysis of genomics big data, e.g. de novo genome assembly and sequence similarity search. We will discuss our algorithm and its feasibility.

Keywords: big data, fuzzy logic, MapReduce, Hadoop, cloud computing

Procedia PDF Downloads 263
13 RA-Apriori: An Efficient and Faster MapReduce-Based Algorithm for Frequent Itemset Mining on Apache Flink

Authors: Sanjay Rathee, Arti Kashyap

Abstract:

Extraction of useful information from large datasets is one of the most important research problems. Association rule mining is one of the best methods for this purpose. Finding possible associations between items in large transaction based datasets (finding frequent patterns) is most important part of the association rule mining. There exist many algorithms to find frequent patterns but Apriori algorithm always remains a preferred choice due to its ease of implementation and natural tendency to be parallelized. Many single-machine based Apriori variants exist but massive amount of data available these days is above capacity of a single machine. Therefore, to meet the demands of this ever-growing huge data, there is a need of multiple machines based Apriori algorithm. For these types of distributed applications, MapReduce is a popular fault-tolerant framework. Hadoop is one of the best open-source software frameworks with MapReduce approach for distributed storage and distributed processing of huge datasets using clusters built from commodity hardware. However, heavy disk I/O operation at each iteration of a highly iterative algorithm like Apriori makes Hadoop inefficient. A number of MapReduce-based platforms are being developed for parallel computing in recent years. Among them, two platforms, namely, Spark and Flink have attracted a lot of attention because of their inbuilt support to distributed computations. Earlier we proposed a reduced- Apriori algorithm on Spark platform which outperforms parallel Apriori, one because of use of Spark and secondly because of the improvement we proposed in standard Apriori. Therefore, this work is a natural sequel of our work and targets on implementing, testing and benchmarking Apriori and Reduced-Apriori and our new algorithm ReducedAll-Apriori on Apache Flink and compares it with Spark implementation. Flink, a streaming dataflow engine, overcomes disk I/O bottlenecks in MapReduce, providing an ideal platform for distributed Apriori. Flink's pipelining based structure allows starting a next iteration as soon as partial results of earlier iteration are available. Therefore, there is no need to wait for all reducers result to start a next iteration. We conduct in-depth experiments to gain insight into the effectiveness, efficiency and scalability of the Apriori and RA-Apriori algorithm on Flink.

Keywords: apriori, apache flink, Mapreduce, spark, Hadoop, R-Apriori, frequent itemset mining

Procedia PDF Downloads 248
12 Applications of Big Data in Education

Authors: Faisal Kalota

Abstract:

Big Data and analytics have gained a huge momentum in recent years. Big Data feeds into the field of Learning Analytics (LA) that may allow academic institutions to better understand the learners’ needs and proactively address them. Hence, it is important to have an understanding of Big Data and its applications. The purpose of this descriptive paper is to provide an overview of Big Data, the technologies used in Big Data, and some of the applications of Big Data in education. Additionally, it discusses some of the concerns related to Big Data and current research trends. While Big Data can provide big benefits, it is important that institutions understand their own needs, infrastructure, resources, and limitation before jumping on the Big Data bandwagon.

Keywords: big data, learning analytics, analytics, big data in education, Hadoop

Procedia PDF Downloads 375
11 Parallel Fuzzy Rough Support Vector Machine for Data Classification in Cloud Environment

Authors: Arindam Chaudhuri

Abstract:

Classification of data has been actively used for most effective and efficient means of conveying knowledge and information to users. The prima face has always been upon techniques for extracting useful knowledge from data such that returns are maximized. With emergence of huge datasets the existing classification techniques often fail to produce desirable results. The challenge lies in analyzing and understanding characteristics of massive data sets by retrieving useful geometric and statistical patterns. We propose a supervised parallel fuzzy rough support vector machine (PFRSVM) for data classification in cloud environment. The classification is performed by PFRSVM using hyperbolic tangent kernel. The fuzzy rough set model takes care of sensitiveness of noisy samples and handles impreciseness in training samples bringing robustness to results. The membership function is function of center and radius of each class in feature space and is represented with kernel. It plays an important role towards sampling the decision surface. The success of PFRSVM is governed by choosing appropriate parameter values. The training samples are either linear or nonlinear separable. The different input points make unique contributions to decision surface. The algorithm is parallelized with a view to reduce training times. The system is built on support vector machine library using Hadoop implementation of MapReduce. The algorithm is tested on large data sets to check its feasibility and convergence. The performance of classifier is also assessed in terms of number of support vectors. The challenges encountered towards implementing big data classification in machine learning frameworks are also discussed. The experiments are done on the cloud environment available at University of Technology and Management, India. The results are illustrated for Gaussian RBF and Bayesian kernels. The effect of variability in prediction and generalization of PFRSVM is examined with respect to values of parameter C. It effectively resolves outliers’ effects, imbalance and overlapping class problems, normalizes to unseen data and relaxes dependency between features and labels. The average classification accuracy for PFRSVM is better than other classifiers for both Gaussian RBF and Bayesian kernels. The experimental results on both synthetic and real data sets clearly demonstrate the superiority of the proposed technique.

Keywords: FRSVM, Hadoop, MapReduce, PFRSVM

Procedia PDF Downloads 454
10 Association Rules Mining and NOSQL Oriented Document in Big Data

Authors: Sarra Senhadji, Imene Benzeguimi, Zohra Yagoub

Abstract:

Big Data represents the recent technology of manipulating voluminous and unstructured data sets over multiple sources. Therefore, NOSQL appears to handle the problem of unstructured data. Association rules mining is one of the popular techniques of data mining to extract hidden relationship from transactional databases. The algorithm for finding association dependencies is well-solved with Map Reduce. The goal of our work is to reduce the time of generating of frequent itemsets by using Map Reduce and NOSQL database oriented document. A comparative study is given to evaluate the performances of our algorithm with the classical algorithm Apriori.

Keywords: Apriori, Association rules mining, Big Data, Data Mining, Hadoop, MapReduce, MongoDB, NoSQL

Procedia PDF Downloads 123
9 Improved K-Means Clustering Algorithm Using RHadoop with Combiner

Authors: Ji Eun Shin, Dong Hoon Lim

Abstract:

Data clustering is a common technique used in data analysis and is used in many applications, such as artificial intelligence, pattern recognition, economics, ecology, psychiatry and marketing. K-means clustering is a well-known clustering algorithm aiming to cluster a set of data points to a predefined number of clusters. In this paper, we implement K-means algorithm based on MapReduce framework with RHadoop to make the clustering method applicable to large scale data. RHadoop is a collection of R packages that allow users to manage and analyze data with Hadoop. The main idea is to introduce a combiner as a function of our map output to decrease the amount of data needed to be processed by reducers. The experimental results demonstrated that K-means algorithm using RHadoop can scale well and efficiently process large data sets on commodity hardware. We also showed that our K-means algorithm using RHadoop with combiner was faster than regular algorithm without combiner as the size of data set increases.

Keywords: big data, combiner, K-means clustering, RHadoop

Procedia PDF Downloads 393
8 BigCrypt: A Probable Approach of Big Data Encryption to Protect Personal and Business Privacy

Authors: Abdullah Al Mamun, Talal Alkharobi

Abstract:

As data size is growing up, people are became more familiar to store big amount of secret information into cloud storage. Companies are always required to need transfer massive business files from one end to another. We are going to lose privacy if we transmit it as it is and continuing same scenario repeatedly without securing the communication mechanism means proper encryption. Although asymmetric key encryption solves the main problem of symmetric key encryption but it can only encrypt limited size of data which is inapplicable for large data encryption. In this paper we propose a probable approach of pretty good privacy for encrypt big data using both symmetric and asymmetric keys. Our goal is to achieve encrypt huge collection information and transmit it through a secure communication channel for committing the business and personal privacy. To justify our method an experimental dataset from three different platform is provided. We would like to show that our approach is working for massive size of various data efficiently and reliably.

Keywords: big data, cloud computing, cryptography, hadoop, public key

Procedia PDF Downloads 287
7 [Keynote]: No-Trust-Zone Architecture for Securing Supervisory Control and Data Acquisition

Authors: Michael Okeke, Andrew Blyth

Abstract:

Supervisory Control And Data Acquisition (SCADA) as the state of the art Industrial Control Systems (ICS) are used in many different critical infrastructures, from smart home to energy systems and from locomotives train system to planes. Security of SCADA systems is vital since many lives depend on it for daily activities and deviation from normal operation could be disastrous to the environment as well as lives. This paper describes how No-Trust-Zone (NTZ) architecture could be incorporated into SCADA Systems in order to reduce the chances of malicious intent. The architecture is made up of two distinctive parts which are; the field devices such as; sensors, PLCs pumps, and actuators. The second part of the architecture is designed following lambda architecture, which is made up of a detection algorithm based on Particle Swarm Optimization (PSO) and Hadoop framework for data processing and storage. Apache Spark will be a part of the lambda architecture for real-time analysis of packets for anomalies detection.

Keywords: industrial control system (ics, no-trust-zone (ntz), particle swarm optimisation (pso), supervisory control and data acquisition (scada), swarm intelligence (SI)

Procedia PDF Downloads 309
6 MapReduce Logistic Regression Algorithms with RHadoop

Authors: Byung Ho Jung, Dong Hoon Lim

Abstract:

Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. Logistic regression is used extensively in numerous disciplines, including the medical and social science fields. In this paper, we address the problem of estimating parameters in the logistic regression based on MapReduce framework with RHadoop that integrates R and Hadoop environment applicable to large scale data. There exist three learning algorithms for logistic regression, namely Gradient descent method, Cost minimization method and Newton-Rhapson's method. The Newton-Rhapson's method does not require a learning rate, while gradient descent and cost minimization methods need to manually pick a learning rate. The experimental results demonstrated that our learning algorithms using RHadoop can scale well and efficiently process large data sets on commodity hardware. We also compared the performance of our Newton-Rhapson's method with gradient descent and cost minimization methods. The results showed that our newton's method appeared to be the most robust to all data tested.

Keywords: big data, logistic regression, MapReduce, RHadoop

Procedia PDF Downloads 241
5 Cloud Computing in Data Mining: A Technical Survey

Authors: Ghaemi Reza, Abdollahi Hamid, Dashti Elham

Abstract:

Cloud computing poses a diversity of challenges in data mining operation arising out of the dynamic structure of data distribution as against the use of typical database scenarios in conventional architecture. Due to immense number of users seeking data on daily basis, there is a serious security concerns to cloud providers as well as data providers who put their data on the cloud computing environment. Big data analytics use compute intensive data mining algorithms (Hidden markov, MapReduce parallel programming, Mahot Project, Hadoop distributed file system, K-Means and KMediod, Apriori) that require efficient high performance processors to produce timely results. Data mining algorithms to solve or optimize the model parameters. The challenges that operation has to encounter is the successful transactions to be established with the existing virtual machine environment and the databases to be kept under the control. Several factors have led to the distributed data mining from normal or centralized mining. The approach is as a SaaS which uses multi-agent systems for implementing the different tasks of system. There are still some problems of data mining based on cloud computing, including design and selection of data mining algorithms.

Keywords: cloud computing, data mining, computing models, cloud services

Procedia PDF Downloads 442