Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 4

Apache Spark Related Abstracts

4 Tracking and Classifying Client Interactions with Personal Coaches

Authors: Kartik Thakore, Anna-Roza Tamas, Adam Cole


The world health organization (WHO) reports that by 2030 more than 23.7 million deaths annually will be caused by Cardiovascular Diseases (CVDs); with a 2008 economic impact of $3.76 T. Metabolic syndrome is a disorder of multiple metabolic risk factors strongly indicated in the development of cardiovascular diseases. Guided lifestyle intervention driven by live coaching has been shown to have a positive impact on metabolic risk factors. Individuals’ path to improved (decreased) metabolic risk factors are driven by personal motivation and personalized messages delivered by coaches and augmented by technology. Using interactions captured between 400 individuals and 3 coaches over a program period of 500 days, a preliminary model was designed. A novel real time event tracking system was created to track and classify clients based on their genetic profile, baseline questionnaires and usage of a mobile application with live coaching sessions. Classification of clients and coaches was done using a support vector machines application build on Apache Spark, Stanford Natural Language Processing Library (SNLPL) and decision-modeling.

Keywords: natural language processing, guided lifestyle intervention, metabolic risk factors, personal coaching, support vector machines application, Apache Spark

Procedia PDF Downloads 315
3 A High-Level Co-Evolutionary Hybrid Algorithm for the Multi-Objective Job Shop Scheduling Problem

Authors: Aydin Teymourifar, Gurkan Ozturk


In this paper, a hybrid distributed algorithm has been suggested for the multi-objective job shop scheduling problem. Many new approaches are used at design steps of the distributed algorithm. Co-evolutionary structure of the algorithm and competition between different communicated hybrid algorithms, which are executed simultaneously, causes to efficient search. Using several machines for distributing the algorithms, at the iteration and solution levels, increases computational speed. The proposed algorithm is able to find the Pareto solutions of the big problems in shorter time than other algorithm in the literature. Apache Spark and Hadoop platforms have been used for the distribution of the algorithm. The suggested algorithm and implementations have been compared with results of the successful algorithms in the literature. Results prove the efficiency and high speed of the algorithm.

Keywords: Distributed Algorithms, Multi-objective optimization, Hadoop, Apache Spark, job shop scheduling

Procedia PDF Downloads 198
2 A Hybrid Distributed Algorithm for Solving Job Shop Scheduling Problem

Authors: Aydin Teymourifar, Gurkan Ozturk


In this paper, a distributed hybrid algorithm is proposed for solving the job shop scheduling problem. The suggested method executes different artificial neural networks, heuristics and meta-heuristics simultaneously on more than one machine. The neural networks are used to control the constraints of the problem while the meta-heuristics search the global space and the heuristics are used to prevent the premature convergence. To attain an efficient distributed intelligent method for solving big and distributed job shop scheduling problems, Apache Spark and Hadoop frameworks are used. In the algorithm implementation and design steps, new approaches are applied. Comparison between the proposed algorithm and other efficient algorithms from the literature shows its efficiency, which is able to solve large size problems in short time.

Keywords: Neural Network, Distributed Algorithms, Hadoop, Apache Spark, job shop scheduling

Procedia PDF Downloads 192
1 Power Iteration Clustering Based on Deflation Technique on Large Scale Graphs

Authors: Taysir Soliman


One of the current popular clustering techniques is Spectral Clustering (SC) because of its advantages over conventional approaches such as hierarchical clustering, k-means, etc. and other techniques as well. However, one of the disadvantages of SC is the time consuming process because it requires computing the eigenvectors. In the past to overcome this disadvantage, a number of attempts have been proposed such as the Power Iteration Clustering (PIC) technique, which is one of versions from SC; some of PIC advantages are: 1) its scalability and efficiency, 2) finding one pseudo-eigenvectors instead of computing eigenvectors, and 3) linear combination of the eigenvectors in linear time. However, its worst disadvantage is an inter-class collision problem because it used only one pseudo-eigenvectors which is not enough. Previous researchers developed Deflation-based Power Iteration Clustering (DPIC) to overcome problems of PIC technique on inter-class collision with the same efficiency of PIC. In this paper, we developed Parallel DPIC (PDPIC) to improve the time and memory complexity which is run on apache spark framework using sparse matrix. To test the performance of PDPIC, we compared it to SC, ESCG, ESCALG algorithms on four small graph benchmark datasets and nine large graph benchmark datasets, where PDPIC proved higher accuracy and better time consuming than other compared algorithms.

Keywords: Spectral Clustering, Apache Spark, power iteration clustering, deflation-based power iteration clustering, large graph

Procedia PDF Downloads 1