High Performance in Parallel Data Integration: An Empirical Evaluation of the Ratio Between Processing Time and Number of Physical Nodes

Caspar von Seckendorff; Eldar Sultanow

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33122

High Performance in Parallel Data Integration: An Empirical Evaluation of the Ratio Between Processing Time and Number of Physical Nodes

Authors: Caspar von Seckendorff, Eldar Sultanow

Abstract:

Many studies have shown that parallelization decreases efficiency [1], [2]. There are many reasons for these decrements. This paper investigates those which appear in the context of parallel data integration. Integration processes generally cannot be allocated to packages of identical size (i. e. tasks of identical complexity). The reason for this is unknown heterogeneous input data which result in variable task lengths. Process delay is defined by the slowest processing node. It leads to a detrimental effect on the total processing time. With a real world example, this study will show that while process delay does initially increase with the introduction of more nodes it ultimately decreases again after a certain point. The example will make use of the cloud computing platform Hadoop and be run inside Amazon-s EC2 compute cloud. A stochastic model will be set up which can explain this effect.

Keywords: Process delay, speedup, efficiency, parallel computing, data integration, E-Commerce, Amazon Elastic Compute Cloud (EC2), Hadoop, Nutch.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1063238

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1634

References:

[1] Kumar, V.: Introduction to Parallel Computing, 2nd edition. Addison- Wesley Longman Publishing Co., Inc., (2002)
[2] Eager, D. L., Zahorjan, J., Lozowska, E. D.: Speedup Versus Efficiency in Parallel Systems. IEEE Transactions on Computers, Vol. 38, No. 3, pp. 408--423 (1989)
[3] Cutting, D.: Nutch: an Open-Source Platform for Web Search. In: Beigbeder, M., Yee, W. G. (eds.) Workshop on Open Source Web Information Retrieval (OSWIR), pp. 31--33 (2005)
[4] Khare, R., Cutting, D., Sitaker, K., Rifkin, A.: Nutch: A Flexible and Scalable Open-Source Web Search Engine (2004)
[5] Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google File System. SOSP '03: Proceedings of 19th ACM symposium on Operating systems principles, ACM Press, pp. 29--43, NY (2003)
[6] Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA (2004)
[7] Cafarella, M. J., Etzioni, O.: A Search Engine for Natural Language Applications. WWW '05: Proceedings of the 14th international conference on World Wide Web, pp. 442--452. ACM Press, NY (2005)
[8] Drost, I., Scheffer, T.: Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam. ECML, pp. 96--107 (2005)
[9] Kimball, A., Michels-Slettvet, S., Bisciglia, C.: Cluster Computing for Web-Scale Data Processing. SIGCSE '08: Proceedings of the 39th SIGCSE technical symposium on Computer science education, ACM, pp. 116--120 (2008)
[10] Butler, M. H., Rutherford, J.: Distributed Lucene : A distributed free text index for Hadoop. HP Laboratories (2008)
[11] Hasan, W., Motwani, R.: Optimization Algorithms for Exploiting the Parallelism-Communication Tradeoff in Pipelined Parallelism. VLDB '94: Proceedings of the 20th International Conference on Very Large Data Bases, Morgan Kaufmann Publishers Inc., pp. 36--47 (1994)
[12] Ahn, J. H., Erez, M., Dally, W. J.: Tradeoff between Data-, Instruction-, and Thread-Level Parallelism in Stream Processors. ICS '07: Proceedings of the 21st annual international conference on Supercomputing, ACM, pp. 126--137 (2007)
[13] Amini, H., Kazakov D., Ridge, E.: Parallelism vs Communication Overhead Trade-off in a JADE Multi-Agent Implementation of Cellular Automata. The First International Symposium on Nature-Inspired Systems for Parallel, Asynchronous and Decentralised Environments (NISPADE), AISB convention, Bristol (2006)
[14] Balasubramonian, R., Dwarkadas, S., Albonesi, D. H.: Dynamically managing the communication-parallelism trade-off in future clustered processors. SIGARCH Comput. Archit. News, ACM, pp. 275--287 (2003)
[15] Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. H., Stoica, I.: Improving MapReduce Performance in Heterogeneous Environments. 8th Symposium on Operating Systems Design and Implementation, pp. 29--42 (2008)
[16] Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theor. Comput. Sci., Elsevier Science Publishers Ltd., 92, pp. 191--211 (1992)