A Survey on Data-Centric and Data-Aware Techniques for Large Scale Infrastructures

Silvina Caíno-Lores; Jesús Carretero

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33122

A Survey on Data-Centric and Data-Aware Techniques for Large Scale Infrastructures

Authors: Silvina Caíno-Lores, Jesús Carretero

Abstract:

Large scale computing infrastructures have been widely developed with the core objective of providing a suitable platform for high-performance and high-throughput computing. These systems are designed to support resource-intensive and complex applications, which can be found in many scientific and industrial areas. Currently, large scale data-intensive applications are hindered by the high latencies that result from the access to vastly distributed data. Recent works have suggested that improving data locality is key to move towards exascale infrastructures efficiently, as solutions to this problem aim to reduce the bandwidth consumed in data transfers, and the overheads that arise from them. There are several techniques that attempt to move computations closer to the data. In this survey we analyse the different mechanisms that have been proposed to provide data locality for large scale high-performance and high-throughput systems. This survey intends to assist scientific computing community in understanding the various technical aspects and strategies that have been reported in recent literature regarding data locality. As a result, we present an overview of locality-oriented techniques, which are grouped in four main categories: application development, task scheduling, in-memory computing and storage platforms. Finally, the authors include a discussion on future research lines and synergies among the former techniques.

Keywords: Co-scheduling, data-centric, data-intensive, data locality, in-memory storage, large scale.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1112258

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1495

References:

[1] K. Yelick, S. Coghlan, B. Draney, R. S. Canon et al., “The magellan report on cloud computing for science,” US Department of Energy, Washington DC, USA, Tech. Rep, 2011.
[2] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
[3] J. Fritsch and C. Walker, “The problem with data,” in Utility and Cloud Computing (UCC), 2014 IEEE/ACM 7th International Conference on. IEEE, 2014, pp. 708–713.
[4] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google file system,” in ACM SIGOPS operating systems review, vol. 37, no. 5. ACM, 2003, pp. 29–43.
[5] T. White, Hadoop: The Definitive Guide. O’Reilly Media, 2009.
[6] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed file system,” in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, May 2010, pp. 1–10.
[7] H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker, “Map-reduce-merge: simplified relational data processing on large clusters,” in Proceedings of the 2007 ACM SIGMOD international conference on Management of data. ACM, 2007, pp. 1029–1040.
[8] R. Tudoran, A. Costan, and G. Antoniu, “Mapiterativereduce: a framework for reduction-intensive data processing on azure clouds,” in Proceedings of third international workshop on MapReduce and its Applications Date. ACM, 2012, pp. 9–16.
[9] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox, “Twister: a runtime for iterative mapreduce,” in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. ACM, 2010, pp. 810–818.
[10] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, “Haloop: efficient iterative data processing on large clusters,” Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp. 285–296, 2010.
[11] T. Gunarathne, B. Zhang, T.-L. Wu, and J. Qiu, “Scalable parallel computing on clouds using twister4azure iterative mapreduce,” Future Generation Computer Systems, vol. 29, no. 4, pp. 1035–1048, 2013.
[12] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets,” in Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, ser. HotCloud’10, Berkeley, CA, USA, 2010, pp. 10–10.
[13] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012, pp. 2–2.
[14] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig latin: a not-so-foreign language for data processing,” in Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008, pp. 1099–1110.
[15] C. Dobre and F. Xhafa, “Parallel programming paradigms and frameworks in big data era,” International Journal of Parallel Programming, vol. 42, no. 5, pp. 710–738, 2014.
[16] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, “Hive-a petabyte scale data warehouse using hadoop,” in Data Engineering (ICDE), 2010 IEEE 26th International Conference on. IEEE, 2010, pp. 996–1005.
[17] F. Zhang, Q. M. Malluhi, T. Elsayed, S. U. Khan, K. Li, and A. Y. Zomaya, “Cloudflow: A data-aware programming model for cloud workflow applications on modern hpc systems,” Future Generation Computer Systems, 2014.
[18] F. Zhang, C. Docan, M. Parashar, S. Klasky, N. Podhorszki, and H. Abbasi, “Enabling in-situ execution of coupled scientific workflow on multi-core platform,” in Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 2012, pp. 1352–1363.
[19] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed data-parallel programs from sequential building blocks,” in ACM SIGOPS Operating Systems Review, vol. 41, no. 3. ACM, 2007, pp. 59–72.
[20] D. Warneke and O. Kao, “Nephele: efficient parallel data processing in the cloud,” in Proceedings of the 2nd workshop on many-task computing on grids and supercomputers. ACM, 2009, p. 8.
[21] H. Topcuoglu, S. Hariri, and M.-y. Wu, “Performance-effective and low-complexity task scheduling for heterogeneous computing,” Parallel and Distributed Systems, IEEE Transactions on, vol. 13, no. 3, pp. 260–274, 2002.
[22] X. Zhang, Y. Feng, S. Feng, J. Fan, and Z. Ming, “An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments,” in Cloud and Service Computing (CSC), 2011 International Conference on. IEEE, 2011, pp. 235–242.
[23] V. W. Thawari, S. D. Babar, N. A. Dhawas, and others, “An efficient data locality driven task scheduling algorithm for cloud computing,” International Journal in Multidisciplinary and Academic Research (SSIJMAR), vol. 1, no. 3, 2012.
[24] T. Kosar and M. Balman, “A new paradigm: Data-aware scheduling in grid computing,” Future Generation Computer Systems, vol. 25, no. 4, pp. 406–413, 2009. (Online). Available: http://www.sciencedirect.com/science/article/pii/S0167739X08001520
[25] G. Khanna, U. Catalyurek, T. Kurc, P. Sadayappan, and J. Saltz, “A data locality aware online scheduling approach for I/O-intensive jobs with file sharing,” in Job Scheduling Strategies for Parallel Processing. Springer, 2007, pp. 141–160.
[26] M. Sun, H. Zhuang, X. Zhou, K. Lu, and C. Li, “HPSO: Prefetching Based Scheduling to Improve Data Locality for MapReduce Clusters,” in Algorithms and Architectures for Parallel Processing. Springer, 2014, pp. 82–95.
[27] A. Bezerra, P. Hernandez, A. Espinosa, and J. C. Moure, “Job scheduling for optimizing data locality in hadoop clusters,” in Proceedings of the 20th European MPI Users’ Group Meeting. ACM, 2013, pp. 271–276.
[28] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, “Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling,” in Proceedings of the 5th European Conference on Computer Systems, ser. EuroSys ’10. New York, NY, USA: ACM, 2010, pp. 265–278. (Online). Available: http://doi.acm.org/10.1145/1755913.1755940
[29] K. Leal, E. Huedo, and I. M. Llorente, “A decentralized model for scheduling independent tasks in Federated Grids,” Future Generation Computer Systems, vol. 25, no. 8, pp. 840–852, Sep. 2009. (Online). Available: http://www.sciencedirect.com/science/article/pii/S0167739X09000211
[30] S. Y. Ko, R. Morales, and I. Gupta, “New worker-centric scheduling strategies for data-intensive grid applications,” in Proceedings of the ACM/IFIP/USENIX 2007 International Conference on Middleware. Springer-Verlag New York, Inc., 2007, pp. 121–142. (Online). Available: http://dl.acm.org/citation.cfm?id=1516134
[31] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox, “Twister: A runtime for iterative mapreduce,” in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, ser. HPDC ’10. New York, NY, USA: ACM, 2010, pp. 810–818. (Online). Available: http://doi.acm.org/10.1145/1851476.1851593
[32] R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica, “Graphx: A resilient distributed graph system on spark,” in First International Workshop on Graph Data Management Experiences and Systems, ser. GRADES ’13. New York, NY, USA: ACM, 2013, pp. 2:1–2:6.
[33] C. Engle, A. Lupher, R. Xin, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica, “Shark: Fast data analysis using coarse-grained distributed memory,” in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’12. New York, NY, USA: ACM, 2012, pp. 689–692.
[34] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, “Hive: a warehousing solution over a map-reduce framework,” Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1626–1629, 2009.
[35] E. Santos-Neto, W. Cirne, F. Brasileiro, and A. Lima, “Exploiting replication and data reuse to efficiently schedule data-intensive applications on grids,” in Job Scheduling Strategies for Parallel Processing, ser. Lecture Notes in Computer Science, D. Feitelson, L. Rudolph, and U. Schwiegelshohn, Eds. Springer Berlin Heidelberg, 2005, vol. 3277, pp. 210–232.
[36] R. Power and J. Li, “Piccolo: Building fast, distributed programs with partitioned tables.” in OSDI, vol. 10, 2010, pp. 1–14.
[37] Y. Chen, X.-H. Sun, R. Thakur, H. Song, and H. Jin, “Improving parallel i/o performance with data layout awareness,” in Cluster Computing (CLUSTER), 2010 IEEE International Conference on, Sept 2010, pp. 302–311.
[38] P. Llopis, J. Blas, F. Isaila, and J. Carretero, “Vidas: object-based virtualized data sharing for high performance storage i/o,” in Proceedings of the 4th ACM workshop on Scientific cloud computing. ACM, 2013, pp. 37–44.
[39] R. Thakur, W. Gropp, and E. Lusk, “Data sieving and collective i/o in romio,” in Frontiers of Massively Parallel Computation, 1999. Frontiers’ 99. The Seventh Symposium on the. IEEE, 1999, pp. 182–189.
[40] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A distributed storage system for structured data,” ACM Trans. Comput. Syst., vol. 26, no. 2, pp. 4:1–4:26, Jun. 2008. (Online). Available: http://doi.acm.org/10.1145/1365815.1365816
[41] J. Paiva, P. Ruivo, P. Romano, and L. Rodrigues, “Autoplacer: Scalable self-tuning data placement in distributed key-value stores,” in Proceedings of the 10th International Conference on Autonomic Computing (ICAC 13). San Jose, CA: USENIX, 2013, pp. 119–131. (Online). Available: https://www.usenix.org/conference/icac13/technical-sessions/presentation/paiva
[42] M. Vora, “Hadoop-hbase for large-scale data,” in Computer Science and Network Technology (ICCSNT), 2011 International Conference on, vol. 1, Dec 2011, pp. 601–605.
[43] M. Shapiro and E. Miller, “Managing databases with binary large objects,” in Mass Storage Systems, 1999. 16th IEEE Symposium on, 1999, pp. 185–193.