Analyzing the Factors that Cause Parallel Performance Degradation in Parallel Graph-Based Computations Using Graph500
Authors: Mustafa Elfituri, Jonathan Cook
Abstract:
Recently, graph-based computations have become more important in large-scale scientific computing as they can provide a methodology to model many types of relations between independent objects. They are being actively used in fields as varied as biology, social networks, cybersecurity, and computer networks. At the same time, graph problems have some properties such as irregularity and poor locality that make their performance different than regular applications performance. Therefore, parallelizing graph algorithms is a hard and challenging task. Initial evidence is that standard computer architectures do not perform very well on graph algorithms. Little is known exactly what causes this. The Graph500 benchmark is a representative application for parallel graph-based computations, which have highly irregular data access and are driven more by traversing connected data than by computation. In this paper, we present results from analyzing the performance of various example implementations of Graph500, including a shared memory (OpenMP) version, a distributed (MPI) version, and a hybrid version. We measured and analyzed all the factors that affect its performance in order to identify possible changes that would improve its performance. Results are discussed in relation to what factors contribute to performance degradation.
Keywords: Graph computation, Graph500 benchmark, parallel architectures, parallel programming, workload characterization.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 556References:
[1] A. Lumsdaine, D. Gregor, B. Hendrickson, and J. W. Berry, “Challenges in Parallel Graph Processing,” Parallel Processing Letters, vol. 17, no. 1, pp. 5–20, 2007.
[2] D. Bader, J. Berry, S. Kahan, R. Murphy, E. Riedy, and J. Willcock, “Graph 500 Benchmark 1 (Search),” 2010, www.graph500.org.
[3] M. El-Fituri, “Analyzing and Improving the Performance of Parallel Graph Algorithms, Ph.D. Thesis,” New Mexico State University.
[4] M. Elfituri, J. Cook, and J. Cook, “Binary instrumentation support for measuring performance in openmp programs,” in Proceedings of the 5th International Workshop on Software Engineering for Computational Science and Engineering, ser. SE-CSE ’13. Piscataway, NJ, USA: IEEE Press, 2013, pp. 19–23. (Online). Available: http://dl.acm.org/citation.cfm?id=2663370.2663374
[5] J. Vetter and M. McCracken, “Statistical Scalability Analysis of Communication Operations in Distributed Applications,” in Proc. ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPOPP), 2001.
[6] M. Roth, M. J. Best, C. Mustard, and A. Fedorova, “Deconstructing the overhead in parallel applications,” 2012 IEEE International Symposium on Workload Characterization (IISWC), vol. 0, pp. 59–68, 2012.
[7] M. Pavlovic, Y. Etsion, and A. Ramirez, “On the memory system requirements of future scientific applications: Four case-studies,” in Proceedings of the 2011 IEEE International Symposium on Workload Characterization, ser. IISWC ’11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 159–170.
[8] T. Suzumura, K. Ueno, H. Sato, K. Fujisawa, and S. Matsuoka, “Performance characteristics of graph500 on large-scale distributed environment,” in Proceedings of the 2011 IEEE International Symposium on Workload Characterization, ser. IISWC ’11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 149–158.
[9] M. Kambadur, K. Tang, and M. A. Kim, “Harmony: Collection and analysis of parallel block vectors,” in Proceedings of the 39th Annual International Symposium on Computer Architecture, ser. ISCA ’12. Washington, DC, USA: IEEE Computer Society, 2012, pp. 452–463.
[10] M. A. Kim and S. A. Edwards, “Computation vs. memory systems: Pinning down accelerator bottlenecks,” in Proceedings of the 2010 International Conference on Computer Architecture, ser. ISCA’10. Berlin, Heidelberg: Springer-Verlag, 2012, pp. 86–98.
[11] R. Murphy, “On the effects of memory latency and bandwidth on supercomputer application performance,” in Proc. 2007 IEEE 10th Int’l Symp. on Workload Characterization, ser. IISWC ’07. Washington, DC, USA: IEEE Computer Society, 2007, pp. 35–43.
[12] M. Burtscher, R. Nasre, and K. Pingali, “A quantitative study of irregular programs on gpus,” 2012 IEEE International Symposium on Workload Characterization (IISWC), vol. 0, pp. 141–151, 2012.
[13] R. Jongerius, P. Stanley-Marbell, and H. Corporaal, “Quantifying the common computational problems in contemporary applications,” in Proc. 2011 IEEE Int’l Symp. on Workload Characterization, ser. IISWC ’11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 74–.
[14] H. Inoue and T. Nakatani, “Performance of multi-process and multithread processing on multi-core smt processors,” in Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’10), ser. IISWC ’10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 1–10.
[15] V. Agarwal, F. Petrini, D. Pasetto, and D. A. Bader, “Scalable graph exploration on multicore processors,” in Proc. 2010 ACM/IEEE Int’l Conf. for High Performance Computing, Networking, Storage and Analysis, ser. SC ’10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 1–11.
[16] E. Krepska, T. Kielmann, W. Fokkink, and H. Bal, “Hipg: Parallel processing of large-scale graphs,” SIGOPS Oper. Syst. Rev., vol. 45, no. 2, pp. 3–13, Jul. 2011.
[17] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: A system for large-scale graph processing,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’10. New York, NY, USA: ACM, 2010, pp. 135–146.
[18] V. Prabhakaran, M. Wu, X. Weng, F. McSherry, L. Zhou, and M. Haridasan, “Managing large graphs on multi-cores with graph awareness,” in Proceedings of the 2012 USENIX Conference on Annual Technical Conference, ser. USENIX ATC’12. Berkeley, CA, USA: USENIX Association, 2012, pp. 4–4.
[19] S. Beamer, K. Asanovic, and D. Patterson, "Locality exists in graph processing: Workload characterization on an ivy bridge server," in Proc. of IEEE International Symposium on Workload Characterization (IISWC), pp. 56--65, October 2015.
[20] Gurbinder Gill, Roshan Dathathri, Loc Hoang, Ramesh Peri, and Keshav Pingali, “Single Machine Graph Analytics on Massive Datasets Using Intel Optane DC Persistent Memory”, PVLDB 13, 8 , 2020, 1304–13
[21] Laxman Dhulipala, Changwan Hong, and Julian Shun, ConnectIt, “A Framework for Static and Incremental Parallel Graph Connectivity Algorithms”, Proceedings of the VLDB Endowment, 2021. To appear.
[22] G. Gill, R. Dathathri, L. Hoang, and K. Pingali. A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms. volume 12 of PVLDB, 2018.
[23] Kartik Lakhotia, Rajgopal Kannan, Sourav Pati, and Viktor Prasanna, “A cache and memory-efficient framework for graph processing over partitions”, In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, 2019, GPOP, (PPoPP’19). ACM, New York, NY, 393—394
[24] D. A. B. et al. Hpcs “scalable synthetic compact applications #2 graph analysis”, version 2.2. http://www.graphanalysis.org/benchmark/HPCSSSCA2