Parallel Pipelined Conjugate Gradient Algorithm on Heterogeneous Platforms
Authors: Sergey Kopysov, Nikita Nedozhogin, Leonid Tonkov
Abstract:
The article presents a parallel iterative solver for large sparse linear systems which can be used on a heterogeneous platform. Traditionally, the problem of solving linear systems do not scale well on cluster containing multiple Central Processing Units (multi-CPUs cluster) or cluster containing multiple Graphics Processing Units (multi-GPUs cluster). For example, most of the attempts to implement the classical conjugate gradient method were at best counted in the same amount of time as the problem was enlarged. The paper proposes the pipelined variant of the conjugate gradient method (PCG), a formulation that is potentially better suited for hybrid CPU/GPU computing since it requires only one synchronization point per one iteration, instead of two for standard CG (Conjugate Gradient). The standard and pipelined CG methods need the vector entries generated by current GPU and other GPUs for matrix-vector product. So the communication between GPUs becomes a major performance bottleneck on miltiGPU cluster. The article presents an approach to minimize the communications between parallel parts of algorithms. Additionally, computation and communication can be overlapped to reduce the impact of data exchange. Using pipelined version of the CG method with one synchronization point, the possibility of asynchronous calculations and communications, load balancing between the CPU and GPU for solving the large linear systems allows for scalability. The algorithm is implemented with the combined use of technologies: MPI, OpenMP and CUDA. We show that almost optimum speed up on 8-CPU/2GPU may be reached (relatively to a one GPU execution). The parallelized solver achieves a speedup of up to 5.49 times on 16 NVIDIA Tesla GPUs, as compared to one GPU.
Keywords: Conjugate Gradient, GPU, parallel programming, pipelined algorithm.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 370References:
[1] S. Mittal and J. S. Vetter, “A survey of cpu-gpu heterogeneous computing techniques,” ACM Comput. Surv., vol. 47, no. 4, Jul. 2015.
[Online]. Available: https://doi.org/10.1145/2788396
[2] N. Bosner, Z. Bujanovi´c, and Z. Drmaˇc, “Parallel solver for shifted systems in a hybrid CPU–GPU framework,” SIAM Journal on Scientific Computing, vol. 40, no. 4, pp. C605–C633, 2018.
[3] E. Agullo, L. Giraud, A. Guermouche, and J. Roman, “Parallel hierarchical hybrid linear solvers for emerging computing platforms,” Comptes Rendus Mecanique, vol. 339, no. 2, pp. 96–103, Feb. 2011.
[4] J. Gaidamour and P. H´enon, “A parallel direct/iterative solver based on a Schur complement approach.” in IEEE 11th International Conference on Computational Science and Engineering, IEEE, Ed., Sao Paulo, Brazil, Jul. 2008, pp. page 98–105, 8 pages double colonnes.
[Online]. Available: https://hal.archives-ouvertes.fr/hal-00353547
[5] L. Giraud, A. Haidar, and Y. Saad, “Sparse approximations of the Schur complement for parallel algebraic hybrid linear solvers in 3D,” INRIA, Research Report RR-7237, Mar. 2010.
[Online]. Available: https://hal.inria.fr/inria-00466828
[6] S. Rajamanickam, E. G. Boman, and M. A. Heroux, “Shylu: A hybrid–hybrid solver for multicore platforms,” in In Proc. of 26th IEEE Intl. Parallel and distributed Processing Symp. (IPDPS’12). IEEE, 2012.
[7] I. Yamazaki, S. Rajamanickam, E. G. Boman, M. Hoemmen, M. A. Heroux, and S. Tomov, “Domain decomposition preconditioners for communication-avoiding krylov methods on a hybrid cpu/gpu cluster,” in The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 14), IEEE. New Orleans, LA: IEEE, 2014-11 2014.
[8] A. Jamal, M. Baboulin, A. Khabou, and M. Sosonkina, “A hybrid CPU/GPU approach for the Parallel Algebraic Recursive Multilevel Solver pARMS,” in 2016 18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), 2016, pp. 411–416.
[9] N. Kasmi, M. Zbakh, and A. Haouari, “Performance analysis of preconditioned conjugate gradient solver on heterogeneous (multi-CPUs/multi-GPUs) architecture,” Lecture Notes in Networks and Systems, vol. 49, pp. 318–336, 2019.
[10] S. Kopysov, I. Kuzmin, N. Nedozhogin, A. Novikov, and Y. Sagdeeva, “Scalable hybrid implementation of the schur complement method for multi-gpu systems,” J. Supercomput., vol. 69, no. 1, p. 81–88, Jul. 2014.
[Online]. Available: https://doi.org/10.1007/s11227-014-1209-7
[11] M. R. Hestenes and E. Stiefel, “Methods of conjugate gradients for solving linear systems,” Journal of research of the National Bureau of Standards, vol. 49, pp. 409–436, 1952.
[12] J. Cornelis, S. Cools, and W. Vanroose, “The communication-hiding conjugate gradient method with deep pipelines,” 2019.
[13] E. D’Azevedo, V. Eijkhout, and C. Romine, “A matrix framework for conjugate gradient methods and some variants of cg with less synchronization overhead.” 01 1993, pp. 644–646.
[14] J. Wilkinson and C. Reinsch, Linear Algebra, Berlin, Heidelberg, 1971.
[15] A. Chronopoulos and C. Gear, “s-step iterative methods for symmetric linear systems,” Journal of Computational and Applied Mathematics, vol. 25, no. 2, pp. 153 – 168, 1989.
[16] P. Ghysels and W. Vanroose, “Hiding global synchronization latency in the preconditioned conjugate gradient algorithm,” Parallel Computing, vol. 40, no. 7, pp. 224–238, 2014, 7th Workshop on Parallel Matrix Algorithms and Applications.
[Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167819113000719
[17] T. A. Davis and Y. Hu, “The university of florida sparse matrix collection,” vol. 38, no. 1, 2011.
[18] I. R. Kadyrov, S. Kopysov, and A. K. Novikov, “Partitioning of triangulated multiply connected domain into subdomains without branching of inner boundaries,” Uchenye Zapiski Kazanskogo Universiteta. Seriya Fiziko-Matematicheskie Nauki, vol. 160, no. 3, pp. 544–560, 2018, (In Russian).
[19] J. Dongarra, M. A. Heroux, and P. Luszczek, “High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems,” The International Journal of High Performance Computing Applications, vol. 30, no. 1, pp. 3–10, 2016.
[Online]. Available: https://doi.org/10.1177/1094342015593158
[20] R. F. Boisvert, R. Pozo, K. Remington, R. F. Barrett, and J. Dongarra, “Matrix market: a web resource for test matrix collections,” in Quality of Numerical Software, 1996.