Accelerating Sparse Matrix Vector Multiplication on Many-Core GPUs

Weizhi Xu; Zhiyong Liu; Dongrui Fan; Shuai Jiao; Xiaochun Ye; Fenglong Song; Chenggang Yan

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33156

Accelerating Sparse Matrix Vector Multiplication on Many-Core GPUs

Authors: Weizhi Xu, Zhiyong Liu, Dongrui Fan, Shuai Jiao, Xiaochun Ye, Fenglong Song, Chenggang Yan

Abstract:

Many-core GPUs provide high computing ability and substantial bandwidth; however, optimizing irregular applications like SpMV on GPUs becomes a difficult but meaningful task. In this paper, we propose a novel method to improve the performance of SpMV on GPUs. A new storage format called HYB-R is proposed to exploit GPU architecture more efficiently. The COO portion of the matrix is partitioned recursively into a ELL portion and a COO portion in the process of creating HYB-R format to ensure that there are as many non-zeros as possible in ELL format. The method of partitioning the matrix is an important problem for HYB-R kernel, so we also try to tune the parameters to partition the matrix for higher performance. Experimental results show that our method can get better performance than the fastest kernel (HYB) in NVIDIA-s SpMV library with as high as 17% speedup.

Keywords: GPU, HYB-R, Many-core, Performance Tuning, SpMV

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1057071

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1994

References:

[1] E.-J. Im, "Optimizing the performance of sparse matrix-vector multiplication", PhD thesis, University of California, Berkeley, May 2000.
[2] R. Vuduc, " Automatic Performance Tuning of Sparse Matrix Kernels", PhD thesis, University of California, Berkeley, December 2003.
[3] S. Williams , "Auto-tuning Performance on Multicore Computers", PhD thesis, University of California, Berkeley, 2008.
[4] Sam Williams, Richard Vuduc, Leonid Oliker, John Shalf, Katherine Yelick, and James Demmel, "Optimizing sparse matrix-vector multiply on emerging multicore platforms," Journal of Parallel Computing, vol. 35, no. 3, pp.178-194, March 2009.
[5] Jeff Bolz, Ian Farmer, et al. , "Sparse matrix solvers on the GPU: Conjugate gradients and multigrid," In Proc. Special Interest Group on Graphics Conf. (SIGGRAPH), San Diego, CA, USA, July 2003.
[6] Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens, "Scan primitives for GPU computing," In Proc. ACM Dense Protein QCD Cantilever Spheres Harbor Ship Wind Tunnel SIGGRAPH/EUROGRAPHICS Symp. Graphics Hardware, San Diego, CA, USA, 2007.
[7] Nathan Bell and Michael Garland, "Efficient sparse matrix-vector multiplication on CUDA," In Proc. ACM/IEEE Conf. Supercomputing (SC), Portland, OR, USA, November 2009.
[8] Muthu Manikandan Baskaran and Rajesh Bordawekar, "Optimizing sparse matrix-vector multiplication on GPUs using compile-time and run-time strategies," Technical Report RC24704 (W0812-047), IBM T.J.Watson Research Center, Yorktown Heights, NY, USA, December 2008.
[9] Ali Cevahir , Akira Nukada , Satoshi Matsuoka, "Fast Conjugate Gradients with Multiple GPUs," Proceedings of the 9th International Conference on Computational Science: Part I, Baton Rouge, LA, May 25-27, 2009.
[10] F. Vazquez, E. M. Garzon, J.A.Martinez, and J.J.Fernandez, "The sparse matrix vector product on GPUs," Technical report, University of Almeria, June 2009.
[11] Monakov, A., A. Lokhmotov, and A. Avetisyan, "Automatically tuning sparse matrix-vector multiplication for GPU architectures," High Performance Embedded Architectures and Compilers, Lecture Notes in Computer Science, Vol. 5952, pp.111-125, 2010.
[12] J. W. Choi, A. Singh, and R. Vuduc, "Model-driven autotuning of sparse matrix-vector multiply on gpus," In PPOPP, pp.115-126, 2010.
[13] Ping Guo, Liqiang Wang, "Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs," The 2010 International Conference on Computational and Information Sciences, Chengdu, China.
[14] A. Bulu├º, J. T. Fineman, M. Frigo, J. R. Gilbert, and E. Leiserson, "Parallel sparse matrix-vector and matrixtranspose-vector multiplication using compressed sparse blocks," In SPAA, pp. 233-244, 2009.
[15] A. Bulu├º, et.al., "Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication," In IPDPS, 2011.
[16] K. Kourtis, V. Karakasis, G. Goumas, and N. Koziris, "CSX: An Extended Compression Format for SpMV on Shared Memory Systems,"In PPoPP, pp. 247-256, San Antonio, Texas, USA, 2011.
[17] Xintian Yang, Srinivasan Parthasarathy, P. Sadayappan, "Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining," In VLDB, 2011.
[18] NVIDIA CUDA (Compute Unified Device Architecture): Programming Guide, Version 2.1, December 2008.
[19] J. Willcock, A. Lumsdaine, "Accelerating sparse matrix computations via data compression," In ICS, Cairns, Australia, June 2006.