Performance Analysis and Optimization for Diagonal Sparse Matrix-Vector Multiplication on Machine Learning Unit
Authors: Qiuyu Dai, Haochong Zhang, Xiangrong Liu
Abstract:
Efficient matrix-vector multiplication with diagonal sparse matrices is pivotal in a multitude of computational domains, ranging from scientific simulations to machine learning workloads. When encoded in the conventional Diagonal (DIA) format, these matrices often induce computational overheads due to extensive zero-padding and non-linear memory accesses, which can hamper the computational throughput, and elevate the usage of precious compute and memory resources beyond necessity. The ’DIA-Adaptive’ approach, a methodological enhancement introduced in this paper, confronts these challenges head-on by leveraging the advanced parallel instruction sets embedded within Machine Learning Units (MLUs). This research presents a thorough analysis of the DIA-Adaptive scheme’s efficacy in optimizing Sparse Matrix-Vector Multiplication (SpMV) operations. The scope of the evaluation extends to a variety of hardware architectures, examining the repercussions of distinct thread allocation strategies and cluster configurations across multiple storage formats. A dedicated computational kernel, intrinsic to the DIA-Adaptive approach, has been meticulously developed to synchronize with the nuanced performance characteristics of MLUs. Empirical results, derived from rigorous experimentation, reveal that the DIA-Adaptive methodology not only diminishes the performance bottlenecks associated with the DIA format but also exhibits pronounced enhancements in execution speed and resource utilization. The analysis delineates a marked improvement in parallelism, showcasing the DIA-Adaptive scheme’s ability to adeptly manage the interplay between storage formats, hardware capabilities, and algorithmic design. The findings suggest that this approach could set a precedent for accelerating SpMV tasks, thereby contributing significantly to the broader domain of high-performance computing and data-intensive applications.
Keywords: Adaptive method, DIA, diagonal sparse matrices, MLU, sparse matrix-vector multiplication.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 234References:
[1] Abubaker, N., et al. (2018). ”Spatiotemporal graph and hypergraph partitioning models for sparse matrix-vector multiplication on many-core architectures.” IEEE Transactions on Parallel and Distributed Systems 30(2): 445-458.
[2] Aleksei, S., et al. (2022). ”Comparing the performance of general matrix multiplication routine on heterogeneous computing systems.” Journal of Parallel and Distributed Computing 160.
[3] Barbieri, D., et al. (2015). ”Three storage formats for sparse matrices on GPGPUs.”
[4] Beaumont, O., et al. (2019). ”Recent Advances in Matrix Partitioning for Parallel Computing on Heterogeneous Platforms.” IEEE Transactions on Parallel and Distributed Systems 30(1).
[5] Bell, N. and M. Garland (2009). Implementing sparse matrix-vector multiplication on throughput-oriented processors. Proceedings of the conference on high performance computing networking, storage and analysis.
[6] Chen, T., et al. (2014). ”DianNao.” ACM SIGARCH Computer Architecture News 42(1).
[7] Chen, Y., et al. (2019). ”Performance-Aware Model for Sparse Matrix-Matrix Multiplication on the Sunway TaihuLight Supercomputer.” IEEE Transactions on Parallel and Distributed Systems 30(4).
[8] Chen, Y., et al. (2014). ”DaDianNao.” Microarchitecture.
[9] Davis, T. A. and Y. Hu (2011). ”The University of Florida sparse matrix collection.” ACM Transactions on Mathematical Software (TOMS) 38(1): 1-25.
[10] Filippone, S., et al. (2017). ”Sparse matrix-vector multiplication on GPGPUs.” ACM Transactions on Mathematical Software (TOMS) 43(4): 1-49.
[11] Gao, J., et al. (2021). ”Adaptive diagonal sparse matrix-vector multiplication on GPU.” Journal of Parallel and Distributed Computing 157: 287-302.
[12] Gao, J., et al. (2017). ”A multi-GPU parallel optimization model for the preconditioned conjugate gradient algorithm.” Parallel Computing 63: 1-16.
[13] Im, E.-J., et al. (2004). ”Sparsity: Optimization framework for sparse matrix kernels.” The International Journal of High Performance Computing Applications 18(1): 135-158.
[14] Kunchum, R., et al. (2017). ”On improving performance of sparse matrix-matrix multiplication on GPUs.” Supercomputing.
[15] Li, K., et al. (2014). ”Performance analysis and optimization for SpMV on GPU using probabilistic modeling.” IEEE Transactions on Parallel and Distributed Systems 26(1): 196-205.
[16] Liu, W. and B. Vinter (2015). CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. Proceedings of the 29th ACM on International Conference on Supercomputing.
[17] Ma, S., et al. (2019). ”Coordinated DMA: Improving the DRAM Access Efficiency for Matrix Multiplication.” IEEE Trans. Parallel Distrib. Syst. 30(10).
[18] Monakov, A., et al. (2010). Automatically tuning sparse matrix-vector multiplication for GPU architectures. International Conference on High-Performance Embedded Architectures and Compilers, Springer.
[19] Ruoxi, W., et al. (2021). ”PBBFMM3D: a parallel black-box algorithm for kernel matrix-vector multiplication.” Journal of Parallel and Distributed Computing(prepublish).
[20] Saad, Y. (1990). SPARSKIT: A basic tool kit for sparse matrix computations.
[21] Cambrican
[Image]. MLU Server Hierarchy. https://www.cambricon.com/docs/bangc/developer guide html/ images/4.1.png
[22] Sergio, B., et al. (2022). ”Efficient and portable GEMM-based convolution operators for deep neural network training on multicore processors.” Journal of Parallel and Distributed Computing 167.
[23] Sun, X., et al. (2011). Optimizing SpMV for diagonal sparse matrices on GPU. 2011 international conference on parallel processing, IEEE.
[24] Williams, S., et al. (2007). Optimization of sparse matrix-vector multiplication on emerging multicore platforms. SC’07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, IEEE.
[25] Xia, Y., et al. (2018). A parallel solving algorithm on GPU for the time-domain linear system with diagonal sparse matrices. Workshop on Big Scientific Data Benchmarks, Architecture, and Systems, Springer.
[26] Yang, W., et al. (2018). ”A parallel computing method using blocked format with optimal partitioning for SpMV on GPU.” Journal of Computer and System Sciences 92: 152-170.