It is evident that large-scale distributed-memory HPC systems are indispensable for the handling of complex computational problems in various domains. This Innovation Study addresses the challenge of enhancing the scalability of sparse irregular operations on large-scale distributed-memory High-Performance Computing (HPC) systems, where bandwidth-related costs are a significant factor.
Sparse irregular operations, which are distinguished by their irregular data access patterns and sparse communication requirements, present significant challenges in achieving scalability on distributed-memory HPC systems. As system sizes increase, the disparity between computation and communication costs becomes more pronounced, necessitating the development of innovative approaches to optimise scalability. The objective of this study is to enhance the scalability of such operations, with a particular focus on Sparse Matrix-Matrix Multiplication (SpMM), Sparse General Matrix-Matrix Multiplication (SpGEMM), and sparse Matrix-Tensor Times Khatri-Rao Product (MTTKRP) kernels, which are commonly encountered in iterative applications. In order to achieve effective parallelisation, it is necessary to ensure that the input and output data are partitioned in a way that is consistent with the computational and communication loads.
In addressing the crucial need for effective parallelisation, CVolBal emphasises the importance of conformal input/output partitioning and load balancing of computational and communication loads. This study proposes a two-phase framework that exploits the characteristics of Tier-0 HPC systems, which are characterised by low diameters and thus low risk of message congestion. The initial phase employs existing graph and hypergraph partitioning tools to establish an initial task and data partition. In the second phase, a novel formulation for communication volume balancing is introduced, which optimises communication loads without disturbing the initial partitioning.
The methodology employed in this study is exemplified through experiments conducted on a Message Passing Interface (MPI)-based Graph Neural Network (GNN) model developed in the C language, which illustrates its capacity to markedly enhance scalability in a range of applications. The results indicate a notable enhancement in scalability, particularly in terms of communication load balancing. By optimising communication loads while maintaining load balance, the method used in this study enables more efficient utilisation of system resources, ultimately leading to improved scalability in various applications.
It is of paramount importance to optimise communication loads while maintaining load balance when scaling sparse irregular operations on distributed-memory high-performance computing (HPC) systems. CVolBal represents a significant advance in addressing the scalability challenges associated with sparse irregular operations on distributed-memory HPC systems. The method employed in this study offers an effective solution, as evidenced by experiments conducted on GNN training, which demonstrate its potential for enhancing scalability in various applications.