Graphs are an effective approach for data representation and organization,and graph analysis is a promising killer application for AI systems.However,recently emerging extremely large graphs(consisting of trillions of...Graphs are an effective approach for data representation and organization,and graph analysis is a promising killer application for AI systems.However,recently emerging extremely large graphs(consisting of trillions of vertices and edges)exceed the capacity of any small-/medium-scale clusters and thus necessitate the adoption of supercomputers for efficient graph processing.Graph500 is the de facto standard for benchmarking supercomputers’graph processing performance,and connected component(CC)is an important basic algorithm for Graph500’s BFS and SSSP tests.However,current CC algorithms are inefficient on supercomputers and fast CC is expensive and challenging.In this paper,we propose VPC,an efficient method that prunes connected components using vector-based path compression.It includes the following innovations:(i)The data structure of the traversal algorithm is customized with the two-dimensional adjacency vector.(ii)The vector-based path compression is proposed for the union-find algorithm.(iii)Parallel VPC is proposed customized with Tianhe.Experimental results validate that the two-dimensional adjacency vector has better performance than other data structures and the vectorbased path compression is used in the realization of the union-find algorithm.When the scale is 26,the performance of our algorithm is 1.38×,1.69×and 2.57×that of other algorithms.The union-find algorithm is used for connected components,and the performance of the algorithm is 5.14×and 5.01×that of BFS and DFS respectively.展开更多
大数据时代,Graph500是评测超级计算机处理数据密集型应用能力的重要工具,E级验证系统的图遍历处理能力主要受限于内存空间和访存带宽,尤其是内存空间利用率直接决定了图的测试规模和测试性能.针对天河E级验证系统小内存特征,提出了基...大数据时代,Graph500是评测超级计算机处理数据密集型应用能力的重要工具,E级验证系统的图遍历处理能力主要受限于内存空间和访存带宽,尤其是内存空间利用率直接决定了图的测试规模和测试性能.针对天河E级验证系统小内存特征,提出了基于双向位图的大规模图数据压缩存储方法(bidirectional-bitmap based CSR,Bi-CSR),Bi-CSR在CSR矩阵压缩的基础上引入行方向位图和列方向位图协同完成稀疏矩阵压缩存储,行方向位图主要负责行方向位图的压缩存储与索引,列方向位图除了进一步压缩图存储空间,还负责为顶点遍历向量并行优化提供加速空间.Bi-CSR大幅度减少了稀疏矩阵存储空间.面向天河E级验证系统,当图输入规模为237时,Graph500的图存储空间节约效率接近70%,全系统稳定测试性能为2.131E+12TEPS,性能最大加速比超过100倍.展开更多
Graph is one of the most important data structures in modern big data applications and is widely used in various fields.Among many graph algorithms,the Breadth-First Search(BFS)algorithm is a classic algorithm to solv...Graph is one of the most important data structures in modern big data applications and is widely used in various fields.Among many graph algorithms,the Breadth-First Search(BFS)algorithm is a classic algorithm to solve the graph traversal problem and also the key kernel of Graph500 benchmark.On modern CPU architecture,the implementation of graph traversal on single-node systems has achieved significant improvement.However,due to the low resource utilization and high communications overhead,graph traversal on distributed clusters suffers from poor performance and energy inefficiency.Highthroughput cluster(HTCs)adopt High-Throughput many-core architecture,which has the characteristics of high concurrency,strong real-time,and low-power consumption.In this work,we propose several techniques,including asynchronous virtual ring method,thread caching scheme and vertex ID reordering to solve above problems and improve BFS performance on HTCs.We systematically evaluate optimized BFS algorithm and achieve 249.74 giga-traversed edges per second(GTEPS)on 72 nodes(2880 cores)HTCs.Compared with results on Graph500 list,the optimized algorithm achieves the highest node efficiency under the same cluster scale and the performance shows weakly linear scalability as the number of cluster nodes increases.With regard to efficiency,the average performance on HTCs is 3.47 GTEPS/node,which is the best among CPU-based distributed systems on the November 2019 Graph500 list.展开更多
基金supported by the National Numerical Wind Tunnel Project(NNW2019ZT6-B21)the National Key Research and Development Program of China(2018YFB0204301)+1 种基金the Hunan Natural Science Foundation of China(2020JJ4669)the Foundation of Parallel and Distributed Processing Laboratory(6142110190206).
文摘Graphs are an effective approach for data representation and organization,and graph analysis is a promising killer application for AI systems.However,recently emerging extremely large graphs(consisting of trillions of vertices and edges)exceed the capacity of any small-/medium-scale clusters and thus necessitate the adoption of supercomputers for efficient graph processing.Graph500 is the de facto standard for benchmarking supercomputers’graph processing performance,and connected component(CC)is an important basic algorithm for Graph500’s BFS and SSSP tests.However,current CC algorithms are inefficient on supercomputers and fast CC is expensive and challenging.In this paper,we propose VPC,an efficient method that prunes connected components using vector-based path compression.It includes the following innovations:(i)The data structure of the traversal algorithm is customized with the two-dimensional adjacency vector.(ii)The vector-based path compression is proposed for the union-find algorithm.(iii)Parallel VPC is proposed customized with Tianhe.Experimental results validate that the two-dimensional adjacency vector has better performance than other data structures and the vectorbased path compression is used in the realization of the union-find algorithm.When the scale is 26,the performance of our algorithm is 1.38×,1.69×and 2.57×that of other algorithms.The union-find algorithm is used for connected components,and the performance of the algorithm is 5.14×and 5.01×that of BFS and DFS respectively.
文摘大数据时代,Graph500是评测超级计算机处理数据密集型应用能力的重要工具,E级验证系统的图遍历处理能力主要受限于内存空间和访存带宽,尤其是内存空间利用率直接决定了图的测试规模和测试性能.针对天河E级验证系统小内存特征,提出了基于双向位图的大规模图数据压缩存储方法(bidirectional-bitmap based CSR,Bi-CSR),Bi-CSR在CSR矩阵压缩的基础上引入行方向位图和列方向位图协同完成稀疏矩阵压缩存储,行方向位图主要负责行方向位图的压缩存储与索引,列方向位图除了进一步压缩图存储空间,还负责为顶点遍历向量并行优化提供加速空间.Bi-CSR大幅度减少了稀疏矩阵存储空间.面向天河E级验证系统,当图输入规模为237时,Graph500的图存储空间节约效率接近70%,全系统稳定测试性能为2.131E+12TEPS,性能最大加速比超过100倍.
基金supported by the National Key Research and Development Program(Grant No.2018YFB1003501)the Strategic Priority Research Program of Chinese Academy of Sciences(XDC05000000)+2 种基金the National Natural Science of China(11904370,61872335,61732018,61672499)the Innovation Project of the State Key Laboratory of Computer Architecture(CARCH4509)the Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing(2019A07).
文摘Graph is one of the most important data structures in modern big data applications and is widely used in various fields.Among many graph algorithms,the Breadth-First Search(BFS)algorithm is a classic algorithm to solve the graph traversal problem and also the key kernel of Graph500 benchmark.On modern CPU architecture,the implementation of graph traversal on single-node systems has achieved significant improvement.However,due to the low resource utilization and high communications overhead,graph traversal on distributed clusters suffers from poor performance and energy inefficiency.Highthroughput cluster(HTCs)adopt High-Throughput many-core architecture,which has the characteristics of high concurrency,strong real-time,and low-power consumption.In this work,we propose several techniques,including asynchronous virtual ring method,thread caching scheme and vertex ID reordering to solve above problems and improve BFS performance on HTCs.We systematically evaluate optimized BFS algorithm and achieve 249.74 giga-traversed edges per second(GTEPS)on 72 nodes(2880 cores)HTCs.Compared with results on Graph500 list,the optimized algorithm achieves the highest node efficiency under the same cluster scale and the performance shows weakly linear scalability as the number of cluster nodes increases.With regard to efficiency,the average performance on HTCs is 3.47 GTEPS/node,which is the best among CPU-based distributed systems on the November 2019 Graph500 list.