GraphX is a graph computing library based on Spark systems,where fault tolerance is a necessary guarantee for the high availability.However,the existing fault tolerance methods are mostly implemented in a pessimistic ...GraphX is a graph computing library based on Spark systems,where fault tolerance is a necessary guarantee for the high availability.However,the existing fault tolerance methods are mostly implemented in a pessimistic way and are aimed at general computing tasks.Considering the characteristics of iterative computation,this paper presents a combination method of the optimistic fault tolerance and checkpoint for recovering the data under different failure conditions.Firstly,for single node failure,we propose the optimistic fault tolerance mechanism based on compensation function.It does not add fault tolerance measures in advance and will not incur additional costs when there are no failures.Secondly,for multiple node failures,we propose the automatic checkpoint management strategy based on RDD importance.It comprehensively considers the factors of lineage length of RDD,dependency relationship,and computation time of RDD,which can set the RDD as the checkpoint properly.Finally,we implement our proposals in GraphX of Spark−3.5.1,and evaluate the performance by using representative iterative graph algorithms on the high performance computing cluster.The results verify the correctness of iteration results of the mechanism,and illustrate that when recovering the RDD partition,the job execution time can be reduced by the mechanism and strategy substantially.展开更多
基金supported by the National Key Research and Development Program of China(Grant No.2021YFB0301200)the Hunan Natural Science Foundation Project(Grant No.2023JJ40555)+1 种基金the Hunan Provincial Graduate Student Research and Innovation Project(Grant No.LXBZZ2024035)the Hunan Provincial Department of Education Scientific Research Project(Grant No.22B0451).
文摘GraphX is a graph computing library based on Spark systems,where fault tolerance is a necessary guarantee for the high availability.However,the existing fault tolerance methods are mostly implemented in a pessimistic way and are aimed at general computing tasks.Considering the characteristics of iterative computation,this paper presents a combination method of the optimistic fault tolerance and checkpoint for recovering the data under different failure conditions.Firstly,for single node failure,we propose the optimistic fault tolerance mechanism based on compensation function.It does not add fault tolerance measures in advance and will not incur additional costs when there are no failures.Secondly,for multiple node failures,we propose the automatic checkpoint management strategy based on RDD importance.It comprehensively considers the factors of lineage length of RDD,dependency relationship,and computation time of RDD,which can set the RDD as the checkpoint properly.Finally,we implement our proposals in GraphX of Spark−3.5.1,and evaluate the performance by using representative iterative graph algorithms on the high performance computing cluster.The results verify the correctness of iteration results of the mechanism,and illustrate that when recovering the RDD partition,the job execution time can be reduced by the mechanism and strategy substantially.