容错并行算法的分类和设计被引量：1

Classification and design of fault-tolerant parallel

导出

摘要鉴于容错并行算法的设计是影响其容错性能的关键因素,首先,根据容错并行算法的设计方法,给出了容错并行算法的分类,并对各类算法的特点进行了分析;然后,根据分类方法选择了并行矩阵三角分解和快速傅里叶变换2种典型的并行算法,设计出2类并行算法应用所对应的容错并行算法;最后,在一个256结点的机群系统上对设计的容错并行算法的性能进行了测试,结果表明容错并行算法可以实现很低的容错开销. The design of fault-tolerant parallel algorithm （FTPA） is to partition a program into program sections, and manipulate each program section into a fault-tolerant program section with the insertion of a data saving section, a failure detection section, and a recovery section. First, according to the design methodology, the classification of FTPA was given and the characters of all classifications of FTPA were analyzed. Second, the FTPAs for matrix triangular decomposition were fast Fourier transformation. Finally, the performance of FTPAs was evaluated on a cluster with 256 nodes. The experimental results show that FTPA can achieve a low fault-tolerant overhead.

作者杜云飞唐玉华

机构地区国防科技大学计算机学院

出处《华中科技大学学报（自然科学版）》 EI CAS CSCD 北大核心 2011年第4期49-52,共4页 Journal of Huazhong University of Science and Technology(Natural Science Edition)

基金国家自然科学基金资助项目(61003087 60903059) 国家科技重大专项基金资助项目(2009ZX01036-001-003-001)

关键词并行编程容错分类容错并行算法矩阵三角分解快速傅里叶变换 parallel programming fault tolerance classification fault-tolerant parallel algorithm matrix triangular decomposition~ fast Fourier transformation

分类号 TP301 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献10

1Wayne Joubert, Douglas Kothe, Hai Ah Nam. Preparing for exascale:ORNL leadership computing facility application requirements and strategy, Technical Report ORNL/TM-2009/308 [R]. Oak Ridge: National Center for Computational Sciences, 2009.
2Scarpazza D P, Mullaney P, Villa O, et al. Transparent system-level migration of PGAS applications using Xen on InfiniBand[C]//2007 IEEE International Conference on Cluster Computing. Washington: IEEE, 2007: 74-83.
3Wang Chao, Mueller F, Engelmann C, et al. Hybridcheckpointing for MPI jobs in HPC environments[C]//16th IEEE International Conference on Parallel and Distributed Systems (ICPADS). Shanghai:IEEE, 2010: 524-533.
4Moody D, Greg B, Kathryn M, et al. Design, modeling, and evaluation of a scalable multi-level check- pointing system [C]//2010 Supercomputing Conference. New Orleans: IEEE/ACM, 2010: 267-277.
5Bronevetsky G, Schulz M, Szwed P, et al. Application-level checkpointing for shared memory programs [C]// Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). New York: ACM, 2004: 235-247.
6Yang Xuejun, Du Yunfei, Wang Panfeng, et al. The fault tolerant parallel algorithm: the parallel recomputing based failure recovery[C]//The Sixteenth International Conference on Parallel Architectures and Compilation Techniques (PACT 2007). Brasov, Romania: IEEE, 2007: 199-209.
7Yang X, Du Y, Wang P, et al. FTPA: supporting fault tolerant parallel computing through parallel recomputing[J]. IEEE Transactions on Parallel and Distributed Systems, 2009, 20(10):1471-1486.
8Chen G L. Parallel computing-architecture algorithm programming[M]: Revised Edition. Beijing:Higher Education Press, 2003.
9Dongarra J J, Duff I S, Sorensen D C, et al. Solving linear systems on vector and shared memory computers[M]. Philadelphia: SIAM, 1991.
10Bailey D, Harris T, Saphir W, et al. The NAS parallel benchmarks 2.0, Technical Report NAS-95-020 [R]. Ames: NASA Ames Research Center, 1995.

同被引文献19

1RANDELI. B. System structure for software fault tolerance[J]. IEEE Transactions on Software Engineering, 1975,1 (2) : 221 - 232.
2LEVITIN G. Optimal structure of fault-tolerant software sys- tems[J]. Reliability Engineering and System Safety, 2005, 89(3) :286-295.
3LEVITIN G, XIE M, ZHANG T. Reliability of fault-tolerant systems with parallel task processing[J]. European Journal of Operational Research, 2007,177 ( 1 ) : 420- 430.
4HANMER R S. Patterns for Fault Tolerant Software[M]. John Wiley & Sons Ltd,2007.
5HUANG K H, ABRAHAM J A. Algorithm-based fault toler- ance for matrix operations[J]. IEEE Transactions on Comput- ers, 1984,33(6) ; 518-528.
6OBORIL F, TAHOORI M B, HEUVELINE V, et al. Numerical defect correction as an algorithm-based fault tolerance technique for iterative solvers[A]. 17th IEEE Pacific Rim International Symposium on Dependable Computing Pasadena, CA, USA, 2011. 144-153.
7BOSII/2A G, DELMAS R, DONGARRA J, et al. Algorithm-based fault tolerance applied to high performance computing[J]. Par- allel and Distributed Computing,2009,69(4):410-416.
8BANERJEE P,ABRAHAM J A. Bounds on algorithm-based fault tolerance in multiple processor systems[J]. IEEE Transactions on Computers, 1986,35(4) :296-306.
9CHEN Z, DONGARRA J. Algorithm-based fault tolerance for fail-stop failures[J]. IEEE Transactions on Parallel and Dis- tributed Systems, 2008, (19) 12 : 1628- 1641.
10MISHRA A, MILI L, PHADKE A G. Algorithm based fault tolerant state estimation of power systemsEA. Proceedings of the 8th International Conference on Probabilistic Methods Ap-plied to Power Systems, Iowa State University, Ames, IA, 2004. 97-103.

引证文献1

1宋效东,刘学军,汤国安,窦万峰,江岭,杨坤.并行数字地形分析的容错算法研究[J].地理与地理信息科学,2013,29(2):1-5. 被引量：3

二级引证文献3

1周琛,陈振杰,王亚飞,任沂斌.基于包含检验法的多边形栅格化并行算法研究[J].地理与地理信息科学,2014,30(1):32-36. 被引量：1
2汤国安.我国数字高程模型与数字地形分析研究进展[J].地理学报,2014,69(9):1305-1325. 被引量：254
3秦承志.数字地形分析方法研究的维度——精准、高效、易用[J].地球信息科学学报,2020,22(4):720-730. 被引量：4

1杜云飞,唐玉华,杨学军.容错并行算法的性能分析[J].计算机科学,2009,36(9):248-251. 被引量：2
2陈建平.矩阵三角分解的递归算法[J].南通工学院学报（自然科学版）,2003,2(4):1-3.
3吴荣腾.多核与多GPU系统下的一种矩阵三角分解并行算法[J].闽江学院学报,2016,37(5):65-71. 被引量：1
4梁海华,盘丽娜,赵秀兰,李克清.CRC查询表及其并行矩阵生成方法[J].计算机科学,2012,39(B06):154-158. 被引量：10
5李安志,杨本立.大型高度稀疏矩阵行列式的一种高效计算法[J].教学与科技,1999,12(4):5-9.
6王树梅,王志成,赵卫东.矩阵三角分解在数字水印中的应用[J].计算机工程与应用,2009,45(13):111-113.
7纪坤,陈建平,石振国,刘维富.矩阵三角分解分块算法的研究与实现[J].计算机应用与软件,2010,27(9):72-74. 被引量：5
8杜云飞,王攀峰,富弘毅,周海芳,杨学军.矩阵LU分解的容错并行算法设计与实现[J].微电子学与计算机,2008,25(10):1-4. 被引量：3
9米国伟,周海芳,杜云飞.面向星载计算机的容错并行算法设计与实现[J].航空兵器,2010,17(4):35-39. 被引量：1
10王英林,吴慧中,田宜风.求解布局模型的并行矩阵算法研究[J].计算机辅助设计与图形学学报,1998,10(4):341-348. 被引量：3

华中科技大学学报（自然科学版）

2011年第4期

浏览历史

内容加载中请稍等...

容错并行算法的分类和设计被引量：1

参考文献10

同被引文献19

引证文献1

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

容错并行算法的分类和设计 被引量：1

参考文献10

同被引文献19

引证文献1

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

容错并行算法的分类和设计被引量：1