期刊文献+

容错并行算法的分类和设计 被引量:1

Classification and design of fault-tolerant parallel
原文传递
导出
摘要 鉴于容错并行算法的设计是影响其容错性能的关键因素,首先,根据容错并行算法的设计方法,给出了容错并行算法的分类,并对各类算法的特点进行了分析;然后,根据分类方法选择了并行矩阵三角分解和快速傅里叶变换2种典型的并行算法,设计出2类并行算法应用所对应的容错并行算法;最后,在一个256结点的机群系统上对设计的容错并行算法的性能进行了测试,结果表明容错并行算法可以实现很低的容错开销. The design of fault-tolerant parallel algorithm (FTPA) is to partition a program into program sections, and manipulate each program section into a fault-tolerant program section with the insertion of a data saving section, a failure detection section, and a recovery section. First, according to the design methodology, the classification of FTPA was given and the characters of all classifications of FTPA were analyzed. Second, the FTPAs for matrix triangular decomposition were fast Fourier transformation. Finally, the performance of FTPAs was evaluated on a cluster with 256 nodes. The experimental results show that FTPA can achieve a low fault-tolerant overhead.
出处 《华中科技大学学报(自然科学版)》 EI CAS CSCD 北大核心 2011年第4期49-52,共4页 Journal of Huazhong University of Science and Technology(Natural Science Edition)
基金 国家自然科学基金资助项目(61003087 60903059) 国家科技重大专项基金资助项目(2009ZX01036-001-003-001)
关键词 并行编程 容错 分类 容错并行算法 矩阵三角分解 快速傅里叶变换 parallel programming fault tolerance classification fault-tolerant parallel algorithm matrix triangular decomposition~ fast Fourier transformation
  • 相关文献

参考文献10

  • 1Wayne Joubert, Douglas Kothe, Hai Ah Nam. Preparing for exascale:ORNL leadership computing facility application requirements and strategy, Technical Report ORNL/TM-2009/308 [R]. Oak Ridge: National Center for Computational Sciences, 2009.
  • 2Scarpazza D P, Mullaney P, Villa O, et al. Transparent system-level migration of PGAS applications using Xen on InfiniBand[C]//2007 IEEE International Conference on Cluster Computing. Washington: IEEE, 2007: 74-83.
  • 3Wang Chao, Mueller F, Engelmann C, et al. Hybridcheckpointing for MPI jobs in HPC environments[C]//16th IEEE International Conference on Parallel and Distributed Systems (ICPADS). Shanghai:IEEE, 2010: 524-533.
  • 4Moody D, Greg B, Kathryn M, et al. Design, modeling, and evaluation of a scalable multi-level check- pointing system [C]//2010 Supercomputing Conference. New Orleans: IEEE/ACM, 2010: 267-277.
  • 5Bronevetsky G, Schulz M, Szwed P, et al. Application-level checkpointing for shared memory programs [C]// Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). New York: ACM, 2004: 235-247.
  • 6Yang Xuejun, Du Yunfei, Wang Panfeng, et al. The fault tolerant parallel algorithm: the parallel recomputing based failure recovery[C]//The Sixteenth International Conference on Parallel Architectures and Compilation Techniques (PACT 2007). Brasov, Romania: IEEE, 2007: 199-209.
  • 7Yang X, Du Y, Wang P, et al. FTPA: supporting fault tolerant parallel computing through parallel recomputing[J]. IEEE Transactions on Parallel and Distributed Systems, 2009, 20(10):1471-1486.
  • 8Chen G L. Parallel computing-architecture algorithm programming[M]: Revised Edition. Beijing:Higher Education Press, 2003.
  • 9Dongarra J J, Duff I S, Sorensen D C, et al. Solving linear systems on vector and shared memory computers[M]. Philadelphia: SIAM, 1991.
  • 10Bailey D, Harris T, Saphir W, et al. The NAS parallel benchmarks 2.0, Technical Report NAS-95-020 [R]. Ames: NASA Ames Research Center, 1995.

同被引文献19

  • 1RANDELI. B. System structure for software fault tolerance[J]. IEEE Transactions on Software Engineering, 1975,1 (2) : 221 - 232.
  • 2LEVITIN G. Optimal structure of fault-tolerant software sys- tems[J]. Reliability Engineering and System Safety, 2005, 89(3) :286-295.
  • 3LEVITIN G, XIE M, ZHANG T. Reliability of fault-tolerant systems with parallel task processing[J]. European Journal of Operational Research, 2007,177 ( 1 ) : 420- 430.
  • 4HANMER R S. Patterns for Fault Tolerant Software[M]. John Wiley & Sons Ltd,2007.
  • 5HUANG K H, ABRAHAM J A. Algorithm-based fault toler- ance for matrix operations[J]. IEEE Transactions on Comput- ers, 1984,33(6) ; 518-528.
  • 6OBORIL F, TAHOORI M B, HEUVELINE V, et al. Numerical defect correction as an algorithm-based fault tolerance technique for iterative solvers[A]. 17th IEEE Pacific Rim International Symposium on Dependable Computing Pasadena, CA, USA, 2011. 144-153.
  • 7BOSII/2A G, DELMAS R, DONGARRA J, et al. Algorithm-based fault tolerance applied to high performance computing[J]. Par- allel and Distributed Computing,2009,69(4):410-416.
  • 8BANERJEE P,ABRAHAM J A. Bounds on algorithm-based fault tolerance in multiple processor systems[J]. IEEE Transactions on Computers, 1986,35(4) :296-306.
  • 9CHEN Z, DONGARRA J. Algorithm-based fault tolerance for fail-stop failures[J]. IEEE Transactions on Parallel and Dis- tributed Systems, 2008, (19) 12 : 1628- 1641.
  • 10MISHRA A, MILI L, PHADKE A G. Algorithm based fault tolerant state estimation of power systemsEA. Proceedings of the 8th International Conference on Probabilistic Methods Ap-plied to Power Systems, Iowa State University, Ames, IA, 2004. 97-103.

引证文献1

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部