期刊文献+

静态分析面向异构系统的应用级Checkpoint设置问题 被引量:2

Static Analysis for the Placement of Application-Level Checkpoints on Heterogeneous System
在线阅读 下载PDF
导出
摘要 应用级checkpointing是一种在大规模科学计算领域中备受关注的容错技术,该技术由用户程序员选择在适当的地方保存关键数据,从而降低了容错开销.选择合适的checkpointing位置、减小全局checkpoint保存数据量是优化应用级checkpointing技术的关键问题.对于近年来推出的带有通用GPU的异构系统上的应用级checkpointing技术,也同样面临上述问题.针对异构系统体系结构和程序特征,对面向异构系统的应用级checkpointing技术的检查点设置进行了静态分析,提出两套不同机制的检查点设置方法:同步及异步检查点设置方法,并分别就checkpointing优化设置问题对其进行数学建模和求解.最后,通过实验验证并评估了所提出的两种方法的性能. Application-Level checkpointing is a widely concerned technique used in large-scale scientific computing fields, and programmers to choose the appropriate place to save crucial data: henceforth, the fault-tolerant overhead can be reduced. There are two key issues in adopting this technique: find the proper place and reduce the scale of global checkpoints saving datum. The same problem is encountered when emerging heterogeneous systems with general purpose computation on GPUs. Towards architecture of heterogeneous system and characterization of application, this paper performs static analysis for the checkpointing configurations and placements, and two novelty approaches are proposed: 'synchronous checkpoint placement' and the 'asynchronous checkpoint placement'. The placement problem of checkpoints can be mathematically modeled and solved. Finally, their performances are evaluated via conducting experiments.
出处 《软件学报》 EI CSCD 北大核心 2013年第6期1361-1375,共15页 Journal of Software
基金 国家自然科学基金(60921062 61003087)
关键词 应用级checkpointing 异构系统 通用GPU 同步检查点设置 异步检查点设置 application-level checkpointing heterogeneous system general purpose computation on GPU synchronous checkpointplacement asynchronous checkpoint placement
  • 相关文献

参考文献27

  • 1Luebke D, Harris M, Kruger J, Purcell T, Govindaraju N, Buck I, Woolley C, Lefohn A. GPGPU: General purpose computation on graphics hardware. In: Proc. of the ACM SIGGRAPH 2004 Course Notes (SIGGRAPH 2004). New York: ACM Press, 2004. 33. [doi: 10.1145/1103900.1103933].
  • 2Fan Z, Qiu F, Kaufman A, Yoakum-Stover S. GPU cluster for high performance computing. In: Proc. of the 2004 ACMJIEEE Conf. on Supercomputing (SC 2004). Washington: IEEE Computer Society, 2004. 47. [doi: 10.1109/SC.2004.26].
  • 3Dally WJ, Hanrahan P, Erez M, Knight TJ. Merrimac: Supercomputing with streams. In: Proc. of the Supercomputing Conf. (SC 2003).2003.35-42. [doi: 10.1109/SC.2003.10043].
  • 4TOP500 supercomputing site. http://www.top500.org.
  • 5Read DA, Lu CD, Mendes CL. Reliability challenges in large systems. Future Generation Computers System, 2006,22(3):293-302. [doi: 10.1016/j.future.2004.11.015].
  • 6Brown A, Patterson DA. Embracing failure: A case for recovery-oriented computing (ROC). In: Proc. of the High Performance Trans. on Processing Symp. 2001.
  • 7Bosi1ca G, Bouteiller A, Cappello F, Djilali S, Fedak G, Germain C, Herault T, Lemarinier P, Lodygensky 0, Magniette F, Neri V, Selikhov A. MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In: Proc. of the 2002 ACM/IEEE Conf. on Supercomputing. Baltimore: IEEE Computer Society Press, 2002. [doi: 10.1109/SC.2002.10048].
  • 8Bronevetsky G, Marques D, Pingali K, Stodghill P. Automated application-level checkpointing of mpi programs. In: Proc. of the ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPoPP). 2003.84-94. [doi: 10.1145/966049.781513].
  • 9Elnozahy EN, Alvisi L, Wang YM, Johnson DB. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 2002,34(3):375-408. [doi: 10.1145/568522.568525].
  • 10Plank JS, Li K, Puening MA. Diskless checkpointing. IEEE Trans. on Parallel Distributed Systems, 1998,9(10):972-986. [doi: 10.1109171.730527].

同被引文献13

引证文献2

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部