静态分析面向异构系统的应用级Checkpoint设置问题被引量：2

Static Analysis for the Placement of Application-Level Checkpoints on Heterogeneous System

下载PDF

导出

摘要应用级checkpointing是一种在大规模科学计算领域中备受关注的容错技术,该技术由用户程序员选择在适当的地方保存关键数据,从而降低了容错开销.选择合适的checkpointing位置、减小全局checkpoint保存数据量是优化应用级checkpointing技术的关键问题.对于近年来推出的带有通用GPU的异构系统上的应用级checkpointing技术,也同样面临上述问题.针对异构系统体系结构和程序特征,对面向异构系统的应用级checkpointing技术的检查点设置进行了静态分析,提出两套不同机制的检查点设置方法:同步及异步检查点设置方法,并分别就checkpointing优化设置问题对其进行数学建模和求解.最后,通过实验验证并评估了所提出的两种方法的性能. Application-Level checkpointing is a widely concerned technique used in large-scale scientific computing fields, and programmers to choose the appropriate place to save crucial data： henceforth, the fault-tolerant overhead can be reduced. There are two key issues in adopting this technique： find the proper place and reduce the scale of global checkpoints saving datum. The same problem is encountered when emerging heterogeneous systems with general purpose computation on GPUs. Towards architecture of heterogeneous system and characterization of application, this paper performs static analysis for the checkpointing configurations and placements, and two novelty approaches are proposed：＇synchronous checkpoint placement＇ and the ＇asynchronous checkpoint placement＇. The placement problem of checkpoints can be mathematically modeled and solved. Finally, their performances are evaluated via conducting experiments.

作者贾佳杨学军马亚青

机构地区国防科学技术大学计算机学院并行与分布处理国家重点实验室北京系统工程研究所中国北方车辆研究所

出处《软件学报》 EI CSCD 北大核心 2013年第6期1361-1375,共15页 Journal of Software

基金国家自然科学基金(60921062 61003087)

关键词应用级checkpointing 异构系统通用GPU 同步检查点设置异步检查点设置 application-level checkpointing heterogeneous system general purpose computation on GPU synchronous checkpointplacement asynchronous checkpoint placement

分类号 TP306 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献27

1Luebke D, Harris M, Kruger J, Purcell T, Govindaraju N, Buck I, Woolley C, Lefohn A. GPGPU: General purpose computation on graphics hardware. In: Proc. of the ACM SIGGRAPH 2004 Course Notes (SIGGRAPH 2004). New York: ACM Press, 2004. 33. [doi: 10.1145/1103900.1103933].
2Fan Z, Qiu F, Kaufman A, Yoakum-Stover S. GPU cluster for high performance computing. In: Proc. of the 2004 ACMJIEEE Conf. on Supercomputing (SC 2004). Washington: IEEE Computer Society, 2004. 47. [doi: 10.1109/SC.2004.26].
3Dally WJ, Hanrahan P, Erez M, Knight TJ. Merrimac: Supercomputing with streams. In: Proc. of the Supercomputing Conf. (SC 2003).2003.35-42. [doi: 10.1109/SC.2003.10043].
4TOP500 supercomputing site. http://www.top500.org.
5Read DA, Lu CD, Mendes CL. Reliability challenges in large systems. Future Generation Computers System, 2006,22(3):293-302. [doi: 10.1016/j.future.2004.11.015].
6Brown A, Patterson DA. Embracing failure: A case for recovery-oriented computing (ROC). In: Proc. of the High Performance Trans. on Processing Symp. 2001.
7Bosi1ca G, Bouteiller A, Cappello F, Djilali S, Fedak G, Germain C, Herault T, Lemarinier P, Lodygensky 0, Magniette F, Neri V, Selikhov A. MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In: Proc. of the 2002 ACM/IEEE Conf. on Supercomputing. Baltimore: IEEE Computer Society Press, 2002. [doi: 10.1109/SC.2002.10048].
8Bronevetsky G, Marques D, Pingali K, Stodghill P. Automated application-level checkpointing of mpi programs. In: Proc. of the ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming (PPoPP). 2003.84-94. [doi: 10.1145/966049.781513].
9Elnozahy EN, Alvisi L, Wang YM, Johnson DB. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 2002,34(3):375-408. [doi: 10.1145/568522.568525].
10Plank JS, Li K, Puening MA. Diskless checkpointing. IEEE Trans. on Parallel Distributed Systems, 1998,9(10):972-986. [doi: 10.1109171.730527].

同被引文献13

1康松林,费洪晓,施荣华.网络应用软件监控系统同步与容错的设计与实现[J].中南大学学报（自然科学版）,2005,36(6):1048-1053. 被引量：3
2周克江.一种基于异构机群环境下的移动Agent系统分布式容错机制[J].现代计算机,2006,12(6):22-24. 被引量：1
3张福新,章隆兵,胡伟武.基于SimpleScalar的龙芯CPU模拟器Sim-Godson[J].计算机学报,2007,30(1):68-73. 被引量：24
4陈文智,姜振宇,吴帆.基于MIPS体系的扩展指令融合技术[J].计算机学报,2008,31(11):1888-1897. 被引量：2
5张春燕,王磊.一种电子商务数据的分布式容错处理框架[J].计算机与数字工程,2008,36(12):39-41. 被引量：1
6王健,孙建伶,王新宇,杨小虎,王申康,陈俊波.容错多处理机中一种高效的实时调度算法(英文)[J].软件学报,2009,20(10):2628-2636. 被引量：16
7李祖松,许先超,胡伟武,唐志敏.龙芯2号处理器的同时多线程设计[J].计算机学报,2009,32(11):2265-2273. 被引量：10
8富弘毅,丁滟,宋伟,杨学军.一种利用并行复算实现的OpenMP容错机制[J].软件学报,2012,23(2):411-427. 被引量：7
9王之元,杨学军,周云.大规模MPI并行计算的可扩展三模冗余容错机制[J].软件学报,2012,23(4):1022-1035. 被引量：13
10贾佳,杨学军,李志凌.一种基于冗余线程的GPU多副本容错技术[J].计算机研究与发展,2013,50(7):1551-1562. 被引量：8

引证文献2

1曾喜良,彭浩.容错机制的异构分布式系统安全可靠调度研究[J].网络安全技术与应用,2015(7):61-62.
2余世干,唐志敏,叶笑春,范东睿.基于推测机制异构多核处理器容错方法与仿真[J].系统仿真学报,2019,31(12):2685-2695. 被引量：2

二级引证文献2

1丁艳,张海文,孙永彦.基于多网格技术的电网工程造价数据信息分析方法研究[J].电子设计工程,2021,29(19):35-39. 被引量：5
2辛明勇,祝健杨,徐长宝,姚浩,刘德宏.基于循环神经网络的多核处理器层次化存储技术[J].电子设计工程,2023,31(22):121-124. 被引量：1

1袁涛,马艳,刘定生.GPU在遥感图像处理中的应用综述[J].遥感信息,2012,34(6):110-117. 被引量：14
2贾佳,杨学军.异构系统硬件故障传播行为分析及容错优化[J].软件学报,2011,22(12):2853-2865. 被引量：3
3刘璋.GPU加速高清视频解码技术的应用[J].云梦学刊,2007,28(S1):183-184.
4王翔.通用GPU又添新兵 NVIDIA通用GPU平台Tesla登场[J].微型计算机,2007(08S):116-118.
5贾佳,杨学军,李志凌.一种基于冗余线程的GPU多副本容错技术[J].计算机研究与发展,2013,50(7):1551-1562. 被引量：8
6贾佳.异构系统的异步应用级Checkpointing技术[J].计算机工程与科学,2011,33(11):54-59.
7蒙安泰.分布式文件系统中元数据管理机制的研究[J].电脑知识与技术（过刊）,2011,17(12X):9038-9040. 被引量：3
8夜叉鸦.从图形渲染到密集计算：通用GPU的崛起[J].微型计算机,2010(12):111-116.
9余洋,陆鑫达.异步检查点容错PVM[J].计算机工程与应用,1999,35(11):34-37. 被引量：1
10王攀峰,杜云飞,富弘毅,杨学军,周海芳.并行复算:一种面向高性能计算的新的容错方法[J].计算机科学,2009,36(3):21-25. 被引量：2

软件学报

2013年第6期

浏览历史

内容加载中请稍等...

静态分析面向异构系统的应用级Checkpoint设置问题被引量：2

参考文献27

同被引文献13

引证文献2

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

静态分析面向异构系统的应用级Checkpoint设置问题 被引量：2

参考文献27

同被引文献13

引证文献2

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

静态分析面向异构系统的应用级Checkpoint设置问题被引量：2