摘要
应用级checkpointing是一种在大规模科学计算领域中备受关注的容错技术,该技术由用户程序员选择在适当的地方保存关键数据,从而降低了容错开销.选择合适的checkpointing位置、减小全局checkpoint保存数据量是优化应用级checkpointing技术的关键问题.对于近年来推出的带有通用GPU的异构系统上的应用级checkpointing技术,也同样面临上述问题.针对异构系统体系结构和程序特征,对面向异构系统的应用级checkpointing技术的检查点设置进行了静态分析,提出两套不同机制的检查点设置方法:同步及异步检查点设置方法,并分别就checkpointing优化设置问题对其进行数学建模和求解.最后,通过实验验证并评估了所提出的两种方法的性能.
Application-Level checkpointing is a widely concerned technique used in large-scale scientific computing fields, and programmers to choose the appropriate place to save crucial data: henceforth, the fault-tolerant overhead can be reduced. There are two key issues in adopting this technique: find the proper place and reduce the scale of global checkpoints saving datum. The same problem is encountered when emerging heterogeneous systems with general purpose computation on GPUs. Towards architecture of heterogeneous system and characterization of application, this paper performs static analysis for the checkpointing configurations and placements, and two novelty approaches are proposed: 'synchronous checkpoint placement' and the 'asynchronous checkpoint placement'. The placement problem of checkpoints can be mathematically modeled and solved. Finally, their performances are evaluated via conducting experiments.
出处
《软件学报》
EI
CSCD
北大核心
2013年第6期1361-1375,共15页
Journal of Software
基金
国家自然科学基金(60921062
61003087)
关键词
应用级checkpointing
异构系统
通用GPU
同步检查点设置
异步检查点设置
application-level checkpointing
heterogeneous system
general purpose computation on GPU
synchronous checkpointplacement
asynchronous checkpoint placement