摘要
在高性能地学计算系统中,任务计算失败将会导致严重的后果,因此高性能地学计算必须具有可靠性保障。软件容错模型是提高并行计算容错能力的一种有效方法。针对传统基于检查点/回滚的容错策略存在资源浪费的不足,以并行地形分析为研究对象,基于软件容错模型提出一种基于邻域型算法的容错策略——N-ABFT(Neigh-boring-Algorithm Based Fault-Tolerant)。针对邻域型地形因子,该容错策略为并行程序划分出的各数据块增加冗余的校验行或校验列。最后,结合N-ABFT算法,提出一种容错调度算法,有效地提高了系统容错能力,降低了错误检测开销。
In recent years,due to the increasing calculation demands for the massive spatial data analysis,the parallel computing based on high-performance computers has become an inevitable trend for Digital Terrain Analysis(DTA).At the same time,the reliability of the parallel system becomes a foremost key while the stability of the clusters with tens of thousands of processors is threatened constantly by a larger number of hardware and software failures.This paper takes parallel DTA technologies as research object and proposes a Neighboring-Algorithm Based Fault-Tolerant(N-ABFT) strategy so as to enhance the accuracy of failure detection in fault-tolerant software.By means of the check row/column,the N-ABFT algorithm can detect the transient and fail-stop failures after all the computing nodes finished the calculation.Finally,two algorithms based on different analytical windows are tested and the preliminary results are discussed.
出处
《地理与地理信息科学》
CSCD
北大核心
2013年第2期1-5,共5页
Geography and Geo-Information Science
基金
国家863计划资助项目(2011AA120303)
国家自然科学基金项目(41171298
41071244)
江苏省普通高校研究生科研创新计划项目(CXZZ12_0393)
关键词
并行计算
DEM
软件容错
parallel computing
DEM
fault-tolerant software