摘要
协调检查点设置及回卷恢复技术作为一种有效的容错手段,已广泛地运用在集群等并行/分布计算机系统中.为了进一步降低协调检查点设置的时间和空间开销,提出了一种基于消息计数的协调检查点设置算法.该算法无须对底层消息通道的FIFO特性进行假设,并使同步阶段引入的控制消息复杂度由通常的O(n2)降低到O(n),有效地提高了系统的效率和扩展性.
The technology of cooperative checkpointing and rollback recovery as an effective method of fault tolerance, has been widely used on the parallel or distributed computer systems, such as cluster of computers. In order to reduce the overhead of time and space, a cooperative checkpointing algorithm based on message counting is given in this paper. While reducing a message complexity during synchronization from O(n2) to O(n), improving system抯 efficiency and scalability, this algorithm is also fit for those non-FIFO message passing systems.
出处
《软件学报》
EI
CSCD
北大核心
2003年第1期43-48,共6页
Journal of Software
基金
(国家高技术研究发展计划)No.2001AA111010
(国家教育部高等院校骨干教师资助计划)~