Fault Tolerant Scheduling in Distributed Networks

Date

1996-09-25

Authors

Weissman, Jon B.
Womack, David

Journal Title

Journal ISSN

Volume Title

Publisher

UTSA Department of Computer Science

Abstract

We present a model for application-level fault tolerance for parallel applications. The objective is to achieve high reliability with minimal impact on the application. Our approach is based on a full replication of all parallel application components in a distributed wide-area environment in which each replica is independently scheduled in a different site. A system architecture for coordinating the replicas is described. The fault tolerance mechanism is being added to a wide-area scheduler prototype in the Legion parallel processing system. A performance evaluation of the fault tolerant scheduler and a comparison to the traditional means of fault tolerance, checkpoint-recovery, is planned.

Description

Keywords

Citation

Department

Computer Science