Weissman, Jon B.Womack, David2023-10-192023-10-191996-09-25https://hdl.handle.net/20.500.12588/2110We present a model for application-level fault tolerance for parallel applications. The objective is to achieve high reliability with minimal impact on the application. Our approach is based on a full replication of all parallel application components in a distributed wide-area environment in which each replica is independently scheduled in a different site. A system architecture for coordinating the replicas is described. The fault tolerance mechanism is being added to a wide-area scheduler prototype in the Legion parallel processing system. A performance evaluation of the fault tolerant scheduler and a comparison to the traditional means of fault tolerance, checkpoint-recovery, is planned.en-USFault Tolerant Scheduling in Distributed NetworksTechnical Report