Hierarchical distributed loop self-scheduling schemes on cluster and cloud systems
Loops are the largest source of parallelism in many scientific applications. Parallelization of irregular loop applications is a challenging problem to achieve scalable performance on cluster and cloud systems. In distributed systems, load balance, communication and synchronization overhead must be taken considered. For cluster systems, previous research proposed an effective Master-Worker model on clusters for distributed self-scheduling schemes that apply to parallel loops with independent iterations. However, this model has not been applied to large-scale clusters. Cloud computing infrastructure offers computing resources as a collection of virtual machines by different hardware configurations, which is transparent to end users. In fact, the computing powers of these virtual machines instances are different and the system behaves as a heterogeneous environment. Thus, scheduling and load balancing for high performance computations become challenging issues. We propose a hierarchical distributed approach suitable for scheduling parallel loops. We implemented our algorithms(or schemes) on a large scale homogeneous cluster and also on a heterogeneous cloud environment. We evaluated various performance aspects associated with our distributed scheduling algorithms. Modern cloud systems provide high availability, fault tolerance, disaster recovery and monitoring for the most critical environments. In the event of failures, cloud systems with fault tolerance can continue to operate properly. We also propose a fault tolerant hierarchical distributed algorithms to survive from hardware/software faults and reschedule the rest of workload.