Automatically tuning task scheduling policies on multicore architectures
The emergence of multicore architectures requires software applications to be partitioned into concurrent tasks to achieve high efficiency. However, due to varying issues such as load imbalance, contention, and reuse of data among different tasks, the performance of multi-tasking applications often critically depends on the efficient scheduling of the concurrent tasks, e.g., how to distribute tasks among different processing cores and in what order to evaluate tasks assigned to the same core. In this thesis, I present a framework which automatically exploits alternative task scheduling strategies for multi-threaded applications written using a variety of parallel programming models, systematically experiments with different scheduling strategies by measuring their performance on varying architectures, and then selects the best performing strategies for each application based on the empirical tuning results.
Most parallel programming models, e.g., OpenMP, permit users to explicitly specify desired scheduling policies within their applications among a set of pre-defined options. However, manually selecting a satisfactory option is difficult as the best policies typically vary for each application and are not portable across different machine architectures. For some applications, the policies could also be sensitive to varying input data. It is unrealistic to expect most users to have the complete knowledge of the scheduling options while manually exploring the underlying space of varying scheduling policies. We overcome these difficulties by integrating the knowledge of alternative scheduling policies within our auto-tuning framework, which can then be configured to automatically select portable decisions based on empirical studies of the user application on varying architectures.
Our framework currently support automatically tuning the scheduling policies of three parallel pro-gramming models, OpenMP, TBB, and the Galois system, and have been applied to six C/C++ benchmarks, including both regular (e.g., Blackscholes and Freqmine) and irregular computations (i.e., operations on linked data structures, e.g.,DMR, DT, SP and ConvexHull). Our results show an average of 80% variation in performance and 24% variation in system power consumption rates when these applications are configured with the worst and best scheduling policies. Implemented using POET (Programmable Optimization for Empirical Tuning), an interpreted source-to-source program transformation language designed to support fine-grained parameterization and empirical tuning of compiler optimizations, our framework can be easily extended to support additional parallel programming models (e.g., Cilk) and languages (e.g., Java).