Detecting Memory Model Bugs in Multi-Core and Many-Core Systems
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
With the ubiquitous availability of parallel architectures, the burden falls on programmers' shoulders to write correct parallel programs. One of the major issues that complicates this task is the intricacies involved with the underlying memory consistency models. Therefore, it is important to innovate with techniques and approaches to tackle Memory Model Bugs. This thesis aims at making parallel programming easier by detecting memory model bugs both in multi-core and many-core systems.
Among various memory models, Sequential Consistency (SC) is the most intuitive one. However, most modern multi-core architectures aggressively reorder and overlap memory accesses, causing memory model bugs (e.g. SC violation bugs). To gain a superior understanding, we conduct the first comprehensive characteristics study of SC violation (SCV) bugs that appear in real world codebases in multi-core systems. Our study uncovers many interesting findings and implications of SCV bugs.
Based on our findings, we propose two approaches to detect SCVs in multi-core systems. We propose Dissector, a hardware software co-designed SCV detector for a typical TSO machine. Dissector hardware works by piggybacking information about pending stores with cache coherence messages. Later, it detects if any of those pending stores cause an SCV cycle. The post processing software filters out false positives and extracts detail debugging information. Our second proposal, Orion is an active testing technique to detect, expose, and classify any SCV - no matter how many threads and variables are involved or how complex thread interleavings need to be. Orion works in two phases. In the first phase, it finds potential SCV cycles by focusing on racing accesses. In the second phase, it exposes each SCV cycle by enforcing the exact scheduling order.
Similar to multi-core systems, modern many core architectures such as GPUs also aggressively reorder and buffer memory accesses. Updates to shared and global data are not guaranteed to be visible to concurrent threads immediately and can cause subtle memory model bugs. We propose Whistle Blower to expose these bugs in any arbitrary GPU program. It works by statically instrumenting the code to buffer the shared and global data for as long as possible without violating the semantics of any instruction. Any program failure that results from such buffering indicates the presence of subtle memory model bugs in the program. Whistle Blower later provides detailed debugging information regarding the failure. Whistle Blower is the first proposal to expose memory model bugs of GPU programs by enforcing the worst case scenario of memory buffering.
We performed detailed experimentation with each of these techniques and showed that they are effective in detecting memory model bugs. More importantly, we uncovered several new and previously unreported bugs in various popular open source codes using these solutions.