Improving Software Reliability via Record and Replay
Although sophisticated techniques are adopted for software testing, most in-production software contains latent bugs. Fixing bugs may take weeks, or even months, which leads to unpredictable financial losses caused by service interruption or defective functionality. In this thesis, the goal is to identify the exact root causes of in-production software failures, so that developers can fix these bugs easily without further confirmation.
First, a lightweight record-and-replay framework, iReplayer, has been designed and implemented. iReplayer aims to identically reproduce the executions of multithreaded programs within the failing process (or under the "in-situ" setting). It only imposes 3% recording overhead, which allows it to be always-on in the production environment. iReplayer enables a range of possibilities in error detection, vulnerability identification, or security forensics. We have implemented several tools on top of it: detectors for heap buffer overflows and use-after-free bugs, and an interactive debugging tool that is integrated with GDB.
Second, an in-situ failure diagnosis system (called Watcher) is proposed to identify root causes of software failures. Watcher is built on top of iReplayer, and can perform the identification at the users' site. This design not only protects users' sensitive information, but also preserves the execution environment (e.g. dependent libraries). Watcher integrates binary analysis with identical record-and-replay to diagnose the root causes level-by-level without manual effort. It reports the complete fault propagation chain to developers, so that they can understand the reasons for the failures. The results show that Watcher can accurately identify the root causes of failures in just a few seconds.
Third, Watcher+ is further developed to enhance the efficiency and accuracy of Watcher. First, due to the lack of execution states, Watcher may employ multiple replays to determine possible execution paths. Instead, Watcher+ adopts advanced hardware support (e.g. Intel PT) to track the execution path, which bounds the number of re-executions to 2X + 1, where X is the number of levels of root causes. Second, Watcher may incorrectly identify the root cause under a control-flow hijack, since it only utilizes binary analysis to identify all possibilities. Watcher+ eliminates this issue with hardware-collected traces. Watcher+ additionally overcomes several corresponding challenges, including precisely mapping the trace to the execution.