Automating the Detection and Correction of Failures in Modern Persistent Memory Systems
Modern software systems are deeply embedded into our daily lives; the failures of these systems can therefore result in massive real-world harm.
Consequently, considerable resources are spent finding and fixing bugs in testing.
Overall, the software industry spends billions of dollars each year on fixing bugs, and ultimately loses trillions of dollars each year due to poor software quality (as a result of bugs that escape testing and wreak havoc once deployed).
One particularly challenging domain of software development for developers is the area of Persistent Memory (PM) programming, an abstraction where developers write software that accesses and updates long-term storage with direct memory operations.
The PM programming abstraction has become popular in recent years due to new hardware advances in low-latency, byte-addressable storage devices. Unfortunately, writing crash-consistent PM applications is challenging, as untimely program crashes can result in data corruption and loss if the application does not carefully order updates to PM, and testing all possible crashes for data consistency is intractable. Furthermore, crash-consistency bugs are difficult to manually debug and repair, taking weeks or months for a developer to correctly fix. Without advancements in PM testing and program repair tools, developers will be unable to effectively write correct and efficient applications for modern PM platforms, hampering the ease of their adoption. Motivated by these PM software development challenges, this dissertation explores research in developing software techniques that automate difficult and time-consuming PM development tasks. We study PM system design, bugs, and bugs fixes and observe that we can automatically provide scalable and high-coverage bug detection and correction by approximating the reasoning performed by developers as they develop their applications. Based on this insight, we first explore automated bug detection and correction for PM application bugs caused by the misuse of platform-specific PM primitives. We develop a testing technique that prioritizes testing program paths that heavily modify PM, as these paths are more likely to misuse PM. We implement this technique in AGAMOTTO, a symbolic-execution tool that thoroughly explores PM applications to uncover platform-specific bugs, which we use to find 84 new bugs while incurring no false positives. We then develop a technique for generating fixes for PM platform-specific bugs that are provably correct, coupled with heuristic performance optimizations that do not compromise correctness, and implement the technique in a compiler tool, HIPPOCRATES. Second, this dissertation explores automated bug detection for general crash-consistency bugs in PM applications (i.e., bugs caused by the improper ordering of PM updates). We develop a technique that automatically identifies groups of PM program behaviors that are likely to result in the same crash-consistency bugs and only tests one behavior out of the group, thus providing high testing accuracy (by testing all types of behaviors thoroughly) while also increasing efficiency (by eliminating redundant testing on functionally-similar behaviors). We implement this technique in SQUINT, a model-checking tool that selectively tests groups of PM program behaviors identified from a dynamic program trace, which we use to find 108 PM crash-consistency bugs. The works presented in this dissertation provide a holistic automated testing and program repair solution for PM software developers. In sum, these tools have been used to find and fix over two hundred PM bugs in real-world PM systems, demonstrating both the need for such tools and the efficacy of the tools presented in this dissertation.