CheckSync: Transparent Primary-Backup Replication for Go Applications Using Checkpoints
Author(s)
Kaashoek, Nicolaas M.
DownloadThesis PDF (458.2Kb)
Advisor
Morris, Robert Tappan
Terms of use
Metadata
Show full item recordAbstract
Many distributed systems have singular, mission-critical components. The MapReduce coordinator, lock servers, etc are all examples of such components. Due to their importance, they require high availability and fault tolerance. The most common way to achieve this is through the use of replicated state machines, an approach in which the application is replicated across multiple machines. There could be as few as two in a primary/backup arrangement, or more to reduce the risk of downtime. Each instance starts in the same state, and then advances to new states in the same order. This allows for easy failover to one of the replicas in case the primary machine fails.
The use of replicated state machines, however, requires an application to expose the correct stream of operations to ensure that each machine ends up in the same final state. This abstraction is not well-suited to all applications, as it can’t support multithreading and can add extra complexity for application developers. This thesis proposes CheckSync, a protocol for achieving high availability and fault tolerance via the use of checkpoints. CheckSync is designed with transparency as a primary goal: applications require little to no modification to use it. It achieves this by checkpointing the memory of an application and replicating that state from primary and a backup. Upon failure, the backup resumes from the checkpoint and continues running.
CheckSync’s transparency sets it apart. Unlike the operation stream required for replicated state machines, CheckSync doesn’t place constraints on the design of the application. It can suspend and capture the memory of Go applications without knowledge of the specifics of the application, as well as restore them on the backup. This is accomplished through careful analysis and recreation of the application’s memory space, as well as efficient transmission of the checkpoint files to minimize performance overhead. CheckSync is evaluated with three different applications, and supports all three without any changes to their code.
Date issued
2021-06Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology