DPR Cluster: An Automated Framework for Deploying Resilient Stateful Cloud Microservices

Raicevic, Nikola

Author(s)

Raicevic, Nikola

DownloadThesis PDF (740.7Kb)

Advisor

Madden, Samuel

Terms of use

In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Recent advances in distributed recovery protocols enable application builders to achieve strong prefix recovery guarantees in distributed systems of cache-stores (pairs of fast cache backed with persistent storage to answer storage requests) with low overhead. Specifically, Distributed Prefix Recovery (DPR) is a general-purpose protocol that implements prefix recovery guarantee for an arbitrary cluster of cache-stores with the help of a centralized management node. However, deploying such a cluster is still challenging, as it involves timely detection and restart of failed nodes, incremental roll-out of new cache-store implementations and deployments, and routing requests in a dynamic cluster with failures. Cluster administrators must manually configure DPR with this information and program cache-stores with the necessary capabilities in a fault-tolerant manner. In this thesis, we introduce the DPR cluster – an automated framework for quickly and easily deploying clusters of DPR-enhanced cache-stores. DPR Cluster utilizes Kubernetes as its cluster manager and features a declarative Python management API for scripting. Cluster administrators merely specify the desired cluster, and Kubernetes automatically deploys and manages the relevant components and restarts them on failure. Clients can dynamically discover a cluster and its components and communicate with them with DPR Cluster’s dynamic, fault-tolerant networking layer based on DNS. Additionally, DPR Cluster implements a suite of functionalities for fault-tolerance in addition to cache-store consistency, such as automatic reconnects. Our evaluation shows that DPR Cluster is highly resilient and functional with a simple API, and significantly lowers the barrier of entry for DPR deployments.

Date issued

2022-09

URI

https://hdl.handle.net/1721.1/147510

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses