Investigating system resilience in distributed evolutionary GAN training

Mustafi, Urmi.

Author(s)

Mustafi, Urmi.

Download1251801498-MIT.pdf (2.426Mb)

Other Contributors

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.

Advisor

Erik Hemberg and Jamal Toutouh.

Terms of use

MIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided. http://dspace.mit.edu/handle/1721.1/7582

Metadata

Show full item record

Abstract

General Adverserial Networks (GANs) provide a useful approach to new data generation with a few common problems of mode collapsing and oscillating behavior. Lipizzaner improves the performance of distributed GAN training with the use of a spatially distributed coevolutionary algorithm and gradient-based optimizers. However, in its current state the use of Lipizzaner is limited by its vulnerabilities on systems that encounter frequent node failures. When faced with a single node failure, Lipizzaner's entire experiment comes to a halt and must be restarted. We see a need for increasing Lipizzaner's resilience to such failures and do the following. We apply a combination of uncoordinated checkpointing, attempted reconnecting, and restarting nodes to form a simple and efficient solution for system resilience in Lipizzaner. We find that checkpointing and reconnecting are essential and simple solutions to failure recovery in Lipizzaner, while restarting nodes requires a more nuanced approach that shows promising results when used correctly to address node failures.

Description

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, February, 2021

Cataloged from the official PDF of thesis.

Includes bibliographical references (pages 57-58).

Date issued

2021

URI

https://hdl.handle.net/1721.1/130707

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Keywords

Electrical Engineering and Computer Science.

Collections

Graduate Theses