Investigating system resilience in distributed evolutionary GAN training

Mustafi, Urmi.

dc.contributor.advisor	Erik Hemberg and Jamal Toutouh.	en_US
dc.contributor.author	Mustafi, Urmi.	en_US
dc.contributor.other	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.	en_US
dc.date.accessioned	2021-05-24T19:52:31Z
dc.date.available	2021-05-24T19:52:31Z
dc.date.copyright	2021	en_US
dc.date.issued	2021	en_US
dc.identifier.uri	https://hdl.handle.net/1721.1/130707
dc.description	Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, February, 2021	en_US
dc.description	Cataloged from the official PDF of thesis.	en_US
dc.description	Includes bibliographical references (pages 57-58).	en_US
dc.description.abstract	General Adverserial Networks (GANs) provide a useful approach to new data generation with a few common problems of mode collapsing and oscillating behavior. Lipizzaner improves the performance of distributed GAN training with the use of a spatially distributed coevolutionary algorithm and gradient-based optimizers. However, in its current state the use of Lipizzaner is limited by its vulnerabilities on systems that encounter frequent node failures. When faced with a single node failure, Lipizzaner's entire experiment comes to a halt and must be restarted. We see a need for increasing Lipizzaner's resilience to such failures and do the following. We apply a combination of uncoordinated checkpointing, attempted reconnecting, and restarting nodes to form a simple and efficient solution for system resilience in Lipizzaner. We find that checkpointing and reconnecting are essential and simple solutions to failure recovery in Lipizzaner, while restarting nodes requires a more nuanced approach that shows promising results when used correctly to address node failures.	en_US
dc.description.statementofresponsibility	by Urmi Mustafi.	en_US
dc.format.extent	58 pages	en_US
dc.language.iso	eng	en_US
dc.publisher	Massachusetts Institute of Technology	en_US
dc.rights	MIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided.	en_US
dc.rights.uri	http://dspace.mit.edu/handle/1721.1/7582	en_US
dc.subject	Electrical Engineering and Computer Science.	en_US
dc.title	Investigating system resilience in distributed evolutionary GAN training	en_US
dc.type	Thesis	en_US
dc.description.degree	M. Eng.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science	en_US
dc.identifier.oclc	1251801498	en_US
dc.description.collection	M.Eng. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science	en_US
dspace.imported	2021-05-24T19:52:31Z	en_US
mit.thesis.degree	Master	en_US
mit.thesis.department	EECS	en_US

Files in this item

Name:: 1251801498-MIT.pdf
Size:: 2.426Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record