| dc.contributor.advisor | Erik Hemberg and Jamal Toutouh. | en_US |
| dc.contributor.author | Mustafi, Urmi. | en_US |
| dc.contributor.other | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science. | en_US |
| dc.date.accessioned | 2021-05-24T19:52:31Z | |
| dc.date.available | 2021-05-24T19:52:31Z | |
| dc.date.copyright | 2021 | en_US |
| dc.date.issued | 2021 | en_US |
| dc.identifier.uri | https://hdl.handle.net/1721.1/130707 | |
| dc.description | Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, February, 2021 | en_US |
| dc.description | Cataloged from the official PDF of thesis. | en_US |
| dc.description | Includes bibliographical references (pages 57-58). | en_US |
| dc.description.abstract | General Adverserial Networks (GANs) provide a useful approach to new data generation with a few common problems of mode collapsing and oscillating behavior. Lipizzaner improves the performance of distributed GAN training with the use of a spatially distributed coevolutionary algorithm and gradient-based optimizers. However, in its current state the use of Lipizzaner is limited by its vulnerabilities on systems that encounter frequent node failures. When faced with a single node failure, Lipizzaner's entire experiment comes to a halt and must be restarted. We see a need for increasing Lipizzaner's resilience to such failures and do the following. We apply a combination of uncoordinated checkpointing, attempted reconnecting, and restarting nodes to form a simple and efficient solution for system resilience in Lipizzaner. We find that checkpointing and reconnecting are essential and simple solutions to failure recovery in Lipizzaner, while restarting nodes requires a more nuanced approach that shows promising results when used correctly to address node failures. | en_US |
| dc.description.statementofresponsibility | by Urmi Mustafi. | en_US |
| dc.format.extent | 58 pages | en_US |
| dc.language.iso | eng | en_US |
| dc.publisher | Massachusetts Institute of Technology | en_US |
| dc.rights | MIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided. | en_US |
| dc.rights.uri | http://dspace.mit.edu/handle/1721.1/7582 | en_US |
| dc.subject | Electrical Engineering and Computer Science. | en_US |
| dc.title | Investigating system resilience in distributed evolutionary GAN training | en_US |
| dc.type | Thesis | en_US |
| dc.description.degree | M. Eng. | en_US |
| dc.contributor.department | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science | en_US |
| dc.identifier.oclc | 1251801498 | en_US |
| dc.description.collection | M.Eng. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science | en_US |
| dspace.imported | 2021-05-24T19:52:31Z | en_US |
| mit.thesis.degree | Master | en_US |
| mit.thesis.department | EECS | en_US |