Show simple item record

dc.contributor.advisorErik Hemberg and Jamal Toutouh.en_US
dc.contributor.authorMustafi, Urmi.en_US
dc.contributor.otherMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.en_US
dc.date.accessioned2021-05-24T19:52:31Z
dc.date.available2021-05-24T19:52:31Z
dc.date.copyright2021en_US
dc.date.issued2021en_US
dc.identifier.urihttps://hdl.handle.net/1721.1/130707
dc.descriptionThesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, February, 2021en_US
dc.descriptionCataloged from the official PDF of thesis.en_US
dc.descriptionIncludes bibliographical references (pages 57-58).en_US
dc.description.abstractGeneral Adverserial Networks (GANs) provide a useful approach to new data generation with a few common problems of mode collapsing and oscillating behavior. Lipizzaner improves the performance of distributed GAN training with the use of a spatially distributed coevolutionary algorithm and gradient-based optimizers. However, in its current state the use of Lipizzaner is limited by its vulnerabilities on systems that encounter frequent node failures. When faced with a single node failure, Lipizzaner's entire experiment comes to a halt and must be restarted. We see a need for increasing Lipizzaner's resilience to such failures and do the following. We apply a combination of uncoordinated checkpointing, attempted reconnecting, and restarting nodes to form a simple and efficient solution for system resilience in Lipizzaner. We find that checkpointing and reconnecting are essential and simple solutions to failure recovery in Lipizzaner, while restarting nodes requires a more nuanced approach that shows promising results when used correctly to address node failures.en_US
dc.description.statementofresponsibilityby Urmi Mustafi.en_US
dc.format.extent58 pagesen_US
dc.language.isoengen_US
dc.publisherMassachusetts Institute of Technologyen_US
dc.rightsMIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided.en_US
dc.rights.urihttp://dspace.mit.edu/handle/1721.1/7582en_US
dc.subjectElectrical Engineering and Computer Science.en_US
dc.titleInvestigating system resilience in distributed evolutionary GAN trainingen_US
dc.typeThesisen_US
dc.description.degreeM. Eng.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Scienceen_US
dc.identifier.oclc1251801498en_US
dc.description.collectionM.Eng. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Scienceen_US
dspace.imported2021-05-24T19:52:31Zen_US
mit.thesis.degreeMasteren_US
mit.thesis.departmentEECSen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record