Show simple item record

dc.contributor.advisorAgrawal, Pulkit
dc.contributor.authorLamp, Avery
dc.date.accessioned2022-01-14T14:59:56Z
dc.date.available2022-01-14T14:59:56Z
dc.date.issued2021-06
dc.date.submitted2021-06-17T20:13:33.943Z
dc.identifier.urihttps://hdl.handle.net/1721.1/139258
dc.description.abstractAs AI/ML research progresses, the amount of compute needed to train and evaluate state-of-the-art AI algorithms consistently increases. With increasing needs for compute, researchers spend time designing distributed systems to scalably train and hyper-parameter optimize their latest model rather than focusing on their core research. We aim to build a fault-tolerant distributed system capable of cheaply and flexibly scheduling reproducible research training jobs on heterogeneous hybrid-cloud compute clusters including local machines and provider agnostic cloud machines. Our system focuses on ML researchers with two main goals, minimizing costs (using preemptible/spot-instances) and user friendliness. The system aims to require minimal user setup and configuration, allowing researchers to quickly get started training models. The Monkey System includes a web console and visualization dashboard to track, evaluate, and compare multiple jobs’ progress and results.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright MIT
dc.rights.urihttp://rightsstatements.org/page/InC-EDU/1.0/
dc.titleMonkey: Platform-Agnostic Hybrid-Cloud Cluster Compute Orchestration Designed for AI/ML
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record