dc.contributor.advisor | Agrawal, Pulkit | |
dc.contributor.author | Lamp, Avery | |
dc.date.accessioned | 2022-01-14T14:59:56Z | |
dc.date.available | 2022-01-14T14:59:56Z | |
dc.date.issued | 2021-06 | |
dc.date.submitted | 2021-06-17T20:13:33.943Z | |
dc.identifier.uri | https://hdl.handle.net/1721.1/139258 | |
dc.description.abstract | As AI/ML research progresses, the amount of compute needed to train and evaluate state-of-the-art AI algorithms consistently increases. With increasing needs for compute, researchers spend time designing distributed systems to scalably train and hyper-parameter optimize their latest model rather than focusing on their core research. We aim to build a fault-tolerant distributed system capable of cheaply and flexibly scheduling reproducible research training jobs on heterogeneous hybrid-cloud compute clusters including local machines and provider agnostic cloud machines. Our system focuses on ML researchers with two main goals, minimizing costs (using preemptible/spot-instances) and user friendliness. The system aims to require minimal user setup and configuration, allowing researchers to quickly get started training models. The Monkey System includes a web console and visualization dashboard to track, evaluate, and compare multiple jobs’ progress and results. | |
dc.publisher | Massachusetts Institute of Technology | |
dc.rights | In Copyright - Educational Use Permitted | |
dc.rights | Copyright MIT | |
dc.rights.uri | http://rightsstatements.org/page/InC-EDU/1.0/ | |
dc.title | Monkey: Platform-Agnostic Hybrid-Cloud Cluster Compute Orchestration Designed for AI/ML | |
dc.type | Thesis | |
dc.description.degree | M.Eng. | |
dc.contributor.department | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science | |
mit.thesis.degree | Master | |
thesis.degree.name | Master of Engineering in Electrical Engineering and Computer Science | |