Monkey: Platform-Agnostic Hybrid-Cloud Cluster Compute Orchestration Designed for AI/ML
Author(s)
Lamp, Avery
DownloadThesis PDF (890.0Kb)
Advisor
Agrawal, Pulkit
Terms of use
Metadata
Show full item recordAbstract
As AI/ML research progresses, the amount of compute needed to train and evaluate state-of-the-art AI algorithms consistently increases. With increasing needs for compute, researchers spend time designing distributed systems to scalably train and hyper-parameter optimize their latest model rather than focusing on their core research. We aim to build a fault-tolerant distributed system capable of cheaply and flexibly scheduling reproducible research training jobs on heterogeneous hybrid-cloud compute clusters including local machines and provider agnostic cloud machines. Our system focuses on ML researchers with two main goals, minimizing costs (using preemptible/spot-instances) and user friendliness. The system aims to require minimal user setup and configuration, allowing researchers to quickly get started training models. The Monkey System includes a web console and visualization dashboard to track, evaluate, and compare multiple jobs’ progress and results.
Date issued
2021-06Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology