Enabling Efficient ML Inference in SigmaOS with Model-Aware Scheduling

Liu, Katie

Author(s)

Liu, Katie

DownloadThesis PDF (2.056Mb)

Advisor

Szekely, Ariel

Kaashoek, M. Frans

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Machine learning inference in multi-tenant cloud environments leads to significant challenges when it comes to minimizing latency and resource contention, especially as models grow in size and complexity. This thesis addresses the cold start overhead and scheduling inefficiencies of multi-tenant ML serving by integrating the RayServe distributed model-serving framework into σOS, a cloud operating system that unifies container and serverless paradigms. The thesis also proposes two model-aware schedulers within σOS that intelligently routes inference requests to reduce the number of cold starts: Model Colocation, which prioritizes placing requests on machines where the required model is already loaded, and Centralized Model Registry, which tracks globally available models to inform scheduling decisions. These policies proactively reduce model load times by reusing cached models. Experimental results on language translation workloads in an 8-node cluster show that these schedulers achieve a ≈ 50% reduction in average inference latency and eliminates roughly 4–5 cold starts per workload, compared to σOS’s default scheduler. Through this model-aware approach to scheduling, our work enables more efficient, scalable, and low-latency ML inference serving in multi-tenant cloud settings.

Date issued

2025-05

URI

https://hdl.handle.net/1721.1/162971

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses