Enabling Efficient ML Inference in SigmaOS withModel-Aware Scheduling
Author(s)
Liu, Katie
DownloadThesis PDF (2.056Mb)
Advisor
Szekely, Ariel
Kaashoek, M. Frans
Terms of use
Metadata
Show full item recordAbstract
Machine learning inference in multi-tenant cloud environments leads to significant challenges when it comes to minimizing latency and resource contention, especially as models grow in size and complexity. This thesis addresses the cold start overhead and scheduling inefficiencies of multi-tenant ML serving by integrating the RayServe distributed model-serving framework into σOS, a cloud operating system that unifies container and serverless paradigms. The thesis also proposes two model-aware schedulers within σOS that intelligently routes inference requests to reduce the number of cold starts: Model Colocation, which prioritizes placing requests on machines where the required model is already loaded, and Centralized Model Registry, which tracks globally available models to inform scheduling decisions. These policies proactively reduce model load times by reusing cached models. Experimental results on language translation workloads in an 8-node cluster show that these schedulers achieve a ≈ 50% reduction in average inference latency and eliminates roughly 4–5 cold starts per workload, compared to σOS’s default scheduler. Through this model-aware approach to scheduling, our work enables more efficient, scalable, and low-latency ML inference serving in multi-tenant cloud settings.
Date issued
2025-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology