Stable Machine Learning
Author(s)
Paskov, Ivan Spassimirov
DownloadThesis PDF (2.526Mb)
Advisor
Bertsimas, Dimitris
Terms of use
Metadata
Show full item recordAbstract
This thesis explores one of the most fundamental questions in Machine Learning, namely, how should the "learning" component in Machine Learning be done? For essentially the entire history of the field, ever since Mosteller and Tukey proposed the paradigm in 1968, the answer has remained constant: use randomization. Namely, randomly split your data into training, validation, and test sets, then train your model on the training set, pick parameters based on the validation set, and then report performance based on the test set. Conceptually and practically simple, this methodology has gained near unanimous adoption. Despite this popularity, however, the methodology is fraught with numerous issues relating to the instability of the trained models, and the question remains whether or not we can do better?
In this thesis, we answer that question in the affirmative. By taking a robust, combinatorial optimization approach, we propose a new way of training all machine learning models based on optimization rather than randomization. Rather than requesting that the model be performant against a single, randomly chosen training set, as is typically done, instead we require that it be robust against every training set of a fixed size. In this way, we extract out that which is common amongst all training sets, rather than the idiosyncrasies of any particular dataset, which are unlikely to generalize to new, yet unseen datasets.
We begin by developing the methodology within the context of spatial, cross-sectional methods, and then proceed to extend the framework to time-series methods, where the contiguous structure of time now plays a key role. We next derive efficient algorithms that make the approach extremely scalable. Finally, we demonstrate the efficacy of the methodology across all methods on a large set of datasets, synthetic and real, derived from both academia and industry.
Date issued
2022-02Department
Massachusetts Institute of Technology. Operations Research CenterPublisher
Massachusetts Institute of Technology