Sparse Learning using Discrete Optimization: Scalable Algorithms and Statistical Insights
Author(s)
Hazimeh, Hussein
DownloadThesis PDF (3.500Mb)
Advisor
Mazumder, Rahul
Terms of use
Metadata
Show full item recordAbstract
Sparsity is a central concept in interpretable machine learning and high-dimensional statistics. While sparse learning problems can be naturally modeled using discrete optimization, computational challenges have historically shifted the focus towards alternatives based on continuous optimization and heuristics. Recently, growing evidence suggests that discrete optimization methods can obtain more interpretable models than popular alternatives. However, scalability issues are limiting the adoption of discrete methods and our understanding of their statistical properties. This thesis develops scalable discrete optimization methods and presents new statistical insights for a fundamental class of sparse learning problems.
In the first chapter, we consider the L0-regularized linear regression problem, which aims to select a subset of features that best predict the outcome. We propose fast, approximate algorithms, based on coordinate descent and local combinatorial optimization, and establish convergence guarantees. Empirically, we identify important high-dimensional settings where L0-based estimators achieve better statistical performance than popular sparse learning methods (e.g., based on L1 regularization). Our open-source implementation (L0Learn) can handle instances with millions of features and run up to 3x faster than state-of-the-art sparse learning toolkits.
In the second chapter, we propose an exact, scalable approach for L0-regularized linear regression. In particular, we develop a specialized nonlinear branch-and-bound (BnB) framework that solves a mixed integer programming (MIP) formulation of the problem. In a radical shift from modern MIP solvers, we solve the BnB subproblems using a specialized first-order method that exploits sparsity. Our open-source solver L0BnB can scale to instances with ~ 10^7 features, over 1000x larger than what modern MIP solvers can handle.
In the third chapter, we focus on L0-regularized classification. We propose an exact and novel algorithm that solves the problem via a sequence of MIP subproblems, each involving a relatively small number of binary variables. The algorithm can scale to instances with 50,000 features. We also develop fast, approximate algorithms that generalize those of the first chapter. We show theoretically and empirically that our proposals can outperform popular sparse classification methods.
In the last two chapters, we consider structured sparse learning problems, in which group or hierarchy constraints are imposed to enhance interpretability. We develop specialized convex and discrete optimization algorithms for these problems. Our experiments indicate that the proposed algorithms are more scalable and can achieve better statistical performance than existing methods.
Date issued
2021-09Department
Massachusetts Institute of Technology. Operations Research Center; Sloan School of ManagementPublisher
Massachusetts Institute of Technology