Sparse hierarchical regression with polynomials

Bertsimas, Dimitris; Van Parys, Bart

Author(s)

Bertsimas, Dimitris; Van Parys, Bart

Download10994_2020_5868_ReferencePDF.pdf (369.4Kb)

Publisher Policy

Terms of use

Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/

Metadata

Show full item record

Abstract

Abstract We present a novel method for sparse polynomial regression. We are interested in that degree r polynomial which depends on at most k inputs, counting at most $$\ell$$ℓ monomial terms, and minimizes the sum of the squares of its prediction errors. Such highly structured sparse regression was denoted by Bach (Advances in neural information processing systems, pp 105–112, 2009) as sparse hierarchical regression in the context of kernel learning. Hierarchical sparse specification aligns well with modern big data settings where many inputs are not relevant for prediction purposes and the functional complexity of the regressor needs to be controlled as to avoid overfitting. We propose an efficient two-step approach to this hierarchical sparse regression problem. First, we discard irrelevant inputs using an extremely fast input ranking heuristic. Secondly, we take advantage of modern cutting plane methods for integer optimization to solve the remaining reduced hierarchical $$(k, \ell )$$(k,ℓ)-sparse problem exactly. The ability of our method to identify all k relevant inputs and all $$\ell$$ℓ monomial terms is shown empirically to experience a phase transition. Crucially, the same transition also presents itself in our ability to reject all irrelevant features and monomials as well. In the regime where our method is statistically powerful, its computational complexity is interestingly on par with Lasso based heuristics. Hierarchical sparsity can retain the flexibility of general nonparametric methods such as nearest neighbors or regression trees (CART), without sacrificing much statistical power. The presented work hence fills a void in terms of a lack of powerful disciplined nonlinear sparse regression methods in high-dimensional settings. Our method is shown empirically to scale to regression problems with $$n\approx 10{,}000$$n≈10,000 observations for input dimension $$p\approx 1000$$p≈1000.

Date issued

2020-01-24

URI

https://hdl.handle.net/1721.1/131533

Department

Massachusetts Institute of Technology. Operations Research Center

Publisher

Springer US

Collections

MIT Open Access Articles