Algorithms and Algorithmic Barriers in High-Dimensional Statistics and Random Combinatorial Structures

Kizildag, Eren C.

Author(s)

Kizildag, Eren C.

DownloadThesis PDF (2.721Mb)

Advisor

Gamarnik, David

Terms of use

In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

We focus on several algorithmic problems arising from the study of random combinatorial structures and of neural network models, with a particular emphasis on computational aspects. Our main contributions are summarized as follows. 1. Our first focus is on two algorithmic problems arising from the study of random combinatorial structures: the random number partitioning problem (NPP) and the symmetric binary perceptron model (SBP). Both of these models exhibit a so-called statistical-to-computational gap: a striking gap between the existential and the best known algorithmic guarantees with bounded computational power (such as polynomial-time algorithms). We investigate the nature of this gap for the NPP and SBP by studying their landscape through the lens of statistical physics and in particular spin glass theory. We establish that both models exhibit the Overlap Gap Property (OGP), an intricate geometrical property that is known to be a rigorous barrier for large classes of algorithms. We then leverage the OGP to rule out certain important classes of algorithms, including the class of stable algorithms and the Monte Carlo Markov Chain type algorithms. The former is a rather powerful abstract class that captures the implementation of several important algorithms including the approximate message passing and the low-degree polynomial based methods. Our hardness results for the stable algorithms are based on Ramsey Theory from extremal combinatorics. To the best of our knowledge, this is the first usage of Ramsey Theory to show algorithmic hardness for models with random parameters. 2. Our second focus is on the Sherrington-Kirkpatrick (SK) spin glass model, a mean-field model for disordered random media. We establish that the algorithmic problem of exactly computing the partition function of the SK model is average-case hard under the assumption that 𝑃 ̸= #𝑃 (an assumption that is milder than 𝑃 ̸= 𝑁𝑃 and is widely believed to be true) both for the finite-precision arithmetic model and for the real-valued computational model. Our result is the first provable hardness result for a statistical physics model with random parameters that is based on standard complexity-theoretical assumptions. 3. Our last focus is on neural network (NN) models arising from modern machine learning and high-dimensional statistical inference tasks. • Our first set of results to this direction establishes self-regularity for two-layer NNs with sigmoid, binary step, rectified linear unit (ReLU) activation functions and non-negative output weights in an algorithm-independent manner. That is, we establish that under very mild distributional assumptions on the training data, any such network has a bounded output norm provided that it attains a small training error on polynomially many data. Our results explain why the overparameterization does not hurt the generalization ability for such architectures. This conundrum has been observed empirically in NNs and defies the classical statistical wisdom. • Our final focus is on the problem of learning two-layer NNs with quadratic activation functions under the assumption that the training data are generated by a so-called teacher network with planted weights. We first investigate the training aspect, establishing that there exists an energy barrier 𝐸0 below which any stationary point of the empirical risk is necessarily a global optimum. That is, there are no spurious stationary points below 𝐸0. Consequently, we show that the gradient descent algorithm, when initialized below 𝐸0, nearly recovers the planted weights in polynomial-time. We then investigate the question of proper initialization under the assumption that the planted weights are generated randomly. By leveraging a certain semicircle law from random matrix theory, we show that a deterministic initialization suffices, provided that the network is sufficiently overparameterized. Finally, we identify a simple necessary and sufficient geometric condition on the training data under which any minimizer of the empirical risk has good generalization. We lastly show that randomly generated data satisfy this condition almost surely under very mild distributional assumptions.

Date issued

2022-05

URI

https://hdl.handle.net/1721.1/144485

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Doctoral Theses