Learning and testing junta distributions over hypercubes
Author(s)
Aliakbarpour, Maryam
DownloadFull printable version (5.147Mb)
Other Contributors
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science.
Advisor
Ronitt Rubinfeld.
Terms of use
Metadata
Show full item recordAbstract
Many tasks related to the analysis of high-dimensional datasets can be formalized as problems involving learning or testing properties of distributions over a high-dimensional domain. In this work, we initiate the study of the following general question: when many of the dimensions of the distribution correspond to "irrelevant" features in the associated dataset, can we learn the distribution efficiently? We formalize this question with the notion of junta distribution. The distribution D over {0, 1}n is a k-junta distribution if the probability mass function p of D is a k-junta-- i. e., if there is a set J [subset][n] of at most k coordinates such that for every x [set membership] {0, 1}7, the value of p(x) is completely determined by the value of x on the coordinates in J. We show that it is possible to learn k-junta distributions with a number of samples that depends only logarithmically on the total number n of dimensions. We give two proofs of this result; one using the cover method and one by developing a Fourier-based learning algorithm inspired by the Low-Degree Algorithm of Linial, Mansour, and Nisan (1993). We also consider the problem of testing whether an unknown distribution is a k-junta distribution. We introduce an algorithm for this task with sample complexity Õ(2n/²k⁴) and show that this bound is nearly optimal for constant values of k. As a byproduct of the analysis of the algorithm, we obtain an optimal bound on the number of samples required to test a weighted collection of distribution for uniformity. Finally, we establish the sample complexity for learning and testing other classes of distributions related to junta-distributions. Notably, we show that the task of testing whether a distribution on {0, 1}n contains a coordinate i [set membership] [n] such that xi is drawn independently from the remaining coordinates requires [theta]](2²n/³) samples. This is in contrast to the task of testing whether all of the coordinates are drawn independently from each other, which was recently shown to have sample complexity [theta](2n/²) by Acharya, Daskalakis, and Kamath (2015).
Description
Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2015. Cataloged from PDF version of thesis. Includes bibliographical references (pages 77-80).
Date issued
2015Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology
Keywords
Electrical Engineering and Computer Science.