New Models And Algorithms For Distribution Testing: Beyond Standard Sampling

Narayanan, Shyam

Author(s)

Narayanan, Shyam

DownloadThesis PDF (7.864Mb)

Advisor

Indyk, Piotr

Terms of use

In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Distribution testing is a crucial area at the interface of statistics and algorithms, where one wishes to learn properties of datasets from a small number of samples. Classic distribution testing problems occur in many applications, including biology, genomics, computer systems, and linguistics. In this thesis, we study distribution testing under two models: the Conditional Sampling Model and the Learning-Based Frequency Model. In the traditional distribution testing framework, one is only allowed random samples from the data, but these two models allow for more powerful queries (as we will further describe). We improve query/sample complexity bounds for classic distribution testing problems in these models. In the conditional sampling model, one is allowed more powerful queries where each query specifies a subset 𝑆 of the domain, and the output received is a sample drawn from the distribution conditioned on being in 𝑆. In this model, we first prove that tolerant uniformity testing can be solved using Õ(𝜀⁻²) queries, which is optimal and improves upon the Õ(𝜀⁻²⁰)-query algorithm of Canonne et al. [18]. This bound even holds under a restricted version of the conditional sampling model called the Pair Conditional Sampling model. Next, we prove that tolerant identity testing in the conditional sampling model can be solved in Õ(𝜀⁻⁴) queries, which is the first known bound independent of the support size of the distribution for this problem. Next, we use our algorithm for tolerant uniformity testing to get an Õ(𝜀⁻⁴)-query algorithm for monotonicity testing in the conditional sampling model, improving on the Õ(𝜀⁻²²)-query algorithm of Canonne [14]. Finally, we study (non-tolerant) identity testing under the pair conditional sampling model, and provide a tight bound of Θ˜(√ log 𝑁 · 𝜀⁻²) for the query complexity, where the domain of the distribution has size 𝑁. This improves upon both the known upper and lower bounds in [18]. We next consider the problem of estimating the number of distinct elements (also known as support size estimation) in a large data set from a random sample of its elements. This problem has been especially well-studied, with a partial bibliography (available at https://courses.cit.cornell.edu/jab18/bibliography.html) from 2007 containing over 900 references, both theoretical and applied, relating to 3 this problem alone! A line of research spanning the last decade resulted in algorithms that estimate the support up to ±𝜀𝑁 from a sample of size 𝑂(log²(1/𝜀) · 𝑁/ log 𝑁) [61], where 𝑁 is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. To overcome this issue, we introduce the Learning-Based Frequency Model, where we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to log(1/𝜀)· 𝑁¹⁻ᶿ⁽¹⸍ ˡᵒᵍ⁽¹⸍𝜀⁾⁾ . In addition, we evaluate the proposed algorithms on a collection of data sets, using the neural-network based estimators from Hsu et al. [35] as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithms. This thesis combines two papers: • Shyam Narayanan. On Tolerant Distribution Testing in the Conditional Sampling Model. In Proceedings of the 32nd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2021. • Talya Eden, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, and Tal Wagner. Learning-based Support Estimation in Sublinear Time. In Proceedings of the 9th Annual International Conference on Learning Representations (ICLR), 2021 (Spotlight Presentation).

Date issued

2021-06

URI

https://hdl.handle.net/1721.1/139095

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses