Show simple item record

dc.contributor.authorGoldfeld, Ziv
dc.contributor.authorGreenewald, Kristjan
dc.contributor.authorNiles-Weed, Jonathan
dc.contributor.authorPolyanskiy, Yury
dc.date.accessioned2021-10-27T20:30:15Z
dc.date.available2021-10-27T20:30:15Z
dc.date.issued2020
dc.identifier.urihttps://hdl.handle.net/1721.1/135991
dc.description.abstract© 1963-2012 IEEE. This paper studies convergence of empirical measures smoothed by a Gaussian kernel. Specifically, consider approximating P* Nσ, for Nσ ≜ N (0,σ 2 I d), by hat P n∗ Nσ under different statistical distances, where hat P n is the empirical measure. We examine the convergence in terms of the Wasserstein distance, total variation (TV), Kullback-Leibler (KL) divergence, and χ 2-divergence. We show that the approximation error under the TV distance and 1-Wasserstein distance (W 1) converges at the rate e O(d) n-1/2 in remarkable contrast to a (typical) n-frac 1 d rate for unsmoothed W 1 (and d ≥ 3). Similarly, for the KL divergence, squared 2-Wasserstein distance (W 22), and χ 2-divergence, the convergence rate is e O(d)} n-1, but only if P achieves finite input-output χ 2 mutual information across the additive white Gaussian noise (AWGN) channel. If the latter condition is not met, the rate changes to ω (n-1) for the KL divergence and W 22, while the χ 2-divergence becomes infinite-a curious dichotomy. As an application we consider estimating the differential entropy h(S+Z), where S∼ P and Z∼ Nσ are independent d-dimensional random variables. The distribution P is unknown and belongs to some nonparametric class, but n independently and identically distributed (i.i.d) samples from it are available. Despite the regularizing effect of noise, we first show that any good estimator (within an additive gap) for this problem must have a sample complexity that is exponential in d. We then leverage the above empirical approximation results to show that the absolute-error risk of the plug-in estimator converges as e O(d)} n-1/2, thus attaining the parametric rate in n. This establishes the plug-in estimator as minimax rate-optimal for the considered problem, with sharp dependence of the convergence rate both in n and d. We provide numerical results comparing the performance of the plug-in estimator to that of general-purpose (unstructured) differential entropy estimators (based on kernel density estimation (KDE) or k nearest neighbors (kNN) techniques) applied to samples of S+Z. These results reveal a significant empirical superiority of the plug-in to state-of-the-art KDE and kNN methods. As a motivating utilization of the plug-in approach, we estimate information flows in deep neural networks and discuss Tishby's Information Bottleneck and the compression conjecture, among others.
dc.language.isoen
dc.publisherInstitute of Electrical and Electronics Engineers (IEEE)
dc.relation.isversionof10.1109/TIT.2020.2975480
dc.rightsCreative Commons Attribution-Noncommercial-Share Alike
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/4.0/
dc.sourcearXiv
dc.titleConvergence of Smoothed Empirical Measures with Applications to Entropy Estimation
dc.typeArticle
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.contributor.departmentMIT-IBM Watson AI Lab
dc.relation.journalIEEE Transactions on Information Theory
dc.eprint.versionOriginal manuscript
dc.type.urihttp://purl.org/eprint/type/JournalArticle
eprint.statushttp://purl.org/eprint/status/NonPeerReviewed
dc.date.updated2021-03-09T20:09:08Z
dspace.orderedauthorsGoldfeld, Z; Greenewald, K; Niles-Weed, J; Polyanskiy, Y
dspace.date.submission2021-03-09T20:09:09Z
mit.journal.volume66
mit.journal.issue7
mit.licenseOPEN_ACCESS_POLICY
mit.metadata.statusAuthority Work and Publication Information Needed


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record