MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Open Access Articles
  • MIT Open Access Articles
  • View Item
  • DSpace@MIT Home
  • MIT Open Access Articles
  • MIT Open Access Articles
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Convergence of Smoothed Empirical Measures with Applications to Entropy Estimation

Author(s)
Goldfeld, Ziv; Greenewald, Kristjan; Niles-Weed, Jonathan; Polyanskiy, Yury
Thumbnail
DownloadSubmitted version (1.267Mb)
Open Access Policy

Open Access Policy

Creative Commons Attribution-Noncommercial-Share Alike

Terms of use
Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/
Metadata
Show full item record
Abstract
© 1963-2012 IEEE. This paper studies convergence of empirical measures smoothed by a Gaussian kernel. Specifically, consider approximating P* Nσ, for Nσ ≜ N (0,σ 2 I d), by hat P n∗ Nσ under different statistical distances, where hat P n is the empirical measure. We examine the convergence in terms of the Wasserstein distance, total variation (TV), Kullback-Leibler (KL) divergence, and χ 2-divergence. We show that the approximation error under the TV distance and 1-Wasserstein distance (W 1) converges at the rate e O(d) n-1/2 in remarkable contrast to a (typical) n-frac 1 d rate for unsmoothed W 1 (and d ≥ 3). Similarly, for the KL divergence, squared 2-Wasserstein distance (W 22), and χ 2-divergence, the convergence rate is e O(d)} n-1, but only if P achieves finite input-output χ 2 mutual information across the additive white Gaussian noise (AWGN) channel. If the latter condition is not met, the rate changes to ω (n-1) for the KL divergence and W 22, while the χ 2-divergence becomes infinite-a curious dichotomy. As an application we consider estimating the differential entropy h(S+Z), where S∼ P and Z∼ Nσ are independent d-dimensional random variables. The distribution P is unknown and belongs to some nonparametric class, but n independently and identically distributed (i.i.d) samples from it are available. Despite the regularizing effect of noise, we first show that any good estimator (within an additive gap) for this problem must have a sample complexity that is exponential in d. We then leverage the above empirical approximation results to show that the absolute-error risk of the plug-in estimator converges as e O(d)} n-1/2, thus attaining the parametric rate in n. This establishes the plug-in estimator as minimax rate-optimal for the considered problem, with sharp dependence of the convergence rate both in n and d. We provide numerical results comparing the performance of the plug-in estimator to that of general-purpose (unstructured) differential entropy estimators (based on kernel density estimation (KDE) or k nearest neighbors (kNN) techniques) applied to samples of S+Z. These results reveal a significant empirical superiority of the plug-in to state-of-the-art KDE and kNN methods. As a motivating utilization of the plug-in approach, we estimate information flows in deep neural networks and discuss Tishby's Information Bottleneck and the compression conjecture, among others.
Date issued
2020
URI
https://hdl.handle.net/1721.1/135991
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science; MIT-IBM Watson AI Lab
Journal
IEEE Transactions on Information Theory
Publisher
Institute of Electrical and Electronics Engineers (IEEE)

Collections
  • MIT Open Access Articles

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.