Continuous LWE is as Hard as LWE (and Applications to Gaussian Mixture Learning) Aparna Gupte∗ Neekon Vafa† Vinod Vaikuntanathan‡ MIT MIT MIT January 31, 2022 Abstract We show a direct and conceptually simple reduction from the classical learning with errors (LWE) problem to its continuous analog called CLWE (Bruna, Regev, Song and Tang, STOC 2021). This allows us to bring to bear the powerful machinery of LWE-based cryptography to the applications of CLWE. As a concrete application, we show a nearly tight hardness result for the problem of dis- tinguishing between a mixture of Gaussians in Rn and the standard multivariate Gaussian, under the (plausible and widely believed) exponential √hardness of the classical LWE problem. In particular, we demonstrate a mixture of roughly O( log n) Gaussians in Rn which is indis- tinguishable from the standard multivariate Gaussian N (0, In×n) with poly(log n) samples and poly(n) time. This gives us a tight computational gap as the problem can be solved in slightly quasipolynomial time, even with only roughly log n samples. Our result improves on Bruna, Regev, Song and Tang (STOC 2021) who show the hardness √ of learning mixtures of more than n Gaussians under the worst-case quantum hardness of lattice problems. The best known polynomial-time algorithms can learn any mixture of O(1) Gaussians. Our key technique is an improved reduction from classical LWE to LWE with k-sparse secrets (Goldwasser, Kalai, Peikert and Vaikuntanathan, ITCS 2010; M√icciancio, Theory of Computing, 2018) where the multiplicative increase in the noise is only O( k), independent of the ambient dimension n. ∗Email: agupte@mit.edu †Email: nvafa@mit.edu. ‡Email: vinodv@mit.edu Contents 1 Introduction 1 1.1 Our Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Technical Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Open Questions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Preliminaries 5 2.1 Lattices and Discrete Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Learning with Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 Reducing LWE to CLWE 9 4 Hardness of k-sparse LWE 16 5 Hardness of Density Estimation for Mixtures of Gaussians 23 6 Low-Sample Algorithm for hCLWE(g) 27 i 1 Introduction The problem of learning a mixture of Gaussians is of fundamental importance in many fields of science [TTM+85, MP00]. Given a set of g multivariate Gaussians in n dimensions, parameterized by their means µ ni ∈ R , covariance matrices Σ ∈ Rn×ni , and non-negative weights w1, . . . , wg summing to one, the Gaussian mixture model is defined to be the distribution generated by picking a Gaussian i ∈ [g] with probability wi and outputting a sample from N (µi,Σi). Dasgupta [Das99] initiated the study of this problem in computer science. A strong notion of learning mixtures of Gaussians is that of parameter esetimation, that is to estimate all µi, Σi and wi given samples from the distribution. If one assumes the Gaussians in the mixture are well- separated, then the problem is known to be tractable, for a constant number of Gaussians [Das99, SK01, VW02, AM05, KSV05, DS07, BV08, MV10, BS15, HP15, RV17, HL18, KSS18, DKS18]. Moitra and Valiant [MV10] and Hardt and Price [HP15] also show that for parameter estimation, there is an information theoretic lower bound on the sample complexity that is exponential in the number of Gaussian components, namely g. Consequently, it makes sense to ask for a weaker notion of learning the mixture of Gaussians in this case, where the goal is to output a density estimate for the mixture of Guassians that is -close in statistical distance to the underlying mixture of Gaussians [FSO06]. That is, given some samples from the Gaussian mixture, one can ask if there is an efficient algorithm that outputs some density oracle (e.g. a circuit) that on any input x ∈ Rn, outputs an estimate of the density at x which closely approximates the density of the underlying Gaussian mixture. The sample complexity of the density estimation problem does not suffer from the exponential dependence in g, as was the case for parameter estimation. In fact, Diakonikolas, Kane, and Stewart [DKS17] show a poly(n, g, 1/) upper bound on the sample complexity, by giving an exponential time algorithm for the problem of density estimation. Given this, one could hope for a more efficient, ideally poly(n, g, 1/)-time, algorithm for the problem of density estimation. Unfortunately, Diakonikolas, Kane, and Stewart [DKS17] show that even this weaker notion of learning mixtures of Gaussian has a super-polynomial lower bound in the restricted statistical query (SQ) model [DKS17] (see [Kea98, FGR+17] for a formal description of the SQ model). Explicitly, they show that any SQ algorithm giving density estimates requires nΩ(g) queries to an SQ oracle of precision n−O(g); this is super-polynomial as long as g is super-constant. However, this lower bound does not show anything about arbitrary polynomial time algorithms for density estimation. Known algorithms for density estimation (e.g. [MV10]) all have a poly(n, 1/)- time algorithm in dimension n outputting an estimate with statistical distance , where the number of Gaussians g in the mixture is some fixed constant. However, the dependence on g is exponential; in [MV10], the dependence is nf(g) for some f , meaning they only show that this runs in polynomial time for constant g. Recently, Bruna, Regev, Song and Tang [BRST21] show that an algorithm for outputting a density estimate implies an algorithm for (widely believed to be hard) worst-case lattice problems. That is, they give a reduction from worst-case lattice problems to outputting a density estimate √ for mixtures of g > n Gaussians, giving a lower-bound for outputting Gaussian mixture density estimates under well-founded cryptographic assumptions. 1 1.1 Our Results Instead of focusing on density estimation for mixtures of Gaussians, we focus our attention on showing hardness for the easier problem of distinguishing N := N (0, In×n) and a certain mixture of g Gaussians which has large statistical distance from it. Let S be some distribution over Rn, let H(g)(s) be some mixture of g Gaussians over Rn indexed by a vector s ∈ Rn, and let H(g)(S) denote the resulting distribution over mixtures of Gaussians when choosing and fixing some s ∼ S across all samples. As long as ∆(N,H(g)(s)) ≥ 1/2 for all s, it turns out that distinguishing N and H(g)(S) is easier than density estimation. From here on out, we will fix H(g)(s) to be some particular mixture of Gaussians that we consider (indexed by s). (In particular, jumping ahead, we remark that it will be the homogeneous CLWE distribution truncated to g Gaussians with secret direction s.) Our main results consist of a reduction from LWE to CLWE and an improved leakage-resilience theorem for k-sparse LWE, which when put together, give us an exponential improvement of roughly √ √ log n Gaussians as compared to the n Gaussians as in Bruna et al. [BRST21], at the expense of a stronger computational assumption. Theorem 1 (Informal). Assume that the exponential LWE assumption holds. Then, the problem (√ ) of distinguishing N and H(g)(S) for roughly g = O log n Gaussians with poly(log n) samples requires quasipolynomial time. The learning with errors (LWE) problem [Reg09] asks to distinguish between “LWE samples” (ai, b = 〈ai, s〉 + ei (mod q)) where the LWE secret s ∈ Znq is chosen at random and fixed for all samples, ai ∼ Znq is uniformly random, and ei ∼ DZ,σ is drawn from a discrete Gaussian distribution with standard deviation σ. The LWE problem has been very well-studied in the cryptography community and lies at the center of efforts by the National Institutes of Standards and Technology √ (NIST) to develop post-quantum cryptosystems. In particular, for q = poly(n) and σ = O( n), the LWE problem is believed to be exponentially hard; that is, hard for 2n -time algorithms that have 2n  LWE samples, for any  < 1 (see, e.g. [LP11]). In other words, we show an exponentially tighter result for density estimation for mixtures of √ √ Gaussians than [BRST21] ( log n vs. n Gaussians) under a stronger hardness assumption. We remark that translating the stronger hardness assumption into the tighter lower bound requires substantially new techniques which we elaborate on in the rest of the introduction. One crucial difference in the mixture of Gaussians we consider is that the secret direction dis- tribution S is now discrete, where in [BRST21], it was continuous over Rn. This allows us to give simple algorithmic upper bounds, letting us state a tight computational gap for the task of distinguishing these distributions, N and H(g)(S). Theorem 2 (Informal). There is an algorithm running in quasipolynomial time in n distinguishing N and H(g)(S) using roughly O(log n) samples (for the same H(g)(S) as in Theorem 1). Combining the above two theorems, assuming exponential LWE, we get that the time complexity √ of distinguishing N and H(g)(S) with poly(log n) samples, where g is roughly O( log n), is exactly quasipolynomial in n. (The remaining question is then which exact quasipolynomial time-bound it is). 2 1.2 Other Applications We mention that our hardness result for CLWE can also be applied in showing (further) hardness of learning single periodic neurons, i.e., neural networks with no hidden layers and a periodic activation function ϕ(t) = cos(2πγt) with frequency γ. Song, Zadik, and Bruna [SZB21] give a direct reduction from CLWE to learning single periodic neurons, showing hardness of learning this class of functions assuming the hardness of CLWE. Our reduction from LWE to CLWE shows that this hardness result can be based directly on LWE instead of worst-case lattice assumptions, as done in [BRST21]. Furthermore, our results expand the scope of their reduction in two ways. First, their √ reduction shows hardness of learning periodic neurons with frequency γ ≥ n, while ours, based √ on exponential hardness of LWE, applies to frequencies almost as small as γ = O( log n), which covers a substantially larger class of periodic neurons. Second, our hardness of k-sparse CLWE from (standard) LWE shows that even learning sparse features (instead of features drawn from the unit sphere Sn−1) is hard under LWE for appropriate parameter settings. We also note that our hardness result for k-sparse LWE may be useful in other settings. Partic- ularly, sparse binary secrets are attractive in practical contexts (e.g. post-quantum cryptographic objects [NIS]) as well as theoretical ones (e.g. where having a low-norm secret, such as in reducing noise blowup in fully-homomorphic encryption, is beneficial). 1.3 Technical Overview Bruna, Regev, Song and Tang [BRST21] introduced a continuous version of LWE called CLWE, and showed that CLWE is hard assuming worst-case lattice assumptions, in a similar way to how LWE is hard assuming worst-case lattice assumptions. Definition 1 (CLWE Distribution [BRST21], informally and rescaled). Let γ, β ∈ R, and let S be a distribution over the (n−1)-sphere, Sn−1 ⊂ Rn. Let CLWE(m,S, γ, β) be the distribution given by sampling a1, · · · ,am ∼ N (0, In×n), w ∼ S, e1, · · · , em ∼ N (0, β 2) and outputting (ai, γ · 〈ai,w〉+ei (mod 1)) for all i ∈ [m]. We refer to n as the dimension and m as the number of samples. While hardness from worst-case lattice assumptions phrases the hardness of LWE and CLWE as an analogy, our main conceptual contribution is a direct reduction from LWE to CLWE. At a high level, the goal of this reduction is to reduce samples from Znq to N := N (0, In×n) (the multivariate Gaussian), secrets from Znq to Sn−1, the (n − 1) dimensional sphere embedded in Rn, and errors from discrete to uniform Gaussians (and also mod q to mod 1). One useful tool in going from discrete to continuous for these distributions is adding continuous Gaussian noise in various places. As an example, to make samples from Tnq := U([0, q)n) (i.e. the n-wise product of the continuous uniform distribution over [0, q)) instead of Znq , we add a sufficiently wide continuous Gaussian to the samples, and argue that this converts Znq to Tnq at some small cost to the width of the noise. Two other types of changes are needed to make the reduction go through. First, we need to fix the norm of the LWE secret, to make it a direction in Rn, and second, we have to convert continuous uniform samples to continuous Gaussian samples. We make the secret direction have fixed norm by using instead a binary seceret s ∼ {+1,−1}n and relying on a work of Micciancio [Mic18] to argue hardness. We describe how we convert uniform samples to Gaussian samples immediately below. See Figure 1 for a full breakdown of the reduction. To go from uniform to Gaussian samples, Boneh et al. [BLMR13] give a general reduction from discrete uniform samples to “coset-sampleable” distributions, and as one example, they show how 3 to reduce discrete uniform samples to discrete Gaussian samples, at the cost of a log(q) multiplica- tive overhead in the dimension, which in some sense is unavoidable information-theoretically. We improve this reduction and circumvent this lower bound in the continuous version by having no overhead in the dimension, i.e. the dimension of both samples are the same. The key ingredient to this improvement is a simple Gaussian pre-image sampling algorithm, which on input z ∼ U([0, q)), outputs y such that q · y = z (mod q) and y is statistically close to a continuous Gaussian (when marginalized over z ∼ U([0, 1))). (See Lemma 12 for a more precise statement.) Bruna et al. [BRST21] show that a homogeneous version of CLWE, called hCLWE, which we denote here as H(g)(S), has a natural interpretation as a certain distribution of mixtures of Gaus- sians. They show that any distinguisher between H(g)(S) and the standard multivariate Gaussian N turns out to be enough to solve CLWE, which thus solves worst-case lattice problems. Therefore, density estimation for Gaussian mixtures implies a solver for CLWE, and so under worst-case lattice √ √ assumptions, density estimation for g > n Gaussian mixtures is hard. (The condition that g > n is a consequence of their worst-case to average-case reduction.) This direct reduction from LWE to CLWE opens up a large toolkit of techniques that were developed in LWE-based cryptography. In this work, we leverage tools from leakage-resilient cryp- tography [Mic18, BD20] to greatly improve the hard instance of [BRST21]. It turns out the number of Gaussians g in the mixture at the end of the day roughly corresponds to the norm of the secrets in LWE. Thus, if we can assume hardness of low-norm secrets LWE, then we get hardness for a small number of Gaussians. Indeed, we achieve hardness of low norm LWE secrets by reducing LWE to k-sparse LWE, using an improved leakage-resilience theorem for LWE with k-sparse secrets. We call a vector s ∈ {+1, 0,−1}n k-sparse if it has exactly k non-zero entries. Theorem 3 (Informal). Assume LWE in dimension ` with n samples is hard with secrets s ∼ Z`q and√errors of width σ. Then, LWE in dimension n with k-sparse secrets is hard for errors of width O( k · σ), as long as k log2(n) ` log2(q). We note that showing hardness for LWE with sparse binary secrets is attractive in other settings, both practical and theoretical. In practice, LWE-based cryptosystems sometimes use sparse secrets (and small corresponding errors) to get concrete efficiency gains, and in theory, sparse secrets allows LWE hardness to be interpreted in a fine-grained way. √It turns out that for our purposes, it is crucial that the blowup in the noise is only a multiplicative O( k) factor. Micciancio [Mic18] gives a simple proof for the hardness of LWE for {+1,−1}n secrets √ with a O( n) blowup in the noise with secrets s ∼ {+1,−1}n. In fact, we can view our k-sparse hardness result as a generalization of the work of Micciancio [Mic18] for arbitrary sparsity k, inste√ad of sparsity k = n, which becomes to {+1,−1}n. At the same time, we wish to get a smaller k blowup in the noise. Brakerski and Döttling [BD20] give a general reduction from LWE to LWE with arbitrary secret distributions with lar√ge enough√entropy, but the noise blowup when applying their results directly to k-sparse secrets is kmn k (where m is the number of samples), which is too large for our purposes. √ We now describe how we get this improvement to only O( k) blowup in the noise. Our starting point is the reduction of Micciancio [Mic18], which gives a reduction for {+1,−1}n secrets with √ O( n) noise blowup. Typically, reductions like this would map standard LWE to the binary LWE distribution and uniform to uniform, but the reduction of [Mic18] takes a different form. The main insight of that works is that it suffices to give an efficiently computable randomized mapping ϕ 4 that maps the uniform distribution, U(Zn×mq ) (or just Zn×mq to abuse notation) to the binary LWE distribution, LWEbin (where secrets are binary) but also maps standard LWE distribution (with secret matrices instead of vectors) to another standard LWE distribution (with secret matrices instead of vectors). Very informally, we have ϕ(Zm×nq ) = LWEbin, and ϕ(LWE) = LWE. The argument of why such a ϕ is sufficient is that under the LWE assumption, LWE ≈ Zn×mq , so a distingiusher for LWE and Zn×mbin q would imply a distinguisher for LWEbin = ϕ(Zn×mq ) and LWE = ϕ(LWE), which would then imply a distinguisher for Zn×mq and LWE, by applying our mapping ϕ in the reduction. Thus, constructing some efficient ϕ is sufficient. As a first attempt, one might try ϕ(B) = [B,Bz + e], where z ∼ {+1,−1}n and e is some noise of width σ. This indeed maps ϕ(Zm×nq ) = LWEbin. Furthermore, ϕ(LWE) is almost the same as LWE by the leftover hash lemma, except that the noise matrix becomes [E,Ez + e]. However, this noise is no longer Gaussian, as the noise is correlated with the secret z. To salvage this, Micciancio [Mic18] carefully constructs a gadget matrix n×O(n)Q ∈ Zq to make the correlations cancel out and modifies the mapping ϕ appropriately, along with adding more sources of randomness. Explicitly, the mapping becomes [ ] ϕ(B) = [s, s · a> +B,G]Q>Z, s + e , where s ∼ Zmq , a ∼ Zn−1 m×O(n) q , G ∼ DZ,σ , e ∼ D m Z,2σ, where Z = diag(z) for z ∼ {+1,−1} n. Our main technical contribution is to give a similar mapping ϕ that works for the case when z is k-sparse. Ultimately, this boils down to carefully adjusting the gadget matrix Q and the matrix Z to work in the k-sparse case. For a full description, see Section 4 (particularly Lemma 17 and Lemma 18). 1.4 Open Questions and Future Directions The best algorithms for learning mixtures of Gaussians run in polynomial time only for constantly √ many Gaussians. We show hardness (under a plausible setting of LWE) for roughly log n Gaus- sians. In fact, for our distribution of Gaussians, we know from Bruna et al. [BRST21] that there exists an algorithm running in time roughly 2O(g2), which becomes almost polynomial at the ex- tremes of our parameter settings, which makes our lower bound nearly tight assuming LWE in these parameter settings. One way to interpret our result is that if there is an algorithm for estimating density of mixtures of 2−g Gaussians (even just in our case) in time poly(n) · 2g using poly(log n) samples, then we get a state-of-the-art algorithm for LWE (runtime 2nδ where δ < 1), even for just our mixture of Gaussians. Valiant and Moitra [MV10] have a nf(g) dependence in their runtime. Is it possible to do any better, and does this improve any state-of-the-art algorithms for LWE? 2 Preliminaries For a distribution D, we write x ∼ D to denote a random variable x being sampled from D. For any n ∈ N, we let Dn denote the n-fold product distribution, i.e. (x1, . . . , x nn) ∼ D is generated by sampling xi ∼i.i.d. D independently. For any finite set S, we write U(S) to denote the discrete 5 uniform distribution over S; we abuse notation and write x ∼ S to denote x ∼ U(S). For any continuous set S, we write U(S) to denote the continuous uniform distribution over S (i.e. having support S and constant density); we also abuse notation and write x ∼ S to denote x ∼ U(S). For distributions D1,D2 supported on a measurable set X , we define the statistical distance∫ between D1 and D2 to be ∆(D1,D 12) = |D1(x)−D2(x)|dx. We say that distributions D2 x∈X 1,D2 are -close if ∆(D1,D2) ≤ . For a distinguisher A running on two distributions D1, D2, we say that A has advantage  if ∣ ∣ ∣ ∣ ∣ Pr [A(x) = 1]− Pr [A(x) = 1]∣ ≥ , ∣x∼D x∼D ∣1 2 where the probability is also over any internal randomness of A. We let In×n ∈ {0, 1}n×n denote the n×n identity matrix. When n is clear from context, we write this simply as I. For any matrix M ∈ Rm×n, we let M> be its transpose matrix, and for ` ∈ [n], we write M ∈ Rm×`[`] to denote the submatrix of M consisting of just the first ` columns, and we write M ∈ Rm×(n−`)]`[ to denote the submatrix of M consisting of all but the first ` columns. For any vector v ∈ Rn, we write ‖v‖ to mean the standard `2-norm of v, and we write ‖v‖∞ to denote the `∞-norm of v, meaning the maximum absolute value of any component. For n ∈ N, we let Sn−1 ⊂ Rn denote the (n − 1)-dimensional sphere embedded in Rn, or equivalently the set of unit vectors in Rn. By Zq, we refer to the ring of integers modulo q, represented by {0, . . . , q − 1}. By Tq, we refer to the set R/qZ = [0, q) ⊆ R where addition (and subtraction) is taken modulo q (i.e. Tq is the torus scaled up by q). We denote T := T1 to be the standard torus. By taking a real number mod q, we refer to taking its representative as an element of Tq in [0, q) unless stated otherwise. Definition 2 (Min-Entropy). For a discrete distribution D with support S, we let H̃∞(D) denote the min-entropy of D, ( ) H̃∞(D) = − log2 max Pr [x = s] . s∈S x∼D Lemma 1 (Leftover Hash Lemma [HILL99]). Let `, n, q ∈ N,  ∈ R>0, and let S be a distribution over {−1, 0, 1}n ⊆ Znq . Suppose H̃∞(S) ≥ ` log2(q) + 2 log2(1/). Then, the distributions given by (A,As (mod q)) and (A,b) where A ∼ Z`×n `q , s ∼ S, b ∼ Zq have statistical distance at most . 2.1 Lattices and Discrete Gaussians A rank n integer lattice is a set Λ = BZn ⊆ Zd of all integer linear combinations of n linearly independent vectors B = [b1, . . . ,bn] in Zd. The dual lattice Λ∗ of a lattice Λ is defined as the set of all vectors y ∈ Rd such that 〈x,y〉 ∈ Z for all x ∈ Λ. For arbitrary x ∈ Rn and c ∈ Rn, let 1 ( ) ρs,c(x) = exp −π‖(x− c)/s‖ 2 sn denote the density function of the standard Gaussian over Rn of width s ∈ R>0 centered at c. Let Ds,c be the corresponding distribution. Note that Ds,c is the n-dimensional Gaussian distribution with mean c and covariance matrix s2/(2π) · In×n. When c = 0, we omit the subscript notation of c on ρ and D. 6 For an n-dimensional lattice Λ ⊆ Rn and point c ∈ Rn, we can define the discrete Gaussian of width s to be given by the mass function ρs(x) DΛ+c,s(x) = ρs(Λ + c) ∑ supported on x ∈ Λ + c, where by ρs(Λ + c) we mean y∈Λ ρs(y + c). We now give the smoothing parameter as defined by [Reg09] and some of its standard properties. Definition 3 ([Reg09], Definition 2.10). For an n-dimensional lattice Λ and  > 0, we define η(Λ) to be the smallest s such that ρ ∗1/s(Λ \ {0}) ≤ . Lemma 2 ([Reg09], Lemma 2.12). For an n-dimensional lattice Λ and  > 0, we have √ ln(2n(1 + 1/)) η(Λ) ≤ · λn(Λ). π Here λi(Λ) is defined as the minimum length of the longest vector in a set of i linearly independent vectors in Λ. Lemma 3 ([Reg09], Corollary 3.10). For any n-dimensional lattice Λ and  ∈ (0, 1/2) σ, σ′ ∈ R>0, and z ∈ Rn, if 1 η(Λ) ≤ √ , 1/(σ′)2 + (‖z‖/σ)2 then if v ∼ DΛ,σ′ and e ∼ Dσ, then 〈z,v〉+e has statistical distance at most 4 from D√ .(σ′‖z‖)2+σ2 Lemma 4 ([MR07], Lemma 4.1). For an n-dimensional lattice Λ,  > 0, c ∈ Rn for all s ≥ η(Λ), we have ∆(Ds,c mod P (Λ), U(P (Λ))) ≤ /2, where P (Λ) is the half-open fundamental parallelepiped of Λ. Lemma 5 ([MR07], implicit in Lemma 4.4). For an n-dimensional lattice Λ, for all  > 0, c ∈ Rn, and all s ≥ η(Λ), we have [ ] 1−  ρs(Λ + c) = ρs,−c(Λ) ∈ , 1 · ρs(Λ). 1 +  Now we recall other facts related to lattices. Lemma 6 ([MP13], Theorem 3). Suppose v ∈ Zm with gcd(v) = 1, and suppose y ∼ Dmi Z,σ for all√ ∑ i i ∈ [m]. As long as σi ≥ 2‖v‖∞η  (Z) for all i ∈ [m], then we have y = i∈[m] yivi is O()-close √ 2m ∑ to D 2 2Z,σ where σ = i∈[m] σi vi . Lemma 7 ([Mic18], Lemma 2.2). For w ∼ U(Z`q), the probability that gcd(w, q) =6 1 is at most log(q)/2`. Definition 4. We say that a matrix T ∈ Zk×m is primitive if TZm = Zk, i.e., if T : Zm → Zk is surjective. 7 Lemma 8 ([Mic18], Lemma 2.6). For any primitive matrix T ∈ Zk×m and positive reals α, σ > 0, if TT> = α2I and η(ker(T )) ≤ σ, then T (DZm,σ) and DZn,ασ are -close. We also use the notation N (µ,Σ) to denote a multivariate Gaussian distribution with mean µ ∈ Rn and covariance matrix Σ ∈ Rn×n for symmetric positive semi-definite Σ. We now define mixtures of Gaussians, follow the definition for estimating the density for mixtures of Guassians as given in [BRST21]. Definition 5. Let Gn,k be the set of all mixtures of k Gaussians in Rn. That is, Gn,k contains exactly the distributions distribution P that can be written as ∑ P = wi · N (µi,Σi), i∈[k] for weights wi ∈ [0, 1] summing to 1 and arbitrary µ ∈ Rni and covariance matrices Σ ∈ Rn×ni . We define the problem of density estimation for Gn,k to be the following problem. Given sample access to an arbitrary (and unknown) P ∈ Gn,k, with probability ≥ 9/10, output a distribution Q (as an evaluation oracle) such that ∆(P,Q) ≤ 1/10. 2.2 Learning with Errors Throughout, we work with decisional versions of LWE, CLWE, and hCLWE. Definition 6 (LWE Distribution). Let n,m, q ∈ N, let A be a distribution over Rn, S be a dis- tribution over Zn, and E be a distribution over R. Let LWE(q,m,A,S, E) be distribution given by sampling a1, · · · ,am ∼ A, s ∼ S, and e1, · · · , em ∼ E, and outputting (ai, s >ai + ei (mod q)) for all i ∈ [m]. We refer to n as the dimension and m as the number of samples. Whenever q is clear from the distribution on A, we omit it for brevity. We also consider the case where S is a distribution over Zn×j and E is a distribution over Rj. In this case, the ouput of each sample is (a , S>i ai + ei (mod q)), where S ∼ S and ei ∼ E. Definition 7 (CLWE Distribution [BRST21]). Let n,m, q ∈ N, γ, β ∈ R, and let A be a distribution over Rn×m and S be a distribution over Sn−1. Let CLWE(q,m,A,S, γ, β) be the distribution given by sampling a1, · · · ,am ∼ A, w ∼ S, e1, · · · , em ∼ Dβ and outputting (ai, γ · 〈ai,w〉+ ei (mod q)) for all i ∈ [m]. We refer to n as the dimension and m as the number of samples. We omit q if q = 1 and omit S if S = U(Sn−1), as is standard for CLWE. Definition 8 (hCLWE Distribution [BRST21]). Let n,m ∈ N, γ, β ∈ R, and let A be a distribution over Rn×m and S be a distribution over Sn−1. Let hCLWE(m,A,S, γ, β) be the the distribution CLWE(m,A,S, γ, β), but conditioned on the fact that for all samples second entries are 0 (mod 1). We refer to n as the dimension and m as the number of samples. We omit S if S = U(Sn−1), as is standard for hCLWE. Note that the hCLWE distribution is itself a mixture of Gaussians. Explicitly, for a secret s ∼ S, we can write the density of hCLWE(1, D1, s, γ, β) at point x ∈ Rn as ( ) ∑ ∑ γ ρ(x) · ρβ(k − γ · 〈s,x〉) = ρ√ 2 2(k) · ρ(πs⊥(x)) · ρ √ 〈s,x〉 − k , (1) β +γ β/ β2+γ2 β2 + γ2 k∈Z k∈Z 8 where πs⊥(x) denotes the projection onto the orthogonal complement of s. Thus, we can view√ hCLWE samples as being drawn from a mixture of Gaussians of width β/ β2 + γ2 ≈ β/γ in the secret direction, and width 1 in all other directions. Definition 9 (Truncated hCLWE Distribution [BRST21]). Let n,m, g ∈ N, γ, β ∈ R, and let S be a distribution over Sn−1. Let hCLWE(g)(m,S, γ, β) be the the distribution hCLWE(m,Dn1 ,S, γ, β), but restricted to the central g Gaussians, where by central g Gaussians, we mean the central g Gaussians in writing hCLWE samples as a mixture of Gaussians, as in Eq. 1. Explicitly, for secret s ∼ S, the density of one sample at a point x ∈ Rn is b(g−1)/2c ( ) ∑ ρ√ γ 2 2(k) · ρ(πs⊥(x)) · ρ √ 2 2 〈s,x〉 − k . (2)β +γ β/ β +γ β2 + γ2 k=−bg/2c The following theorem tells us that distinguishing a truncated version of the hCLWE Gaussian mixture from the standard Gaussian is enough to distinguish the original Gaussian mixture from the standard Gaussian. In particular, we can use density estimation to solve hCLWE since the truncated version has a finite number of Gaussians. Theorem 4 (Proposition 5.2 of [BRST21]). Let n,m ∈ N, γ, β ∈ R>0 with β < 1/32 and γ ≥ 1.√ Let S be a distribution over Sn−1. For sufficiently large m and for g = 2γ lnm/π, if there is an algorithm running in time T that distinguishes hCLWE(2g+1)(m,S, γ, β) and Dn×m1 with constant probability, then there is a time T + poly(n,m) algorithm distinguishing hCLWE(m,Dn1 ,S, γ, β) and Dn×m1 with constant probability. In particular, if there is an algorithm running in time T that solves density estimation for Gn,2g+1, then there is a time T + poly(n,m) algorithm distinguishing hCLWE(m,Dn,S, γ, β) and Dn×m1 1 with constant probability. We also use a Lemma which says that if CLWE is hard, then so is hCLWE. Lemma 9 (Lemma 4.1 of [BRST21]). There is a poly(n, 1/β)-time reduction M such that M maps samples CLWE(Dn1 , s, γ, β) to hCLWE(D n 1 , s, γ, 2β) and maps D n × U(T ) to Dn1 1 1 . 3 Reducing LWE to CLWE Our main result in this section is a reduction from decisional LWE to decisional CLWE. Explicitly: Theorem 5. Let q, n,m,m1 ∈ N with m1 ≥ m, and let γ, β, σ,  ∈ R>0. If there is a T -time m×m distinguisher with advantage  between CLWE(m1, (D m 1) , γ, β) and D 1 1 ×U(Tm1), then there is a time T + poly(n,m,m1, q, λ) time distinguisher with advantage /O(m1) up to additive negl(λ) factors between LWE(m,Znq ,Znq , DZ,σ) and U(Zn×mq × Zmq ), for (√ ) γ = O m(lnm1 + ω(log λ)) , (√ ) m √ β = O · σ2 + lnm1 + ω(log λ) , q √ as long as log(q)/2n = negl(λ), m ≥ 2n log2 q, and σ ≥ C · lnm1 + ω(log λ) for some universal constant C. 9 Reducing LWE to CLWE (Theorem 5) Samples Secrets (· γ) Errors # Samples Adv. Start (LWE) Znq Znq DZ,σ m O(m1) Step 1 (Theorem 6) Zn1q {+1,−1}n1 DZ,σ m1 1  Step 2 (Lemma 10) Zn1 {+1,−1}n1q Dσ m1 2 Step 3 (Lemma 11) Tn1q {+1,−1}n1 D√ σ m1 3 Step 4 (Lemma 13) Dn1 √1 {+1,−1}n1τ (· q n1) Dσ m1 n1 3√ Step 5 (Lemma 14) Dn1 Sn1−1τ (· q n√ 1) Dσ m1 3 CLWE (Lemma 15) nD 1 Sn1−11 (· τ n1) Dβ m1  Figure 1: This tables shows the steps in the reduction from LWE to CLWE. When a distribution is not explicitly specified, it is taken to be the uniform distribution. Here, “Adv.” is short for advantage of the distinguisher, which holds up to additive negl(λ) factors. Remark 1. The use of the security parameter λ here is only to talk about disitnguishing advantage; in particular, one can set λ = Θ(1) independently of all other parameters (so negl(λ) = o(1)) to get that a distinguisher with advantage Ω(1) for CLWE implies a distinguisher with advantage Ω(1/m1) for LWE. This reduction goes via a series of transformations, which we briefly outline below: 1. We convert secrets s ∼ Znq to binary secrets s ∼ {+1,−1}n1 for slightly larger n1 and noise σ1, now with m1 samples instead of m. 2. We convert discrete Gaussian errors e ∼ DZ,σ to continuous Gaussian errors e ∼ Dσ for σ2 2 slightly larger than σ1. 3. We convert discrete uniform samples a ∼ Zn1q to continuous uniform samples a ∼ Tn1q with errors from Dσ , where σ3 is slightly larger than σ3 2. 4. We convert uniform a ∼ Tn1q to Gaussian a ∼ Dn1τ where the secret is effectively scaled up√ by a factor of q; viewing it as a CLWE distribution, we have parameter γ0 = q n1 and unit vector secret w ∼ √1 {+1,−1}n1 . n1 5. We now re-randomize the secret distribution to be a continuously uniformly random element of Sn1−1 instead of discrete uniform over √1 {+1,−1}n1 . n1 6. We scale variables to bring us to the standard formulation of CLWE for parameters γ, β set appropriately. Setting of parameters. If we start with dimension n and m samples with error width σ: √ 1. After the first step, we get n1 = m, m1 samples, and σ1 = 2σ m, with advantage loss of multiplicative O(m1). √ 2. After the second step, we get σ2 = σ21 + 4 ln(m1) + ω(log λ). 10 √ 3. After the third step, we get σ = σ2 + 9n √ √ 3 2 1 (lnn1 + lnm1 + ω(log λ)), as long as σ2 ≥ 3 n1 lnn1 + lnm1 + ω(log λ). √ 4. After the fourth step, we get τ = lnn1 + lnm1 + ω(log λ) where the secret now has norm√ γ0 = q n1. 5. Nothing changes in the fifth step. 6. After step 5, we have γ = γ0 · τ/q and β = σ3/q. Step 1: Converting uniform secrets to binary secrets. We can reduce the standard LWE problem above to a version where secrets are drawn uniformly from s ∼ {+1,−1}n1 for some slightly larger n1. This has the effect of making the secret both short and have `2 norm (in Rn1) exactly√ n1. From [Mic18], we know a reasonably tight reduction between these two problems. Theorem 6 ([Mic18], Theorem 3.1 and Lemma 2.9). Let q, n,m,m1 ∈ Z, σ ∈ R. If a T -time al- (m+1)×m gorithm has advantage  in distinguishing LWE(m1,Zm+1q , {+1,−1}m+1, D 1 Z,σ′) and U(Zq × Zm1q ), then there is a time T + poly(n,m, q, λ) algorithm with advantage /O(m1) (up to addi- n×(m+1) tive negl(λ)) in distinguishing LWE(m + 1,Zn,Zn, D ) and U(Z × Zm+1q q Z,σ q q ), as long as√ √ log(q)/2n = negl(λ), σ ≥ 4 ω(log λ) + lnm+ lnm1, m ≥ 2n log2 q+ω(log λ), and σ ′ = 2σ m+ 1. Remark 2. Note that we phrase the parameter requirements differently here than is done in [Mic18], mainly because we want to delink the security parameter from n. Explicitly: • The requirements q ≤ 2poly(m) and n ≥ ω(logm) in [Mic18] are needed only to make sure that the first row of a primitive matrix is close to uniform over Zq. Indeed, Lemma 2.2 of [Mic18] shows the statistical distance is at most log(q)/2n. Thus, the requirement log(q)/2n = negl(λ) is sufficient. √ √ • We require σ ≥ 4 ω(log λ) + lnm+ lnm1 instead of σ ≥ ω( logm) for various triangle inequalities to go through to get negl(λ) overall statistical distance. Step 2: Converting discrete errors to continuous errors. Now, we make the error distribu- tion statistically close to a continuous Gaussian instead of a discrete Gaussian. Essentially, all we do is add a small continuous Gaussian noise to the second component and argue that this makes the noise look like a continuous Gaussian instead of a discrete one. Lemma 10. Let n,m, q ∈ N, σ ∈ R>0. For any distribution S over Zn, suppose there is a T -time distinguisher LWE(m,Zn,S, D ) and U(Zn×m mq σ′ q × Tq ), where √ σ′ = σ2 + 4 ln(m) + ω(log λ). √ If σ > 4 lnm+ ω(log λ), then there is a distinguisher between LWE(m,Znq ,S, DZ,σ) and U(Zn×mq × Zmq ) running in time T + poly(m,n, q, λ). Proof. We run our original distinguisher for LWE(m,Zn,S, D ) and U(Zn×m mq σ′ q × Tq ). For every sample (a, b) (from either LWE(m,Znq ,S, DZ,σ) or U(Zn×m mq ×Zq )), we sample a continuous Gaussian e′ ∼ Dσ′′ where σ′′ will be set later, and send (a, b+ e′ (mod q)) to the distinguisher. 11 By Lemma 4, we know that the distribution of e′ (mod 1) has statistical distance at most  to U([0, 1)) as long as σ′′ ≥ η(Z). Therefore, if we are given samples from U(Zn×m × Zmq q ), due to symmetry of b ∼ Zq, we can set  = λ−ω(1)/m to have b + e′ (mod q) look negl(λ)/m-close to Tq, making it look like samples from U(Zn×mq × Tmq ). If we are given samples from LWE(m,Znq ,S, DZ,σ), then the second component can be seen as hav-√ ing noise e+e′, where e ∼ DZ,σ and e′ ∼ Dσ′′ . Applying Lemma 3, as long as 1/ 1√/σ 2 + 1/(σ′′)2 ≥ η(Z), then e + e′ will look O()-close to D√ 2 ′′ 2 . Thus, as long as σ, σ′′ ≥ 2 · η(Z), it allσ +(σ ) goes through, as taking errors mod q (i.e. in Tq instead of R) can only decrease statistical distance.√ Now, applying Lemma 2, we can set  = λ−ω(1)/m and σ′′ = 4 ln(m) + ω(log λ), and as long as √ σ > 4 ln(m) + ω(log λ), all goes through. Now, doing the triangle inequality over all m samples, we get negl(λ)-closeness of all samples. Step 3: Converting discrete to continuous samples. Now, we convert discrete uniform samples a ∼ Znq to continuous uniform samples a ∼ Tnq . Lemma 11. Let n,m, q ∈ N, σ ∈ R. Let S be a distribution over Zn where all elements in the support have fixed norm r, and suppose that √ σ ≥ 3r lnn+ lnm+ ω(log λ). Suppose there is a T -time distinguisher between the distributions LWE(m,Tnq ,S, Dσ′) and U(Tn×mq × Tmq ), where we set √ σ′ = σ2 + 9r2(lnn+ lnm+ ω(log λ)). Then, there is a T + poly(m,n, λ, q) time distinguisher between the distributions LWE(m,Znq ,S, Dσ) and U(Zn×mq × Tmq ). Proof. We run our distinguisher for LWE(m,Tnq ,S, Dσ′) and U(Tn×m mq × Tq ). Let  = negl(λ)/m,√ and let σ′′ ≥ 2 ·η(Zn). For each sample (a, b) (from either LWE(m,Znq ,S, Dσ) or U(Zn×m×Tmq q )), we sample a continuous Gaussian a′ ∼ n(Dσ′′) and send (a+a′ (mod q), b) to the distinguisher. By Lemma 4, we know that the distribution of a′ (mod 1) has statistical distance at most  = negl(λ)/m to U([0, 1)n). Thus, by symmetry over a ∼ (Z )nq , the distribution of a + a′ (mod q) will be negl(λ)/m-close to uniform over (T )nq . Therefore, by the triangle inequality, if we are given samples from U(Zn×m × Tmq q ), the reduction gives samples to the distinguisher that are negl(λ)-close to U(Tn×mq × Tmq ). If we are given samples from LWE(m,Znq ,S, Dσ), then the reduction gives us (taking everything mod q) (a + a′, 〈a, s〉+ e) = (a + a′, 〈a + a′, s〉+ e− 〈a′, s〉) = (a + a′, 〈a + a′, s〉+ e′), where we define e′ = e− 〈a′, s〉 over R. Conditioned on a+a′ mod q, a′ is a discrete Gaussian distributed according toDZn+(a+a′),σ′′ . By Lemma 3, as long as σ ≥ rσ′′, the distribution of e′ is O() = negl(λ)/m close to Dσ′ , where √ σ′ = σ2 + r2(σ′′)2. 12 Averaging the distribution of e′ over s will not change the distribution over e′. Therefore, if we are given the m samples from LWE(m,Znq ,S, Dσ), the reduction gives us samples negl(λ)-close to LWE(m,Tnq ,S, Dσ′), as desired. √ √ To set parameters, we choose σ′′ = 3 lnn+ lnm+ ω(log λ) to ensure that σ′′ ≥ 2·η nnegl(λ)/m(Z ). This gives √ σ′ = σ2 + 9r2(lnn+ lnm+ ω(log λ)), along with the requirement that √ σ ≥ rσ′′ = 3r lnn+ lnm+ ω(log λ). Step 4: Converting uniform to Gaussian samples. Lemma 12. Let t ∈ R>0 be a parameter. There is a poly(n, t, q, λ)-time algorithm such that on input z ∈ Tnq , the algorithm outputs some y ∈ Rn such that q · y = z (mod q). Moreover, if z is uniform over Tn nq , then the distribution on the outputs y is negl(λ)/t-close to (Dτ ) , where√ τ = lnn+ ln t+ ω(log λ). Remark 3. In the discrete setting, there is in some sense a necessary multiplicative Ω(log q) over- head in the dimension due to entropy arguments, but the above shows that we can overcome that barrier in the continuous case. Proof. We give each coordinate of y separately. By the triangle inequality, it suffices to show how to sample y ∈ R such that qy = z (mod q) and such that if z ∼ Tq, then y is negl(λ)/(tn)-close to Dτ . We sample y ∼ DZ+z/q,τ , which can be sampled efficiently (see e.g. [BLP+13], Section 5.1 of full version), where we have negl(λ)/(tn) statistical distance between y and DZ+z/q,τ , and always satisfy y ∈ Z + z/q. Since y ∈ Z + z/q, it follows that qy ∈ qZ + z, which implies that qy = z (mod q). Now, we need to argue that the distribution of y looks negl(λ)/(tn)-close to Dτ when z ∼ U([0, q)). To see this, observe that z/q is distributed uniformly on [0, 1), so it suffices to show that DZ+r,τ for r ∼ [0, 1) is statistically close to Dτ . Note that for fixed r ∈ [0, 1), we can view the distribution DZ+r,τ as a continuous distribution with density ρτ (x) DZ+r,τ (x) = δ(x− r mod 1) · ρτ (Z + r) for arbitrary x ∈ R, where δ(·) is the Dirac delta function. Thus, as long as τ ≥ η(Z) (for  set 13 later), the density of the marginal distribution DZ+r,τ where r ∼ U([0, 1)) is given by ∫ 1 DZ+U([0,1)),τ (x) = 1 ·DZ+r,τ (x) · dr 0 ∫ 1 ρτ (x) = δ(x− r mod 1) · dr 0 ρτ (Z + r) ρτ (x) = ρτ (Z + x) [ ] 1 +  ρτ (x) ∈ 1, · 1−  ρτ (Z) [ ] 1 +  ∝ 1, · ρτ (x), 1−  where the inclusion comes from Lemma 5. Therefore, a standard calculation shows that the statis- tical distance between DZ+U([0,1)),τ and Dτ is at most O(). Setting  = λ−ω(1)/(t · n), we need to √ take τ ≥ ηλ−ω(1)/(t·n)(Z), which we can do by setting τ = lnn+ ln t+ ω(log λ) by Lemma 2. Lemma 13. Let n,m, q ∈ N, σ, r, γ ∈ R. Let S be a distribution over Zn where all elements in the support have fixed norm r. Suppose there is a T -time distinguisher between the distributions LWE(m, q,Dnτ , q · S, Dσ) = CLWE(m, q,D n, 1τ · S, γ, σ) and D n×m × U(Tmτ q ), where γ = r · q andr √ τ = lnn+ lnm+ ω(log λ). Then, there is a T -time distinguisher between the distributions LWE(m,Tnq ,S, Dσ) and U(Tn×mq × Tmq ). Proof. We run the distinguisher for LWE(m, q,Dn, q · S, D ) and Dn×mτ σ τ ×U(Tmq ). For each sample (a, b) from either or LWE(m,Tnq ,S, Dσ) and U(Tn×mq × Tmq ), we invoke Lemma 12 on a with pa- rameter t = m to get some y ∈ Rn with statistical distance negl(λ)/m from Dnτ such that q · y = a (mod q). We then send (y, b) to the distinguisher. If (a, b) is a sample from LWE(m,Tnq ,S, Dσ), then for secret s ∼ S, since s ∈ Zn, we have (y, b) = (y, 〈a, s〉+ e (mod q)) = (y, 〈q · y, s〉+ e (mod q)) = (y, 〈y, q · s〉+ e (mod q)), where this is now negl(λ)/m close to a sample from LWE(m, q,Dnτ , q ·S, Dσ). Applying this reduction to U(Tn×mq ×Tmq ) clearly gives us a statistically close sample to Dn×m mτ ×U(Tq ) by Lemma 12 and the triangle inequality over all m samples. Step 5: Converting the secret to a random direction. The distribution on the secret as given above is not uniform over the sphere, so we apply the worst-case to average-case reduction for CLWE (Claim 2.22 in [BRST21]). For completeness, we provide a proof. Lemma 14 ([BRST21], Claim 2.22). Let n,m, q ∈ N, and let τ, σ ∈ R>0. Let S be a distribu- tion over Rn of fixed norm 1. Suppose there is a T -time distinguisher between the distributions 14 CLWE(m, q,Dnτ , γ, σ) and D n×m×U(Tmτ q ). Then, there is a T + poly(n,m, q) time distinguisher be- tween the distributions CLWE(m, q,Dnτ ,S, γ, σ) and D n×m τ ×U(Tmq ). That is, we can reduce CLWE to CLWE to randomize the secret to be a uniformly random unit vector instead of drawn from (possibly discrete) S. Proof. We run the distinguisher for CLWE(m, q,Dnτ , γ, σ) and Dn×mτ × U(Tmq ). Let R ∈ Rn×n be a uniformly random rotation matrix in Rn, fixed for all samples. When giving the distinguisher a sample, we get (a, b) from either CLWE(m, q,Dnτ ,S, γ, σ) or Dn×mτ × U(Tmq ), and send (Ra, b) to the distinguisher. If (a, b) is drawn from CLWE(m, q,Dnτ ,S, γ, σ), then we have (Ra, b) = (Ra, γ〈a, s〉+ e (mod q)) = (Ra, γ〈Ra, Rs〉+ e (mod q)) = (a′, γ · 〈a′,w〉+ e (mod q)), for a ∼ Dnτ , s ∼ S, and e ∼ Dσ, where we set a′ = Ra and w = Rs (fixed for all samples). For an arbitrary rotation R, since the distribution on a is spherically symmetric, we have a′ = Ra ∼ (Dτ )n, independently of R. For a random rotation matrix R, for arbitrary s, we have that w = Rs is a uniformly random unit vector in Rn. Since this holds for arbitrary s, this also holds when averaging over the distribution s ∼ S. If (a, b) is drawn from Dn×mτ ×U(Tmq ), then (Ra, b) is drawn identically to (a, b), since the distribution on a′ = Ra is spherically symmetric. Thus, the reduction maps the distributions perfectly. Step 6: Going from mod q to mod 1. Now, we divide a by τ , multiply γ by τ/q, and divide e by q to finally reduce to decisional CLWE as defined in [BRST21]. To be precise: Lemma 15. Suppose there is a T -time distinguisher between the distributions CLWE(m,Dn ′1 , γ , σ/q) and Dn×m1 × U(Tm), where γ′ = γ · τ/q. Then, there is a T + poly(n,m, q, λ) time distinguisher between the distributions CLWE(m, q,Dnτ , γ, σ) and D n×m m τ × U(Tq ). Proof. The reduction follows by rescaling the samples appropriately. Now, we are ready to give a proof of Theorem 5. (See Figure 1 for a sketch.) Proof of Theorem 5. Throughout this proof, when we say advantage, we omit additive negl(λ) terms. Suppose there is no T + poly(n,m,m1, q, λ) time distinguisher with advantage /O(m1) between LWE(m,Znq ,Znq , DZ,σ) and U(Zn×m mq × Zq ). Then, by Theorem 6, there is no T + poly(n,m,m1, q, λ)-time distinguisher between√ LWE(m ,Zm, {+1,−1}m, DZ ) and U(Zm×m1 × Zm11 q ,σ q q ) with advantage , where σ1 = 2σ m, and1 all other sufficient conditions are met by the hypotheses of the theorem. Note that we are setting n1 = m. Then, by Lemma 10, there is no T + poly(n,m,m1, q, λ)-time distinguisher between LWE(m m m m×m1 m1 2 21,Zq , {+1,−1} , Dσ ) and U(Zq × Tq ) with advantage , where σ2 = σ1 + 4 lnm2 1 + ω(log λ). Note that σ21 = 4σ2m ≥ 4m · C2(lnm1 + ω(log λ)) 4 lnm+ ω(log λ), as needed. Then, by Lemma 11, there is no T + poly(n,m,m1, q, λ)-time distinguisher between LWE(m m m1,Tq , {+1,−1} , Dσ ) and U(Tm×m1 m1q ×Tq ) with advantage , where σ23 = σ22 + 9m(lnm+3 √ √ lnm1 + ω(log λ)), as long as σ2 ≥ 3 m lnm+ lnm1 + ω(log λ), which we are given, since σ22 > σ 2 1 = 4σ 2 ·m ≥ C2(lnm1 + ω(log λ)) ·m, 15 where C is chosen to be a sufficiently large constant. Then, by Lemma 13, there is no T + poly(n,m,m1, q, λ)-time distinguisher between ( { }m ) m m m 1 1LWE(m1, q,Dτ , {+q,−q} , Dσ ) = CLWE m1, q,Dτ , √ ,−√ , σ3, γ3 0m m √ √ and Dm×m1τ × U(Tm1q ) with advantage  for γ0 = m · q, and for τ = lnm+ lnm1 + ω(log λ). Then, by Lemma 14, there is no T + poly(n,m,m1, q, λ)-time distinguisher between m×m CLWE (m1, q,D m τ , σ3, γ0) and D 11 × U(Tm1q ) with advantage . Lastly, by Lemma 15, there is no T -time distinguisher between CLWE (m m √ 1 , D1 , β, γ) and Dm×m1 m1τ × U(Tq ) with advantage  where γ = γ0 · τ/q = m · τ and β = σ3/q. Unraveling parameters, we have √ (√ ) γ = m · τ = O m(lnm1 + ω(log λ)) , and ( ) ( ) σ2 σ22 3 2 +m lnm1 +m · ω(log λ)) σ 2 1 +m lnm1 +m · ω(log λ)β = = O = O q2 q2 q2 ( ) σ2 + lnm1 + ω(log λ) = O m · , q2 as desired. 4 Hardness of k-sparse LWE In this section, we reduce from standard LWE to a version where secrets are sparse, in the sense that they have few non-zero entries. Definition 10. For k, n ∈ N with k ≤ n, let Sn,k be the subset of vectors in {−1, 0,+1}n with exactly k non-zero entries. We call s ∈ Zn k-sparse if s ∈ Sn,k. Lemma 16. We have H̃∞(Sn,k) ≥ k log2(n/k). ( ) ( ) Proof. Observe that |S | = nn,k · 2k. Using the bound (n/k)k ≤ n , we havek k (( )n)k H̃∞(Sn,k) ≥ log2 2 · ≥ k log2(n/k), k as desired. Micciancio [Mic18] gave a simplified proof of hardness for n-sparse secrets (i.e. binary secrets with entries in {+1,−1}), and we show that his result with slight modification extends to the k-sparse setting in a natural way. Explicitly, we have the following. 16 Theorem 7. Let q,m, n, `, k ∈ N with 1 < k < n, and let σ,  ∈ R ` √ >0 . Suppose log(q)/2 = negl(λ), σ ≥ 4 ω(log λ) + lnn+ lnm, and k log(n/k) ≥ (` + 1) log2(q) + ω(log λ). Suppose there is no T + poly(n,m, q, λ)-time distinguisher with advantage  between LWE(n− 1,Z` ,Zm×`q q , DZm,σ) Z`×(n−1) m×(n−1)and U( q ×Zq ), and further suppose there is no T time distinguisher with advantage  m×(`+1) (`+1)×(n+1) between LWE(n+ 1,Z`+1q ,Zq , DZm,2σ) and U(Zq × Z m×(n+1) q ). Then, there is no T time distinguisher with advantage 2 (up to additive negl(λ) factors) between LWE(m,Znq ,Sn,k, DZ,σ′)√ and U(Zm×n × Zm), where σ′q q = 2σ k + 1. Definition 11. Let n, k ∈ Z with k ≤ n. For all i ∈ [n], we define ei to be the ith standard basis column vector, i.e. having a 1 in the ith coordinate and 0s elsewhere. We then define u ∈ Zn to be ∑k u = i=1 ei, i.e. 1s in the first k coordinates and 0 elsewhere. Lemma 17. There is a poly(n)-time computable matrix Q ∈ Zn×(2n+5) such that Q √ [n] is invert- ible, u>Q = e>, the vector v> = u>Q ∈ Zn+5[n] 1 ]n[ satisfies ‖v‖2 = 2 k and ‖v‖√ ∞ = 2, and√ Q]1[(DZ2n+4,σ) and DZn,2σ are negl(λ)/t close as long as σ ≥ 6 · ω(log λ) + lnn+ ln t for a parameter t. Proof. We use essentially the same gadget Q as in Lemma 2.7 of [Mic18], except we modify two entries of the matrix and add two columns. Specifically, we set Qk,k+1 = 0 (instead of −1), Qk,n+k+1 = 0 (instead of 1), and add two columns to the end that are all 0 except for two en- tries of 1 in Qk,2n+4 and Qk,2n+5. We will give it explicitly as follows. Let the matrix X ∈ Zn×(n−1) be defined by   −1  1 −1     . .  . . . .     1 −1    X =  1 0  ,      1 −1   . . . .   . .     1 −1 1 where the row with the abnormal 0 is the kth row. Similarly, let Y ∈ Zn×(n−1) be defined by   1 1 1     . .   . . . .     1 1    Y =  1 0  ,      1 1   . .  . . .  .     1 1 1 17 where the row with the abnormal 0 is again the kth row. We then define Q ∈ Zn×(2n+5) by Q = [e1, X,−en, Y, en, e1, e1, ek, ek]. First, notice that Q[n] is invertible, since it is upper-triangular with 1s on the diagonal. Next, notice that u>Q > >[n] = e1 , as u e1 = 1 and the sum of the first k entries in each column of X are all 0 by construction. We can write v> = u>Q]n[ = [0, 2, 2, · · · , 2, 0, · · · , 0, 1, 1, 1, 1], which has `2 norm √ √ (k − 1) · 22 + 4 · 12 = 2 k. It’s clear to also see that ‖v‖∞ = 2. All that is remaining to show is that Q]1[(DZ2n+4,σ) and DZn,2σ are negl(λ)/t-close, which we do below. To show that Q]1[(DZ2n+4,σ) and DZn,2σ are negl(λ)/t close, we first prove the preconditions of and then invoke Lemma 8. Let T = Q ∈ Zn×(2n+4)]1[ . First, we show that T is primitive. It suffices to show that for every standard basis column vector ei, there is some g 2n+4i ∈ Z such that ei = Tgi. For all j ∈ [2n + 4], we define fj to be the jth standard basis column vector in R2n+4. Let g1 = f2n+1, and gk+1 = fk. It can be easily checked that e1 = Tg1 and ek+1 = Tgk+1. Then, for all i such that 1 < i ≤ k and k + 1 < i ≤ n, let gi = fi−1 + gi−1. Using an inductive argument, and by the construction of T , it follows that Tgi = T (fi−1 + gi−1) = T fi−1 + Tgi−1 = (ei − ei−1) + ei−1 = ei. It is easy to check that TT> = 4I. Finally, we bound the smoothing parameter of the lattice Λ = ker(T ). Since T ∈ Zn×(2n+4) and T has full rank, its kernel Λ has dimension n + 4. The columns of the following matrix give a basis for the lattice Λ.   Ỹ e1 −ek−1    −X̃ −e1 −ek−1     1 1  V = ∈ Z(2n+4)×(n+4)  ,  1 −1    −Z̃k−1 1 1  −Z̃k−1 1 −1 where we define   −1  1 −1    X̃ = n×n .  . . . .  ∈ Z , .  1 −1   1 1 1    Ỹ = n×n . .  . . .  ∈ Z , and .  1 1 [ ] Z̃k−1 = 0 . . . 0 1 0 . . . 0 ∈ Z1×n. 18 Here Z̃k−1 is the zero matrix except for the (k − 1)th column which has a 1 entry. By direct computation, it is easy to see that the columns of V lie in ker(T ). To see that V is a basis for ker(T ), we can show that its columns are linearly independent by constructing a matrix W ∈ Z(n+4)×(2n+4) such that WV = 2I(n+4)×(n+4). Indeed, we can do so in the following way. We can first define matrices     1 1  . .  . .     . .       1   1      I =  0 1  ∈ Zn×n, I =  + 0 −1 ∈ Zn×n,  −    1      1   . .   . .  .   .  1 1 where the abnormal row is the (k − 1)th row, and then define   I+ I−  1 1    W =  1 −1  ∈ Z(n+4)×(2n+4),   Z̃ k −Z̃k 1 1 1 −1 where similarly to before, Z̃ ∈ Z1×nk is the one-hot vector with a 1 in the kth column. It is straightforward to verify that WV = 2I(n+4)×(n+4), showing that the columns of V are linearly independent. √ √ By looking at the columns of V , we have λ (Λ) ≤ 6, so by Lemma 2, we have η (Λ) ≤√ n+4  6 · ω(log λ) + lnn+ ln t ≤ σ, where we set  = n√egl(λ)/t. Therefore by Lemma 8, we get that√ Q]1[(DZ2n+4,σ) and DZn,2σ are negl(λ)/t close if σ ≥ 6 · ω(log λ) + lnn+ ln t. Lemma 18. There is a poly(n) time algorithm that on input z ∈ S outputs a matrix Z ∈ Zn×nn,k (as a function of z) that satisfies the following properties: • Z is a permutation matrix with signs, i.e. a permutation matrix where the non-zero entries could be ±1 instead of just 1, • Z = Z> = Z−1, and • Zz = u. Proof. We can define Z as follows. Let T≤k = {i ∈ [k] : zi =6 0}, T>k = {i ∈ [n] \ [k] : zi 6= 0}, T ∗≤k = {i ∈ [k] : zi = 0}, T ∗ >k = {i ∈ [n] \ [k] : zi = 0}. Intuitively, T≤k and T>k partition the non-zero coordinates of z based on whether they lie in the first k coordinates, and T ∗≤k and T ∗ >k partition the zero-coordinates of z based on whether they lie in the first k coordinates. Note that by k-sparsity of z, we have |T>k| = k − |T ∗ ≤k| = |[k] \ T≤k| = |T≤k|. 19 Therefore, we can choose an arbitrary bijection f : T ∗>k → T≤k. For all i ∈ T≤k, we set Zi,i = zi ∈ {+1,−1}. For all i ∈ T ∗>k, we set Zi,i = 1. For all i ∈ T>k, we set Zf(i),i = zi ∈ {+1,−1} and Zi,f(i) = Zf−1(f(i)),f(i) = zi ∈ {+1,−1}. We set all other entries of Z to be 0. It’s clear from this definition that Z = Z>. First, observe that Z is a signed permutation matrix. For all i ∈ T ∗≤k∪T>k, Z is the identity map up to signs (on basis vectors ei), and for all i ∈ T>k, Z consists of signed transpositions Zei = zief(i) and Zef(i) = zief−1(f(i)) = ziei. Therefore, Z is a signed permutation matrix, and furthermore we have also shown Z2 = In×n. Therefore, Z = Z−1. Lastly, we show Zz = u. We can decompose z as z = z≤k+z>k in the natural way by considering the non-zero coordinates of z on [k] and [n] \ [k] respectively. We then have Zz = Z(z≤k + z>k) = Zz≤k + Zz>k = 1T + 1T ∗ = u,≤k ≤k as desired. Definition 12. We define a randomized mapping ϕ as follows. Let Q be as defined in Lemma 17. We sample z ∼ Sn,k, s ∼ Zmq , a ∼ Zn−1q , e ∼ DZm,2σ, G ∼ D n×nZm×(n+5),σ. Let Z ∈ Z be as defined Zm×(n−1)in Lemma 18 as a function of z. On input B ∈ q , we define [[ ] ] ϕ(B; z, s,a, e, G) = s, s · a> +B,G Q>Z, s + e . First, we show that maps Zm×(n−1)ϕ B ∼ U( q ) to LWE(m,Znq ,Sn,k, DZ,σ′). m×(n−1) Lemma 19. Assume the same hypothesis as Theorem 7. For B ∼ U(Zq ), we have ϕ(B) and LWE(m,Znq ,Sn,k, DZ,σ′) are negl(λ)-close. Proof. We fix a ∈ Zn−1q , z ∈ S nn,k and we argue that ϕ(B) maps to LWE(m,Zq , z, DZ,σ′), i.e. the LWE distribution with secret z. Averaging over a and z gives the desired result. [[ ] ] First, we show that X = s, s · a> +B,G Q>Z looks uniform. By construction, [s, s · a>+B] has distribution U(Zm×nq ), by using the independent randomness of s and B. We can write X = [s, s · a> +B]Q>[n]Z +GQ > ]n[Z. Since Q[n] and Z are invertible, by a one-time pad argument, we have X ∼ U(Zm×nq ), independent of G and e. Now, we have to argue that the conditional distribution on x = s + e is equal to Xz + e′ for some Gaussian noise e′. We can directly write x−Xz = s + e− ([s, s · a> +B]Q>[n]Z +GQ > ]n[Z)z = s + e− [s, s · a> +B]Q> >[n]u−GQ]n[u = s + e− [s, s · a> +B]e1 −Gv = e−Gv, where we use the fact that Zz = u, u>Q > >[n] = e1 and u Q]n[ = v>. For all j ∈ [m], let g ∈ Zn+5j be the jth row of G. For each entry (row) ẽj of e − Gv, we can write ẽj = ej − g>j v = 〈[ej ,gj ], [1,−v]〉 and apply Lemma 6 with the vector v ′ = [1,−v] to argue 20 √ ∑ √ √ that ẽj is O()-close to D with σ′Z,σ′ = (2σ)2 + 2i∈[n+5](σvi) = σ 4 + ‖v‖ 2 2 = 2σ k + 1, as √ long as σ ≥ 2‖v‖∞η/(2(n+6))(Z). Now, using the triangle inequality over all m rows to get overall statistical distance negl(λ), we can set  = negl(λ)/m, for which √ σ ≥ 2 · 2 · ηnegl(λ)/(mn)(Z) √ is sufficient. By Lemma 2, this holds as long as σ ≥ 4 lnm+ lnn+ ω(log λ), which we are given. Next, we show ϕ maps the standard LWE (with matrices as secrets) to standard LWE in slightly different dimensions, very much following the proof of Claim 3.3 of [Mic18]. Lemma 20. Assume the same hypothesis as Theorem 7. Let D1 denote the distribution of SA+E `×(n−1) m×(n−1) (mod q), where A ∼ U(Z m×`q ), S ∼ U(Zq ), E ∼ DZ,σ . Let D2 denote the distribution of (`+1)×(n+1) m×(`+1) m×(n+1) Ŝ + Ê (mod q), where  ∼ U(Zq ), Ŝ ∼ U(Zq ), Ê ∼ DZ,2σ . Then, ϕ(D1) is negl(λ)-close to D2. The proof goes exactly as in Claim 3.3 of [Mic18]. The only differences are in our matrices Q,Z, and our distribution of secrets z ∼ Sn,k. The full differences are as follows. • While our Z is different, since Z = Z> is a permutation matrix with signs, it still holds that Z ·Dn nZ,2σ = DZ,2σ due to symmetry. • We have Q]1[(D2n+4 nZ,σ ) is negl(λ)/m-close to DZ,2σ by Lemma 17. • The probability that w (in their notation) is not primitive is at most log(q)/2` = negl(λ), as desired. • When applying leftover hash lemma (Lemma 1), the min-entropy of z ∼ Sn,k is now k log2(n/k). Thus, we require k log2(n/k) ≥ (`+1) log2(q)+ω(log λ) instead of n ≥ (`+1) log2(q)+ω(logm). For completeness, we provide a self-contained proof, exactly following Claim 3.3 of [Mic18]. Proof of Lemma 20. Let B ∼ D1. Let Y = [s, sa> + B]. By linearity, we can decompose Y as Y = Ys + Ye, where Ys = [s, sa> + SA] and Ye = [0, E]. Similarly, we can write [[ ] ] ϕ(B) = s, s · a> +B,G Q>Z, s + e = [Xs, s] + [Xe, e], where Xs = Y Q>s Z and Xe = [Ye, G]Q>Z = [E,G]Q> Z. Our goal is to now show that [Xs, s] is[n] ]1[ statistically close to ŜÂ, and that [Xe, e] is statistically close to Ê, where ŜÂ+ Ê is a sample from D2. If this holds, then ϕ(B) is statistically close to ŜÂ+ Ê, which completes the proof. First, let us look at [Xe, e]. Note that e is a discrete Gaussian vector of width 2σ independent of everything else, so the last column has the desired distribution. Furthermore, note that E and G have entries that are discrete Gaussian of width , so m×(2n+4)σ [E,G] ∼ DZ,σ . By Lemma 17, setting t = m, we can use the triangle inequality over all m rows to get that [E,G]Q> is negl(λ) close to √ ]1[√ Dm×nZ,2σ as long as σ ≥ 6 ω(log λ) + lnn+ lnm. Since Z is a signed permutation, by symmetry, 21 we then know that Xe = [E,G]Q> Z is negl(λ) close to Dm×nZ,2σ , and thus [Xe, e] is negl(λ) close to]1[ m×(n+1) DZ,2σ , which is the same distribution as Ê. Note that this depends only on e, G,E. To finish, we look at [Xs, s]. We now define [ ] Ŝ = s, S W−1 ∈ Zm×(`+1)q , where W is a uniformly random invertible matrix over Z(`+1)×(`+1)q . Since W is invertible, using the randomness of S and s, Ŝ is uniformly random independently of W . Next, we define  = WHQ>[n]Z >[In×n, z] ∈ Z(`+1)×(n+1)q , where [ ] 1 a> H = ∈ Z(`+1)×n 0 A q . Note that we have the identity Q> Z>z = Q> Zz = Q> u = e1 by Lemmas 18 and 17, as well as[n] [n] [n] the identity ŜWH = [s, S]H = Ys. Therefore, Ŝ = ŜWHQ> Z>[n] [In×n, z] = Y Q > Z>s [n] [In×n, z] = [Y Q > s [n]Z, Yse1] = [Xs, s], as desired. Now, we have to show that Ŝ and  have the correct distributions. We have already shown that Ŝ has the correct distribution (only depending on S and s), so it suffices to show that  has the correct distribution given S and s, using the randomness of A,a,W and z. First, let’s look at the matrix WH. Let w be the first column of W . The first column of WH will be exactly w. Since W is a uniformly random invertible matrix, w is distributed uniformly among all primitive vectors in Z`+1q , i.e. so that gcd(w, q) = 1. By Lemma 7, as long as log(q)/2` = negl(λ), which we have assumed, then the distribution of w is negl(λ)-close to uniform over Z`+1q . The remaining columns[ ] > of WH will be aW , which by using the uniform randomness of a and A, and the invertibility A of W , will be uniformly random and independent of w. Therefore, Z(`+1)×nWH ∈ q is negl(λ)-close to uniformly random. Now, since Q> and Z> are invertible, we have WHQ> Z> is negl(λ)-close [n] [n] to uniform, independently of z. Let A′ = WHQ> Z>, which we have just shown is negl(λ)-close to [n] uniform, independently of z. Note that  = A′[In×n, z] = [A ′, A′z]. Applying the leftover hash lemma (Lemma 1) and Lemma 16, since k log2(n/k) ≥ (`+ 1) log2(q) + ω(log λ), we know  is negl(λ)-close to uniform, independently of Ŝ and Ê. This completes the proof that ϕ(D1) and D2 are negl(λ)-close. With the above claims, we are ready to prove the main theorem of this section. Proof of Theorem 7. We will show the contrapositive. Suppose we have a T -time distinguisher between LWE(m,Znq ,Sn,k, DZ,σ′) and U(Zm×n × Zmq q ) with advantage 2 We have two cases. Suppose that this distinguisher distinguishes between U(Zm×nq × Zmq ) = m×(n+1) U(Zq ) and D2 as given in Lemma 20, with advantage . Then, we have a T time dis- tinguisher between Z`+1 Zm×(`+1) (`+1)×(n+1) m×(n+1)LWE(n + 1, q , q , DZm,2σ) and U(Zq × Zq ) where we simply discard the samples, i.e. the first part in Z(`+1)×(n+1)q (the matrix Â). 22 Now, for the second case, suppose that this distinguisher does not distinguish between U(Zm×nq × Zm Zm×(n+1)q ) = U( q ) and D2 with advantage . Then, we have a T -time distinguisher between LWE(m,Znq ,Sn,k, DZ,σ′) and D2 with advantage ≥ 2 −  =  by the triangle inequality. Now, we can use this distinguisher to distinguish LWE(n−1,Z`q,Zm×`q , DZm,2σ) and U(Z `×(n−1) q ×Zm×(n−1)q ) by once again discarding the samples, i.e. the first part in Z`×(n−1)q (the matrix A), and then by applying ϕ to the remaining part in Zm×(n−1)q . Now, using Lemmas 19 and 20, the resulting distributions coming out of ϕ when given Zm×(n−1)U( q ) and D1 will be negl(λ)-close to LWE(m,Znq ,Sn,k, DZ,σ′) and D2, respectively. Thus, our assumed distingiusher will be correct, where the only runtime increase is in the randomized transformation ϕ, taking time poly(n,m, q, λ). Now, we state a simpler version of Theorem 7 that is easier to use. √ Corollary 1. Suppose log(q)/2` = negl(λ), σ ≥ 4 ω(log λ) + lnn+ lnm, and k log2(n/k) ≥ (` + 1) log2(q) + ω(log λ). Then, if LWE(n,Z`q,Z` , D `×n nq Z,σ) and U(Zq × Zq ) have no T + poly(n,m, q, λ) time distinguisher with advantage , then LWE(m,Znq ,Sn,k, DZ,σ′) and U(Zn×m mq × Zq ) have no T -√ time distinguisher with advantage 2m (up to additive negl(λ) factors), where σ′ = 2σ k + 1. Proof. If LWE(n,Z` `q,Zq, DZ,σ) and U(Z`×n × Znq q ) cannot be distinguished with advantage , then by a hybriding argument, the version where the secrets are matrices (with m dimensions instead of 1) cannot be distinguished with advantage m (up to additive negl(λ) factors). Then, applying Theorem 7, LWE(m,Znq ,Sn,k, D n×m mZ,σ′) and U(Zq × Zq ) cannot be distinguished with advantage 2m, where we reparameterize to absorb small additive factors, with the observation that LWE is harder when the dimension and noise grow, and easier when the number of samples grows. 5 Hardness of Density Estimation for Mixtures of Gaussians Now, using tools from the previous sections, we reduce LWE to density estimation for mixtures of Gaussians, using similar ideas as [BRST21]. Our machinery from the previous sections now allow us to give a fine-grained version of hardness of learning mixtures of Gaussians. Theorem 8 (Reducing k-sparse LWE to k-sparse hCLWE). Let n,m, q, k, g ∈ N, σ ∈ R>0.(√ ) √ √ Suppose q = ω σ2 + k(lnm+ lnn) and σ ≥ 3 k lnn+ lnm+ ω(1). Suppose that LWE(m,Znq ,S n×m mn,k, DZ,σ) and U(Zq × Zq ) have no T + poly(n,m, q, 1/β)-time distinguisher with advantage Ω(1). Then, there is no T -time algorithm distinguishing hCLWE(g)(m, √1 Sn,k, γ, β) and k Dn×m1 with advantage Ω(1) for √ γ = 2 k(lnm+ lnn), √ σ2 + k(lnm+ lnn) β = 20 · , and q √ √ g = 8 k (lnm)2 + ln(m) ln(n). Proof. Throughout, we set the security parameter to constant, say λ = 2. First, we apply Lemma √ 10 to make the errors continuous, for width σ 2 22 = σ + 4 lnm+ ω(1). Note that σ ≥ 9k(lnn + lnm+ ω(1)) 4 lnm+ ω(1) as needed for Lemma 10. 23 Then, we apply Lemma 11 to make the samples continuous uniform as opposed to discrete uniform, where the width becomes √ √ σ = σ23 2 + 9k(lnn+ lnm+ ω(1)) ≤ 10 σ 2 + k(lnm+ lnn), √ √ as long as σ2 ≥ 3 k lnn+ lnm+ ω(1), which is true because σ√2 ≥ σ. Then, we apply Lemma 13 to ma√ke the samples look Gaussian instead of uniform, for τ ≥ lnn+ lnm+ ω(1), so we can set τ = 2 lnm+ lnn. Then, we apply Lemma 15 to recsale, to get √ √ γ = k · τ = 2 k(lnm+ lnn), and √ σ σ23 + k(lnm+ lnn) β = ≤ 10 · = o(1). q q Let β′ = 2β. We can then reduce the problem of distinguishing CLWE(m,Dn1 , √ 1 Sn,k, γ, β) and k Dn×m1 ×U(Tm) to the problem of distinguishing hCLWE(m,Dn, √ 1 S , γ, β′) andDn×m1 n,k 1 , in additivek time poly(n, 1/β) by Lemma 9. Lastly, by Theorem 4, since β′ = o(1) < 1/32, we know there is √ no T time algorithm distinguishing hCLWE(g)(m, √1 S ′n,k, γ, β ) and Dn×m1 with g = 4γ lnm/π <√ √ k 8 k (lnm)2 + ln(m) ln(n) with constant advantage. Remark 4. Note that we do not apply the worst-case to average-case reduction for the secrets to reduce to CLWE. Instead, we keep our secret distribution discrete over the sphere. Now, we combine this with our reduction from regular LWE to k-sparse LWE to get the following: Theorem 9 (Reducing LWE to k-spar√se hCLWE). Suppose for some constant  < 1 we have√ √ 4 ω(log `) + lnn+ lnm ≤ σ ≤ `, 3 k lnn+ lnm+ ω(1) ≤ σ, `2 ≤ q ≤ poly(`), k ≤ O(n1−), ` ≤ n, and k log2(n/k) = 2 log2(q). Suppose that LWE(n,Z` ,Z`q q, DZ,σ) has no T (`)+poly(n) time distinguisher with advantage at least 1/poly(`). Then, there is no algorithm distinguishing (√ ) hCLWE(g)(m, √1 S , γ, β) and Dn×mn,k 1 with Ω(1) advantage for g = O k · log ` · log n in time T (`)k √ in Rn, where m = poly(`) = poly(k log n/ log q), γ = 2 k(lnm+ lnn), and some β = o(q−1/5). Proof. First, we know LWE(n,Z`q,Z`q, DZ,σ) has no T (`)-time distinguisher with advantage at least 1/(100m), where we setm = poly(`) ≤ n and λ = `. Applying Corollary 1, we know this implies that LWE(m,Zn √q ,Sn,k, DZ ) has no T + poly(n, q)-time distinguisher with advantage ≥ 1/50, as long,3σ k √ as k log2(n/k) ≥ 2` log2(q), log(q)/2` = negl(`), and σ ≥ 4 lnn+ lnm+ ω(log `). Note that all of these conditions are met by the hypotheses on the parameter√s. Now, we can apply Theorem 8 to√ √ argue that as long as q = ω( kσ2 + k(lnm+ lnn)) and σ ≥ 3 k lnn+ lnm+ ω(1), we get no T - √ √ time distinguisher between hCLWE(g)(m, √1 Sn,k, γ, β) andDn×m1 for g = 8 k (lnm)2 + ln(m) ln(n), √ k γ = 2 k(lnm+ lnn), and (√ ) kσ2 + k(lnm+ lnn) β = O . q 24 By our assumptions on parameters, we get ( √ ) √ √ k kσ2 + k(lnm+ lnn) ≤ O(σ k lnn) ≤ O σ ln(n/k)  ( √ ) ( √ ) ≤ O ` · ` log(q) = O `3/2 log ` = o(q4/5), √ so the assumption q = ω( kσ2 + k(lnm+ lnn)) is satisfied, and in fact β = o(q4/5/q) = o(q−1/5). Note that the number of Gaussians g can be bounded as √ √ (√ √ ) g = 8 k O((log `)2) +O(log ` · log n) = O k log ` · log n , as desired. Lastly, m = poly(`) = poly(k log n/ log q), as desired. Corollary 2. Let , δ ∈ (0, 1) be arbitrary constants with δ < . Assuming ( ) LWE 2` δ ,Z`q,Z`q, DZ,σ ( )  `δ `δ has no T (`) = 2O(` ) time distinguisher from U Z`×2q × Z2q with advantage at least 1/poly(`), where σ = `2/3 and q = `2, then there is no algorithm distinguishing hCLWE(g)(m, √1 Sn,k, γ, β) k and Dn×m /δ 1 in time 2 log2(n) (which is quasipolynomial in n), where m = poly(log n), g = ( ) ( √ ) O (log n)1/(2δ) · log logn , γ = O (log n)1/(2δ) log log n and β = o(q−1/5). Proof. We set n = 2`δ and k = 4`1−δ log2(`) in Theorem 9. (Since k = no(1), we replace k log2(n/k) with k log2(n) at the cost of a (1− o(1)) factor.) Let us first confirm that all the hypotheses of Theorem 9 hold. First, observe that (√ ) √ √ 4 ω(log `) + lnn+ lnm = O ω(log `) + `δ +O(log `) = O(`δ/2) = o( `) ≤ σ. Next, we have √ √ (√ ) (√ ) (√ ) 3 k lnn+ lnm+ ω(1) ≤ O k`δ = O `1−δ log ` · `δ = O ` log ` = o(σ). For the last non-trivial condition, we have k log2(n) = 4` 1−δ log δ2(`)` = 4` log2(`) = 2` log2(q). If we have a  2` = 2log2(n) /δ time distinguisher for the mixture of Gaussians, we get a `2 + poly(n) = 2O(`) = T (`) time algorithm for LWE. The number of samples here is m = poly(`) = poly(log n). The number of Gaussians becomes (√ √ ) ( ) g = O k · log ` · log n = O (log n)1/(2δ) · log log n , 25 and furthermore √ ( √ ) γ = O( k(lnm+ lnn)) = O (log n)1/(2δ) log logn , and β is unchanged at o(q−1/5). We give another setting of parameters where the number of Gaussians in the mixture is larger, but assumption on LWE is weaker by reducing the number of samples. Corollary 3. Let α > 1 be an arbitrary constant. Assuming LWE(n,Z` ,Z` , D ) and U(Z`×n nq q Z,σ q ×Zq )√ has no T (`) + poly(n) time distinguisher with advantage 1/poly(m) where n = `α, σ = k, and q = `2, then there is no algorithm distinguishing hCLWE(g)(m, √1 Sn,k, γ, β) and D n×m 1 with constant ( )k ( √ ) advantage in time T (`) = T (n1/α), where g = O n1/(2α) · log n , γ = O n1/(2α) · log n , and some β = o(q−1/5). In particular, if T (`) = poly(`), then assuming the LWE problem is hard to distinguish for poly(`)- time algorithms, so is the problem on hCLWE(g). Proof. We set k = 4`/(α− 1) = 4n1/α/(α− 1) and apply Theorem 9. Observe that ( ) 4` `α 4` k log2(n/k) = · log2 = · ((α− 1) log2(`)−O(1)) α− 1 4`/α α− 1 = 4` log2(`)−O(`) = 2` log2(q)−O(`), as necessary (the O(`) factor doesn’t change the proof of Theorem 8). Let us see that the other hypotheses of Theorem 9 hold. We have √ (√ ) 4 ω(log `) + lnn+ lnm = O ω(log `) = o(σ), and also √ √ √ √ 3 k lnn+ lnm+ ω(1) = O( ` · ln `) = o(σ), as desired. Also note that k = O(n1/α), and since 1/α < 1, there exists some ′ > 0 such that 1−′k ≤ O(n ). If we have a time T (n1/α) = T (`) distinguisher for hCLWE, we get a time T (`) + poly(n) time distinguisher for LWE. The number of Gaussians becomes (√ √ ) ( ) g = O k · log ` · log n = O n1/(2α) · log n , and furthermore √ ( √ ) γ = O( k(lnm+ lnn)) = O n1/(2α) log n , and β is unchanged at o(q−1/5). 26 6 Low-Sample Algorithm for hCLWE(g) √ ( ) n Theorem 10. Let γ = 2 k(lnn+ lnm) and β = o(q−1/5). Further, let t := |Sn,k| = · 2 k k denote the number of k-sparse {−1, 0,+1}-secrets and suppose log log(log t/ log q) = o(log q). Then, ( ( )) for some m = O(k log n/ log q) n, there is a O m · 2k -time algorithm that distinguishes between k hCLWE(g)(m,Dn1 , √ 1 Sn,k, γ, β) and D n×m 1 with advantage at least 1/2.k Remark 5. This theorem can be generalized for other settings of β, γ, but we state it this way because it suffices for our purposes. It also works for the setting of non-truncated hCLWE. Remark 6. While the runtime of this algorithm is similar to the algorithm solving hCLWE given in Theorem 7.5 of [BRST21] as applied in a black-box way, the sample complexity needed here is O(γ2O(k log n/ log q), as opposed to roughly 2 ) = nΩ(k). Algorithm 1: Low Sample algorithm for hCLWE(g) Input: Sampling oracle to distribution D. Output: 1 to indicate D = hCLWE(g) and 0 otherwise. Draw m samples a1, . . . ,am ∼ D. for s ∈ √1 Sn,k do k Compute f ′2s(ai) = 〈ai, s〉 mod γ/γ for all i ∈ [m]. if fs(ai) ∈ [−aβ/γ ′, aβ/γ′] for all i ∈ [m] then return 1. return 0. Proof. For the sake of this proof, we take the representatives of Tq to be in the interval [−q/2, q/2).√ Further, let γ′ = γ2 + β2 and a ∈ Rn and s ∈ √1 Sn,k. We define fs : Rn → T k γ/γ ′2 by f (a) := 〈a, s〉 mod γ/γ′2s . We use the main idea in the proof of Claim 5.3 in [BRST21] to give an algorithm that distinguishes the two distributions. Given m samples a1, . . . ,am from an unknown distribution D, we compute 〈ai, s〉 mod γ/γ ′2 for all possible secret directions s ∈ √1 Sn,k and for all samples i ∈ [m]. This k takes time O(mt) . If there is some s such that fs(ai) is small for all samples i ∈ [m], then we guess D = hCLWE(g), and otherwise we guess D = Dm1 . Now, suppose that the input distribution is D = hCLWE(g)(m,Dn1 , √ 1 Sn,k, γ, β). Let s∗ be the k randomly sampled but fixed secret direction. Then for all the m samples ai, we have that fs∗(ai) is distributed as D ′2β/γ′ mod γ/γ . This can be seen from Equation 2. As an aside, note that by Claim 5.3 of [BRST21] this holds even when the input distribution is not truncated, that is, D = hCLWE(m,Dn, √11 Sn,k, γ, β).k √ For a parameter δ > 0 specified later, let a = log(1/δ). By a standard Chernoff bound, the probability mass of Dβ/γ′ that is outside the interval [−aβ/γ′, aβ/γ′] is at most δ. Taking a union 27 bound over the m samples ai, the probability that there exists some sample indexed by i ∈ [m] such that ( ) fs∗(ai) = 〈a , s ∗ i 〉 mod γ/γ ′2 ∈/ [−aβ/γ′, aβ/γ′] is at most mδ. Therefore, if D = hCLWE(g)(m,Dn √11 , Sn,k, γ, β), the algorithm outputs 1 withk probability at least 1−mδ. On the other hand, if ai ∼ Dn1 , then for any fixed s ∈ √ 1 Sn,k, we have that 〈ai, s〉 ∼ D1, k independently of s. By Lemma 4 and Lemma 2, ∆(Dm mod γ/γ′21 ,Tmγ/γ′2) ≤ m exp(−γ ′4/γ2)/2 ≤ m exp(−γ2)/2. Therefore for a fixed s and independent samples ai, we have that (fs(ai) = 〈ai, s〉 mod γ/γ′2)i∈[m] ( ) is m exp(−γ2)/2-close to m(ui)i∈[m] ∼ U Tγ/γ′2 . The probability that ui ∈ [−aβ/γ′, aβ/γ′] for all i ∈ [m] is at most (2aβγ′/γ)m. This means that [ ] ( )m( ( ) ) ′ Pr (f (a )) ′ ′ m m ′2 m γ s i i∈[m] ∈ [−aβ/γ , aβ/γ ] ≤ ∆ D1 mod γ/γ , U Tγ/γ′2 + 2aβ ·ai γ ( ) ′ m ≤ m exp(−γ2 γ )/2 + 2aβ · . γ Taking a union bound over all the t secret directions s ∈ √1 Sn,k, we get that the probability that k there exists some s ∈ √1 Sn,k such that for all i ∈ [m], fs(ai) ∈ [−aβ/γ′, aβ/γ′] is at most k ( ) γ′ m t ·m exp(−γ2)/2 + t 2aβ · . γ Putting all parts together, we get that the advantage of the distinguisher is at least ( ) γ′ m 1−mδ − t 2aβ · − t ·m exp(−γ2)/2. γ √ ( ) Since γ = 2 k(lnm+ lnn) and β = o(q−1/5), let us now set δ = 1 , log tm = Θ . Then 10m log q mδ ≤ 1/10 and t ·m · exp(−γ2) 2k · nk ·m · exp(−4k(lnm+ lnn)) (2n)k ·m · (mn)−4k 1 ≤ = ≤ . 2 2 2 10 Lastly, to get advantage greater than 1/2, we want m such that 1 1 t ≤ · ( ) . 10 γ′ m 2aβ γ Taking the log on both sides, ( ) γ′ log t ≤ − log 10−m log 2 + log a+ log β + log = Θ (m(− log logm+ log q)) = Θ(m log q), γ where we use that log a = O(log logm) = O(log log(log t/ log q)) = o(log q), β = o(q−1/5) and γ′/γ ≤ 2. This gives us that the advantage of the distinguisher is at least 1 − 3/10 > 1/2. Since log t ≤ k log(2n) = O(k log n), this makes m = Θ(k log n/ log q). 28 Now, we combine Theorem 10 and Corollary 2 to get the following tightness for the mixtures of Gaussians we consider. Corollary 4. Let , δ ∈ (0, 1) be arbitrary constants with δ < . Assuming ( ) `δLWE 2 ,Z` ,Z`q q, DZ,σ ( )  `δ `δ has no T (`) = 2O(` ) time distinguisher from U Z`×2 2q × Zq with advantage at least 1/poly(`), √ where σ = ` and q = `2, then there is no algorithm distinguishing hCLWE(g)(m, √1 Sn,k, γ, β) k and Dn×m /δ 1 with Ω(1) advantage in time 2 log2(n) (which is quasipolynomial in n), where m = ( ) ( √ ) poly(log n), g = O (log n)1/(2δ) · log logn , γ = O (log n)1/(2δ) log log n and some β = o(q−1/5). Yet, there is a distinguisher running in time 2O((logn) 1/δ log logn) using O(log(n)1/δ) samples. Proof. The first part of the statement is immediate from Corollary 2. The second part of the statement follows from Theorem 10. To see this, in Corollary 2, we set k = O(`1−δ log(`)) = O((log n)(1−δ)/δ log logn), which implies the run-time of the algorithm becomes O(k) O(k) O(k) O((logn)(1−δ)/δm · n = poly(log `) · n = n = n ·log logn) = 2O((logn) 1/δ·log logn). Moreover, the number of samples necessary is ( ) O(log t/ log q) = O(k log n/ log q) = O(`) = O log(n)1/δ . Lastly, as needed by Theorem 10, we have log log(log t/ log q) = O(log log `) = O(log log q) = o(log q), as desired. References [AM05] Dimitris Achlioptas and Frank McSherry. On spectral learning of mixtures of distribu- tions. In International Conference on Computational Learning Theory, pages 458–469. Springer, 2005. 1 [BD20] Zvika Brakerski and Nico Döttling. Hardness of LWE on general entropic distributions. In Anne Canteaut and Yuval Ishai, editors, Advances in Cryptology - EUROCRYPT 2020 - 39th Annual International Conference on the Theory and Applications of Cryp- tographic Techniques, Zagreb, Croatia, May 10-14, 2020, Proceedings, Part II, volume 12106 of Lecture Notes in Computer Science, pages 551–575. Springer, 2020. 4 [BLMR13] Dan Boneh, Kevin Lewi, Hart William Montgomery, and Ananth Raghunathan. Key homomorphic prfs and their applications. In Ran Canetti and Juan A. Garay, editors, Advances in Cryptology - CRYPTO 2013 - 33rd Annual Cryptology Conference, Santa Barbara, CA, USA, August 18-22, 2013. Proceedings, Part I, volume 8042 of Lecture Notes in Computer Science, pages 410–428. Springer, 2013. 3 29 [BLP+13] Zvika Brakerski, Adeline Langlois, Chris Peikert, Oded Regev, and Damien Stehlé. Clas- sical hardness of learning with errors. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 575–584, 2013. 13 [BRST21] Joan Bruna, Oded Regev, Min Jae Song, and Yi Tang. Continuous LWE. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 694–707, 2021. 1, 2, 3, 4, 5, 8, 9, 14, 15, 23, 27 [BS15] Mikhail Belkin and Kaushik Sinha. Polynomial learning of distribution families. SIAM Journal on Computing, 44(4):889–911, 2015. 1 [BV08] S Charles Brubaker and Santosh S Vempala. Isotropic pca and affine-invariant clustering. In Building Bridges, pages 241–281. Springer, 2008. 1 [Das99] Sanjoy Dasgupta. Learning mixtures of gaussians. In 40th Annual Symposium on Foun- dations of Computer Science, FOCS ’99, 17-18 October, 1999, New York, NY, USA, pages 634–644. IEEE Computer Society, 1999. 1 [DKS17] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 73–84. IEEE, 2017. 1 [DKS18] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. List-decodable robust mean estimation and learning mixtures of spherical gaussians. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1047–1060, 2018. 1 [DS07] Sanjoy Dasgupta and Leonard J Schulman. A probabilistic analysis of em for mixtures of separated, spherical gaussians. Journal of Machine Learning Research, 8:203–226, 2007. 1 [FGR+17] Vitaly Feldman, Elena Grigorescu, Lev Reyzin, Santosh S Vempala, and Ying Xiao. Statistical algorithms and a lower bound for detecting planted cliques. Journal of the ACM (JACM), 64(2):1–37, 2017. 1 [FSO06] Jon Feldman, Rocco A Servedio, and Ryan O’Donnell. Pac learning axis-aligned mix- tures of gaussians with no separation assumption. In International Conference on Com- putational Learning Theory, pages 20–34. Springer, 2006. 1 [HILL99] Johan Håstad, Russell Impagliazzo, Leonid A Levin, and Michael Luby. A pseudorandom generator from any one-way function. SIAM Journal on Computing, 28(4):1364–1396, 1999. 6 [HL18] Samuel B Hopkins and Jerry Li. Mixture models, robustness, and sum of squares proofs. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1021–1034, 2018. 1 [HP15] Moritz Hardt and Eric Price. Tight bounds for learning a mixture of two gaussians. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 753–760, 2015. 1 30 [Kea98] Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983–1006, 1998. 1 [KSS18] Pravesh K Kothari, Jacob Steinhardt, and David Steurer. Robust moment estimation and improved clustering via sum of squares. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1035–1046, 2018. 1 [KSV05] Ravindran Kannan, Hadi Salmasian, and Santosh Vempala. The spectral method for general mixture models. In International Conference on Computational Learning Theory, pages 444–457. Springer, 2005. 1 [LP11] Richard Lindner and Chris Peikert. Better key sizes (and attacks) for lwe-based encryp- tion. In Cryptographers’ Track at the RSA Conference, pages 319–339. Springer, 2011. 2 [Mic18] Daniele Micciancio. On the hardness of learning with errors with binary secrets. Theory Comput., 14(1):1–17, 2018. 3, 4, 5, 7, 8, 11, 16, 17, 21 [MP00] G. J. McLachlan and D. Peel. Finite mixture models. Wiley Series in Probability and Statistics, 2000. 1 [MP13] Daniele Micciancio and Chris Peikert. Hardness of sis and lwe with small parameters. In Annual Cryptology Conference, pages 21–39. Springer, 2013. 7 [MR07] Daniele Micciancio and Oded Regev. Worst-case to average-case reductions based on gaussian measures. SIAM Journal on Computing, 37(1):267–302, 2007. 7 [MV10] Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures of gaussians. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 93–102. IEEE, 2010. 1, 5 [NIS] NIST. Post-quantum cryptography standardization. https://csrc.nist.gov/ Projects/Post-Quantum-Cryptography. 3 [Reg09] Oded Regev. On lattices, learning with errors, random linear codes, and cryptography. Journal of the ACM (JACM), 56(6):1–40, 2009. 2, 7 [RV17] Oded Regev and Aravindan Vijayaraghavan. On learning mixtures of well-separated gaussians. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 85–96. IEEE, 2017. 1 [SK01] Arora Sanjeev and Ravi Kannan. Learning mixtures of arbitrary gaussians. In Proceed- ings of the thirty-third annual ACM symposium on Theory of computing, pages 247–257, 2001. 1 [SZB21] Min Jae Song, Ilias Zadik, and Joan Bruna. On the cryptographic hardness of learning single periodic neurons. arXiv preprint arXiv:2106.10744, 2021. 3 [TTM+85] D.M. Titterington, P.S.D.M. Titterington, S.A.F. M, A.F.M. Smith, U.E. Makov, and John Wiley & Sons. Statistical Analysis of Finite Mixture Distributions. Applied section. Wiley, 1985. 1 31 [VW02] Santosh Vempala and Grant Wang. A spectral algorithm for learning mixtures of dis- tributions. In The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings., pages 113–122. IEEE, 2002. 1 32