key: cord-0475444-z05il317 authors: Polyanskiy, Yury; Wu, Yihong title: Self-regularizing Property of Nonparametric Maximum Likelihood Estimator in Mixture Models date: 2020-08-19 journal: nan DOI: nan sha: c3a871f8dda57146aa9387a87f84c02e28530544 doc_id: 475444 cord_uid: z05il317 Introduced by Kiefer and Wolfowitz cite{KW56}, the nonparametric maximum likelihood estimator (NPMLE) is a widely used methodology for learning mixture odels and empirical Bayes estimation. Sidestepping the non-convexity in mixture likelihood, the NPMLE estimates the mixing distribution by maximizing the total likelihood over the space of probability measures, which can be viewed as an extreme form of overparameterization. In this paper we discover a surprising property of the NPMLE solution. Consider, for example, a Gaussian mixture model on the real line with a subgaussian mixing distribution. Leveraging complex-analytic techniques, we show that with high probability the NPMLE based on a sample of size $n$ has $O(log n)$ atoms (mass points), significantly improving the deterministic upper bound of $n$ due to Lindsay cite{lindsay1983geometry1}. Notably, any such Gaussian mixture is statistically indistinguishable from a finite one with $O(log n)$ components (and this is tight for certain mixtures). Thus, absent any explicit form of model selection, NPMLE automatically chooses the right model complexity, a property we term emph{self-regularization}. Extensions to other exponential families are given. As a statistical application, we show that this structural property can be harnessed to bootstrap existing Hellinger risk bound of the (parametric) MLE for finite Gaussian mixtures to the NPMLE for general Gaussian mixtures, recovering a result of Zhang cite{zhang2009generalized}. Nonparametric maximum likelihood estimator (NPMLE) is a useful methodology for various statistical problems such as density estimation, regression, censoring model, deconvolution, and mixture models (see the monographs [GW92, GJ14] ). Oftentimes optimizing over a massive (infinitedimensional) parameter space can lead to undesirable properties, such as non-existence 1 and roughness, and runs the risk of overfitting. These shortcomings can be remedied by the method of sieves [Gre81] or explicit regularization [GG71, Sil82] at the expense of losing the main advantages of the NPMLE -the full adaptivity (tuning parameters-free) and the computational tractability. However, for certain problems including shape constraints (such as monotonicity [Gre56, Bir89] and log-concavity [DR09, CSS10, DW16, KS16] ) and mixture models [Lin95, Zha09, SG20] , a striking observation is that unpenalized NPMLE achieves superior performance and has become the method of choice for both theoretical investigation and practical computation. While basic structural properties of NPMLE has been well understood, these results are frequently too conservative to explain its superior statistical performance. This paper studies the typical structure of NPMLE for mixture models as well as its statistical consequences. Consider a parametric family of densities {p θ : θ ∈ Θ} with respect to some dominating measure µ on R, where the parameter space Θ is assumed to be a subset of R. Given a mixing distribution (prior) π on Θ, we denote the induced mixture density as: (1) Introduced by Kiefer and Wolfowitz [KW56] (see also an earlier abstract by Robbins [Rob50] ), the NPMLE for the mixing distribution is defined as a maximizer of the mixture likelihood given n data points x 1 , . . . , x n : where M(Θ) denotes the collection of all probability measures on Θ. We refer the readers to the monograph of Lindsay [Lin95] for a systematic treatment on the NPMLE. Although the convex optimization problem (2) is infinite-dimensional, over the years various computationally efficient algorithms have been obtained; see [Lin95, Chapter 6 ] and more recent developments in [JZ09, KM14] . The NPMLE provides a highly useful primitive for empirical Bayes and compound estimation problem, in which one first apply the NPMLE to learn a prior then execute the corresponding Bayes estimator of the learned prior. This strategy can be used as a universal means for denoising and achieves the state-of-the-art empirical Bayes performance [JZ09] . We summarize a few known structural properties of the NPMLE. The first existence and uniqueness result was obtained by Simar [Sim76] for the Poisson mixture, followed by Jewell [Jew82] for mixtures of exponential distributions. It was shown that the (unique) solution π NPMLE to the optimization problem (2) is a discrete distribution, whose number of atoms (mass points) is at most the number of distinct values of the observations and consequently at most the sample size n. 2 These results have been significantly extended in [Lai78, Lin83a, Lin83b, GW92, LR93] which show that the NPMLE solution is unique and n-atomic for all exponential families with densities with respect to the Lebesgue measure. Although the bound |supp( π NPMLE )| ≤ n is the best possible, which seems to suggest the estimator exhibits significant overfitting (since an n-component mixture requires 2n − 1 parameters to describe), in practice the support size is much smaller than n. Understanding this phenomenon is the main motivation behind this work. To anchor the discussion, let us focus on the Gaussian location model, where p θ (x) = ϕ(x − θ) is the density of N (θ, 1) and ϕ(x) = 1 √ 2π e −x 2 /2 is the standard normal density, so that p π is the convolution π * ϕ. It is well known that for finite Gaussian mixtures, the likelihood is non-concave in the location parameters; furthermore, spurious local maxima can exist even with infinite sample size [JZB + 16] which pose difficulty for heuristic methods such as the EM algorithm. To sidestep the non-convexity, the approach of NPMLE can be viewed as an extreme form of overparameterization, which postulates a potentially infinite Gaussian mixture so as to convexify the optimization problem. Since overparameterized models are prone to overfitting, it is of significant interest to understand the typical model size fitted by the NPMLE. To this end, the worst-case bound |supp( π NPMLE )| ≤ n is not useful. In fact, this bound can be tight, e.g., when the n observations are extremely spaced out [Lin95, p. 116 ]. This, however, is not a typical configuration of the sample if it consists of independent observations. Indeed, in practice it has been observed that NPMLE tends to fit a Gaussian mixture with much fewer components than n. This not only explains the absence of overfitting, but is a highly desirable property for interpretability of the NPMLE solution, and is clearly not explained by the worst-case bound. Based on numerical evidence, Koenker and Gu [KG19] suggested that the number of atoms of the NPMLE is typically O( √ n). As our main result shows next, it is in fact O(log n). Theorem 1 (Gaussian mixture model). Let p θ be the density of N (θ, 1). Let x min = min i∈[n] x i and x max = max i∈[n] x i . Then there exists a universal constant C 0 , such that Consequently, suppose x 1 , . . . , x n are drawn independently from π * N (0, 1) for some s-subgaussian mixing distribution π, i.e., π(dθ)e tθ ≤ e st 2 /2 for all t ∈ R. Then for any τ > 0, there exists some constant C = C(s, τ ) such that with probability at least 1 − n −τ , A few remarks are in order: Remark 1 (Tightness of Theorem 1). The O(log n) upper bound in Theorem 1 is tight in the following sense: • First, it is necessary to select a model of size Ω(log n) in order to be compatible with existing statistical guarantees on the NPMLE. Indeed, suppose the true density p π is N (0, σ 2 ) for some σ > 1 (i.e. the mixing distribution is another Gaussian). It is known that the Hellinger distance between N (0, σ 2 ) and any k-Gaussian mixture (k-GM) with unit variance is at least exp(−O(k)) [WV10] . Therefore, if |supp( π NPMLE )| ≤ c log n for some small constant c, the bias would be too big, violating the Hellinger risk bound E[H 2 (p π , p π NPMLE )] = O( log 2 n n ) on the NPMLE [Zha09] (see Section 4). • On the other hand, there is no statistical value to fit a model of size bigger than Ω(log n). Indeed, it is easy to show by moment matching (see, e.g., [WY20, Lemma 8]) that for any subgaussian π, p π = π * N (0, 1) can be approximated by a k-GM within total variation (TV) distance exp(−Ω(k)). Therefore, there exists a k-GM density p π ′ with k = C log n, such that TV(p π , p π ′ ) = o(1/n). As such, one can couple the original sample X 1 , . . . , X n drawn from p π with the sample X ′ 1 , . . . , X ′ n drawn from p π ′ , so that with probability 1 − o(1), X i = X ′ i for all i = 1, . . . , n. From this simulation perspective, p π ′ is an equally plausible "ground truth" that explains the data, and hence, statistically speaking, there is no reason to fit a mixture model with more than C log n components. From the above two aspects, one can view Θ(log n) as the "effective dimension" of the Gaussian mixture model with subgaussian mixing distributions (i.e. each doubling of the sample size unlocks a new parameter of the model class). Thus it is a remarkable fact that NPMLE picks up the right model size without explicit model selection penalty. For this reason, we refer to the phenomenon described in Theorem 1 as self-regularization. In order to further quantify self-regularization and determine what the correct model size is, we formalize a framework called the statistical degree in Section 5.1. Remark 2 (Poisson mixture). Using only classical results, one can get a glimpse of the selfregularization property of the NPMLE by considering the Poisson model. Suppose x 1 , . . . , x n are drawn independently from a Poisson mixture for some subexponential mixing distribution on the mean parameter. Since the observations are non-negative integers, the number of distinct values of in the sample is at most x max + 1, which is O(log n) with probability 1 − o(1) by a union bound. Thus the NPMLE for the Poisson mixture is O(log n)-atomic with high probability, which is again the optimal model size. Clearly, this argument does not generalize to continuous distributions such as the Gaussian mixture model in which all observations are distinct with probability one. Nevertheless, the range of the data still grows logarithmically and Theorem 1 shows that the number of atoms in the NPMLE can be bounded by the squared range. Remark 3 (Model selection and penalized MLE). Define the likelihood value of the best k-GM fit as Note that this is a non-convex optimization problem, since the likelihood is not concave in the location parameters. As k → ∞, L opt (k) approaches the objective value of the (convex optimization) NPMLE (2). Theorem 1 shows that with high probability with respect to the randomness of the sample, the curve k → L opt (k) flattens when k surpasses C log n for some constant C. This has the following immediate bearing on model selection. Various criteria (such as AIC or BIC [Ler92, Ker00] ) have been proposed for the mixture model: given a penalty function pen(k) that strictly increases in k, select a model size by solving max k=1,...,K where K is a pre-defined maximal model size. Theorem 1 shows that for Gaussian mixtures, regardless of the actual model size, there is no need to choose K bigger than C log n, which also suggests K = C log n a universal choice. It is shown in [Ler92, Ker00] that BIC (with pen(k) = k 2 log n) is consistent in estimating the order of the mixture model. Complementing this result, Theorem 1 shows that regardless of the choice of penalty, any penalized MLE will not choose a model size bigger than C log n with high probability. Remark 4 (Comparison with shape-constrained estimation). The structure of the NPMLE is much less well understood for mixture models than for shape-constrained estimation. For example, for monotone density, the NPMLE (known as the Grenander estimator [Gre56] ) of a decreasing density on [0, 1] with n observations is known to be piecewise constant with at most n pieces. Denote by k n by its number of pieces. Under appropriate conditions it is shown that in general k n = O(n 1/3 ) with high probability [Gro11, Lemma 3.1]. In the special case where the data are drawn from the uniform distribution, k n is asymptotically N (log n, log n) [GL93, Theorem 2]. These results are made possible thanks to an explicit characterization of the NPMLE in terms of empirical processes, a luxury we do not have in mixture models. On the other hand, there is a clear analogy for the structural behavior of NPMLE for monotone density and mixture models: In the former, if the data are drawn from uniform distribution (onepiecewise constant), the NPMLE will fit a piecewise constant density with O(log n) pieces; in the latter, if the data are drawn from a single Gaussian, the NPMLE will fit a Gaussian mixture with O(log n) components. From this perspective one could say there is some mild overfitting in NPMLE; nevertheless, it is a modest (and fair) price to pay for being completely automatic and computationally attractive. Theorem 1 is further extended in Theorem 3 to general exponential families, which shows that there is some degree of universality to the O(log n) upper bound. As we will see in Section 2, bounding the number of atoms in NPMLE boils down to counting critical points of functions of the form F (θ) = n i=1 w i p θ (x i ), where w i 's are nonnegative weights. We accomplish this task using methods from complex analysis. Roughly speaking, the strategy is as follows: First, we localize the roots of F ′ in a compact interval, say [−r, r]. Then, we bound the number of zeros of F ′ in the complex disk of radius r, in terms of its maximal modulus on the complex disks. This leads to a deterministic upper bound, as a function of the sample, on the number of atoms of the NPMLE. Finally, we analyze the high-probability behavior of this upper bound when the sample consists of iid observations. We note that in the special case of Gaussian model, counting the number of critical points of F has been studied, independently, in the context of a seemingly unrelated information-theoretic problem [DYPS20] ; see also Section 5.4. Note that statistical guarantees on NPMLE, typically in terms of Hellinger risk of density estimation, have been obtained in [GvdV01, GvdV07, Zha09, SG20] . These results follow the usual route of analyzing MLE using entropy numbers and only uses the zeroth order optimality condition, and therefore cannot produce any structural information on the optimizer such as the number of atoms. (For example, such analysis applies equally to π NPMLE convolved with an arbitrarily small Gaussian, which now has infinitely many atoms.) A structural result, such as Theorem 1, can only be obtained by "opening up the optimization blackbox", by examining the exact optimality conditions, as we indeed do below. In turn, a pleasant consequence of the self-regularizing property is a simpler proof of the statistical guarantee of the NPMLE in [Zha09] , by bootstrapping existing that of the (parametric) MLE for finite Gaussian mixtures [MM11] to general mixtures. The remainder of the paper is organized as follows: Section 2 recalls the first-order optimality condition for the NPMLE. Following [Lin83b] , Section 3 studies the NPMLE for mixtures of exponential family and bounds its number of atoms as well as analyzing its typical behavior. Section 4 derives Hellinger risk bounds for the NPMLE in the Gaussian mixture model. Section 5 concludes the paper by discussing the concept of "self-regularization", its ramifications and open problems. Throughout the paper, we use standard asymptotic notations: For any sequences {a n } and {b n } of positive numbers, we write a n b n if a n ≥ cb n holds for all n and some absolute constant c > 0, a n b n if a n b n , and a n ≍ b n if both a n b n and a n b n hold; the notations O(·), Ω(·), and Θ(·) are similarly defined. We write a n = o(b n ) or b n = ω(b n ) or a n ≪ b n or b n ≫ a n if a n /b n → 0 as n → ∞. In this section we review the first-order optimality condition (both necessary and sufficient) for characterizing the NPMLE. Denote the objective function in (2) by Let π = π NPMLE . Since ℓ( π) ≥ ℓ((1 − ǫ) π + ǫδ θ ) for any ǫ ∈ [0, 1] and any θ ∈ R, we arrive at the first-order optimality condition d Furthermore, averaging the LHS of (6) over π and using the definition of the mixture density in (1), we have In particular, the number of atoms of π is at most the number of critical points of D π . Example 1 (Poisson mixture). As a concrete example, let us consider the Poisson model, where for some nonnegative weights {w i }. Note that the quantity inside the parenthesis is a polynomial of θ of degree at most x max . Therefore, the number of critical points of D π (θ) and hence the number of atoms of π NPMLE are at most x max . This result is first observed 4 in [Sim76] , which slightly improves the bound x max + 1 in Remark 2. For other models, the first-order condition typically does not reduce to a polynomial equation. Following [Lin83a, Lin83b] , we consider the following exponential family. Let p 0 be a base density (with respect to some dominating measure µ) on R, whose moment generating function (MGF) and cumulant generating function is defined as 3 The condition (6) is also sufficient for the global optimality of π. Indeed, for any π, by the concavity of ℓ, The derivation here differs slightly with the original argument of [Sim76] which treats the system of {1, x, . . . , x k , e x }. and is assumed to be finite for all θ ∈ (θ, θ), where θ, θ ∈ [−∞, ∞]. Define the following exponential family of densities with natural parameter θ: Notable examples include: 2s . • Poisson model Poi(e θ ): p 0 = Poi(1), L(θ) = exp e θ − 1 and κ(θ) = e θ − 1. We need the following facts on the MGF: 2. κ is strictly convex and hence is strictly increasing in θ. Furthermore, if the distribution p 0 is fully supported on R, then µ(±∞) = ±∞. and sup z∈D(z 0 ,r) Next we focus on continuous exponential families for which the NPMLE solution is known to be unique [LR93] . The following is a deterministic bound on the number of atoms of the NPMLE. Remark 5. Roughly speaking, by choosing δ ≍ r, Theorem 3 shows that |supp( π NPMLE )| |θ| max a + κ max . Proof. Starting from (7), we bound the number of critical points of the following function where n i=1 w i = 1 and w i = c p 0 (x i ) p π (x i ) and c is the normalization constant. Then Since µ = κ ′ = L ′ L and L(θ) has no real roots (Lemma 2), we conclude that the critical points of F (θ) are the real roots of the following function: We first notice that all real roots of G should be on [θ min , θ max ]. Indeed, by the strict monotonicity of µ, we have G(θ) > 0 for θ > θ max and G(θ) < 0 for θ < θ min . Next, let us extend definition (12) to a complex argument z and modify the function by introducing: Note that the number of zeros of g in z ∈ [−r, r] is the same as the total number of real zeros of G. We will overbound this quantity by counting all zeros of g in a disk of radius r on C. To that end, we define M g (ρ) sup{|g(z)| : |z| ≤ ρ}. We next fix δ 4 > δ 3 > δ 2 > 0 such that θ max + δ 4 < θ and θ min − δ 4 > θ. Set r 2 = r + δ 2 and r 1 = r + δ 3 . On one hand, since µ(θ max ) = x max and µ(θ min ) = x min , we have and similarly On the other hand, by Lemma 2 we have sup |z|≤r 1 Now setting δ 4 = 3δ, δ 3 = 2δ, δ 2 = δ we get The result then follows by the following lemma after also simplifying Lemma 4. Let f be a non-zero holomorphic function on a disk of radius r 1 . Let n f (r) |{z : |z| ≤ r, f (z) = 0}| and M f (r) sup |z| 0. Then Since a 2 + r 2 2 < 1 + a 2 r 2 2 we find that (18) is maximized at φ = π, thus proving (17). Furthermore, from (18) applied with r 2 = 1 we also note that |B a (z)| = 1 whenever |z| = 1, which via (16) implies M g (1) = M f (1). Finally, from (16)-(17) and the fact that |g(z)| ≤ M f (1) for all z ∈ D we conclude that for any |z| ≤ r 2 we have This concludes the proof of (15) after noticing that each factor is upper bounded by 1 C(r,r 2 ) . As an application of Theorem 3, we now prove Theorem 1 for Gaussian location mixtures. Proof. Choose x max = max i∈[n] x i and x min = min i∈[n] x i . Recall that p θ denote the density of N (θ, 1). In this model, we have θ = −∞, θ = ∞, κ(θ) = θ 2 /2 and µ(θ) = θ. Thus θ min = x min , θ max = x max , r = a = 1 2 (x max − x min ), τ = δ. Conveniently, note that for location family, we have the following translation invariance: Let T x (π) denote the pushforward of π under the translation · + x. Then π NPMLE (x 1 + x, . . . , x n + x) = T x ( π NPMLE (x 1 , . . . , x n ) ). Therefore, without loss of generality, we can assume x min = −r ≤ 0 ≤ x max = r, so that x 0 = 0 and |x| max = r. Choosing δ = r yields (3). Finally, the high-probability statement follows from P [|x i | ≥ τ ] ≤ exp(−cτ 2 ) for some constant c and a union bound. The examples of Gaussian and Poisson models (Theorem 1 and Example 1) seem to suggest that NPMLE is always O(log n)-atomic with high probability. Indeed, there is some degree of universality to this bound, as the following result shows. The extra condition we impose essentially says that the tail probability P 0 {|X| ≥ a} behaves as exp(−a c ) for some c > 1. For notational convenience, we will assume that the base measure p 0 is symmetric around zero. Theorem 5. Fix 2 < K 0 ≤ K 1 and θ 0 , b, β > 0. Then there exist n 0 , C depending on (K 0 , K 1 , β, b) with the following property. Consider any density p 0 symmetric around zero whose log-MGF satisfies ∼ p π for some mixing distribution π supported on the interval [−b, b]. Then for all n ≥ n 0 , with probability 1 − 2n −β , π NPMLE has at most C log n atoms. Remark 6. Theorem 5 shows that the Gaussian tail is not essential for the O(log n) result to hold. In fact, consider any smooth density p 0 such that − log p 0 (x) ≍ |x| α for α > 1. Then by saddle-point approximation we have κ(θ) ≍ θ α/(α−1) as θ → ∞, which satisfies (19). On the other hand, compactly supported families are excluded since for those distributions κ(θ) is asymptotically linear (with a slope given by the essential supremum of p 0 ) as θ → ∞. Furthermore, exponential tails are also excluded. This is directly related to the open problem with mixtures of exponential distributions which will be discussed in Section 5.2. Proof of Theorem 5. We start by establishing properties of κ(·) and µ(·) implied by conditions of the theorem. Under the symmetry assumption of p 0 , κ(θ) is an even convex function with κ(θ) ≥ κ(0) = 0 and µ(0) = 0. From the convexity of κ we get for any θ > 0 And thus, for θ > θ 0 we get Clearly, also, µ(θ) → ∞ as θ → ∞ and hence p 0 is supported on the whole of R. Thus we have θ = −∞ and θ = ∞. Define the rate function for a > 0: which is achieved at θ = ρ µ −1 (a), so that E(a) = aρ − κ(ρ). From (20)-(21) (noting C 0 > 1) we conclude that as a → ∞ (and hence ρ → ∞) we get: We need to establish one more consequence of (19). Namely, there exists θ ′ 0 > 0 and m 0 ∈ N such that for any To show this, we select m 0 > log 2 4(K 0 −1) Denoting θ 2 2 m 0 θ 1 , applying the above inequality repeatedly yields µ(θ 2 ) ≤ µ(θ 1 ) + m 0 . Consequently, from the convexity of κ we have On the other hand, κ(θ 2 /2) ≥ κ(θ 1 ) + (2 m 0 −1 − θ 1 )µ(θ 1 ) . Taking the ratio of these, we get from (19): Rearranging terms we arrive at Dropping all negative terms on the right, and noticing that by the choice of θ ′ 0 we have ( K 0 2 − 1)µ(θ 1 ) − m 0 ≥ 1 2 ( K 0 2 − 1)µ(θ 1 ), we conclude 2 m 0 1 2 By the choice of m 0 , however, this is impossible. Hence, there must exist θ * satisfying (23). Having established (22) and (23) we proceed to the proof of the theorem. Let X ∼ p π . By the Chernoff bound, for any θ > 0, where the last inequality follows from the fact that L is an even function lower bounded by L(0) = 1, and L(θ ′ + θ) ≤ L(θ + b) for any θ > 0 and θ ′ ∈ [−b, b]. Optimizing over θ we set θ = ρ − b and obtain P [X ≥ a] ≤ e ab−E(a) , provided that ρ > b. Since we aim to apply Theorem 3, we need to choose x max and x min . We set them as follows. First we set θ 1 so that a 1 = µ(θ 1 ) verifies E(a 1 ) − a 1 b = (1 + β) log n. Note that as n → ∞ we have a 1 , θ 1 → ∞. In the sequel, we assume that n is so large that θ 1 > θ ′ 0 and θ 1 > b. Notice that from (22) we have E(a 1 ) ≫ a 1 and hence E(a 1 ) ≍ θ 1 a 1 ≍ κ(θ 1 ) ≍ log n . Next, having selected θ 1 we use (23) to select θ max = θ * . We set x max = µ(θ max ) and x min = −x max , θ min = −θ max , so that x 0 = (x min +x max )/2 = 0. Then we have P [X ≥ x max ] ≤ P [X ≥ a 1 ] ≤ n −(1+β) . Similarly, P [X ≤ −x min ] ≤ n −(1+β) . By the union bound this implies that with probability at least 1 − 2n −β , we have x min ≤ min i x i ≤ max i x i ≤ x max . Now we apply Theorem 3 with δ = 2θ max , obtaining where τ = µ(3θ max ) − µ(θ max ) ≥ µ(2θ max ) − µ(θ max ) > 1 by (23). Consequently, the last term in (25) is dominated by the first. Finally, we note that θ max ∈ [θ 1 , 2 m 0 −1 θ 1 ] and thus θ max ≍ θ 1 . From (19) we have κ(θ max ) ≍ κ(θ 1 ) ≍ log n. Similarly, from (22) we get θ max x max ≍ κ(θ max ) ≍ log n. In all, the right-hand side of (25) is ≍ log n as claimed. In this section we show how the self-regularization property of the NPMLE allows one to "bootstrap" existing results on MLE in finite Gaussian models to infinite mixtures. The following statistical guarantee on NPMLE is due to Zhang [Zha09] , improving over previous result of [ GvdV01, GvdV07] . ∼ p π π * ϕ and let π = π NPMLE (X 1 , . . . , X n ) be given in (2). Then where M SG (s) denote the collection of all s-subgaussian distributions on R. Next we show that using the self-regularization of the NPMLE, Theorem 6 can be deduced from existing guarantees on MLE in finite mixture models. We need a couple of auxiliary results, whose proofs are deferred to the end of this section. The following result is on approximating a general Gaussian mixture by finite mixtures: Lemma 7. Let π be 1-subgaussian. For any a > 0 and any k ∈ N, there exists a k-atomic π ′ supported on [−a, a], such that TV(p π , p π ′ ) ≤ 2e −a 2 /2 + 2e a 2 /4 ea 2 2k k Next we recall the statistical guarantee on the parametric MLE in finite Gaussian mixtures. By standard results on MLE (cf. e.g. [vdG00] ), this can be deduced from the bracketing entropy for this class, which has been thoroughly investigated in the literature [GvdV01,GW00,Zha09,MM11]. The following result is a corollary of the entropy bound of Maugis and Michel in [MM11] . Lemma 8. Let a ≥ 1 and k ∈ N. Let π k,a = π k,a (Y 1 , . . . , Y n ) is the (parametric) MLE defined in ∼ p π . There exists a universal constant C such that where M k,a denotes the collection of all k-atomic distributions on [−a, a]. Proof of Theorem 6. Let X 1 , . . . , X n i.i.d. ∼ p π = π * N (0, 1) for some 1-subgaussian π. Define the event E 0 {|X max | ≤ a 0 }, where a 0 = √ C 0 log n for some large absolute constant C 0 . Then E 0 has probability at least 1 − n −2 . By Theorem 1, on the event E 0 , π is supported on [−a 0 , a 0 ] and |supp( π)| ≤ C 1 a 2 0 = C 1 C 0 log n k 0 . Then for any k ≥ k 0 and a ≥ a 0 , on the event E 0 , we have π = π k,a (X 1 , . . . , X n ) argmax π∈M k,a n i=1 log p π (X i ). (In case that (28) has multiple maximizers, π is chosen to be any one of them.) Pick a = √ C 1 log n and k = C 2 log n such that a ≥ a 0 , k ≥ k 0 , and a 2 /k ≤ 1/10. Applying Lemma 7 with this choice, we obtain a k-atomic distribution π ′ supported on [−a, a] such that TV(p π , p π ′ ) ≤ n −3 . Let Y 1 , . . . , Y n i.i.d. ∼ p π = π * N (0, 1) for some 1-subgaussian π. Then TV(Law(X 1 , . . . , X n ), Law(Y 1 , . . . , Y n )) ≤ n −2 . Then there exists a coupling such that X i = Y i for i = 1, . . . , n with probability at least 1 − n −2 . Let E 2 denote this event. On the event of E 1 ∩ E 2 , we have π = π NPMLE (X 1 , . . . , X n ) = π k,a (X 1 , . . . , X n ) = π k,a (Y 1 , . . . , Y n ). Now we are in a position to pass the statistical guarantee on the parametric MLE in finite Gaussian mixtures to the NPMLE. Applying Lemma 8 with k ≍ log n and a ≍ √ log n, we have Finally, using the fact that H 2 ≤ TV and the triangle inequality for Hellinger, we have The proof is completed since H 2 ≤ 2 and E 0 ∩ E 1 has probability at least 1 − 2n −2 . Remark 7. The following minimax lower bound is shown in [Kim14] : which differs from the upper bound in Theorem 6 by log n. As frequently observed in the density estimation literature, such a logarithmic factor can be attributed to the fact that the analysis of the MLE is based on the global entropy bound. Thus obtaining a local version of the entropy bound in [MM11] can potentially close this gap and establish the sharp optimality of the NPMLE in achieving the lower bound in (30). Proof of Lemma 7. Without loss of generality, assume that π has zero mean. Let π denote the conditional version of π on [−a, a]. By the data processing inequality of total variation, TV(π * N (0, 1), π * N (0, 1)) ≤ TV(π, π) = π([−a, a] c ) ≤ 2e −a 2 /2 where the last inequality follows from π being 1-subgaussian. Next, let π ′ denote the k-point Gauss quadrature of π, such that π ′ and π have identical first 2k − 1 moments, and π ′ is also supported on [−a, a]. Then by moment-matching approximation (see [WY20, Lemma 8]), we have χ 2 (π ′ * N (0, 1) π * N (0, 1)) ≤ 4e a 2 /2 ea 2 2k 2k . Using the fact that 2TV 2 ≤ χ 2 and the triangle inequality, the previous two displays yield the desired bound. Thus E[H 2 (p π k,a , p π )] ǫ 2 n , where √ nǫ 2 n = J(ǫ n ) so that ǫ n ≍ k n log na 2 k . In this subsection we discuss the concept of self-regularization. Loosely speaking, an unregularized estimator can be said to achieve some form of self-regularization if it returns a density with o(n) components, which improves over the worst-case upper bound of n. Expanding on the reasoning in Remark 1, below we introduce a formal framework and provide a perspective on what may be the correct model size. Consider a sequence of nested statistical models M 1 ⊂ M 2 ⊂ · · · M ⊂ M(X ), where k is a parameter that encodes the "model complexity" of M k . For example, in linear models, M k denotes those with k-sparse regression coefficients; in shape-constrained setting, M k can be the set of kpiecewise constant or log-affine densities; in our setting of mixture models, M k is the set of all k-GM densities. Given a sample of size n, we define the statistical degree K n as where d max (A, B) sup P ∈A inf Q∈B H(P, Q) denotes the best approximation error (in the Hellinger distance) of the model class A by members of B. By definition, K n is the largest k so that any density in M can be made statistically indistinguishable (on the basis of n observations) from some density in M k ; in other words, given a sample of size n drawn independently from any f ∈ M , one can simulate it with probability at least 1 − c for some constant c using one drawn from some f k ∈ M k . From this simulation perspective, there is no statistical reason to fit a model of complexity bigger than K n ; on the other hand, it does not compromise the statistical performance (in terms of the Hellinger rate) to restrict to models of complexity at most K n . Thus, we view achieving the statistical degree K n as a criterion of self-regularization. As shown in Remark 1, for the class M of Gaussian mixtures with subgaussian mixing distributions, we have K n = Θ(log n), which coincides with the typical model size fitted by the NPMLE. Next, we discuss a simple example where the self-regularization of the unpenalized NPMLE can be established directly. ∼ P ∈ M , the NPMLE for P (without enforcing the subgaussianity) is simply the empirical distribution P , where By a union bound, there exists a constant C, such that with probability 1 − o(1), P (j) = 0 for all j ≥ k = C √ log n. In other words, with high probability we automatically have P ∈ M k for some k that agrees with the statistical degree. Note that the self-regularizing property in Example 2 is a simple consequence of the explicit expression of the NPMLE in (33). In contrast, for mixture models in Theorems 1 and 3 we need to resort to the optimality condition and complex-analytic techniques, due to the lack of close-form expression of NPMLE in mixture models. Another major difference is that for mixture models M k is non-convex and hence optimizing the likelihood over M k can be expensive. Quite spectacularly, the full relaxation over all measures somehow automatically solves the nonconvex optimization (and for the right k). Although we have not identified an example of a mixture model where the number of atoms of NPMLE is ω(log n), the program of analyzing the NPMLE in Theorem 3 and Theorem 5 does have its limitations. As a leading example, let us consider mixtures of exponential distributions, which is among the earliest results on the structure of NPMLE [Jew82] (see also [GW92, Sec. 2.1]). Since the tail is exponential, this model is outside the scope of Theorem 5. Example 3 (Exponential mixture). Consider the exponential distribution Exp(θ) with density p θ (x) = θe −θx 1 {x>0} and θ > 0. In this case, the NPMLE is defined as Upon normalization, the gradient (6) is proportional to the function where n i=1 w i = 1 and w i ≥ 0. Thus the atoms of the NPMLE are roots of F ′ (θ) = n i=1 w i e −θx i (1− θx i ), which are localized in the interval [a, b] with a = 1/x max and b = 1/x min . Following the proof of Theorem 3, to bound the number of roots of F ′ , we can apply Lemma 4 to f (θ) = F ′ (θ − a+b 2 ) and r = b−a 2 . Choose r 2 = a+b 2 and r 1 = 2r = b − a. Since f (r 2 ) = F ′ (0) = 1, we have M f (r 2 ) ≥ 1. Moreover, it is clear that M f (r 1 ) ≤ exp(Cbx max ) for some constant C. Thus an application of Lemma 4 shows that However, in the stochastic setting the above bound is too loose to be useful. Indeed, suppose x 1 , . . . , x n are drawn independently from a single exponential distribution, say, Exp(1). Then with high probability, we have x min = Θ P ( 1 n ) and x max = Θ P (log n). Thus (36) yields | π NPMLE | = O(n log n), which is even worse than the deterministic bound of | π NPMLE | ≤ n. Clearly, the culprit of this looseness stems from the fact that the data-generating distribution is supported on R + which has a boundary at zero. Since the smallest observation will be on the order of 1 n , a priori one can only localize the atoms of the NPMLE in an interval of width Θ(n), which is much worse than Θ( √ log n) in the Gaussian model. Similar problems also arise in other distributions whose support has boundary points, such as Gamma or Beta families. Open question: ∼ p π where supp(π) ⊂ [1, 2], prove that with probability 1 − o(1), we have |supp( π NPMLE )| = Θ(log n) The crucial O(log n) upper bound would follow from the following analytic conjecture: For any distribution π on [−a, a] the convolution (π * h)(x) h(x − y)π(dy) has at most O(a) critical points, where h(x) = e −e x +x is the density of a Gompertz distribution. So far we have focused on unconstrained NPMLE, where the likelihood is maximized over all mixing distributions. In case where one has extra knowledge such as compact support, moment constraint, or sparsity, these information can be incorporated into the optimization problem as linear constraints leading to potentially improved statistical performance. This begs the question: to what extent does constraint help the self-regularization of the NPMLE. Specifically, 1. If the unconstrained solution fails to self-regularize, does adding constraints make it so? 2. If the unconstrained solution is already self-regularizing, does adding constraints make it more so? We briefly discuss these two aspects below. For the first problem, let us continue Example 3 on exponential mixtures, where we pointed out that the program in Theorem 3 does not resolve the self-regularization of unconstrained NPMLE. Nevertheless, it is easy to show that adding a support constraint to NPMLE does resolve conjecture (37). Indeed, suppose that the parameter θ is bounded from above by some constant θ 0 , in which case one can consider the following support-constrained version of (34): Thanks to the constraint, we only need to count the number of critical points of (35) in the interval [0, θ 0 ]. Applying the same argument in Example 3 now with r = θ 0 /2 = O(1) yields | π ′ NPMLE | x max = O P (log n). Note that using moment matching and Taylor expansion we can show that the statistical degree for exponential mixtures with parameters bounded away from zero and infinity (say, supp(π) ⊂ [1, 2]) is O(log n). We conjecture that the statistical degree K n in this case is Θ(log n) and if so, the argument above shows that π ′ NPMLE does self-regularize. For the second problem, let us revisit the Gaussian location mixture. Suppose the mixing distribution is supported on a compact interval, say, [−1, 1]. Theorem 1 shows that the unconstrained NPMLE is O(log n)-atomic with high probability. However, when the mixing distribution is compactly supported, the moment-matching argument in [WY20, Lemma 8] shows that the statistical degree in fact reduces to O( log n log log n ). Again, we conjecture that in this case K n ≍ log n log log n . Then a natural question is whether NPMLE with support constraint π ′ NPMLE defined as in (38) with maximization over {π : supp(π) ∈ [−1, 1]} achieves a better self-regularization of O( log n log log n ) atoms. 5 The main bottleneck of proving this is the following. Note that similar to the proof of Theorem 3 we can reduce to the problem of counting the critical point of (11), which for Gaussian model simplifies to However this time we are not interested in bounding the number of all critical points of F , but only those in [−1, 1]. Thus, we can set r = 1, r 1 ≍ √ log n in the application of Lemma 4. The issue is with setting r 2 . If we could show that F must have at least one point z 0 inside a disk of radius for some C (with high probability), then invoking Lemma 4 with r 2 = O(1) would conclude that F ′ has at most O( log n log log n ) roots inside the unit disk. It is tempting to conjecture further that z 0 satisfying (40) such that |F ′ (z)| ≤ n −C log log n for all |z| = O(1). Therefore unlike the proof of Theorem 1, here we cannot ignore the stochastic origin of x i and that w i is inversely proportional to the fitted likelihood at x i (see (39)). Since π ′ NPMLE itself is random, proving this property of G(z) seems to require a delicate analysis of "small-ball" probabilities of the empirical process. This is left for future work. In the special case of the Gaussian location mixture, Theorem 3 translates to the following statement on the Gaussian convolution: For any distribution π supported on the interval [−a, a], the convolution π * ϕ has at most O(a 2 ) critical points. This result has been shown independently in the recent work [DYPS20, Theorem 6] by similar techniques using a corollary of Jensen's formula from [Tij71] . Can this bound be improved? The answer is negative and we next give a simple construction of a Gaussian mixture with Ω(a 2 ) local maxima. 6 Lemma 9. Let h be a continuous probability density on R with characteristic function h and CDF H. Suppose we have ω 0 and a > 0 such that | h(ω 0 )| > 2(H(−a/2) + 1 − H(a/2)) . (41) Let π(x) = c(1 + sin(ω 0 x))1{|x| ≤ a} with c > 0 chosen to make π a probability density. Then h * π has at least ω 0 a 2π local maxima on [−a/2, a/2]. Proof. Let π 0 (x) = 1+sin(ω 0 x). Then 0 ≤ π 0 ≤ 0. Then (π 0 * h)(x) = | h(ω 0 )| sin(ω 0 x−arg h(ω 0 ))+ 1, which is a shifted and scaled sinusoid. Let S + and S − be the sets of global maxima and minima of π 0 * h (which are lattices with step 2π ω 0 ). Let π 1 (x) = π 0 (x)1{|x| ≤ a}. Define ∆ h * π 0 − h * π 1 . Then ∆ ≥ 0 everywhere. Furthermore, Thus, by assumption (41), for any |x| ≤ a/2 we have ∆(x) ≤ | h(ω 0 )|. Consequently, for any x ∈ S + ∩ [−a/2, a/2] we have (π 1 * h)(x) = (π 0 * h)(x) − ∆(x) > 1 and for any x ∈ S − ∩ [−a/2, a/2] we have (π 1 * h)(x) < 1 − | h(ω 0 )|. Thus, the level 1 − 1 2 | h( 0 )| must be crossed in between any two consecutive points from S + and S − , implying the statement. Corollary 10. There exists a compactly supported density π on [−a, a] so that π * ϕ has Ω(a 2 ) local maxima on [−a/2, a/2]. Proof. By the Gaussian tail bound, we have H(−a/2) + 1 − H(a/2) ≤ 2e −a 2 /8 . Choosing ω 0 = a/4, the claim follows from Lemma 9 for sufficiently large a. Consider the following question: Given a convex combination of k unimodal densities, how many modes can it have? A moment of thought shows that the answer is trivial as the sum of two unimodal densities, e.g. can have infinitely many modes; the same example also applies even if unimodality is replaced by log-concavity. A natural question is what happens to strongly log-concave densities. 7 By replacing (43) with which is strongly log-concave, we again see that f (x) + f (x − 1) can have a flat piece. Furthermore, it is possible to construct infinitely differentiable f (by convolving with a mollifier) with the same property; however, such a density is not analytic. Thus, we ask the question: Given a convex combination of k analytic densities that are strongly log-concave, how many modes can it have? The following result gives an Ω(k 2 ) lower bound. Whether this is tight is an open question. Corollary 11. There exist strongly log-concave analytic densities f 1 , . . . , f k on R and weights α 1 , . . . , α k such that α 1 f 1 + . . . + α k f k has Ω(k 2 ) local maxima. Proof. Take the π supported on [−a, a] from Corollary 10. Partition [−a, a] into k = 4a consecutive intervals I 1 , . . . , I k of length 1/4. Let π i denote the conditional version of π on I i and set α i = π(I i ). Recall the fact that (log(µ * ϕ)) ′′ ≥ 1 − b 2 for any probability measure µ supported on an interval of length 2b; this follows from the well-known identity (log(µ * ϕ)) ′′ (y) = 1 − Var(X|X + Z = y), where X ∼ µ and Z ∼ N (0, 1) are independent. Then f i π i * ϕ is strongly log-concave satisfying (log f i ) ′′ ≥ 3/4. Since π * ϕ = k i=1 α i f i , the desired conclusion then follows from Corollary 10. In addition to those on exponential mixtures and constrained NPMLE mentioned in Section 5.3 and Section 5.2, we end the paper by describing some further open problems on the structure of NPMLE: Lower bound for NPMLE A particular consequence of Theorem 1 is the following: when the sample are generated from a finite Gaussian mixture, say, N (0, 1), with high probability the NPMLE outputs a Gaussian mixture with at most O(log n) components. To understand the NPMLE from the perspective of overparameterization, it is of great interest to determine whether this bound is tight. (Note that the reasoning in Remark 1 only shows that this is tight when the true density is N (0, σ 2 ) for any σ 2 > 1.) If so, it would show that the unpenalized NPMLE indeed selects a slightly inflated model (at the price of being fully automatic) and the model selection criterion, such as BIC [Ler92, Ker00] , is genuinely needed for achieving consistency in estimating the order of the mixture. As mentioned in Section 1, such lower bound is known to hold for the Grenander estimator (NPMLE for monotone density): if the true density f is uniform, then the number of pieces in the Grenander estimator is asymptotically N (log n, log n) [GL93] . This is a direct consequence of a celebrated result of Sparre Andersen on the least concave majorant of empirical CDF [SA54, Gro20] , whose discontinuity in slope correspond to the atoms of Grenander estimator. For the NPMLE in mixture models, no such simple characterization is known other than the first-order condition (6). Multivariate models Compared to the univariate case, the structure of the NPMLE is far less well understood for multivariate models. Indeed, the general theory developed in [Lin95] relies on the parameter space being one-dimensional. For instance, for the simplest Gaussian location mixture, even the uniqueness of the solution is open in dimension d ≥ 2. Similar to the analysis in the current paper, bounding the number of atoms in the NPMLE boils down to counting the critical points of a Gaussian mixture (39) with centers being the individual observations, which, if drawn from a subgaussian distribution, lie in a hypercube of size O( √ log n) with high probability. The construction in [KK20, Proposition 3] shows that there exists a mixing distribution on [−a, a] d whose Gaussian location mixture has Ω(a 2d ) modes. However, it is unclear whether this is tight and directly extending the complex-analytic technique in this paper to multiple dimensions appears challenging. On the other hand, although the uniqueness of the NPMLE is not settled, the usual analysis of maximal likelihood (zeroth-order optimality) yields statistical guarantees that apply to any solution of the NPMLE [DZ16, SG20] . For example, extending the work of [Zha09] , [SG20, Corollary 2.2] showed that if the mixing distribution is compactly supported, then the estimated mixture density has squared Hellinger accuracy of O d ((log n) d+1 /n). In view of the above results, we conjecture that the solution to the NPMLE for multivariate Gaussian mixtures is unique and, furthermore, given a subgaussian sample of size n it is typically (log n) C(d) -atomic when the dimension d is not too big. Log-concave NPMLE The NPMLE for log-concave densities is well-studied in nonparametric statistics literature. Basic properties (such as the almost sure existence and uniqueness) and computational algorithms are obtained in [PWM07, DR09] in one dimension and extended to multiple dimensions [CSS10] . In particular, similar to the NPMLE for monotone density (Grenander estimator) which is piecewise constant, the logarithmic of the NPMLE for log-concave density is piecewise affine with at most n pieces; however, unlike the Grenander estimator, its typical structure (e.g. the number of pieces) is little understood, partly because the optimal condition is more complicated. In terms of statistical results, in one dimension the minimax squared Hellinger rate is shown to be Θ(n −4/5 ) [DW16, KS16] . For dimension d ≥ 2, [KS16] proved the minimax lower bound Ω(n −2/(d+1) ) and showed it can be attained by the NPMLE up to logarithmic factors for d = 2 and 3. This near-optimality of NPMLE is recently extended to any dimension in [KDR19, Han19] . In view of the corresponding results for the Grenander estimator, if one interprets the minimax rate as the effective dimension divided by the sample size, it is reasonable to conjecture that the typical number of pieces in the log-concave NPMLE is O(n 1/5 ) and O(n (d−1)/(d+1) ) for d ≥ 2. The Grenader estimator: A nonasymptotic approach. The Annals of Statistics Maximum likelihood estimation of a multi-dimensional log-concave density Maximum likelihood estimation of a logconcave density and its distribution function: Basic properties and uniform consistency Global rates of convergence of the MLEs of logconcave and s-concave densities The capacity achieving distribution for the amplitude constrained additive Gaussian channel: An upper bound on the number of mass points High-dimensional classification via nonparametric empirical bayes and maximum likelihood inference Nonparametric roughness penalties for probability densities Nonparametric estimation under shape constraints Isotonic estimators of monotone densities and distribution functions: basic facts On the theory of mortality measurement. Part II Abstract inference Vertices of the least concave majorant of brownian motion with parabolic drift Grenander functionals and Cauchy's formula Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities Posterior convergence rates of Dirichlet mixtures at smooth densities. The Annals of Statistics Information bounds and nonparametric maximum likelihood estimation Rates of convergence for the Gaussian mixture sieve Global empirical risk minimizers with "shape constraints" are rate optimal in general dimensions Mixtures of exponential distributions General maximum likelihood empirical bayes estimation of normal means Local maxima in the likelihood of Gaussian mixture models: Structural results and algorithmic consequences Optimality of maximum likelihood for log-concave density estimation and bounded convex regression Consistent estimation of the order of mixture models Comment: Minimalist g-modeling Minimax bounds for estimation of normal mixtures How many modes can a constrained Gaussian mixture have? Convex optimization, shape constraints, compound decisions, and empirical Bayes rules Global rates of convergence in log-concave density estimation Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters Nonparametric maximum likelihood estimation of a mixing distribution Consistent estimation of a mixing distribution The geometry of mixture likelihoods: a general theory The geometry of mixture likelihoods, part II: the exponential family Mixture models: theory, geometry and applications Uniqueness of estimation and identifiability in mixture models A non asymptotic penalized criterion for gaussian mixture model selection Note on approximating the Laplace transform of a Gaussian on a unit disk Estimating a Polya frequency function 2 A generalization of the method of maximum likelihood: Estimating a mixing distribution (Abstract) On the fluctuations of sums of random variables On the nonparametric maximum likelihood estimator for gaussian location mixture densities with application to gaussian denoising On the estimation of a probability density function by the maximum penalized likelihood method. The Annals of Statistics Maximum likelihood estimation of a compound poisson process. The Annals of Statistics Log-concavity and strong log-concavity: a review. Statistics surveys On the number of zeros of general exponential polynomials Empirical Processes in M-Estimation The impact of constellation cardinality on Gaussian channel capacity Optimal estimation of Gaussian mixtures with denoised method of moments Generalized maximum likelihood estimation of normal mixture densities This work was partially completed when the authors were visiting the Information Processing Group at the School of Computer and Communication Sciences of EPFL, whose generous support is gratefully acknowledged and whose seminar, canceled due to COVID-19, nevertheless brought the independent work [DYPS20] to our attention. The authors thank Pengkun Yang for helpful discussions at the onset of the project and for informing us [KK20] . The authors are also grateful to Roger Koenker for helpful discussion on [KG19] and providing numerical simulation.Y. Wu is supported in part by the NSF Grant CCF-1900507, NSF CAREER award CCF-1651588, and an Alfred Sloan fellowship. Y. Polyanskiy is supported in part by the Center for Science of Information (CSoI), an NSF Science and Technology Center, under grant agreement CCF-09-39370, and the MIT-IBM Watson AI Lab.