key: cord-0522816-mt5kuyli authors: Chien, Eli; Milenkovic, Olgica; Nedich, Angelia title: Support Estimation with Sampling Artifacts and Errors date: 2020-06-14 journal: nan DOI: nan sha: b51d02c955f06aa221cba8e075f8ccd30c44aa80 doc_id: 522816 cord_uid: mt5kuyli The problem of estimating the support of a distribution is of great importance in many areas of machine learning, computer science, physics and biology. Most of the existing work in this domain has focused on settings that assume perfectly accurate sampling approaches, which is seldom true in practical data science. Here we introduce the first known approach to support estimation in the presence of sampling artifacts and errors where each sample is assumed to arise from a Poisson repeat channel which simultaneously captures repetitions and deletions of samples. The proposed estimator is based on regularized weighted Chebyshev approximations, with weights governed by evaluations of so-called Touchard (Bell) polynomials. The supports in the presence of sampling artifacts are calculated using discretized semi-infite programming methods. The estimation approach is tested on synthetic and textual data, as well as on GISAID data collected to address a new problem in computational biology: mutational support estimation in genes of the SARS-Cov-2 virus. In the later setting, the Poisson channel captures the fact that many individuals are tested multiple times for the presence of viral RNA, thereby leading to repeated samples, while other individual's results are not recorded due to test errors. For all experiments performed, we observed significant improvements of our integrated methods compared to those obtained through adequate modifications of state-of-the-art noiseless support estimation methods. Estimating the support size of a discrete distribution is an important theoretical and data processing problem [1, 2] . In computer science, this task frequently arises in large-scale database mining and network monitoring where the objective is to estimate the types of database entries or IP addresses from a limited number of observations [3, 4, 5] . In machine learning, support estimation is used to bound the number of clusters in clustering problems encountered in semi-supervised or active learning [6, 7, 8, 9] . In life sciences, support estimation arises when estimating population sizes or increases in population sizes [10] . The most challenging practical support estimation issues are encountered in the "small sample set" regime in which one has only a limited number of observations for a distribution with a large support. In such a setting, classical maximum likelihood frequency techniques are known to perform poorly [11] . It is for this sampling regime that the estimation problem has received significant attention from both the theoretical computer science and machine learning community, as well as researchers from various computational data processing areas [12, 13, 14, 15, 5, 16, 17, 18] . By now, a number of efficient and near-optimal support estimation techniques has been reported in the literature [19, 20, 21, 22, 23, 24, 25] . All these methods traditionally use the assumption that the samples are observed without errors. In practice, sampling artifacts and noise are ubiquitous, especially when dealing with data acquired from biological and medical science experiments. As a motivating example, consider the problem of estimating the number of mutations in a viral genome (such as that of SARS-Cov-19) during the early stages of an outbreak. Viral RNA/DNA is usually of length ×10, 000 and testing is time consuming and expensive, and additionally hampered by privacy issues. Consequently, a small number of sequenced genomes may be available when trying to determine in a timely manner if the virus is mutating at a high rate and therefore potentially dangerous to the population (Note that large body of work reports mutational rates of viruses as indicators of their virulence and potential to cause epidemic and pandemic outbreaks [26, 27] .). In this particular case, the actual alphabet size is known and equal to the length of the genome, but not all genomic sites are subject to mutations. Furthermore, sequencing errors introduce counting artifacts, and so do sampling biases which are caused by some individuals being tested multiple times (e.g., health workers [28] ) or not tested at all. To address these issues, we propose a novel noisy support estimation problem under the Poisson repeat channel [29] model. The Poisson repeat channel models both deletion and repetitions of particular mutational sites, and is adequate for capturing unknown sampling biases and sequencing error phenomena. To the best of our knowledge, this work is the first to consider the noisy support estimation problem where symbols are potentially repeated or missed. The only other line of work addressing a similar problem was reported in [30] , with a focus on Good-Turing distribution estimation in the presence of sample insertion errors. We address the noisy small sample support estimation problem under the Poisson repeat channel by using novel regularized weighted Chebyshev approximation techniques. Weighted polynomial approximation techniques are largely unknown in the machine learning community [31] since all previous approximation based methods proposed so far have only focused on unweighted and noiseless settings [20, 22, 24, 25] . As will be shown in our subsequent analysis, exponential smoothing and "noise-compensating" weights play a major role in improving the performance of polynomial methods as well as making them computationally tractable. Within this framework, the Mhaskar-Saff theorem and extensions thereof presented in the work are of great importance [31, 32, 33] . In addition, our regularization term arises from consideration of the variance of the estimators, as the weighted Chebyshev approximation component only takes into account the bias. Hence, solving our regularized weighted Chebyshev approximation problem is equivalent to jointly optimizing the bias and variance of the estimator. In addition, we show that Touchard (Bell) polynomials [34] naturally arise when incorporating the Poisson repeat channel into the model. To numerically solve the underlying optimization problem we use discretized semi-infinite programming (SIP) techniques. We prove that the solution of discretized SIP converges to the true unique optimal solution for the noisy support estimation problem. Through extensive experiments on both synthetic and real-world data, we show that our methods are able to accurately estimate the support size under the Poisson repeat channel. Prior work Noiseless support estimation methods operating in the small sample regime can be roughly grouped into two categories [19, 20, 21, 22, 23, 24, 25] . The first line of works [19, 21, 23] makes use of the maximum likelihood principle. While [21] constructs estimators based on the Profile Maximum Likelihood (PML) [35] , the work reported in [19] focuses on Sequence Maximum Likelihood (SML) estimators [36] . The main advantage of ML-based methods is that they easily generalize to many other estimation tasks. For example, the authors of [21] showed that a single method may be used for entropy estimation, support coverage and distance to uniformity analysis. However, most ML-based estimators require large computational resources [23, 20] . To address the computational issue, a sophisticated approximate PML technique that reduces the computational complexity of support estimation at the expense of some performance loss was proposed in [23] . On the other hand, the second line of works [20, 22, 24, 25] formulates support estimation as an approximation problem. The underlying methods, which we henceforth refer to as approximationbased methods, design estimators by minimizing the worst case risk. In particular, [20] uses shifted and scaled Chebyshev polynomials of the first kind to construct efficient estimators. In contrast, the authors of [25] suggest disposing of minmax estimators and implementing a data amplification tech-nique with analytical performance guarantees. The aforementioned estimator is based on polynomial smoothing [37] related to approximation techniques. All described approximation-based estimators are computational efficient, with the exception of [24] , as reported in [25] . Sampling artifacts lead to non-iid observations which are very different from Markovian models discussed in [38] . The paper is organized as follows. In Section 2, we introduce the relevant notation and the support estimation problem under the Poisson repeat channel. We also describe a class of estimators termed polynomial class estimators. In Section 3, we outline our analysis of polynomial class estimators and describe our main results needed to overcome the technical challenges associated with regularized weighted minmax polynomial approximations. Section 4 is devoted to experimental verifications and testing, both on synthetically generated data and real-world data. Our real world data includes Shakespeare's plays, used to illustrate the performance of our methods, and a collection of ∼ 4, 100 SARS-Cov-2 viral genomes retrieved from GISAID [39, 40] . Let P = (p 1 , p 2 , . . .) be a discrete distribution over some countable alphabet and let X 1 , . . . , X n be i.i.d. samples drawn according to the distribution P . The problem of interest is to estimate the support size, defined as S(P ) = i 1 {pi>0} where 1 A stands for the indicator of the event A. When clear from the context we use S instead of S(P ) to avoid notational clutter. We make the assumption that the minimum nonzero probability of the distribution P is at least 1/k, for some k ∈ R + , i.e., inf{p ∈ P | p > 0} ≥ 1 k . Furthermore, we let D k denote the space of all probability distribution satisfying inf{p ∈ P | p > 0} ≥ 1 k . Clearly, S ≤ k, ∀P ∈ D k . A sufficient statistics for X 1 , . . . , X n is the empirical distribution (i.e., histogram) The Poisson repeat channel. For each sample index 1 ≤ i ≤ n, the Poisson repeat channel with parameter η outputs R i copies of the sample X i , where R i ∼ P oisson(η) is a Poisson distributed random variable with mean η. For simplicity of analysis, the variables R i are assumed to be i.i.d. and the corresponding channel memoryless; the value R i = 0 indicates that a sample input X i has been deleted. We also define the empirical distribution of the output sequence of the Poisson channel as Throughout the paper, we assume that the parameter η is known although it is possible to learn it simultaneously with the support. The focal point of our analysis is to upper bound the minmax risk under normalized squared loss which correspond to the case without and with Poisson repeat channel, respectively. We focus on the case including the Poisson repeat channel, as a similar analysis can be easily performed for the case without Poisson repeats. We seek a support estimatorŜ that minimizes The first term within the supremum captures the expected bias of the estimatorŜ. The second term represents the variance of the estimatorŜ. Hence, "good" estimators are required to balance out the worst-case contributions of the bias and variance. We define a class of polynomial based estimators as follows. Given a parameter L ∈ N, we say that an estimatorŜ is a polynomial class estimator with the parameter L (i.e., a P oly(L) estimator) if it takes the formŜ = i g L (N i ), where g L is defined as Here, a j ∈ R, and a 0 = −1, since this choice ensures that g L (0) = 0. One can associate an estimatorŜ with its corresponding coefficients a, and define a family of estimators P oly(L) = a ∈ R L+1 |a 0 = −1 . Next, we show that the problem of minimizing worst-case risk within the class P oly(L) can be cast as a regularized exponentially weighted Chebyshev approximation problem [31] . We start by analyzing the minmax risk under the Poisson channel model through Poissonization arguments. These assert that the number of samples drawn is Poisson distributed N ∼ P oisson(n) and that the counts N i ∼ P oisson(λ i ) are independent, where λ i = np i . Poissonization was also used in [20, 24] to derive the following tight upper bound on the minmax risk. Lemma 3.1 (Lemma 1 in [20] ). Let R * P (n, k) be the minmax risk under the Poissonized model. Then, for any β > 1 we have Note that it is straightforward to show that Lemma 3.1 is also valid under the Poisson repeat channel. Let L = {ℓ|λ ℓ > 0} be the set of symbols with positive probability. A simple calculation reveals that for anyŜ ∈ P oly(L), one has where the inequality we use Cauchy-Bunyakowski-Schwarz inequality for the cross terms. Note that the first term captures the variance while the second term is related to the bias. An interesting observation is that the objective function above is symmetric in the parameters λ i . The little-known problem of establishing when the optima of such constrained symmetric functions is achieved for the case that all parameters are equal has been studied in [41] . We first analyze the bias term E (g(N ′ i ) − 1). By the definition of the Poisson repeat channel, we have N ′ i ∼ P oisson(ηN i ), which allows us to write E (g( where M N * l r (λ i e −η ) r , where l r denotes the Stirling number of the second kind, counting the number of partitions of a set of size l into r disjoint nonempty subsets. Following a procedure similar to the one used for the bias, one can be shown that the term corresponding to the variance equals where N * ∼ P oisson(λe −η ). The inequality (7) is due to the increase in the domain over which we take the supremum. The inequality (8) follows from (4). The last inequality is a consequence of |L| = S ≤ k and the fact that all terms in the summation are nonnegative. Hence we have to minimize an objective with respect to the coefficients a 1 , ..., a L according to For the case that the Poisson repeat channel is not present, one only needs to adjust the exponential weights and substitute η l M Note that both (10) and (11) are of the form of a regularized weighted Chebychev approximation problem. For simplicity, we first focus on the noiseless case (11), as similar but more tedious arguments may be used for the noisy case (10). If we ignore the first term in (11) , the optimization problem reads as The term e −λ L l=0 a l λ l corresponds to the bias of the estimator. It is straightforward to see that the optimal choice of a for the two problems are the same. Problem (12) is an exponentially weighted Chebyshev approximation problem [42] . Note that one can further upper bound (12) as follows resulting in a standard Chebyshev approximation problem with a solution of the form of scaled and shifted Chebyshev polynomials. Despite the fact that the authors of [20] obtained the coefficients of the Chebyshev estimator using a different interval in the supremum that account for the variance, (13) along with our extensive simulation results show that ignoring the exponential weights results in a worse bound on the risk and practical performance. The first term 1 k L l=0 e −λ a 2 l λ l l! , which corresponds to the variance, may be rewritten as Clearly, ||.|| M(λ) is a valid norm, and consequently, the first term in (11) may be viewed as a regularizer. For the case including the Poisson repeat channel, since the sum of Touchard polynomials is still a polynomial, the bias term in (10) also represents an exponentially weighted Chebyshev approximation problem. The variance term may be written as the weighted norm of a with a weight matrix N * (0)). The resulting problem (10) is once again an regularized weighted Chebyshev approximation problem. Solving problems (10) and (11) . Solving problem (10) and (11) directly appears to be difficult, so we instead resort to numerically solving the epigraph formulation of the semi-infinite programs (10) and (11) and proving that the numerical solution is asymptotically consistent. Once again, we start with the case that excludes the Poisson repeat channel (11) . The epigraph formulation of (11) is a semi-infinite program of the form ( [43] , Chapter 6.1) There are many algorithms that can be used to numerically solve (14): the discretization and central cutting plane method, the KKT and SQP reductions [44, 45] . For simplicity, we focus on the discretization method. For this purpose, we first form a grid of the interval [ n k , n] involving s points, denoted by Grid([ n k , n], s). Problem (14) represents an LP with infinitely many quadratic constraints, which is not solvable. Hence, instead of solving (14), we focus on the relaxed problem , n], s). (15) As will be discussed in greater detail later, the solution of the relaxed problem is asymptotically consistent with the solution of the original problem (i.e., as s → ∞, the optimal values of the objectives of the original and relaxed problem are the same). Problem (15) is an LP with a finite number of quadratic constraints that may be solved using standard optimization tools. Unfortunately, the number of constraints scales with the length of the grid interval, which in the case of interest is linear in n. This appears as an undesirable feature of the approach, but can be easily mitigated through the following theorem which demonstrates that an optimal solution of the problem may be found over an interval of length proportional to the significantly smaller value of log k (k/ log k n is needed for accurate estimation [20] ). We relegate the proof of this result to the Supplement. Theorem 3.2. For any a ∈ P oly(L), L = ⌊c 0 log k⌋ and c 0 = 0.558, let Then, we have g(a, λ) = sup λ∈[ n k ,6.5L] g(a, λ) if n k ≤ 6.5L g(a, n k ) if n k > 6.5L. Remark 3.1. In weighted approximation theory [31] , the problem of bounding the interval over which the supremum is achieved is a topic of significant interest, with many important results readily available. For example, if we ignore the regularization term, we can directly use the Mhaskar-Saff theorem [32, 33] (Theorem 7.1 in the Supplement) to reduce the length of the interval in the supremum to π 2 L. Our Theorem 3.2 shows that even when a regularization term is present, we can still restrict the length of the interval to 6.5L. Our proof differs from that of the more general Mhaskar-Saff theorem, since we exploit the specific structure of the problem. It remains an open problem to extend the approach of [31] used in our proof to account for more general weights. Using the previous derivations, we arrive at the following optimization problem Since L = ⌊c 0 log k⌋ for the case excluding the Poisson repeat channel, the length of the optimization interval in (16) is proportional to log k and thus the (16) can be solved efficiently. For the case including the Poisson repeat channel, using the same arguments as above, we have where C > 0 is a constant. Unfortunately, it is very hard to precisely characterize C. Nevertheless, we find that in practice, the choice C = 2 works well with L = ⌊ηc 0 log k⌋. Furthermore, since the Poisson repeat channel introduces an average of η replicas of each samples, the "cut-off" value L for g L in (2) is set to be η times larger than the corresponding value without Poisson repeats. Convergence of the discretized method For the case of objective functions and constraints that are "well-behaved" (see [46] and [47] ), as s grows, the solution of the relaxed semi-infinite program approaches the optimal solution of the original problem. We use the above results in conjunction with a number of properties of our objective SIP to establish the claim in the following theorem whose proof is delegated to the Supplement. be the length of the discretization interval. As d → 0, the optimal objective value t d of the discretized SIP (16) or (17) (with η > 0) converges to the optimal objective value of the original SIP t ⋆ . Moreover, the optimal solution is unique. The convergence rate of t d to t ⋆ equals O(d 2 ). If the optimal solution of the SIP is a strict minimum of order one (i.e., if t − t ⋆ ≥ C||a − a ⋆ || for some constant C > 0 and for all feasible neighborhoods of a ⋆ ), then the solution of the discretized SIP also converges to an optimal solution with rate O(d 2 ). Next, we compare our estimator, referred to as the Regularized Weighted Chebyshev (RWC) method, with the Good-Turing (GT) estimator, the WY estimator of [20] , the PJW estimator described in [23] and the HOSW estimator of [25] . We do not compare our method with the estimators introduced in [19, 24] due to their high computational complexity [20, 25] . Synthetic data experiments. We first evaluate the maximum risk under normalized squared loss of all listed estimators over six different distributions without Poisson repeats: the uniform distribution with p i = 1 k , the Zipf distributions with p i ∝ i −α , and α equal to 1.5, 1, 0.5 or 0.25, and the Benford distribution with p i ∝ log(i + 1) − log(i). We choose the support sizes for the Zipf and Benford distribution so that the minimum nonzero probability mass is roughly 10 −6 . We run the estimator 100 times to calculate the risk. For solving (16), we use a grid with s = 1000 points in the interval [ n k , 6.5L], and L = ⌊0.558 log k⌋. The GT method used for comparison first estimates the total probability of seen symbols (e.g., sample coverage) according toĈ = 1 − h 1 /n, and then estimates the support size according toŜ GT =Ŝ c /Ĉ; here,Ŝ c stands for the simple counting estimator. Note that h 1 equals the number of different alphabet symbols observed only once in the n samples. Detailed findings are presented in the Supplement. Figures 1(a) and 1(b) indicate that the RWC estimator has a significantly better worst case performance compared to all other methods when tested on the above distributions, provided that n ≥ 0.2k. Also, both the RWC and WY estimators have significantly better error exponents compared to GT, PJW and HOSW. Interestingly, we find the the worst case risk with normalization (1/k) 2 tends to severely bias the results towards a near-uniform distribution. We mitigate this issue by changing the normalization from (1/k) 2 to (1/S) 2 , which was also done in [25] . We repeat the experiment using the normalization (1/S) 2 , corresponding to what we refer to as the RWC-S estimator. A detailed description of this algorithm and its analysis is available in the Supplement. Figures 1(c) and 1(d) illustrate that the RWC-S estimator significantly outperforms all other estimators. Next, we turn our attention to the case of the Poisson repeat channel (PRC). Our RWC-S-prc estimator requires solving (17) with 1/k replaced by 1/Ŝ c , and setting C = 2, L = ⌊0.558η log k⌋ and s = 1000. Since the noisy support estimation problem is new there is no standard benchmark to compare it with. This is why we consider the performance of RWC-S-prc, the naive counting estimator and two simple modifications of the WY estimator, since those offer the best performance in the noiseless setting. The WY-naive method first divides the empirical counts N ′ by η and then applies the WY estimator. This is intuitive since each symbol is repeated η times in expectation. The WY-prc method involves modifying the coefficients in the WY estimator with Touchard polynomial multipliers. Note that this modification does not take into account the exponential weighting and regularization term that we introduced for both the noiseless and noisy setting. In the experiments, we choose η from {0.5, 1, 1.5} since the replication rate is small in practice. As we can see from Fig-ures 1(e), 1(f) and 1(g), our method significantly outperforms all other methods. Notably, WY-prc performs poorly for η > 1 while WY-Naive performs poorly for η < 1. Real-world data experiments. We start by estimating the number of distinct words in selected books, as suggested in [20, 19] . We use Hamlet, Othello, Macbeth and King Lear for our comparative study, with the results presented in Figure 1 (h) (for Hamlet) and the Supplement (all other plays). In the experiments, we randomly sampled words from the text with replacement and used the obtained counts to estimate the number of distinct words. For simplicity, we set k to be equal the total number of words. For example, as the number of words in Hamlet equals 30, 364, we set k = 30, 364. As may be clearly seen, our methods significantly outperform all other competitive techniques both in terms of convergence rate and the accuracy of the estimated support. To estimate the mutational support of the SARS-Cov-2 virus in the presence of sampling artifacts, we first create the histogram of mutations in sequenced genomes, using a reference corresponding to Patient 1 (the first infected individual that was sequenced). The mutational support of a population of individuals equals the size of the union of the individual supports. The datasets used in the study were retrieved from the GISAID repository [39, 40] on 04-14-2020, and they pertain to European patients only. The analysis of datasets acquired from Asia and North America is relegated to the Supplement. We conduct three experiments: first, we examine the results of the noiseless support estimation methods (Table 3) ; next, we manually corrupt the samples by Poisson repeats with η = 0.5, 1, 1.5 (Table 1 and the Supplement for η = 1, 1.5). A good noisy support estimation method should produce a support close to that of its noiseless counterpart, and in this setting our method shows superior performance compared to other techniques. Finally, we also report the results of noisy support estimation on the unperturbed data ( Table 2 ). Note that the naive estimator gives a lower bound for the true support size, and the results of WY-Naive for η = 0.5 are erroneous. On the other hand, WY-prc produces estimates that violate the known maximum support size or negative entries for η = 1.5, as reported in the Supplement. A more detailed discussion of the relevant biological findings is also available in the Supplement. To prove the result, we need to show that ∀λ ≥ 6.5L, ∂ ∂λ g(a, λ) < 0. The derivative of the first term in g equals Clearly, the right hand side in the above expression is negative for all λ > L. The second term of the derivative equals To analyze the two terms of the derivative, we introduce the vectors y, z, 1 and the diagonal matrix D according to To show that ∂ ∂λ g(a, λ) < 0 for all polynomials of degree L whenever λ > CL, we show that the matrix e λ k D + (1z T + z1 T ) is negative-definite whenever λ > CL, for some constant C > 0. It suffices to show that the sum of the maximum eigenvalues of e λ k D and (1z T +z1 T ) is negative, since e λ k D is a diagonal matrix. Thus, we turn our attention to determining the maximum eigenvalues of these two matrices. For e λ k D, the maximum eigenvalue satisfies e λ k max i∈{0,1,...,L} The last inequality is a consequence of Stirling's formula, which asserts that n! ≥ ( n e ) n . Combining the above expressions, we obtain e λ k max i∈{0,1,...,L} Next, we derive an upper bound on maximum eigenvalue of the second matrix. The i, j entry of the matrix (1z T + z1 T ) equals i+j−2 λ − 2, and all these values are negative when λ > L. Moreover, it is clear that the matrix of interest has rank equal to 2. Therefore, the matrix has exactly two nonzero eigenvalues. Let A = −(1z T + z1 T ). All entries of A are positive whenever λ > L. By Gershgorin's theorem, we can upper bound the maximum eigenvalues of the matrix A by its maximum row sum. It is obvious that the maximum row sum equals Moreover, the trace of A equals This implies that the minimum eigenvalue of A is lower bounded by − L(L+1) 2λ , which directly implies that the maximum eigenvalue of (1z T + z1 T ) is upper bounded by L(L+1) 2λ . L eλ L ⇔ log(L) + log(L + 1) + log(k) − L log(L) + L < λ + log(λ) − L log(λ). The function λ + log(λ) − L log(λ) is nondecreasing in λ whenever λ > L since By the definition of L = ⌊c 0 log(k)⌋, we also have log(k) ≤ L+1 c0 . Using log(x + 1) ≤ x, which holds ∀x ≥ 1. Hence ∀λ > CL where C > 2, the sufficient condition for (18) to hold is Rearranging terms leads to Sufficient conditions that ensure that the above inequality holds are log(C) ≥ 1 c0 and (C − log(C) − 2 − 1 c0 ) > 0. The first condition implies C ≥ e 1 c 0 = 6.0021, while the second condition holds with C = 6.5, for which the first condition is also satisfied. This completes the proof. The proof consists of two parts. In the first part, we establish the conditions for convergence, while in the second part, we determine the convergence rate. For simplicity, we present the proofs for the case without Poisson repeats. We then outline how the analysis can be modified to account for the repeats. We start by introducing the relevant terminology. Let Π ⊂ R L+1 be a closed set of parameters, and let f be a continuous functional on Π. Assume that B ⊂ R is compact and that g : We also make the following two assumptions: [46] ). Under assumptions 6.1 and 6.2, the solution of the discretized problem converges to the optimal solution. More formally, we have If c * is the unique optimal solution of the original problem, and c * i is the optimal solution of the discretized relaxation with grid B i , then It is straightforward to see that our chosen grid is arbitrary fine. Hence, we only need to prove that there exists a c 0 such that the level set Level(c 0 , D) is bounded. Let c = (a; t) and note that in our setting, f (c) = t. Rewrite g(c, λ) in matrix form as where Λ e −λ (λ 0 , λ 1 , ..., λ L ) T . Note that only a 1 , ...a L are allowed to vary since we fixed a 0 = −1. Obviously, ΛΛ T is positive semi-definite and the previously introduced M(λ) is positive definite for all λ > 0. Since the constraints on g in (16) are positive definite with respect to a 1 , ...a L , g is coercive in a 1 , ...a L . Furthermore, for any given t, the set of feasible coefficients a 1 , ...a L is bounded. Therefore, given a t 0 , the level set Level(c 0 , B 0 ) is bounded. This ensures that Assumption 6.2 holds for our optimization problem. Next, we prove the uniqueness of the optimal solution c ⋆ . Note that proving this result is equivalent to proving the uniqueness of a ⋆ . Hence, we once again refer to the original minmax formulation of our problem, Clearly, ∀λ ∈ [ n k , 6.5L], the function h λ (a) is strictly convex since (M(λ) + ΛΛ T ) ≻ 0, ∀λ ∈ [ n k , 6.5L]. Taking the supremum over λ preserves strict convexity since ∀θ ∈ (0, 1), one has sup λ∈[ n k ,6.5L] Hence sup λ∈[ n k ,6.5L] h λ (a) is strictly convex, which consequently implies the uniqueness of a ⋆ and hence c ⋆ . For the case of samples passed through a Poisson channel, it is not hard to see that the constraints are again strictly convex in a, where one only need to replace M(λ), Λ by N * (0)) T respectively. Thus, a similar analysis is possible and the details are omitted. The proof above along with the previous observation prove the convergence result of Theorem 3.3. In what follows, and for reasons of simplicity, we omit the constraint a 0 = −1 in the SIP formulation. The described proof only requires small modifications to accommodate a 0 = −1. Recall that we used B d to denote the grid with grid spacing d. In order to use the results in [47] , we require the convergence assumption below. This assumption is satisfied for the SIP of interest as shown in the first part of the proof. Assumption 6.5. The following hold true: • There is a neighborhoodŪ ofc such that the function ∂ 2 ∂λ 2 g(c, λ) is continuous onŪ × B. • The set B is compact, nonempty and explicitly given as the solution set of a set of inequalities, B = {λ ∈ R|v i (λ) ≤ 0, i ∈ I}, where I is a finite index set and v i ∈ C 2 (B). • For anyλ ∈ B, the vectors ∂ ∂λ v i (λ), i ∈ {i ∈ I|v i (λ) = 0} are linearly independent. Recall that our objective is of the form where Λ e −λ (λ 0 , λ 1 , ..., λ L ) T , c = (a; t), It is straightforward to see that the first condition in Assumption 6.5 holds. For the second condition, recall that B = [ n k , 6.5L]. Hence, the second condition can be satisfied by choosing I = {1}, v 1 (λ) = (λ − n k )(λ − 6.5L). Since we only have one variable v 1 , it is also easy to see that the third condition is met. This assumption also clearly holds for the grid of choice. Note that it is crucial to include the boundary points for the proof in [47] to be applicable. Assumption 6.7. ∇ c g(c, λ) is continuous onŪ × B, whereŪ is a neighborhood ofc. Moreover, there exists a vector ξ such that Note that ∇ c g(c, λ) = [∇ a g(c, λ); ∇ t g(c, λ)] and ∇ a g(c, λ) = 2(M(λ) + ΛΛ T )a. Also note that ∀λ ∈ B, M(λ) + ΛΛ T is positive definite. Hence choosing ξ to be colinear with and of the same direction as [−a T 1] T , as well as of sufficiently large norm will allow us to satisfy the inequality Hence, Assumption 6.7 holds as well. The next results follow from the above assumptions and observations, and the results in [47] . Lemma 6.8 (Corollary 1 in [47] ). Let t d be the optimal objective value of the discretized SIP used for support estimation with the grid B d , and let t ⋆ be the optimal objective value for the original SIP. Since Assumptions 6.4,6.5,6.6,6.7 hold, then for some c 3 > 0 and d sufficiently small, we have Consequently, t d → t ⋆ with a convergence rate of O(d 2 ). Lemma 6.9 (Theorem 2 in [47] ). Assume that all assumptions in Lemma 6.8 hold. If there exists a constant c 4 > 0 such that then for sufficiently small d and σ > 0 we have This result implies that ifc is also a strict minimum of order one, then the solution of the discretized SIP converges to that of the the original SIP with rate O(d 2 ). For the Poisson repeat channel, the constraints are also strictly convex in a. Therefore, a similar analysis is possible and the details are omitted once again. Combining these results completes the proof. 7 Theoretical results supporting Remark 3.1 The result described in the main text follows from Theorem 6.2 in [31] , originally proved in [32, 33] and [48] . |P (x)W (x)|. Here, M L stands for the Mhaskar-Rakhmanov-Saff (MSF) number, which is the smallest positive root of the integral equation In our setting, the weight equals exp(−x). Solving (20) gives us an MSF number equal to M L = π 2 L. Thus, we can restrict our optimization interval to [ n k , π 2 L + n k ]. If there is no regularization term, the optimal interval reduces to [ n k , π 2 L + n k ]. We introduce the optimization problem needed for minimizing the risk E S−Ŝ S 2 . Poissonization arguments once again establish that where the last inequality is due to the fact thatŜ c ≤ S. Note that the only difference between (21) and (11) 9 Detailed description of the SARS-Cov-2 genomic data Organization of the SARS-Cov-2 genome. A breakdown of the genomic structure of SARS-Cov-2 is shown in Figure 2 , and described in detail in more detail in [49] . SARS-Cov-2 comprises the following open reading frames: ORF1a and ORF1b, spike (S), membrane (M), envelope (E), and nucleocapsid (N), as well as ORF 10. Since all these ORFs encode proteins that have different roles in the process of evading the immune system of the host, we perform the mutational support analysis in the presence of sampling artifacts for each individual region. Note that the overall support is the sum of the mutational supports of all ORF regions. We used data from the GISAID EpiCov repository [40] which contains sequenced viral strains collected from patients across the world. We downloaded the data aggregated by the date of 04-14-2020. We filtered out all incomplete genomic datasets, resulting in a total of 8, 893 samples. To obtain the mutation counts, we first used the alignment software MUSCLE [50] with respect to the reference of Patient 1, published under the name Wuhan-Hu-1 and collected at the Central Hospital of Wuhan, December 2019 (GenBank accession number MN909847). For each aligned pair of samples we generate a list containing the positions in the reference genome in which the patient aligned to the reference has a mutation. The mutation profile lists are aggregated producing a mutation histogram for each of the viral genome regions depicted in Figure 2 . To counter alignment artifacts caused by sequencing errors, we removed all gaps encountered in the prefixes and suffixes and sufficiently long gaps (> 10 nts) within the alignments. For the experiments, we partitioned the mutation histograms based on geographic location (Asia/ North America/ Europe). Since the number of samples across the three regions varies significantly, we subsampled the sequence sets to arrive at 636 samples from Asia and 1774 samples from Europe and North America. Interpretation of the results. The mutational supports without Poisson repeats may be used to estimate how quickly SARS-Cov-2 as well as any other virus mutates early on in the infection. The Poisson repeat model, as previously mentioned, accounts for resampling of patients that may have tested positive in a previous round or that may have been exposed to new sources of infection. Erroneous samples are accounted for through deletions. The mean value of the Poisson repeats is chosen to accommodate the fact that most individuals are sequenced once or not at all, but that certain high-risk groups (such as health workers) may be subject to multiple tests. Remark The estimators are based on the assumption that the symbol counts are independent and identically distributed, which may clearly not be the case when analyzing viral mutations (i.e., mutations at difference locations in the genomes may and are expected to be correlated ORF1a 1746 911 2022 1082 1010 2770 13203 ORF1b 858 477 1046 549 598 1346 8087 S 438 246 563 277 309 700 3822 ORF3a 166 99 204 108 108 272 828 E 26 15 25 43 16 30 228 M 81 51 85 63 64 103 669 ORF6 90 52 105 158 61 118 186 ORF7a 116 66 131 157 76 156 366 ORF8 49 32 53 42 43 60 366 N 221 139 274 146 175 336 1260 ORF10 39 30 21 35 40 21 117 All 3830 2118 4529 2660 2500 5912 29132 Table 3 : Results of the noiseless support estimation method on samples from Europe. Genomic region η = 0.5 η = 1 η = 1.5 RWC-S-prc Naïve WY-Naïve WY-prc RWC-S-prc Naïve WY-Naïve WY-prc RWC-S-prc Naïve WY-Naïve WY-prc Table 4 : Support estimation with synthetic Poisson repeats on samples from Europe. We report the relative difference (in %) of the results for the noisy and noiseless counterparts (the closer the value to 0, the better the performance). Genomic region η = 0.5 η = 1 η = 1.5 Maximum support Table 5 : Results of the noisy support estimation method directly applied to the real-world data on samples from Europe. We underline the results that are obviously false (i.e., those violating the maximum support constraint, taking negative values or values smaller than the results of the naive estimator). Table 6 : Results of the noiseless support estimation method on samples from Asia. Genome region η = 0.5 η = 1 η = 1.5 RWC-S-prc Naive WY-Naive WY-prc RWC-S-prc Naive WY-Naive WY-prc RWC-S-prc Naive WY-Naive WY-prc Table 7 : Tests of the noisy support estimation methods against synthetic Poisson repeats introduced into samples from Asia. We report the relative difference (in %) compared to the results of the noiseless counterparts (the closer the value to 0, the better the estimator). Genome region η = 0.5 η = 1 η = 1.5 RWC-S-prc Naive WY-Naive WY-prc RWC-S-prc Naive WY-Naive WY-prc RWC-S-prc Naive WY-Naive WY-prc Table 10 : Tests of the noisy support estimation methods against synthetic Poisson repeats introduced into samples from North America. We report the relative difference (in %) compared to the results of the noiseless counterparts (the closer the value to 0, the better the estimator). Genome region η = 0.5 η = 1 η = 1.5 Maximum support RWC-S-prc Naïve WY-Naïve WY-prc RWC-S-prc Naïve WY-Naïve WY-prc RWC-S-prc Naïve WY-Naïve WY-prc ORF1a 2005 804 407 1114 2716 804 1765 2851 3502 804 2802 95895 13203 ORF1b 924 403 206 558 1202 403 912 1474 1535 403 1391 55948 8087 S 471 209 82 305 608 209 515 901 799 209 793 50434 3822 ORF3a 167 81 58 91 209 81 161 213 253 81 240 3656 828 E 31 15 15 15 39 15 24 22 50 15 30 72 228 M 51 28 24 34 60 28 52 60 71 28 77 235 669 ORF6 39 21 17 26 46 21 43 52 54 21 62 252 186 ORF7a 345 135 113 165 448 135 285 317 535 135 381 4729 366 ORF8 52 29 22 32 61 29 45 61 70 29 93 168 366 N 287 138 95 157 369 138 272 372 469 138 408 8471 1260 ORF10 18 10 10 10 23 10 16 15 28 10 22 47 117 All 4390 1873 1049 2507 5781 1873 4090 6338 7366 1873 6299 219907 29132 Table 11 : Results of noisy support estimation methods applying directly to North America samples. We underline the results that are obviously false (i.e. violating maximum support, being negative and smaller than the results of Naive estimator). The relation between the number of species and the number of individuals in a random sample of an animal population Estimating the number of unseen species: How many words did shakespeare know Strong lower bounds for approximating distribution support size and the distinct elements problem Counting distinct elements in a data stream Towards estimation error guarantees for distinct values Query k-means clustering and the double Dixie cup problem Clustering with same-cluster queries HS 2 : Active learning over hypergraphs Active learning in the geometric block model The number of new species, and the increase in population coverage, when a sample is increased Always good turing: Asymptotically optimal probability estimation Competitive closeness testing Estimation of entropy and mutual information Estimating the number of species: a review Sampling algorithms: lower bounds and applications Testing that distributions are close An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people Recent explosive human population growth has resulted in an excess of rare genetic variants Estimating the unseen: improved estimators for entropy and other properties Chebyshev polynomials, moment matching, and optimal estimation of the unseen A unified maximum likelihood approach for estimating symmetric properties of discrete distributions Sample complexity of the distinct elements problem Approximate profile maximum likelihood Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under wasserstein distance Data amplification: A unified and competitive approach to property estimation Mutation rate and genotype variation of ebola virus from mali case sequences Quantifying the diversification of hepatitis c virus (hcv) during primary infection: estimates of the in vivo mutation rate Coronavirus pandemic (covid-19) Sharp analytical capacity upper bounds for sticky and related channels Small-sample distribution estimation over sticky channels A survey of weighted polynomial approximation with exponential weights Extremal problems for polynomials with exponential weights Where does the sup norm of a weighted polynomial live Sur les cycles des substitutions On modeling profiles instead of values R.A. Fisher and the making of maximum likelihood 1912-1922 Optimal prediction of the number of unseen species Entropy rate estimation for markov chains with large state space Data, disease and diplomacy: Gisaid's innovative contribution to global health Gisaid: Global initiative on sharing all influenza data-from vision to reality Do symmetric problems have symmetric solutions? Convex optimization Semi-infinite programming Numerical methods for semi-infinite programming: a survey Discretization methods for the solution of semi-infinite programming problems Discretization in semi-infinite programming: the rate of convergence On asymptotic properties of polynomials orthogonal on the real axis Genotype and phenotype of covid-19: Their roles in pathogenesis Muscle: a multiple sequence alignment method with reduced time and space complexity Naive WY GT PJW HOSW Maximum support ORF1a 1752 835 2139 1543 897 2752 13203 ORF1b 624 316 762 747 335 983 8087 S 353 188 455 375 213 551 3822 ORF3a 171 93 185 186 107 242 828 E 63 36 82 185 38 84 228 M 51 31 65 55 31 72 669 ORF6 3 3 3 Inf 3 5 186 ORF7a 214 109 234 633 117 301 366 ORF8 339 340 330 340 340 316 366 N 93 60 115 70 74 132 1260 ORF10 17 11 17 15 13 20 117 All 3680 2022 4387 4149 2168 5458 29132 Genome region η = 0.5 η = 1 η = 1.5 Maximum support RWC-S-prc Naive WY-Naive WY-prc RWC-S-prc Naive WY-Naive WY-prc RWC-S-prc Naive WY-Naive WY-prc ORF1a 2401 835 289 1254 3278 835 2139 3864 4138 835 3508 183124 13203 ORF1b 832 316 192 372 1144 316 762 1072 1578 316 1060 29256 8087 S 465 188 110 223 621 188 455 654 809 188 627 4799 3822 ORF3a 210 93 79 112 266 93 185 207 314 93 257 2764 828 E 74 36 28 47 90 36 82 100 105 36 114 560 228 M 58 31 25 38 70 31 65 76 81 31 79 401 669 ORF6 3 3 3 3 3 3 3 3 3 3 3 3 186 ORF7a 269 109 90 135 346 109 234 265 409 109 314 4372 366 ORF8 339 340 340 340 341 340 330 324 346 340 352 -11001 366 N 113 60 41 68 137 60 115 162 163 60 170 1015 1260 ORF10 21 11 11 11 26 11 17 16 33 11 22 44 117 All 4785 2022 1208 2603 6322 2022 4387 6743 7979 2022 6506 215337 29132 Table 8 : Results of noisy support estimation methods applied directly on samples from Asia. We underline the results that are obviously false (i.e., those that violate the maximum support size constraint, those that produce negative results and estimates smaller than the results of the Naive estimator). Naive WY GT PJW HOSW Maximum support ORF1a 1509 804 1765 944 980 2374 13203 ORF1b 727 403 912 437 442 1245 8087 S 375 209 515 237 229 636 3822 ORF3a 134 81 161 86 97 214 828 E 25 15 24 33 15 29 228 M 44 28 52 38 32 59 669 ORF6 33 21 43 32 23 46 186 ORF7a 269 135 285 587 142 384 366 ORF8 44 29 45 30 36 51 366 N 227 138 272 174 185 331 1260 ORF10 16 10 16 20 14 16 117 All 3403 1873 4090 2618 2195 5385 29132 Table 9 : Results for the noiseless support estimation methods applied to samples from North America.