key: cord-0522816-mt5kuyli
authors: Chien, Eli; Milenkovic, Olgica; Nedich, Angelia
title: Support Estimation with Sampling Artifacts and Errors
date: 2020-06-14
journal: nan
DOI: nan
sha: b51d02c955f06aa221cba8e075f8ccd30c44aa80
doc_id: 522816
cord_uid: mt5kuyli

The problem of estimating the support of a distribution is of great importance in many areas of machine learning, computer science, physics and biology. Most of the existing work in this domain has focused on settings that assume perfectly accurate sampling approaches, which is seldom true in practical data science. Here we introduce the first known approach to support estimation in the presence of sampling artifacts and errors where each sample is assumed to arise from a Poisson repeat channel which simultaneously captures repetitions and deletions of samples. The proposed estimator is based on regularized weighted Chebyshev approximations, with weights governed by evaluations of so-called Touchard (Bell) polynomials. The supports in the presence of sampling artifacts are calculated using discretized semi-infite programming methods. The estimation approach is tested on synthetic and textual data, as well as on GISAID data collected to address a new problem in computational biology: mutational support estimation in genes of the SARS-Cov-2 virus. In the later setting, the Poisson channel captures the fact that many individuals are tested multiple times for the presence of viral RNA, thereby leading to repeated samples, while other individual's results are not recorded due to test errors. For all experiments performed, we observed significant improvements of our integrated methods compared to those obtained through adequate modifications of state-of-the-art noiseless support estimation methods.

Estimating the support size of a discrete distribution is an important theoretical and data processing problem [1, 2] . In computer science, this task frequently arises in large-scale database mining and network monitoring where the objective is to estimate the types of database entries or IP addresses from a limited number of observations [3, 4, 5] . In machine learning, support estimation is used to bound the number of clusters in clustering problems encountered in semi-supervised or active learning [6, 7, 8, 9] . In life sciences, support estimation arises when estimating population sizes or increases in population sizes [10] . The most challenging practical support estimation issues are encountered in the "small sample set" regime in which one has only a limited number of observations for a distribution with a large support. In such a setting, classical maximum likelihood frequency techniques are known to perform poorly [11] . It is for this sampling regime that the estimation problem has received significant attention from both the theoretical computer science and machine learning community, as well as researchers from various computational data processing areas [12, 13, 14, 15, 5, 16, 17, 18] .

By now, a number of efficient and near-optimal support estimation techniques has been reported in the literature [19, 20, 21, 22, 23, 24, 25] . All these methods traditionally use the assumption that the samples are observed without errors. In practice, sampling artifacts and noise are ubiquitous, especially when dealing with data acquired from biological and medical science experiments. As a motivating example, consider the problem of estimating the number of mutations in a viral genome (such as that of SARS-Cov-19) during the early stages of an outbreak. Viral RNA/DNA is usually of length ×10, 000 and testing is time consuming and expensive, and additionally hampered by privacy issues. Consequently, a small number of sequenced genomes may be available when trying to determine in a timely manner if the virus is mutating at a high rate and therefore potentially dangerous to the population (Note that large body of work reports mutational rates of viruses as indicators of their virulence and potential to cause epidemic and pandemic outbreaks [26, 27] .). In this particular case, the actual alphabet size is known and equal to the length of the genome, but not all genomic sites are subject to mutations. Furthermore, sequencing errors introduce counting artifacts, and so do sampling biases which are caused by some individuals being tested multiple times (e.g., health workers [28] ) or not tested at all. To address these issues, we propose a novel noisy support estimation problem under the Poisson repeat channel [29] model. The Poisson repeat channel models both deletion and repetitions of particular mutational sites, and is adequate for capturing unknown sampling biases and sequencing error phenomena. To the best of our knowledge, this work is the first to consider the noisy support estimation problem where symbols are potentially repeated or missed. The only other line of work addressing a similar problem was reported in [30] , with a focus on Good-Turing distribution estimation in the presence of sample insertion errors.

We address the noisy small sample support estimation problem under the Poisson repeat channel by using novel regularized weighted Chebyshev approximation techniques. Weighted polynomial approximation techniques are largely unknown in the machine learning community [31] since all previous approximation based methods proposed so far have only focused on unweighted and noiseless settings [20, 22, 24, 25] . As will be shown in our subsequent analysis, exponential smoothing and "noise-compensating" weights play a major role in improving the performance of polynomial methods as well as making them computationally tractable. Within this framework, the Mhaskar-Saff theorem and extensions thereof presented in the work are of great importance [31, 32, 33] . In addition, our regularization term arises from consideration of the variance of the estimators, as the weighted Chebyshev approximation component only takes into account the bias. Hence, solving our regularized weighted Chebyshev approximation problem is equivalent to jointly optimizing the bias and variance of the estimator. In addition, we show that Touchard (Bell) polynomials [34] naturally arise when incorporating the Poisson repeat channel into the model. To numerically solve the underlying optimization problem we use discretized semi-infinite programming (SIP) techniques. We prove that the solution of discretized SIP converges to the true unique optimal solution for the noisy support estimation problem. Through extensive experiments on both synthetic and real-world data, we show that our methods are able to accurately estimate the support size under the Poisson repeat channel.

Prior work Noiseless support estimation methods operating in the small sample regime can be roughly grouped into two categories [19, 20, 21, 22, 23, 24, 25] . The first line of works [19, 21, 23] makes use of the maximum likelihood principle. While [21] constructs estimators based on the Profile Maximum Likelihood (PML) [35] , the work reported in [19] focuses on Sequence Maximum Likelihood (SML) estimators [36] . The main advantage of ML-based methods is that they easily generalize to many other estimation tasks. For example, the authors of [21] showed that a single method may be used for entropy estimation, support coverage and distance to uniformity analysis. However, most ML-based estimators require large computational resources [23, 20] . To address the computational issue, a sophisticated approximate PML technique that reduces the computational complexity of support estimation at the expense of some performance loss was proposed in [23] . On the other hand, the second line of works [20, 22, 24, 25] formulates support estimation as an approximation problem. The underlying methods, which we henceforth refer to as approximationbased methods, design estimators by minimizing the worst case risk. In particular, [20] uses shifted and scaled Chebyshev polynomials of the first kind to construct efficient estimators. In contrast, the authors of [25] suggest disposing of minmax estimators and implementing a data amplification tech-nique with analytical performance guarantees. The aforementioned estimator is based on polynomial smoothing [37] related to approximation techniques. All described approximation-based estimators are computational efficient, with the exception of [24] , as reported in [25] . Sampling artifacts lead to non-iid observations which are very different from Markovian models discussed in [38] .

The paper is organized as follows. In Section 2, we introduce the relevant notation and the support estimation problem under the Poisson repeat channel. We also describe a class of estimators termed polynomial class estimators. In Section 3, we outline our analysis of polynomial class estimators and describe our main results needed to overcome the technical challenges associated with regularized weighted minmax polynomial approximations. Section 4 is devoted to experimental verifications and testing, both on synthetically generated data and real-world data. Our real world data includes Shakespeare's plays, used to illustrate the performance of our methods, and a collection of ∼ 4, 100 SARS-Cov-2 viral genomes retrieved from GISAID [39, 40] .

Let P = (p 1 , p 2 , . . .) be a discrete distribution over some countable alphabet and let X 1 , . . . , X n be i.i.d. samples drawn according to the distribution P . The problem of interest is to estimate the support size, defined as S(P ) = i 1 {pi>0} where 1 A stands for the indicator of the event A. When clear from the context we use S instead of S(P ) to avoid notational clutter. We make the assumption that the minimum nonzero probability of the distribution P is at least 1/k, for some k ∈ R + , i.e., inf{p ∈ P | p > 0} ≥ 1 k . Furthermore, we let D k denote the space of all probability distribution satisfying inf{p ∈ P | p > 0} ≥ 1 k . Clearly, S ≤ k, ∀P ∈ D k . A sufficient statistics for X 1 , . . . , X n is the empirical distribution (i.e., histogram)

The Poisson repeat channel. For each sample index 1 ≤ i ≤ n, the Poisson repeat channel with parameter η outputs R i copies of the sample X i , where R i ∼ P oisson(η) is a Poisson distributed random variable with mean η. For simplicity of analysis, the variables R i are assumed to be i.i.d. and the corresponding channel memoryless; the value R i = 0 indicates that a sample input X i has been deleted. We also define the empirical distribution of the output sequence of the Poisson channel as

Throughout the paper, we assume that the parameter η is known although it is possible to learn it simultaneously with the support.

The focal point of our analysis is to upper bound the minmax risk under normalized squared loss

which correspond to the case without and with Poisson repeat channel, respectively.

We focus on the case including the Poisson repeat channel, as a similar analysis can be easily performed for the case without Poisson repeats. We seek a support estimatorŜ that minimizes

The first term within the supremum captures the expected bias of the estimatorŜ. The second term represents the variance of the estimatorŜ. Hence, "good" estimators are required to balance out the worst-case contributions of the bias and variance.

We define a class of polynomial based estimators as follows. Given a parameter L ∈ N, we say that an estimatorŜ is a polynomial class estimator with the parameter L (i.e., a P oly(L) estimator) if it takes the formŜ = i g L (N i ), where g L is defined as

Here, a j ∈ R, and a 0 = −1, since this choice ensures that g L (0) = 0. One can associate an estimatorŜ with its corresponding coefficients a, and define a family of estimators P oly(L) = a ∈ R L+1 |a 0 = −1 .

Next, we show that the problem of minimizing worst-case risk within the class P oly(L) can be cast as a regularized exponentially weighted Chebyshev approximation problem [31] .

We start by analyzing the minmax risk under the Poisson channel model through Poissonization arguments. These assert that the number of samples drawn is Poisson distributed N ∼ P oisson(n) and that the counts N i ∼ P oisson(λ i ) are independent, where λ i = np i . Poissonization was also used in [20, 24] to derive the following tight upper bound on the minmax risk.

Lemma 3.1 (Lemma 1 in [20] ). Let R * P (n, k) be the minmax risk under the Poissonized model. Then, for any β > 1 we have

Note that it is straightforward to show that Lemma 3.1 is also valid under the Poisson repeat channel. Let L = {ℓ|λ ℓ > 0} be the set of symbols with positive probability. A simple calculation reveals that for anyŜ ∈ P oly(L), one has

where the inequality we use Cauchy-Bunyakowski-Schwarz inequality for the cross terms. Note that the first term captures the variance while the second term is related to the bias. An interesting observation is that the objective function above is symmetric in the parameters λ i . The little-known problem of establishing when the optima of such constrained symmetric functions is achieved for the case that all parameters are equal has been studied in [41] .

We first analyze the bias term E (g(N ′ i ) − 1). By the definition of the Poisson repeat channel, we have N ′ i ∼ P oisson(ηN i ), which allows us to write E (g(

where M N * l r (λ i e −η ) r , where l r denotes the Stirling number of the second kind, counting the number of partitions of a set of size l into r disjoint nonempty subsets. Following a procedure similar to the one used for the bias, one can be shown that the term corresponding to the variance equals

where N * ∼ P oisson(λe −η ). The inequality (7) is due to the increase in the domain over which we take the supremum. The inequality (8) follows from (4). The last inequality is a consequence of |L| = S ≤ k and the fact that all terms in the summation are nonnegative. Hence we have to minimize an objective with respect to the coefficients a 1 , ..., a L according to

For the case that the Poisson repeat channel is not present, one only needs to adjust the exponential weights and substitute η l M 

Note that both (10) and (11) are of the form of a regularized weighted Chebychev approximation problem. For simplicity, we first focus on the noiseless case (11), as similar but more tedious arguments may be used for the noisy case (10).

If we ignore the first term in (11) , the optimization problem reads as 

The term e −λ L l=0 a l λ l corresponds to the bias of the estimator. It is straightforward to see that the optimal choice of a for the two problems are the same. Problem (12) is an exponentially weighted Chebyshev approximation problem [42] . Note that one can further upper bound (12) as follows 

resulting in a standard Chebyshev approximation problem with a solution of the form of scaled and shifted Chebyshev polynomials. Despite the fact that the authors of [20] obtained the coefficients of the Chebyshev estimator using a different interval in the supremum that account for the variance, (13) along with our extensive simulation results show that ignoring the exponential weights results in a worse bound on the risk and practical performance.

The first term 1 k L l=0 e −λ a 2 l λ l l! , which corresponds to the variance, may be rewritten as

Clearly, ||.|| M(λ) is a valid norm, and consequently, the first term in (11) may be viewed as a regularizer. For the case including the Poisson repeat channel, since the sum of Touchard polynomials is still a polynomial, the bias term in (10) also represents an exponentially weighted Chebyshev approximation problem. The variance term may be written as the weighted norm of a with a weight matrix

N * (0)). The resulting problem (10) is once again an regularized weighted Chebyshev approximation problem.

Solving problems (10) and (11) . Solving problem (10) and (11) directly appears to be difficult, so we instead resort to numerically solving the epigraph formulation of the semi-infinite programs (10) and (11) and proving that the numerical solution is asymptotically consistent.

Once again, we start with the case that excludes the Poisson repeat channel (11) . The epigraph formulation of (11) is a semi-infinite program of the form ( [43] , Chapter 6.1)

There are many algorithms that can be used to numerically solve (14): the discretization and central cutting plane method, the KKT and SQP reductions [44, 45] . For simplicity, we focus on the discretization method. For this purpose, we first form a grid of the interval [ n k , n] involving s points, denoted by Grid([ n k , n], s). Problem (14) represents an LP with infinitely many quadratic constraints, which is not solvable. Hence, instead of solving (14), we focus on the relaxed problem

, n], s). (15) As will be discussed in greater detail later, the solution of the relaxed problem is asymptotically consistent with the solution of the original problem (i.e., as s → ∞, the optimal values of the objectives of the original and relaxed problem are the same). Problem (15) is an LP with a finite number of quadratic constraints that may be solved using standard optimization tools. Unfortunately, the number of constraints scales with the length of the grid interval, which in the case of interest is linear in n. This appears as an undesirable feature of the approach, but can be easily mitigated through the following theorem which demonstrates that an optimal solution of the problem may be found over an interval of length proportional to the significantly smaller value of log k (k/ log k n is needed for accurate estimation [20] ). We relegate the proof of this result to the Supplement. Theorem 3.2. For any a ∈ P oly(L), L = ⌊c 0 log k⌋ and c 0 = 0.558, let

Then, we have

g(a, λ) = sup λ∈[ n k ,6.5L] g(a, λ) if n k ≤ 6.5L g(a, n k ) if n k > 6.5L. Remark 3.1. In weighted approximation theory [31] , the problem of bounding the interval over which the supremum is achieved is a topic of significant interest, with many important results readily available. For example, if we ignore the regularization term, we can directly use the Mhaskar-Saff theorem [32, 33] (Theorem 7.1 in the Supplement) to reduce the length of the interval in the supremum to π 2 L. Our Theorem 3.2 shows that even when a regularization term is present, we can still restrict the length of the interval to 6.5L. Our proof differs from that of the more general Mhaskar-Saff theorem, since we exploit the specific structure of the problem. It remains an open problem to extend the approach of [31] used in our proof to account for more general weights.

Using the previous derivations, we arrive at the following optimization problem

Since L = ⌊c 0 log k⌋ for the case excluding the Poisson repeat channel, the length of the optimization interval in (16) is proportional to log k and thus the (16) can be solved efficiently.

For the case including the Poisson repeat channel, using the same arguments as above, we have

where C > 0 is a constant. Unfortunately, it is very hard to precisely characterize C. Nevertheless, we find that in practice, the choice C = 2 works well with L = ⌊ηc 0 log k⌋. Furthermore, since the Poisson repeat channel introduces an average of η replicas of each samples, the "cut-off" value L for g L in (2) is set to be η times larger than the corresponding value without Poisson repeats.

Convergence of the discretized method For the case of objective functions and constraints that are "well-behaved" (see [46] and [47] ), as s grows, the solution of the relaxed semi-infinite program approaches the optimal solution of the original problem. We use the above results in conjunction with a number of properties of our objective SIP to establish the claim in the following theorem whose proof is delegated to the Supplement. be the length of the discretization interval. As d → 0, the optimal objective value t d of the discretized SIP (16) or (17) (with η > 0) converges to the optimal objective value of the original SIP t ⋆ . Moreover, the optimal solution is unique. The convergence rate of t d to t ⋆ equals O(d 2 ). If the optimal solution of the SIP is a strict minimum of order one (i.e., if t − t ⋆ ≥ C||a − a ⋆ || for some constant C > 0 and for all feasible neighborhoods of a ⋆ ), then the solution of the discretized SIP also converges to an optimal solution with rate O(d 2 ).

Next, we compare our estimator, referred to as the Regularized Weighted Chebyshev (RWC) method, with the Good-Turing (GT) estimator, the WY estimator of [20] , the PJW estimator described in [23] and the HOSW estimator of [25] . We do not compare our method with the estimators introduced in [19, 24] due to their high computational complexity [20, 25] .

Synthetic data experiments. We first evaluate the maximum risk under normalized squared loss of all listed estimators over six different distributions without Poisson repeats: the uniform distribution with p i = 1 k , the Zipf distributions with p i ∝ i −α , and α equal to 1.5, 1, 0.5 or 0.25, and the Benford distribution with p i ∝ log(i + 1) − log(i). We choose the support sizes for the Zipf and Benford distribution so that the minimum nonzero probability mass is roughly 10 −6 . We run the estimator 100 times to calculate the risk. For solving (16), we use a grid with s = 1000 points in the interval [ n k , 6.5L], and L = ⌊0.558 log k⌋. The GT method used for comparison first estimates the total probability of seen symbols (e.g., sample coverage) according toĈ = 1 − h 1 /n, and then estimates the support size according toŜ GT =Ŝ c /Ĉ; here,Ŝ c stands for the simple counting estimator. Note that h 1 equals the number of different alphabet symbols observed only once in the n samples. Detailed findings are presented in the Supplement.

Figures 1(a) and 1(b) indicate that the RWC estimator has a significantly better worst case performance compared to all other methods when tested on the above distributions, provided that n ≥ 0.2k. Also, both the RWC and WY estimators have significantly better error exponents compared to GT, PJW and HOSW. Interestingly, we find the the worst case risk with normalization (1/k) 2 tends to severely bias the results towards a near-uniform distribution. We mitigate this issue by changing the normalization from (1/k) 2 to (1/S) 2 , which was also done in [25] . We repeat the experiment using the normalization (1/S) 2 , corresponding to what we refer to as the RWC-S estimator. A detailed description of this algorithm and its analysis is available in the Supplement. Figures 1(c) and 1(d) illustrate that the RWC-S estimator significantly outperforms all other estimators.

Next, we turn our attention to the case of the Poisson repeat channel (PRC). Our RWC-S-prc estimator requires solving (17) with 1/k replaced by 1/Ŝ c , and setting C = 2, L = ⌊0.558η log k⌋ and s = 1000. Since the noisy support estimation problem is new there is no standard benchmark to compare it with. This is why we consider the performance of RWC-S-prc, the naive counting estimator and two simple modifications of the WY estimator, since those offer the best performance in the noiseless setting. The WY-naive method first divides the empirical counts N ′ by η and then applies the WY estimator. This is intuitive since each symbol is repeated η times in expectation. The WY-prc method involves modifying the coefficients in the WY estimator with Touchard polynomial multipliers. Note that this modification does not take into account the exponential weighting and regularization term that we introduced for both the noiseless and noisy setting. In the experiments, we choose η from {0.5, 1, 1.5} since the replication rate is small in practice. As we can see from Fig-ures 1(e), 1(f) and 1(g), our method significantly outperforms all other methods. Notably, WY-prc performs poorly for η > 1 while WY-Naive performs poorly for η < 1. Real-world data experiments. We start by estimating the number of distinct words in selected books, as suggested in [20, 19] . We use Hamlet, Othello, Macbeth and King Lear for our comparative study, with the results presented in Figure 1 (h) (for Hamlet) and the Supplement (all other plays). In the experiments, we randomly sampled words from the text with replacement and used the obtained counts to estimate the number of distinct words. For simplicity, we set k to be equal the total number of words. For example, as the number of words in Hamlet equals 30, 364, we set k = 30, 364. As may be clearly seen, our methods significantly outperform all other competitive techniques both in terms of convergence rate and the accuracy of the estimated support.

To estimate the mutational support of the SARS-Cov-2 virus in the presence of sampling artifacts, we first create the histogram of mutations in sequenced genomes, using a reference corresponding to Patient 1 (the first infected individual that was sequenced). The mutational support of a population of individuals equals the size of the union of the individual supports. The datasets used in the study were retrieved from the GISAID repository [39, 40] on 04-14-2020, and they pertain to European patients only. The analysis of datasets acquired from Asia and North America is relegated to the Supplement. We conduct three experiments: first, we examine the results of the noiseless support estimation methods (Table 3) ; next, we manually corrupt the samples by Poisson repeats with η = 0.5, 1, 1.5 (Table 1 and the Supplement for η = 1, 1.5). A good noisy support estimation method should produce a support close to that of its noiseless counterpart, and in this setting our method shows superior performance compared to other techniques. Finally, we also report the results of noisy support estimation on the unperturbed data ( Table 2 ). Note that the naive estimator gives a lower bound for the true support size, and the results of WY-Naive for η = 0.5 are erroneous. On the other hand, WY-prc produces estimates that violate the known maximum support size or negative entries for η = 1.5, as reported in the Supplement. A more detailed discussion of the relevant biological findings is also available in the Supplement. To prove the result, we need to show that ∀λ ≥ 6.5L, ∂ ∂λ g(a, λ) < 0. The derivative of the first term in g equals

Clearly, the right hand side in the above expression is negative for all λ > L. The second term of the derivative equals

To analyze the two terms of the derivative, we introduce the vectors y, z, 1 and the diagonal matrix D according to

To show that ∂ ∂λ g(a, λ) < 0 for all polynomials of degree L whenever λ > CL, we show that the matrix e λ k D + (1z T + z1 T ) is negative-definite whenever λ > CL, for some constant C > 0. It suffices to show that the sum of the maximum eigenvalues of e λ k D and (1z T +z1 T ) is negative, since e λ k D is a diagonal matrix. Thus, we turn our attention to determining the maximum eigenvalues of these two matrices. For e λ k D, the maximum eigenvalue satisfies e λ k max i∈{0,1,...,L} The last inequality is a consequence of Stirling's formula, which asserts that n! ≥ ( n e ) n . Combining the above expressions, we obtain e λ k max i∈{0,1,...,L}

Next, we derive an upper bound on maximum eigenvalue of the second matrix. The i, j entry of the matrix (1z T + z1 T ) equals i+j−2 λ − 2, and all these values are negative when λ > L. Moreover, it is clear that the matrix of interest has rank equal to 2. Therefore, the matrix has exactly two nonzero eigenvalues.

Let A = −(1z T + z1 T ). All entries of A are positive whenever λ > L. By Gershgorin's theorem, we can upper bound the maximum eigenvalues of the matrix A by its maximum row sum. It is obvious that the maximum row sum equals

Moreover, the trace of A equals

This implies that the minimum eigenvalue of A is lower bounded by − L(L+1) 2λ , which directly implies that the maximum eigenvalue of (1z T + z1 T ) is upper bounded by L(L+1) 2λ .

L eλ L ⇔ log(L) + log(L + 1) + log(k) − L log(L) + L < λ + log(λ) − L log(λ).

The function λ + log(λ) − L log(λ) is nondecreasing in λ whenever λ > L since

By the definition of L = ⌊c 0 log(k)⌋, we also have log(k) ≤ L+1 c0 . Using log(x + 1) ≤ x, which holds ∀x ≥ 1. Hence ∀λ > CL where C > 2, the sufficient condition for (18) to hold is

Rearranging terms leads to

Sufficient conditions that ensure that the above inequality holds are log(C) ≥ 1 c0 and (C − log(C) − 2 − 1 c0 ) > 0. The first condition implies C ≥ e 1 c 0 = 6.0021, while the second condition holds with C = 6.5, for which the first condition is also satisfied. This completes the proof.

The proof consists of two parts. In the first part, we establish the conditions for convergence, while in the second part, we determine the convergence rate. For simplicity, we present the proofs for the case without Poisson repeats. We then outline how the analysis can be modified to account for the repeats.

We start by introducing the relevant terminology. Let Π ⊂ R L+1 be a closed set of parameters, and let f be a continuous functional on Π. Assume that B ⊂ R is compact and that g : We also make the following two assumptions: [46] ). Under assumptions 6.1 and 6.2, the solution of the discretized problem converges to the optimal solution. More formally, we have

If c * is the unique optimal solution of the original problem, and c * i is the optimal solution of the discretized relaxation with grid B i , then

It is straightforward to see that our chosen grid is arbitrary fine. Hence, we only need to prove that there exists a c 0 such that the level set Level(c 0 , D) is bounded.

Let c = (a; t) and note that in our setting, f (c) = t. Rewrite g(c, λ) in matrix form as

where Λ e −λ (λ 0 , λ 1 , ..., λ L ) T .

Note that only a 1 , ...a L are allowed to vary since we fixed a 0 = −1. Obviously, ΛΛ T is positive semi-definite and the previously introduced M(λ) is positive definite for all λ > 0. Since the constraints on g in (16) are positive definite with respect to a 1 , ...a L , g is coercive in a 1 , ...a L . Furthermore, for any given t, the set of feasible coefficients a 1 , ...a L is bounded. Therefore, given a t 0 , the level set Level(c 0 , B 0 ) is bounded. This ensures that Assumption 6.2 holds for our optimization problem.

Next, we prove the uniqueness of the optimal solution c ⋆ . Note that proving this result is equivalent to proving the uniqueness of a ⋆ . Hence, we once again refer to the original minmax formulation of our problem, 

Clearly, ∀λ ∈ [ n k , 6.5L], the function h λ (a) is strictly convex since (M(λ) + ΛΛ T ) ≻ 0, ∀λ ∈ [ n k , 6.5L]. Taking the supremum over λ preserves strict convexity since ∀θ ∈ (0, 1), one has sup λ∈[ n k ,6.5L]

Hence sup λ∈[ n k ,6.5L] h λ (a) is strictly convex, which consequently implies the uniqueness of a ⋆ and hence c ⋆ .

For the case of samples passed through a Poisson channel, it is not hard to see that the constraints are again strictly convex in a, where one only need to replace M(λ), Λ by

N * (0)) T respectively. Thus, a similar analysis is possible and the details are omitted. The proof above along with the previous observation prove the convergence result of Theorem 3.3.

In what follows, and for reasons of simplicity, we omit the constraint a 0 = −1 in the SIP formulation. The described proof only requires small modifications to accommodate a 0 = −1.

Recall that we used B d to denote the grid with grid spacing d. In order to use the results in [47] , we require the convergence assumption below. This assumption is satisfied for the SIP of interest as shown in the first part of the proof.

Assumption 6.5. The following hold true:

• There is a neighborhoodŪ ofc such that the function ∂ 2 ∂λ 2 g(c, λ) is continuous onŪ × B. • The set B is compact, nonempty and explicitly given as the solution set of a set of inequalities, B = {λ ∈ R|v i (λ) ≤ 0, i ∈ I}, where I is a finite index set and v i ∈ C 2 (B).

• For anyλ ∈ B, the vectors ∂ ∂λ v i (λ), i ∈ {i ∈ I|v i (λ) = 0} are linearly independent.

Recall that our objective is of the form

where Λ e −λ (λ 0 , λ 1 , ..., λ L ) T , c = (a; t),

It is straightforward to see that the first condition in Assumption 6.5 holds. For the second condition, recall that B = [ n k , 6.5L]. Hence, the second condition can be satisfied by choosing I = {1}, v 1 (λ) = (λ − n k )(λ − 6.5L). Since we only have one variable v 1 , it is also easy to see that the third condition is met. This assumption also clearly holds for the grid of choice. Note that it is crucial to include the boundary points for the proof in [47] to be applicable. Assumption 6.7. ∇ c g(c, λ) is continuous onŪ × B, whereŪ is a neighborhood ofc. Moreover, there exists a vector ξ such that

Note that ∇ c g(c, λ) = [∇ a g(c, λ); ∇ t g(c, λ)] and ∇ a g(c, λ) = 2(M(λ) + ΛΛ T )a.

Also note that ∀λ ∈ B, M(λ) + ΛΛ T is positive definite. Hence choosing ξ to be colinear with and of the same direction as [−a T 1] T , as well as of sufficiently large norm will allow us to satisfy the inequality

Hence, Assumption 6.7 holds as well. The next results follow from the above assumptions and observations, and the results in [47] . Lemma 6.8 (Corollary 1 in [47] ). Let t d be the optimal objective value of the discretized SIP used for support estimation with the grid B d , and let t ⋆ be the optimal objective value for the original SIP. Since Assumptions 6.4,6.5,6.6,6.7 hold, then for some c 3 > 0 and d sufficiently small, we have

Consequently, t d → t ⋆ with a convergence rate of O(d 2 ). Lemma 6.9 (Theorem 2 in [47] ). Assume that all assumptions in Lemma 6.8 hold. If there exists a constant c 4 > 0 such that

then for sufficiently small d and σ > 0 we have

This result implies that ifc is also a strict minimum of order one, then the solution of the discretized SIP converges to that of the the original SIP with rate O(d 2 ). For the Poisson repeat channel, the constraints are also strictly convex in a. Therefore, a similar analysis is possible and the details are omitted once again. Combining these results completes the proof.

7 Theoretical results supporting Remark 3.1

The result described in the main text follows from Theorem 6.2 in [31] , originally proved in [32, 33] and [48] . |P (x)W (x)|.

Here, M L stands for the Mhaskar-Rakhmanov-Saff (MSF) number, which is the smallest positive root of the integral equation

In our setting, the weight equals exp(−x). Solving (20) gives us an MSF number equal to M L = π 2 L. Thus, we can restrict our optimization interval to [ n k , π 2 L + n k ]. If there is no regularization term, the optimal interval reduces to [ n k , π 2 L + n k ].

We introduce the optimization problem needed for minimizing the risk E S−Ŝ S 2 . Poissonization arguments once again establish that 

where the last inequality is due to the fact thatŜ c ≤ S. Note that the only difference between (21) and (11) 

9 Detailed description of the SARS-Cov-2 genomic data Organization of the SARS-Cov-2 genome. A breakdown of the genomic structure of SARS-Cov-2 is shown in Figure 2 , and described in detail in more detail in [49] . SARS-Cov-2 comprises the following open reading frames: ORF1a and ORF1b, spike (S), membrane (M), envelope (E), and nucleocapsid (N), as well as ORF 10. Since all these ORFs encode proteins that have different roles in the process of evading the immune system of the host, we perform the mutational support analysis in the presence of sampling artifacts for each individual region. Note that the overall support is the sum of the mutational supports of all ORF regions. 

We used data from the GISAID EpiCov repository [40] which contains sequenced viral strains collected from patients across the world. We downloaded the data aggregated by the date of 04-14-2020. We filtered out all incomplete genomic datasets, resulting in a total of 8, 893 samples. To obtain the mutation counts, we first used the alignment software MUSCLE [50] with respect to the reference of Patient 1, published under the name Wuhan-Hu-1 and collected at the Central Hospital of Wuhan, December 2019 (GenBank accession number MN909847). For each aligned pair of samples we generate a list containing the positions in the reference genome in which the patient aligned to the reference has a mutation. The mutation profile lists are aggregated producing a mutation histogram for each of the viral genome regions depicted in Figure 2 . To counter alignment artifacts caused by sequencing errors, we removed all gaps encountered in the prefixes and suffixes and sufficiently long gaps (> 10 nts) within the alignments.

For the experiments, we partitioned the mutation histograms based on geographic location (Asia/ North America/ Europe). Since the number of samples across the three regions varies significantly, we subsampled the sequence sets to arrive at 636 samples from Asia and 1774 samples from Europe and North America.

Interpretation of the results.

The mutational supports without Poisson repeats may be used to estimate how quickly SARS-Cov-2 as well as any other virus mutates early on in the infection. The Poisson repeat model, as previously mentioned, accounts for resampling of patients that may have tested positive in a previous round or that may have been exposed to new sources of infection. Erroneous samples are accounted for through deletions. The mean value of the Poisson repeats is chosen to accommodate the fact that most individuals are sequenced once or not at all, but that certain high-risk groups (such as health workers) may be subject to multiple tests.

Remark The estimators are based on the assumption that the symbol counts are independent and identically distributed, which may clearly not be the case when analyzing viral mutations (i.e., mutations at difference locations in the genomes may and are expected to be correlated ORF1a  1746  911 2022 1082 1010 2770  13203  ORF1b  858  477 1046 549 598 1346  8087  S  438  246 563 277 309  700  3822  ORF3a  166  99  204 108 108  272  828  E  26  15  25  43  16  30  228  M  81  51  85  63  64  103  669  ORF6  90  52  105 158 61  118  186  ORF7a  116  66  131 157 76  156  366  ORF8  49  32  53  42  43  60  366  N  221  139 274 146 175  336  1260  ORF10  39  30  21  35  40  21  117  All  3830 2118 4529 2660 2500 5912  29132   Table 3 : Results of the noiseless support estimation method on samples from Europe.

Genomic region η = 0.5 η = 1 η = 1.5

RWC-S-prc Naïve WY-Naïve WY-prc RWC-S-prc Naïve WY-Naïve WY-prc RWC-S-prc Naïve WY-Naïve WY-prc Table 4 : Support estimation with synthetic Poisson repeats on samples from Europe. We report the relative difference (in %) of the results for the noisy and noiseless counterparts (the closer the value to 0, the better the performance).

Genomic region η = 0.5 η = 1 η = 1.5 Maximum support Table 5 : Results of the noisy support estimation method directly applied to the real-world data on samples from Europe. We underline the results that are obviously false (i.e., those violating the maximum support constraint, taking negative values or values smaller than the results of the naive estimator). Table 6 : Results of the noiseless support estimation method on samples from Asia.

Genome region η = 0.5 η = 1 η = 1.5

RWC-S-prc Naive WY-Naive WY-prc RWC-S-prc Naive WY-Naive WY-prc RWC-S-prc Naive WY-Naive WY-prc Table 7 : Tests of the noisy support estimation methods against synthetic Poisson repeats introduced into samples from Asia. We report the relative difference (in %) compared to the results of the noiseless counterparts (the closer the value to 0, the better the estimator).

Genome region η = 0.5 η = 1 η = 1.5

RWC-S-prc Naive WY-Naive WY-prc RWC-S-prc Naive WY-Naive WY-prc RWC-S-prc Naive WY-Naive WY-prc Table 10 : Tests of the noisy support estimation methods against synthetic Poisson repeats introduced into samples from North America. We report the relative difference (in %) compared to the results of the noiseless counterparts (the closer the value to 0, the better the estimator).

Genome region η = 0.5 η = 1 η = 1.5 Maximum support RWC-S-prc Naïve WY-Naïve WY-prc RWC-S-prc Naïve WY-Naïve WY-prc RWC-S-prc Naïve WY-Naïve WY-prc  ORF1a  2005  804  407  1114  2716  804  1765  2851  3502  804  2802  95895  13203  ORF1b  924  403  206  558  1202  403  912  1474  1535  403  1391  55948  8087  S  471  209  82  305  608  209  515  901  799  209  793  50434  3822  ORF3a  167  81  58  91  209  81  161  213  253  81  240  3656  828  E  31  15  15  15  39  15  24  22  50  15  30  72  228  M  51  28  24  34  60  28  52  60  71  28  77  235  669  ORF6  39  21  17  26  46  21  43  52  54  21  62  252  186  ORF7a  345  135  113  165  448  135  285  317  535  135  381  4729  366  ORF8  52  29  22  32  61  29  45  61  70  29  93  168  366  N  287  138  95  157  369  138  272  372  469  138  408  8471  1260  ORF10  18  10  10  10  23  10  16  15  28  10  22  47  117  All  4390  1873  1049  2507  5781  1873  4090  6338  7366  1873  6299  219907  29132   Table 11 : Results of noisy support estimation methods applying directly to North America samples. We underline the results that are obviously false (i.e. violating maximum support, being negative and smaller than the results of Naive estimator).

The relation between the number of species and the number of individuals in a random sample of an animal population

Estimating the number of unseen species: How many words did shakespeare know

Strong lower bounds for approximating distribution support size and the distinct elements problem

Counting distinct elements in a data stream

Towards estimation error guarantees for distinct values

Query k-means clustering and the double Dixie cup problem

Clustering with same-cluster queries

HS 2 : Active learning over hypergraphs

Active learning in the geometric block model

The number of new species, and the increase in population coverage, when a sample is increased

Always good turing: Asymptotically optimal probability estimation

Competitive closeness testing

Estimation of entropy and mutual information

Estimating the number of species: a review

Sampling algorithms: lower bounds and applications

Testing that distributions are close

An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people

Recent explosive human population growth has resulted in an excess of rare genetic variants

Estimating the unseen: improved estimators for entropy and other properties

Chebyshev polynomials, moment matching, and optimal estimation of the unseen

A unified maximum likelihood approach for estimating symmetric properties of discrete distributions

Sample complexity of the distinct elements problem

Approximate profile maximum likelihood

Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under wasserstein distance

Data amplification: A unified and competitive approach to property estimation

Mutation rate and genotype variation of ebola virus from mali case sequences

Quantifying the diversification of hepatitis c virus (hcv) during primary infection: estimates of the in vivo mutation rate

Coronavirus pandemic (covid-19)

Sharp analytical capacity upper bounds for sticky and related channels

Small-sample distribution estimation over sticky channels

A survey of weighted polynomial approximation with exponential weights

Extremal problems for polynomials with exponential weights

Where does the sup norm of a weighted polynomial live

Sur les cycles des substitutions

On modeling profiles instead of values

R.A. Fisher and the making of maximum likelihood 1912-1922

Optimal prediction of the number of unseen species

Entropy rate estimation for markov chains with large state space

Data, disease and diplomacy: Gisaid's innovative contribution to global health

Gisaid: Global initiative on sharing all influenza data-from vision to reality

Do symmetric problems have symmetric solutions?

Convex optimization

Semi-infinite programming

Numerical methods for semi-infinite programming: a survey

Discretization methods for the solution of semi-infinite programming problems

Discretization in semi-infinite programming: the rate of convergence

On asymptotic properties of polynomials orthogonal on the real axis

Genotype and phenotype of covid-19: Their roles in pathogenesis

Muscle: a multiple sequence alignment method with reduced time and space complexity

 Naive  WY  GT  PJW  HOSW  Maximum support   ORF1a  1752  835  2139  1543  897  2752  13203  ORF1b  624  316  762  747  335  983  8087  S  353  188  455  375  213  551  3822  ORF3a  171  93  185  186  107  242  828  E  63  36  82  185  38  84  228  M  51  31  65  55  31  72  669  ORF6  3  3  3  Inf  3  5  186  ORF7a  214  109  234  633  117  301  366  ORF8  339  340  330  340  340  316  366  N  93  60  115  70  74  132  1260  ORF10  17  11  17  15  13  20  117  All  3680  2022  4387  4149  2168  5458  29132 Genome region η = 0.5 η = 1 η = 1.5 Maximum support RWC-S-prc Naive WY-Naive WY-prc RWC-S-prc Naive WY-Naive WY-prc RWC-S-prc Naive WY-Naive WY-prc  ORF1a  2401  835  289  1254  3278  835  2139  3864  4138  835  3508  183124  13203  ORF1b  832  316  192  372  1144  316  762  1072  1578  316  1060  29256  8087  S  465  188  110  223  621  188  455  654  809  188  627  4799  3822  ORF3a  210  93  79  112  266  93  185  207  314  93  257  2764  828  E  74  36  28  47  90  36  82  100  105  36  114  560  228  M  58  31  25  38  70  31  65  76  81  31  79  401  669  ORF6  3  3  3  3  3  3  3  3  3  3  3  3  186  ORF7a  269  109  90  135  346  109  234  265  409  109  314  4372  366  ORF8  339  340  340  340  341  340  330  324  346  340  352  -11001  366  N  113  60  41  68  137  60  115  162  163  60  170  1015  1260  ORF10  21  11  11  11  26  11  17  16  33  11  22  44  117  All  4785  2022  1208  2603  6322  2022  4387  6743  7979  2022  6506  215337  29132   Table 8 : Results of noisy support estimation methods applied directly on samples from Asia. We underline the results that are obviously false (i.e., those that violate the maximum support size constraint, those that produce negative results and estimates smaller than the results of the Naive estimator). Naive  WY  GT  PJW  HOSW  Maximum support   ORF1a  1509  804  1765  944  980  2374  13203  ORF1b  727  403  912  437  442  1245  8087  S  375  209  515  237  229  636  3822  ORF3a  134  81  161  86  97  214  828  E  25  15  24  33  15  29  228  M  44  28  52  38  32  59  669  ORF6  33  21  43  32  23  46  186  ORF7a  269  135  285  587  142  384  366  ORF8  44  29  45  30  36  51  366  N  227  138  272  174  185  331  1260  ORF10  16  10  16  20  14  16  117  All  3403  1873  4090  2618  2195  5385  29132   Table 9 : Results for the noiseless support estimation methods applied to samples from North America.