key: cord-0175517-pnsrhamz
authors: Petersen, Hendrik Bernd; Bah, Bubacarr; Jung, Peter
title: Efficient Noise-Blind $ell_1$-Regression of Nonnegative Compressible Signals
date: 2020-03-29
journal: nan
DOI: nan
sha: 453b25bc86ec94ab3326543c8ca885218b451312
doc_id: 175517
cord_uid: pnsrhamz

In compressed sensing the goal is to recover a signal from as few as possible noisy, linear measurements. The general assumption is that the signal has only a few non-zero entries. Given an estimate for the noise level a common convex approach to recover the signal is basis pursuit denoising (BPDN). If the measurement matrix has the robust null space property with respect to the $ell_2$-norm, BPDN obeys stable and robust recovery guarantees. In the case of unknown noise levels, nonnegative least squares recovers non-negative signals if the measurement matrix fulfills an additional property (sometimes called the $M^+$-criterion). However, if the measurement matrix is the biadjacency matrix of a random left regular bipartite graph it obeys with a high probability the null space property with respect to the $ell_1$-norm with optimal parameters. Therefore, we discuss non-negative least absolute deviation (NNLAD). For these measurement matrices, we prove a uniform, stable and robust recovery guarantee. Such guarantees are important, since binary expander matrices are sparse and thus allow for fast sketching and recovery. We will further present a method to solve the NNLAD numerically and show that this is comparable to state of the art methods. Lastly, we explain how the NNLAD can be used for group testing in the recent COVID-19 crisis and why contamination of specimens may be modeled as peaky noise, which favors $ell_1$ based data fidelity terms.

Since it has been realized that many signals admit a sparse representation in some frames, the question arose whether or not such signals can be recovered from less samples than the dimension of the domain by abusing the low dimensional structure of the signal. The question was already answered positively in the beginning of the millennium [CRT06] [Don06] . By now there are multiple different decoders to recover a sparse signal from noisy measurements with robust recovery guarantees. Most of them however rely on some form of tuning, depending on either the signal or the noise. The basis pursuit denoising (BPDN) requires an upper bound on the norm of the noise [FR13, Theorem 4 .22], the least shrinkage and selection operator (LASSO) an estimate on the ℓ 1 -norm of the signal [HTW15, Theorem 11.1], the iterative hard thresholding an upper bound for the sparsity of the signal [FR13, Theorem 6.21], and the Lagrangian version of LASSO allegedly needs to be tuned in the order of the the noise level [HTW15, Theorem 11.1 ]. If these side information is not known a priori many decoders yield either no recovery guarantees or, in their imperfect tuned versions, yield sub-optimal estimation errors [FR13, Theorem 11.12 ]. Even though the problem of sparse recovery from under-sampled measurements has been answered long ago, finding tuning free decoders that achieve robust recovery guarantees is still a topic of interest. The most prominent achievement for that is the non-negative least squares (NNLS) [BEZ08] [DT10] [WXT11] [SH11] [SH13] . It is completely tuning free [KKRT16] and in [KJ18] [ SJC19] it was proven that it achieves robust recovery guarantees if the measurement matrix consists of certain independent sub-Gaussian random variables. We will replace the least squares with an arbitrary norm and obtain the non-negative least residual (NNLR). We prove recovery guarantees under similar conditions as the NNLS. In particular we consider the case where we minimize the ℓ 1 -norm of the residual (NNLAD) and give a recovery guarantee if the measurement matrix is a random walk matrix of a uniformly at random drawn D-left regular bipartite graph. This has two crucial advantages over the NNLS. The NNLS with sub-Gaussian measurement matrix might lose some level of sparsity for the recovery guarantee, while the left regularity implies that the sparsity level is being preserved when decoding with NNLAD and measuring with a random walk matrices of a left regular graph. On the other hand the need for fast encoding and decoding favors sparse measurement matrices over dense measurement matrices. The encoding time can be reduced to a small fraction by using sparse matrix multiplications for encoding. However, this also affects the decoding time, since decoders are often motivated as first order methods of an optimization problem minimizing the residual, so that in every iteration an iterate is being encoded. Thus, sparse measurement matrices often also allow for fast decoding. Here the NNLAD has crucial advantages. Using [CP11] we can solve the NNLAD with a first order method of a single optimization problem with a sparse measurement matrix. Other state of the art decoders often use non-convex optimization, computational complex projections or need to solve multiple different optimization problems. For instance, to solve the BPDN given a tuning parameter a common approach is to solve a sequence of LASSO problems to approximate where the Pareto curve attains the value of the tuning parameter of BPDN [vdBF09] . Note that other approaches for non-negative recovery are possible. In [DT05] they used a non-negative basis pursuit. For super resolution a problem really closely related to the NNLAD has been studied in [MC16] .

For K ∈ N we denote the set of integers from 1 to K by [K] . For a set T ⊂ [N ] we denote the number of elements in T by # (T ). Vectors are denoted by lower case bold face symbols, while its corresponding components are denoted by lower case italic letters. Matrices are denoted by upper case bold face symbols, while its corresponding components are denoted by upper case italic letters. For x ∈ R N we denote its ℓ pnorms by x p . Given A ∈ R M×N we denote the its operator norm as operator from ℓ q to ℓ p by A q→p := sup v∈R N , v q ≤1 Av p . By R N + we denote the non-negative orthant. Given a closed convex set C ⊂ R N , we denote the projection onto C, i.e. the unique minimizer of argmin z∈C 1 2 z − v 2 2 , by P C (v). For a vector x ∈ R N and a set T ⊂ [N ], x| T denotes the vector in R N , whose n-th component is x n if n ∈ T and 0 else. Given N, S ∈ N we will often need sets T ⊂ [N ] with # (T ) ≤ S and we abbreviate this by # (T ) ≤ S if no confusing is possible. Given a measurement matrix A ∈ R M×N a decoder is any map Q A : R M → R N . A signal is any possible x ∈ R N . If x ∈ R N + = z ∈ R N : z n ≥ 0 for all n ∈ [N ] , we say the signal is non-negative and write shortly x ≥ 0. If additionally x n > 0 for all n ∈ [N ], we write x > 0. An observation is any possible input of a decoder, i.e. all y ∈ R M . We allow all possible inputs of the decoder as observation, since in general the transmitted codeword Ax is disturbed by some noise. Thus, given a signal x and an observation y we call e := y − Ax the noise. A signal x is called S-sparse if x 0 := # ({n ∈ [N ] : x n = 0}) ≤ S. We denote the set of S-sparse vectors by

Given some S ∈ [N ] the compressibility of a signal x can be measured by d 1 (x, Σ S ) := inf z∈ΣS x − z 1 . Given N and S, the general non-negative compressed sensing task is to find a measurement matrix A ∈ R M×N and a decoder Q A : R M → R N with M as small as possible such that the following holds true: There exists a q ∈ [1, ∞] and a continuous function C : R × R M → R + with C (0, 0) = 0 such that

for all x ∈ R N + and y ∈ R M holds true. This will ensure that if we can control the compressibility and the noise, we can also control the estimation error and in particular decode every noiseless observation of S-sparse signals exactly.

Given a measurement matrix A ∈ R M×N and a norm · on R M we propose to define the decoder as following: Given y ∈ R M set Q A (y) as any minimizer of

We call this problem non-negative least residual (NNLR). In particular, for · = · 1 this problem is called non-negative least absolute deviation (NNLAD) and for · = · 2 this problem is known as the non-negative least squares (NNLS) studied in [KJ18] . In fact, we can translate the proof techniques fairly simple. We just need to introduce the dual norm.

Definition 3.1. Let · be a norm on R M . The norm · * on R M defined by v * := sup u ≤1 v, u , is called dual norm to · .

Note that the dual norm is actually a norm. To obtain a recovery guarantee for NNLR we have certain requirements on the measurement matrix A. As for most other convex optimization problems in compressed sensing, we use a null space property.

Definition 3.2. Let S ∈ [N ], q ∈ [1, ∞) and · be any norm on R M . Further let A ∈ R M×N . Suppose there exists constants ρ ∈ [0, 1) and τ ∈ [0, ∞) such that

Then, we say A has the ℓ q -robust null space property of order S with respect to · or in short A has the ℓ q -RNSP of order S with respect to · with constants ρ and τ . ρ is called stableness constant and τ is called robustness constant.

In order to deal with the non-negativity, we need A to be biased in a certain way. In [KJ18] this bias was guaranteed with the M + criterion.

Then we say A obeys the the M + criterion with vector t and constant κ :

Note that κ is actually a condition number of the matrix with diagonal A T t and 0 else. Condition number numbers are frequently used in error bounds of numerical linear algebra. The general recovery guarantee is the following and similar results have been obtained in the matrix case in [JJ20] .

Theorem 3.4 (NNLR Recovery Guarantee). Let S ∈ [N ], q ∈ [1, ∞) and · be any norm on R M with dual norm · * . Further, suppose that A ∈ R M×N obeys a) the ℓ q -RNSP of order S with respect to · with constants ρ and τ and b) the M + criterion with vector t and constant κ.

If κρ < 1, the following recovery guarantee holds true: For all x ∈ R N + and y ∈ R M any minimizer

obeys the bound

If q = 1, this bound can be improved to

Proof. The proof can be found in Section 7.

Certain random measurement matrices guarantee uniform bounds on κ for fixed vectors t. In [KJ18, Theorem 12] , it was proven that if A m,n are all i.i.d. 0/1 Bernoulli random variables, A has M + criterion with t = (1, . . . , 1) T ∈ R M and κ ≤ 3 with high probability. This is problematic, since if κ > 1 it might happen that κρ < 1 is not fulfilled anymore. Since the stableness constant ρ (S ′ ) as a function of S ′ is monotonically increasing, the condition κρ(S ′ ) < 1 might only hold if S ′ < S. If that is the case, there are vectors x ∈ Σ S that are being recovered by BPDN but not by NNLS! This is for instance the case of the matrix A = 1 0 1 0 1 1 , which has ℓ 1 -robust null space property of order 1 with stableness constant ρ := 1 2 and M + criterion with κ ≥ 2 for any possible choice of t. In particular, the vector x = (0, 0, 1) T is not necessarily being recovered by the NNLAD and the NNLS. Hence, it is crucial that the vector t is chosen to minimize κ and ideally obeys the optimal κ = 1. This motivates us to use random walk matrices of regular graphs since they obey exactly this. We will only consider random walk matrices and no biadjacency matrices. Note that we have made a slight abuse of notation. The term D-LRBG as a short form for D-left regular bipartite graph refers in our case to the random walk matrix A but not the graph itself. We omit this minor technical differentiation, for the sake of shortening the frequently used term random walk matrix of a D-left regular bipartite graph. Lossless expanders are bipartite graphs that have a low number of edges but are still highly connected, see for instance [Vad12, Chapter 4] . As a consequence their random walk matrices have good properties for compressed sensing. It is well known that random walk matrices of a (2S, D, θ)-lossless expanders obey the ℓ 1 -RNSP of order S with respect to · 1 , see [FR13, Theorem 13.11 ]. The dual norm of · 1 is the norm · ∞ and the M + criterion is easily fulfilled, since the columns sum up to one. From Theorem 3.4 we can thus draw the following corollary.

Corollary 3.6 (Lossless Expander Recovery Guarantee). Let S ∈ [N ], θ ∈ 0, 1 6 . If A ∈ 0, D −1 M×N is a random walk matrix of a (2S, D, θ)-lossless expander, then the following recovery guarantee holds true: For all x ∈ R N + and y ∈ R M any minimizer x # of

obeys the bound

Proof. By [FR13, Theorem 13 .11] A has ℓ 1 -RNSP with respect to · 1 with constants ρ = 2θ 1−4θ and τ = 1 1−4θ . The dual norm of the norm · 1 is · ∞ . If we set t := (1, . . . , 1) T ∈ R M we get

Hence, A has the M + criterion with vector t and constant κ = 1 and the condition κρ < 1 is immediately fulfilled. We obtain t * = t ∞ = 1 and max n∈[N ] A T t −1 n = 1. Applying Theorem 3.4 with improved bound for q = 1 and these values yields

If we additionally substitute the values for ρ and τ we get

This finishes the proof.

Note that, if M ≥ 2 θ exp 2 θ S ln eN 

If A is a random walk matrix of a (2S, D, θ)-lossless expander with θ ∈ 0, 1 6 , then we can also draw a recovery guarantee for the NNLS. By [FR13, Theorem 13 .11] A has ℓ 1 -RNSP with respect to · 1 with constants ρ = 2θ 1−4θ and τ = 1 1−4θ and hence also ℓ 1 -RNSP with respect to · 2 with constants ρ ′ = ρ and τ ′ = τ M 1 2 . Similar to the proof of Corollary 3.6 we can use Theorem 3.4 to deduce that any minimizer x # of

obeys the bound

If the measurement error e = y − Ax is a constant vector, i.e. e = α½, then e 1 = M 1 2 e 2 . In this case the error bound of the NNLS is just as good as the error bound of the NNLAD. However, if e is a standard unit vector, then e 1 = e 2 . In this case the error bound of the NNLS is significantly worse than the error bound of the NNLAD. Thus, the NNLAD performs better under peaky noise, while the NNLS and NNLAD are tied under noise with evenly distributed mass. We will verify this numerically in Subsection 5.1. One can draw a complementary result for matrices with biased sub-Gaussian entries, which obey the ℓ 2 -RNSP with respect to · 2 and the M + criterion in the optimal regime [KJ18] . Table 1 states the methods, which have an advantage over the other in each scenario.

Measurement Matrix D-LRBG (ℓ 1 ) biased sub-Gaussian (ℓ 2 ) peaky e 1 ≈ e 2 NNLAD -Noise even mass e 1 ≈ M 1 2 e 2 -NNLS unknown noise NNLAD NNLS 

For · = · p the NNLR is a convex optimization problem and the objective function has a simple and globally bounded subdifferential. Thus, the NNLR can directly be solved with a projective subgradient method using a problem independent step size. Such subgradient methods achieve only a convergence rate of O log (k) k − 1 2 towards the optimal objective value [Nes04, Section 3.2.3], where k is the number of iterations performed. In the case that the norm is the ℓ 2 -norm, we can transfer the problem into a differentiable version, i.e. the NNLS

Since the gradient of such an objective is Lipschitz, this problem can be solved by a projected gradient method with constant step size, which achieve a convergence rate of O k −2 towards the optimal objective value [BT09] [AP16] . However this does not generalize to the ℓ 1 -norm. We will show below that the case of the ℓ 1 -norm can be solved with the convergence rate O k −1 towards the optimal objective value by using the proximal point method proposed in [CP11] . This results in the following algorithm.

Algorithm 4.1 (NNLAD as First Order Method).

A convergence guarantee can be deduced from the following result.

Then, the following statements hold true:

(1) The iterates and averages converge:

The sequences x k k∈N and x k k∈N converge to a minimizer of argmin z≥0

Az − y 1 .

(2) The iterates and averages are feasible:

There is a stopping criteria for the iterates:

Az − y 1 .

(4) The stopping criteria also holds for the averages by replacing x k withx k and w k withw k .

(5) The averages obey the convergence rate to optimal objective value:

Proof. The proof can be found in Section 8 Note that Algorithm 4.1 yields a convergence guarantee for both the iterates and averages, but the convergence rate is only guaranteed for the averages. Algorithm 4.1 is optimized in the sense that it uses the least possible number of matrix vector multiplications per iteration, since these govern the computational complexity. 

As stated the NNLS achieves the convergence rate O k −2 [AP16] while the NNLAD only achieves the convergence rate of O k −1 towards to optimal objective value. However, this should not be considered as weaker, since the objective function of the NNLS is the square of a norm. If x k are the iterates of the NNLS implementation of [AP16] , algebraic manipulation yields

Thus, the ℓ 2 -norm of the residual of the NNLS iterates only decays in the same order as the ℓ 1 -norm of the residual of the NNLAD averages.

Let · be a norm on R M . Since the exact optimal objective value inf z≥0 Az − y is an unknown value, checking for an optimizer can be done only with Karush Kuhn Tucker conditions or by checking if zero is in the subdifferential. In particular, one needs to find a vector w ∈ R M which satisfies the following conditions

where · * is the dual norm of · . The first two conditions ensure that A T w is in the subdifferential of z → Az − y for z = x and the second two conditions ensure that −A T w is in the subdifferential of the characteristic function of R N + at x. If · is the ℓ p -norm for p ∈ (1, ∞) and Ax − y = 0, by Hoelder's inequality the vector

is the only vector, that satisfies (Opt 1) and (Opt 2). Checking (Opt 3) and (Opt 4) is then easy. However, if · is the ℓ 1 or ℓ ∞ -norm, the first two equations have multiple solutions and checking for an optimizer is a non-trivial feasibility problem. Statements (3) and (4) of Proposition 4.2 give us a simple condition to check for an optimizer, which can be done in O (M + N ) by a proper implementation.

The question arises whether or not it is better to estimate with averages or iterates. Numerical testing suggest that the iterates reach tolerance thresholds significantly faster than the averages. We can only give a heuristically explanation for this phenomenon. From statement (3) of Proposition 4.2 we obtain that lim k→∞ A T w k ≥ 0. In practice we observe that A T w k ≥ 0 for all sufficiently large k. However, A T w k+1 ≥ 0 yields x k+1 ≤ x k . This monotonicity promotes the converges of the iterates and gives a clue why the iterates seem to converge better in practice. See Figure 5 and Figure 6 .

In this section we will compare NNLAD with several state of the art recovery methods in terms of achieved sparsity levels and decoding time.

We recall that the goal is to recover x from the noisy linear measurements y = Ax+ e. To investigate properties of the minimizers of NNLAD we compare it to the minimizers of the well studied problems basis pursuit (BP), optimally tuned BPDN, optimally tuned ℓ 1 -constrained least residual (CLR) and the NNLS, which are given by argmin z: Az−y 1 ≤ǫ

argmin z:Az=y

Further, we compare the NNLAD to any cluster point of the sequence of the expander iterative hard thresholding (EIHT) given by

where median (z) n is the median of (z m ) m∈Row({n}) , and P ΣS (v) is a hard thresholding operator, i.e. any minimizer of argmin z∈ΣS 1 2 z − v 2 2 . By convex decoders we refer to BPDN, BP, CLR, NNLAD, and NNLS. We choose the optimal tuning ǫ = e 1 for the BPDN and τ = x 1 for the CLR. Thus, such optimally tuned BPDN and CLR are representing a best case benchmark. In [K19, Figure 1 .1] it was noticed that tuning the BPDN with ǫ > e p often leads to worse estimation errors than tuning with ǫ < e p for p = 2. Thus, BP is representing a worst case benchmark, as a version of BPDN with no prior knowledge about a noise. At the moment we do not care about the method to calculate the minimizers of the optimization problems, thus we solve all optimization problems with the CVX package of Matlab [GB14] , [MS08] . For a given SN R, r, N, M, D, S we will do the following experiment multiple times:

4. Define the observation y := Ax + e.

For each decoder Q A calculate an estimator x # := Q A (y) and collect the relative estimation error

In this experiment we have SN R = Ax 1 e 1 and since A is a D-LRBG and x ≥ 0, we further have Ax 1 = x 1 = 1. Note that for r = 0 and r = 1 we obtain two different noise distributions. If e is uniformly distributed on S M−1 1 , then the absolute value of each component |e m | is a random variable with density h →

M+1 . By testing one can observe a concentration around this expected value, in particular that M , then e 2 = e 1 . Thus, these two noise distributions each represent randomly drawn noise vectors obeying one norm equivalence asymptotically tightly up to a constant. From (1) and (2) we expect that the NNLS has roughly the same estimation errors as the NNLAD for r = 1, i.e. the evenly distributed noise, and significantly worse estimation errors for r = 0, i.e. the peaky noise.

We fix the constants r = 1, N = 1024, M = 256, D = 10, SN R = 1000 and vary the sparsity level S ∈ [64]. For each S we repeat Experiment 1 100 times. We plot the mean of the relative ℓ 1 -estimation error and the mean of the logarithmic relative ℓ 1 -estimation error, i.e.

over the sparsity. The result can be found in Figure 1a and Figure 1b . For S ≥ 30 the estimation error of the EIHT randomly peaks high. We deduce that the EIHT fails to recover the signal reliably for S ≥ 30, while the NNLAD and other convex decoders succeed. This is not surprising, since by [FR13, Theorem 13 .15] the EIHT obeys a robust recovery guartanee for S-sparse signals, whenever A is the random wak matrix of a (3S, D, θ ′ )-lossless expander with θ ′ < 1 12 . This is significantly stronger than the (2S, Dθ)-lossless expander property with θ < 1 6 required for a null space property. However, if the EIHT recovers a signal, it recovers it significantly better than any convex method. This might be the case, since the originally generated signal is indeed from Σ S , which is being enforced by the hard thresholding of the EIHT, but not by the convex decoders. This suggests that it might be useful to consider using thresholding on the output of any convex decoder to increase the accuracy if the orignal signal is indeed sparse and not only compressible. For the remainder of this subsection we focus on convex decoders. Contrary to our expectation the BPDN achieves worse estimation errors than all other convex decoders for S ≥ 60, even worse than the not tuned BP. The authors have no explanation for this phenomenon. Apart from that we observe that the CLR and BP indeed perform as respectively best and worst case benchmark. However, the difference between BP and CLR becomes rather small for high S. We deduce that tuning becomes less important near the optimal sampling rate. The NNLAD, NNLS and CLR perform roughly the same. This is quite strong, since BPDN and CLR are optimally tuned using unknown prior information. As expected the NNLS performs roughly the same as the NNLAD, see Table 1 . However, this is the result of the noise distribution for r = 1. We repeat Experiment 1 with the same constants, but r = 0, i.e. e is a unit vector scaled by ± Ax 1 SN R . We plot the mean of the relative ℓ 1 -estimation error and the mean of the logarithmic relative ℓ 1 -estimation error over the sparsity. The result can be found in Figure 2a and Figure 2b . We want to note that similarly to Figure 1a the EIHT works only unreliably for S ≥ 30. Even though the mean of the logarithmic relative ℓ 1 -estimation error of NNLS is worse than the one of EIHT for 30 ≤ S ≤ 60, the NNLS does not fail but only approximates with a weak error bound. As the theory suggests, the NNLS performs significantly worse than the NNLAD, see Table 1 . It is worth to mention, that the estimaton errors of NNLS seem to be bounded by the estimation errors of BP. This suggests that A obeys a ℓ 1 quotient property, that bounds the estimation error of any instance optimal decoder, see [FR13, Lemma 11 .15].

Theorem 3.4 states that the NNLAD has an error bound similarly to the optimally tuned CLR and BPDN. Further, by (1) the ratio The logarithmic relative ℓ 1 -estimation errors of the different decoders stay in a constant relation to each other over the whole range of SN R. This relation is roughly the relation we can find in Figure 1b for S = 32. As expected the the ratio of relative ℓ 1 -estimation error and ℓ 1 -noise power stays constant independent on the SN R for all decoders. We deduce that the NNLAD is noise-blind. We repeat the experiment with r = 0 and obtain Figure 4a and Figure 4b . Against our expectation,

and not

x−x # 1

x 1 e 1 seems to be constant. Since

fairly small, we suspect that this is the result of CVX reaching a tolerance parameter 1 √ eps ≈ 1.5 · 10 −8 and terminating, while the actual optimizer might in fact be the original signal. It is definitely note worthy that even with the incredibly small signal to noise ration of 10 the signal can be recovered by the NNLAD with an estimation error of 1.0 · 10 −7 for this noise distribution.

To investigate the convergence rates of the NNLAD as proposed in Proposition 4.2, we compare it to different types of decoders when e = 0. As a best case benchmark, we consider the EIHT, which has a linear convergence rate O c −k towards the signal [FR13, Theorem 13.15]. As a direct competitor we consider the NNLS implemented by the methods of [AP16] 2 , which has a convergence rate of O k −2 towards the optimal objective value. [AP16] can also be used to calculate the LASSO. However calculating the projection onto the ℓ 1 -ball in R N , is computationally slightly more complex than the projection onto R N + . Thus the NNLS will also be a lower bound for the LASSO. As a worst case benchmark we consider a simple projected subgradient implementation of NNLAD using the Polyak step size, i.e.

which has a convergence rate of O k − 1 2 towards the optimal objective value [Pol87, Section 7.2.2 & Section 5.3.2] [Boy14, Section 6]. We will always initialize all iterated methods by zero vectors. The EIHT will always use the parameter S ′ = x 0 , the NNLAD σ = τ = 0.99 A −1 2→2 and the NNLS the parameters s = 0.99 A −2 2→2 and α = 3.01, see [AP16] . Parameters that can be computed from A, will be calculated before the timers start. This includes the adjacency structure of A for the EIHT, σ, τ for NNLAD, s, α for NNLS, since these are considered to be a part of the decoder. We will do the following experiment multiple times: For r = 2 this represents a biased sub-gaussian random ensemble [KJ18] with optimal recovery guarantees for the NNLS. For r = 1 this represents a D-LRBG random ensemble with optimal recovery guarantees for the NNLAD. We fix the constants r = 1, N = 1024, M = 256, S = 16, D = 10 and repeat Experiment 2 100 times. We plot the mean of the logarithmic relative ℓ 1 -estimation error and the mean of the relative ℓ 1 -norm of the residual, i.e.

Mean (LNℓ 1 E) = Mean 10 log 10

x k − x 1 x 1 and Mean (LNℓ 1 R) = Mean 10 log 10 Ax k − y 1 y 1 over the sparsity and the time. The result can be found in Figure 5 and Figure 6 . The averages of NNLAD converge significantly slower than the iterates, even though we lack a convergence rate for the iterates. We deduce that one should always use the iterates of NNLAD to recover a signal. Surprisingly, the averages converge even slower than the subgradient method. However, this is not because the averages converge slow, but rather because the subgradient method and all others converges faster than expected. In particular, the NNLAD iterates, EIHT and the NNLS all converge linearly towards the optimal objective value and towards the signal. Even the subgradient method converges almost linearly. We deduce that the NNLS is the fastest method to recover a signal if A is a D-LRBG. Apart from a constant the NNLAD iterates, EIHT and NNLS converge in the same order. However, this behavior does not hold if we consider a different distribution for A as one can verify by setting each component A m,n as independent 0/1 Bernoulli random variables. While EIHT has better iterations compared to the NNLS, it still takes more time to achieve the same estimation errors and residuals. We plot the mean of the time required to calculate the first k iterations in Figure 7 . The EIHT requires roughly 6 times as long as any other method to calculate each iteration. All methods but the EIHT can be implemented with only two matrix vector multiplications, namely once by A and once by A T . Both of these requires roughly 2DN floating point operations. Hence each iterations require O (4DN ) floating point iterations. The EIHT only calculates one matrix vector multiplication, but also the median. This calculation is significantly slower than a matrix vector multiplication. For every n ∈ [N ] we need to order a vector with D elements, which can be performed in O (D log (D)). Hence, each iteration of EIHT requires O (DN log (D)) floating point operations, which explains why the EIHT requires significantly more time for each iteration. As we have seen the NNLS is able to recover signals faster than any other method, however it also only obeys sub-optimal robustness guarantees as we have seen in Figure 4a . We ask ourself whether or not the NNLS is also faster with a more natural measurement scheme, i.e. if A m,n are independent 0/1 Bernoulli random variables. We repeat Experiment 2 100 times with r = 2 for the NNLS and r = 1 for the other methods. We again plot the mean of the logarithmic relative ℓ 1 -estimation error and the mean of the relative ℓ 1 -norm of the residual in Figure 8 and Figure 9 . The NNLAD and the EIHT converge to the solution with roughly the same time. Even the subgradient implementation of the NNLAD recovers a signal in less time than the NNLS. Further the convergence of NNLS does not seem to be linearly anymore. We deduce that sparse structure of A has a more significant influence on the decoding time than the smoothness of the data fidelity term. Also we deduce that even the subgradient method is a viable choice to recover a signal.

As a last test we compare the NNLAD to the SPGL1 [vdBF09] [vdBF19] toolbox for matlab. Experiment 3.

1. Generate the measurement matrix A ∈ 0, D −1 M×N as a uniformly at random drawn D-LRBG.

3. Define the observation y := Ax.

x 2 and the time to calculate x # .

Collect the time to perform these iterations. If this threshold can not be reached after 10 5 iterations, the recovery failed and the time is set to ∞.

We again fix the dimension N = 1024, M = 256, D = 10 and vary S ∈ [128]. For both the BP implementation of SPGL1 and the LASSO implementation of SPGL1 we repeat Experiment 3 100 times for each S. We plot the mean of the time to calculate the estimators and plot these over the sparsity in Figure 10a and Figure 10b . The NNLAD implementation is slower than both SPGL1 methods for small S. However, if we have the optimal number of measurements M ∈ O S log N S , the NNLAD is faster than both SPGL1 methods.

The implementation of NNLAD as presented in Algorithm 4.1 is a reliable recovery method for sparse nonnegative signals. There are methods that might be faster, but these either recover a smaller number of coefficients (EIHT, greedy methods) or they obey sub-optimal recovery guarantees (NNLS). The implementation is as fast as the commonly uses SPGL1 toolbox, but has the advantage that it requires no tuning depending on the unknown x or e. Lastly, the NNLAD can handle peaky noise overwhelmingly good.

With the outbreak and rapid spread of the COVID-19 virus, we are in the need of testing a lot of people for an infection. Since we can only test a fixed number of persons in a given time, the number of persons tested for the virus grows at most linearly. On the other hand, models suggest that the number of possibly infected persons grows exponentially. At some point, if that is not already the case, we will have a shortage of test kits and we will not be able to test every person. It is thus desirable, to test as much persons with as few as possible test kits. The field group testing develops strategies to test groups of individuals instead of individuals in order to reduce the amount of tests required to identify individuals with a certain property. The first advances in group testing were made in [Dor43] . For a general overview about group testing we refer to [AJS19] . The problem of testing a large group for a virus can be modeled as a compressed sensing problem in the following way: Suppose we want to test where e pro m is the amount of viruses in the sample originating from a possible contamination of the sample. Usually, when using a test kit the viruses are duplicated multiple times, for instance with a polymerase chain reaction. The test kit will then detect a known, bijective transformation of each individual component y m . Hence, we can calculate y m directly and thus assume that y m is the result of the test. After all M tests we detect the quantity

where e = Ae spec + e pro . For now we assume that e is peaky in terms of Table 1 , and we will later argue why such a model is natural. Often each specimen is tested separately, meaning that A is the identity. In particular, we need at least as much test kits as specimens. Further, we estimate the true quantitiy of viruses x n by x # n := y n , which results in the estimation error x # n − x n = e n = e spe n + e pro n . In this scenario the estimation error for the n-th person is only affected by the errors made while taking its specimen and testing its sample. Since the noise vector e is peaky, some but few tests will be inaccurate and might result in false positives or false negatives. In general, only a fraction of persons is indeed affected by the virus. Thus, we assume that x 0 ≤ S for some small S. Since the amount of viruses is a non-negative value, we also have x ≥ 0. Corollary 3.6 suggests to choose A as the random walk matrix of a lossless expander or by [FR13, Theorem 13 .7] to choose A as a uniformly at random chosen D-LRBG. Such a matrix A has non-negative entries and the column sums of A are not greater than one. This is a necessary requirement since each column sum is the total amount of specimen used in the test procedure. Especially, a fraction of D −1 of each specimen is used in exactly D test kits. In order to calculate an estimation to the true quantity of viruses x we propose to use the NNLAD. By Corollary 3.6 and [FR13, Theorem 13.7] this allows us to reduce the number of test kits required to M ≈ CS log e N S . Further, as we have seen in Figure 4a and Figure 4b , the approximation error is incredibly small since the noise e is peaky. In this manner, the estimation will even be more exact than by testing each specimen separately. It remains to argue, why a peaky noise model is natural for the problem of testing a large group for a virus. In general, the sample of the m-th test kit will be affected by the noise e pro m = 0 if for instance the sample is contaminated by a specimen of a different person or a laboratory employee. Due to the caution and expertise of members in the health care system, this is a rare occasion, but when it happens the effect will be rather strong. Thus, it is natural to assume that e pro is peaky. Similarly, the specimen of the n-th person is affected by the noise e spe n under similar circumstances. Thus, it is also natural to assume that e spe is peaky. If we test each specimen seperately it follows that e = e spe + e pro is peaky. On the other hand if A is a D-LRBG, the sample of the m-th test kit is affected by the noise of the n-th specimen if and only if A m,n = 0. Since, there are exactly D non-zeros per column, exactly D samples are affected by the noise of the n-th specimen, i.e. by D −1 e spe n . If n e denotes the number of components of e spe with significant absolute value and m e denotes the number of components of e pro with significant absolute value, then the number of components of e with significant absolute value is at most Dn e + m e . If D is sufficiently small, as it is required for [FR13, Theorem 13 .7], the noise will be peaky. Note that the lack on knowledge of the noise e favors the NNLAD recovery method over a (BPDN) approach. Further, since the total sum of viruses in all patients given by n∈ [N ] x n = x 1 is unknown, it is undesirable to use (CLR).

By ½ we denote the all ones vector in R N or R M respectively. The proof is an adaption of the steps used in [KJ18] . As for most convex optimization problems in compressed sensing we require [FR13, Theorem 4 .25] and [FR13, Theorem 4 .20] respectively. . Let q ∈ [1, ∞) and suppose A has the ℓ q -RNSP of order S with respect to · with constants ρ and τ . Then, it holds that

If q = 1, this bound can be improved to

Note that by a modification of the proof this result also holds for q = ∞. [PJ20] . As a consequence, all our statements also hold for q = ∞ with 1 q := 0. If W ∈ R N ×N is a diagonal matrix, we can calculate some operator norms fairly easy:

|W n,n | for all q ∈ [1, ∞] .

We use this relation frequently over this section. Furthermore, we use [KJ18, Lemma 5] without adaption. For the sake of completeness we add a short proof.

Lemma 7.2 ( [KJ18, Lemma 5] ). Let q ∈ [1, ∞) and suppose that A ∈ R M×N has ℓ q -RNSP of order S with respect to · with constants ρ and τ . Let W ∈ R N ×N be a diagonal matrix with W n,n > 0. If ρ ′ = W q→q W −1 1→1 ρ < 1, then AW −1 has ℓ q -RNSP of order S with respect to · with constants ρ ′ = W q→q W −1 1→1 ρ and τ ′ = W q→q τ .

Proof. Let v ∈ R N and # (T ) ≤ S. If we apply the RNSP of A for the vector W −1 v T , we get

This finishes the proof.

Next we adapt [KJ18, Theorem 4] to account for arbitrary norms. Further, we obtain a slight improvement in form of the dimensional scaling constant S 1 q −1 . With this, our error bound becomes for S → ∞ asymptotically the error bound of the basis pursuit denoising, whenever κ = 1 and q > 1 [FR13] .

Proposition 7.3 ( Similar to [KJ18, Theorem 4] ). Let q ∈ [1, ∞) and · be a norm on R M with dual norm · * . Suppose A has ℓ q -RNSP of order S with respect to · with constants ρ and τ . Suppose A has the M + criterion with vector t and constant κ and that κρ < 1. Then, we have

If q = 1, this bound can be improved to

Proof. Let x, z ≥ 0. In order to apply Lemma 7.2 we set W as the matrix with diagonal A T t and zero else. It follows that W n,n > 0 and W q→q W −1 1→1 ρ = κρ < 1. We can apply Lemma 7.2, which yields that AW −1 has ℓ q -RNSP with constants ρ ′ = W q→q W −1 1→1 ρ = κρ and τ ′ = W q→q τ = max n∈[N ] A T t n τ . We apply Theorem 7.1 with the matrix AW −1 , the vectors Wx, Wz and the constants ρ ′ and τ ′ and get

We lower bound the left hand side further to get

We want to estimate the term Wz 1 − Wx 1 using the M + criterion. Since z, x ≥ 0, W n,n = A T t n > 0 and W is a diagonal matrix, we have

Applying this to (4) we get

If q = 1 we can repeat the proof with the improved bound of Theorem 7.1.

After these auxiliary statements it remains to prove the main result of Section 3 about the properties of the NNLR minimizer.

Proof of Theorem 3.4. By applying Proposition 7.3 with x and z := x # ≥ 0 we get

(1 + κρ)

where in the last step we used that x # is a minimizer and x is feasible. If q = 1 we can repeat the proof with the improved bound of Proposition 7.3.

In this section we only care about minimizers of argmin z≥0 Az − y 1 , and thus x will denote any element in R N and does not need to be sparse or compressible. In order to prove Proposition 4.2 we introduce saddle point problems and technical notations from optimization.

holds true, then x # , w # is called saddle point of f . In general we have for any point (

This yields that the inequality

holds true, but not necessarily with equality. The equality is a condition of the existence of a saddle point. The

≥ 0 is called the duality gap. Further, (5) and (6) yield the logical statement

Given a function F :

The fenchel conjugate has several interesting properties, however we only require that if F is proper, convex and lower semicontinuous 3 , then also F * is proper, convex and lower semicontinuous and F * * = F holds true [Roc70, Theorem 12.2]. Given a proper, convex, lower-semicontinuous function F : R N → R, the proximal point operator of F is the function Prox F (·) : Then, the function f (x, w) := Ax, w + G (x) − F * (w) has a saddle point. Further, let τ, σ ∈ (0, ∞) be parameters with στ < A −2 2→2 and x 0 ∈ R N , w 0 ∈ R M be initializations. Set v 0 := x 0 and for all k ∈ N 0 inductively w k+1 =Prox σF * w k + σAv k (PP 1)

The sequence x k , w k converges to a saddle point of f . Lastly, for any bounded sets

By a proper choice of F and G any saddle point of f will also give a minimizer of NNLAD. We denote this proper choice in the next lemma. Then F, G, F * , G * are proper, convex and lower semicontinuous. F * and G * are given by

Proof. From the definition it is clear that F, G are proper, convex and lower semicontinuous. Hence F * and G * are also proper, convex and lower semicontinuous. By a direct calculation we have

For the other fenchel conjugate we calculate

where in the last step we used that each summand depends on exactly one component of w * . Now w m w * − |w * | is larger for sgn (w * ) = sgn (w m ), than for sgn (w * ) = sgn (w m ). Hence, we can restrict the supremum to the case sgn (w * ) = sgn (w m ) and obtain

Since F is proper, convex and lower semicontinuous, we have F * * = F . Thus,

And lastly we have

which finishes the proof. For the sake of completeness we added a proof. 

and hence σ −1 (w ′ m − w m ) ∈ [−1, 1]. It follows that zero is a possible subgradient, i.e. the subdifferential contains zero. Hence, w ′ is the unique minimizer. To prove (9), we apply the first statement to calculate

.

It follows that By Lemma 8.2 F, G, F * , G * are proper, convex and lower-semicontinuous. Thus, the requirements of Theorem 8.1 are fulfilled, which yields that f has a saddle point and thus the duality gap is zero. By Lemma 8.2 and the fact that the duality gap is zero, it follows that

Ax − y 1 and w # ∈ argmax w∈R M :A T w≥0, w ∞ ≤1 − w, y .

If (x ′ , w ′ ) are any points with x ′ ≥ 0, w ′ ∞ ≤ 1 and A T w ′ ≥ 0, then we have by Lemma 8.2

Ax ′ − y 1 + y, w ′ = sup

If this is non-positive, (7) and (10) yield that x ′ is a minimizer of NNLAD. Hence, it holds true that Ax ′ − y 1 + y, w ′ ≤ 0 and x ′ ≥ 0 and w ′ ∞ ≤ 1 and A T w ′ ≥ 0 ⇒ x ′ is minimizer of NNLAD. (11) Lastly, by Lemma 8.3 the iterates calculated in (iter 1), (iter 2) and (iter 3) are exactly the iterates calculated in (PP 1), (PP 2) and (PP 3) respectively. We will now prove all statements. By Theorem 8.1 the sequence x k , w k converges to some saddle point x # , w # . Hence, x k converges to x # , which is a minimizer of NNLAD by (10). Since any sequence of averages converges to the same value as the original sequence, statement (1) follows. Since x k and w k are in the image of the proximal point operator of G and F * respectively, they need to obey G x k < ∞ and F * w k < ∞. Lemma 8.2 yields the x k ≥ 0 and w k ∞ ≤ 1. By convexity we obtain alsō x k ≥ 0 and w k ∞ ≤ 1. Statement (2) is proven. By Theorem 8.1 the sequence x k , w k converges to some saddle point x # , w # . By taking the limit, statement (2) yields x # ≥ 0 and w # ∞ ≤ 1. The saddle point property and Lemma 8.2 implies

By Lemma 8.2 again, this is only possible if w # is feasible, i.e.

A T w # ≥ 0.

Hence, lim k→∞ A T w k ≥ 0 follows. By Lemma 8.2 and the feasibility of x # and w # we have lim k→∞ Ax k − y 1 + y, w k = Ax # − y 1 + y, w # = sup

which is zero, since x # , w # is a saddle point. This yields the convergence in statement (3). Since any sequence of averages converges to the same value as the original sequence, we also get the convergence of statement (4). The in particular part of statements (3) and (4) follows from (11) and statement (2). Hence, statements (3) and (4) are proven.

To prove the the remaining statement (5) we choose B 1 := x # and B 2 := {w : w ∞ ≤ 1}. The bound of Theorem 8.1 becomes

w − w 0 2 2 = 1 k 1 2σ

x # − x 0 2 2 + 1 2τ w 0 2 2 + 2 w 0 1 + M .

By using Lemma 8.2 and the feasibility of x # we get

Now letw be a maximizer of sup w∈R M f x k , w . By statement (2) we get G x k = 0 and thusw is also a minimizer of the convex function w → −Ax k , w + F * (w). Hence, the subdifferential of this function needs to contain zero atw. Since ∂F * (w) = ∅ whenever w ∞ > 1, we get w ∞ ≤ 1. This together with the feasibility ofx k yields

Combining (13), (14) and (15) yields

x # − x 0 2 2 + 1 2τ w 0 2 2 + 2 w 0 1 + M and finishes the proof.

We want to remark that the other feasibility assumptions A T w k ≥ 0 and A Twk ≥ 0 does not need to hold.

Group testing: An information theory perspective. Foundations and Trends in Communications and Information Theory

The rate of convergence of nesterov's accelerated forwardbackward method is actually faster than 1/k 2

On the uniqueness of non-negative sparse & redundant representations

Subgradient Methods, Notes for EE364b

A fast iterative shrinkage-thresholding algorithm for linear inverse problems

A first-order primal-dual algorithm for convex problems with applications to imaging

Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information

Compressed sensing

The detection of defective members of large populations

Sparse nonnegative solution of underdetermined linear equations by linear programming

Counting the faces of randomly-projected hypercubes and orthants, with applications. Discrete and Computational Geometry

A Mathematical Introduction to Compressive Sensing. Birkhäuser Basel

CVX: Matlab software for disciplined convex programming

Statistical Learning with Sparsity: The Lasso and Generalizations

Robust recovery of sparse nonnegative weights from mixtures of positivesemidefinite matrices

Understanding and Enhancing Data Recovery Algorithms. Dissertation, Technische Universitt Mnchen, Mnchen

Robust nonnegative sparse recovery and the nullspace property of 0/1 measurements

Stable low-rank matrix recovery via null space properties

Super-resolution of positive sources: The discrete setup

Graph implementations for nonsmooth convex programs

Introductory Lectures on Convex Optimization -A Basic Course

Robust instance-optimal recovery of sparse signals at unknown noise levels

Introduction to optimization. Translations series in mathematics and engineering

Convex analysis. Princeton Mathematical Series

Sparse recovery by thresholded non-negative least squares

Non-negative least squares for high-dimensional linear models: Consistency and sparse recovery without regularization

Sparse non-negative recovery from biased subgaussian measurements using NNLS. CoRR

Probing the pareto frontier for basis pursuit solutions

SPGL1: A solver for large-scale sparse reconstruction

A unique "nonnegative" solution to an underdetermined system: From vectors to matrices

The work was partially supported by DAAD grant 57417688. PJ has been supported by DFG grant JU 2795/3. BB has been supported by BMBF through the German Research Chair at AIMS, administered by the Humboldt Foundation.

Proof. The proximal point operator of an indicator function of a closed, convex set is always the projection to the set, hence the identity for G follows. For F this is more difficult. Note that w ′ is a minimizer ofif and only if zero is in the subdifferential at w ′ , which is given by the setSince the minimizer for the proximal operator is always unique, it remains to verify that zero is in the subdifferential at the vector from the statement. So let w ′ be the vector from the right hand side of (8) and m ∈ [M ]. If w m − y m > σ, then w ′ m = w m − σ > y m and thusIf w m − y m < −σ, then w ′ m = w m + σ < y m and thusIf |w m − y m | ≤ σ, we have w ′ m = y m and