key: cord-127025-9ubhd4vf
authors: Abraham, Louis; B'ecigneul, Gary; Scholkopf, Bernhard
title: Crackovid: Optimizing Group Testing
date: 2020-05-13
journal: nan
DOI: nan
sha: 
doc_id: 127025
cord_uid: 9ubhd4vf

We study the problem usually referred to as group testing in the context of COVID-19. Given $n$ samples taken from patients, how should we select mixtures of samples to be tested, so as to maximize information and minimize the number of tests? We consider both adaptive and non-adaptive strategies, and take a Bayesian approach with a prior both for infection of patients and test errors. We start by proposing a mathematically principled objective, grounded in information theory. We then optimize non-adaptive optimization strategies using genetic algorithms, and leverage the mathematical framework of adaptive sub-modularity to obtain theoretical guarantees for the greedy-adaptive method.

Lacking effective treatments or vaccinations, the most effective way to save lives in an ongoing epidemic is to mitigate and control its spread. This can be done by testing and isolating positive cases early enough to prevent subsequent infections. If done sufficiently regularly and for a sufficiently large fraction of individuals at risk, this has the potential to prevent a large fraction of the infections a positive case would normally cause. However, a number of factors, such as limits on material resources as well as on work force, necessitate economical and efficient use of test resources.

Group testing 1 aims to improve properties of tests by testing groups of items simultaneously. We wish to leverage this framework to improve COVID-19 testing. One needs to differentiate between two different settings: adaptive and non-adaptive. In the former, tests can be decided one at a time, taking into account previous test results. In the latter, one has to select all tests before seeing any lab result and could thus run them in parallel.

One can also imagine a semi-adaptive setting in which tests are selected in small batches between each lab evaluation. A simple example of a semi-adaptive group test is to first split n samples into g groups of (roughly) equal size, pool the samples within the groups and perform g tests on the pooled samples. All samples in negatively tested pools are marked as negative, and all samples in positively tested pools are subsequently tested individually. This strategy has recently been validated for COVID-19 PCR tests [Schmidt et al., 2020] .

Non-adaptive group testing. Most of the existing research on non-adaptive group testing is concerned with identifying at most k positive samples amongst n items, which is referred to as non-adaptive hypergeometric group testing [Hwang and Sós, 1987] . This assumption yields asymptotic bounds on the number of tests needed to recover the ground truth [Knill et al., 1998 , Indyk et al., 2010 , Cheraghchi et al., 2012 , Chan et al., 2014 . However, these are of limited practical relevance when constructive results on small numbers of samples are required.

A different problem formulation. We formulate the problem differently: given n people and m testing kits, the characteristics of the test and prior probabilities for each person to be sick, we seek to optimize the way the tests are used by combining several samples. For simplicity, samples are assumed to be independent. [Mazumdar, 2016] considers these assumptions as well, for the non-adaptive setting. However, his results are also asymptotic, i.e., valid for large n.

Adaptive group testing. In the adaptive setting, subsequent test designs may take into account earlier test results. By leveraging the framework of adaptive sub-modularity initially developed for sensor covering by [Golovin and Krause, 2011] , we prove near-optimality of the greedy-adaptive strategy.

Our contributions are as follows:

• We provide a mathematically grounded objective function to optimize when designing a strategy, leveraging information theory.

• We implement different strategies, both non-adaptive and adaptive, which can readily be used in a web browser to know (i) which tests to run and (ii) how to interpret the outcome.

• We provide a guarantee of near-optimality of the greedy-adaptive strategy which is based on the mathematical objective we proposed.

• Software is available at https://louisabraham.github.io/crackovid/crackovid.html.

Intuition. Our mathematical objective is designed such that the mixture tests it proposes to run in the lab will maximize the amount of information we gain on the ground truth once their lab results are revealed − in expectation, over the randomness of both imperfect tests and prior probabilities of infection per individual.

Notations are progressively introduced throughout, but also all gathered in appendix A.

Denote the number of patient 2 samples by n, and the number of tests to run by m. Tests are assumed to be imperfect, with a true positive rate (or sensitivity) tpr 3 and true negative rate (or specificity) tnr. 4 patient sample i is infected with probability p i ∈ [0, 1] and we assume statistical independence of infection of patient samples. Denoting by a '1' a positive result (infection), the unknown ground truth describing who is infected and who is not is a vector of size n made up of '0's and '1's: we call this the secret, denoted as s ∈ {0, 1} n . A design of a test d ∈ {0, 1} n to run in the lab is a subset of patient samples to mix together into the same sample, where d i = 1 if patient sample i is mixed into design d and d i = 0 otherwise. Note that the outcome of a perfect design d for a given secret s can simply be obtained as

, a test result is positive if there is at least one patient i for which d i = 1 (i is included in the sample) and s i = 1 (i is infected).

Recall that the secret s is unknown. However, since we assume initially that patient sample i is infected with probability p i and that patient samples are independent, this yields a prior probability distribution over the possible values of s. We hence represent the random value of s as a random variable (r.v.) , denoted by S, with probability distribution p S (s) := Pr[S = s] over {0, 1} n .

Let us now recall the definition of the entropy of our random variable,

representing the amount of uncertainty that we have on its outcome, measured in bits. It is maximized when S follows a uniform distribution, and minimized when S constantly outputs the same value. As we perform tests, we gain additional knowledge about S. For instance, if a test pools all samples and returns a negative result, then our posterior probability that all patients are healthy goes up, i.e., p S ((0, . . . , 0)) increases, governed by Bayes' rule of probability theory. More generally, we may perform a sequence of tests of varying composition, updating our posterior after each test. Our goal will be to select designs of tests so as to minimize entropy, resulting in the least amount of uncertainty about the test outcome for all individuals.

Note that since tests are imperfect, for a given pool design d ∈ {0, 1} n and a given secret s ∈ {0, 1} n , the Boolean outcome T (s, d) of the test in the lab is not deterministic. If tests were perfect, we would have T (s, d) = 1 d,s >0 . To allow for imperfect tests, we model T (s, d) as a r.v. whose distribution is described by Pr 

Since the secret s is also unknown (and described by the r.v. S), the outcome T (S, d) has now two sources of randomness: imperfection of tests and unknown secret. 6 In practice, one will not run one test but multiple tests. We now suppose that m tests of pool designs are run and let their designs be represented as a multiset D ∈ ({0, 1} n 

This leads us to the following question: given an initial prior probability distribution p S over the secret, how should we select pool designs to test in the lab? We want to select it such that once we have its outcome, we have as much information as possible about S, i.e. the entropy (uncertainty) of S has been minimized. Since we cannot know in advance the outcome of the tests, we have to minimize this quantity in expectation over the randomness coming from both the imperfects test and unknown secret. This requires the notion of conditional entropy.

Conditional Entropy. Given pool designs D, we consider two random variables S (secret) and T := T (S, D) (test results). The conditional entropy of S given T is given by:

In this formula, the joint probability Pr[S = s, T = t] has been computed with the conditional probability formula Pr[S = s, T = t] = Pr[S = s] Pr[T = t|S = s], and the posterior distribution is computed using Bayesian updating, i.e.,

where

It represents the amount of information (measured in bits) needed to describe the outcome of S, given that the result of T is known.

Mutual Information. Equivalently, one can define the mutual information between S and T as:

It quantifies the amount of information obtained about S by observing T .

A well-motivated criterion for test selection. Since H(S) does not depend on d, selecting the pool design d minimizing the conditional entropy of S given the outcome of D is equivalent to selecting the one maximizing the mutual information between S and T (S, D). We now have a clear criterion for selecting D:

This criterion selects the pool designs D whose outcome will maximize our information about S.

Expected Confidence. We report another evaluation metric of interest called the expected confidence. It is the mean average precision of the maximum likelihood outcome. The maximum likelihood outcome it defined by:

which yields the following definition of Expected Confidence:

ML is of particular practical interest: given test results t, a physician wants to make a prediction. In this case, it makes sense to use the maximum likelihood predictor. The interpretation of Confidence is straightforward: the probability of the prediction to be true (across all possible secrets).

Updating the priors. Both scoring functions described above compute the expectation relative to the test results of a score on the posterior distribution p S|T =t (s). After observing the test results, we are able to replace the prior distribution p S by the posterior. By the rules of Bayesian computation, this update operation is commutative, i.e., the order in which designs d 1 and d 2 are tested does not matter, and compositional in the sense that we can test {d 1 , d 2 } simultaneously with the same results. Thus, we can decompose those steps and make different choices as we run tests (see the adaptive method below).

(A) Compare two outcomes: (1) the posterior is concentrated on two points s ∈ {0, 1} n , taking the values 0.6 and 0.4. (2) it is concentrated on three points, taking the values 0.6, 0.2, 0.2. The entropy is higher, but maybe we still prefer (2) since the margin to the second best explanation is larger?

Foreword. Non-adaptive methods have the benefit over their adaptive counterparts that they can be run in parallel. On the flip side, they describe a strictly more restrictive class of algorithms, since any non-adaptive method is an adaptive one ignoring the information obtained adaptively. Moreover, non-asymptotic (i.e., for small values of n) performance guarantees are harder to obtain than for adaptive methods.

Given numbers n & m, test characteristics tpr & tnr as well as prior probabilities of sample infection p i , the best multiset D of m pool designs is the one maximizing some score, like I(S, T (S, D)) or Confidence(S, T (S, D)).

The tests are order insensitive, which gives a search space of cardinality 2 n +m m . Evaluating the score of every multiset separately takes O (2 n+m ) operations. 8 Hence, brute-forcing this search space is prohibitive even for small values of n and m.

We resort to randomized algorithms to find a good enough solution. Our approach is to use Evolution Strategies. We apply a variant of the (1 + λ) ES with optimal restarts [Luby et al., 1993] to optimize any objective function over individuals (multisets of tests).

Detailed Description. We maintain a population of 1 individual between steps. At every step of the ES, we mutate it in λ ∈ N + offsprings. In the standard (1 + λ) ES, each offspring is mutated from the population, whereas our offsprings are iteratively mutated, each one being the mutation of the previous. These offsprings are added to the population, and the best element of the population is selected as the next generation of the population.

We initialize our population with the "zero" individual that doesn't test anybody. Our mutation step is straightforward: flipping one bit d i of one pool design d, both chosen uniformly at random. Our iterative mutation scheme allows us to step out of local optima.

After choosing a basis b proportional to n × m (which is approximately the logarithm of our search space), we apply restarts according to the Luby sequence:

This sequence of restarts is optimal for Las Vegas algorithms [Luby et al., 1993] , and our ES can be viewed as such under two conditions: (i) that the population never be stuck in a local optimum, which can be achieved in our algorithm using λ = n × m (note that much smaller constant values are used in practice); (ii) the second condition is purely conceptual and consists in defining a success as having a score larger than some (unknown) threshold. The fact that our algorithm does not use this threshold as an input yields the following result:

Theorem 1. Under condition (i), the evolutionary strategy using the sequence of restarts illustrated in Eq. (10) yields a Las Vegas algorithm that restarts optimally (as defined by [Luby et al., 1993] ) to achieve any target score threshold.

Future improvements would consist in (i) initializing the population with random pool designs (following randomized testing methods from the literature), (ii) design new mutation rules and (iii) a dynamic restart strategy based on the detection of lack of progress.

Foreword. Adaptive methods have the clear advantage over their non-adaptive counterparts to be more efficient in the number of tests, although they require waiting for the lab results after each sequential test run. Although searching the space of all possible adaptive strategies would yield a prohibitive complexity of Ω(2 2 m ), it turns out that a simple adaptive strategy can yield provably near-optimal results. Detailed Description. We describe our adaptive method as Algorithm 1 which greedily optimizes the criterion defined in Eq. (5). Leveraging the framework of adaptive sub-modularity [Golovin and Krause, 2011] , and the fact that the criterion defined by Eq. (5) is adaptive sub-modular and adaptive monotone, Algorithm 1 has the guarantee below. We prove it in Appendix B.1.

Algo, the expectation being taken over all 2 m outcomes of lab results. Denote by 'Optimal' the best (unknown) adaptive strategy. If we run Algorithm 1 for m 1 tests and Optimal for m 2 tests, we have:

where α is defined as follows: assume that our priors p i are wrong, in the sense that there exist constants c, d with cp i ≤ p ′ i ≤ dp i for i ∈ {1, ..., n}, with c ≤ 1 and d ≥ 1, where p ′ i denotes the true prior: we set α := d/c.

Remarks. Theorem 2 states that Algorithm 1 is (i) robust to wrong priors and (ii) nearoptimal in the sense that the ratio of its performance with that of the optimal strategy goes to 1 exponentially fast in the ratio of the numbers of tests run in each algorithm. For α = 1 and m 1 = m 2 , this yields 1 − 1/e ≃ 0.63.

There exists recent work applying group testing to COVID-19: [Seifried and Ciesek, 2020] report using mini pools of patient samples of size 5 yielding no error in the prediction of healthy patients, over a set of 50 patient samples; [Yelin et al., 2020] report the possibility of mixing up to 32 patient samples together, with a false positive rate of 10%; [Jeffay, 2020] made a press release about the possibility of reliably testing mixtures including as many as 64 patient samples; [Kadri, 2020] mentions simple mathematical methods to approach the problem; [Deleforge, 2020] published a blogpost with appealing illustrations vulgarizing the mathematics of group testing.

We have discussed some interesting special cases using relatively restrictive assumptions. While theoretical results often require such assumptions, our implementation can in principle be generalized into various directions...

We have used the number of tests and samples as given, and then optimized a conditional entropy. However, from a practical point of view, other quantities are relevant and may need to be included in the objective, e.g. the expectation (over a population) of the waiting time before an individual is "cleared" as negative (and can then go to work, visit a nursing home, or perform other actions which may require a confirmation of non-infectiousness).

Semi-adaptive tests. Instead of performing m consecutive tests, one could do them in k batches of respective sizes m 1 , ..., m k satisfying m 1 + ... + m k = m. Adaptivity over the sequence of length k could be handled greedily as in Algorithm 1, except that instead of selecting a single pool design d * , we would select m i designs at the i th step. We named this semi-adaptive algorithm the k-greedy strategy.

Pool size optimization. One could analyze the simple setting of [Schmidt et al., 2020] , i.e., what is the best pool size given an average prior probability. "Best" here could mean something different than in our setting. One might want to take into account the number of tests kits used as well as the number of people which are "cleared" as negative after a given time duration.

Pool allocation. How do we allocate people to the (first round) pool designs in this setting if prior probabilities are non-uniform? This will naturally arise if we apply group testing as per [Schmidt et al., 2020] every day for part of the workforce of a company, in which case a negative result for a person on a given day would imply a low prior on the next day. One could also wonder how to allocate people to the pool designs in the presence of correlations between patient samples. One initial guess could be to place correlated people into the same pool in order to minimize the expected number of what one might call false pool alarms, i.e., cases where a person ends up in a positively tested pool even though they are personally negative.

Further practical considerations. A good practical strategy could be to perform one round of pooled tests to disjoint groups every morning as individuals arrive at work, being evaluated during work hours. Those who are in a positive group (adaptively) get assigned to a second pool design tested later, which can consist of a non-adaptive combination of multiple designs, tested over night. They receive the result in the morning before they go to work, and if individually positive, the enter quarantine. If the test is so sensitive that it detects infections even before individuals become contagious (which may be the case for PCR tests), such a strategy could avoid most infections at work.

Dependencies between tpr, tnr and pool size. The reliability of tests may vary with pool size. In our notation, the outcome of the tests is a random variable that need not only depend on whether one person is sick (1 d,s >0 ) but it may also depend on the number of tested people |d| and the number of sick people d, s (cf. Footnote 5); it could even assign different values of tpr and tnr to different people. The tpr may in practice be an increasing function of the proportion of sick people d, s /|d|.

Estimating prior infection probabilities. Currently, we start with a factorized prior of infection that not only assumes independence between the tested patients but is also oblivious to the individual characteristics. We could, however, build a simple ML system that estimates the prior probabilities based on a set of features such as: job, number of people living in the same household, number of children, location of home, movement or contact data, etc. 9 Those prior probabilities can then be readily used by our approach to optimize the pool designs, and the ML system can gradually be improved as we gather more test results.

Prevalence estimation. Similar methods can be applied to the question of estimating prevalence. Note that this is an easier problem in the sense that we need not necessarily estimate which individuals are positive, but only how many.

Ethical considerations. We identify two families of concern to address. The first family concerns the accuracy of the tests. Indeed, when the number of tests and patients are equal, it is natural to compare the tpr/tnr of the individual test to the tpr/tnr of the individual results in our grouped test framework (obtained by marginalizing the posterior distribution). In some situations with unbalanced priors, the marginal tpr/tnr of some people in the group could be lower than the test tpr/tnr, even if the test will be more successful overall. However, reporting the marginal individual results gives doctors a tool to decide whether further testing should be needed; hence we cannot rule out that individuals might be worse off by being tested in a group. The second family of concerns, directly resulting from the first, is the responsibility of the doctor when assigning the people in batches and giving them prior probabilities (using another model). The assignment of people in batches should be dealt with in a future extension of our framework, while the sensitivity of our protocol to priors should be studied in more depth. In particular, the adaptive framework is more robust with respect to the choice of priors than the non-adaptive one.

• H(p S|T =t ) : f (E(π, Φ), Φ);

• H(S | T ) : f avg := E[f (E(π, Φ), Φ)].

This allows one to define, following Definition 1 of [Golovin and Krause, 2011] , the conditional expected marginal benefit of a pool design d given results t as:

It represents the marginal gain of information obtained, in expectation, by observing the outcome of d at a given stage (this stage being defined by p S , i.e. after having observed test results t).

Adaptive monotonicity holds if ∆(d) ≥ 0 for any d.

Adaptive sub-modularity holds if for any two sets of results t and t ′ such that t is a sub-realization 10 of t ′ , for any pool design d:

The below lemma concludes the proof.

Lemma. With respect to ∆ defined in Eq. (12), adaptive monotonicity and adaptive sub-modularity both hold. Proof. Adaptive monotonicity is a consequence of the "information-never-hurts" bound H(X | Y ) ≤ H(X) [Cover and Thomas, 2012] . Moreover, as mentioned in Lemma 1 of [Guestrin et al., 2005] , sub-modularity also follows directly from this bound.

In this section, we show how our program (available at https://louisabraham.github.io/crackovid/cra can be used for practical applications.

Suppose a doctor already made some pooled tests. Our program can compute the posterior distribution and report the most probable diagnosis, its confidence as well as the marginal probabilities for each person to be sick.

In our numerical example, tpr=0.95, tnr=0.99, we test 3 persons with 3 tests. The i-th test is applied to everybody but the i-th person. We can then enter the results of each test.

The input of our program is thus 10 i.e. there exist D and D ′ such that T (S, D) = t, T (S, D ′ ) = t ′ and D ⊂ D ′ .

Nonadaptive group testing: Explicit bounds and novel algorithms

Graph-constrained group testing

Elements of information theory

The maths of group testing: Mixing samples to speed up covid-19 detection

Adaptive submodularity: Theory and applications in active learning and stochastic optimization

Near-optimal sensor placements in gaussian processes

Non-adaptive hypergeometric group testing

Efficiently decodable nonadaptive group testing

To ease global virus test bottleneck, Israeli scientists suggest pooling samples

Enhancing the number of lab tests with a 'poisoned wine' approach

Non-adaptive group testing in the presence of errors

Optimal speedup of las vegas algorithms

Nonadaptive group testing with random set of defectives

FACT -Frankfurt adjusted COVID-19 testing -a novel method enables high-throughput SARS-CoV-2 screening without loss of sensitivity

Pool testing of SARS-CoV-2 samples increases worldwide test capacities many times over

Evaluation of COVID-19 RT-qPCR test

We use upper case letters exclusively for random variables (r.v.), except for mutual information I and entropy H. • n: number of patient samples

• m: number of tests to run in the lab

1} n : the secret to unveil, with s i = 1 if and only if patient sample i is positive

over possible values of s whose law describes the current information we have about s

1} n : a pool design, with d i = 1 if and only if patient sample i belongs to pool design d

1} n ) m : random multiset describing the pool designs output by the strategy

1} m : lab result of a list of m tests

• T : r.v. over possible values of t describing lab results

• tpr: true positive rate, sensitivity, hit rate, detection rate, recall

• tnr: true negative rate, specificity, correct rejection rate, selectivity

A]: probability of event A to happen

) is both adaptive monotone and adaptive sub-modular. Direct respective correspondence between our notations and that of

• Set D of selected designs : set E(π, Φ) of selected items by policy π

Gary Bécigneul is funded by the Max Planck ETH Center for Learning Systems. 9 subject to privacy considerations

with [results] a binary code representing the test observations. If the results are 000, the program indicates that with very high probability nobody is sick: 11 most probable diagnosis: 000 confidence: 0.999963 marginals: 1.23414e-05 1.23414e-05 1.23414e-05If the results are 011, the program predicts that the first person is sick and the other two are healthy: most probable diagnosis: 100 confidence: 0.973086 marginals: 0.975488 0.00292 0.00292Interestingly, we observe error correction when the results are 001 (an impossible outcome with perfect tests) as the program still infers that nobody is sick: most probable diagnosis: 000 confidence: 0.955646 marginals: 0.0221854 0.0221854 6.64093e-05

This mode, given prior probabilities of people to be sick and a number of tests, finds a pooling scheme to optimize the expectation of some score on the posterior probabilities, eg minimize the entropy or maximize the confidence.In fact, the framework we applied above is optimal as Suppose that instead of testing 3 people, we want to test 6 people. A possible return value of our randomized algorithm is: As 0.958704 2 = 0.919113 < 0.937214, grouping 6 people together gives much more accurate results than dividing then in two groups of 3 people.