key: cord-0633217-y52qqc42 authors: Bertolotti, Paolo; Jadbabaie, Ali title: Network Group Testing date: 2020-12-04 journal: nan DOI: nan sha: 2c602786c268803f5b6a39f306b43c4b36bacc98 doc_id: 633217 cord_uid: y52qqc42 We consider the problem of identifying infected individuals in a population of size N. We introduce a group testing approach that uses significantly fewer than N tests when infection prevalence is low. The most common approach to group testing, Dorfman testing, groups individuals randomly. However, as communicable diseases spread from individual to individual through underlying social networks, our approach utilizes network information to improve performance. Network grouping, which groups individuals by community, weakly dominates Dorfman testing in terms of the expected number of tests used. Network grouping's outperformance is determined by the strength of community structure in the network. When networks have strong community structure, network grouping achieves the lower bound for two-stage testing procedures. As an empirical example, we consider the scenario of a university testing its population for COVID-19. Using social network data from a Danish university, we demonstrate network grouping requires significantly fewer tests than Dorfman. In contrast to many proposed group testing approaches, network grouping is simple for practitioners to implement. In practice, individuals can be grouped by family unit, social group, or work group. Group testing improves testing capabilities for infectious diseases when resources are limited. In normal scenarios, infected individuals from a population of size N are identified by testing all population members individually, which uses N tests. In the simplest form of group testing, individual samples are pooled together into groups of size n for an initial stage of testing. If a group tests negative, all individuals within the group are classified as negative for the disease. If a group tests positive, all individual samples from the group are retested individually to identify the infected members. To illustrate the power of group testing, consider the scenario where N = 50 and one individual is infected. If individuals are pooled into groups of size n = 10 for an initial stage of testing, one group will test positive and all 10 samples from the group will be retested. The group testing approach uses 15 tests compared to the 50 used under individual testing. Group testing was introduced by Dorfman (1943) to screen for syphilis in the US military. Dorfman's insight was simple but powerful. As a result, group testing has been employed numerous times in the medical field for diseases including influenza, chlamydia, and malaria (Van et al., 2012; Currie et al., 2004; Taylor et al., 2010) . Within the US, group testing is used in blood banks and infertility prevention programs where large numbers of individuals are routinely tested (FDA, 2012; Bilder et al., 2010) . Group testing's efficient use of resources has made it a valuable technique in developing areas. Notably, group testing was used during the early stages of the HIV pandemic in Africa when polymerase chain reaction (PCR) test costs were high (Emmanuel et al., 1988) . By reducing testing costs and increasing access to diagnostic information, group testing plays an important role in increasing health equity. Under Dorfman's approach, each individual's infection probability is treated as homogenous and individuals are placed into groups randomly, which is equivalent to ignoring any information regarding an individual's susceptibility to infection. However, as communicable diseases spread from individual to individual through underlying social networks, an individual's network location affects their infection probability. In this work, we utilize network information to pool individuals for group testing. Specifically, we group individuals by community as infections are more likely to spread between closely connected community members than between members of distinct communities. In order to analyze the performance of a network grouping strategy, we introduce a generative network model and epidemic model. We derive the number of tests used under network grouping and prove the expected number of tests is upper bounded by Dorfman testing, which implies network grouping weakly dominates Dorfman. The outperformance of network grouping is determined by the strength of community structure in the network. In networks with strong community structure, network grouping performs optimally and achieves the lower bound for two-stage testing procedures. In networks with no structure, network grouping is equivalent to Dorfman testing. We end by considering the scenario of a university testing its population for COVID-19 cases. Using social network data from a Danish university, we demonstrate network grouping outperforms Dorfman testing. Our work reinforces the benefit of group testing for communicable diseases, which is consequential for the current COVID-19 pandemic. Multiple labs have demonstrated the efficacy of group testing for detecting the SARS-CoV-2 virus (Hogan et al., 2020; Yelin et al., 2020) and several countries have implemented group testing to increase their testing capabilities (FDA, 2020b; WSJ, 2020) . As testing resources still remain constrained (WSJ, 2021), we hope more institutions and governments will take advantage of the power of group testing. Since Dorfman's work in 1943, numerous group testing approaches with strong performance have been introduced (Litvak et al., 1994; Cheraghchi et al., 2012; Ghosh et al., 2020) . However, the complexity of the proposed methods have limited their adoption in the medical field. As a result, Dorfman testing remains the most common approach to group testing in practice (McMahan et al., 2012; FDA, 2020a) . Importantly, our proposed approach is simple for practitioners to implement. In practice, individuals can be grouped by family unit, social group, work group, or other community structure. In this section, we describe Dorfman two-stage testing and the lower bound for two-stage group testing procedures. Under two-stage testing, a population of size N is split into N /n groups of size n for an initial stage of testing. Let G denote the number of positive groups after the initial stage. In the second stage of testing, all n samples from each positive group are retested individually. In total, N /n + nG tests are used. Under Dorfman testing, one individual is infected with probability one and the remaining N − 1 individuals are infected independently with probability v. 1 The expected number of infected individuals is therefore E[I D ] = 1 + (N − 1)v. The expected number of tests used under Dorfman testing is The derivation of E[T D ] was provided by Dorfman and can be found in appendix A.1 for completeness. When the infection prevalence v is low, Dorfman testing uses significantly fewer than N tests in expectation. As an example, consider the scenario where N = 1000 and v = 0.05 (5%). If we employ Dorfman testing and a group size of n = 10, only 507 tests are needed in expectation to test the entire population, a reduction of nearly 50% compared to the N = 1000 tests used under individual testing. Given a population, a certain number of infected individuals, and a group size, the minimum number of tests is achieved by minimizing the number of positive groups G. G is minimized by perfect grouping, in which all infected individuals are pooled together into the minimum possible number of groups. The lower bound for two-stage testing procedures when 1 + (N − 1)v individuals are infected is The derivation can be found in appendix A.2. Revisiting our example, if N = 1000, v = 0.05, and n = 10, the minimum number of tests needed under two-stage group testing is 151. The lower bound is unattainable in most scenarios as we do not know which samples are infected a priori. In this work, we consider the population of N individuals to be embedded in a network, where each individual corresponds to a node and their physical interactions correspond to edges. In our framework, the network underlying the population is generated by a stochastic block model (SBM). Specifically, we consider an SBM with N nodes split into N /m communities of size m. Within each community of m nodes, edges exist between nodes independently with probability p. Edges exist between nodes in different communities independently with probability q, where q ≤ p. As a result, nodes are more likely to be connected to other nodes in the same community than to nodes in other communities. For our epidemic model, we consider the initial stage of a branching process model. Specifically, an epidemic starts with a single infected seed node, which is chosen at random from the population. The seed node infects each of its neighbors independently with probability α. The seed node has m − 1 possible neighbors within its community, each connected with probability p, and N − m possible neighbors outside of its community, each connected with probability q. As a result, the expected number of infected individuals under this model, which will we use for network grouping, The epidemic model describes the initial stage of an outbreak or, alternatively, a super-spreader event. We set α such that the expected number of infected individuals in the epidemic model is equal to the expected number of infected individuals in the Dorfman setting. Setting For the remainder of this work, we assume the following. In this section, we introduce our main results regarding network grouping and its performance compared to Dorfman testing. Under network grouping, we group individuals by community. In the simplest case, if communities have the same size as groups, m = n, each community is pooled into a unique group. If community size is divisible by group size, the m community members are pooled into m /n groups. If group size is divisible by community size, each group of size n consists of n /m communities. For example, if m = 20 and n = 10, each community is pooled into two groups and if m = 5 and n = 10, each group consists of two communities. When m not divisible by n and n not divisible by m, we keep communities intact as much as possible and remainder community members are pooled into the remaining groups. The expected number of tests used under network grouping is If q = 0 and n ≥ m, then E[ The proof of theorem 1 can be found in appendix A.4. Theorem 1 states network grouping weakly dominates Dorfman testing in terms of the expected number of tests used. The outperformance of network grouping is driven by q, the probability an edge exists between nodes in different communities. In settings where networks have extremely strong community structure, network grouping performs optimally and achieves the lower bound. Specifically, when q = 0, communities are disconnected from each other and all infected individuals will reside within the same community. When n ≥ m, each group is large enough to capture each entire community and, as a result, all infected individuals will be grouped together. However, there are also scenarios where network grouping is equivalent to Dorfman testing, notability when q = p. Interestingly, even though we assume a network model, epidemic model, and network grouping, we end up back where we started with Dorfman testing. The reasoning is simple: since the network has no structure, all nodes have the same probability of being infected and the network provides no useful information for grouping. In normal cases when 0 < q < p, network grouping significantly outperforms Dorfman testing. Consider the scenario of a university testing its population for COVID-19 cases. Using data from the Copenhagen Networks Study of Sapiezynski et al. (2019) , we build the social network of firstyear students at the Technical University of Denmark. The network contains 310 nodes, which correspond to students, and 1503 edges, which correspond to their physical interactions recorded using bluetooth-enabled smartphones. We apply the Louvain algorithm of Blondel et al. (2008) to detect communities, resulting in 11 communities with an average size of 28 individuals. The social network with nodes colored by their community is displayed in figure 1a . We estimate p to be 0.18 and q to be 0.01, indicating a sparse network with strong community structure. We simulate 1000 epidemic processes on the network using the model outlined in section 3 with α = 0.95, resulting in an estimated infection prevalence of 0.03 (3%). We apply network grouping, which groups individuals by community, and Dorfman testing, which groups individuals randomly. Figure 1b displays the average number of tests used under the different approaches as a function of group size. Network grouping strongly outperforms Dorfman testing. When n = 10, Dorfman testing uses 112 tests on average while network grouping uses 75 tests to screen the population of 310 students, a reduction of 33%. In addition, figure 1b demonstrates our analytical result for the number of tests used under network grouping, provided in equation 3, is a strong approximation for the number of tests used in a real network setting. In this work, we have introduced the idea of using social network information to improve group testing. When networks have strong community structure, network grouping outperforms Dorfman testing in terms of the number of tests used. It turns out network grouping also outperforms Dorfman in terms of false positives and false negatives when tests are imperfect. However, we leave discussion of imperfect tests to future write-ups. Importantly, network grouping is simple for practitioners to implement; individuals can be grouped by family unit, friend group, or other community structure. of whole blood and blood components, including source plasma, to reduce the risk of transmission of hepatitis B virus. https://www.fda.gov/regulatory-information/searchfda-guidance-documents/use-nucleic-acid-tests-pooled-andindividual-samples-donors-whole-blood-and-blood-components, October 2012. In vitro diagnostics EUAs -molecular diagnostic template for laboratories. https://www.fda.gov/medical-devices/coronavirus-disease-2019covid-19-emergency-use-authorizations-medical-devices/vitrodiagnostics-euas, July 28, 2020a. Pooled sample testing and screening testing for COVID-19. https://www. fda.gov/medical-devices/coronavirus-covid-19-and-medicaldevices/pooled-sample-testing-and-screening-testing-covid-19, August 24, 2020b. A.1 DERIVATION OF DORFMAN TESTING Under Dorfman testing, a population of size N is split into N /n groups of size n for an initial stage of testing. Let G denote the number of positive groups after the initial stage. In the second stage of testing, all n samples from each positive group are retested individually. In total, N /n + nG tests are used. G is a random variable. Of the N /n groups, one is positive with probability one as there is at least one infected individual. The remaining N /n − 1 groups are positive independently with some probability v . As a result, G is distributed 1 + Bin( N /n − 1, v ). The probability v is derived as follows. Each of the remaining N − n individuals (that are not in the first group) are infected with probability v and not infected with probability 1 − v. The probability that all n individuals in a group are not infected is (1 − v) n . The probability that at least one individual in the group is infected, and therefore the group tests positive, Putting everything together, the number of tests used under Dorfman testing is distributed Taking the expectation of T D provides E[T D ] as displayed in equation 1. Under two-stage group testing, a population of size N is split into N /n groups of size n for an initial stage of testing. Let G denote the number of positive groups after the initial stage. In the second stage of testing, all n samples from each positive group are retested individually. In total, N /n + nG tests are used. Given a population, a certain number of infected individuals, and a group size, the minimum number of tests is achieved by minimizing the number of positive groups G. G is minimized by perfect grouping, in which all infected individuals are pooled together into the minimum possible number of groups. When 1 + (N − 1)v individuals are infected, the minimum number of positive groups of size n is [1 + (N − 1)v]/n. For example, if 20 individuals are infected and n = 10, the minimum number of positive groups is two. When the number of infected individuals is greater than or equal to one but less than or equal to n, the minimum number of positive groups will be one. Note, there is always at least one infected individual in our framework. Putting everything together, the lower bound for the number of tests needed under two-stage group testing is Under two-stage group testing, a population of size N is split into N /n groups of size n for an initial stage of testing. Let G denote the number of positive groups after the initial stage. In the second stage of testing, all n samples from each positive group are retested individually. In total, N /n + nG tests are used. G is a random variable. The network contains N /m communities of size m. We consider cases where n divisible by m or m divisible by n. First consider the case where m ≤ n. Since we group individuals by community (as described at the beginning of section 4), the infected seed node and its m − 1 community members will be contained in the same group. This group will be positive with probability one. The remaining N /n − 1 groups each contain n nodes that belong to different communities than the seed node. As a result, each node in the remaining N /n − 1 groups is not infected with probability 1 − qα, as they are only infected if they are both connected to the seed, with probability q, and infected by the seed, with probability α. The probability all n nodes within a group are not infected is (1 − qα) n . The probability that at least one individual in a group is infected, and therefore the group tests positive, is q = 1 − (1 − qα) n . In summary, the remaining N /n − 1 groups are positive independently with probability q . Putting everything together, the distribution of the number of tests used under network grouping when m ≤ n is Now consider the case where m > n. As we group individuals by community, there will be one group that contains the infected seed node and n − 1 of its community members. This group will be positive with probability one. The remaining m − n nodes from the seed node's community will be pooled into (m − n)/n = m /n − 1 other groups. Each node in these groups will be infected with probability 1 − pα, as they are only infected if they are both connected to the seed, with probability p, and infected by the seed, with probability α. Following the same logic as the m ≤ n case, each of these m /n−1 groups is positive independently with probability p = 1−(1−pα) n . After accounting for the infected seed's group and the other m /n − 1 groups, N /n − m /n groups still remain. Each of the n nodes in these groups are members of different communities than the seed node. Therefore, each of the N /n − m /n groups is positive independently with probability q = 1 − (1 − qα) n . Putting everything together, the distribution of the number of tests used under network grouping when m > n is The two cases, m ≤ n and m > n, can be easily combined. Defining (x) + = max(x, 0), we have ( m /n − 1) + = 0 when m ≤ n. Therefore, we can write the distribution of the number of tests used under network grouping in the general case as where p = 1 − (1 − pα) n and q = 1 − (1 − qα) n . Equation 9 is increasing in q, we consider the cases where m > n and m ≤ n separately. Case 1: We first consider the case where m > n. When m > n, E[T N G ] is given by where p = 1 − (1 − pα) n and q = 1 − (1 − qα) n . Under assumption 1, Taking the derivative of equation 10 with respect to q and simplifying yields Demonstrating equation 11 is nonnegative proves equation 10 is weakly increasing in q. The denominator is nonnegative due to the square and n, v, N − 1, and N − m are nonnegative by assumption 1. Examining the bracket term in the numerator, we note p(m − 1) ≥ p(m − n) as n ≥ 1 and (1 − qα) n−1 ≥ (1 − pα) n−1 as p ≥ q. Note both 1 − qα and 1 − pα are probabilities between 0 and 1 as p and α are between 0 and 1. As a result, the bracket term is nonnegative and the entirety of equation 11 is nonnegative. Case 2: We now consider the case where m ≤ n. When m ≤ n, E[T N G ] is given by Taking the derivative of equation 12 with respect to q and simplifying yields All terms in the numerator and the first bracket term in the denominator are nonnegative by assumption 1. The second bracket term in the denominator is nonnegative if Rearranging equation 14 yields which is true by assumption 1. As a result, equation 13 is nonnegative and equation 12 is weakly increasing in q. We have shown E[T N G ] is increasing in q for m > n and m ≤ n. Setting q to its maximum value under assumption 1, q = p, we have p = q as 1 − (1 − pα) n = 1 − (1 − qα) n . In addition, α simplifies to v /p and pα = v. Therefore, in the general case simplifies to Lower bound To prove E[T N G ] ≥ T LB , we prove E[T N G ] − T LB ≥ 0 for the three cases of 1) group size larger than (or equal to) the expected number of infected individuals, n ≥ 1 + (N − 1)v, 2) group size less than infected individuals and less than community size, n < 1 + (N − 1)v and n < m, and 3) group size less than infected individuals and greater than (or equal to) community size, n < 1 + (N − 1)v and n ≥ m. Case 1: When group size is larger than or equal to the expected number of infected individuals, n ≥ 1 + (N − 1)v, the lower bound in equation 2 simplifies to N /n + n. Therefore, where p = 1 − (1 − pα) n and q = 1 − (1 − qα) n . By assumption 1, n ≥ 1 and both p and q are probabilities between 0 and 1, as 1 − pα and 1 − qα are between 0 and 1. In addition, the term ( N /n − 1 − ( m /n − 1) + ) is nonnegative as N ≥ n and N > m . As a result, the entirety of equation 19 is nonnegative. Case 2: When group size is smaller than the expected number of infected individuals and community size, n < 1 + (N − 1)v and n < m, we can write the lower bound T LB as By assumption 1, we have m > 1, N > m, and n ≥ 1. In addition, 1 ≥ p as p = 1 − (1 − pα) n is a probability between 0 and 1. Lastly, p = 1 − (1 − pα) n ≥ 1 − (1 − pα) as (1 − pα) n ≤ (1 − pα). Similarly, q = 1 − (1 − qα) n ≥ 1 − (1 − qα). As a result, E[T N G ] − T LB is nonnegative. Case 3: We consider the case where group size is smaller than the expected number of infected individuals but larger than (or equal to) community size, n < 1 + (N − 1)v and n ≥ m. Using equation 21 and the inequality m ≥ 1 + (m − 1)pα, we have the following inequality for the lower bound T LB . By assumption, N > m and n ≥ m. In addition, 1 ≥ q as q is a probability between 0 and 1. Lastly, q = 1 − (1 − qα) n ≥ 1 − (1 − qα) as (1 − qα) n ≤ (1 − qα). As a result, the difference E[T N G ] − T LB is nonnegative. We have proven E[T N G ] − T LB ≥ 0 for the three cases under consideration, completing the lower bound portion of the proof. 1) . Recall the number of infected individuals is 1 + (N − 1)v. We now have 1 + (N − 1)v ≤ 1 + (m − 1)p and 1 + (m − 1)p ≤ m as p ≤ 1. Since m ≤ n, we have 1 + (N − 1)v ≤ n. Therefore, the lower bound T LB is and E[T N G ] = T LB , completing the proof. Note, the assumption α ≤ 1 in assumption 1 sets an upper bound for the infection prevalence v as α is a function of v. However, this is not restrictive as group testing is employed in cases when v is low. The statement "If q = p, then E[T N G ] = E[T D ]" is proved above during the upper bound portion of the proof. Informative retesting Fast unfolding of communities in large networks Graph-constrained group testing Pooling of clinical specimens prior to testing for chlamydia trachomatis by PCR is accurate and cost saving The detection of defective members of large populations Pooling of sera for human immunodeficiency virus (HIV) testing: an economical method for use in developing countries Tapestry: a single-round smart pooling technique for COVID-19 testing. medRxiv Sample pooling as a strategy to detect community transmission of SARS-CoV-2 Screening for the presence of a disease by pooling sera samples Informative Dorfman screening Interaction data from the Copenhagen Networks Study High-throughput pooling and real-time PCR-based strategy for malaria detection Pooling nasopharyngeal/throat swab specimens to increase testing capacity for influenza viruses by PCR Wuhan tests nine million people for coronavirus in 10 days COVID-19 tests are still hard to get in many communities Evaluation of COVID-19 RT-qPCR test in multi sample pools We thank Alberto Abadie, Jordan Brooks, Ben Deaner, Yash Deshpande, David Hughes, Peter Kempthorne, and Noelle Wyman for helpful comments and conversations. We are grateful to Eric Lai and the NIH RADx team for helpful discussions. The authors acknowledge the MIT Super-Cloud and Lincoln Laboratory Supercomputing Center for providing high performance computing resources. Paolo Bertolotti was supported by a National Defense Science and Engineering Graduate (NDSEG) Fellowship.