key: cord-0990917-1wdlf0ms
authors: de Belleville, L.-M. H.
title: Group Testing with Homophily to Curb Epidemics with Asymptomatic Carriers
date: 2020-10-13
journal: nan
DOI: 10.1101/2020.10.09.20210260
sha: c5a7bfa68f881438360c7930be545ae09bbd4c06
doc_id: 990917
cord_uid: 1wdlf0ms

The global fight against COVID-19 is plagued by asymptomatic transmission and false negatives. Group testing is increasingly recognized as necessary to fight this epidemic. I examine the gains from considering heterogeneous interpersonal interactions (homophily), which induce potential contamination, when designing testing pools. Homophily can be identified ex ante at a scale commensurate with pool size, so that the risk of contamination is higher within a well-designed pool than with an outsider. This makes it possible to overcome the usual information-theoretic limits of group testing which rely on an implicit homogeneity assumption. More importantly, group testing with homophily detects asymptomatic carriers that would be missed even by exhaustive individual testing because of false negatives. Such a strategy should be implemented at least at a weekly frequency to fit the time profile of test positivity. It can be used either to avoid unnecessary lockdowns or to make lockdowns more efficient.

The present study analyzes the potential gains from taking homophily into account when implementing pool testing to fight epidemics with asymptomatic carriers and false negatives. It shows that designing test pools that encompass potential contamination clusters improves the efficiency of tests significantly, and makes it possible, in combination with more advanced complementary exams, to identify carriers that would be missed even by exhaustive (and unfeasible) individual testing.

Various strategies have been implemented to curb the COVID-19 epidemic. Trace and track may be efficient (Normile, 2020) but can be thwarted by asocial behaviors or religious beliefs (see patient 31 in South Korea) and tracking teams are overwhelmed when incidence is too high. Lockdowns and quarantine work (World Health Organization, 2020, Kupferschmidt and Cohen, 2020) but are costly (Gourinchas, 2020) . The need for testing, which can substantially reduce the need for indiscriminate quarantines, was identified early in the epidemic (Piguillem and Shi, 2020) .

Massive and timely identification of asymptomatic disease carriers is crucial if human-to-human asymptomatic transmission happens. Clinical diagnosis based on symptoms is inefficient in that case, while Yelin et al. (2020) lament that focusing tests on acutely ill patients leave potentially infectious carriers undiagnosed at the community 2 . Testing of asymptomatic people is also useful if the disease has long-lasting consequences even without symptoms or if subsequent phases induce a higher fatality rate. Chan, Yuan, Kok et al. (2020) and find ground-glass opacities for the vast majority of COVID19 asymptomatic patients; although further research may be needed on that point, this may signal potential sequelae even for asymptomatic patients.

More or less stringent definitions of asymptomatic carriers exist. document the existence of both presymptomatic and truly asymptomatic carriers. In order to tackle the contamination induced by the former, one could consider implementing trace and track, at least if the presymptomatic contagious period is short. However, as noted already, tracking teams may soon be overwhelmed. Thus, massive and timely identification of presymptomatic carriers may be necessary. More generally, I follow Harpedanne (2020) and use "asymptomatic" transmission to cover the transmission of a disease by asymptomatic but also presymptomatic, subclinical, or only mildly sick patients. Post-symptomatic patients may also present viral load, but these patients cause less problems for disease transmission since they can be isolated easily.

First, many studies document the existence of asymptomatic carriers. In a meta-analysis of 66 articles and pre-prints, Koh, Naing, Rozledzana et al. (2020) find average asymptomatic proportion at diagnosis of 25.9%, including two thirds of presymptomatic and one third of truly asymptomatic. Among other studies, document a low proportion of asymptomatic carriers (4%), probably due to sample selection issues (virologically confirmed COVID-19 patients in Shanghai Public Health Centre). Kimball et al. (2020) find 13% asymptomatic carriers and 43% presymptomatic; Mizumoto et al. (2020) find 18% asymptomatic carriers; Qiu et al. (2020) find 28%, Nishiura, Kobayashi, Suzuki et al. (2020) find 31%, and Day (2020) cites China National Health Commission pointing to 78% asymptomatic carriers in new cases observed over 24 hours to April 1, 2020.

Second, Human-to-human transmission of COVID-19/SARS-CoV-2 was early documented by Xu et al. (2020) , Li, Guan et al. (2020) , Chan, Yuan, Kok et al. (2020) and Phan et al. (2020) . More specifically, biological and epidemiologic evidence for asymptomatic transmission is provided by Bai et al. (2020) , Rothe et al. (2020) , Zou et al. (2020) , Santarpia et al. (2020 ) 3 , while Wong, Aziz, Chaw, Mahamud, Griffith Ying-Ru et al. (2020 strengthen the evidence for both asymptomatic and presymptomatic transmission. Koh, Naing, Rozledzana et al. (2020) find that the risk of transmission is 2.55 higher when the index case is symptomatic. Li, Pei et al. (2020) find that although the transmission rate of undocumented carriers is only 55% that of documented carriers, the former are responsible for 80% of contaminations, due to their high absolute numbers. Thus, massive testing of asymptomatic people has gained popularity during the epidemic (Allen, Block, Cohen et al., 2020 .

However, individual testing of asymptomatic people is hopeless. For instance, France (the sixth largest world economy, with a population of 67 millions) has reached 1.19 million COVID-19 tests per week as of September 2020. Even in the unlikely case in which all these tests would be dedicated to detect asymptomatic carriers, each person would be tested less than once a year.

Group testing (batching the samples of different people and implementing a single test on the pooled sample), first proposed by Dorfman (1943) , makes asymptomatic testing much more efficient. According to Mutesa (2020), when the prevalence of COVID-19 is .1 percent, the Dorfman's design decreases 17-fold the number of tests required to identify asymptomatic COVID-19 carriers (0.06 test per person) while a new strategy suggested by Mutesa et al. decrease 55 times this number of tests (0.018 test per person). Still, the gains of group testing over individual testing are lower for higher prevalence, and there exist information-theoretic limits to the potential improvements allowed by group testing (see below Section 3).

I examine the benefits of taking a priori information on interpersonal relations into account when designing the pools used for group testing. Considering this information (which I label "homophily", see Section 2) makes it possible to push the information-theoretic limits mentioned above and improve test's efficiency further. Also, group testing with homophily proves very efficient to tackle false negatives, a major deficiency of usual COVID19 RT-PCR tests based on nasopharyngeal swabs.

Group testing is used to fight COVID19 in China, India, Germany, the United States (Mallapaty, 2020) and Rwanda (Mutesa et al., 2020) . In the United States, it is authorized for pools of up to four people . The specific literature on group testing and COVID19 includes Gollier and Gossner (2020) , Conger et al. (2020) , , Eberhardt, Breuckmann and Eberhardt (2020), Mutesa et al. (2020) , Mallapaty (2020) , Lohse et al. (2020) and Yelin et al. (2020) .

Group testing methods are either adaptive or non-adaptive. A group testing framework is adaptive if the results of a given round of test influence the design of subsequent rounds. For instance, Dorfman (1943) proposes a two-step procedure in which groups are tested in a first round, and if a group is positive, individual tests are implemented in this group. Mutesa et al. (2020, Section IV and Appendices B and C) discuss the pros and cons of various adaptive and non-adaptive tests to fight the COVID-19.

Section 2 presents homophily and clustering as well as their links to contamination channels and contamination clusters; it shows that homophily can be identified ex ante, so that the strategy analyzed here, and especially in Section 5, is feasible. Then, Section 3 provides counterexamples to the limits computed by Chan et al. (2011) and Baldassini et al. (2013) and concludes that the homophily structure provides relevant information. Then, Section 4 shows that homophily makes it possible to reduce the effects of the dilution induced by group testing. Eventually, Section 5 considers false negatives induced by idiosyncratic noise: I find evidence that an adaptive strategy combining a first step group testing with homophily and a second step based on more advanced individual complementary exams can help identify asymptomatic carriers that could not be detected even by exhaustive individual tests. The appendix analyze the efficiency of this strategy to identify carriers at reduced cost and minimize the risk of false negatives.

Although many examples and references relate to the COVID19 epidemic, most results are mathematical or logical in nature and may be useful more generally during epidemics in which massive and repeated asymptomatic testing is necessary.

In this Section, I present homophily and clusitering, which are concepts used in social sciences, and clusters, which are used in epidemiology. I show that these concepts, that I summarize by "homophily" can be used ex ante to identify potential contamination structures in order to design testing pools accordingly, which is a necessary condition to implement the testing strategy analyzed in the rest of the present paper.

Homophily was defined in 1954 by Lazarsfeld and Merton. It "refers to the fact that people are more prone to maintain relationships with people who are similar to themselves" (Jackson, 2008, p. 68) . McPherson et al. (2003) document the prevalence of homophily in many social networks. Jackson and Lopez-Pintado (2013) analyze the effects of homophily on contagion. In particular, they show that starting from a small initial seed (a small number of infected people), homophily facilitates diffusion under rather limited conditions. Since homophily is prevalent in network analysis and relates to contagion, it makes sense to consider this phenomenon when designing testing strategies. Surprisingly enough, the literature on group testing has disregarded these aspects until now.

A related econometric concept is clustering. Clustering refers to the nondeterministic correlation of outcomes between individuals that are somewhat related. Moulton (1986 Moulton ( , 1990 introduced this idea and showed that failing to take it into account induce significant errors when estimating standard errors. Clustering has been popularized by Bertrand et al. (2004) who have shown that the standard errors of difference-in-difference estimates were not properly estimated when neglecting clustering. Clustering may be "multi-way" (Cameron, Gelbach and Miller, 2011) , meaning for instance that an individual may be correlated with people working in the same firm on the one hand, with people living in the same village on the other hand, but also with people going to the same gym club, those having their children in the same school, etc, without these different clusters being nested.

Nowadays, correcting for potential clustering is a condition sine qua non for scientific work in applied economics. Many results show that overlooking clustering would underestimate standard errors, which means that there exists a positive correlation between outcomes for individual belonging to groups identified ex ante on rather simple criteria. This pattern is verified for a wide range of outcomes in many settings.

In other words, various branches of social sciences converge on both the necessity and the possibility of taking heterogeneous interpersonal interactions into account when analyzing many mechanisms, and contagion especially.

The medical literature confirms that contamination occurs through clusters. Han and Yang (2020) cite a Chinese-written article asserting that "In some cities, cases involving cluster transmission accounted for 50% to 80% of all confirmed cases of COVID-19." The strategy analyzed in the present study requires to group together COVID cases in the same testing pool, or a few number of pools. Pools may include a few dozen individuals, at most one hundred (See Section 4 on dilution). Thus, it is needed to identify ex ante potential clusters of limited size. Madewell et al. (2020) report that "To better understand clustering within households, it would also be useful for researchers to report the number of infections by household in addition to the total number of infected individuals." Unfortunately, this is rarely done, so that the feasibility of the strategy here analyzed must be evaluated indirectly. This can be done for instance by looking at studies on small clusters with high attack rate (ratio of contaminated people in a given group) or secondary attack rate (SAR is the number of people contaminated by an index case, divided by the people in contact with this index case). Koh, Naing, Rozledzana et al. (2020) provide a meta-analysis of 20 studies on secondary attack rate. Household is quite often the place where the SAR is highest (15.4% on average), and Qiu et al. (2020) . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210260 doi: medRxiv preprint find that out of 36 children infected in a Chinese city, 32 (89%) had transmission by close contact with family members. But high SAR have also been observed in a chalet (73.3%), at a choir (53.3%) and at a religious event; high SARs are observed sometimes for travels and eating with an index case. Koh, Naing, Rozledzana et al. (2020) report other cases of clusters with very high attack rate: "a nursing home in Kings County, Washington (64%) […] a church in Arkansas (38%), a homeless shelter in Boston (36%), a fitness dance class (26.3%) and the Diamond Princess cruise ship in Japan (18.8%). Park et al. (2020) analyze an outbreak in a Korean building: 94 of 97 cases worked on the same floor (11th). 79 cases worked in the same open-space (attack rate of 52%). Many clusters have also been observed in slaughterhouses. Thus, most of these results show high concentrations of cases in groups of limited size that can be tested in one or a few groups.

Is it possible to derive from these results ex ante general contamination patterns in small clusters? Many studies (Li, Zhang, Lu et al., 2020 , Madewell et al., 2020 , Liu, Lian, Zhong et al.,2020 and the meta-analysis by Koh, Naing, Rozledzana et al. (2020) underline that longer and more intense exposure to infection sources increases the risk of infection. Crowded indoor environments with sustained close contact and conversations are a particularly high-risk setting (Nishiura, Oshitani, Kobayashi, et al., 2020) . Interestingly, Park et al. (2020) find a high concentration of cases in open spaces but only one case in small offices. This is consistent with the theoretical analysis in Harpedanne de Belleville (2020) who shows an almost convex effect of the number of room users onto contamination.

Using these general patterns and theoretical results, it is possible to identify ex ante potential clusters, and to design pools that encompass these clusters. Section 3 to 5 analyze the gains from this strategy.

More generally, Harpedanne 2020) relates contamination channels and contamination probabilities between individuals who have interpersonal interactions. For instance, airborne and droplets contagions increase the probability of contagion between people sharing the same office, open space, corridor, etc. Contagion through fomites increase the probability of contagion between people using successively the same toilet, the same seat in a train coach, etc.

Thus, like Harpedanne (2020), the present paper deals with curbing epidemics with asymptomatic contamination and heterogeneous social interactions inducing specific expected contamination patterns. For Harpedanne (2020), asymptomatic carriers are unidentified but organizational measures can affect interpersonal interactions; conversely, the present paper takes interpersonal interactions for granted and proposes to take them into account to better identify asymptomatic carriers.

In Section 5, I analyze a two-step strategy in which the first step (pool test on nasopharyngeal swabs) draws on homophily to design test pools. From a policy perspective, many patterns of homophily and potential clusters are identified; they can be used to make testing more efficient and therefore reduce the need for unnecessary lockdowns (Piguillem and Shi, 2020) . Still, households are documented as frequent clusters (high homophily inside households), and lockdowns may aggravate this fact. Thus, the two-step strategy can also be implemented to identify household contamination and make lockdowns more efficient and shorter.

In this section, I show that if the homophily structure is known before implementing a pooled test, it contains information that can make the test more efficient. For that purpose, I evidence that if homophily is "strong" enough and can be properly identified when designing the pooled tests -more specifically, when designing the pools -, it makes it possible to overcome the most recent and tight information-theoretic lower bounds on the efficiency of group testing. These limits have been identified by Chan et al. (2011) who, for the first time in the literature, define limits in terms of actual numbers and not only rate or capacity, and Baldassini et al. (2013) , who follow the same path and provide a new and tighter lower bound.

Unlike Sections 4 and 5, the present Section focuses on noiseless tests. Thus, a few definitions may be useful here. A group test is noiseless if a negative test outcome is guaranteed when all items in the . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210260 doi: medRxiv preprint testing pool are nondefective, and a positive outcome when a least one item in the pool is defective (Aldridge, Johnson and Scarlett, 2019) . Otherwise, the test is noisy.

Noisy tests are often examined under the assumptions of constant (Chan et al., 2011) or worst-case (Macula, 1997) noise. However, the results by Yelin et al. (2020) point instead to an increasing risk of false negatives when dilution increases, and the results by Yang, Yang, Shen et al. (2020) , Wang, Xu, Gao et al. (2020) and Wang, Tan, Wang et al. (2020) point to patient-specific idiosyncratic noise. These two issues are analyzed in Sections 4 and 5, respectively, and I show that taking homophily into account brings specific gains. Conversely, the general form noise models (such as the symmetric error model, see e.g. Chan et al. , 2011 , or the additive model, see e.g. Atia and Saligrama, 2012) are not relevant to analyze these forms of noises, so that there is no need to provide a general noisy-models analysis. I consider noiseless tests for different other reasons. First, this makes it clear that the benefits of taking homophily into account in group testing are not limited to noise-related issues. Second, it makes the comparison with the information-theoretic limit of Baldassini et al. (2013) easier. Baldassini et al. (2013, Section III) analyze a noiseless test in a population of size N, with K defectives (K is known for simplicity). They show that if the number of tests is limited to T, the probability of correct identification of the set of defectives is:

Let N be 64 and K be 8. Using 6 tests only (T=6), one can cut the population in 8 groups of 8 people each and determine which group contains carriers if only one group contains carriers (think of the 64 population as a 4x4x4 cube and cut the cube in half in each dimension, that is implement 6 tests over 32 people each). According to (1):

(2) ( ) ≤ 2 6 64 8 ≈ 1.45 10 −8

Let now introduce homophily. Homophily means that there exists high potential for contamination within each group, while the potential for intergroup contamination is low. Let assume that only one individual has imported the disease in the 64 population: this a decent assumption if the prevalence is low in the general population; this assumption may be verified with probability (1-ε1), and let ε2 be the probability that intergroup contamination has happened. Then with a probability higher than (1 -ε1)(1 -ε2), all 8 carriers are in the same group. For instance, if ε1=0.2 and ε2=.5, we get:

Which of course contradicts (2). Strong homophily provides information that makes it possible to overcome information-theoretic limits based on the implicit assumption of absence of homophily.

The counterexample just provided is extreme and not very useful in practice; the aim of this example is merely to illustrate that the usual information-theoretic limits rely on an implicit homogeneity assumption. By taking homophily into account, we relax this assumption. More realistic adaptative frameworks may provide rather high probability of success with a limited number of test. Think of testing the 8 groups independently in a first step and testing individually all people in the first two groups that turn positive in the first step. With decent homophily, this strategy would likely provide a very good probability of success while (1) would give a bound equal to 0.3 %. Actually, (1) would not apply since the strategies examined by Baldassini et al. (2013) are only non-adaptative, but since the authors point to the limited gain in efficiency brought by adaptative designs, it is likely that gains from homophily could be formally proven for adaptative designs. This is beyond the scope of the present paper, and (3) suffices to prove that homophily, if identified ex ante when designing a group test, may provide relevant information.

. CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210260 doi: medRxiv preprint

Dilution is a crucial issue for group testing applied to disease detection. If a positive swab is pooled with many negative swabs, it may difficult to identify traces of the virus. Dilution is dealt with already in the seminal paper on group testing by Dorfman (1943) who tackles syphilis detection among young drafted Americans. Dorfman finds that "diagnostic tests for syphilis are extremely sensitive and will show positive results for even great dilution of antigen". Still, dilution may be more of an issue for other diseases. Dilution has been examined in a group testing framework by Hwang (1976) . Warasi et al. (2017) propose a parametric model to tackle dilution.

A series of studies have examined dilution for RT-PCR test targeting genes of SARS-CoV-2. Lohse et al. (2020) target the E gene (envelop gene) and S gene (spike gene) of SARS-CoV-2 and show that it is possible to identify positive samples correctly, even when diluted 30-fold (that is, one positive swab pooled with 29 negatives swabs). Yelin et al. (2020) find that group testing can be implemented with up to 32 individuals per group with a rate of false negative of 10 % in the case of COVID19, which they claim is low when compared to other sources of noise. They also suggest that implementing additional amplification cycles would make it possible to implement group testing with up to 64 individuals per group. Accordingly, Mutesa et al. (2020) show that using a Ct-value of 40 makes it possible to detect positive swabs diluted up one hundred-fold. More precisely, they show that for tests targeting the N gene of SARS-CoV-2, the upper 95% bound is below 40 and for tests targeting the Orf1ab gene, the upper 90 % bound is below 40. Subsequently, they confirm these results with tests targeting the E and RdRp genes.

If homophily is taken into account when designing the pools, it is rather likely that carriers will be concentrated in a few pools, and therefore that no single carrier will be isolated in a pool. If many carriers are concentrated in a pool, this counteracts the effects of dilution. For instance, if a pool with 32 swabs contains 2 positive swabs, the dilution is 1/16, which is rather limited.

In the present paper, I do not analyze further the interactions between homophily and dilution. Indeed, the available literature points to limited effect of dilution for COVID-19. Still, this issue remains open to further research.

Idiosyncratic noise in tests can occur for many different reasons: contamination of the samples, error or insufficient training of the person in charge of collecting the swabs, etc. I focus here on a type of noise that has been extensively documented by the literature on COVID19: the swabs used for tests may fail to contain viral loading for many disease carriers. To solve this issue, a strategy based on group testing with homophily can identify more asymptomatic carriers than group testing alone, but also more than exhaustive individual testing.

Many methods can be used to identify SARS-CoV-2 carriers: clinical diagnosis, chest radiograph and CT-scan, fibrobronchoscope brush biopsy, RT-PCR on bronchoalveolar lavage fluid, sputum, nasal swabs, pharyngeal swabs, feces, etc. By definition, clinical diagnosis does not work for asymptomatic carriers; chest radiograph and CT-scan are not available to implement massive identification of asymptomatic carriers; as underlined by Yang, Yang, Shen et al. (2020) , collecting lower respiratory samples (bronchoalveolar lavage fluid, fibrobronchoscope brush biopsy) requires specific equipment and skilled operators, and can be painful; among upper respiratory samples, sputum is produced in only 28 % of COVID cases examined by Huang et al. (2020) . Thus, only nasal swabs and pharyngeal swabs may be used for large scale asymptomatic testing. Yang, Yang, Shen et al. (2020) analyze four types of specimens (bronchoalveolar lavage fluid, nasal swabs, pharyngeal swabs and sputum) from 213 confirmed COVID patients and find strong evidence of false negatives for individual tests based on nasal or pharyngeal swabs: the rate of positive is only 50% to 73.3 % for nasal swabs; it is higher (72.1 % to 73.3 %) for the swabs collected over the first week . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210260 doi: medRxiv preprint after the onset of the disease. The rate is even lower for the pharyngeal swabs: 11 % to 61.3 %; once again higher for the swabs collected over the first week (60 % to 61.3 %). Wang, Xu, Gao et al. (2020) analyze eight types of samples (bronchoalveolar lavage fluid, fibrobronchoscope brush biopsy, sputum, nasal swabs, pharyngeal swabs, feces, bood and urine). They also find a high rate for false negative for individual nasal swabs and even more for pharyngeal swabs. They also point the very poor performance of rRt-PCR tests based on blood and urine. Wang, Tan, Wang et al. (2020) compare the detection performance of individual nasopharyngeal and oropharyngeal swabs for 535 patients. They also find better results for the former, and find that using both tests increase detection slightly over nasopharyngeal alone, which confirms the existence for false negatives for both. Overall, nasal swabs, which are widely used for rRT-PCR identification of SARS-CoV-2 (Wang, Hu, Hu et al., 2020) , display a significant rate of false negatives. Yang, Yang, Shen et al. (2020) underline cases in which all upper respiratory tests (or all upper respiratory tests over a given period) are negative for confirmed COVID-19 patients. This points to the fact that false negative results are related to specific individuals rather than mere technical errors.

Group testing with homophily can be very beneficial here. Indeed, even if a carrier is "false negative" (meaning that there is no viral load in the sample collected for this individual), with homophily it is likely that other carriers belong to the same group/pool, so that the pooled test has more chances to turn positive 4 . If α is the proportion of false negatives in carriers and false negative individuals are i.i.d. in the population of carriers, the risk of missing the identification of a carrier is α for individual tests and for test with only one carrier in a pool, but it is α² if there are 2 carriers in the pool… and α n if there are n carriers. Since the literature shows that 0< α <1, we get α n < α n-1 ….< α ²< α: more carriers in a group increase the probability of a correct (positive) result at the group level.

The multistage testing strategies usually considered in the group testing literature are clearly not optimal in the presence of individual false negatives: the last stage usually implemented is individual testing in positive groups, which would pick only part of the true positives in each group. But if groups are properly defined by taking homophily into account, carriers are likely to be concentrated in the positive groups. Thus, rather than implementing the same tests as in the first step, it makes sense to isolate and take care of all people in the positive groups and to implement advanced search onto them; for COVID-19: clinical diagnosis, chest radiograph or CTscan, fibrobronchoscope brush biopsy, and rRT-PCR on bronchoalveolar lavage fluid, sputum if available and feces (this choice of samples for rRT-PCR is based on Wang, Xu, Gao et al., 2020) .

In this multistage strategy, two "costs" depend on homophility. First, even if this strategy is more efficient at tackling false negatives and identifying carriers than existing testing strategies, missed carriers happen. Second, if carriers are identified but are not concentrated, many pools must undergo the costly second-step process.

Graph 1 analyze the quantitative gains from homophily in the two-step strategy described above, for an absolute number of defectives ranging from 2 to 5. This covers a large range of different situations. For instance, two defectives in a population of two thousand correspond to a rate of 0.1%, while five in a population of 50 correspond to a rate of 10%. The size (and number) of pools do not affect the graphs, so that the graphs also covers a large range of pool size. From left to right, each graph provides statistics (described below) for increasing concentration of the defectives in a few testing pools.

As in Section 3, K is the number of defective. α is the probability that a test result is a false negative. According to Wang, Xu, Gao et al. (2020) , RT-PCR on nasal swabs identifies 63% of carriers (72% to 74% for swabs collected over the first two weeks after onset for Yang, Yang, Shen et al., 2020) . Thus, realistic figures for α range from 0.25 to 0.5, with an average value close to 0.33. I provide results for these three values. For instance, in the upper graphs (2 defectives), the left case (1 1) corresponds to one defective in a pool and one in another, while the right case (2) corresponds to the two defectives in 4 For the sake of simplicity, I neglect dilution. This is valid under mild conditions, including a limited pool size. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210260 doi: medRxiv preprint the same pool. Increasing concentration can be denoted by the operator <C. <C is transitive but is not a total order, and using brackets ≈ for configurations that cannot be ordered through <C, we obtain, for 5 defectives: (1 1 1 1 1) <C (2 1 1 1) <C (2 1 1) ≈ (3 1 1) <C (3 2) ≈ (4 1) <C (5).

The bars describe the expected number of missed carriers due to false negatives. When carriers are concentrated in a few pools or a single pool (on the right of each graph), they are detected more easily and the expected number of missed carriers decreases. Let consider for instance the case with exactly 5 carriers. If homophily was disregarded when designing the pools and carriers are distributed i.id. in the pools, it is likely that five groups contain each 1 carrier (1 1 1 1 1) or one group will contain 2 carriers and 3 groups will each contain 1 (2 1 1 1) . Conversely, if homophily was taken into account, carriers are likely to be concentrated in a few groups. For instance, we note (3 2) the case in which one group contain 3 carriers and another contains 2, and (5) the extreme case where one group contains all the carriers. In the latter case, the number of expected missed carrier may be infinitesimal. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 13, 2020 . . https://doi.org/10.1101 More generally, when homophily is better taken into account when designing the groups for first-stage testing, the "expected missed", i.e. the expected number of carrier belongings to groups with false negative results, decreases. Even when the proportion of individual false negatives is rather high (α=.5), almost all carriers are expected to be identified when they are concentrated in, say, two of the pools, and the risk of missing altogether vanishes when they belong to the same pool.

It is interesting to note than even without homophily, that is if carriers are i.i.d. in the different pools, it is rather likely that at least two carriers will be in the same pool, which will reduce the expected number of "missed" carriers. That is, group testing alone, without considering potential homophily, can help identify and isolate carriers who would be missed by exhaustive individual testing. To the best of my knowledge, this simple and striking result has been overlooked in the scientific and policy debates about group testing to fight epidemics. Given the documented high prevalence of individual false negatives for tests implemented all over the world to fight COVID-19, this fact alone may deserve careful consideration.

As already noted, in the presence of individual false negatives, implementing the same (failing) tests over the final phase(s) of a group tests process is useless and individuals in positive groups may rather be isolated and submitted to medical examinations and tests based on more reliable samples.

This process induces a second cost related to groups that are positive in the first step. Homophily is crucial here. If it is possible to assess ex ante the homophily structure of contamination and to take that information into account when designing the pools and thus to increase the concentration of carriers in a few pools, the expected number of contaminated groups, and therefore the expected number of positive groups (groups that are identified as contaminated by the first step) are reduced stringently. The absolute gains are especially high for low values of α, since for low α, most contaminated groups are identified correctly.

In this framework, determining the optimal pool size is different from the usual optimization introduced by Dorfman (1943) . Here, the optimal size increases as usual with the cost of missing a carrier, but also decreases with the cost of the second step.

The present paper shows that heterogeneous interpersonal interactions with more relations within specific groups -homophily -contain relevant information to help detect asymptomatic carriers of diseases such as COVID-19. Specifically, designing test pools that encompass potential small-scale clusters makes it possible to overcome information-theoretic limits on the minimal number of test required to identify all carriers with a given (non-one) probability even in a noiseless framework.

Still, the practical benefits of considering homophily for group testing may be related to noise in tests. Indeed, homophily may help counteract the detrimental dilution effect of group testing, which exists for COVID-19. Above all, group testing makes it possible to identify asymptomatic carriers who would be missed even by exhaustive individual testings -a benefit apparently unnoticed in the group testing literature -; in this context homophily helps increase the identification power while reducing the costs of the second step which includes individual complementary examination and tests on people belonging to positive groups.

From a policy perspective, homophily can be used to increase test efficiency and therefore to avoid unnecessary lockdowns. Conversely, since households are one of the main clusters, especially during lockdowns, a testing strategy based on homophily could prove very useful to identify household contamination and make a lockdown shorter and more efficient.

The present paper illustrates the interest of analyzing social interactions and their group structure to design testing strategies. I hope this work will encourage more research on group testing, false negatives, and the structure on social relationships in order to curb the COVID-19 epidemic and other . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 13, 2020. . https://doi.org/10.1101/2020.10.09.20210260 doi: medRxiv preprint epidemics with asymptomatic carriers. Madewell et al. (2020) suggest to report contaminations by households. More generally, the cluster structure of contamination should be reported whenever possible, the links with contamination channels should be established, and general patterns should be identified in order to better implement the strategy analyzed in the present study.

The computations in the appendix are based on the simplifying assumption that the results of different tests for the same individual are uncorrelated, so that the probability of a false negative using different swabs is the product of the rate of false negatives for each type of swab. More work on the correlation of test results would be very useful to fine-tune this analysis. Also, the time dimension is not taken into account in the present document. Optimization tools could determine the optimal frequency together with other parameters (size of the pools, etc.), which is not the aim of the present analysis. Still, from the appendix, a first order requirement is that the test procedure be implemented at least every fortnight, and even weekly if CT-scan cannot be generalized for the second step, in order to fit the time profile of test positivity.

The approach analyzed here cannot fight the COVID-19 epidemic alone, and should be implemented in combination with other methods (track and trace…). However, these other approaches, as well as more advanced testing methods recently proposed, cannot tackle false negatives specifically. Thus, the combination of group testing and homophily provides a unique opportunity to tackle this crucial issue to fight COVID19.

As shown in Section 5, with false negatives, a test strategy with a first step based on RT-PCR test on nasopharyngeal swabs and a second step based on a series of complementary exams can identify asymptomatic carriers better than exhaustive individual RT-PCR tests on nasopharyngeal swabs if multiple carriers are in the same pool. In Section 5, I examine the capacity of such a strategy to identify groups of carriers and I show that taking homophily into account decreases both the risk of missing individual carriers and the number of people on whom the second step must be implemented.

The present appendix does not tackle homophily and provides more technical details on the different steps of such a strategy, with the aim of identifying as many carriers as possible while reducing the costs as well as the sufferings, constraints and potential sequelae on tested people.

RT-PCR on nasal swabs identifies 63% of COVID-19 carriers only (Wang, Xu, Gao et al., 2020) . The rate is higher for swabs collected over the first two weeks after the onset of the disease (Yang, Yang, Shen et al., 2020) but lower afterwards. Thus, one could consider using other swabs instead, or using a combination of different swabs and exams. CT-scan or chest radiographs are available in limited number and must be implemented individually. Thus, they are not suitable for group analysis which is the cornerstone of the first step. Furthermore, CT scans or radiographs may induce potential sequelae if implemented repeatedly over a long period.

Fibrobronchoscope brush biopsy and RT-PCR on bronchoalveolar lavage fluid requires specific equipment and specifically trained operators to collect the swabs. Although the RT-PCR or biopsy could be grouped, swabs collection is an individual and lengthy process. Thus, it cannot be implemented repeatedly over the whole population. Furthermore, the collection of bronchoalveolar lavage fluid is painful. The social acceptance of its large scale repeated collection is unlikely for COVID-19, which has a rather limited mortality rate. 5

The following table describes the steps (column 2). In column 3, I compute the cumulative probability of identification of a carrier, given the probabilities documented by Wang, Xu, Gao et al. (2020) , under the simplifying assumption that the results of different types of individual tests or exams are independent for a given carrier. Wang, Xu, Gao et al. (2020) measure the probability of positive test for identified carriers for eight different types of swabs: sputum, nasal, (oro)phatyngeal, Fibrobronchoscope brush biopsy, Bronchoalveolar lavage fluid, blood and feces. I In column 4, I compute the same cumulative probability given the probabilities documented by Yang, Yang, Shen et al. (2020) for swabs collected over the week following the onset of the disease for mild cases. Columns 5 and 6 provide the same computations for swabs collected 8 to 14 d.a.o, and after 15 d.a.o, respectively, also for mild cases.

The first step, based on nasal swabs, must be implemented at the very least every fortnight. Indeed, RT-PCR based on nasal swabs are efficient over the first two weeks after onset, so that a test implemented every fortnight is likely to identify asymptomatic carriers if they follow a pattern similar to mild cases. Conversely, the rate of false negatives soars afterwards. The efficiency of BALF is questionable for asymptomatic carriers. Thus, the second step should be based either on upper respiratory track swabs alone or on a combination with CT-scan or radiographs. If the latter are not widely available, it is crucial to implement complementary exams, and the whole procedure, weekly. 5 Whether sputum can be used in the first step instead of or together with nasal swab is on open question. In the second step (individual complementary exams), if the sputum test is negative, the individual undergoes additional exams which are likely discover the presence of the disease and may include painful bronchoalveolar lavage. Thus, cheating during the series of complementary exams is irrational. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 13, 2020. . https://doi.org/10. 1101 

Group Testing: An Information Theory Perspective

Roadmap to pandemic resilience: Massive scale testing, tracing, and supported isolation (TTSI) as the path to pandemic resilience for a free society

Boolean compressed sensing and noisy group testing

Presumed asymptomatic carrier transmission of COVID-19

The Capacity of Adaptative Group Testing

How Much Should We Trust Difference-in-Difference Estimates?

Robust Inference with Multi-Way Clustering

Non-adaptative probabilistic group testing with noisy measurments: Near-optimal bounds with efficient algorithms

A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster

Testing pooled samples for COVID-19 helps Stanford researchers track early viral spread in Bay Area

Elements of Information Theory

Covid-19: four fifths of cases are asymptomatic, China figures indicate

The detection of defective members of large population

Group Testing against COVID-19

Flatten the Curve of Infection and the Curve of Recession at the Same Time

The transmission and diagnosis of 2019 novel coronavirus infection disease (COVID-19): A Chinese perspective

Act Now or Forever Hold Your Peace: Slowing contagion with Unknown Spreaders, Limited Cleaning Capacities, and Costless Measures

Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China

Group Testing with a Dilution Effect

Social and Economic Networks

Diffusion and contagion in networks with heterogeneous agents and homophily

Asymptomatic and Presymptomatic SARS-CoV-2 Infections in Residents of a Long-Term Care Skilled Nursing Facility

What do we know about SARS-CoV-2 transmission? A systematic review and meta-analysis of the secondary attack rate, serial interval, and asymptomatic infection

China's aggressive measures have slowed the coronavirus. They may not work in other countries

Getting Americans Back to Work (and School) with Pooled Testing

Group testing for coronavirus -called pooled testing -could be the fastest and cheapest way to increase screening nationwide

Friendship as a Social Process: A Substantive and Methodological Analysis

The fraction of influenza virus infections that are asymptomatic: a systematic review and meta-analysis

Early transmission dynamics in Wuhan, China, of novel Coronavirusinfected pneumonia

The characteristics of household transmission of COVID-19

Risk factors associated with COVID-19 infection: a retrospective cohort study based on contacts tracing

Pooling of Samples for testing for SARS-CoV-2 in asymptomatic people

Household transmission of SARS-CoV-2: a systematic review and meta-analysis of secondary attack rate

Error-correcting nonadaptive group testing with dedisjunct matrices

The marthematical strategy that could transform coronavirus testing

Birds of a Feather: Homophily in Social Networks

Estimating the asymptomatic proportion of coronavirus disease 2019 (COVID-19) cases on board the Diamond Princess cruise ship

Random Group Effects and the Precision of Regression Estimates

An Illustration of a Pitfall in Estimating the Effects of Aggregate Variables on Micro Units

Estimation of the asymptomatic ratio of novel coronavirus infections (COVID-19)

Closed environments facilitate secondary transmission of coronavirus disease 2019 (COVID-19

Coronavirus cases have dropped sharply in South Korea. What's the secret to its success?

A family cluster of Middle East respiratory syndrome coronavirus infections related to a likely unrecognized asymptomatic or mild case

Importation and Human-to-Human Transmission of a Novel Coronavirus in Vietnam

Optimal COVID-19 Quarantine and Testing Policies

Clinical and epidemiological features of 36 children with coronavirus disease 2019 (COVID-19) in Zhejiang, China: an observational cohort study

Transmission of 2019-nCoV infection from an asymptomatic contact in Germany

Transmission potential of SARS-CoV-2 in viral shedding observed at the University of Nebraska Medical Center

Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus-infected pneumonia in Wuhan

Detection of SARS-CoV-2 in different types of clinical specimens

Group testing regression models with dilution submodels

Asymptomatic SARS Coronavirus infection among healthcare workers

High proportion of asymptomatic and presymptomatic COVID-19 infections in travelers and returning residents to Brunei

Comparison of nasopharyngeal and oropharyngeal swabs for SARS-CoV-2 detection in 353 patients received tests with both specimens simultaneously

Clinical findings in a group of patients infected with the 2019 novel coronavirus SARS-Cov-2 outside of Wuhan, China

Evaluating the accuracy of different respiratory specimens in the laboratory diagnosis and monitoring the viral shedding of 2019-nCoV infections

RT-qPCR test in multi-sample pools

Follow-up of the asymptomatic patients with SARS-CoV-2 infection

SARS-CoV-2 viral load in upper respiratory specimens of infected patients

Cumulative probability of identification of a disease carrier comments Wang, Xu, Gao et Sputum is available only for 28% of carriers (Huang C, Wang Y, Li X, et al.,2020) . This percentage of availability is applied to the percentage of positive identification in columns 3 to 6. When available, RT-PCR is highly positive for actual carriers. Requires trained operator and specific suction device; the procedure is painful for the patient. Very efficient according to Wang, Xu, Gao et al. (2020) but inefficient for mild cases according to Yang, Yang, Shen et al. (2020) 2g CT scan ? High % ? High % ? High % ?CT scan is implemented by Yang, Shen et al. (2020) for 3 carriers, including one without positive results for any previous RT-PCR test on upper or lower respiratory track swabs. Results are positive but the sample is too small. Computations of cumulative probability of identification are based on the assumption that the probabilities of identification at different steps are uncorrelated.