key: cord-0990561-4ej6f8qo authors: Jain, S.; Jonasson, J. O.; Pauphilet, J.; Flower, B.; Moshe, M.; Fontana, G.; Satkunarajah, S.; Tedder, R.; McClure, M.; Ashrafian, H.; Elliott, P.; Barclay, W. S.; Atchison, C. J.; Ward, H.; Cooke, G.; Darzi, A.; Ramdas, K. title: A new combination testing methodology to identify accurate and economical point-of-care testing strategies date: 2021-06-16 journal: nan DOI: 10.1101/2021.06.15.21257351 sha: ef41a8b91b33f6fd767d5cd7a40959be3ff4b379 doc_id: 990561 cord_uid: 4ej6f8qo Background: Quick, cheap and accurate point-of-care testing is urgently needed to enable frequent, large-scale testing to contain COVID-19. Lateral flow tests for antigen and antibody detection are an obvious candidate for use in community-wide testing, because they are quick and cheap relative to lab-processed tests. However, their low accuracy has limited their adoption. We develop a new methodology to increase the diagnostic accuracy of a combination of cheap, quick and inaccurate index tests with correlated or discordant outcomes, and illustrate its performance on commercially available lateral flow immunoassays (LFIAs) for Sars-CoV-2 antibody detection. Methods and Findings: We analyze laboratory test outcomes of 300 serum samples from health care workers detected with PCR-confirmed SARS-Cov-2 infection at least 21 days prior to sample collection, and 500 pre-pandemic serum samples, from a national seroprevalence survey, tested using eight LFIAs (Abbott, Biosure/Mologic, Orientgene-Menarini, Fortress, Biopanda I, Biopanda II, SureScreen and Wondfo) and Hybrid DABA as reference test. For each of 14 two-test combinations (e.g., Abbott, Fortress) and 16 three-test combinations (e.g., Abbott, Fortress, Biosure/Mologic) used on at least 100 positive and 100 negative samples, we classify an outcome sequence -- e.g., (+,-) for (Abbott, Fortress) -- as positive if its combination positive predictive value (CPPV) exceeds a given threshold, set between 0 and 1. Our main outcome measures are the sensitivity and specificity of different classification rules for classifying the outcomes of a combination test. We define testing possibility frontiers which represent sensitivity and false positive rates for different thresholds. The envelope of frontiers further enables test selection. The eight index tests individually meet neither the UK Medicines and Healthcare Products Regulatory Agency's 98% sensitivity and 98% specificity criterion, nor the US Center for Disease Control's 99.5% specificity criterion. Among these eight tests, the highest single-test LFIA specificity is 99.4% (with a sensitivity of 65.2%) and the highest single-test LFIA sensitivity is 93.4% (with a specificity of 97.4%). Using our methodology, a two-test combination meets the UK Medicines and Healthcare Products Regulatory Agency's criterion, achieving sensitivity of 98.4% and specificity of 98.0%. While two-test combinations meeting the US Center for Disease Control's 99.5% specificity criterion have sensitivity below 83.6%, a three-test combination delivers a specificity of 99.6% and a sensitivity of 95.8%. Conclusions: Current CDC guidelines suggest combining tests, noting that 'performance of orthogonal testing algorithms has not been systematically evaluated' and highlighting discordant outcomes. Our methodology combines available LFIAs to meet desired accuracy criteria, by identifying testing possibility frontiers which encompass benchmarks, enabling cost savings. Our methodology applies equally to antigen testing and can greatly expand testing capacity through combining less accurate tests, especially for use cases needing quick, accurate tests, e.g., entry to public spaces such as airports, nursing homes or hospitals. Quick, cheap and accurate point-of-care (POC) antigen and antibody tests are urgently needed to enable frequent, large-scale testing to contain COVID-19. [1] [2] [3] Accurate Rapid Antigen Tests could allow for significant scale-up of the frequency and scope of diagnostic testing [4] [5] [6] and enable quick and accurate testing in use cases such as entry to public spaces including airports, nursing homes and hospitals. Accurate POC antibody tests will allow for large-scale serological surveys to estimate prevalence, [7] identify individuals who receive false negative RT-PCR results [8] or have high quality convalescent plasma, and enable redeployment of recovered individuals into the community. [9] Yet, the accuracy of existing lateral flow antigen and antibody tests is low. We contribute a new and universal methodology to improve diagnostic accuracy, which relies on using multiple index tests, and we illustrate this methodology in the context of Covid-19 POC antibody tests. For POC antibody detection, lateral flow immunoassays (LFIAs) are an obvious candidate for use in community-wide testing, because they are quick and cheap relative to lab-processed antibody tests (e.g., ELISAs). However, their low accuracy has limited their adoption. [1, [10] [11] [12] Many LFIAs meeting the US Center for Disease Control's (CDC) 99.5% specificity criterion have low sensitivity. Similarly, few LFIAs have won approval by meeting the UK Medicines and Healthcare Products Regulatory Agency's (MHRA) 98% sensitivity and 98% specificity criteria. [13] Combination testing -i.e., assessing each sample using more than one test -can in theory increase the accuracy of tests whose outcomes are independent. [14] The CDC's current guidelines for COVID-19 antibody testing suggests combining tests, stating that "the performance of orthogonal testing algorithms has not been systematically evaluated but can be estimated using an online calculator from the FDA", 1 which assumes independent test outcomes. The FDA guidelines highlight that results can be "discordant" biologically-for instance when the antigens (e.g., spike protein or nucleocapsid protein) or the Ig classes (total Ig, IgG, or IgM) that are detected in two separate tests are different. There are no guidelines on how to handle correlated or discordant results. However, in practice, outcomes of tests of different formats which detect the same antibodies or use the same substrate are likely to be correlated but not identical, generating discordant results. Any combination of tests requires a method to classify each outcome sequence-e.g., (+, +) or (+, -) for a particular two-test combination-as positive or negative. Existing methods include the 'majority', 'believe the positive' (i.e., 'any'), and 'believe the negative' (i.e., 'all') heuristic rules, or in some cases additional arbitrator tests. [15] [16] [17] [18] [19] [20] [21] Instead, we classify each outcome sequence based on its combination positive predictive value, and identify the testing possibility frontier, i.e., the set of achievable sensitivity and specificity parameters for a combination test. We compare this method to existing benchmarks using data from a national seroprevalence study. [13] Our methodology complements prior literature on the utility of cheap and less accurate tests [22, 23] and simple combination testing, which assumes classification heuristics such as any, all or majority. [19] [20] [21] Our study sample is 300 antibody positive and 500 pre-pandemic antibody negative serum samples from a REACT-2 study, which evaluated the accuracy and suitability of commercially available SARS-CoV-2 antibody LFIAs for a national random population sampled seroprevalence survey. [13] We consider eight index tests (see Table 1 , Column 2), each performed on a subset of our study sample. The antibody positive samples were collected in May 2020, from adult National Health Service (NHS) workers who tested positive for SARS-CoV-2 through a nasopharyngeal swab PCR test, or showed symptoms, at least 21 days prior and had not been hospitalized, using hybrid DABA (hybrid spike protein receptor binding domain double antigen binding assay detecting antibody to RBD) as reference test. The antibody negative samples are from 2018, collected during the Airwaves study. [24] Sample sizes were determined in the prior study. [13] The REACT-2 study had significant patient and public input. We will disseminate results of our study to participants, once published. Index tests were performed by laboratory technicians blind to the participants' underlying condition, according to manufacturers' instructions, and the result scored as either positive or negative for IgG. Since the data was originally collected to evaluate the diagnostic performance of each index test, the data collection was not designed to prioritize applying each test to each sample. As a result, our analysis relies on fewer observations for each combination of tests. Specifically, we consider two-test and three-test combinations whose index tests were each evaluated on at least 100 antibody positive and at least 100 antibody negative samples. More observations for each combination of tests (e.g., through prospective data collection) would result in smaller confidence intervals for CPPVs and combination testing sensitivities and specificities (reported in Supplementary information sections 2 and 3). We calculate the sensitivity and specificity of each index test against the DABA reference test and 95% confidence intervals using the Wilson method. [25] Supplementary information section 1 includes the pairwise correlation of outcomes (as well as true positives, false positives, true negatives, and false negatives). Applying two (three) index tests to the same subject can result in four (eight) outcome sequences, e.g., (+,+) , (+, -), (-, +), or (-, -) for a two-test combination. A useful combination test methodology must classify each outcome sequence as either a positive or a negative outcome, in a data-driven way that leverages the fact that some of the index tests are more accurate than others and also accounting for correlations amongst the test outcomes. Given any combination test, our methodology classifies a sample as either positive or negative, depending on the outcome sequence obtained (see Figure 1 ). A combination of two (three) tests has four (eight) possible outcome sequences. This simple, data-driven 4-step classification methodology leverages the inherent relative accuracies of the index tests and accounts for correlations amongst the test outcomes. Step 1. Extending the concept of PPV of a single test, we calculate the combination positive predictive value (CPPV) for each outcome sequence as the number of cases with that outcome sequence that are condition-positive, divided by the number that test positive. Supplementary information sections 2 and 3 include the CPPVs of all outcome sequences for all combinations of two or three tests along with confidence intervals. Step 2. We rank outcome sequences by their CPPV. The highest-CPPV outcome sequence has the highest proportion of true positives. Step 3. We use a CPPV threshold between 0 and 1 to classify each outcome sequence, with outcome sequences with CPPVs above or equal to the threshold classified as positive and the rest as negative. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 16, 2021. ; Step 4. For each threshold value, we calculate the 'implied' sensitivity and specificity of the combination test in the traditional manner. A higher (lower number?) threshold will emphasize specificity (sensitivity). We represent the distinct (sensitivity, specificity) possibilities for each combination test through a testing possibility frontier, which represents the maximum achievable sensitivity (specificity) for a given level of specificity (sensitivity). Points on the frontier dominate those within it on at least one of these two dimensions. This methodology has three main advantages. First, it enables flexibly choosing a threshold for a given combination of tests, to emphasize high sensitivity, high specificity, or both. It also generates an envelope of frontiers of multiple combination tests, enabling even greater flexibility in test selection. Second, it combines tests with different predictive accuracy in a data-driven way. By ranking outcome sequences by CPPV, it essentially places more weight on more accurate index tests, when classifying outcome sequences as positive or negative. Third, unlike the formula provided on the CDC website as of September 25 2020 which gives the maximum possible sensitivity advantage for a combination test 2 assuming independent outcomes, we develop data-driven implied sensitivity and specificity estimates that reflect the inherent correlations. High correlation results in minimal improvement in sensitivity/specificity over the index tests. We compared the accuracy obtained using our threshold rule with that of three common benchmark heuristics. [18] The ANY heuristic classifies an outcome sequence as positive if any of its index tests returns a positive outcome. The ALL heuristic classifies an outcome sequence as positive only if all of its index tests return a positive outcome. The MAJORITY heuristic classifies an outcome sequence as positive only if the majority of its index tests return a positive outcome. 2 https://www.cdc.gov/coronavirus/2019-ncov/lab/resources/antibody-tests-guidelines.html . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) These benchmark heuristics have two significant drawbacks relative to the threshold rule. First, they treat all tests in a combination as equal, regardless of their predictive accuracy. Thus applying the MAJORITY heuristic to a combination test with one accurate and two inaccurate index tests will worsen accuracy. In contrast, our methodology assigns greater weight to tests with higher predictive accuracy. Second, each benchmark heuristic results in a single set of sensitivity and specificity values. By design, the ANY heuristic often results in high sensitivity (and low specificity) and the ALL heuristic in high specificity (and low sensitivity). In contrast, our methodology offers the policy maker a menu of attainable sensitivity-specificity values given any combination of index tests, so that the same tests can be used for populations with different accuracy needs. The sensitivity and specificity of the eight index tests using DABA as reference test are reported in Table 1 . Specificity is generally high (97.1% to 99.4%) and sensitivity low (65.2% to 93.4%). Moreover, the highest specificity (sensitivity) index test had the lowest sensitivity (specificity). No single test meets the CDC or MHRA criteria. Test outcomes are highly, but not perfectly, correlated (average pairwise correlation: 0.75, range: 0.57 to 0.89). Figure 1 illustrates our four step methodology, using a combination of two index tests -test 6 (Biopanda II) and test 7 (Surescreen). Each of the four possible outcome sequences (+, +), (+, -), (-, +), and (-, -) provides a threshold value that generates a combination test strategy with distinct sensitivity and specificity. Ignoring the lowest threshold value (which trivially classifies all samples as positive) we observe that combining these two tests can achieve a testing strategy with (sensitivity, specificity) values ranging from ( . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 16, 2021. ; MHRA's but not the CDC's criteria, while the latter meets the CDC's criteria albeit with very low sensitivity, but not the MHRA's. In particular, a threshold between 0.01 and 0.80 classifies outcome sequences in which either test returns a positive outcome (corresponding to the ANY heuristic) as positive, for a sensitivity of 98 For ease of viewing, we ignore thresholds which result in sensitivity or specificity below 50%. To gain intuition, in Figure 2 (bottom) we include the results of the ANY and ALL heuristics (the MAJORITY heuristic requires three or more tests) for the same 16 two-test combinations. The marker points with highest sensitivity generally correspond to the ANY heuristic and those with highest specificity to the ALL heuristic. No benchmark improves on a strategy obtained through our methodology. Figure 2 also shows that (6, 7) -i.e., (Biopanda II, Surescreen) -is the only 2-test combination that meets the MHRA criteria, while (2, 8) and (6, 8) -i.e., (Fortress, Abbott) and (Biopanda II, Abbott) -achieve an estimated sensitivity above 98% and specificity above only 96%. Our point estimates indicate that the CDC's 99.5% specificity criterion is met by 15 out of the 16 two-test combinations, for high-enough thresholds (see full details in Supplementary information section 2). Only outcome sequences with suitably high CPPVs are classified as positive in this case. However, the resulting sensitivities are very low: the highest sensitivity achievable while meeting the CDC criteria is 83.7% [CI: 80.8%-86.2%], for the combination test (2, 7), i.e., (Fortress, Surescreen). . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 16, 2021. ; https://doi.org/10.1101/2021.06.15.21257351 doi: medRxiv preprint For illustration, we consider test 6 (Biopanda II) and test 7 (Surescreen) as before, adding test 8 Figure 4 shows that for this three-test combination, no benchmark heuristic (ANY/ALL/MAJORITY) achieves higher sensitivity than 94.8% and specificity above 98%. The threshold rule achieves higher sensitivity because at a threshold of 0.6, it classifies the outcome sequence (-, +, -) as positive, despite two negative index tests. Figure 5 presents the testing possibility frontiers for the 14 three-test combinations that satisfy our inclusion criteria on sample size for the index tests (all estimates and their confidence intervals are included in Supplementary information section 2). Again, the marked points correspond to distinct threshold values on each frontier and we ignore threshold values which result in sensitivity or specificity below 50%. Only one three-test combination -of index tests 2 (Fortress), 6 (Biopanda II), and 7 (Surescreen) -achieves point estimates that meet the MHRA criteria. At a threshold of 0.67, we obtain a . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 16, 2021. ; sensitivity/specificity of (98.4 [CI: 97.1%-99.1%], 98.0 [CI: 96.7%-98.9%]). Interestingly, there is no improvement in predictive accuracy over the two-test combination of tests 6 and 7. In fact, no three-test combination performs better than this two-test combination, in meeting the MHRA's accuracy criteria. Adding a third test is of more value in meeting the CDC's 99.5% specificity criterion. All 14 three-test combinations included in our analysis can achieve an estimated specificity of 99.5%. However, with a three-test combination, the achievable sensitivity is significantly higher. Specifically, combining index tests 2 (Fortress), 6 (Biopanda II), and 8 (Abbott) and applying a threshold of 0.88 (which coincidentally correspond to the MAJORITY heuristic) achieves an estimated sensitivity and specificity of (95. While many of the classification thresholds we identify as promising for two-test and three-test combinations correspond to the benchmark heuristics, our results demonstrate the value of exploring other classification schemes. As an example, the ANY, MAJORITY, and ALL . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 16, 2021. heuristics applied to the three-test combination of tests 6 (Biopanda II), 7 (Surescreen), and 8 (Abbott) result in a sensitivity/specificity of (100. Our threshold-based classification methodology results in 112 achievable sensitivity/specificity values on the testing possibility frontiers associated with the 14 three-test combinations included in our analysis. While the MAJORITY heuristic represents one point on each frontier, the ANY (ALL) heuristic fails to appear on the frontier for 5 (4) of the 14 combination tests, because is strictly dominated by points on the frontier. Thus, even when the goal is to maximize sensitivity (specificity), for which applying the ANY (ALL) heuristic may seem reasonable, our thresholdbased methodology often achieves better results. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. We develop and illustrate a methodology for combining diagnostic tests to simultaneously achieve high sensitivity and high specificity, with index tests that are vastly inferior to the best available. The objective of this methodology is not to compare individual index test and their ability to identify various antigen markers, but to combine index tests for better diagnostic accuracy. We use combination positive predictive value (CPPV) estimates to classify each possible outcome sequence of a combination test, accounting for correlated or discordant outcomes. For any combination test, we are able to evaluate the common ANY, ALL, and MAJORITY heuristics against the 'testing possibility frontier' -which represents the trade-off between achievable sensitivity and specificity for that combination. Our approach reveals when benchmark heuristics are inferior. From the 'envelope' of frontiers we generate, policy makers can choose points on the envelope -each representing a distinct sensitivity/specificity combination -ideal for particular subpopulations. Volume discounts can accrue by using the same index tests for different subpopulations, e.g., with a higher sensitivity threshold for older individuals (for whom false negatives can be costly) than for frontline workers (for whom false positives are costly). Strikingly, we observe large gains in accuracy by combining just two or three LFIAs, an approach that is cheaper than a laboratory-based test. The cost of additional tests also needs to be weighed against the cost of a false result -at under $10 per LFIA, offering a three-test kit is likely to be far less costly than a false result. An LFIA requires little serum, so blood samples can be shared across hospital systems and even countries, to identify the best LFIA combinations. FIND (Foundation for Innovative New Diagnostics, a WHO Collaborating Center) and the CDC are evaluating numerous SARS-CoV-2 immunoassays on standardized samples. Analyzing this data using our methodology would identify the best available combinations of sensitivity and specificity attainable, so that policymakers can choose optimal combinations for specific populations. This approach would overcome limitations of small sample size and enable more precise parameter estimation. Future research can examine larger samples, and subpopulations with different prevalence rates. Our methodology also applies when using the same LFIA test multiple times on a sample. Tests with low kappa values are useful candidates for repeat testing as their outcomes have low correlation. Beyond LFIAs, our threshold-based classification methodology applies to any diagnostic tests that can be used in combination, e.g., quick, cheap and inaccurate rapid antigen tests for Covid-19[27], and would enable quick and accurate testing in use cases such as entry to public spaces including airports, nursing homes and hospitals. It can even be used to combine readings from infrared thermometers that rapidly screen for body temperature to permit entrance into buildings. With our approach, test development is no longer a high stakes winner-take-all game, though clearly continued development of better single tests is desirable. 'Close enough' can be good . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 16, 2021. ; enough -in combination with other tests. Test developers can cooperate and develop complementary tests whose outcomes are less correlated, rather than compete in the same space. Clinicians often informally combine tests, without considering underlying correlations between test outcomes. Our methodology codifies this practice and provides a pathway that the FDA can use for evaluation and approval for use of combination tests. Home self-testing for SARS-CoV-2 antibodies using LFIAs is feasible. [28] With a pathway in place, combination test kits can be offered for home use, with clear instructions on how to interpret the results, e.g., "read as positive only if both tests give a positive result". Another option might be to create a single test kit which combines two or three tests "under the hood", with no extra effort required on the consumer's part. This would enable the widespread, frequent community testing urgently needed to curb covid-19. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 16, 2021. We examine the correlation of outcomes, true positives, false positives, false negatives, and true negatives between the eight index tests in Tables S1, S2, S3, S4, and S5, respectively . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 16, 2021. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 16, 2021. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 16, 2021. ; https://doi.org/10.1101/2021.06.15.21257351 doi: medRxiv preprint SI 2: Two test combination performance Table 6 provides statistics for each two test combination that meets the criteria of both tests having been applied to at least 100 antibody negative and 100 antibody positive samples. The panel on the left includes the CPPV estimates for each outcome sequence (along with the number of observations) and the panel on the right describes the possible range of (Sensitivity, Specificity) and the associated threshold values that produce them. All confidence intervals are calculated using the Wilson method, reflecting the uncertainty in a given ratio as a function of the sample size. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 16, 2021. ; . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Table 7 provides statistics for each three test combination that meets the criteria of all tests having been applied to at least 100 antibody negative and 100 antibody positive samples. The panel on the left includes the CPPV estimates for each outcome sequence (along with the number of observations) and the panel on the right describes the possible range of (Sensitivity, Specificity) and the associated threshold values that produce them. All confidence intervals are calculated using the Wilson method, reflecting the uncertainty in a given ratio as a function of the sample size. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 16, 2021. ; https://doi.org/10.1101/2021.06.15.21257351 doi: medRxiv preprint Diagnostic accuracy of serological tests for covid-19: systematic review and metaanalysis False Negative Tests for SARS-CoV-2 Infection-Challenges and Implications Rational use of SARS-CoV-2 polymerase chain reaction tests within institutions caring for the vulnerable Evaluation of the accuracy, ease of use and limit of detection of novel, rapid, antigendetecting point-of-care diagnostics for SARS-CoV-2. medRxiv Clarifying the evidence on SARS-CoV-2 antigen rapid tests in public health responses to COVID-19 Rapid coronavirus tests: a guide for the perplexed Fundamental principles of epidemic spread highlight the immediate need for large-scale serological surveys to assess the stage of the SARS-CoV-2 epidemic Antibody responses to SARS-CoV-2 in patients with COVID-19 Modeling shield immunity to reduce COVID-19 epidemic spread AbC-19 Rapid Test" for detection of previous SARS-CoV-2 infection in key workers: test accuracy study COVID-19 spread in the UK: the end of the beginning? Diagnostic testing for severe acute respiratory syndrome-related coronavirus-2: A narrative review Clinical and laboratory evaluation of SARS-CoV-2 lateral flow assays for use in a national COVID-19 seroprevalence survey Test, re-test, re-test': using inaccurate tests to greatly increase the accuracy of COVID-19 testing Methods for calculating sensitivity and specificity of clustered data: a tutorial Statistical combination schemes of repeated diagnostic test data +) 96.6% (82.8%, 99.4%) 4% (97.1%, 99.1%) 98.0% (96.7%, 98.9%) (-, -, +) 66.7% (20.8%, 93.9%) +) 100.0% (87.5%, 100.0%) 27 (-, +, +) 88.2% (65.7%, 96.7%) 91.1% (88.6%, 93.0%) 99.6% (98.7%, 99.9%) (-, -, +) 73.3% (48.0%, 89.1%) < t <= 0.50 97.9% (96.5%, 98.7%) 98.3% (96.9%, 99.0%) (+ 5% (60.6%, 68.2%) 99.8% (99.0%, 100.0%) (-, -, +) 6% (77.4%, 83.5%) 99.8% (99.0%, 100.0%) (+, -, +) 94.1% (73.0%, 99.0%) 0% (86.4%, 91.2%) 99.6% (98.7%, 99.9%) (+, +, -) 91.7% (64.6%, 98.5%) The authors thank Anna Daunt, Anjna Badhan, Jonathan Brown, Rebecca Frise, Ruthiran Kugathasan, and Srishti Katuri for research assistance, and Maitreyee Hazarika and Sadhana Kapur for useful discussions. interpretation, critical review and feedback from GF, SS, RT, BF, Maya M, Myra M, HA, PE, WB, CA, HW, GC and AD. KR led the research team.