key: cord-0280094-rbvsobgw authors: Beigel, R.; Webber, M. J. title: A Partition-Based Group Testing Algorithm for Estimating the Number of Infected Individuals date: 2021-07-27 journal: nan DOI: 10.1101/2021.07.27.21260924 sha: d5ffe618dfdd6947e9046f17f258d6648e804c38 doc_id: 280094 cord_uid: rbvsobgw The dangers of COVID-19 remain ever-present worldwide. The asymptomatic nature of COVID-19 obfuscates the signs policy makers look for when deciding to reopen public areas or further quarantine. In much of the world, testing resources are often scarce, creating a need for testing potentially infected individuals that prioritizes efficiency. This report presents an advancement to Beigel and Kasif's Approximate Counting Algorithm (ACA). ACA estimates the infection rate with a number of tests that is logarithmic in the population size. Our newer version of the algorithm provides an extra level of efficiency: each subject is tested exactly once. A simulation of the algorithm, created for and presented as part of this paper, can be used to find a linear regression of the results with R^2 > 0.999. This allows stakeholders and members of the biomedical community to estimate infection rates for varying population sizes and ranges of infection rates. The COVID-19 pandemic seems to abate in the United States, thanks to the availability of vaccines and testing materials. To date, over 500 million individual coronavirus tests have been performed in the US. Meanwhile, Mexico and Bangladesh, both heavily impacted by the pandemic, have yet to exceed even 8 million tests to date. [1] There continues to be a high demand for testing materials worldwide as production is stretched thin. With the emergence of a more infectious delta variant of the coronavirus [2] there is an impetus to find the number of infected individuals within a given population. . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not certified by peer review) The copyright holder for this preprint this version posted July 27, 2021. ; https://doi.org/10.1101/2021.07.27.21260924 doi: medRxiv preprint Group disease testing has been used to find the set of positives in a testing population while reducing the tests used since the method was canonized by Dorfman in World War II: testing prospective draftees for syphilis antigens. If the samples from a group of individuals are pooled, and a group tests negative, then those individuals are all negative. In cases where disease rates are low, most of the negatives can be identified easily with few tests. Group testing is therefore an economical alternative to individual testing in these cases. [3, 4, 5, 6] Scientists in the United States, Israel and Germany have been using group testing since 2020 to detect the presence of COVID-19 in test groups. Computer-ran algorithms have helped to mitigate the challenges to identification presented by multiple infected individuals. [7] One goal of such an algorithm is to count approximately the number of infected individuals using probabilistic We present a new version of the ACA algorithm which tests each subject exactly once. Then, we simulate both versions of the algorithm. We will show how ⌈log 2 (n+1)⌉ number of tests can be used to approximate the number of infected individuals. For a set of test subjects S, we take n samples from each labeled 1 through n. A small number, k, of those are infected. From here, the people are randomly assigned to sets or groups in one of two ways in each version of the algorithm: • ACAI: Sample, randomly (as with a Fisher-Yates "out of a hat" shuffle), and without replacement independent subsets of size ⌈n/2⌉, ⌈n/4⌉, ⌈n/8⌉, ⌈n/16⌉, ..., 1. • ACAP: Randomly partition into independent subsets of size ⌈n/2⌉, ⌈⌊n/2⌋/2⌉, ⌈⌊n/4⌋/2⌉, ⌈⌊n/8⌋/2⌉, ⌈⌊n/16⌋/2⌉, …, 1. . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not certified by peer review) The copyright holder for this preprint this version posted July 27, 2021. ; https://doi.org/10.1101/2021.07.27.21260924 doi: medRxiv preprint In both cases, test all groups, and count the number of groups where one or more members test positive. The binomial random variable Y = the number of positive groups. Then the Expected Value(Y) ≈ log 2 (k). It follows, then, that Expected Value(2^Y) ≈ k. Only ⌈log 2 (n+1)⌉ tests must be performed and they can be performed in parallel. The simulation we used to test the efficacy of the algorithm was written in Perl, and consisted of 4 different files: • generateTestData.pl generates a .csv file with rows of binary strings. Each row is n bits long, and k of those bits are 1s (representing infected individuals). The remaining zeros represent the uninfected. The positions of the 1s and 0s are randomly determined. • ACAI.pl runs a trial of ACAI for each row of the csv file, returning the sample mean of Y and 2^Y. • ACAP.pl runs a trial of ACAP for each row of the csv file, returning the sample mean of Y and 2^Y. • driver.pl is responsible for running generateTestData.pl and either ACAI.pl or ACAP.pl for a range of k values. Its output is a .csv file containing a record of the sample mean and sample variance of Y and 2^Y for every k. Using Excel we will graph the recorded k values versus the sample mean(2^Y) for each algorithm, compare the efficiency of both ACA variants, and determine how closely E(2^Y) and k are correlated using a linear regression. The parameters used are n = 1023, and trials per k = 10000. The range of k values used are 1 to 50 (about 1% to 5% infection rate). . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not certified by peer review) The copyright holder for this preprint this version posted July 27, 2021. ; https://doi.org/10.1101/2021.07.27.21260924 doi: medRxiv preprint When we widen the range of values, the R-squared values decrease somewhat, with both linear regressions looking equally effective with an R-squared value of .9995. The trendline is tightest bound around the middle of the graph. Despite this, the correlation is still strong throughout. These formulae would be acceptable estimators for k, but it's worth noting how the linear formulae differ between the 1 to 50 and 1 to 100 ranges for k. Both the multiplicative and additive constants are higher. . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not certified by peer review) The copyright holder for this preprint this version posted July 27, 2021. ; https://doi.org/10.1101/2021.07.27.21260924 doi: medRxiv preprint The COVID-19 pandemic remains ever-present in humanity's concern worldwide. Approximating the number of infected individuals can inform decision-makers responsible for reducing the spread of COVID-19. For example, an approximate count can be the tipping point for whether schools reopen in-person or live events are held. This form of batch testing can be both economical and revelatory. While EY in this case was a useful estimator for log 2 (k), E(2^Y) is biased. Exponentiating the relationship between EY and log 2 (k) will not yield directly useful results. It was the linear regression of k versus 2^Y that proved to be the best estimator for k. Detecting the presence of the disease as early as possible coincides with the goal to approximately count the infected at low prevalence. The code used for this simulation can be freely accessed and downloaded at https://github.com/maxjwebber/covidcounting. Users should install Perl, move the files to the same directory and run driver.pl with the following command line parameters. Usage is: For example, driver.pl -n 1023 -kmin 1 -kmax 50 -trialsperk 10000 We recommend a lowest k value of 0 or 1 for the best fit (the program will warn a user who uses a different value). A pair of csv files, ACAP_results.csv and ACAI_results.csv will be created with the results of the individual k values, and the linear regression formulae will be written to linear_regression.csv along with the R-squared values for each. We recommend using the formula with the higher R-squared value (though both should have acceptable R-squared values). . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not certified by peer review) The copyright holder for this preprint this version posted July 27, 2021. ; https://doi.org/10.1101/2021.07.27.21260924 doi: medRxiv preprint Number of coronavirus (COVID-19) tests performed in the most impacted countries worldwide as of The Delta Variant Isn't Just Hyper-Contagious. It Also Grows More Rapidly Inside You Analysis and Applications of Adaptive Group Testing Methods for COVID-19 Cassidy Mentus, Martin Romeo, Christian DiPaola Google Scholar Evaluation of Group Testing for SARS-CoV-2 RNA Google Scholar Google Scholar Google Scholar Coronavirus Test Shortages Trigger a New Strategy: Group Screening. Scientific American It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not certified by peer review) The copyright holder for this preprint this version posted July 27, 2021. . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.is the (which was not certified by peer review) The copyright holder for this preprint this version posted July 27, 2021. ; https://doi.org/10.1101/2021.07.27.21260924 doi: medRxiv preprint