key: cord-0760681-2nwzfa3i
authors: Bilder, Christopher R.; Tebbs, Joshua M.; McMahan, Christopher S.
title: Discussion on “Is group testing ready for prime‐time in disease identification”
date: 2021-07-12
journal: Stat Med
DOI: 10.1002/sim.8988
sha: f99fb2d088d4a024a3e9671591854175397d8b02
doc_id: 760681
cord_uid: 2nwzfa3i

nan

Is group testing ready for prime-time in disease identification? Yes! Prior to the COVID-19 pandemic, group testing (also known as pooled testing and specimen pooling) was widely used in areas including blood donation screening, 1,2 infectious disease testing in animals, 3, 4 sexually transmitted infection testing, 5, 6 and surveillance for pathogen contamination of food. 7, 8 More areas could definitely benefit as well, as pointed out by Haber et al 9 (HMA) . The use of group testing for SARS-CoV-2 detection was isolated at the beginning of the pandemic, but early accounts of its successful implementation [10] [11] [12] with little if any loss of accuracy led to its widespread use. More than 100 papers detail its implementation since April 2020. A large number of news media accounts, such as in the Washington Post, 13 ABC News, 14 and National Public Radio, 15 likely led to its adoption as well. Large laboratories, like LabCorp 16 and Quest Diagnostics, 17 received Emergency Use Authorizations (EUAs) from the Food and Drug Administration to use their assays with group testing. There is even a Wikipedia 18 web page dedicated to the use of group testing during the pandemic. Therefore, not only is group testing ready for "prime-time", but it is an established "hit" among laboratories around the world to increase testing capacity.

The focus of HMA is on the accuracy of group testing, in particular the first-stage sensitivity for Dorfman testing. We completely agree that quantifying this accuracy is extremely important for laboratories. Where we differ from HMA is how to account for the first-stage sensitivity. In our discussion, we provide comments on the methods used by HMA. Next, we discuss how laboratories choose a group size to insure first-stage sensitivity is not compromised. Relative to this discussion, we provide information about how to find the optimal group size. We conclude with reasons beyond accuracy for why some laboratories may remain hesitant toward group testing. We also conclude with what is next for laboratory implementation of group testing. Because of the worldwide importance of SARS-CoV-2 detection right now, our discussion focuses mostly on this group testing application.

If the first stage of Dorfman testing has reduced sensitivity, the main cause is the "dilution effect." This problem can occur because a smaller portion of each individual specimen is tested in a group than if each was tested separately. Thus, less of a pathogen may be present in a group containing one positive than if the positive specimen was tested alone. Whether reduced sensitivity for the group occurs is dependent on the type, target, and implementation of the assay, in addition to the group size. Laboratories can control some of these factors to prevent the problem from occurring. HMA advocated using a deterministic model to account for the potential dilution effect. The authors primarily focus on a model proposed by Hwang 19 (f H (p, k, d)) that is a function of the overall infection prevalence (p) in a population of interest, the group size (k), and the amount of dilution (d). Section 7.1 also includes other models that are functions of group size alone. While it is correct to have these expressions as decreasing functions of group size, there is no evidence presented that these particular models describe what truly occurs. This leaves the reader uncertain if the results presented in Sections 6 and 7 occur in actual practice.

To illustrate our concern, we focus on the Hwang model. We are not aware of assays with a sensitivity that is a function of the prevalence. Hwang 19 does not provide any evidence as well. This is especially concerning because nucleic acid amplification tests (NAATs) were not available when Hwang 19 was published in 1976. In addition, examinations of assay sensitivity are always given by an assay's product insert and associated validation materials. While prevalence may be stated, this is done to summarize the specimens within a validation experiment, rather than to imply that sensitivity changes as a function of prevalence (eg, Aptima Mycoplasma genitalium assay 20 and BD Max CT/GC/TV 21 ).

For a prevalence of 0.10, Figure 1 evaluates the Hwang model as a function of group size and dilution effect. Most of these dilution settings lead to unrealistic first-stage sensitivities, like for d = 0.3 with sensitivities dropping below 0.5. At the very least, it is doubtful that a laboratory would consider using an assay for group testing across such a large range of group sizes when first-stage sensitivities are less than 0.9.

We emphatically agree with HMA that accuracy should be a top concern whenever a laboratory implements group testing. Others in the group testing literature, such as Kim et al 22 and Hitt et al, 23 have emphasized this point as well. Drawing upon the accuracy concern, HMA proposed a way to validate the group testing process using their dilution models. Unfortunately, this validation could lead to very large sample sizes. In this section, we describe the simpler process that many laboratories use instead to ensure there is little if any loss of accuracy when compared to individual testing. The most common NAAT approach for SARS-CoV-2 detection is through using real-time reverse transcription polymerase chain reaction (RT-qPCR). The response from the test is a cycle threshold (CT) value. This value represents the number of heating/cooling cycles completed when enough florescence is detected to declare a specimen positive. Each of these cycles results in an approximate doubling of the viral genetic material present. The maximum number of cycles used is dependent on the assay, where common values are between 35 and 40 (eg, the Centers for Disease Control and Prevention assay uses 40 cycles 24 ). Thus, a CT value of 30 indicates a specimen was determined to be positive. If not enough florescence is detected prior to the maximum number of cycles, the specimen is declared negative. The amount of florescence during the cycles is tracked with an amplification curve plot, where examples are given by Yelin et al 12 and by Anderson et al 25 (see video demonstrating the testing process at the Nebraska Public Health Laboratory). While the CT is not the viral load for a specimen, it is closely related to it. A higher viral load will lead to a lower CT value (amplification threshold reached earlier due to more virus present), and a lower viral load will lead to a higher CT value (amplification threshold is reached later due to less virus present).

Rather than the process outlined in Section 7 of HMA, laboratories use the spiking procedure briefly described in Section 9. One known positive will be combined with k − 1 known negative specimens to evaluate how well a group size of k will work in practice. Because the known positive is diluted by the known negatives, there will likely be less virus present in the grouped specimen than if the positive specimen was tested separately (assuming the same microliters are used for a group and individual test). Thus, it is likely to take more cycles to declare the group positive than the individual positive. Fortunately, this increase in cycles is predictable due to the approximate doubling from the PCR process. Tan Because of this predictable trend, low viral load specimens are of most concern to make sure no positive specimens are missed. What can be done to make sure the sensitivity is not reduced? An assay used with group testing should have a low limit of detection. With this type of assay, the maximum number of cycles can be increased whenever group tests are performed. For example, rather than using a maximum of 40 cycles, the Nebraska Public Health Laboratory used a maximum of 45 cycles. To reduce the possibility of an individual false positive, specimens within a positive group were retested using the original cycle threshold limit. Alternatively, because amplification curves can be examined, individual group members can be retested if their group test appears to be approaching the minimum florescence needed to be declared positive. 12 This approach is similar to the early group testing work on Chlamydia trachomatis detection by NAAT methods that were used at the time. 27 Other approaches are possible as well. Some laboratories will forgo these small accommodations. For example, this appears to be what LabCorp does based on their EUA. They present a histogram of approximately 150,000 CT values from individual testing and conclude 2.3% of positives may be missed using groups of size 5. It is important to note however that the EUA does not address whether some of these "missed" positives may represent a previous infection due to their high CT values. [28] [29] [30] [31] Laboratories will only implement a group testing algorithm if the group test has the same or similar accuracy as if each specimen were tested separately with the same assay. For this reason, it may be best to refer to first-stage "sensitivity" as first-stage "positive percent agreement" (PPA). Some product inserts, such as the Aptima Combo 2 Assay for Chlamydia trachomatis and Neisseria gonorrhoeae detection, 32 have switched to using this more precise terminology when validating their assay compared to another for individual testing. Fitzpatrick et al 33 advocated a move to this terminology as well for test validation in general. Therefore, rather than including sensitivity in an expected value of tests expression, it may be best to use PPA.

Laboratories want to use the most efficient group size possible. This group size will lead to the largest possible increase in testing capacity because the resources saved can be used to test more specimens. As discussed by McMahan et al, 34 choosing the optimal group size is similar to a power study. A power study makes assumptions regarding parameters and determines a sample size such that the power for a hypothesis test of interest is above a threshold. To find an optimal group size for group testing, one needs to make assumptions about parameter values, such as the prevalence and first-stage sensitivity. These values are determined by past testing and/or validation experiments, so that they provide the best set of information available. A number of different parameter values can be used in this process to understand how these assumptions could affect group size and other operating characteristics of interest. A group size is chosen by minimizing an objective function, most often the expected number of tests per individual, among those group sizes that either do not reduce or result in a very small reduction in the first-stage sensitivity. Group sizes after implementation can be adjusted as needed to reflect potential changing prevalences and/or laboratory resources. Even if the optimal group size is not used, it is important to note that Dorfman testing will very likely be more efficient than testing specimens separately unless a significant increase in the prevalence occurs.

As described in HMA, other objective functions are possible. Some include subjective weighting for accuracy measures. Malinosky et al 35 proposed an objective function free from these subjective weights that was the ratio of the expected number of correct classifications to the expected number of tests. Interestingly, Hitt et al 23 showed this objective function generally resulted in the same optimal group sizes as when the expected number of tests was used as an objective function. Even when there were differences, the differences were small and would not be meaningful to a laboratory. A key point made by Hitt et al 23 was the optimal design should be chosen relative to an objective function and then "examine the accuracy associated with it." If higher accuracy was needed, a "new suboptimal testing configuration would be chosen with accuracies that are acceptable."

While group testing is widely used for SARS-CoV-2 detection, there remains reluctance by some laboratories to implement it. Why? One reason involves how groups are formed within a laboratory. Small portions of individual specimens need to be extracted and combined for testing. This extraction/combination process may be performed manually by a laboratory technician using a single-channel pippetting instrument. It can be monotonous and even result in repetitive-strain injuries if a considerable number of specimens need to be grouped per day. Automated liquid handling instruments can greatly ease this process and make laboratories more receptive to group testing. A second reason involves laboratory information management systems. These systems may be designed to track only individual test results, rather than group test results. In those situations, we have seen laboratory technicians record group test results by hand rather than have a computer completely track them. While this will work fine in some settings, it is not ideal for others with a high volume of clinical specimens. A third reason is specimen storage and associated logistics become more complicated. For example, a higher stage hierarchical group testing algorithm requires specimens to be capped/re-capped and moved in/out of storage multiple times. Even with these reasons, a very large number of laboratories have been able to overcome them and significantly increase their testing capacity through group testing. An alternative to HMA's question is "Are non-Dorfman algorithms ready for prime-time?" Since Dorfman testing was conceived in the early 1940s, many algorithms have been developed that are much more efficient and/or have more desirable properties. Too often, we find laboratories think of Dorfman testing as group testing, rather than one way to implement it. There have been a few examples of these other algorithms in use for SARS-CoV-2 detection. These examples include Lohse et al 36 used a three-stage hierarchical algorithm and LabCorp's 16 EUA was for array testing. Kim et al 22 and Bilder et al 37 provided nice summaries for these types of algorithms in general, while the binGroup2 package 38 in R provides computation tools. Statisticians need to get the word out to more laboratories that these other algorithms exist.

Overall infection prevalence is a large factor in determining whether any group testing algorithm is more efficient than testing each specimen separately. Some laboratories experiencing higher SARS-CoV-2 positivity rates deemed group testing to be no longer useful for fear that most groups will test positive. 39 However, this is an example of where non-Dorfman algorithms can make group testing quite useful. Informative group testing takes into account individual-specific probabilities of infection. [40] [41] [42] For example, the threshold optimal Dorfman algorithm of McMahan et al 43 provides the simplest implementation. For this algorithm, probabilities of infection for each individual are used to find a threshold for those specimens that are tested separately and those specimens that are tested in groups. Individuals displaying symptoms likely have a higher probability of infection and could be those tested separately. This type of approach has been investigated in Nebraska and implemented in Germany. 44 The end result is an algorithm that allows groups to be used only on those individuals with a low probability of infection and that resolves changing prevalence issues as described in Section 8.1 of HMA.

American Red Cross Infectious disease testing

To pool or not to pool? guidelines for pooling samples for use in surveillance testing of infectious diseases in aquatic animals

Nebraska Veterinary Diagnostic Center Diagnostic tests & fees

Pooled nucleic acid testing strategy for monitoring HIV-1 treatment in resource limited settings

To pool or not to pool samples for sexually transmitted infections detection in men who have sex with men? an evaluation of a new pooling method using the GeneXpert instrument in West Africa

Pathogen detection, testing, and control in fresh broccoli sprouts

Optimization and evaluation of the qPCR-based pooling strategy DEP-pooling in dairy production for the detection of Listeria monocytogenes

Is group testing ready for prime-time in disease identification

Assessment of specimen pooling to conserve SARS CoV-2 testing resources

Large-scale implementation of pooled RNA extraction and RT-PCR for SARS-CoV-2 detection

Evaluation of COVID-19 RT-qPCR test in multi sample pools

A temporary coronavirus testing fix: use each kit on 50 people at a time. The Washington Post

With all eyes on coronavirus testing, some researchers say 'group testing' could make up the shortage. ABC News

Pooling coronavirus tests can spare scarce supplies, but there's a catch. National Public Radio

LabCorp Emergency use authorization for COVID-19 RT-PCR test

Quest Diagnostics Emergency use authorization for SARS-CoV-2 RNA, qualitative real-time RT-PCR

Wikipedia List of countries implementing pool testing strategy against COVID-19

Group testing with a dilution effect

Hologic Aptima mycoplasma genitalium assay

Comparison of group testing algorithms for case identification in the presence of test error

The objective function controversy for group testing: much ado about nothing

2019-nCoV) real-time RT-PCR diagnostic panel

Group testing for COVID-19, used in Nebraska, seen as 'promising way' to conserve testing supplies. Omaha World-Herald

Considerations for group testing: a practical approach for the clinical laboratory

Pooling urine samples for ligase chain reaction screening for genital Chlamydia trachomatis infection in asymptomatic women

This week in virology 641: COVID-19 with Dr. Anthony Fauci

Your coronavirus test is positive. maybe it shouldn't be. The New York Times

World Health Organization Nucleic acid testing (NAT) technologies that use real-time polymerase chain reaction (RT-PCR) for detection of SARS-CoV-2

World Health Organization Nucleic acid testing (NAT) technologies that use real-time polymerase chain reaction (RT-PCR) for detection of SARS-CoV-2

Hologic Aptima combo 2 assay for CT/NG

Buyer beware: inflated claims of sensitivity for rapid COVID-19 tests

Rejoinder reaction: a note on the evaluation of group testing algorithms in the presence of misclassification

Reader reaction: a note on the evaluation of group testing algorithms in the presence of misclassification

Pooling of samples for testing for SARS-CoV-2 in asymptomatic people

Tests in short supply? try group testing

binGroup2: Identification and estimation using group testing

Why pooled testing for the coronavirus isn't working in America. The New York Times

Informative retesting

Optimal retesting configurations for hierarchical group testing

Two-dimensional informative array testing

Informative Dorfman screening

Simple questionnaires to improve pooling strategies for SARS-CoV-2 laboratory testing

Discussion on "Is group testing ready for prime-time in disease identification