key: cord-0314116-blg75olo
authors: Lewis, David D.; Yang, Eugene; Frieder, Ophir
title: Certifying One-Phase Technology-Assisted Reviews
date: 2021-08-29
journal: nan
DOI: 10.1145/3459637.3482415
sha: 740dfd83faebe727189e03e462d6bdbcf360f095
doc_id: 314116
cord_uid: blg75olo

Technology-assisted review (TAR) workflows based on iterative active learning are widely used in document review applications. Most stopping rules for one-phase TAR workflows lack valid statistical guarantees, which has discouraged their use in some legal contexts. Drawing on the theory of quantile estimation, we provide the first broadly applicable and statistically valid sample-based stopping rules for one-phase TAR. We further show theoretically and empirically that overshooting a recall target, which has been treated as innocuous or desirable in past evaluations of stopping rules, is a major source of excess cost in one-phase TAR workflows. Counterintuitively, incurring a larger sampling cost to reduce excess recall leads to lower total cost in almost all scenarios.

Technology-assisted review (TAR) is the use of technological means to accelerate manual document review workflows. A prominent application is document review in legal cases, known as electronic discovery or eDiscovery [3] , a multi-billion dollar industry. 1 Another application area is systematic reviews of scientific literature [52] , which have played a revolutionary role in empirical medicine [21] and other fields [15] . More generally, TAR is applicable to a range of high recall retrieval tasks [1, 9, 11, 12, 30, 47] . The TREC-COVID project was an emergency deployment of a TAR process early in the Covid-19 pandemic [36] .

Two categories of TAR workflows can be distinguished. Twophase TAR workflows (sometimes called culling workflows) are focused on iterative training of a text classifier by active learning [44] which is then used to select a subset of a collection for review [8, 32, 56] . A distinction is drawn between the training phase (Phase 1) and the review phase (Phase 2). While review of documents is done in both phases, most review effort occurs in Phase 2, after training is over. Two-phase reviews are preferred when per-document costs vary among review personnel [56] .

In contrast, one-phase workflows do not distinguish between training and review, and are preferable when review costs are constant. Iterative training of models, using those models to prioritize documents for review, reviewing of the prioritized documents, and the feeding back of reviewed documents for training continues during the entire review. This is the structure of a classical relevance feedback workflow in information retrieval [37, 41] and, indeed, relevance feedback is widely used in one-phase TAR reviews [8] .

Since TAR is used when it would be too expensive to review all documents [35] , a stopping rule is necessary to decide when the review ends. However, one wants confidence, and ideally a certification by statistical guarantee, that a certain proportion of relevant documents have been found by the stopping point, i.e., that a recall target has been achieved [28] . A stopping rule can thus fail in one of two ways: failing to hit its recall target or incurring unacceptably high costs in doing so.

Unfortunately, no statistically valid, generally applicable stopping rule for one-phase TAR has been available (Section 6). The lack of such certification rules has limited the adoption of one-phase TAR workflows. For instance, the US Department of Justice Antitrust Division's model agreement for use of supervised learning includes only two-phase culling workflows. 2 In response to this need, we reconsider TAR stopping rules from the perspective of statistical quality control in manufacturing [16] . Our contributions are:

• A taxonomy of TAR stopping rules by application contexts in which they can be used • A theoretical framework for understanding stopping a TAR review as a problem in quantile estimation • The first two general purpose certification rules for onephase TAR: the Quantile Point Estimate Threshold (QPET) rule and the Quantile Binomial Confidence Bound (QBCB) rule. Both can be used with any sample size and recall target. The latter also provides a confidence interval on recall at any specified confidence level.

• A theoretical and empirical demonstration that, for many TAR tasks, the counterintuitive key to reducing total TAR review cost is to incur more cost for sampling and reduce excess recall

We begin by proposing a taxonomy of TAR stopping rules and zeroing in on those with broad applicability (Section 2). We identify sequential bias as the key challenge to certification rules and apply the theory of quantile estimation to evade this bias (Section 3). This leads to first the QPET rule (Section 3) and then the QBCB rule (Section 4), whose properties we analyze. We also examine previously proposed certification rules and find that only one narrowly applicable rule, Cormack and Grossman's Target rule [6] , is statistically valid and indeed is a special case of the QBCB rule (Section 6.3). Finally, we demonstrate theoretically and empirically that minimizing sample size, as suggested by Cormack and Grossman, is almost always suboptimal from a total cost standpoint (Sections 5 and 8).

Many TAR stopping rules that have been proposed would be unusable in most operational TAR contexts. In this section, we propose a taxonomy of stopping rules that clarifies their range of applicability.

TAR evaluation conferences [17, [23] [24] [25] 38] have emphasized interventional stopping rules, i.e., rules that alter the method used to select documents for review. These rules include SCAL [8] , Autostop -Conservative [31] , Autostop -Optimistic [31] , and a recent rule by Callaghan and Müller-Hansen Callaghan and Müller-Hansen [5] . By modifying the document selection process, these methods gather information that enables more accurate stopping (if not always valid statistical guarantees).

While powerful, an interventional rule requires that all documents selected be chosen by a particular novel active learning algorithm. Most document reviews rely on commercial TAR software whose document selection algorithms cannot be modified by the user. Further, review managers often prefer (and may be legally required) to select documents not just by active learning, but also by Boolean text or metadata searches. Documents from other sources (related projects, direct attorney knowledge, or legal actions) may also need to be reviewed at arbitrary times.

In contrast, we call a stopping rule a standoff rule if it can be applied to any TAR review, regardless of how documents are selected or in what order. Some rules allow arbitrary review combined with interventional portions: we call these hybrid rules.

Standoff and hybrid rules usually require drawing a random sample for estimation purposes. Some of these rules assume that all review team decisions are correct (self-evaluation rules), while others assume only the decisions on the sample are correct (gold standard rules).

A cross-cutting distinction for all rules is how strong a guarantee of quality they provide. Heuristic rules make a stopping decision based on general patterns observed for review processes, such as declining precision with increasing manual search effort or diminishing impact from new training data [6, 7, 43, 52, 55] . Heuristic stopping rules for one-phase TAR reviews are closely related to stopping rules for active learning in two-phase TAR reviews [32] and in generalization tasks [26, 44, 49] .

Certification rules, on the other hand, use a random sample to provide a formal statistical guarantee that the stopping point has certain properties and/or provide a formal statistical estimate of effectiveness at the stopping point. If correctly designed, they give a degree of confidence that heuristic rules cannot. However, with one narrow exception, previously proposed certification rules fail to meet their purported statistical guarantees (Section 6).

The consequences for such failures can be severe: parties in legal cases have been sanctioned for failing to meet stated targets on information retrieval effectiveness measures. 3 . In Sections 3 and 4 we provide the first standoff gold standard certification rules for one-phase TAR workflows that can be used with any sample size, recall target, and confidence level.

Certification rules condition stopping on some statistical guarantee of effectiveness of the TAR process. We consider here the usual collection-level binary contingency table measures, where the four outcomes TP (true positives), FP (false positives), FN (false negatives), and TN (true negatives) sum to the number of documents in the collection. For a one-phase TAR workflow, a positive prediction or detection corresponds to the document having been reviewed before the workflow is stopped.

Recall, /( + ), is the most common measure on which the TAR processes are evaluated [28] . Other measures of interest in TAR are precision = /( + ) and elusion = /( + ). Elusion (which one desires to be low) can be thought of as precision in the unreviewed documents and has mostly seen use in the law [40] .

Effectiveness must be estimated. Estimates are produced by estimators, i.e., functions that define a random variable in terms of a random sample from a population [27] . An estimate is the value taken on by that random variable for a particular random sample. A point estimate is an estimate which is a scalar value. A common point estimator is the plug-in estimator, which replaces population values by the random variables for the corresponding sample values [14] . The plug-in estimator for recall, based on a simple random sample annotated for both category and detection status, is / where is a random variable for the number of positive detected examples in the sample, and is a random variable for the total number of positive examples in that sample. In other words, recall on the labeled random sample is used as the point estimate of recall in the population.

The plug-in estimator of recall assumes that both class labels and detection statuses are known. If we are using the estimate in a stopping rule, however, we must stop to have an estimate, but must have an estimate to stop. The usual resolution of this dilemma in TAR is to compute, after each batch of documents is reviewed, what the estimated effectiveness would be if the TAR process were stopped at that point. The TAR process is stopped the first time one of these trial estimates exceeds the recall goal. We refer to this rule, widely used in practice, as the Point Estimate Threshold (PET) stopping rule.

Unfortunately, the PET rule is statistically biased: the expected value of effectiveness at the stopping point typically falls short of the claimed effectiveness level. We demonstrate this with an example that, while simple, exhibits the core phenomena at play.

Consider a large collection, A, with an even number of documents, all of which are relevant. If we ran the TAR process until all documents were found, each document would be assigned a rank corresponding to the order in which it was found. Call that number the A-rank of the document. Suppose our recall goal is 0.5. Since all documents are relevant in our example, a TAR process achieves recall of 0.5 or more if it stops at or after A-rank /2. Now draw a simple random sample of even size from A and have it coded, prior to starting the TAR process, as a gold standard. On coding those documents we will find all are relevant, and so we have a simple random sample, , of size = from the relevant documents in A. At first we do not know the A-rank of any document in sample D. However, as reviewers examine documents, they periodically find one of the sample documents, at which point we know its A-rank. When the /2'th document from the sample is found, the plug-in estimate of recall on the sample will be 0.5, and the PET rule would stop the TAR process.

Since D is a random sample, the value of the /2'th highest A-rank in D, our stopping point, is a random variable ( /2) , the /2'th order statistic in sample [2] . It has the following probability mass function

corresponding to draws without replacement from three bins (less than, equal to, and greater than ) [2, Chapter 3]. The expected value of ( /2) is ( /2) ( +1)/( +1) [2, Chapter 3], and thus the expected recall of the PET rule in this case is (1/2)( /( + 1))(( + 1)/ ). This is less than 0.5 for any < .

The PET rule makes multiple tests on estimates and stops when the first one succeeds, thus biasing the estimate at the stopping point. This phenomenon is the focus of sequential analysis [13, 46, 51, 53] , which is central to statistical quality control in manufacturing [16] .

A key insight from sequential analysis is that conditioning stopping of a process on a random variable makes the stopping point itself a random variable. It is that latter random variable we need to have the desired statistical properties. Suppose we view the PET rule more abstractly, as a rule that stops a TAR process when we have found items from a sample D of positive documents from A. The A-rank of the th item will be the th lowest A-rank in our sample. That value, , is the realization for our sample of the random variable , where 1 , ..., are the order statistics for the sample [2] .

For such a rule to be valid, given a recall goal and positive sample size , one strategy would be to choose such that the worst case expected value of recall for any data set, averaged over the realizations of for that data set, is at least . Computing this worst case expected value is nontrivial. Fortunately an alternative perspective is possible when recall is the measure of interest. A t-quantile is a population parameter such that a fraction of the population is at or above that value. Formally, for 0 < < 1, we define to be the -quantile of finite population B if = inf { :

[ ≤ ] ≥ } for X drawn uniformly from the population [10, Chapter 7]. Let B be just the relevant documents within collection A, but in sorted order by their A-ranks. Let A-rank be the -quantile for B. Then the recall of a TAR process stopping at is the smallest ′ such that ′ ≥ and ′ is achievable for some stopping point.

The quantile perspective links recall at a stopping point to a single population parameter. We then require an estimator that maps from order statistics in the positive sample D to population quantiles within the positive subpopulation B, i.e. a quantile point estimator. Hyndman & Fan [22] review the properties of nine quantile point estimators, of which their Q7 is the default in the R statistical package. 4 Q7 is defined by letting ℎ = ( − 1) + 1, = ⌊ℎ⌋, and using + (ℎ − )( +1 − ) as the estimator for the -quantile. Figure 1 diagrams the logic of quantile estimation using Q7.

Using Q7 as our quantile point estimator, we define the Quantile Point Estimate Threshold (QPET) stopping rule as follows. Given a sample size and recall goal , we compute ℎ = ( − 1) + 1 and = ⌊ℎ⌋. We need to stop at a point where we can apply our estimator +(ℎ− ) ( +1 − ). If ℎ is an integer, then = ℎ and we only need the value of . We therefore stop at , i.e., after finding the 'th positive sample document. If ℎ is not an integer, then we need the values of both and +1 , so we stop at +1 , i.e., after finding the + 1'th positive sample document. In either scenario, + (ℎ − ) ( +1 − ) is our point estimate of the −quantile at the stopping point (with the second term 0 if ℎ is an integer).

If a point estimate of recall at the stopping point is required, we can use as that estimate. This point estimate of recall is conservative in the sense that recall at the -quantile is always at least , but can be higher.

When does the QPET rule stop in comparison with the PET rule? Assume a nontrivial recall goal 0 < < 1 and positive sample size . The PET rule stops after reviewing total examples, where is lowest value such that / ≥ . If is an integer, this will be when = . In this case, the QPET rule has ′ = ⌊ℎ⌋ = ⌊( − 1)

Ignoring the trivial case = 2, ( − 1)/ is never an integer, and thus ℎ is never an integer. So both ′ and ′ +1 are needed, and QPET stops at +1 . If is not an integer, then the PET rule stops at where =

In all cases then, the QPET rule requires finding at most one more positive sample document than the PET rule, and sometimes, no additional documents.

The QPET rule outputs a point estimate on the -quantile (i.e., the number of documents to review), and with a point estimate of recall equal to . However, we surely should feel less good about these estimates if they are based, say, on a sample of 10 positive examples than on a sample of 1000 positive examples.

A confidence interval is an estimate that consists of a pair of scalar values with an associated confidence level, conventionally expressed as 1 − [20, Chapter 2]. A confidence interval estimator specifies a closed interval [ , ], where and are random variables defined in terms of the sample. We say that such an estimator produces a 1 − confidence interval for a population value when the probability is at least 1 − that, over many draws of a random sample of the specified type, the sample-based realizations of and of are such that ≤ ≤ .

Confidence interval estimators are used in two ways in TAR. The first is to use a power analysis on the estimator as a guide to sample size [42] . A review manager will draw a random sample large enough to guarantee that any confidence interval estimate produced from it will have some property, such as a maximum margin of error of 0.05.

The second use of confidence intervals is in reporting, i.e., in making a statistical claim based on a labeled random sample about the effectiveness achieved by a TAR process. If the only use of the random sample is in reporting, this is unproblematic. But if (as is common) the same sample is used to decide when to stop the review, the reported confidence interval estimate will have sequential bias [53] .

As with point estimates, the quantile perspective can rescue confidence intervals on recall from sequential bias. A quantile confidence interval is a confidence interval on a quantile [54, Chapter 4] . To avoid distributional assumptions about the TAR process, we can use a nonparametric quantile confidence interval estimator [10, Chapter 7] . This takes the form of a pair of order statistics, [ , ] . The estimator determines the values and based on the quantile level , sample size , and confidence level 1 − . It provides the guarantee that, with at least 1 − probability over draws of the random sample, the -quantile falls within the sample-based realization [ , ] .

If an estimator of this form is available, we can define the stopping point of a TAR review to be the value of for our positive random sample, and have 1 − confidence that the -quantile in B falls within [ , ] . By definition of the -quantile, we thus have 1 − confidence that stopping at gives a recall of at least .

For many uses of confidence intervals we want estimators that make the interval narrow (the realization − is likely to be small) and/or symmetric (i.e., − and − are likely to be similar, where is some point estimate). For a stopping rule, however, the most important criterion is that is likely to be small, since this reduces the number of documents the TAR process must review before stopping.

We can minimize the likely value of by using a nonparametric one-sided upper confidence interval (UCI) on a quantile [20, Chapter 5] . Such an interval has the form [ 0 , ], where 0 is the 0th order statistic, i.e., the lowest logically possible value. For us this is 0 = 1 (the lowest A-rank); so the interval is [1, ] . We refer to the pair as an 1-s UCI, and the upper end of the interval as a 1-s UCB (one-sided upper confidence bound).

The estimator must choose such that the realization will be, with 1 − probability, a -quantile or higher. This is equivalent to requiring a probability 1 − or higher that fewer than elements of positive random sample D have A-rank less than the −quantile. Suppose there are positives in B, and that our sample of positives is of size . Then our estimator should choose the smallest such that:

In a TAR setting we do not know . However, if is large relative to , the binomial distribution is a good approximation to the above hypergeometric distribution [48, Chapter 3] . In this condition, we want the smallest such that

In fact, we can use the binomial approximation safely even when we are not confident that is small relative to . For values of greater than 0.5, the fact that the binomial has larger variance than the hypergeometric means that the chosen using the binomial approximation will never be less than the one chosen using the hypergeometric. Values of recall less than 0.5 are rarely of interest, but if needed we could find a similarly conservative value of for such a by running the summation from downwards instead of 0 upwards.

Based on the above analysis, we define the Quantile Binomial Confidence Bound (QBCB) stopping rule. Given a sample size , recall target , and confidence level 1 − , it specifies stopping a one-phase TAR process when the th positive sample document is found. Here is smallest integer such that 

We observed that the recall goal can be used as a conservative point estimate of recall at the QPET stopping point. By the same logic, is a conservative point estimate of recall at the QBCB stopping point.

If we prefer an interval estimate, we can use a 1− one-sided lower confidence interval (1-s LCI) (or one-sided lower confidence bound, 1-s LCB) estimator [20, Chapter 2] ). This defines a pair [ , 1.0] where, with probability at least 1 − over random samples of size , the realization [ℓ, 1.0] contains a desired population value. Given the definition of -quantile, we know that [ , 1.0] is a 1 − ℎ 1-s LCI on recall at the QBCB stopping point.

This interval estimate on recall may seem unsatisfying: it is identical regardless of sample size. However, this simply reflects the task we have set for the QBCB rule: stop as soon as one has confidence that a recall goal has been met. Larger sample sizes translate to earlier stopping, not a tighter 1-s LCI.

We can also compute more conventional estimates of recall at the QBCB stopping point. As long as those estimates depend only on (which is fixed as soon as we choose and ) and not on (the actual A-rank at which we stop), these estimates are not affected by sequential bias. These estimates give insight into the behavior of the QBCB rule. Table 1 shows the QBCB values of for recall goal 0.8, confidence level 95% (1 − = 0.95), and selected sample sizes from 14 to 457. (The choice of the sample sizes is discussed in Section 5.)

Sample sizes 8 to 13 are also included. However, with these sample sizes, the only 95% 1-s UCI based on order statistics that includes the 0.8-quantile is the trivial interval [ 0 , +1 ] = [1, ] (using the convention that the + 1'th order statistic for a sample of size is the maximum population value). So for < 14, the QBCB value of is + 1, and the rule does not provide meaningful stopping behavior. For these sizes we instead show * = − 1 = , the largest non-trivial stopping point. We also show both the QBCB and * = − 1 for the case r=21, discussed in Section 5. Rows with QBCB * = − 1 values are indicated by "*".

We show the value of three estimates of recall based solely on or * . The first is a 95% 1-s LCI, but for recall rather than for the -quantile. In particular, we use the Clopper-Pearson exact interval [4] . Second, we show the plug-in estimate / discussed earlier for the PET rule. Finally, we show a 95% 1-s UCI on recall, again computed using the Clopper-Pearson method.

For rows with the QBCB value, the lower end of the 95% 1-s LCI is always at or above 0.80, but fluctuates and is closer to 0.80 when the sample size is larger. This reflects the fact that the Clopper-Pearson LCI is based on the same binomial approximation used in the QBCB rule. The only difference is that the QBCB computation solves for integer based on fixed real , while the LCI computation solves for real based on fixed integer . The QBCB requirement that be an integer means that the at the lower end of the LCI is typically slightly more than 0.8, with the difference decreasing as increases and more values are available.

The plug-in point estimates (which are simply / or * / depending on the row) for small sample sizes are much higher than 0.8. We can think of these as the estimated recall at which the naive PET rule would need to stop to achieve the same confidence bound as the QBCB rule, and reflects how uncertain recall estimates from small samples are.

The last column shows a 95% 1-s UCI on recall at the QBCB stopping point. This estimate shows that, as sample sizes increase, we slowly become more confident that the QBCB stopping point will not have very high recall. Section 5 discusses why, counterintuitively, we should want such confidence.

Past evaluations of stopping rules have often treated overshooting a recall goal as a lucky outcome [7] . By definition, however, a certification rule that stops with an recall higher than its goal has incurred extra costs. A TAR process that incurs high costs, particularly unpredictably high costs, while overshooting stakeholder requirements is not a success.

Further, in some contexts exceeding a recall goal may be a negative outcome even if costs are ignored. A litigant that would like to produce 0% of responsive documents to an adversary, but has a legal obligation to produce 80% of responsive documents, is not happier if their legal service provider delivers 90% of responsive documents to the adversary.

Recall is an expensive measure on which to overshoot a goal. As a TAR method pushes for high recall, relevant documents tend to be spaced increasingly farther apart. This is a basis of the common heuristic rule that stops review when batch precision drops below some minimum value. Larger intervals between relevant documents mean that each percentage point of recall achieved beyond the goal value comes at increasing marginal cost.

Thus part of the benefit of using a larger random sample in a certification rule is lower recall. Indeed, jointly choosing an order statistic and a sample size so that both a UCB and an LCB are bounded is an old technique from statistical quality control [18] .

For Table 1 we chose the sample sizes ≥ 30 to be the smallest sizes for which the 95% 1-s LCB on recall is less than or equal to each of the values 0.99 to 0.86, decreasing by increments of 0.01. For instance, 158 is the smallest sample size such that the 95% 1-s LCB on recall is 0.90 or lower. For sample sizes of 14 and above we always have 95% confidence that we achieve the specified minimum recall, 0.80. What we get for larger sample sizes is a lower expected recall (21, 20) , (21, 21) , and (22, 21) show, = 22 is the lowest sample size for which − = 1, i.e., we can leave one example undetected and still meet our criterion. For sample sizes from = 23 through = 29 the LCB, point estimate, and UCB of recall all increase steadily with increasing sample size, with the largest values at = 29. This is a lose-lose situation: increasing sample size in this range both increases sampling costs and increases TAR costs (since we expect to stop at a higher recall). The pattern is not broken until = 30, the lowest sample size for which we can leave two examples undetected, at which point the pattern starts again.

This pattern results from the fact that a sample of size only provides + 1 possible stopping points if stopping is at an order statistic. Some combinations of sample size, confidence level, and population parameter (recall goal) inevitably poorly match the available choices. This problem decreases for larger sample sizes, since more order statistics are available. As in other estimation situations with small sample sizes, careful choice of sample size can reduce costs substantially [42, Chapter 2] .

This phenomenon is also relevant to empirical studies of certification rules: poor choices of sample size will introduce unneeded variation in the relationship between sample size and achieved recall (and thus cost). In our tests in Section 8 we use the optimal sample sizes from Table 1 .

For the most part, however, larger samples reduce excess recall. How large a sample is appropriate depends on how much overshooting the recall goal costs. This depends on numerous details, including the difficulty of the classification problem, size of the collection, type of classifier, active learning approach, and batch size. In Section 8, we examine some typical situations.

We previously discussed the PET rule and our proposed QPET and QBCB rules. In this section, we examine other certification stopping rules in common TAR practice or proposed in the scientific literature.

Practitioners often carry out a one-phase TAR workflow until a heuristic rule suggests that they have found most relevant documents. A common hybrid stopping approach is to first do this, then draw a random sample from the unreviewed documents, and make some statistical test on this sample. If the test succeeds, review stops. If the test fails, the sample is recycled as training data, and the review is restarted until the heuristic again indicates stopping and sampling. This can be thought of as a repeated PET (RPET) rule: we repeatedly test against some threshold value until succeeding.

One statistical test used is accept on zero [19, 33, 39] , i.e., recycle unless no relevant documents are in the sample. More generally one can estimate elusion from the sample, and recycle unless elusion is low enough. A variant on this uses the elusion estimate to compute an ad hoc estimate of recall [50] , and recycles unless estimated recall is high enough. Regardless of the particular variant, all RPET approaches suffer from sequential bias induced by multiple testing: the process is more likely to stop when sampling fluctuation gives an over-optimistic estimate of effectiveness. Dimm [13] provides a detailed analysis of how accept on zero fails when used in an RPET rule.

Shemilt, et. al. discuss systematic review projects in which several stopping criteria were considered [45] . One is based on what they call the BIR (Baseline Inclusion Rate): simply the plug-in estimatê = / of the proportion of relevant documents in the collection. They convert this to an estimate ( )/ of the number of relevant documents in the collection. They propose stopping the TAR process when the number of relevant documents found equals this value, or the budgeted time runs out. This is equivalent to using ( )/( ) as an estimator for recall, and stopping when estimated recall hits a recall target , which for Shemilt was 1.0. This stopping rule is known in e-discovery as the "countdown method" or "indirect method". 5 The method is seriously flawed. First, the countdown estimator can produce recall estimates greater than 1.0. Second, in those cases where the point estimate of the number of relevant in the population is an overestimate, the TAR process may reach the end of the collection without stopping. Finally, the countdown method does not take into account sampling variation, and so provides no statistical characterization of the actual recall achieved.

The Target rule [7] uses a simple random sample of 10 positive examples (the target set) and stops when the one-phase TAR process has found all of them. It would be viewed in our framework as implicitly computing a 1-s UCI [1, 10 ] based on a positive sample of size 10, and stopping when the realization of 10 is reached.

Cormack and Grossman analyze the Target rule and conclude it achieves a recall of 0.70 with 95% confidence. However, their analysis uses the binomial approximation in an unnecessarily conservative way, by treating 0.3 / = 0.3 as small. In fact, Table 1 shows that a target set of only 9 positive documents is sufficient to achieve a recall goal of 0.70 with 95% confidence, while their suggested target set of 10 positive documents achieves a recall goal slightly over 0.74.

The Target rule satisfies (actually exceeds) its claimed statistical guarantee, but does not allow any flexibility in recall goal or confidence level. Further, as shown in Section 8, using the minimum possible positive sample size usually increases total review cost. Requiring that every positive sample document be found also means a single coding error would have large consequences.

The correctness of the QPET and QBCB stopping rules is completely determined by the theory of quantile statistics, regardless of sample size. Our goal in empirical work here is not, therefore, to verify the correctness of the rules, but simply to provide a demonstration of how sample size and cost interact in perhaps counterintuitive ways.

We worked with a random 20% subset of the RCV1-v2 [29] text categorization collection defined in a prior TAR study [56] . An advantage of RCV1-v2 over collections used in past TAR evaluations is the ability to explore a range of category difficulties and prevalences simultaneously. That study defined three levels of category prevalence and three of classification difficulty. For our demonstration, we selected the category with closest to median difficulty and prevalence from each of their nine bins, and seed document with closest to median difficulty for each category. Based on that seed document, iterative relevance feedback with a batch size of 200 was carried out until the collection was exhausted (805 iterations). Supervised learning used the logistic regression implementation in scikit-learn with L2 regularization and 1.0 as the penalty strength.

The resulting batches were concatenated in order. When applying the QBCB rule we considered stopping points only at the end Figure 2 : Relationship between positive sample sizes (x-axis) and collection recall at the stopping point (y-axis) for the QBCB rule on category E12. 100 replications of each sample size are displayed using boxplot conventions: the box ranges from the 25% (q1) to 75% (q3) quartiles of recall with the 100 replications, the green and purple lines are the median and mean recall respectively, and whiskers extend to the lesser of the most extreme cost observed or to a distance 1.5(q3 -q1) from the edges of the box. Outliers are presented as dots above and below the whiskers.

of each batch, so order within bins had no effect. For each category and each positive sample size value, we then generated 100 simple random samples constrained to have exactly that number of positive examples. We applied the QBCB rule with 95% confidence and recall goal 0.80 to those samples, found the stopping iteration, and computed actual recall and cost at that point. Sample sizes used were all those from Table 1 that allow the confidence level and recall goal to be met.

We separated the review cost at a stopping point into four components for analysis purposes: the positive and negative documents in the random sample, and the positive and negative documents encountered during relevance feedback prior to the stopping point. We assume that the random sample is, to avoid bias, reviewed by different personnel than conduct the main review. Thus encountering the same document in both the sample and during relevance feedback costs twice. We discuss costs further in the next section. We would expect that reducing the occurrence of very high recall values would also reduce the occurrence of very high costs. Figure 3 explores this in detail. It is again a boxplot for 100 replications, but this time for all 9 of our exemplar TAR workflows and displaying total cost rather than recall.

Category ALG (Algeria) is a category where Cormack and Grossman's approach of of using the minimum possible sample size leads For most categories however, investing in random samples large enough to get more than the minimum number of positives brings down the maximum cost over 100 replications substantially. For I22100 (medium frequency and medium difficult) the maximum cost over 100 replications is a factor of 14 times greater for a sample of 14 positive than for an optimal sample of 30 positives. The graphs also emphasize the importance of a power analysis in choosing small sample sizes. For most categories and cost statistics, 21 positives is actually worse than 14, while 22 is better.

For categories E12 (Common-Hard) and I300003 (Common -Medium), using a larger than minimum sample size brings down not just the worst case cost, but even the median cost. It is worth noting while these categories are in our "Common" bin, their prevalences are 3% and 1% respectively, which is typical or even low for e-discovery projects, depending on collection strategies. Samplebased stopping will be even more practical for a project in, say, the 10% prevalence range.

Our focus in this study has been on one-phase TAR reviews. Was anything lost by not considering two-phase review? Figure 4 uses cost dynamics graphs [56] to provide a perspective on this question. For a single TAR run (i.e., one seed) on category E12 we plot the total cost at stopping points from 0 to 200 iterations for four sample sizes. In addition to the four costs accounted for in Figure 3 , for iterations where stopping would give recall less than 0.80 we add the cost of an optimal second phase review to reach 0.80 recall. That is, for each iteration we rank the unreviewed documents and assume that a top-down review through that ranking is carried out until 0.80 recall is reached. This is the minimum cost that a two-phase review reaching 0.80 recall would incur.

The graphs immediately show that a one-phase review is optimal for this example: the minimum cost is at a point where no secondphase cost is incurred. This is typical for the setting of this paper, where the costs of all forms of review (sampling, phase 1, and phase 2 if present) are equal. One-phase review is typically not optimal when costs are unequal [56] .

The graphs also provide an interesting perspective on the role of sample size in minimizing cost. A horizontal dashed line shows the worst case total cost for QBCB over 100 replications for each sample size, while the vertical line shows the corresponding stopping point. As the sample size is increased, the stopping point comes closer to the minimum of the cost landscape, but the entire landscape is raised. The sample size that minimizes the worst case cost over our 100 replications, sample size 129 in this case, strikes a balance between the two effects.

The QBCB rule makes use only of the positive documents in a random sample. Exploiting both positive and negative documents using a hypergeometric distribution should modestly reduce sample size, if the unknown number of relevant documents can be addressed. The bounding technique proposed in Callaghan and Müller-Hansen [5] is one possible approach, as is treating the positive subpopulation size as a nuisance parameter [27, Chapter 6] . Other approaches to reducing sample size that could be applied are stratified sampling [48, Chapter 11 ] and multi-stage or sequential sampling [48, Chapter 13] . Dimm [13] has presented promising results on using multi-stage sampling to reduce costs in making a binary acceptance decision for a complete TAR production, and this approach likely can be adapted to stopping rules.

Desirable extensions of QBCB would be to two-sided confidence intervals, to two-phase workflows [56] , to multiple assessors who may disagree, to effectiveness measures other than recall, and to rolling collections (where the TAR workflow must be started before all documents have arrived). Techniques from survey research for repeated sampling may be applicable to the last [34] .

Finally, the QPET and QBCB rules are based on viewing a onephase TAR process as incrementally exposing a ranking of a collection. The rules may also be applied to actual rankings of collections produced by, for instance, search engines and text classifiers. In this scenario, QPET and QBCB become rules for avoiding sequential bias in choosing a sample-based cutoff that hits an estimated recall target.

The philosophy of statistical quality control is to accurately characterize and control a process [16] . We have shown in this study that previously proposed certification rules for one-phase TAR reviews are statistically invalid, inflexible, expensive, or all three.

Drawing on the statistical theory of quantile estimation, we derive a new rule, the QBCB rule, that avoids sequential bias and allows controlling the risk of excessive costs. The rule applies to any one-phase TAR workflow, and can immediately be put into practice in real-world TAR environments. By using this rule, valid statistical guarantees of recall can be produced for the first time, while mitigating the risks of extreme cost.

A system for efficient high-recall retrieval

A First Course in Order Statistics

Perspectives on Predictive Coding: And Other Advanced Search Methods for the Legal Practitioner. American Bar Association, Section of Litigation

Interval estimation for a binomial proportion

Statistical stopping criteria for automated screening in systematic reviews

Evaluation of machinelearning protocols for technology-assisted review in electronic discovery

Engineering Quality and Reliability in Technology-Assisted Review

Scalability of continuous active learning for reliable high-recall text classification

High-recall information retrieval from linked big data

Order statistics

Active bucket categorization for high recall video retrieval

Trivir: A visualization system to support document retrieval with high recall

Confirming Recall Adequacy With Unbiased Multi-Stage Acceptance Testing

A first course in multivariate statistics

An introduction to systematic reviews

Statistical quality control

TREC 2016 Total Recall Track Overview

Tolerance intervals for univariate distributions

Minimum size sampling plans

Statistical intervals: a guide for practitioners

Cochrane handbook for systematic reviews of interventions

Sample quantiles in statistical packages

Technologically Assisted Reviews in Empirical Medicine Overview

CLEF 2018 Technologically Assisted Reviews in Empirical Medicine Overview. CEUR Workshop Proceedings 2125

CLEF 2019 technology assisted reviews in empirical medicine overview

Limitations of assessing active learning performance at runtime

Theory of point estimation

Defining and Estimating Effectiveness in Document Review

RCV1: A New Benchmark Collection for Text Categorization Research

Req-rec: High recall retrieval with query pooling and interactive classification

When to Stop Reviewing in Technology-Assisted Reviews: Sampling from an Adaptive Distribution to Estimate Residual Relevant Documents

Active learning strategies for technology assisted sensitivity review

An alternative approach to accept on zero and accept on one sampling plans

Sequential poisson sampling

Where the Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery. RAND Corporation

TREC-COVID: rationale and structure of an information retrieval shared task for COVID-19

Relevance feedback in information retrieval

TREC 2015 Total Recall Track Overview

Measurement in E-Discovery

Search and information retrieval science

A survey on the use of relevance feedback for information access systems

Sample size determination and power

Batchmode active learning for technology-assisted review

Active learning literature survey

Pinpointing needles in giant haystacks: use of text mining to reduce impractical screening workload in extremely large scoping reviews

Sequential analysis: tests and confidence intervals

Relevance maximization for highrecall retrieval problem: finding all needles in a haystack

Theory of sample surveys

Approximating Learning Curves for Active-Learning-Driven Annotation

TAR for Smart People: How Technology Assisted Review Works and why it Matters for Legal Professionals. Catalyst Repository Systems

Sequential analysis. Courier Corporation

Semi-automated screening of biomedical citations for systematic reviews

Sequential testing in classifier evaluation yields biased estimates of effectiveness

Introduction to robust estimation and hypothesis testing

Heuristic Stopping Rules For Technology-Assisted Review

On Minimizing Cost in Legal Document Review Workflows

We thank Lilith Bat-Leah and William Webber for their thoughtful feedback on drafts of this paper, and Tony Dunnigan for the Figure  1 diagram. All errors are the responsibility of the authors.