key: cord-0577637-z1fg32ig
authors: Labadi, Luai Al; Evans, Michael; Liang, Qiaoyu
title: ROC Analyses Based on Measuring Evidence
date: 2021-03-01
journal: nan
DOI: nan
sha: 983b0057e340565e5d3d6ce9afdf5a144a5f8ab0
doc_id: 577637
cord_uid: z1fg32ig

ROC analyses are considered under a variety of assumptions concerning the distributions of a measurement $X$ in two populations. These include the binormal model as well as nonparametric models where little is assumed about the form of distributions. The methodology is based on a characterization of statistical evidence which is dependent on the specification of prior distributions for the unknown population distributions as well as for the relevant prevalence $w$ of the disease in a given population. In all cases, elicitation algorithms are provided to guide the selection of the priors. Inferences are derived for the AUC as well as the cutoff $c$ used for classification and the associated error characteristics.

An ROC analysis is used in medical science to determine whether or not a real-valued diagnostic X for a disease or condition is useful. If the diagnostic indicates that an individual has the condition, then this will typically mean that a more expensive or invasive medical procedure is undertaken. So it is important to assess the accuracy of X. These methods have a wider class of applications but our terminology will focus on the medical context.

An approach to such analyses is presented here that is based on a characterization of statistical evidence and which incorporates all available information as expressed via prior probability distributions. For example, while p-values are often used in such analyses, there are questions concerning the validity of these quantities as characterizations of statistical evidence. As will be seen, there are many advantages to the framework adopted here.

A common approach to the assessment of X is to estimate its AUC, namely, the probability that an individual sampled from the diseased population will have a higher value of X than an individual independently sampled from the nondiseased population. A good X should give a value of the AUC near 1 while a value near 1/2 indicates a poor diagnostic (if the AUC is near 0, then the classification is reversed). It is possible, however, that a diagnostic with AUC ≈ 1 may not be suitable (see Examples 1 and 6). In particular, a cutoff value c needs to be selected so that if X > c, then an individual is classified as requiring the more invasive procedure. Inferences about the error characteristics for the combination (X, c), such as the false positive rate, etc., are also required.

This paper is concerned with inferences about the AUC, the cutoff c, and the error characteristics. A key aspect of the analysis is the relevant prevalence w. The phrase "relevant prevalence" means that X will be applied to a certain population, such as those patients who exhibit certain symptoms, and w represents the proportion of this subpopulation who are diseased. The value of w may vary by geography, medical unit, time, etc. To make a valid assessment of X in an application, it is necessary that the information available concerning w be incorporated. This information is expressed here via an elicited prior probability distribution for w, which may be degenerate at a single value if w is known, or be quite diffuse when little is known about w. In fact, all unknown population quantities are given elicited priors. There are many contexts where data is available relevant to the value of w and this leads to a full posterior analysis for w as well as for the other quantities of interest. Even when such data is not available, however, it is still possible to take the prior for w into account so the uncertainties concerning w always play a role in the analysis.

While there are many methods available for the choice of c, see López-Ratón et al. (2014) , Unal (2017) , these often do not depend on the prevalence w which is a key factor in determining the true error characteristics of (X, c) in an application, see Verbakel et al. (2020) . So it is preferable to take w into account when considering the value of a diagnostic in a particular context. One approach to choosing c is to minimize some error criterion that depends on w to obtain c opt . As will be demonstrated in the examples, however, sometimes c opt results in a classification that is useless. In such a situation a suboptimal choice of c is required but the error characteristics can still be based on what is known about w so that these are directly relevant to the application.

Others have pointed out deficiencies in the AUC statistic and proposed alternatives. Hand (2009) takes into account the costs associated with various misclassification errors and argues that using the AUC is implicitly making unrealistic assumptions concerning these costs. While costs are relevant, costs are not incorporated here as these are often difficult to quantify. Our goal is to express clearly what the evidence is saying about how good (X, c) is via an assessment of its error characteristics. With the error characteristics in hand, a user can decide whether or not the costs of misclassifications are such that the diagnostic is usable. This may be a qualitative assessment although, if numerical costs are available, these could be subsequently incorporated. The principle here is that economic or social factors be considered separately from what the evidence in the data says, as it is a goal of statistics to clearly state the latter.

The framework for the analysis is Bayesian as proper priors are placed on the unknown distribution F N D (the distribution of X in the nondiseased population), on F D (the distribution of X in the diseased population) and the prevalence w. In all the problems considered, elicitation algorithms are presented for how to choose these priors. Also, all inferences are based on the relative belief characterization of statistical evidence where, for a given quantity, evidence in favor (against) is obtained when posterior beliefs are greater (less) than prior beliefs, see Evans (2015) . So evidence is determined by how the data changes beliefs. Section 2 discusses the general framework and defines relevant quantities. Section 3 develops inferences for these quantities for three contexts (1) X is an ordered discrete variable with no constraints on (F N D , F D ) (2) X is a continuous variable and (F N D , F D ) are normal distributions (the binormal model ) (3) X is a continuous variable and no constraints are placed on (F N D , F D ).

There is previous work on using Bayesian methods in ROC analyses. For example, Gu et al. (2008) estimate the ROC using the Bayesian bootstrap. Carvalho et al. (2013) consider ROC analyses when there are covariates using priors similar to those discussed in Section 3.4. Ladouceur et al. (2011) also use priors similar to those used here but only consider the sampling regime where the data can be used for inference about the relevant prevalence and where a gold standard classifier is not assumed to exist.

The contributions of this paper are as follows. An elicitation algorithm is provided for every prior used. As described in Section 2, two different sampling regimes are considered as sampling regime (i) seems more relevant in many medical applications than sampling regime (ii). While a prior on the relevant prevalence is used in both sampling regimes, the posterior distribution of this quantity is only available in sampling regime (ii) but the prior is still used when making inferences about relevant quantities under sampling regime (i). Inferences about the AUC, the optimal cutoff and various error quantities associated with the cutoff are implemented for both sampling regimes. These inferences include estimates of the AUC and the optimal cutoff as well as exact assessments of the error in these estimates. In addition, estimates are provided for the error characteristics of the classification, at the cutoff used, that determine the value of the diagnostic in an application. It is shown that sometimes a useful optimal cutoff does not exist so some other choice is necessary. In each case the hypothesis H 0 : AUC > 1/2 is first assessed and if evidence is found in favor of this, the prior is then conditioned on this event being true for inferences about the remaining quantities. Three contexts are considered, the diagnostic takes finitely many values, the diagnostic is normally distributed and the diagnostic is continuous but not normal. A thorough analysis is made of the binormal model and it is shown that, unless certain conditions on the model parameters are satisfied, then a useful optimal cutoff is not available. Hypothesis assessments are made to determine if these conditions hold. Based on the binormal model, a nonparametric Bayes model is developed that allows for deviation from normality.

Consider the formulation of the problem as presented in Obuchowski and Bullen (2018) , Zhou et al. (2011) but with somewhat different notation. There is a measurement X : Ω → R 1 defined on a population Ω = Ω D ∪ Ω N D , with

specificity or true negative rate Table 1 : Error probabilities when X > c indicates a positive .

where Ω D is comprised of those with a particular disease, and Ω N D represents those without the disease. So F N D (c) = #({ω ∈ Ω N D : X(ω) ≤ c})/#(Ω N D ) is the conditional cdf of X in the nondiseased population, and F D (x) = #({ω ∈ Ω D : X(ω) ≤ x})/#(Ω D ) is the conditional cdf of X in the diseased population. It is assumed that there is a gold standard classifier, typically much more difficult to use than X, such that for any ω ∈ Ω it can be determined definitively if ω ∈ Ω D or ω ∈ Ω N D . There are two ways in which one can sample from Ω, namely, (i) take samples from each of Ω D and Ω N D separately or (ii) take a sample from Ω.

The sampling method used affects the inferences that can be drawn and for many studies (i) is the relevant sampling mode.

It supposed that the greater the value X(ω) is for individual ω, the more likely it is that ω ∈ Ω D . For the classification, a cutoff value c is required such that, if X(ω) > c, then ω is classified as being in Ω D and otherwise is classified as being in Ω N D . But X is an imperfect classifier for any c and it is necessary to assess the performance of (X, c). It seems natural that a value of c be used that is optimal in some sense related to the error characteristics of this classification. Table 1 gives the relevant probabilities for classification into Ω D and Ω N D , together with some common terminology, in a confusion matrix.

Another key ingredient is the prevalence w = #(Ω D )/#(Ω) of the disease in Ω. In practical situations, it is necessary to also take w into account in assessing the error in (X, c). The following error characteristics depend on w,

,

.

Under sampling regime (ii) and cutoff c, Error(c) is the probability of making an error, FDR(c) is the conditional probability of misclassifying a subject as positive given that it is classified as positive and FNDR(c) is the conditional probability of misclassifying a subject as negative given that it is classified as negative. It is often observed that when w is very small and FNR(c) and FPR(c) are small, then FDR(c) can be big. This is sometimes referred to as the base rate fallacy as, even though the test appears to be a good one, there is a high probability that an individual classified as having the disease will be misclassified. In these cases the false nondiscovery rate is quite small while the false discovery rate is large. If the disease is highly contagious, then these probabilities may be considered acceptable but indeed they need to be estimated. Similarly, FNDR(c) may be small when FNR(c) is large and w is very small. It is naturally desirable to make inference about an optimal cutoff c opt and its associated error quantities. For a given value of w, the optimal cutoff will be defined here as c opt = arg inf Error(c), the value which minimizes the probability of making an error. Other choices for determining a c opt can be made, and the analysis and computations will be quite similar, but our thesis is that, when possible, any such criterion should involve the prior distribution of the relevant prevalence w. As demonstrated in Example 6 this can sometimes lead to useless values of c opt even when the AUC is large. While this situation calls into question the value of the diagnostic, a suboptimal choice of c can still be made according to some alternative methodology like the use of Youden's index (maximizing 1 − 2Error(c) over c with w = 1/2). The methodology developed here provides an estimate of the c to be used, together with an exact assessment of the error in this estimate, as well as providing estimates of the associated error characteristics of the classification.

Consider two situations where F N D , F D are either both absolutely continuous or both discrete. In the discrete case, suppose that these distributions are concentrated on a set of points c 1 < c 2 < · · · < c m . When ω D , ω N D are selected using sampling scheme (i), then the probability that a higher score is received on diagnostic X by a diseased individual than a nondiseased individual is

(1) Under the assumption that F D (c) is constant on {c : F N D (c) = p} for every p ∈ [0, 1], there is a function ROC (receiver operator curve) such that

In the absolutely continuous case, AUC= 1 0 ROC(p) dp which is the area under the curve given by the ROC function. The area under the curve interpretation is geometrically evocative but is not necessary for (1) to be meaningful.

It is commonly suggested that a good diagnostic X will have AUC close to 1 while a value close to 1/2 suggests a poor diagnostic. It is surely the case, however, that the utility of X in practice will depend on the cutoff c chosen and the various error characteristics associated with this choice. So while the AUC can be used to screen diagnostics, it is only part of the analysis and inferences about the error characteristics are required to truly assess the performance of a diagnostic. Consider an example. Example 1. Suppose that F D = F q N D for some q > 1, where F N D is continuous, strictly increasing with associated density f N D . Then using (1), AUC = 1 − 1/(q + 1) which is approximately 1 when q is large. The optimal c minimizes Error(c) = wF q N D (c) + (1 − w)(1 − F N D (c)) which implies c satisfies F N D (c) = {(1 − w)/qw} 1/(q−1) when q > (1 − w)/w and the optimal c is otherwise c = ∞. If q = 99, then AUC = 0.99 and with w = 0.025, (1 − w)/w = 39 < q so FNR(c opt ) = 0.390, FPR(c opt ) = 0.009, Error(c opt ) = 0.019, FDR(c opt ) = 0.009 and FNDR(c opt ) = 0.010. So X seems like a good diagnostic via the AUC and the error characteristics that depend on the prevalence although within the diseased population the probability is 0.39 of not detecting the disease. If instead w = 0.01, then the AUC is the same but q = 99 = (1 − w)/w and the optimal classification always classifies an individual as non-diseased which is useless. So the AUC does not indicate enough about the characteristics of the diagnostic to determine if it is useful or not. It is necessary to look at the error characteristics of the classification at the cutoff value that will actually be used, to determine if a diagnostic is suitable and this implies that information about w is necessary in an application.

Suppose we have a sample of n D from Ω D , namely, x D = (x D1 , . . . , x Dn D ) and a sample of n N D from Ω N D , namely, x N D = (x N D1 , . . . , x N Dn N D ) and the goal is to make inference about the AUC, some cutoff c and the error characteristics FNR(c), FPR(c), Error(c), FDR(c) and FNDR(c). For the AUC it makes sense to first assess the hypothesis H 0 : AUC > 1/2 via stating whether there is evidence for or against H 0 together with an assessment of the strength of this evidence. Estimates are required for all of these quantities, together with an assessment of the accuracy of the estimate.

As stated in the Introduction, several different contexts are considered and the approach here is Bayesian with a prior placed on (F N D , F D ) as well as the relevant prevalence w. The specific inferences are derived via the principle of evidence: if the posterior probability of an event is greater (smaller) than the prior probability of the event, then there is evidence in favor of (against) the event being true. This approach is implemented via the relative belief ratio (see Evans (2015) ) which is effectively the ratio of the posterior probability to the prior probability of the event in question. So if the relative belief ratio is greater than (less than) 1 there is evidence in favor of (against) the event being true.

Consider first inferences for the relevant prevalence w. If w is known then nothing further needs to be done but otherwise this quantity needs to be taken into account when assessing the value of the diagnostic and so uncertainty about w needs to be addressed.

If the full data set is based on sampling scheme (ii), then n D ∼ binomial(n, w). A natural prior π W to place on w is a beta(α 1w , α 2w ) distribution. The hyperparameters are chosen based on the elicitation algorithm discussed in Evans et al. (2017) where interval [l, u] is chosen such that it is believed that w ∈ [l, u] with prior probability γ. Here [l, u] is chosen so that we are virtually certain that w ∈ [l, u] and γ = 0.99 then seems like a reasonable choice. Note that choosing l = u corresponds to w being known and so γ = 1 in that case. Next pick a point ξ w ∈ [l, u] for the mode of the prior and a reasonable choice might be ξ w = (l + u)/2. Then putting τ w = α 1w + α 2w − 2 leads to the parameterization beta(α 1w , α 2w ) = beta(1 + τ w ξ w , 1 + τ w (1 − ξ w )) where ξ w locates the mode and τ w controls the spread of the distribution about ξ w . Here τ w = 0 gives the uniform distribution and τ w = ∞ gives the distribution degenerate at ξ w . With ξ w specified, τ w is the smallest value of τ w such that the probability content of [l, u] is γ and this is found iteratively. For example, if [l, u] = [0.60, 0.70] and γ = 0.99, so w is known reasonably well, then ξ w = (l+u)/2 = 0.65 and τ w = 601.1, so the prior is beta(391.72, 211.39) and the posterior is beta(391.72+n D , 211.39+n N D ).

The estimate of w is then obtained by maximizing the relative belief ratio RB(w | n D , n N D ) = π W (w | n D , n N D )/π W (w), the ratio of the posterior to the prior, as this value has the most evidence in its favor. In this case the estimate is the MLE, namely, w(n D , n N D ) = n D /(n D + n N D ). The accuracy of this estimate is measured by the size of the plausible region P l(n D , n N D ) = {w : RB(w | n D , n N D ) > 1}, the set of all w values for which there is evidence in favor. For example, if n = 100 and n D = 68, then w(68, 32) = 0.68 and P l(68, 32) = [0.647, 0.712] which has posterior content 0.651. So the data suggest that the upper bound of u = 0.70 is too strong although the posterior belief in this interval is not very high.

The prior and posterior distributions of w play a role in inferences about all the quantities that depend on the prevalence. In the case where the cutoff is determined by minimizing the probability of a misclassification, then c opt , FNR(c opt ), FPR(c opt ), Error(c opt ), FDR(c opt ) and FNDR(c opt ) all depend on the prevalence. Under sampling scheme (i), however, only the prior on w has any influence when considering the effectiveness of X. Inference for these quantities is now discussed in both cases.

Suppose X takes values on the finite ordered scale c 1 < c 2 < · · · < c m and let

(1 − FNR(c i )) p N Di with the remaining quan-tities defined similarly. Evans et al. (2017) can be used to obtain independent elicited Dirichlet priors

on these probabilities by placing either upper or lower bounds on each cell probability that hold with virtual certainty γ, as discussed for the beta prior on the prevalence. If little information is available, it is reasonable to use uniform (Dirichlet(1, . . . , 1)) priors on p N D and p D . This together with the independent prior on w leads to prior distributions for the AUC, c opt and all the quantities associated with error assessment such as FNR(c opt ), etc.

The

. . , f Dm ) which in turn lead to the independent posteriors

(

Under sampling regime (ii) this, together with the independent posterior on w, leads to posterior distributions for all the quantities of interest. Under sampling regime (i), however, the logical thing to do, so the inferences reflect the uncertainty about w, is to only use the prior on w when deriving inferences about any quantities that depend on this such as c opt and the various error assessments. Consider inferences for the AUC. The first inference should be to assess the hypothesis H 0 : AUC > 1/2 for, if H 0 is false, then X would seem to have no value as a diagnostic (the possibility that the directionality is wrong is ignored here). The relative belief ratio

is computed and compared to 1. If it is concluded that H 0 is true, then perhaps the next inference of interest is to estimate the AUC via the relative belief estimate. The prior and posterior densities of the AUC are not available in closed form so estimates are required and density histograms are employed here for this. The set (0, 1] is discretized into L subintervals (0, 1] = ∪ L i=1 ((i − 1)/L, i/L] , and putting a i = (i − 1/2)/L, the value of the prior density p AUC (a i ) is estimated by L × (proportion of prior simulated values of AUC in (i − 1, i]/L) and similarly for the posterior density p AUC (a i | f N D , f D ). Then RB AUC (a | f N D , f N D ) is maximized to obtain the relative belief estimate AUC(f N D , f D ) together with the plausible region and its posterior content. These quantities are obtained for c opt in a similar fashion, although c opt has prior and posterior distribution concentrated on {c 1 , c 2 , . . . , c m } so there is no need to discretize. For the es-

) are obtained as these indicate the performance of the diagnostic in practice. The relative belief estimates of these quantities are easily obtained in a second sim-

Consider now an example. Example 2. Simulated example.

For k = 5, data was generated as Supposing that the relevant prevalence is known to be w = 0.65, Figure 1 contains plots of the prior and posterior densities and relative belief ratio of

with posterior probability content 0.53 so the correct optimal cut-off has been identified but there is a degree of uncertainty concerning this. The error characteristics that tell us about the utility of X as a diagnostic are given by the relative belief estimates (column (a)) in Table 2 . It is interesting to note that the estimate of Error(c opt ) is determined by the prior and posterior distributions of a convex combination of FPR(c opt ) and FNR(c opt ) and the estimate is not the same convex combination of the estimates of FPR(c opt ) and FNR(c opt ). So, in this case Error(c opt ) seems like a much better assessment of the performance of the diagnostic.

Suppose now that the prevalence is not known but there is a beta(1 + τ w ξ w , 1 + τ w (1 − ξ w )) prior specified for w and consider the choice discussed in Section 3.1 where ξ w = 0.65 and τ w = 601.1. When the data is produced according to sampling regime (i), then there is no posterior for w but this prior can still be used in determining the prior and posterior distributions of c opt and the associated error characteristics. When this simulation was carried out c opt (f N D , f D ) = 2 with P l copt (f N D , f D ) = {2} with posterior probability content 0.53. and column (b) of Table 2 gives the estimates of the error characteristics. So other than the estimate of the FPR, the results are similar. Finally, assuming that the data arose under sampling scheme (ii), then w has a posterior distribution and using this gives

with posterior probability content 0.52 and error characteristics as in column (c) of Table 2 . These results are the same as if the prevalence is known which is sensible as the posterior concentrates about the true value more than the prior. 0.14 0.14 0.14 FNDR(c opt ) 0.34 0.34 0.34 Table 2 : The estimates of the error characteristcs of X at c opt = 2 in Example 2 where (a) w is assumed known, (b) only the prior for w is available, (c) the posterior for w is also available.

Another somewhat anomalous feature of this example is the fact that uniform priors on p D and p N D do not lead to a prior on the AUC that is even close to uniform. In fact, these choices put more weight against a diagnostic with AUC > 1/2 and indeed most choices of p D and p N D will not satisfy this. Another possibility is to require p N D1 ≥ · · · ≥ p N D1 and p D1 ≤ · · · ≤ p D1 , namely, require monotonicity of the probabilities. A result in Englert et al. (2018) implies that p N D satisfies this iff p N D = A k p * N D where p * N D ∈ S k , the standard (k − 1)-dimensional simplex, and A k ∈ R k×k with i-ith row equal to (0, . . . , 0, 1/i, 1/(i + 1), . . . , 1/k) and p D satisfies this iff p D = B k p * D where p * D ∈ S k and B k = I * k A k where I * k ∈ R k×k contains all 0's except for 1's on the crossdiagonal. If p * N D and p * D are independent and uniform on S k , then p D and p N D are independent and uniform on the sets of probabilities satisfying the corresponding monotonicities and Figure 2 has a plot of the prior of the AUC when this is the case. It is seen that this prior puts most of its weight in favor of AUC > 1/2. Figure 2 also has a plot the prior of the AUC when p D is uniform on the set of all nondecreasing probabilities and p N D is uniform on S k . This reflects a much more modest belief that X will satisfy AUC > 1/2 and indeed this may be a more appropriate prior than using uniform distributions on S k . Englert et al. (2018) also provides elicitation algorithms for choosing alternative Dirichlet distributions for p * N D and p * D .

When H 0 : AUC > 0.5 is accepted, it makes sense to use the conditional prior, given that this event is true, in the inferences. As such it is necessary to condition the prior on the event m i=1 i j=1 p Dj p N Di ≤ 1/2. In general, it isn't clear how to generate from this conditional prior but depending on the size of m and the prior, a brute force approach is to simply generate from the unconditional prior and select those samples for which the condition is satisfied and the same approach works with the posterior. Example 2. Simulated example (continued).

Here m = 5, and using uniform priors for p N D and p D , the prior probability of AUC > 0.5 is 0.281 while the posterior probability is 0.998 so the posterior sampling is much more efficient. Choosing priors that are more favorable to AUC > 0.5 will improve the efficiency of the prior sampling. Using the conditional priors led to AUC(f N D , f D ) = 0.66 with P l AUC (f N D , f D ) = [0.60, 0.76] with posterior content 0.85. This is similar to the results obtained using the unconditional prior but the conditional prior puts more mass on larger values of the AUC hence the wider plausible region with lower posterior content. Also, c opt (f N D , f D ) = 2 with P l copt (f N D , f D ) = {1, 2} with posterior probability content approximately 1.00 (actually 0.99999) which reflects virtual certainty that the true optimal value is in {1, 2}.

Suppose now that X is a continuous diagnostic variable and it is assumed that the distributions F D and F N D are normal distributions. The assumption of normality should be checked by an appropriate test and it will be assumed here that this has been carried out and normality was not rejected. While the normality assumption may seem somewhat unrealistic, many aspects of the analysis can be expressed in closed form and this allows for a deeper understanding of ROC analyses more generally.

With Φ denoting the N (0, 1) cdf, then FNR

For given (µ D , σ D , µ N D , σ N D ) and c, all these values can be computed using Φ except the AUC and for that quadrature or simulation via generating z ∼ N (0, 1) is required.

The following results hold for the AUC with the proofs in the Appendix. Lemma 2. AUC > 1/2 iff µ D > µ N D and when µ D > µ N D , the AUC is a strictly increasing function of σ N D /σ D . From Lemma 2 it is clear that it makes sense to restrict the parameterization so that µ D > µ N D but we need to test the hypothesis H 0 : µ D > µ N D first. Clearly Error(c) = wFNR(c) + (1 − w)FPR(c) → 1 − w as c → −∞ and Error(c) → w as c → ∞ so, if Error(c) does not achieve a minimum at a finite value of c, then the optimal cut-off is infinite and the optimal error is min{w, 1 − w}. It is possible to give conditions under which a finite cutoff exists and express c opt in closed form when the parameters and the relevant prevalence w are all known. Lemma 3. (i) When σ 2 D = σ 2 N D = σ 2 , then a finite optimal cut-off minimizing Error(c) exists iff µ D > µ N D and in that case

(ii) When σ 2 D = σ 2 N D , then a finite optimal cut-off exists iff

and in that case

Note that when w = 1/2, then in (i) c opt = (µ D + µ N D )/2 as one might expect.

In the case of unequal variances there is an additional restriction beyond µ D ≥ µ N D required to hold if the diagnostic is to serve as a reasonable classifier. The following shows that these can be combined in a natural way. Corollary 4. The restrictions µ D ≥ µ N D and (5) hold iff

So, if one is unwilling to assume constant variance, then the hypothesis H 0 : (7) holds, needs to be assessed. There is some importance to these results as they demonstrate that a finite optimal cutoff may in fact not exist at least when considering both types of error. For example, when µ N D = 1, µ D = 2, σ D = 1, σ N D = 1.5, then for any w ≤ 0.30885, the optimal cutoff is c opt = ∞ with Error(∞) = w. When c opt is infinite, then one may need to consider various cutoffs c and find one that is acceptable at least with respect to some of the error characteristics FNR(c), FPR(c), Error(c), FDR(c) and FNDR(c).

Consider now examples with equal and unequal variances. Example 3. Binormal with σ 2 N D = σ 2 D . There may be reasons why the assumption of equal variance is believed to hold but this needs to be assessed and evidence in favor found. If evidence against the assumption is found, then the approach of Example 4 can be used. A possible prior is given by π 1 (µ N D , σ 2 )π 2 (µ D | σ 2 ) where µ N D | σ 2 ∼ N (µ 0 , τ 2 0 σ 2 ), µ D | σ 2 ∼ N (µ 0 , τ 2 0 σ 2 ), 1/σ 2 ∼ gamma(λ 1 , λ 2 ) and this is a conjugate prior. The hyperparameters that need to be elicited are (µ 0 , τ 2 0 , λ 1 , λ 2 ). Consider first eliciting the prior for (µ N D , σ 2 ). For this an interval (m 1 , m 2 ) is specified such that is it believed that µ N D ∈ (m 1 , m 2 ) with virtual certainty (say with probability γ = 0.99). Then putting µ 0 = (m 1 + m 2 )/2 implies

The interval µ N D ± σz (1+γ)/2 will contain an observation from F N D with virtual certainty and let (l 0 , u 0 ) be lower and upper bounds on the half-length of this interval so l 0 /z (1+γ)/2 ≤ σ ≤ u 0 /z (1+γ)/2 with virtual certainty. This implies τ 0 = (m 2 − m 1 )/2u 0 . This leaves specifying the hyperparameters (λ 1 , λ 2 ), and letting G(·, λ 1 , λ 2 ) denote the cdf of the gamma(λ 1 , λ 2 ) distribution, then (λ 1 , λ 2 ) satisfying G(z 2 (1+γ)/2 /l 2 0 , λ 1 , λ 2 ) = (1 + γ)/2, G(z 2 (1+γ)/2 /u 2 0 , λ 1 , λ 2 ) = (1 − γ)/2 (8) will give the specified γ coverage. Noting that G(x, λ 1 , λ 2 ) = G(λ 2 x, λ 1 , 1), first specify λ 1 and solve the first equation in (8) for λ 2 and then solve the second equation in (8) for λ 1 and continue this iteration until the values give a probability content to (l 0 /z (1+γ)/2 , u 0 /z (1+γ)/2 ) that is sufficiently close to γ. For the prior elicitation, suppose it is known with virtual certainty that both means lie in (−5, 5) and (l 0 , u 0 ) = (1, 10) so we take µ 0 = (−5 + 5)/2 = 0, τ 0 = (m 2 − m 1 )/2u 0 = 0.5 and the iterative process leads to (λ 1 , λ 2 ) = (1.787, 1.056). For inference about c opt it is necessary to specify a prior distribution for the prevalence w. This can range from w being completely known to being completely unknown whence a uniform(0,1) (beta(1, 1)) would be appropriate. Following the developments of Section 3.1, suppose it is known that w ∈ [l, u] = [0.2, 0.6] with prior probability γ = 0.99, so in this case ξ w = (l + u)/2 = 0.4 and τ w = 35.89725 and the prior is w ∼ beta(15.3589, 22.53835).

The first inference step is to assess the hypothesis H 0 : AUC > 1/2 which is equivalent to H 0 : µ N D < µ D by computing the prior and posterior probabilities of this event to obtain the relative belief ratio. The prior probability of H 0 given σ 2 is ∞ −∞ Φ ((µ D − µ 0 )/τ 0 σ) (τ 0 σ) −1 ϕ ((µ D − µ 0 )/τ 0 σ) dµ D = 1/2 and averaging this quantity over the prior for σ 2 we get 1/2. The posterior probability of this event can be easily obtained via simulating from the joint posterior. When this is done in the specific numerical example, the relative belief ratio of this event is 2.011 with posterior content 0.999 so there is strong evidence that H 0 : AUC > 1/2 is true.

If evidence is found against H 0 , then this would indicate a poor diagnostic. If evidence is found in favor, then we can proceed conditionally given that H 0 holds and so condition the joint prior and joint posterior on this event being true when making inferences about AUC, c opt , etc. So for the prior it is necessary to generate 1/σ 2 ∼ gamma(α 0 , β 0 ) and then generate (µ D , µ N D ) from the joint conditional prior given σ 2 and that µ D > µ N D . Denoting the conditional priors given σ 2 by π D (µ D | σ 2 ) and π N D (µ N D | σ 2 ), we see that this joint conditional prior is proportional to

While generally it is not possible to generate efficiently from this distribution we can use importance sampling to calculate any expectations by generating

Note that, if we take the posterior from the unconditioned prior and condition that, we will get the same conditioned posterior as when we use the conditioned prior to obtain the posterior. This implies that in the joint posterior for (µ N D , µ D , σ 2 ) it is only necessary to adjust the posterior for µ N D as was done with the prior and this is also easy to generate from. Note that Lemma 3(i) implies that it is necessary to use the conditional prior and posterior to guarantee that c opt exists finitely.

Since H 0 was accepted, the conditional sampling was implemented and the estimate of the AUC is 0.795 with plausible region [0.670, 0.880] which has posterior content 0.856. So the estimate is close to the true value but there is substantial uncertainty. Figure 3 is a plot of the conditioned prior, the conditioned posterior and relative belief ratio for this data.

With the specified prior for w, the posterior based on the given data is beta (35.3589, 47.53835 ) which leads to estimate 0.444 for w with plausible interval (0.374, 0.516) having posterior probability content 0.782. Using this prior and posterior for w and the conditioned prior and posterior for (µ D , µ N D , σ 2 ), we proceed to inference about c opt and the error characteristics associated with this classification. A computational problem arises when obtaining the prior and posterior distributions of c opt as it is clear from (4) that these distributions can be extremely long-tailed. As such, we transform to c mod = 0.5+arctan(c opt )/π ∈ [0, 1] (the Cauchy cdf), obtain the estimate c mod (d) where d = (n N D ,x N D , s 2 N D , n D ,x D , s 2 D ) and its plausible region and then, applying the inverse transform, obtain c opt (d) = tan(π(c mod (d) − 0.5)) and its plausible region. It is notable that relative belief inferences are invariant under 1-1 smooth transformations, so it doesn't matter which parameterization is used, but it is much easier computationally to work with a bounded quantity. Also, if a shorter tailed cdf is used rather than a Cauchy, e.g. a N (0, 1) cdf, then errors can arise due to extreme negative values being always transformed to 0 and very extreme positive values always transformed to 1. Figure 4 is a plot of the prior density, posterior density and relative belief ratio of c mod . For this data c opt (d) = 0.715 with plausible interval (0.316, 1.228) having posterior content 0.860. Large Monte Carlo samples were used to get smooth estimates of the densities and relative belief ratio but these only required a few minutes of com- 

Although this specifies the same prior for the two populations, this is easily modified to use different priors and, in any case, the posteriors are different. Again it is necessary to check that the AUC > 1/2 but also to check that c opt exists finitely using the full posterior based on this prior and for this we have the hypothesis H 0 given by Corollary 4. If evidence in favor of H 0 is found, the prior is replaced by the conditional prior given this event for inference about c opt . This can be implemented via importance sampling as was done in Example 3 and similarly for the posterior. Using the same data and hyperparameters as in Example 3 the relative belief ratio of H 0 is 3.748 with posterior content 0.828 so there is reasonably strong evidence in favor of H 0 . Estimating the value of the AUC is then based on conditioning on H 0 being true. Using the conditional prior given that H 0 is true, the relative belief estimate of the AUC is 0.793 with plausible interval (0.683, 0.857) with posterior content 0.839. The optimal cutoff is estimated as It is notable that these inferences are very similar to those in Example 3. It is also noted that the sample sizes are not big and so the only situation where it might be expected that the inferences will be quite different between the two analyses is when the variances are substantially different.

Suppose that X is a continuous variable, of course still measured to some finite accuracy, and available information is such that no particular finite dimensional family of distributions is considered feasible. The situation is considered where a normal distribution N (µ, σ 2 ), perhaps after transforming the data, is considered as a possible base distribution for X but we want to allow for deviation from this form. Alternative choices can also be made for the base distribution. The statistical model is then to assume that the x N D and x D are generated as samples from F N D and F D , where these are independent values from a DP(a, H) (Dirichlet) process with base H = N (µ, σ 2 ) for some (µ, σ 2 ) and concentration parameter a. Actually, since it is difficult to argue for some particular choice of (µ, σ 2 ), it is supposed that (µ, σ 2 ) is generated from a prior π(µ, σ 2 ). The prior 

To complete the prior it is necessary to specify π and the concentration parameters a N D and a D . For π the prior is taken to be a normal distribution elicited as discussed in Section 3.3 although other choices are possible. For eliciting the concentration parameters, consider how strongly it is believed that normality holds and for convenience suppose a = a N D = a D . If 

where B(·, β 1 , β 2 ) denotes the beta(β 1 , β 2 ) measure. This upper bound on the probability that the random F differs from H by at least ε on an event can be made as small as desirable by choosing a large enough. For example, if ε = 0.25 and it is required that this upper bound be less than 0.1, then this satisfied when a ≥ 9.8 and if instead ε = 0.1, then a ≥ 66.8 is necessary. Note that, since this bound holds for every continuous probability measure H, it also holds when H is random, as considered here. So a is controlling how close it is believed that the true distribution is to H. Alternative methods for eliciting a can be found in Swartz(1993 Swartz( , 1999 . Generating (F N D , F D ) from the prior for given (a, H) can only be done approximately and the approach of Ishwaran and Zarepour (2002) is adopted. For this, integer n * is specified and the measure P n * = n * i=1 p i,n * I {ci} is generated where (p 1,n * , . . . , p n * ,n * ) ∼ Dirichlet(a/n * , . . . , .a/n * ) independent of c 1 , . . . , c n * iid ∼ H, since P n * w → DP(a, H) as n * → ∞. So to carry out a priori calculations proceed as follows. Generate

and similarly for (p D1,n * , . . . , p Dn * ,n * ), (µ D , σ 2 D ), and (c D1 , . . . , c Dn * ). Then F N D,n * (c) = {i:c N Di ≤c} p N Din * is the random cdf at c ∈ R 1 and similarly for F D,n * and AUC= n * i=1 (1 − F D,n * (c N Di ))p N Di,n * is a value from the prior distribution of the AUC. This is done repeatedly to get the prior distribution of the AUC as in our previous discussions and we proceed similarly for the other quantities of interest.

The posterior given (µ N D , σ 2 N D , µ D , σ 2 D ) is ∞,c] (x N Di )/n N D is the empirical cdf (ecdf) based on x N D and similarly for H D . The posteriors of (µ N D , σ 2 N D ) and (µ D , σ 2 D ) are obtained via results in Antoniak (1974) and Doss (1994) . The posterior density of (µ N D , σ 2 N D ) given x N D is proportional to

whereñ N D is the number of unique values in x N D and {x N D1 , . . . ,x N Dñ N D } is the set of unique values with meanx N D and sum of squared deviationss 2 N D . From this it is immediate that

A similar result holds for the posterior of (µ D , σ 2 D ).

To approximately generate from the full posterior specify some n * * , put p a,n N D = a/(a + n N D ), q a,n N D = 1 − p a,n N D and generate (p N D1,n * * , . . . , p N Dn * * ,n * * ) | x N D ∼ Dirichlet(((a + n N D )/n * * )1 n * * ),

and similarly for (p D1,n * * , . . . , p Dn * * ,n * * ), (µ D , σ 2 D ) and (c D1 , . . . , c Dn * * ). If the data does not comprise a sample from the full population, then the posterior for w is replaced by its prior.

There is an issue that arises when making inference about c opt , namely, the distributions for c opt that arises from this approach can be very irregular and particularly the posterior distribution. In part this is due to the discreteness of the posterior distributions of F N D and F D . This doesn't affect the prior distribution because the points on which the generated distributions are concentrated vary quite continuously among the realizations and this leads to a relatively smooth prior density for c opt . For the posterior, however, the sampling from the ecdf leads to a very irregular, multimodal density for c opt . So some smoothing is necessary in this case.

Consider now applying such an analysis to the dataset of Example 3, where we know the true values of the quantities of interest and then to a dataset concerned with the COVID-19 epidemic.

The data used in Example 3 is now analyzed but using the methods of this section. The prior on (µ N D , σ 2 N D ), (µ D , σ 2 D ) and w is taken to be the same as that used in Example 4 so the variances are not assumed to be the same. The value ε = 0.25 is used and requiring (10) to be less than 0.018 leads to a = 20. So the true distributions are allowed to differ quite substantially from a normal distribution. Testing the hypothesis H 0 : AUC > 1/2 led to the relative belief ratio 1.992 (maximum possible value is 2) and the strength of the evidence is 0.997 so there is strong evidence that H 0 is true. The AUC, based on the prior conditioned on H 0 being true, is estimated to be equal to 0.839 with plausible interval (0.691, 0.929) having posterior content 0.814. For this data c opt (d) = 0.850 with plausible interval (0.45, 1.75) having posterior content 0.835. The true value of the AUC is 0.760 and the true value of c opt is 0.905 so these inferences are certainly reasonable although, as one might expect, when the length of the plausible intervals are taken into account, they are not as accurate as those when binormality is assumed as this is correct for this data. So the DP approach worked here although the posterior density for c opt was quite multimodal and required some smoothing (averaging 3 consecutive values). Example 6. COVID-19 data A dataset was downloaded from https://github.com/YasinKhc/Covid-19 containing data on 3397 individuals diagnosed with COVID-19 and includes whether or not the patient survived the disease, their gender and their age. There are 1136 complete cases on these variables of which 646 are male, with 52 having died, and 490 are female, with 25 having died. Our interest is in the use of a patient's age X to predict whether or not they will survive. More detail on this dataset can be found in Charvadeh and Yi (2020) . The goal is to determine a cutoff age so that extra medical attention can be paid to patients beyond that age. Also it is desirable to see whether or not gender leads to differences so separate analyses can be carried out by gender. So, for example, in the male group ND refers to those males with COVID-19 that will not die and D refers to the population that will. Looking at histograms of the data, it is quite clear that binormality is not a suitable assumption and no transformation of the age variable seems to be available to make a normality assumption more suitable. Table 3 gives summary statistics for the subgroups. Of some note is that condition (7), when using standard estimates for population quantities like w = 52/646 = 0.08 for Males and w = 25/490 = 0.05 for females, is not satisfied which suggests that in a binormal analysis no finite optimal cutoff exists. For the prior, it is assumed that (µ N D , σ 2 N D ) and (µ D , σ 2 D ) are independent values from the same prior distribution as in (9). For the prior elicitation suppose it is known with virtual certainty that both means lie in (20, 70) and (l 0 , u 0 ) = (20, 50) so we take µ 0 = 45, τ 0 = (m 2 − m 1 )/2u 0 = 0.75 and the iterative process leads to (λ 1 , λ 2 ) = (8.545, 1080.596) which implies a prior on the σ's with mode at 10.932 and the interval (7.764, 19.411) containing 0.99 of the prior probability. Here the relevant prevalence refers to the proportion of COVID-19 patients that will die and it is supposed that w ∈ [0.00, 0.15] with virtual certainty which implies w ∼ beta(9.81, 109.66). So the prior probability that someone with COVID-19 will die is assumed to be less than 15% with virtual certainty. Since normality is not an appropriate assumption for the distribution of X, the choice ε = 0.25 with the upper bound (10) equal to 0.1 seems reasonable and so a = 9.8. This specifies the prior that is used for the analysis with both genders and it is to be noted that it is not highly informative.

For males the hypothesis AUC > 1/2 is assessed and RB = 1.991 (maximum value 2) with strength effectively equal to 1.00 was obtained, so there is extremely strong evidence that this is true. The unconditional estimate of the AUC is 0.808 with plausible region [0.698, 0.888] having posterior content 0.959, so there is a fair bit of uncertainty concerning the true value. For the conditional analysis, given that AUC > 1/2, the estimate of the AUC is 0.806 with plausible region [0.731, 0.861] having posterior content 0.932. So the conditional analysis gives a similar estimate for the AUC with a small increase in accuracy.

In either case it seems that the AUC is indicating that Age should be a reasonable diagnostic. Note that the standard nonparametric estimate of the AUC is 0.810 so the two approaches agree here. For females the hypothesis AUC > 1/2 is assessed and RB = 1.994 with strength effectively equal to 1 was obtained, so there is extremely strong evidence that this is true. The unconditional estimate of the AUC is 0.873 with plausible region (0.742, 0.948) having posterior content 0.968. For the conditional analysis, given that AUC > 1/2, the estimate of the AUC is 0.874 with plausible region (0.791, 0.936) having posterior content 0.956. The traditional estimate of the AUC is 0.902 so the two approaches are again in close agreement.

Inferences for c opt are more problematical in both genders. Consider the male data. The data set is very discrete as there are many repeats and the approach samples from the ecdf about 84% of the time for the males that died and 98% of the time for the males that didn't die. The result is a plausible region that is not contiguous even with smoothing. Without smoothing the estimate is c opt (d) = 85.5 for males, which is a very dominant peak for the relative belief ratio. The plausible region contains 0.928 of the posterior probability and, although it is not a contiguous interval, the subinterval [85.2, 85.8] is a 0.58-credible interval for c opt that is in agreement with the evidence. If we continuize the data by adding a uniform(0,1) random error to each age in the data set, then c opt (d) = 86.1 and plausible interval [75.9, 86.7] with posterior content 0.968 is obtained. These cutoffs are both greater than the maximum value in the ND data, so there is ample protection against false positives but it is undoubtedly false negatives that are of most concern in this context. If instead the FNDR is used as the error criterion to minimize, then c opt (d) = 35.7 and plausible interval [26.1, 35.7] with posterior content 0.826 is obtained and so in this case there will be too many false positives. So a useful optimal cutoff incorporating the relevant prevalence does not seem to exist with this data.

If the relevant prevalence is ignored and w 0 FNR+(1 − w 0 )FPR is used for some fixed weight w 0 to determine c opt (d), then more reasonable values are obtained. Table 4 gives the estimates for various w 0 values. With w 0 = 0.5 (corresponding to using Youden's index) c opt (d) = 65.7 while if w 0 = 0.7, then c opt (d) = 56.7. When w 0 is too small or too large then the value of c opt (d) is not useful. While these estimates do not depend on the relevant prevalence, the error characteristics that do depend on this prevalence (as expressed via its prior and posterior distributions) can still be quoted and a decision made as to whether or not to use the diagnostic. Table 5 contains the estimates of the error characteristics at c opt (d) for various values of w 0 where these are determined using the prior and posterior on the relevant prevalence w. Note that these estimates are determined as the values that maximize the corresponding relative belief ratios and take into account the posterior of w. So, for example, the estimate of the Error is not the convex combination of the estimates of FNR and FPR based on the w 0 weight. Another approach is to simply set the cutoff Age at a value at a value c 0 and then investigate the error characteristics at that value. FNDR(c 0 ) = 0.028. Similar results are obtained for the cutoff with female data although with different values. Overall, Age by itself does not seem to be useful classifier although that is a decision for medical practitioners. Perhaps it is more important to treat those who stand a significant chance of dying more extensively and not worry too much that some treatments are not necessary. The clear message from this data, however, is that a relatively high AUC does not immediately imply that a diagnostic is useful and the relevant prevalence is a key aspect of this determination.

Inferences for an ROC analysis have been implemented using a characterization of statistical evidence based on how data changes beliefs. Several contexts have been considered, namely, a diagnostic variable taking finitely many values with no restrictions on the distributions, a continuous diagnostic with both distributions normal and a continuous diagnostic with no restrictions on the distributions. A central theme is that it is not enough to simply quote the AUC as a high value does not imply a good diagnostic. An analysis of a diagnostic should also involve the relevant prevalence of the condition in question as this affects the error characteristics at a specific cutoff. While sometimes a usable optimal cutoff can be determined that takes into account the relevant prevalence, this is not always the case and then some other criterion needs to be considered to determine the cutoff to be used. For the cutoff used, the error characteristics that involve the relevant prevalence can still be assessed.

ab 1+b 2 . When a > 0, then ∞ −∞ Φ (a + bz) ϕ(z) dz is increasing in b for b > 0, decreasing in b for b < 0, equals 0 when b = 0 and when a < 0 it is decreasing in b for b > 0, increasing in b for b < 0. Therefore, when a > 0, b > 0, then ∞ −∞ Φ (a + bz) ϕ(z) dz ≥ Φ (a) > 1/2 and when a ≤ 0, b > 0 then ∞ −∞ Φ (a + bz) ϕ(z) dz ≤ Φ (a) ≤ 1/2. Proof of Lemma 3 Note that c opt will satisfy d dc Error (

So c opt is a root of the quadratic 1/σ 2 D − 1/σ 2 N D c 2 −2(µ D /σ 2 D −µ N D /σ 2 N D )c+ (µ 2 D /σ 2 D − µ 2 N D /σ 2 N D + 2 log((1 − w)σ D /wσ N D )). A single real root exists when σ 2 D = σ 2 N D = σ 2 and is given by (4). When σ 2 D = σ 2 N D there are two real roots the discriminant 4(µ D /σ 2 D − µ N D /σ 2 N D ) 2 − 4 1/σ 2 D − 1/σ 2 N D (µ 2 D /σ 2 D − µ 2 N D /σ 2 N D + 2 log((1 − w)σ D /wσ N D )) ≥ 0 establishing (5). To be a minimum the root c has to satisfy 0 < d 2 Errorw(c)

and by (11), this holds iff 0 < − w

which is true iff (1/σ 2 D − 1/σ 2 N D )c < µ D /σ 2 D − µ N D /σ 2 N D . When σ 2 D = σ 2 N D this is true iff µ D > µ N D which completes the proof of (i). When σ 2 D = σ 2 N D this, together with the formula for the roots of a quadratic establishes (6).

Proof of Corollary 4 Suppose µ D ≥ µ N D and (5) hold. Then putting a = 2 σ 2 D − σ 2 N D log((1 − w)w −1 σ D σ −1 N D ) we have that, for fixed µ D , σ 2 D , σ 2 N D and w, then (µ D − µ N D ) 2 + a is a quadratic in µ N D . This quadratic has discriminant −4a and so has no real roots whenever a > 0 and, noting a does not depend on µ D , the only restriction on µ N D is µ N D ≤ µ D . When a ≤ 0 the roots of the quadratic are given by µ D ± √ −a and so, since the quadratic is negative between the roots and µ D − √ −a ≤ µ D ≤ µ D + √ −a the two restrictions imply µ N D ≤ µ D − √ −a. Combining the two cases gives (7). Now suppose (7) holds. Then µ N D ≤ µ D − {max(0, −a)} 1/2 ≤ µ D which gives the first restriction and also µ N D − µ D ≤ −{max(0, −a)} 1/2 ≤ 0 which implies (µ N D − µ D ) 2 ≥ max(0, −a) and so (µ N D − µ D ) 2 + a ≥ max(0, −a) + a and by examining the cases a ≤ 0 and a > 0 we conclude that (5) holds.

Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems

Bayesian nonparametric ROC regression modeling

Data visualization and descriptive analysis for understanding epidemiological characteristics of COVID-19: a case study of a dataset from

Bayesian Nonparametric Estimation for Incomplete Data Via Successive Substitution Sampling

Checking the model and the prior for the constrained multinomial

Measuring Statistical Evidence Using Relative Belief. Monographs on

Prior elicitation, assessment and inference with a Dirichlet prior

Bayesian bootstrap estimation of ROC curve

Measuring classifier performance: a coherent alternative to the area under the ROC curve

Exact and approximate sum representations for the Dirichlet process

Modeling continuous diagnostic test data using approximate Dirichlet process distributions

OptimalCutpoints: An R package for selecting optimal cutpoints in diagnostic tests

Proper" binormal ROC curves: theory and maximum-likelihood estimation

Receiver operating characteristic (ROC) curves: review of methods with applications in diagnostic medicine

Subjective priors for the Dirichlet process

Nonparametric goodness-of-fit

Defining an optimal cut-point value in ROC analysis:an alternative approach

ROC plots showed no added value above the AUC when evaluating the performance of clinical prediction models. In press

Statistical Methods in Diagnostic Medicine

This research was supported by a grant from the Natural Sciences and Engineering Research Council of Canada and a University of Toronto Excellence Award. Qiaoyu Liang thanks Zhanhua He, Justin Ko, Zeyong Jin and Jiyuan Cheng for their help.