key: cord-0577570-w1hklo5b
authors: Manski, Charles F.
title: Probabilistic Prediction for Binary Treatment Choice: with focus on personalized medicine
date: 2021-10-02
journal: nan
DOI: nan
sha: 8f83385eb86fb76e15e5234c96a1378cea853a30
doc_id: 577570
cord_uid: w1hklo5b

This paper extends my research applying statistical decision theory to treatment choice with sample data, using maximum regret to evaluate the performance of treatment rules. The specific new contribution is to study as-if optimization using estimates of illness probabilities in clinical choice between surveillance and aggressive treatment. Beyond its specifics, the paper sends a broad message. Statisticians and computer scientists have addressed conditional prediction for decision making in indirect ways, the former applying classical statistical theory and the latter measuring prediction accuracy in test samples. Neither approach is satisfactory. Statistical decision theory provides a coherent, generally applicable methodology.

A classic concern of probability theory and statistics has been to predict realizations of a real random variable y conditional on realizations of a covariate vector x. A standard formalization of the problem begins with a population characterized by a joint distribution P(y, x). A member is drawn at random from the subpopulation with a specified value of x. The problem is to predict y conditional on x. The conditional distribution P(y|x) provides the complete feasible probabilistic prediction.

Rather than study P(y|x) in totality, researchers often focus on a real-valued feature of P(y|x) that is interpretable as a best point predictor of y conditional on x. The standard approach has been to minimize the conditional expected loss from prediction errors, with respect to a given loss function. Thus, one solves a minimization problem min p ∈ (−∞, ∞) E[L(y − p)x], where p is a predictor value, y -p is prediction error, and L(•) is the loss function. Familiar cases include square and absolute loss, yielding the conditional mean and median as best predictors. For expositions, see Ferguson (1967) and Manski (2007a) .

A standard formalization of the statistical problem supposes that one does not know P(y, x). Instead, one observes (yi, xi, i = 1, . . , N) in a random sample of N persons drawn from a study population that has distribution P(y, x). One uses the sample data to estimate P(y|x), or a best point predictor.

The standard formalization considers probabilistic prediction as a self-contained problem, without reference to an external application. Obtaining a best point prediction by minimizing expected loss solves a decision problem that may have applications. However, the commonly used loss functions, notably square and absolute loss, have usually been motivated by tradition and tractability rather than by losses incurred in actual decisions that require choice of point predictions.

This paper considers probabilistic predictions used to inform decisions in an important class of applications, namely choice between two treatments for members of a heterogeneous population. To flesh out abstract ideas, I analyze choice between surveillance and aggressive treatment in personalized medicine.

Building on earlier work in Manski (2018 Manski ( , 2019a , I consider a clinician caring for patients with observed covariates x. The setting supposes that there are two care options for a specified disease, with A denoting surveillance and B denoting aggressive treatment. Let y = 1 if a patient is ill with the disease and y = 0 if not. I pose a model of patient welfare in which surveillance is the better option if y = 0 and aggressive treatment if y = 1.

The analysis in this paper applies as well to other decision problems that have the same mathematical structure. One such is judicial treatment of criminal defendants. Here the choice is to find a defendant guilty or not guilty of a crime. A defendant is analogous to a patient. A guilty decision is analogous to aggressive treatment, and a not-guilty one is analogous to surveillance. Uncertainty about whether a defendant committed the crime is analogous to uncertainty about whether a patient is ill. See Manski (2020) for discussion relating judicial and clinical decisions.

Decision making would be simple if y were observable at the time of treatment choice. However, the clinician must choose without knowing the illness status of the patient. This generates a rationale to predict illness. Given knowledge of x, the most that a clinician can do is to learn P(y = 1|x). The model of patient welfare implies that surveillance is the better option if P(y = 1|x) is less than a known threshold p x * and aggressive treatment is better if P(y = 1|x) exceeds px * . Hence, the best point prediction is y = 1 if P(y = 1|x) > px * and y = 0 if P(y = 1|x) < px * .

Empirical research on medical risk assessment has used sample data on illness in study populations to estimate conditional probabilities of illness or to make point predictions of illness. Risk assessment has long been performed by biostatisticians who use classical frequentist statistical theory to propose inferential methods and assess findings. Although the motivation may be to improve patient care, biostatistical analysis commonly views prediction as a self-contained inferential problem rather than as a task undertaken specifically to inform treatment choice.

In the 21 st century, medical risk assessment is increasingly performed by computer scientists, who view prediction methods as computational algorithms rather than as approaches to statistical inference.

Frequentist statisticians maintain an ex-ante perspective, studying how methods perform across repetitions of a sampling process. In contrast, computer scientists perform ex post evaluation, fitting an algorithm on a "training" sample and examining the accuracy of the predictions it yields on a "test" sample. Breiman (2001) argues for this approach, writing (p. 201): "Predictive accuracy on test sets is the criterion for how good the model is." Breiman does not explain why this is or should be the criterion. He just states it.

Measuring performance in this ex-post manner may have appeal to clinicians who lack expertise in statistical methodology but who may feel that they can appraise ex post prediction accuracy heuristically.

However, by its nature, evaluation on a test sample cannot yield lessons that generalize beyond the particular test performed. Efron (2020) , in an article contrasting the perspectives of statisticians and computer scientists, writes (p. S49): "In place of theoretical criteria, various prediction competitions have been used to grade algorithms in the so-called 'Common Task Framework.'. . . None of this is a good substitute for a so-far nonexistent theory of optimal prediction."

Efron is correct that prediction competitions are not a satisfactory way to evaluate prediction methods. However, he is not correct when he states that a theory of optimal prediction is "so-far nonexistent. " Wald (1939 " Wald ( , 1945 " Wald ( , 1950 considered the general problem of using sample data to make decisions. He posed the task as choice of a statistical decision function, which maps potentially available data into a choice among the feasible actions. His development of statistical decision theory provides a broad framework for decision making with sample data, yielding optimal decisions when these are well-defined and proposing criteria for "reasonable" decision making more generally.

Wald recommended ex ante (frequentist) evaluation of statistical decision functions as procedures applied as the sampling process is engaged repeatedly to draw independent data samples. Whereas computer scientists measure performance when a prediction method is trained on one sample and used to predict outcomes in a test sample, statistical decision theory measures average performance across all possible training samples, when the objective is to predict outcomes in an entire population rather than a test sample.

The idea of a procedure transforms the inductive problem of evaluating a prediction method based on its performance in a single setting into the deductive problem of assessing the performance of a statistical decision function across realizations of the sampling process. It enables coherent study of treatment choice using sample data to make probabilistic predictions, with application to personalized medicine and elsewhere. This paper shows how.

The findings reported here add to a recent econometric literature using statistical decision theory to study treatment choice with sample data. See Manski (2004 Manski ( , 2005 Manski ( , 2007b Manski ( , 2019b Manski ( , 2021 , Manski and Tetenov (2007 , 2016 , 2019 , 2021 , Porter (2009, 2020) , Stoye (2009 Stoye ( , 2012 , Tetenov (2012) , Kitagawa and Tetenov (2018) , Mbakop and Tabord-Meehan (2021) , and Athey and Wager (2021) .

Relative to this precedent work, part of the contribution of the present paper is its consideration of a class of treatment-choice problems that differs in some respects from those studied earlier. Part is its new application of the theme of as-if optimization, developed abstractly in Manski (2021) . Part is its cautionary advice to clinical researchers and clinicians as they seek to interpret personalized medical risk assessments evaluated using traditional biostatistical criteria or prediction competitions.

Section 2 explains in broad terms how statistical decision theory enables study of treatment choice using sample data to make probabilistic predictions. I draw on and extend exposition in Manski (2021) .

Section 3 explains use of as-if optimization to choose between surveillance and aggressive treatment, first in generality and then when the data are generated by random sampling from P(y|x). Section 4 studies as-if optimization using estimates of P(y|x) that combine data on persons with different covariate values. A particular innovation is to introduce a new form of analysis of kernel estimation. Section 5 concludes.

Wald began with the standard decision theoretic problem of a planner who must choose an action yielding welfare that depends on an unknown state of nature. The planner specifies a state space listing the states considered possible. He chooses without knowing the true state. Wald added to this standard problem by supposing that the planner observes sample data that may be informative about the true state.

In the context of this paper, the action is a treatment choice, the unknown state of nature is the conditional probability distribution P(y|x), and the sample data are informative about P(y|x). I describe basic ideas in abstraction before applying them to this context.

First consider decisions without sample data. A planner faces a choice set C and believes that the true state of nature s * lies in state space S. An objective function w(•, •): C × S ⇾ R 1 maps actions and states into welfare. The planner ideally would maximize w(•, s * ) over C, but he does not know s * . To choose an action, decision theorists have proposed various ways of using w(, •) to form functions of actions alone, which can be optimized. When posing extremum problems, I use max and min notation, without concern for the subtleties that sometimes make it necessary to use sup and inf operations.

One approach places a subjective probability distribution π on the state space, computes average statedependent welfare with respect to π, and maximizes subjective average welfare over C. The criterion solves

(1) max ∫w(c, s)dπ.

c ∈ C Another approach seeks an action that, in some sense, works uniformly well over all of S. This yields the maximin and minimax-regret (MMR) criteria. The maximin criterion solves the problem (2) max min w(c, s).

The MMR criterion solves

Here max d ∊ C w(d, s) − w(c, s) is the regret of action c in state s. The true state being unknown, one evaluates c by its maximum regret over all states and selects an action that minimizes maximum regret. The maximum regret of an action measures its maximum distance from optimality across states.

Statistical decision problems suppose that the planner observes data generated by a sampling distribution. Knowledge of the sampling distribution is generally incomplete. To express this, one extends state space S to list the feasible sampling distributions, denoted (Qs, s ∈ S). Let Ψs denote the sample space in state s; Ψs is the set of samples that may be drawn under distribution Qs. The literature typically assumes that the sample space does not vary with s and is known. I assume this and denote the sample space as Ψ.

Then a statistical decision function (SDF), c(): Ψ ⇾ C, maps the sample data into a chosen action.

An SDF is a deterministic function after realization of the sample data, but it is a random function ex ante. Hence, the welfare achieved is a random variable ex ante. Wald's theory evaluates the performance of SDF c() in state s by Qs{w[c(ψ), s]}, the ex-ante distribution of welfare that it yields across realizations ψ of the sampling process.

It remains to ask how a planner might compare the welfare distributions yielded by different SDFs. Statistical decision theory has mainly studied the same decision criteria as has decision theory without sample data. Let Γ be a specified set of SDFs, each mapping Ψ ⇾ C. The statistical versions of criteria (1),

(2), and (3) are

Observe that these ex-ante criteria for evaluation of performance differ fundamentally from the computer-science practice of ex-post evaluation of predictions on test samples. The Wald framework evaluates a decision criterion by its mean performance across all feasible samples, not by its performance in a particular sample. In the computer-science approach, some data collection process generates two datasets, say ψtr and ψtest, denoted the training and test samples. The training sample is used to compute a predictor function, say pred(ψtr). The predictor function is applied to the test sample, the test being how well it predicts some feature of ψtest deemed to be of interest. Judgement of how well the predictor function performs is typically subjective rather than through use of a formal statistical criterion. 1 Computer scientists often motivate their approach by stating that it protects against drawing misleading conclusions from prediction performance on the study sample, which may be unrealistically high due to so-called "overfitting" the data. Concern with overfitting does not arise in the Wald framework, which evaluates performance across all feasible samples, not in a particular study sample. 1 The possibility of formal statistical analysis depends on how training and test samples are generated. In some cases, a well-understood sampling process generates data, which are then randomly split into training and testing subsamples. In other cases, an idiosyncratic process generates data, which are then randomly split as above. In yet other cases, separate idiosyncratic processes generate training and test samples. In principle, it should be possible to study cases of the first type using classical frequentist statistical theory. In cases of the second type, the initial data generation is not interpretable with statistical theory, but the randomized split into training and test samples may enable randomization inference. Cases of the third type may not be amenable to any formal statistical analysis. Manski (2004 Manski ( , 2021 discuss and compare the properties of criteria (4) -(6). To summarize some main points, maximization of subjective average welfare (4) may be appealing if one has a credible basis to place a subjective probability distribution on the state space, but not otherwise. Concern with specification of priors motivated Wald to study the maximin criterion (5). However, I see conceptual reasons to focus instead on the MMR criterion (6).

The conceptual appeal of using maximum regret to measure performance is that it quantifies how lack of knowledge of the true state of nature diminishes the quality of decisions. The term "maximum regret" is a shorthand for the maximum sub-optimality of a decision criterion across the feasible states of nature. An SDF with small maximum regret is uniformly near-optimal across all states. This is a desirable property.

Subject to regularity conditions ensuring that the relevant expectations and extrema exist, problems Carlo integration can also be used in criterion (4) to approximate the subjective average of expected welfare.

The main computational challenges are determination of the extrema across actions in problem (6), across states in problems (5) − (6), and across SDFs in problems (4) − (6). Solution of max d ∈C w(d, s) in (6) is often straightforward but sometimes difficult. Finding extrema over S must cope with the fact that the state space commonly is uncountable. In applications where the quantity to be optimized varies smoothly over S, a simple approach is to compute the extremum over a suitable finite grid of states.

The most difficult computational challenge usually is to optimize over the feasible SDFs, Γ. No generally applicable approach is available. Hence, applications of statistical decision theory necessarily proceed case-by-case. It may not be tractable to find the best feasible SDF, but one often can evaluate the performance of relatively simple SDFs that researchers use in practice. This paper studies illustrative cases.

SDFs for binary choice have a simple structure. Manski (2021) shows that they can be viewed as hypothesis tests. Yet the Wald perspective on testing differs from the classical perspective of Pearson (1928, 1933) . Decision theory does not restrict attention to tests that yield a predetermined upper bound on the probability of a Type I error. Wald (1939) proposed evaluation of the performance of an SDF for binary choice by the expected welfare that it yields across realizations of the sampling process. The welfare distribution in state s in a binary choice problem is Bernoulli, with mass points max [w(a, s), w(b, s)] and min [w(a, s), w(b, s)]. These coincide if w(a, s) = w(b, s). When w(a, s) ≠ w(b, s), let Rc(•)s denote the probability that c(•) yields an error, choosing the inferior action over the superior one. That is,

These are the probabilities of Type I and Type II errors.

The probabilities that welfare equals max [w(a, s), w(b, s)] and min [w(a, s), w(b, s)] are 1 − Rc(•)s and Rc(•)s. Hence, expected welfare is

Observe that Rc(•)s|w(a, s) − w(b, s)| is the expected regret of c(•). Thus expected regret, which was defined in abstraction in (6), has a simple form when choice is binary. It is the product of the error probability and the magnitude of the welfare loss when an error occurs.

The concept of a statistical decision function embraces all mappings [data → action]. An SDF need not perform inference; that is, it need not use data to draw conclusions about the true state of nature.

Although SDFs need not perform inference, some do. These have the form [data → inference → action], first performing inference and then using the inference to make a decision. There has been no accepted term for such SDFs, so Manski (2021) calls them inference-based.

A common type of inference-based SDF performs as-if optimization, also called plug-in or two-step decision making, choosing an action that optimizes welfare as if an estimate of the true state of nature actually is the true state. Formally, a point estimate is a function s(): Ψ ⇾ S that maps data into the state

Traditionally, researchers have given computational and asymptotic statistical rationales for acting in the manner of (9). Computationally, using a point estimate to maximize welfare is easier than solving problems (4) to (6). To further motivate as-if optimization, statisticians and econometricians cite limit theorems of asymptotic theory. They hypothesize a sequence of sampling processes indexed by sample size and a corresponding sequence of estimates. They show that the sequence is consistent when specified assumptions hold. They may also derive the rate of convergence and limiting distribution of the estimate.

Computational and asymptotic arguments do not prove that as-if optimization provides a wellperforming SDF. Statistical decision theory evaluates as-if optimization in state s by the expected welfare, Es{w{c[s(ψ)], s}}, that it yields across samples of specified size, not asymptotically. This is how I proceed below when studying treatment choice with sample data.

I now use statistical decision theory to study a version of the medical problem of choice between surveillance and aggressive treatment. The broad problem concerns a clinician caring for patients with observed covariates x. There are two care options for a specified disease, with A denoting surveillance and B denoting aggressive treatment. The clinician must choose without knowing a patient's illness status; y = 1 if a patient is ill and y = 0 if not. Observing x, the clinician can attempt to learn the conditional probability of illness, px  P(y = 1|x). I suppose that the planner performs as-if optimization, using sample data to estimate px and acting as if the estimate is correct.

The version of the decision problem studied here maintains simplifying assumptions used in parts of the analysis in Manski (2018 Manski ( , 2019a . I assume that patient welfare with care option c ∈ {A, B} has the known form Ux(c, y); thus, welfare may vary with whether the disease occurs and with the patient covariates

x. Aggressive treatment is better if the disease occurs, and surveillance is better otherwise. That is,

The specific form of welfare function Ux(•, •) necessarily depends on the clinical context, but inequalities (10a) -(10b) are realistic in many settings.

I assume that the chosen care option does not affect whether the disease occurs; hence, a patient's illness probability is simply px rather than a function px(c) of the care option. With this assumption, treatment choice still matters because it may affect the severity of illness and patient experience of side effects. Aggressive treatment if beneficial to the extent that it lessens the severity of illness, but harmful if it yields side effects that do not occur with surveillance.

Illustration: A patient presents to a clinician with symptoms of a sore throat. The patient may have a streptococcal infection (y = 1) or a throat irritation (y = 0). Treatment A is to counsel the patient to rest and monitor body temperature until the result of a throat culture is obtained. Treatment B is immediate prescription of an antibiotic. The antibiotic will lessen the severity of illness if y = 1, but it will have no beneficial effect if y = 0. Whether or not infection is present, the patient may suffer an adverse side effect from receipt of the antibiotic. ∎

The central difference between Manski (2018 Manski ( , 2019a and the present paper is that I earlier supposed the clinician may have deterministic partial knowledge of px but has no informative sample data. Here I study treatment choice using sample data to estimate px. In the first part of this section, the estimate is an abstract function of sample data. This makes it easy to explain general principles. I subsequently specialize to settings where the data are drawn by random sampling of y conditional on x.

3.1. Treatment Choice with Knowledge of px 3.1.1. Optimal Treatment Choice

Before considering decision making with sample data, suppose that the clinician knows px and chooses a treatment that maximizes expected patient welfare conditional on x. Then an optimal decision is

The decision yields optimal expected patient welfare

The optimal decision is easy to characterize when inequalities (10a) − (10b) hold. Let px * denote the threshold value of px that makes options A and B have the same expected utility. This value is

Observe that 0 < px * < 1. Option A is optimal if px  px * and B if px  px * . Thus, optimal treatment choice does not require exact knowledge of px. It only requires knowing whether px is larger or smaller than px * .

An instructive special case occurs when aggressive treatment neutralizes disease, in the sense that Ux(B, 0) = Ux(B, 1). For example, aggressive treatment might be surgery to remove a localized tumor that may (y = 1) or may not (y = 0) be malignant. Suppose that surgery always eliminates cancer when present.

Then surgery neutralizes disease. Being invasive and costly, performance of surgery has a negative side effect on welfare that is the same regardless of whether cancer is present.

Let UxB denote welfare with aggressive treatment. Then (10) -(13) reduce to

Further simplification occurs when one normalizes the location and scale of welfare by setting Ux(A, 0) = 1 and Ux(A, 1) = 0. Then (14) − (17) become

(21) px * = 1  UxB.

I henceforth assume that aggressive treatment neutralizes disease and I normalize the welfare of surveillance as above.

Now suppose that the clinician does not know px. I assume that the clinician does not know whether px is smaller or larger than px * = 1  UxB. Formally, pmx < 1  UxB < pMx, where pmx ≡ min s ∈ S psx and pMx ≡ max s ∈ S psx. Thus, the clinician cannot maximize expected patient welfare conditional on x. UxB] = 0 when psx and φx(ψ) yield the same treatment. Regret using estimate φx(ψ) is

Expected regret across repeated samples is

Maximum expected regret across the state space is max s ∊ S Es{Rsx[φx(ψ)]}.

Evaluation of estimate φx(•) by the maximum regret of the treatment choices it yields is reminiscent of, but distinct in various respects from, analysis of plug-in decision making in the statistical-learning literature on pattern recognition or classification (e.g., Devroye et al., 1996; Yang, 1999) . Interpreted in the context of the present paper, researchers in that field have supposed that the welfare function is anti- 2 Another form of as-if optimization has been studied in econometrics and statistical learning theory. Rather than estimate px and then determine whether the estimate exceeds the threshold px * , one directly estimates the indicator function 1[px > px * ] and makes a decision accordingly. When the data are a random sample from a population with heterogeneous covariates, this approach yields the maximum score method of econometrics (Manski, 1975 (Manski, , 1985 Manski and Thompson, 1989) and the empirical risk minimization methods of statistical learning theory (Vapnik, 1999 (Vapnik, , 2000 . Here too, researchers have mainly studied asymptotic questions of consistency and rates of convergence.

Expected regret is easy to compute; it equals the error probability times |(1  psx) − UxB|. Maximizing regret must cope with the fact that the set (psx, s ∊ S) commonly is uncountable. Being a subset of [0, 1], this set is relatively simple in structure. A pragmatic approach is to maximize over a suitable finite grid of feasible probability values. Refining the grid increases the accuracy of the approximate solution.

Computation is particularly straightforward when the data are illness outcomes (yi, i = 1, . . , Nx) that have been observed in a random sample of Nx persons drawn from a study population with illness probability px. The ordering of the observations in a random sample is immaterial, so the sample space may be defined to be the number nx of observed illness outcomes; thus, Ψ = {0, 1, 2, . . . . , Nx}. The sampling distribution in state s is the Binomial distribution Qs = B(psx, Nx), where

is the probability of observing nx illnesses.

In this setting, expected regret has the form

where SA = (s ∈ S: 1  psx ≥ UxB) and SB = (s ∈ S: 1  psx < UxB) were defined in Section 2.2. Computing maximum regret when performing as-if optimization with estimate φx(•) requires maximizing (25a) over s ∈ SA and maximizing (25b) over s ∈ SB. Maximum regret over S is the larger of these sub-maxima. The sub-maximization problems usually do not have explicit solutions. However, the fact that n has the Binomial distribution makes it easy to perform numerical maximization. Illustrative findings are given in Section 4.

While numerical computation of maximum regret must be the norm, extreme cases with no informative sample data or with one observation of y drawn at random from px are amenable to simple analysis. An analytical upper bound on maximum regret is available when multiple observations of y are drawn at random from px and the sample illness frequency is used to estimate px. Sections 3.3 through 3.6 present the findings.

Suppose that one observes no sample data. Then the estimate must be a constant φx rather than a random variable φx(ψ). Under the maintained assumption that pmx ≡ min s ∈ S psx < 1  UxB < pMx ≡ max s ∈ S psx, the minimax regret estimate is given in Proposition 1. The proofs of this and all subsequent propositions are collected in an Appendix. 

Suppose that one does not observe data that are informative about the true state of nature s * .

Nevertheless, it is always possible to generate random data that are uninformative about s * and use them to estimate px. One may specify a distribution on the interval [0, 1] and estimate px by a realization drawn from this distribution.

As-if optimization with an estimate based on uninformative data may seem pointless, but it opens new possibilities for treatment choice relative to the situation with no data at all. Estimates were deterministic in that case, so error probabilities could take only the value 0 or 1. Now estimates can be random variables and error probabilities can take any value in [0, 1]. I show that this makes it possible to reduce maximum regret. The present analysis is broadly similar to my earlier work (Manski, 2009 (Manski, , 2021 showing that randomization improves on deterministic binary treatment choice, but it differs in the details.

Formally, one specifies a sample space Ψ0, a sampling distribution Q0 on Ψ0, and an estimate φ0x(•):

Ψ0 → [0, 1]. One programs a random number generator to draw realizations ψ with distribution Q0, and one uses φ0x(ψ) to estimate px. The minimax regret estimate is given in Proposition 2. 

Proposition 1 showed that minimum achievable maximum regret using no sample data is (26).

Expression (28) is smaller than (26) under the maintained assumption that pmx < 1  UxB < pMx. Thus, as-if optimization with uninformative sample data reduces minimum achievable maximum regret relative to treatment choice with no sample data.

Here and in Section 3.6, I suppose that one observes Nx illness outcomes drawn at random from px. It is then natural to consider as-if optimization with estimate φx(nx) = nx/Nx, which uses the sample rate of illness to estimate the illness probability. Proposition 3 gives the exact value of maximum regret when Nx An analytical finite-sample justification stems from a large-deviations inequality of Hoeffding (1963) for averages of bounded random variables, which shows that Qs(nx/Nx  psx > δ) ≤ exp(−2Nxδ 2 ) and Qs(psx − nx/Nx > δ) ≤ exp(−2Nxδ 2 ) for all δ > 0. These inequalities yield an upper bound on maximum regret, whose magnitude depends on the known values of (pmx, pMx, UxB). Proposition 4 gives the bound.

Proposition 4: Consider as-if optimization with estimate φx(n) = nx/Nx. Then, for all δ > 0,

Minimizing the analytical bound (30) over δ > 0 yields a tighter bound that can be determined numerically. Given any value of (pmx, pMx, UxB), the bound decreases to zero if δ → 0 and Nxδ 2 → ∞. Hence, the maximum regret of as-if optimization with estimate φx(n) = nx/Nx converges to zero as Nx → ∞.

Bound (30) is simple and useful, but it is not sharp. Numerical computation of maximum regret is straightforward and shows that the exact value is sometimes much less than the bound. Hence, exact numerical computation is recommended in practice.

Researchers often analyze data on outcomes for persons with heterogeneous covariates. When the decision problem is to choose a treatment for someone with a specific covariate value, data on persons with other covariates are not informative per se. However, these data may be informative when assumptions are imposed that relate the outcome distributions of persons with different covariates. An important subject for methodological research is to learn what is achievable with various combinations of data and assumptions.

I demonstrate here, continuing to focus on maximum regret.

I focus on the instructive, simple setting where persons have either of two covariate values, x = 0 and x = 1. Persons with x = 0 and x =1 may be similar in some respects, but they differ in some way. State s now indexes a possible pair (ps0, ps1) of conditional illness probabilities. Let random samples of N0 outcomes be drawn from p0 and N1 outcomes be drawn from p1, these sample sizes being predetermined.

Then the data are the numbers of ill persons in each sample, n0 and n1 respectively. 3

Let the decision problem be to choose a treatment for a person with x = 0. The question of interest is the extent to which observation of (n0, n1) improves treatment choice relative to observation of n0 alone.

Proposition 2 showed that observation of sample data can improve treatment choice even when the data are uninformative about p0, because the data provide a means to randomize treatment choice. In this section, I consider settings in which one maintains assumptions that make observation of n1 informative about p0.

In the absence of assumptions that suitably restrict the state space, observation of n1 is not informative about p0. Under random sampling, the joint sampling distribution of (n0, n1) in state s is the Binomial product 3 Considering N0 and N1 to be predetermined simplifies regret analysis relative to a setting where persons are sampled at random from the population at large. In that setting, (N0, N1, n0, n1) are jointly random variables. With (N0, N1) predetermined, only (n0, n1) are random. Moreover, n0 and n1 are statistically independent of one another. The analysis performed here applies with sampling at random from the population at large if one measures performance by expected regret conditional on realized values of (N0, N1) rather than by unconditional expected regret. B(ps0, N0) ⤫ B(ps1, N1). The distribution of n1 varies with the value of ps1, but not with the value of ps0.

Hence, n1 is uninformative about p0.

Observation of n1 becomes informative when the state space has non-rectangular structure. A rectangular state space has the form S = S0 ⤫ S1, where S0 and S1 index the feasible values of p0 and p1

respectively. Then the feasible values of p0 do not vary with the value of p1. If S is non-rectangular, the feasible p0 vary with p1. Hence, observation of n1 may be informative about p0, via p1. Sections 4.1 and 4.2 examine two settings with non-rectangular state spaces.

One may find it credible to assume that p0 and p1 are not too different from one another. Thus, one may impose a bounded-variation assumption of the form (31) ps1 + λ− ≤ ps0 ≤ ps1 + λ+, all s ∊ S, for specified λ− ≤ λ+. The implications of bounded variation assumptions for identification of conditional probabilities have been studied by Pepper (2000, 2018) and Manski (2018) . As far as I am aware, the only precedent work studying the implications for decision making with finite-sample data is Stoye (2012) , who studied a class of treatment choice problems whose structure differs from the problem examined here.

Illustration: Manski (2018) One might use the combined sample average (n0 + n1)/(N0 + N1) to estimate p0. In the statistical literature, estimation by a combined average rather than by n0/N0 is called dimension reduction. Statisticians usually analyze dimension reduction as a tradeoff between variance and bias, the objective being to minimize the mean square error of prediction. Combining samples increases the total sample size from N0

to N0 + N1, increasing precision. However, the quantity being estimated is now a weighted average of p0 and p1, which differs from p0 if p1 ≠ p0.

The intuition of a tradeoff between variance and bias extends to evaluation of maximum regret in binary treatment choice. However, maximum regret when using an estimate of p0 in treatment choice differs from the maximum mean square error of the estimate. Hence, the mathematical analysis differs.

As-if optimization with estimate (n0 + n1)/(N0 + N1) yields smaller maximum regret than using n0/N0

for some values of the parameters (pm0, pM0, U0B, N0, N1, λ-, λ+), but larger maximum regret for other values.

Combining samples is obviously preferable when λ− = λ+ = 0, as (31) reduces to ps1 = ps0 for all s. It also holds when λ− and λ+ are not too far from zero. In these cases, the benefit of increasing sample size from N0 to (N0 + N1) exceeds the imperfection of using data on persons with illness probability p1 to estimate illness probability p0. Given specified values of the parameters, maximum regret using the two estimates can be computed numerically and compared. See Section 4.1.2.

As a prelude, Propositions 5 and 6 present analytical findings indicating when combining samples outperforms using n0/N0. To simplify analysis, these propositions assume that pm1 = 0, pM1 = 1, and that the bound in (31) is symmetric.

The first result concerns the special case where N0 = 0 and N1 = 1. When N0 = 0, disregarding data for persons with x = 1 implies treatment choice with a data-invariant estimate. The estimate that uses the data is n1. Proposition 5, which extends Proposition 3, gives maximum regret for as-if optimization using this estimate, provided that the bound on p0 specified in (31) is symmetric and not too wide.

Proposition 5: Let 0 ≤ λ ≤ min (U0B, 1 -U0B). Let (31) hold, with λ+ = λ and λ-= −λ. Let pm1 = 0 and pM1 = 1. Let the sample data be one realization drawn at random from p1. Consider as-if optimization with estimate n1. The value of maximum regret is

Proposition 6, which extends Proposition 4, gives the second result.

Proposition 6: Let 0 ≤ λ. Let (31) hold, with λ+ = λ and λ-= −λ. Let pm1 = 0 and pM1 = 1. Consider as-if optimization with the estimate (n0 + n1)/(N0 + N1). Let α1 ≡ N1/(N0 + N1). Then (33) and (30) shows that the first term of (33) exceeds that in (30), being (δ + α1λ) rather than δ. However, the second term of (33) is less than that in (30), as exp[−2(N0 + N1)δ 2 ] is less than exp(−2N0δ 2 ). Hence, the upper bound on maximum regret using estimate (n0 + n1)/(N0 + N1) is smaller than the one using estimate n0/N0 when α1λ is sufficiently small and N1 is sufficiently large.

The sample averages (n0 + n1)/(N0 + N1) and n0/N0 provide polar ways to estimate p0. The former acts as if p1 = p0, whereas the latter acts as if p1 and p0 may be arbitrarily different from one another. Between the two poles, one might consider estimation by a weighted average, data with x = 0 being weighted more heavily than data with x = 1. Such an estimate is

where ½ ≤ w0 ≤ 1 and w1 = 1 -w0 are the weights. Weighted-average estimates perform partial dimension reduction, bridging the gap between complete dimension reduction (w0 = ½) and no reduction (w0 = 1).

Equation (34) is a simple form of the kernel estimate studied in the statistical literature on nonparametric regression. However, maximum-regret analysis of the performance of the estimate when used in binary treatment choice differs considerably from standard analysis of kernel estimation. To the extent that statisticians have performed finite-sample analysis, the usual concern has been the maximum mean square of an estimate. The literature mainly studies asymptotic properties of estimates---convergence of mean square error to zero, convergence in probability, and rates of convergence as sample size increases.

Theorems typically assume that x is a real vector whose distribution has positive density in a neighborhood of a value of interest. With some exceptions, theorems assume that the conditional expectation E(y|x) varies smoothly with x in a local sense, such as being differentiable, rather than a global sense such as being Lipschitz or Hölder continuous. Thus, they typically do not impose bounded-variation assumptions such as (34), which bound the difference between E(y|x = x1) and E(y|x = x0) at specified covariate values. Donoho et al. (1995) reviews many findings.

Given specified values of (pm0, pM0, U0B, N0, N1, λ-, λ+, w0), maximum regret using a weighted-average estimate can be computed numerically. Moreover, one can vary w0 and determine the weighting that minimizes maximum regret among all weighted averages. To illustrate, I use the problem of treating bleeds in patients with immune thrombocytopenia (ITP).

Illustration: ITP is an autoimmune disease characterized by low platelet counts and increased risk of bleeding. When a patient with ITP presents in a hospital emergency department, a difficult clinical problem is to predict whether the patient is experiencing a critical bleed (y = 1) or not (y = 0). 4 A critical bleed warrants aggressive treatment, while surveillance is preferable otherwise.

Assume that aggressive treatment neutralizes disease by stopping a critical bleed, but it may have side effects whose implications for patient welfare are measured by UxB. The treatment decision is made with knowledge of patient covariates x, but without knowledge of y. Given knowledge of px, the conditional probability that a bleed is critical, the optimal decision is surveillance if 1 − px ≥ UxB and aggressive treatment if 1 − px ≤ UxB. Suppose that, px not being known, treatment choice will be made by as-if optimization with a weighted average estimate.

For specificity, let x = 0 and x = 1 respectively denote female and male patients who have the same observed attributes other than gender. Suppose that available clinical knowledge and assessment of patient welfare makes it credible to set pm0 = 0.2, pM0 = 0.6, λ-= −0.1, λ+ = 0.1, and U0B = 0.6. Table 1 reports maximum regret computed for various values of (N0, N1, w0) as well as the value of minimax regret, with the optimal weight in parentheses: 5 holding w0 and the total sample size N0 + N1 fixed, re-allocating sample from x = 1 to x = 0 always reduces maximum regret. Third, holding (N0, N1) fixed, maximum regret is minimized when the weight lies between the polar cases w0 = 0.5 and w0 = 1. The optimal weight is closer to w0 = 1 when sample size is larger.

Estimation of p0 by a weighted average of outcomes extends easily from the case of a binary covariate to ones where patients have multiple observed covariate values. Let k = 0, . . . , K index distinct covariate values, with pk denoting the conditional probability of illness for persons with value xk. For each k, let Nk be the number of sampled such patients, which I take to be predetermined, and let nk be the number observed to be ill. 6

At each specified value of (N0, N1), the optimal weight was approximated by computing maximum regret over the uniform grid w0 ∊ [0.50, 0.51, 0.52, . . , 0.98, 0.99, 1] .

Let the weights satisfy 0 ≤ wk for all k and ∑k = 0, . . . , K wk = 1. Let wkNk > 0 for at least one value of k. The problem is often described by considering a random sample of specified size, each of whose members has an observed J-dimensional real covariate vector x and an observed outcome y. Under usual local smoothness assumptions, the sampling probability with which the covariate value xi of each observation i lies within a specified Euclidean distance ε > 0 of a covariate value of interest, say x0, is of order ε J when ε is small. Hence, increasing the dimensionality of the covariate space lessens the information that the data yield about the conditional expectation E(y|x = x0).

In this paper, covariates may lie in a general space, not necessarily a real vector space. To maintain comparability with the traditional setup, let x be a real vector and consider increasing dimensionality. Thus, the extended covariate vector is (x, w), where w is a real vector. Let (x = x0, w = w0) be the extended covariate vector of a patient to be treated. Now the objective is to learn P(y = 1|x = x0, w = w0) rather than P(y = 1|x = x0). Whereas the data originally were [(yki, xki), i = 1, . . , Nk, k = 0, . . , K], they now are [(yki, xki, wki), i = 1, . . , Nk, k = 0, . . , K].

How does dimensional refinement affect estimation with bounded variation assumptions? The answer depends on the applied setting. Whereas inequalities (36) bounded the difference between P(y = 1|x = x0) and [P(y = 1|x = xk), k = 1, . . , K], a clinician might now seek to credibly bound the difference between P(y = 1|x = x0, w = w0) and [P(y = 1|x = xk, w = wk), k = 1, . . . , K]. Some of the latter bounds may be tighter than the former ones and others may be looser, depending on the illness and the covariates. Overall, refinement of dimensionality may improve estimation in some applications and weaken it in others.

Return to the setting with two covariate values. A bounded-variation assumption directly connects the illness probabilities p0 and p1. A different way to connect p0 and p1 materializes if one has empirical knowledge of the marginal probability of illness in the patient population, say p, and the fractions of the population who have each covariate value, say r0 and r1. For example, suppose that x is a binary measure of obesity and the illness of concern is liver cirrhosis. Public data sources may record the overall rates of obesity and cirrhosis in a population, but not the rate of cirrhosis conditional on obesity status. Then the public data reveal (p, r0, r1), but not (p0, p1).

Research on the ecological inference problem studies the logic of inference on (p0, p1) given knowledge of (p, r0, r1). The connection among these quantities is shown by the Law of Total Probability (37) p = p0r0 + p1r1. Duncan and Davis (1953) observed that (37) implies a computable bound on p0, namely (38) max[0, (p -r1)/r0] ≤ p0 ≤ min(1, p/r0).

The subsequent literature generalizes this finding to settings with general real-valued outcomes. Manski (2018) reviews this work and gives an application to medical decision making.

When evaluating the performance of treatment-choice rules, one may use equation (37) to shrink the state space relative to what it would be in the absence of knowledge of (p, r0, r1). Consider any initial state space S, embodying the available restrictions on the true state without knowledge of (p, r0, r1). The state space using this knowledge is the non-rectangular set (s ∈ S: p = ps0r0 + ps1r1).

To date, analysis of ecological inference has assumed empirical knowledge only of (p, r0, r1), with no sample data on outcomes conditional on covariates. Suppose that one can combine knowledge of (p, r0, r1) To achieve greater precision in estimation, classical statistical thinking suggests minimization of total sample prediction error under square loss, subject to constraint (37). The mean of a probability distribution is the best predictor of a random draw under square loss; hence, p0 and p1 are the best population predictors of illness conditional on x = 0 and x =1 respectively. When (p, r0, r1) are unknown, this motivates unconstrained least squares estimation, yielding n0/N0 and n1/N1 as estimates. When (p, r0, r1) are known, it suggests constrained least squares estimation, namely Given specified values of (pm0, pM0, U0B, N0, N1, p, r0, r1), the approximate maximum regret of as-if optimization using the constrained least squares estimate can be computed numerically. To illustrate, let 7 I am grateful to Michael Gmeiner for the derivation, which is obtainable from the author of this paper.

(pm0 = 0, pM0 = 1, U0B = ½, p = ½, r0 = 0.7, r1 = 0.3) and consider the two sample-size pairs (N0, N1) = (10, 10) and (20, 20) . Approximating the state space by a grid of 100 values for p0 and maximizing over this grid, the resulting values of maximum regret are 0.011 and 0.008 respectively.

This paper carries further my research applying statistical decision theory to treatment choice with sample data, using maximum regret to evaluate the performance of treatment rules. The methodological innovation relative to past work is to study as-if optimization with alternative estimates of illness probabilities, when choosing between surveillance and aggressive treatment. To render the analysis transparent and informative, I studied a relatively simple but decidedly nontrivial formalization of the decision problem. Extending the analysis to more complex and realistic forms of the problem offers much scope for future research.

Beyond the specific analysis performed here, the paper sends a broad message. It is always important to address decision making with care but particularly so in medical settings, where the stakes are often high.

Biostatisticians and computer scientists have addressed medical risk assessment in indirect ways, the former applying classical statistical theory and the latter measuring prediction accuracy in test samples. Neither approach is satisfactory. Statistical decision theory provides a coherent, generally applicable methodology.

Proof of Proposition 1: The error probability in state s takes the value 0 or 1, with (A1a) Qs[e(psx, φx, UxB) = 1] = 0 if min(1  psx, 1  φx) ≥ UxB or max (1  psx, 1  φx) < UxB, To compute maximum regret across S, recall the maintained assumption that pmx ≡ min s ∈ S psx < 1  UxB < pMx ≡ max s ∈ S psx. It follows from (A2a)−(A2b) that This expression differs from (23) in that the sampling distribution is the known Q0 rather than a statedependent Qs.

To study maximum regret, partition S into the regions SA = (s ∈ S: 1  psx ≥ UxB) and SB = (s ∈ S: 1  psx < UxB). It follows from (A4) Now consider states in SB = (s ∈ S: 1  psx < UxB). An analogous derivation yields Proof of Proposition 6: In state s, the estimate (n0 + n1)/(N0 + N1) has mean α0ps0 + α1ps1, where α0 ≡ N0/(N0 + N1) and α1 ≡ N1/(N0 + N1). Hence, by (31), (A18) |ps0 -(α0ps0 + α1ps1)| = α1|ps0 − ps1| ≤ α1λ.

The large-deviations inequality of Hoeffding (1963) shows that, for all δ ∊ (0, 1) and s ∊S, The rest of the proof is similar to the proof to Proposition 4, with δ + α1λ replacing δ when the Law of Iterated Expectations is used to decomposed expected regret. Consider s ∊ S0A = (s ∈ S: 1  ps0 ≥ U0B).

(A21) Es{Rs0[(n0 + n1)/(N0 + N1)]} = Es{Rs0[(n0 + n1)/(N0 + N1)]│(n0 + n1)/(N0 + N1)  ps0 ≤ δ + α1λ}•Qs[(n0 + n1)/(N0 + N1)  ps0 ≤ δ + α1λ] + Es{Rs0[(n0 + n1)/(N0 + N1)]│(n0 + n1)/(N0 + N1)  ps0 > δ + α1λ}•Qs[(n0 + n1)/(N0 + N1)  ps0 > δ + α1λ].

Qs[(n0 + n1)/(N0 + N1)  ps0 ≤ δ + α1λ] ≤ 1. Rs0[(n0 + n1)/(N0 + N1)] ≤ (1  ps0) − U0B for all values of (n0 + n1)/(N0 + N1). Combining these results with (A20a) gives this upper bound on expected regret:

UxB -(1  pMx) UxB -(1  pMx)

UxB -(1  pMx) + [(1  pmx) − UxB] pMx  pmx Inserting this into (A7) gives the minimum achievable value of maximum regret

The sampling probabilities in state s are Qs(0) = 1 -psx and Qs(1) = psx. Consider the estimate

By (25a) -(25b)

} = [UxB -(1  psx)]•(1 -psx) for s ∈ SB

Examination of first and second-order conditions shows that (A10a) is globally maximized at (1 -UxB)/2 and (A10b) is globally maximized at 1 -UxB/2. These maxima lie within SA and SB respectively. Hence, (A11a) max Es{Rsx

Rsx(nx/Nx) ≤ δ +

Combining (A15) and (A16) yields result (30)

Proof of Proposition 5: The sampling probabilities in state s are Qs(n1 = 0) = 1 -ps1 and Qs(n1 = 1) = ps1

Consider s ∊ S0A; thus, ps0 ≤ 1 -U0B. By assumption, λ ≤ min (U0B, 1 -U0B). Hence, ps0 + λ ≤ 1

Holding ps0 fixed, maximum expected regret in (A17a) over ps1, subject to (31), occurs when ps1 = ps0 + λ. This yields maximum regret (1  ps0 − U0B)•(ps0 + λ), a quadratic function of ps0 alone. Examination of first and second-order conditions shows that (A17a) is globally maximized at (1 -U0B − λ)/2. This maximum lies within S0A

Consider s ∊ S0B; thus, ps0 > 1 -U0B. By assumption, λ ≤ min (U0B, 1 -U0B). Hence, ps0 − λ > 0

Holding ps0 fixed, maximum expected regret in (A17b) over ps1, subject to (31), occurs when ps1 = ps0 − λ

•(1 − ps0 + λ), a quadratic function of ps0 alone. Examination of first and second-order conditions shows that (A17b) is globally maximized at 1 -U0B/2 + λ/2. This maximum lies within S0B. The value of the maximum across SB is

│(n0 + n1)/(N0 + N1)  ps0 ≤ δ + α1λ}. When (1  ps0) − U0B

Maximizing expected regret over SA yields (A24) max Es{Rs0

SB = (s ∈ S: 1  ps0 < U0B), an analogous derivation yields (A25) max Es{Rs0

s ∈ SB Combining (A24) and (A25) yields

Statistical Modeling: The Two Cultures

A Probabilistic Theory of Pattern Recognition

Wavelet Shrinkage: Asymptopia?

An Alternative to Ecological Correlation

Estimation, Prediction, and Attribution

Mathematical Statistics: A Decision Theoretic Approach

Asymptotics for Statistical Treatment Rules

Statistical Decision Rules in Econometrics

Probability Inequalities for Sums of Bounded Random Variables

Who Should be Treated? Empirical Welfare Maximization Methods for Treatment Choice

Maximum Score Estimation of the Stochastic Utility Model of Choice

Semiparametric Analysis of Discrete Response: Asymptotic Properties of the Maximum Score Estimator

Statistical Treatment Rules for Heterogeneous Populations

Social Choice with Partial Knowledge of Treatment Response

Identification for Prediction and Decision

Minimax-Regret Treatment Choice with Missing Outcome Data

Diversified Treatment under Ambiguity

Credible Ecological Inference for Medical Decisions with Personalized Risk Assessment

ATreatment Choice with Trial Data: Statistical Decision Theory Should Supplant Hypothesis Testing,@ The American Statistician

Judicial and Clinical Decision Making under Uncertainty

Econometrics for Decision Making: Building Foundations Sketched by Haavelmo and Wald

Monotone Instrumental Variables: With an Application to the Returns to Schooling

How Do Right-to-Carry Laws Affect Crime Rates? Coping with Ambiguity Using Bounded-Variation Assumptions

Admissible Treatment Rules for a Risk-Averse Planner with Experimental Data on an Innovation

Sufficient Trial Size to Inform Clinical Practice

Trial Size for Near-Optimal Choice Between Surveillance and Aggressive Treatment: Reconsidering MSLT-II

Statistical Decision Properties of Imprecise Trials Assessing COVID-19 Drugs

Model Selection for Treatment Choice: Penalized Welfare Maximization

On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference

On the Problem of the Most Efficient Tests of Statistical Hypotheses

Minimax Regret Treatment Choice with Finite Samples

Minimax Regret Treatment Choice with Covariates or with Limited Validity of Experiments

Definition of a Critical Bleed in Patients with Immune Thrombocytopenia: Communication from the ISTH SSC Subcommittee on Platelet Immunology

Statistical Treatment Choice Based on Asymmetric Minimax Regret Criteria

An Overview of Statistical Learning Theory

The Nature of Statistical Learning Theory

Contribution to the Theory of Statistical Estimation and Testing Hypotheses

Statistical Decision Functions Which Minimize the Maximum Risk

Statistical Decision Functions

Minimax Nonparametric Classification -Part I: Rates of Convergence