1 The Philosophical Significance of Stein’s Paradox Olav Vassend*, Elliott Sober*, and Branden Fitelson† *Philosophy Department, University of Wisconsin, Madison, Wisconsin †Philosophy Department, Northeastern University, Boston, Massachusetts Abstract: Charles Stein discovered a paradox in 1955 that many statisticians think is of fundamental importance. Here we explore its philosophical implications. We outline the nature of Stein’s result and of subsequent work on shrinkage estimators; then we describe how these results are related to Bayesianism and to model selection criteria like the Akaike Information Criterion. We also discuss their bearing on scientific realism and instrumentalism. We argue that results concerning shrinkage estimators underwrite a surprising form of holistic pragmatism. 1. Shrinkage is better than straight MLE when k≥31 If you sample at random (with replacement) from a human population and find that the average height in your sample is 5 feet, what could be more natural than the conclusion that the average height in the whole population is about 5 feet? The principle underwriting this inference has gone by different names. Philosophers have called it “the principle of induction.” Frequentist statisticians say that the inference is justified by a method called “maximum likelihood estimation” (MLE). Here the word “likelihood” is used in its technical sense. The estimate that the population mean is 5 feet maximizes likelihood, not in the sense that this is the most probable estimate given the observations, but in the sense that it makes the observations more probable than other estimates are able to do. The likelihood of hypothesis H relative to observation O is the quantity Pr(O|H), not the quantity Pr(H|O). If the population mean were 7 feet, a sample mean of (about) 5 feet would be very improbable, “almost a miracle.” If the population mean were 5 feet, a sample mean of approximately 5 feet would be much less surprising.2 1 Here we are indebted to the excellent exposition in Efron and Morris (1977). 2 Here we assuming the variances are the same in the two cases. 2 Statisticians have done for MLE something that their philosophical predecessors did not do for the principle of induction. They proved that MLE uniquely possesses various desirable properties. Gauss showed that if the distribution of heights in the population is normal, then the ML estimate of the mean height is the sample mean (Edwards 1974, p. 11). Gauss also realized that MLE is unbiased, meaning that repeated ML estimates based on different samples drawn from the same population will tend to be centered on the population mean. There are infinitely many unbiased estimators; Gauss (1823) proved, finally, that ML estimates of normal means have lower expected mean-squared error than any other unbiased estimator that is a linear function of the observations. The case for MLE was strengthened in the 1930’s with the development of statistical decision theory. Suppose estimates are good to the degree that they come close to the true (but unknown) value of the quantity being estimated. In particular, consider a “loss function” that measures an estimator’s inaccuracy by its squared deviation from the true value of the target quantity. Figure 1 depicts three estimators that might be used for the mean height in a population – the sample average, one-half the sample average, and the sample median (the middle value). Figure 1 tracks how accurate an estimator can be expected to be as a function of the unknown value of the population’s mean height (θ). Of course, how well an estimator does may vary from sample to sample; what can be plotted precisely is the average performance (the mathematical expectation). The greater the expected squared error, the higher the estimator’s “risk.” As Figure 1 makes plain, the mean has a lower expected inaccuracy than the median for each possible value of θ. The mean strongly dominates the median, in the technical sense of that term used in decision theory. For that reason, the median is deemed an inadmissible estimator. An admissible estimator isn’t dominated (either strongly or weakly) by any other estimator.3 Figure 1 does not say whether the sample mean and half-the-sample-mean are admissible, since the figure leaves open that there might be other estimators that dominate both. Blyth (1951) and Lehmann and Hodges (1951) settled this question; they proved, for data drawn from a normal 3 X weakly dominates Y precisely when X’s risk is never higher than Y’s, and for some values of θ, X’s risk is lower. Inadmissible estimators are weakly dominated; they may or may not be strongly dominated. 3 population, that there is no estimator that dominates MLE. MLE is admissible. This doesn't mean that MLE dominates all other estimators, but only that no estimator dominates MLE. Figure 14 Stein (1956) surprised the statistics community by showing that MLE has a dark side. He proved that MLE is inadmissible when the estimation problem concerns three or more statistically independent measurements of the means of normal distributions with identical known variances. MLE is admissible if you are estimating average height in the United States. It also is admissible if you are estimating average height in Norway. And it is admissible if you are estimating average height in Japan. But if you want to estimate all three averages at the same time, and your goal is to minimize expected error across the three estimates, MLE is inadmissible.5 James and Stein (1961) showed that you do better in expectation if you construct 4 Adapted from Efron and Morris (1977), p. 124. 5 In the theory of random walks, a similarly interesting transition happens when you move from two to three dimensions. Random walks in one or two dimensions are recurrent, meaning they have a probability of 1 of returning to their starting point. Random walks in dimensions greater 4 your estimates by shrinking the three observed frequencies towards zero by multiplying each by 𝑐 = 1 − 𝜎 2/(∑ 𝑋𝑖 23 𝑖=1 ). Here, 𝑋𝑖 is the ML estimate of the average height of population i, and 𝜎 2 is the variance within each population. Multiplying by c shrinks estimates towards 0 because c will, in general, be less than 1 and greater than 0.6 Efron and Morris (1973) showed how to generalize Stein’s result when the variances of the populations are unknown and/or different from each other, and they also showed that shrinking towards the grand mean of the samples is better than straight MLE when four or more quantities are estimated.7 When you estimate three or more parameters, you should not expect a shrinkage estimator to better MLE for each parameter; in fact, for each parameter, there are parameter values for which shrinkage will do worse than MLE and other values for which it will do better. However, your total error over the parameters collectively will be lower in expectation if you shrink. 2. Why this is paradoxical Stein’s result is bizarre. The three quantities can be entirely unrelated to and independent of each other and these shrinkage estimators still do better than straight MLE when judged by the criterion of expected squared error. Efron and Morris (1977) give the example of estimating the batting abilities of 18 Major League baseball players from data on their first 45 times at bat in a year. If you estimate each player’s probability of getting a hit, and use those estimates to predict how well each player will do at the end of the year, the data from the end of the season reveals that you do worse if you use MLE than if you shrink those ML estimates towards the grand sample mean. In this case, real data behave in conformity with Stein’s result. This may lead you to think that the baseball players are influenced by a common cause that exerts a “gravitational than two are not recurrent. Brown (1971) discovered a connection between Stein’s result and this fact about random walks. 6 Although it’s unlikely, it’s possible for c to be a negative number. 7 The estimator that shrinks the ML estimates towards the sample mean is now standardly referred to as the “Efron-Morris estimator,” and we will follow this practice in our paper; however, it is worth pointing out that the Efron-Morris estimator was originally suggested by Dennis Lindley in the discussion section of Stein (1962). 5 attraction” on batting ability as the season unfolds. The point of importance is that shrinkage results depend on no such assumption (Stigler 1990, pp. 147-148). Efron and Morris (1977) make this point graphic. After describing their baseball example, they add to it the problem of inferring the percentage of foreign cars in Chicago. This enlarged problem involves quantities that come in different units − number of hits number of times at bat and number of foreign cars in Chicago number of cars in Chicago . Shrinkage estimators are better than straight MLE estimates here, since there are more than three parameters being estimated. The mathematical results that ground this fact do not turn on whether hits in baseball and foreign cars in Chicago are causally related to each other. Which quantity do you want to estimate? % domestic cars % foreign cars Your data 60% domestic cars 40% foreign cars Your shrinkage estimate <60% domestic cars <40% foreign cars This is weird, but there is more. The additional strangeness is that there are different, equally correct, ways of coding your data, and your choice of code will affect how you shrink your estimates away from the MLE estimates. This point is illustrated in the accompanying table. The average success rate at the start of the season for the 18 baseball players that Efron and Morris examined was 26.5%. Suppose you sample the cars in Chicago and find that 40% of the cars in your sample are foreign. Since there are 18 baseball players and only one Chicago, the grand mean of these 19 frequencies is close to 27%. So, if you shrink your estimates of those 19 parameters towards that grand mean, you’ll do better in expectation than if you use straight MLE. Notice that if you shrink, you will shrink your car estimate towards 27% regardless of whether you estimate the percentage of foreign cars or the percentage of domestic. If you do both, you’ll have contradictory estimates. So what should you do? Of course, only one of these shrinkages will move you closer to the truth, but both in expectation will do so when they are part of the 19-parameter problem. Moreover, what is true in the 19-parameter problem also holds for the initial problem of the 18 baseball players. The James-Stein theorem does not require that you code the 18 sample means in terms of each player’s percentage of hits in their first 45 times at bat. You could code some players in terms of their success rates and others in 6 terms of their failure rates, and the theorem would still apply. As Efron and Morris tell the story, Roberto Clemente’s estimated season-long batting average gets shrunk from his initial success rate of 0.400 towards 0.265, but MLE also will be bettered if you shrink his initial failure rate of 0.600 towards 0.265. What goes for Chicago also goes for Roberto. 3. Stein’s Paradox and Linguistic Invariance The reader has a right to be shocked by the fact that what you say about Roberto Clemente’s batting ability depends on whether you seek to estimate his ability to succeed or his tendency to fail. The oddity arises if your problem is to estimate Clemente’s ability and that of 17 other baseball players simultaneously. This finding shows that shrinkage estimators are not linguistically invariant in the following weak sense: Linear Invariance: An estimator E of parameter θ is linearly invariant iff, for any data set D, and any linear function f, f{E[θ | D]} = E[f(θ) | D, where a linear function is any function of the form f(θ) = aθ+b, for real-valued a and b. Linear invariance seems like a very reasonable requirement to impose on estimators. Demanding linear invariance means that if your estimator says that 5 feet is the best estimate of the mean height in a population, it had better say that 60 inches is also the best estimate. MLE is invariant in this sense, as are the sample mean, the sample median, and the sample mode. There is another sort of invariance that also seems reasonable to impose on estimators. Suppose you are estimating the mass m of some object and that you end up with a data set D of measurements. Now suppose someone adds the number 1 to each of your measurements, thereby yielding the new data set D’. Intuitively, it seems reasonable in this case to demand that your estimator should be invariant in the following sense: E(m | D’) = E(m | D) + 1. This type of invariance is known as translation invariance. Note that whereas Linear Invariance is invariance under certain transformations of the parameters, translation invariance is invariance under certain transformations of the data. The sample mean, median, and mode are all clearly translation invariant, and MLE satisfies translation invariance too given very plausible restrictions on the distribution. Shrinkage estimators, however, are not translation invariant. 7 Given weak restrictions on the loss function and the probability distribution, any translation invariant estimator has a constant risk function (cf. the median and the average in Figure 1). Translation invariant estimators are therefore easy to compare with each other: the best translation invariant estimator is simply the one that has the lowest (constant) risk. Stein’s result shows that, under quadratic loss, the best translation invariant estimator of normal means, namely MLE, is dominated by shrinkage estimators that are not translation invariant and not invariant under linear transformations, when more than two parameters are being estimated. MLE is the best invariant estimator if you are estimating the mean of a normal distribution and you are using the squared loss function to measure inaccuracy. However, if you change your loss function, then MLE might no longer be the best invariant estimator. For example, if you were to measure inaccuracy by taking the absolute value of the deviation from the target quantity, rather than the square of this deviation, then the best invariant estimator would instead be the sample median, while the ML estimator would still be the sample mean. On the other hand, if you were measuring the mean of a Laplace distribution rather than a normal distribution, then the ML and best estimator would be the median.8 The reader might naturally wonder whether Stein’s result is a mathematical curiosity that arises only in estimation problems that involve normal distributions where the loss function is squared deviation. It is not. Brown (1966) showed that for a very large range of loss functions and distributions, the best translation invariant estimator is never admissible. In each of the cases considered by Brown, there is a way of modifying the James-Stein estimator to yield a non- invariant estimator that dominates the best invariant estimator.9 8 The Laplace distribution is exponential like the normal distribution. The main difference between the two distributions is that in the Laplace distribution, the exponent is not squared; instead, its absolute value is taken. 9 Nor is Stein’s result in the case of the squared loss function attributable to the fact that the squared loss function is unbounded (James and Stein 1961, p. 367). The result also does not depend on the assumption that the measurements of the different means are independent of one another (Bock 1975). 8 Returning to the baseball players, it seems natural to restrict your attention to the class C of estimation methods that yield the same answer whether you code your data in terms of failure rates or success rates, and that aren’t affected by uniformly shifting the data by some constant amount. However, Brown’s result shows that, under widely applicable conditions, the best estimators in C will be dominated by strange shrinkage estimators that are sensitive to how you encode your data. The real Stein paradox is therefore not just that MLE is inadmissible in dimensions greater than 2. The real paradox is that if total accuracy is your goal, then under widely applicable conditions, you have to go outside the class of invariant estimators if you want your estimator to be admissible.10 If you value linguistic invariance above total accuracy, you might choose to use an invariant estimator even though shrinkage is more accurate.11 However, if you opt for a shrinkage estimator SE(-), there is a kind of invariance that your estimator will possess, at least if you use the squared loss function. If SE(θ, data) is more accurate in expectation than MLE(θ, data), then SE(θ′, data) will also be more accurate in expectation than MLE(θ′, data), provided that θ′ is a linear transformation of θ. In this sense, it doesn’t matter whether you score all 18 baseball players by their success rates, or all by their failure rates, or use success for some and failure for others. Regardless of coding, shrinkage can be expected to better MLE. 4. Making the shrinkage result intuitive12 Suppose you make independent measurements of three unknown parameters where the measurement of each parameter is modeled as a normal distribution with variance 1 and mean i. That is, you model the measurements Xi by the formula Xi= i + error, where the error is 10 We thank an anonymous reviewer for pointing this out to us. 11 George Barnard seems to endorse this view (see Stein 1962, p. 288), as do Perlman and Chaudhuri (2012, p.139n18), who maintain that shrinkage estimation should not be used in the absence of prior information that non-arbitrarily singles out some point towards which you shrink the ML estimates. 12 Here we are indebted to Stigler (1990). 9 Gaussian. For simplicity, suppose you have just three measurements, x1, x2, and x3, of 1, 2, and 3, respectively. MLE says that you should estimate that i=xi for each i. Figure 2 represents this estimate in the plane as a 45 degree straight line that goes through the origin. Each 𝜃 is the MLE estimate for a given observed value of X. MLE is therefore a linear estimator of  given X, but is MLE the best linear estimator? Figure 2 To investigate this question, it is useful to indulge in a fiction: suppose you know the true value for  that is associated with each of the three observed x values. The three pairs are represented in Figure 3. What is the best linear estimator given these three data points? To answer this question, you need to decide which of two estimation problems you want to address. These are shown in Figures 3 and 4. Suppose your goal is to find the line that minimizes the vertical distances between points and line. This line is shown in Figure 3; it obeys the equation  = aX+b. Unfortunately, you 10 don’t know the values of the ’s, so you can’t estimate the coefficients a and b in this equation in the usual way. However, you can try to approximate them; indeed, all of the different shrinkage estimators, beginning with the one described by James and Stein (1961), can be regarded as clever methods for estimating a and b from the data. Figure 3 Alternatively, your goal might be to minimize the horizontal distances between points and line. The best line is then the one shown in Figure 4. It obeys the equation X = αθ + β. As Galton (1988) recognized, the lines in Figures 3 and 4 are different. You need to decide whether 11 you want a line that, for an observed x value, is close to the true θ value (Figure 3), or a line that, for a given θ value, is close to the observed x value (Figure 4). Figure 4 The ML estimate shown in Figure 2 may be regarded as an approximation of the least- squares line in Figure 4. Indeed, the 45 degree ML estimate is equivalent to the theoretical regression line of X on , and therefore may be said to be the best possible approximation of the least squares line in Figure 4, given that the θ values are unknown. However, if you want to minimize error in estimating , then the line in Figure 4 is not the line you should try to approximate; instead, you should try to approximate the least squares line in Figure 3, which is what shrinkage estimators all attempt to do. The fact that there are two least-squares lines explains why MLE leaves something to be desired if your goal is to find a straight line that is close to the true θ value for a given observed value of X. However, it explains more than that. 12 First, note that if you were to move the θs in Figure 3 and Figure 4 further apart from each other, the two regression lines would then converge toward each other. Shrinkage can therefore be expected to offer only a minor improvement over MLE if the θs are far apart from each other (although the improvement will always be greater than 0). On the other hand, if the θs are very close together, the slopes of the two regression lines will be very different, and shrinkage estimation can therefore be expected to offer a more dramatic improvement over MLE. Second, suppose there are just two parameters (1 and 2) that you want to estimate. In that circumstance, the two least squares lines collapse into one, as shown in Figure 5. Since the ML line is the best approximation of the X=+ least squares line, and since this line is necessarily identical to the =aX+b least squares line, it follows that the ML line must also be the best possible approximation of the =aX+b least squares line. In other words, the ML estimator is admissible by coincidence (literally) when you try to estimate fewer than three parameters. Figure 5 13 We hope this connection of shrinkage estimators to the two regression problems – fitting θ to X and fitting X to θ − is instructive. However, it is no substitute for the detailed mathematics of James and Stein (1961) and of Efron and Morris (1973), where a particular shrinkage estimator is shown to have lower expected mean squared error than straight MLE. Our point here is to show why MLE is suboptimal and how constraining three or more estimates to be connected to each other can provide an improvement. The assumption that X and θ are related to each other by a straight line is such a constraint. This is not to say that shrinkage estimators assume linearity. Rather, the point is that shrinkage towards some single value is a constraint that links distinct estimation problems to each other; by doing so, shrinkage provides an improvement over straight MLE. In Section 9, we provide a different perspective on how introducing a common constraint can improve estimation. 5. Why Care About Admissibility? So far we’ve sometimes been a bit careless. We’ve said that Stein’s result asserts that shrinkage estimators are “better” than MLE when more than three parameters are estimated. But, of course, Stein’s result is a mathematical theorem and doesn’t have any implications regarding what’s better or worse unless additional assumptions are added. In particular, Stein’s result says that there are shrinkage estimators that dominate MLE; this result obviously doesn’t allow you to conclude that shrinkage estimators are “better” than MLE unless you add the following normative premise: (D) If estimator E dominates estimator E’, then E is better than E’. Not everyone will accept (D). For example, Bayesians might reject (D) because the more natural Bayesian approach to estimation is to choose the estimator that maximizes expected utility, and it’s not clear whether (D) is consistent with such an approach. We postpone a more thorough discussion of the relation between Stein’s paradox and Bayesianism until Section 8. Some frequentists will also want to reject (D). For example, Spanos (2016) argues that the criterion of admissibility and the use of loss functions go against frequentism, and he 14 consequently (implicitly) rejects (D). Spanos’s main objection to (D) is that it will sometimes judge a biased or inconsistent estimator to be better than an unbiased or consistent estimator. In particular, as we shall see in Section 9, all shrinkage estimators are biased. Hence, according to (D), shrinkage estimators are better than ML estimators even though shrinkage estimators are biased. According to Spanos (2016, p. 15), any biased or inconsistent estimator should be immediately disqualified, because the “primary objective” of estimation is to pinpoint the true value of θ, and using a biased or inconsistent estimator means abandoning this goal from the outset. Of course, unbiased and consistent estimators can also fail to pinpoint the true value of θ, given the data at hand, but they will tend to converge on the true value as the data set is enlarged. This asymptotic property is nice for an estimator to have, but is it necessary? The answer is far from clear since the estimation problems we face always involve finite data. And for finite data, no estimator is guaranteed or even likely to get you the true value. We think it is reasonable to prefer one estimator over another when the first has a lower expected error than the second, even if the first estimator is biased while the second one isn’t. Of course, this requires that “error” be quantified in terms of some loss function. Spanos asks where the extra “information” is supposed to come from (p. 17) that allows you to pick a particular loss function. It may be true that the choice of a particular loss function is to some extent arbitrary, and that several loss functions may reasonably be said to quantify inaccuracy. However, the fact that the choice of a particular loss function is to some extent arbitrary does not mean that you should give up on loss functions altogether. Estimators can be compared along many dimensions: they can be biased or unbiased, they can be consistent or inconsistent, they can be variant or invariant, they can be admissible or inadmissible, etc. One of the lessons of Stein’s result is that no estimator will be superior to all other estimators along all the dimensions that you might care about. Thus, ultimately, whether one estimator is “better” than another comes down to which properties you care about and how 15 you weigh them against each other. (D) is therefore not an assumption that is unconditionally true; rather, it is a premise that it is reasonable to accept given that you have certain interests. In particular, if you care about minimizing your error given finite data – as opposed to making sure you pinpoint the true value of θ in the infinite long run limit – then (D) is a reasonable premise for you to accept. But, of course, accepting (D) leads you to prefer shrinkage estimators over MLE and biased estimators over unbiased ones. 6. Choosing an Estimator If your goal is to minimize expected squared error, which estimator should you use? One possible answer is that if you know that estimator E1 weakly dominates estimator E2, you should not use E2. We endorse this answer, with one small qualification that we’ll note at the end of this section.13 How can this negative advice be supplemented with something positive? We begin with two cautions. The first is that admissibility isn’t sufficient for using an estimator. Indeed, in any realistic estimation problem there are infinitely many admissible estimators. For example, suppose you are estimating a single parameter and your estimation method is to just guess that θ=0 regardless of what your data are. If θ=0, your estimator has zero risk. Hence, always guessing that θ=0 is an estimator that isn't weakly dominated by any other estimator that has positive risk when θ=0, and any estimator that is not constant is going to have positive risk. More generally, guessing that θ=c, regardless of the data, for any constant number c, will be an admissible estimator.14 Inadmissibility is bad but admissibility isn’t anything special. 13 This is not quite to say that inadmissibility suffices for refusing to use an estimator. You may know that E is inadmissible, but not know the identity of an estimator that weakly dominates E. This may lead you to think that E is better than nothing. We take no stand on whether you should use E or simply refuse to make an estimate. 14 The constant estimator E(X)=c has risk 0 for θ=c. Any estimator that weakly dominates E(X)=c must also have risk 0 for θ=c. But such an estimator will therefore have a variance of 0, which means (given any reasonable error distribution) that the estimator doesn't vary given different data, and hence it must also be a constant estimator. Thus, any estimator that dominates E(X)=c must itself be a constant estimator, but a constant estimator with E(X)=d≠c can’t dominate E(X)=c. So all constant estimators of the form E(X)=c are admissible. 16 The second caution is that admissibility isn’t necessary for using an estimator. Perhaps you are in a situation of ignorance. You know that estimator E weakly dominates all the other estimators you have considered, but you don’t know whether there exists an as yet unknown estimator that dominates E. In this case you are entitled to use E. That entitlement may lapse if you learn more.15 It turns out that while MLE is dominated by the James-Stein estimator when three or more parameters are being estimated, James-Stein is itself dominated by other estimators (Baranchik 1964). The same point holds of the Efron-Morris estimator; it isn’t admissible, either (Brown 1971). Are there any shrinkage estimators that are admissible and that dominate MLE? The answer (for k≥5) is yes; there are several (Strawderman 1971). The implication of what we’ve just said would therefore seem to be that you should never use either the James-Stein or the Efron-Morris estimator since they are both dominated by other known estimators. However, in practice, statisticians care about computational tractability as well as about minimizing global inaccuracy. Since James-Stein and Efron-Morris are both very simple estimators, and since they have risk functions that are numerically close to the known estimators that they are dominated by (Larry Brown, personal communication), you arguably are justified in using James-Stein or Efron-Morris although these estimators are known to be inadmissible. 7. Holistic Pragmatism Many epistemologists are evidentialists – they think that what you believe (or your degree of belief) should be guided by your evidence and by your evidence alone. Evidentialism has its dissenters. Carnap (1950), for example, argues that some propositions can be accepted because they represent convenient conventions even though there is no evidence that they are true. Pascal (1662) anticipated this pragmatic turn by arguing that belief in God should be 15 Suppose estimators E1 and E2 each weakly dominate the others you have considered, but neither dominates the other. You may have reason to prefer one over the other if you have reason to believe that some values of θ are more probable than others. Consider the relation of the mean and half-the-mean in Figure 1. 17 influenced by the positive utility of going to heaven and the negative utility of going to hell. So did James (1896), except that for him the question need not involve the existence of God or the afterlife. Another departure from strict evidentialism may be found in Rudner’s (1953) argument that “the scientist qua scientist makes value judgments.” Rudner maintains that science is in the business of accepting and rejecting hypotheses and that your standards concerning how much evidence is required for you to accept or reject should depend on the ethical consequences of error. None of these pragmatisms covers what Stein discovered. We call this Steinian pragmatism Holistic Pragmatism (not to be confused with Morton White’s (2005) pragmatism of the same name). Holistic pragmatism is the thesis that when an estimation problem has several parts, it’s a pragmatic decision whether your goal is to minimize error across the whole problem, or to minimize error within each part. Of course, considered separately, MLE and shrinkage estimation are each “evidentialist” in the sense that ML estimators and shrinkage estimators are functions of the evidence and only of the evidence. The reason we think Stein’s result licenses a kind of pragmatism is that the result tells you that your goals are relevant to deciding whether you should use MLE or a shrinkage estimator. Compare Ms. Multi-Tasker and Mr. One-at-a-Timer. Ms. Multi-Tasker takes up three estimation problems and wants to minimize her expected sum of squared errors across the three. Mr. One-at-a-Timer takes up the same three problems, and uses the same evidence and background information that Ms. Multi-Tasker has at hand, but he cares about each problem for its own sake, wanting to minimize his expected squared error on each. According to Stein, they should reach different estimates, with Ms. Multi-Tasker shrinking and Mr. One-at-a-Timer doing straight MLE. To an evidentialist, whether estimator E is a good estimator of θ given data D is purely a question of the relation between E, θ and D. Estimation is in other words a three-place relation. However, according to holistic pragmatism, whether your goal is to maximize local or global accuracy also matters. Thus, on holistic pragmatism, estimation involves a four-place relation between E, θ, D, and the goals of the agent, G. 18 It is utterly familiar that rational action requires assumptions about utilities. It is controversial that rational belief must involve such assumptions.16 Some of the utilities involved in deciding whether to use a shrinkage estimator are epistemic – the goal is to have one’s estimates be close to the truth (where this is quantified by using the expected sum of the squared errors). Here we see a departure from Pascal (and from James). But an additional type of utility is relevant to estimation: should you care about estimation problems separately or should you seek to minimize the sum of squared errors that arises in the lot? That is, should you be a lumper or a splitter in your conception of the estimation problems you face? The surprise is that answering this question matters. In Section 11, we investigate whether this question has an objective answer. 8. The relation of shrinkage estimation to Bayesianism So far our discussion has been in a frequentist framework. However, there is another prominent statistical framework, decision-theoretic Bayesianism, which conceives of estimation in a very different way. According to this framework, an agent should choose the estimate that has the lowest expected loss, where expected loss is calculated relative to the agent’s posterior probability distribution.17 Is shrinkage estimation and the holistic pragmatism it underwrites compatible with this Bayesian understanding of estimation? Our answer to this question is that it all depends on which type of Bayesianism is at issue. The most minimal type of Bayesianism says that all it takes for an agent’s probability distribution to be rational is synchronic coherence. A rational agent can have any probability distribution at any given time, so long as it obeys the axioms of probability. There is no further constraint on how prior probability values are assigned to different propositions, nor is there any diachronic constraint on how the agent’s distributions at different times must be related. Minimal 16 For example, see the reply to Rudner (1953) by Jeffrey (1956). 17 More precisely, the expected loss of estimate e relative to distribution p( | x) and loss function L is given by the formula ∑ 𝑝(|D)𝐿(, 𝑒) , where the sum is over all possible values of the parameter . 19 Bayesians are therefore free to embrace shrinkage, since they can often engineer their probability distribution in such a way that a shrinkage estimator will end up having the lowest expected loss. There are two stronger forms of Bayesianism that we think conflict with shrinkage estimation. Both embrace synchronic coherence but insist that rationality demands something more. The first interprets probabilities as rational degrees of belief and insists that prior probabilities should, in some sense, accurately and reasonably reflect agents’ background knowledge or their “initial state” of information. The second type of Bayesianism updates probabilities by strict conditionalization or by Jeffrey (1983) conditionalization. Of course, these two types of Bayesianism are not mutually incompatible, and flesh and blood Bayesians often sign up under both banners. To see the difficulties posed for Bayesians of the first kind, note that, if probabilities represent the degrees of belief of an agent, then an agent should never lump together estimation problems that the agent thinks are completely unrelated and independent of each other, even if the goal is to maximize global accuracy. This is because, for the kind of Bayesian we now are discussing, the estimate e’ of the parameter ’ can affect the estimate e of a different parameter only if there is some possible data x such that p(| ’ & x) ≠ p( x) ─ i.e., only if it is possible that ’ could provide information about  over and above the information provided by the data itself.18 It may be helpful to demonstrate why this is the case, so we illustrate it for the case of just two parameters, although it holds in general. Suppose your goal is to maximize global accuracy over  and ’ in a Bayesian framework. Hence, you want to choose the estimates e of  and e’ of ’, given data D, that jointly minimize the posterior expected total loss where this total loss is the sum of the loss on each of the parameters. In other words you want to pick the e and e’ that jointly minimize: ∑ 𝑝( ,′ , ’|D) ∗ [L(, e) + L′(’, e′)] 18 Note that this is a purely synchronic constraint on the conditional probability distribution. 20 But if p(| ’ & x) = p(x) for all x, then simple algebra and applications of the probability axioms show that the above formula reduces to: ∑ 𝑝(|D)L(, e) + ∑ 𝑝( ’|D)L′( ’, e′) ’  Since these two summands have no common variables, minimizing their sum just reduces to the problem of minimizing each of the summands. Hence, if  and ’ are thought by the Bayesian agent to be completely informationally independent in the sense that p(| ’ & x) = p(x) for all x, then the problem of minimizing global accuracy just reduces to the problem of minimizing inaccuracy over each of the parameters, which means that there is no gain in lumping together estimation problems that are thought by the agent to be unrelated and independent, even if the goal is maximal global accuracy. On the other hand, as we have argued in this paper, the holistic pragmatism underwritten by Stein’s result sometimes recommends that agents who want to maximize global accuracy lump together unrelated and independent estimation problems. Hence, holistic pragmatism is clearly in conflict with a decision theoretic Bayesianism that interprets prior and posterior probabilities as representing the rational degrees of belief of some agent. Bayesians who want to embrace shrinkage estimation must consequently reject the view that the probabilities they use always represent their rational degrees of belief. This is not to deny that Bayesians can mathematically accommodate shrinkage estimators by, for example, using so-called “shrinkage priors,” which are priors that are explicitly designed to induce Steinian shrinkage, as described in Efron (2011, pp 2-6), Efron and Morris (1973), or Lehmann (1983, p.299). But such “shrinkage priors” cannot reasonably be interpreted as representing the rational degrees of belief of some agent. We have heard it said that Bayesians automatically incorporate shrinkage and that Bayesians therefore don’t need to worry about artificially inducing shrinkage in order to maximize global accuracy. However, this is not correct. It is true that all proper priors – that is, priors that sum to 1 – impose a kind of shrinkage. In particular, any proper prior will “shrink” the estimate of a parameter towards values of the parameter that have a high prior probability. 21 However, Stein shrinkage only happens when multiple parameters are shrunk towards a common point. The only way for a Bayesian to induce this kind of shrinkage is by imposing a common proper prior over all the parameters that are being estimated that has the effect of shrinking all the parameters towards the same point. If priors are interpreted as degrees of belief, however, then any such prior will represent a prior belief that the parameters are actually somewhat close together. Of course, sometimes such a prior belief is warranted, but our point here is that holistic pragmatism says that if your goal is to maximize global accuracy, you should always shrink the parameters towards a common point, regardless of whether you actually believe the parameters are close together. Now, as we saw in Section 4, the practical benefit of shrinkage will not be large unless the parameters are actually close to each other, but this does not alter the fact that the expected benefit of shrinkage is always greater than 0, provided that global accuracy is the goal. Note, however, that if local accuracy rather than global accuracy is your goal, then Stein’s result does not give Bayesian agents any reason for imposing a common prior over parameters that they think are independent. Thus, whether you should impose a common prior over the parameters you are estimating or keep the parameters separate depends on your goals. This observation brings us to the tension that we find with the second type of Bayesianism in which there is a commitment to updating by strict conditionalization or by Jeffrey conditionalization. The impediment here is that Stein’s results show that you can and should change your estimates simply because your goals change. To see the problem, consider Ms. Multi-Tasker and Mr. One-at-a-Time as two stages in a single Jekyll-and-Hyde personality – namely you. On Monday you want to reduce your risk in estimating each of three parameters, while on Tuesday you want to reduce your total risk. MLE is admissible on Monday but not on Tuesday. Your estimates change value but not because of any new evidence you acquired. Updating by conditionalization is a way to take new evidence into account. It does not allow your estimates to change merely because you have changed your goals. Of course, the Bayesian framework does allow you to change your estimate if your loss function changes, even if you do not gain any new evidence. But that is not what’s going on in the above example. Even if your loss function remains the same throughout the whole week (e.g. you are using squared loss), Stein’s result gives you a reason for using one estimate on Monday 22 and another estimate on Tuesday. But if your loss function stays the same and you don’t gain any new evidence, then the only way you can change your estimate in the Bayesian framework is by changing your probability distribution in a way that violates conditionalization. That is why shrinkage estimation and holistic pragmatism are in tension with the second type of Bayesianism. We think the most plausible Bayesian response to Stein’s results is to either reject them outright or to adopt an instrumentalist view of personal probabilities. The instrumentalist response is to abandon the idea that probabilities are rational degrees of belief. Rather, they are sometimes irrational (or arational), but agents should adopt them anyhow because they help agents get what they want. The former option, of outright rejection, has several motivations. The most obvious motive is perhaps that Stein’s result is fundamentally a frequentist result since it is couched in terms of expected inaccuracy over all possible data sets. Bayesians might insist that Bayesians should not care about frequentist risk, but we know of few Bayesians who hold this view.19 Even if Bayesians grant that frequentist risk is relevant they still might insist that you shouldn’t use frequentist risk, by itself, to decide which estimator to use. A more fully Bayesian decision theoretic solution would require that you calculate the expected utility of choosing each candidate estimator by averaging the frequentist risk of the various candidate estimators over a prior probability distribution. An estimator that maximizes expected utility relative to a particular prior is known as a “Bayes rule” relative to that prior. As it happens, MLE and shrinkage estimation both are Bayes rules relative to the (improper) prior that is flat over all the parameters you are estimating. In other words, relative to the flat prior, MLE and shrinkage estimation have the same expected utility. Consequently, if you come to the table with a flat prior in hand, you apparently have no reason (from a Bayesian point of view) for preferring shrinkage estimation to MLE.20 19 Angers and Berger (1985, p. 5) emphasize that frequentist risk should be important even to pure Bayesians because frequentist risk gives an indication of the average posterior expected loss. 20 We thank Teddy Seidenfeld for pressing us on this point. We note that this point only makes sense given certain presuppositions. In particular, countable additivity must be discarded. 23 We grant that an expected utility calculation will not tell you that shrinkage is preferable to MLE, if you use a flat prior in calculating expected utility. However, it’s also true that such a calculation will not tell you that MLE is preferable to shrinkage, given that MLE and shrinkage have the same expected utility. If you want a non-arbitrary way of picking an estimator, you therefore need a tie-breaking criterion aside from expected utility. A reasonable tie-breaker, we think, is to pick the estimator that has lower frequentist risk. The above discussion assumes that you have already adopted a flat prior. But what if you are trying to decide what prior to adopt in the first place? An expected utility calculation cannot tell you what prior to adopt since expected utility must be calculated relative to a prior. If you are unsure of what prior to adopt, but your goal is to minimize global error, then Stein’s result gives you a reason for preferring a prior that imposes shrinkage to a prior that does not. Some Bayesians argue that shrinkage estimators should not be used in the absence of genuine prior information and in particular that “shrinkage priors” of the sort described in this section should never be used. For example, Angers and Berger (1985) show that under certain (strong) assumptions, Bayesians who are certain or nearly certain that several estimation problems are probabilistically independent should not combine the problems and impose shrinkage because it is possible to obtain more robust estimates (i.e., estimates that are less sensitive to the choice of prior) by considering the problems separately.21 We concede that shrinkage priors do not provide you with robust individual estimates under the assumptions that Angers and Berger describe. Indeed, as we discuss further in the next section, what point you shrink towards (or what shrinkage prior you choose to adopt) can greatly affect which of your individual estimates end up being more accurate and which ones end up being less accurate when compared to estimates provided by MLE. Nonetheless, if your main concern is to maximize global accuracy – as opposed to making sure that you have robust estimates – then Stein’s result shows that you ought to shrink. 21 Perlman and Chaudhuri (2012) offer a different argument for a similar conclusion. They claim, without offering any explanation, that agents who use shrinkage estimation in the absence of prior information will unwittingly end up using a procedure that has the effect of reversing the Stein effect (see n24 for a description of the procedure). 24 There are two remaining and perhaps more fundamental reasons why Bayesians might reject the use of shrinkage estimation. One reason is that shrinkage estimation is a type of estimation, and some Bayesians have claimed that estimation is simply not that interesting from a Bayesian perspective. Another fundamental reason that Bayesians might want to reject the use of shrinkage estimation is that they value language invariance, and shrinkage estimators are not language invariant; we discussed language invariance in Section 3. 9. Shrinkage Estimation, AIC, and the Bias/Variance Tradeoff Complex models (ones with more adjustable parameters) will in general exhibit less estimation error or bias than simpler models (ones with fewer parameters) because the extra flexibility that accompanies complexity enables complex models to accurately fit real patterns in the data.22 For example, an nth degree polynomial can fit a data set containing n−1 observations perfectly whereas polynomials with fewer adjustable parameters will almost never do as well. Unfortunately, complexity comes at a cost; more complex models are also more likely than simpler ones to be misled by noise. Hence, more complex models have greater approximation error or variance since a complex model is likely to “bounce around” quite a bit when it is fitted to different data sets drawn from the same underlying distribution. The Akaike Information Criterion (AIC) gives advice concerning how bias should be traded off against variance when the goal is to maximize predictive accuracy. According to AIC, the predictive inaccuracy of model M given data D can be estimated by the number of parameters in M minus the log-likelihood of the best-fitting member of M. AIC tells you that the model that has the lowest AIC score will be the one that is most predictively accurate; this is the 22 In speaking of the “bias” and “variance” of models and estimators, we are following the (perhaps unfortunate) statistical practice of using these terms ambiguously. In the context of parameter estimation, “bias” and “variance” have precise technical meanings, as we also note in the text. In model selection, on the other hand, “bias” just means something like “the inability of a model to mimic the true curve, whatever the true curve happens to be.” For example, LIN is “biased” in this latter sense because it can adequately mimic the true curve only if the true curve happens to be roughly linear. 25 model that achieves the best balance between fit and complexity, or, in other words, between bias and variance. The reason scientists often should prefer simple models that they know are false (i.e., ones that are biased away from the truth) over more complex models that they know to be true is that simpler models have lower variance and hence often have a higher degree of predictive accuracy (see Forster and Sober 1994). Shrinkage estimators rely on a similar kind of bias-variance trade-off. However, in the context of parameter estimation, the terms “bias” and “variance” take on technical meanings that differ from the meanings of those terms in model selection. For example, an unbiased estimator is one whose expected value equals the true value of the estimated parameter. Using these technical concepts makes it possible to decompose the error of an estimator in a precise way. In particular, if you look at a single estimator, m, of some quantity θ, you can decompose its mean squared error in the following way (see e.g., Wasserman 2004, p. 91): MSE(m) = [bias(m)]2 + variance(m). If m is an ML estimator, then [bias(m)]2 = 0 when the distribution of the data is normal, since MLE is unbiased in this case. Hence, in this case, the MSE of m simply reduces to the variance of m. If we independently estimate several parameters θ1, θ2, …, θn and obtain m1, m2, …, mn as their ML estimates, then the total MSE of all the estimates is simply the sum of all their variances: MSE(m1, m2, …, mn) = variance(m1) + variance(m2) + … + variance(mn). Since using a shrinkage estimator yields a lower total MSE than MLE, and since shrinkage estimators are clearly biased (as noted, the ML estimator is unbiased, and shrinkage means shrinking your estimates away from the ML estimates; note that we are here using the terms “biased” and “unbiased” in the technical sense), it is clear that shrinkage estimators work because they lead to a reduction in overall variance. By shrinking all the ML estimates toward some common point, you bias your estimate of each parameter, but at the same time you reduce the freedom of each estimate to vary in response to noise; by “tethering” all the estimates to a 26 single point, shrinkage thereby reduces the overall variance in your estimates. Just as AIC sometimes prefers simpler false models because they have lower variance, so shrinkage estimators sacrifice the unbiasedness of MLE for the sake of obtaining a lower variance. Trading-off bias against variance arises in many statistical contexts23 − in model selection, in classification problems (see, e.g. von Luxburg and Scholkopf 2009, pp. 662-664) and – as we have seen – in parameter estimation. What is especially striking about shrinkage estimators is that there are several conflicting ways of sacrificing bias to achieve a reduction in variance; recall that the James-Stein estimator shrinks all estimates towards zero while the Efron- Morris estimator shrinks them towards the grand sample mean. However, whether you introduce this bias or that bias is less important than the fact that you introduce some bias or other.24 What bias you introduce (i.e. what the point is towards which you shrink) will have consequences for which individual estimates end up being less accurate than they might have been had you used MLE, but overall the reduction in variance justifies sacrificing accuracy at the level of estimating individual parameters. 10. Shrinkage, Realism, and Instrumentalism Model selection criteria like AIC legitimate a kind of instrumentalism (Sober 2008). According to this interpretation, the goal of AIC is not to decide which model is true, or has the highest probability of being true, but rather to determine which model (among the candidate models considered) will make the most accurate predictions. As noted, a surprising property of 23 An anonymous reviewer pointed out to us that similar trade-offs also arguably occur outside of statistics, e.g. in the “Runge phenomenon” in numerical analysis. 24 It is worth noting that how you choose to bias your estimates is important. Perlman and Chaudhuri (2012) show that there are procedures for picking the point towards which you shrink that lead to a “reverse” Stein effect wherein the resulting shrinkage estimator does worse than MLE in expectation. In particular, the point you shrink towards needs to be relatively stable given different data sets; otherwise, your shrinkage estimator is not going to reduce total variance. Here’s an example: given data (x1, x2, …, xn) about parameters (1, 2, …, n), consider (x1, x2, …, xn) as the center of a sphere in n-dimensional space and then randomly pick some point within the sphere towards which you shrink your data. This procedure gives you an estimator that bounces around given different data sets, and that therefore doesn’t help you reduce overall variance. 27 AIC is that it sometimes (correctly) judges that a model known to be false will be more predictively accurate than a model known to be true. Should shrinkage estimators be placed under the same instrumentalist umbrella? According to this interpretation, the goal is not to discover the true value of a parameter, but to find the estimate that will most accurately predict new data. The surprise is that achieving this goal depends on whether you want to predict new observations of one or two parameters, or of three or more. And if you want to predict three or more, a second surprise presents itself; it matters whether you want to reduce total expected error or rather want to minimize expected error on each parameter. Although this instrumentalist gloss makes sense, shrinkage estimation is something that realists also need to take on board. 25 The realist view of estimation is that the goal is to get as close as possible to the true value of the quantity being estimated. Shrinkage estimators achieve that goal better than straight MLE when there are more than three parameters being estimated and the goal is to minimize total inaccuracy. However, some realists will be lumpers while others will be splitters, and so they will disagree about how Stein’s result should bear on their scientific practice. 11. When to lump and when to split? Can the question of when to lump and when to split be settled in an objective way? Lumping the success rates of baseball players is a good idea if the goal is to minimize global inaccuracy. However, as we emphasized earlier, Stein’s result does not guarantee that shrinkage estimation will do better than MLE when you treat each baseball player as a separate estimation problem. As noted, MLE is admissible when you estimate the value of a single parameter. This does not mean that MLE dominates shrinkage estimators when k<3; it means only that no shrinkage estimator dominates MLE. Furthermore, the fact that no shrinkage estimator dominates MLE in single-parameter estimation just means that there is no shrinkage estimator that does at least as well as MLE across all possible values that the parameter might have; this 25 The same point holds for AIC – realists can embrace this estimator so long as closeness to the truth is understood in the right way (Sober 2015). 28 leaves open that there may be estimators that do significantly better than MLE when the parameter has some particular true value. Indeed, if you return to Figure 1, you’ll notice that dividing the sample mean in half actually has lower risk than the sample mean itself when the true value of the parameter is sufficiently close to 0. “Admissible” does not mean “optimal in all cases.” We now change our focus to a question that the concept of admissibility cannot answer: should you be a lumper or a splitter in the way you formulate your estimation problems? In general, there are two things that can happen when you lump three or more estimation problems and use shrinkage, rather than split them and use MLE: (1) The risk of some of the individual estimates increases substantially. (2) The risk of each of each individual estimate remains about the same or decreases.26 If (1) holds, you face a dilemma; shrinkage estimation will decrease your global risk, but only at the price of increasing the risk of some of the individual estimates. Should you trade an increase in risk at the individual level for a reduction in risk at the global level? The answer to this question depends on what your goals are and what kind of error you prefer to avoid.27 On the other hand, if (2) holds, you objectively ought to use a shrinkage estimator; by shrinking, you’ll reduce global risk without increasing individual risk, or you’ll reduce both. In 26 It may seem that (2) contradicts the fact that MLE is admissible when you are estimating a single parameter. However, the fact that MLE is admissible just means that, given (normally distributed) data D and a single parameter p, there is no function of D that weakly dominates the ML estimator. This does not exclude the possibility that you can get a better estimate of p by lumping D with another data set D′ and using a shrinkage estimator on D&D′. 27 The measure of global inaccuracy that we have relied on so far in this paper implicitly places an equal weight on each of the individual estimation problems, since global inaccuracy is simply the unweighted sum of the inaccuracies in all the individual estimates. It may, however, happen that getting accurate estimates for some of the parameters is more important than getting accurate estimates for others. Brown (1975) models this by weighting each inaccuracy term (xi-i) 2 by a factor ci that measures the importance of getting an accurate estimate of i, and he shows that Stein’s result is surprisingly robust across different assignments of weights (although he does not propose explicit estimators corresponding to the different weightings of the local estimation problems). In a similar vein, Efron and Morris (1972) introduce “compromise estimators” that aim at lowering total inaccuracy while at the same time limiting the loss in accuracy at the level of individual estimates. 29 general, (2) holds if and only if the true means and variances of the different parameters you are estimating are sufficiently close to each other. To see why, recall that shrinkage estimators work by introducing a bias in order to reduce variance. If the true means are sufficiently close to each other, then the bias introduced by using a shrinkage estimator such as the Efron-Morris estimator will be smaller than the reduction in variance even at the level of individual estimates. How do you know whether the quantities you are studying have true means and variances that are “sufficiently close” before you have the relevant data? Sometimes background knowledge can serve as a guide. Suppose, for example, that you wish to estimate the mean heights in all the northern European countries. Your background knowledge might lead you to believe that the true mean heights and variances for the different countries, whatever they are, are close together. If your belief is correct, each true population mean will be close to the true grand mean, which entails that the bias introduced by the Efron-Morris estimator will be small for each of the individual estimates you make. At the same time, combining your samples from several countries in the way that the Efron-Morris estimator does will reduce the total variance of your estimates, as we discussed in Section 9. Lumping therefore seems justified in this case; you have reason to believe that (2) is true. Background knowledge would also seem to justify lumping together Major League baseball players. Although the players do vary in ability, there is reason to believe that their true abilities are sufficiently close so that by lumping the estimation problems together, you can expect to increase the accuracy in your estimate of almost every player. Of course, your estimate of players who are truly outstanding (i.e. whose true batting success probability is far from the grand mean) will suffer, but if you know who those players are, you can just exclude them from shrinkage. These examples show how using background knowledge to lump together estimation problems can be expected to produce better individual estimates if they are done right, but the examples offer no general guidelines for how to do this. According to Efron, “[s]cientific guidance would be most welcome at this point…” (2011 p. 185). Efron goes on to quote Miller’s (1981) pronouncement (in a slightly different context) that this is where “statistics takes leave of 30 mathematics and must be guided by subjective judgment.” Efron explains that subjective judgment is still what dominates statistical practice, but goes on to offer “hints … of a more principled approach to separation and combination … but at this stage they remain just hints” (p. 186). The upshot is that lumping together several estimation problems and using a shrinkage estimator is sometimes objectively better than keeping the problems separate and using MLE, in the sense that lumping can sometimes yield individual estimates that are more accurate than the estimates you would get by handling the problems separately. This situation arises when the populations under study are “similar enough.” But when and how problems should be lumped together or split apart remains an important open problem in statistics.28 12. Concluding Comment Stein’s result seems paradoxical when estimating a quantity is taken to be closely connected to assigning a probability to a proposition.29 When propositions A, B, and C are probabilistically independent of each other, the probability of their conjunction is determined by the probabilities of the conjuncts: 28 There is a parallel situation for AIC. Consider NULL and DIFF as claims about how the mean heights in two (or more) human populations are related. NULL says they have the same mean height; DIFF says that the heights may differ. Since NULL says there is a single mean height that characterizes each population, it has a single adjustable parameter. DIFF has one adjustable parameter for each population. AIC tells you to prefer NULL if the sample means are close together and DIFF when they are very far apart, where the question “how close is close enough?” is answered by considering how the two models differ in their numbers of adjustable parameters. NULL lumps whereas DIFF splits. 29 This is not to say that the puzzling quality of Stein’s result derives entirely from the assumption that estimating a quantity and assigning a probability to a proposition are closely connected projects. 31 If Pr(A)=x and Pr(B)=y, Pr(C)=z, and A,B,C are probabilistically independent of each other, then Pr(A&B&C) = xyz. This is true by definition. Analogy might suggest that something similar is true of estimation. One might expect, when the quantities Q, R, and S are probabilistically independent of each other, that the following principle holds: If BE(Q)=x, BE(R)=y, BE(S)=z and Q,R,S are probabilistically independent of each other, then BE = . Here BE(θ) is the best estimate of θ, given the data at hand, and “<…>” denotes a vector of variables or values. This principle says that the best estimate for a vector of quantities is determined by the best estimate for each quantity when the quantities are probabilistically independent. The principle would be right if the best estimate of a quantity were the one that has the highest probability (density) of being true. This is an intuitive interpretation of estimation, but it is not the one that frequentists embrace. Their interpretation is built on the concept of admissibility. In this paper, we have attempted to describe the surprising form of holistic pragmatism that issues from that conception.30, 31 References Angers, Jean-Francois and Berger, James O. (1985) “The Stein Effect and Bayesian Analysis: A Reexamination,” Technical Report #85-6. Department of Statistics, Purdue University. 30 In thinking about how assigning a probability to a proposition is related to estimating a quantity, it is important to note that an estimation problem must involve infinitely many possible values if shrinkage estimators are to have lower expected error than MLE (Guttmann 1982). 31 We thank Marty Barrett, Larry Brown, Jan-Willem Romeijn, Teddy Seidenfeld, Mike Steel, Reuben Stern, and the anonymous referees for very useful comments. This paper is dedicated to the memory of Charles Stein (1920-2016). 32 Baranchik, Alvin J. (1964) “Multiple Regression and Estimation of the Mean of a Multivariate Normal Distribution.” Technical Report 51. Department of Statistics, Stanford University. Blyth, Colin (1951) “On Minimax Statistical Decision Procedures and their Admissibility.” The Annals of Mathematical Statistics, Vol. 22 (1): 22-42. Bock, Mary E. (1975) “Minimax Estimators of the Mean of a Multivariate Distribution.” Annals of Statistics, Vol. 3 (1): 209-218. Brown, Lawrence D. (1966) “On the Admissibility of Invariant Estimators of One or More Location Parameters.” Annals of Mathematical Statistics, Vol. 37 (5): 1087-1136. Brown, Lawrence D. (1971) “Admissible Estimators, Recurrent Diffusions, and Insoluble Boundary Value Problems.” The Annals of Mathematical Statistics, Vol. 42 (3): 855-903. Brown, Lawrence D. (1975) “Estimation with Incompletely Specified Loss Functions (the Case of Several Location Parameters).” Journal of the American Statistical Association, Vol. 70 (350): 417-427. Carnap, Rudolf (1950) “Empiricism, Semantics, and Ontology.” Revue Internationale de Philosophie, Vol. 4 (2): 20-40. Edwards, A. W. F (1974) “The History of Likelihood.” International Statistical Review, Vol. 42 (1): 9-15. Efron, Bradley (2013) Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge University Press. Efron, Bradley and Morris, Carl (1972) “Limiting the Risk of Bayes and Empirical Bayes Estimators – Part II: The Empirical Bayes Case.” Journal of the American Statistical Association, Vol. 67 (337): 130-139. 33 Efron, Bradley and Morris, Carl (1973) “Stein’s Estimation Rule and Its Competitors – An Empirical Bayes Approach.” Journal of the American Statistical Association, Vol. 68 (341): 117- 130. Efron, Bradley and Morris, Carl (1977) “Stein's Paradox in Statistics.” Scientific American Vol. 236 (5): 119–127. Forster, Malcolm and Sober, Elliott (1994) “How to Tell When Simpler, More Unified, or Less Ad Hoc Theories Will Provide More Accurate Predictions.” British Journal for the Philosophy of Science, Vol. 45: 1-36. Galton, Francis (1888) “Co-relations and their Measurement, chiefly from Anthropometric Data.” Proceedings of the Royal Society of London, Vol 45: 135-45. Gauss, Carl F. (1823) “Theoria Combinationis Observationum Erroribus Minimis Obnoxiae: Pars Posterior.” Translated (1995) as Theory of the Combination of Observations Least Subject to Error: Part One, Part Two, Supplement. Translated by: G. W. Stewart. Society for Industrial and Applied Mathematics. Gruber, Marvin (1998) Improving Efficiency by Shrinkage: The James-Stein and Ridge Regression Estimators. CRC Press. Guttmann, Sam (1982) “Stein's Paradox is Impossible in Problems with Finite Sample Space.” Annals of Statistics, 10(3): 1017-1020. Hodges, Joseph and Lehmann, Erich (1951) “Some applications of the Cramér-Rao inequality.” Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, Berkeley and Los Angeles, University of California Press, pp. 13-22. 34 James, Willard and Stein, Charles (1961) “Estimation with Quadratic Loss.” Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1: 361-379. James, William, (1896, 1979) “The Will to Believe” in F. Burkhardt et al. (eds.), The Will to Believe and Other Essays in Popular Philosophy, Cambridge: MA, Harvard, pp. 291–341. Jeffrey, Richard (1956) “Valuation and Acceptance of Scientific Hypotheses.” Philosophy of Science, Vol. 23 (3): 237-246. Jeffrey, Richard (1983) The Logic of Decision. Second edition. Cambridge University Press, Cambridge. von Luxburg, Ulrike and Schölkopf, Bernhard (2009) “Statistical Learning Theory: Models, Concepts, and Results.” In: D. Gabbay, S. Hartmann, and J. Woods (Eds). Handbook of the History of Logic, Vol 10: Inductive Logic. Miller, Rupert G. Jr. (1981) Simultaneous Statistical Inference: Second Edition. Springer. Pascal, Blaise (1662) Pensées. Translated by W. Trotter, New York: J. M. Dent Co., 1958, fragments: 233-241. Perlman, Michael D. and Chaudhuri, Sanjay (2012) “Reversing the Stein Effect.” Statistical Science, Vol. 27 (1): 135-143 Rudner, Richard (1953) “The Scientist Qua Scientist Makes Value Judgments.” Philosophy of Science, Vol. 20 (1): 1-6. Sober, Elliott (2008) Evidence and Evolution – the Logic Behind the Science. Cambridge University Press. 35 Sober, E. (2015) Ockham’s Razors – A User’s Manual. Cambridge: Cambridge University Press. Spanos, Aris (2016) “How the Decision Theoretic Perspective Misrepresents Frequentist Inference: `Nuts and Bolts’ vs Learning from Data” Available at https://arxiv.org/pdf/1211.0638v3.pdf Stein, Charles (1956) “Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution.” Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1: 197-206. Stein, Charles (1962) “Confidence Sets for the Mean of a Multivariate Normal Distribution (with discussion). Journal of the Royal Statistical Society. Series B (Methodological), Vol. 24 (2): 265- 296. Stigler, Stephen (1990) “The 1988 Neyman Memorial Lecture: A Galtonian Perspective on Shrinkage Estimators.” Statistical Science, Vol 5 (1): 147-155. Strawderman, William E. (1971) “Proper Bayes Minimax Estimators of the Multivariate Normal Mean.” Annals of Mathematical Statistics, Vol 42 (1): 385-388. Wasserman, Larry (2004) All of Statistics: A Concise Course in Statistical Inference. Springer. White, Morton (2005) A Philosophy of Culture: The Scope of Holistic Pragmatism. Princeton University Press.