S112 †Send requests for reprints to the author, Department of Philosophy, University of Wisconsin, Madison, WI 53706; e-mail: ersober@facstaff.wisc.edu. ‡My thanks to Martin Barrett, Kevin DeQuieroz, Michael Donoghue, Branden Fitel- son, Malcolm Forster, Dan Hausman, Paul Lewis, Richard Lewontin, and Mike Steel for useful discussion. Philosophy of Science, 69 (September 2002) pp. S112–S123. 0031-8248/2002/69supp-0011$10.00 Copyright 2002 by the Philosophy of Science Association. All rights reserved. Instrumentalism, Parsimony, and the Akaike Framework Elliott Sober†‡ University of Wisconsin and London School of Economics and Political Science Akaike’s framework for thinking about model selection in terms of the goal of predictive accuracy and his criterion for model selection have important philosophical implica- tions. Scientists often test models whose truth values they already know, and they often decline to reject models that they know full well are false. Instrumentalism helps explain this pervasive feature of scientific practice, and Akaike’s framework helps provide in- strumentalism with the epistemology it needs. Akaike’s criterion for model selection also throws light on the role of parsimony considerations in hypothesis evaluation. I explain the basic ideas behind Akaike’s framework and criterion; several biological examples, including the use of maximum likelihood methods in phylogenetic inference, are considered. Philosophers of science usually agree that the point of testing theories— indeed, the point of doing science—is to determine which theories are true. We recognize, of course, that scientists never have access to all possible theories on a given subject; they are limited by the theories they have at hand. But given a set of competing theories, the point of theory assessment is to ascertain which of these competitors is one’s best guess as to what the truth is. Bayesians tend to see things this way, so do scientists who use orthodox Neyman-Pearson methods, and likelihoodists tend to fall into this pattern as well. To be sure, there are deep differences among these outlooks. Bayesians assess which hypotheses are most probable, frequen- tists evaluate which hypotheses should be rejected, and likelihoodists say which hypotheses are best supported. But these assessments typically in- , ,     S113 voke the concept of truth; the question is which hypotheses are most prob- ably true, or should be rejected as false, or are most likely to be true. This obsession with truth also finds expression in the debate between realism and empiricism. Realism says that the goal of science is to discover which theories are true; empiricism maintains that the goal is to discover which theories are empirically adequate (Van Fraassen 1980). A theory is empirically adequate if what it says about observables is true. Realists think that theories should be assessed by considering the truth values of everything they say, while empiricists hold that theories should be assessed by considering the truth values of part of what they say. In both cases, truth is the property that matters. An older tradition, now not much in evidence in these post-positivist times, holds that the point of science is to provide accurate predictions, not to tell us which theories are true. This is instrumentalism, stripped of the defective philosophy of language that led instrumentalists to deny that theories have truth values (Morgenbesser 1960). Ernest Nagel (1979) is often taken to have punctured the instrumentalist balloon with his sug- gestion that the difference between instrumentalism and realism is non- substantive; if true theories are the ones that maximize predictive accuracy, then the goal of seeking predictive accuracy and the goal of seeking truth come to the same thing. With the demise of positivism and the ascendancy of realism, why even consider instrumentalism? The reason the case needs to be reopened has two parts. First, there are aspects of scientific practice that don’t make sense on the model of science as the quest for truth. And second, there is an alternative framework for understanding scientific inference, one that is used increasingly by scientists themselves, which says that the goal of theory evaluation is to estimate predictive accuracy. It turns out in this framework that a true theory can be less predictively accurate than a false one. Nagel’s suggestion that truth and predictive accuracy always coincide is not correct. The simple but pervasive fact about scientific practice is that scientists often test hypotheses that they know full well are false and they often refuse to reject such hypotheses in the light of evidence. Consider, for example, the simple statistical problem of deciding whether two large pop- ulations of corn plants have the same mean heights. Where u1 and u2 are the two means, the two hypotheses to consider are (Null) u1 � u2 (Diff ) u1 � u2. Surely no scientist could or should believe that these populations have exactly the same average heights. Yet, this and similar hypotheses are  S114 1. Perhaps the simple explanation of why scientists behave in the peculiar way I have described is that they accept frequentist statistics. This raises two questions: Is the frequentist approach sound? Does frequentism have instrumentalist commitments? Space does not permit me to pursue either question here. tested every day and sometimes the conclusion is drawn that one should not reject the null hypothesis. Scientists must be crazy if this assessment concerns what is true. But if their goal is to assess which model is more predictively accurate, there may be method in this madness (Sober 1998; Forster 2000a). My next example (for others, see Yoccuz 1991; Johnson 1995; Burnham and Anderson 1998) concerns hypotheses about a “molecular clock.” Con- sider two lineages that stem from a common ancestor and connect to contemporary descendant species B and C. There are millions of nucleo- tides in the DNA of the organisms in these two lineages. Let b � the rate of nucleotide substitution in the lineage leading to B, and c � the rate of substitution in the lineage leading to C. The clock hypothesis is expressed by the first of the following two hypotheses (Felsenstein 1983): (Constrained) b � c (Unconstrained) b � c or b � c. The constrained model entails the unconstrained model, but not con- versely. I think we know, before any data are gathered, that the con- strained model is almost certainly false and that the unconstrained model must be true; the latter is, after all, a tautology. Furthermore, we know that if we gather data, our observations will not dislodge this two-part verdict. Yet, scientists go to the trouble of gathering data, and when they run statistical tests, they sometimes decline to reject the clock hypothesis. The puzzlement is why scientists bother to run these tests in the first place, if the goal is to discover which models are true. Does science, like poetry, demand the willing suspension of disbelief ? These peculiar practices start to make sense in the light of an inferential framework developed by the Japanese statistician H. Akaike (1973) and his school (Sakamoto et al. 1986) for thinking about how models are used to make predictions.1 Models, first of all, are statements that contain adjust- able parameters. They are disjunctions, often infinite disjunctions, over all the different parameter values that are consistent with the constraints the model specifies. If models are disjunctive in this way, how can they be used to make predictions? The answer is that one estimates the values of param- eters by finding the parameter values that maximize the probability of the data—i.e., the values that have maximum likelihood. In the case of (Null) and (Diff ), suppose one samples from each population and finds that the , ,     S115 2. Forster and Sober (1994) describe Akaike’s estimated predictive accuracy as a quan- tity per datum, and so divided the right side of this equation by N, the number of data. mean height in the first sample is 62 inches and the mean height in the second is 64 inches. If so, the likeliest members of (Null) and (Diff ) are: L(Null) u1 � u2 � 63 inches L(Diff ) u1 � 62 inches; u2 � 64 inches. L(Diff ) fits the data better than L(Null) does. However, the goal is not to fit old data, but to predict new data. The question is how well L(Null) and L(Diff ) will do in predicting new data drawn from the same two popu- lations. It is perfectly possible that L(Null) will predict new data better than L(Diff ), even though (Null) is false and (Diff ) is true. The concept of predictive accuracy describes a two-step process. One uses old data to find the most likely member of each model; then one uses those likeliest members to predict new data: Imagine repeating this process again and again. The predictive accuracy of a model is its average performance in this iterated task. Predictive ac- curacy is a mathematical expectation. Akaike not only articulated a framework in which predictive accuracy is the goal of inference; in addition, he provided a methodology for esti- mating a model’s predictive accuracy. Given the data at hand, how is one to estimate how well a model will do in predicting new data—data that one does not yet have? Akaike’s criterion for model selection is expressed by a theorem he proved.2 An unbiased estimate of the predictive accuracy of model M � log- Pr[Data | L(M)] � k. The probability that L(M) confers on the data is relevant to assessing M’s predictive accuracy, but it is not the only consideration. The other factor that matters is k, the number of adjustable parameters the model  S116 3. Not only can AIC be compared with other criteria that have been defended as meth- ods for maximizing predictive accuracy; in addition, one can assess methods that have been developed for quite other reasons by seeing how well they do in prediction prob- lems. Two examples are BIC, the Bayesian information criterion of Schwarz (1978), and Neyman-Pearson likelihood ratio tests. BIC was developed as a method for as- sessing the average likelihood of a model; it proposes a criterion that gives more weight to simplicity than AIC does. Similarly, when the likelihood ratio test is applied to nested contains. Akaike’s theorem imposes a penalty for complexity. In the ex- amples we have considered, it is inevitable that L(Null) will have a lower likelihood than L(Diff ) and L(Constrained) will be less likely than L(Unconstrained). However, it also is true that (Null) is simpler than (Diff ) and that (Constrained) is simpler than (Unconstrained). Likelihood and simplicity are in conflict; Akaike’s theorem shows how each contrib- utes to estimating a model’s predictive accuracy. If two models fit the existing data about equally well, then the simpler model can be expected to do a better job predicting new data. For the more complex model to receive the higher AIC (Akaike information criterion) score, it must fit the data a lot better, not just modestly better. Akaike’s theorem quantifies this trade-off—it describes how much of a gain in likelihood there must be to off-set a given loss in simplicity. Just as Akaike’s framework breathes new life into instrumentalism, his theorem provides powerful insights into the relevance of parsimony considerations in many inference problems (For- ster and Sober 1994). Akaike’s theorem is a theorem, so we should note the assumptions that go into its proof. First, in his definition of predictive accuracy, Akaike defines the distance between a fitted model and the truth by using the Kullback-Leibler distance. Second, he assumes that the new data will be drawn from the same underlying reality that generated the old (this has two parts—that the true function that connects independent to dependent variables is the same across data sets, and that the distribution that de- termines how the values of independent variables are selected is also the same); this might be termed a Humean “uniformity of nature” assumption (Forster and Sober 1994). And third, Akaike makes a normality assump- tion; roughly, this is the idea that repeated estimates of each parameter are normally distributed. In the model selection literature, there is discus- sion of other distance measures, such as mean-squared error and Kol- mogorov’s absolute difference measure (Linhart and Zucchini 1986; McQuarrie and Tsai 1998). It also turns out to matter whether one’s data set is small or large. These are two reasons why model selection criteria other than Akaike’s have attracted attention (see also Burnham and An- derson 1998). Proposed criteria differ in terms of the penalty imposed for complexity, but it is not controversial that high likelihood of the fitted model is good news and that complexity is bad.3 , ,     S117 models (with 1 degree of freedom), it too embodies a policy that gives more weight to simplicity than AIC does (Forster 2000b). 4. If (Diff ) is true, then there exists a member of (Diff )—call it T(Diff )—that is true. If only one knew the identity of T(Diff ), one could use that hypothesis to predict new data, and no other assignment of parameter values, to either (Null) or (Diff ), will yield more accurate predictions. In this sense, Nagel was right to suggest that the truth is the best predictor. But what is true of the members of (Null) and (Diff ) is not true of those models themselves. 5. Can a realist adopt the ecumenical view that one of the aims of science is predictive accuracy and, therefore, Akaike’s framework and criterion do not conflict with realism? That depends on whether realism is the innocuous claim that one of the goals of theory evaluation is to discover which theories are true, or the more substantive claim that the unique ultimate goal of theorizing is the discovery of truth (Sober 1998). The uniformity of nature and normality assumptions that go into the proof of Akaike’s theorem are empirical claims about the prediction prob- lem at hand. This has important implications for the question of whether simplicity is a “super-empirical virtue.” Although it is clear that simplicity is a separate consideration in model selection from fit to data, the justifi- cation provided by Akaike’s theorem for using simplicity depends on em- pirical assumptions. Simplicity is therefore an empirical consideration. This is good news for empiricism, since empiricists have had a hard time reconciling their epistemology with the role that simplicity evidently plays in scientific inference, and sometimes have gone so far as to claim that simplicity considerations are merely pragmatic (e.g., Van Fraassen 1980). With this sketch of Akaike’s criterion added to the previous sketch of the Akaike framework, we can fine-tune our claim concerning instrumen- talism and realism. These philosophies are usually understood globally; the goal of inference is always to find theories that make accurate predic- tions, or the goal is always to find theories that are true. Akaike’s frame- work and theorem show that each must be reformulated locally. The assessment of models containing adjustable parameters conforms to in- strumentalism. But the assessment of fitted models, all of whose param- eters have been adjusted, can be construed realistically. The data may lead us to judge that (Null) is a better predictor than (Diff ), even though we know that (Null) is false and (Diff ) is true,4 but, if so, we also will judge that L(Null) is closer to the truth (in the sense of Kullback-Leibler dis- tance) than L(Diff ). The operative slogan is: instrumentalism for models, realism for fitted models.5 Notice that the realism that pertains to fitted models does not mean that one regards them as true (surely one knows that they are not). Akaike’s criterion allows one to compare both nested and non-nested models. To explain what this means, I will describe some of the models that Burnham and Anderson (1998, 110–114) consider as explanations of  S118 data gathered by Schoener (1970) on resource utilization in two species of Anolis lizard in Jamaica. Schoener repeatedly inspected the sites occupied by different lizards in an area that had been cleared of trees and shrubs. Each time he observed a lizard perching, he noted which of two species (S) it belonged to, the height (H) of the perch (� or � 5 feet), the perch’s diameter (D) (� or � 2 inches), the site’s insolation (I) (sunny or shady), and the time of day (T) (early morning, midday, or late afternoon). The most complex model that Burnham and Anderson consider says that an individual’s probability of perching on a site may be influenced by S, H, D, I, and T, its being left to the data to say how much of a difference, if any, each makes. The simplest model says that an individual’s proba- bility of perching is not affected by any of these factors; (NULL) is the nihilistic model that nothing matters. In between are singleton models, two-factor models, three-factor models, and so on; each says that the vari- ables cited may matter, and that the ones that go unmentioned do not. Let’s consider just the following: most complex model STH two-factor models ST TH SH singleton models S T H simplest model NULL These models form a partial ordering, depicted by the lines connecting models at different levels. NULL is the logically strongest model; it entails all the others. NULL is nested inside of S, S is nested inside of ST, and so on. Simpler models can be obtained from the more complex models in which they are nested by setting parameters equal to zero. Akaike’s theorem allows nested models to be compared for their esti- mated predictive accuracy, and disjoint models to be compared as well, with the answer always depending on the data at hand. This isn’t so for either Bayesianism or for Neyman-Pearson procedures. When Bayesians look at nested models, the verdict is pre-ordained; for example, since NULL entails T, NULL cannot be more probable than T, no matter what the data say. This means that Bayesians cannot represent the fact that scientists often evaluate logically stronger models as “better” than logi- cally weaker models. They can’t be better in Bayesian terms, because they can’t be more probable (Popper 1959; Forster and Sober 1994). The stan- dard Bayesian response is to “change the subject.” Instead of comparing , ,     S119 NULL with T, they will compare NULL with T*; where T asserts that time of day may make a difference, T* says that time does make a differ- ence. T* is not nested in NULL; they are disjoint. With the problem re- defined in this way, there now is no logical prohibition against claiming that NULL is more probable than T*. Of course, two problems remain— how are the likelihoods of composite hypotheses to be assessed and how is one to justify an assignment of prior probabilities (especially one that says that NULL is more probable a priori than T*)? The limitation imposed by Neyman-Pearson procedures is different. The likelihood ratio test used in frequentist statistics allows nested models to be compared, but not models that are disjoint. The members of the set consisting of NULL, S, ST, and STH can all be compared, but S cannot be compared with T, nor with TH, for example. In this situation, fre- quentists often compare each singleton model with the null hypothesis, and then construct a multi-factor model that includes all and only the singleton factors that were able to “beat” the null hypothesis. This is an expedient procedure that has no mathematical rationale within frequentist philosophy. If STH beats each of S, T, and H, and each of these singleton models beats the null hypothesis, then it makes sense to embrace the STH model and reject the simpler alternatives. However, it is perfectly possible that NULL is rejected each time it is compared with the singleton models S, T, and H, but that one of these singleton models does not get rejected when it is compared with the three-factor model STH. In this case, what is one to do? Neyman-Pearson statistics provides no answer. This is why some scientists have embraced Akaike-style model selection procedures. Most scientists follow Neyman-Pearson methods when they can; however, when they want to compare non-nested models, they have had to find a different approach. I now want to describe an area of research in evolutionary biology in which Akaike’s ideas are just starting to be used (see, for example, Posada and Crandall 2001). In the 1960’s biologists began developing maximum likelihood methods for inferring phylogenetic relationships. Consider the simplest case, in which the inference problem involves three species (hu- mans, chimpanzees, and gorillas, for example). Assuming that there is a common ancestor that unites all three and that the phylogeny is bifurcat- ing, there are three possible rooted trees. The goal is to use the observed characteristics of these species to assess which tree is most likely; the task is to say whether Pr[Data | (HC)G] � Pr[Data | (H(CG)] , Pr[Data | (HG)C]. The problem, however, is that phylogenetic hypotheses are composite. The probability that a tree topology confers on the data depends on a model  S120 of the evolutionary process and on the values of the parameters in that model; that is, the likelihood of the hypothesis is an average: Pr(Data | (HC)G) � Ri Rv Pr[Data | (HC)G & Model-i & parameters in Model-i have values v]Pr[Model-i & parameters in Model-i have values v | (HC)G]. This poses a problem for the likelihood approach since no one has the slightest idea how to evaluate the second product term on the right side of the equality. In this circumstance, the parameters in a model and the model itself are “nuisance parameters;” they affect a tree’s likelihood, but they are not what one wishes to infer. One way to deal with nuisance parameters is to “change the subject”—instead of trying to determine the average likeli- hood of a topology, given a range of possible models, one chooses a single model as the true one and assesses its likelihood under the assumption that the parameters in the model have their maximum likelihood values. This expedient solution is not entirely satisfactory (Edwards 1972; Sober 1988; Royall 1997. Forster 1986, 1988 disagrees). But even granting this reformulation, two new problems arise. First, the evaluation of tree to- pologies can depend on the process model used: (*) Pr[Data | (HC)G & L(Model-1)] � Pr[Data | H(CG) & L(Model-1)] Pr[Data | (HC)G & L(Model-2)] � Pr[Data | H(CG) & L(Model-2)]. Unfortunately, biologists who don’t already know which phylogeny is cor- rect usually don’t know which process model is correct for the taxa and traits at hand (Felsenstein 1978; Sober 1988). Furthermore, the ability to discriminate among tree topologies tends to decline as more complex and realistic process models are employed (Steel et al. 1994; Lewis 1998, 139). Although the likelihood of a topology goes up as more complex process models are employed, the likelihoods of different topologies come closer together. How depressing that greater realism about the evolutionary pro- cess should impair, rather than enhance, our ability to reconstruct phy- logenies! The solution to both these problems has been to use the frequentist methodology of likelihood ratio tests. Instead of passing automatically from simpler to more complex models, one asks whether the shift repre- sents a significant improvement in fit. However, we now need to attend to the fact, noted earlier, that likelihood ratio tests are meaningful only for nested models. To explore the import of this point, I should correct a bit of misleading notation. In (*), L(Model-1) appears on both sides of the inequality. The point that needs to be recognized is that the same process model, when conjoined with different tree topologies, often yields different , ,     S121 maximum likelihood estimates of parameter values. It would be better to write L[(HC)G & Model-1] and L[H(CG) & Model-1] to make this point clear. Process models, when separated from tree topologies, are partially or- dered (Swofford et al. 1996, 434). For example, in the case of inferring phylogenies from DNA sequence data, the first process model to be ex- plored was also the simplest—that of Jukes and Cantor (1969). This model assumes that all changes at a site have the same probability, that all sites in a lineage evolve independently and according to the same rules, and that a site in one lineage obeys the same rules as the same site in any other. The model therefore assumes that selection does not favor one nucleotide at a site over any other; this is a pure drift model in which the effective population sizes in different lineages are the same. Subsequent models have relaxed different assumptions in the Jukes-Cantor model in different ways. It isn’t that the newer models assume the opposite of what the Jukes and Cantor model stipulates. Rather, these models leave this or that mat- ter open, and let the data decide what the best settings of the parameters are (Lewis 1998). Consider the following four conjunctions; each includes a tree topology and a process model fitted to the data on the assumption that the topology is correct: L[(HC)G & Model-2] L[H(CG) & Model-2] L[(HC)G & Model-1] L[H(CG) & Model-1] Likelihood ratio tests permit vertical comparisons, if Model-1 is nested in Model-2. However, it isn’t so easy to determine, within frequentist statis- tics, how one should make horizontal comparisons (Swofford et al. 1996, 506). How is one to determine whether L[(HC)G & Model-1] fits the data significantly better than L[H(CG) & Model-1]? But even more puzzling are diagonal comparisons. Likelihood ratio tests do not permit L[(HC)G & Model-2] to be tested against L[H(CG) & Model-1]. However, this com- parison and the others as well make perfect sense in the Akaike frame- work. The likelihood of each conjunction is relevant, but so too is the number of adjustable parameters. If it turns out that L[H(CG) & Model- 1] has the highest AIC score among the conjunctions considered, there is no need to apologize for the fact that Model-1 is obviously false. One’s best estimate is that the fitted model L[H(CG) & Model-1] is closer to the truth than its competitors. Using the Akaike framework has the liberating effect that a plurality of process models can and should be considered; there is no reason to drop an idealized model and use only the most re- alistic model one can analyze. In closing, I want to discuss one more example, just for fun. Some years  S122 ago, cognitive psychologists discussed the phenomenon of “hot hands” in sports. Everyone with even the most superficial familiarity with profes- sional basketball believes that players occasionally have “hot hands.” When players are hot, their chance of scoring improves, and teammates try to feed the ball to them. However, a statistical analysis of scoring patterns in the NBA yielded the result that one cannot reject the null hypothesis that each player has a constant probability of scoring through- out the season (Gilovich et al. 1985). The scientists concluded that belief in hot hands is a “cognitive illusion,” while basketball mavens reacted to this statistical pronouncement with total incredulity. Placing this dispute in the Akaike framework allows it to make more sense. Scientists should not feel shy about admitting that the null hypothesis is false. The idea that players never waiver in their probabilities of scoring is preposterous. The point of doing statistics is not to see whether this silly hypothesis is true, but to see how good it is at predicting new data. Presumably, the truth about basketball players is very complex. Their scoring probabilities change as subtle responses to a large number of interacting causes. Given this complexity, players and coaches may make better predictions by re- lying on simplified models. Hot hands may be a reality, but trying to predict when players have hot hands may be a fool’s errand. REFERENCES Akaike, H. (1973), “Information Theory as an Extension of the Maximum Likelihood Prin- ciple”, in B. Petrov and F. Csaki (eds.), Second International Symposium on Information Theory. Budapest: Akademiai Kiado, 267–281. Burnham, K. and D. Anderson (1998), Model Selection and Inference—a Practical Information-Theoretic Approach. New York: Springer. Edwards, A. (1972), Likelihood. Cambridge: Cambridge University Press. Felsenstein, J. (1978), “Cases in which Parsimony and Compatibility Methods Can Be Pos- itively Misleading”, Systematic Zoology 27: 401–410. (1983), “Statistical Inference of Phylogenies”, Journal of the Royal Statistical Society A 146: 246–272. Forster, M. (1986), “Statistical Covariance as a Measure of Phylogenetic Relationship”, Cladistics 2: 297–317. (1988), “Sober’s Principle of Common Cause and the Problem of Comparing In- complete Hypotheses”, Philosophy of Science 55: 538–559. (2000a), “Hard Problems in the Philosophy of Science—Idealisation and Commen- surability”, in R. Nola and H. Sankey (eds.), After Popper, Kuhn, and Feyerabend. London: Kluwer, 231–250. (2000b), “Key Concepts in Model Selection—Performance and Generality”, Journal of Mathematical Psychology 44: 205–231. Forster, M. and E. Sober (1994), “How to Tell When Simpler, More Unified, or Less Ad Hoc Theories Will Provide More Accurate Predictions”, British Journal for the Philos- ophy of Science 45: 1–36. (2003), “Why Likelihood?”, in M. Taper and S. Lee (eds.), The Nature of Scientific Evidence, Chicago: University of Chicago Press, forthcoming. Gilovich, T., R. Valone, and A. Tversky (1985), “The Hot Hand in Basketball—On the Misperception of Random Sequences”, Cognitive Psychology 17: 295–314. , ,     S123 Johnson, D. (1995), “Statistical Sirens—the Allure of Nonparametrics”, Ecology 76: 1998–2000. Jukes, T. and C. Cantor (1969), “Evolution of Protein Molecules”, in H. Munro (ed.), Mam- malian Protein Metabolism. New York: Academic Press, 21–132. Lewis, P. (1998), “Maximum Likelihood as an Alternative to Parsimony for Inferring Phy- logeny Using Nucleotide Sequence Data”, in D. Soltis, P. Soltis, and J. Doyle (eds.), Molecular Systematics of Plants II. Boston: Kluwer, 132–163. Linhart, H. and W. Zucchini (1986), Model Selection. New York: Wiley. McQuarrie, A. and C. Tsai (1998), Regression and Time Series Model Selection. Singapore: World Scientific. Morgenbesser, S. (1960), “The Realist-Instrumentalist Controversy”, in. S. Morgenbesser, P. Suppes, and M. White (eds.), Philosophy, Science, and Method. New York: Harcourt, Brace, and World, 106–122. Nagel, E. (1979), The Structure of Science. Indianapolis: Hackett. Popper, K. (1959), Logic of Scientific Discovery. London: Hutchinson. Possada, D. and K. Crandall (2001), “Selecting Models of Nucleotide Substitution—an Ap- plication to Human Immunodeficiency Virus 1 (HIV-1)”, Molecular Biology and Evo- lution (18) 6: 897–906. Royall, R. (1997), Statistical Evidence —a Likelihood Paradigm. Boca Raton: Chapman and Hall. Sakamoto, Y., M. Ishiguro, and G. Kitagawa (1986), Akaike Information Criterion Statistics. New York: Springer. Schoener, T. (1970), “Nonsynchronous Spatial Overlap of Lizards in Patchy Habitats”, Ecol- ogy 51: 408–418. Schwarz, G. (1978), “Estimating the Dimension of a Model”, Annals of Statistics 6: 461–465. Sober, E. (1988), Reconstructing the Past—Parsimony, Evolution, and Inference. Cambridge, Mass.: MIT Press. (1998), “Instrumentalism Revisited”, Critica 31: 3–38. Steel, M., L. Szekely, and M. Hendy (1994), “Reconstructing Trees When Sequence Sites Evolve at Variable Rates”, Journal of Computational Biology 1: 153–163. Swofford, D., G. Olsen, P. Waddell, and D. Hillis, (1996), “Phylogenetic Inference”, in D. Hillis, C. Moritz, and B. Marble (eds.), Molecular Systematics, 2nd ed. Sunderland, Mass.: Sinauer, 407–514. Van Fraassen, B. (1980), The Scientific Image. New York: Oxford University Press. Yoccoz, N. (1991), “Use, Overuse, and Misuse of Significance Tests in Evolutionary Biology and Ecology”, Bulletin of the Ecological Society of America 32: 106–111.