Philosophy of Science, 74 (December 2007) pp. 000–000. 0031-8248/2007/7405-0002$10.00 Copyright 2007 by the Philosophy of Science Association. All rights reserved. Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH Proof 1 A New Solution to the Puzzle of Simplicity Kevin T. Kelly† Explaining the connection, if any, between simplicity and truth is among the deepest problems facing the philosophy of science, statistics, and machine learning. Say that an efficient truth finding method minimizes worst case costs en route to converging to the true answer to a theory choice problem. Let the costs considered include the number of times a false answer is selected, the number of times opinion is reversed, and the times at which the reversals occur. It is demonstrated that (1) always choosing the simplest theory compatible with experience, and (2) hanging onto it while it remains simplest, is both necessary and sufficient for efficiency. 1. The Puzzle of Simplicity. Philosophy of science, statistics, and machine learning all recommend the selection of simple theories or models on the basis of empirical data, where simplicity has something to do with min- imizing independent entities, principles, causes, or equational coefficients. This intuitive preference for simplicity is called Ockham’s razor, after the fourteenth century theologian and logician William of Ockham. But in spite of its intuitive appeal, how could Ockham’s razor help us find the true theory? For, in an updated version of Plato’s Meno paradox, if we already know that the truth is simple, we don’t need Ockham’s help. And if we don’t already know that the truth is simple, what entitles us to assume that it is? It does not help to say that simplicity is associated with other virtues such as testability (Popper 1968), unity (Friedman 1983), better expla- nations (Harman 1965), higher “confirmation” (Carnap 1950; Glymour 1980), or minimum description length (Rissanen 1983), since if the truth were not simple, it would not have these nice properties either. To assume otherwise is to engage in wishful thinking (van Fraassen 1981). Over-fitting arguments (Akaike 1973; Forster and Sober 1994) show that using a simple model for predictive purposes in the presence of ran- †To contact the author, please write to: Department of Philosophy, Carnegie Mellon University, Baker Hall 135, Pittsburgh, PA 15213-3890; e-mail: kk3n@andrew.cmu.edu. q1 q2 CHECKED 2 KEVIN T. KELLY Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH dom noise can decrease the expected squared error of predictions. But that is still the case when one knows in advance that the truth is complex, so over-fitting arguments concern accuracy of prediction rather than find- ing the true theory. Furthermore, if one is interested in predicting the causal outcome of a policy on the basis of non-experimental data, the prediction could end up far from the mark because the counterfactual distribution after the policy is enacted may be quite different from the distribution sampled (Spirtes and Zhang 2003). Finally, such arguments work only in statistical settings, but Ockham’s razor seems no less com- pelling in deterministic ones. Nor is Ockham’s razor explained by a prior probabilistic bias in favor of simple possibilities, for the propriety of a systematic bias in favor of simplicity is precisely what is at issue. The argument remains circular even if complex and simple theories receive equal prior probabilities, for the- ories with more free parameters can be true in more ‘ways’, so that each way the complex theory might be true ends up carrying less prior prob- ability than each of the ways the simple theory might be true; that prior bias toward simple possibilities is merely passed through Bayes’ theorem (e.g., Rosenkrantz 1983 and the discussion of the Bayes information cri- terion in Wasserman 2004). There are noncircular, relevant arguments for Ockham’s razor, if one is willing to grant premises far more dubious than the theories Ockham’s razor is used to justify. Leibniz ([1714] 1875) appealed to the Creator’s taste for elegance. More recently, some “naturalistic” philosophers and machine learning researchers have replaced Providence with an equally vague and optimistic appeal to Evolution (e.g., Giere 1985; Duda et al. 2000, 464–465). But whereas a sufficiently powerful and kind Deity could save us from error in scientific questions never before encountered, it is hardly clear how selective pressures on our hominid ancestors could do so—unless Ockham’s razor is invoked to argue that our evolved penchant for simplicity is a reliable guide to the truth in questions never before encountered. Even if Providence or Evolution did arrange the truth of simple theories after a fashion that may remain eternally obscure, it would surely be nice, in addition, to have a clear, normative argument to the effect that Ock- ham’s razor is the most efficient possible method for finding the true theory when the problem involves theory choice. This note presents just such an argument.1 The idea is that it is hopeless to provide an a priori 1. The approach is based on concepts from computational learning theory. An early appearance of retractions as a fundamental cost of inquiry is in Putnam 1965. An abstract theory of complexity of inductive inference is presented in Daley and Smith 1986. For a survey of results concerning retractions in inductive inference see Jain et q3 THE PUZZLE OF SIMPLICITY CHECKED 3 Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH explanation of how simplicity points at the truth immediately, since the truth may depend upon subtle empirical effects that have not yet been observed or even conceived. The best that Ockham’s razor could guar- antee a priori is to keep us on the straightest possible path to the truth, allowing for unavoidable twists and turns along the way as new effects are discovered—and that is just what it does guarantee. Readers who wish to cut to the chase may prefer to peek immediately at Theorem 1 in Section 5 prior to reviewing the relevant definitions. 2. Illustration: Empirical Effects. Suppose that you are interested in the structure S of an unknown polynomial law if (x) p a x , (1)� i i�S where S is assumed to be a finite set of indices such that for each ,i � S . It seems that structures involving fewer monomial terms are sim-a ( 0i pler, so Ockham’s razor favors them. Suppose that patience and improve- ments in measurement technology allow one to obtain ever tighter open intervals around for each specified value of x as time progresses.2f (x) Suppose that the true degree is zero, so that f is a constant function. Each finite collection of open intervals around values of f is compatible with degree one (linearity), since there is always a bit of wiggle room within finitely many open intervals to tilt the line. So suppose that the truth is the tilted line that fits the data received so far. Eventually you can obtain data from this line that refutes degree zero. Call such data a (first order) effect. Any further, finite, amount of data collected for the linear theory is compatible (due to the remaining minute wiggle room) with a quadratic law, etc. The truth is assumed to be polynomial, so the story must end, eventually, at some finite set S of effects. Thus, determining the true polynomial law amounts, essentially, to determining the finite set S of all monomial effects that one will ever see. So conceived, empirical effects have the property that they never appear if they do not exist but may appear arbitrarily late if they do exist.3 To reduce the curve fitting problem to its essential elements, let E be a de- al. 1999. Earlier versions of the following argument may be found in Schulte 1999; Kelly 2002, 2004; Kelly and Glymour 2004; and especially Kelly 2005, 2007, 2008. 2. In statistics, the situation is analogous: increasing the sample size reduces the interval estimates of the values of the function at each argument. 3. In typical statistical applications, something similar is true: effects probably do not appear at each sample size if they don’t exist and probably appear at some sample size onward if they do exist. The data model under discussion may be viewed as a logical approximation of the statistical situation, if one thinks of samples accumulating through time. CHECKED 4 KEVIN T. KELLY Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH numerable set of potential effects and assume that at most finitely many of these effects will ever occur. Assume that your laboratory merely reports the finite set of all effects that have been detected so far, so an input sequence is an upwardly nested sequence of finite subsets of E that con- verges to some finite subset S of E. An input stream or empirical world is an infinite input sequence. Let the effects presented in input sequence e be denoted . The true answer to the effect accounting problem in em-Se pirical world w is then just . Call this abstract problem the effect ac-Sw counting problem. The effect accounting problem reflects, approximately, the structure of a number of naturally posed inference problems, such as determining the set of independent variables a dependent variable depends upon, determining quantum numbers from a set of reactions (Schulte 2000), and causal inference (Spirtes et al. 2000), in addition to the poly- nomial inference problem already mentioned.4 A strategy for effect accounting responds to an arbitrary input sequence either with a finite set of effects or with ‘?’, indicating a refusal to choose. Strategy M solves the effect accounting problem if and only if M converges to the true set of effects in each empirical world w. One obvious solutionSw to the effect accounting problem is the strategy , which guessesM (e) p S0 e exactly the effects it has seen so far. If the possibility of infinitely many effects were admitted, then the effect accounting problem would not be solvable at all, due to a classic result by Gold (1978). Ockham’s razor is the principle that one should never output an infor- mative answer unless that answer is among the simplest answers compatible with experience. In the effect accounting problem, there is a uniquely sim- plest answer compatible with experience e, namely, the set of effectsSe reported so far along e.5 Thus, strategy M is Ockham at e if and only if M produces either or ‘?’ in response to finite input sequence e.Se If the inputs received so far are , then let the imme-e p (e , . . . , e )0 n�1 diately preceding evidential state be (where is stipulatede p (e , . . . , e ) e� 0 n � to denote the empty sequence if e does). Say that solution M is stalwart at e if and only if when —that is, if you are alreadyM(e) p M M(e ) p Se � e� accepting the simplest answer, don’t drop it until it is no longer simplest. One may speak of stalwartness and of Ockham’s razor as being satisfied from e onward (i.e., at each extension of e compatible with K).′e The simplicity puzzle now arises because, although every convergent strategy must agree with an Ockham strategy eventually (since the true 4. Strictly speaking, the inference of causal structure is more complicated because some finite sets of effects are routinely ruled out a priori. A more general setting of that sort is discussed below. 5. Again, this condition is false in some applications and its relaxation is discussed below. THE PUZZLE OF SIMPLICITY CHECKED 5 Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH structure in w is eventually the uniquely simplest structure compatibleSw with the data presented along w), convergence is compatible with arbi- trarily severe violations of Ockham’s razor and stalwartness in the short run; for example, one could start with some complex answer andS ( M retract back to if at stage 1000 (Salmon 1967). The trouble isM S ( Se that there are infinitely many ways to converge to the truth in the ac- counting problem, just as there are infinitely many algorithmic solutions to a solvable computational problem. The nuances of programming prac- tice—the very stuff of textbook computer science—are derived not from solvability itself, but from efficiency or computational complexity (e.g., the time or storage space required to find the right answer). The proposal is that Ockham’s razor is similarly grounded in the efficiency of empirical inquiry, rather than in mere convergence (solvability). 3. Costs of Inquiry. An obvious, doxastic cost of inquiry is the total num- ber of times one’s strategy produces a false answer prior to convergence to the true answer. Another is the number of times a conclusion is ‘taken back’ or retracted prior to convergence, which corresponds to the degree of ‘straightness’ of the path followed to the truth.6 One might also wish to minimize the respective times by which these retractions occur, since there is no point ‘living a lie’ longer than necessary or allowing subsidiary conclusions to accumulate prior to being ‘flushed’ when the retraction occurs. Taken together, these costs reflect the directness and timeliness with which one surmounts obstacles on one’s way to the truth, and a strategy that minimizes them can be said to have the strongest possible connection with the truth. Insofar as epistemology is distinguishable from ‘psychologism’ by its regard for truth conduciveness (Bonjour 1985), min- imization of retractions, retraction times, and errors is a properly epistemic consideration—indeed, more so than coherence, plausibility, confirma- tion, or rhetorical force. For a given strategy M and infinite input stream w, let the total loss of M in w be represented by the pair l(M, w) p (q, (r , . . . , r )), (2)1 k where q is the total number of errors or false answers output by M in w, k is the total number of retractions performed by M in w, and is theri stage of inquiry at which the ith retraction occurs. Happily, it turns out that one need only consider comparisons in which one cost sequence is as good as or better than another in each of the above dimensions (i.e., Pareto comparisons). Accordingly, let (q,(r , . . . , r )) ≤1 k if and only if and there exists a subsequence′ ′ ′ ′(q ,(r , . . . , r )) q ≤ q′0 k 6. Retractions are called mind-changes in computational learning theory (cf. Jain et al. 1999) and contractions in the literature on belief revision (Gärdenfors 1988). CHECKED 6 KEVIN T. KELLY Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH of such that for each i from 1 to k, . Then′ ′(u , . . . , u ) (r , . . . , r ) r ≤ u′0 k 0 k i i for cost pairs , define iff but .′ ′ ′ ′v, v v ! v v ≤ v v ≤ v A potential cost bound is like a cost pair except that the first infinite ordinal q may occur. Potential cost bound is a cost bound on set X ofb cost pairs if and only if each in X is . If are both potential cost′v ≤ b b, b bounds, say that if and only if for each cost pair , if then′b ≤ b v v ≤ b . Then each set X of cost pairs has a unique, least upper cost bound′v ≤ b (see Kelly 2007).sup (X ) 4. Empirical Complexity and Efficiency. No solution to the effect ac- counting problem achieves a nontrivial cost bound over the whole effect accounting problem, since each theory can be overturned by future effects in the arbitrarily remote future. Computational complexity theory (Aho et al. 1974) has long since sidestepped a similar conceptual difficulty by partitioning potential inputs into respective sizes (i.e., lengths) and by then examining worst case resource bounds over the finitely many inputs of a given length. In empirical problems, each input stream w has infinite length, but it remains natural to partition potential input streams by empirical complexity. After finite input sequence e has been received, let the conditional empirical complexity of w given e be defined as: , where is the cardinality of S, and let the nthc(w, e) p FS F � FS F FSFw e empirical complexity cell given e be the set of all worlds w such thatC (n)e . Let M be an arbitrary solution to the effect accounting prob-c(w, e) p n lem. Define the worst case loss of solution M over complexity class as: , where the supremum is understoodC (n) l (M, n) p sup l(M, w)e e w�C (n)e in the sense of the preceding section. Suppose that input sequence e has just been received and the question concerns the efficiency of one’s strategy M. Since the past cannot be altered, the only relevant alternatives are strategies that produce the same answers as M along (recall that denotes the result of deleting thee e� � last entry of e). Say that such a strategy agrees with M along (abbre-e� viated ).′M { M e�Given solutions , the following, natural, worst case performance′M, M comparisons can be defined at e: ′ ′M ≤ M iff (G n) l (M, n) ≤ l (M , n);e e e ′ ′ ′M ! M iff M ≤ M and M ≤Z M; e e e ′ ′M ≺ M iff (G n) C (n) ( M ⇒ l (M, n) ! l (M , n).e e e e THE PUZZLE OF SIMPLICITY CHECKED 7 Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH These comparisons give rise to two natural properties of strategies: ′ ′M is strongly beaten at e iff (a solution M { M ) M ≺ M; e e� ′ ′M is beaten at e iff (a solution M { M ) M ! M; e e� ′ ′M is efficient at e iff (G solution M { M ) M ≥ M. e e� A solution that is strongly beaten does worse than some alternative so- lution in worst case performance in each nonempty, empirical complexity cell. A solution that is beaten does worse than some solution in some complexity cell and no better in the rest of the cells. An efficient solution is as good as an arbitrary solution in worst case performance in each empirical complexity cell. One may speak of being efficient from e onward. Being strongly beaten implies being beaten, which implies inefficiency. 5. The New Solution. Here is the proposed efficiency argument for Ock- ham’s razor. The proof is in the appendix. Theorem 1 (Ockham efficiency characterization). Let M solve the effect accounting problem. Let e be a finite input sequence. Then the following statements are equivalent: 1. M is stalwart and Ockham from e onward; 2. M is efficient from e onward; 3. M is never strongly beaten from e onward. So the set of all solutions to the effect accounting problem is cleanly partitioned given e into two groups: the solutions that are stalwart, Ock- ham, and efficient from e onward and the solutions that are strongly beaten at some stage due to future violations of the stalwart, Ock-′e ≥ e ham property. As promised, the argument is a priori, normative, truth directed, and yet noncircular. The argument presumes no prior proba- bilistic bias, so there is no question of a circular appeal to such a bias. The argument is driven only by efficient convergence to the truth, so there is no bait-and-switch from truth finding to some other aim. There is no confusion between ‘confirmation’ and truth finding, since the concept of confirmation is never mentioned. There is no wishful presumption that the truth must be testable or nice in any other way. There is no appeal to the hidden hands of Providence, the Synthetic a Priori, Convention, or Evolution. There is nothing built into the argument other than a ques- tion, simplicity relative to the question, and efficient convergence to the true answer to the question. Furthermore, the argument is stable in the sense that born again Ock- hamism strongly beats recidivism at each contemplated violation, so past CHECKED 8 KEVIN T. KELLY Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH violations, no matter how severe, do not undermine the normative force of the argument at each moment. That is important, for Ockham viola- tions are practically unavoidable in real science, either due to a failure to think of the simplest answer in time or due to spurious, auxiliary objec- tions that are resolved only later. The argument does not accomplish the impossible. Ockham’s razor cannot be shown, without circularity, to point at or track the truth im- mediately, for some effects may be arbitrarily hard to detect given current technologies and sample sizes, in which case all possible, convergent strat- egies—Ockham strategies included—can be forced to retract their opin- ions any finite number of times. Nor can one demand a stronger notion of efficiency with respect to retractions and errors. (1) One cannot establish weak dominance for Ockham methods with respect to all problem in- stances jointly, because anticipation of unseen effects might be vindicated immediately, saving retractions that the Ockham method would have to perform when the effects appear. (2) Nor can one show that Ockham’s razor does best in terms of a global worst case bound over all problem instances (minimax theory), for such worst case bounds on errors and retractions are trivially infinite for all methods at every stage. (3) Nor can one show a decisive advantage for Ockham’s razor in terms of expected retractions. For example, if the question is whether one will see at least one effect, then the expected retractions of the obvious strategy M(e) p are less than those of an arbitrary Ockham violator only if the priorSe probability of the simpler answer is at least one half, so that if more than one complex world carries nonzero probability, no complex world is as probable as the simplest world, which begs the question in favor of sim- plicity.7 If the prior probability of the simple hypothesis drops below 0.5, the advantage lies not only with violating Ockham’s razor, but with vi- olating it more rather than less. So Bayesians must either beg the question or rule strongly against Ockham. 7. Let be a non-Ockham strategy that starts by guessing answer until no effectM ≥ 1i is seen by stage i, at which point returns 0. If the effect is ever seen, M returnsMi answer . Consider the competing Ockham method M that always guesses 0 until≥ 1 the effect is seen, at which time M returns answer . Consider probabilities at stage≥ 1 0. Let a denote the probability that no effect occurs, let b denote the probability that an effect occurs no later than stage i and let c denote the probability that an effect occurs after stage i. Then, a priori, the expected retractions of are given byM a �i , whereas the expected retractions of M are . So the Ockham strategy M does2c b � c better when . Since , this is true if and only if . By in-a � c 1 b a � c � b p 1 b ! 0.5 creasing i, one can drive c arbitrarily small (by countable additivity), so if the Ockham strategy is to beat the expected retractions of an arbitrary , then . That impliesM a ≥ bi that each of the several (complex) possibilities over which mass b is distributed receives less probability than the simple world carrying probability b. This bias increases with i and with the number of ways the complex theory can be true. THE PUZZLE OF SIMPLICITY CHECKED 9 Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH Some applications, like the search for causal structure (Spirtes et al. 2000), imply a priori restrictions on the possible, finite sets of effects that correspond to possible answers. Let G be the set of a priori possible, finite sets of effects that nature might reveal for eternity. Let denote the subsetGe of G whose elements are all consistent with e (where S is consistent with e if ). A directed path in is just an upwardly nested, finite sequenceS P S Ge e of elements of . Now define the conditional empirical complexityGe of world given e as one less than the length of a longestc(S, e) S � Ge path in terminating in S and let . Theorem 1 extendsG c(w, e) p c(S , e)e w to such cases (cf. Kelly 2008), except that the beating incurred by Ockham violators may fail to be strong when there is more than one simplest answer compatible with e. The preceding approach still assumes that the theorist is fed pre-digested empirical effects, rather than raw experience itself. Here is a very general definition of empirical complexity that agrees with the preceding account when applied to pre-digested problems (cf. Kelly 2008). In general, an empirical problem consists of a set K of possible input streams or worldsP and an empirical question P, which is just a partition of K into potential answers. No objectionable pre-digestion is assumed here: the successive inputs presented by could be boolean bits in a highly ‘gruified’w � K coding scheme with an ocean of information irrelevant to the question P thrown in. If e is a finite input sequence, let denote the restriction ofK e K to input streams extending finite input sequence e. Let p be a finite sequence of answers drawn from P. Say that p is forcible by nature given finite input sequence e in if and only if for each strategy M guaranteedP to converge to the true answer in , there exists w in such that MP K e responds to w, after the end of e, with a sequence of outputs of which p is a subsequence. Let denote the set of all finite sequences of answersSe forcible in given e. Restrict attention to the natural problems in whichP P exists, for each , and let denote this limit. Let denotelim S w � K S Gir� wFi w e the set of all such that . If , say that if and only′ ′S w � K S, S � G S ≤ Sw e e if for each extending e such that , there exists extending′ ′′ ′e S p S e e′e such that . Now define in terms of longest -path length′S p S c(w, e) ≤′′e in to , just as in the preceding paragraph. This definition of simplicityG Se e depends only upon the (semantic) structures of , so it is invariantK, P under arbitrary, grue-like (Goodman 1983) recodings of the inputs (which leave the semantics of the problem intact). Moreover, if is the (pre-PG digested) sort of problem discussed in the preceding paragraph, then it can be shown that the complexity degree assignment just definedc(w, e) is identical to the one defined in the preceding paragraph. Finally, applying this definition to problems that look, intuitively, like effect accounting problems (e.g., polynomial structure, causal structure, or conservation laws) identifies what intuition would point out as the empirical effects CHECKED 10 KEVIN T. KELLY Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH relevant to the question P given. So careful attention to truth finding efficiency provides not only a novel explanation of Ockham’s razor, but also a fresh perspective on the nature of simplicity, itself. Appendix: Proof of Theorem 1. , is immediate from the definitions. For , suppose that M(2 ⇒ 3) (3 ⇒ 1) violates Ockham’s razor or stalwartness at finite input sequence e. Let M be a solution that is stalwart and Ockham from onward. Let have′ ′e e ≥ e length j. Then M is Ockham and stalwart from e onward. Let be an′M arbitrary solution such that . Let be the retraction times′M { M r , . . . , r1 k e�for both M and along . Let q denote the number of times M produces′M e� an answer other than along . Consider the hard case in which MS ee � retracts at e. Let . In w, M retracts at e but never retracts afterw � C (0)e e and M produces only the true answer after e. Hence:Se l (M, 0) ≤ (q, (r , . . . , r , j)). (A1)e 1 k There exists (just extend e by repeating forever). Thenw � C (0) S0 e e is false in . So since is a solution, converges to′ ′ ′M(e ) p M (e ) w M M� � 0 the true answer in at some point after , which implies a retractionS w ee 0 � at some point no sooner than e. Hence: ′l (M , 0) ≥ (q, (r , . . . , r , j)) ≥ l (M, 0). (A2)e 1 k e If , then every method succeeds under the trivial boundC (n � 1) p Me , so suppose that . Since M is a stalwart, Ockham(0, ()) C (n � 1) ( Me solution, M retracts at most once at each new effect, so l (M, n � 1) ≤ (q, (r , . . . , r , j, q, . . . , q)). (A3)e 1 k \ n�1 times Let arbitrary natural number i be given. Since is a solution, even-′ ′M M tually converges to in , so there exists such thatA p S w e e ≤ e !0 e 0 0 0 by which has retracted the false answer and has produced′ ′w M M (e )0 � the true answer successively at least i times after the end of e, so ′A M0 retracts at least as late as e in . Then there exists such thate w � C (1)0 1 e (since , nature can choose some ande ! w C (n � 1) ( M x � E � A0 1 e 0 0 extend forever with answer . Again, must converge′e A p A ∪ {x }) M0 1 0 0 to in and, therefore, produces successively at least i times byA w A1 1 1 some initial segment of w that extends . Continuing in this manner,e e1 0 construct . Thenw � C (n � 1)n�1 e ′l (M , w ) ≥ (i, (r , . . . , r , j, j � 1i, j � 2i, . . . , j � (n � 1)i)). (A4)e n�1 1 k q4 THE PUZZLE OF SIMPLICITY CHECKED 11 Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH Since i is arbitrary, ′l (M , n � 1) ≥ (q, (r , . . . , r , j, q, . . . , q)) ≥ l (M, n � 1). (A5)e 1 k e\ n�1times Now consider the easy case in which M does not retract at e. Then the argument is similar to that in the preceding case except that the retraction at j is dropped from all the bounds. For the proof of , let M be a solution that violates either Ock-(1 ⇒ 2) ham’s razor or stalwartness at e of length j. Let return at each′M S ′e such that and let agree with M otherwise. Then′ ′ ′e � K e ≥ e Mfin by construction and is evidently a solution. Let be′ ′M { M M r , . . . , re 1 k� the retraction times for both M and along e up to but not including′M the last entry in e. Consider the case in which M violates Ockham’s razor at e. So for some , . Let . Then A is false in w andA P E M(e) p A ( S w � C (0)e e is true in w. Let q denote the number of times both M and produce′S Me an answer other than along . Since produces the true answer at′S e Me � e in w and continues to produce it thereafter: ′l (M , 0) ≤ (q, (r , . . . , r , j)).e 1 k There exists in (just extend e forever with ). Since A is false inw C (0) S0 e e and M is a solution, M retracts A in at some stage greater than j,w w0 0 so ′l (M, 0) ≥ l(M, w ) ≥ (q � 1, (r , . . . , r , j � 1)) 1 l (M , 0). (A6)e 0 1 k e As in the proof of , it suffices to consider the case in which(3 ⇒ 1) . Since produces at each ,′ ′C (n � 1) ( M M S e ≥ e′e e ′l (M , n � 1) ≤ (q, (r , . . . , r , j, q, . . . , q)). (A7)e 1 k \ n�1 times Let . Answer is false in , so since M is a solution, Mi � q A p M(e) w0 eventually converges to in , so there exists properly extendingA p S w e0 e 0 0 e by which M has produced successively at least i times after the endA 0 of e and M retracts A back to no sooner than stage . Now continueA j � 10 according to the recipe described in the proof of to construct(3 ⇒ 1) such that:w � C (n � 1)n�1 e l(M, w ) ≥ (i, (r , . . . , r , j � 1, j � 1i, j � 2i, . . . , j � (n � 1)i)). (A8)n�1 1 k Since i is arbitrary, ′l (M, n � 1) ≥ (q, (r , . . . , r , j � 1, q, . . . , q)) 1 l (M , n � 1). (A9)e 1 k e\ n�1 times CHECKED 12 KEVIN T. KELLY Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH Next, consider the case in which M violates stalwartness at e. So but . Let . Let q denote the number ofM(e ) p S M(e) ( S w � C (0)� e e e errors committed in w by both M and along . Since ,′ ′M e M (e ) p S� � e it follows that does not retract in w from j onward, so:′M ′l (M , 0) ≤ (q, (r , . . . , r )). (A10)e 1 k Again, there exists in . Since M retracts at j,w C (0)0 e ′l (M, 0) ≥ (q, (r , . . . , r , j)) 1 l (M , 0). (A11)e 1 k e Let . Since produces at each ,′ ′C (n � 1) ( M M S e ≥ e′e e ′l (M , n � 1) ≤ (q, (r , . . . , r , q, . . . , q)). (A12)e 1 k \ n�1 times Let arbitrary natural number i be given. Since M retracts at j, one may continue according to the recipe described in the proof of (3 1) to⇒ construct extending e in such that:w C (n � 1)n�1 e l(M, w ) ≥ (i, (r , . . . , r , j, j � 1i, j � 2i, . . . , j � (n � 1)i)). (A13)n�1 1 k Since i is arbitrary, ′l (M, n � 1) ≥ (q, (r , . . . , r , j, q, . . . , q)) 1 l (M , n � 1). (A14)e 1 k e\ n�1 times REFERENCES Aho, A., J. Hopcroft, and J. Ullman (1974), The Design and Analysis of Computer Algorithms. New York: Addison-Wesley. Akaike, H. (1973), “Information Theory and an Extension of the Maximum Likelihood Principle”, in B. N. Petrov and F. Csaki (eds.), The Second International Symposium on Information Theory. Budapest: Akadémiai Kiadó, 267–281. Bonjour, L. (1985), The Structure of Empirical Knowledge. Cambridge, MA: Harvard Uni- versity Press. Carnap, R. (1950), Logical Foundations of Probability. Chicago: University of Chicago Press. Daley, R., and C. Smith, (1986), “On the Complexity of Inductive Inference”, Information and Control 69:12–40. Duda, R., D. Stork, and P. Hart (2000), Pattern Classification. Vol. 1. New York: Wiley. M. Forster, and Sober, E. (1994), “How to Tell When Simpler, More Unified, or Less Ad Hoc Theories Will Provide More Accurate Predictions”, British Journal for the Phi- losophy of Science 45: 1–35. Friedman, M. (1983), Foundations of Space-Time Theories: Relativistic Physics and Philos- ophy of Science. Princeton, NJ: Princeton University Press. Gärdenfors, P. (1988), Knowledge in Flux. Cambridge, MA: MIT Press. Giere, R. (1985), “Philosophy of Science Naturalized,” Philosophy of Science 52: 331–356. Gold, E. (1978), “Language Identification in the Limit”, Information and Control 10: 447– 474. Goodman, N. (1983), Fact, Fiction, and Forecast. Cambridge, MA: Harvard University Press. Glymour, C. (1980), Theory and Evidence. Princeton, NJ: Princeton University Press. THE PUZZLE OF SIMPLICITY CHECKED 13 Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH Harman, G. (1965), “The Inference to the Best Explanation”, Philosophical Review 74: 88– 95. Jain, S., D. Osherson, J. Royer, and A. Sharma (1999), Systems That Learn. Cambridge, MA: MIT Press. Kelly, K. (2002), “Efficient Convergence Implies Ockham’s Razor”, paper delivered at the 2002 International Workshop on Computational Models of Scientific Reasoning and Applications, Las Vegas, June 24–27. ——— (2004), “Justification as Truth-Finding Efficiency: How Ockham’s Razor Works”, Minds and Machines 14: 485-505. ——— (2007), “Ockham’s Razor, Empirical Complexity, and Truth-Finding Efficiency”, Theoretical Computer Science, 270–289. ——— (2008)“Ockham’s Razor, Truth, and Information,” in J. Van Benthem and P. Ad- riaans (eds.), Philosophy of Information. Amsterdam: Elsevier, forthcoming. Kelly, K., and C. Glymour (2004), “Why Probability Does Not Capture the Logic of Sci- entific Justification”, in C. Hitchcock (ed.), Contemporary Debates in the Philosophy of Science. Oxford: Blackwell, 94–114. Leibniz, G. W. ([1714] 1875), Monadologie, in L. E. Loemker (ed.), Die Philosophischen Schriften von G. W. Leibniz, vol. 4. Berlin: Gerhardt, 607–623. Popper, K. (1968), The Logic of Scientific Discovery. New York: Harper. Putnam, H. (1965) “Trial and Error Predicates and a Solution to a Problem of Mostowski”, Journal of Symbolic Logic 30: 49–57. Rissanen, J. (1983), “A Universal Prior for Integers and Estimation by Minimum Description Length”, Annals of Statistics 11: 416–431. Rosenkrantz, R. (1983), “Why Glymour is a Bayesian”, in J. Earman (ed.), Testing Scientific Theories. Minneapolis: University of Minnesota Press, 69–98. Salmon, W. (1967), The Logic of Scientific Inference. Pittsburgh: University of Pittsburgh Press. Schulte, O. (1999), “Means-Ends Epistemology”, British Journal for the Philosophy of Sci- ence, 50: 1–31. ——— (2000), “Inferring Conservation Principles in Particle Physics: A Case Study in the Problem of Induction”, British Journal for the Philosophy of Science 51: 771–806. Spirtes, P., C. Glymour, and R. Scheines (2000), Causation, Prediction, and Search. Cam- bridge, MA: MIT Press. Spirtes, P., and J. Zhang (2003), “Strong Faithfulness and Uniform Consistency in Causal Inference”, in Christopher Meek and Uffe Kjærulff (eds.), Proceedings of the 19th Conference in Uncertainty in Artificial Intelligence. San Mateo, CA: Kaufmann, 632– 639. van Fraassen, B. (1981), The Scientific Image. Oxford: Clarendon. Wasserman, L. (2004), All of Statistics: A Concise Course in Statistical Inference. New York: Springer. CHECKED 14 KEVIN T. KELLY Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH QUERIES TO THE AUTHOR 1 Au: In the references, I changed: Kelly, K. (2002), Efficient Conver- gence Implies Ockhams Razor, paper delivered at the 2002 International Workshop on Computational Models of Scientific Reasoning and Ap- plications, Las Vegas, June 24–27. Correct as modified? 2 Au: For the Akaike reference, I changed the final o to an accented ó in the publisher name “Akadémiai Kiadó”; there was a stray “l” after the comma, which I have removed. Does this now appear to be correct? 3 Au: Please check the author date citations for Kelly in n. 2 very carefully. Should Kelly be cited twice in the last sentence of this note? Please note that Kelly 2005 is not in the reference list. For the Kelly 2007 ref. list entry, please add vol. number for this journal, if available. 4 Au: There is no square in the text to show where the proof ends; should a square be added?