Philosophy of Science, 74 (December 2007) pp. 000–000. 0031-8248/2007/7405-0002$10.00
Copyright 2007 by the Philosophy of Science Association. All rights reserved.

Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH

Proof 1

A New Solution to the Puzzle of
Simplicity

Kevin T. Kelly†

Explaining the connection, if any, between simplicity and truth is among the deepest
problems facing the philosophy of science, statistics, and machine learning. Say that
an efficient truth finding method minimizes worst case costs en route to converging to
the true answer to a theory choice problem. Let the costs considered include the number
of times a false answer is selected, the number of times opinion is reversed, and the
times at which the reversals occur. It is demonstrated that (1) always choosing the
simplest theory compatible with experience, and (2) hanging onto it while it remains
simplest, is both necessary and sufficient for efficiency.

1. The Puzzle of Simplicity. Philosophy of science, statistics, and machine
learning all recommend the selection of simple theories or models on the
basis of empirical data, where simplicity has something to do with min-
imizing independent entities, principles, causes, or equational coefficients.
This intuitive preference for simplicity is called Ockham’s razor, after the
fourteenth century theologian and logician William of Ockham. But in
spite of its intuitive appeal, how could Ockham’s razor help us find the
true theory? For, in an updated version of Plato’s Meno paradox, if we
already know that the truth is simple, we don’t need Ockham’s help. And
if we don’t already know that the truth is simple, what entitles us to
assume that it is?

It does not help to say that simplicity is associated with other virtues
such as testability (Popper 1968), unity (Friedman 1983), better expla-
nations (Harman 1965), higher “confirmation” (Carnap 1950; Glymour
1980), or minimum description length (Rissanen 1983), since if the truth
were not simple, it would not have these nice properties either. To assume
otherwise is to engage in wishful thinking (van Fraassen 1981).

Over-fitting arguments (Akaike 1973; Forster and Sober 1994) show
that using a simple model for predictive purposes in the presence of ran-

†To contact the author, please write to: Department of Philosophy, Carnegie Mellon
University, Baker Hall 135, Pittsburgh, PA 15213-3890; e-mail: kk3n@andrew.cmu.edu.

q1

q2


CHECKED 2 KEVIN T. KELLY

Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH

dom noise can decrease the expected squared error of predictions. But
that is still the case when one knows in advance that the truth is complex,
so over-fitting arguments concern accuracy of prediction rather than find-
ing the true theory. Furthermore, if one is interested in predicting the
causal outcome of a policy on the basis of non-experimental data, the
prediction could end up far from the mark because the counterfactual
distribution after the policy is enacted may be quite different from the
distribution sampled (Spirtes and Zhang 2003). Finally, such arguments
work only in statistical settings, but Ockham’s razor seems no less com-
pelling in deterministic ones.

Nor is Ockham’s razor explained by a prior probabilistic bias in favor
of simple possibilities, for the propriety of a systematic bias in favor of
simplicity is precisely what is at issue. The argument remains circular even
if complex and simple theories receive equal prior probabilities, for the-
ories with more free parameters can be true in more ‘ways’, so that each
way the complex theory might be true ends up carrying less prior prob-
ability than each of the ways the simple theory might be true; that prior
bias toward simple possibilities is merely passed through Bayes’ theorem
(e.g., Rosenkrantz 1983 and the discussion of the Bayes information cri-
terion in Wasserman 2004).

There are noncircular, relevant arguments for Ockham’s razor, if one
is willing to grant premises far more dubious than the theories Ockham’s
razor is used to justify. Leibniz ([1714] 1875) appealed to the Creator’s
taste for elegance. More recently, some “naturalistic” philosophers and
machine learning researchers have replaced Providence with an equally
vague and optimistic appeal to Evolution (e.g., Giere 1985; Duda et al.
2000, 464–465). But whereas a sufficiently powerful and kind Deity could
save us from error in scientific questions never before encountered, it is
hardly clear how selective pressures on our hominid ancestors could do
so—unless Ockham’s razor is invoked to argue that our evolved penchant
for simplicity is a reliable guide to the truth in questions never before
encountered.

Even if Providence or Evolution did arrange the truth of simple theories
after a fashion that may remain eternally obscure, it would surely be nice,
in addition, to have a clear, normative argument to the effect that Ock-
ham’s razor is the most efficient possible method for finding the true
theory when the problem involves theory choice. This note presents just
such an argument.1 The idea is that it is hopeless to provide an a priori

1. The approach is based on concepts from computational learning theory. An early
appearance of retractions as a fundamental cost of inquiry is in Putnam 1965. An
abstract theory of complexity of inductive inference is presented in Daley and Smith
1986. For a survey of results concerning retractions in inductive inference see Jain et

q3


THE PUZZLE OF SIMPLICITY CHECKED 3

Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH

explanation of how simplicity points at the truth immediately, since the
truth may depend upon subtle empirical effects that have not yet been
observed or even conceived. The best that Ockham’s razor could guar-
antee a priori is to keep us on the straightest possible path to the truth,
allowing for unavoidable twists and turns along the way as new effects
are discovered—and that is just what it does guarantee. Readers who wish
to cut to the chase may prefer to peek immediately at Theorem 1 in Section
5 prior to reviewing the relevant definitions.

2. Illustration: Empirical Effects. Suppose that you are interested in the
structure S of an unknown polynomial law

if (x) p a x , (1)� i
i�S

where S is assumed to be a finite set of indices such that for each ,i � S
. It seems that structures involving fewer monomial terms are sim-a ( 0i

pler, so Ockham’s razor favors them. Suppose that patience and improve-
ments in measurement technology allow one to obtain ever tighter open
intervals around for each specified value of x as time progresses.2f (x)
Suppose that the true degree is zero, so that f is a constant function. Each
finite collection of open intervals around values of f is compatible with
degree one (linearity), since there is always a bit of wiggle room within
finitely many open intervals to tilt the line. So suppose that the truth is
the tilted line that fits the data received so far. Eventually you can obtain
data from this line that refutes degree zero. Call such data a (first order)
effect. Any further, finite, amount of data collected for the linear theory
is compatible (due to the remaining minute wiggle room) with a quadratic
law, etc. The truth is assumed to be polynomial, so the story must end,
eventually, at some finite set S of effects. Thus, determining the true
polynomial law amounts, essentially, to determining the finite set S of all
monomial effects that one will ever see.

So conceived, empirical effects have the property that they never appear
if they do not exist but may appear arbitrarily late if they do exist.3 To
reduce the curve fitting problem to its essential elements, let E be a de-

al. 1999. Earlier versions of the following argument may be found in Schulte 1999;
Kelly 2002, 2004; Kelly and Glymour 2004; and especially Kelly 2005, 2007, 2008.

2. In statistics, the situation is analogous: increasing the sample size reduces the interval
estimates of the values of the function at each argument.

3. In typical statistical applications, something similar is true: effects probably do not
appear at each sample size if they don’t exist and probably appear at some sample
size onward if they do exist. The data model under discussion may be viewed as a
logical approximation of the statistical situation, if one thinks of samples accumulating
through time.


CHECKED 4 KEVIN T. KELLY

Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH

numerable set of potential effects and assume that at most finitely many
of these effects will ever occur. Assume that your laboratory merely reports
the finite set of all effects that have been detected so far, so an input
sequence is an upwardly nested sequence of finite subsets of E that con-
verges to some finite subset S of E. An input stream or empirical world is
an infinite input sequence. Let the effects presented in input sequence e
be denoted . The true answer to the effect accounting problem in em-Se
pirical world w is then just . Call this abstract problem the effect ac-Sw
counting problem. The effect accounting problem reflects, approximately,
the structure of a number of naturally posed inference problems, such as
determining the set of independent variables a dependent variable depends
upon, determining quantum numbers from a set of reactions (Schulte
2000), and causal inference (Spirtes et al. 2000), in addition to the poly-
nomial inference problem already mentioned.4

A strategy for effect accounting responds to an arbitrary input sequence
either with a finite set of effects or with ‘?’, indicating a refusal to choose.
Strategy M solves the effect accounting problem if and only if M converges
to the true set of effects in each empirical world w. One obvious solutionSw
to the effect accounting problem is the strategy , which guessesM (e) p S0 e
exactly the effects it has seen so far. If the possibility of infinitely many
effects were admitted, then the effect accounting problem would not be
solvable at all, due to a classic result by Gold (1978).

Ockham’s razor is the principle that one should never output an infor-
mative answer unless that answer is among the simplest answers compatible
with experience. In the effect accounting problem, there is a uniquely sim-
plest answer compatible with experience e, namely, the set of effectsSe
reported so far along e.5 Thus, strategy M is Ockham at e if and only if M
produces either or ‘?’ in response to finite input sequence e.Se

If the inputs received so far are , then let the imme-e p (e , . . . , e )0 n�1
diately preceding evidential state be (where is stipulatede p (e , . . . , e ) e� 0 n �
to denote the empty sequence if e does). Say that solution M is stalwart at
e if and only if when —that is, if you are alreadyM(e) p M M(e ) p Se � e�
accepting the simplest answer, don’t drop it until it is no longer simplest.
One may speak of stalwartness and of Ockham’s razor as being satisfied
from e onward (i.e., at each extension of e compatible with K).′e

The simplicity puzzle now arises because, although every convergent
strategy must agree with an Ockham strategy eventually (since the true

4. Strictly speaking, the inference of causal structure is more complicated because some
finite sets of effects are routinely ruled out a priori. A more general setting of that sort
is discussed below.

5. Again, this condition is false in some applications and its relaxation is discussed
below.


THE PUZZLE OF SIMPLICITY CHECKED 5

Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH

structure in w is eventually the uniquely simplest structure compatibleSw
with the data presented along w), convergence is compatible with arbi-
trarily severe violations of Ockham’s razor and stalwartness in the short
run; for example, one could start with some complex answer andS ( M
retract back to if at stage 1000 (Salmon 1967). The trouble isM S ( Se
that there are infinitely many ways to converge to the truth in the ac-
counting problem, just as there are infinitely many algorithmic solutions
to a solvable computational problem. The nuances of programming prac-
tice—the very stuff of textbook computer science—are derived not from
solvability itself, but from efficiency or computational complexity (e.g.,
the time or storage space required to find the right answer). The proposal
is that Ockham’s razor is similarly grounded in the efficiency of empirical
inquiry, rather than in mere convergence (solvability).

3. Costs of Inquiry. An obvious, doxastic cost of inquiry is the total num-
ber of times one’s strategy produces a false answer prior to convergence
to the true answer. Another is the number of times a conclusion is ‘taken
back’ or retracted prior to convergence, which corresponds to the degree
of ‘straightness’ of the path followed to the truth.6 One might also wish
to minimize the respective times by which these retractions occur, since
there is no point ‘living a lie’ longer than necessary or allowing subsidiary
conclusions to accumulate prior to being ‘flushed’ when the retraction
occurs. Taken together, these costs reflect the directness and timeliness
with which one surmounts obstacles on one’s way to the truth, and a
strategy that minimizes them can be said to have the strongest possible
connection with the truth. Insofar as epistemology is distinguishable from
‘psychologism’ by its regard for truth conduciveness (Bonjour 1985), min-
imization of retractions, retraction times, and errors is a properly epistemic
consideration—indeed, more so than coherence, plausibility, confirma-
tion, or rhetorical force. For a given strategy M and infinite input stream
w, let the total loss of M in w be represented by the pair

l(M, w) p (q, (r , . . . , r )), (2)1 k

where q is the total number of errors or false answers output by M in w,
k is the total number of retractions performed by M in w, and is theri
stage of inquiry at which the ith retraction occurs.

Happily, it turns out that one need only consider comparisons in which
one cost sequence is as good as or better than another in each of the above
dimensions (i.e., Pareto comparisons). Accordingly, let (q,(r , . . . , r )) ≤1 k

if and only if and there exists a subsequence′ ′ ′ ′(q ,(r , . . . , r )) q ≤ q′0 k

6. Retractions are called mind-changes in computational learning theory (cf. Jain et
al. 1999) and contractions in the literature on belief revision (Gärdenfors 1988).


CHECKED 6 KEVIN T. KELLY

Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH

of such that for each i from 1 to k, . Then′ ′(u , . . . , u ) (r , . . . , r ) r ≤ u′0 k 0 k i i
for cost pairs , define iff but .′ ′ ′ ′v, v v ! v v ≤ v v ≤ v

A potential cost bound is like a cost pair except that the first infinite
ordinal q may occur. Potential cost bound is a cost bound on set X ofb
cost pairs if and only if each in X is . If are both potential cost′v ≤ b b, b
bounds, say that if and only if for each cost pair , if then′b ≤ b v v ≤ b

. Then each set X of cost pairs has a unique, least upper cost bound′v ≤ b
(see Kelly 2007).sup (X )

4. Empirical Complexity and Efficiency. No solution to the effect ac-
counting problem achieves a nontrivial cost bound over the whole effect
accounting problem, since each theory can be overturned by future effects
in the arbitrarily remote future. Computational complexity theory (Aho
et al. 1974) has long since sidestepped a similar conceptual difficulty by
partitioning potential inputs into respective sizes (i.e., lengths) and by
then examining worst case resource bounds over the finitely many inputs
of a given length. In empirical problems, each input stream w has infinite
length, but it remains natural to partition potential input streams by
empirical complexity. After finite input sequence e has been received, let
the conditional empirical complexity of w given e be defined as:

, where is the cardinality of S, and let the nthc(w, e) p FS F � FS F FSFw e
empirical complexity cell given e be the set of all worlds w such thatC (n)e

. Let M be an arbitrary solution to the effect accounting prob-c(w, e) p n
lem. Define the worst case loss of solution M over complexity class

as: , where the supremum is understoodC (n) l (M, n) p sup l(M, w)e e w�C (n)e
in the sense of the preceding section.

Suppose that input sequence e has just been received and the question
concerns the efficiency of one’s strategy M. Since the past cannot be
altered, the only relevant alternatives are strategies that produce the same
answers as M along (recall that denotes the result of deleting thee e� �
last entry of e). Say that such a strategy agrees with M along (abbre-e�
viated ).′M { M

e�Given solutions , the following, natural, worst case performance′M, M
comparisons can be defined at e:

′ ′M ≤ M iff (G n) l (M, n) ≤ l (M , n);e e
e

′ ′ ′M ! M iff M ≤ M and M ≤Z M;
e e e

′ ′M ≺ M iff (G n) C (n) ( M ⇒ l (M, n) ! l (M , n).e e e
e


THE PUZZLE OF SIMPLICITY CHECKED 7

Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH

These comparisons give rise to two natural properties of strategies:

′ ′M is strongly beaten at e iff (a solution M { M ) M ≺ M;
e e�

′ ′M is beaten at e iff (a solution M { M ) M ! M;
e e�

′ ′M is efficient at e iff (G solution M { M ) M ≥ M.
e e�

A solution that is strongly beaten does worse than some alternative so-
lution in worst case performance in each nonempty, empirical complexity
cell. A solution that is beaten does worse than some solution in some
complexity cell and no better in the rest of the cells. An efficient solution
is as good as an arbitrary solution in worst case performance in each
empirical complexity cell. One may speak of being efficient from e onward.
Being strongly beaten implies being beaten, which implies inefficiency.

5. The New Solution. Here is the proposed efficiency argument for Ock-
ham’s razor. The proof is in the appendix.

Theorem 1 (Ockham efficiency characterization). Let M solve the effect
accounting problem. Let e be a finite input sequence. Then the following
statements are equivalent:

1. M is stalwart and Ockham from e onward;
2. M is efficient from e onward;
3. M is never strongly beaten from e onward.

So the set of all solutions to the effect accounting problem is cleanly
partitioned given e into two groups: the solutions that are stalwart, Ock-
ham, and efficient from e onward and the solutions that are strongly
beaten at some stage due to future violations of the stalwart, Ock-′e ≥ e
ham property. As promised, the argument is a priori, normative, truth
directed, and yet noncircular. The argument presumes no prior proba-
bilistic bias, so there is no question of a circular appeal to such a bias.
The argument is driven only by efficient convergence to the truth, so there
is no bait-and-switch from truth finding to some other aim. There is no
confusion between ‘confirmation’ and truth finding, since the concept of
confirmation is never mentioned. There is no wishful presumption that
the truth must be testable or nice in any other way. There is no appeal
to the hidden hands of Providence, the Synthetic a Priori, Convention,
or Evolution. There is nothing built into the argument other than a ques-
tion, simplicity relative to the question, and efficient convergence to the
true answer to the question.

Furthermore, the argument is stable in the sense that born again Ock-
hamism strongly beats recidivism at each contemplated violation, so past


CHECKED 8 KEVIN T. KELLY

Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH

violations, no matter how severe, do not undermine the normative force
of the argument at each moment. That is important, for Ockham viola-
tions are practically unavoidable in real science, either due to a failure to
think of the simplest answer in time or due to spurious, auxiliary objec-
tions that are resolved only later.

The argument does not accomplish the impossible. Ockham’s razor
cannot be shown, without circularity, to point at or track the truth im-
mediately, for some effects may be arbitrarily hard to detect given current
technologies and sample sizes, in which case all possible, convergent strat-
egies—Ockham strategies included—can be forced to retract their opin-
ions any finite number of times. Nor can one demand a stronger notion
of efficiency with respect to retractions and errors. (1) One cannot establish
weak dominance for Ockham methods with respect to all problem in-
stances jointly, because anticipation of unseen effects might be vindicated
immediately, saving retractions that the Ockham method would have to
perform when the effects appear. (2) Nor can one show that Ockham’s
razor does best in terms of a global worst case bound over all problem
instances (minimax theory), for such worst case bounds on errors and
retractions are trivially infinite for all methods at every stage. (3) Nor can
one show a decisive advantage for Ockham’s razor in terms of expected
retractions. For example, if the question is whether one will see at least
one effect, then the expected retractions of the obvious strategy M(e) p

are less than those of an arbitrary Ockham violator only if the priorSe
probability of the simpler answer is at least one half, so that if more than
one complex world carries nonzero probability, no complex world is as
probable as the simplest world, which begs the question in favor of sim-
plicity.7 If the prior probability of the simple hypothesis drops below 0.5,
the advantage lies not only with violating Ockham’s razor, but with vi-
olating it more rather than less. So Bayesians must either beg the question
or rule strongly against Ockham.

7. Let be a non-Ockham strategy that starts by guessing answer until no effectM ≥ 1i
is seen by stage i, at which point returns 0. If the effect is ever seen, M returnsMi
answer . Consider the competing Ockham method M that always guesses 0 until≥ 1
the effect is seen, at which time M returns answer . Consider probabilities at stage≥ 1
0. Let a denote the probability that no effect occurs, let b denote the probability that
an effect occurs no later than stage i and let c denote the probability that an effect
occurs after stage i. Then, a priori, the expected retractions of are given byM a �i

, whereas the expected retractions of M are . So the Ockham strategy M does2c b � c
better when . Since , this is true if and only if . By in-a � c 1 b a � c � b p 1 b ! 0.5
creasing i, one can drive c arbitrarily small (by countable additivity), so if the Ockham
strategy is to beat the expected retractions of an arbitrary , then . That impliesM a ≥ bi
that each of the several (complex) possibilities over which mass b is distributed receives
less probability than the simple world carrying probability b. This bias increases with
i and with the number of ways the complex theory can be true.


THE PUZZLE OF SIMPLICITY CHECKED 9

Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH

Some applications, like the search for causal structure (Spirtes et al.
2000), imply a priori restrictions on the possible, finite sets of effects that
correspond to possible answers. Let G be the set of a priori possible, finite
sets of effects that nature might reveal for eternity. Let denote the subsetGe
of G whose elements are all consistent with e (where S is consistent with
e if ). A directed path in is just an upwardly nested, finite sequenceS P S Ge e
of elements of . Now define the conditional empirical complexityGe

of world given e as one less than the length of a longestc(S, e) S � Ge
path in terminating in S and let . Theorem 1 extendsG c(w, e) p c(S , e)e w
to such cases (cf. Kelly 2008), except that the beating incurred by Ockham
violators may fail to be strong when there is more than one simplest
answer compatible with e.

The preceding approach still assumes that the theorist is fed pre-digested
empirical effects, rather than raw experience itself. Here is a very general
definition of empirical complexity that agrees with the preceding account
when applied to pre-digested problems (cf. Kelly 2008). In general, an
empirical problem consists of a set K of possible input streams or worldsP
and an empirical question P, which is just a partition of K into potential
answers. No objectionable pre-digestion is assumed here: the successive
inputs presented by could be boolean bits in a highly ‘gruified’w � K
coding scheme with an ocean of information irrelevant to the question P
thrown in. If e is a finite input sequence, let denote the restriction ofK e
K to input streams extending finite input sequence e. Let p be a finite
sequence of answers drawn from P. Say that p is forcible by nature given
finite input sequence e in if and only if for each strategy M guaranteedP
to converge to the true answer in , there exists w in such that MP K e
responds to w, after the end of e, with a sequence of outputs of which p
is a subsequence. Let denote the set of all finite sequences of answersSe
forcible in given e. Restrict attention to the natural problems in whichP P

exists, for each , and let denote this limit. Let denotelim S w � K S Gir� wFi w e
the set of all such that . If , say that if and only′ ′S w � K S, S � G S ≤ Sw e e
if for each extending e such that , there exists extending′ ′′ ′e S p S e e′e
such that . Now define in terms of longest -path length′S p S c(w, e) ≤′′e
in to , just as in the preceding paragraph. This definition of simplicityG Se e
depends only upon the (semantic) structures of , so it is invariantK, P
under arbitrary, grue-like (Goodman 1983) recodings of the inputs (which
leave the semantics of the problem intact). Moreover, if is the (pre-PG
digested) sort of problem discussed in the preceding paragraph, then it
can be shown that the complexity degree assignment just definedc(w, e)
is identical to the one defined in the preceding paragraph. Finally, applying
this definition to problems that look, intuitively, like effect accounting
problems (e.g., polynomial structure, causal structure, or conservation
laws) identifies what intuition would point out as the empirical effects


CHECKED 10 KEVIN T. KELLY

Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH

relevant to the question P given. So careful attention to truth finding
efficiency provides not only a novel explanation of Ockham’s razor, but
also a fresh perspective on the nature of simplicity, itself.

Appendix: Proof of Theorem 1.

, is immediate from the definitions. For , suppose that M(2 ⇒ 3) (3 ⇒ 1)
violates Ockham’s razor or stalwartness at finite input sequence e. Let M
be a solution that is stalwart and Ockham from onward. Let have′ ′e e ≥ e
length j. Then M is Ockham and stalwart from e onward. Let be an′M
arbitrary solution such that . Let be the retraction times′M { M r , . . . , r1 k

e�for both M and along . Let q denote the number of times M produces′M e�
an answer other than along . Consider the hard case in which MS ee �
retracts at e. Let . In w, M retracts at e but never retracts afterw � C (0)e
e and M produces only the true answer after e. Hence:Se

l (M, 0) ≤ (q, (r , . . . , r , j)). (A1)e 1 k
There exists (just extend e by repeating forever). Thenw � C (0) S0 e e

is false in . So since is a solution, converges to′ ′ ′M(e ) p M (e ) w M M� � 0
the true answer in at some point after , which implies a retractionS w ee 0 �
at some point no sooner than e. Hence:

′l (M , 0) ≥ (q, (r , . . . , r , j)) ≥ l (M, 0). (A2)e 1 k e
If , then every method succeeds under the trivial boundC (n � 1) p Me

, so suppose that . Since M is a stalwart, Ockham(0, ()) C (n � 1) ( Me
solution, M retracts at most once at each new effect, so

l (M, n � 1) ≤ (q, (r , . . . , r , j, q, . . . , q)). (A3)e 1 k \
n�1 times

Let arbitrary natural number i be given. Since is a solution, even-′ ′M M
tually converges to in , so there exists such thatA p S w e e ≤ e !0 e 0 0 0

by which has retracted the false answer and has produced′ ′w M M (e )0 �
the true answer successively at least i times after the end of e, so ′A M0
retracts at least as late as e in . Then there exists such thate w � C (1)0 1 e

(since , nature can choose some ande ! w C (n � 1) ( M x � E � A0 1 e 0 0
extend forever with answer . Again, must converge′e A p A ∪ {x }) M0 1 0 0
to in and, therefore, produces successively at least i times byA w A1 1 1
some initial segment of w that extends . Continuing in this manner,e e1 0
construct . Thenw � C (n � 1)n�1 e

′l (M , w ) ≥ (i, (r , . . . , r , j, j � 1i, j � 2i, . . . , j � (n � 1)i)). (A4)e n�1 1 k

q4


THE PUZZLE OF SIMPLICITY CHECKED 11

Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH

Since i is arbitrary,
′l (M , n � 1) ≥ (q, (r , . . . , r , j, q, . . . , q)) ≥ l (M, n � 1). (A5)e 1 k e\

n�1times

Now consider the easy case in which M does not retract at e. Then the
argument is similar to that in the preceding case except that the retraction
at j is dropped from all the bounds.

For the proof of , let M be a solution that violates either Ock-(1 ⇒ 2)
ham’s razor or stalwartness at e of length j. Let return at each′M S ′e

such that and let agree with M otherwise. Then′ ′ ′e � K e ≥ e Mfin
by construction and is evidently a solution. Let be′ ′M { M M r , . . . , re 1 k�

the retraction times for both M and along e up to but not including′M
the last entry in e.

Consider the case in which M violates Ockham’s razor at e. So for
some , . Let . Then A is false in w andA P E M(e) p A ( S w � C (0)e e

is true in w. Let q denote the number of times both M and produce′S Me
an answer other than along . Since produces the true answer at′S e Me �
e in w and continues to produce it thereafter:

′l (M , 0) ≤ (q, (r , . . . , r , j)).e 1 k
There exists in (just extend e forever with ). Since A is false inw C (0) S0 e e

and M is a solution, M retracts A in at some stage greater than j,w w0 0
so

′l (M, 0) ≥ l(M, w ) ≥ (q � 1, (r , . . . , r , j � 1)) 1 l (M , 0). (A6)e 0 1 k e
As in the proof of , it suffices to consider the case in which(3 ⇒ 1)

. Since produces at each ,′ ′C (n � 1) ( M M S e ≥ e′e e
′l (M , n � 1) ≤ (q, (r , . . . , r , j, q, . . . , q)). (A7)e 1 k \

n�1 times

Let . Answer is false in , so since M is a solution, Mi � q A p M(e) w0
eventually converges to in , so there exists properly extendingA p S w e0 e 0 0
e by which M has produced successively at least i times after the endA 0
of e and M retracts A back to no sooner than stage . Now continueA j � 10
according to the recipe described in the proof of to construct(3 ⇒ 1)

such that:w � C (n � 1)n�1 e

l(M, w ) ≥ (i, (r , . . . , r , j � 1, j � 1i, j � 2i, . . . , j � (n � 1)i)). (A8)n�1 1 k
Since i is arbitrary,

′l (M, n � 1) ≥ (q, (r , . . . , r , j � 1, q, . . . , q)) 1 l (M , n � 1). (A9)e 1 k e\
n�1 times


CHECKED 12 KEVIN T. KELLY

Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH

Next, consider the case in which M violates stalwartness at e. So
but . Let . Let q denote the number ofM(e ) p S M(e) ( S w � C (0)� e e e

errors committed in w by both M and along . Since ,′ ′M e M (e ) p S� � e
it follows that does not retract in w from j onward, so:′M

′l (M , 0) ≤ (q, (r , . . . , r )). (A10)e 1 k
Again, there exists in . Since M retracts at j,w C (0)0 e

′l (M, 0) ≥ (q, (r , . . . , r , j)) 1 l (M , 0). (A11)e 1 k e
Let . Since produces at each ,′ ′C (n � 1) ( M M S e ≥ e′e e

′l (M , n � 1) ≤ (q, (r , . . . , r , q, . . . , q)). (A12)e 1 k \
n�1 times

Let arbitrary natural number i be given. Since M retracts at j, one may
continue according to the recipe described in the proof of (3 1) to⇒
construct extending e in such that:w C (n � 1)n�1 e

l(M, w ) ≥ (i, (r , . . . , r , j, j � 1i, j � 2i, . . . , j � (n � 1)i)). (A13)n�1 1 k
Since i is arbitrary,

′l (M, n � 1) ≥ (q, (r , . . . , r , j, q, . . . , q)) 1 l (M , n � 1). (A14)e 1 k e\
n�1 times

REFERENCES

Aho, A., J. Hopcroft, and J. Ullman (1974), The Design and Analysis of Computer Algorithms.
New York: Addison-Wesley.

Akaike, H. (1973), “Information Theory and an Extension of the Maximum Likelihood
Principle”, in B. N. Petrov and F. Csaki (eds.), The Second International Symposium
on Information Theory. Budapest: Akadémiai Kiadó, 267–281.

Bonjour, L. (1985), The Structure of Empirical Knowledge. Cambridge, MA: Harvard Uni-
versity Press.

Carnap, R. (1950), Logical Foundations of Probability. Chicago: University of Chicago Press.
Daley, R., and C. Smith, (1986), “On the Complexity of Inductive Inference”, Information

and Control 69:12–40.
Duda, R., D. Stork, and P. Hart (2000), Pattern Classification. Vol. 1. New York: Wiley.
M. Forster, and Sober, E. (1994), “How to Tell When Simpler, More Unified, or Less Ad

Hoc Theories Will Provide More Accurate Predictions”, British Journal for the Phi-
losophy of Science 45: 1–35.

Friedman, M. (1983), Foundations of Space-Time Theories: Relativistic Physics and Philos-
ophy of Science. Princeton, NJ: Princeton University Press.

Gärdenfors, P. (1988), Knowledge in Flux. Cambridge, MA: MIT Press.
Giere, R. (1985), “Philosophy of Science Naturalized,” Philosophy of Science 52: 331–356.
Gold, E. (1978), “Language Identification in the Limit”, Information and Control 10: 447–

474.
Goodman, N. (1983), Fact, Fiction, and Forecast. Cambridge, MA: Harvard University

Press.
Glymour, C. (1980), Theory and Evidence. Princeton, NJ: Princeton University Press.


THE PUZZLE OF SIMPLICITY CHECKED 13

Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH

Harman, G. (1965), “The Inference to the Best Explanation”, Philosophical Review 74: 88–
95.

Jain, S., D. Osherson, J. Royer, and A. Sharma (1999), Systems That Learn. Cambridge,
MA: MIT Press.

Kelly, K. (2002), “Efficient Convergence Implies Ockham’s Razor”, paper delivered at the
2002 International Workshop on Computational Models of Scientific Reasoning and
Applications, Las Vegas, June 24–27.

——— (2004), “Justification as Truth-Finding Efficiency: How Ockham’s Razor Works”,
Minds and Machines 14: 485-505.

——— (2007), “Ockham’s Razor, Empirical Complexity, and Truth-Finding Efficiency”,
Theoretical Computer Science, 270–289.

——— (2008)“Ockham’s Razor, Truth, and Information,” in J. Van Benthem and P. Ad-
riaans (eds.), Philosophy of Information. Amsterdam: Elsevier, forthcoming.

Kelly, K., and C. Glymour (2004), “Why Probability Does Not Capture the Logic of Sci-
entific Justification”, in C. Hitchcock (ed.), Contemporary Debates in the Philosophy of
Science. Oxford: Blackwell, 94–114.

Leibniz, G. W. ([1714] 1875), Monadologie, in L. E. Loemker (ed.), Die Philosophischen
Schriften von G. W. Leibniz, vol. 4. Berlin: Gerhardt, 607–623.

Popper, K. (1968), The Logic of Scientific Discovery. New York: Harper.
Putnam, H. (1965) “Trial and Error Predicates and a Solution to a Problem of Mostowski”,

Journal of Symbolic Logic 30: 49–57.
Rissanen, J. (1983), “A Universal Prior for Integers and Estimation by Minimum Description

Length”, Annals of Statistics 11: 416–431.
Rosenkrantz, R. (1983), “Why Glymour is a Bayesian”, in J. Earman (ed.), Testing Scientific

Theories. Minneapolis: University of Minnesota Press, 69–98.
Salmon, W. (1967), The Logic of Scientific Inference. Pittsburgh: University of Pittsburgh

Press.
Schulte, O. (1999), “Means-Ends Epistemology”, British Journal for the Philosophy of Sci-

ence, 50: 1–31.
——— (2000), “Inferring Conservation Principles in Particle Physics: A Case Study in the

Problem of Induction”, British Journal for the Philosophy of Science 51: 771–806.
Spirtes, P., C. Glymour, and R. Scheines (2000), Causation, Prediction, and Search. Cam-

bridge, MA: MIT Press.
Spirtes, P., and J. Zhang (2003), “Strong Faithfulness and Uniform Consistency in Causal

Inference”, in Christopher Meek and Uffe Kjærulff (eds.), Proceedings of the 19th
Conference in Uncertainty in Artificial Intelligence. San Mateo, CA: Kaufmann, 632–
639.

van Fraassen, B. (1981), The Scientific Image. Oxford: Clarendon.
Wasserman, L. (2004), All of Statistics: A Concise Course in Statistical Inference. New York:

Springer.


CHECKED 14 KEVIN T. KELLY

Monday Feb 11 2008 03:36 PM PHOS v74n5 740502 JH

QUERIES TO THE AUTHOR

1 Au: In the references, I changed: Kelly, K. (2002), Efficient Conver-
gence Implies Ockhams Razor, paper delivered at the 2002 International
Workshop on Computational Models of Scientific Reasoning and Ap-
plications, Las Vegas, June 24–27. Correct as modified?

2 Au: For the Akaike reference, I changed the final o to an accented
ó in the publisher name “Akadémiai Kiadó”; there was a stray “l” after
the comma, which I have removed. Does this now appear to be correct?

3 Au: Please check the author date citations for Kelly in n. 2 very
carefully. Should Kelly be cited twice in the last sentence of this note?
Please note that Kelly 2005 is not in the reference list. For the Kelly 2007
ref. list entry, please add vol. number for this journal, if available.

4 Au: There is no square in the text to show where the proof ends;
should a square be added?