Charlotte Werndl and Katie Steele 
Climate models, calibration, and 
confirmation 
 
Article (Accepted version) 
(Refereed) 
 
 
Original citation: 
Werndl, Charlotte and Steele, Katie (2013) Climate models, calibration, and confirmation. British 
Journal for the Philosophy of Science, 64 (3). pp. 609-635. ISSN 0007-0882  
DOI: 10.1093/bjps/axs036  
 
© 2013 The Authors. Published by Oxford University Press on behalf of British Society for the 
Philosophy of Science 
 
This version available at: http://eprints.lse.ac.uk/44236/ 
Available in LSE Research Online: August 2014 
 
LSE has developed LSE Research Online so that users may access research output of the 
School. Copyright © and Moral Rights for the papers on this site are retained by the individual 
authors and/or other copyright owners. Users may download and/or print one copy of any 
article(s) in LSE Research Online to facilitate their private study or for non-commercial research. 
You may not engage in further distribution of the material or use it for any profit-making activities 
or any commercial gain. You may freely distribute the URL (http://eprints.lse.ac.uk) of the LSE 
Research Online website.  
 
This document is the author’s final accepted version of the journal article. There may be 
differences between this version and the published version.  You are advised to consult the 
publisher’s version if you wish to cite from it. 
 
 
CORE Metadata, citation and similar papers at core.ac.uk

Provided by LSE Research Online

https://core.ac.uk/display/8792191?utm_source=pdf&utm_medium=banner&utm_campaign=pdf-decoration-v1
http://www2.lse.ac.uk/researchAndExpertise/Experts/profile.aspx?KeyValue=c.s.werndl@lse.ac.uk
http://www.lse.ac.uk/researchAndExpertise/Experts/profile.aspx?KeyValue=k.steele@lse.ac.uk
http://bjps.oxfordjournals.org/
http://bjps.oxfordjournals.org/
http://dx.doi.org/10.1093/bjps/axs036
http://www.oxfordjournals.org/
http://www.thebsps.org/
http://www.thebsps.org/
http://eprints.lse.ac.uk/44236/


Climate Models, Calibration and Confirmation

Katie Steele and Charlotte Werndl∗

k.s.steele@lse.ac.uk, c.s.werndl@lse.ac.uk
Department of Philosophy, Logic and Scientific Method

London School of Economics

This article has been accepted for publication in
The British Journal for the Philosophy of Science

published by Oxford University Press.

April 23, 2012

Abstract

We argue that concerns about double-counting —using the same
evidence both to calibrate or tune climate models and also to confirm
or verify that the models are adequate—deserve more careful scrutiny
in climate modelling circles. It is widely held that double-counting is
bad and that separate data must be used for calibration and confir-
mation. We show that this is far from obviously true, and that cli-
mate scientists may be confusing their targets. Our analysis turns on
a Bayesian/relative-likelihood approach to incremental confirmation.
According to this approach, double-counting is entirely proper. We
go on to discuss plausible difficulties with calibrating climate models,
and we distinguish more and less ambitious notions of confirmation.
Strong claims of confirmation may not, in many cases, be warranted,
but it would be a mistake to regard double-counting as the culprit.

∗Authors are listed alphabetically; this work is fully collaborative.

1


Contents

1 Introduction 3

2 Remarks about models and adequacy-for-purpose 6

3 Evidence for calibration can also yield comparative confir-
mation 8
3.1 Double-counting I . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Double-counting II . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Climate science examples: comparative confirmation in prac-
tice 13
4.1 Confirmation due to better and worse best fits . . . . . . . . . . . 14
4.2 Confirmation due to more and less plausible forcings values . . . . 16

5 Old evidence 17

6 Doubts about the relevance of past data 21

7 Non-comparative confirmation and catch-alls 23

8 Climate science example: non-comparative confirmation and
catch-alls in practice 26

9 Concluding remarks 28

References 30

2


1 Introduction

Climate scientists express concern about the practice of ‘calibrating ’ climate
models to observational data (another widely-used word for ‘calibration’ is
‘tuning’). Calibration occurs when a model includes parameters or forcings
about which there is much uncertainty, and the value of the parameter or
forcing is determined by finding best fit with the data. That is, the param-
eter or forcing in question is effectively a free parameter, and calibration
determines which value(s) for the free parameter best explain(s) the data.
A prominent example, which we refer to later, is the fitting of the aerosol
forcing.

The apparent concern about calibration is that it may result or always
results in data being double-counted: data used to construct the fully-specified
model is also used to evaluate the model’s accuracy, in a problematic way.
Indeed, various climate scientists worry about circular reasoning:

In addition some commentators feel that there is an unscientific
circularity in some of the arguments provided by GCMers [gen-
eral circulation modelers]; for example, the claim that GCMs may
produce a good simulation sits uneasily with the fact that impor-
tant aspects of the simulation rely upon [...] tuning. (Shackley
et al. 1998, 170).

This is just one particularly suggestive quote about the badness of double-
counting. But what exactly is the badness here? We will see that this de-
pends crucially on the details.

This paper seeks to clarify and evaluate worries surrounding calibration
and double-counting. We appeal to statements made by various climate sci-
entists, but our aim is not to rebut particular individuals. Our main concern
is that, in general, climate scientists’ statements about calibration/tuning/double-
counting do not attend to the details, and are, at worst, misleading. A
number of different issues are bundled together as the ‘problem of double-
counting’, and each of these issues deserves to be carefully articulated.

It is necessary to introduce some terminology. Calibration is introduced
above. Confirmation refers to the evaluation of a model’s accuracy for par-
ticular purposes.1 Note also that there is an important difference between

1Some authors, e.g., Frame et al. (2007), use the term ‘verification’ in lieue of ‘confir-
mation’. We use the latter term in the interests of making a clear connection with the
philosophical literature.

3


incremental and absolute confirmation; the former concerns whether confi-
dence in a model hypothesis has increased, while the latter concerns whether
confidence in a model hypothesis is sufficient, or above some threshold. This
paper focuses on (varieties of) incremental confirmation.2 The central ques-
tion is: what is the inevitable/proper relationship between calibration and
confirmation?

Some climate scientists appear to claim that calibration is bad and should
therefore be avoided :

Climate change simulations should, in general, only incorporate
forcings for which the magnitude and uncertainty have been quan-
tified using basic physical and chemical principles (Rodhe et al. 2000,
421).

This statement may be stronger than the authors intend.3 In any case, an
anti-calibration position is not defensible, because it would preclude refining
models in response to observational evidence. This is common practice in all
areas of science. In short, whatever the details of the relationship between
calibration and confirmation, it had better be the case that calibration is not
something that is bad or needs to be avoided.

Our main target here is the widespread view amongst climate scientists
that calibration and confirmation should be kept ‘separate’. The following
quotes suggest that evidence used in calibration should not (or cannot) yield
incremental confirmation; only separate data, not already used for calibra-
tion, can boost confidence in a model. In other words, tuning is fine if it
simply amounts to calibration, but double-counting is not fine:

The inverse calculations [calibration] are also based on sound
physical principles. However, to the extent that climate models
rely on inverse calculations, the possibility of circular reasoning
arises—that is, using the temperature record to derive a key input
to climate models that are then tested against the temperature
record (Anderson et al. 2003, 1103).

If the model has been tuned to give a good representation of a
particular observed quantity, the agreement with that observation

2From now on, when we use the term ‘confirmation’ we mean incremental confirma-
tion, unless otherwise indicated. As will become clear, we distinguish two varieties of
incremental confirmation: comparative and non-comparative.

3Perhaps the authors want to exclude forcings that have no physical plausibility at all,
rather than forcings that merely cannot be well quantified.

4


cannot be used to build confidence in that model (IPCC report
– Randall and Wood 2007, 596, our underlining).

Indeed, the need for separate data for calibration and confirmation is
usually simply taken for granted in the climate science literature, or else the
reasoning is ambiguous.4 But this position is far from being obviously true,
and requires further argument.

The first part of the paper argues that separate data for calibration and
confirmation is not an uncontroversial tenet of confirmation logic, because it
does not follow (in fact, quite the contrary) from at least one major approach
to confirmation—the Bayesian approach.5 After some remarks in Section 2
about climate models and adequacy-for-purpose that are useful to bear in
mind throughout the discussion, in Sections 3 and 4, we demonstrate, using
a very basic model and examples from climate science, that evidence may
be used to calibrate and also to incrementally confirm a model relative to
another model (we call this comparative confirmation).

We then go on to address some complicating issues—reasons why in some
contexts data are useless for calibration or confirmation. Some climate scien-
tists’ worries about double-counting are most charitably reconstructed along
these lines, i.e. as concerning the inapplicability, rather than the inherent
badness of double-counting. Section 5 considers the issue of ‘old evidence’—
if evidence already informs the prior probability distribution over models, it
cannot be used a second time over for further calibration and confirmation.
Section 6 discusses the worry that past data are irrelevant for model ade-
quacy in the future and hence cannot be used for calibration or confirmation.

Section 7 discusses a different sense of incremental confirmation that cli-
mate scientists may have in mind: non-comparative confirmation, which con-
cerns our confidence in a model tout court, i.e. relative to its entire comple-
ment. While also for non-comparative confirmation evidence may be used to
calibrate and confirm a model, the worry arises that climate models are based
on assumptions that may be wrong, especially in the future; hence there is

4See, for instance, Anderson et al. (2003, 1103), Knutti (2008, 4651), Knutti (2010,
399), IPCC report – Randall and Wood (2007, 596), Shackley et al. (1998, 170), and
Tebaldi and Knutti (2007, 2070).

5We try to deal minimally in Bayesian assumptions that may be objectionable to some
readers, chiefly, prior probabilities. While we restrict our attention to Bayesian confirma-
tion logic, the lessons apply more broadly, and we note this where appropriate. In any
case, our aim is simply to show that it is not uncontroversial to claim that separate data
must be used for calibration and confirmation.

5


considerable uncertainty about the full space of models, implying that data
will not confirm a model. Section 8 presents an example from climate science
that brings these subtler issues to the fore. The paper ends with a conclusion
in Section 9.

Let us now turn to the remarks about the predictive purposes of climate
models and how this bears on what evidence is relevant for assessing them.

2 Remarks about models and adequacy-for-purpose

A variety of climate models are used to study the Earth’s climate. In the
words of Parker (2010, 1084):

[Climate models] range from the highly simplified to the extremely
complex and are constructed with the goal of simulating in greater
or lesser detail the transport of mass, energy, moisture, and other
quantities by various processes in the climate system. These pro-
cesses include the movement of large-scale weather systems, the
formation of clouds and precipitation, ocean currents, the melt-
ing of sea ice, the absorption and emission of radiation by atmo-
spheric gases, and many others.

Climate scientists note that many of the aforementioned processes are
still poorly understood, and, moreover, that these processes can typically
be only approximated in a model, even one of maximum possible precision.
Consequently, it is clear from the outset that climate models will not cor-
rectly represent or predict the target systems in all their details. This means
that climate models themselves cannot be confirmed. As Parker (2009) has
convincingly argued, instead what can be confirmed is the adequacy of cli-
mate models for particular purposes. The hypotheses about the purposes of
climate models need to be specified by climate scientists. A prime example
of such a hypothesis is: ‘this climate model with these initial conditions is
adequate for predicting the mean surface temperature changes within 0.5 de-
grees in the next 50 years under this emission scenario’.

In climate science typically some model error is allowed. Therefore, an
important part of specifying the hypothesis about the purpose of a model is
to state the assumptions about the model error. There are two main kinds of
error. First, for discrete model error all that counts is whether the actual out-
come is within a certain distance from the simulated outcome, e.g., whether
the actual and simulated mean surface temperature is less than 0.5oC apart.

6


Second, there is probabilistic model error when the error is described by a
probability distribution. To give a simple example, the error might be mod-
eled by a Gaussian distribution around the true value.

In this framework of adequacy-for-purpose one needs to be cautious about
what data are actually relevant to assess whether a model fulfills a particu-
lar purpose. We have to determine the observational consequences that are
likely to follow if the model is adequate; the data about these consequences
will then be relevant. To come back to our example about mean surface
temperature changes: here many will regard past temperature changes as
relevant (although we return to this issue later in Section 6). However, it is
less clear whether, e.g., past precipitation changes are relevant. As Parker
(2009) has argued, if climate scientists have obtained a good understanding
of the relation between mean surface temperature changes and precipitation
changes, then precipitation changes will be relevant. However, when lacking
any knowledge about the interdependence of these two variables, then pre-
cipitation changes will not be relevant. Which data are relevant is crucial
for two reasons: only relevant data can confirm or disconfirm the adequacy
of a model and can meaningfully be used to calibrate the free parameters of a
model.

This paper does not, for the most part, focus on the question of what data
are relevant to assess a model’s adequacy for purpose. General points about
the suitability of data for confirmation will, however, become important in
Sections 5 and 6. Here it is just important to realise that this question is a
separate issue and should not be confused with the worry of double-counting.
That is, if data are not relevant to a model’s adequacy for purpose, then test-
ing the model against the data even once would be counting the data one too
many times; likewise, calibrating the free parameters of the model against the
data would be counting the data one too many times.

The next section discusses calibration/double-counting in the context of
more simple models. The aim is to elucidate calibration vis-à-vis Bayesian
confirmation.

7


3 Evidence for calibration can also yield comparative confirma-

tion

Here we argue against the view that double-counting, in the sense of using
evidence for both calibration and confirmation, is obviously bad practice. We
show that, by Bayesian or likelihoodist standards at least, double-counting
simply amounts to using evidence in a regular and proper way. This is best
demonstrated in the context of comparing two well-specified hypotheses. We
distinguish two interpretations of double-counting—I (subsection 3.1) and II
(subsection 3.2)—because the legitimacy of the latter is more controversial
than the former.

3.1 Double-counting I

Let us start with a straightforward case, and then add complexity. Consider
just one type of base model with very simple structure: a linear relationship
between variables y and t. Because, as outlined in the previous section, cli-
mate scientists typically allow for model error, we will assume a probabilistic
model error term that is distributed normally with standard deviation σ:6

L : y(t) = αt + β + N(0,σ). (1)

The Bayesian account of model calibration depends crucially on the follow-
ing setup: there is a whole family of specific instances of the base model L,
where each specific instance has particular values for the unknown parame-
ters or forcings α and β. For instance, assume that possible values for α are
{1, 2, 3, 4}, and likewise for β. So the scientist associates with L a (discrete)
set of specific model instances that we might label L1,1, L1,2, . . ., where the
subscripts indicate the values for α and β.

Calibration of L then just amounts to comparing specific instances of the
base model—L1,1, L1,2, . . .—with respect to the data, i.e. observed values for
y(t). Of course, strictly speaking, what we are comparing are model hypothe-
ses; assume that the hypotheses here postulate that the model in question
accurately describes the data generation process for y(t). Calibration is sim-
ply the common practice of testing hypotheses against evidence. Given the
probabilistic error term, none of the hypotheses L1,1,L1,2, . . ., can be falsified

6Alternatively, the error term could be interpreted as observational error or as a com-
bined term for observational error and model error. We focus on model error because it
seems particularly widespread in climate science papers. However, all we say carries over
to any other interpretation of the error term.

8


by the data, even if the data lies very far away from the specified line. Note
also that since the model error is probabilistic, the hypotheses are mutually
exclusive. This is important: calibration is best understood as the compar-
ison, given new evidence, of the mutually exclusive hypotheses constituting
a base model.7

Calibration, understood in this way, may well result in confirmation of
Li,j, say, with respect to Lk,l. By Bayesian logic, the extent of confirmation
depends on the likelihood ratio: Pr(E|Li,j)/Pr(E|Lk,l), where Pr(E|Li,j) is
just the probability, Pr, of the evidence, E, i.e. the observed data points,
given the model Li,j.

8 The likelihoods are related, in a manner that depends
on the assumed error probability distribution (in our case Gaussian), to the
sum-of-squares distance of the data points from the line. If the likelihood
ratio is greater than 1, then Li,j is confirmed by the data relative to Lk,l,
and vice versa if the likelihood ratio is less than 1. When the likelihood ratio
equals 1, neither hypothesis is confirmed relative to the other. Note that the
relative posterior (post-evidence) probabilities of Li,j and Lk,l is a further
matter of absolute rather than incremental confirmation (cf. comments in
Section 1); absolute confirmation depends also on their relative prior (ini-
tial) probabilities.9

7Where model error is discrete, identifying mutually exclusive model hypotheses is more
complicated. For instance, consider a simple example of two hypotheses involving discrete
model error: L1,1 is the hypothesis that y(t) = t + 1 accurately predicts y(t) within ±2,
and L1,2 is the hypothesis that y(t) = t + 2 accurately predicts y(t) within ±2. These
two hypotheses could both be correct. Indeed, the model hypotheses in Knutti et al. (2002,
2003) discussed later in Section 4 and 7 deserve further scrutiny on this basis. We will not
discuss this further here; we merely want to flag the issue.

8To be more precise, we should also explicitly state the background knowledge B in the
likelihood expressions, such that they read Pr(E|Li,j&B). In the interests of readability,
we will not use these more precise expressions, but the B should be understood as implicit.

9This is the Bayesian wisdom, anyhow. The complete Bayesian expression is as follows:

Prf (Li,j)

Prf (Lk,l)
=

Pr(Li,j|E)
Pr(Lk,l|E)

=
Pr(E|Li,j)
Pr(E|Lk,l)

×
Pr(Li,j)

Pr(Lk,l)
(2)

where the first term is the ratio of posterior probabilities, i.e. the ratio of probabilities
after receipt of the evidence. The final term is the ratio of prior or initial probabilities for
the model hypotheses, i.e. before the evidence.

In short, the ratio of posteriors for the model hypotheses, given new evidence E, is
a product of the ratio of prior probabilities and the likelihood ratio. As mentioned, it
is the likelihood ratio that governs the relative extent to which the model hypotheses
are confirmed by E. Note that the likelihood ratio plays a key role in other theories of
confirmation too, not just the Bayesian.

9


We begin with this case to show that there is a straightforward way in
which double-counting is fine: calibration of L involves ascertaining appro-
priate values for α and β; thus the whole point is to consider which specific
model hypotheses are confirmed relative to others in light of the data. Call
this double-counting I ; we do not expect its legitimacy to be controversial,
given a hypothesis space as described above. So we already see that un-
qualified statements about the badness of calibration/double-counting are
problematic.

3.2 Double-counting II

An interesting qualification may be deduced from the work of Worrall (2010).
He suggests that the real double-counting sin would be to use evidence to
calibrate a base model such as L above, and also hold that the same evidence
confirms not only specific instances of this base model relative to others, but
the base-model hypothesis itself:

Using empirical data e to construct a specific theory T ′ within
an already accepted general framework T leads to a T ′ that is
indeed (generally maximally) supported by e; but e will not, in
such case, supply any support at all for the underlying general
theory T . (Worrall 2010, 143)

Call this double-counting II. In this quote Worrall refers to a general the-
ory T that is already ‘accepted’. In such a case, the general theory cannot be
incrementally confirmed, as it already has maximal probability.10 Worrall’s
remarks are thus consistent with Bayesian confirmation. We take Worrall’s
work to be highly suggestive, however, of the more general claim against
double-counting II. We will show that, according to Bayesian confirmation
theory, double-counting II is legitimate—thus conflicting with the more gen-
eral claim against double-counting II.

Perhaps when climate scientists claim that separate data is required for
confirmation and calibration, they take for granted, along the lines of Wor-
rall, that double-counting II is illegitimate, i.e. calibration of a base-model
hypothesis cannot result in that hypothesis being confirmed relative to an-
other base-model hypothesis, and thus other data is needed for any such
confirmation.

10Note also that Worrall considers only cases where the evidence falsifies all but one
instance of a base model.

10


This position, however, is not born out by Bayesian confirmation logic
(at least).11 On the contrary, double-counting II is legitimate and can arise
for two reasons: 1) ‘average’ fit with the evidence may be better for one base
model relative to another, and/or 2) the specific instances of one base model
that are favoured by the evidence may be more plausible than those of the
other base model that are favoured by the evidence.12

As per double-counting I, our analysis revolves around straightforward
likelihood ratios, although here we must introduce prior probability distri-
butions over the specific model instances, conditional on each base-model
hypothesis being true.13 In the interests of a more concrete discussion, we
first introduce a second base-model hypothesis, a quadratic of the form:

Q : y(t) = αt2 + β + N(0,σ). (3)

Assume that the specific model instances, like those of L above, are all com-
binations of α and β, where each may take any value in the discrete set
{1, 2, 3, 4}. As before, the error standard deviation, σ, is fixed. Specific model
instances are labelled Q1,1, Q1,2, . . .. Note that the base-model hypotheses
L and Q are of the same complexity, i.e. they have the same number of free
parameters. This is an intentional choice; we do not want to introduce a fur-
ther issue of relative model complexity and penalties for overfitting. While
an important and controversial issue that is certainly tied up with calibra-
tion, the overfitting debate only confounds the question of double-counting.
(Nonetheless we will return to this debate briefly at the end of the subsec-
tion.)

In standard Bayesian terms, the confirmation of one base-model hypoth-
esis, e.g., L, with respect to another, e.g., Q, depends on the likelihood ratio
Pr(E|L)/Pr(E|Q). As before, if the ratio is greater than 1, then L is con-
firmed relative to Q, and if it is less than 1, then Q is confirmed relative to

11We remark on frequentist ‘model selection’ methods at the end of this section; ac-
cording to these methods, double-counting II is legitimate—in conflict with the general
claim we are attributing to Worrall. Note that Mayo’s ‘severe testing’ approach to con-
firmation does not support the Worrall conclusion either (see Mayo’s 2010 response to
Worrall). What is important for the severe testing approach is not whether evidence has
already been used to calibrate a base-model, but whether the evidence severely tests this
base-model hypothesis; these two considerations do not always match up. It is beyond the
scope and aims of this paper, however, to elaborate further on the severe testing approach
or any other alternative vis-à-vis Bayesian confirmation.

12Our analysis is thus more in line with Howson (1988).
13For double-counting I we were able to eschew prior probabilities altogether when

assessing confirmation.

11


L.14 In this case, the relevant likelihoods, however, are not entirely straight-
forward:

Pr(E|L) = Pr(E|L1,1) ×Pr(L1,1|L) + . . . + Pr(E|L4,4) ×Pr(L4,4|L), (4)
Pr(E|Q) = Pr(E|Q1,1) ×Pr(Q1,1|Q) + . . . + Pr(E|Q4,4) ×Pr(Q4,4|Q).

Note that Pr(L1,1|L) is the prior probability (i.e. probability before the data
is received) of y(t) = t+1+N(0,σ) being the true description of the data gen-
eration process for y(t), given that the true model is linear. The expressions
above provide formal support for our earlier statement that confirmation of
base models depends on 1) fit with the evidence and 2) the conditional priors
of all specific instances of these base models.

Consider first the special case where the conditional prior probabilities of
all specific instances of L and Q are equivalent. That is:

Pr(L1,1|L) =. . .=Pr(L4,4|L) =. . . = Pr(Q1,1|Q) =. . .=Pr(Q4,4|Q) =x. (5)

Suppose the observed data E yield on balance greater likelihoods for in-
stances of L than Q. Then L is confirmed relative to Q because of reason
1), viz. the average fit with the evidence is better for base-model hypothesis L
than for Q. Furthermore, there is calibration because E is used to determine
the most likely values of α and β.

Another special case is where the base-model hypotheses have equivalent
fit with the data when all specific models are weighted equally, but the priors
are not in fact equal. Suppose that the specific instances of L that have the
higher likelihoods for E are in fact more plausible (higher conditional priors)
than the specific instances of Q that have the higher likelihoods. Then L is
confirmed relative to Q because of reason 2), viz. the specific instances of L
favoured by the evidence are more plausible than the specific instances of Q
favoured by evidence. Furthermore, there is calibration: E is used to deter-
mine the most likely values of α and β.

Alongside these two special cases there is also the case of double-counting
II because of both 1) and 2). Worrall (2010) has claimed that in cases where
data seem to be used for calibration and confirmation of a base-model hy-
pothesis, what really happens is that only some of the data is needed to
determine the values of the initial free parameters, and the rest of the data

14Again, as before, the relative posterior probabilities of L and Q, i.e.
Pr(L|E)/Pr(Q|E), depend also on their prior probability ratio.

12


then confirms the hypothesis; thus there is no double-counting. However,
this splitting of the data can throw away valuable information about the free
parameters and is not in keeping with Bayesian logic of confirmation. Rather,
as we see for the cases discussed here, all of the data are used to determine
the values of the free parameters as well as for confirmation of base-model
hypotheses, and thus we have a genuine case of double-counting.

Finally, while the Bayesian approach to confirmation is far from marginal,
there have been interesting challenges to this approach in the context of
double-counting II. Concerns about comparing base models of differing com-
plexity have lead to special methods for assessing base models, i.e. families
of models. This is the field of model selection (see Burnham and Anderson
2002). Our analysis above is standard Bayesian, but it is important to note
that various alternative methods for comparing base models have been sug-
gested, including the Akaike approach (see Forster and Sober 1994). The
controversies here run deep and extend to whether the basic unit of analysis
should be a family of models or a specific model, and also to what we are
trying to assess: the truth of model hypotheses, or their predictive accu-
racy? It is beyond the scope of this paper to enter into this debate. We
note simply that even if an alternative (frequentist) approach to confirma-
tion of base models is taken, the legitimacy of both double-counting I and II
holds: evidence used for calibrating base models is also used for determining
their relative standing, or, in other words, for confirmation (see, for instance,
Hitchcock and Sober 2004).

Section 4 presents two analyses from the climate literature that exem-
plify the two special cases of double-counting II. The aim here is to show
that climate scientists do engage in double-counting, even if they do not
acknowledge it as such.

4 Climate science examples: comparative confirmation in prac-

tice

There is considerable discussion in climate science about calibrating aerosol
forcing. To give some background: aerosols are small particles in the atmo-
sphere. They vary widely in size and chemical composition and arise, e.g.,
from industrial processes. Aerosols alter the Earth’s radiation balance, and
the aerosol forcing measures the extent that anthropogenic aerosols alter this
balance. Anthropogenic aerosols influence the climate in two ways: first, they

13


reflect and scatter solar and infrared radiation in the atmosphere (measured
by the direct aerosol forcing). Second, they change the properties of clouds
and ice (measured by the indirect aerosol forcing). Overall aerosols are be-
lieved to exert a cooling effect on the climate.

The uncertainty about the magnitude of the aerosol forcing, in particu-
lar about the indirect aerosol forcing, is huge because little is known about
the physical and chemical principles of how aerosols change the properties of
clouds and ice and how they scatter radiation. Consequently, it is standard
practice to calibrate the aerosol forcing against data, and the aerosol forcing
constitutes a prime example of calibration in climate science.

We will now show that in climate papers about the aerosol forcing we can
find the two special cases of double counting II.

4.1 Confirmation due to better and worse best fits

The first paper we look at is Harvey and Kaufmann (2002). They compare
the adequacy of two climate models (with model error) for simulating the
observed warming of the past two and a half centuries. The two base models
are (the climate models are derived from an energy balance model coupled
to a two-dimensional ocean model):15

• M1: model instances that consider both natural and anthropogenic
forcings to describe climate change (plus model error).

• M2: model instances that consider only anthropogenic forcings to de-
scribe climate change (plus model error).

They assume that the model error is such that none of the base-model
hypotheses can be falsified by the data but where, roughly, the closer the
simulations are to the observations, the better.16 The evidence regarded as
relevant for assessing the adequacy of the base models are the past record
of mean surface temperature changes, interhemispheric surface temperature
changes, surface temperature changes in the northern hemisphere and surface
temperature changes in the southern hemisphere. This evidence is used to
simultaneously calibrate the aerosol forcing and the climate sensitivity. (The

15The base model M1 (M2) does not consist of one model to which different forcing
values can be assigned. It consists of several different models, which consider different an-
thropogenic and natural influences (different anthropogenic influences), to which different
forcing values can be assigned. Hence Harvey and Kaufmann compare two sets of models.

16They do not assume any observation error.

14


climate sensitivity measures the mean temperature change resulting from a
doubling of the concentration of carbon dioxide in the atmosphere). Moti-
vated by physical considerations, the initial ranges considered are [0,-3] for
the aerosol forcing and [1, 5] for the climate sensitivity.

They proceed as follows: among all the model instances of M1 and M2,
Harvey and Kaufmann identify a model instance which best matches the
data. Then they apply a statistical test to determine whether other model
instances differ significantly from the best instance. In this way they arrive
at a set of best performing models instances. (Denote this set by MB and let
MBC be the model instances of M1 and M2 which are not in MB.) It turns
out that MB only includes instances of M1. Consequently, they conclude that
there is confirmation: M1 (natural and anthropogenic forcings) is more ade-
quate for simulating the past temperature record than M2 (only anthropogenic
forcings). Furthermore, they use the same data to calibrate the aerosol forc-
ing : the instances of M1 in MB correspond to an aerosol forcing range of
(-1.5, 0], which is thus regarded as the likely range.

Harvey and Kaufmann can be seen as engaging in double-counting II.
Their procedure can (roughly) be reconstructed in Bayesian terms, as per
Section 4. The model error is probabilistic.17 Further, because initially they
are indifferent about the exact forcing values, they assume a uniform prior
over the aerosol forcing and climate sensitivity conditional on M1 and M2.18

Their procedure comes close to assigning to the probability of the data given
MBC a much smaller value than to the probability of the data given MB.
(That is, Pr(E|MBC)/ Pr(E|MB) is much smaller than 1, e.g., 1/9.) Then,
because MB only includes instances of M1, it follows that the probability of
the data given M1 is much higher than the probability of the data given M2.
Consequently, probabilistic confirmation theory yields that M1 is confirmed
relative to M2 and that very likely the aerosol forcing is in the range (−1.5, 0].

To conclude, Harvey and Kaufmann justifiably use the same data for
calibration and comparative confirmation: They engage in case 1) of double
counting II, i.e. there is confirmation because the average fit with the evidence
is better for M1 than for M2. Note that we are not here assessing other

17Their method implies that (roughly) the smaller the model error, the better, and that
none of the models can be falsified. However, apart from this, the assumptions about the
model error remain unclear. It would be desirable to spell these assumptions out because
this is needed for specifying the models’ adequacy.

18Likewise, we assume that each of the different models in M1 (M2) are equiprobable
(see footnote 15).

15


aspects of the experimental design; for instance, climate scientists may debate
the relevance of the past ocean temperature change data for comparing the
models’ adequacy. As stressed earlier, that is a different question not to be
confused with double-counting.

4.2 Confirmation due to more and less plausible forcings values

As a second case let us compare the models of Knutti et al. (2002) and
Knutti et al. (2003). Knutti et al.’s (2002, 2003) concern is to construct
models which are adequate for long-term predictions of temperature changes
(within the error bounds) until 2100 under two important emission scenarios.
They assume that the model error is discrete (cf. Section 3). The two base
models are (the climate models are derived from a dynamical ocean model
coupled to an energy- and moisture-balance model of the atmosphere):

• M1: model instances considered by Knutti et al. (2002). There are
five different ocean setups and the carbon cycle is not accounted for
explicitly (the carbon cycle determines how emissions are converted
into concentrations in the atmosphere).19

• M2: model instances considered by Knutti et al. (2003). There are ten
different ocean model setups and the carbon cycle and its uncertainty
are explicitly accounted for with a parameterization.20

The evidence which they regard as relevant for assessing the adequacy of
these models are past mean surface temperature changes and ocean temper-
ature changes.

All the elements needed to compare the two base-model hypotheses in
the framework of probabilistic confirmation theory are present in Knutti et
al. (2002, 2003). The evidence is used to simultaneously calibrate the indirect
aerosol forcing and the climate sensitivity. Motivated by physical estimates,
Knutti et al. (2002, 2003) assume that, conditional on M1 and M2, the indi-
rect aerosol forcing is initially normally distributed with the mean at -1 and
a standard deviation of 1.21 The climate sensitivity is assumed to be initially

19The ocean setups of M1 and M2 differ: the ten ocean setups of M2 do not include
the five ocean setups of M1.

20Because of the different ocean setups, the base model M1 (M2) does not consist of one
model to which different forcing values can be assigned but of five (ten) different models
to which different forcing values can be assigned. Hence the sets of models M1 and M2
are compared.

21They also discuss the case of a uniformly distributed aerosol forcing. However, the
case of the normal distribution will be more insightful here.

16


uniformly distributed over [1,10], conditional on M1 and M2.

Knutti et al. (2002, 2003) then calculate the a posterior probabilities for
model instances, i.e. the likelihood of an arbitrary model-hypothesis instance
given the data, assuming that M1 (M2) is true. A model-hypothesis in-
stance is regarded as consistent if the average difference between the actual
and the simulated observations is smaller than a constant.22 The a posterior
probability is zero for inconsistent model-hypothesis instances; consistent
model-hypothesis instances are assigned a probability proportional to the
prior probability over the forcings values (i.e. over the model instances23). It
turns out that the a posterior probability distribution over the forcings are
the same for M1 and M2, implying the indirect aerosol forcing is likely (with
approximate probability 0.90) to be in the range [-1.5,0.2). In short, the con-
sistent model instances of M1 span the same range of forcing values as the
consistent model instances of M2. Since all consistent model instances are
regarded as having equivalent fit with the data (because postulated model
error is discrete), we conclude that there is no comparative confirmation.

Now suppose that for M1 the a posterior probability distribution over
the forcings would have been different, say, that the likely (with probability
0.90) aerosol forcing range would have been [-2.7,-1]. Then the data would
have been justifiably used both for calibration and comparative confirmation
of the base-model hypotheses. This would have been an example of case 2 of
double counting II : M2 would have been confirmed relative to M1 because
the specific instances of M2 favoured by the evidence are more plausible than
the specific instances of M1 favoured by the evidence.

5 Old evidence

We have seen that double-counting is not illegitimate by Bayesian confirma-
tion standards, at least, and is, moreover, practised by some climate scien-
tists. This problematises assertions that double-counting is clearly bad. The
remainder of the paper considers reasons why double-counting may yet be,
for the most part, inapplicable in the climate-model context. Note that the

22The constant equals the standard deviation of the model ensemble, which in climate
science is regarded as a measure of model error. They also assume that there is observation
error. To account for it, the difference of the observed and modelled temperature is divided
by the uncertainty of the observed warming (Knutti et al. 2002, 2003).

23Knutti et al. (2002, 2003) assume that each of the five (ten) different models consti-
tuting the base model class M1 (M2) are equiprobable (cf. footnote 20).

17


reasons we canvas concern the failure of calibration and/or confirmation of
base models; nothing we say in these final sections supports the position that
separate data should be used for calibration and confirmation.

We start with what seems a prevalent concern: that the evidence in
question was used to formulate the climate-model hypotheses, and so is old
evidence that is not suitable for further confirmation purposes. This appears
to be a concern of Stainforth et al. (2007a):

Development and improvement of long time-scale processes are
therefore reliant solely on tests of internal consistency and phys-
ical understanding of the processes involved, guided by informa-
tion on past climatic states deduced from proxy data. Such data
are inapplicable for calibration or confirmation as they are in-
sample, having guided the development process.

The term ‘in-sample’ is ambiguous here: on the one hand it apparently
refers to evidence belonging to a different time(/spatial) period from the pre-
dictions of interest (we discuss this issue in subsequent sections), yet on the
other hand it seems to refer to old evidence, i.e., evidence already taken into
account in model development. Since these two issues come apart,24 they
deserve separate treatment.

Our current concern is updating on old-evidence. How might this prob-
lem manifest? It helps to consider a paradigm case: imagine that a detective
announces that the most plausible hypothesis, given the expensive earring
and strands of hair found at the crime scene, is that the rich Lady visit-
ing the manor killed the host. Clearly the evidence has already been taken
into account in announcing that this hypothesis is the most plausible one. In
Bayesian terms, the current plausibility of the hypothesis—its relatively high
probability—is already a posterior probability, given the evidence. It would
thus be a mistake to further confirm the rich-Lady hypothesis with respect to
the same evidence. One can still assess the confirmatory power of the old evi-
dence, but this requires estimating ‘counterfactual’ probabilities, such as the
likelihood Pr(E|rich-Lady hypothesis where E is not already known). One
can also entertain, if necessary, a prior probability for ‘rich-Lady hypothesis
where E is not already known’—this is evidently what the detective’s belief
in the rich Lady’s culpability would have been, before the evidence E was

24Consider: It is possible to find ‘new’ evidence from the same time period as the ‘old’
evidence.

18


known.25

To better appreciate the problem, it is helpful to consider the overall con-
firmation from two independent pieces of evidence, say E1 and E2, according
to Bayes’ theorem. In such case, the overall confirmation of, say, H1 relative
to H2, depends on the product of the two likelihood ratios:

Pr(E1|H1)
Pr(E1|H2)

×
Pr(E2|H1)
Pr(E2|H2)

. (6)

It would be a mistake, of course, to treat the one piece of evidence, E, as if
it were two pieces of independent evidence, and thus take confirmation due
to E as:

Pr(E|H1)
Pr(E|H2)

×
Pr(E|H1)
Pr(E|H2)

. (7)

This is what it means to update again on old evidence, or use the same ev-
idence two times over for confirmation. It is effectively what would happen
if, say, our detective further confirmed the rich-Lady hypothesis with respect
to the same crime-scene data, and concluded that it was even more plausible
that she was the murderer.

Let us now return to climate models. The way we have characterised
calibration in Section 3 already guards against this old-evidence updating,
to some extent. As mentioned, the problem set-up is crucial to a defen-
sible Bayesian analysis: when calibrating and comparing two base-model
hypotheses, we must assign all the specific instances of these models appro-
priate conditional priors, i.e., probabilities that do not yet take the evidence
into account. Then the evidence can be used to calibrate or discriminate fur-
ther between the model instances (and between the base models too, as per
double-counting II). This is effectively the procedure that is followed in the
case studies of Section 4; suitable conditional prior probabilities are initially
selected, and then updated in light of the temperature data.

Of course, evidence might be unwittingly used two times over for cali-
bration and/or confirmation. Indeed, Frame et al. (2007) note this danger
in the context of assessing climate models. They caution against calibrating
and/or confirming twice with the same evidence, not realising that the ev-
idence already informed the conditional prior probability distributions over

25Admittedly, these ‘counterfactual’ probabilities may be difficult to estimate, and the
controversy about their interpretation runs deep, but there are nonetheless ways to make
sense of them (see, for instance, Eells and Fitelson 2000).

19


instances of the base models. In short, updating on old evidence is problem-
atic, and practitioners should be careful to avoid doing this. But this is not
an inevitable problem, and the remedy is not to use separate data for cali-
bration and confirmation; the remedy is simply not to calibrate and confirm
model hypotheses two times over with the same evidence.

There may be a lingering concern that prior probabilities for the base-
model hypotheses themselves already incorporate the evidence, especially if
base models with additional forcings or parameters are constructed expressly
to achieve better fit with the data. So the base-model hypotheses are only a
subset of the full space of possible models, and hence assigning each an equal
prior probability would be to over-estimate their initial plausibility. The
situation seems analogous to the murder case above—the base models that
climate scientists work with are considered plausible precisely because the
evidence has already been taken into account in selecting them. Just as the
murder detective does not bother to mention various people near the crime
scene who may have been under greater suspicion if the evidence were oth-
erwise, climate scientists have presumably already dismissed a large number
of possible base models in favour of the few under consideration that seem
to have the potential to permit a reasonable fit with the evidence. It would
then seem wrong to use the evidence a second time over for confirmation.
Notwithstanding this concern, we can still calibrate and assess comparative
(incremental) confirmation in terms of the likelihoods Pr(E|Hi), where it is
assumed in the condition that E is not already known. Furthermore, as men-
tioned above, even if the base-model hypotheses are only a subset of the full
space of model hypotheses—the ones deemed most plausible in light of the
evidence—one can still estimate ‘counterfactual’ prior probabilities for the
base-model hypotheses where the evidence E is not taken into account. Pre-
sumably, the counterfactual prior probabilities for these base models should
not add to 1, but to some probability less than 1. Determining the appropri-
ate probability mass to assign to the set of base-model hypotheses may be
quite tricky. But this problem affects only non-comparative, and ultimately,
absolute confirmation, where we want to assess how confident we should be,
overall, in our models, and again, has nothing to do with double-counting.
In any case, the assessment of non-comparative and absolute confirmation
of climate models is plagued with even bigger difficulties, and we will get to
these in Section 7.

For now we continue to analyse why even calibration and comparative
confirmation may fail in the climate-model context. In particular, we turn
now to concerns about the (ir)relevance of past data.

20


6 Doubts about the relevance of past data

There is an important difference between the climate studies discussed in
Subsections 4.1 and 4.2. In the Harvey and Kaufmann study, past data
was used to calibrate/confirm base-model hypotheses concerning past cli-
mate behaviour, whereas in the Knutti et al. studies, past data was used
to calibrate/confirm base-model hypotheses concerning long-term future cli-
mate behaviour (policy makers are most interested in this long-term future
climate behaviour). The latter is more controversial than the former, and, as
we will see in this and the next section, may be what some climate scientists
have in mind when they make negative comments about calibration and con-
firmation. This section discusses whether particular past data are relevant
for assessing the adequacy of climate-model hypotheses in predicting future
climate variables of interest. The next section will discuss the concern that
climate models are based on assumptions that may not hold in the future,
and hence there is considerable uncertainty about the full space of models
that are possibly adequate for predicting future climate.

Let us initially confine our analysis to the model instances of a single
base-model hypothesis, e.g., L (equation (1) in Section 3). Assume that the
model hypotheses denoted L1,1,L1,2 . . . this time concern whether the line in
question (plus probabilistic model error) accurately predicts y(t) for future
times t ≥ t∗. Our question here is: Can past data, i.e. data for t < t∗, help
in calibrating L?

The answer: it all depends on what is the implicit relationship between
t < t∗ and t ≥ t∗, i.e. the implicit extension of the model instances of L
that span t ≥ t∗ into the past. One possibility is that the past values de-
pend strongly on the future values, and vice versa, a special case being where
each line in L for t ≥ t∗ is associated with just one and the same line for
t < t∗. In this case, past data E (past values for y(t)) is clearly relevant for
comparing L1,1,L1,2 . . ..

26 The likelihood ratios Pr(E|Li,j)/Pr(E|Lk,l) may
be calculated as before.27

Another possibility, of course, is that the past values are independent of

26Note that the various frequentist estimators used in model selection, such as the Akaike
estimator, assume an unchanging physical reality or data generation process.

27Recall our earlier footnote 8, which notes that the likelihoods are more precisely stated
Pr(E|Li,j&B), etc., where B is background knowledge. Here background knowledge about
the implicit relationship between past and future is very important for determining the
value of the likelihood.

21


the future values, a special case being where each line in L for t ≥ t∗ is
associated with any line for t < t∗. That is, each line hypothesis in L, such
as L1,1, is implicitly associated with a whole set of extended models:

28

y(t) =

{
t + 1 + N(0,σ) if t ≥ t∗;
γt + θ + N(0,σ) if t < t∗. (8)

Here E, i.e. past values for y(t), will be irrelevant for comparing instances
of L, the reason being that all instances of L are associated with the same
pasts, and so E does not distinguish these instances. That is to say that
the pertinent likelihoods for calibration—Pr(E|Li,j)/Pr(E|Lk,l)—all equal
1. So in this case there is no calibration of L and thus, in a sense, no double-
counting I.

The analysis of double-counting II is essentially the same. In this case,
we are comparing two base-model hypotheses, for example, L and Q (equa-
tions (1) and (3) in Section 3) where the concern is whether the models
accurately predict y(t) for future times t ≥ t∗. Consider the special case
where every model instance of L or Q is implicitly extended into the past in
the same variety of ways.29 In this case past data E again does not favour
any instance of either model over any other instance of either model, and we
obtain Pr(E|L)/Pr(E|Q) = 1. Neither base hypothesis is confirmed relative
to the other. So in a sense there is no double-counting II (in addition to no
calibration and no double-counting I). Of course, this is just a special case;
if the values of past and future variables were dependent, past data may
confirm one base-model hypothesis over another.

This scenario of independence is what some climate scientists seem to
have in mind when they say:

Statements about future climate relate to a never before experi-
enced state of the system; thus, it is impossible to either calibrate
the model for the forecast regime of interest or confirm the use-
fulness of the forecasting process (Stainforth et al. 2007a, 2146).

We have here the grounds for a charitable interpretation of climate scien-
tists’ claim that data cannot be used to calibrate and confirm climate models.
As suggested by the quote, one might say that calibration is impossible when

28Also, the implicit conditional probabilities for the past extensions are assumed not to
vary for the Li,j.

29Again, the implicit conditional probabilities of the extensions are assumed not to vary
for the Li,j and Qi,j.

22


the future climate variables in question (or the equations that adequately pre-
dict them) are considered independent of the past data at hand (or the equa-
tions that adequately predict them).30 It is important to note that the extent
to which the point applies in climate science is controversial. Some climate
scientists suggest that the future values of prominent climate variables, in-
cluding precipitation and even average global temperature rise, are more or
less unconstrained by the past values of these or other variables (e.g., Frame
et al. 2002; Stainforth et al. 2007a). Other climate scientists apparently do
not think it so plausible that past values for at least some prominent climate
variables are irrelevant to their future values (e.g., Knutti et al. 2002, 2003;
Randall and Wood, 2007). In any case, the claim that calibration fails and
there is no confirmation of model instances or model hypotheses in a par-
ticular context is very different from the claim that double-counting is ‘bad
practice’. Moreover, using separate past data for calibration and confirmation
is no remedy for this problem.

7 Non-comparative confirmation and catch-alls

We have thus far been concerned with confirmation of one model hypothesis
relative to another. Yet certain statements from climate scientists concern-
ing calibration suggest that what is at issue is whether the evidence confirms
the predictions of a model tout court, i.e. relative to its complement (non-
comparative confirmation). We first show that double-counting is also legit-
imate for non-comparative confirmation. Then we explain why, nonetheless,
confidence in future climate predictions may be hard to amass. The difficul-
ties arise when climate models are based on assumptions which are suspected
to be wrong in the future. Again, the problem cannot be solved by employing
separate data for calibration and confirmation.

In some cases, assessing non-comparative confirmation is relatively straight-
forward. The relevant likelihood ratio involves a model (a base model or a
specific instance) and its entire complement. For instance, the degree to
which evidence E confirms base model hypothesis M relative to its entire
complement is (where N,.. . ,Z are the mutually exclusive base model hy-
potheses that exhaust the complement of M):

Pr(E|M)
Pr(E|¬M)

=
Pr(E|M)

Pr(E|N)×Pr(N|¬M)+. . .+Pr(E|Z)×Pr(Z|¬M)
. (9)

30A case which often arises in climate science is that the equations for adequately pre-
dicting the past and future climate variables are considered identical in form, yet the
parameters in these equations have values for past and future that are independent.

23


As before, this likelihood ratio may be greater than, less than, or equal to 1,
corresponding to M being confirmed, disconfirmed, or neither, relative to its
complement.

Here again it must be noted that the final probability of M, i.e. Pr(M|E),
is a further matter, and depends also on the prior probability Pr(M). This
section too focuses just on the extent to which evidence incrementally con-
firms or raises confidence in a model, this time relative to its complement. An
examination of the above expression reveals, however, that non-comparative
confirmation nonetheless requires substantial information regarding the prior
probabilities of base models, in the form of conditional probabilities like
Pr(N|¬M). So the comments at the end of Section 5 regarding difficulties
in estimating the prior probabilities of base models are pertinent here.

Further problems arise when the full set of base models under considera-
tion is believed not to be exhaustive, and yet we are unable to specify what
is missing (there are ‘known unknowns’). In other words, we have a range
of plausible base-model hypotheses plus a catch-all, i.e. a hypothesis to the
effect ‘none of the above is true’. One can easily see that non-comparative
confirmation in these conditions is difficult to assess. The relevant likelihood
is (where M is a base-model hypothesis, and hypotheses N,... together with
the catch-all C exhaust the complement of M):

Pr(E|M)
Pr(E|¬M)

=
Pr(E|M)

Pr(E|N)×Pr(N|¬M)+. . .+Pr(E|C)×Pr(C|¬M)
.(10)

The problem is that the likelihood associated with the catch-all, Pr(E|C),
let alone the probability Pr(C|¬M), is very difficult to evaluate. How do
we estimate the probability of some evidence conditional on the truth of a
hypothesis which we cannot actually specify?

The common sentiment in climate science seems to be that there is in-
deed a catch-all, especially when the models’ purpose is to predict future
climate. Nonetheless, some studies appear to proceed under the assumption
that model hypotheses may be confirmed (or disconfirmed) to some degree
in non-comparative terms, given evidence. Most plausibly, in these cases
the catch-all is either negligible, or else it is not completely unspecified, and
some climate scientists think they know enough about it to at least have
rough estimates for Pr(E|C). If at least a rough estimate for Pr(E|C) can
be given (as well as rough estimates for all other terms in the expression
above), the main conclusions drawn about double-counting and comparative
confirmation carry over. In particular, double counting II is legitimate for

24


non-comparative confirmation and can arise for two reasons (cf. Section 3):
1) better fit of the model or the complement of the model with the evi-
dence and/or 2) the specific instances of the model that are favoured by the
evidence may be more plausible or less plausible than the instances of the
complement favoured by the evidence.

So far so good, but some climate scientists do not think the prospects for
non-comparative confirmation of model hypotheses concerning the future are
so rosy. First, note that if past data is considered independent of the future
(cf. the discussion in Section 6), there cannot be non-comparative confirma-
tion because there is no confirmation of one base-model hypothesis relative
to another or indeed the catch-all.

Second, even if past data are relevant, many scientists worry that climate
models (which are based on our understanding of climate processes to date)
invoke assumptions which may not hold in the future.31 Consider:

For these processes, and therefore for climate forecasting, there
is no possibility of a true cycle of improvement and confirmation,
the problem is always one of extrapolation and the life cycle of a
model is significantly less than the lead time of interest. (Stain-
forth et al. 2007a, 2147).

One might interpret this view as follows: if base-model hypotheses concern
future predictions, then the catch-all is overwhelming. Future climate be-
haviour may differ from that of the past/present in unanticipated ways, and
so we are unable to specify even roughly the appropriate likelihoods of the
relevant catch-all.

At this point it should be mentioned that climate models are designed
to accurately simulate mean surface temperature changes ; they fail to sim-
ulate absolute mean surface temperatures to a similar level of accuracy. In
particular, the simulated mean surface temperature changes are derived from
simulated surface temperature values that show biases of several degrees Cel-
sius on many regions of the Earth; and the same holds for other variables
such as ocean temperatures (Knutti et al. 2010; Randall et al. 2007, 608 and
supplementary material). There is nothing in principle wrong with modelling
temperature changes rather than absolute temperatures. When one variable

31Note that while these two concerns are logically distinct, they are of course closely
related in the climate context. This is because the scientific reasons for doubting the
relevance of past climate data have much overlap with the reasons for positing significant
uncertainty about the future.

25


is too difficult to predict, often scientists succeed instead in predicting a sim-
pler variable such as an average or a change in that variable. However, many
climate scientists argue that the reason why climate models fail to accurately
simulate absolute temperatures is because important processes are ignored
which may become relevant for adequately predicting long-term future cli-
mate behaviour of interest (e.g., Stainforth et al. 2007a). From this doubts
arise whether current climate models will adequately describe the relevant
aspects of the future climate.

Climate scientists seem to take different views on the extent of our un-
certainty about the future. But in the case of radical uncertainty, non-
comparative confirmation of any one, or the whole set of, our climate-model
hypotheses concerning the future is indeterminate, even if past data are rel-
evant for comparing pairs of hypotheses. Overall confidence in any single
model or the full set of models cannot increase.32 This position regarding
non-comparative confirmation is reflected in the following statement concern-
ing the modelling of future climate:

We take climate ensembles exploring model uncertainty as poten-
tially providing a lower bound on the maximum range of uncer-
tainty and thus a non-discountable [unable-to-be-ignored] climate
change envelope [range of climate-change predictions]. (Stain-
forth et al. 2007b, 2167)

We now turn to an example in climate science which highlights the con-
troversies surrounding the relevance of past data and the overall adequacy of
climate models for future predictions.

8 Climate science example: non-comparative confirmation and

catch-alls in practice

Our example for non-comparative confirmation with a catch-all again con-
cerns the aerosol forcing and is Knutti et al. (2003), already discussed in
Subsection 4.2. Recall that Knutti et al. aim to construct models which are

32Moreover, applying full Bayesian reasoning: the posterior probabilities of the climate-
model hypotheses would also be indeterminate, due to the indeterminate likelihood ratios.
Most plausibly, in the case of a radically unspecified catch-all, the prior probabilities would
be indeterminate as well.

26


adequate for long-term predictions of the temperature changes until 2100 un-
der two emission scenarios (within the error bounds), and that the model
error is discrete. The two base models are:

• M: models instances of Knutti et al. (2003);

• C: catch-all.

Recall that mean surface temperature changes and the ocean warming
are regarded as relevant to assess the adequacy of the models, and they are
used to constrain the indirect aerosol forcing and the climate sensitivity. Mo-
tivated by physical estimates, for the aerosol forcing a uniform distribution
over [−2, 0] is chosen conditional on M or C.33 For the climate sensitivity a
uniform distribution over [1, 10] is chosen conditional on M or C.

The data are used for calibration: Knutti et al. (2003) calculate the like-
lihood of an arbitrary model-hypothesis instance given the data, assuming
that M is true. Because of the uniform prior distribution over the forc-
ings values, consistent model-hypothesis instances are equiprobable given
the data; inconsistent model-hypothesis instances have zero probability (a
model-hypothesis instance is regarded as consistent if the average difference
between the actual and the simulated observations is smaller than a con-
stant). The conclusion is that the likely range (summing to probability 0.93)
of the indirect aerosol forcing is [-1.2,0). Furthermore, Knutti et al. seem to
claim that the data confirm M relative to the catch-all because the fit with the
data is very good and the model could have (easily) failed to simulate the data.

As already discussed in Subsection 4.2, Knutti et al. (2003) use elements
of probabilistic confirmation theory. However, when reconstructing this as
a case of non-comparative confirmation, what is missing are the values of
Pr(E|M) and, in particular, of Pr(E|C). The crucial question is whether
Pr(E|M)/Pr(E|C) > 1. If it is, then probabilistic confirmation theory will
yield that the data are justifiably used for non-comparative confirmation
and calibration; there will be double-counting II for reason 1)—the model
instances of M provide a better fit with the data than the catch-all.

It should come as no surprise that the answer to this question is con-
troversial. Knutti et al. (2003) tend to an affirmative answer ; they seem to
claim that confidence in the future predictions of M has increased. However,

33Knutti et al. (2003) also discuss the case of a normally distributed aerosol forcing—see
footnote 21.

27


if Stainforth et al. (2007a) are right that past data are not relevant to the
future climate predictions of interest (as discussed in Section 6) or that the
probabilities associated with the catch-all cannot be precisely specified (as
discussed in Section 7), then the answer will be negative: the data simply
will not confirm M relative to the catch-all.

The fact that there is controversy among climate scientists about such
fundamental and policy-relevant questions highlights the need to think more
carefully about them. Whatever the outcome, this controversy is not about
the problem of double-counting.

9 Concluding remarks

The main contribution of this paper is the untangling and clarification of wor-
ries concerning double-counting. We have argued that the common position—
that double-counting is bad and that separate data must be used for calibra-
tion and confirmation of base-model hypotheses—is by no means obviously
true. This is not to say there are no other fundamental concerns about the
confirmatory power of evidence or about uncertainty in climate science. It is
crucial, however, that the various issues are articulated and distinguished, if
we are to make progress in assessing confidence in climate models and their
predictions.

Our claim is that double-counting, in the sense of using evidence for cal-
ibration and confirmation, is justified by at least one major approach to
confirmation—the Bayesian or relative likelihood approach. Calibration of
a base-model hypothesis is all about determining which specific instances of
the base model are confirmed relative to other specific instances. We call this
double-counting I. Furthermore, we showed that, according to Bayesian stan-
dards, the same evidence may be used for calibration and for incrementally
confirming one base-model hypothesis relative to another, or relative to its
entire complement. We call this double-counting II. We appealed to studies
in climate science to show that these two forms of double-counting are in
fact practised by some climate scientists, even if they are not acknowledged
as such.

In the latter parts of the paper, we acknowledged and discussed important
worries about calibration and confirmation in the climate-modelling context
that may be marring the double-counting debate. In some cases, evidence
already informs the prior assessment of model instances. If so, it cannot be

28


used again for calibration and confirmation—this would be using the same ev-
idence two times over. More fundamentally, there is often controversy about
what evidence is relevant to whether a model achieves its purpose. Treating
irrelevant evidence as if it were relevant and using this evidence for confir-
mation or calibration is also bad practice. Indeed, some climate scientists
state strongly that future climate variables of interest are more or less un-
constrained by the available past climate data. The upshot is that this past
climate data is irrelevant for assessing the adequacy of models for predicting
the future; hence there can be no calibration or double-counting. A related
but subtly different concern is that climate models are based on assumptions
which may not be applicable in the future. This would imply that one can-
not hope to even roughly determine the likelihood of the catch-all hypothesis
with respect to adequately predicting the future, and non-comparative con-
firmation, let alone absolute confirmation, would be indeterminate.

We noted that climate scientists disagree about whether these worries are
all justified. In any case, the worries concern whether data are useless for
confirmation and/or calibration. Problems of this kind cannot be remedied
by using separate data for calibration and confirmation. We thus suggest that
practitioners be clearer about their targets. Suspicions about the legitimacy
of double-counting should not be confused with other important issues, such
as what evidence is relevant for confirmation given the modelling context at
hand, whether issues of old evidence are appropriately handled, or whether
the worry is justified that climate models are based on assumptions which
will not hold in the future.

Acknowledgements

Earlier versions of this paper have been presented at the third conference
of the European Philosophy of Science Association, the 2010/2011 London
School of Economics Discussion Group Meetings on Climate Science and
Decision-making, the 2011 Bristol Workshop on Philosophical Issues in Cli-
mate Science, the first Annual Ghent Metaphysics, Methodology and Science
Program, the 2011 Geneva Workshop on Causation and Confirmation, the
2011 Stockholm Workshop on Preferences and Decisions, and the 2012 Pop-
per seminar. We would like to thank the audiences for valuable discussions.
We also want to thank Reto Knutti, Wendy Parker and David Stainforth for
helpful comments.

29


References

Anderson, T.L., Charlson, R.J., Schwartz, S.E., Knutti, R., Boucher, O.,
Rodhe, H. and J. Heintzenberg (2003). ‘Climate Forcing by Aerosols –
a Hazy Picture.’ Science 300, 1103–1104.

Burnham, K.P. and D.R. Anderson (1998). Model Selection and Multimodal
Inference. Berlin and New York: Springer.

Eells, E., and B. Fitelson (2000). ‘Measuring Confirmation and Evidence.’
Journal of Philosophy 97, 663–672.

Forster, M. and E. Sober (1994). ‘How to Tell When Simpler, More Unified
or Less Ad Hoc Hypotheses Will Provide More Accurate Predictions.’
British Journal for the Philosophy of Science 45, 1–35.

Frame, D.J., Faull, N.E., Joshi, M.M. and M.R. Allen (2007). ‘Probabilistic
Climate Forecasts and Inductive Problems.’ Philosophical Transactions
of the Royal Society A 365 (20), 1971–1992.

Harvey, D. and R.K. Kaufmann (2002). ‘Simultaneously Constraining Cli-
mate Sensitivity and Aerosol Radiative Forcing.’ Journal of Climate 15
(20), 2837–2861.

Hitchcock, C.R. and E. Sober (2004). ‘Prediction Versus Accommodation
and the Risk of Overfitting.’ British Journal for the Philosophy of Sci-
ence 55, 1–34.

Howson, C. (1988). ‘Accommodation, Prediction and Bayesian Confirmation
Theory.’ PSA: Proceedings of the Biennial Meeting of the Philosophy of
Science Association 1988, 381–392.

Knutti, R. (2008). ‘Should We Believe Model Predictions of Future Climate
Change?’ Philosophical Transactions of the Royal Society A 366, 4647–
4664.

Knutti, R. (2010). ‘The End of Model Democracy – an Editorial Comment.’
Climatic Change 102, 395–404.

Knutti, R., Stocker, T.F., Joos, F. and G.-K. Plattner (2002). ‘Constraints
on Radiative Forcing and Future Climate Change from Observations
and Climate Model Ensembles.’ Nature 416, 719–723.

30


Knutti, R., Stocker, T.F., Joos, F. and G.-K. Plattner (2003). ‘Probabilistic
Climate Change Projections Using Neural Networks.’ Climate Dynamics
21, 257–272.

Knutti, R., Furrer, R., Tebaldi, C., Cermak, J. and G. Meehl (2010). ‘Chal-
lenges in Combining Projections from Multiple Climate Models.’ Jour-
nal of Climate 23, 2739–2758.

Mayo, D.G. (2010). ‘An Ad Hoc Save of a Theory of Adhocness? Exchanges
with John Worrall.’ In: D.G. Mayo and A. Spanos (eds.), Error and In-
ference: Recent Exchanges on Experimental Reasoning, Reliability, Ob-
jectivity and Rationality of Science. Cambridge: Cambridge University
Press, 155–169.

Parker, W.S. (2010). ‘Comparative Process Tracing and Climate Change
Fingerprints’ Philosophy of Science (Proceedings) 77 (5), 1083–1095.

Parker, W.S. (2009). ‘Confirmation and Adequacy for Purpose in Climate
Modelling.’ Aristotelian Society Proceedings, Supplementary Volume 83
(5), 233–249.

Randall, D.A. and Wielicki B.A. (1997). ‘Measurements, Models, and Hy-
potheses in the Atmospheric Sciences.’ Bulletin of the American Mete-
orological Society 78, 399–406.

Randall, D.A. and R.A. Wood (2007). ‘Climate Models and Their Evalua-
tion.’ In: S. Solomon, D. Qin, M. Manning, Z. Chen, M. Marquis, K.B.
Averyt, M. Tignor and H.L. Miller (eds.), Climate Change 2007: The
Scientific Basis. Cambridge: Cambridge University Press, 589–662.

Rodhe, H., Charlson, R.J. and T.L. Anderson (2000). ‘Avoiding Circular
Logic in Climate Modeling.’ Climatic Change 44, 419–422.

Shackley S., Young, P., Parkinson, S. and B. Wynne (1998). ‘Uncertainty,
Complexity and Concepts of Good Science in Climate Change Mod-
elling: Are GCMs the Best Tools?’ Climatic Change 38, 159–205.

Tebaldi, C. and R. Knutti (2007). ‘The Use of the Multi-Model Ensemble
in Probabilistic Climate Projections.’ Philosophical Transactions of the
Royal Society A 365, 2053-2075.

Stainforth, D.A., Allen, M.R., Tredger, E.R. and L.A. Smith (2007a). ‘Con-
fidence, Uncertainty and Decision-support Relevance in Climate Predic-
tions.’ Philosophical Transactions of the Royal Society A 365, 2145–
2161.

31


Stainforth, D.A., Downing, T.E., Washington, M., Lopez, A. and M. New
(2007b). ‘Issues in the Interpretation of Climate Model Ensembles to
Inform Decisions.’ Philosophical Transactions of the Royal Society A
365, 2163–2177.

Worrall, J. (2010). ‘Error, Tests, and Theory Confirmation.’ In: D.G. Mayo
and A. Spanos (eds.), Error and Inference: Recent Exchanges on Ex-
perimental Reasoning, Reliability, and the Objectivity and Rationality of
Science. Cambridge: Cambridge University Press, 125–154.

32


	Werndl_Steele_Climate-models-calibration-and-confirmation_2013_cover
	Werndl_Steele_Climate-models-calibration-and-confirmation_2013_author