Charlotte Werndl and Katie Steele Climate models, calibration, and confirmation Article (Accepted version) (Refereed) Original citation: Werndl, Charlotte and Steele, Katie (2013) Climate models, calibration, and confirmation. British Journal for the Philosophy of Science, 64 (3). pp. 609-635. ISSN 0007-0882 DOI: 10.1093/bjps/axs036 © 2013 The Authors. Published by Oxford University Press on behalf of British Society for the Philosophy of Science This version available at: http://eprints.lse.ac.uk/44236/ Available in LSE Research Online: August 2014 LSE has developed LSE Research Online so that users may access research output of the School. Copyright © and Moral Rights for the papers on this site are retained by the individual authors and/or other copyright owners. Users may download and/or print one copy of any article(s) in LSE Research Online to facilitate their private study or for non-commercial research. You may not engage in further distribution of the material or use it for any profit-making activities or any commercial gain. You may freely distribute the URL (http://eprints.lse.ac.uk) of the LSE Research Online website. This document is the author’s final accepted version of the journal article. There may be differences between this version and the published version. You are advised to consult the publisher’s version if you wish to cite from it. CORE Metadata, citation and similar papers at core.ac.uk Provided by LSE Research Online https://core.ac.uk/display/8792191?utm_source=pdf&utm_medium=banner&utm_campaign=pdf-decoration-v1 http://www2.lse.ac.uk/researchAndExpertise/Experts/profile.aspx?KeyValue=c.s.werndl@lse.ac.uk http://www.lse.ac.uk/researchAndExpertise/Experts/profile.aspx?KeyValue=k.steele@lse.ac.uk http://bjps.oxfordjournals.org/ http://bjps.oxfordjournals.org/ http://dx.doi.org/10.1093/bjps/axs036 http://www.oxfordjournals.org/ http://www.thebsps.org/ http://www.thebsps.org/ http://eprints.lse.ac.uk/44236/ Climate Models, Calibration and Confirmation Katie Steele and Charlotte Werndl∗ k.s.steele@lse.ac.uk, c.s.werndl@lse.ac.uk Department of Philosophy, Logic and Scientific Method London School of Economics This article has been accepted for publication in The British Journal for the Philosophy of Science published by Oxford University Press. April 23, 2012 Abstract We argue that concerns about double-counting —using the same evidence both to calibrate or tune climate models and also to confirm or verify that the models are adequate—deserve more careful scrutiny in climate modelling circles. It is widely held that double-counting is bad and that separate data must be used for calibration and confir- mation. We show that this is far from obviously true, and that cli- mate scientists may be confusing their targets. Our analysis turns on a Bayesian/relative-likelihood approach to incremental confirmation. According to this approach, double-counting is entirely proper. We go on to discuss plausible difficulties with calibrating climate models, and we distinguish more and less ambitious notions of confirmation. Strong claims of confirmation may not, in many cases, be warranted, but it would be a mistake to regard double-counting as the culprit. ∗Authors are listed alphabetically; this work is fully collaborative. 1 Contents 1 Introduction 3 2 Remarks about models and adequacy-for-purpose 6 3 Evidence for calibration can also yield comparative confir- mation 8 3.1 Double-counting I . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Double-counting II . . . . . . . . . . . . . . . . . . . . . . . . . 10 4 Climate science examples: comparative confirmation in prac- tice 13 4.1 Confirmation due to better and worse best fits . . . . . . . . . . . 14 4.2 Confirmation due to more and less plausible forcings values . . . . 16 5 Old evidence 17 6 Doubts about the relevance of past data 21 7 Non-comparative confirmation and catch-alls 23 8 Climate science example: non-comparative confirmation and catch-alls in practice 26 9 Concluding remarks 28 References 30 2 1 Introduction Climate scientists express concern about the practice of ‘calibrating ’ climate models to observational data (another widely-used word for ‘calibration’ is ‘tuning’). Calibration occurs when a model includes parameters or forcings about which there is much uncertainty, and the value of the parameter or forcing is determined by finding best fit with the data. That is, the param- eter or forcing in question is effectively a free parameter, and calibration determines which value(s) for the free parameter best explain(s) the data. A prominent example, which we refer to later, is the fitting of the aerosol forcing. The apparent concern about calibration is that it may result or always results in data being double-counted: data used to construct the fully-specified model is also used to evaluate the model’s accuracy, in a problematic way. Indeed, various climate scientists worry about circular reasoning: In addition some commentators feel that there is an unscientific circularity in some of the arguments provided by GCMers [gen- eral circulation modelers]; for example, the claim that GCMs may produce a good simulation sits uneasily with the fact that impor- tant aspects of the simulation rely upon [...] tuning. (Shackley et al. 1998, 170). This is just one particularly suggestive quote about the badness of double- counting. But what exactly is the badness here? We will see that this de- pends crucially on the details. This paper seeks to clarify and evaluate worries surrounding calibration and double-counting. We appeal to statements made by various climate sci- entists, but our aim is not to rebut particular individuals. Our main concern is that, in general, climate scientists’ statements about calibration/tuning/double- counting do not attend to the details, and are, at worst, misleading. A number of different issues are bundled together as the ‘problem of double- counting’, and each of these issues deserves to be carefully articulated. It is necessary to introduce some terminology. Calibration is introduced above. Confirmation refers to the evaluation of a model’s accuracy for par- ticular purposes.1 Note also that there is an important difference between 1Some authors, e.g., Frame et al. (2007), use the term ‘verification’ in lieue of ‘confir- mation’. We use the latter term in the interests of making a clear connection with the philosophical literature. 3 incremental and absolute confirmation; the former concerns whether confi- dence in a model hypothesis has increased, while the latter concerns whether confidence in a model hypothesis is sufficient, or above some threshold. This paper focuses on (varieties of) incremental confirmation.2 The central ques- tion is: what is the inevitable/proper relationship between calibration and confirmation? Some climate scientists appear to claim that calibration is bad and should therefore be avoided : Climate change simulations should, in general, only incorporate forcings for which the magnitude and uncertainty have been quan- tified using basic physical and chemical principles (Rodhe et al. 2000, 421). This statement may be stronger than the authors intend.3 In any case, an anti-calibration position is not defensible, because it would preclude refining models in response to observational evidence. This is common practice in all areas of science. In short, whatever the details of the relationship between calibration and confirmation, it had better be the case that calibration is not something that is bad or needs to be avoided. Our main target here is the widespread view amongst climate scientists that calibration and confirmation should be kept ‘separate’. The following quotes suggest that evidence used in calibration should not (or cannot) yield incremental confirmation; only separate data, not already used for calibra- tion, can boost confidence in a model. In other words, tuning is fine if it simply amounts to calibration, but double-counting is not fine: The inverse calculations [calibration] are also based on sound physical principles. However, to the extent that climate models rely on inverse calculations, the possibility of circular reasoning arises—that is, using the temperature record to derive a key input to climate models that are then tested against the temperature record (Anderson et al. 2003, 1103). If the model has been tuned to give a good representation of a particular observed quantity, the agreement with that observation 2From now on, when we use the term ‘confirmation’ we mean incremental confirma- tion, unless otherwise indicated. As will become clear, we distinguish two varieties of incremental confirmation: comparative and non-comparative. 3Perhaps the authors want to exclude forcings that have no physical plausibility at all, rather than forcings that merely cannot be well quantified. 4 cannot be used to build confidence in that model (IPCC report – Randall and Wood 2007, 596, our underlining). Indeed, the need for separate data for calibration and confirmation is usually simply taken for granted in the climate science literature, or else the reasoning is ambiguous.4 But this position is far from being obviously true, and requires further argument. The first part of the paper argues that separate data for calibration and confirmation is not an uncontroversial tenet of confirmation logic, because it does not follow (in fact, quite the contrary) from at least one major approach to confirmation—the Bayesian approach.5 After some remarks in Section 2 about climate models and adequacy-for-purpose that are useful to bear in mind throughout the discussion, in Sections 3 and 4, we demonstrate, using a very basic model and examples from climate science, that evidence may be used to calibrate and also to incrementally confirm a model relative to another model (we call this comparative confirmation). We then go on to address some complicating issues—reasons why in some contexts data are useless for calibration or confirmation. Some climate scien- tists’ worries about double-counting are most charitably reconstructed along these lines, i.e. as concerning the inapplicability, rather than the inherent badness of double-counting. Section 5 considers the issue of ‘old evidence’— if evidence already informs the prior probability distribution over models, it cannot be used a second time over for further calibration and confirmation. Section 6 discusses the worry that past data are irrelevant for model ade- quacy in the future and hence cannot be used for calibration or confirmation. Section 7 discusses a different sense of incremental confirmation that cli- mate scientists may have in mind: non-comparative confirmation, which con- cerns our confidence in a model tout court, i.e. relative to its entire comple- ment. While also for non-comparative confirmation evidence may be used to calibrate and confirm a model, the worry arises that climate models are based on assumptions that may be wrong, especially in the future; hence there is 4See, for instance, Anderson et al. (2003, 1103), Knutti (2008, 4651), Knutti (2010, 399), IPCC report – Randall and Wood (2007, 596), Shackley et al. (1998, 170), and Tebaldi and Knutti (2007, 2070). 5We try to deal minimally in Bayesian assumptions that may be objectionable to some readers, chiefly, prior probabilities. While we restrict our attention to Bayesian confirma- tion logic, the lessons apply more broadly, and we note this where appropriate. In any case, our aim is simply to show that it is not uncontroversial to claim that separate data must be used for calibration and confirmation. 5 considerable uncertainty about the full space of models, implying that data will not confirm a model. Section 8 presents an example from climate science that brings these subtler issues to the fore. The paper ends with a conclusion in Section 9. Let us now turn to the remarks about the predictive purposes of climate models and how this bears on what evidence is relevant for assessing them. 2 Remarks about models and adequacy-for-purpose A variety of climate models are used to study the Earth’s climate. In the words of Parker (2010, 1084): [Climate models] range from the highly simplified to the extremely complex and are constructed with the goal of simulating in greater or lesser detail the transport of mass, energy, moisture, and other quantities by various processes in the climate system. These pro- cesses include the movement of large-scale weather systems, the formation of clouds and precipitation, ocean currents, the melt- ing of sea ice, the absorption and emission of radiation by atmo- spheric gases, and many others. Climate scientists note that many of the aforementioned processes are still poorly understood, and, moreover, that these processes can typically be only approximated in a model, even one of maximum possible precision. Consequently, it is clear from the outset that climate models will not cor- rectly represent or predict the target systems in all their details. This means that climate models themselves cannot be confirmed. As Parker (2009) has convincingly argued, instead what can be confirmed is the adequacy of cli- mate models for particular purposes. The hypotheses about the purposes of climate models need to be specified by climate scientists. A prime example of such a hypothesis is: ‘this climate model with these initial conditions is adequate for predicting the mean surface temperature changes within 0.5 de- grees in the next 50 years under this emission scenario’. In climate science typically some model error is allowed. Therefore, an important part of specifying the hypothesis about the purpose of a model is to state the assumptions about the model error. There are two main kinds of error. First, for discrete model error all that counts is whether the actual out- come is within a certain distance from the simulated outcome, e.g., whether the actual and simulated mean surface temperature is less than 0.5oC apart. 6 Second, there is probabilistic model error when the error is described by a probability distribution. To give a simple example, the error might be mod- eled by a Gaussian distribution around the true value. In this framework of adequacy-for-purpose one needs to be cautious about what data are actually relevant to assess whether a model fulfills a particu- lar purpose. We have to determine the observational consequences that are likely to follow if the model is adequate; the data about these consequences will then be relevant. To come back to our example about mean surface temperature changes: here many will regard past temperature changes as relevant (although we return to this issue later in Section 6). However, it is less clear whether, e.g., past precipitation changes are relevant. As Parker (2009) has argued, if climate scientists have obtained a good understanding of the relation between mean surface temperature changes and precipitation changes, then precipitation changes will be relevant. However, when lacking any knowledge about the interdependence of these two variables, then pre- cipitation changes will not be relevant. Which data are relevant is crucial for two reasons: only relevant data can confirm or disconfirm the adequacy of a model and can meaningfully be used to calibrate the free parameters of a model. This paper does not, for the most part, focus on the question of what data are relevant to assess a model’s adequacy for purpose. General points about the suitability of data for confirmation will, however, become important in Sections 5 and 6. Here it is just important to realise that this question is a separate issue and should not be confused with the worry of double-counting. That is, if data are not relevant to a model’s adequacy for purpose, then test- ing the model against the data even once would be counting the data one too many times; likewise, calibrating the free parameters of the model against the data would be counting the data one too many times. The next section discusses calibration/double-counting in the context of more simple models. The aim is to elucidate calibration vis-à-vis Bayesian confirmation. 7 3 Evidence for calibration can also yield comparative confirma- tion Here we argue against the view that double-counting, in the sense of using evidence for both calibration and confirmation, is obviously bad practice. We show that, by Bayesian or likelihoodist standards at least, double-counting simply amounts to using evidence in a regular and proper way. This is best demonstrated in the context of comparing two well-specified hypotheses. We distinguish two interpretations of double-counting—I (subsection 3.1) and II (subsection 3.2)—because the legitimacy of the latter is more controversial than the former. 3.1 Double-counting I Let us start with a straightforward case, and then add complexity. Consider just one type of base model with very simple structure: a linear relationship between variables y and t. Because, as outlined in the previous section, cli- mate scientists typically allow for model error, we will assume a probabilistic model error term that is distributed normally with standard deviation σ:6 L : y(t) = αt + β + N(0,σ). (1) The Bayesian account of model calibration depends crucially on the follow- ing setup: there is a whole family of specific instances of the base model L, where each specific instance has particular values for the unknown parame- ters or forcings α and β. For instance, assume that possible values for α are {1, 2, 3, 4}, and likewise for β. So the scientist associates with L a (discrete) set of specific model instances that we might label L1,1, L1,2, . . ., where the subscripts indicate the values for α and β. Calibration of L then just amounts to comparing specific instances of the base model—L1,1, L1,2, . . .—with respect to the data, i.e. observed values for y(t). Of course, strictly speaking, what we are comparing are model hypothe- ses; assume that the hypotheses here postulate that the model in question accurately describes the data generation process for y(t). Calibration is sim- ply the common practice of testing hypotheses against evidence. Given the probabilistic error term, none of the hypotheses L1,1,L1,2, . . ., can be falsified 6Alternatively, the error term could be interpreted as observational error or as a com- bined term for observational error and model error. We focus on model error because it seems particularly widespread in climate science papers. However, all we say carries over to any other interpretation of the error term. 8 by the data, even if the data lies very far away from the specified line. Note also that since the model error is probabilistic, the hypotheses are mutually exclusive. This is important: calibration is best understood as the compar- ison, given new evidence, of the mutually exclusive hypotheses constituting a base model.7 Calibration, understood in this way, may well result in confirmation of Li,j, say, with respect to Lk,l. By Bayesian logic, the extent of confirmation depends on the likelihood ratio: Pr(E|Li,j)/Pr(E|Lk,l), where Pr(E|Li,j) is just the probability, Pr, of the evidence, E, i.e. the observed data points, given the model Li,j. 8 The likelihoods are related, in a manner that depends on the assumed error probability distribution (in our case Gaussian), to the sum-of-squares distance of the data points from the line. If the likelihood ratio is greater than 1, then Li,j is confirmed by the data relative to Lk,l, and vice versa if the likelihood ratio is less than 1. When the likelihood ratio equals 1, neither hypothesis is confirmed relative to the other. Note that the relative posterior (post-evidence) probabilities of Li,j and Lk,l is a further matter of absolute rather than incremental confirmation (cf. comments in Section 1); absolute confirmation depends also on their relative prior (ini- tial) probabilities.9 7Where model error is discrete, identifying mutually exclusive model hypotheses is more complicated. For instance, consider a simple example of two hypotheses involving discrete model error: L1,1 is the hypothesis that y(t) = t + 1 accurately predicts y(t) within ±2, and L1,2 is the hypothesis that y(t) = t + 2 accurately predicts y(t) within ±2. These two hypotheses could both be correct. Indeed, the model hypotheses in Knutti et al. (2002, 2003) discussed later in Section 4 and 7 deserve further scrutiny on this basis. We will not discuss this further here; we merely want to flag the issue. 8To be more precise, we should also explicitly state the background knowledge B in the likelihood expressions, such that they read Pr(E|Li,j&B). In the interests of readability, we will not use these more precise expressions, but the B should be understood as implicit. 9This is the Bayesian wisdom, anyhow. The complete Bayesian expression is as follows: Prf (Li,j) Prf (Lk,l) = Pr(Li,j|E) Pr(Lk,l|E) = Pr(E|Li,j) Pr(E|Lk,l) × Pr(Li,j) Pr(Lk,l) (2) where the first term is the ratio of posterior probabilities, i.e. the ratio of probabilities after receipt of the evidence. The final term is the ratio of prior or initial probabilities for the model hypotheses, i.e. before the evidence. In short, the ratio of posteriors for the model hypotheses, given new evidence E, is a product of the ratio of prior probabilities and the likelihood ratio. As mentioned, it is the likelihood ratio that governs the relative extent to which the model hypotheses are confirmed by E. Note that the likelihood ratio plays a key role in other theories of confirmation too, not just the Bayesian. 9 We begin with this case to show that there is a straightforward way in which double-counting is fine: calibration of L involves ascertaining appro- priate values for α and β; thus the whole point is to consider which specific model hypotheses are confirmed relative to others in light of the data. Call this double-counting I ; we do not expect its legitimacy to be controversial, given a hypothesis space as described above. So we already see that un- qualified statements about the badness of calibration/double-counting are problematic. 3.2 Double-counting II An interesting qualification may be deduced from the work of Worrall (2010). He suggests that the real double-counting sin would be to use evidence to calibrate a base model such as L above, and also hold that the same evidence confirms not only specific instances of this base model relative to others, but the base-model hypothesis itself: Using empirical data e to construct a specific theory T ′ within an already accepted general framework T leads to a T ′ that is indeed (generally maximally) supported by e; but e will not, in such case, supply any support at all for the underlying general theory T . (Worrall 2010, 143) Call this double-counting II. In this quote Worrall refers to a general the- ory T that is already ‘accepted’. In such a case, the general theory cannot be incrementally confirmed, as it already has maximal probability.10 Worrall’s remarks are thus consistent with Bayesian confirmation. We take Worrall’s work to be highly suggestive, however, of the more general claim against double-counting II. We will show that, according to Bayesian confirmation theory, double-counting II is legitimate—thus conflicting with the more gen- eral claim against double-counting II. Perhaps when climate scientists claim that separate data is required for confirmation and calibration, they take for granted, along the lines of Wor- rall, that double-counting II is illegitimate, i.e. calibration of a base-model hypothesis cannot result in that hypothesis being confirmed relative to an- other base-model hypothesis, and thus other data is needed for any such confirmation. 10Note also that Worrall considers only cases where the evidence falsifies all but one instance of a base model. 10 This position, however, is not born out by Bayesian confirmation logic (at least).11 On the contrary, double-counting II is legitimate and can arise for two reasons: 1) ‘average’ fit with the evidence may be better for one base model relative to another, and/or 2) the specific instances of one base model that are favoured by the evidence may be more plausible than those of the other base model that are favoured by the evidence.12 As per double-counting I, our analysis revolves around straightforward likelihood ratios, although here we must introduce prior probability distri- butions over the specific model instances, conditional on each base-model hypothesis being true.13 In the interests of a more concrete discussion, we first introduce a second base-model hypothesis, a quadratic of the form: Q : y(t) = αt2 + β + N(0,σ). (3) Assume that the specific model instances, like those of L above, are all com- binations of α and β, where each may take any value in the discrete set {1, 2, 3, 4}. As before, the error standard deviation, σ, is fixed. Specific model instances are labelled Q1,1, Q1,2, . . .. Note that the base-model hypotheses L and Q are of the same complexity, i.e. they have the same number of free parameters. This is an intentional choice; we do not want to introduce a fur- ther issue of relative model complexity and penalties for overfitting. While an important and controversial issue that is certainly tied up with calibra- tion, the overfitting debate only confounds the question of double-counting. (Nonetheless we will return to this debate briefly at the end of the subsec- tion.) In standard Bayesian terms, the confirmation of one base-model hypoth- esis, e.g., L, with respect to another, e.g., Q, depends on the likelihood ratio Pr(E|L)/Pr(E|Q). As before, if the ratio is greater than 1, then L is con- firmed relative to Q, and if it is less than 1, then Q is confirmed relative to 11We remark on frequentist ‘model selection’ methods at the end of this section; ac- cording to these methods, double-counting II is legitimate—in conflict with the general claim we are attributing to Worrall. Note that Mayo’s ‘severe testing’ approach to con- firmation does not support the Worrall conclusion either (see Mayo’s 2010 response to Worrall). What is important for the severe testing approach is not whether evidence has already been used to calibrate a base-model, but whether the evidence severely tests this base-model hypothesis; these two considerations do not always match up. It is beyond the scope and aims of this paper, however, to elaborate further on the severe testing approach or any other alternative vis-à-vis Bayesian confirmation. 12Our analysis is thus more in line with Howson (1988). 13For double-counting I we were able to eschew prior probabilities altogether when assessing confirmation. 11 L.14 In this case, the relevant likelihoods, however, are not entirely straight- forward: Pr(E|L) = Pr(E|L1,1) ×Pr(L1,1|L) + . . . + Pr(E|L4,4) ×Pr(L4,4|L), (4) Pr(E|Q) = Pr(E|Q1,1) ×Pr(Q1,1|Q) + . . . + Pr(E|Q4,4) ×Pr(Q4,4|Q). Note that Pr(L1,1|L) is the prior probability (i.e. probability before the data is received) of y(t) = t+1+N(0,σ) being the true description of the data gen- eration process for y(t), given that the true model is linear. The expressions above provide formal support for our earlier statement that confirmation of base models depends on 1) fit with the evidence and 2) the conditional priors of all specific instances of these base models. Consider first the special case where the conditional prior probabilities of all specific instances of L and Q are equivalent. That is: Pr(L1,1|L) =. . .=Pr(L4,4|L) =. . . = Pr(Q1,1|Q) =. . .=Pr(Q4,4|Q) =x. (5) Suppose the observed data E yield on balance greater likelihoods for in- stances of L than Q. Then L is confirmed relative to Q because of reason 1), viz. the average fit with the evidence is better for base-model hypothesis L than for Q. Furthermore, there is calibration because E is used to determine the most likely values of α and β. Another special case is where the base-model hypotheses have equivalent fit with the data when all specific models are weighted equally, but the priors are not in fact equal. Suppose that the specific instances of L that have the higher likelihoods for E are in fact more plausible (higher conditional priors) than the specific instances of Q that have the higher likelihoods. Then L is confirmed relative to Q because of reason 2), viz. the specific instances of L favoured by the evidence are more plausible than the specific instances of Q favoured by evidence. Furthermore, there is calibration: E is used to deter- mine the most likely values of α and β. Alongside these two special cases there is also the case of double-counting II because of both 1) and 2). Worrall (2010) has claimed that in cases where data seem to be used for calibration and confirmation of a base-model hy- pothesis, what really happens is that only some of the data is needed to determine the values of the initial free parameters, and the rest of the data 14Again, as before, the relative posterior probabilities of L and Q, i.e. Pr(L|E)/Pr(Q|E), depend also on their prior probability ratio. 12 then confirms the hypothesis; thus there is no double-counting. However, this splitting of the data can throw away valuable information about the free parameters and is not in keeping with Bayesian logic of confirmation. Rather, as we see for the cases discussed here, all of the data are used to determine the values of the free parameters as well as for confirmation of base-model hypotheses, and thus we have a genuine case of double-counting. Finally, while the Bayesian approach to confirmation is far from marginal, there have been interesting challenges to this approach in the context of double-counting II. Concerns about comparing base models of differing com- plexity have lead to special methods for assessing base models, i.e. families of models. This is the field of model selection (see Burnham and Anderson 2002). Our analysis above is standard Bayesian, but it is important to note that various alternative methods for comparing base models have been sug- gested, including the Akaike approach (see Forster and Sober 1994). The controversies here run deep and extend to whether the basic unit of analysis should be a family of models or a specific model, and also to what we are trying to assess: the truth of model hypotheses, or their predictive accu- racy? It is beyond the scope of this paper to enter into this debate. We note simply that even if an alternative (frequentist) approach to confirma- tion of base models is taken, the legitimacy of both double-counting I and II holds: evidence used for calibrating base models is also used for determining their relative standing, or, in other words, for confirmation (see, for instance, Hitchcock and Sober 2004). Section 4 presents two analyses from the climate literature that exem- plify the two special cases of double-counting II. The aim here is to show that climate scientists do engage in double-counting, even if they do not acknowledge it as such. 4 Climate science examples: comparative confirmation in prac- tice There is considerable discussion in climate science about calibrating aerosol forcing. To give some background: aerosols are small particles in the atmo- sphere. They vary widely in size and chemical composition and arise, e.g., from industrial processes. Aerosols alter the Earth’s radiation balance, and the aerosol forcing measures the extent that anthropogenic aerosols alter this balance. Anthropogenic aerosols influence the climate in two ways: first, they 13 reflect and scatter solar and infrared radiation in the atmosphere (measured by the direct aerosol forcing). Second, they change the properties of clouds and ice (measured by the indirect aerosol forcing). Overall aerosols are be- lieved to exert a cooling effect on the climate. The uncertainty about the magnitude of the aerosol forcing, in particu- lar about the indirect aerosol forcing, is huge because little is known about the physical and chemical principles of how aerosols change the properties of clouds and ice and how they scatter radiation. Consequently, it is standard practice to calibrate the aerosol forcing against data, and the aerosol forcing constitutes a prime example of calibration in climate science. We will now show that in climate papers about the aerosol forcing we can find the two special cases of double counting II. 4.1 Confirmation due to better and worse best fits The first paper we look at is Harvey and Kaufmann (2002). They compare the adequacy of two climate models (with model error) for simulating the observed warming of the past two and a half centuries. The two base models are (the climate models are derived from an energy balance model coupled to a two-dimensional ocean model):15 • M1: model instances that consider both natural and anthropogenic forcings to describe climate change (plus model error). • M2: model instances that consider only anthropogenic forcings to de- scribe climate change (plus model error). They assume that the model error is such that none of the base-model hypotheses can be falsified by the data but where, roughly, the closer the simulations are to the observations, the better.16 The evidence regarded as relevant for assessing the adequacy of the base models are the past record of mean surface temperature changes, interhemispheric surface temperature changes, surface temperature changes in the northern hemisphere and surface temperature changes in the southern hemisphere. This evidence is used to simultaneously calibrate the aerosol forcing and the climate sensitivity. (The 15The base model M1 (M2) does not consist of one model to which different forcing values can be assigned. It consists of several different models, which consider different an- thropogenic and natural influences (different anthropogenic influences), to which different forcing values can be assigned. Hence Harvey and Kaufmann compare two sets of models. 16They do not assume any observation error. 14 climate sensitivity measures the mean temperature change resulting from a doubling of the concentration of carbon dioxide in the atmosphere). Moti- vated by physical considerations, the initial ranges considered are [0,-3] for the aerosol forcing and [1, 5] for the climate sensitivity. They proceed as follows: among all the model instances of M1 and M2, Harvey and Kaufmann identify a model instance which best matches the data. Then they apply a statistical test to determine whether other model instances differ significantly from the best instance. In this way they arrive at a set of best performing models instances. (Denote this set by MB and let MBC be the model instances of M1 and M2 which are not in MB.) It turns out that MB only includes instances of M1. Consequently, they conclude that there is confirmation: M1 (natural and anthropogenic forcings) is more ade- quate for simulating the past temperature record than M2 (only anthropogenic forcings). Furthermore, they use the same data to calibrate the aerosol forc- ing : the instances of M1 in MB correspond to an aerosol forcing range of (-1.5, 0], which is thus regarded as the likely range. Harvey and Kaufmann can be seen as engaging in double-counting II. Their procedure can (roughly) be reconstructed in Bayesian terms, as per Section 4. The model error is probabilistic.17 Further, because initially they are indifferent about the exact forcing values, they assume a uniform prior over the aerosol forcing and climate sensitivity conditional on M1 and M2.18 Their procedure comes close to assigning to the probability of the data given MBC a much smaller value than to the probability of the data given MB. (That is, Pr(E|MBC)/ Pr(E|MB) is much smaller than 1, e.g., 1/9.) Then, because MB only includes instances of M1, it follows that the probability of the data given M1 is much higher than the probability of the data given M2. Consequently, probabilistic confirmation theory yields that M1 is confirmed relative to M2 and that very likely the aerosol forcing is in the range (−1.5, 0]. To conclude, Harvey and Kaufmann justifiably use the same data for calibration and comparative confirmation: They engage in case 1) of double counting II, i.e. there is confirmation because the average fit with the evidence is better for M1 than for M2. Note that we are not here assessing other 17Their method implies that (roughly) the smaller the model error, the better, and that none of the models can be falsified. However, apart from this, the assumptions about the model error remain unclear. It would be desirable to spell these assumptions out because this is needed for specifying the models’ adequacy. 18Likewise, we assume that each of the different models in M1 (M2) are equiprobable (see footnote 15). 15 aspects of the experimental design; for instance, climate scientists may debate the relevance of the past ocean temperature change data for comparing the models’ adequacy. As stressed earlier, that is a different question not to be confused with double-counting. 4.2 Confirmation due to more and less plausible forcings values As a second case let us compare the models of Knutti et al. (2002) and Knutti et al. (2003). Knutti et al.’s (2002, 2003) concern is to construct models which are adequate for long-term predictions of temperature changes (within the error bounds) until 2100 under two important emission scenarios. They assume that the model error is discrete (cf. Section 3). The two base models are (the climate models are derived from a dynamical ocean model coupled to an energy- and moisture-balance model of the atmosphere): • M1: model instances considered by Knutti et al. (2002). There are five different ocean setups and the carbon cycle is not accounted for explicitly (the carbon cycle determines how emissions are converted into concentrations in the atmosphere).19 • M2: model instances considered by Knutti et al. (2003). There are ten different ocean model setups and the carbon cycle and its uncertainty are explicitly accounted for with a parameterization.20 The evidence which they regard as relevant for assessing the adequacy of these models are past mean surface temperature changes and ocean temper- ature changes. All the elements needed to compare the two base-model hypotheses in the framework of probabilistic confirmation theory are present in Knutti et al. (2002, 2003). The evidence is used to simultaneously calibrate the indirect aerosol forcing and the climate sensitivity. Motivated by physical estimates, Knutti et al. (2002, 2003) assume that, conditional on M1 and M2, the indi- rect aerosol forcing is initially normally distributed with the mean at -1 and a standard deviation of 1.21 The climate sensitivity is assumed to be initially 19The ocean setups of M1 and M2 differ: the ten ocean setups of M2 do not include the five ocean setups of M1. 20Because of the different ocean setups, the base model M1 (M2) does not consist of one model to which different forcing values can be assigned but of five (ten) different models to which different forcing values can be assigned. Hence the sets of models M1 and M2 are compared. 21They also discuss the case of a uniformly distributed aerosol forcing. However, the case of the normal distribution will be more insightful here. 16 uniformly distributed over [1,10], conditional on M1 and M2. Knutti et al. (2002, 2003) then calculate the a posterior probabilities for model instances, i.e. the likelihood of an arbitrary model-hypothesis instance given the data, assuming that M1 (M2) is true. A model-hypothesis in- stance is regarded as consistent if the average difference between the actual and the simulated observations is smaller than a constant.22 The a posterior probability is zero for inconsistent model-hypothesis instances; consistent model-hypothesis instances are assigned a probability proportional to the prior probability over the forcings values (i.e. over the model instances23). It turns out that the a posterior probability distribution over the forcings are the same for M1 and M2, implying the indirect aerosol forcing is likely (with approximate probability 0.90) to be in the range [-1.5,0.2). In short, the con- sistent model instances of M1 span the same range of forcing values as the consistent model instances of M2. Since all consistent model instances are regarded as having equivalent fit with the data (because postulated model error is discrete), we conclude that there is no comparative confirmation. Now suppose that for M1 the a posterior probability distribution over the forcings would have been different, say, that the likely (with probability 0.90) aerosol forcing range would have been [-2.7,-1]. Then the data would have been justifiably used both for calibration and comparative confirmation of the base-model hypotheses. This would have been an example of case 2 of double counting II : M2 would have been confirmed relative to M1 because the specific instances of M2 favoured by the evidence are more plausible than the specific instances of M1 favoured by the evidence. 5 Old evidence We have seen that double-counting is not illegitimate by Bayesian confirma- tion standards, at least, and is, moreover, practised by some climate scien- tists. This problematises assertions that double-counting is clearly bad. The remainder of the paper considers reasons why double-counting may yet be, for the most part, inapplicable in the climate-model context. Note that the 22The constant equals the standard deviation of the model ensemble, which in climate science is regarded as a measure of model error. They also assume that there is observation error. To account for it, the difference of the observed and modelled temperature is divided by the uncertainty of the observed warming (Knutti et al. 2002, 2003). 23Knutti et al. (2002, 2003) assume that each of the five (ten) different models consti- tuting the base model class M1 (M2) are equiprobable (cf. footnote 20). 17 reasons we canvas concern the failure of calibration and/or confirmation of base models; nothing we say in these final sections supports the position that separate data should be used for calibration and confirmation. We start with what seems a prevalent concern: that the evidence in question was used to formulate the climate-model hypotheses, and so is old evidence that is not suitable for further confirmation purposes. This appears to be a concern of Stainforth et al. (2007a): Development and improvement of long time-scale processes are therefore reliant solely on tests of internal consistency and phys- ical understanding of the processes involved, guided by informa- tion on past climatic states deduced from proxy data. Such data are inapplicable for calibration or confirmation as they are in- sample, having guided the development process. The term ‘in-sample’ is ambiguous here: on the one hand it apparently refers to evidence belonging to a different time(/spatial) period from the pre- dictions of interest (we discuss this issue in subsequent sections), yet on the other hand it seems to refer to old evidence, i.e., evidence already taken into account in model development. Since these two issues come apart,24 they deserve separate treatment. Our current concern is updating on old-evidence. How might this prob- lem manifest? It helps to consider a paradigm case: imagine that a detective announces that the most plausible hypothesis, given the expensive earring and strands of hair found at the crime scene, is that the rich Lady visit- ing the manor killed the host. Clearly the evidence has already been taken into account in announcing that this hypothesis is the most plausible one. In Bayesian terms, the current plausibility of the hypothesis—its relatively high probability—is already a posterior probability, given the evidence. It would thus be a mistake to further confirm the rich-Lady hypothesis with respect to the same evidence. One can still assess the confirmatory power of the old evi- dence, but this requires estimating ‘counterfactual’ probabilities, such as the likelihood Pr(E|rich-Lady hypothesis where E is not already known). One can also entertain, if necessary, a prior probability for ‘rich-Lady hypothesis where E is not already known’—this is evidently what the detective’s belief in the rich Lady’s culpability would have been, before the evidence E was 24Consider: It is possible to find ‘new’ evidence from the same time period as the ‘old’ evidence. 18 known.25 To better appreciate the problem, it is helpful to consider the overall con- firmation from two independent pieces of evidence, say E1 and E2, according to Bayes’ theorem. In such case, the overall confirmation of, say, H1 relative to H2, depends on the product of the two likelihood ratios: Pr(E1|H1) Pr(E1|H2) × Pr(E2|H1) Pr(E2|H2) . (6) It would be a mistake, of course, to treat the one piece of evidence, E, as if it were two pieces of independent evidence, and thus take confirmation due to E as: Pr(E|H1) Pr(E|H2) × Pr(E|H1) Pr(E|H2) . (7) This is what it means to update again on old evidence, or use the same ev- idence two times over for confirmation. It is effectively what would happen if, say, our detective further confirmed the rich-Lady hypothesis with respect to the same crime-scene data, and concluded that it was even more plausible that she was the murderer. Let us now return to climate models. The way we have characterised calibration in Section 3 already guards against this old-evidence updating, to some extent. As mentioned, the problem set-up is crucial to a defen- sible Bayesian analysis: when calibrating and comparing two base-model hypotheses, we must assign all the specific instances of these models appro- priate conditional priors, i.e., probabilities that do not yet take the evidence into account. Then the evidence can be used to calibrate or discriminate fur- ther between the model instances (and between the base models too, as per double-counting II). This is effectively the procedure that is followed in the case studies of Section 4; suitable conditional prior probabilities are initially selected, and then updated in light of the temperature data. Of course, evidence might be unwittingly used two times over for cali- bration and/or confirmation. Indeed, Frame et al. (2007) note this danger in the context of assessing climate models. They caution against calibrating and/or confirming twice with the same evidence, not realising that the ev- idence already informed the conditional prior probability distributions over 25Admittedly, these ‘counterfactual’ probabilities may be difficult to estimate, and the controversy about their interpretation runs deep, but there are nonetheless ways to make sense of them (see, for instance, Eells and Fitelson 2000). 19 instances of the base models. In short, updating on old evidence is problem- atic, and practitioners should be careful to avoid doing this. But this is not an inevitable problem, and the remedy is not to use separate data for cali- bration and confirmation; the remedy is simply not to calibrate and confirm model hypotheses two times over with the same evidence. There may be a lingering concern that prior probabilities for the base- model hypotheses themselves already incorporate the evidence, especially if base models with additional forcings or parameters are constructed expressly to achieve better fit with the data. So the base-model hypotheses are only a subset of the full space of possible models, and hence assigning each an equal prior probability would be to over-estimate their initial plausibility. The situation seems analogous to the murder case above—the base models that climate scientists work with are considered plausible precisely because the evidence has already been taken into account in selecting them. Just as the murder detective does not bother to mention various people near the crime scene who may have been under greater suspicion if the evidence were oth- erwise, climate scientists have presumably already dismissed a large number of possible base models in favour of the few under consideration that seem to have the potential to permit a reasonable fit with the evidence. It would then seem wrong to use the evidence a second time over for confirmation. Notwithstanding this concern, we can still calibrate and assess comparative (incremental) confirmation in terms of the likelihoods Pr(E|Hi), where it is assumed in the condition that E is not already known. Furthermore, as men- tioned above, even if the base-model hypotheses are only a subset of the full space of model hypotheses—the ones deemed most plausible in light of the evidence—one can still estimate ‘counterfactual’ prior probabilities for the base-model hypotheses where the evidence E is not taken into account. Pre- sumably, the counterfactual prior probabilities for these base models should not add to 1, but to some probability less than 1. Determining the appropri- ate probability mass to assign to the set of base-model hypotheses may be quite tricky. But this problem affects only non-comparative, and ultimately, absolute confirmation, where we want to assess how confident we should be, overall, in our models, and again, has nothing to do with double-counting. In any case, the assessment of non-comparative and absolute confirmation of climate models is plagued with even bigger difficulties, and we will get to these in Section 7. For now we continue to analyse why even calibration and comparative confirmation may fail in the climate-model context. In particular, we turn now to concerns about the (ir)relevance of past data. 20 6 Doubts about the relevance of past data There is an important difference between the climate studies discussed in Subsections 4.1 and 4.2. In the Harvey and Kaufmann study, past data was used to calibrate/confirm base-model hypotheses concerning past cli- mate behaviour, whereas in the Knutti et al. studies, past data was used to calibrate/confirm base-model hypotheses concerning long-term future cli- mate behaviour (policy makers are most interested in this long-term future climate behaviour). The latter is more controversial than the former, and, as we will see in this and the next section, may be what some climate scientists have in mind when they make negative comments about calibration and con- firmation. This section discusses whether particular past data are relevant for assessing the adequacy of climate-model hypotheses in predicting future climate variables of interest. The next section will discuss the concern that climate models are based on assumptions that may not hold in the future, and hence there is considerable uncertainty about the full space of models that are possibly adequate for predicting future climate. Let us initially confine our analysis to the model instances of a single base-model hypothesis, e.g., L (equation (1) in Section 3). Assume that the model hypotheses denoted L1,1,L1,2 . . . this time concern whether the line in question (plus probabilistic model error) accurately predicts y(t) for future times t ≥ t∗. Our question here is: Can past data, i.e. data for t < t∗, help in calibrating L? The answer: it all depends on what is the implicit relationship between t < t∗ and t ≥ t∗, i.e. the implicit extension of the model instances of L that span t ≥ t∗ into the past. One possibility is that the past values de- pend strongly on the future values, and vice versa, a special case being where each line in L for t ≥ t∗ is associated with just one and the same line for t < t∗. In this case, past data E (past values for y(t)) is clearly relevant for comparing L1,1,L1,2 . . .. 26 The likelihood ratios Pr(E|Li,j)/Pr(E|Lk,l) may be calculated as before.27 Another possibility, of course, is that the past values are independent of 26Note that the various frequentist estimators used in model selection, such as the Akaike estimator, assume an unchanging physical reality or data generation process. 27Recall our earlier footnote 8, which notes that the likelihoods are more precisely stated Pr(E|Li,j&B), etc., where B is background knowledge. Here background knowledge about the implicit relationship between past and future is very important for determining the value of the likelihood. 21 the future values, a special case being where each line in L for t ≥ t∗ is associated with any line for t < t∗. That is, each line hypothesis in L, such as L1,1, is implicitly associated with a whole set of extended models: 28 y(t) = { t + 1 + N(0,σ) if t ≥ t∗; γt + θ + N(0,σ) if t < t∗. (8) Here E, i.e. past values for y(t), will be irrelevant for comparing instances of L, the reason being that all instances of L are associated with the same pasts, and so E does not distinguish these instances. That is to say that the pertinent likelihoods for calibration—Pr(E|Li,j)/Pr(E|Lk,l)—all equal 1. So in this case there is no calibration of L and thus, in a sense, no double- counting I. The analysis of double-counting II is essentially the same. In this case, we are comparing two base-model hypotheses, for example, L and Q (equa- tions (1) and (3) in Section 3) where the concern is whether the models accurately predict y(t) for future times t ≥ t∗. Consider the special case where every model instance of L or Q is implicitly extended into the past in the same variety of ways.29 In this case past data E again does not favour any instance of either model over any other instance of either model, and we obtain Pr(E|L)/Pr(E|Q) = 1. Neither base hypothesis is confirmed relative to the other. So in a sense there is no double-counting II (in addition to no calibration and no double-counting I). Of course, this is just a special case; if the values of past and future variables were dependent, past data may confirm one base-model hypothesis over another. This scenario of independence is what some climate scientists seem to have in mind when they say: Statements about future climate relate to a never before experi- enced state of the system; thus, it is impossible to either calibrate the model for the forecast regime of interest or confirm the use- fulness of the forecasting process (Stainforth et al. 2007a, 2146). We have here the grounds for a charitable interpretation of climate scien- tists’ claim that data cannot be used to calibrate and confirm climate models. As suggested by the quote, one might say that calibration is impossible when 28Also, the implicit conditional probabilities for the past extensions are assumed not to vary for the Li,j. 29Again, the implicit conditional probabilities of the extensions are assumed not to vary for the Li,j and Qi,j. 22 the future climate variables in question (or the equations that adequately pre- dict them) are considered independent of the past data at hand (or the equa- tions that adequately predict them).30 It is important to note that the extent to which the point applies in climate science is controversial. Some climate scientists suggest that the future values of prominent climate variables, in- cluding precipitation and even average global temperature rise, are more or less unconstrained by the past values of these or other variables (e.g., Frame et al. 2002; Stainforth et al. 2007a). Other climate scientists apparently do not think it so plausible that past values for at least some prominent climate variables are irrelevant to their future values (e.g., Knutti et al. 2002, 2003; Randall and Wood, 2007). In any case, the claim that calibration fails and there is no confirmation of model instances or model hypotheses in a par- ticular context is very different from the claim that double-counting is ‘bad practice’. Moreover, using separate past data for calibration and confirmation is no remedy for this problem. 7 Non-comparative confirmation and catch-alls We have thus far been concerned with confirmation of one model hypothesis relative to another. Yet certain statements from climate scientists concern- ing calibration suggest that what is at issue is whether the evidence confirms the predictions of a model tout court, i.e. relative to its complement (non- comparative confirmation). We first show that double-counting is also legit- imate for non-comparative confirmation. Then we explain why, nonetheless, confidence in future climate predictions may be hard to amass. The difficul- ties arise when climate models are based on assumptions which are suspected to be wrong in the future. Again, the problem cannot be solved by employing separate data for calibration and confirmation. In some cases, assessing non-comparative confirmation is relatively straight- forward. The relevant likelihood ratio involves a model (a base model or a specific instance) and its entire complement. For instance, the degree to which evidence E confirms base model hypothesis M relative to its entire complement is (where N,.. . ,Z are the mutually exclusive base model hy- potheses that exhaust the complement of M): Pr(E|M) Pr(E|¬M) = Pr(E|M) Pr(E|N)×Pr(N|¬M)+. . .+Pr(E|Z)×Pr(Z|¬M) . (9) 30A case which often arises in climate science is that the equations for adequately pre- dicting the past and future climate variables are considered identical in form, yet the parameters in these equations have values for past and future that are independent. 23 As before, this likelihood ratio may be greater than, less than, or equal to 1, corresponding to M being confirmed, disconfirmed, or neither, relative to its complement. Here again it must be noted that the final probability of M, i.e. Pr(M|E), is a further matter, and depends also on the prior probability Pr(M). This section too focuses just on the extent to which evidence incrementally con- firms or raises confidence in a model, this time relative to its complement. An examination of the above expression reveals, however, that non-comparative confirmation nonetheless requires substantial information regarding the prior probabilities of base models, in the form of conditional probabilities like Pr(N|¬M). So the comments at the end of Section 5 regarding difficulties in estimating the prior probabilities of base models are pertinent here. Further problems arise when the full set of base models under considera- tion is believed not to be exhaustive, and yet we are unable to specify what is missing (there are ‘known unknowns’). In other words, we have a range of plausible base-model hypotheses plus a catch-all, i.e. a hypothesis to the effect ‘none of the above is true’. One can easily see that non-comparative confirmation in these conditions is difficult to assess. The relevant likelihood is (where M is a base-model hypothesis, and hypotheses N,... together with the catch-all C exhaust the complement of M): Pr(E|M) Pr(E|¬M) = Pr(E|M) Pr(E|N)×Pr(N|¬M)+. . .+Pr(E|C)×Pr(C|¬M) .(10) The problem is that the likelihood associated with the catch-all, Pr(E|C), let alone the probability Pr(C|¬M), is very difficult to evaluate. How do we estimate the probability of some evidence conditional on the truth of a hypothesis which we cannot actually specify? The common sentiment in climate science seems to be that there is in- deed a catch-all, especially when the models’ purpose is to predict future climate. Nonetheless, some studies appear to proceed under the assumption that model hypotheses may be confirmed (or disconfirmed) to some degree in non-comparative terms, given evidence. Most plausibly, in these cases the catch-all is either negligible, or else it is not completely unspecified, and some climate scientists think they know enough about it to at least have rough estimates for Pr(E|C). If at least a rough estimate for Pr(E|C) can be given (as well as rough estimates for all other terms in the expression above), the main conclusions drawn about double-counting and comparative confirmation carry over. In particular, double counting II is legitimate for 24 non-comparative confirmation and can arise for two reasons (cf. Section 3): 1) better fit of the model or the complement of the model with the evi- dence and/or 2) the specific instances of the model that are favoured by the evidence may be more plausible or less plausible than the instances of the complement favoured by the evidence. So far so good, but some climate scientists do not think the prospects for non-comparative confirmation of model hypotheses concerning the future are so rosy. First, note that if past data is considered independent of the future (cf. the discussion in Section 6), there cannot be non-comparative confirma- tion because there is no confirmation of one base-model hypothesis relative to another or indeed the catch-all. Second, even if past data are relevant, many scientists worry that climate models (which are based on our understanding of climate processes to date) invoke assumptions which may not hold in the future.31 Consider: For these processes, and therefore for climate forecasting, there is no possibility of a true cycle of improvement and confirmation, the problem is always one of extrapolation and the life cycle of a model is significantly less than the lead time of interest. (Stain- forth et al. 2007a, 2147). One might interpret this view as follows: if base-model hypotheses concern future predictions, then the catch-all is overwhelming. Future climate be- haviour may differ from that of the past/present in unanticipated ways, and so we are unable to specify even roughly the appropriate likelihoods of the relevant catch-all. At this point it should be mentioned that climate models are designed to accurately simulate mean surface temperature changes ; they fail to sim- ulate absolute mean surface temperatures to a similar level of accuracy. In particular, the simulated mean surface temperature changes are derived from simulated surface temperature values that show biases of several degrees Cel- sius on many regions of the Earth; and the same holds for other variables such as ocean temperatures (Knutti et al. 2010; Randall et al. 2007, 608 and supplementary material). There is nothing in principle wrong with modelling temperature changes rather than absolute temperatures. When one variable 31Note that while these two concerns are logically distinct, they are of course closely related in the climate context. This is because the scientific reasons for doubting the relevance of past climate data have much overlap with the reasons for positing significant uncertainty about the future. 25 is too difficult to predict, often scientists succeed instead in predicting a sim- pler variable such as an average or a change in that variable. However, many climate scientists argue that the reason why climate models fail to accurately simulate absolute temperatures is because important processes are ignored which may become relevant for adequately predicting long-term future cli- mate behaviour of interest (e.g., Stainforth et al. 2007a). From this doubts arise whether current climate models will adequately describe the relevant aspects of the future climate. Climate scientists seem to take different views on the extent of our un- certainty about the future. But in the case of radical uncertainty, non- comparative confirmation of any one, or the whole set of, our climate-model hypotheses concerning the future is indeterminate, even if past data are rel- evant for comparing pairs of hypotheses. Overall confidence in any single model or the full set of models cannot increase.32 This position regarding non-comparative confirmation is reflected in the following statement concern- ing the modelling of future climate: We take climate ensembles exploring model uncertainty as poten- tially providing a lower bound on the maximum range of uncer- tainty and thus a non-discountable [unable-to-be-ignored] climate change envelope [range of climate-change predictions]. (Stain- forth et al. 2007b, 2167) We now turn to an example in climate science which highlights the con- troversies surrounding the relevance of past data and the overall adequacy of climate models for future predictions. 8 Climate science example: non-comparative confirmation and catch-alls in practice Our example for non-comparative confirmation with a catch-all again con- cerns the aerosol forcing and is Knutti et al. (2003), already discussed in Subsection 4.2. Recall that Knutti et al. aim to construct models which are 32Moreover, applying full Bayesian reasoning: the posterior probabilities of the climate- model hypotheses would also be indeterminate, due to the indeterminate likelihood ratios. Most plausibly, in the case of a radically unspecified catch-all, the prior probabilities would be indeterminate as well. 26 adequate for long-term predictions of the temperature changes until 2100 un- der two emission scenarios (within the error bounds), and that the model error is discrete. The two base models are: • M: models instances of Knutti et al. (2003); • C: catch-all. Recall that mean surface temperature changes and the ocean warming are regarded as relevant to assess the adequacy of the models, and they are used to constrain the indirect aerosol forcing and the climate sensitivity. Mo- tivated by physical estimates, for the aerosol forcing a uniform distribution over [−2, 0] is chosen conditional on M or C.33 For the climate sensitivity a uniform distribution over [1, 10] is chosen conditional on M or C. The data are used for calibration: Knutti et al. (2003) calculate the like- lihood of an arbitrary model-hypothesis instance given the data, assuming that M is true. Because of the uniform prior distribution over the forc- ings values, consistent model-hypothesis instances are equiprobable given the data; inconsistent model-hypothesis instances have zero probability (a model-hypothesis instance is regarded as consistent if the average difference between the actual and the simulated observations is smaller than a con- stant). The conclusion is that the likely range (summing to probability 0.93) of the indirect aerosol forcing is [-1.2,0). Furthermore, Knutti et al. seem to claim that the data confirm M relative to the catch-all because the fit with the data is very good and the model could have (easily) failed to simulate the data. As already discussed in Subsection 4.2, Knutti et al. (2003) use elements of probabilistic confirmation theory. However, when reconstructing this as a case of non-comparative confirmation, what is missing are the values of Pr(E|M) and, in particular, of Pr(E|C). The crucial question is whether Pr(E|M)/Pr(E|C) > 1. If it is, then probabilistic confirmation theory will yield that the data are justifiably used for non-comparative confirmation and calibration; there will be double-counting II for reason 1)—the model instances of M provide a better fit with the data than the catch-all. It should come as no surprise that the answer to this question is con- troversial. Knutti et al. (2003) tend to an affirmative answer ; they seem to claim that confidence in the future predictions of M has increased. However, 33Knutti et al. (2003) also discuss the case of a normally distributed aerosol forcing—see footnote 21. 27 if Stainforth et al. (2007a) are right that past data are not relevant to the future climate predictions of interest (as discussed in Section 6) or that the probabilities associated with the catch-all cannot be precisely specified (as discussed in Section 7), then the answer will be negative: the data simply will not confirm M relative to the catch-all. The fact that there is controversy among climate scientists about such fundamental and policy-relevant questions highlights the need to think more carefully about them. Whatever the outcome, this controversy is not about the problem of double-counting. 9 Concluding remarks The main contribution of this paper is the untangling and clarification of wor- ries concerning double-counting. We have argued that the common position— that double-counting is bad and that separate data must be used for calibra- tion and confirmation of base-model hypotheses—is by no means obviously true. This is not to say there are no other fundamental concerns about the confirmatory power of evidence or about uncertainty in climate science. It is crucial, however, that the various issues are articulated and distinguished, if we are to make progress in assessing confidence in climate models and their predictions. Our claim is that double-counting, in the sense of using evidence for cal- ibration and confirmation, is justified by at least one major approach to confirmation—the Bayesian or relative likelihood approach. Calibration of a base-model hypothesis is all about determining which specific instances of the base model are confirmed relative to other specific instances. We call this double-counting I. Furthermore, we showed that, according to Bayesian stan- dards, the same evidence may be used for calibration and for incrementally confirming one base-model hypothesis relative to another, or relative to its entire complement. We call this double-counting II. We appealed to studies in climate science to show that these two forms of double-counting are in fact practised by some climate scientists, even if they are not acknowledged as such. In the latter parts of the paper, we acknowledged and discussed important worries about calibration and confirmation in the climate-modelling context that may be marring the double-counting debate. In some cases, evidence already informs the prior assessment of model instances. If so, it cannot be 28 used again for calibration and confirmation—this would be using the same ev- idence two times over. More fundamentally, there is often controversy about what evidence is relevant to whether a model achieves its purpose. Treating irrelevant evidence as if it were relevant and using this evidence for confir- mation or calibration is also bad practice. Indeed, some climate scientists state strongly that future climate variables of interest are more or less un- constrained by the available past climate data. The upshot is that this past climate data is irrelevant for assessing the adequacy of models for predicting the future; hence there can be no calibration or double-counting. A related but subtly different concern is that climate models are based on assumptions which may not be applicable in the future. This would imply that one can- not hope to even roughly determine the likelihood of the catch-all hypothesis with respect to adequately predicting the future, and non-comparative con- firmation, let alone absolute confirmation, would be indeterminate. We noted that climate scientists disagree about whether these worries are all justified. In any case, the worries concern whether data are useless for confirmation and/or calibration. Problems of this kind cannot be remedied by using separate data for calibration and confirmation. We thus suggest that practitioners be clearer about their targets. Suspicions about the legitimacy of double-counting should not be confused with other important issues, such as what evidence is relevant for confirmation given the modelling context at hand, whether issues of old evidence are appropriately handled, or whether the worry is justified that climate models are based on assumptions which will not hold in the future. Acknowledgements Earlier versions of this paper have been presented at the third conference of the European Philosophy of Science Association, the 2010/2011 London School of Economics Discussion Group Meetings on Climate Science and Decision-making, the 2011 Bristol Workshop on Philosophical Issues in Cli- mate Science, the first Annual Ghent Metaphysics, Methodology and Science Program, the 2011 Geneva Workshop on Causation and Confirmation, the 2011 Stockholm Workshop on Preferences and Decisions, and the 2012 Pop- per seminar. We would like to thank the audiences for valuable discussions. We also want to thank Reto Knutti, Wendy Parker and David Stainforth for helpful comments. 29 References Anderson, T.L., Charlson, R.J., Schwartz, S.E., Knutti, R., Boucher, O., Rodhe, H. and J. Heintzenberg (2003). ‘Climate Forcing by Aerosols – a Hazy Picture.’ Science 300, 1103–1104. Burnham, K.P. and D.R. Anderson (1998). Model Selection and Multimodal Inference. Berlin and New York: Springer. Eells, E., and B. Fitelson (2000). ‘Measuring Confirmation and Evidence.’ Journal of Philosophy 97, 663–672. Forster, M. and E. Sober (1994). ‘How to Tell When Simpler, More Unified or Less Ad Hoc Hypotheses Will Provide More Accurate Predictions.’ British Journal for the Philosophy of Science 45, 1–35. Frame, D.J., Faull, N.E., Joshi, M.M. and M.R. Allen (2007). ‘Probabilistic Climate Forecasts and Inductive Problems.’ Philosophical Transactions of the Royal Society A 365 (20), 1971–1992. Harvey, D. and R.K. Kaufmann (2002). ‘Simultaneously Constraining Cli- mate Sensitivity and Aerosol Radiative Forcing.’ Journal of Climate 15 (20), 2837–2861. Hitchcock, C.R. and E. Sober (2004). ‘Prediction Versus Accommodation and the Risk of Overfitting.’ British Journal for the Philosophy of Sci- ence 55, 1–34. Howson, C. (1988). ‘Accommodation, Prediction and Bayesian Confirmation Theory.’ PSA: Proceedings of the Biennial Meeting of the Philosophy of Science Association 1988, 381–392. Knutti, R. (2008). ‘Should We Believe Model Predictions of Future Climate Change?’ Philosophical Transactions of the Royal Society A 366, 4647– 4664. Knutti, R. (2010). ‘The End of Model Democracy – an Editorial Comment.’ Climatic Change 102, 395–404. Knutti, R., Stocker, T.F., Joos, F. and G.-K. Plattner (2002). ‘Constraints on Radiative Forcing and Future Climate Change from Observations and Climate Model Ensembles.’ Nature 416, 719–723. 30 Knutti, R., Stocker, T.F., Joos, F. and G.-K. Plattner (2003). ‘Probabilistic Climate Change Projections Using Neural Networks.’ Climate Dynamics 21, 257–272. Knutti, R., Furrer, R., Tebaldi, C., Cermak, J. and G. Meehl (2010). ‘Chal- lenges in Combining Projections from Multiple Climate Models.’ Jour- nal of Climate 23, 2739–2758. Mayo, D.G. (2010). ‘An Ad Hoc Save of a Theory of Adhocness? Exchanges with John Worrall.’ In: D.G. Mayo and A. Spanos (eds.), Error and In- ference: Recent Exchanges on Experimental Reasoning, Reliability, Ob- jectivity and Rationality of Science. Cambridge: Cambridge University Press, 155–169. Parker, W.S. (2010). ‘Comparative Process Tracing and Climate Change Fingerprints’ Philosophy of Science (Proceedings) 77 (5), 1083–1095. Parker, W.S. (2009). ‘Confirmation and Adequacy for Purpose in Climate Modelling.’ Aristotelian Society Proceedings, Supplementary Volume 83 (5), 233–249. Randall, D.A. and Wielicki B.A. (1997). ‘Measurements, Models, and Hy- potheses in the Atmospheric Sciences.’ Bulletin of the American Mete- orological Society 78, 399–406. Randall, D.A. and R.A. Wood (2007). ‘Climate Models and Their Evalua- tion.’ In: S. Solomon, D. Qin, M. Manning, Z. Chen, M. Marquis, K.B. Averyt, M. Tignor and H.L. Miller (eds.), Climate Change 2007: The Scientific Basis. Cambridge: Cambridge University Press, 589–662. Rodhe, H., Charlson, R.J. and T.L. Anderson (2000). ‘Avoiding Circular Logic in Climate Modeling.’ Climatic Change 44, 419–422. Shackley S., Young, P., Parkinson, S. and B. Wynne (1998). ‘Uncertainty, Complexity and Concepts of Good Science in Climate Change Mod- elling: Are GCMs the Best Tools?’ Climatic Change 38, 159–205. Tebaldi, C. and R. Knutti (2007). ‘The Use of the Multi-Model Ensemble in Probabilistic Climate Projections.’ Philosophical Transactions of the Royal Society A 365, 2053-2075. Stainforth, D.A., Allen, M.R., Tredger, E.R. and L.A. Smith (2007a). ‘Con- fidence, Uncertainty and Decision-support Relevance in Climate Predic- tions.’ Philosophical Transactions of the Royal Society A 365, 2145– 2161. 31 Stainforth, D.A., Downing, T.E., Washington, M., Lopez, A. and M. New (2007b). ‘Issues in the Interpretation of Climate Model Ensembles to Inform Decisions.’ Philosophical Transactions of the Royal Society A 365, 2163–2177. Worrall, J. (2010). ‘Error, Tests, and Theory Confirmation.’ In: D.G. Mayo and A. Spanos (eds.), Error and Inference: Recent Exchanges on Ex- perimental Reasoning, Reliability, and the Objectivity and Rationality of Science. Cambridge: Cambridge University Press, 125–154. 32 Werndl_Steele_Climate-models-calibration-and-confirmation_2013_cover Werndl_Steele_Climate-models-calibration-and-confirmation_2013_author