key: cord-244657-zp65561y
authors: Hawryluk, Iwona; Mishra, Swapnil; Flaxman, Seth; Bhatt, Samir; Mellan, Thomas A.
title: Simulating normalising constants with referenced thermodynamic integration: application to COVID-19 model selection
date: 2020-09-08
journal: nan
DOI: nan
sha: 
doc_id: 244657
cord_uid: zp65561y

Model selection is a fundamental part of Bayesian statistical inference; a widely used tool in the field of epidemiology. Simple methods such as Akaike Information Criterion are commonly used but they do not incorporate the uncertainty of the model's parameters, which can give misleading choices when comparing models with similar fit to the data. One approach to model selection in a more rigorous way that uses the full posterior distributions of the models is to compute the ratio of the normalising constants (or model evidence), known as Bayes factors. These normalising constants integrate the posterior distribution over all parameters and balance over and under fitting. However, normalising constants often come in the form of intractable, high-dimensional integrals, therefore special probabilistic techniques need to be applied to correctly estimate the Bayes factors. One such method is thermodynamic integration (TI), which can be used to estimate the ratio of two models' evidence by integrating over a continuous path between the two un-normalised densities. In this paper we introduce a variation of the TI method, here referred to as referenced TI, which computes a single model's evidence in an efficient way by using a reference density such as a multivariate normal - where the normalising constant is known. We show that referenced TI, an asymptotically exact Monte Carlo method of calculating the normalising constant of a single model, in practice converges to the correct result much faster than other competing approaches such as the method of power posteriors. We illustrate the implementation of the algorithm on informative 1- and 2-dimensional examples, and apply it to a popular linear regression problem, and use it to select parameters for a model of the COVID-19 epidemic in South Korea.

Mathematical modelling of infectious diseases is widely used to understand the processes underlying pathogen transmission and inform public health policies. With advances in both computing power and availability of data, it is possible to build more complex, robust and accurate models. The increasing importance of epidemiological models requires synthesis with rigorous statistical methods. This synthesis is required to robustly estimate necessary parameters, quantify uncertainty in predictions, and test hypotheses [1] . In practice the decision to choose a model is often based on heuristics, relying on the knowledge and experience of the modeller, rather than through a systematic selection process [2, 3] . A number of model selection methods are available, but those methods often come with a trade-off between accuracy and computational complexity. For example, widely used in epidemiology Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC) are easy to compute, but come with certain limitations [1] . Specifically, they do not take into account the parameters' uncertainty or the prior probabilities, and might favour excessively complex models.

The ratio of two normalising constants -the Bayes factor (BF) -is popularly used for model selection method in the Bayesian setting [4] . In general, normalising constants cannot be computed analytically or through common quadrature methods, and more complex probabilistic algorithms need to be employed. Thermodynamic integration (TI) [5, 6, 7, 8] provides a useful way estimate the log ratio of the normalising constants of two densities. Instead of marginalising the densities explicitly, which results in high-dimensional integrals, by using TI we only need to evaluate a 1dimensional integral, where the integrand can easily be sampled with Markov Chain Monte Carlo (MCMC). To see how this works, consider a pair of normalising constants z 1 and z 2 ,

where q i is a density for model M i with parameters θ, that can be normalised to give the model's Bayesian posterior density

To apply thermodynamic integration we introduce the concept of a path between q 1 (θ) and q 2 (θ), linking the two densities via a series of intermediate ones. This family of densities is denoted q(λ; θ). An example path in λ is shown in Figure 1 , for selected values of the coupling parameter. Note -often the coupling parameter is denoted β (or t) in reference to a physical thermodynamic integration in inverse temperature (or temperature). In many instances this analogy makes sense but a more generic procedure is a thermodynamic integration between two systems with distinct Hamiltonians coupled by a switching parameter λ, which is closer to the spirit of this work. The parametric density q(λ; θ), linking q 1 to q 2 and defining the intermediate densities, can be chosen to have an optimal or in some way convenient path, but a common choice is simply the geometric one q(λ; θ) = q λ 2 (θ)q 1−λ 1 (θ) , λ ∈ [0, 1] .

The important point to note is that for λ = 0, q(λ; θ) returns the first density q(0; θ) = q 1 (θ), for λ = 1 it gives q(1; θ) = q 2 (θ), and for in-between λ values a log-linear mixture of the end-point

Code available at https://github.com/mrc-ide/referenced-TI densities. Just as we have defined a family of densities, there is an associated normalising constant for any point along the path, that for any value of λ is given by

q(λ; θ)dθ .

A further small but important point to avoid complications is to have densities that have common support, e.g. Ω(1) = Ω(0). Hereafter support is denoted Ω.

Having set up the definitions of q(λ; θ) and z(λ), the TI expression can be derived, to compute the log ratio of z 1 = z(0) and z 2 = z(1), while avoiding explicit integrals over the models' parameters θ. By the fundamental theorem of calculus, assuming that the order of ∂ λ and dθ integral can be exchanged, and by the elementary rules of differentiating logarithms we get:

The notation E q(λ;θ) is for an expectation with respect to the sampling distribution q(λ; θ). The final line in the expression summarises the usefulness of TI -instead of having to work with the complicated high-dimensional integrals of Equation 1 to find the log-Bayes factor log z2 z1 , we only need consider a 1-dimensional integral of an expectation, and the expectation can be readily produced by MCMC.

Here we set out in detail a variation on the TI theme that we have found to be useful in practice. The variation is to work primarily in terms of referenced normalising constants. The approach of using exactly integrate-able references has provided us with a particularly efficient method of selection between different hierarchical Bayesian models times-series models, and we hope the approach will be useful to others working on similar problems, for example in phylogenetic model selection where TI is already a popular established method [9, 8] .

In the following, we begin with introducing the reference density, then go on to illustrate different practical choices of a reference normalising constant, along with theoretical and practical considerations. Next, the mechanics of applying the method are set out for two simple pedagogical examples. Performance benchmarks are discussed for a well-known problem in the statistical literature [10] , which shows the method performs favourably in terms of accuracy and the number of iterations to convergence. Finally the technique is applied to a hierarchical Bayesian time-series model describing the COVID-19 epidemic in South Korea. In the COVID-19 infections model we select technical parameters in the reproduction number (R t ) model, such as the autoregression window size and AR(k) lag, as well epidemiologically meaningful parameters such as the serial interval distribution time for generating infections.

An efficient approach to compute Bayes factors, or more generally to marginalise a given density for any application, is to introduce a reference as

To clarify notation, z is the normalising constant of interest with density q, z ref is a reference normalising constant with associated density q ref . The second line replaces the ratio z/z ref with a thermodynamic integral as per the identity derived in Equation 2. The introduction of a reference naturally facilitates the marginalisation of a single density, rather than requiring pairwise model comparisons by direct application of Equation 2. This is useful when comparing multiple models as n > n 2 for n ≥ 3. However, the primary motivation to reference the TI is the computational efficiency of converging the TI expectation. In Equation 3, with judicious choice of q ref , the reference normalising constant z ref can be evaluated analytically and account for most of z. In this case log q(θ) q ref (θ) tends to have a small expectation and variance and converges quickly.

The idea of using an exact reference to aid in the solution of computationally intractable problems is a fundamental and a perennial one throughout computational sciences. Within normalising constant evaluation methods, our suggested algorithm is closely related to several other techniques [11, 7, 12, 8, 9, 13, 14] . In the generalised stepping stone method a reference is introduced to speed up convergence of the importance sampling at each temperature [13, 14] . In the power posteriors (PP) approach the reference in Equation 3 is the prior distribution of q and thus z ref = 1 [11] . This is elegant as the reference need not be chosen -it is simply the prior -however the downside of the simplicity is that for poorly chosen or uninformative priors, the thermodynamic integral will be slow to converge and susceptible to instability. In particular for complex hierarchical models with uninformative priors this can often be an issue.

The reference density in Equation 3 can be chosen at convenience, but the main desirable features are that it should be easily formed without special consideration or adjustments, similar to power posteriors method, and that z ref should be analytically integratable and account for as much of z as possible. Such a choice of z ref ensures the part with expensive sampling is small and convergence is fast. An obvious choice in this regard is the Laplace-type reference, where the log density is approximated with a second-order one, e.g. a multivariate Gaussian. For densities with a single concentration, Laplace-type approximations are ubiquitous, and an excellent natural choice for many problems. In the following section we consider three approaches that can be used to formulate a reference normalising constant z ref from a second-order log density (though more generally other tractable references are possible). In each referenced TI scenario, we note that even if the reference approximation is poor, the estimate of the normalising constant based on Equation 3 remains asymptotically exact -only the speed of convergence may be reduced (provided assumptions such matching support of end-point densities remains).

The most straightforward way to generate a reference density is to Taylor expand log q(θ) to second order about a mode. Noting there is no linear term, we see the reference density is

where H is the hessian matrix and θ 0 is the vector of mode parameters. The associated normalising constant is

Taylor expansion method tends to produce a reference density that works well in the neighbourhood of θ 0 but can be less suitable if density is asymmetric, has long or short tails, or if the derivatives at the mode are poorly approximated for example due to cusps or conversely very flat curvature at the mode. In many instances a more robust choice of reference can found by using MCMC samples from the whole posterior density.

Another elementary but often more robust approach to form a reference density is by drawing samples from the true density q(θ), to estimate the mean parametersθ and covariance matrixΣ, such that

The reference normalising constant is z ref = q(θ) det(2πΣ) . This method of generating a reference is simple and reliable. It requires sampling from the posterior q(θ) so is more expensive than derivative methods, but the cost associated with drawing enough samples to generate a sufficiently good reference tends to be quite low. In the primary application discussed later, regarding relatively complex high-dimensional Bayesian hierarchical models, we use this approach to generate a reference density and normalising constant.

The sampled covariance reference is typically a good approach, but it is not in general optimaltypically another Gaussian reference with different parameters can generate a normalising constant closer to the true one, thus leading to faster convergence of the thermodynamic integral to the exact value. Such an optimal reference can be identified variationally.

The conditions to identify a optimal reference normalising constant can be derived using elementary methods by perturbation theory. Consider a Taylor expansion of the log normalising constant log z(λ) about λ = 0:

The first derivative gives the expectation

as per the derivation in Equation 2, and the second derivative is a variance

As the curvature of log z(λ) is increasing, to first order we see

and for the specific case of λ = 1,

This inequality establishes bounds that can be maximised with respect to the position (µ) and scale (S) parameters of a reference density such as

Thus the parameters that optimise

provide a reference density that is variationally optimal. We note this is simply an application of the Gibbs-Feynman-Bogoliubov inequality [15, 16, 17] and that finding such a approximation to the true density is well-studied problem with numerous approaches that can be used to determine q ref [18, 19] . In itself the existence of a variational bound provides no guarantee of being a good approximation to the true normalising constant. Thus alone it is not a satisfactory general approach. However as a point of reference from which to estimate the true normalising constant, it provides an optimal density within the family of trial reference functions considered, therefore improving convergence to the MCMC normalising constant in referenced TI.

Having set out three approaches to find a single reference for the TI expression in Equation 3, a natural generalisation is the telescopic expansion

Note, here the analytic reference is denoted z 0 rather than z ref to generalise the indexing. In cases where q 0 differs substantially from q, the telescopic generalisation can improve numerical stability. By bridging the endpoints in terms of intermediate density pairs, qi+1(θ) qi(θ) , we can form a series of lower variance MCMC simulations with favourable convergence properties. A reasonable choice for generating intermediate densities is for the q i th density to be the 2 (i + 1) th order Taylor expansion of q(θ).

If a model has parameters with limits, e.g. θ 1 ∈ [0, ∞), θ 2 ∈ (−1, ∞) etc., in referenced TI the exact analytic integration for the reference density should be commensurately limited. However, the calculation of arbitrary probability density function orthants, even for well-known analytic functions such as the multivariate Gaussian, is in general a difficult problem. Computing highdimensional orthants usually requires advanced techniques, the use of approximations, or sampling methods [20, 21, 22, 23, 24, 25] . Fortunately we can simplify our reference density to create a reference with tractable analytic integration for limits by using a diagonal approximation to the sampled covariance or hessian matrix. For example the orthant of a diagonal multivariate Gaussian can be given in terms of the error function [26] , leading to the expression

where K denotes the set of indices of the parameters with lower limits a i . Σ diag is a diagonal covariance matrix, that is one containing only the variance of each of the parameters, without the covariance terms and Σ diag i denotes the i th element of the diagonal. Restricting our density to a diagonal one is a poorer approximation than using the full covariance matrix. In practice however this has not been particularly detrimental to the convergence of the thermodynamic integraland again we note that the quality of the reference only affects convergence rather than eventual theoretical Monte Carlo accuracy of the normalising constant. This behaviour is observed in the practical examples later considered, though the distinction between accuracy and convergence and matters of asymptotic consistency using an MCMC estimator with finite iterations are naturally less clear cut.

The algorithm described in this section was implemented in Python and Stan programming languages. Using Stan enables fast MCMC simulations, using Hamiltonian Monte Carlo (HMC) and No-U-Turn samples (NUTS) algorithm [27, 28] , and portability between other statistical languages, such as R or Julia. Additionally it is familiar to many epidemiologists using Bayesian statistics [29] . The code for all examples shown in this paper is available at https://github.com/mrc-ide/ referenced-TI.

In the examples shown in the 3 section, we used 4 chains with 20,000 iterations per chain for the pedagogical examples, and 4 chains with 2,000 iterations for the other applications. In all cases, half of iterations were used for the burn-in. Mixing of the MCMC chains and the sampling convergence was checked in each case, by ensuring that theR value was ≤ 1.05 for each parameter in every model.

In all examples in the remaining part of this paper, the integral given in Equation 2 was discretised to allow computer simulations. Each expectation E q(λ;θ) log q1(θ) q0(θ) was evaluated at λ = 0.0, 0.1, 0.2, ...., 0.9, 1.0, unless stated otherwise. To obtain the value of the integral in Equation 2, we interpolated a curve linking the expectations using a cubic spline, which was then integrated numerically. The pseudo-code of the algorithm with sampled covariance Laplace reference is shown in Algorithm 1.

Input q -un-normalised density, q ref -un-normalised reference density, Λ -set of coupling parameters λ, N -number of MCMC iterations

Output z -normalising constant of the density q Compute the mean,

9: end for 10: Interpolate between the consecutive E λ values to obtain a curve ∂ λ log(z(λ)) 11: Integrate ∂ λ log(z(λ)) over λ ∈ [0, 1] to get log z z ref

To illustrate the technique consider the 1-dimensional density

,WHUDWLRQ b) Expectation E q(λ;θ) log q(θ) q ref (θ) vs MCMC iteration, shown at each value of λ sampled. c) λ dependence of the TI contribution to the log-evidence d) Convergence of the evidence z, with 1% convergence after 500 iterations and 0.1% after 17, 000 iterations per λ.

with normalising constant z = ∞ −∞ q(θ)dθ. This density has a cusp -one of the more awkward pathologies of some naturally occurring densities [30, 31] -and it does not have an analytical integral that easily generalises to multiple dimensions, but is otherwise a made-up 1-dimensional example that could be interchanged with an other.

In this instance the Laplace approximation based on the second-order Taylor expansion at the mode will fail due to the cusp, so we can use the more robust covariance sampling method. Sampling from the 1d density q(θ) we find a variance ofσ 2 = 0.424, giving a Gaussian reference density q ref Figure 1 . In this simple example, the integral can be easily evaluated to high accuracy using quadrature [32, 33] , giving a value of 1.523. Referenced thermodynamic integration reproduces this value, with convergence of z shown in Figure 1 , converging to 1% of z with 500 iterations and 0.1% within 17, 000 iterations.

This example illustrates notable characteristic features of referenced thermodynamic integration. The reference q ref (θ) is a good approximation to q(θ), with z ref accounting for most (102%) of z. Consequently z z ref is close to 1, and the expectations, E q(λ;θ) log q(θ) q ref (θ) , evaluated by MCMC are small. For the same reasons the variance at each λ is small, leading to favourable convergence within a small number of iterations. And finally E q(λ;θ) log q(θ) q ref (θ) weakly depends on λ, so there is no need to use a very fine grid of λ values or consider optimal paths -satisfactory convergence is easily achieved using a simple geometric path with 4 λ intervals.

2D pedagogical example with constrained parameters As a second example, consider a 2-dimensional un-normalised density with a constrained parameter space:

where θ 1 ∈ [0, +∞) and θ 2 ∈ (−∞, +∞)

A reference density q ref (θ) can be constructed from the Hessian at the mode of q(θ). Notice, that because parameter θ 1 is constrained to be ≥ 0, integrating the Gaussian approximations q ref (θ) using the formula given in Equation 5 will give an overestimate. To account for this we use the reference density q diag ref (θ), based on a diagonal Hessian, that has an exact and easy to calculate orthant. All densities are shown in Figure 2 . To obtain the log-evidence of the model, we calculated the exact value numerically [34, 32] , using the full covariance Laplace method as per Equation 6 and using the diagonal covariance with correction added to take into account the lower bound of the parameter θ 1 , as per Equation 8 . The Gaussian reference densities were then used to carry out referenced thermodynamic integration. Results of all methods are given in Table 1 . As expected, without applying the correction the value of the evidence is overestimated. 

To benchmark the application of the referenced TI in the model selection task, consider fitting non-nested linear regression to the radiata pine data [10] . This example is widely used for testing the normalising constant calculating methods, due to the fact that the exact value of the model evidence can be computed. The data consists of 42 3-dimensional datapoints, expressed by y imaximum compression strength, x i -density and z i -density adjusted for resin content. In this example, we follow the approach of Friel and Wyse [12] , and test which of the two models M 1 and M 2 provides better predictions for the compression strength:

In other words, we want to know, whether density or density adjusted allows to predict the compression strength better. The priors for the models were selected in a way which enables obtaining an exact solution and can be found in Friel and Wyse [12] .

Five methods of estimating the model evidence were used in this example: Laplace approximation using a sampled covariance matrix, model switch TI along a path directly connecting the models [8, 35] , referenced TI, power posteriors with equidistant 11 λ-placements (labelled here as PP 11 ) and power posteriors with 100 λ-s (PP 100 ) as in [12] . For the model switch TI, referenced TI and PP 11 we used λ ∈ {0.0, 0.1, ..., 1.0}.

The expectation from MCMC sampling per each λ for model switch TI, referenced TI, PP 11 and PP 100 and fitted cubic splines between the expectations are shown in Figure 3 . Immediately we notice that both TI methods eliminate the problem of divergence of expectation for λ = 0, which is observed with the power posteriors, where samples for λ = 0 come from the prior distribution. PP 11 method failed to estimate the log-evidence correctly.

The 1-dimensional lines estimated by fitting a cubic spline to the expectation were integrated for each of the models to obtain the log-evidence for M 1 and M 2 , and the log ratio of the two models' evidences for the model switch TI. Rolling mean of the integral over 1500 iterations for referenced TI and PP 100 are shown in Figure 4 a-b. We can see from the plots, that referenced TI presents excellent convergence to the exact value, whereas PP 100 oscillates around it. In the same Figure 4 , plots c-d show the distribution of log-evidence for each model generated by 15 runs of the three algorithms: Laplace approximation with sampled covariance matrix, referenced TI and PP 100 . Although all three methods resulted in a log-evidence satisfactorily close to the exact solution, referenced TI was the most accurate and importantly, converged fastest.

The BFs calculated to assess whether model M 2 fits the data better than model M 1 and the number of iterations needed to achieve standard error of 0.5%, excluding the iterations needed for the MCMC burn-in, are presented in Table 2 . Notably, both TI methods gave a BF very close to the exact value. Referenced TI performed the best out of tested methods -it converged faster than all other methods, requiring only 308 MCMC draws compared to 55,000 draws needed for the power posterior method or over 2,000 for the model switch TI. Referenced TI also showed excellent accuracy both in estimating individual model's evidence and the BF. 

The final example of using referenced TI for calculating model evidence is fitting a renewal model to COVID-19 case data from South Korea. The data were obtained from https://opendata.ecdc. europa.eu/covid19/casedistribution/csv (accessed 19-07-2020) and contained time-series data of confirmed cases from 31-12-2019 to 18-07-2020. The model is based on the Bellman-Harris branching process whose expectation is the renewal equation. Its derivation and details are explained in Mishra et al. [36] and a short explanation of the model is provided in the Appendix. Briefly, the model is fitted to the time-series case data and estimates a number of parameters, including serial interval and the effective reproduction number, R t . Number of cases for each day are modelled by a negative binomial distribution, with shape and overdispersion parameters estimated by a renewal equation. Three modification of the original model were tested:

• GI = k, k = 5, 6, 6. 5, 7, 8, 9, 10, 20 - Both TI and referenced TI methods used 11 equidistant λ-s. Power posteriors method was used with 11 (PP 11 ) and 100 (PP 100 ) λ-s. Third column shows the total number of MCMC steps required to achieve standard error of 0.5%, excluding the burn-in steps. * -using sampled covariance matrix.

of Rayleigh-distributed generation interval (which we assume to be the same as the serial interval),

• AR(k), k = 2, 3, 4 -autoregressive model with k days lag,

• W = k, k = 1, 2, 3, 4, 7 -changing the length k of the sliding window W .

Within each group of models, GI, AR and W , we want to select the best model through the highest evidence method. For example, we want to check whether GI = 6 fits the data better than GI = 10, etc. The dimension of each model was dependent on the modifications applied, but in all the cases the normalising constant was a 40-to 200-dimensional integral. The logevidence of each model was calculated using Laplace approximation with a sampled covariance matrix, and then correction to the estimate was obtained using referenced TI method. Values of the log-evidence for each model calculated by both Laplace and referenced TI methods are given in Table 3 . Interestingly, the favoured model in each group, i.e. the model with the highest logevidence, was different when the evidence was evaluated using Laplace approximation than when it was evaluated with referenced TI. For example, using Laplace method, sliding window of length 7 was incorrectly identified as the best model, whereas with referenced TI window of length 2 was chosen to be the best among the tested sliding windows models, which agrees with the previous studies of the window-length selection in H1N1 influenza and SARS outbreaks [37] . This exposes how essential it is to accurately determine the evidence, even good approximations can result in misleading results. Log-Bayes factors for all model pairs within each of the three groups are shown in Figure 7 in the Appendix.

The importance of performing model selection in a rigorous way is clear from Figure 5 , where the posterior densities of parameters φ and σ and the generated R t time-series are plotted for the models favoured by Laplace and referenced TI methods (meaning of the parameters is given in the Appendix). The differences in the densities and time-series show the pitfalls of selecting an incorrect model. For example, the parameter σ was overestimated by the models selected by Laplace approximation in comparison to these selected by the referenced TI. Table 3 : Log-evidence estimated by Laplace approximation, added referenced TI correction and total log-evidence from referenced TI, with 95% credible interval given in brackets. In each section, model with the highest log-evidence estimated by Laplace or referenced TI method is indicated in bold.

between the two favoured models were most extreme for the GI = 8 and GI = 20 models. While a GI = 8 is plausible, even likely for COVID-19, GI = 20 is implausible given observed data [38] . This is further supported by observing that for GI = 20, favoured by the Laplace method, R t reached the value of over 125 in the first peak -around 100 more than for the GI = 8. The second peak was also largely overestimated, where R t reached a value of 75. Particularly interesting is the fact that all models present similar fit to the confirmed COVID-19 cases data, as shown in Figure 8 in the Appendix. That makes it impossible to select the best model just by visually comparing the model fits, or by using model selection methods that do not take the full posterior distributions into account. That shows that although the models might fit the data well, other quantities generated, which are often of interest to the modeller, might be completely incorrect. Moreover, it emphasises the need to test multiple models before any conclusion or inference is undertaken, especially with the complex, hierarchical models. In epidemiology this is particularly important, as the modellers can be tempted to pick arbitrary parameters for their model, as long as the predictions fit the data [37] . Although the fit might be accurate, other inferred parameters or uncertainty related to the predictions might be completely inappropriate for making any meaningful predictions.

In the radiata pine example, the contribution of TI to the marginalised likelihood estimated by Laplace approximation was not substantial and using Laplace approximation would suffice to make an informed model choice. In this example, we see that the TI contribution is relatively large and changes the decision we would make if we calculated model evidence using only the Laplace method. Moreover, from Table 3 we see that the evidence was the highest for the "boundary" models when Laplace approximation was applied. For example, for the fixed sliding window length models, log-evidence was monotonically increasing with the value of W when Gaussian approximation was applied. Increasing value of the log-evidence makes it less clear what the best value of W is -in general, in that case more values should be tested until the preferred model is no longer on the boundary. With referenced TI, that issue no longer exists, as the preferred model was not the boundary one and the log-evidence change was concave.

The examples shown in section 3 illustrate the applicability of the referenced TI algorithm for calculating model evidence and the model selection process. In the radiata pine example, referenced TI performed better than other tested methods, both in terms of accuracy and speed. Power posteriors method required much denser placement of the coupling parameters around λ = 0, where the values are sampled purely from the prior distribution. In the case of referenced TI, at λ = 0 values are sampled from the reference density, which should be close to the original density (in the sense of K-L distance), which results not only in a more accurate estimate of the normalising constant, but also much faster convergence of the MCMC samples. It is worth noting that referenced TI performed better even than the model switch TI method. This speed of convergence of the referenced TI algorithm needs to eventually be theoretically characterised but practical tests shown in this thesis suggest a faster convergence than other approaches. This is a very important property for costly models where MCMC iterations are computationally demanding.

In the Bellman-Harris renewal model for the COVID-19 epidemic in South Korea, we showed that for complex models, model selection through the Laplace approximation of the normalising constant can give misleading results. Using referenced TI, we calculated model evidence for 16 models, which enabled a quick comparison between chosen pairs of competing models. Importantly, the evidence given by the referenced TI was not monotonic with the increase of one of the parameters, which was the case for the Laplace approximation.

Referenced TI proves useful in situations where the posterior distribution is uni-modal but non-Gaussian. In the case of Gaussian posteriors, the BFs calculated using Laplace approximation of the model evidence provides almost immediate and sufficient information on which model fits the data better, although the error introduced by this approximation grows quickly in the higher dimensions [39] . Therefore, in the case of multi-dimensional datasets it is not always clear whether the posterior distribution is of a Gaussian shape, and thus it is still worth testing the outcome with a more accurate method.

In epidemiology, as well as other fields, the best model is often picked using simple methods such as AIC or BIC [1] . Their main advantage is that they can be computed with virtually no additional computational cost and are often provided by the off-the-shelf statistical software. It is important to note, that neither AIC nor BIC incorporate the uncertainty of the parameters and the model's predictions [4] . As a result, AIC often favours the models with a larger number of parameters, and although BIC penalises more complex models, it might give misleading answers when multiple models with similar score are compared [4] . Additionally, none of these methods takes into account prior probabilities of the parameters [9] and does not force the researcher to consider prior predictive checks on model appropriateness [40] . Fully Bayesian model selection methods which compare the models' evidences are often more appropriate, as they take into account the full posterior densities, instead of just maximum a posteriori probability estimates [41] .

Using model evidence and BFs as a method of model selection has been disputed in scientific literature. BFs have been shown to be sensitive to the choice of priors of the model, which some consider to be a disadvantage, although we have shown in the COVID-19 model example that sometimes it proves useful when the competing models generate outputs that cannot be compared to the empirical data. It is also possible that all competing models make similar predictions under sensible priors, but their marginal likelihood might differ substantially once integrated over the parameter space [42] . Due to this sensitivity, it is recommended to evaluate the BFs over a range of possible prior choices, which may become computationally expensive [4] . In general, however, if a sample size is sufficiently large, the effect of the priors is small [42, 4] . Worth noting as well, that BFs indicate which one of the two models is relatively better, and do not give an absolute score of an individual model's accuracy. For a comprehensive discussion of the applicability of BFs we refer the reader to [42, 43] .

Although referenced thermodynamic integration and other methods using path-sampling have theoretical asymptotically exact Monte Carlo estimator limits, in practice a number of considerations affect accuracy. For example, biases will be introduced to the referenced TI estimate in practice if one endpoint density substantially differs from another. This is because the volume of parameter space that must be explored to produce an unbiased estimate of the expectation cannot be sampled based on the reference density generating proposals within a practical number of iterations. Generally the larger the mismatch in endpoint densities, the larger the bias within finite iterations. The point is shown for a simple 1D example in Figure 6 . Similarly, the larger the mismatch, the higher the variance and slower the expectation is to converge. This illustrates the advantage of using a reference that matches the posterior as closely as possible, as opposed to a typically wide reference like the prior distribution, that gives the characteristic divergence at λ = 0 with power posteriors. Measures of density similarity and reference performance scaling with dimension should be considered in future work.

Furthermore, discretisation of the coupling parameter path in λ introduces a discretisation bias. For the power posteriors method, Friel et al. (2017) propose an iterative way of selective the λplacements which reduce the discretisation error [7] . Calderhead and Girolami (2009) test multiple λ-placements for 2 and 20 dimensional regression models, and report relative bias for each tested scenario [44] . In the referenced TI algorithm discretisation bias is however negligible -the use of the reference density results in TI expectations that are small and have low variance, and therefore converge quickly. Because of that, both the expected distance from two densities estimated by the MCMC draws and the curvature in λ for the expectations per λ are small. In our framework we use geometric paths with equidistant coupling parameters λ between the un-normalised posterior densities, but there are other possible choices of the path constructions, e.g a harmonic [6] or hypergeometric path [35] . Further optimisation might be worthwhile exploring with alternative choices of λ placement, to correctly estimate the evidence with fewer MCMC draws.

Normalising constants are fundamental in Bayesian statistics and allow the best model to be selected for given data. In this paper we describe referenced thermodynamic integration, which allows efficient calculation of model evidence by sampling from geometric paths between the unnormalised density of interest and a reference -here, a multivariate normal reference. In the examples presented the method converges to an accurate solution faster than other approaches such as power posteriors or model switch thermodynamic integration. The referenced TI approach was applied to several examples, in which normalising constants over 1 to 200 dimensional integrals were calculated. The referenced TI method can be applied to model selection in epidemiology of infectious diseases, as it allows to calculate Bayes factors for competing models efficiently even for high-dimensional and complex models. autoregressive process with a longer lag (3-and 4-days respectively). Finally, models W = k, k = 1, .., 7 were similar to the AR(2) model, but the underlying assumption of these models is that the R t stays constant for the duration of the length of the sliding window W = k.

In the original model developed in [36] , imported infections are generated with an exogenous component, however the simplified version used in this paper neglected those and focused only on the endogenous component. AR (2) AR (3) AR (4) AR (2) AR (3) AR (4) (c) Figure 7 : Logarithms of Bayes factors for the analysed COVID-19 renewal models, evaluated using the normalising constants ratios obtained by referenced TI. In each cell, the colour indicates the value of the BF 1,2 for models M 1 (row) and M 2 (column). Higher values, that is a brighter orange colour, suggest that M 1 is strongly better than M 2 , and values below 0 in blue palette indicate that M 1 is worse than M 2 . GI = 8 performed best out of fixed GI models, W = 2 best out of sliding window models, and AR(3) performed better than AR(2) and AR (4) . For the interpretation of the BF values see [4] . 

Mathematical models of infectious disease transmission

Parameter inference and model selection in deterministic and stochastic dynamical models via approximate Bayesian computation: modeling a wildlife epidemic

Model selection and parameter estimation for dynamic epidemic models via iterated filtering: application to rotavirus in Germany

Statistical Mechanics of Fluid Mixtures

Simulating normalizing constants: From importance sampling to bridge sampling to path sampling

Improving power posterior estimation of statistical evidence

Computing Bayes Factors Using Thermodynamic Integration

Improving Marginal Likelihood Estimation for Bayesian Phylogenetic Model Selection

Regression Analysis

Marginal likelihood estimation via power posteriors

Estimating the evidence: a review

Choosing among partition models in Bayesian phylogenetics

Genealogical Working Distributions for Bayesian Model Testing with Phylogenetic Uncertainty

On model dynamical systems in statistical mechanics

Variational principle of Bogoliubov and generalized mean fields in manyparticle interacting systems

The application of the Gibbs-Bogoliubov-Feynman inequality in mean field calculations for Markov random fields

A view of the EM algorithm that justifies incremental, sparse, and other variants

An introduction to variational methods for graphical models

Computation of Gaussian orthant probabilities in high dimension

Estimating orthant probabilities of high-dimensional Gaussian vectors with an application to set estimation

Orthant probabilities

The evaluation of general non-centred orthant probabilities

The numerical evaluation of certain multivariate normal integrals

An Asymptotic Expansion for the Multivariate

A generalized error function in n dimensions

The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo

Stan: A Probabilistic Programming Language

Stan for epidemiology

Star distribution around a massive black hole in a globular cluster

On the eigenfunctions of many-particle systems in quantum mechanics

QUADPACK: A Subroutine Package for Automatic Integration

SciPy: Open Source Scientific Tools for Python: scipy.integrate

SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python

Thermodynamic Bayesian model comparison

On the derivation of the renewal equation from an age-dependent branching process: an epidemic modelling perspective

Using information theory to optimise epidemic models for real-time prediction and estimation

Epidemiology and transmission of COVID-19 in 391 cases and 1286 of their close contacts in Shenzhen, China: a retrospective cohort study

On the error in the Laplace approximations of high-dimensional integrals

Can the Strengths of AIC and BIC Be Shared? A Conflict between Model Indentification and Regression Estimation

Bayesian model evidence as a practical alternative to deviance information criterion

A survey of Bayesian predictive methods for model assessment, selection and comparison

Marginal likelihood computation for model selection and hypothesis testing: an extensive review

Estimating Bayes factors via thermodynamic integration and population MCMC

A.0.1 COVID-19 modelThe COVID-19 model shown is based on the renewal equation derived from the Bellman-Harris process. The details of the model and its derivation are provided in Mishra et al. [36] . Here, we give a short overview of the AR(2) model. The model has a Bayesian hierarchical structure and is fitted to the time-series data containing a number of new confirmed COVID-19 cases per day in South Korea, obtained from https://opendata.ecdc.europa.eu/covid19/casedistribution/ csv. New infections y(t) are modelled by a negative binomial distribution, with a mean parameter in a form of a renewal equation. The number of confirmed cases y(t) is modelled as:where φ is an overdispersion or variance parameter and the mean of the negative binomial distribution is denoted as f (t) and represents the daily case data through:As the case data is not continuous but is reported per day, f (t) can be represented in a discretised, binned form as:Here, g(τ ) is a Raileigh-distributed serial interval with mean GI, which is discretised as R t , the effective reproduction number, is parametrised as R t = exp( t ), with exponent ensuring positivity. t is an autoregressive process with two-days lag, that is AR(2), with 1 ∼ N (−1, 0.1), 2 ∼ N (−1, σ) and t ∼ N (ρ 1 t−1 + ρ 2 t−2 , σ t ) for t = {3, 4, 5, ...}. The model's priors are:Modification were applied to this basic model, to obtain the different variants of the model as described in section 3. First group of models analysed was the AR(2) model described above, but with the GI parameter fixed to a certain value instead of inferring that parameter from the data. AR(3) and AR(4) models had additional parameters ρ 3 and ρ 4 , which allow to model the