key: cord-0216463-ukp4qanx authors: Biron-Lattes, Miguel; Bouchard-Cot'e, Alexandre; Campbell, Trevor title: Pseudo-marginal Inference for CTMCs on Infinite Spaces via Monotonic Likelihood Approximations date: 2021-05-28 journal: nan DOI: nan sha: be15f0c205f255eacce04a588559d586020710f5 doc_id: 216463 cord_uid: ukp4qanx Bayesian inference for Continuous-Time Markov Chains (CTMCs) on countably infinite spaces is notoriously difficult because evaluating the likelihood exactly is intractable. One way to address this challenge is to first build a non-negative and unbiased estimate of the likelihood -- involving the matrix exponential of finite truncations of the true rate matrix -- and then to use the estimates in a pseudo-marginal inference method. In this work, we show that we can dramatically increase the efficiency of this approach by avoiding the computation of exact matrix exponentials. In particular, we develop a general methodology for constructing an unbiased, non-negative estimate of the likelihood using doubly-monotone matrix exponential approximations. We further develop a novel approximation in this family -- the skeletoid -- as well as theory regarding its approximation error and how that relates to the variance of the estimates used in pseudo-marginal inference. Experimental results show that our approach yields more efficient posterior inference for a wide variety of CTMCs. Continuous-Time Markov Chains (CTMCs) (Anderson, 1991) are a class of stochastic processes with piecewise constant paths, taking values in a countable set X , and switching between states at random real-valued times. Notable examples are the Stochastic Susceptible-Infectious-Recovered (SSIR) model in epidemiology (Keeling and Ross, 2008) and the Stochastic Lotka-Volterra (SLV) predator-prey model in ecology (Spencer and Tanner, 2008) , but they have also been applied in a wide range of other fields such as population genetics (Beerenwinkel and Sullivant, 2009) , engineering (Yin et al., 2003; Lipsky, 2008) , biochemistry (Schaeffer et al., 2015) and phylogenetics (Maddison et al., 2007) . As an example, consider the SSIR model in its simplest form. At any time t ≥ 0, the system is described by a triplet of integers (S(t), I(t), R(t)), denoting the number S(t) of healthy people susceptible of being infected, the number I(t) of people infected, and the number R(t) of recovered people who have become immune. The system switches states at random times whenever a susceptible individual becomes infected, when an infected person recovers, or when a new susceptible individual enters the population. Our interest lies in using observations to perform Bayesian statistical inference for the unknown parameters of the CTMC model. In particular, we aim to infer the rate matrix Q that governs the dynamics of the process. For any pair of states i = j, Q i,j is related to the probability of the system jumping from i to j in an "infinitesimally small" interval of time. Evaluating the likelihood for any particular Q involves computing the matrix exponential of Q, denoted e tQ ; in particular, the matrix exponential lets us move from "infinitesimal time intervals" to transition functions for time intervals of any length t ≥ 0 without approximation error. However, we can only compute e tQ exactly when the system takes at most a finite number of different states. And even in this case, the computation is feasible only when the size of X is relatively small due to the O(|X | 3 ) complexity of the matrix exponential (Moler and Van Loan, 2003) . In many applications the set X where the CTMC takes values is infinite, which in turn implies that Q is infinite-dimensional too. One way to address this issue is to use the pseudo-marginal method for exact Bayesian inference (Andrieu and Roberts, 2009) , where one is allowed to replace exact evaluation of the likelihood by a noisy estimate of the likelihood that is non-negative (i.e., guaranteed to be at least 0) and unbiased (i.e., with mean equal to the true likelihood). Georgoulas et al. (2017) applied this strategy to a subclass of CTMCs called Reaction Networks, of which the SSIR model is a particular example. The idea is to define a growing, infinite sequence of finite sets X 1 , X 2 , X 3 , etc., that eventually covers the entirety of X . Then, a corresponding sequence of finite rate matrices Q r is obtained, so that each e tQr can be computed exactly. Finally, an unbiased estimate is obtained through a randomization-based debiasing method (McLeish, 2011; Rhee and Glynn, 2015; Lyne et al., 2015) that preserves non-negativity. The method described above can be recast as exploiting monotone approximations to the transition probabilities of an infinite CTMC. Indeed, Georgoulas et al. (2017) constructs a monotone approximation by computing matrix exponentials for rate matrices arising from increasing truncations of the state-space X . However, we argue that exactly computing (expensive) matrix exponentials is wasteful. We show that this can be avoided by building a monotone approximation that simultaneously increases the truncation of the underlying state-space as well as the accuracy of an approximation to the finite matrix exponential. By preserving monotonicity, this new sequence fits into the pseudo-marginal framework described earlier; and because it involves only approximate matrix exponentials, it improves computational efficiency. We demonstrate that these monotone approximations can substantially improve the efficiency of pseudo-marginal methods for CTMCs. Crucial to this new approach is the use of doubly-monotone approximations to the finite rate matrix exponential-i.e., those that are monotone nondecreasing in both state-space truncation size and approximation iterationbecause they enable the construction of the monotone approximations to the transition probability required by the pseudo-marginal method. Past approximations applicable to this setting, such as the well-known uniformization method (Jensen, 1953) , are not doubly-monotone except in some limited circumstances. Therefore we propose the Skeletoid, a novel matrix exponential approximation. We demonstrate that the skeletoid is doubly-monotone, and provide bounds on its approximation error as a function of the number of iterations. Empirical results show that our pseudo-marginal method based on approximate matrix exponentials-using Skeletoid or uniformization-gives substantial efficiency improvements. The remainder of the paper is organized as follows. Section 2 introduces the transition function for a CTMC, a description of Reaction Networks, and the general pseudo-marginal Markov chain Monte Carlo (MCMC) method. Sections 3.1 and 3.2 introduce the strategy to construct pseudo-marginal samplers using monotone transition function approximations together with a debiasing scheme. Section 3.3 shows how to improve the efficiency of this approach using doubly-monotone matrix exponential approximations. Section 3.4 introduces the Skeletoid approximation, provides an analysis of its error, and compares it with uniformization. Finally, Section 5 provides a comparison of our proposed method and alternative approaches. We conclude with a discussion of avenues of improvement and generalizations to other classes of stochastic processes. We begin this section by defining transition functions in the context of CTMCs, and providing conditions under which the inference problem is well-posed. Then, we define the subclass of CTMCs known as reaction networks. Finally, we give a general description of the pseudo-marginal approach to Bayesian inference. Let {X(t)} t≥0 denote a continuous time Markov chain (CTMC) on a countable space X (Ç inlar, 2013, Ch. 8). The set of equations M x,y (t) := P(X(t) = y|X(0) = x), x, y ∈ X , t ≥ 0, define the transition function of the process {X(t)} t≥0 . In applications, the usual input available to produce M (t) is the rate matrix Q = (q x,y ) such that q x,y ≥ 0 Algorithm 1: Gillespie's algorithm for sampling a path of a CTMC input : x 0 , Q, t end > 0. x ← x 0 ; t ← 0 initialize starting point X ← {(x, t)} initialize path storage loop τ ∼ Exp(−q x,x ); t jump ← t + τ sample next jump time if t jump > t end then break reached the end x ← Categorical qx,y (−qx,x) y =x sample next state append new jump to path end output: Path X. for x = y. When |X | < ∞, and Q is conservative, i.e. q x,x = − y =x q x,y , then M (t) is given by the matrix exponential of Q, One can use this relationship to compute the exact likelihood of observations. For countably infinite state-spaces |X | = ∞, there is no such closed-form expression relating M (t) and Q (they are generally related via Kolmogorov's equations (Anderson, 1991, Thm. 2.2.2) ). Therefore, specialized inference methods are required. Note that the rate matrix Q is still the inferential target in this setting, as it still determines the dynamics of the CTMC: given a conservative and non-explosive rate matrix Q (Ç inlar, 2013, Def. 8.3.22), Algorithm 1 produces realizations of the CTMC that are distributed according to Eq. (1) (Gillespie, 1977) . Reaction networks are a class of CTMCs that admit a structured parametrization of their rate matrices. The state space X of a reaction network is a subset of the integer lattice Z ns , where n s is the number of species in the network. In most cases, the state spaces of reaction networks can be described by (possibly infinite) rectangular sets ∞} define the lower and upper bounds on the counts of each species. At any x ∈ X , the process can only evolve by moving in a finite set of directions in the lattice, specified by an integer valued update matrix U of size n r × n s , where n r is the number of reactions. Each row in U corresponds to an update vector, such that the next state is any one of {x + U r,· } nr r=1 . The rate of jumps towards these states is characterized by the propensity functions {f r θ } nr r=1 , where f r θ : X → [0, ∞) and θ ∈ Θ is a parameter vector governing the system, on which we aim to do inference. The x-th row of Q is then given by Note that this definition imposes a high level of sparsity on Q, as its rows contain at most n r + 1 nonzero elements. As mentioned earlier, the running SSIR model example is a reaction network with X = Z 3 + ; i.e., with B l = (0, 0, 0) and B u = (∞, ∞, ∞). It is represented by the following diagram where θ := (θ 1 , θ 2 , θ 3 ) are the model's parameters, and the formulae above the arrows correspond to the propensity functions. The update matrix induced by the diagram is Most of the theory in this paper is applicable to general CTMCs, the only exception being Section 3.2. However, we will focus our applications and experiments on reaction networks because of their broad applicability in scientific practice and the convenient structure they exhibit. Consider a general statistical problem where one has collected a dataset D with the intention of performing inference for an unknown parameter θ ∈ Θ. The likelihood function is given by L(θ), and we put a prior distribution on θ with density π(θ). The posterior density over θ is given by When π(θ) and L(θ) can be evaluated point-wise, standard tools such as Markov Chain Monte Carlo (MCMC) can be used to carry out Bayesian inference without the need to calculate p(D). However, in many situations the likelihood L(θ) cannot be evaluated, either because it does not admit a closed form expression, or because using such an expression requires infinite computational resources. The pseudo-marginal approach (Andrieu and Roberts, 2009 ) is a useful tool to approach cases where L(θ) is intractable. Suppose that we have access to a non-negative estimator of the likelihoodL(θ, U ) ≥ 0, where U is an auxiliary Uvalued independent random variable with density m(u), such that E[L(θ, U )] = L(θ). Define the augmented posterior distribution We can check that this augmented density has the correct marginal for θ, since Moreover, γ(θ, u) can be evaluated point-wise because it does not explicitly depend on L(θ). Therefore, we can perform inference for θ by employing any suitable MCMC algorithm targeting γ. Suppose that we observe a CTMC at a finite set of times We aim to perform inference for the parameters θ ∈ Θ that define the rate matrix Q = Q(θ). The likelihood has the form where M (δ; θ) x,y is the transition probability from state x to y after a time interval of length δ for the CTMC governed by the rate matrix Q(θ) (we write M (t) instead of M (t; θ) to simplify notation). As we have previously discussed, for countably infinite spaces there is no general exact method to compute M (u; θ) x,y in finite time. In this section we describe a novel approach that uses the pseudomarginal method to address this problem. To construct an unbiased estimator of the likelihood L(θ), we design a debiasing scheme (Kahn, 1954; McLeish, 2011; Rhee and Glynn, 2015; Lyne et al., 2015; Jacob et al., 2015) that exploits the monotonicity of increasing sequences converging to a quantity of interest (all proofs in Appendix B). Proposition 1. Let {a n } n∈Z+ be a monotone increasing sequence with a n ↑ α < ∞. Fix ω ∈ Z + and let N ∈ Z + be an independent random variable with probability mass function p(n) satisfying Define the random variable Then Z ≥ a ω almost surely, E[Z] = α, and Furthermore, if Var(Z) < ∞ for a fixed ω ∈ Z + , then Var(Z) → 0 as ω → ∞. We refer to the above method as the Offset Single Term Estimator (OSTE) because it builds on the single-term estimator of Rhee and Glynn (2015) by incorporating the offset ω as a tuning parameter. OSTE has two advantages over other debiasing schemes. First, it requires computing at most 3 (random) elements of the sequence {a n } n∈Z+ , and only 2 with probability P(N = 0). This provides a large reduction in computation in situations-as in the present work-where a n can be evaluated directly without needing to first evaluate a 1 , . . . , a n−1 . Second, the offset allows us to control a trade-off between computation cost and variance separate from the design of the distribution p. In the context of pseudo-marginal inference, it is important to control the variance of the estimate, as it directly influences the convergence rate of the sampler. Aside from the offset, Eq. (6) suggests that the variance of the OSTE is determined by the relative values of the difference sequence a n+1 − a n and the probability mass function p(n). The faster a n converges, the lighter-tailed p(n) can be; in situations where a n becomes more expensive to compute as n increases-as in the present work-it is desirable to use a light-tailed p(n) that rarely returns large values of N . In particular, a case of special interest to the present work is when the sequence {a n } n∈Z+ converges exponentially fast to its limit. Proposition 2 shows that in this case, we can use a geometric distribution p while still guaranteeing that Var(Z) < ∞. Proposition 2. Suppose there exist constants c, ρ > 0 such that ∀n ∈ Z + , a n+1 − a n ≤ ce −nρ . Then if N ∼ Geom 1 − e −β for 0 < β < 2ρ, . In this work we follow this general strategy of creating monotone sequences that converge at least exponentially quickly to their limit. In practice, the constants c, ρ above are typically unknown; in Appendix A.5 we describe a procedure to automatically tune OSTE that is suitable for the present work. It remains to design an appropriate monotone sequence that converges to L(θ). A first step in this direction is given by the following result due to Anderson (1991) , which shows that any increasing sequence of state spaces induces an increasing sequence of transition functions. Proposition 3 (Anderson (1991, Prop. 2.2.14) ). Let {X r } r∈Z+ be an increasing sequence of sets such that r≥0 X r = X . Define a corresponding sequence of rate matrices {Q r } r∈Z+ by Then for all x, y ∈ X and t ≥ 0, [e tQr ] x,y ↑ M (t) x,y as r → ∞. Proposition 3 tells us how to construct an increasing sequence of probabilities which we can pass to OSTE to produce a non-negative and unbiased estimator of the transition probability of a CTMC. Note that the truncated rate matrices Q r are not guaranteed to be conservative, meaning that some of its rows may fail to sum to a value larger than zero (due to the missing states). We refer to these matrices as non-conservative. Nonetheless, Proposition 13 and its proof shows that [e tQr ] x,y can be interpreted as a transition probability in an enlarged space. Let us assume that for each observation (x i−1 , x i , ∆t i ) we have constructed an increasing sequence of state-spaces {X i r } r∈Z+ . Fix ω i ∈ Z + and let N i be an independent random integer with probability mass function p i (n). Then an estimator of L i (θ) := M (∆t i ) xi−1,xi is given by applying the OSTE debiasing scheme to the monotone sequence of transition probabilities {M r (∆t i ) xi−1,xi } r∈Z+ given by Proposition 3, IE stands for an estimator for Irregular time series based on Exact matrix exponentials. When the increasing sequence of state-spaces is designed such that the conditions of Proposition 1 are satisfied,L IE i (θ, N i ) is unbiased. Furthermore, by the independence of {N i } n d i=1 , we can obtain an unbiased estimator of the likelihood viaL where U := (N 1 , . . . , N n d ). As the emphasis on irregular time series suggests, there is another estimator that is suitable only for regular time series; i.e., with ∆t i = ∆t for all i ∈ {1, . . . , n d }. It is motivated by a potential efficiency gain associated with the fact that in this case we can obtain all the transition probabilities {M (∆t) xi−1,xi } n d i=1 from the same matrix exponential M (∆t). To exploit this property we begin by collapsing the collection of sequences of state-spaces {X i r } i,r into a unique sequence {X r } r∈Z+ Since each sequence {X i r } r∈Z+ is increasing to X , the collapsed sequence also satisfies this property. Then we proceed as before: fix ω ∈ Z + and let N be a random integer with probability mass function p(n). Definẽ Applying the OSTE debiasing scheme to the sequence {L RE r (θ)} r∈Z+ yields a new estimator of the likelihood RE now stands for "Regular time series with Exact matrix exponentials." The estimatorL RE (θ, N ) is unbiased whenever the increasing state-space sequence is designed such that the conditions of Proposition 1 hold. The strategy presented in this section is not restricted in principle to CTMCs. In other words, the problem of performing exact inference on stochastic processes with intractable transition functions can be reduced to constructing monotone approximations to this function. Then, passing this sequence through a debiasing scheme such as OSTE produces a likelihood estimator which can be readily used as the basis of a pseudo-marginal sampler targeting the correct (augmented) posterior distribution. Up until this point, we have assumed the existence of a sequence of state spaces increasing to X . In this section we describe a procedure to create such a sequence that leverages the structure of reaction networks. For every observation (x i−1 , x i , ∆t i ), we set X i 0 to be a positive probability path between x i−1 and x i . Then, we form an increasing sequence of sets using a simple iterative rule. Let D be any collection of vectors that span X . To grow the current space X i r , we get new points by "moving" all the elements of X i r one step along every direction in D, and then discarding points outside X In this work we set D to the canonical basis vectors of Z ns and their negations D := {±e j } ns j=1 ; this is the default choice in the literature (Georgoulas et al., 2017; Sherlock and Golightly, 2019) . Note that this choice implies that the truncated state space X i r grows polynomially, i.e., |X i r | = O(r ns ). Another possible choice of D, for example, is to use the rows of the update matrix U of the CTMC (e.g., Eq. (2) for the SIR model). The benefit of including a valid path between endpoints in X i 0 is that it ensures that the transition probability M 0 (∆t i ) xi−1,xi estimated with the corresponding truncated rate matrix Q (i) 0 is positive. When this condition fails, the performance of the pseudo-marginal sampler is impaired because proposals with zero estimated likelihood are immediately rejected. In Appendix A.3 we describe a simple heuristic based on linear programming to obtain X i 0 . More general-purpose pathfinding algorithms like A * (Hart et al., 1968) could also be used. Regardless of the algorithm used, one only needs to build X 0 once before running the sampler. We now investigate the convergence rate of the estimated transition probability sequence using this state-space truncation scheme. The goal is to show that the error decays exponentially quickly, and hence that we may use a lighttailed random truncation levels N i by Proposition 2. Let (x, y, t) represent a generic observation, and let P x be the law of the CTMC initialized at X(0) = x. As Proposition 13 shows, the r-th estimated transition probability M r (t) x,y can be written as the probability of moving from x to y without ever leaving X r , Reaching a point y ∈ X \X r from a point x ∈ X 0 involves at the very minimum a trip from the boundary of X 0 to the boundary of X r . Since the directions D span X , there exists a constant γ > 0 such that this trip requires at least γr jumps. Therefore where N (0, t] is the number of jumps the process makes in (0, t]. To our knowledge, there are no analytically tractable expressions for tail probabilities of N (0, t] for general, non-explosive CTMCs in countably infinite state spaces. Nevertheless, when the full rate matrix Q is uniformizable, i.e., it is possible to derive an upper bound. In particular, one can add self-transitions to the CTMC-a process that is known as uniformization (Jensen, 1953 )-so that the enlarged set of jump events is given by a homogeneous Poisson process N with intensity −q. Then, standard results for the Poisson distribution (see e.g. Boucheron et al., 2013) yield Thus, under the assumption of uniformizability, the error in the truncation sequence decreases superexponentially. This is desirable given our aim to use the OSTE scheme, because it implies that Eq. (7) in Proposition 2 is satisfied. Fig. 1 shows the error incurred versus the truncation index r for selected observation pairs in two CTMCs. The top row corresponds to a simplified model of a queue. It assumes jobs arrive at a rate λ, and that there are c The plots show that the decrease in error is slightly faster than exponential, in agreement with the derivation above. The bottom row in Fig. 1 corresponds to the Schlögl model, another CTMC which will be described in detail in Section 5. In contrast to the M/M/c queue, the Schlögl model has a rate matrix that is not uniformizable. Nevertheless, the observed decay in error still seems superexponential, which suggests that fast convergence is robust to violations of the uniformization assumption. In practical application, the specific constants governing the rate of (super)exponential error decay are not known. In Appendix A.5 we show how to empirically estimate these rates as part of the process of tuning a pseudomarginal sampler. The IE and RE estimators described in Section 3.1 require algorithms that produce an exact evaluation of e tQr for each truncated rate matrix Q r . In this section, we show how to construct new unbiased likelihood estimators that take advantage of monotonic approximations of each e tQr to avoid the cost of exact computation. In particular, suppose for each r ∈ Z + we are given a monotone approximation M The following result shows that for a doubly-monotone collection {M (k) r (t)} r,k∈Z+ , any sequence of matrices constructed from pairs of indices {(r n , k n )} n∈Z+ with r n ↑ ∞ and k n ↑ ∞ converges monotonically to the correct limit. Suppose that for all r ∈ Z + , a (k) r ↑ a r ∈ R where a r ↑ α ∈ R, and that for all r } r∈Z+ is monotone increasing. Let {(r n , k n )} n∈Z+ be any sequence of pairs r n , k n ∈ Z + such that r n ↑ ∞ and k n ↑ ∞. Then a (kn) rn ↑ α. r (t) x,y : r ∈ Z + , k ∈ Z + } for each matrix entry (x, y) ensures that we can pick any sequence of increasing pairs {(r n , k n )} n∈Z+ and obtain a sequence M (kn) rn (t) that converges monotonically to the true transition probability. Therefore, we can construct a debiased estimator using the result of Proposition 1 without ever having to evaluate the exact matrix exponential M r (t) of any state-space truncation. In order to control the variance of the estimator via Proposition 2, we require further conditions on the approximation sequence given by Proposition 5. Proposition 5. In addition to the conditions in Proposition 4, suppose that there exists c 1 , c 2 , ρ, κ > 0 such that Then, setting r n = n and k n = (ρ/κ)n gives for all n ∈ Z + a (kn+1) rn+1 − a (kn) rn ≤ 2 max{c 1 , c 2 }e −ρn . The first assumption in Eq. (9) is discussed in Section 3.2. The second condition is satisfied whenever the matrix exponential approximation admits an upper bound on the error that is invertible (Section 3.4 gives such bounds for the methods described therein). Finally, we again note that the constants in Proposition 5 are unknown in practice, but nonetheless the sequences r n , k n can still be tuned as we show in Appendix A.5. We now proceed as in Section 3.1 to define a new estimator of the true transition probability M (∆t i ) xi−1,xi for the i-th observation. Suppose {M (k) r } r,k∈Z+ is doubly-monotone, and let {r i (n), k i (n)} n∈Z+ be a sequence of increasing pairs. Theñ is an unbiased estimator of M (∆t i ) xi−1,xi . In turn, these estimators can then be composed into an unbiased estimator of the likelihood where U := (N 1 , . . . , N n d ). We use the suffix IA to denote the use of Approximate solutions in the setting of Irregular time-series. Likewise, for regular time series let Applying Applying the OSTE scheme to the sequence {L RA n (θ)} n∈Z+ yields a new estimator of the likelihood Since the IE and RE estimators based on exact solutions are closely related to the ones based on approximate solutions, the rest of the paper will focus on the IA and RA estimators. The relative efficiency of the IA and RA estimators is not obvious. On the one hand, RA needs only one call to the matrix exponentiation algorithm for each likelihood estimate, while IA needs n d . Also, a likelihood estimate produced by RA involves a smaller number of states, since from Eq. (8) we see that On the other hand, IA allows for finer tuning since each observation gets its own joint sequence. For example, transition probabilities for observations of the form (x, x, ∆t) in models with low rates of transition can be well approximated by the probability of not moving-i.e., e qx,x∆t -and therefore we can set both of its offsets to a low value. In contrast, for RA we need to set its parameters so that all transition probabilities are well approximated. Due to these competing effects, the relative efficiency of IA and RA will depend on the particulars of the problem at hand, as we shall see in experiments later on. Nevertheless, a good rule of thumb is to use the RA approach whenever This rule selects RA whenever there is sufficient redundancy in the base statespaces so that it is more efficient to work with the union of them. In this section we describe doubly-monotone approximations that can be used as the basis of the IA and RA unbiased likelihood estimators described in the previous section. In particular, we begin with a simple approach based on uniformization (Jensen, 1953) , which is efficient for rate matrices that have low norm. We present two variants which differ in their applicability to more general classes of CTMCs, as well as in their fulfillment of the double monotonicity property. Additionally, we develop a novel matrix exponential approximationthe skeletoid -which is doubly monotonic and excels in the high rate regime. Finally, we compare the efficiency of the two methods in practical application to the exponentiation of rate matrices for a number of CTMCs. The process of uniformization described in Section 3.2 can be used to produce a monotone sequence of approximations to e tQ . Suppose and the elements of the sum on the right are all (entry-wise) non-negative. Using this fact, we can obtain an increasing sequence of approximate solutions, such that M (s) (t) ↑ M (t) as s → ∞. Moreover, for r ∈ Z + , let P r := I+Q r /(−q) (with Q r defined in Proposition 3). Define Then, for any x, y ∈ X and t ≥ 0, the collection {M (s) r (t) x,y } r,s∈Z+ is doubly monotonic, as Proposition 6 shows. Proposition 6. The collection in Eq. (14) is doubly monotone. We now study the error incurred by uniformization. Let A ∞ be the ∞ operator norm of the matrix A, The ∞ error incurred by uniformization admits a simple bound. Proposition 7. Consider a possibly non-conservative rate matrix Q with associated state space X . Suppose there existq ∈ (−∞, 0) such thatq ≤ inf x∈X q x,x . Let λ := (−q)t and F (n; λ) be the cumulative distribution function of a Poisson(λ) distribution. Then, Given any > 0, we can invert Eq. (15) to find s ∈ Z + such that the error is less than . Indeed, let be the quantile function of a Poisson(λ) distribution. Then, setting guarantees that the ∞ error is bounded by . In particular, note that by imposing = (k) = e −κk , for any κ > 0, the second requirement in Proposition 5 is satisfied. For the purposes of pseudo-marginal inference for CTMCs (Section 3.3), the uniformization approach presented so far-which we call global uniformizationassumes that the rate matrix Q for the infinite space X satisfies inf x∈X q x,x > −∞. Although appealing due to its double monotonicity, the global uniformization assumption does not hold in many infinite state space CTMCs of interestlike the ones included in our experiments (Section 5), and in particular in the running example of the SSIR model. Sequential uniformization is an alternative approach that is applicable to CTMCs that are not uniformizable; i.e., models for which inf x∈X q x,x = −∞. Here, a decreasing sequence of lower boundsq r := inf x∈Xr q x,x is computed for every truncated rate matrix Q r . The corresponding double sequence of approximations becomes whereP r := I + Q r /(−q r ). For every r ∈ Z + , t ≥ 0, and x, y ∈ X r , {M (s) r (t) x,y } s∈Z+ is increasing. However, the sequential approach is not doubly monotonic. Consider the following counterexample. Let (x, x, 1) be the observation of interest, so that the base state space is a singleton X 0 = {x} (see Section 3.2). Suppose further that the associated truncated rate matrix is the one-by-one matrix Q 0 = (−1). This yields (M (0) 0 (1)) 1,1 = e −1 . Now suppose that X 1 has an additional state, and that Then, the base sequential uniformization approximation gives Thus, (M (0) 1 (1)) 1,1 < (M (0) 0 (1)) 1,1 , so sequential uniformization is not doubly monotonic. In practice, situations that break double monotonicity for the sequential approach are rare. And, for observations (x i−1 , x i , ∆t i ) that do exhibit this phenomenon, we can always set the values of the sequence k i (n) high enough so that-up to numerical accuracy-we effectively fall back to using the IE estimator (which does not require double monotonicity). As we show in Section 5, this implementation of sequential uniformization works well for models with lowq r values. As a consequence, we take uniformization to mean sequential uniformization unless mentioned otherwise. Besides the issue with double monotonicity, an important drawback of uniformization is that it is inefficient when λ = −qt is large. In particular, for fixed , the Poisson(λ) quantile function behaves like (Giles, 2016) and so we must compute s = O(λ) terms in Eq. (13) if we wish to have ∞ error below . Therefore uniformization can be prohibitively slow in cases with large λ, such as the Schlögl model (described in Section 5). Finally, we note that a naive implementation of Eq. (13) is prone to numerical instability at high values of λ. In our experiments we use the more robust algorithm described in Sherlock (2018) . In the following we describe a novel monotone approximation to the matrix exponential-which we refer to as the skeletoid -that is doubly monotonic and requires only O(log λ) computation, therefore excelling in the large λ setting. For any δ > 0, let S(δ) be a matrix with entries x, y ∈ X defined by (S(δ)) x,y := if q x,x = q y,y , q x,y e qy,y δ −e qx,x δ qy,y−qx,x otherwise. The skeletoid approximation to the transition function is given by Note that, once S(t2 −s ) is computed, its 2 s -th power can be evaluated using only s matrix-matrix multiplications via repeated squaring. Indeed, for l = 0, 1, . . . , s − 1, set Proposition 8 shows that S(t2 −s ) 2 s defines a novel monotone increasing sequence of approximations to the transition function of a CTMC. Proposition 8. If Q is non-explosive, then for all x, y ∈ X and t ≥ 0, The proof for Proposition 8 can be found in Appendix B. It is important to highlight that the non-explosivity condition of Proposition 8 is qualitatively less stringent than the assumption that inf x q x,x > −∞, which is required by the global uniformization approach. Additionally, for r ∈ Z + , let S r (δ) be the matrix constructed from Eq. (17) using the truncated rate matrix Q r . Let be the collection of approximations induced by the skeletoid method. Proposition 9 shows that this collection is doubly monotonic. Proposition 9. The skeletoid double sequence in Eq. (18) is doubly monotone. In order to find an s ∈ Z + that achieves an ∞ error lower than any given > 0, we require an error bound that admits a straightforward inverse. The following proposition gives such a bound. Discarding the higher-order terms in Eq. (19) and solving for s yields the rule Eq. (20) shows that the cost of achieving a given error is O(log λ) (with λ := (−q)t as before), which explains the differences in performance between the skeletoid and uniformization methods in the large λ setting. Also, and notwithstanding the fact that Eq. (19) already implies exponential decay of the error as 2 −s , we can achieve any other rate of decay κ > 0 by setting = (k) = e −κk in Eq. (20) (c.f. Proposition 5). Finally, we note that the skeletoid approximation is related to the "scaling and squaring" method, a common trick used as part of various algorithms for computing the matrix exponential (Moler and Van Loan, 2003) . Scaling and squaring reduces the problem of computing e tQ to that of obtaining the transition function e δQ of the δ-skeleton-with δ := t2 −s -by applying the identity e tQ = (e δQ ) 2 s (a consequence of Chapman-Kolmogorov). The key difference here is that S(δ) is a specific computationally cheap approximation to the δ-skeleton solution e δQ -hence the name "skeletoid." This specific choice is critical to establishing double monotonicity in a general non-explosive setting, making the skeletoid approximation particularly suitable to pseudo-marginal methods. a random integer giving the number of non-zero entries, and then sample the positions of these elements in each row at random. In contrast, matrices in the dense clase have only non-zero entries. In the absorbing state class we make the first state absorbing, meaning that q 1j = 0 for all j; then, we draw a random dense rate sub-matrix for the remaining states, and finally complete the first column to satisfy the conservative property. Lastly, in the General Time Reversible (GTR) class we simulate matrices satisfying detailed balance: µ x q x,y = µ y q y,x for all x, y, for some probability vector µ. In all these classes, the non-zero offdiagonal elements of the matrices are drawn from Exp(1) distributions; then, the diagonal is set to enforce the conservative property, and finally the matrix is scaled to achieve a mean rate of one. Note that the skeletoid is much faster than uniformization in the t = 1000 setting, across all four CTMC classes, whereas in the t = 1 setting, the ranking of the two methods depend on the computational budget. Fig. 3 shows the result of using Eqs. (16) and (20) to produce approximations with a given error bound. We see a high level of agreement between requested and realized errors in most of the settings, with a bias towards more conservative outcomes when disagreement occurs. Notwithstanding the usefulness of Eqs. (15) and (19) to provide bounds on errors that can be evaluated before doing any computation, monotone increasing approximations (not necessarily doubly monotone) to the matrix exponential of rate matrices-like skeletoid and uniformization-offer a simple way of obtaining the exact error of a given approximation, since where the inequality arises from the possibility that Q is non-conservative. The right-hand side is computable since the formula does not depend on M (t). The inequality becomes an equality in the case of a conservative Q, and in this case a stronger statement holds for all x ∈ X : Finally, we note that it is always possible to improve a collection of monotone increasing approximations by computing all of them in parallel and returning their element-wise maximum, which will again be monotone increasing. However, in our tests the overhead needed to parallelize the skeletoid and uniformization methods was not compensated by the efficiency gains obtained. However, it is possible that for models not considered in this study, an algorithm returning the element-wise maximum of the uniformization, skeletoid, and potentially other monotone approximations could be advantageous. Georgoulas et al. (2017) were the first to describe a pseudo-marginal method to perform exact inference for CTMCs on countably infinite spaces by debiasing a sequence of likelihood estimates arising from an increasing sequence of truncated state-spaces. They used the randomly-truncated series debiasing scheme first introduced by McLeish (2011), which has the downside of requiring an unbounded random number of calls to the matrix exponentiation algorithm. In contrast, the OSTE scheme needs at most 3 calls to these routines. Additionally, Georgoulas et al. (2017) used a stopping time whose distribution has super-exponential tails. This introduces the risk of obtaining likelihood estimates with infinite variance, unless one is able to prove that the approximating sequence converges faster. As discussed in Section 3.2, there is evidence that the sequences converge at super-exponential speed, but the precise rate must still be estimated on a case-by-case basis. Building on Georgoulas et al. (2017) , Sherlock and Golightly (2019) introduced the MESA algorithm. This method expands the state space of the sampler to include the random index r of the smallest truncated state space that contains the full (unobserved) path of the process for all observation pairs. This strategy is appealing because it dispenses with the need to design debiasing methods with appropriate stopping times. The variant nMESA is also proposed, which has a different index r i for each observation pair. In terms of computational cost, the sampler used in MESA requires 3 calls to matrix exponentiation routines, which are based on uniformization and scaling-and-squaring. On the other hand, because they are not pseudo-marginal samplers, neither MESA nor nMESA can take advantage of the accuracy relaxation technique described in Section 3.4. Thus, they require computing matrix exponentials up to numerical accuracy. Previous work has explored the use of particle MCMC to design positive unbiased estimators in the context of infinite state space CTMC. The most straightforward way to use particle MCMC for infinite CTMC is to impute full paths and set weights to zero when a particle does not match an observed endpoint (Saeedi and Bouchard-Côté, 2011; Golightly and Wilkinson, 2011) . However this approach can suffer from severe weight degeneracy, including particle population collapse where all weights are equal to zero. A more sophisticated approach builds a martingale to guide particle paths toward the end-point (Hajiaghayi et al., 2014) , but this requires the construction of a problem-specific potential function. The literature on exact inference for CTMCs on finite state spaces is extensive (see e.g. Geweke et al., 1986; Fearnhead and Sherlock, 2006; Ferreira and Suchard, 2008) , but methods therein cannot be extended to the countably infinite case in a straightforward manner. In particular, as we have previously mentioned, uniformization cannot be applied to infinite state-space CTMCs whose rate matrices do not admit a uniform bound on the diagonal. Another line of work has built advanced extensions for the related but distinct problem of end-point simulation (Rao and Teh, 2013; Zhang and Rao, 2020) . Algorithm 2: Metropolis pseudo-marginal sampler with RA estimator input : n samples , θ 0 , Σ prop , p, {r(n), k(n)} n∈Z+ C ← {X 0 } construct and store X 0 r ← 0 index of the largest state-space available N ∼ Geom(p) L cur ←L RA (θ 0 , N ) using Eq. (12) with {r(n), k(n)} n∈Z+ for s = 1, 2, . . . , n samples − 1 do if r(N + 1) >r then need to build larger state-spaces C ← C ∪ {X r } for all r ∈ {r + 1, . . . , r(N + 1)} We evaluate our methods using experimental benchmarks introduced in Georgoulas et al. (2017) 1 and Sherlock and Golightly (2019). 2 We run a random walk Metropolis sampler targeting the augmented posterior distribution π(θ, U |D) of Eq. (3). The proposal for θ uses a multivariate normal with covariance matrix Σ prop = α Var(θ|D), for some α > 0 and with Var(θ|D) estimated in a trial run. For the auxiliary variables U we take independent draws from the distribution used by the OSTE scheme, which we set to Geom(p) distributions with p ∈ (0, 1). Appendix A.5 contains detailed explanations on how to tune all the parameters involved. An overview of the process to run the Metropolis sampler using the RA estimator is shown in Algorithm 2. The process used for a sampler using the IA estimator is similar but involves also looping over observations. In order to fairly compare the efficiency of algorithms implemented in different languages, we use the effective sample size per billion floating point operations (ESS/GFLOPs), when considering FLOPs used in matrix multiplications for the Lotka-Volterra (4 reactions) and SIR models within matrix exponentiation algorithms. The reason to focus on these operations is that they account for most of the computational cost incurred by the samplers. The ESS is measured using the method implemented in the R package mcmcse (Flegal et al., 2020) and described by Gong and Flegal (2016) . Due to the importance of correctly tuning samplers to obtain good results, we only compare against other methods that have been previously tuned for and tested in a given dataset by their respective authors. The Schlögl model is defined by four reactions where the number of A and B molecules are kept constant, so that the only species evolving in time is X. The model is known in the mathematical biology literature as the canonical example of a chemical reaction system exhibiting "bi-stability" (Vellela and Qian, 2009 ). This means that for some parameter configurations, the system oscillates between two meta-states or regimes, one with very low or zero counts of X, and one with high counts. For the purpose of inference, it is simpler to work with the birth-death version of the process, obtained by collapsing the reactions that increase X and the ones that decrease it To give a more detailed description of the performance of our pseudo-marginal approach, we first focus on the results obtained for the Schlögl model, then present summaries for several models. The dataset we use is taken from Sherlock and Golightly (2019) , and is depicted in Fig. 4 . Note that it exhibits the bi-stability phenomenon described above. These data are taken every ∆t = 4 from a path simulated with θ true = (3, 0.5, 0.5, 3). Sherlock and Golightly (2019) use a Metropolis sampler on the log-transformed variables ξ = log(θ), and place independent standard normal priors on ξ. Since we work directly with θ, we use i.i.d. standard log-normals. After following the tuning procedure outlined in Appendix A.5, we run three parallel chains for 10,000 iterations each, initialized at θ true . Fig. 5 shows the trace-plots for an RA sampler using the skeletoid method, evidencing good mixing in all parameters and chains. Fig. 6 summarizes the approximated posterior. Note how there is almost perfect correlation between θ 1 and θ 2 , demonstrating that having a proposal with non-diagonal covariance is useful. The second panel in Fig. 7 shows the ESS/GFLOP achieved by the IA and RA samplers with the skeletoid, compared to the results obtained for the MESA and nMESA methods. Our RA sampler achieves an improvement of more than 450% over nMESA. The Lotka-Volterra (LV) LV process has state-space X = Z 2 + ; i.e., defined by boundaries B l = (0, 0) and B u = (∞, ∞). We consider two versions of this model with different numbers of reactions. (2019) Here, R denotes the number of predators and Y the size of the prey population. For our experiments, we focus on the LV20 dataset-shown in the top-left panel of Fig. 4 -containing 20 observations obtained at regular time intervals ∆t = 1 of an LV path simulated with θ true = (0.3, 0.4, 0.01). Again, Sherlock and Golightly (2019) use a Metropolis sampler on the log-transformed variables ξ, putting independent Gaussian priors with unit variance and means (log(0.2), log(0.2), log(0.02)). To match these, we put independent log-normal priors on θ with the same parameters. In the first panel of Fig. 7 we show that the IA sampler with uniformization achieves more than a 20% average increase in ESS/GFLOP when compared agains nMESA, although the ranges do not show a strong separation. These results also show that the relative performance between IA and RA samplers depends on the particulars of the model and data. The data used by the authors is depicted in the bottom-left panel of Fig. 4 . It represents observations at regular intervals with ∆t = 200 of a path simulated using θ true = (1, 5, 5, 1) · 10 −4 . We compare our methods against the pseudo-marginal sampler proposed by the authors, which builds a chain directly targeting the augmented posterior for θ using i.i.d. Gamma(4, 10 4 ) priors. As the third panel in Fig. 7 shows, we achieve an improvement of 3 orders of magnitude using the IA sampler with uniformization. AMH11 denotes the matrix exponentiation algorithm described in Al-Mohy and Higham (2011). The last CTMC we consider is the SSIR model, described in detail in Section 2.2. The data-taken from Georgoulas et al. (2017) and shown in the bottom-right panel of Fig. 4 -are obtained from observations taken at irregular time steps from a path simulated using θ true = (0.4, 0.5, 0.4). In our experiments, we match the prior used by the authors, corresponding to i.i.d. Gamma(1.5, 5) . We can see in the last panel of Fig. 7 that the IA sampler with uniformization achieves an improvement of almost 4 orders of magnitude. In this work we have proposed pseudo-marginal methods based on monotonic approximations to the likelihood of stochastic processes with intractable transition functions. A monotonic approximation can be seen as one of two inputs-the other being a debiasing scheme-of an almost automatic procedure for designing pseudo-marginal samplers targeting the correct posterior distribution. To this end, we have described the OSTE scheme which improves over other debiasing methods by using a bounded number of calls to the approximating sequence, and by offering a flexible way to balance the variance-cost trade-off. We have also shown how to accelerate a recently proposed pseudo-marginal sampler for countable state-space CTMCs by considering matrix exponentiation algorithms that produce increasing approximations when applied to finite rate matrices. Moreover, we have introduced the skeletoid as a novel probabilistic approach for computing monotonic approximations to the exponential of rate matrices, that excels in cases where the norm of these matrices is high. Finally, we have given experimental evidence that our proposed algorithms offer sizable efficiency gains. Note that the only component of the proposed algorithms that is tailored to reaction networks is the design of the increasing sequence of state-spaces. Therefore, extending our algorithms to CTMCs other than the ones considered in this study should not be cumbersome. Some interesting areas of application include CTMCs arising from the study of the secondary structure of nucleic acids through elementary step kinetic models (Schaeffer et al., 2015; Zolaktaf et al., 2019) , as well as CTMCs underlying state-dependent speciation and extinction models in phylogenetics (Maddison et al., 2007; Louca and Pennell, 2019) . More generally, we aim to extend the "monotone approximation plus debiasing" framework to other stochastic processes with intractable likelihoods. For example, exact inference methods for diffusions exist only for a restricted class of such processes (Beskos et al., 2006; Sermaidis et al., 2013) . On the other hand, there is a growing literature on asymptotic expansions to the transition functions of diffusions (Aït-Sahalia et al., 2008; Li et al., 2013) which are valid in more general settings. These expansions could potentially be modified to produce monotone approximations, thus enabling the use of our framework for exact inference. A different way to tackle this problem would be to rely on techniques to approximate diffusions using CTMCs (Kushner, 1980; Di Masi and Runggaldier, 1981) . As long as the transition function of the latter converges monotonically to the transition function of the diffusion, the methods in this paper could be applied. Algorithm 3: Implicit matrix squaring input : B := A − I, k for i = 1, 2, . . . , k do B ← 2B + B 2 output: A 2 k = I + B. A Implementation details A.1 Numerically stable implementation of the skeletoid method When δ = t2 −k is close to the machine's precision, the straightforward computation of S(t2 −k ) 2 k produces rounding errors. The reason is that in these cases S(δ) ≈ I. A simple way to bypass this issue is to rely on the identity which holds for any square matrix A. Defining B := A − I, the identity becomes An inductive argument then shows that we can compute A 2 k via the recursion depicted in Algorithm 3. To fully take advantage of Algorithm 3 we must implicitly compute S(δ) − I. This is achieved by replacing the formula for the diagonal elements in Eq. (17) by where expm1 is the standard numerical routine for accurately computing e x − 1 when x ≈ 0. When computing the likelihood function we are generally interested in only a handful of entries of the transition function. This can be used to speedup computation, by only performing computation for the rows of the matrix exponential that contain those entries. Let Q r be the rate matrix associated to X r , and define b r := |X r |. Suppose we are interested in the entries {(i l , j l )} m l=1 of the s-th approximation to the transition function In the case of uniformization, recall that P := I − Q r /q. Then (M (s) (t)) i l ,j l = (LM (s) (t)) l,j l = s n=0 eq t (−qt) n n! (LP n ) l,j l Note that the matrix LP n can be efficiently computed recursively setting LP 0 = L and updating via LP n = (LP n−1 )P . Suppose that P has η non-zero elements per column on average. Then the update operation involves ηb r FLOPs, so we can obtain (M (s) (t)) i,j using only sηb r operations. Compare to sηb 2 r which would be the cost of computing the full matrix exponential. For the skeletoid method, we can apply a trick used for computing the action of the matrix exponential obtained through scaling and squaring (Sherlock, 2018) . Consider the squaring decomposition Now, in general S(δ) 2 k 1 is a dense matrix of size b r × b r (due to CTMCs being irreducible in general), so we model its computational cost to k 1 b 3 r (ignoring subcubic algorithms due to their higher associated constants for simplicity). Then the product LS(δ) 2 k 1 costs mb 2 r , and so does every subsequent multiplication by S(δ) 2 k 1 on the right. Therefore, the total cost becomes C(k 1 , k 2 ) = βk 1 b 3 r + mb 2 r 2 k2 where we have added the parameter β < 1 to account for the fact that n vector-matrix multiplications with cost n 2 are slower than one matrix-matrix multiplication with cost n 3 . We have found β = 0.1 gives consistent results. We can easily optimize the function above if we work in R by setting x ← k 2 , and k 1 = k − x. Indeed, let Differentiating gives the condition 0 = C (x) = −βb 3 r + log(2)mb 2 r 2 x . x * = log 2 βb r log(2)m . Constraining x * to an integer in the correct interval gives Finally, setting k * 1 = k − k * 2 yields a strategy for computing LS(δ) 2 k which is in general faster than the naive approach requiring forming S(δ) 2 k . Here we present a heuristic for computing a "seed" path between pairs of observations that is specialized to reaction networks with X = Z ns + . Fix an observed change in the state of the system (x i−1 , x i , ∆t i ), with ∆t i := t i −t i−1 . Assuming non-explosivity, there exists a finite sequence of states {s p } P p=0 and a corresponding sequence of reactions {r p } P p=1 -with s 0 = x i−1 and s P = x i -such that for all p ∈ {1, . . . , P } where U is the reaction matrix of the model of size n r × n s . Unrolling the above recursion yields where ν ∈ Z nr + is an integer vector denoting the number of times each reaction is used to form the path. The solution to Eq. (21), although not necessarily unique, can be computed by expressing the problem as an Integer Program Since for the reaction networks considered in this paper, n r , n s < 10, the problem is computationally tractable. Moreover, since ∆x i is an integer vector, when U is Totally Unimodular-which is the case for all the reaction networks we experiment with-the corresponding Linear Program relaxation yields an integer solution (Hoffman and Kruskal, 1956) . Finally, once ν is obtained, we recover the path {s p } P p=1 -with P := 1 T nr ν-using a simple heuristic search algorithm. The formula for computingL RA (θ, U ) in Eq. (12) is numerically unstable because it depends on the products in Eq. (11), which in most cases will be ≈ 0. However, we can use the following identities to directly compute log(L RA (θ, U )) without explicitly computing products. The function expm1 is the standard numerical routine for accurately computing e x − 1 when x ≈ 0. Similarly, log1p refers to standard numerical routines that accurately compute log(1 + x) for |x| 1. Proposition 11. Let 0 ≤ p 1 ≤ p 2 ≤ p 3 . Fix α ∈ (0, 1] and let Define s i := log(p i ) for each i ∈ {1, 2, 3}. Then otherwise. Proof. The first two cases follow directly. For the third case, note that Hence log(z) = s 2 + log(expm1(s 3 − s 2 )) − log(α). The general case follows similarly Taking log(·) at both sides gives the required expression log(z) = s 1 + log1p(exp(s 2 − s 1 − log(α))expm1(s 3 − s 2 )). The approach we have presented depends crucially on correctly tuning the joint truncation-accuracy sequences (see Section 3.4) and the probability mass function p(n) of the stopping times so that de-biasing behaves properly. As Fig. 8 shows, without proper tuning the sampler can give rise to chains that get stuck for long stretches (in stark contrast to Fig. 5 ). The reason for this behavior is to be found in Eq. (5). The ratio can give rise to arbitrarily high estimates of the likelihood if the probability mass function p(n) is much smaller than the observed decrease in error (i.e., the numerator). In this respect, the design of p(n) bears similarity with the design of proposal distributions for importance sampling, as we must ensure that the tails of p(n) match the decay of the first differences a ω+n+1 − a ω+n . There is a growing literature concerned with optimally tuning pseudo-marginal samplers to maximize some measure of efficiency (Pitt et al., 2012; Doucet et al., 2015; Sherlock et al., 2015; Schmon et al., 2020) . Recall the augmented target density in Eq. (3), and define ζ(θ, U ) := log(L(θ, U )) − log(L(θ)). Moreover, let σ 2 ζ (θ) := Var(ζ|θ). The recommendations suggest tuning the sampler to achieve σ ζ (θ MAP ) =σ, for some pre-specified targetσ, and with θ MAP a mode of π(θ|D). The benefit of this approach is that it does not require running the chains for long periods of time with different configurations, since σ ζ (θ MAP ) can be estimated using simple Monte Carlo whenever we can sample U ∼ m. On the other hand, depending on the particular assumptions of each study, the recommendedσ ranges from 0.92 to 1.8. Given the ambiguity of these different recommendations, we use σ ζ (θ MAP ) not as something to tune in itself, but as a summary of the characteristics of different samplers. In short, we tune our samplers following these steps: 1. Tune a sampler with a default value of p min = 0.9 (as shown in the next paragraphs) and do a preliminary run to find θ MAP and an estimate Var(θ|D) of Var(θ|D). 6. Select the sampler giving the highest ESS per compute time. To implement the general outline described above, let us assume that for every observation i ∈ {1, . . . , n d } and every r ∈ Z + , the sequence {M (k) r (∆t i ) xi−1,xi } k∈Z+ satisfies a specific version of the second condition in Eq. (9) (Proposition 5), namely r (∆t i ) xi−1,xi } r∈Z+,k∈R+ . This allows us to have a finer control in the tuning process. In order to have a generic tuning procedure that works for both IA and RA estimators, let a Following the conclusions of Proposition 5, we parametrize the sequence of pairs {(r n , k n )} n∈Z+ using a simple linear form ∀n ∈ Z + : r n := ω (T) + n and k n := ω (S) + σn, where ω (T) ∈ Z + is a truncation offset, ω (S) ∈ R an approximation quality offset, and σ > 0 a relative step-size. Indeed, in Proposition 5, the offsets are 0 and σ = ρ/κ if one allows for k n ∈ R + as we do here. We shall see that having non-zero offsets lets us control de variance of the estimators. Let us begin by setting ω (T) , ω (S) to preliminary values. Define Figure 9 : Visualization of the tuning process for an RA pseudo-marginal sampler for the Schlögl model, with θ = arg max π(θ|D). First row: output of Algorithm 4. Second row: profiles of the first differences computed from that output. known. Moreover, we do not have a theoretical bound for the truncation error that would tell us how high to set r to approximate α. A similar thing happens with the matrix exponential approximation error when we use the skeletoid method, since its bound can be loose in some settings. We must therefore begin by empirically inspecting the behavior of {a ∞ r } r∈Z+ and {a k ∞ } k∈R+ . Algorithm 4 achieves this by first exploring the truncation sequence when the matrix exponentiation algorithm is set to a fixed high precision setting, producing a truncation index r at which the difference between consecutive solutions is less than a pre-specified threshold > 0. It also gives a best guess for the true value a * ≈ α. After this, it sets the truncation to r and proceeds to check the convergence in k by scanning a k r at integer values of k, until it finds k such that a * − a k r < . The algorithm returns all these values plus the complete profiles D (T) , D (S) . The first row in Fig. 9 gives an example of the output for an RA sampler for the Schlögl model. We see that truncating at r = 20 gives a good approximation, while the skeletoid method gives a very accurate solution for k = 0, indicating that the associated bound is indeed loose for this problem. The output from Algorithm 4 allows us to proceed with tuning the offsets. As mentioned in Section 5, we use stopping times with distribution N ∼ Geom(p); i.e., p(n) = p(1 − p) n for all n ∈ Z + . Note that this probability mass function is exponentially decreasing with rate β := − log(1 − p). On the other hand, we σ high enough to ensure monotonicity holds. The process is summarized in Algorithm 5. Whenever we encounter a failure in monotonicity a (kn+1) rn+1 < a (kn) rn , we double σ and recompute a (kn+1) rn+1 until the condition is met. Having set σ we can adjust the offsets to match the behavior of the joint sequence. Let n peak be the index of the rightmost peak in the sequence of differences ∆a n+1 := a (kn+1) rn+1 − a (kn) rn . Define n offset := min{n ≥ n peak : a (kn) rn ≥ p min a * }. and update the offsets to ω (T) = r n offset and ω (S) = k n offset . Finally, to tune p g we fit a linear regression without intercept to ∼ N (0, σ 2 ). Then we set p g = p ∨ (p ∧ (1 − e βg )), where [p, p] defines a pre-specified safe interval (we use p = 0.4 and p = 0.9). Proof of Proposition 1. The fact that Z ≥ a ω almost surely follows directly from the definition of Z and the non-decreasing property of the sequence. Unbiasedness follows by the monotone convergence theorem: (a ω+n+1 − a ω+n ) = a ω + n≥0 (a ω+n+1 − a ω+n ) (Eq. (4)) = lim n→∞ a n = α. Then Applying expectations on both sides and using the monotone convergence theorem, Finally, the expression for Var(Z) follows from If the variance is finite, then the series above is finite; the fact that a ω ↑ α yields Var(Z) → 0 as ω → ∞. Proof of Proposition 2. Since p(n) > 0 for all n ∈ Z + , by Proposition 1, The result follows from the geometric series formula. Proof of Proposition 4. From the monotonicity assumptions, a (kn) rn converges to some limit β ≤ α. Suppose for the purpose of contradiction that β < α. From the definition of a r and the assumption r n ↑ ∞, we can pick n 0 ∈ Z + , > 0 such that Then, for j ∈ N, iteratively pick n j > n j−1 such that k nj > k nj−1 . This is possible using the assumption k n ↑ ∞. Then, Since n j > n 0 , and r n is increasing, r nj ≥ r n0 , and hence Combining Eqs. (24) to (26) yields which is a contradiction. We conclude that β = α. Proof of Proposition 5. Since r n ↑ ∞ and k n ↑ ∞, Proposition 4 shows that a (kn) rn ↑ α. Moreover, rn ≤ α − a (kn) rn = (α − a rn ) + (a rn − a (kn) rn ) ≤ c 1 e −ρn + c 2 e −κ (ρ/κ)n ≤ c 1 e −ρn + c 2 e −κ(ρ/κ)n ≤ 2 max{c 1 , c 2 }e −ρn . Lemma 12. Let Q be a non-conservative rate matrix. Define Then R is conservative, with inf i R x,x = inf i Q x,x , and for n ∈ N, Moreover, for any t ≥ 0 e tR = 0 0 T 1 − e tQ 1 e tQ . Proof. The first property inf i R x,x = inf i Q x,x is clear from the definition of R in Eq. (27), since only a 0 is added to the diagonal, whose elements are always non-positive. For the second property, we proceed by induction. The case n = 2 gives Assuming the property holds for some n, then We conclude the result holds for all n ∈ N. Next, note that Proposition 13. For r ∈ Z + , let Q r be the non-conservative rate matrix given by Proposition 3, with associated truncated state-space X r . Then for all t ≥ 0 and x, y ∈ X r , [e tQr ] x,y = P x (X(t) = y and ∀s ∈ [0, t] : X(s) ∈ X r ), where {X(t)} t≥0 is the CTMC on X driven by the full non-explosive rate matrix Q and initialized at X(0) = x. Proof. Let X r = X r ∪ {∆}, and consider expanding Q r to Q r as in Lemma 12, so that ∆ becomes the absorbing state. Let { X(t)} t≥0 be a CTMC defined on X r driven by Q r and initialized at X(0) = x. By Lemma 12, we obtain that for all x, y ∈ X r [e tQr ] x,y = [e t Qr ] x,y = P x ( X(t) = y) Now, we can couple (X(t), X(t)) by sharing the random variables used in their simulation (see Algorithm 1). Denote τ ∆ := inf{t ≥ 0 : X(t) = ∆} the first hitting time of ∆. Then ∀t ∈ [0, τ ∆ ) : X(t) = X(t) ∈ X r , and ∀t ≥ τ ∆ : X(t) = ∆. Therefore, = P x (X(t) = y and ∀s ∈ [0, t] : X(s) ∈ X r ). Proof of Proposition 6. To show double monotonicity, we must prove that for all r, s ∈ Z + , t ≥ 0, and x, y ∈ X r , We first show by induction that for all r ∈ Z + , x, y ∈ X r , and n ∈ N, For n = 1, (P r+1 ) x,y = δ x,y −q −1 (Q r+1 ) x,y = δ x,y −q −1 (Q r ) x,y = (P r ) x,y . Now suppose that the result holds for some n ∈ N. Then (P n+1 r+1 ) x,y = x∈Xr+1 (P n r+1 ) x,z (P r+1 ) z,y ≥ z∈Xr (P n r+1 ) x,z (P r+1 ) z,y (Discard non-negative summands) ≥ z∈Xr (P n r ) x,z (P r ) z,y (Induction and case n = 1) = (P n+1 r ) x,y . Thus, the result holds for all n ∈ N. Finally, (M (s) r+1 (t)) x,y = s n=0 eq t (−qt) n n! [P n r+1 ] x,y ≥ s n=0 eq t (−qt) n n! [P n r ] x,y = (M (s) r (t)) x,y . Proof of Proposition 7. Note that P = I − Q/q is sub-stochastic. Moreover, for all n ∈ Z + , P n is sub-stochastic too. Then, 0 ≤ The interchange of sums is justified by Tonelli's theorem since the summands are non-negative. Taking the supremum over i gives the required result. Lemma 14. Let N δ be the number of jumps of the CTMC in (0, δ]. Then, for all x, y ∈ X , P x (X(δ) = y, N δ ≤ 1) = (S(δ)) x,y for S(δ) defined in Eq. (17). Proof. Case y = x is given by P x (N δ = 0) = P x (first jump occurs after δ) = e qx,xδ . For y = x, P x (X(δ) = y, N δ ≤ 1) = P x (X(δ) = y, N δ = 1) = δ 0 P x (X(δ) = y, T 1 ∈ du, T 2 > δ) = δ 0 (−q x,x )e qx,xu q x,y (−q x,x ) e qy,y(δ−u) du = q x,y e qy,yδ δ 0 e (qx,x−qy,y)u du. If q x,x = q y,y we get P x (X(δ) = y, N δ ≤ 1) = q x,y δe qx,xδ . Otherwise P x (X(δ) = y, N δ ≤ 1) = q x,y e qy,yδ 1 q x,x − q y,y e (qx,x−qy,y)u δ 0 = q x,y e qy,yδ e (qx,x−qy,y)δ − 1 q x,x − q y,y = q x,y e qx,xδ − e qy,yδ q x,x − q y,y . Lemma 15. Fix k ∈ Z + and δ = t2 −k . Let N l,k := number of jumps in (lδ, (l + 1)δ], l ∈ {0, 1, . . . , 2 k−1 }. Then for x, y ∈ X and L ∈ N, Proof. If Q is non-conservative, append additional entries corresponding to an absorbing state ∆ as in Eq. (27). Since ∆ is absorbing, all transitions between x, y ∈ X must necessarily avoid it. Let F t := σ(X(s), 0 ≤ s ≤ t) be the natural filtration of the process. We can derive the following recursion Now proceed to prove Eq. (28) by induction. Using L = 1 in the recursion above yields P x (X(2δ) = y, N 0,k ≤ 1, N 1,k ≤ 1) = z∈X P x (X(δ) = z, N 0,k ≤ 1)S(δ) z,y = z∈X S(δ) x,z S(δ) z,y Lemma 16. Let be the event which prescribes that at most one jump occurs in each bin. If the process is non-explosive, then for all i ∈ X P x   k∈Z+ A k   = 1. Proof. Let T denote the total number of jump events. For every n ∈ Z + , let B n = {T ≤ n}. We first show that B n ⊂ ∪ k A k . Suppose ω ∈ B n , and let ω denote the smallest inter-arrival time for the jump events in ω. Since ω ∈ B n , the number of jump events is finite, therefore ω > 0. Hence there exists a k ω , namely t2 −kω < ω =⇒ k ω = log 2 t ω , such that for k > k ω , ω ∈ A k (to see why, note that the smallest inter-arrival time in ω is larger than the mesh size in A k , hence there can be at most one event in each block of A k ). It follows that ω ∈ ∪ k A k . Now, by non-explosivity, P x (∪B n ) = 1. Additionally, {B n } n∈Z+ is an increasing family of sets. Hence, (continuity from below) = lim n→∞ P x (B n ) = P x (∪B n ) = 1. Proof of Proposition 8. Note that {A k } k∈Z+ -as defined in Lemma 16-is an increasing sequence of sets because the k-th grid is strictly contained in the (k + 1)-th grid, so that (tl2 −k , t(2l + 1)2 −(k+1) ] ∪ (t(2l + 1)2 −(k+1) , t(l + 1)2 −k ] = (tl2 −k , t(l + 1)2 −k ]. Applying Lemma 15 with L = 2 k yields P x (X(t) = j, A k ) = [S(δ) 2 k ] x,y . Since A k is increasing, {X(t) = y} ∩ A k is also increasing. Therefore, by continuity from below Proof of Proposition 9. To show double monotonicity, we must prove that for all t ≥ 0, r, s ∈ Z + , and all x, y ∈ X r , (S r (t2 −s ) 2 s ) x,y ≤ (S r+1 (t2 −s ) 2 s ) x,y . By induction on s ∈ Z + . The base case s = 0 follows from the definition of Q r , since for all x, y ∈ X r (Q r ) x,y = Q x,y = (Q r+1 ) x,y . Thus, for all t ≥ 0 and x, y ∈ X r [S r (t)] x,y = [S r+1 (t)] x,y . Now, assume Eq. (30) holds for s ∈ Z + . Then Proof of Proposition 10. Note that [M (t)] x,y −[S(t2 −k ) 2 k ] x,y = P x (X(t) = y)−P x (X(t) = y, A k ) = P x (X(t) = y, A c k ), where the event A k is defined in Eq. (29). Summing over y, where the last inequality recognizes the possibility that Q may be non-conservative. Sinceq > −∞, the process can be simulated via uniformization. Let N l,k i.i.d. ∼ Poisson(−qt2 −k ), l ∈ {0, 1, . . . , 2 k − 1} denote the number of uniformized events in each block of the partition. These are independent and identically distributed by definition of uniformization. Also, each of the uniformized events is either a self-transition or a jump. Hence, N l,k ≤ 1 =⇒ N l,k ≤ 1. Then, Now, use the expansion e −x 1 + x n n = 1 − x 2 2n + O x 3 n 2 with x = −qt and n = 2 k to obtain Since the bound on the right-hand side does not depend on x, we conclude that S(t2 −k ) 2 k − M (t) ∞ ≤ sup x∈X P x (A c k ) ≤ (qt) 2 2 −k−1 + O((qt) 3 2 −2k ). Closed-form likelihood expansions for multivariate diffusions Computing the action of the matrix exponential, with an application to exponential integrators Continuous-Time Markov Chains: An Applications-Oriented Approach The pseudo-marginal approach for efficient Monte Carlo computations Markov models for accumulating mutations Exact and computationally efficient likelihood-based estimation for discretely observed diffusion processes (with discussion) Concentration inequalities: A nonasymptotic theory of independence Introduction to Stochastic Processes Continuous-time approximations for the nonlinear filtering problem Efficient implementation of Markov chain Monte Carlo when using an unbiased likelihood estimator An exact Gibbs sampler for the Markovmodulated Poisson process Bayesian analysis of elapsed times in continuous-time Markov chains mcmcse: Monte Carlo Standard Errors for MCMC Unbiased Bayesian inference for population Markov jump processes via random truncations Exact inference for continuous time Markov chain models Algorithm 955: Approximation of the Inverse Poisson Cumulative Distribution Function Exact stochastic simulation of coupled chemical reactions Bayesian parameter inference for stochastic biochemical network models using particle Markov chain Monte Carlo A practical sequential stopping rule for high-dimensional Markov chain Monte Carlo Efficient continuous-time Markov chain estimation A formal basis for the heuristic determination of minimum cost paths Integral Boundary Points of Convex Polyhedra On nonnegative unbiased estimators Markoff chains as an aid in the study of Markoff processes Applications of Monte Carlo On methods for studying stochastic disease dynamics A robust discrete state approximation to the optimal nonlinear filter for a diffusion Maximum-likelihood estimation for diffusion processes via closed-form density expansions Queueing Theory: A linear algebraic approach A General and Efficient Algorithm for the Likelihood of Diversification and Discrete-Trait Evolutionary Models On Russian roulette estimates for Bayesian inference with doublyintractable likelihoods Estimating a Binary Character's Effect on Speciation and Extinction A general method for debiasing a Monte Carlo estimator Nineteen Dubious Ways to Compute the Exponential of a Matrix, Twenty-Five Years Later On some properties of Markov chain Monte Carlo simulation methods based on the particle filter Fast MCMC Sampling for Markov Jump Processes and Extensions Unbiased estimation with square root convergence for SDE models Priors over recurrent continuous time processes Stochastic Simulation of the Kinetics of Multiple Interacting Nucleic Acid Strands Largesample asymptotics of the pseudo-marginal method Markov chain Monte Carlo for exact inference for diffusions Direct statistical inference for finite Markov jump processes via the matrix exponential Exact Bayesian inference for discretely observed Markov Jump Processes using finite rate matrices On the efficiency of pseudo-marginal random walk Metropolis algorithms Lotka-Volterra competition models for sessile organisms Stochastic dynamics and non-equilibrium thermodynamics of a bistable chemical system: the Schlögl model revisited Simulation of population dynamics using continuous-time finite-state Markov chains Efficient Parameter Sampling for Markov Jump Processes Efficient Parameter Estimation for DNA Kinetics Modeled as Continuous-Time Markov Chains Algorithm 4: Separate analysis of convergence of sequences input : = 10 −8 , r explore = 15 Set accuracy k to achieve error using Eq. (16) or Eq. (20)Step 1: analyze truncation r at fixed algorithm setting k = k D (T) ← ∅ storage for truncation profile a old ← 0, ∆ ← ∞, r ← 0 while r < r explore or ∆ ≥ do a new ← a k r get estimate at (r, k ) accuracybest guess of exact solutionStep 2: analyze algorithm setting k at fixed truncation r = r .algorithm setting achieving ∆ < output: D (T) , D (S) , a * , r , k α denotes the target quantity to be estimated; that is, either the transition probability M (∆t i ) xi−1,xi if using the IA method, or the likelihood L(θ) if using the RA approach. Proposition 1 shows that we can arbitrarily reduce the variance of Z by increasing the offset ω. On the other hand, the cost of computing Z is at least the cost of computing a ω , which in our context increases to ∞ as ω → ∞. It follows that there should be a sweet-spot maximizing the efficiency of the sampler. Following this intuition, we would setfor some p min ∈ (0, 1) which would reflect a trade-off between guaranteed accuracy and cost. In turn, p min would be set by maximizing ESS per compute time.Implementing the above strategy has the obvious drawback that α is un-initialize with solution at both offsets n ← 0; r ← ω (T) ; k ← ω (S) initialize sequences while r ≤ r and k ≤ k do r ← ω (T) + n; k ← ω (S) + σn a new ← a k r get estimate at (r, k) accuracy while a old > a new and k ≤ k do monotonicity fails σ ← 2σ double σ until monotonicity holds k ← ω (S) + σn a new ← a k r update estimate end a old ← a new n ← n + 1 end output: σ can verify that the distribution given by ∀n ∈ Z + : p(n) := a ω+n+1 − a ω+n α − a ω+n ∝ a ω+n+1 − a ω+n satisfies all the requirements in Proposition 1 and gives Var(Z) = 0. Aiming to mimic this optimal distribution using a Geom(p) imposes the requirement that the sequence of first differences ∆a ω+n := a ω+n+1 − a ω+n be strictly decreasing. This is in general not true for any offset, as the second row in Fig. 9 shows. Indeed, one can observe peaks at r peak = 15 and k peak = −2 for truncation and matrix exponential accuracy respectively. Thus, we impose that offsets be set to at least the last peak observed for the corresponding {∆a ω+n } n≥0 profiles ω (T) = r peak ∨ min{r : (r, a k r ) ∈ D (T) and a k r ≥ p min a * } ω (S) = k peak ∨ min{k : (k, a k r ) ∈ D (S) and a k r ≥ p min a * }.Having set preliminary values for ω (S) , ω (T) , we proceed to set σ. Following the intuition from Proposition 5, we initialize this parameter with the ruleThe ratio on the right-hand side corresponds to an "average relative velocity" of the convergence of the matrix exponential accuracy sequence versus the truncation sequence, while σ default > 0 corrects border cases. For doubly monotone sequences, there is no additional tuning required. In other cases, we must set