key: cord-0552179-o27an9jn authors: Park, Sewon; Oh, Hee-Seok; Lee, Jaeyong title: L'{e}vy Adaptive B-spline Regression via Overcomplete Systems date: 2021-01-28 journal: nan DOI: nan sha: f84378a4842815d4af09a5bbcb000a8b91294862 doc_id: 552179 cord_uid: o27an9jn The estimation of functions with varying degrees of smoothness is a challenging problem in the nonparametric function estimation. In this paper, we propose the LABS (L'{e}vy Adaptive B-Spline regression) model, an extension of the LARK models, for the estimation of functions with varying degrees of smoothness. LABS model is a LARK with B-spline bases as generating kernels. The B-spline basis consists of piecewise k degree polynomials with k-1 continuous derivatives and can express systematically functions with varying degrees of smoothness. By changing the orders of the B-spline basis, LABS can systematically adapt the smoothness of functions, i.e., jump discontinuities, sharp peaks, etc. Results of simulation studies and real data examples support that this model catches not only smooth areas but also jumps and sharp peaks of functions. The proposed model also has the best performance in almost all examples. Finally, we provide theoretical results that the mean function for the LABS model belongs to the certain Besov spaces based on the orders of the B-spline basis and that the prior of the model has the full support on the Besov spaces. Suppose we observe n pairs of observations, (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x n , y n ) where and η is an unknown real-valued function which maps R to R. We wish to estimate the mean function η that may have varying degrees of smoothness including discontinuities. In nonparametric function estimation, we often face smooth curves except for a finite number of jump discontinuities and sharp peaks, which are common in many climate and economic datasets. Heavy rainfalls cause a sudden rise in the water level of a river. The COVID-19 epidemic brought about a sharp drop in unemployment rates. Policymakers' decisions can give rise to abrupt changes. For instance, the United States Congress passed the National Minimum Drinking Age Act in 1984, which has been debated over several decades in the United States, establishing 21 as the minimum legal alcohol purchase age. This act caused a sudden rise in mortality for young Americans around 21. The abrupt changes can provide us with meaningful information on these issues, and it is important to grasp the changes. There has been much research into the estimation of local smoothness of the functions. The first approach is to minimize the penalized sum of squares based on a locally varying smoothing parameter or penalty function across the whole domain. Pintore et al. (2006) , Liu and Guo (2010) , and Wang et al. (2013) modeled the smoothing parameter of smoothing spline to vary over the domain. Ruppert and Carroll (2000) , Crainiceanu et al. (2007) , Krivobokova et al. (2008) , and Yang and Hong (2017) suggested the penalized splines based on the local penalty that adapts to spatial heterogeneity in the regression function. The second approach is the adaptive free-knot splines that choose the number and location of the knots from the data. Friedman (1991) and Luo and Wahba (1997) determined a set of knots using stepwise forward/backward knot selection procedures. Zhou and Shen (2001) avoided the problems of stepwise schemes and proposed optimal knot selection schemes introducing the knot relocation step. Smith and Kohn (1996) , Denison et al. (1998b) , Denison et al. (1998a), and DiMatteo et al. (2001) studied Bayesian estimation of free knot splines using MCMC techniques. The third approach is to use wavelet shrinkage estimators including VisuShrink based on the universal threshold 1994) , SureShrink based on Stein's unbiased risk estimator (SURE) function 1995) , Bayesian thresholding rules by utilizing a mixture prior (Abramovich et al.; , and empirical Bayes methods for level-dependent threshold selection (Johnstone and Silverman; . The fourth approach is to detect jump discontinuities in the regression curve. Koo (1997) , Lee (2002) , and Yang and Song (2014) dealt with the estimation of discontinuous function using B-spline basis functions. Qiu and Yandell (1998) , Qiu (2003) , Gijbels et al. (2007) , and Xia and Qiu (2015) identified jumps based on local polynomial kernel estimation. In this paper, we consider a function estimation method using overcomplete systems. A subset of the vectors {φ} j∈J of Banach space F is called a complete system if where β j ∈ R and J ∈ N ∪ ∞. Such a complete system is overcomplete if removal of a vector φ j from the system does not alter the completeness. In other words, an overcomplete system is constructed by adding basis functions to a complete basis (Lewicki and Sejnowski; . Coefficients β j in the expansion of η with an overcomplete system are not unique owing to the redundancy intrinsic in the overcomplete system. The non-uniqueness property can provide more parsimonious representations than those with a complete system (Simoncelli et al.; 1992) . The Lévy Adaptive Regression Kernels (LARK) model, first proposed by Tu (2006) , is a Bayesian regression model utilizing overcomplete systems with Lévy process priors. Tu (2006) showed the LARK model had sparse representations for η from an overcomplete system and improvements in nonparametric function estimation. Pillai et al. (2007) found out the relationship between the LARK model and a reproducing kernel Hilbert space (RKHS), and Pillai (2008) proved the posterior consistency of the LARK model. Ouyang et al. (2008) extended the LARK method to the classification problems. Chu et al. (2009) used continuous wavelets as the elements of an overcomplete system. Wolpert et al. (2011) obtained sufficient conditions for LARK models to lie in the some Besov space or Sobolev space. Lee et al. (2020) devised an extended LARK model with multiple kernels instead of only one type kernel. In this paper, we develop a fully Bayesian approach with B-spline basis functions as the elements of an overcomplete system and call it the Lévy Adaptive B-Spline regression (LABS). Our main contributions of this work can be summarized as follows. The rest of the paper is organized as follows. In section 2, we introduce the Lévy Adaptive Regression Kernels and discuss its properties. In section 3, we propose the LABS model and present an equivalent model with latent variables that make the posterior computation tractable. We present three theorems that the function spaces for the proposed model depend upon the degree of B-spline basis and that the prior has large support in some function spaces. We describe the detailed algorithm of posterior sampling using reversible jump Markov chain Monte Carlo in section 4. In section 5, the LABS model is compared with other methods in two simulation studies and in section 6 three real-world data sets are analysed using the LABS model. In the last section, we discuss some improvements and possible extensions of the proposed model. In this section, we give a brief introduction to the LARK model. Let Ω be a complete separable metric space, and ν be a positive measure on R×Ω with ν({0}, Ω) = 0 satisfying L 1 integrability condition, (1 ∧ |β|)ν(dβ, dω) < ∞, for each compact set A ⊂ Ω. The Lévy random measure L with Lévy measure ν is defined as where N is a Poisson random measure with intensity measure ν. We denote L ∼ Lévy(ν). For Let g(x, ω) be a real-valued function defined on X × Ω where X is another set. By integrating g with respect to a Lévy random measure L, we define a real-valued function on X : We call g a generating function of η. When ν(R × Ω) = M is finite, a Lévy random measure can be represented as L(dω) = j≤J β j δ ω j , where J has a Poisson distribution with mean M > 0 and {(β j , ω j )} iid ∼ π(dβ j , dω j ) := ν/M, j = 1, 2, . . . , J. In this case, equation (4) can be expressed as where {(β j , ω j )} is the random set of finite support points of a Poisson random measure. If g is bounded, L 1 integrability condition (2) implies the existence of (4) for all x. See Lee et al. (2020) . If a Lévy measure satisfying (2) is infinite, the number of the support points of N (R, Ω) is infinite almost surely. Tu (2006) proved that the truncated Lévy random field L [g] converges in distribution to L[g] as → 0, where and N is a Poisson measure on R with mean measure This truncation often used as an approximation of the posterior. For posterior computation methods for the Poisson random measure without truncation, see Lee (2007) and Lee and Kim (2004) . Together with data generating mechanism (1), the LARK model is defined as follows: where Lèvy(ν) denotes the Lèvy process which has the characteristic function and ν is a Lèvy measure satisfying (2). Tu (2006) used gamma, symmetric gamma, and symmetric α-stable (SαS) (0 < α < 2) Lèvy random fields. The conditional distribution for Y has a hyperparameter θ and π θ denotes the prior distribution of θ. The generating function g(x, ω) is used as elements of an overcomplete system. Tu (2006) and Lee et al. (2020) used the Gaussian kernel, the Laplace kernel, and Haar wavelet as generating functions: • Haar kernel: g(x, ω) with ω := (χ, λ) ∈ R × R + := Ω. All of the above generating functions are bounded. This LARK model can be represented in a hierarchical structure as follows: for j = 1, . . . , J. J is the random number that is stochastically determined by Lèvy random measure, (β 1 , . . . , β J ) is the unknown coefficients of a mean function and (ω 1 , . . . , ω J ) is the parameters of the generating functions. To obtain samples from the posterior distribution under the LARK model, the reversible jump Markov chain Monte Carlo (RJMCMC) proposed by Green (1995) is used because some parameters have varying dimensions. The LARK model stochastically extracts features and finds a compact representation for η(·) based on an overcomplete system. That is, it enables functions to be represented by the small number of elements from an overcomplete system. However, both the LARK model and most methods for function estimation use only one type of kernel or basis and can find out the restricted smoothness of the target function. These models cannot afford to capture all parts of the function with various degrees of smoothness. For example, we consider a noisy modified Heavisine function sampled at n = 512 equally spaced points on [0, 1] in Figure 1 . The data contains both smooth and non-smooth regions such as peaks and jumps. As shown in panel (a) of Figure 1 , it is difficult for the LARK model with a finite Lèvy measure using Gaussian kernel to estimate jump parts of the data. We, therefore, propose a new model which can adapt the smoothness of function systematically by using a variety of B-spline bases as the generating elements of an overcomplete system. We consider a general type of basis function as the generating elements of an overcomplete system instead of specific kernel functions such as Haar, Laplacian, and Gaussian. The LABS model uses B-spline basis functions which can all systematically express jumps, sharp peaks, and smooth parts of the function. The B-spline basis function consists of piecewise k degree polynomials with k − 1 continuous derivatives. In general, the B-spline basis of degree k can be derived utilizing the Cox-de Boor recursion formula: where t i are called knots which must be in non-descending order t i ≤ t i+1 (De Boor; 1972), (Cox; . The B-spline basis of degree k, B * k,i (x) then needs (k + 2) knots, (t i , . . . , t i+k+1 ). For convenience of notation, we redefine the B-spline basis of degree k with a knot sequence ξ k := (ξ k,1 , . . . , ξ k,k+2 ) as follows. where ξ k := (ξ k,1 , ξ k,2 , . . . , ξ k,(k+1) ) and ξ k := (ξ k,2 , ξ k,3 , . . . , ξ k,(k+2) ). The LARK model with one type of kernel can not estimate well functions with both continuous and discontinuous parts. To improve this, we consider various a B-spline basis functions simultaneously for estimating all parts of the unknown function. The new model uses B-spline basis to systematically generate an overcomplete system with varying degrees of smoothness. For example, the B-spline basis functions of degrees 0, 1 and 2 or more are for jumps, sharp peaks and smooth parts of the function, respectively. We consider the mean function can be expressed as a random finite sum: where S denotes the subset of degree numbers of B-spline basis and B k (x; ξ k ) is a B-spline basis of degree k with knots, ξ k ∈ X (k+2) := Ω. Generating functions of the LARK model are replaced by the B-spline basis functions. J k has a Poisson distribution with M k > 0 and In this paper, we assume The mean function can be also defined as The stochastic integral representation of the mean function is determined by Although the Lévy measure ν k satisfying (2) may be infinite, the Poisson integrals and sums aboves are well defined for all bounded measurable compactly-supported B k (·, ·) for which for all k ∈ S, In this paper, we consider only finite Lévy measures in the proposed model. In other words, we restrict our attention to the Lévy measure of a compound Poisson process. The new model is more complex than the LARK model with one kernel and expected to give a more accurate estimate of the regression function. It can estimate a mean function having both smooth and peak shapes. The proposed model can write in hierarchical form as In this section, we present three theorems on the support of the LABS model. We first define the modulus of smoothness and Besov spaces. Definition 1 Let 0 < p ≤ ∞ and h > 0. For f ∈ L p (X ), the rth order modulus of smoothness of f is defined by If r = 1, ω 1 (f, t) p is the modulus of continuity. There exist equivalent definitions in defining Besov spaces. We follow DeVore and Lorentz (1993)[2.10, page 54]. Definition 2 Let α > 0 be given and let r be the smallest integer such that r > α. For The Besov space is a general function space depending on the smoothness of functions in L p (X ) and especially can allow smoothness of spatially inhomogeneous functions, including spikes and jumps. The Besov space has three parameters, α, p, and q, where α is the degree of smoothness, p represents that L p (Ω) is the function space where smoothness is measured, and q is a parameter for a finer tuning on the degree of smoothness. Theorem 1 For fixed k ∈ S and ξ k ∈ X (k+2) , the B-spline basis B k (x; ξ k ) falls in B α p,q (X ) for all 1 ≤ p, q < ∞ and α < k + 1/p. The proof is given in Appendix A. For instance, the B-spline basis with degree 0 satisfies B k (·, ξ k ) ∈ B α p,q for α < 1/p, the B-spline basis with degree 1 is in B α p,q for 1 + 1/p and the B-spline basis with degree 2 falls in B s p,q for 2 + 1/p. The following theorem describes the mean function of the LABS model, η, is in a Besov space with smoothness corresponding to degrees of B-spline bases used in the LABS model. The proof of the theorem closely follows that of Wolpert et al. (2011) . The proof of Theorem 2 is given in Theorem 2 Suppose X is a compact subset of R. Let ν k be a Lévy measure on R × X (k+2) that satisfies the following integrability condition, and L k ∼ Lévy(ν k ) for all k ∈ S. Define the mean function of the LABS model, (12) for each fixed x ∈ X . Then, η has the convergent series where S is a finite set including degree numbers of B-spline basis. Furthermore, η lies in the Besov space B α p,q (X ) with α < min(S) + 1 p almost surely. For example, if a zero element is included in S then the mean function of the LABS, η falls in B α p,q with α < 1 p almost surely, which consists of functions no longer continuous. If S = {3, 5, 8}, then, η belongs to B α p,q with α < 3 + 1 p almost surely. Moreover, it is highly possible that the function spaces for the LABS model may be larger than those of the LARK model using one type of kernel function. Specifically, the mean function for the LABS model with S = {0, 1} falls in B α p,p with α < 1 p almost surely. If that of the LARK model using only one Laplacian kernel falls in B α p,p with α < 1 + 1 p , then the function spaces of the LABS model with given α < 1 p are larger than those of the LARK model for the range of smoothness parameter, 1 p < α < 1 + 1 p by the properties of the Besov space. The next theorem shows that the prior distribution of our LABS model has sufficiently large support on the Besov space B α p,q with 1 ≤ p, q < ∞ and α > 0. For η 0 ∈ B α p,q (X ), denote the ball around η 0 of radius δ,b Theorem 3 Let X be a bounded domain in R. Let ν k be a finite measure on R × X (k+2) and L k ∼ Levy(ν k ) for all k ∈ S. Suppose η has a prior Π for the LABS model (11). Then, Based on the prior specifications and the likelihood function, the joint posterior distribution of the LABS model (11) is The parameters β and ξ of the LABS model have varying dimensions as J k is a random variable. We use the Reversible Jump Markov Chain Monte Carlo (RJMCMC) algorithm (Green; 1995) for the posterior computation. We consider three transitions in the generation of posterior samples: (a) the addition of basis functions and coefficients; (b) the deletion of basis functions and coefficients; (c) the relocation of knots which affects the shape of basis functions and coefficients. Note that in step (c) the numbers of basis functions and coefficients do not change. We call such move types birth step, death step, and relocation step, respectively. A type of move is determined with probabilities p b , p d and p w with p b + p d + p w = 1, where p b , p d and p w are probabilities of choosing the birth, death, and relocation steps, respectively. Let us denote θ k,l = (β k,l , ξ k,l ) by an element of θ k = {θ k,1 , θ k,2 , . . . , θ k,j , . . . , θ k,J k }, where each ξ k,l has the (k + 2) dimensions. In general, the acceptance ratio of the RJMCMC can be expressed as In our problem the acceptance ratio for each move types is given by where θ k and J k refer to the current model parameters and the number of basis functions in the current state, θ k and J k denote the proposed model parameters and the number of basis functions of the new state. Here, the Jacobian is 1 in all move types. q(θ k |θ k ) is the jump proposal distribution that proposes a new state θ k given a current state θ k . Specifically, we choose the following jump proposal density proposed by Lee et al. (2020) for each move steps: where b(·) is a candidate distribution which proposes a new element. For death and change steps, a randomly chosen rth element of θ k is deleted and rearranged, respectively. The details regarding updating schemes of each move steps are as follows. (a) [Birth step] Assume that the current model is composed of J k basis functions. If the birth step is selected, a new basis function B k,J k+1 and θ k,J k+1 is accepted with the acceptance ratio min 1, Especially, a coefficient β k,J k+1 and an ordered knot set ξ k,J k+1 are drawn from their generating distributions and added at the end of (β k,1 , . . . , β k,J k ) and (ξ k,1 , . . . , ξ k,J k ). When J k = 0, the birth step must be exceptionally selected until J k becomes one. (b) [Death step] If the death step is selected, a rth element, θ k,r uniformly chosen is removed from the existing set of basis functions, coefficients and ordered knot sets. We can find out the acceptance ratio for a death step similarly. The acceptance ratio is given by (c) [Relocation step] Unlike the other steps, the relocation step keeps the numbers of basis functions or coefficients or ordered knot sets fixed. Therefore, the updating scheme of this step is a Metropolis-Hastings within Gibbs sampler. If the relocation step is selected, a current location θ k,r is moved to a new location θ k,r generated by proposal distributions with the acceptance ratio (16). Particularly, since knots of basis function must be in non-descending order, ξ k,r,i which is the ith element of an ordered knot set is sequentially replaced with a new knot location ξ k,r,i generated by U(ξ k,r,i−1 , ξ k,r,i+1 ), i = 1, . . . , (k + 2), where ξ k,r,0 and ξ k,r,k+1 are boundary points of X . That is, each element of a specific knot set ξ k,r = (ξ k,r,1 , . . . , ξ k,r,k+2 ) is moved to new knot locations ξ k,r = (ξ k,r,1 , . . . , ξ k,r,k+2 ) in turn. The corresponding acceptance ratio is given by When using an independent proposal distribution (i.e. q w (θ k,r |θ k,r ) = π(θ k,r )), the acceptance ratio can reduce to min 1, Finally, β k,r is sampled from its conditional posterior distribution by using the Gibbs sampling. The posterior samples of σ and M k can be generated from their conditional posterior distributions. See Appendix C. The pseudo-code for the proposed strategy is given in Algorithm Initialize parameters J , β, ξ, M , σ 2 from prior distributions. for iteration i = 1 to N do 4: for k = 1 to |S| do J := {J 1 , . . . , J k , . . . , J |S| } Update (J k , β k , ξ k ) through a reversible jump MCMC. Sample M k from the full conditional π(M k |others). Gibbs step 7: end for 8: Sample σ 2 from the full conditional π(σ 2 |β, ξ, J , M , y). Store ith MCMC samples. Second, we consider three functions that we created ourselves with jumps and peaks to assess the practical performance of the proposed model. The simulated data sets are generated from equally spaced x's on X = [0, 1] with sample sizes n = 128 and 512. Independent normally distributed noises N (0, σ 2 ) are added to the true function η(·). The root signal-to-noise ratio (RSNR) is defined as wheref := 1 |X | X f (x) dx and set at 3, 5 and 10. We also run the LABS model for 200,000 iterations, with the first 100,000 iterations discarded as burn-in and retain every 10th sample. For comparison between the methods, we compute the mean squared errors of all methods using 100 replicate data sets for each test function. The average of the posterior curves is used for the estimate of the test function. We carry out a simulation study using the benchmark test functions suggested by Donoho and Johnstone (1994) The hyperparameters and types of basis functions displayed in Table 1 were used in (11). For Bumps and Doppler, the parameter r of prior distribution for σ 2 was set to 100 to speed up convergence. We also took account (2005) Our main interest lies in estimating smooth functions with either discontinuity such as jumps or sharp peaks or both. We design three test functions to assess the practical performance of the proposed method for our concerns. The first and second example is modified by adding some smooth parts, unlike the original version of the Bumps and Blocks of DJ test functions. Each They are displayed in Figure 6 . We call in turn them modified Blocks, modified Bumps, and modified Heavisine, respectively. In these experiments, we use two or more types of B-spline basis as elements of overcomplete systems since three functions have different shapes, unlike previous simulation studies. Hyper- Table 3 furnishes that the LABS model has the best outcomes when the sample size is 128, which is difficult to estimate. Furthermore, when n = 512, we find out from both Table 4 and Figure 7 that the LABS model performs well in most cases with either the lowest or the second lowest average MSE values across 100 replicates. In particular, the LABS outperforms competitors in modified Blocks, irrespective of the sample size and noise levels as expected. Among all models, the worst performing method is the BASS-2 since it cannot estimate well many jumps or peak points for given test functions. Figure 8 supports that the LABS model has the abilities to overcome the noise and adapt to smooth functions with either discontinuity such as jumps or sharp peaks or both. We now apply LABS model (11) This data has been widely used to estimate the causal effect of policies on the minimum legal drinking age in the area of Regression Discontinuity Design (RDD). Figure 9 (a) highlights that the MLDA data might represent a piecewise smooth function with a single jump discontinuity at minimum drinking age of 21 referred to as cutoff in the RDD. Specifically, each observation (or point) in Figure 9 corresponds to the death rate from all causes in the monthly interval and the number of all observations is 48. Using the MLDA data, we estimate unknown functions of death rates via LABS and several competing models including BPP-21, BASS-1, LARMuK, and GP-R. Figure 9 (b) shows that In Figure 9 (b), both LABS and BASS-1 provide similar posterior curves to the piecewise polynomial regression of Figure 9 (a). The estimated curves of them also have a jump point at 21. While the estimated function of GP-R is smooth, the mean function for the LABS model has both smooth and jump parts. We calculated the mean squared error with 10-folds cross-validation for comparison between methods. The mean and standard deviation values of cross-validation prediction errors are given in Table 5 . The smaller CV error rate of LABS implies that LABS has a better performance of estimating a smooth function with discontinuous points than the others. Bitcoin is the best-known cryptocurrency based on Blockchain technology. The demands for bitcoin have increased globally because of offering a higher yield and easy access. The primary characteristic of bitcoin is to enable financial transactions from user to user on the peer-to-peer network configuration without a central bank. Unlimited trading methods and smaller market size than the stock market lead to high volatility in the bitcoin price. We collected a daily We also added LOESS (locally estimated scatterplot smoothing) regression line to a scatter plot of a daily closing price in Figure 10 . The dataset shows locally strong upward and downward movements. We apply LABS and other models to estimate the curve of daily bitcoin Figure 10 : Daily bitcoin closing price with a smoothing line closing price. Figure 11 illustrates the predicted curves of the LABS and competing models for approximating an unknown function of daily bitcoin closing price. There are no significant differences between the estimated posterior curves. Alternatively, we calculate cross-validation errors to assess model performances. The values of cross-validation errors are given in Table 6 . Both Table 6 and Figure 12 demonstrate that the LABS model provides more accurate function estimation and consistent performance through both minimum mean and relatively low standard deviation values of the cross-validation errors. They also indicate that the Gaussian process is not proper in the cases with locally varying smoothness. We found that the LABS gives more reliable estimated functions that may have both discontinuous and smooth parts than other methods. We suggested general function estimation methodologies using the B-spline basis function as the elements of an overcomplete system. The B-spline basis can systematically represent functions with varying smoothness since it has nice properties such as local support and differentiability. The overcomplete system and a Lévy random measure enable a function that has both continuous and discontinuous parts to capture all features of the unknown regression function. Simulation studies and real data analysis also present that the proposed models show better performance than other competing models. We also showed that the prior has full support in certain Besov spaces. The prominent limitation of the LABS model is the slow mixing of the MCMC algorithm. Future work will develop an efficient algorithm for the LABS model and extend the LABS model for multivariate analysis. A Proof of Theorems 1-3 A.1 Proof of Theorem 1 For simplicity, we assume that X = [0, 1]. Since the B-spline basis has local support and is bounded, B k (x; ξ k ) p is finite for all k ≥ 0. It is enough to show that if the Besov semi-norm, The definition of the modulus of smoothness and the property that Let k be zero. Then, the B-spline basis is piecewise constant with 2 knots, ξ 0 := (ξ 01 , ξ 02 ). By the definition of the B-spline basis (6), we divide into two cases to calculate the modulus of continuity. Case 1 Assume ξ 01 + h < ξ 02 , h > 0. Thus, Case 2 Assume ξ 01 + h > ξ 02 , h > 0. Thus, Therefore, in all cases, By definition, The upper bound of |B 0 (x; ξ 0 )| B α p,q is finite if and only if α < 1 p and q < ∞. Let k ≥ 1. Since the B-spline basis of order k is a piecewise polynomial and has (k − 1) continuous derivatives at the knots, it falls in W k p (X ), where W k p (X ) is the Sobolev space, which is a vector space of functions that have weak derivatives. See the definition of the Sobolev space described in chapter 2.5 of DeVore and Lorentz (1993) . We use the following property of the modulus of smoothness, Using (18) and (19), it follows that For all k ≥ 1, |B k (x; ξ k )| B α p,q is finite if and only if α < k + 1 p and q < ∞, so the proof is complete. By Theorem 3 of Wolpert et al. (2011) , the L p norm and Besov semi-norm of η satisfy the following upper bounds, respectively. Since the condition for (12) is satisfied and B-spline basis is bounded and locally supported, η p is almost surely finite. To obtain finite Besov semi-norms for all k ∈ S, the smoothness parameter α should be α < min(S) + 1 p by Theorem 1. Therefore, η belongs to B α p,q with α < min(S) + 1 p almost surely. This appendix contains the full simulations results of the four DJ test functions. We simulated two scenarios: (a) small sample size (n = 128) and (b) large sample size (n = 512) with different noise levels (RSNR = 3, 5, and 10). In this appendix, we derive the full conditional distributions of some parameters required for Gibbs sampling. The full conditional posterior of each parameter can be easily obtained via conjugacy properties. Let us first find the full conditional posterior for β p,q . • Full conditional posterior for β p,q For each q = 1, . . . , J p , For convenience, we set c i = β 0 + k∈S\{p} J k l=1 B k,l (x i ; ξ k,l )β k,l as a constant term. Then, Wavelet thresholding via a bayesian approach Mastering'metrics: The path from cause to effect Bayesian function estimation using continuous wavelet dictionaries Numerical analysis of wavelet methods The numerical evaluation of b-splines Spatially adaptive bayesian penalized splines with heteroscedastic errors On calculating with b-splines Bayesian mars Automatic bayesian curve fitting Constructive Approximation Bayesian curve-fitting with free-knot splines Adapting to unknown smoothness via wavelet shrinkage Ideal spatial adaptation by wavelet shrinkage miscf: Miscellaneous functions Bass: Bayesian adaptive spline surfaces Sensitivity analysis and emulation for functional data using bayesian adaptive splines Multivariate adaptive regression splines Jump-preserving regression and smoothing using local linear fitting: a compromise Reversible jump markov chain monte carlo computation and bayesian model determination Empirical bayes selection of wavelet thresholds kernlab: Kernel-based machine learning lab Spline estimation of discontinuous regression functions Fast adaptive penalized splines Sampling methods of neutral to the right processes A new algorithm to generate beta processes Automatic smoothing for discontinuous regression functions Bayesian curve fitting for discontinuous functions using an overcomplete system with multiple kernels Learning overcomplete representations Data driven adaptive spline smoothing Hybrid adaptive splines Bayesian additive regression kernels Direct and converse theorems for spline and rational approximation and besov spaces, Function spaces and applications Lévy random measures: Posterior consistency and applications, PhD dissertation Characterizing the function space for bayesian kernel models Spatially adaptive smoothing splines A jump-preserving curve fitting procedure based on local piecewise-linear kernel estimation Local polynomial jump-detection algorithm in nonparametric regression R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing Theory & methods: Spatially-adaptive penalties for spline fitting Ebayesthresh: Empirical bayes thresholding and related methods Shiftable multiscale transforms Nonparametric regression using bayesian variable selection Bayesian nonparametric modeling using Levy process priors with applications for function estimation, time series modeling and spatio-temporal modeling Smoothing splines with varying smoothing parameter fancova: Nonparametric analysis of covariance Stochastic expansions using continuous dictionaries: Lévy adaptive regression kernels Jump information criterion for statistical inference in estimating discontinuous curves Adaptive penalized splines for data smoothing Jump detection in time series nonparametric regression models: a polynomial spline approach Spatially adaptive regression splines and accurate knot selection schemes For the sake of simplicity we assume X = [0, 1]. Fix δ > 0 and η 0 ∈ B α p,q ([0, 1]) with α > 0, 1 ≤ p, q < ∞. If 1 ≤ p ≤ p < ∞, then η 0 also belongs to B α p ,q ([0, 1]) by property of the Besov space (Cohen; [3.2, page 163]. From Theorem 2.1 of Petrushev (1988) , we can show that there exists a spline s ∈ S(n , q) such thatwhere S(n , q) denotes the set of all splines of degree (q − 1) with a sufficiently large number n knots and constant C = C(α, p, q). Since any spline of given degree can be represented as a linear combination of B-spline basis functions with same degree, we can define a spline s(x) bywhere B * (q−1),j (x) is the B-spline basis of degree (q − 1) with a sequence of knots ξ * in (6). Set n := k∈S J δ k , A := k∈S J δ k l=1 |β k,l | < ∞ , ρ := sup B k (x, ξ k ) p < ∞ and := δ 2(A+ρ) . We denote the range of a sequence of knots ξ k,l by r(ξ k,l ), e.g., r(ξ k,l ) = (ξ k,l,(k+2) − ξ k,l,1 ). For convenience, we reindex the coefficients and knots of the spline s(x) in (20) such that β * k,l and ξ * k,l for l = 1, . . . , J δ k , k ∈ S. Then, the spline s(x) can be expressed as follows:where ξ * k,l := (ξ * l , . . . , ξ * l+(q−1)+1 ) is a subsequence of given knots ξ * . Using the definitions of B-spline basis in (6) and (7), we can find a ζ > 0 such thatProof It suffices to show that η ∈b δ (η 0 ) =⇒ η ∈b δ (η 0 ) to finish the proof of the lemma. ForBy the triangle inequality,Thus, η ∈b δ (η 0 ) and this finishes the proof of the lemma.To complete the proof of this theorem, we have to show that Π η ∈b δ (η 0 ) > 0 by using the previous lemma. Let J := max k∈S J δ k . By the triangle inequality,• Full conditional posterior of M k For each k ∈ S,The full conditional distribution for M k is given by• Full conditional posterior of σ 2The full conditional distribution for σ 2 is [σ 2 | others, Y] ∼ IG r 0 2 , r 0 R 0 2 with r 0 = r + n,B k (x; ξ k,l )β k,l ) 2 + rR r 0 .