key: cord-1029269-dxdi90bd authors: Kalogridis, Ioannis; Aelst, Stefan Van title: Robust optimal estimation of location from discretely sampled functional data date: 2020-08-03 journal: Scandinavian Journal of Statistics DOI: 10.1111/sjos.12586 sha: e88a109e8fa568e1254ade9395111e07095a75d9 doc_id: 1029269 cord_uid: dxdi90bd Estimating location is a central problem in functional data analysis, yet most current estimation procedures either unrealistically assume completely observed trajectories or lack robustness with respect to the many kinds of anomalies one can encounter in the functional setting. To remedy these deficiencies we introduce the first class of optimal robust location estimators based on discretely sampled functional data. The proposed method is based on M-type smoothing spline estimation with repeated measurements and is suitable for both commonly and independently observed trajectories that are subject to measurement error. We show that under suitable assumptions the proposed family of estimators is minimax rate optimal both for commonly and independently observed trajectories and we illustrate its highly competitive performance and practical usefulness in a Monte-Carlo study and a real-data example involving recent Covid-19 data. In recent years, technological innovations and improved storage capabilities have led practitioners to observe and record increasingly complex high-dimensional data that are characterized by an underlying functional structure. Such data are nowadays commonly referred to as functional data and relevant research has been enjoying considerable popularity, following works such as (Ramsay, 1982) , (Ramsay and Dalzell, 1991; Rice and Silverman, 1991) and (Ramsay and Silverman, 2005) . While the field of functional data analysis has become very broad with many specialized subpaths, see, e.g., (Ferraty and Vieu, 2006; Horváth and Kokoszka, 2012; Kokoszka and Reimherr, 2017) , a central problem is inference regarding location functions µ of a random function X with sample paths in some nicelybehaved function space. In the early days of functional data analysis, it was commonly assumed that a sample of fully observed curves X 1 , . . . , X n was readily available to practitioners. This implies among others that the mean function can be optimally estimated by the sample mean of the curves, see e.g., (Horváth and Kokoszka, 2012; Hsing and Eubank, 2015) . More recently, emphasis has been placed on the more realistic setting of discretely observed curves, possibly distorted by additive random noise. An early contribution is (Yao et al., 2005) , who proposed local linear estimation with repeated measurements and obtained uniform rates of convergence for sparsely observed trajectories. Degras (2008) derived consistency conditions for a broad family of weighted linear estimators and anticipated later developments by remarking that the rate of convergence with respect to the L 2 -norm can at most be of order n −1 for densely observed data. The delicate interplay between sample size and sampling frequency was further illustrated by Li and Hsing (2010) , who showed that the uniform rate of convergence is in between the parametric and non-parametric rates, the exact rate depending on the sampling frequency. A major breakthrough was obtained by Cai and Yuan (2011) , who established minimax rates of convergence for mean function estimation and showed that these optimal rates can be achieved by a least-squares smoothing spline estimator with repeated measurements. Xiao (2020) demonstrated that these convergence rates can also be attained by a least-squares penalized spline estimator under similar assumptions. A common drawback of all the estimation procedures above is the stringent conditions required for the errors because they rely on the minimization of L 2 distances. Resistant estimation of central tendency in functional data is a problem that has received significantly less attention in the literature. An early contribution in the area was made by Wei and He (2006) who developed a quantile estimator for longitudinal data based on spline expansions. In the same vein, Lima et al. (2019) proposed replacing the check loss with a more general loss function and similarly use an unpenalized spline expansion in order to recover the mean function from discretely sampled functional data. To the best of our knowledge, most other existing proposals require completely observed trajectories. Cuesta- Albertos and Fraiman (2006) proposed a functional equivalent of the trimmed mean estimator and showed its consistency in the ideal setting of fully observed trajectories. The robust spatial median has received the lion's share of attention in the literature. Gervini (2008) proved its rootn convergence rate for fully observed finite-dimensional trajectories. These convergence results were greatly strengthened by Cardot et al. (2013) to cover separable Hilbert spaces of functions. An even deeper treatment of the spatial median was offered by Chakraborty and Chaudhuri (2014) , who established general Glivenko-Cantelli and Donsker-type results for the empirical spatial distribution process in general Banach spaces. More recently, Sinova et al. (2018) proposed a broad family of M-estimators for functional location and derived its consistency, again under the finite-dimensional assumption of the complete trajectories. Practitioners who seek robust estimates often overcome the discreteness of the data and the presence of measurement error, either by a pre-smoothing step or by altogether ignoring these problematic aspects of the data and directly applying any of the aforementioned robust estimators. Although these are popular strategies due to the lack of a better alternative, it should be stressed that their theoretical side-effects are not well-understood and consequently it is impossible to decide on best practices. A question that arises naturally is whether it is possible to construct theoretically-optimal robust estimators for functional location that can operate in the ubiquitous functional data setting of curves that are discretely observed with measurement error. We answer this question positively by studying a broad class of estimators, including the resistant quantile and Huber-type smoothing splines among others. Our results do not require uncorrelated measurement errors nor existence of any moments of the error distributions for well-chosen robust loss functions, thereby providing an important relaxation of the conditions of (Cai and Yuan, 2011) . The estimators can be efficiently computed with the convenient B-spline representation and well-established fast iterative algorithms, so that the associated computational burden is minimal, even if the trajectories are densely observed. Our contributions are threefold. We propose a broad class of M-type smoothing spline estimators to estimate location functions based on discretely sampled functional data. Next to estimators for the central tendency, this class of estimators also includes quantile and expectile estimators. We study the rate of convergence of these estimators under weak assumptions for both common and independent designs and show that these estimators are rate-optimal. We do not only provide convergence rates for the location functionals, but also for their derivatives, which are often helpful in uncovering intricate features of functional data. Let us first assume that {X(t) : t ∈ [0, 1]} is a process with a well-defined mean function, that is, sup t∈[0,1] E{|X(t)|} < ∞. All subsequent arguments can be modified to encompass any location function, e.g., the median or expectile functions provided that these functions are also well-defined. Let X 1 , . . . , X n denote n independent and identically distributed copies of X. Our goal is to recover the mean function µ(t) = E{X(t)} from noisy observations of the discretized curves: Y ij = X i (T ij ) + ζ ij , (j = 1, . . . , m i ; i = 1, . . . , n), where T ij are sampling points and ζ ij are random noise variables. In (Yao et al., 2005; Cai and Yuan, 2011) and (Xiao, 2020) it is assumed that the noise variables ζ ij are independent of the X i and independent and identically distributed with zero mean and finite variance. However, we allow for correlated errors that are not necessarily independent of the curves. We may view the errors ζ ij as part of n independent copies of the process ζ : (Ω, A, P) × [0, 1] → R evaluated at the T ij , that is, ζ ij = ζ i (T ij ), j = 1, . . . , m i ; i = 1, . . . , n. Expressing model (1) in terms of mean-deviations, we may equivalently write denotes the error process associated with the ith response evaluated at T ij . The problem is thus reformulated as a regression problem with repeated measurements and possibly correlated errors. While there are many parametric and non-parametric estimators that can potentially be applied to the problem at hand, smoothing spline estimation arises naturally after we restrict µ to the Hilbert-Sobolev space of order r denoted by W r,2 ([0, 1]), viz, This separable Hilbert space may be understood as the completion of the space of r-times continuously differentiable functions under a suitable norm (Adams and Fournier, 2003) . Therefore, it lends itself to slightly greater generality than the space of r-times continuously differentiable functions commonly employed in penalized spline and local polynomial estimation. Under this assumption, an intuitively appealing estimator for µ may be obtained by solving the smoothing spline problem for some convex nonnegative loss function ρ satisfying ρ(0) = 0 and a penalty parameter λ > 0, whose positivity is needed to make the problem well-defined. The two extreme cases λ ↓ 0 and λ ↑ ∞ reveal the compromise smoothing spline estimators aim to achieve. On the one hand, for λ ↓ 0 the solution becomes arbitrarily close to an natural spline of order 2r which interpolates the measurements Y ij . On the other hand, for λ → ∞ the dominance of the penalty term in (3) forces || µ (r) n || 2 = 0, where µ n denotes a solution of (3). A Taylor expansion with integral remainder of µ n about 0 shows that where P r (t) is the Taylor polynomial of order r. But by the Schwarz inequality, so the Taylor remainder is equal to zero. This implies that for large λ, µ n is roughly equal to its Taylor polynomial of degree at most (r − 1). For in between values of λ the penalty functional can be viewed as providing a bound on how far µ is allowed to depart from a polynomial, as noted by Eubank (1999) . While for general convex ρ-functions it cannot be assumed that the solution is unique (Eggermont and LaRiccia, 2009, Chapter 17) , for λ > 0 and n i=1 m i ≥ r it can be shown that a solution to the above variational problem may be found in the space of 2rth order natural splines with knots at the unique T ij , j = 1, . . . , m i , i = 1, . . . , n, see Theorem 1 of (Kalogridis, 2020) . Since this is a finite-dimensional space, with dimension equal to the number of unique knots, a solution to (3) may be expediently found by solving a ridge-type regression problem, as discussed in detail in Section 4 below. The proposed estimator is a generalization of the robust smoothing spline introduced by Huber (1979) for the case of independent and identically distributed errors. It is clear that the squared loss ρ(x) = x 2 fulfils the requirements on ρ. In this case one recovers the least-squares smoothing spline with repeated measurements proposed by Rice and Silverman (1991) and theoretically investigated by Cai and Yuan (2011) . However, the benefit of the formulation in (3) is that it includes loss functions that increase less rapidly as their argument increases in absolute value, so that better resistance towards outlying observations is achieved. Wellknown examples of such loss functions are the L q loss with ρ q (x) = |x| q , 1 ≤ q ≤ 2, which includes both the absolute and square losses as special cases, and Huber's loss given by for some k > 0 that controls the blending of square and absolute losses. For large k, ρ k (x) essentially behaves like a square loss while for small values of k, ρ k (x) provides a smooth approximation to the absolute loss. For heterogeneous functional data a single location function may not be appropriate to summarize the data-generating process. In such situations quantiles or expectiles provide much richer information. Since we do not require the symmetry of the loss function in (3), such versatile estimators may be readily incorporated into the present framework. Indeed, to compute conditional quantiles and expectiles one would only need to select the loss function according to ρ τ (x) = x(τ − I(x < 0)) for τ ∈ (0, 1) and ρ α (x) = x 2 /2|α − I(x < 0)| for α ∈ (0, 1) respectively. Furthermore, the asymptotic properties of quantile and expectile smoothing spline estimators are covered by the theory developed in Section 3. We now examine the asymptotic properties of M-type smoothing spline estimators in two very different scenarios: the common design case and the independent design case. In the case of common design, the curves are recorded at the same locations, that is, while in the case of independent design we allow T ij to be random variables and consequently to be distinct across the subjects. Central to our theoretical development in both types of designs is the theory of reproducing kernel Hilbert spaces. In particular, it is well-known that W r,2 ([0, 1]) is a reproducing kernel Hilbert space but the reproducing kernel depends on the inner product. In this paper we make use of the inner product f, g r,λ = f, g 2 + λ f (r) , g (r) 2 , for f, g ∈ W r,2 ([0, 1]), where ·, · 2 denotes the usual L 2 ([0, 1]) inner product. These quantities are well-defined and interestingly depend on the smoothing parameter λ, which typically varies with n. As we shall see, formulating our results in terms of || · || r,λ instead of the commonly used L 2 ([0, 1])-norm will enable us to obtain uniform convergence rates as well as rates of convergence of the derivatives for a broad class of estimators, greatly extending the results of (Cai and Yuan, 2011) . Our first result establishes the continuity of point evaluations under || · || r,λ and characterizes the form of the resulting reproducing kernel. The interested reader is referred to (Eggermont and LaRiccia, 2009, Lemma 2.11, Chapter 13) for an alternative proof of the first part of Proposition 1 below. Proposition 1. Let r ≥ 1 be an integer. For all λ ∈ (0, 1] it holds that (a) There exists a constant c r depending only on r such that, for all f ∈ W r,2 ([0, 1]) and all x ∈ [0, 1], (c) There exists a sequence {φ j } j of uniformly bounded basis functions of L 2 ([0, 1]) not depending on λ, such that where γ 1 = . . . = γ r = 0 and C 1 j 2r ≤ γ j ≤ C 2 j 2r , j ≥ r + 1, with C 1 , C 2 > 0. Furthermore, the reproducing kernel, R r,λ (x, y), admits the representation where the series converges absolutely and uniformly. The proposition requires that λ ∈ (0, 1], but this comes without any loss of generality, as in all our theoretical results below we will assume that λ decays to zero as n → ∞. The constant c r does not depend on x, thus Proposition 1 defines a Sobolev embedding of W r,2 ([0, 1]) into C([0, 1]) equipped with the standard sup-norm (Adams and Fournier, 2003) . For general r, the explicit form of {φ j } j is difficult to determine but for r = 1 we may use the results of (Hsing and Eubank, 2015, Chapter 2) to deduce that φ 1 = 1 and φ j (x) = 2 1/2 cos((j − 1)πx), j ≥ 2, which is the standard cosine series of L 2 ([0, 1]), and γ j = (πj) 2 , j ≥ 2. It is interesting to observe the rapid decay of the summands, which implies that the φ j with larger frequencies contribute little to the value of the reproducing kernel at each point (x, y). Of course, the precise rate of decay depends on the smoothing parameter with smaller values of λ leading to a more slowly converging series. Another interesting observation is that while the φ j are infinitely differentiable, the map x → R r,λ (x, y) is not differentiable at x = y. This is not a contradiction to Proposition 1, however, as that result only asserts the much weaker x → R r,λ (x, y) ∈ W 1,2 ([0, 1]). Figure 1 plots this mapping for different values of λ and y in the left and right panels respectively. With this result at our disposal, we begin with the asymptotic study of M-type smoothing spline estimators in the common design. Notice that in this case the notation simplifies and the M-type estimator µ n is defined according to We require the following assumptions on the loss function, the error process and design points. (A1) ρ is an absolutely continuous convex function on R with derivative ψ existing almost everywhere. (A2) The error processes i (ω, t) : (Ω, A, P) × [0, 1] → R, i = 1, . . . , n, are independent and identically distributed. (A3) There exist finite constants κ and M 1 such that for all x ∈ R and |y| < κ, as u → 0. (A5) sup j≤m E{|ψ( 1j )| 2 } < ∞, E{ψ( 1j )} = 0 and there exist positive constants δ j , j = 1, . . . , m, such that 0 < inf j≤m δ j ≤ sup j≤m δ j < ∞ and as u → 0. (A6) The family of discretization points T j satisfies T 0 := 0 < T 1 < T 2 < . . . , < T m < T m+1 := 1 and max for some constant c > 0. Assumption (A1) is standard in regression M-estimation, see, e.g., (Wu, 2007; Li et al., 2011) . In our setting it ensures that (3) has a solution in W r,2 ([0, 1]) and that all solutions will asymptotically be contained in the same ball of W r,2 ([0, 1]), so that one does not need to worry about "bad" solutions in the limit. Assumption (A2) specifies the permissible error structure; it is substantially weaker than the assumption of Cai and Yuan (2011) , as we do not require independence of the measurements errors ζ ij , j = 1, . . . , m, associated with the ith subject. In our opinion, this as an important generalization because measurement errors occurring closely to each other in time or space are likely to be correlated in practice. Assumptions (A3)-(A5) permit the use of discontinuous score functions by imposing some regularity on the error process and its finite-dimensional distributions instead. It is clear that assumptions (A3)-(A5) are satisfied for smooth ψ-functions. However, as Bai and Wu (1994) point out, these conditions hold quite generally. For example, they show that assumption (A4) may hold with the tighter |u| 2 on the right hand side, even if ψ has jumps. On the whole, assumption (A5) ensures the Fisher-consistency of the estimators. The first part are two weak conditions that are satisfied, for example, if ψ is bounded and odd and the error has a symmetric distribution about zero. The second part of assumption (A5) essentially requires that each of the functions m j (u) := E{ψ( 1j + u)}, j = 1, . . . , m, is differentiable with strictly positive derivative at the origin. This is a necessary condition for the minimum to be well-separated in the limit. It is also not a stringent assumption and can be shown to hold for many popular loss functions. We now provide several examples to illustrate the broad applicability of this assumption. Example 1 (Squared loss). In this case ψ(x) = 2x and the second part of assumption (A5) holds with δ j = 2, provided that E{ 1j } = 0 for all j ≤ m, as in (Cai and Yuan, 2011) . Example 2 (Smooth loss functions). All monotone everywhere differentiable ψ functions with bounded second derivative ψ (x), such as ρ(x) = log(cosh(x)), satisfy the second part of assumption (A5) if This is a straightforward generalization of the classical Fisher-consistency condition for spline estimation with smooth ψ-functions, see (Cox, 1983) . Example 3 (Check loss). First, consider the absolute loss for which ψ(x) = sign(x). If each 1j , j = 1, . . . , m, has a distribution function F j with positive density f j on an interval about zero, then so that the assumption in (A5) holds with δ j = 2f j (0). This easily generalizes to the check loss, provided that in this case one views the regression function µ as the τ -quantile function, that is, Pr(Y ij ≤ µ(T j )) = τ , see (Wei and He, 2006) . Example 4 (Huber loss). Now ψ k (x) = xI(|x| ≤ k) + k sign(x)I(|x| > k) for some k > 0. Assuming that each F j , j = 1, . . . , m, is absolutely continuous and symmetric about zero we have that It is clear that the assumption in (A5) holds in this case with δ j = 2F j (k) − 1 provided that min j F j (k) > 1/2. Example 5 (L q loss with q ∈ (1, 2)). Clearly, ψ q (x) = q|x| q−1 sign(x) and if we assume that each F j is symmetric about zero, E{| 1j | q−1 } < ∞ and E{| 1j | q−2 } < ∞, then see (Arcones, 2000) . The latter expectation is finite, if, e.g., F j possesses a Lebesgue density f j that is bounded at an interval about zero. Example 6 (Expectile loss). As an alternative to the check loss, consider the continuously differentiable expectile loss ρ α ( Assuming that there is an interval about the origin in which no F j has atoms we have The term in curly braces is positive for each j and α ∈ (0, 1). Therefore, the assumption in (A5) holds with δ j = α + (1 − 2α)F j (0). Condition (A6) ensures that the sampling points are unique and observed at a sufficiently regular grid. It is again a weak assumption that can be shown to hold, for example, if T j = 2j/(2m + 2), j = 1, . . . , m, and many other designs. See (Cai and Yuan, 2011) and (Xiao, 2020) for close variants of this condition. With these assumptions we can now state our first main asymptotic result. To ensure consistency of the estimators we assume that the discretization points become more numerous as the sample size grows larger. That is, we require that m depends on n and m(n) → ∞ as n → ∞. This is a natural requirement in order to examine consistency in norms that involve integrals. Also the penalty parameter λ needs to depend on n. To avoid making the notation too heavy, we often not explicitly indicate the dependencies of m and λ on n. Theorem 1. Consider model (2) and the M-type smoothing spline estimator µ n satisfying (4). Assume that assumptions (A1)-(A6) hold. Moreover, assume that for n → ∞, m and λ vary in such a way that m → ∞, λ → 0, nλ 1/2+(1−1/8r)/r → ∞ and lim inf n→∞ mλ 1/2r ≥ 2c r , where c r is the constant of Proposition 1. Then, every sequence of M-type smoothing spline estimators µ n satisfies The asymptotic behaviour of the smoothing spline estimator in this functional setting given in Theorem 1 is very different from the asymptotic behaviour of M-type smoothing splines in nonparametric regression with i.i.d. errors as studied by (Cox, 1983; Kalogridis, 2020) . In fact, the only similarity is the role of λ which needs to tend to zero in order for the regularization bias to become negligible. The limit conditions of the theorem are satisfied if λ ∼ C(m −2r + n −1 ) for sufficiently large C and r ≥ 2. They can also be satisfied for r = 1, i.e., the case of linear smoothing splines, provided that m is "small" relative to n, which precludes densely observed functional data. For r = 1 this rate of convergence can be attained without any conditions on m, provided that one uses a smoother (Lipschitzcontinuous) score function ψ. Thus, smoother score functions allow more flexibility in the choice of r and the penalty parameter λ. It is interesting to observe that for λ (m −2r + n −1 ) the L 2 ([0, 1])-error decays like || µ n − µ|| 2 2 ≤ || µ n − µ|| 2 r,λ = O P n −1 + m −2r , which is the optimal rate of convergence in the common design (Cai and Yuan, 2011) . This rate of convergence implies that a phase transition between the two sources of error occurs at m n 1/2r . Indeed, if m >> n 1/2r then the asymptotic error behaves like n −1 , which is the rate of convergence for many functional location statistics that assume the curves are observed entirely, see (Horváth and Kokoszka, 2012) and (Gervini, 2008) for the functional mean and median respectively. On the other hand, if m = O(n 1/2r ) then the asymptotic error behaves like m −2r , which is the error associated with piecewise polynomial interpolation of a W r,2 ([0, 1])-function (DeVore and Lorentz, 1993) . This result confirms the intuitive notion that as long as the grid is dense, the error will decay with the parametric rate n −1 . However, Theorem 1 also shows that for sparsely observed data the discretization error cannot be ignored in the way many currently available resistant estimation procedures require. Based on Theorem 1 we can now also obtain rates of convergence for the derivatives µ (s) n , s = 1, . . . , r − 1 and tightness of µ (r) n in the classical L 2 ([0, 1])-metric. These are summarized in Corollary 1 below. for all s ≤ r. Since, by assumption, λ → 0 as n → ∞, the corollary implies that derivatives of higher order are more difficult to estimate, ceteris paribus. This is not surprising as differentiability is a local property and one is expected to need a larger sample size and higher grid resolution in order to examine local properties with the same degree of precision. Convergence in L 2 ([0, 1]) is, in general, too weak to imply pointwise convergence, but the stronger mode of convergence obtained in Theorem 1 and Proposition 1 allow us to derive a uniform rate of convergence as follows. Corollary 2. Under the assumptions of Theorem 1, it holds that Besides its intrinsic interest, Corollary 2 may be viewed as the first step towards establishing uniform confidence bands for µ. We aim to study this problem in detail in forthcoming work. We now turn to the problem of general location estimation in the independent design case, i.e. when trajectories are observed at different points and their number can vary from trajectory to trajectory. A convenient way to model this difference is by treating the T ij , j = 1, . . . , m i , i = 1, . . . , n, as independent and identically distributed random variables. The assumptions that we need in order to obtain the main result in this case are for the most part straightforward generalizations of the assumptions governing the common design case. We denote the set of sampling points {T ij } i,j by T . (B6) The sampling points T ij , j = 1, . . . , m i , i = 1, . . . , n, are independent and identically distributed random variables with Lebesgue measure on [0, 1]. They are also independent of the error processes i (t), i = 1, . . . , n. Assumptions (B4)-(B5) are adaptations of assumptions (A4)-(A5) to the case of distinct sampling points. Assumption (B6) ensures that the discretization points are well-spread throughout the [0, 1]-interval. The assumption of uniformity can be marginally weakened to requiring that the density is bounded away from zero and infinity on [0, 1] at the cost of heavier notation and lengthier derivations in the proofs. A quantity of importance for the asymptotic properties of M-type estimators is the harmonic mean m of the m i , that is, We may think of m as a measure of how dense the data is: lower values indicate sparse functional data while larger values indicate more densely sampled functional data. As in (Xiao, 2020) the m i are deterministic in our analysis. The more complex situation of random m i from a distribution with support on the integers will be considered in future work. Theorem 2. Consider model (2) and the M-type smoothing spline estimator µ n solving (3). Assume that assumptions (B1)-(B6) hold. Moreover, assume that for n → ∞, λ varies in such a way that λ → 0 and nλ 1/2+(1−1/8r)/r → ∞. Then, every sequence of M-type smoothing spline estimators µ n satisfies || µ n − µ|| 2 r,λ = O P (nmλ 1/2r ) −1 + λ + n −1 . The limit conditions of Theorem 2 parallel those of Theorem 1 and are given in such a way that they do not involve m, which in the case of an independent design cannot be assumed to tend to infinity. Taking λ (mn) −2r/2r+1 ensures that the limit assumptions are satisfied and we are lead to which are the optimal rates of convergence for the independent design (Cai and Yuan, 2011) . Similarly to the common design case, the phase transition occurs at m n 1/2r , although in the present case the conclusions are more subtle. For m >> n 1/2r it can again be seen that the asymptotic error behaves like n −1 . However, a rather interesting phenomenon occurs when m = O(n 1/2r ). In this case, the asymptotic error behaves like (mn) −2r/(2r+1) , which would have been the optimal nonparametric rate of convergence had we possessed mn independent observations at mn different sites (Cox, 1983; Kalogridis, 2020) . Thus, in these extreme cases the estimator behaves either as if the curves were fully observed, or as if they were only observed at nm points in an independent manner. Corollaries 1 and 2 regarding derivative estimation and uniform convergence follow in the same manner as previously through Sobolev embeddings. It is curious that while derivatives in many instances provide insightful tools for the analysis of functional data, they have received little theoretical attention in the sparse setting and, to the best of our knowledge, all existing results, such as (Liu and Müller, 2009) , concern least-squares estimators. The present methodology provides not only rates of convergence for a broad class of estimators, but also an expedient way of computing these derivatives through the use of the B-spline basis, which we outline next. As discussed in Section 2, there exists at least one solution of (3) in the space of natural splines of order 2r with knots at the unique T ij . Thus we may restrict attention to the linear subspace of natural splines for the computation of the estimator. Assume for simplicity that all T ij are distinct and let a = min ij T ij > 0 and b = max ij T ij < 1. Then the natural spline has v := n i=1 m i − 2 interior knots and we may write µ(t) = v+2r k=1 µ k B k (t) where the B k (·) are the B-spline basis functions of order 2r supported by the knots at the interior points T ij and the µ k are scalar coefficients. It is well-known that the space of natural splines with v interior knots is v-dimensional due to the boundary conditions, but for the moment we operate in the larger (v+2r)-dimensional spline subspace. This is computationally convenient due to the local support of B-splines. As will be explained below, the penalty automatically imposes the boundary conditions (see also Hastie et al., 2009, pp. 161-162) . With the introduction of the B-spline basis, to compute the estimator it suffices to find the vector µ = ( µ 1 , . . . , µ v+2r ) such that Note that the derivatives of B-splines may be written in terms of B-splines of lower order by taking weighted differences of the coefficients (de Boor, 2001, p. 117) . Initially, it may seem that this formulation ignores the boundary constraints that govern natural splines but it turns out the penalty term automatically imposes them. The reasoning is as follows: if that were not the case, it would always be possible to find a 2rth order natural interpolating spline that leaves the first term in (5) unchanged, but because it is a polynomial of order r outside of [a, b] ⊂ [0, 1] the penalty semi-norm would be strictly smaller. Hence the minimizer of (5) incorporates the boundary conditions. For ρ-functions whose derivative exists everywhere, the solution to (5) may be expediently found through minor modification of the penalized iteratively reweighted least-squares algorithm, see e.g., (Maronna, 2011) . The algorithm consists of solving a weighted penalized least-squares problem at each iteration until convergence, which is guaranteed irrespective of the starting values and yields a stationary point of (3), under mild conditions on ρ that include the boundedness of ρ (x)/x near zero (Huber and Ronchetti, 2009 ). These conditions are satisfied for smooth loss functions as well as the Huber and expectile losses, but are not satisfied for the popular quantile loss. Nevertheless, the easily implementable recipe of (Nychka et al., 1995) may be used in order to obtain an approximate solution of (5). In particular, in the algorithm the loss function can be replaced by the smooth approximatioñ for some small > 0. Whenever such a modification of the objective function is not feasible, we recommend utilizing a convex-optimization program in order to identify a minimizer of (5), as given, for example, by Fu et al. (2020) . To determine the penalty parameter λ in a data-driven way we propose to select λ that minimizes the generalized cross-validation (GCV) criterion where r ij is the ijth residual, W (r ij ) = ρ (r ij )/r ij and the h ij are measures of the influence of the ijth observation, which can for instance be obtained from the diagonal of the weighted hat-matrix obtained upon convergence of the iterative reweighted least-squares algorithm. The GCV criterion employed herein is a generalization of the criterion proposed by Cunningham et al. (1991) . Implementations and illustrative examples of the Huber and quantile-type smoothing spline estimators are available in https://github.com/ ioanniskalogridis/Robust-optimal-estimation-of-functional-location. We now compare the numerical performance of the proposed robust M-type smoothing spline estimator to several competitors. In particular, we include the least-squares smoothing spline estimator of (Rice and Silverman, 1991; Cai and Yuan, 2011) , the robust regression spline estimator of (Lima et al., 2019) , the least-squares local linear estimator of (Yao et al., 2005; Degras, 2011) and the robust functional M-estimator of (Sinova et al., 2018) in the comparison. For simplicity and for ease of comparison with the functional M-estimator of (Sinova et al., 2018) , which at the very least requires densely sampled trajectories, we consider the case of common design. First, we briefly review the construction of the competing estimators. Let K : R → [0, 1] denote a symmetric nonnegative kernel (weight) function on R that is Lipschitz continuous. Then, for any given point t the local linear estimator of (Yao et al., 2005; Degras, 2011) Here,Ȳ j denotes the average of the responses at T j and h denotes the smoothing parameter (i.e., the bandwidth). In our implementation we use the Gaussian kernel for K. The bandwidth h is chosen by the direct plug-in method, which essentially works by replacing the unknown functionals in the asymptotic expression of the optimal bandwidth with Gaussian kernel estimates. For fully observed i.i.d. curves X 1 , . . . , X n , Sinova et al. (2018) propose to estimate µ(t) by µ HF such that with ρ a real-valued loss function. The well-known spatial median can be recovered from (7) by setting ρ(x) = x. For this comparison we use the Huber loss function with tuning parameter equal to 0.70, which corresponds to 85% efficiency in the Gaussian location model. To apply the estimator in our setting, L 2 ([0, 1])-norms and inner products were computed with trapezoidal Riemann approximations wherever necessary. For discretely but commonly observed functional data, Lima et al. (2019) proposed estimating µ by p j=1 µ j B j where µ = (µ 1 , . . . , µ p ) satisfies with B 1 , . . . , B p B-spline basis functions. The loss function ρ is taken to be the Huber loss with tuning again equal to 0.70. The vector B j stands for a collection of cubic B-spline basis functions defined on equidistant knots evaluated at T j , i.e., B j = (B 1 (T j ), . . . , B p (T j )) . Motivated by asymptotic considerations, the authors propose to choose the dimension of the approximating spline subspace as p = max{4, [0.3n 1/4 log(n)]}. In our simulation experiments we are particularly interested in the effect of the sampling frequency, the level of noise and the effect of atypical observations on the estimates. To reflect these considerations we have generated independent and identically distributed curves according to the Karhunen-Loève expansion for t ∈ (0, 1). The mean function µ(t) is either sin(6πt)(t + 1) or 3 exp[−(0.25 − t) 2 /0.1]. The W k are independent and identically distributed random variables following the t 5 distribution. Using the t 5 distribution instead of the commonly used standard Gaussian distribution implies that some of the curves exhibit some outlying behaviour, so that our samples may contain observations with large influence. Once the curves have been generated we discretize them in m ∈ {20, 50} equispaced points 0 < T 1 < . . . < T m < 1 and generate the noisy observations according to Y ij = X i (T j ) + σζ ij , , (j = 1, . . . , m; i = 1, . . . , n), with σ > 0 a constant that controls the level of noise in the data. We consider the values σ ∈ {0.2, 0.5, 1}, reflecting low, medium and high levels of noise, respectively. The errors ζ ij are i.i.d. random variables generated according to each of the following distributions: (i) standard Gaussian, (ii) t 3 distribution, (iii) skewed t 3 distribution with non-centrality parameter equal to 0.5, (iv) a convex combination of independent Gaussians with means equal to zero, variances equal to 1 and 9 respectively and weights equal to 0.85 and 0.15, respectively and (v) Tukey's Slash distribution, i.e., the quotient of independent standard Gaussian and (0, 1)-uniform random variables. Figure 2 presents examples of a typical set of observations under standard Gaussian errors and small noise with the first and second mean-function, respectively. For this comparison we consider the M-type smoothing spline estimator based on the Huber function, also with tuning constant equal to 0.70, and henceforth denote this estimator by µ HSP . The least-squares smoothing spline estimator described by Cai and Yuan (2011) is denoted by µ LSSP . These authors do not recommend a method to select the penalty parameter in finite samples. Hence, similarly as for the M-type smoothing spline estimator we use the GCV criterion in (6). Both spline estimators were fitted with r = 2 resulting in cubic splines. All estimators were implemented in the freeware R (R Core Team, 2018). For the local linear estimator we used the package KernSmooth (Wand and Ripley, 2016) , for the regression spline estimator we used a convex optimization routine provided by Fu et al. (2020) , while the functional M-estimator was implemented according to the algorithm provided by Sinova et al. (2018) . Table 1 : Mean and standard error of the MSE for the competing estimators over 1000 datasets of size n = 60 with mean function µ(t) = sin(6πt)(t + 1). Best performances are in bold. To evaluate the performance of the estimators we calculate their mean-square error (MSE), given by which for large grids is an approximation to the L 2 ([0, 1]) distance. Tables 1 and 2 below report the mean-squared errors and their standard errors based on 1000 simulated datasets of size n = 60. In our experience such a moderately small sample size occurs fairly often in practice, see e.g., the well-known Canadian weather dataset (Ramsay and Silverman, 2005) . The results indicate that the least-squares smoothing spline estimator, µ LSSP , behaves well under all noise and discretization settings provided that the measurement errors follow a Gaussian distribution. However, its performance quickly deteriorates as soon as the errors deviate from the Gaussian ideal. Similar remarks apply for the local linear estimator µ LSLP , except that its behavior is not good for the sinusoidal mean function when m is small. The reason for this lesser performance is that µ LSLP regularly oversmooths in this case and thus misses the peaks and troughs of the sinusoidal mean function. We believe that its performance may be greatly improved if more effort is invested in the selection of its bandwidth, which naturally would come at the cost of a higher computing time. Comparing the robust estimators µ HSP , µ HF and µ RS in detail leads to a number of interesting observations. First, while µ HF performs well in general for lower levels of noise, it is highly sensitive to larger levels of noise, especially when that is accompanied by heavytailed measurement errors. In particular, for the Gaussian mixture µ HF tends to perform nearly as bad as the least-squares estimator µ LSSP with respect to both mean functions. The regression spline estimator µ RS does not perform well with respect to the sinusoidal mean function of our first example. This may be attributed to the local characteristics of this function. For such functions regression spline estimators require not only an appropriate number of knots but also good placement, see, e.g., (Eubank, 1999) for possible strategies in that respect. Inevitably, this results in much increased computational effort that negates the computational advantage of regression spline estimators. The performance of µ RS greatly improves in our second example, but it gets outperformed by the competing estimators in all but one of the settings considered. The Huber-type smoothing spline estimator, µ HSP , performs as well as µ LSSP under Gaussian errors and exhibits a high degree of resilience towards high levels of noise and contamination. In particular, it can be seen that symmetric contamination, i.e. t 3 , mixture Gaussian errors and Slash errors, only has a small effect on the performance of the estimator which clearly outperforms its competitors. Interestingly, µ HSP as well as the other robust estimators seem more vulnerable to asymmetric contamination resulting from the rightskewed t-distribution. Although its performance deteriorates in this case, the Huber-type smoothing spline estimator best mitigates the effect of this contamination relative to its competitors. Overall, the present simulation experiments suggest that robust smoothing spline estimators perform well in clean data and safeguard against outlying observations either in the form of outlying curves or heavy-tailed measurement errors. Since its identification at the end of 2019, the Covid-19 virus has been responsible for hundreds of millions of infected cases and considerable economic and social upheaval. In this study we aim to shed light on the course of the epidemic in Europe and identify the overall trends as well as countries which have been most or least afflicted. Our dataset consists of the number of daily new cases per million persons from 34 European countries in the period between the 24th of January 2020 and the 21st of March 2021 for a total of 423 days. The data is not observed on the same grid, as most countries only began systematically reporting the number of new daily cases around the end of March 2020. Hence, according to our previous definitions, we are in the situation of an independent design. The data is depicted in the left panel of Figure 3 . Despite the fact that the raw data exhibits a number of isolated spikes corresponding to occasionally very large numbers of daily cases, two peaks are easily discernible. The first peak taking place between the end of March and the beginning of April is commonly identified with the culmination of the first wave of infections. The number of infections then plummeted in the aftermath of the lock-down measures, but rose rapidly after the end of the summer possibly due to the gradual abandonment of restrictions. The so-called second wave of infections peaked towards the end of 2020 and subsided afterwards. However, as of February 2021 the number of infections seems to be yet again on the increase. In order to draw formal conclusions, we employ the quantile cubic smoothing spline with values of τ ∈ {0.1, 0.3, 0.5, 0.7, 0.9} yielding the right panel of Figure 3 . The estimated quantiles serve to confirm the suspected bimodality in the number of infections. Although these two peaks are identified by all estimated quantiles, higher quantiles significantly differ from lower ones in that they are much steeper. This suggests that countries most afflicted by the first wave of infections experienced only a brief period of calm in between the two waves. For these countries the number of new infections rose very fast towards the end of the summer. By contrast, the median curve indicates that for countries less affected by the first wave, i.e., countries with fewer than 25 cases per million persons, the number of infections only starts to gradually increase again in August. The estimates further suggest a structural break in early September, when the number of new infections picks up pace. This change may be attributable to the opening of schools in many countries. To visualize the estimates, we have constructed heat maps of Europe based on the quantiles for four days covering the whole period. These are shown in Figure 4 . The heat maps in the top row indicate that Spain, Portugal, Luxembourg Italy, Switzerland, Sweden and the United Kingdom were hit hardest in the first wave of the disease. Similarly, from the heat maps in the bottom row it can be seen that countries worse afflicted by the second wave include Luxembourg, Switzerland, Serbia, Hungary, Poland and Estonia. A tentative conclusion from these maps may be that the first wave was more detrimental to western European countries, while during the second wave it was the countries of eastern Europe that were worse afflicted. Lastly, the ab ove analysis can be complemented by a study of the daily deaths per million persons attributed to Covid-19. The left panel of Figure 5 plots the corresponding trajectories while the right panel presents the quantile smoothing spline estimates. It can clearly be seen that the number of daily deaths is also bimodal, but the second peak is only marginally larger than the first peak. In other words, the numbers of deaths have not kept up with the number of cases, which exploded during the second wave. Two plausible explanations for this difference is that the fraction of younger less vulnerable people that contract the disease is much higher than in the first wave of the epidemic and the treatment of patients has improved. This paper provides theoretical justification for robust smoothing spline estimators under both common and independent designs with trajectories that may be densely or sparsely sampled. From here there are several possible research directions worth pursuing. Of particular importance is robust inference for the covariance structure of second-order processes. Despite its importance, this problem has not received much attention. To the best of our knowledge current work, such as (Panaretos and Kraus, 2012) , require completely observed trajectories. To construct robust dispersion estimates from discretely sampled functional data the current framework can be extended by replacing the one-dimensional smoothing spline with a two-dimensional thin plate spline (Wahba, 1990; Wood, 2017) . Robust estimation of the covariance structure would then allow for robust functional principal component analysis based on discretely sampled data in the manner outlined in (Yao et al., 2005; Li and Hsing, 2010) . Another important area for the application of robust smoothing spline estimators which has attracted great interest recently is functional regression and its variants. To date, most estimation proposals assume that the functional predictor is fully observed. Hence, the discreteness of the data is essentially ignored. A smoothing spline estimator for discretely observed curves based on the MM principle was proposed by Maronna and Yohai (2013) , but without any theoretical support. We are confident that under appropriate conditions this MM-type smoothing spline estimator can achieve optimal rates of convergence, as defined in (Crambes et al., 2009) , whilst also providing a much safer estimation method in practice. We aim to study this in future work. Proof of Proposition 1. First, we recall the existence of a complete orthonormal sequence for values 0 = γ 1 = . . . = γ r < γ r+1 < . . . which satisfy C 1 k 2r ≤ γ k+r ≤ C 2 k 2r , k ≥ 1, for some 0 < C 1 ≤ C 2 , see (Hsing and Eubank, 2015, Theorem 2.8.3) . Furthermore, since W r,2 ([0, 1]) ⊂ L 2 ([0, 1]) and the φ j are orthogonal in W r,2 ([0, 1]) under ·, · r,λ , we have that such that φ j /(1 + λγ j ) is an orthonormal basis in W r,2 ([0, 1]) under the inner product || · || r,λ . It follows that for every f ∈ W r,2 ([0, 1]) we can write for some square summable sequence {f j } j . Using this representation and the Schwarz inequality we find The first factor on the right hand-side is bounded, say sup j≥1 sup x∈[0,1] |φ j (x)| ≤ M , as the φ j are uniformly bounded in both x and j, see (Hsing and Eubank, 2015, p. 60) . Moreover, Parseval's theorem yields ∞ j=1 |f j | 2 = ||f || 2 r,λ < ∞, since f ∈ W r,2 ([0, 1]). Finally, by the properties of the γ j and integral approximation, we obtain ∞ j=1 1 1 + λγ j ≤ r + ∞ r dx 1 + λC 1 x 2r ≤ r + c r λ −1/2r , for some finite constant c r depending only on r. Putting everything together and recalling that λ ∈ (0, 1], |f (x)| ≤ M max{r, c r } 1/2 λ −1/4r ||f || r,λ , which establishes part (a). Part (b) now follows from the Riesz representation theorem for bounded linear functionals on W r,2 ([0, 1]). To prove part (c) note that, since x → R r,λ (x, y) ∈ W r,2 ([0, 1]) and φ j /(1 + λγ j ) 1/2 , j ≥ 1, is an orthonormal basis, we may write But R r,λ (x, y) is the reproducing kernel, hence R r,λ (·, y), φ j r,λ = φ j (y) resulting in as asserted. Now, for every n > m we have Since for every λ > 0 the latter series is convergent, it follows that for every > 0 there exists an N such that sup n,m≥N sup x,y∈[0,1] 2 n j=m+1 |φ j (x)φ j (y)| 1 + λγ j < . The absolute series is thus uniformly Cauchy and hence uniformly convergent, as was to be proven. Proposition 1 implies that for λ ∈ (0, 1]. Upon dividing both sides with ||R r,λ (x, ·)|| r,λ we obtain sup x∈[0,1] ||R r,λ (x, ·)|| r,λ ≤ c r λ −1/4r . We will make frequent use of this inequality in our proofs below. First, we give two lemmas that will be used to proof Theorem 1 . The first lemma concerns the minimization of convex semi-continuous functionals in the Hilbert space and is proven in (Kalogridis, 2020) . Lemma 1. Let H denote a real Hilbert space of functions endowed with inner product ·, · H and associated norm || · || H and let L : H → R + denote a convex lower semicontinuous functional. If then there exists a minimizer of L in the unit ball {f ∈ H : ||f || H ≤ 1} The second lemma concerns the approximation of an integral of a smooth W r,2 ([0, 1])function by a sum. Lemma 2. Under assumption (A6) there exists a constant c 0 such that, for all f ∈ W r,2 ([0, 1]) and m ≥ 2, 1 m Proof. We first claim the existence of a universal constant c such that for all f ∈ W 1,1 ( Let Q m denote the distribution function that jumps m −1 at each design point T j . Integration by parts then shows by assumption (A6). The claim thus holds. The proof is completed with an appeal to (Eggermont and LaRiccia, 2009 , Chapter 13, Lemma 2.27). For notational convenience, in the proof of Theorem 1 we denote positive constants by c 0 . Note that this constant may change value from appearance to appearance. Proof of Theorem 1. Let L n (f ) denote the objective function, that is, Let NS 2r (T 1 , . . . , T m ) denote the m-dimensional natural spline subspace with knots at T 1 , . . . , T m and let Q : W r,2 ([0, 1]) → NS 2r (T 1 , . . . , T m ) denote the operator associated with rth order spline interpolation at T j , j = 1, . . . , m. We use R(µ) := µ − Q(µ) to denote the error of approximation and we write R j := R(µ)(T j ), j = 1 . . . , m, in what follows. Putting g := Q(µ)−f , it is easy to see that minimizing L n (f ) is equivalent to minimizing Denoting C n := n −1 + m −2r + λ, we aim to show that for every > 0 there exists a D ≥ 1 (in the remainder we will drop the dependence on to avoid heavy notation, i.e., we simply write D) such that By Lemma 1 this entails the existence of a minimizer g n such that || g n || 2 r,λ = O P (C n ), provided that we can establish the convexity and lower semicontinuity of L n on W r,2 ([0, 1]), which we now check. Convexity follows easily from assumption (A1) and the convexity of the map as the composition of a convex function f → ||f (r) || 2 and an increasing convex function on [0, ∞), namely h(x) = x 2 . To check lower semi-continuity recall that by assumption (A1) ρ is convex and hence continuous, thus also lower semicontinuous. In addition, the seminorm ||f (r) || 2 2 is continuous with respect to convergence in W r,2 ([0, 1]). Since the sum of two continuous functions is continuous, we have thus proved the (lower semi-)continuity of the objective function. Using the one-to-one relation between g and f if (9) holds we obtain Now, note that which implies that if (9) holds, then the claim follows from the properties of spline interpolation. In particular, it is well-known (see, e.g., DeVore and Lorentz, 1993, Theorem 7.3) that for some positive constant c 0 > 0. Here, the last bound follows from assumption (A6) and the fact that µ ∈ W r,2 ([0, 1]). On the other hand, Theorem 3.4 of (Schultz, 1970) implies that the spline interpolant can be chosen such that for some c 0 > 0. From (12) and (13) we have since, by definition of C n , m −2r + λ < C n . Thus, if (9) holds, (11) and (14) imply that which is the desired result. We thus have to establish (9). To that end, decompose L n (C 1/2 n g) − L n (0) as follows After adding and subtracting ψ( ij ) and E{ψ( ij )} = 0 from the second integrand, we may equivalently write L n (C 1/2 n g) − L n (0) := I 1 (g) + I 2 (g) + I 3 (g) + I 4 (g), By the superadditivity of the infimum we have the lower bound inf ||g|| r,λ =D We will show that for D sufficiently large inf ||g|| r,λ =D I 1 (g) is positive and dominates all other terms in the decomposition. For this, we need to determine the order of each one of the four terms. Starting with inf ||g|| r,λ =D I 4 (g), since ||Q(µ) (r) || 2 ≤ c 0 ||µ (r) || 2 , the Schwarz inequality yields | inf ||g|| r,λ =D where we have used that µ ∈ W r,2 ([0, 1]), λ 1/2 < C 1/2 n and λ 1/2 ||g (r) || 2 ≤ ||g|| r,λ , by definition of || · || r,λ . Turning to I 3 (g), observe that by assumption (A2) the errors ij = i (T j ) are independent in i and identically distributed for a common j. With the Schwarz inequality we obtain For the first factor we clearly have by assumption (A5). Using Markov's inequality we thus find To bound the second factor, we note that as a consequence of Lemma 2 since, by assumption, lim inf n→∞ mλ 1/2r > 0. Combining these two bounds we obtain sup ||g|| r,λ ≤D as n 1/2 < C 1/2 n . Before examining the terms I 1 (g) and I 2 (g) note that, by the reproducing property and inequality (8), where the second inequality follows from (14). Our limit conditions now ensure that We now consider I 1 (g). Applying Fubini's theorem and assumption (A5) yields for all large n. Consequently, say. We establish a lower bound for I 11 (g) and an upper bound for I 12 (g). From assumption (A5) and Lemma 2 we obtain and, by the Schwarz inequality, Since ||g|| r,λ ≤ D, the first factor in curly braces may be bounded by c 0 D using a previous argument. For the second factor, by Lemma 2, (14) and our assumption lim inf mλ 1/2r > 0, We may thus conclude that sup ||g|| r,λ ≤D for all large n. Combining (17) and (18) we find that, for all g ∈ {f ∈ W r,2 ([0, 1]) : ||f || r,λ ≤ D}, since, by assumption, lim inf n→∞ mλ 1/2r ≥ 2c r . Consequently, inf ||g|| r,λ =D To conclude the proof we now show that for every > 0, which in combination with (15), (16) and (19) establishes (9) and thus yields the result stated in the theorem. To accomplish this we introduce some notation. By assumption, the tuples ( i1 , . . . , im ), i = 1, . . . , n, are independent and identically distributed according to P. Moreover, the design points are common across i. For g ∈ {f ∈ W r,2 ([0, 1]) : ||f || r,λ ≤ D} := B D consider the class of real-valued functions f g : R m → R given by This class of functions depends on n via m and C n but we suppress this dependence for notational convenience. Letting P n denote the empirical measure associated with ( i1 , . . . , im ), i = 1, . . . , n, we may rewrite I 2 (g) as where, as in (van der Vaart, 1998) , P n f g and Pf g stand for expectations of f g with respect to P n and P respectively. By Markov's inequality we have Pr sup and thus it suffices to show that the right-hand side of (20) tends to zero as n → ∞. Let N [ ] ( , F, L 2 (P)) denote the -bracketing number for a set of functions F in the L 2 (P)norm and let N ( , F, || · || ∞ ) denote the -covering number in the sup-norm. Furthermore, let J [ ] (δ, F, L 2 (P) denote the -bracketing integral, that is, By Lemma 19.36 of (van der Vaart, 1998), for any class of real-valued functions F such that E{|f | 2 } < δ 2 and ||f || ∞ ≤ M for all f ∈ F we have E{| sup f ∈F n 1/2 |(P n − P)f |} ≤ c 0 J [ ] (δ, F, L 2 (P)) 1 + J [ ] (δ, F, L 2 (P)) δ 2 n 1/2 M . First we determine a bound on the bracketing number and then we estimate δ and M . By our previous discussion, max j≤m |R j | → 0 and C 1/2 n sup g∈B D ||g|| ∞ → 0 as n → ∞. Hence, for any (g 1 , g 2 ) ∈ B D × B D we obtain, by assumption (A3), By Theorem 2.7.11 of (van der Vaart and Wellner, 1996) , we now have We develop a bound for this last covering number. By our assumptions λ ∈ (0, 1] for all large n and therefore for all f ∈ W r,2 ([0, 1]), ||f || r,λ ≥ λ 1/2 ||f || r,1 , with || · || r,1 the standard Sobolev norm. Hence, By Proposition 6 of (Cucker and Smale, 2001) , we deduce that for all > 0, We proceed to estimate δ and M , as required by the Lemma. For the former observe that by the Schwarz inequality and assumption (A4) we have ≤ c 0 C 3/2 n λ −1/4r , where the last inequality follows by (8). Since this estimate is uniform on B D , we may take δ = c 0 C 3/4 n λ −1/8r . In addition, using assumption (A3) and Lemma 2 we obtain ≤ c 0 C 1/2 n so that M = c 0 C 1/2 n and consequently M/δ 2 ≤ c 0 λ 1/4r /C n . With our estimate of δ and the bound on the bracketing number implied by (23)-(24) the bracketing integral in (22) over n 1/2 C n may be bounded as follows J [ ] (δ, {f g , g ∈ B D }, L 2 (P))/(n 1/2 C n ) ≤ c 0 (n 1/2 C 1/8r+1/4 n λ 3/8r−1/16r 2 ) −1 . Now, (n 1/2 C 1/8r+1/4 n λ 3/8r−1/16r 2 ) 2 ≥ nλ 1/2+1/r−1/8r 2 → ∞, by our limit assumptions. We have thus shown that J [ ] (δ, {f g , g ∈ B D }, L 2 (P))/(n 1/2 C n ) → 0 as n → ∞. Furthermore, λ → 0 as n → ∞, which implies as n → ∞, whence we may deduce that the right-hand side of (21) tends to zero. Consequently, (20) holds and the proof is complete. for all s ≤ r and f ∈ W r,2 ([0, 1]) with the constant c s,r depending only on s and r. Since W r,2 ([0, 1]) is a vector space, Theorem 1 now implies that for any 1 ≤ s ≤ r λ s/r || µ (s) n − µ (s) || 2 2 ≤ c s,r || µ n − µ|| 2 r,λ = O P n −1 + m −2r + λ . The result follows by the positivity of λ. To prove Theorem 2 we use the following uniform law of large numbers on Sobolev spaces. Lemma 3. If assumption (B6) holds, λ → 0 and nλ (3−1/2r)/(2r) → ∞ as n → ∞, then for every fixed D > 0 and δ > 0 the following uniform law holds Proof. First observe that the set B D := {g ∈ W r,2 ([0, 1]) : ||g|| r,λ ≤ D} is compact in the || · || ∞ -topology and from the proof of Theorem 1 its entropy satisfies log(N (δ, B D , || · || ∞ )) ≤ c 0 λ −1/2r δ −1/r . Furthermore, by assumption (B6) E{|g(T ij )| 2 } = ||g|| 2 2 and by (8), sup g∈B D ||g|| 2 ∞ ≤ c 0 λ −1/2r , where c 0 depends on D. For i = 1, . . . , n, define independent stochastic processes Fix g 0 ∈ B D and observe that We first deal with the first term on the right hand side of (25) and show that it converges to zero in probability. Note that each U i,g 2 0 is uniformly bounded for g 0 ∈ B D as max for some c 0 not depending on n. For every δ > 0, Hoeffding's inequality (see, e.g., van de Geer, 2000) now yields Pr 1 n n i=1 U i,g 2 0 ≥ δ ≤ 2 exp −cδ 2 nλ 1/r . Our limit assumptions imply nλ 1/r → ∞ as n → ∞ and therefore the exponential tends to zero. To treat the second term in (24) we first note that max 1≤i≤n |U i,g 2 − U i,g 2 | ≤ 2||g 2 −g 2 || ∞ , g,g ∈ B D , and there exists a c 0 > 0 such that This shows that N (δ, {g 2 , g ∈ B D }, || · || ∞ )) ≤ N (c 0 δλ 1/4r , g ∈ B D }, || · || ∞ )). From our bound on this last covering number it now follows that By our limit assumptions, nλ 3/2r−1/4r 2 → ∞. Therefore, applying Lemma 8.5 of van de Geer (2000) (with d i (g 2 , g 2 0 ) = ||g 2 − g 2 0 || ∞ ) yields for some c 0 > 0 which completes the proof. Proof of Theorem 2. Let L n (f ) denote the objective function, that is, and write C n = (nmλ 1/2r ) −1 + 1/n + λ. It will be shown that for every > 0 there exists a D ≥ 1 such that lim n→∞ Pr inf ||g|| r,λ =D L n (µ + C 1/2 n g) > L n (µ) ≥ 1 − . As before, this then implies the existence of a minimizer in the ball {f ∈ W r,2 ([0, 1]) : ||f − µ|| 2 r,λ ≤ DC n } with high probability. Decomposing L n (µ + C 1/2 n g) − L n (µ) yields g(T ij )ψ( ij ) + λC n ||g (r) || 2 2 + 2λC 1/2 n µ (r) , g (r) 2 := I 1 (g) + I 2 (g) + I 3 (g), with I 3 (g) := 2λC 1/2 n µ (r) , g (r) 2 . Letting T = {T ij } i,j we proceed to show the following: These bounds are sufficient for (26) to hold, as by virtue of (27) inf ||g|| r,λ =D E{I 1 (g)|T } will be positive and dominate all other terms for sufficiently large D. The bound on (30) can be readily verified with the argument employed in the proof of Theorem 1, so it remains to establish (27)-(29). Starting with (29), note that with the reproducing property we may write Consequently, by the Schwarz inequality, Now, by assumptions (B3), (B5) and (B6), say. For Z 1 we immediately find by assumption (B5) and (8). Hence, To bound E{Z 2 } note that, by assumptions (B3) and (B6), Now, let T R r,λ denote the integral operator that maps L 2 ([0, 1]) into itself with R r,λ as its kernel, viz, By Proposition 1, T R r,λ is positive semidefinite and compact with largest eigenvalue equal to 1. Combining these observations with assumption (B5) we find see, e.g., (Hsing and Eubank, 2015, Theorem 4.2.5) for the second inequality. Applying Markov's inequality now yields sup ||g|| r,λ ≤D |I 2 (g)| = O P (1)DC n , as asserted. Turning to I 1 (g), a similar derivation using the reproducing property as in the proof of Theorem 1 shows that lim n→∞ C 1/2 n max 1≤i≤n max 1≤j≤m i |g(T ij )| = 0, under our limit conditions. Applying Fubini's theorem and assumption (B5) yields Consequently, for all sufficiently large n with arbitrarily high probability. Our limit conditions imply those of Lemma 3 and therefore, where the o P (1) term is uniform in g ∈ B D . From (32) it follows that inf ||g|| r,λ =D E{I 1 (g)|T } ≥ c 0 D 2 C n {1 + o P (1)}, so we have established (27). To complete the proof we need to show (28), for which we use an empirical process result given by Theorem 8.13 of (van de Geer, 2000) . Let (Ω, A, P) denote the underlying probability space and write T i = (T i1 , . . . , T im i ) and i = ( i1 , . . . , im i ) . Let F i = σ((T 1 , 1 ), . . . , (T i , i )), i = 1, . . . be an increasing sequence of sigma fields generated by the T i and the i . Clearly F i ⊂ A for all i, as the T i and i are measurable. We have I 1 (g) − E{I 1 (g)|T } = n −1 n i=1 Z i,g , with Z i,g independent mean-zero random variables given by The Z i,g are F i -measurable since, by assumption (B6), Furthermore, the Z i,g are uniformly bounded since C 1/2 n sup g∈B D ||g|| ∞ = o(1) and consequently, by assumption (B3), sup g∈B D max 1≤i≤n |Z i,g | ≤ c 0 C 1/2 n λ −1/4r . for all large n. Thus, the required expectations exist and is a martingale. Let Z g = (Z 1,g , . . . , Z n,g ) and write ρ K for the Bernstein "norm", i.e., ρ K (Z i,g ) = 2K 2 E{e |Z i,g |/K − 1 − |Z i,g |/K}, i = 1, . . . , n, for some K > 0. Furthermore, define the "average" squared Bernstein norm It is helpful to recall the definition of the generalized entropy with bracketing (van de Geer, 2000) . For 0 < δ ≤ R and F ∈ A, this is defined as the logarithm of the smallest nonrandom value of N for which a collection of pairs of random vectors Z L k = (Z L 1,k , . . . , Z L n,k ) and Z U k = (Z U 1,k , . . . , Z U n,k ) with [Z L i,k , Z U i,k ] F i -measurable, i = 1 . . . , n, k = 1, . . . , N, such that for all g ∈ B D , there is a j = j(g) ∈ {1, . . . , N } with g → j(g) non-random, such that . . , n, on {ρ K (Z g ) ≤ R} ∩ F. We denote this value of N with N B,K (δ, R, F) and its logarithm, the generalized entropy, with H B,K (δ, R, F). We take F = Ω for our purposes and develop a bound on the generalized entropy with bracketing. In particular, for specific K and R we show that where c 0 is a generic constant. To prove (34) we take K = c 0 C 1/2 n λ −1/4r where c 0 is the constant given in (32) and we determine a bound onρ K (Z g ), that is, we determine a value of R. Since, by (33), |Z i,g | ≤ K, Lemma 5.8 of (van de Geer, 2000) implies ρ K (Z i,g ) ≤ c 0 sup g∈B D max 1≤i≤n E{|Z i,g | 2 }. Now, applying the Schwarz inequality twice along with assumption (B4) yields Iterating expectations and using assumption (B6), we obtain sup g∈B D max 1≤i≤n E{|Z i,g | 2 } ≤ c 0 λ −1/4r C 3/2 n . Thus, we may take R = c 0 λ −1/8r C 3/4 n in the definition of generalized entropy with bracketing. The above also reveals that {ρ K (Z g ) ≤ R} = Ω. For this R, let us now establish (34). Fix δ > 0 and let g 1 , . . . , g N denote a δ-net covering B D in the supremum norm. Noting that for every g 1 , g 2 ∈ B D we have max 1≤i≤n |Z i,g 1 − Z i,g 2 | ≤ c 0 C 1/2 n ||g 1 − g 2 || ∞ . It is easy to see that for each g ∈ B D there exists a non-random g k with k ∈ {1, . . . , N } such that Z i,g k − δc 0 C 1/2 n ≤ Z i,g ≤ Z i,g k + δc 0 C 1/2 n , 1, . . . , n. Take Z L i,k = Z i,g k − δc 0 C 1/2 n and Z U i,k = Z i,g k + δc 0 C 1/2 n . Since λ −1/4r ≥ 1 for all large n, applying Lemma 5.8 of van de Geer (2000) leads to ρ K (Z U i,k − Z L i,k ) = ρ K (2δc 0 C 1/2 n ) ≤ c 0 δ 2 C n . The constant c 0 does not depend on i and consequently after averaging we still havē . . , Z U n,k ) and likewise Z L k = (Z L 1,k , . . . , Z L n,k ) . We have thus shown that a cover of B D with radii at most δ/(c 0 C 1/2 n ) provides a set of δ-brackets according to the definition of generalized entropy. In other words, H B,K (δ, R, Ω) ≤ log N (c 0 δ/C 1/2 n , B D , || · || ∞ ), for some c 0 , as was to be shown. With these values of K, R and the bound on H B,K (δ, R, Ω) developed above, we now check the conditions of Theorem 8.13 in van de Geer (2000) . First, n 1/2 C n = o(n 1/2 R 2 /K), as n → ∞, or, equivalently, as n → ∞, which holds, by virtue of λ → 0 and nλ 1/2r → ∞. Furthermore, by (34), the bracketing integral may be bounded as follows Evaluating the latter integral we find c 0 C 3/4 n λ −1/8r 0 (H B,K (u, c 0 C 3/4 n λ −1/8r , Ω)) 1/2 du ≤ c 0 C 1/4r n λ 1/4r C 3/4 n λ −1/8r C −3/8r n λ 1/16r 2 = c 0 C 3/4−1/8r n λ −3/8r+1/16r 2 . Consequently, c 0 C 3/4 n λ −1/8r 0 (H B,K (u, c 0 C 3/4 n λ −1/8r , Ω)) 1/2 du/(n 1/2 C n ) ≤ c 0 n −1/2 C −1/4−1/8r n λ −3/8r+1/16r 2 . The definition of C n now yields (n 1/2 C 1/4+1/8r n λ 3/8r−1/16r 2 ) 2 ≥ nλ 1/2+1/r−1/8r 2 → ∞, by our limit assumptions. The conditions of the theorem are thus satisfied, which implies that Pr n 1/2 sup g∈B D |I 1 (g) − E{I 1 (g)|T }| ≥ n 1/2 C n ≤ c 0 exp −c 0 2 nC 1/2 n λ 1/4r , for every > 0. The exponential tends to zero, as our limit assumptions imply nC 1/2 n λ 1/4r → ∞ as n → ∞. Hence (28) holds and the result follows. Sobolev Spaces Asymptotic distribution of regression M-estimators Limiting behavior of M-estimators of regression coefficients in high dimensional linear models I. Scale-dependent case Optimal estimation of the mean function based on discretely sampled functional data: Phase transition Efficient and fast estimation of the geometric median in Hilbert spaces with an averaged tochastic gradient algorithm The spatial distribution in infinite dimensional spaces and related quantiles and depths Asymptotics for M-type smoothing splines Smoothing splines estimators for functional linear regression On the mathematical foundations of Learning Impartial trimmed means for functional data M-type smoothing splines with auxiliary scale estimation A Practical Guide to Splines Asymptotics for the nonparametric estimation of the mean function of a random process Simultaneous confidence bands for nonparametric regression with functional data Constructive Approximation Maximum Penalized Likelihood Estimation Nonparametric Regression and Spline Smoothing Nonparametric Functional Data Analysis: Theory and Practice CVXR: An R package for disciplined convex optimization Robust functional estimation using the median and spherical principal components The Elements of Statistical Learning: Data Mining, Inference, and Prediction Inference for Functional Data With Applications Theoretical Foundations of Functional Data Analysis Robust smoothing Robust Statistics Asymptotics for M-type smoothing splines with non-smooth objective functions Introduction to Functional Data Analysis Uniform convergence rates for nonparametric regression and principal component analysis in functional/longitudinal data Nonconcave penalized M-estimation with a diverging number of parameters M-based simultaneous inference for the mean function of functional data Estimating derivatives for samples of sparsely observed functions, with application to online auction dynamics Robust ridge regression for high-dimensional data Robust functional linear regression based on splines Robust Statistics: Theory and Methods A nonparametric regression approach to syringe grading for quality improvement Dispersion operators and resistant second-order functional data analysis R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing When the data are functions Some tools for functional data analysis Functional Data Analysis Estimating the mean and covariance structure nonparametrically when the data are curves Error bounds for polynomial spline interpolation M-estimators of location for functional data Asymptotic properties of penalized splines for functional data Spline Models For Observational Data KernSmooth: Functions for kernel smoothing for Wand & Jones Conditional growth charts Generalized Additive Models: An Introduction with R M-estimation of linear models with dependent errors Empirical Processes in M-Estimation Weak Convergence and Empirical Processes Asymptotic Statistics Functional data analysis for sparse longitudinal data