key: cord-0506644-7gqin9yc
authors: Pegoraro, Matteo; Beraha, Mario
title: Projected Statistical Methods for Distributional Data on the Real Line with the Wasserstein Metric
date: 2021-01-22
journal: nan
DOI: nan
sha: 837f223d79c2baf176f05f3141b1b1ce4bf53f55
doc_id: 506644
cord_uid: 7gqin9yc

We present a novel class of projected methods, to perform statistical analysis on a data set of probability distributions on the real line, with the 2-Wasserstein metric. We focus in particular on Principal Component Analysis (PCA) and regression. To define these models, we exploit a representation of the Wasserstein space closely related to its weak Riemannian structure, by mapping the data to a suitable linear space and using a metric projection operator to constrain the results in the Wasserstein space. By carefully choosing the tangent point, we are able to derive fast empirical methods, exploiting a constrained B-spline approximation. As a byproduct of our approach, we are also able to derive faster routines for previous work on PCA for distributions. By means of simulation studies, we compare our approaches to previously proposed methods, showing that our projected PCA has similar performance for a fraction of the computational cost and that the projected regression is extremely flexible even under misspecification. Several theoretical properties of the models are investigated and asymptotic consistency is proven. Two real world applications to Covid-19 mortality in the US and wind speed forecasting are discussed.

In many fields of machine learning and statistics, performing inference on a set of distributions is an ubiquitous but arduous task. The Wasserstein distance provides a powerful tool to compare distributions, as it requires very little assumptions on them and is at the same time reasonably easy to compute numerically. In fact, many other distances for distributions either require the existence of a probability density function or are impossible to evaluate, cf. Cuturi (2013) , Peyré et al. (2019) , Panaretos and Zemel (2020) .

The Wasserstein distance recently gained popularity both in the statistics and machine learning community. See for instance Bassetti et al. (2006) , Bernton et al. (2019a) , Catalano et al. (2021) for statistical properties of the Wasserstein distance, Cao et al. (2019) , and Cuturi and Doucet (2014) for applications in the field of machine and deep learning, Bernton et al. (2019b) and Srivastava et al. (2015) for applications in Bayesian computation.

In this work, we focus on the situation in which the single observation itself can be seen as a distribution, as in the analysis of images (Cuturi and Doucet, 2014; Banerjee et al., 2015) , census data (Cazelles et al., 2018) , econometric surveys Potter et al. (2017) and process monitoring (Hron et al., 2014) . In particular, we consider observations to be distributions on the real line. There exist several possible ways to represent distributions, such as histograms, probability density functions (pdfs) and cumulative density functions (cdfs), each characterized by different constraints. For instance, histograms sum to one, pdfs integrate to one, and the limits for cdfs are 0 and 1, moreover all of these functions are nonnegative. These constraints translate into complex geometrical structures that characterize the underlying spaces these objects live in.

One of the first works defining PCA for a data set of distributions is Kneip and Utikal (2001) , where the authors apply tools from functional data analysis (FDA) directly to a collection of probability density functions. This approach, however, completely ignores the constrained nature of probability density functions, leading to poor interpretability of the results.

Based on theoretical results in Egozcue et al. (2006) , who defines a Hilbert structure on a space of probability density functions on a compact interval (called a Bayes space), Delicado (2011) and Hron et al. (2014) , propose a more reasonable approach to the problem of PCA for density functions. In particular, in Hron et al. (2014) , the authors use the geometric properties of the Bayes space, coupled with a suitable transformation from the Bayes space to an L 2 space, to perform PCA on a set of pdfs using FDA tools, and then map back the results to the Bayes space.

Another, perhaps less widely used, approach focuses on borrowing tools from symbolic data analysis (SDA) in the context of histogram data (Nagabhushan and Pradeep Kumar, 2007; Rodríguez et al., 2000; Le-Rademacher and Billard, 2017) . Moreover, in Verde et al. (2015) some of these attempts are extended to generic distributional data using Wasserstein metrics.

Finally, Bigot et al. (2017) and Cazelles et al. (2018) propose two PCA formulations based on the geometric structure of the Wasserstein space: a geodesic PCA and a log PCA. In a similar fashion, the recent preprints of Chen et al. (2021) and Zhang et al. (2020) propose linear regression and autoregressive models, respectively, for distributional data using the Wasserstein geometry.

We now highlight some key aspects of the aforementioned approaches. Hron et al. (2014) assumes that all the probability measures have the same support. This is hardly verified in practice, so that to apply their techniques one needs either to truncate the support of some of the probability density functions, or to extend others (for instance, by adding a small constant value and renormalizing), leading to numerical instability as discussed in Sections 7 and 8.

The SDA-based methods in Nagabhushan and Pradeep Kumar (2007) ; Rodríguez et al. (2000) ; Le-Rademacher and Billard (2017) and Verde et al. (2015) share the poor interpretability of SDA.

The methods in Bigot et al. (2017) , Cazelles et al. (2018) , Chen et al. (2021) and Zhang et al. (2020) are based on the weak Riemannian structure of the Wasserstein space, cf. Section 2.2. Such structure enables the authors to borrow ideas and terminologies from statistical frameworks defined on Riemannian manifolds (see Bhattacharya et al., 2012; Pennec, 2006 Pennec, , 2008 Huckemann et al., 2010; Patrangenaru and Ellingson, 2015; Fletcher, 2013; Banerjee et al., 2015) . We can roughly distinguish those frameworks in two main approaches: the intrinsic/geodesic one and extrinsic/log one.

Briefly, intrinsic methods are defined using the metric structure of the Wasserstein space, working with geodesic curves and geodesic subsets, so that they faithfully respect the metric of the underlying space. However, in general, intrinsic methods present many practical difficulties in that the optimization problems they lead to are usually nontrivial, as we discuss in Section 5.3. Instances of intrinsic methods for distributional data are the geodesic PCA in Bigot et al. (2017) and, under some rather restrictive assumptions, the linear models in Chen et al. (2021) and the autoregressive models in Zhang et al. (2020) , see Sections 3.3 and 3.4.

On the other hand, extrinsic methods resort to the linear structure of suitably defined tangent spaces, by mapping data from the Wasserstein space to the tangent (through the so-called log map) and then mapping back the results to the Wasserstein space (through the exp map). Of course, this approach is less respectful of the underlying geometry than the intrinsic one, but usually presents several numerical advantages. An example of such extrinsic methods defined in the Wasserstein space is the log PCA in Cazelles et al. (2018) .

The main issue with this log PCA is that the image of the log map inside the tangent of the Wasserstein space is not a linear space, but rather a convex cone embedded in a linear space (see Section 2.2). Hence, while exploiting the linear structure of the tangent, it is possible that the projection of some points onto the principal components end up outside of the cone. For these points, the exp map from the tangent to the Wasserstein space used in Cazelles et al. (2018) is not a metric projection, which in general is not available, so that the results in this setting are hardly interpretable.

The contribution of this work is three folded. First, we propose alternative PCA and regression models for distributional data in the Wasserstein space. We term these models projected, in opposition to the log PCA in Cazelles et al. (2018) . Second, by exploiting a geometric characterization of Wasserstein space closely related to its weak Riemannian structure, we build a novel approximation of the Wasserstein space using monotone Bspline. This allows us to represent the space of probability measures as a convex polytope in R J . Lastly, we obtain faster optimization routines for the geodesic PCAs defined in Bigot et al. (2017) , exploiting the aforementioned B-spline representation.

Our projected framework lies in between the log one and the geodesic one, since we use an analogous to the log map to transform our data, as for extrinsic methods, but do not resort to the exp map to return to the Wasserstein space, using instead the metric projection operator. Thanks to this, our projected methods are more respectful of the underlying geometry than the log ones, while at the same time retaining the same reduced computational complexity. Thus, the projected methods expand the range of situations where extrinsic methods are an effective and efficient alternative to intrinsic tools: in our examples, the performance loss in general is marginal (see Section 7).

By centering the analysis in appropriate points of the Wasserstein space, one can identify the space of probability measures (with finite second moment) with the space of square integrable monotonically non-decreasing functions on a compact set. We use a suitable quadratic B-spline expansion to get a very handy representation of such functions. Through such B-spline expansion, it is possible to approximate the metric projection onto the Wasserstein space as a constrained quadratic optimization problem over a convex polytope, that is a well-established problem, cf. Potra and Wright (2000) . This allows us to exploit the underlying linear structure of an L 2 space, so that all the machinery developed for functional data analysis can be directly applied to this setting. We address the issue of interpretability of the results, tackling a number of diverse applications and developing different ways to measure the loss of information caused by the extrinsic nature of our methods.

We observe that the idea of representing nondecreasing functions through B-splines for statistical purposes has been proposed also by Das and Ghosal (2017) , in the context of Bayesian quantile regression, where the authors use B-splines with (random) monotonic coefficients as a generative model for random quantile functions. However, their focus is on defining a generative model, and not on developing a statistical setting exploiting the geometry given by the constrained representation. Along this direction, they do not restrict their attention to quadratic splines and consider cubic ones.

As already mentioned, a further contribution of this work is the derivation of alternative numerical optimization schemes for the geodesic PCA in Bigot et al. (2017) and Cazelles et al. (2018) , based on the proposed quadratic B-spline expansion.

The remaining of the paper is organized as follows. Section 2 covers the basic concepts of Wasserstein distance and the weak Riemannian structure of the Wasserstein space, along with a brief discussion on a suitable way to exploit such structure for our purposes. Section 3 defines the projected PCA and projected regression in a general setting. In Section 4 we discuss the choice of the base point in which we center our analysis and how to efficiently approximate the metric projection through B-splines; in Section 5 we present the numerical algorithms needed to compute our projected methods and an alternative optimization routine for the geodesic PCA in Cazelles et al. (2018) . Section 6 discusses the asymptotic properties of the spline approximation and of the projected models, establishing consistency of the estimators under some assumptions. Numerical illustrations on real and simulated data sets are shown in Sections 7 and 8. In particular, we apply our projected methods to two real world problems: we perform PCA on the US data on Covid-19 mortality by age and sex and perform a distribution regression to forecast the wind speed near a wind farm. Finally, the article concludes in Section 9. The Appendix collects all the proofs of the theoretical results, additional details on the simplicial PCA and regression, and further simulations. Code for reproducing the numerical results is available at https://github.com/mberaha/ProjectedWasserstein.

In the following, we will consider probability measures on the real line R endowed with the usual Borel σ-field, we will skip references to the σ-field whenever it is obvious.

Given a measure µ on R define its cumulative distribution function F µ (x) = µ((−∞, x]) for x ∈ R and the associated quantile function F − µ (t) = inf{x ∈ R : t ≤ F µ (x)}. When F µ is continuous and strictly monotonically increasing, F − µ = (F µ ) −1 .

We start by recalling the definition of the 2-Wasserstein distance between two probability measures µ, ν on R:

where Γ(µ, ν) is the collection of all probability measures on R × R with marginals µ and ν. Closely related to the definition of Wasserstein distance lies the one of Optimal Transport (OT). In particular, (1) identifies the Wasserstein distance with the minimal total transportation cost between µ and ν in the Kantorovich problem with quadratic cost (Ambrosio et al., 2008) . For our purposes, it is convenient to consider another formulation of the OT problem, originally introduced in Monge (1781). Given two measures µ, ν as before, the optimal transport map from µ to ν is the solution of the problem

where # denotes the pushfoward operator, that is for any measurable set B and measurable function

Note that any solution of (2) induces one and only one solution of (1); moreover if the OT problem has a unique solution, then also the Wasserstein distance problem has only one solution. However not all Wasserstein distance problems can be solved through Monge's formulation (Ambrosio et al., 2008) . The unidimensional setting is a remarkable exception in that there exist explicit formulas for both problems. In particular, the Wasserstein distance can be computed as

and, if the measure µ has no atoms, then there exists a unique solution to Monge's problem given by T ν µ = F − ν • F µ . For a proof of these results, see Chapter 6 of Ambrosio et al. (2008) .

It is clear that, in general, the Wasserstein distance between two probability measures can be unbounded (for instance when in (4) F − µ is not square integrable on [0, 1]). Nonetheless, when restricting the focus on the set of probability measures with finite second moment, then it holds that W 2 defines a metric (see, for instance, Chapter 7 of Villani, 2008) . Formally, let the Wasserstein space:

W 2 (R) = µ ∈ P(R) : R x 2 dµ < +∞ then (W 2 (R), W 2 ) is a separable complete metric space.

Thanks to the uniqueness of the transport maps, by fixing an absolutely continuous (a.c.) probability measure µ ∈ W 2 (R), we can associate to any ν ∈ W 2 (R) the optimal transport map T ν µ . Since R |T ν µ (x)| 2 dµ = R x 2 dν we can define the following map ϕ µ : W 2 (R) → L µ 2 (R) with the rule: ϕ µ (ν) = T ν µ . We note several immediate but interesting properties of the map ϕ µ . First, it is an isometry (and so a homeomorphism onto its image) since

Second, the image of ϕ µ is a closed convex cone in L µ 2 (R): a set closed under addition and positive scalar multiplication. In fact, for any λ ≥ 0, λT ν µ is still a transport map from µ to another measure whose quantile is λF − µ ; and similarly T ν

. Third, ϕ µ (µ) = id R (where id C denotes the identity map of the set C). Finally, as shown in Panaretos and Zemel (2020), ϕ µ is not surjective and ϕ µ (W 2 (R)) is the set of µ-a.e. non decreasing functions in L µ 2 (R). The inverse of the map of ϕ µ is the measure pushforward (see Equation 3 ) and it is defined on the whole L µ 2 (R): given f ∈ L µ 2 (R), then ν = f #µ is a measure in W 2 (R). In fact:

A natural way to define a tangent structure for W 2 (R) is therefore to take advantage of the cone structure given by ϕ µ . In fact for closed convex cones, there are already notions of tangent cones. Similarly to Rockafellar and Wets (1998) , Theorem 6.9, we can define:

We remark that Theorem 6.9 in Rockafellar and Wets (1998) is stated in R n , but it holds also more generally, for instance in an Hilbert space (see Aubin and Frankowska (2009) , Chapter 4).

A geometric interpretation of (5) is the following. The tangent space consists of all the vectors f that move the base point inside the cone ϕ µ (W 2 (R)), when considered up to a scale factor h. Hence, f plays the role of direction of a tangent vector going out from the tangent point. Furthermore, since for every f ∈ ϕ µ (W 2 (R)) then f + id ∈ ϕ µ (W 2 (R)) we have that ϕ µ (W 2 (R)) is included in the tangent space. As shown later in this Section, the inclusion is strict and the tangent space is much larger than ϕ µ (W 2 (R)).

Note that we can recover the definition of tangent space given by Ambrosio et al. (2008) and Panaretos and Zemel (2020) by a simple "change of variable": calling g = id + hf then substituting (g − id)/h in (5) gives the following definition of tangent

which is the one given in Ambrosio et al. (2008) and Panaretos and Zemel (2020) . As shown in Panaretos and Zemel (2020) the tangent cone Tan µ (W 2 (R)) is indeed a linear space. For this reason we refer to it as tangent space, instead of cone.

In analogy to Riemannian geometry, following Ambrosio et al. (2008) and Panaretos and Zemel (2020), we define the log µ and exp µ maps. Having fixed µ absolutely continuous:

We briefly highlight some properties of these maps; properties which immediately follows from the discussion above.

Remark 1 The map log µ is defined on the whole space W 2 (R). Moreover, it is clearly an isometry: W 2 (η, ν) = log µ (η) − log µ (ν) L µ 2 (R) (Panaretos and Zemel, 2020) . This shows that there is no local-approximation issue when working in the tangent space, in contrast with the usual Riemannian manifold setting. There, the tangent space usually provides good approximation only in a neighborhood of the tangent point.

The map log µ is not surjective on Tan µ , indeed its image Im(log µ ) is a closed convex subset of L µ 2 (R) given by all the maps f such that f + id ∈ ϕ µ (W 2 (R)), that is, f + id is µ-a.e. increasing. The restriction of exp µ on Im(log µ ), henceforth denoted by exp µ| log µ (W 2 (R)) , is an isometric homeomorphism and its inverse is log µ . In particular, we observe that log µ • exp µ is not a metric projection in L µ 2 . That is, in general log µ • exp µ (f ) = arg min g∈Im(log µ ) ||f − g|| L µ 2 .

As mentioned in Section 1.1, borrowing ideas from Riemannian geometry leads to discerning statistical methods on the Wasserstein space in the classes of intrinsic and extrinsic methods.

The Weak Riemannian structure presented in Section 2.2 provides a suitable environment for developing intrinsic methods. In fact, the geodesic structure of W 2 (R) can be recovered through the linear structure of any L µ 2 (R) space through the isometry ϕ µ . Pointwise interpolation of the transport maps coincide with the geodesic between measures. In other words, given µ a.c., the geodesic between ν and η is given by:

Thus, such geodesic structure can be recovered in many different (but equivalent) ways, depending on µ.

On the other hand, Remark 1 motivates the development of extrinsic tools, since working in the image of log µ inside the tangent space Tan µ is exactly like working in W 2 (R). This is not common in Riemannian manifold framework, since usually the tangent space provides a good approximation only near to the tangent point. As a consequence, if in the general Riemannian manifold framework the choice of the tangent point µ is crucial (since results for extrinsic methods might be significantly altered for different choices of µ) when working with W 2 (R) this is not the case.

To further motivate this key point, consider µ and ν a.c. measures; the maps log ν •(exp µ|log µ (W 2 (R)) ) and ϕ ν • ϕ −1 µ are isometric homeomorphisms (as composition of isometries and homeomorphisms). In other words, they preserve distances and send border elements of log µ (W 2 (R)) or ϕ µ (W 2 (R)) into border elements of log ν (W 2 (R)) and ϕ ν (W 2 (R)), respectively, and the same with internal points (and so in particular, they preserve distances from any point to the border). In Chen et al. (2021) , Bigot et al. (2017) and Zhang et al. (2020) µ is chosen as the barycentric measurex of the observations x i ∈ W 2 (R). The discussion above implies that considering the tangent space at the Wasserstein barycenterx and working on logx(x i ) = logx(x i ) − logx(x) is exactly the same as considering the tangent space at any µ a.c. and working on log µ (x i ) − log µ (x) for our statistical purposes. So the choice of the tangent space from the theoretical point of view is completely arbitrary.

Moreover, centering the analysis in the barycenter presents a drawback when studying asymptotic properties of the models under consideration, sincex changes as the sample size grows. In Section 4.1 we propose to fix µ as the uniform measure on [0, 1]. This choice not only allows us to derive empirical methods that are extremely simple to implement, cf. Section 5, but also allows us to study asymptotic properties of the models in Section 6.2 without resorting to parallel transport, as done for instance in Chen et al. (2021) .

Lastly, we briefly discuss the major differences between using a tangent space representation of W 2 (R) and using the representation given by some ϕ µ .

We recall that, for a fixed µ a.c., the two representations are indeed quite similar ϕ µ (ν) = T ν µ , log µ (ν) = T ν µ −id; a priori one may prefer the tangent representation, because it already expresses data as vectors coming out of a point. Therefore, for instance, it might result practically more convenient to center the analysis in the barycenter and work on vectors, taking away any "data centering" issues. At the same time, also notational coherence with already existing methods might benefit from this choice.

However, especially when dealing with extrinsic techniques, we found slightly more practical to use the ϕ µ representation in that it is more straightforward to represent ϕ µ (W 2 (R)) compared to log µ (W 2 (R)): the first one can in fact be represented directly as the cone of the µ-a.e non-decreasing functions.

In this section, exploiting the embeddings given by ϕ µ , we define a class of projected statistical methods to perform extrinsic analysis for data in the Wasserstein space.

To give a general framework, we do not restrict our attention to a particular ϕ µ yet, even though in Section 4 we argue that a natural choice which allows an easier implementation of the empirical methods is letting µ be the uniform distribution on [0, 1]. Hence, for the sake of notation, we consider a generic case of data lying in a closed convex cone X inside a separable Hilbert space H. In our setting, H would be L µ 2 (R) and X = ϕ µ (W 2 (R)), for some µ ∈ W 2 (R) absolutely continuous.

We start by defining one of the main contributions of our work: the projected PCA. We recall that for an H-valued random variable X , PCA is a well established technique and amounts to finding the eigenfunctions of the Karhunen-Loéve expansion of the covariance operator of X , see Ramsay (2004) . Observe that any X-valued random variable can be considered as an H-valued one (by the inclusion map), so that a notion of PCA is already available.

When defining principal components, a key notion is the one of dimension of the principal component (PC). In this work, principal components will be closed convex subsets of H, and we will always define the dimension of a subset of H as the dimension of the smallest affine subset of H containing it. For a generic closed convex set C ⊂ H, let Π C denote the metric projection onto C: Π C (x) := arg min c∈C ||x − c|| and, for a set of vectors U , denote with Sp(U ) its linear span.

In what follows, we denote by x 0 the "center" of the PCA. For us, x 0 = E[X ], or its empirical counterpart. To have a well defined PCA, we always assume that x 0 belongs to the relative interior of the convex hull of the support of X , see Appendix A for the definition of relative interior and further details. This is a rather technical hypothesis but it is not a restrictive one. For instance, it is always verified for empirical measures and when X ⊆ R d and hence for our empirical methods, cf. Section 5.1.

Definition 1 (Projected PCA). Given X a random variable with values in X ⊂ H, let U k = {w 1 , ..., w k } be its first k H-principal components centered in

In other words, the projected principal component is obtained by approximating the span of the principal components found in H, with convex subsets in X. Note that the principal components in H might "capture" some variability which is not present when measuring distances inside X. In fact the projection of a point belonging to X onto a direction w j might end up being outside X, see Section 3.3. However, as we will show in Section 7, in our examples the projected PCA behaves well and this issue does not seem to affect significantly the performance.

Remark 3 Convex sets are essential in our analysis since, thanks to (7) convex sets in X are precisely the subsets of W 2 (R) which are geodesically complete: the geodesic connecting any pair of points in the subset, is contained in the subset. Geodesic subsets are a natural generalization of linear spaces.

The metric projection of a linear subspace onto a convex subset can end up being a nonconvex set. In addition to that, while loosing convexity, the dimension of the metric projection of a convex subset can be bigger of the dimension of the original subset. A simple example where both cases happen is the projection of y = −x onto x, y ≥ 0 in R 2 .

We observe that inside a projected principal component, we have a preferential orthonormal basis given by the principal components in H; for this reason we call U k = {w 1 , ..., w k } principal directions.

Although it might seem impractical to find the projected component, the following Lemma provides a more convenient alternative characterization.

Lemma 1 Let x 0 and U x 0 ,k X be as in Definition 1, then U x 0 ,k

Natural alternatives to Definition 1 would be, for instance, to let the projected principal directions (component) be the metric projection of w 1 , . . . , w k (the linear span of {w 1 , . . . , w k }) onto X, respectively. In the former case, the projection would not guarantee the orthogonality of the projected directions, which is instead essential to properly explore the variability. Moreover, since the "tip" of the projected unit vectors would likely lie on the border of X, the projection of a new observations on a direction would still lie outside of X as soon as the score associated to that direction is larger than 1. The latter case, instead, presents the drawbacks pointed out in Remark 4.

We argue that, despite its simplicity, Definition 1 is indeed very well suited for statistical analysis in the Wasserstein Space. For instance, we are guaranteed that, as the dimension grows up, the k projected components provide a monotonically better fit to the data. This is easily verified because Π X is a strictly non-expansive operator, being X closed and convex (see Deutsch (2012) ), which implies the following Proposition.

Proposition 1 With the same notation as Definition 1, for any x ∈ X we have:

Once a principal component is found, a classical task that one may want to perform is to project a new "observation" x * ∈ X onto U x 0 ,k X , for instance for dimensionality reduction purposes. In general, the metric projection on generic convex subsets might be arduous to find, we will deal with this issue in Section 4. Nevertheless, we can use the following Proposition to reduce in advance the dimension of the parameters involved in the problem; turning it into a projection problem inside the principal projected component, which allows for faster computations (see Equation 13).

Proposition 2 Let x * ∈ X and let Π k be the orthogonal projection on Span(U k ). The projection of x * onto U x 0 ,k X is given by

Lastly, we observe that, since projected principal components are not linear subspaces, the scores of some points on a principal direction can vary as we increase the dimension of the principal component.

Broadly speaking, a regression model between two variables with values in two different spaces is given by an operator between such spaces, which for every input value of the independent variable, returns a predicted value for the dependent variable. In the following, let us denote with Z the independent variable and with Y the dependent one. A regression model is usually understood as an operator Γ specifying the conditional value of Y given Z, that is, E[Y|Z] = Γ(Z).

If the spaces where Z and Y take values possess a linear structure, this linearity is usually exploited by means of a (kernel) linear operator, with possibly an "intercept" term. To define our projected regression model, we want to exploit the cone structure of X in a similar fashion. In fact, such linear kernel operators combine good optimization properties and interpretability since their kernels can provide insights into the analysis, much like coefficients in multivariate linear regression.

We treat separately the cases where the X-valued variable is the independent or the dependent one. The case when both variables are X-valued follows naturally. To keep the notation light, in what follows we will not distinguish between "proper" linear operators and linear operators with an added intercept term, which could as well be employed in all the incoming definitions to gain flexibility.

Consider the case in which we have an independent X-valued random variable, and denote with V the space where the dependent variable takes value. Despite the fact that X is not a linear space, with an abuse of notation, we call "linear" an operator which respect sum and positive scalar multiplication for elements in X. Such operators are in fact obtained by restricting on X linear operators defined on H. Following this idea, in order to define linear regression for an X-valued independent random variable, we consider such variable as H-valued, obtain the regression operator and then take the restriction of the operator on X. In this way, when H = L µ 2 (R) and X = ϕ µ (W 2 (R)), it is possible to exploit the classical FDA framework to perform all kinds of distribution on scalar/vector/etc... regression. For brevity, we report only the definition with V = R.

Definition 2 Let Z an X-valued random variable, and Y a real valued one. Let Γ β : H → R be a functional linear regression model for such variables, with Z considered as H-valued and Γ β (v) = β, v . A projected linear regression model for (Z, Y) is given by (Γ β ) |X . Now we turn to the cases which feature an X valued independent variable and a Z valued dependent one, for Z a generic Hilbert space. Through the inclusion X → H, we can consider a regression problem with X-valued dependent variable, as a problem with H-valued dependent variable. Comparing this situation with the previous one, it is clear that we now face a "dual" problem. Indeed, while before we needed to restrict the domain from H to X, we now need to force the codomain of Γ to lie inside X. We would like to retain the same properties that make linear kernel operators appealing as regression operators between Hilbert spaces. A possibility could be considering a linear kernel operator Γ with values in H and restricting it to Γ −1 (X). However, this would imply that for any z ∈ Γ −1 (X) no prediction would be available.

We argue that a more reasonable approach consists in finding an operator Γ P : Z → X as close as possible (in some sense that will be clear later) to the linear kernel operator Γ aforementioned. Hence, we relax the linearity assumption in favor of Lipschitzianity, and take as regression operator Π X • Γ, whose image always lies in X. Note that Γ P inherits the interpretability of the kernel of Γ.

To motivate such choice, we give the following notion of a projected operator.

Definition 3 Let Z be a normed space and consider Z a Z-valued random variable. Let Γ : Z → H a generic Lipschitz operator between Z and H. A (Z, X)-projection of Γ is an operator Γ P : Z → X such that:

In other words, Γ P provides the best pointwise approximation of the H-valued operator Γ, averaged w.r.t. the measure induced by Z. Hence, given a Z a Z-valued random variable and Y an X-valued random variable and a linear regression model Γ : Z → H for (Z, Y), the projected regression model induced by Γ is Γ P .

Proposition 3 With the same notation as above, if E Z 2 < ∞, then Γ P = Π X • Γ. Proof For any T : Z → X, it holds:

. Moreover, Γ and Π X • Γ are Lipschitz, and being Π X non-expansive, they share the same constant L > 0:

The only case left out from the treatment above is when both the independent and the dependent variables are X-valued. This case, however, follows naturally by combining the two approaches and we report the definition below.

Definition 4 Let Z and Y two X-valued random variables. Let Γ : H → H be a functional linear regression model for the variables considered as H-valued. A projected linear regression model for (Z, Y) is given by (Π X • Γ) |X .

Remark 5 When considering a regression with X-valued independent variable, one may want to relax the restriction on X in Definition 2 for various reasons; for instance one may have measurement errors, or by design the test set may consider points also outside X. In such cases it is worth considering the problem of how many continuous linear extensions of Γ |X are possible on the whole H. A sufficient condition for the uniqueness of such extension is the following: there exist a sequence of linear subspaces of H, say {H J } J≥1 , such that J H J is dense in H and X J := H J ∩ X contains a basis of H J for every J.

Remark 6 When H = L µ 2 (R) and X = ϕ µ (W 2 (R)) the condition in Remark 5 is verified, for instance, by Remark 8 in Section 4.3. Moreover, observe that the uniqueness of the extension can also be proven thank to Jordan's representation of functions f : R → R with bounded variation (BV). In fact any f with BV can be written as the difference of monotone functions and thus Γ(f ) is fixed. Then by the density of BV functions in H, we define Γ on the remaining elements of H.

We now compare the projected methods defined earlier in this Section and the intrinsic counterparts. In particular, we focus on the geodesic PCA defined in Bigot et al. (2017) and Cazelles et al. (2018) and on the distribution on distribution regression model in Chen et al. (2021) . Bigot et al. (2017) and Cazelles et al. (2018) define two different PCA, namely a global and a nested one; in particular the nested approach presents analogies with other PCAs developed for manifold valued random variables (Jung et al., 2012; Huckemann and Eltzner, 2018; Pennec, 2018) ; we report the two definitions below. 

The first key difference between the global and the nested geodesic PCA is that the latter provides a notion of preferential directions in the principal component, while the first one does not. In fact, the first nested principal component corresponds to the first principal direction, and it is possible to find the remaining principal directions by imposing orthogonality constraints as we obtain nested PCs of higher dimensions. Thus, the nested geodesic PCA is more suitable to explore and visualize the variability in a data set, see also Section 7. On the other hand, exactly because of the lack of such constraints, the global PCA is in general more flexible and provides superior performance in terms of reconstruction error, cf. Section 7.

Comparing these definitions with the one of our projected PCA, the key difference is that geodesic PCAs do not exploit the Hilbert structure of H. Thus, as we discuss in Section 5.3, the numerical routines needed to find such principal components rely on nonlinear constrained optimization, which can be extremely demanding and nontrivial to implement. This is in sharp contrast with our projected PCA in Definition 1, that, thanks to Lemma 1 can be straightforwardly computed. However, as a result, the projected PCA is in general less respectful of the underlying metric structure. By investigating this issue in simpler settings, for instance when H = R d and X is a convex polytope in R d , we noticed that the differences between the projected principal directions and the nested geodesic ones become appreciable only if the random variable X gives significant probability to values near the borders of X. See for instance Figure 1 . While this intuition remains valid also in the more complex setting that we investigate in this paper, it is harder to imagine realizations of X near the borders of X.

Note that the interpretability of the projected PCA is determined by the level of discrepancy between the definitions, as in Figure 1 , which depends on how much variability it is correctly captured by the component, that is how much of the variability captured by the projected component lies in X. This intuition is formalized in Section 7.2 where two measures of "reliability" of the projected PCA are proposed.

Turning to the regression context, Chen et al. (2021) define a distribution on distribution linear regression model in the Wasserstein space. Their approach considers two different tangent spaces of W 2 (R) (the first one centered in the barycenter of the independent variable and the second one centered in the barycenter of the dependent variable) and map the observations to the corresponding tangent spaces. They then use FDA tools to estimate a functional linear model Γ between those two spaces. When the image of the regression operator Γ lies inside the image of the log map centered in the dependent variable's barycenter, their distribution on distribution regression can be considered a properly intrinsic method. This assumption is used to prove asymptotic properties of their methodology, but as the authors in Chen et al. (2021) notice, is hardly verified in practice, so that whenever the output of the regression operator is not a distribution, they resort to squeezing such a value with some scalar multiplication, namely "boundary projection", which in general is not a metric projection. The boundary projection step gives an extrinsic nature to their model and we provide further comparisons with our methods in Section 3.4.

In this section, we offer a comparison of our projected methods with other extrinsic methods, namely the log PCA in Cazelles et al. (2018) and the distribution on distribution regression in Chen et al. (2021) , which, as outlined in the previous section, may behave as an extrinsic method. Let us start with the former. Cazelles et al. (2018) propose the definition of a log PCA as an alternative to the geodesic PCAs in Bigot et al. (2017) . Both the log and the projected PCA are extrinsic methods: they proceed by carrying out the PCA in a linear space H and then map back the results to the Wasserstein space, following an approach which had already been proposed by Fletcher et al. (2004) .

For the log PCA, H is the tangent space at µ, for the projected H is L µ 2 (R). Given U k = {w 1 , . . . , w k } the first k H-principal components, the log principal component in W 2 (R) is exp µ (Sp(U k )) . Analogously, by considering the convex cone X =: log µ (W 2 (R)) ⊆ H, the principal component in X is log µ exp µ (Sp(U k )) .

We notice two key differences between the log and projected PCA. First, as pointed out in Remark 2, log µ • exp µ is not a metric projection in L µ 2 so that given a point x ∈ H \ X, log µ (exp µ (x)) might end up being extremely different from x. See for instance Figure 2 where for a point x (blue line) that is close (in the L µ 2 norm) to X, log µ (exp µ (x)) turns out to be quite far from x. In the context of PCA, this means that as soon as the projection onto Sp(U k ) of observation lies outside of X, the log PCA quickly loses its interpretability. Second, as discussed in Remark 4, there is no guarantee that log µ exp µ (Sp(U k )) is contained in Sp(U k ), its dimension might increase and it might not even be convex. For this same reason, in general, log PCA cannot define a set of (orthogonal) principal directions which span the principal component. Hence, it is not possible to work directly on the scores of the PCA.

Combined, we believe that the above mentioned issues present a major drawback of the log PCA when compared to the projected PCA, as they prevent the possibility of doing proper dimensionality reduction and working on the scores of data points on the principal components. Finally, we also point out that approximating the exp µ map is a nontrivial task, involving computing numerically the preimages of an arbitrary large number of sets and numerical differentiation, that can lead to numerical instability of the log PCA.

We end this disccussion with a comparison between the boundary projection in Chen et al. (2021) and the metric projection. Their difference, for a possible regression output x ∈ H \ X is depicted in Figure 2 . Note that, by construction, such a procedure shrinks the tails of the output. Even when the regression output is slightly outside the image of the log map, the boundary projection result can be extremely far from the regression output and from the metric projection in terms of Wasserstein distance. For example, in Figure 2 , the regression output and the projected method assign positive probability to values in the range [−45, 45] , while the output of the boundary projection assigns zero probability to values outside [−17, 17] . This underrepresentation of the variability might be a crucial issue depending on the application considered.

The projected methods defined in Section 3 depend heavily on the availability of projection operators on the closed convex cone X = ϕ µ (W 2 (R)). Being X a cone inside a linear space, such operators are always well defined, but their implementation might be nontrivial. In this Section, we present a possible solution to this problem, based on choosing a particular µ as base point and constructing a B-spline representation of the cone X.

As already mentioned, our projected methods can be carried out by choosing µ arbitrarily and there is no theoretical difference between different choices of µ, cf. Section 2.2. Nonetheless, in practice, a clever choice of µ can lead to substantially easier and more numerically stable algorithms. For instance, by choosing a measure µ with compact support C in R, then the ambient space becomes L µ 2 (C) since we work up to zero-measure sets. This greatly simplifies any numerical procedure since we could work with grids over bounded sets, and do not need to resort to any truncation procedure, which would be mandatory in case the support of µ was unbounded. Moreover, note that evaluating the maps ϕ µ in a certain measure ν amounts to computing the transport map T ν µ = F − ν • F µ , hence it is clear that the choice of F µ numerically influences the results.

For the aforementioned reasons, we argue that a reasonable choice is to center our analysis in µ = U ([0, 1]). In fact, in this case, L µ 2 (R) = L 2 ([0, 1]), and F µ = id [0,1] (the transport maps are simply given by quantile functions).

Having chosen µ as Section 4.1 leads to an explicit characterization of the image of ϕ µ as the set of square integrable a.e. non-decreasing functions on [0, 1]. Hence, the operator Π X in Section 3 is the metric projection onto the cone of a.e. non-decreasing functions in L 2 ([0, 1]).

Projection onto monotone functions has been widely studied in the field of order restricted inference, (Anevski et al., 2006; Dykstra et al., 2012) . For instance, in Anevski and Soulier (2011) an explicit characterization of such a projection is given, which however does not lead to a closed form solution, while in Ayer et al. (1955) several numerical algorithms to approximate the projection operator are proposed. Those algorithms are based on approximating the function to be projected with a step function defined on n intervals and can be shown to have a computational complexity that is linear in n (Best and Chakravarti, 1990) .

Despite the numerical convenience of the aforementioned approximations, we believe that they are not suited for distributional data analysis. First and foremost, suppose that observations are given as probability density functions, so that one may want to interpret the results of a PCA, for instance, in terms of pdfs and not of quantile functions. If one were to estimate discontinuous principal directions through any of the algorithms in Ayer et al. (1955) , it would not be possible to do so, as the corresponding cdfs would not be differentiable. In addition to that, the choice of the number of intervals n is not obvious when quantile functions are not directly observed but obtained with transformation. If n needs to be big to faithfully approximate the true quantile functions, this projection can be quite slow.

For these reasons, we propose to resort to a B-spline expansion, through which we can derive an alternative approximation of the projection operator Π X , without incurring in the issues of the algorithms in Ayer et al. (1955) . Moreover, we will also show in Section 5.3 that the proposed B-spline expansion also leads us to a simpler and faster reformulation of the geodesic PCA in Bigot et al. (2017) .

In what follows, let µ = U ([0, 1]). Moreover, denote with x = [x 1 , . . . , x k ] ∈ R k a generic vector.

As already said, through the ϕ µ map, we can identify W 2 (R) with the space

This leads us to consider a suitable B-spline basis for the space, to efficiently evaluate all the computations needed in our algorithms and for a convenient way to express the constraints which define L 2 ([0, 1]) ↑ . In particular, we consider the basis of quadratic splines with equispaced knots in [0, 1]. The reason for this particular choice is two-folded. First of all, splines of degree greater than one enjoy the nice property of uniform approximation of all continuous functions as the maximum distance between knots goes to zero, in turn this means that the closure of the linear space generated by the spline basis w.r.t the L 2 norm coincides with L 2 ([0, 1]). Secondly, quadratic splines are particularly well suited to characterize monotonic functions by looking at the coefficients of the (quadratic) B-spline expansion, as shown in the next Proposition.

Proposition 4 Let {ψ k j } J j=1 be a basis of B-splines of order k defined over the knots

2. If k = 2, then 1. holds with an "if and only if "

Before proceeding, let us fix some notation. From now on, we omit the dimension index "k" for the spline basis, writing ψ j for ψ 2 j , moreover we will let {ψ j } J j=1 with fixed J > 0 denote a B-spline basis in L 2 ([0, 1]).

Remark 7 Let R J↑ be the set of vectors v ∈ R J with nondecreasing coefficients. That is,

is fully identifiable with R J↑ , endowed with the metric given by the symmetric positive definite matrix E with entries

The norm induced is therefore

Remark 8 It is possible to find a basis for R J with vectors lying in R J↑ (and so in X J ), namely the vectors (0, . . . , 0, 1), (0, . . . , 0, 1, 1) etc. In other words, Span(L 2 ([0, 1]) ↑ ∩ Span{ψ j } J j=1 ) = Span{ψ j } J j=1 for every J > 0. This tells us that the convex cone of monotone splines is indeed quite big inside the spline space, and this a priori is beneficial for extrinsic methods, especially for PCA.

From now on, to lighten the notation, we deliberately confuse the coefficients of the splines, living in R J or R J↑ (with the metric given by E), with the corresponding spline functions living in the subsets of L 2 ([0, 1]) given by

Remark 9 Lastly, we point out that R J↑ has the structure of a convex polytope, since the constraints given by Gv ≥ 0 (guaranteeing that v ∈ R J↑ ) are linear. Such geometric property makes optimization on R J↑ handy and is key for the empirical methods developed in the remaining of the paper.

As a consequence of Remark 9, the optimization problem given by the projection of a vector v ∈ R J onto R J↑ can be formulated as follows:

The computational complexity required to solve (10) is at most cubic in the number of basis elements J (Potra and Wright, 2000) .

Preliminary analysis showed that solving the optimization problem in (10) compares favorably with the Pool Adjacent Violators Algorithm (PAVA) in Ayer et al. (1955) . In particular, computing PAVA with n = 100 approximation intervals is roughly eight times slower than (10) with J = 20 (a reasonable choice, leading to negligible approximation error, in our examples, with a quadratic spline basis). Increasing n = 1000 for PAVA makes it 700 times slower than (10).

In addition to that, resorting to a discretized approximation of quantiles would also increase the cost of the projected PCA, due to the need of using some functional PCA implementation, as opposed to the low-dimensional multivariate model we are able to implement with the B-spline basis functions.

In this Section, we present the empirical counterparts of the projected PCA defined in Section 3 and provide an illustrative example of projected linear regression, namely when both the dependent and independent variables are distributions.

Let {ψ j } J j=1 be a fixed quadratic B-spline basis. Upon approximating the observed quantile functions with their spline expansion, thanks to Remark 7, we can develop our methodology in R J , considering the metric induced by E instead of the usual one. Indeed, given a vector w ∈ R J , we can identify the corresponding function in L 2 by the map w → J j= w j ψ j . For the projected PCA in Section 5.1 and for the geodesic PCA in Section 5.3 we consider observations F − 1 , . . . , F − n , and let F − 0 be the centering point of the PCA. In our examples, F − 0 will always be the barycenter of the observations. As a preprocessing step, we approximate each of these quantile functions through a B-spline expansion and denote by a i = {a ij } j and a 0 = {a 0j } j the coefficients of the spline representation associated to F − i and F − 0 respectively, that is,

s are realizations of the independent variable Z and the F − yi 's are realizations of the dependent variable Y. We apply the same preprocessing step and let a (z) i and a (y) i denote the coefficient of the spline approximation of F − zi and F − yi respectively.

Denote with A the (n × J) matrix with rows a 1 , . . . , a n . As in standard PCA, the first principal component centered in a 0 is found by solving the optimization problem:

where A is the matrix whose i-th row is given a i − a 0 . The optimization problem (11) can be solved similarly to a Rayleigh quotient: using Lagrange multipliers, (11) is equivalent to

Deriving (12) w.r.t w and equating the derivative to zero shows that the solutions to dL(w)/dw = 0 are the eigenvectors of the matrix A T AE. Hence, ordering the eigenvalues of A T AE in decreasing order, the first principal component w * 1 corresponds to the first eigenvector. Using similar arguments it can be shown that w * 2 , . . . w * J correspond to the remaining eigenvectors.

Once the first k principal directions w * 1 , . . . , w * k are found, the projection of a new observation x * = J j=1 a * j ψ j onto U k,x 0 X (see Definition 1) is found exploiting Proposition 2. In particular, the following optimization problem is to be solved:

which is equivalent to the minimization of a norm inside a polytope, that is a well-studied problem in R J (see Sekitani and Yamamoto, 1993) and there exist a variety of fast numerical routines to solve it.

In this section, we provide the details of the estimation procedure for a projected regression model where both the independent and the dependent variables are distribution-valued. It is straightforward to extend our methodology to cases when only one of these variables is distribution-valued and the other one takes values in R q . First, we outline how to obtain an estimator for the linear operator Γ in Definition 4. Following Section 3.2 we first embed both Y and Z in L 2 ([0, 1]) through the inclusion operator L 2 ([0, 1]) ↑ → L 2 ([0, 1]), and assume the functional linear model presented in Ramsay (2004) and Prchal and Sarda (2007) 

The goal is then to estimate α ∈ L 2 ([0, 1]) and β ∈ L 2 ([0, 1] 2 ). Further, we assume that ε and Z are uncorrelated:

and the corresponding spline coefficients. Further, we project α(t) on the same spline basis, so that α ≈ J j=1 θ αj ψ(j) and β(t, s) on the basis on [0, 1] 2 with J ×J elements, so that β(t, s) ≈ J i,j =1 Θ βij ψ i (t)ψ j (s). Neglecting the spline approximation error, model (14) entails a (y)

where a (ε)

i denotes the spline expansion coefficients of the unobserved error ε i (t). We propose to estimate (15) using the same approach of Prchal and Sarda (2007) , but extending it to account for spline approximations for both dependent and independent variables. We focus only on the estimate Θ β of Θ β since once such estimate is obtained, the estimate for a α can be straightforwardly derived, (see Cai and Hall, 2006) as:

where a (y) and a (z) are the means of a (y) and a (z) respectively.

The estimator Θ β is found by penalized least square minimization:

where ρ > 0 is a penalization parameter to be fixed (usually through cross-validation) and Pen(1, Θ) is a penalization term defined in Prchal and Sarda (2007) . Briefly, the term Pen(1, Θ) in (16) penalizes both the norm of β(t, s) and its derivatives, thus favoring smoother solutions. As shown in Prchal and Sarda (2007) , (16) has a closed form solution. Nonetheless, the form of our solution differs from the one presented in Prchal and Sarda (2007) , since they work directly on discretized functions while we propose to estimate spline coefficients, and some care must be taken since they can use (up to scaling) the usual inner product in the Euclidean space of discretized functions, while we must consider the inner product induced by E. However, the procedure for obtaining our result is identical to the one in Prchal and Sarda (2007) . Hence, we only report the expression for the estimate.

LetĈ be the matrix with entrieŝ

where b k and b s are the k-th and s-th elements of the standard Euclidean basis in R J . Further letD the matrix with entrieŝ

Finally, let E denote the matrix with entries E ij =< ψ i , ψ j > (where ψ i denotes the first derivative of the B-spline basis function ψ i ), C ρ = E T ⊗(Ĉ +ρE ), and P = E T ⊗E +E T ⊗ E , where ⊗ denotes the Kronecker product. Then the solution of (16) can be expressed as vec( Θ β ) = (C ρ + ρP ) −1 vec(D)

where vec(·) denotes the vectorization of the matrix. Finally, our projected regression model is the composition of the operator induced by (θ α ,Θ β ) with the projection on R ↑J :

We now show how the framework in Section 4 can be employed also to derive faster numerical algorithms to find the global and nested geodesic PCA as of Definition 5 and Definition 6.

Proposition 5 (Global geodesic PCA) A k dimensional global geodesic PC centered in a 0 is the subset of R J↑ spanned by {w 1 , · · · , w k }, linearly independent, which solve:

arg min

Proposition 6 (Nested geodesic PCA) With the same notation as above, a k dimensional nested geodesic PC, centered in a 0 is the set spanned by {w 1 , · · · , w k } in R J↑ , where the w i s are found recursively from w 1 to w k , such that w h is a solution, for every h, of:

arg min

To solve (17) and (18) we employ an interior point method using the solver Ipopt (Waechter and Biegler, 2006) . When comparing our implementation with J = 20 spline basis and the one in Cazelles et al. (2018) , we notice a substantial performance improvement, by a factor of 35 for a data set of n = 100 distributions, due to the fact working with spline approximations reduces greatly the number of parameters in the optimization problem.

Further, note that (17) and (8) seem extremely similar. However, in (8) the optimization is carried out having fixed w * 1 , . . . , w * k and for a single observation, while in (17) the optimization is done over a much larger set of parameters. In fact, the number of parameters in (17) is (n + k)J, hence the computational complexity needed to solve (17) is cubic in both the number of bases and the number of observations. On the other hand, the projected PCA requires a linear time in the number of observations (computation of A T AE) and cubic time in the number of basis J (eigendecomposition and projections of new observations).

In this section, we study the convergence of the proposed projected empirical methods. First of all, we show that as the number of spline basis J increases, the error due to the spline approximation vanishes if the data is sufficiently regular. Further, under a suitable set of assumptions, we establish consistency results for the projected PCA and for the projected distribution on distribution regression.

In the following, denote with W r k ([0, 1]) the space of functions whose weak derivatives up to order k belong to L r ([0, 1]), further denote with D the (weak) derivative operator, so that Df = f , D 2 f = f and so on, monotonically non-decreasing in j for every J, such that:

Let us remark two important facts.

Remark 10 Since the inclusion L ∞ ([0, 1]) ⊂ L 2 ([0, 1]) is continuous, thanks to Hölder inequality, the convergence rates hold also for the L 2 norm. By default we will use the L 2 norm if not stated differently.

Remark 11 By Poincaré inequality, if D 3 f ∞ < C then f belongs to a sphere in W ∞ 3 ([0, 1]) whose radius depends on C and on the Poincaré constant of [0, 1]; viceversa, all the elements in the sphere of radius C in W ∞ 3 ([0, 1]) clearly have (weak) derivatives bounded by C.

In this Section we prove the consistency of the projected methods under some assumptions on the data-generating process. In particular, we show that that there exists a number of basis functions J > 0 and a sample size n such that the error committed by the empirical models in Section 5 is smaller than ε > 0, for any fixed ε.

Consistency of spline-based PCA for functional data has been addressed, among the first, by Silverman et al. (1996) and Qi and Zhao (2011) . As one of the main building blocks of our projected PCA is the PCA in the ambient space, that is L 2 ([0, 1]), it is natural to follow Qi and Zhao (2011) in making the following assumptions. Consider data µ 1 , . . . , µ n , Before stating the main results, let us comment on assumptions (P1)-(P5). First of all, (P2) is essential in order to apply Proposition 7 and get uniform errors on the data set. Moreover, (P2) is satisfied, for instance, if the F − i 's lie in the L 2 -closure of a ball of radius M > 0 in W ∞ 3 . (P4) is a rather standard condition and is satisfied if µ 1 , . . . , µ n ∈ W 4 (R). (P4) and (P5) imply the assumptions that in Qi and Zhao (2011) are used for the consistency results. In particular, (P5) is stronger than the corresponding assumption in Qi and Zhao (2011) , where the eigenfunctions are assumed to belong to W 2 2 ([a, b]). Similarly, in such work, there is no counterpart of assumption (P2); in fact we need these stronger regularity conditions to get uniform errors when using B-splines. Still some of the examples Qi and Zhao (2011) provide of situations satisfying their assumptions, meet also our requirements. Finally, the zero-mean assumption in (P1) might seem a little odd, since we know that the quantile functions are monotonically nondecreasing. However, observe that it is always possible to subtract the empirical mean from the observations to satisfy (asymptotically) this assumption.

Let J denote the dimension of a quadratic B-spline basis on [0, 1] and let a J i the coefficients of the B-spline approximation of F − i . In what follows, to lighten the notation, we refer to a set of spline coefficients both as elements of R J with the E-norm, or as functions in L 2 , without making explicit reference to the coordinate operator and its inverse.

Proposition 8 Under assumptions (P1)-(P5), for any ε > 0 there exists a sample size n > 0 and a number of basis functions J > 0 such that:

Proposition 8 ensures the consistency of the B-spline approximation of the PCA for monotone functional data in H which is equivalent to the consistent estimation of the projected principal directions.

Since for any set of coefficients λ h we have the convergence λ h w J * h → λ h w * h , we obtain that the projection of a point onto Sp(U J k ) ∩ L 2 ([0, 1]) ↑ converges to the projection onto Sp(U k ) ∩ L 2 ([0, 1]) ↑ . Thus we also have convergence of the projection onto the principal components.

We consider model (14) given

We make the following assumptions:

(R1) The data generating process satisfies (14) 

Without loss of generality, suppose that both the dependent and the independent variables have been centered by subtracting their mean so that E[Z] = E[Y] = 0 and α = 0.

The strategy to prove the consistency of the projected linear regression is the following. First of all, we prove that the estimator Θ J converges to the estimator Θ PS , defined in Prchal and Sarda (2007) , for large enough n and J. Second, we exploit the consistency of the estimator in Prchal and Sarda (2007) combined with the approximation results of the metric projection, to establish consistency in terms of the prediction error of our projected regression operator.

Briefly Θ PS is obtained by minimizing an objective function similar to the one in (16), but where the spline approximation is used only for Θ, while the F − zi 's and the F − yi 's are assumed fully observed, and not approximated through splines. Calling B the vector of functions with entries ψ 1 , . . . , ψ J , Θ PS is defined as:

Convergence of Θ J to Θ PS is shown in the next proposition

Proposition 9 Under assumptions (R1)-(R3), if the number of samples is big enough Θ and Θ J exists with probability close to 1, and there is J > 0 such that Θ PS − Θ J E⊗E < ε.

Let β PS and β J be the kernels β PS = B T Θ PS B and β J = B T Θ J B. Since β PS (s, t) − β J (s, t) L 2 ([0,1] 2 ) = Θ PS − Θ J E⊗E , we established strong convergence of our kernel to the estimator of Prchal and Sarda (2007) . This implies that the consistency results for the estimator Θ PS holds also for Θ J , with respect to the seminorm induced by the covariance operator of Z.

Specifically, given Z H-valued random variable and its covariance operator C Z , for any ϕ ∈ L 2 ([0, 1] 2 ), we consider the semi-norm on L 2 ([0, 1] 2 ) given by:

Thus, the following result is immediately implied since strong convergence implies seminorm convergence (see Appendix A).

Proof We use the seminorm triangle inequality:

The first term on the right hand side converges to zero thanks to Theorem 2 in Prchal and Sarda (2007) , while the second term converges to zero thanks to Proposition 9 and the previous observations. Lastly, we need to take into account the projection step. First, we notice that β− β Γ Z corresponds to the expected prediction error, in fact, as in Prchal and Sarda (2007) :

further, by Hölder's inequality E | Z, β − β J | β J → 0, which straightforwardly yields

Thus, the following simple lemma ensures the consistency of the spline approximation of the projection on X and leads to the consistency of the projected regression in terms of prediction error. Again, following Remark 7, we can identify the space monotone B-splines with J basis functions with R J↑ . Hence, to lighten the notation, we denote Π R J↑ the metric projection operator onto the space of monotone B-splines with J basis functions.

Lemma 2 Given β n → β in H, for any ε > 0 there exists n, J > 0 such that

In this section we perform PCA on different simulated data sets and on a real data set of Covid-19 mortality data in the US. In particular, on the simulated data sets we compare the performance of our projected PCA (in terms of approximation error and interpretability of the directions) with the ones of intrinsic methods, showing that the projected PCA is a valid competitor in a diverse set of situations. For the Covid-19 data set, we compare inference obtained using the projected, nested and log PCA, highlighting the practical benefits of the projected PCA over the log one.

For the projected, nested and global PCAs we need to fix a B-spline basis to express the quantile functions. In particular, we fix an equispaced quadratic B-spline basis with J interior knots on [0, 1]. Here, the number of basis J is always fixed to 20, which provided a negligible approximation error of the quantile functions. We did not observe any appreciable change when increasing it. In Appendix C we show further simulations where we perform sensitivity analysis as the number of basis increases for a fixed sample size, we provide empirical confirmation of the consistency results in Section 6 and give practical guidance on how to choose J.

We consider three different simulations to compare both the interpretability and the ability to compress information of different PCAs.

We compare our projected PCA with the nested and global geodesic PCAs (Bigot et al., 2017; Cazelles et al., 2018) and the simplicial PCA (Hron et al., 2014) .

Briefly, the simplicial PCA applies a transformation that maps densities defined on the same compact interval I into functions in L 2 (I), called centered log ratio. Then, a standard L 2 PCA is performed on the transformed pdfs and, by the inverse of the centered log ratio transform, the results are mapped back to the space of densities, called Bayes space (for a more accurate definition, see Egozcue et al., 2006) . In particular, we remark that, to be well defined, the simplicial PCA requires that all the pdfs have support equal to I, which is a strong assumption in practice. Further details about simplicial PCA are given in Appendix B.

As for the projected PCA, to compute the simplicial PCA, we resort to a B-spline approximation, but this time of the transformed pdfs. Hence, we need to select a B-spline basis on the support of the pdfs I. In this case, we fix a cubic B-spline basis with J = J = 20 interior knots on I, as this choice yielded a negligible approximation error for the transformed pdfs.

In the first scenario, we simulate data from

Where "proportional to" stands for the fact that we confine the density to the support [−10, 10] and renormalize it so that it integrates to 1.

Observe that there are two sources of variability across the pdfs from the data generating process (19). The first one is the location of the peak µ i and the second one is the width of the distribution around the peak, controlled by σ i . See Figure 3 . Figure 4 shows the first two principal directions obtained using the different methods. We can notice several differences between them. Focusing on the first principal direction, we can see that the simplicial, projected and nested PCAs detect a change in the location of the peak of the pdf. In particular, the first direction for the Wasserstein PCAs represents a shift from left to right of this peak, while for the simplicial PCA the first direction is Each line represents the pdf associated to λw i where w i is the i-th principal direction (i = 1, 2) and λ is a score ranging from −2 (darkest blue) to +2 (darkest red). associated to a peak in 3 (blue lines, negative values of the scores) or to a peak in −3 (red lines, positive value of the scores). This also highlights the difference in the geometries underlying the Wasserstein and Bayes spaces. Looking at the second principal direction instead, we can see how in the Wasserstein PCAs it clearly represents a change in the width of the distribution, while for the simplicial PCA the interpretation is somewhat obscure.

The global geodesic PCA deserves a separate discussion. Indeed, from Definition 5 it is clear that a global principal component is a convex set without any notion of preferential directions, so that it is not possible to interpret separately the variation along the first and second direction found by the global PCA. Now we present two additional simulations that quantify the amount of information that is "lost" by performing the PCA. As a metric, we consider the reconstruction error, that is, the quantity

where the F − i 's are the observed probability measures, F − i are the reconstructed ones and k is the dimension of the principal component. More in detail F − i is found by first projecting (F − i −F − 0 ) into R k using the PCA and then applying the inverse transformation. Informally, the reconstruction error is a measure of the quantity of information lost by applying the PCA as a black-box dimensionality reduction.

As evident in Equation (20), we measure the performance of PCAs just in terms of Wasserstein metric. This is likely to favor the performance of the Wasserstein PCAs over the simplicial one. Thus, the interesting performance comparison is the one between the geodesic PCAs and the projected PCA. Nevertheless, we think that is worth reporting also the results for the simplicial PCA, which is an intrinsic method in the Bayes space, to show that the underlying metric structures are extremely different. This also helps to appreciate the results in Section 8. Given the difference in the metric structure between Wasserstein and Bayes spaces, we believe that the choice between simplicial and Wasserstein frameworks is not trivial and should be application-driven.

To measure raw performance differences between geodesic and projected PCAs, we simulate data so that there is little recognizable structure in them, unlike in the previous example. The data generating process is as follows:

Observe that (21) is a finite dimensional approximation of the Dirichlet Process mixture model, a popular workhorse in Bayesian nonparametric statistics, that is well known to be dense in the space of densities on R, see for instance Ferguson (1983) . An example of the kind of pdfs generated from (21) is shown in Figure 5(a) .

To separate the effect of the B-spline smoothing procedure, in this scenario we evaluate the reconstruction error in (20) considering µ i to be the reconstructed quantile functions (for the Wasserstein PCAs) or pdfs (for the simplicial PCA) and µ i to be the probability measure represented by the B-spline approximation of the quantile function or the (centered log ratio of) the pdf respectively. (21) and the shaded area represent ± one standard deviation.

Figure 5(b) shows the reconstruction error as a function of the dimension of the principal component, that is, RE k as a function of k. We can see how the three Wasserstein PCAs consistently outperform the simplicial one. Moreover, as to be expected, the global geodesic PCA obtains the lowest reconstruction error for all the choices of dimension k, with the nested geodesic PCA being a close runner-up. However, the computational cost of finding the nested or global geodesic PCA can become prohibitive as the sample size or the number of bases in the B-spline expansion or the dimension k increases. For comparison, finding the 10-dimensional projected PCA is around 1,000 times quicker than finding the corresponding global geodesic PCA and 200 times quicker than finding the nested geodesic one.

As an additional simulation, in Appendix C we investigate the effect of the number of B-spline basis J. In particular, we conclude that, for a fixed dimension k the reconstruction error (20) increases with the number of basis functions, both for the projected and the simplicial PCA. Furthermore, we also observe that the reconstruction error for the simplicial PCA exhibits a larger variance than the reconstruction error for the projected PCA. Our insight is that this is due to the different degree of smoothness of the pdfs and the quantile functions. Since the quantile functions are in general smoother than the pdfs, their B-spline expansion should have lower variance.

A classical measure of performance of the standard Euclidean PCA, also useful to determine the dimension of the principal component to use, is the proportion of the explained variance. For a k-dimensional Euclidean principal component, this quantity is easily computed as a ratio of eigenvalues: k j=1 λ j j≥1 λ j . Upon truncating the series at the denominator, the same quantity can also be computed for PCA in infinite dimensional Hilbert spaces.

Due to the projection step involved in our definition of PCA, we argue that the proportion of explained variance might not be a reliable indicator of performance, nor should it be used to guide the choice of the dimension k. Instead, we propose a fast alternative based on the Wasserstein distance that we believe better represents the properties of the projected PCA, that is, the normalized reconstruction error:

, where the numerator corresponds to the reconstruction error in (20) and the denominator is the average distance between the observed measures and their barycenter. Observe that in Euclidean spaces, this quantity is closely related to the proportion of explained variance, since in Euclidean spaces maximizing variance in a subspace, amounts to minimizing the average distance from the subspace to data points. Given its extrinsic nature, for a fixed dimension, the projected PCA might sometimes fail to capture the variability of some particular data set and, in those situations, an intrinsic approach should be preferred. However, given the high computational cost associated to geodesic PCAs, one would carry out such analysis only knowing that the results would be significantly better than the ones obtained by projected PCA. This calls for discerning whether the poor performance of projected PCA is due to its extrinsic nature or rather to the scarceness of structure in the data set under consideration: in the former situation it is likely that a geodesic approach would yield better results, in the latter instead, it is likely that results remain the same.

We propose now two empirical indicators of the "reliability" of the empirical projected PCA. The first one measures, once a k-dimensional principal component is found, how reliable are the projected principal directions and the second one gives an idea of how different the projected PCA and the L 2 PCA are. To assess the interpretability of the principal directions and the scores obtained with the projected PCA, we first compute for every principal direction w * h the quantities η min h and η max h such that

where a 0 is the spline coefficient vector associated with the barycenter F − 0 . The scalar η max h is found analogously. Hence (η min h w * h , η max h w * h ) is the segment spanned by the principal direction living inside the convex cone R J↑ . If the scores of all observations along this direction lie within the range (η min h , η max h ), then the variability captured by (empirical) projected PCA can be decomposed along the principal directions, whose scores are then highly interpretable. Contrary, the PCA scores outside (η min h , η max h ) will be associated with functions which are not quantiles, and thus limiting the interpretability of the direction. Hence, we propose the following interpretability score

where s ih is the score of observation i along direction h according to the projected PCA. A value of IS h equal to one corresponds to perfect interpretability, that is, projected PCA behaves like a standard Euclidean PCA along direction h. On the other hand, values of IS h closer to zero indicate that the decomposition of the variance along the principal directions lies outside R J↑ for direction h. The interpretability score can be fruitfully used also to evaluate the directions found with the nested PCA, upon replacing the s ih 's in (22) with the scores given by the nested PCA. Note that the IS h score is useful to interpret the directions one at a time. However, it can be the case that some scores along one direction h lie outside the (η min h , η max h ) range but that the L 2 projection on the h ≥ h component still lies within the projected component.

For instance, this could imply that a projected PC could be similar to a nested one despite having very different directions. A discrepancy between the two can appear when the projections of some data points on the L 2 PCA lie outside R J↑ . Using the terminology of Proposition 2 this can be measured in terms of difference between the projections

), for a given observation F − * . To quantify the loss of information at the level of the component (instead of direction), we propose to measure the "ghost variance" captured by the L 2 PCA:

that is, the GV k score measures the quantity of information that is lost due to the projection step or, in other words, the information that we trained our PCA on, but that does not appear in the Wasserstein Space. If GV k = 0 then all the information captured by the L 2 PCA is inside the Wasserstein Space, then the projected PCA coincide with the nested one by definition. Finally, although this situation never occurred in our experience, it might happen that GV k is small but some IS k (k ≤ k) is large. This means that the subspace identified by the projected PCA is suitable for representing the data, but the single principal directions are not interpretable. In this case, we suggest to take a hybrid approach: use the projected PCA as a fast black-box dimensionality reduction step, thus reducing the dimensionality of each observation from J to k, and then use the nested PCA, in dimension k, to estimate the directions, the main advantage being the reduction in the computational cost to estimate the nested PCA in this lower dimensional space.

We perform PCA analysis on the Covid-19 mortality data publicly available at data. cdc.gov as of the first December 2020. The data set collects the total number of deaths due to Covid 19 in the US from January 1st 2020 to the current date, data are subdivided by state, sex, and age. In particular, the ages of the deceased are grouped in eleven bins: [0, 1), [1, 5), [5, 15), [15, 25), [25, 35), [35, 45), [45, 55), [55, 65), [75, 85) , [85, +∞) but we truncate the last bin to 95 years for numerical convenience. Further, we remove Puerto Rico from the analysis because it presented too many missing values. Our final data set, shown in Figure 6 (a), consists of 106 samples of the distribution of the ages of patients deceased due to Covid-19, divided by sex and pertaining 53 between US states and inhabited territories.

We apply our usual B-spline approximation with J = 20 basis to the quantile functions obtained starting from the histograms in Figure 6 . This choice of J yields an average approximation error, in terms of Wasserstein distance, of 0.02. An error this low is to be expected since the quantile functions are piecewise linear functions defined on eleven intervals.

We use this real data set to make a hands-on comparison of the inference that can be obtained employing the projected, nested and log PCA.

We start by comparing the projected and nested PCAs. The first direction found by the nested PCA is identical to the one found by the projected while the second is extremely close: the cosine between the two principal directions is approximately 0.99. In line with this, the interpretability scores equal IS 1 = 1 and IS 2 ≈ 0.89, while GV 2 = 0.05. Moreover, the two-dimensional projected principal component explains more than 90% of the L 2 variability and N RE 2 ≈ 0.05 for both projected and nested PCA. Given the reconstruction error and the GV 2 score, we can conclude that the two-dimensional projected principal component provides a very good fit to the data, and that both selected principal directions are well behaved with respect to their scores, guaranteeing interpretable results. Considering the discussion above and the fact that both the projected and nested PCA employ metric projection to map data points to the k-dimensional principal component, inference obtained with the nested PCA and with the projected one is almost identical in this case and we show results only for the projected PCA in Figure 7 . In particular, the first principal direction shows that the greatest variability is due to the elders: low negative values along this direction correspond to most of the mortality being concentrated among in the 80+ range. The red and the green distributions shown in the rightmost panel show two antithetic behaviors which correspond to scores along the first principal direction of roughly −8.5 and 7 as shown in the third panel of Figure 7 . In fact, the red distribution is concentrated almost exclusively on the last two bins of the histogram, with the 85+ bin weighting for more of 60% of the deaths. At the opposite, the green distribution gives more weight to lower age values. The second direction instead shows variability in the 40 − 80 range. The purple distribution, characterized by the highest score along this direction, shows that a significant percentage of deaths occurred in the age range 60 − 75. Finally, the third panel of Figure 7 reports the scores along the first two principal directions for the whole data set, blue dots representing males and orange dots women. We can appreciate how women tend to have lower scores on both directions. This is in line with our understanding that Covid-19 is more severe among the male population (see for instance Mandavilli, 2020) , which explains why males are more susceptible to death even at younger ages, while deaths among women are more concentrated in the 70+ age range, being the elders in general more fragile.

The comparison with log PCA requires more attention. First of all, note that the directions obtained with the projected and log PCA are the same by definition, since they are both obtained performing PCA in L 2 ([0, 1]), but the principal components may differ because different projection operators are employed when the orthogonal projection of a point onto the principal component lies outside of the image of ϕ µ , as discussed in Figure 4 . The third panel reports the scores of the projections on the two dimensional principal component (orange for women and blue for men) and the fourth panel shows three particular distributions, also highlighted in the third panel. In particular, the red distribution is the one of women in Vermont, the green one are males in Alaska and the purple one are women in West Virginia.

Section 3.4. As expected from the comparison between the metric projection and the pushforward operator in Figure 2 , the fit to the data of the projected and log PCAs will be different. In particular, in this case we observe that the log PCA does a worse job in term of N RE, as shown in Figure 6 (b), especially when the dimension increases. This behavior can be also in part explained by the complexity of the numerical routines needed to approximate the pushforward operator (required by the log PCA) where it is natural to expect some numerical errors. More in general, as discussed also in Cazelles et al. (2018) , we can conclude that the log PCA is not suited to study this particular data set because the L 2 PCA is different from the nested geodesic PCA (as testified by the GV 2 score). In fact, apart from the visual inspection of the L 2 principal directions -which are not guaranteed to span the log-principal components -not much can be obtained from the log PCA in this case, since it does not provide a consistent way of projecting data points on the principal component as pointed out in Section 3.4.

In this section, we propose a comparison between the Wasserstein projected and simplicial (see Appendix B) approaches when the task at hand is distribution on distribution regression, and show an application of the Wasserstein projected regression framework to a problem of wind speed forecasting.

We consider two data generating processes as follows. In the first setting, data are generating from the Wasserstein regression: independent variables z 1 , . . . , z n are generated by considering quantile functions F − z1 , . . . , F − zn such that F zi = 30 h=1 a

1 , . . . , ψ

30 is a cubic spline basis over equispaced knots in [0, 1] and a (z)

Second scenario Wasserstein (4 × 10 −7 , 7 × 10 −8 ) (5 × 10 −3 , 6 × 10 −3 ) Simplicial (0.9, 2.66) (4 × 10 −4 , 5 × 10 −4 ) ij−1 + δ ij−1 , and (δ i2 , . . . , δ i30 ) ∼ Dirichlet(1, . . . , 1) . This data generating procedure ensures the F − zi (0) = 0, F − zi (1) = 1 and F − zi is monotonically increasing, cf. Proposition 4. The dependent variables F − y1 , . . . , F − yn are generated using the same spline expansion of the dependent variables and letting a

i . B is a randomly generated matrix with rows b 1 , . . . , b 30 , and each b i is generated as follows: b i1 ∼ Uniform(0, 0, 5) b ij = b ij−1 +b ij andb ij ∼ Uniform(0, 0, 5), so that the coefficients a (y) ij are monotonically non decreasing for each i and thus the F − yi 's can be considered quantile functions. We compute the pushforward of the uniform distribution via numerical inversion and differentiation and obtain the pdf associated to each quantile function. Observe that this task is easier than approximating the pushforward of a generic µ through a generic f (as Cazelles et al. (2018) do) since the quantile functions are monotonic and we have simple expressions for all the quantities related to µ. Since the simplicial regression takes as input (a transformation of) the pdfs while the Wasserstein regression works directly on the quantile functions, and also due to the fact that numerical errors can be introduced in the data set during the inversion and differentiation, we consider as ground truth the pdfs and, for the Wasserstein approach, re-compute numerically the quantile functions.

In the second setting instead, we generate data from the simplicial regression model: independent variables z 1 , . . . , z n are generated by applying the inverse of the centered log ratio to a random spline expansion as follows. For each i = 1, . . . , n letp zi = 30 j=1 a i , where B is a randomly generated 30 × 30 matrix with entries drawn iid from a standard normal distribution. Finally the pdfs p zi (p yi ) are recovered by applying the inverse of the centered log ratio top zi (p yi ), see Appendix B for more details.

Note that under the second data generating process, both the dependent and independent distributions have support in [0, 1] by construction, whereas under the first data generating process the independent variables might have a larger support. Thus, to fit the simplicial regression in the first scenario, as common practice (cf. Appendix B), we extend the support of all the distributions (both dependent and independent) to the smallest interval of the real line containing all the supports. This is done by adding a small term to the pdfs (in our example, 10 −12 ) and then renormalizing them.

For both examples, we simulated 100 observations and compared the projected-Wasserstein and simplicial regression using leave-one-out cross-validation. In particular, for both approaches we use J = 20 quadratic spline basis and choose the penalty term ρ in (16) through grid search. Table 1 shows the pairs of mean squared error and standard deviation of the cross validation, the metric to compare the ground truth and the prediction is the 2-Wasserstein distance. As one might expect, the Wasserstein regression performs better in the first scenario while the simplicial regression performs better in the second scenario. However, it is surprising how the Wasserstein geometry can capture (in terms of Wasserstein metric) dependence generated by a linear structure which we have shown to 

We consider the problem of forecasting the distribution of the wind speed nearby a wind farm from a set of experts. The data set is publicly available at www.kaggle.com/ theforcecoder/wind-power-forecasting. In particular, data consists of measurements of the wind speed collected every ten minutes for a period of 821 days starting from the 31st December 2017. The daily average wind speed is shown in Figure 8 .

We assume to have access to a set of experts, that is a set of trained models, that provide a probabilistic one-day-ahead forecast for the average wind speed. Here, our goal is to combine this set of experts and provide a point estimate of the wind speed distribution for the whole day, which can be helpful when planning the maintenance of the wind mills for instance.

Formally, let K denote the number of experts considered, F − zij is the quantile function associated to the probabilistic forecast of the average wind speed for day i given by expert j = 1, . . . , K; F − yi is the empirical quantile function of the wind speed for day i. In particular, we consider K = 4 experts built from the Prophet model by Facebook (Taylor and Letham, 2018) as follows: model M 1 is the classical Prophet, without additional covariates or seasonality trends; model M 2 includes the ambient temperature as covariate but not seasonality; model M 3 includes a yearly seasonality and no covariates and model M 4 includes both yearly seasonality and ambient temperature as covariate. The models are estimated using variational inference on rolling samples of 365 days and produce one day ahead probabilistic forecasts for the average wind speed. The final sample size corresponds to n = 456.

We consider a trivial extension of the distribution on distribution regression model in Section 5.2 as follows: Having approximated all the functions through a B-spline expansion, the model reads

The procedure for estimating θ α and Θ β 1 , . . . Θ β K is analogous to the one outlined in Section 5.2. We compare the prediction performance of five distribution on distribution regression models. Models R1 to R4 are obtained by fitting model (23) using only one of the four experts, M 1 to M 4, while the fifth model (RF ) is the "full" model in (23) considering all the four experts. For this comparison, we perform a train-test split of the 456 days for which the experts produced the prediction, considering the last 100 days as test. We select hyperparameters (namely, the penalty coefficient ρ in (16) and whether to include or not the intercept term α) by a grid search cross validation on the training set, and compare the mean square error on the held-out test set. Results of the comparison are reported in Table 2 . As expected, the model with the four predictors (RF ) is the best performer. Interestingly, all the other models R1-R4 perform similarly and present a much higher mean square error when compared to RF , thus suggesting that the best performance is achieved by combining the different experts together and no expert alone can be a good predictor. This is possibly explained by some experts being able to better forecast one scenario (for instance, light winds) and other experts being able to better forecast other scenarios.

We conclude with some descriptive analysis. Figure 9 shows the point estimates for the coefficients β j . We can interpret as highly influential for the regression the areas of the β j 's with high absolute value, and as negligible area with values close to zero.

We can highlight some differences among the coefficients in Figure 9 . In particular, model M 1, seems influent when predicting the tails of the distribution, in particular with negative weights for the left tail and positive weights for the right tail. Model M 2, seems to be affecting all the steps of the prediction and in particular to be model affecting the most the median of the distribution. Model M 3, appears to be, with M 2, the most important model for the prediction: the absolute value in the corresponding regressor β 3 is often very high and with noticeable peaks corresponding to areas predicting the left tail and towards the right tail. Finally, the regressor corresponding to M 4 has very low values thus resulting in minor importance in terms of regression influence.

Interestingly, the experts providing the most precious inputs to our regression model are M 2 and M 3, that incorporate only the seasonality effect and the temperature covariate respectively, while M 4, which incorporates both, seems to be less important. Hence, the regression model in (23) finds more effective combining experts trained on different covariates than correcting an expert already trained on all the covariates. In particular, our insight is that M 2 is responsible for centering the median of the output distribution. The tails of the distribution seem to need also the contribution of seasonality data, given by M 3. Finally, we also observe that the left tail of the wind distribution seems the most difficult to be predicted, needing very high positive and negative weights across different models, to be obtained.

In this paper, we propose a novel class of projected statistical methods for distributional data on the real line, focusing in particular on the definition of a projected PCA and a projected linear regression. By investigating the weak Riemannian structure of the Wasserstein space and the transport maps between probability measures, we represent the Wasserstein space as a closed convex cone inside an Hilbert space.

Similar to log methods, our models exploit the possibility to map data into a linear space to perform statistics in an extrinsic fashion. However, instead of using operators like the exp map or a some kind of boundary projection to return to the Wasserstein space, we rely on a metric projection operator that is more respectful of the underlying metric.

By choosing as base point the uniform measure on [0, 1], we are able to efficiently approximate the metric projection operator so that our models combine the ease of implementation of extrinsic methods while retaining a performance similar to the one of intrinsic methods. Further, through a quadratic B-spline approximation, we can greatly reduce the dimensionality of the optimization problems involved, resulting in fast empir-ical methods. As a byproduct of this approach, we also derive faster numerical routines for the geodesic PCA in Bigot et al. (2017) .

We study asymptotic properties of the proposed methods, concluding that, under reasonable regularity assumptions, our projected models provide consistent estimates and that the B-spline approximation error becomes negligible. We showcase our approach in several simulation studies and using two real world data sets, comparing our models to intrinsic and extrinsic ones and to the simplicial approach in Hron et al. (2014) , concluding that the projected PCA and regression constitute a valid candidate for performing inference on a data set of distributions.

Although our projected framework was proven to be viable in many practical situations, some care must be taken when adopting it, especially when performing PCA. In fact, the extrinsic nature of our method might not fit every data set, in which case a more computationally demanding intrinsic PCA might be preferred, see for instance Appendix D.1 for an example where the projected principal directions are not interpretable. On top of that, performing PCA in the Wasserstein space requires more attention than performing the usual Euclidean PCA: as pointed out in Appendix D.2, since principal components are not linear subspaces, decomposing the variance along the directions (i.e., looking at the scores) must be done carefully, and making sure that the directions are indeed interpretable. To assist practitioners, in Section 7.2 we have also proposed two scores that quantify the interpretability of the principal directions and the discrepancy between the nested and projected principal components.

Several extensions and modifications of our approach are possible. One possibility is to extend our framework to encompass more models, such as generalized linear models and independent component analysis. Although this should be straightforward in theory, the numerical computations could become more burdensome. Furthermore, as an alternative to our approach based on B-splines approximation, one could use such B-spline expansion only to approximate the metric projection operator. Another interesting line of research would consist in building hybrid approaches (as anticipated in Section 7.2) to analyze distributions in the Wasserstein space, using both extrinsic and intrinsic methods to exploit the advantages of both worlds, while mitigating the disadvantages. We also think that a deeper comparison between the Wasserstein and the simplicial geometries could help practitioners in choosing between them.

Finally, as pointed out by an anonymous referee, extensions to encompass measures supported on R d , d > 1, are of great interest. This is surely a very challenging problem, due to the geometric structure of W 2 (R d ). We identify three main obstacles in this sense. First, the map onto the tangent space is not an isometry because the Wasserstein space is curved. Second, we lose the nice characterization of the tangent space and of the image of log µ , so that the metric projection operator becomes harder to derive. Third, the computational cost would greatly increase due to the need of numerically approximating the transport maps needed to compute the distances.

Assumptions on x 0 .

Let B ε (x 0 ) = {x ∈ H ||x − x 0 || < ε}, a ball of radius ε in H. Given a set C, we refer to aff(C) as the smallest affine subset containing C, found as the intersection of all affine subspaces containing C. Similarly H(C) is the convex hull of C, the smallest convex subset of H containing it. The relative interior of a set C is defined as its interior considering as ambient space aff(C):

Throughout our paper we assume that the random variable X is such that (i) there exists x 0 = E[X ] and (ii) x 0 ∈ relint(H(supp(X ))) where supp(X ) is the support of X . These assumptions are indeed quite natural and require that the distribution of X has a well defined barycenter, which is not in a "degenerate" position with respect to the convex hull of its support, which may happen in infinite dimensional Hilbert Spaces. See, for instance, Berezin and Miftakhov (2019) for an example of distributions not verifying this second assumption.

Proof of Lemma 1. The proof is divided in two steps. First, we prove that (x 0 + Sp(U k )) ∩ X has dimension k. Then, we show that U x 0 ,k X = (x 0 + Sp(U k )) ∩ X. Without loss of generality, for ease of notation, we perform an affine change of variable so that x 0 = 0, but, with a slight abuse of notation, we keep denoting with X and X the transformed random variable and the convex cone respectively.

To prove the first part, let H(X ) be the convex hull of the support of X and aff(H(X )) = K be the smallest affine subset of H containing H(X ). We know by assumption that there is an open ball in K which contains x 0 = 0 and is contained in H(X ). Moreover, for every k ≤ dim(K), Sp(U k ) ⊂ K. Note that we can clearly suppose k ≤ dim(K), otherwise principal components analysis is useless. With this assumption, since x 0 = 0 is in the relative intern of H(X ), we have k = dim(Sp(U k ) ∩ H(X )) ≤ dim(Sp(U k ) ∩ X) ≤ k.

Now we prove that a (k, 0)-projected principal component is given by Sp(U k ) ∩ X. To prove this, let C * be a (k, 0)-projected principal component and A * = A ∩ X, with A = Sp(U k ): we know (i) x 0 = 0 ∈ A * , (ii) dim(A * ) = k by definition and (iii) A * ⊆ Π X (A), so we have A * ⊂ C * .

Since dim(C * ) = k there is C linear subspace of dimension k such that C * ⊂ C. Consider C = C ∩ X: clearly C * ⊂ C , so that A * ⊂ C * ⊂ C . Moreover, A * ⊂ C , which implies A ∩ X ⊂ C ∩ X and thus Sp(A ∩ X) ⊂ Sp(C ∩ X). The proof is concluded if dim(Sp(A ∩ X)) = dim(Sp(C ∩ X)) = k. In fact, in this case A = Sp(A ∩ X) and C = Sp(C ∩ X) which means that A ⊂ C and since dim(A) = dim(C) = k, A and C coincide, proving A * = C * .

To prove this final claim, observe that dim(Sp(A ∩ X)) < k implies dim(A ∩ X) < k, which contradicts the proof of the first part of this Lemma. Similarly, dim(Sp(C ∩X)) = k since dim(C * ) = k by hypothesis.

Proof of Proposition 1.

Now, to prove that Π U x 0 ,k X (x) − x → 0 as k increases, we first notice that, by the properties of the principal components in H we have Π Sp(U k ) (x − x 0 ) k − → x − x 0 for every x ∈ X, which implies Π Sp(U k )+x 0 (x) − x → 0. Then, denote x 1 = Π U x 0 ,1 X (x) and let r k be the line between x 1 and x. Let:

Proof of Proposition 2. Again, without loss of generality, for ease of notation, we perform an affine change of variable so that x 0 = 0, but, with a slight abuse of notation, we keep denoting with X and X the transformed random variable and convex cone respectively.

We start by noticing that being Π k the orthogonal projection onto a subspace, x − Π k (x)⊥Span(U k ) and thus for v ∈ Span(U k ):

and the result follows.

Proof of Proposition 4.

1. As shown in the supplementary of Pya and Wood (2015) by standard B-spline formulas we obtain that given f (x) = J j=1 a j ψ k j (x), then f (x) = J j=1 (a j − a j−1 ) · ψ k−1 j (x). Being the B-spline basis function nonnegative by definition, we obtain the result.

2. With k = 2, f (x) on the interval [x j+1 , x j ] has the following expression:

and the result follows.

Proof of Proposition 5 and 6. We report here Propositions 3.3 and 3.4 of Bigot et al. (2017) , with the notation adapted to our manuscript. In the following H is a separable Hilbert space, X is a closed convex subset of H, X is an X-valued square integrable random variable, x 0 a point in X and k ≥ 1 an integer.

Proposition 10 Let U * = {u * 1 , .., u * k } be a minimizer over orthonormal sets U of H of cardinality k, of D x 0 X (X , U ) := Ed 2 (X , (x 0 + Sp(U )) ∩ X), then U x 0 X := (x 0 + Sp(U )) ∩ X is a (k, x 0 )−global principal component of X .

Proposition 11 Let U * = {u * 1 , .., u * k } be an orthonormal set such that U * i = {u * 1 , .., u * i } is a minimimizer of D x 0 X (X , U ) over the orthonormal sets of cardinality "i" such that U ⊃ U * i−1 ; then U * x 0 X is a (k, x 0 )−nested principal convex component of X .

Applying Propositions 10 and 11 we can obtain equivalent definitions of geodesic and nested PCA as optimization problems in L 2 ([0, 1]). If we fix J ∈ N > 0 and a quadratic B-spline basis {ψ j } J j=1 , we can use Propositions 10 and 11 with X = L 2 ([0, 1]) J↑ and H = L 2 ([0, 1]) J . Thanks to Remark 7 we obtain the results.

Proof of Proposition 7.

denotes the linear spline basis on the same equispaced grid in [0, 1].

Let f − µ = (F − µ ) , of course it can be seen that f − µ is non-negative. Moreover, it is obvious that f − µ ∈ W ∞ 2 ([0, 1]). Then, from De Boor and Daniel (1974) we get that there exist s J such that

where C is a constant depending on the interval [0, 1] but not on n.

Hence, we can determine the coefficients {λ (J) j }, starting from the spline s J , up to a translation factor.

We fix a particular set of coefficients by letting S J (0) = λ

By using the previous result, the integral we have that S J (x) − F − µ (x) ≤ CJ −2 for all x which proves the proposition.

Proof of Proposition 8. By the Assumptions in Section 6.2.1 and Remark 10 there exists a ball B K in W ∞ 3 ([0, 1]) of radius K for some K > 0, such that each F − i can be ε-approximated by F − i ∈ W ∞ 3 ([0, 1]) with F − ∈ B K . We can suppose that also the eigenvectors of the covariance operator of the generating process belong to such sphere, otherwise we just increase its radius of some finite amount. By Proposition 7 we can choose a spline basis (that is, a number of elements J > 0), such that we get a ε-uniformly good approximation of B K (and thus we can 2ε-approximate its L 2 closure). To lighten notation, thanks to Remark 7 we deliberately confuse R J↑ and the space monotone B-splines with J basis functions, the inner product we are referring to will always be clear by looking at its entries. Now consider the following inequalities, with a J i obtained as 2ε approximations of

where the inner product a J i , w is to be intended as the L 2 inner product between the spline function with coefficients a J i and the L 2 function w. Consider now:

Similarly:

We know that a solution to the problem max w L 2 =1 1 n i F − i , w 2 is given by the first eigenfunction w of the covariance operator of the empirical process. Now we are in the condition to apply results in Dauxois et al. (1982) , or in Qi and Zhao (2011) (with α → 0) to conclude that w converges to the first eigenfunctionw of the covariance operator of the process that generates F − i . By hypothesis, such eigenfunctionw lies in B K and thus can be approximated with our fixed spline basis. Thus for high enough n, also w can be approximated up to 2ε.

Let a w be the coefficients of the spline expansion of w spline approximation, that is, w − a w ≤ 2ε. Observe that w 2 − a w E ≤ 2ε, just as a i J ≤ K + 2ε. Thus, up to adding another ε to the approximation error w − a w , we can suppose a w 2 = 1. Hence:

Which leads to:

Finally, combining the above results and the fact that | max f − max g| ≤ max |f − g| for any pair of real valued functions f and g, we obtain:

Thus for instance if we ask that ε < 1, we obtain the desired result with D = 18 · K. Consistency follows since a w −w ≤ a w − w + w −w .

Proof of Lemma 2.

Since for any x ∈ X we have Π R J↑ (x) → x, for any v ∈ H:

. Consider now β n → β in H; we have the inequality:

the first term of the right hand side of the inequality can be sent to 0 by increasing J, the other by increasing n.

Proof of Proposition 9. We call a i the spline coefficients associated to x i and b i the ones associated to y i . Again we deliberately confuse the spaces where the coefficients live to lighten the notation. Since the penalty term does not depend on the data, we have:

Then for some constant K depending on the bounds in the Assumptions, we get:

Thus, if J is such that we have ε-approximations of the data, by Cauchy-Schwartz we obtain:

Thanks to the results in Prchal and Sarda (2007) , for any ε > 0, if the number of samples is big, Θ and Θ J exist with probability 1 − ε and are unique. Since the value of the minimization problem the solve are arbitrarily close, then the minimizers converge in R J×J with the metric given by the spline basis.

Strong convergence implies semi-norm convergence.

Let Z be an H-valued random variable and C Z the covariance operator associated to Z, that is:

(

cov(x(s), x(t))f (t)dt.

In the following, we denote with · L 2 the L 2 ([0, 1] 2 ) norm. Further, recall that cov(Z(s), Z(t)) L 2 ([0,1] 2 ) = E[ Z 2 ]. We want to look at the behavior of β PS − β J C Z .

[0,1]

The simplicial approach to distributional data analysis is based on the definition of Bayes space B 2 (I) (Egozcue et al., 2006) . Formally, let I ⊂ R a closed interval, the Bayes spaces B 2 (I) is defined the equivalence class of probability densities p(x) on I (that is p(x) ≥ 0 and I p(x)dx = 1) with square integrable logarithm.

The Bayes space is endowed with a linear space starting from the definition of the perturbation and powering operators, that are analogous to the sum and multiplication times a scalar, and inner product. Moreover defines an isometric isomorphism between B 2 (I) and L 2 ([0, 1]) through the so-called centered log ratio (clr) map defined as

for every p ∈ B 2 (I). The inverse map is defined as

I exp( p(x))dx Thus, it is possible to define a simplicial PCA and simplicial regression on the Bayes space starting from the clr map. In particular, let p 1 , . . . , p n be observed densities on the interval I and let p i = clr(p i ). Denote with w 1 , . . . , w k the first k principal directions estimated from the p i 's, then a k dimensional simplicial principal component is the span

Similarly, for pdfs {(p z , p y ) i } n i=1 a simplicial regression model is defined starting from the clr transformed variables. Let Γ denote a functional regression model in L 2 for variables {( p z , p y ) i } n i=1 , then the simplicial regression states:

Apart from the different geometries of the Wasserstein and Bayes space, which are discussed in Sections 7 and 8, we can highlight one particular drawback from the simplicial approach, which we believe poses a significant limit to its usefulness. In fact, the main assumption is that all the pdfs p i share the same support, which might not be the case (26) (for instance, it is not the case for our example in Section 8.2). In practice, one may circumvent this need by either "padding" all the pdfs to the same support, i.e considering

where I[·] denotes the indicator function, and the proportionality is due to the need of re-normalizing the p i 's so that they integrate to 1. Another approach could consist in considering I as the intersection of all the supports of the different p i 's let truncate all the pdfs to the shared interval I. Both approaches present undesired side effects that can greatly alter the results. The second approach might end up with a very small interval I, so that a lot of information is lost due to this pre-processing step. The drawback of the first approach instead is due to numerical instability. In fact, one would like ε in (25) to be small in order not to corrupt the true signal, given by p i . However, considering the transformation in (24) having a small ε would cause the p i to present some extreme values (negative) in correspondence to ε. Performing PCA on a data set processed in this way would greatly alter the results, as most of the variability of the p i 's would be masked by a difference in their support. Figure 12 : Results for the third scenario. All the panels show the reconstruction error as a function of the number of the spline basis functions. From left to right the results are obtained using the 2, 5 and 10 dimensional PCA. The solid lines represent the mean of 10 independent runs on independent data sets from (26) and the shaded area represent ± one standard deviation. Figure 12 shows the results. We can see that the reconstruction errors decrease when the dimension of the principal component increases both for the simplicial and projected PCA. Moreover, as the number of B-spline basis increase, the performance tend to get a little bit worse for both the approaches. We believe that this is due to an increased variance in the B-spline estimation of the quantile functions and (clr of) pdfs. In fact, computing the spline approximation for a single function amounts to solving a linear regression problem and increasing the dimension of the B-spline basis corresponds to increasing the number of regressors. Hence, letting B the matrix with columns ψ 1 , . . . , ψ J (evaluated on a grid), the variance of the OLS estimate of the coefficients a is proportional to (B T B) −1 . When increasing the number of B-splines, the entries in B T B become closer to zero, since the support of each of the spline basis becomes smaller. This leads to smaller precision (and higher variance) in the estimator for a.

Another interesting thing to notice is that the simplicial PCA exhibits a much larger variance in the reconstruction error. This is possibly due to the different degree of smoothness of the quantile functions and of the pdfs. As the quantile functions are smoother than the pdfs, their B-spline basis expansion should have lower variance and be more similar to the true quantiles.

In this section, we provide additional simulations to verify the consistency results established in Section 6.

For the PCA, we consider the two data generating processes in equations (19) (Gaussian) and (21) (DPM). First, first we fix J = 20 spline basis (as we do throughout Section 7) and let n increase. Then, we also let J increase linearly with n. We estimate the "true" principal directions by simulating 10 5 observations and using 2500 elements in the B-spline basis. Then, for any choice of n and J we generate another data set and compute the corresponding first two principal directions via the projected PCA and compute the L 2 norm between the "true" directions and the estimated ones. Figure 13 : L 2 distance between estimated and true principal directions when J = 20 as a function of n. Solid line represents the median and the shaded area to a 90% confidence interval estimated from 100 independent repetition. Figure 13 shows the case of fixed J for both data generation strategies. It is clear that in both cases the error quickly decreases to zero (observe that both the x and y axes are in log scale), but the convergence speed is surely sub-exponential when looking, for instance, at the second principal direction.

When increasing the number of basis elements with n, we consider three strategies letting J = n/10, n/2 and 9/10n respectively (rounded to the closest integer). Figure 14 shows the errors between the true and estimated principal directions in this case. Note that the convergence rate looks exponential for both data generating processes for every choice of J = J(n) (increasing with n). In the case of Gaussian data, we observe smaller errors (as low as 10 −5 for the first direction and 10 −4 for the second direction) than in the case of the more challenging DPM data set, see Figure 14 . For the former data set, using a large number of basis functions such as 9/10n or n/2 provides a much better fit than using n/10 basis functions on the second principal direction. For DPM data, the errors are in general two orders of magnitude higher than with Gaussian data. This is likely due to the different data generating process, which results in a more challenging problem. Interestingly, the errors are almost equal for all values of J (when fixing n).

Let us now analyze the projected regression. The independent variable are generated similarly to Section 8, by discretizing the interval [0, 1] in 1,000 equispaced intervals, the value of the quantile function F − z i in the j-th interval equals j k=1 δ ik and (δ i1 , . . . , δ i1000 ) ∼ Dirichlet(0.01, . . . , 0.01) + U([0, 5]). We fix the kernel β (t, s) (details are given below) and let quantile functions F yi = Π L 2 ([0,1] ↑ ) • Γ β (F − zi ) + N (0, (0.1) 2 ). We consider two different choices of β : a smooth function β 1 (t, s) = (t − 1/2) 3 + (s − 1/2) 3 , for which we expect that a small number of spline basis will give a low error, and a rougher function β 2 (t, s) defined as β 2 (t, s) = that is, β 2 corresponds to an approximation of β 1 on a 10×10 grid. As in the case of PCA, we present two simulations for each choice of β i , i=1,2, where we first fix the number of spline basis J = 20 while increasing the sample size n and second compare the performance for various values of J. We do not adopt the same strategy of setting J as a fraction of the number of n since the number of parameters to estimates grows quadratically with J which makes the computational cost substantial when J ≥ 100. We measure both the seminorm error β − β C Z and the mean square prediction error on an unseen "test" set of 1, 000 samples. Figure 15 shows the seminorm error and the prediction error when J = 20 as n increases, while in Figure 16 various values of J are also considered. When data are generated from β 1 , J = 20 spline basis is more than enough (and actually J = 10 would suffice) and the seminorm error in Figure 15 (a) and Figure 16 (a) decays exponentially while the prediction error reaches the irreducible error with n = 10 3 samples. When data are generated from β 2 the seminorm error does not show the same exponential decay when J = 20 (see Figure 15 (b)), but it does for larger values of J, in particular it seems that the error obtained with J = 50 is the same obtained when J = 100, see Figure 16 (b). Hence, it is clear that the choice of J is crucial to obtain a fast decay of the error: when the kernel to be approximated is not very smooth, a larger values of spline basis elements are needed, as one would expect. We conclude this discussion by giving a practical advice on how to select J for a given data set. Our suggestion is to let J to be the smallest value that allows for a reconstruction error smaller than a given threshold, which may depend on the specific inferential task. For instance, if the problem is PCA and the goal is to provide a descriptive analysis of the variability, a (relative) approximation error below 0.05 will typically give satisfactory results. If instead the goal is only to perform dimensionality reduction and working on the scores of a PCA as Euclidean data, one should aim for a lower approximation error, possibly of the order of 10 −4 . A similar reasoning can be applied to the regression: if the goal is mainly to interpret the estimate β a larger reconstruction error can be allowed. If instead one is interested in obtaining very accurate predictions, a lower error is preferred. For instance, when β 1 is used to generate the data, the reconstruction error for both Solid line represents the median and the shaded area to a 90% confidence interval estimated from 100 independent repetition. dependent and independent variables is below 10 −4 for J ≥ 20, while to get to the same error when β 2 is used one must use J = 100 basis. where v ij ∼ max{0, N (0, 1)} independently. See Figure 17 for a random sample from this data generating process. In this case, computing the projected PCA results in an interpretability score IS k equal to one for k = 1, 2 and equal to zero for k = 3, 4, . . .. Hence, from the third principal direction onward, the projected PCA does not give any reliable information and, if those directions are needed, in this case a nested PCA could be preferred. Despite the poor interpretability scores from the third direction onward, the reconstruction errors are always good as N RE 1 = 0.26 and N RE k ≈ 10 −6 for k ≥ 2. Moreover, the ghost variances GV k are smaller than 10 −10 for all values of k, so that this particular data set would be a good candidate for the hybrid methods mentioned in Section 7.2.

In summary, in our experience, the performance of the projected PCA can suffer when considering the interpretability of the directions associated to lower variability, but usually (at least always in our examples) gives a reasonable reconstruction error and ghost variance.

Here, we highlight a feature which is shared by both projected and nested PCA, that is, the scores of the projection onto a projected principal component are dependent on the dimension of the principal component, as already noted in Section 3.1.

This can be considered a limitation to those frameworks, because it contributes to the complexity of the analysis: one has always to fix the dimension of the chosen principal component and use the scores accordingly obtained. For instance, the scores, both for nested and projected PCAs, coincide with the L 2 scores when the dimension of the principal components is equal to the cardinality of the spline basis J. This happens because the principal components are not linear subspaces. As a consequence also the interpretability score of a direction is dimension-dependent.

Hence, the choice of the dimension k must be carried out balancing (i) a parsimonious representation, (ii) a low reconstruction error, so that the projections on the principal components yield good approximations of the data, and (iii) the intepretability score of the directions.

Thus, opposed to standard Euclidean PCA, where the k+1-th direction does not change the behavior of the data along the previous k directions (i.e., the scores), when doing (any) PCA in Wasserstein space the whole picture must always be taken into account, both for nested and projected PCA to assess the interpretability of the results.

Finally, note that such interpretability might be low for both intrinsic and extrinsic methods, but this means that the Wasserstein metric may not be the most adequate to capture and explain the variability of the data set.

Gradient flows: in metric spaces and in the space of probability measures

Monotone spectral density estimation

A general asymptotic scheme for inference under order restrictions

Set-valued analysis

An empirical distribution function for sampling with incomplete information

Nonlinear regression on riemannian manifolds and its applications to neuroimage analysis

On minimum kantorovich distance estimators

On barycenters of probability measures

On parameter estimation with the wasserstein distance. Information and Inference: A

Approximate bayesian computation with the wasserstein distance

Active set algorithms for isotonic regression; a unifying framework

Extrinsic analysis on manifolds is computationally faster than intrinsic analysis with applications to quality control by machine vision

Geodesic PCA in the Wasserstein space by convex PCA

Prediction in functional linear regression

Multi-marginal Wasserstein GAN

Measuring dependence in the wasserstein distance for bayesian nonparametric models. The Annals of Statistics, forthcoming

Geodesic PCA versus log-PCA of histograms in the Wasserstein space

Wasserstein regression*

Sinkhorn Distances: Lightspeed Computation of Optimal Transport

Fast Computation of Wasserstein Barycenters

Differentiable Ranking and Sorting using Optimal Transport

Bayesian quantile regression using random B-spline series prior

Asymptotic Theory for the Principal Component Analysis of a Vector Random Function: Some Applications to Statistical Inference

Splines with Nonnegative B-spline Coefficients. Mathematics of computation

Dimensionality reduction when data are density functions

Best Approximation in Inner-Product Spaces

Advances in Order Restricted Statistical Inference: Proceedings of the Symposium on Order Restricted Statistical Inference Held in

Hilbert Space of Probability Density Functions Based on Aitchison Geometry

Bayesian density estimation by mixtures of normal distributions

Geodesic Regression and the Theory of Least Squares on Riemannian Manifolds

Principal geodesic analysis for the study of nonlinear statistics of shape

Simplicial principal component analysis for density functions in Bayes spaces

Intrinsic shape analysis: Geodesic PCA for Riemannian manifolds modulo isometric lie group actions

Backward nested descriptors asymptotics with inference on stem cell differentiation

Analysis of principal nested spheres

Inference for Density Families Using Functional Principal Component Analysis

Principal component analysis for histogram-valued data

Why does the coronavirus hit men harder? a new clue

A kriging approach based on Aitchison geometry for the characterization of particle-size curves in heterogeneous aquifers

Mémoire sur la théorie des déblais et des remblais. Histoire de l'Académie Royale des Sciences de Paris

Derong Liu, Shumin Fei, Zengguang Hou, Huaguang Zhang, and Changyin Sun

An Invitation to Statistics in Wasserstein Space

Nonparametric Statistics on Manifolds and Their Application to Object Data Analysis

Intrinsic Statistics on Riemannian Manifolds: Basic Tools for Geometric Measurements

Statistical Computing on Manifolds: From Riemannian geometry to Computational Anatomy

Barycentric subspace analysis on manifolds

Computational Optimal Transport: With Applications to Data Science. Foundations and Trends in Machine Learning

Interior-point methods

The advantages of probabilistic survey questions

Spline estimator for functional linear regression with functional response

Shape constrained additive models

Some theoretical properties of Silverman's method for smoothed functional principal component analysis

Functional data analysis. Encyclopedia of Statistical Sciences

Variational Analysis

Generalization of the Principal Components Analysis to Histogram Data

A recursive algorithm for finding the minimum norm point in a polytope and a pair of closest points in two polytopes

Smoothed functional principal components analysis by choice of norm

Wasp: Scalable Bayes via barycenters of subset posteriors

Forecasting at scale. The American Statistician

Dimension reduction techniques for distributional symbolic data

Optimal Transport: old and new

On the implementation of a primal-dual interior point filter line search algorithm for large-scale nonlinear programming

Wasserstein autoregressive models for density time series

We thank two anonymous referees for their detailed and helpful reviews, which helped us improving the quality and the clarity of this work. We also thank Riccardo Scimone for helpful feedback and comments on an earlier draft of this paper and Federico Bassetti, Alessandra Guglielmi and Piercesare Secchi for helpful discussions.

In this simulation, we show how the number of B-spline basis functions affects the inference in our projected PCA and in the simplicial one. In this Scenario, the probability measures are simulated as mixture of beta densities, also known as Bernstein polynomials, as follows:Where β(x; a, b) denotes the density of a beta distributed random variable with parameters (a, b) evaluated in x. By definition, the p i s generated from (26) have a fixed support I = [0, 1]. See Figure 11 .In this setting instead, we let µ i in (20) be the probability measure associated to p i and not its smoothed version. Hence, in addition to the amount of information lost during the PCA another factor comes into play: the amount of information that is lost due to the B-spline representation.

Here, we show an example to highlight the limitations of the proposed framework, specifically of the projected PCA. The main idea behind this example is that the projected principal directions will be different from the nested geodesic ones when data are concentrated around the "borders" of X, as in the trivial example shown in Figure 1 . In the Wasserstein case, X is the space of quantile functions so that the border composed of functions that are constant on a subset of [0, 1].Hence, we consider the following data generating process, modeling directly the quantile functionsif t > 0.5