key: cord-0612730-3is6o9ay authors: Spilak, Bruno; Hardle, Wolfgang Karl title: Does non-linear factorization of financial returns help build better and stabler portfolios? date: 2022-04-06 journal: nan DOI: nan sha: 3dd5a7b23719120d704bb86012f6fae4abc4a797 doc_id: 612730 cord_uid: 3is6o9ay A portfolio allocation method based on linear and non-linear latent constrained conditional factors is presented. The factor loadings are constrained to always be positive in order to obtain long-only portfolios, which is not guaranteed by classical factor analysis or PCA. In addition, the factors are to be uncorrelated among clusters in order to build long-only portfolios. Our approach is based on modern machine learning tools: convex Non-negative Matrix Factorization (NMF) and autoencoder neural networks, designed in a specific manner to enforce the learning of useful hidden data structure such as correlation between the assets' returns. Our technique finds lowly correlated linear and non-linear conditional latent factors which are used to build outperforming global portfolios consisting of cryptocurrencies and traditional assets, similar to hierarchical clustering method. We study the dynamics of the derived non-linear factors in order to forecast tail losses of the portfolios and thus build more stable ones. Portfolio selection is at the heart of robust wealth management and is one of the main tasks of "quants". Its goal is to find an allocation of capital across various assets. There are two main approaches to this problem. The first is a "model-free" or "data-based" approach (Ľuboš Pástor; 2000) assuming functional forms of the stochastics of asset returns; an approach taken by Markowitz (1952) that led him to critical line algorithms (CLA), which are based on sample estimates of the mean and covariance of the asset returns. Practitioners have recognized the limits of Markowitz approach. A major caveat is that small deviations in the estimate of the returns mean will cause CLA to produce very different portfolios (Michaud and Michaud; . On top, this method requires the inversion of the covariance matrix, which is prone to large errors when the covariance matrix is numerically ill-conditioned (Bailey and López de Prado; 2012) . These instability concerns led practitioners to develop more robust "data-based" approaches, in particular Robust Markowitz , the hierarchical risk parity (HRP) (López de Prado; 2016) and hierarchical clustering asset allocation (HCAA) (Raffinot; 2017 ). Nevertheless, model-free approaches rely mostly on the study of the returns covariance matrix structures and ignore the potential usefulness of "model-based" approaches, as, for example, financial returns modeling like asset pricing models. In this second approach, the investor uses his prior knowledge to build a linear model for the returns as follows: xt = W zt + εt, where t ∈ [1, T ], xt ∈ R d,1 are the asset returns, W ∈ R d,p is the unknown factor loading matrix corresponding to the factors zt ∈ R p,1 for p d. The εt correspond to the idiosyncratic errors. This linear model tells us that the asset returns can be explained by a few specific variables, which are either defined by prior established knowledge about the empirical behavior of average returns , as in the Fama and French model (Fama and French; 1993) ; or they are treated as latent variables and directly estimated from the data as a linear combination of the asset returns, using factors analysis or principal component analysis (PCA). This method has the main advantage of requiring no ex ante knowledge of the structure of average returns. On top, factor analysis explains the correlation between assets, which has to be taken into account in order to build a diversified portfolio. In this paper, both approaches are combined to find the optimal portfolio by developing a data driven factor model for the asset returns. In fact, there are no obvious theoretical justification for the linearity assumptions of the factors. Here we modernize the traditional factor analysis by using a non-linear autoencoder to explain the returns as a function of non-linear conditional factors. There are discovered in an unsupervised fashion, therefore without any prior knowledge about the asset returns. Autoencoders are classical dimension reduction tools in machine learning and are often described as the non-linear counterpart of PCA. Nevertheless, PCA allows only for a linear smaller representation of dimension p of d individual returns and is not easily constrained to guarantee long-only portfolios. This is why we compare the non-linear factor model with convex Non-negative Matrix Factorization (NMF) which constrains the loading matrix W to be non-negative, thus, making W interpretable as encoding cluster indicator. Convex NMF extends the classical NMF method to mixed signs data, such as financial returns. NMF has great success in various applications such as text mining (Pauca et al.; or pattern recognition 2001) , but its application in finance is mostly restricted to price data analysis (Drakakis et al.; . Even so, convex NMF can handle mixed signs data and its clustering ability, inherent sparsity and enhanced interpretability makes it particularly appealing for the long-only portfolio allocation problem. To find the optimal portfolio based on the proposed model, as in Gu et al. (2021) , the autoencoder is structured in a specific fashion in order to impose uncorrelated and interpretable factors as well as long-only portfolios by forcing them to be linear combinations of individual equity returns with strictly positive factor loadings, as in convex NMF. Our model is then a non-linear non-negative latent factor model, where the non-linearities manifest through the learning of a boundary between a normal linear and extreme non-linear market regime. This boundary is used to define signals for tail events whose probability is predicted in order to hedge the associated extreme losses. Our approach is evaluated in two universes: one global portfolio consisting of bonds, stock indices, commodities, Forex pairs and cryptocurrencies, which have been gaining strong interest from traditional investors in the last two years ; and one consisting of bonds at various tenors, stock and commodity indices on a 30 year long time frame. Our contribution is threefold. First, we offer a new portfolio allocation method based on a sparse factor-based representation of the asset returns. Second, we give a a conditional heteroscedastic model for the activation of the ReLU neurons of the encoder network which allows us to explain the dynamics of the encoder over time. Finally, we propose a hedging strategy based on the forecast of the probability of activation of the ReLU neurons. Our implementation and results are available on quantlet.de ( EmbeddingPortfolio) for reproducible and transparent research. Our findings show that, on our datasets: • Our method is successful at finding robust lowly correlated portfolio out-of-sample across our test sets and it outperforms classical allocation methods in terms of risk-adjusted returns and naive strategies in some specific cases, an important incentive for portfolio managers to integrate machine learning into their portfolio allocation methodologies universe. • Linear and non-linear NMF have very similar performance in terms of reconstruction error and factor interpretation • Non-linear factors of asset returns share the stylized facts of financial returns such as conditional heteroscedasticity which allows us to accurately predict the probability of extreme losses at a portfolio level The structure of the paper is as follows: first the model and portfolio allocation methodology are described. In a second part, we present its application with the main findings, finally we study the dynamics of the ReLU activations and the hedging strategy. 2 Linear and non-linear matrix factorization for portfolio allocation Here, we present the linear and non-linear factor approaches in formal terms. We start by explaining the convex NMF and autoencoders and their difference between PCA, then we describe the specific architecture and the proposed portfolio allocation methods. One main aspect of portfolio management is diversification, which is defined as spreading investments into a variety of assets that have different risk exposures in order to increase the return while reducing the risk of the portfolio. One classical measure of linear co-dependence is correlation. Thus diversification aims at finding assets that present low correlation. A simple approach is to build a global portfolio with various asset classes, for example bonds and stocks, since they are driven by different fundamentals. However, in this paper, we are interested in finding such assets without prior economic knowledge. The goal is then to learn how to map the asset returns into p clusters with low inter correlation and large intra correlation, that is clusters must group highly correlated assets and separate lowly correlated ones. A well-known approach for clustering is based on dimension reduction technique such as PCA. PCA finds an orthogonal projection from the original data space to a lower dimensional linear space, such that the variance of the projected data, known as principal components, is maximized. PCA guarantees that the principal components are uncorrelated, since they are mutually orthogonal. Nevertheless, PCA suffers from a few well known limitations. Indeed, even though the multivariate normality assumption is not necessary to apply PCA, without it the principal components lose their interpretability power, as the covariance matrix does not fully describe the data in the absence of normality (Jolliffe; 2002) . Moreover, if the data lies on a high dimensional feature space related in a non-linear fashion, PCA has been shown to perform badly, hence the need for non-linear PCA methodologies such as kernel PCA 1997; Bakir et al.; or non-linear autoencoders. Finally, PCA loadings are not necessarily positive, which makes the principal components difficult to be interpreted as long-only portfolios. Some alternatives are constrained optimization techniques such as non-negative matrix factorization (NMF) or autoencoders which allow for non-linear representation. Non-negative matrix factorization (NMF) (Lee and Seung; 1999) is a widely used method in computer vision and natural language processing, which aims at representing the data as a linear combination of basis features. Describing the problem in a matrix factorization framework, with X a n × d data matrix, the goal of NMF, as PCA, is to find a factor matrix Z and associated factor loading matrix W , of size n × p and p × d respectively, such that X = ZW + ε. The rank p of the factorization is a hyperparameter and must be carefully chosen so that p d. We will explain how to select it in section 3.4.1. The difference between PCA and NMF lies in the constraints imposed on the factorization. Indeed, PCA forces the rows of W to be orthogonal and the colums of Z, the principal components, to be orthonormal, while NMF applies to data matrix X with positive entries only and does not allow negative ones in the matrix factors W and Z. In contrast to PCA, in the linear combination of the factors, only additive operations are allowed, which makes NMF easily more interpretable than PCA and intuitively compatible with the notion of parts-based representation. It is impractical to apply NMF to mixed-sign data such as financial returns, thus Ding et al. (2010) proposed a semi-NMF method in which Z is unconstrained while W is restricted to be nonnegative. On top, for reasons of interpretability, the vectors defining Z are constrained to lie within a convex column space of X so that the columns of Z, the factors z k , can be interpreted as weighted sums of certain data points, that is: Convex NMF tends to produce very sparse factors W and Z, which effectively allows for parts-of-whole representation of the data. 2005 show that, in practice, W in convex NMF is close to orthogonal and if W is indeed orthogonal (W W = I), then NMF and convex NMF are relaxations of K-means clustering and soft relaxations in the general case. Thus, NMF is expected to find clusters in the asset universe with centroids encoded by Z similar to K-means and posterior probabilities encoded by W . W can be directly used for cluster assignment as orthogonality and nonnegativity together imply that each row of W has only one nonnegative element, which makes W be interpretable as encoding cluster indicators which is not the case for PCA. Let Cj denote the cluster that includes all asset returns projected on factor zj, it is defined as Cj = {i} 1≤i≤d, max(W i: )=j , since wij can be interpreted as the posterior probability of asset i belonging to cluster j. A simple autoencoder is a special case of a feed-forward neural network that maps the input into itself as X ≈ X where X = f θ (X) where X ∈ R n,d and θ corresponds to the estimated weights of the neural network, θ. The autoencoder consists of two parts, for a data vector xt ∈ R d , we have f θ (xt) = v θ D {u θ E (xt)}}, where: • u θ E (xt) is called the encoder that maps the input of size d into the factors zt of size p. The encoder acts as a dimension reduction tool and finds the p latent factors that defines the lower dimensional space. It estimates the conditional probability of the latent factors with respect to the input. • v θ D (zt) is called the decoder that maps the factors zt back to the original input space. It estimates the conditional probability of the observation with respect to the latent factors. An autoencoder is a combined neural network of the encoder followed by the decoder network, where the former finds latent features and the latter acts as a generative model. u θ E (xt) and v θ D (zt) can take any form of a feedforward neural network. s In this paper, the autoencoder is restricted to a single-layer encoder and decoder, as represented in Figure 1 , to keep a model with similar number of free parameters as in convex NMF. We explain in the next paragraph what follows from this choice. The encoder is u θ E (xt) = σ(H xt + bE), where σ is some non-linear activation function, H ∈ R d,p is the matrix of the encoder weights and bE ∈ R p the bias vector. The decoder corresponds to v θ D (zt) = W zt + bD where W ∈ R d,p is the matrix of the decoder weights and bD ∈ R d which learns a non-linear and compressed representation of d asset returns represented by a p dimensional space of factors, zt. At the same time, it is guaranteed that the factors are interpretable as portfolios, i.e., that they are linear combinations of individual equity returns as in Gu et al. (2021) , through the linear decoding layer with weights W . From this approach, the well studied model for the asset returns, which are decomposed into factor based portfolios W zt + bD and some errors ε = xt − W zt − bD, can be easily retrieved. In fact the autoencoder can be seen as a non-linear counter part of PCA in specific cases only. Bourlard and Kamp (1988) show that, when the mean squared error is minimized and the decoder layers are linear, the autoencoder learns to span the same subspace as PCA, even if the encoder layers have sigmoidal activation functions. Nevertheless, the latter is not true when the encoder layers are ReLU operations (Montufar et al.; 2014) . Thus in this paper ReLU (Nair and Hinton; 2010) is used as activation function for the encoding layer. Nevertheless, naively training a plain autoencoder is not recommendable. Indeed, an autoencoder that simply succeeds at learning to set f θ (x) = x everywhere is useless (Goodfellow et al.; 2016) . In practice, autoencoders are restricted as to allow them to copy only input that resembles the training data. Because the model is forced to prioritize which aspects of the input should be copied, it often learns useful properties of the data. In the next paragraph, we discuss which restrictions are used in this paper in order to learn a useful representation z of the asset returns x for the portfolio allocation problem. The following constraints are imposed: 1. The factor matrix have positive entries: in order to obtain long-only portfolios 2. To avoid exploding weights, each column of the factor matrix must have unit norm: On top, W must be orthogonal, i.e. W W = I so the autoencoder can be compared to K-means clustering. Indeed, this constraint associated with the positive constraint will ensure that each asset falls into one cluster defined by the non-zero entry in W 4. The autoencoder is forced to learn low correlated portfolios by constraining the non-linear factors (z = σ(H x)) to have close to diagonal covariance matrix. That way, intuitively, the number of free-parameters in the autoencoder differs only by the bias vectors compared to convex NMF. As for convex NMF, W is interpreted as encoding cluster indicators and Cj is easily built for the autoencoder model with the same formula as for convex NMF (Section 2.3). A regularized loss function The training of the autoencoder is done by minimizing the mean squared error (MSE). The following minimization program is solved on the training data X = {x (1) , . . . , x (T ) }, x (i) = (x1, . . . , xN ): where xt = f θ (xt). In our case, the constrained problem with constraints defined in the previous paragraph must be used. This is done by direct regularization or via a penalty in the loss function. For example, at each weight update, the weights can be forced to verify the non-negativity constraint (1) by updating the weights at each update with: ∀i ∈ {1, . . . , d}, k ∈ {1, . . . , p} w i,k = w i,k 1{w ≥ 0}, and the unit norm constraints (2) with: 1/2 . The factor loadings orthogonality and uncorrelated features are added to the loss function as penalty terms. Thus, we can write: Σz ∈ R p,p is the sample covariance matrix of the batch of factors Z = {z t−b+1 , . . . , zt} at some training iteration and (4), (5), (6) correspond respectively to the classic L1 regularization to control the number of free parameters, the soft orthogonality used in (Xie et al.; 2017; Bansal et al.; and uncorrelated factors constraint, which penalizes for off diagonal terms in the factors covariance. The training of the autoencoder is done with the Adaptive moment estimation algorithm (Adam) (Kingma and Ba; , a variant of Stochastic Gradient Descent (SGD) which computes adaptive learning rates for the neural network parameters. On top, before optimizing the objective, we introduce more regularization. Indeed, as we describe in 3.4, overfitting is controlled by reducing the effective size of the parameter dimensions with the widely used Early Stopping technique (Yao et al.; 2007) , monitoring the loss on the validation set. On top, batch normalization (Ioffe and Szegedy; 2015) on the encoder is performed, which ensures a stable factor covariance matrix across batches. Indeed, before computing Ω3, the factors are centered and scaled, allowing for a stable penalization across batches. Finally, as a last regularization technique, the autoencoder is trained using a deep ensemble approach with multiple random seeds (15, for example) to initialize the neural network parameters. This is a classical method to improve the stability of the results because optimization via SGD can cause different seeds to settle at different optima (Hansen and Salamon; 1990) . As in any gradient descent optimization framework, the parameters of the autoencoder must be initialized. A popular approach for neural network weights initialization is the Glorot or Xavier initialization which induces an equal variance of the activations across layers (Glorot and Bengio; , but for ReLU activations, the He initialization is preferred as it preserves the non-linearity of the ReLU layers. The layer bias are usually initialized with ones. In this paper, the initialization is done by performing first a convex NMF on the data and replacing the weight matrix W and H by their NMF solution. This method, similar to what is used in transfer learning with a pretraining phase or finetuning, drastically improves the convergence rate. The proposed approach requires the tuning of a few hyperparameters: • The most important hyperparameter is the number of latent factors p. As p approaches d, the reconstruction error will decrease, nevertheless, any meaningful information from the factors is lost and a model with a large p overfits. • The learning rate, η, for the weights update in gradient descent. • Finally, the penalty parameters for regularization: λ1, λ2 and λ3 which control the trade-off between MSE minimization and weight decay, orthogonality preservation and correlation penalization, respectively. The Section 3.4.1 explains how to tune these hyperparameters. 2.5 Model-free approaches to the portfolio allocation problem After the 2008 crisis, a lot of attention has been given to the risk contribution portfolio (RCP). The RCP allocates risk, following a risk budget rule b ∈ R d defined by the portfolio manager, among the assets instead of capital, with respect to a certain risk measure (Maillard et al.; Roncalli; 2013) . The goal of RCP is then to find the capital allocation where the marginal risk contribution of each asset corresponds to its predefined budget. Let us consider the volatility of a portfolio as risk measure, given by σ(a) = √ a Σa where a is the capital allocation. Euler's theorem about homogenous function of degree one gives the following additive decomposition of the variance: . The relative risk contribution (RRC) is the following normalized version: RRCi = a i (Σaw) i a Σa , so that d i=1 RRCi = 1. The Risk Parity Portfolio (RPP) or Equal Risk Contribution (ERC) portfolio corresponds to the simplest risk budget rule where the risk contribution is the same for all assets in the portfolio, that is the goal is to find a, with d i=1 ai = 1 and for all i, ai ≥ 0, where for all i, RRCi = 1 d . When the assets are perfectly uncorrelated, show that the RPP is equivalent to an inverse variance allocation portfolio. More generally, the risk budgeting portfolio or risk contribution portfolio attempts to allocate the risk according to the risk budget b. The goal is to find a where for all i, RRCi = bi with d i=1 bi = 1, bi > 0 and d i=1 ai = 1, ai ≥ 0 for all i. The Hierarchical Risk Parity (HRP) approach (López de Prado; 2016) introduces a hierarchy in the correlation matrix to build a diversified portfolio. An advantage of HRP is that it does not require the inversion of the covariance matrix to compute assets weights. The algorithm consists of three steps. First, by using tree-clustering, the algorithm finds a linkage matrix based on the correlation to group assets together in clusters. Then, one performs quasi-diagonalization of the correlation matrix to place assets that are highly correlated together along the first diagonal and the ones that are not correlated far apart from each other in order to obtain a quasi-diagonal correlation matrix. When the covariance matrix is diagonal, the optimal allocation, for a Markowitz type investor, corresponds to inverse-variance weights (see Chapter 19 in Härdle and Simar (2019) ). Thus, the final step of the algorithm uses a top-down recursive bisection to split allocations between adjacent clusters in inverse proportion to their aggregated variances. Nevertheless, as has been underlined by Raffinot (2017) , on top of suffering from certain disadvantages of minimum spanning tree (MST) and single linkage with chaining, HRP does not leverage the structure of the dendogram obtained from tree-clustering. Indeed, MST is only used to reorder assets, but the dendogram is not used when building the allocation with bisection, only the number of assets matter. Raffinot (2017) overcomes that issue by directly using the dendogram shape for the bisection in his Hierarchical Clustering Asset Allocation (HCAA) method. In this paper, we propose to use the graph edges of v θ D (on the decoding part of the neural network presented in Figure 1 ), that is the factor loadings matrix W for assets selection. 2.6 Factor models for diversified portfolio allocation 2.6.1 Convex NMF and autoencoder as interpretable factor model The linear and non-linear models gives a clustered representation of the correlation structure between the assets as HRP, but with two main advantages. Indeed, neural networks are directed graphs, and in our autoencoder, the factor loadings matrix W corresponds to the edge of the graph nodes. It allows us to directly link assets and relate them to their representation on the subspace defined by z. As the dimensions of z are forced to have low correlation, the covariance of the cluster portfolios will be diagonal or quasi-diagonal, as in HRP. On top, our autoencoder gives a model for the returns. Thus, the selection of assets that explains most of the correlation structures in the basket is clear. Those assets will have a higher loading: the highest being 1 and the lowest, 0, due to unit norm and positive constraints on the loadings. In other words, the autoencoder learns to assign 0 weight to the assets that do not explain any correlation, as long as the number of factors p is not too large. Economically, this behavior can be desirable, as having too many assets in a frequently rebalanced portfolio causes high transaction cost if the turnover is high. It is worth noting, that the linearity of the decoder makes the resulting factors interpretable and they may be studied over time. To obtain the wealth allocation across the assets, one possibility is to build a naive allocation as in HRP since the autoencoder gives a quasi-diagonalization of the explained asset returns covariance matrix. Nevertheless, as stated previously, the autoencoder directly gives a relative importance of which asset in one cluster explains most of the intra cluster correlation. It is then natural to build an allocation which relies first on the intra cluster correlation importance, expressed by the factor loading (wij). This approach leverages the learned structure of the autoencoder. Second, we apply a risk allocation similar to HRP based on inter cluster measure such as inversevariance, equal risk parity (ERC) or other risk measure like Value-at-Risk (VaR) or Conditional Value-at-Risk (CVaR) also called Expected-Shortfall (ES). AutoEncoder and convex NMF Risk Parity -AERP and NMFRP First we present the convex NMF (NMFRP) and AutoEncoder Risk Parity (AERP) portfolio, which is a naive allocation based on risk parity approach, similar to HRP. Since the obtained factors present low correlation, the covariance matrix of the associated cluster portfolio is quasi-diagonal. Thus, it makes sense to apply a simple inverse-variance allocation at the cluster level. First, the weight for cluster j is computed, for which the cluster variance needs to be derived: and Σj corresponds to the covariance of cluster j. As the cluster covariance is quasi-diagonal, the inverse-variance weight is optimal for a Markowitz type investor, which gives the following cluster weight (see Chapter 19 in Härdle and Simar (2019) Then, as in HRP, the intra cluster weight for asset i is set as to the inverse-variance weight: where σ 2 i is the variance of the asset i returns. The final weight for each asset is simply the intra cluster weight rescaled by its associated cluster weight, that is: ai = a c i × cj. In other words, the final weight for each asset is simply the inverse-variance weight at cluster level rescaled by the individual inverse-variance weight within each cluster. This allocation reflects the disparities between asset classes more accurately than the simple inverse-variance allocation. Indeed, when we are dealing with a global portfolios with various asset classes, for example, bonds, stocks, cryptos and Forex, the goal is to take advantage of the diversification effect from the relatively low correlation between assets classes. Nevertheless, they might have different volatility levels. Typically, Forex and bonds have low volatility while stocks have high and cryptos have extremely high volatility. By applying an inverse-variance allocation to this universe, cryptos would have close to 0 allocation, neutralizing any potential diversification effect from the crypto class. When using an allocation at the cluster level, the assets belonging to another cluster are disregarded when computing the individual weights. HRP and the proposed autoencoder based allocation achieves that goal. There are two main issues with the previous approach. First, the intra cluster weight for asset i, a c i is defined as an inverse-variance weight within cluster j. However, this allocation only makes sense if the cluster covariance is quasi-diagonal. In our model, as in HRP, there is no guarantee that the assets within each cluster have low correlation, on the contrary, the opposite is expected to happen. Thus, the inverse-variance weighting seems inappropriate and a good alternative could be an ERC allocation within each cluster, that takes into consideration non-diagonal terms of the intra cluster covariance matrix. Second, this previous allocation does not leverage the model-based approach as it does not take into consideration the actual factor loading values, wij. Indeed, for scaled input, factor loadings can be interpreted as the relative importance of factor j for the explanation of the asset returns i. Thus, our idea is to allocate more weight within each cluster, in terms of risk contribution, to the assets that are well explained by the model. In other words, more risk is allocated to the assets whose correlation is better explained by the factor. That way, we avoid giving arbitrarily more weight to an asset that is not modelized, reducing model risk. In order to find the intra cluster weight a c i , we propose then to use a RCP allocation within each cluster with risk budget bj = w 2 ij k∈C j w 2 ik i∈C j , forming the AERC W allocation for the autoencoder based portfolio and NMFRC W for the convex NMF one. The risk budget can be simplied to bj = w 2 ij i∈C j for AERCW, since each column of W has unit norm. Indeed, in classical linear factor analysis, the square loading wij corresponds to the proportion of variance explained by factor j for asset i. Finally, at the cluster level, we keep the inverse-variance allocation from the previous approach, taking advantage of quasi-diagonalization and the final portfolio allocation is: ai = a c i × cj. AutoEncoder and convex NMF Asset Allocation -AEAA and NMFAA Similar to HCAA, we also propose a simple allocation based on cluster item numbers, denoted AEAA and NMFAA for the autoencoder and the convex NMF based portfolios respectively, where the cluster weight are equal to 1 p and the intra cluster weight are set within each cluster to: a c i = 1/|Cj| where |Cj| is the cardinal of cluster j. We have ai = a c i p . This method relies only on the clustering result and does not use the embedding W , thus we will compare it with a simple K-means clustering. In this section, we use the linear and non-linear factorization on two specific cases. It can be applied of course to other asset universes. The first universe consists of daily returns from 2016-06-30 until 2021-11-01 of d = 21 assets from five asset classes including cryptocurrencies: It is relatively easy to get a long history of daily prices for traditional assets, but not for cryptocurrencies. Thus, we simply chose the largest market capitalization and most liquid stock indices, bonds and Forex pairs. The prices are collected from the Blockchain Research Center (BRC). The crypto market is a decentralized one and operates 24/7 all around the world, Forex is also decentralized and is open 24 hours a day in different part of the world but closed during weekend (typically from 4 p.m. EST on Friday until 5 p.m. EST on Sunday). Moreover, indices cannot be traded, but they can easily be replicated by future contracts or ETFs. In the empirical study, we assume that a trader replicates the various indices performances via futures, for example. The Future market is regulated and centralized with specific opening and closing time, as for the bond market. For simplicity, the price at 4 p.m. EST, which corresponds to NYSE closing time, is selected for all assets to compute the daily returns. Doing so, price variations of cryptos or Forex during Since dataset 1 has a very short history and the test set consists mostly of post COVID crash which can be considered as crisis market regime, we propose to investigate how the proposed method performs on a longer history corresponding to multiple market regimes. Alas it is impossible to test on the cryptocurrency market given its recent creation. That is why we implemented another portfolio on a second universe which consists of daily returns from1989-02-01 until 2021-11-10 of 17 assets from three asset classes including: This dataset (2017), the missing recent prices are collected from Bloomberg data source. Finally, this dataset has 8551 daily returns over more than 30 years of history, which allows us to backtest the proposed strategy over multiple market regimes on a relatively long history. In where µ and σ are respectively the training sample mean and standard deviation), which avoids having artificially large weights on assets with high volatility. In clustering, it is difficult to find a single metric for model selection, as one main aspect is the interpretability of the clusters which cannot be assessed quantitatively, making hyperparameter tuning on a large scale difficult. On top, the autoencoder model must have a small enough reconstruction error on unseen data in order to make sense. Thus, for model selection, two metrics are chosen: the reconstruction error, with Root Mean Squared Error (2)) and the cluster assignment stability, with the Adjusted Rand index (ARI) (Hubert and Arabie; 1985) . It computes a similarity measure between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clustering. When the Adjusted Rand index is close to 0.0, the clustering technique is equivalent to random labeling independently of the number of clusters and samples, when it is exactly 1.0, the two clusterings are identical. Of course, the true clustering is unknown in our case, nevertheless, here we simply verify that for the set of hyperparameters, the selected model is stable in terms of cluster assignment across runs with different block-bootstrap datasets. As convex NMF is much faster to train than autoencoders, the number of factors, p, is tuned based on the NMF results. N training runs are launched on the cross-validation sets. For each validation set, a random clustering result is selected to be a true clustering and serves as comparison base for the other cluster assignments on the same validation set, where the cluster assignment is given by Cj defined in Section 2.3. The final metrics for model selection is then the average RMSE and ARI across validation sets. The model with the lowest RMSE is selected while controlling for a high ARI (ARI ≥ 0.95). As a final step, the clustering result for the selected hyperparameters must be interpretable and meet the model constraints, in particular, the factor covariance must be quasi-diagonal. Having p, the other autoencoder hyperparameters can be tuned using the same procedure. In this study, we tested 4 types of portfolios with 2, 3, 4 and 5 factors. We tune the penalty hyperparameters λ1, λ2, λ3 and learning rate η with a simple grid search where: (λ1, λ2, λ3) ∈ {1e −1 , 1e −2 , 1e −3 , 1e −4 , 1e −5 } 3 and η ∈ {1e −2 , 1e −3 , 1e −4 }. To summarize, the main steps of the portfolio allocation procedure are as follows: • Split sample into a training set for model selection and a test set for out-of-sample evaluation For each set of hyperparameters, we launch N = 30 training runs, which takes around 2 hours for dataset 1 on a classical 4 CPU machine with parallelization. We retain the final model with: p = 4, λ1 = 1e −3 , λ2 = 1e −2 , λ3 = 1e −2 and η = 1e −3 and for dataset 2 p = 5. For the out-of-sample evaluation of the selected model, inference on the test set is performed by retraining the model every month using the same setup and the previous month as validation set for Early Stopping. The retraining frequency is chosen to simulate a portfolio manager that would need to rebalance his portfolio on a monthly frequency. Let us now present the out-of-sample backtest performance and model evaluation. We backtest the factor based strategies on the test set with a monthly rebalancing period including a 2 bps transactions costs and the estimation of the strategy parameters is calculated using the last 252 observations. The choice of this window is not discussed here and is out of the scope of this paper. Nevertheless, a yearly window is widely used in the literature of portfolio management with daily data. In order to fairly compare the strategy total returns, the portfolio leverage is set to reach the annualized volatility target of 5%. As proposed by Jaeger et al. (2021) , the estimation of realized volatility is the maximum of the portfolio volatilities measured over 20 and 60 trading days, σ20 and σ60 respectively. This increases the probability that the strategy will not show higher out-of-sample volatility than the ex ante volatility target. The target asset allocation is calculated as a target = σtarget max (σ20, σ60) and the unnormalized portfolio weights areãi = a target ai. The obtained strategies are compared with various benchmarks, such as: the model-free approaches presented in 2.5, equally weighted allocation (ai = 1/d) and equally weighted allocation per asset class (ai = 1/dc, where dc is the cardinal of the asset class as defined on Tables 1 and 2), the classic Markowitz and Robust Markowitz allocations proposed by Härdle et al. (2021) , and the SP500, Ruseel2000 and EuroStoxx50 indices with volatility target. For comparison, the following measures are used: • the traditional Value-at-Risk (VaR), Expected Shortfall (ES) at 5% level, Sharpe Ratio (SR = (µ − r f )/σ, where µ and σ are the sample average return and volatility) and Max drawdown (MDD) • the Probabilistic Sharpe Ratio (Bailey and López de Prado; 2012) : where Z is the cdf of the standard normal distribution, γ3 and γ4 are the sample skewness and kurtosis, evaluates the probability the estimated Sharpe ratio is larger than 0 in presence of non-normal returns. • the Minimum Track Record Length (Bailey and López de Prado; 2012) : which indicates how long a backtest history should be in order to have statistical confidence that its Sharpe ratio is above 0 for a given level α. On reason could be that they do not achieve diversification, having the largest SSPW after Markowitz, and AERP and NMFRP allocations (0.36, 0.27, 0.93, 0.39 and 0.40 respectively) and high turnover. This causes a large concentration of capital in assets with low volatility, such as bonds, which did not perform well on this period, as it is shown in Figure 4 . Indeed, this is the main drawback of these two procedures when applied on such investment universes where assets show different volatility structure. On the other hand, it is clear that the proposed methods outperform the other advanced portfolio allocation approaches in terms of total and risk adjusted return, having a smaller minTRL and PSR close to 0.9, for AERCW and NMFRCW, and close to 1, for AEAA and NMFAA. On top, they achieve high diversification with the SSPW Figure 3 : Backtest statistics comparison on dataset 1 EmbeddingPortfolio closer to the naive equally weighted portfolio and a lower turnover. Nevertheless, the AERCW and NMFRCW methods did not provide better protection against the COVID crash, having larger exposure to the cryptocurrency market. Indeed, the proposed strategies have a relatively larger tail risk compared to the naive ones, as both the Value-at-Risk and Expected Shortfall are much larger. Finally, they have a minTRL of order 1000 days, showing that the test set is not long enough to be confident at 95% of obtaining a positive average return with these strategies. Finally, it is worth noticing that the proposed clustering method, which ensures that the associated portfolio have low correlation out-of-sample (Figure 21 ), enhances the portfolio performance in terms of risk adjusted and total return, which can be observed when comparing HRP with AERP/NMFRP, HCAA with AEAA/NMFAA or KMAA with AEAA/NMFAA since those methods only differ by the clustering approach. is inverted, with the French bonds having more weight than US ones. The main difference between HRP and the two methods comes from diversification: on average, the proposed methods allocate much more capital to the assets with larger volatilities (commodities and stocks) than HRP, around 10% and below 5% respectively. It is worth noticing that on both datasets, while Markowitz allocation achieves a good risk adjusted return In short, the factor based strategies outperform hierarchical clustering and Markowitz allocations on both datasets in terms of total and risk adjusted return controlled for non-normality. This indicates that, for a Markowitz type investor, the proposed strategies should be preferred even if it allocates more weight to assets with larger volatility. This is mainly due to the fact that the allocation is diversified among clusters that present no or low correlation. However, the factor based strategies show higher or similar tail risk. Now, from the backtest performance measures, the autoencoder based approach does not present any advantage over the linear one. On the contrary, the convex NMF method should be preferred since, on top of being simpler, the portfolio weights are more stable, inducing less rebalancing costs, which is shown by a lower turnover on the test sets on average over all convex NMF based allocations. Model interpretability A first interpretation of the model is given based on the analysis of the R 2 measure defined as: x is the sample mean. Figure 8 shows that the factors on dataset 1 of both the autoencoder and convex NMF models explain very well the returns of UK_B, GE_B, EUR_FX, GBP_FX, MXWD_X, SX5E_X, UKX_X, BTC, ETH and LTC, but not JPY_FX and CNY_FX Forex pairs, SHCOMP_X, JP_B or NKY_X. On top, the linear model outperforms its non-linear counterpart in terms of Figure 8 : Average out-of-sample R2 (%) on dataset 1 for the autoencoder (top) and convex NMF (bottom) models EmbeddingPortfolio explainability. As expected, their corresponding factor loading, shown in Figure 9 , is close to 0. The factor models give cluster portfolios and also automatically selects the assets from which it can explain the returns and correlation structures. The others should not be included in the portfolio, as they are impossible to modelize with the proposed method. This is the main advantage over model-free approaches. Interestingly, the models seem to exclude assets from the Asian region. Figure 9 shows the learned average factor loadings over the various training periods. It can be clearly observed that factor 0 corresponds mostly to a bond portfolio, factor 1 to a stock indices portfolio, factor 2 to a portfolio consisting of EUR_FX, GPB_FX and GOLDS_C and finally the last factor summarizing the crypto asset class. CNY_FX pair is not projected on any factor during the whole test period. Let Bond factor denote the factor number 0, Stock the number 1, Forex the number 2 and Crypto factor the number 3. On This intuition is confirmed by Figure 10 . On top, can clearly see that both autoencoder and NMF models ensure that the factors have low correlation out-of-sample, where it is measured with the Pearson correlation coefficient given for each input i and each factor k by: Figure 10 : Average Pearson correlation of (input, factor) pairs over the test sets for the autoencoder and convex NMF models EmbeddingPortfolio least in terms of correlation. Both models find a intuitive segmentation between the assets in the portfolio based mostly on the asset class. The same analysis is performed on dataset 2. Figure 11 shows that the factors on dataset 2 of both the autoencoder and convex NMF models explain very well the returns of SP500, Russel 2000, EuroStoxx Small, US-5Y and 10Y and the same tenors for the French bond. In Figure 12 and 13, 5 clusters with very low pairwise correlation can be identified, the first one represented Figure 11 : Average out-of-sample R2 (%) on dataset 1 for the autoencoder (top) and convex NMF (bottom) models EmbeddingPortfolio mainly by SP500 and Russel2000, that is North American stock indices, the second one by stock indices from other regions, the third one by commodities, fourth one by US bond yields and the last one by French bond yields. As for dataset 1, the linear model outperforms its non-linear counterpart in terms of explainability In the previous section, we have shown that the autoencoder factors do not differ in terms of interpretation from the convex NMF factors on both test sets. On top, the NMF factors have more explainability power than their non-linear counterpart. Nevertheless, the non-linear autoencoder can still provide valuable information thanks to its ReLU activations in the encoding layer. Indeed, the autoencoder learns p cutoffs, given by the p ReLU neurons, between a normal linear regime, on the right part of the data distribution and an extreme non-linear one on the left part. Indeed, autoencoders have been very useful for anomalies and outliers detection in various applications. The literature around that topic is vast and we refer the reader to e.g. (Mienye et al.; Lopez-Martin et al.; 2017; Kieu et al.; Zhao et al.; 2017) . For each neuron k = 1, . . . , p in the encoding layer, this cutoff moves as more data is presented to the autoencoder during the training phase, depending on how big the subset of observations from the normal and extreme regimes that belong to the same hyperplane is, where this hyperplane is defined as: In the simplest case, the ReLU autoencoder is linear, which means that the boundary of the linearity region is outside the training data. The autoencoder model can explain all past left tail events that belongs to the region of linearity, but not the one falling outside, where the ReLU is constant equal to 0. In this region, the loss gradient associated with each observation is 0 and the autoencoder does not learn. For example, Figure Formally, let H and bE denote the trained weights and bias of the encoder, the linear activations are defined for all t as z lin t = H xt + bE ∈ R p . For each neuron k, the true regime classes are defined for all t as c k,t = I{z lin k,t ≤ 0}. The goal is to build a classifier that can predict c k,t+1 using only the information available at time t. Instead of building d hedging strategies, one for each asset in the portfolio separately, we are left with finding p classifiers, which drastically reduces the difficulty of the problem. As the activations are a linear combination of assets returns, we expect them to present stylized fact of financial returns such as weak autocorrelation and conditional heteroscedasticity, shown in Figure 23 in appendix. Thus, we propose to fit p ARMA(P, Q)-GARCH(p, q) models on each activation series. This hypothesis is validated using the Box-Pierce and Ljung-Box (Box and Pierce; 1970) and Lagrange Multiplier tests (Engle; 1982) , on top of standard analysis of residuals in Figure 15 . We have for each neuron k: z lin kt = µ + a1z lin k,t−1 + . . . + aP z lin k,t−P + εt + b1εt−1 + . . . + bQεt−Q with: For each new out-of-sample AE training, the parameters and the innovation distribution for the ARMA-GARCH model are selected by minimizing the Akaike Information Criterion (AIC) on the last 250 observations of the train set. We tested for (P, Q) ∈ {0, . . . , 5} 2 , (p, q) ∈ {0, . . . , 2} 2 and normal, skewed-Normal, Student and skewed-Student distribution for the GARCH innovation. Then, the next volatility σt+1 and linear activation z lin k,t+1 are forecast using the fitted parameters on the test set, refitting the model at each new observation. For example, we find that, for the test period from 2020-02-13 to 2020-03-11 of dataset 1, the linear neuron corresponding to the Stock factor follows a ARMA(0, 1)-GARCH(1, 1) with skewed-Student innovation and Table 3 : ARMA(0, 1)-GARCH(1, 1) parameters EmbeddingPortfolio The ACF of the squared standardized residuals in Figure 15 fail to suggest any significant serial correlation or conditional heteroscedasticity in the standardized residual series. On top, the QQ-plot shows that the assumption of skewed-Student distribution is fair. Thus, the model appears to be adequate in describing the linear dependence in the activation and volatility series. Some discrepancies with the skewed-Student distribution at the tails are present, which could be addressed by using Generalized Pareto Distribution from Extreme Value Theory as in (Packham et al.; 2017; Spilak and Härdle; . Figure 15 : ACF of the squared standardized residuals (left) and quantile-quantile plot (right) EmbeddingPortfolio Now, we describe how to obtain hedging signals from the activations forecast. Let p kt = P(c kt+1 = 1) denote the probability class. p kt is the probability of exceedance of the loss series of the linear activation: P(−z lin k,t ≥ 0) for all t. Based on the prediction of the ARMA-GARCH model, p kt is estimated at time t as: where F is the cumulative distribution function of L (Spilak and Härdle; 2020). As for any classifier, the probability threshold u must be defined to obtain the final predicted class c k,t+1 = I{ p k,t ≥ u}. This threshold is usually taken with respect to the optimal trade-off between type I and II errors given by the False Positive Rate (FPR(u) = P( ct = 1|ct = 0)) and False Negative Rate (FNR(u) = P( ct = 0|ct = 1)), where P is the sample probability. We propose to set the optimal threshold by backtesting the exceedance probability in sample, as for a Value-At-Risk coverage test. That is, for each train set, T k , the predicted tail events { c kt = 1} are compared with the actual tail events {c kt = 1} for multiple thresholds and u is set such that |α − α| is minimum, where α is the expected sample exceedance on the train set T k : α = P(z lin k,t ≤ 0) = t∈T k I{c k,t = 1}/#T k and α = t∈T k I{ c k,t = 1}/#T k is the expected predicted exceedance. That way, the probability of tail events is well calibrated on the train set and it should also be the case on the test set, if the stationary assumption of the model holds. For illustration, Figure 16 shows the out-of-sample probability predictions on the concatenated test sets. Similar to the tail-risk protection trading strategies from Packham et al. (2017) ; Spilak and Härdle (2020), some Figure 16 : True (in white) and out-of-sample predicted probability on the test sets of dataset 1 (top) and 2 (bottom) EmbeddingPortfolio tail events are missed (false negatives), which corresponds to tail events that occur "out of the blue" and cannot be predicted. Nevertheless, some periods with extreme losses are well predicted, in particular during the 2008 crisis and COVID crash where the risk build-up in the market is well modelized by the activation probability. Finally, some increase of probability do not correspond to any true tail event (false positives), because of the weak autocorrelation of the linear activations. Based on these prediction and the optimal threshold calibrated on the respective train sets with the procedure describe above, trading signals are built. In the next section, we present the actual result of the hedging strategy. Figures 17 and 18 show that the described hedging approach greatly improves the risk adjusted return performance of the portfolio for both datasets and all portfolio allocation strategies, except the AERP portfolio on dataset 2. The hedging strategies present a drastic improvement for all tail risk measures, such as MDD, Skewness, Kurtosis, ES and VaR, as expected. We must underline that this outperformance does not come with a greater cost in terms of average return as the Sharpe Ratio and Probabilistic Sharpe Ratio are also greatly improved. We can conclude that on this backtest, the hedging strategy is well calibrated in terms of type I and type II errors trade-off, except for the AERP portfolio on dataset 2, and helps protecting the portfolio against extreme losses, as the market experienced during the COVID-19 crash for example. which indicates that the tail events classifier is better calibrated for the dataset 2 linear activations, while the classifier is over protecting for the SPX_X and EUR_FX activations on dataset 1. Indeed the type I and II errors Figure 19 : AUC (in %), expected exceedance and excess exceedance (in % points) of the different classifiers on the test sets EmbeddingPortfolio trade-off is better captured by the classifiers on dataset 2, as the AUC are larger for the classifiers on dataset 2. It is interesting to see that the worst performance is achieved by the crypto factor classifier. Indeed the crypto market might be more sensitive to sudden news which cannot be predicted by a risk build-up pattern in the returns. In this paper, we build a new long-only portfolio allocation method based on a sparse factor model for the returns in a linear and non-linear setting, via non-negative matrix factorization and ReLU autoencoder. Our method outperforms classical portfolio allocation strategies in terms of risk-adjusted return by ensuring that the factors are lowly correlated out-of-sample and that the portfolio is indeed diversified, including assets with low volatility, such as bonds or Forex pairs, and high volatility, such as cryptocurrencies. We verify that the factors are robust using block-bootstrap sampling technique, which guarantees the portfolios stability across time. Finally, even if we find that non-linear factors cannot be interpreted differently from their linear counterpart, they still provide valuable information, as the autoencoder learns to distinguish between a normal linear and a extreme non-linear market regime. The portfolios regime association can be predicted by modelizing the time series of the ReLU activations, which present the same stylized facts as financial returns, thanks to standard econometrics models such as ARMA-GARCH. Forecasting the activation map of the autoencoder allows us to effectively build a successful hedging strategy against extreme losses of the portfolios. Table 4 : Average RMSE (1e −2 ) and ARI for convex NMF on validation sets from dataset 1: Naturally, as p increases, RMSE decreases, but it is not indicative that higher p will provide good factors for the portfolio allocation problem, since for p = 5, ARI is much lower than for p = 4 (0.88 and 1.00 respectively EmbeddingPortfolio Figure 20 : Average consensus matrix over the validation sets of dataset 1 for p = 2, 3, 4, 5, 6 and 7 from left to right. On each heatmap, p + 1 clusters can be identified, the extra clusters corresponding to assets that are not assigned to any factor. The highest consensus is obtained for p = 4 EmbeddingPortfolio B Model validation The Sharpe ratio efficient frontier Learning to find pre-images Can we gain more from orthogonality regularizations in training deep cnns? Auto-association by multilayer perceptrons and singular value decomposition Distribution of residual autocorrelations in autoregressive-integrated moving average time series models Optimal versus naive diversification: How inefficient is the 1-n portfolio strategy? Convex and semi-nonnegative matrix factorizations On the equivalence of nonnegative matrix factorization and spectral clustering Analysis of financial data using non-negative matrix factorisation, International Mathematical Forum 3(38). Special Issue in honor of Henk van der Vorst Autoregressive conditional heteroscedasticity with estimates of the variance of united kingdom inflation Common risk factors in the returns on stocks and bonds Understanding the difficulty of training deep feedforward neural networks Equity portfolio diversification Deep Learning Annals Issue:Financial Econometrics in the Age of the Digital Economy Neural network ensembles Delving deep into rectifiers: Surpassing human-level performance on imagenet classification Comparing partitions Applied Multivariate Statistical Analysis, 5 edn Robustifying markowitz International Research Training Group 1792 "High Dimensional Nonstationary Time Series Batch normalization: Accelerating deep network training by reducing internal covariate shift Interpretable machine learning for diversified portfolio construction Principal Component Analysis, 2 edn Characteristics are covariances: A unified model of risk and return Outlier detection for time series with recurrent autoencoder ensembles Adam: A method for stochastic optimization Learning the parts of objects by non-negative matrix factorization Learning spatially localized, parts-based representation Conditional variational autoencoder for prediction and feature recovery applied to intrusion detection in iot Building diversified portfolios that outperform out of sample The properties of equally weighted risk contribution portfolios Portfolio selection Efficient Asset Management: A Practical Guide to Stock Portfolio Optimization and Asset Allocation, 2 edn Improved sparse autoencoder based artificial neural network approach for prediction of heart disease On the number of linear regions of deep neural networks Rectified linear units improve restricted boltzmann machines Tail-risk protection trading strategies Text Mining using Non-Negative Matrix Factorizations Investing with cryptocurrencies -evaluating their potential for portfolio allocation strategies Hierarchical clustering-based asset allocation Introduction to Risk Parity and Budgeting Kernel principal component analysis Tail-risk protection: Machine learning meets modern econometric, Encyclopedia of Finance Portfolio selection and asset pricing models All you need is beyond a good init: Exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation On early stopping in gradient descent learning Spatio-temporal autoencoder for video anomaly detection, MM '17