key: cord-200147-ans8d3oa authors: Arimond, Alexander; Borth, Damian; Hoepner, Andreas; Klawunn, Michael; Weisheit, Stefan title: Neural Networks and Value at Risk date: 2020-05-04 journal: nan DOI: nan sha: doc_id: 200147 cord_uid: ans8d3oa Utilizing a generative regime switching framework, we perform Monte-Carlo simulations of asset returns for Value at Risk threshold estimation. Using equity markets and long term bonds as test assets in the global, US, Euro area and UK setting over an up to 1,250 weeks sample horizon ending in August 2018, we investigate neural networks along three design steps relating (i) to the initialization of the neural network, (ii) its incentive function according to which it has been trained and (iii) the amount of data we feed. First, we compare neural networks with random seeding with networks that are initialized via estimations from the best-established model (i.e. the Hidden Markov). We find latter to outperform in terms of the frequency of VaR breaches (i.e. the realized return falling short of the estimated VaR threshold). Second, we balance the incentive structure of the loss function of our networks by adding a second objective to the training instructions so that the neural networks optimize for accuracy while also aiming to stay in empirically realistic regime distributions (i.e. bull vs. bear market frequencies). In particular this design feature enables the balanced incentive recurrent neural network (RNN) to outperform the single incentive RNN as well as any other neural network or established approach by statistically and economically significant levels. Third, we half our training data set of 2,000 days. We find our networks when fed with substantially less data (i.e. 1,000 days) to perform significantly worse which highlights a crucial weakness of neural networks in their dependence on very large data sets ... While leading papers on machine learning in asset pricing focus on predominantly returns and stochastic discount factors (Chen, Pelger & Zhu 2020; Gu, Kelly & Xiu 2020) , we are motivated by the global Coid-19 virus crisis and the subsequent stock market crash to investigate if and how machine learning methods can enhance Value at Risk (VaR) threshold estimates. In line with Gu, Kelly & Xiu's (2020: 7) , we like to open by disclaiming our awareness that " [m] achine learning methods on their own do not identify deep fundamental associations" .without human scientists designing hypothesized mechanisms into an estimation problem. 1 Nevertheless, measurement errors can be reduced based on machine learning methods. Hence, machine learning methods employed as means to an end instead of as end in themselves can significantly support researchers in challenging estimation tasks. 2 In their already legendary paper, Gu, Kelly & Xiu (GKX in the following, 2020) apply Machine Learning to a key problem in academic finance literature: 'measuring asset risk premia'. They observe that machine learning improves the description of expected returns relative to traditional econometric forecasting methods based on (i) better out-ofsample R-squared and (ii) forecasts earning larger Sharpe ratios. More specifically, they compare four 'traditional' methods (OLS, GLM, PCR/PCA, PLS) with regression trees (e.g. random forests) and a simple 'feed forward neural network' based on 30k stocks over 720 months , using 94 firm characteristics, 74 sectors and 900+ baseline signals. Crediting inter alia (i) flexibility of functional form and (ii) enhanced ability to prioritize vast sets of baseline signals, they find the feed forward neural networks (FFNN) to perform best. Contrary to results reported from computer vision, GKX further observe that "'shallow' learning outperforms 'deep' learning" (p.47), as their neural network with 3 hidden layers excels beyond neural networks with more hidden layers. They interpret this result as a consequence of a relatively much lower signal to noise ratio and much smaller data sets in finance. Interestingly, the outperformance of NNs over the other 5 methods widens at portfolio compared to stock level, another indication that an understanding of the signal to noise ratio in financial markets is crucial when training neural networks. That said, while classic OLS is statistically significantly weaker than all other models, NN3 beats all others but not always at statistically significant levels. GKX finally confirm their results via Monte Carlo simulations. They show that if one generated two hypothetical security price datasets, one linear and un-interacted and one nonlinear and interactive, OLS and GLM would dominate in former, while NNs dominate in the latter. They conclude by attributing the "predictive advantage [of neural networks] to accommodation of nonlinear interactions that are missed by other methods." (p.47) Following GKX, an extensive literature on machine learning in finance is rapidly emerging. Chen, Pelger and Zhu (CPZ in the following, 2020) introduce more advanced (i.e. recurrent) neural networks and estimate a (i) non-linear asset pricing model (ii) regularized under no-arbitrage conditions operationalized via a stochastic discount factor (iii) while considering economic conditions. In particular they attribute the time varying dependency of the stochastic discount factor of about ten thousand US stocks to macroeconomic state processes via a recurrent Long Short Term Memory (LSTM) network. In CPZ's (2020: 5) view "it is essential to identify the dynamic pattern in macroeconomic time series before feeding them into a machine learning model". Avramov et al. (2020) replicate the approaches of GKX's (2020) , CPZ (2020) , and two conditional factor pricing models: Kelly, Pruitt, and Su's (2019) linear instrumented principal component analysis (IPCA) and Gu, Kelly, and Xiu's (2019) nonlinear conditional autoencoder in the context of real-world economic restrictions. While they find strong Fama French six factor (FF6) adjusted returns in the original setting without real world economic constraints, these returns reduce by more than half if microcaps or firms without credit ratings are excluded. In fact, when Avramov et al. (2020: 3) are "[e]xcluding distressed firms, all deep learning methods no longer generate significant (valueweighted) FF6-adjusted return at the 5% level." They confirm this finding by showing that the GKX (2020) and CPZ (2020) machine learning signals perform substantially weaker in economic conditions that limit arbitrage (i.e. low market liquidity, high market volatility, high investor sentiment). Curiously though, Avramov et al. (2020: 5) find that the only linear model they analyse - Kelly et al.'s (2019) IPCA -"stands out … as it is less sensitive to market episodes of high limits to arbitrage." Their finding as well as the results of CPZ (2020) imply that economic conditions have to be explicitly accounted for when analysing the abilities and performance of neural networks. Furthermore, Avramov et al. (2020) as well as GKX (2020) and CPZ (2020) make anecdotal observations that machine learning methods appear to reduce drawdowns. 1 While their manuscripts focused on return predictability, we devote our work to risk predictability in the context of market wide economic conditions. The Covid-19 crisis as well as the density of economic crisis in the previous three decades imply that catastrophic 'black swan' type risks occur more frequent than predicted by symmetric economic distributions. Consequently, underestimating tail risks can have catastrophic consequences for investors. Hence, the analysis of risks with the ambition to avoid underestimations deserves, in our view, equivalent attention to the analysis of returns with its ambition to identify investment opportunities resulting from mispricing. More specifically, since a symmetric approach such as the "mean-variance framework implicitly assumes normality of asset returns, it is likely to underestimate the tail risk for assets with negatively skewed payoffs" (Agarwal & Naik, 2004:85) . Empirically, equity market indices usually exhibit, not only since Covid-19, negative skewness in its return payoffs (Albuquerque, 2012 , Kozhan et al. 2013 . Consequently, it is crucial for a post Covid-19 world with its substantial tail risk exposures (e.g. second pandemic wave, climate change, cyber security) that investors provided with tools which avoid the underestimation of risks best possible. Naturally, neural networks with their near unlimited flexibility in modelling non-linearities appear suitable candidates for such conservative tail risk modelling that focuses on avoiding Giglio & Xiu (2019) , and Kozak, Nagel & Santosh (2020) as also noteworthy, as are efforts by Fallahgouly and Franstiantoz (2020) and Horel and Giesecke (2019) to develop significant tests for neural networks. Our paper investigates is basic and/or more advanced neural networks have the capability of underestimating tail risk less often at common statistical significance levels. We operationalize tail risk as Value at Risk which is the most used tail risk measure in both commercial practice as well as academic literature (Billio et al. 2012 , Billio and Pellizon, 2000 , Jorion, 2005 , Nieto & Ruiz, 2015 . Specifically, we estimate VaR thresholds using classic methods (i.e. Mean/Variance, Hidden Markov Model) 1 as well as machine learning methods (i.e. feed forward, convolutional, recurrent), which we advance via initialization of input parameter and regularization of incentive function. Recognizing the importance of economic conditions (Avramov et al. 2020 , Chen et al. 2020 , we embed our analysis in a regime-based asset allocation setting. Specifically, we perform Monte-Carlo simulations of asset returns for Value at Risk threshold estimation in a generative regime switching framework. Using equity markets and long term bonds as test assets in the global, US, Euro area and UK setting over an up to 1,250 weeks sample horizon ending in August 2018, we investigate neural networks along three design steps relating (i) to the initialization of the neural network's input parameter, (ii) its incentive function according to which it has been trained and which can lead to extreme outputs if it is not regularized as well as (iii) the amount of data we feed. First, we compare neural networks with random seeding with networks that are initialized via estimations from the best-established model (i.e. the Hidden Markov). We find latter to outperform in terms of the frequency of VaR breaches (i.e. the realized return falling short of the estimated VaR threshold). Second, we balance the incentive structure of the loss function of our networks by adding a second objective to the training instructions so that the neural networks optimize for accuracy while also aiming to stay in empirically realistic regime distributions (i.e. bull vs. bear market frequencies). This design features leads to better regularization of the neural network, as it substantially reduces extreme outcomes than can result from a single incentive function. In particular this design feature enables the balanced incentive recurrent neural network (RNN) to outperform the single incentive RNN as well as any other neural network or established approach by statistically and economically significant levels. Third, we half our training data set of 2,000 days. We find our networks when fed with substantially less data (i.e. 1,000 days) to perform significantly worse which highlights a crucial weakness of neural networks in their dependence on very large data sets. Our contributions are fivefold. First, we extend the currently return focused literature of machine learning in finance (Avramov et al. 2020 , Chen et al. 2020 Gu et al. 2020) to also focus on the estimation of risk thresholds. Assessing the advancements that machine learning can bring to risk estimation potentially offers valuable innovation to asset owners such as pension funds and can better protect the retirement savings of their members. 2 Second, we advance the design of our three types of neural networks by initializing their input parameter with the best established model. While initializations are a common research topic in core machine learnings fields such as image classification or machine translation (Glorot & Bengio, 2010 , we are not aware of any systematic application of initialized neural networks in the field of finance. Hence, demonstrating the statistical superiority of an initialized neural network over itself non-initialized appears a relevant contribution to the community. Third, while CPZ (2020) regularize their neural networks via no arbitrage conditions, we regularize via balancing the incentive function of our neural networks on multiple objectives (i.e. estimation accuracy and empirically realistic regime distributions). This prevents any single objective from leading to extreme outputs and hence balances the computational power of the trained neural network in desirable directions. In fact, our results show that amendments to the incentive function maybe the strongest tool available to us in engineering neural networks. Fourth, we also hope to make a marginal contribution to the literature on value at risk estimation. Whereas our paper is focused on advancing machine learning techniques and is therefore following Billio and Pellizon (2000) anchored in a regime based asset allocation setting 1 to account for time varying economic states (CPZ, 2020), we still believe that the nonlinearity and flexible form especially of recurrent neural networks maybe of interesting to the VaR (forecasting) literature (Billio et al. 2012 , Nieto & Ruiz, 2015 , Patton et al. 2019 . Fifth, our final contribution lies in the documentation of weaknesses of neural networks as applied to finance. While Avramov et al. (2020) subjects neural networks to real world economic constraints and finds these to substantially reduce their performance, we expose our neural networks to data scarcity and document just how much data these new approaches need to advance the estimation of risk thresholds. Naturally, such long data history may not always be available in practice when estimating asset management VaR thresholds and therefore established methods and neural networks are likely to be used in parallel for the foreseeable future. In section two, we will describe our testing methodology including all five competing models (i.e. Mean/Variance, Hidden Markov Model, Feed Forward Neural Network, Convolutional Neural Network, Recurrent Neural Network). Section three describes data, model training, Monte Carlo simulations and baseline results. Section four then advances our neural networks via initialization and balancing the incentive functions and discusses the results of both features. Section five conducts robustness tests and sensitivity analyses before section six concludes. 1 We acknowledge that most recent statistical advances in Value at Risk estimation have concentrated on jointly modelling Value at Risk and Expected Shortfall and were therefore naturally less focused on time varying economic states (Patton et al. 2019 , Taylor 2019 , 2020 ). Value at Risk estimation with Mean/Variance approach When modelling financial time series related to investment decisions the asset return of portfolio (p) at time (t) as defined in equation (1) below is the focal point of interest instead of asset price , since investors earn on the difference between the price at which they sold. Value-at-Risk (VaR) metrics are an important tool in many areas of risk management. Our particular focus on VaR measures as a means to perform risk budgeting in asset allocation. Asset owners such as pension funds or insurances as well as asset managers often incorporate VaR measures into their investment processes (Jorion, 2005) . Value at Risk is defined as in equation (2) as the lower bound of a portfolio's return, which the portfolio or asset is not expected to fall short off with a certain probability (a) within the next period of allocation (n). Pr ( + < − ( )) = For example, an investment fund indicates that, based on the composition of its portfolio and on current market conditions, there is a 95% or 99% probability it will not lose more than a specified amount of assets over the next 5 trading days The VaR measurement can be interpreted as a threshold (Billio and Pellizon 2000) . If the actual portfolio or asset return falls below this threshold, we refer to this a VaR breach. The classic mean variance approach of measuring VaR values is based on the assumption that asset returns follow a (multivariate) normal distribution. VaR thresholds can then be measured by estimating the mean and covariance ( , Σ) of the asset returns by calculating sample mean and sample covariance of the respective historical window. The 1% or 5% percentile of the resulting normal distribution will be an appropriate estimator of the 95% or 99% VaR threshold. We refer to this way of estimating VaR thresholds as being the "classical" approach and use it as baseline of our evaluation. This classic approach, however, does not sufficiently reflect the skewness of real world equity markets and the divergences of return distributions across different economics regimes. In other words, the classic approach does not take into account longer term market dynamics, which express themselves as phases of growth or of downside, also commonly known as bull market and bear markets. For this purpose, regime switching models have grown in popularity well before machine learning entered finance (Billio and Pellizon 2000) . In this study, we model financial markets inter alia using neural networks while accounting for shifts in economics regimes (Avramov et al. 2020 , Chen et al., 2020 . Due to the generative nature of these networks, they are able to perform Monte-Carlo simulation of future returns, which could be beneficial for VaR estimation. In asset manager's risk budgeting it is advantageous to know about the current market phase (regime) and estimate the probability that the regime changes (Schmeding et al., 2019) . The most common way of modelling market regimes is by distinguishing between bull markets and bear markets. Unfortunately, market regimes are not directly observable, but are rather to be derived indirectly from market data. Regime Switching Models based on Hidden Markov Models are an established tool for regime based modelling. Hidden Markov Models (HMM)which are based on Markov chains -are models that allow for analysing and representing characteristics of time series such as negative skewness (Ang and Bekaert, 2002; Timmerman, 2000) . We employ the HMM for the special case of two economic states called 'regimes' in the HMM context. Specifically, we model asset returns y t ∈ n (we are looking at n ≥ 1 assets) at time t to follow an n-dimensional Gaussian process with hidden states S ∈ {1, 2} as shown in equation (3): The returns are modelled to have state dependent expected returns μ ∈ as well as covariance Σ ∈ . The dynamic of is following a homogenous Markov chain with transition probability matrix with = ( = 1 | | −1 = 1 ) and = ( = 2 | | −1 = 2 ) . This definition describes if and how states are changing over time. It is also important to note the 'Markov Property' that the probability of being in any state at the next point in time only depends on the present state, not the sequence of states that preceded it. Furthermore, the probability of being in a state at a certain point in time is given as π = ( = 1) and (1 − π ) = ( = 2). This is also called smoothed state probability. By estimating the smoothed probability πT of the last element of the historical window as the present regime probability, we can use the model to start from there and perform Monte-Carlo simulations of future asset returns for the next days. 1 This is outlined for the two-regimes case in Figure 1 below. 2 Figure 1 : Algorithm for the Hidden Markov Monte-Carlo simulation (for two regimes) 1: Estimate = ( 0 , , , Σ) from history When Graves [13] successfully made use of a Long Short-Term Memory (LSTM) based recurrent neural network to generate realistic sequences of handwriting, he followed the idea of using a Mixture Density Network (MDN) to parametrize a Gaussian Mixture predictive distribution (Bishop, 1995) . Compared to standard neural networks (Multi-Layer Perceptron) as used by GKX (2020), this network does not only predict the conditional average of the target variable as point estimate (in GKX' case expected risk premia), but rather estimates the conditional distribution of the target variable. Given the autoregressive nature of Graves' approach, the output distributions are not assumed to be static over time, but dynamically conditioned on previous outputs, thus capturing the temporal context of the data. We consider both characteristics as being beneficial for modelling financial market returns, which experience a low signal to noise ratio as highlighted by GKX' results due to inherently high levels of intertemporal uncertainty. The core of the proposed neural network regime switching framework is a (swappable) neural network architecture, which takes as input the historical sequence of daily asset returns. At the output level, the framework computes regime probabilities and provides learnable gaussian mixture distribution parameters, which can be used to sample new asset returns for Monte-Carlo simulation. A multivariate gaussian mixture model (GMM) is a weighted sum of k different components, each following a distinct multivariate normal distribution as shown in equation (5): A GMM by its nature does not assume a single normal distribution, but naturally models a random variable as being the interleave of different (multivariate) normal distributions. In our model, we interpret k as the number of regimes and φi explains how much each regime contributes to the (current output). In other words, φi can be seen as the probability that we are in regime i. In this sense the GMM output provides a suitable level of interpretability for the use case of regime based modelling. With regard to the neural network regime switching model, we extend the notion of a gaussian mixture by conditioning φi via a yet undefined neural network f on the historic asset returns within a certain window of a certain size. We call this window receptive field and denote its size by r: This extension makes the gaussian mixture weights dependent on the (recent) history of the time varying asset returns. Note that we only condition φ on the historical returns. The other parameters of the gaussian mixture ( , Σ ), are modelled as unconditioned, yet optimizable parameters of the model. This basically means we assume the parameters of the gaussians to be constant over time (per regime). This is in contrast to the standard MDN, where ( , Σ ) are also conditioned on and therefore can change over time. 1 Keeping these remaining parameters unconditional is crucial to allow for a fair comparison between the neural networks and the HMM, which also exhibits time invariant parameters ( , Σ ) in its regime shift probabilities. Following Graves (2013), we define the probability given by the network and the corresponding sequence as shown in equation (7) and (8), respectively: Since financial markets operate in weekly cycles with many investors shying away from exposure to substantial leverage during the illiquid weekend period, we are not surprised to observe that model training is more stable when choosing the predictive distribution to not only be responsible for the next day, but for the next 5 days (Hann and Steuer, 1995) . We call this forward looking window the lookahead. This is also practically aligned with the overall investment process, in which we want to appropriately model the upcoming allocation period, which usually spans multiple days. It also fits with the intuition that regimes do not switch daily but have stability at least for a week. The extended sequence probability and sequence loss are denoted accordingly in equations (9) and (10): An important feature of the neural network regime model is how it simulates future returns. We follow Graves (2013) approach and conduct sequential sampling from the network. When we want to simulate a path of returns for the next N business days, we do this according to the algorithm displayed in Figure 2 . In accordance with GKX (2020) we first focus our analysis on traditional "feed-forward" neural networks before engaging in more sophisticated neural network architectures for time series analysis within the neural network regime model. The traditional model of neural networks, also called Multi-Layer Perceptron, consists of an "input layer" which contains the raw input predictors and one or more "hidden layers" that combine input signals in a nonlinear way and an "output layer", which aggregates the output of the hidden layers into a final predictive signal. The nonlinearity of the hidden layers arises from the application of nonlinear "activation functions" on the combined signals. We visualise the traditional feed forward neural network and its input layers in Figure 4 . we setup our network structure in alignment with GKX's (2020) best performance neural network 'NN3'. The setup of our network is thus given with 3 hidden layers with decreasing number of hidden units (32, 16, 8) . Since we want to capture the temporal aspect of our time series data, we condition the network output on at least a receptive field of 10 days. Even though the receptive field of the network is not very high in this case, the dense structure of the network results in a very high number of parameters (1698 in total, including the GMM parameters). In between layers, we make use of the activation function tanh. Convolutional Neural Networks (CNNs) can also be applied within the proposed neural network regime switching model. Recently, CNNs gained popularity for time series analysis, as for example Van den Oord et al. (2015) successfully applied convolutional neural networks on time series data for generating audio waveforms, the state-ofthe-art text-to-speech and music generation. Their adaption of Convolutional Neural Networkscalled WaveNethas shown to be able to capture long ranging dependencies on sequences very well. In its essence, a WaveNet consists of multiple layers of stacked convolutions along the time axis. Crucial features of these convolutions are that they have to be causal and dilated. Causal means that the output of a convolution only depends on past elements of the input sequence. Dilated convolutions are ones that exhibit "holes" in their respective kernel, which effectively means that its filter size increases while being dilated with zeros in between. WaveNet typically is constructed with increasing dilation factor (doubling in size) in each (hidden) layer. By doing so, the model is capable of capturing an exponentially growing number of elements from the input sequence depending on the number of hidden convolutional layers in the network. The number of captured sequence elements is called the receptive field of the network (and in this sense is equal to the receptive field defined for the neural network regime model). 1 The Convolutional Neural Network (CNN), due to its structure of stacked dilated convolutions, has a much greater receptive field than the simple feed forward network and needs much less weights to be trained. We restricted the number of hidden layers to 3 to illustrate the idea. Our network structure has 7 hidden layers. Each hidden layer furthermore exhibits a number of channels, which are not visualized here. Figure 5 illustrates the networks basic structure as a combination of stacked causal convolutions with a dilation factor of D = 2. The backing model presented in this investigation is inspired by WaveNet, We restrict the model to the basic layout, using causal structure and increasing dilation between layers. The output layer comprises the regime predictive distributions by applying a SoftMax function to the hidden layers' outputs. Our network consists of 6 hidden layers, each layer having 3 channels. The convolutions each have a kernel size of 3. In total, the network exhibits 242 weights (including GMM parameters), the receptive field has a size of 255 days. As Graves (2013) was very successful in applying LSTM for generating sequences, we also adapt this approach for the neural network regime switching model. Originally introduced by Hochreiter and Schmidhuber (1997), a main characteristic of LSTMswhich are a sub class of recurrent neural networks -is its purpose-built memory cells, which allows it to capture long range dependencies in the data. From a model perspective, LSTMs differ from other neural network architectures in that they are applied recurrently (see Figure 6 ). The output from a previous sequence of the network function servesin combination with the next sequence element -as input for the next application of the network function. In this sense, the LSTM can be interpreted as being similar to an HMM, in that there is a hidden state which conditions the output distribution. However, the LSTM hidden state not only depends on its previous states, but it also captures long term sequence dependencies through its recurrent nature. Maybe most notably, the receptive field size of an LSTM is not bound architecture wise as in case of simple feed forward network and CNN. Instead, the LSTM's receptive field depends solely on the LSTMs ability to memorize the past input. In our architecture we have one LSTM layer with a hidden state size of 5. In total, the model exhibits 236 parameters (including the GMM parameters). The potential of LSTMs was noted by CPZ (2020: 6) who note that "LSTMs are designed to find patterns in time series data and … are among the most successful commercial AIs". 3 Assessment Procedure We obtain daily price data for stock and bond indices globally for three major global markets (i.e. EU, UK, US) to study the presented regime based neural network approaches on a variety of stock markets and bond markets. For each stock market, we focus on one major stock index. For bond markets, we further distinguish between long term bond indices (7-10 years) and short term bond indices (1-3 years). The markets in scope are (1) The data dates back to at least January 1990 and ends with August 2018, which means covering almost 30 years of market development. Hence, the data also accounts for crises like the dot-com bubble in the early 2000s as well as the financial crisis of 2008. This is especially important for testing the regime based approaches. The price indices are given as total return indices (i.e. dividends treated as being reinvested) to properly reflect market development. The data is taken from Refinitiv's DataStream. Descriptive statistics are displayed in Table 1 , whereby Panel A displays a daily frequency and Panel B a weekly frequency. Mean returns for equities exceed the returns for bond whereby the longer bond return more than the shorter one. Equities have naturally a much higher standard deviation and a far worse minimum return. In fact, equity returns in all four regions lose substantially more money than bond return even at the 25 th percentile, which highlights that the holy grail of asset allocation is the ability to predict equity market drawdowns. Furthermore, equity markets tend to bequite negatively skewed as expected while short bonds experience a positive skewness, which reflects previous findings (Albuquerque, 2012 , Kozhan et al. 2013 ) and the inherent differential in the riskiness of both asset's payoffs. [Insert Table 1 about here] The back testing is done on a weekly basis via a moving window approach. At each point in time, the respective model is fitted by providing the last 2,000 days (which is roughly 8 years) as training data. We choose this long range window, because neural networks are known to need big datasets as inputs and it is reasonable to assume that over eight years include simultaneously times of (at least relative) crisis and times of market growth. Covering both bull and bear markets in the training sample is crucial to allow the model to "learn" these types of regimes. 1 For all our models we set the number of regimes to k = 2. As we back test an allocation strategy with a weekly re-allocation, we set the lookahead for the neural network regime models to 5 days. We further configured the back testing dates to always align with the end of a business week (i.e. Fridays). The Classic approach does not need any configuration, model fitting is same as computing sample mean and sample covariance of the asset returns within the respective window. The HMM also does not need any more configuration, the Baum-Welch algorithm is guaranteed to converge the parameters into a local optimum with respect to the likelihood function (Baum, 1970) . For the neural network regime models, additional data processing is required to learn network weights that lead to meaningful regime probabilities and distribution parameters. An important pre-processing step is input normalization, as it is considered good practice for neural network training (Bishop, 1995) . For this purpose, we normalize the input data by ' = ( − ( )) / ( ) . In other words, we demean the input data and scale them by their variance but without removing the interactions between the assets. We train the network by using the AdaMax optimizing algorithm (Kingma & Ba, 2014) and at the same time applying weight decay to reduce overfitting (Krogh & Hertz, 1992) . Learning rate and number of epochs configured for training vary depending on the model. In general, estimating parameters of a neural network model is a non-convex optimization problem. Thus, the optimization algorithm might become stuck in an infeasible local optimum. In order to mitigate this problem, it is common practice to repeat the training multiple times, starting off having different (usually randomly chosen) parameter initializations, and then averaging over the resulting models or picking the best in terms of loss. In this paper, we follow a best-out-of-5 approach, that means each training is done five times with varying initialization and the best one is selected for simulation. The initialization strategy, which we will show in chapter 4.1, further mitigates this problem by starting off from an economically reasonable parameter set. We observe that the in-sample regime probabilities learned by the neural network regime switching models as compared to those estimated by the HMM based regime switching model generally show comparable results in terms of distribution and temporal dynamics. When we set k = 2 and the model fits two regimes with nearly invariably one having a positive corresponding equity means and low volatility, and the other experiencing a low or negative equity mean and high volatility. These regimes can be interpreted as bull and bear market, respectively. The respective insample regime probabilities over time also show strong alignment with growth and drawdown phases. This holds true for the vast majority of seeds and hence indicates that the neural network regime model is a valid practical alternative for regime modelling when compared to a Hidden Markov Model. After training the model for a specific point in time, we start a Monte Carlo simulation of asset returns for the next 5 days (one week -Monday to Friday). For the purpose of calculating statistically solid quantiles of the resulting distribution, we simulate 100,000 paths for each model. We do this for at least 1093 (EMU), and at most 1250 (globally) points in time within the back-test history window. As soon as we have simulated all return paths, we calculate a total (weekly) return for each path. The generated weekly returns follow a non-trivial distribution, which arises from the respective model and its underlying temporal dynamics. Based on the simulations we compute quantiles for Value at Risk estimations. For example, the 0.01 and 0.05 percentile of the resulting distribution represent the 99% and 95% -5 day -VaR metric, respectively. We evaluate the quality of our Value at Risk estimations by counting the number of breaches of the asset returns. In case, the actual return is below the estimated VaR threshold, we count this as a breach. Assuming an average performing model, it is e.g. reasonable to expect 5% breaches for a 95% VaR measurement. We compared the breaches of all models with each other. We classify a model as being superior to another model, if the number of VaR breaches is less than those from the compared model. A value comparison comp = 1.0(= 0.0) indicates that the row model is superior (inferior) to the column model. We performed significance tests by applying paired t-tests. We further evaluated a dominance value which is defined as shown in equation (11): In our view the three most crucial design features of neural networks in finance, where the sheer number of hidden layers appears less helpful due to the low signal to noise ratio (GKX, 2020), are: amount of input data, initializing information and incentive function. Big input data is important for neural networks, as they need to consume sufficient evidence also of rarer empirical features to ensure that their nonlinear abilities in fitting virtually any functional form are used in a relevant instead of an exotic manner. Similarly, the initialization of input parameters should be as much as possible based on empirically established estimates to ensure that the gradient descent inside the neural network takes off from a suitable point of departure, thereby substantially reducing the risks that a neural network confuses itself into irrelevant local minima. On the output side, every neural network is trained according to an incentive (i.e. loss) function. It is this particular loss function which determines the direction of travel for the neural network, which has no other ambitions than to minimize its loss best possible. Hence, if the loss function only represents one of several practically relevant parameters, the neural network may come to results with bizarre outcomes for those parameters not included in its incentive function. In our case, for instance, the baseline incentive is just estimation accuracy which could lead to forecasts dominated much more by a single regime than ever observed in practice. In other words, after a long bull market, the neural network could "conclude" that bear markets do not exist. Metaphorically spoken, a unidimensional loss function in a neural network has little decency (Marcus, 2018) . Commencing with the initialization and the incentive functions, we will assess our three neural networks in the following vis a vis classic and HMM approach, where each of the three networks is once displayed with an advanced design feature and once with a naïve design feature. If no specific initialization strategy for neural networks is defined, it occurs entirely random, normal via a computer generated random number. Where established econometric approaches use naïve priors (i.e. mean), neural networks originally relied on brute force computing power and a bit of luck. Hence, it is unsurprising that initializations are a common research topic in core machine learnings fields such as image classification or machine translation (Glorot & Bengio, 2010 nowadays. However, we are not aware of any systematic application of initialized neural networks in the field of finance. Hence, we compare naïve neural networks, which are not initialized with neural networks that have been initialized with the best available prior. In our case, the best available prior for , Σ of the model is the equivalent HMM estimation based on the same window. 1 Such initialization is feasible, since the structure of the Neural Network -due to its similarity with respect to , Σis broadly comparable with the HMM. In other words, we make use of already trained parameters from HMM training as starting parameters for the Neural Network training. In this sense, initialized neural networks are not only flexible in their functional form, they are also adaptable to "learn" from the best established model in the field if suitably supervised by the human data scientists. Metaphorically spoken, our neural networks can stand on the shoulders of the giant that HMM is for regime based estimations. Table 2 presents the results by comparing breaches between the two classic approaches (Mean/Variance, HMM) and the non-initialized and HMM initialized neural networks across all four regions. Panel A and B display the 1% VaR threshold for equities and long bonds, respectively, while Panels C and D show the equivalent comparison for 5% VaR thresholds. 2 Note that for model training we apply a best-out-of-5 strategy as described in section 3.2. That means we repeat the training five times, starting off with random parameter initializations each time. In case of the presented HMM initialized model, we apply the same strategy, with the exception that , Σ of the model are initialized the same for each of the five iterations. All residual parameters are initialized randomly as fits best according to the neural network part of the model. XXX findings are observable: First, not a single VaR threshold estimation process in a single region and in either of the two asset classes was able uphold its promise in that an estimated 1% VaR threshold should be breached no more than 1% of the time. This is very disappointing and quite alarming for institutional investors such as pension funds and insurance since it implies that all approachesestablished and machine learning basedfail to sufficiently capture downside tail risks and hence underestimate 1% VaR thresholds. The vast majority of approaches estimate VaR thresholds that occur in more than 2% of the cases and the LSTM fails entirely if not initialised. In fact, even the best method, the HMM for US equities, estimates VaR thresholds which are breached in 1.34% of the cases. Second, when inspecting the ability of our eight methods to estimate 5% VaR thresholds, the result remains bad but is less catastrophic. The Mean/Variance approach, the HMM and the initialised LSTM display cases where their VaR thresholds were breaches in less than the expected 5%. The Mean/Variance and HMM approach make their thresholds in 3 out of 8 cases and the initialised LSTM in 1 out of 8. Overall, this is still a disappointing performance, especially for the feed forward neural network and the CNN. 1 Even though we initialize , Σ from HMM parameters, we still have weights to be initialized arising from the temporal Neural Network part of the model. We do this on a per layer level by sampling uniformly as where i is the number of input units for this layer. 2 We focus our discussion of results on the equities and long bonds since these have more variation, lower skewness and hence risk. Results for the short bonds are available upon request from the contact author. Third, when comparing the initialised with the non-initialised neural networks, the performance is like day vs. night. The non-initialised neural networks perform always worse and the LSTM performs entirely dismal without a suitable prior. When comparing across all eight approaches, the HMM appears most competitive which means that we either have to further advance the design of our neural networks or their marginal value add beyond classic econometric approaches appears inexistent. To advance the design of our neural networks further, we aim to balance its utility function to avoid extreme unrealistic results possible in the univariate case. [Insert Table 2 about here] Whereas CPZ (2020) regularize their neural networks via no arbitrage conditions, we regularize via balancing the incentive function of our neural networks on multiple objectives. Specifically, we extend the loss function to not only focus on accuracy of point estimates but also give some weight to eventually achieving empirically realistic regime distributions (i.e. in our data sample across all four regions no regimes display more than 60% frequency on a weekly basis). This balanced extension of the loss function prevents the neural networks from arriving at bizarre outcomes such as the conclusion that bear markets (or even bull markets) barely exist. Technically, such bizarre outcomes result from cases where the regime probabilities φi(t) tend to converge globally either into 0 or 1 for all t, which basically means the neural network only recognises one-regime. To balance the incentive function of the neural network and facilitate balancing between regime contributions, we introduced an additional regularization term reg into the loss function which penalizes unbalanced regime probabilities. The regularization term is displayed in equation (13) below. If bear and bull market have equivalent regime probabilities the term converges to 0.5, while it converges towards 1 the larger the imbalance between the two regimes. Substituting equation (13) into our loss function of equation (10), leads to equation (14) below, which doubles the point estimation based standard loss function in case of total regime balance inaccuracy but adds only 50% of the original loss function in case of full balance. Conditioning the extension of the loss function on its origin is important to avoid biases due to diverging scales. Setting the additional incentive function to initially have half the marginal weight of the original function also seems appropriate for comparability. The outcome of balancing the incentive functions of our neural networks are displayed in Table 3 , where Panels A-D are distributed as previously in Table 2 . The results are very encouraging, especially with regards to the LSTM. The regularized LSTM is in all 32 cases (i.e. 2 thresholds, 2 asset classes, 4 regions) better than the non-regularized LSTM. For the 5% VaR thresholds, it reaches realized occurrences of less than 4% in half the cases. This implies that the regularized LSTM can even be more cautious than required. The regularized LSTM also sets a new record for the 1% [Insert Table 4 about here] To measure how much value the regularized LSTM can add compared to alternative approaches, we compute the annual accumulated costs of breaches as well as the average cost per breach. They are displayed in Table 5 for the 5% VaR threshold. The regularized LSTM is for both numbers in any case better than the classic approaches (Mean/Variance ad HMM) and the difference is economically meaningful. For equities the regularized LSTM results in annual accumulated costs of 97-130 basis points less than the classic Mean/Variance approach, which would be up to over one billion US$ avoid loss per annum for a > US$100 billion equity portfolios of pension fund such as CalPERS or PGGM. Compared to the HMM approach, the regularized LSTM avoids annual accumulated costs of 44-88 basis points, which is still a substantial amount of money for the vast majority of asset owners. With respect to long bonds, where total returns are naturally lower, the regularized LSTM's avoided annual costs against the mean/variance and the HMM approach range between 23-30 basis points, which is high for bond markets. [Insert Table 5 about here] These statistically and economically attractive results have been achieved, however, based on 2,000 days of training data. Such "big" amounts of data may not always be available for newer investment strategies. Hence, it is natural to ask if the performance of the regularized neural networks drop when fed with just half the data (i.e. 1,000 days). Apart from reducing statistical power, a period of over 4 years also may comprise less information on downside tail risks. Indeed, the results displayed in Table 6 show that in all context of VaR thresholds and asset classes, the regularized networks trained on 2,000 days substantially outperform and usually dominate their equivalently designed neural networks with half the training data. Hence, the attractive risk management features for HMM initialised, balanced incentive LSTMs are likely only available for established discretionary investment strategies where sufficient historical data is available or for entirely rules-based approaches whose history can be replicated ex-post with sufficient confidence. [Insert Table 6 about here] We further conduct an array of robustness tests and sensitivity analysis to challenge our results and the applicability of neural network based regime switching models. As first robustness test, we extend the regularization in a manner that the balancing incentive function of equation (13) has the same marginal weight than the original loss function instead of just half the marginal weight. The performance of both types of regularized LSTMs is essentially equivalent Second, we study higher VaR thresholds such as 10% and find the results to be very comparable to the 5% VaR results. Third, we estimate monthly instead of weekly VaR. Accounting for the loss of statistical power in comparison tests due to the lower number of observations, the results are equivalent again. We conduct two sensitivity analysis. First, we set up our neural networks to be generalized by two balancing incentive functions but without HMM initialisation. The results show the regularization enhances performance compared to the naïve non-regularized and non-initialized models but that both design features are needed to achieve the full performance. In other words, initialization and regularization seem additive design features in terms of neural network performance. Second, we run analytical approaches with K > 2 regimes. Adding a third or even fourth regime when asset prices only know two directions leads to substantial instability in the neural networks and tends to depreciate the quality of results. Inspired by GKX (2020)'s and CPZ (2020) to outperform the single incentive RNN as well as any other neural network or established approach by statistically and economically significant levels. Third, we half our training data set of 2,000 days. We find our networks when fed with substantially less data (i.e. 1,000 days) to perform significantly worse which highlights a crucial weakness of neural networks in their dependence on very large data sets. Hence, we conclude that well designed neural networks, i.e. a recurrent LSTM neural network initialized with best current evidence and balanced incentivescan potentially advance the protection offered to institutional investors by VaR thresholds through a reduction in threshold breaches. However, such advancements rely on the availability of a long data history, which may not always be available in practice when estimating asset management VaR thresholds. descriptive statistics of the daily returns of the main equity index (Equity), the main sovereign bond with (short) 1-3 years maturity (SB1-3y) and the main sovereign bond (long) with 7-10 year maturity (SB7-10). Descriptive statistics include sample length, the first three moments of the return distribution and 11 thresholds along the return distribution. Risks and Portfolio Decisions Involving Hedge Funds Skewness in Stock Returns: Reconciling the Evidence on Firm Versus Aggregate Returns Can Machines Learn Capital Structure Dynamics? Working Paper International asset allocation with regime shifts Machine Learning, Human Experts, and the Valuation of Real Assets Machine Learning versus Economic Restrictions: Evidence from Stock Return Predictability A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains Bond risk premia with machine learning Value-at-risk: a multivariate switching regime approach Econometric measures of connectedness and systemic risk in the finance and insurance sectors Neural networks for pattern recognition Deep learning in asset pricing Subsampled Factor Models for Asset Pricing: The Rise of Vasa Microstructure in the Machine Age Towards Explaining Deep Learning: Significance Tests for Multi-Layer Perceptrons Asset Pricing with Omitted Factors How to Deal with Small Data Sets in Machine Learning: An Analysis on the CAT Bond Market Understanding the difficulty of training deep feedforward neural networks Generating sequences with recurrent neural networks Autoencoder asset pricing models Much ado about nothing? Exchange rate forecasting: Neural networks vs. linear models using monthly and weekly data J. Long short-term memory Towards explainable ai: Significance tests for neural networks Improving Earnings Predictions with Machine Learning. Working Paper Jorion, P.. Value at risk Characteristics are covariances: A unified model of risk and return Adam: A method for stochastic optimization Shrinking the cross-section The Skew Risk Premium in the Equity Index Market A simple weight decay can improve generalization Advances in Financial Machine Learning Deep Learning: a critical appraisal Frontiers in VaR forecasting and backtesting Dynamic semiparametric models for expected shortfall (and value-at-risk) Maschinelles Lernen bei der Entwicklung von Wertsicherungsstrategien. Zeitschrift für das gesamte Kreditwesen Deep learning for mortgage risk Forecasting value at risk and expected shortfall using a semiparametric approach based on the asymmetric Laplace distribution Forecast combinations for value at risk and expected shortfall Moments of markov switching models Verstyuk, S. 2020. Modeling Multivariate Time Series in Economics: From Auto-Regressions to Recurrent Neural Networks. Working Paper Fixup initialization: Residual learning without normalization. Interantional Conference on Learning Representations (ICLR) Paper Acknowledgments: We are grateful for comments from Theodor Cojoianu, James Hodson, Juho Kanniainen, Qian Li, Yanan, Andrew Vivian, Xiaojun Zeng and participants at 2019 Financial Data Science Association conference in San Francisco the International Conference on Fintech and Financial Data Science at University College Dublin (UCD). The views expressed in this manuscript are not necessarily shared by Sociovestix Labs, the Technical Expert Group of DG FISMA or Warburg Invest AG. Authors are listed in alphabetical order, whereby Hoepner serves as the contact author (andreas.hoepner@ucd.ie). Any remaining errors are our own.