key: cord-0145537-11zwocm0
authors: Pfitzinger, Johann
title: An Interpretable Neural Network for Parameter Inference
date: 2021-06-10
journal: nan
DOI: nan
sha: 814ecd1bcebef2e6bdeda6a9346f7a7968f070e1
doc_id: 145537
cord_uid: 11zwocm0

Adoption of deep neural networks in fields such as economics or finance has been constrained by the lack of interpretability of model outcomes. This paper proposes a generative neural network architecture - the parameter encoder neural network (PENN) - capable of estimating local posterior distributions for the parameters of a regression model. The parameters fully explain predictions in terms of the inputs and permit visualization, interpretation and inference in the presence of complex heterogeneous effects and feature dependencies. The use of Bayesian inference techniques offers an intuitive mechanism to regularize local parameter estimates towards a stable solution, and to reduce noise-fitting in settings of limited data availability. The proposed neural network is particularly well-suited to applications in economics and finance, where parameter inference plays an important role. An application to an asset pricing problem demonstrates how the PENN can be used to explore nonlinear risk dynamics in financial markets, and to compare empirical nonlinear effects to behavior posited by financial theory.

Deep learning is rapidly emerging as the most influential sub-field of machine learning, due in large part to substantial gains in predictive performance achieved by deep neural networks (DNN) in the fields of computer vision and natural language processing (Goodfellow et al., 2016) . Fueled by successes in these and other domains, a large literature has developed, applying DNN techniques to economic and financial problems. However, despite some appealing properties of DNN (e.g. the oft-cited universal approximation (Hornik et al., 1989) or predictive accuracy (Fernandez-Delgado et al., 2014) ), the scope for their use in econometrics remains limited for several reasons.

The most important barrier to the wider adoption of DNN in the field of econometrics is their lack of interpretability. Econometric analysis is typically concerned with inference about the causal dynamics governing economic processes (e.g. expected responses to policy innovations). This requires an identifiable parametric representation of the process, precluding the use of DNN as well as most other machine learning methods. A growing literature proposes the use of post hoc algorithms to interpret the results of neural networks, however these methods often lack robustness, are cumbersome to implement and computationally demanding (Alvarez- .

In addition, -and perhaps more importantly -post hoc interpretation is generally facilitated by imposing simplifying assumptions, such as feature independence, onto complex algorithms. Since the structure of DNN makes them appropriate precisely for problems characterized by feature dependencies, removing this property from the interpretation vastly reduces its power to describe the underlying data generating process.

Another impediment to the use of DNN in econometrics is the relatively small data typically available in empirical applications. Skillful consideration of the network architecture and regularization are required to avoid overfitting of DNN in small samples, while simultaneously capturing systematic nonlinearities. The reliance on pseudo out-of-sample model selection algorithms, such as cross-validation, further exacerbates the issue, and puts DNN at a disadvantage compared to simpler methods like linear regression or nonlinear additive models.

In this paper, I propose a neural network architecture that aims to solve both of the above obstacles to the application of DNN in econometrics. The parameter encoder neural network (PENN) represents a novel contribution to the nascent field of self-explaining neural networks, where the architecture of the DNN is designed in such a manner as to produce interpretable outputs natively.

The method retains the flexibility of the neural network to encode complex nonlinear behavior, but simultaneously generates interpretable posterior densities of local regression parameters. A Bayesian prior shrinks the local parameter estimates towards global means, reducing the process to static parameters when the data do not support nonlinearity. This form of regularization is extremely intuitive and permits the PENN to be used even in comparatively data-constrained environments.

The contribution of the PENN model can be viewed from two perspectives. On the one hand, it represents an explainability method, that compels a complex neural network to encode effects via a latent channel of local regression parameters. The regression parameters define a linear decomposition of each prediction that corresponds to the local contributions produced by popular explainability algorithms such as SHAP (Lundberg & Lee, 2017) or LIME (Ribeiro et al., 2016) , with the important distinction that the PENN requires no assumption of feature independence. 3

On the other hand, the PENN is a nonlinear regression technique that can be used to conduct parameter inference in econometric models and to explore marginal effects at the local level.

Examples are presented in this paper to highlight each of these facets.

The role of the PENN as an explainability method is explored using a series of simulations, which demonstrate that the PENN can generate consistent local parameter estimates superior, in several respects, to the feature contributions obtained using popular post hoc explainability algorithms, as well as other interpretable nonlinear frameworks. The presence of interaction effects among covariates can lead to large inaccuracies in estimators that assume an additive effects structure (as is the case for most existing explainability algorithms). The parameters estimated using the PENN capture non-additive effects correctly, and are comparatively robust to several characteristics commonly observed in economic data, such as reduced data availability, multicollinearity and a low signal-to-noise ratio.

In an applied econometric setting, the PENN method is used to estimate a nonlinear version of the popular capital asset pricing model (CAPM) . The approach permits the exploration of dynamic dependencies between equity risk premia and the economic regime, and can be used to test theoretical assumptions about equity risk and return characteristics. Financial theory suggests that an asset's sensitivity to systematic risk sources is not static, but depends on the state that the economy resides in at any point in time. The PENN is uniquely suited to the estimation of dynamic risk premia conditional on a nonlinear and highly flexible macroeconomic 3 The explainability algorithms mentioned here are discussed in detail in subsequent sections. (1999) . The GAM estimates nonlinear functions of the covariates, which are additively combined to produce predictions. Unless accounted for explicitly, the model is therefore not able to capture dependencies between two or more covariates.

Some examples of applications of the GAM framework to machine learning algorithms exist. The nonlinear covariate-specific functions of the GAM have been represented, for instance, by neural networks (Lisboa et al., 2020; Potts, 1999) , or by a random forest (Caruana et al., 2015) . This results in an additive version of the respective underlying machine learning algorithm, which is appealing due to the inherent simplicity, but unsatisfactory when interaction effects are expected to exist, or the researcher wishes to remain agnostic about their existence.

The PENN model proposed in this paper modifies a neural network architecture, to approximate local posterior parameter distributions, and to generate interpretable outputs without imposing an additive dependence structure. The approach is most closely related to Al-Shedivat et al. (2017) and , both of which propose conceptually related self-explaining frameworks. The PENN can be seen as a variant of the self-explaining neural network framework proposed by , but is distinct in a few important respects: (i) It utilizes Bayesian inference techniques to produce parameter distributions, rather than point estimates;

(ii) The PENN is conceived specifically as an econometric tool, where the emphasis of comparable studies has been on computer vision tasks; (iii) The concepts of stability and regularization are derived directly from a posterior distribution of the parameters, differing from the gradientregularized objective proposed in . An implication of this final point is that, as an approach to model explainability, the gradient-regularized self-explaining networklike the post hoc algorithms described above -embeds an assumption of feature independence, while the PENN does not.

Econometric inference using machine learning models is in its infancy with comparatively few examples in the literature. Those approaches proposed to date either build on existing explainability algorithms (primarily Shapley values) (Lundberg & Lee, 2017; Shapley, 1953; Štrumbelj & Kononenko, 2014) or modify machine learning models directly, with the PENN falling into the latter of these two categories. The former group includes, for instance, Joseph (2019), who proposes a framework to conduct statistical inference in machine learning models using standard regression analysis with Shapley values as inputs. Bracke et al. (2019) show how a Shapley-based algorithm developed in Datta et al. (2016) , which can account for some degree of feature dependence, can be used to obtain a systematic analytical framework for explainability in financial and economet-ric applications. The authors apply the algorithm to determine key drivers of mortgage default.

Finally, Tiffin (2019) demonstrates how Shapley values can be used to quantify the impact of financial crises on growth.

Arguably the most prominent application of machine learning methods to econometric inference is in post machine learning semiparametric inference, typically for the calculation of average treatment effects Belloni et al., 2014 Belloni et al., , 2017 Chernozhukov et al., 2018; Farrell, 2015; Farrell et al., 2021) . This literature studies various methods of obtaining valid causal inference on a static parameter (the average treatment effect), when the first-stage machine learning method is subject to regularization bias. The methods require notional modifications to the underlying machine learning techniques, in the form of partially linear designs.

Other approaches that modify machine learning algorithms directly include , Athey et al. (2019) and Friedberg et al. (2020) , who propose random forest based algorithms to learn causal effects. The authors introduce the concept of a causal forest to estimate heterogeneous treatment effects. Mullainathan & Spiess (2017) propose constructing an explicit correspondence between an econometric and a machine learning approach, by treating a decision tree like a regression with multiple interaction terms. Interpretability frameworks to test for nonlinear Granger causality have been put forward by Tank et al. (2018) , Nauta et al. (2019) , Wu et al. (2020) , Löwe et al. (2020) , Khanna & Tan (2020) and Marcinkevičs & Vogt (2021) . Marcinkevičs & Vogt (2021) is noteworthy in that the authors employ a variant of a self-explaining neural network architecture to examine Granger causality. Finally, Horel & Giesecke (2020) propose a general method of conducting significance tests in a nonlinear setting using neural networks, with an application to economic data. For further detailed reviews of the role of machine learning in economics and finance, the reader is referred to Tiffin (2019) and Varian (2014) .

The PENN methodology introduced in the following sections differs from most of the aforementioned approaches in that it is capable of capturing heterogeneous effect structures, in the form of a locally parameterized difference equation, without reducing the flexibility of the underlying machine learning algorithm. Much of the current literature focusses on methods to accommodate the black-box property of neural networks (by approximating the gradient of a fitted model, or by correcting the bias inherent to a semiparametric framework). The PENN instead aims to circumvent the black-box property entirely, and to permit a degree of statistical inference on the effects revealed by the DNN.

For a standard DNN, the objective is to learn a probabilistic function p θ (y|x), where y is the de-

and θ is a vector of neural network weights. While its multilayer structure coupled with a nonlinear activation function can capture nonlinearity with a high degree of flexibility, the predictions (ŷ) generated by the neural network are not immediately interpretable. For regression tasks, θ is optimized using a gradient descent algorithm, by minimizing the loss in Eq. 3.1 (Goodfellow et al., 2016) 

(3.1)

In contrast to the above, econometric analysis typically employs some form of a linear regression or classification framework, parameterizing the problem with a vector of coefficients that results in the conditional data likelihood p(y|β, x), which is maximized with respect to β. The coefficients uniquely map predictions to covariates, and can -subject to a set of assumptions -be treated as causal effects (Hansen, 2019) . The cost of this interpretability is the supposition that the underlying data generating process (DGP) is linear in β. A standard Gaussian likelihood for regression tasks results in the loss function in Eq. 3.2:

The PENN proposes a synthesis of the flexible neural network and the interpretable linear regression, uniting both approaches in the context of an encoder-decoder framework. 5 The encoder is an inference network that generates posterior densities for a vector of local regression parameters β i , i ∈ 1, ...N , and is denoted q θ (β|x). The decoder uses the posterior densities to form predictions 4 A more detailed discussion of the structure of feed-forward neural networks is omitted here and has been covered extensively in the related literature. 5 Encoder-decoder frameworks are neural network architectures that consist of two separate entities -an encoder and a decoder -, which are chained and trained together using a combined loss. Examples of encoder-decoder frameworks are autoencoders for data compression applications or sequence-to-sequence models, often used in natural language translation tasks (Goodfellow et al., 2016) . over y i based on a parameterized likelihood. This framework retains the rich flexibility of a DNN, but compels the neural network to encode predictions via an interpretable locally linear model.

Combining the neural network with the linear likelihood function results in the (conceptual) loss in Eq. 3.3, which represents the expected linear likelihood with parameters generated by the inference network:

( 3.3)

The aim in this and subsequent sections is to convert Eq. 3.3 into a loss function that can be used to train a neural network, and to obtain parameters for the locally linear regression model. Each local linear model, while interpretable, can be useful only insofar as the coefficients are uniquely identified. A logical starting point for the derivation of a loss function is therefore the unknown true density distribution of the coefficients, p(β|y, x), which the encoder q θ (β|x) aims to approximate, such that q θ (β|x) ≈ p(β|y, x). This manner of framing the objective is closely related to variational inference problems, where the divergence between a latent and an approximating density distribution is minimized.

Variational inference is a Bayesian technique of approximating the posterior density of a latent variable, typically denoted z, and has become an important tool in several branches of machine learning (Blei et al., 2017; Jordan et al., 1999; Zhang et al., 2018) . Its objective is to infer a posterior distribution of model parameters given data. In contrast to Markov Chain Monte Carlo (MCMC) sampling, variational inference converts the process of obtaining a posterior from a sampling problem into an optimization problem. The aim in variational inference is to determine a density, q(z), that is as similar as possible to the posterior p(z|x) -i.e. that minimizes the divergence between the distributions. The approximate posterior q(z) is represented using a parametric distribution whose parameters are optimized with the objective of resembling p(z|x). For instance, the latent variable could be assumed to follow a normal distribution, with z ∼ N (µ z , σ 2 z ), where the mean and variance (µ z and σ 2 z ) are optimized to ensure that q(z) ≈ p(z|x).

In the context of the PENN model, the latent variable z is replaced by β, q θ (β|x) is the approximating function, and p(β|y, x) is the posterior. In order to make the problem tractable, it is common to assume that the approximating function follows a mean field variational distribution, with independently distributed latent variables (Blei et al., 2017) , such that:

The vector β i is furthermore assumed to follow an amortized multivariate Gaussian distribution,

The distribution is referred to as amortized, since it depends on shared parameter functions µ θ (·) and σ 2 θ (·), substantially reducing the complexity of the problem. The functions take as input a vector x i and infer from it parameters for the approximate posterior of β i . The parameter functions are estimated using a DNN, referred to as an inference network, since it infers parameters for the posterior based on the data. This removes the need to parameterize N × K distributions, and instead requires a single neural network -the inference network -that predicts K parameterizations given an arbitrary input.

This general methodology has found wide application in machine learning, most prominently in the optimization of variational autoencoders (VAE) (Kingma & Welling, 2014; Rezende et al., 2014) . The assumption of an amortized Gaussian mean field variational distribution for the latent variables is also standard in machine learning applications (Zhang et al., 2018) . A contribution of the PENN is to interpret the latent variables as the posteriors of local regression parameters and training a supervised inference network using y.

The similarity between the approximating and posterior densities can be captured using a Kullback-Leibler divergence, D KL , which measures the expectation of the information difference between any two distributions and is defined, for the general case, as:

Note that D KL is not symmetrical (i.e. D KL (q||p) = D KL (p||q)). Reversing the arguments can be more intuitive, however in the context of variational inference -where p is unknown -, integrating over q is preferred since it allows the expectation to be computed (Zhang et al., 2018) . Several approaches exist to "symmetrize" D KL (e.g. Pu et al. (2017) , Chen et al. (2017) , and Arjovsky & Bottou (2017)), however, these are not explored here.

Substituting the approximated and true posteriors for any parameter vector β i ∈ β into a multivariate D KL (where i = 1, ..., N ) yields Eq. 3.7:

Given the distributional assumption in Eq. 3.5, the additive property implies that the multivariate D KL in Eq. 3.7 is equal to the sum of the univariate Kullback-Leibler divergences for the individual parameters β ik ∈ β i . This property is used in subsequent discussions to simplify the exposition and derivation of the closed form solution.

A composite learning objective of the PENN can be defined by combining the Kullback-Leibler divergence in Eq. 3.7 with Eq. 3.1. The aim is to minimize the aggregate D KL , while simultaneously maximizing the likelihood of the data, p(y|x). Since q θ (β|x) follows a mean field variational distribution with independently distributed parameters, the aggregate D KL can be found by summing over N , leading to the objective captured in Eq. 3.8:

(3.8)

Eq. 3.8 is not yet a computable loss function since it contains the unknown density p(β i |y i , x i ).

Instead, Proposition 1 operationalizes the estimation of the PENN in a manner related to the VAE loss function, by applying Bayes' rule to Eq. 3.7 (see Appendix A for a proof):

Proposition 1. Where θ is a vector of neural network weights, β is a matrix of local regression coefficients with β i representing one coefficient vector, q θ (β i |x i ) is an approximation function of the latent posterior density given by p(β i |y i , x i ), and p(β i |x i ) is a conditional prior:

Proposition 1 warrants detailed examination. The LHS of the equality is identical to the (negative)

objective function of the PENN in Eq. 3.8. The RHS is analogous to the evidence lower bound (ELBO) in variational inference, which is equal to the original Kullback-Leibler divergence in Eq.

3.7 upto a constant term -log p(y|x) (constant in the sense that it does not depend on q θ ), and can therefore be used as a proxy for optimizing D KL .

Dissecting the ELBO further, the first term encapsulates the encoder-decoder framework (Eq.

, and is the expected parametric likelihood log p(y|β, x) (decoder), with β generated by q θ (β|x) (encoder). The second term introduces a conditional prior on the parameters, p(β|x),

acting as a regularization loss, which penalizes any solution that diverges from a stable process. The role of the prior is discussed in more detail below. The

Kullback-Leibler divergence facilitates a form of identification of the local parameter estimates, since it restricts estimates of β i to the most stable -in the extreme case, globally static -path, thus ensuring a solution that is both unique and meaningful.

The RHS of the equality in Proposition 1 can be converted into a neural network loss function by making a few assumptions about the shape of the likelihood, the coefficients and the prior. The parametric log-likelihood log p(y|β, x) can be defined using any linear regression or classification model. For the applications in this paper, the simple Gaussian regression is considered, with:

whereŷ i = x i β m i , and β m i is a draw from the estimated posterior for the ith observation. The prediction mean squared error is obtained by computing the expectation over M Monte Carlo draws from the parameter posterior:

(3.10)

The second term in the loss function, the Kullback-Leibler loss, can be computed in closed form given the Gaussian assumption for β ik :

where µ p ik , µ q ik , σ p ik and σ q ik parameterize the prior and posterior distributions, such that

(3.13)

The parameters µ q ik and σ q ik are inferred by the neural network given data at observation i. The parameters of the prior are discussed below. The derivation of Eq. 3.11 is provided in Appendix B.

Finally, for the estimation of nonlinear regression parameters, it is desirable to control the relative importance of the Kullback-Leibler penalty within the loss function using a hyperparameter. This permits the extent of nonlinearity encoded in the PENN to be determined in a data-driven manner.

For instance, when increasing the weight of the prior, the parameters vary less over i, converging to a static solution as the weight goes to infinity. The trade-off between prior and likelihood is achieved using the hyperparameter λ. 6

Assembling the various components, the overall objective of the PENN is given in Eq. 3.14:

(3.14)

The loss function in Eq. 3.14 is readily computable and can be used to train a neural networkin this case the inference network of the PENN framework.

The prior p(β|x) is not expressed in absolute terms, but rather encodes an expected conditional relationship with x. Given the underlying independent Gaussian assumption, the prior density p(β i |x i ) is fully described by its conditonal moments µ p i |x i and σ p i |x i , for any vector of parameters β i . An important function of the prior is to induce stability in the estimates, shrinking the conditional moments of the posterior towards static parameters in the neighborhood of x i . With an appropriate definition of the prior moments, the inference network can be trained to infer posterior parameterizations in adherence to "stability rules", ensuring that similar neighborhoods in x result in similar posteriors. Or, expressed differently, the gradient x β i ≈ 0 (the gradient of the parameters with respect to the inputs approximates zero).

A parameter gradient approximating zero leads naturally to interpretable parameters that resemble local marginal effects, as shown (for the independent features case) in Proposition 2:

The proof of Proposition 2, as well as a discussion of the dependent features case, is provided in Appendix C.

Apart from suggesting that stable parameters are indeed interpretable, Proposition 2 also implies an approach to their estimation. By constructing a loss function that penalizes the deviation of the parameters from backpropagated network gradients (i.e. penalizing the deviation from x f (x)), parameter estimates are encouraged to act as locally stable gradients. 7 However, since the backpropagated gradient is composed of partial derivatives of f (x) with respect to x, the approach produces biased estimates when the assumption of feature independence fails. The point is illustrated using the example of the gradient-regularized self-explaining network (SENN) of in Section 4, and in Appendix C. Given its implicit assumption of feature independence, the explanations generated by a SENN are closely related to those produced by popular post hoc algorithms such as SHAP or LIME, which also assume feature independence.

In the case of the PENN, p(β|x) is instead obtained using multivariate predictions of the prior

Any estimator of the prior conditional moments should satisfy three properties: (i) The estimator should induce stability, with lim λ→∞ x β i = 0 ∀ i in the empirical joint data distribution.

(ii) The estimator should be nonparametric. By using an estimator with minimal distributional assumptions, the more rigorous assumptions required by other explainability algorithms, such as feature independence, can be avoided. (iii) The extent of regularization should be variable. The estimator should accommodate shrinkage towards global static parameters, by allowing the degree of prior support to vary based on a hyperparameter.

A stable and nonparametric method that satisfies the above criteria and captures the relational character of the prior is the k-nearest neighbors estimator. Formally, the prior is a function that computes expected moments of q θ (β|x) over any neighborhood, I, of points in x, such that:

and I ⊂ [1, N ] indexes a neighborhood of vectors in x that lie in close mutual proximity. Here

The expectation is computed using an N × N kernel weighting matrix, π x;D , that indexes blocks of neighboring points in x, and whose rows sum to unity. This is closely related to conventional kernel weighting methods, particularly compact support kernels, with positive weights over a region of neighboring points. The definition of I in Eq. 3.16 results in disjoint neighborhoods (defined based on the distance threshold D) that can be interpreted as parameter regimes. This custom compact support kernel is found to permit more nuanced control over the support range (and hence over the extent and manner of regularization in the inference network).

A comparison to more traditional compact support kernels is provided in Appendix D.

Given π x;D , the prior moments are obtained from q θ (β|x), aŝ

where µ q and σ q are generated in the inference network. When D is large, the condition in Eq.

3.16 is satisfied for all i and j, resulting in a single (static) prior parameter vector. As D decreases, the number of regimes encoded in the prior increases. When D = 0, π x;D collapses to an identity matrix, with the regularization loss equal to zero.

Apart from ensuring that the gradient x β i ≈ 0, this definition of the prior has the intuitive appeal that the broader a neighborhood in x is defined, the more the posterior is shrunken towards a globally static parameter value, with -in the extreme caseβ i = β * ∀ i ∈ 1, ..., N , where β * is the least-squares optimal static parameter vector. In addition, regularizing the overall gradient of the inference network with respect to its inputs results in a network that is less sensitive to the chosen topology, and is therefore simpler to train. 8 By constraining nonlinearity to a stable range of solutions that are meaningful for purposes of interpretation, the regression parameters β i can be uniquely identified in the inference network. The prior permits a single unique solution to exist, which maximizes model fit under the condition that similar input vectors lead to a similarly parameterized posterior.

An equivalent, but more intuitive alternative to the scalar D, is to define the number of separate 8 Gradient regularization has played an important role recently in improving the adversarial robustness of neural networks (i.e. the extent to which small perturbations can lead to large changes in prediction). A good overview of current research is provided by Finlay & Oberman (2021). neighborhoods in x over which expectations are computed. The normalized number of neighborhoods is denoted δ, where δ = 0 is equivalent to a single regime (D = ∞), and δ = 1 is equivalent to N regimes (D = 0). δ is treated as a second hyperparameter in the PENN model (in addition to λ), and governs the number of distinct parameter regimes (or neighborhoods) encoded in the prior.

The optimal composition of neighborhoods can conveniently be obtained using complete linkage agglomerative clustering (see Kaufman & Rousseeuw (2005) for an overview). Complete linkage clustering begins by placing each observation vector into a cluster of its own, and merging the clusters with the minimum cluster distance, D IJ , where I and J denote clusters. Clusters are merged until exactly 1 + δ(N − 1) clusters remain. The cluster distance is formally defined as the largest distance between any two sample vectors in the respective clusters, such that:

(3.17)

Complete linkage clustering ensures that for any value of δ, the distance threshold D that satisfies the condition d ij < D, i, j ∈ I for all neighborhoods, is minimized. Thus, there is no cluster arrangement that meets the condition set forth in Eq. 3.16 with a smaller number of clusters.

Treating δ as a hyperparameter in conjunction with complete linkage agglomerative clustering is therefore equivalent to the use of D, with δ representing a choice that is at once more intuitive and can easily be combined with standard software implementations of hierarchical clustering algorithms.

The interplay between the hyperparameters λ and δ permits a great deal of flexibility in controlling the strength of shrinkage of the coefficients. Where λ governs the overall weight of the prior in relation to the training mean squared error, δ controls the nuance in the nonlinear patterns captured by the model. If δ = (N − 1) −1 (resulting exactly two neighborhoods in x), the parameters are shrunken towards a two-regime representation. As δ increases, so does the number of regimes. δ can therefore be described as defining the resolution of the static estimates. At maximum resolution (δ = 1), the problem is unregularized.

Increasing λ limits the extent to which parameters can vary around the regime-specific static parameter values. As λ grows very large, parameters become static, with the number of static parameters driven by δ. When λ = 0, the problem is unregularized. The simulations introduced in Section 4 provide a good setting within which to explore the role of the hyperparameters λ and δ. The parameter prior enables the inference network to learn a set of relational rules governing the manner in which it infers posterior parameterizations from the input data. When optimal hyperparameters are very restrictive, the resulting rules dictate identical parameters inferred from all input vectors (the static solution). Conversely, as the support for nonlinearity increases, the nature of the rules can become highly complex, permitting a nuanced trade-off between model fit and parameter stability. Since the rules are encoded in the weights of the inference network, outof-sample parameter inference becomes a simple matter of predicting posterior parameterizations using an appropriately trained network.

Having derived a loss function, the following section proposes a network architecture for the PENN.

Significant discretion exists in the choice of the network topology for the inference network (e.g. the number of hidden layers, node types etc.). The layout presented below utilizes a standard feed forward network with two hidden layers, which is appropriate for both the simulated and empirical applications included in this paper. However, the type of architecture used should be chosen to suit the specific application. The network consists of five components. Standardized input features are passed to a stack of fully-connected feed forward hidden layers. The sigmoid activation function is used to introduce nonlinearity within the hidden layers. 10 The hidden layers feed into two variational layers of equal dimension as the inputs, which learn µ q and σ q , respectively. The purpose of the variational layers is to infer parameters (mean and variance) for the distribution q θ (β|x), permitting direct sampling from the approximated posterior. The variance layer utilizes an exponential activation function to ensure that the output is limited to R ≥0 .

Since it is not possible to backpropagate the gradient through a stochastic process, the so-called "reparameterization trick" is used in the sampling layer to connect the encoder and decoder sections of the PENN (see Kingma & Welling (2014) 

layers are subsequently used to generate the corresponding M Monte Carlo samples from the approximate posterior, with β m = µ q + σ q s m (where is element-wise multiplication). Note that the sampling layer therefore consists of a three dimensional tensor with dimensions N × M × K.

Finally, the samples from the parameter posterior, β m , m = 1, ..., M , feed into a non-trainable output layer that simply calculates the dot productŷ m i = x i β m i , to generate M draws from the predictive density. The input layer is an auxiliary input to the final layer, which requires both β m and x, and the output layer is again a three dimensional tensor with dimensions N × M × 1.

In the applications in Sections 4 and 5, the PENN is implemented using keras and tensorflow frameworks (Allaire & Chollet, 2020; Allaire & Tang, 2019) . Models are trained using the popular adaptive moment (Adam) algorithm introduced in Kingma & Ba (2017), which is a stochastic gradient descent algorithm with adaptive learning rates. The number of Monte Carlo draws M is set to 100 across all applications. The size of the hidden layers is determined individually for each application, with larger specifications preferred, in order to permit a high capacity for nonlinearity in the network. A more comprehensive overview of the neural network hyperparameters is provided in Appendix E.

A useful variation on the standard PENN architecture in econometric or financial applications may be to constrain some parameters to be static. This is desirable, for instance, when the data availability is limited and a full neural network is impractical, or when the DGP is assumed to be partially linear. Static parameters are easily accommodated in the framework, by removing connections between the nodes associated with linear parameters in the variational layers, and the hidden layer. By removing the connections between variational and hidden layers, only the bias term is retained, resulting in a posterior with static mean and variance. Fig. 3.3 Finally, a local mean can be included in the PENN, by adding a column of ones to x. A global mean is computed by holding the local mean constant using the architecture described in Fig. 3 .3, or simply by adding a trainable bias to the decoder.

The role of the PENN as an explainability method in the presence of a nonlinear and non-additive DGP can be demonstrated using a simple simulation. Consider a DGP consisting of three covariates, x = x 1 x 2 x 3 , a dependent variable, y, and an error term, ∼ N (0, σ 2 ):

The marginal effects are nonlinear and denoted by the coefficient functions B k (x ij ). For the first covariate, the coefficient follows a sine curve, with B 1 (x i1 ) = 5 sin x i1 . The second and third covariates exhibit a simple interaction, with B 3 (x i2 ) = τ (x i2 ). Notice that x 2 has no effect on y directly, but influences the output via a threshold function τ (x 2 ), such that

The first marginal effect function, B 1 (x 1 ), evaluates the method's ability to capture arbitrary nonlinearity. Conversely, B 3 (x 2 ) tests whether the PENN is capable of identifying the true effect in the presence of an interaction. The latter task is challenging particularly for post hoc explainability algorithms that assume feature independence. Such an approach usually fails to disentangle the indirect effect of x 2 from the effect of x 3 .

Finally, the covariates are independent samples from a correlated multivariate normal distribution,

The whereφ ik is the contribution estimate. The contribution is defined using a linear model, witĥ

where β ik is a coefficient or weight. Note that φ ik is not equivalent to a marginal effect, but can be computed using β ik . In the case of a local linear model, withŷ i = β 0 + k β ik x ik , summing the contributions yields the predicted value for the ith observation, normalized by the average prediction:

(4.6)

The two explainability methods used in this application are among the most widely applied algorithms and warrant more detailed discussion. The first is based on the game theoretic concept of Shapley values (Shapley, 1953) , and conceptually determines a contribution for each covariate x k , by approximating are obtained using the R-package iml (Molnar, 2018) , which implements the method according to Štrumbelj & Kononenko (2014) .

The second explainability method used as a benchmark for the PENN is the local interpretable model-agnostic explanations (LIME) algorithm, introduced in Ribeiro et al. (2016) . The algorithm fits an interpretable model (e.g. a linear regression) for each observation, by perturbing the data set at x i to generate a simulated data set z, and subsequently minimizing

where f is the black-box model, g is a linear regression, and π x is a weighting function measuring the proximity between x i and z (based on the Euclidean distance in this application). Sampling z uniformly as suggested in Ribeiro et al. (2016) and implemented in various software packages is found to work poorly in this context. Instead z is drawn randomly from a narrow region around

, where σ 2 z is treated as a hyperparameter.

Finally, a GAM is estimated, which generates contribution functions using smoothing splines, and is computed using the R-package mgcv (Wood, 2011) . Note that GAM and SHAP offer limited comparability to the PENN, since the methods only produce contributionsφ ik and no local coefficients. Conversely,φ ik can easily be calculated from the estimates generated by PENN, SENN and LIME using Eq. 4.3.

The accuracy of the various methods is computed formally using the mean absolute error (MAE) of the contributions:

(4.9)

for the various methods, where N = 1000, σ 2 = 1 and ρ = 0: The PENN model generates coefficients and contributions that closely resemble the DGP, outperforming all benchmark methods. Particularly the values associated with x 2 and x 3 suggest that the PENN is capable of identifying the effects correctly in the presence of feature dependencies. All benchmark methods produce reasonable results for φ 1 , but fail to identify the correct contributions for the remaining covariates, due to an implicit assumption of feature independence. Since SHAP values are calculated by measuring the average impact that the inclusion of a covariate has on predictions, it tends to allocate contribution to both x 2 and x 3 equally. The additivity underlying the model structures of LIME and GAM is similarly incapable of capturing feature dependencies.

Finally, by forcing the parameter vectors to act as gradients, the SENN embeds feature independence into its concept of explainable parameters and contributions, inducing smoothness of B k over the range of x k and hence penalizing zero values of B 2 . simulation runs with the settings used in Fig. 4 .1 above. Given its additive structure, the GAM performs comparatively poorly. The three neural networks capture the DGP almost identically well, illustrating that the structure of the PENN is capable of retaining the complexity of an equally-sized DNN. In the same vein, the poor explanatory performance of the SHAP and LIME algorithms, as well as the SENN parameters observed above, does not stem from an insufficiently flexible underlying neural network, but is the result of the theoretical particularities of the explainability methods themselves. Overall, the PENN achieves the highest accuracy in all scenarios. The left panel plots the effect of decreasing the sample size on MAE φ , the middle panel the role of multicollinearity, and the right panel of the signal-to-noise ratio. The PENN is comparatively robust in small sample sizes, which results from the regularizing effect of the parameter prior. Multicollinearity is not a substantial problem for any of the methods, with relatively minor changes to the accuracy. In fact, the SENN even improves slightly as ρ increases. Decreasing the signal-to-noise ratio results in a lower accuracy across all methods. Of the benchmark methods, the SENN performs best, which may reflect an advantage in embedding explainability directly into the neural network architecture as opposed to adding a post hoc explainability layer (with an associated additional source for errors).

The key comparative advantage of the PENN is its ability to infer the interaction between x 2 and x 3 without prior knowledge. The interaction could be modeled explicitly by the benchmark methods, under the assumption that the existence of the interaction is known a priori. In the case of the GAM, LIME and SENN this is simply a matter of adding the appropriate interaction term to the regression equations, and adding the interaction to the input feature set of the SENN.

For SHAP it is more complicated and currently not implemented in the software packages used here. 12 The interaction is therefore only examined for the PENN, SENN, GAM and LIME. Fig.   4 .4 plots the resulting MAE φ , illustrating that the PENN achieves superior or closely matched accuracy even when the interaction is modeled with prior knowledge by the benchmark methods.

The GAM performs particularly poorly when the correlation between the interacting variables is high. In this case, differentiating between x 2 (which has no effect) and the interaction between x 2 and x 3 is far more difficult in the GAM, but does not pose as large a challenge for the PENN. The LIME, while substantially more accurate than in Fig. 4 .3, continues to capture the effects poorly. The PENN model has several advantages over alternative explainability algorithms. The primary aspect explored in this section is its ability to estimate parameters consistently in the presence of dependent effects structures, and without reducing model interpretability. This is an important achievement, since interpretability is usually obtained at the cost of complexity -more specifically at the cost of imposing additivity and removing any feature interactions. Aside from the ability to generate explainable results without reducing complexity, the approach is less computationally costly when compared to DNN-based alternatives -particularly SHAP -which requires extensive simulations and quickly becomes unfeasible as the number of covariates grows. Furthermore, the posterior densities of the estimates produced by the PENN incorporate a measure of confidence in the parameter values, and permit parameter inference in highly nonlinear settings. This latter point is explored in detail using the empirical example in the following section.

The application introduced in this section illustrates how a PENN architecture can be used (i) to explore nonlinear behavior in asset markets, and (ii) to conduct parameter inference. I estimate a nonlinear version of the capital asset pricing model (CAPM) for 10 global sector classifications, with the nonlinear dynamics of systematic risk driven by the economic regime. The application is sufficiently simple to illustrate key aspects of the PENN methodology, while simultaneously providing a novel perspective on risk in equity markets. The underlying neural network facilitates the prediction of systematic and idiosyncratic risk components in real-time, addressing two of the most important empirical critiques to the CAPM: its static and its backward-looking character.

The application is embedded within the theoretical literature on the conditional CAPM, which posits that risk premia vary based on the state that the economy resides in at a given point in time (see Lewellen & Nagel (2006) for a review). The proposed PENN model can be viewed as a nonlinear rendition of the conditional CAPM introduced in Jagannathan & Wang (1996) and Petkova & Zhang (2005) , where the economic state is described using a data set of external macroeconomic instruments. The results reveal substantial variation in risk premia over time,

with a closer examination of risk dynamics suggesting that the nonlinear character of the proposed conditional CAPM is indeed appropriate.

The capital asset pricing model, developed by Sharpe (1964) and Lintner (1965) , represents an important cornerstone of academic financial theory. It is constructed based on the assumptions of efficient markets with rational risk-averse investors, and characterizes a trade-off between risk and return of financial assets. Despite several prominent empirical and theoretical critiques, the CAPM continues to play an important role for financial and investment practitioners as an objective and intuitive approach to cost of capital estimation, business valuation and performance measurement.

The investment decision was originally framed by Markowitz (1952) as a trade-off between expected risk and return. For any level of expected return, a rational investor attempts to minimize risk, and for any level of risk, the investor aims to maximize return. The feasible set of portfolios emerging from these decision criteria traces a pareto efficient frontier, which optimally trades off risk and return. By introducing a risk-free asset to the investable universe, the efficient frontier collapses to a linear frontier -the security market line (SML) -between the risk-free asset and a tangency portfolio situated on the Markowitz risk-return frontier (Sharpe, 1964) . Given appropriate assumptions, the linear SML is described mathematically by the CAPM, which formulates the expected return on asset k as a function of the risk-free return (r f ) and the return on the tangency portfolio (given by the value-weighted market portfolio, r m ): 13

13 If investors can borrow and lend freely at the risk-free rate, expected utility is maximized by holding only the tangency portfolio and the risk-free asset, with the weights determined by the investor's degree of risk aversion. Under the assumptions of information efficiency (all market participants form equivalent risk and return expectations) and market clearing (all assets have an owner), the tangency portfolio must be a value-weighted portfolio of the entire investable universe.

(5.1) Thus, asset k earns the risk-free rate of return plus a risk premium. The risk premium depends on the expected excess return earned by the market portfolio, as well as the asset's market beta (β k ). The market beta measures the overall systematic risk exposure of asset k and is typically estimated using a time series regression of excess asset returns on a proxy for the market risk premium (Rossi, 2016) :r

wherer tk = r tk −r tf andr tm = r tm −r tf are the excess returns on the asset and market, respectively, and t indexes time. Idiosyncratic variation ( t ) is argued to be diversifiable and hence does not earn a return premium, with E[ t ] = 0. Finally, α k (or simply "alpha") captures market risk-adjusted excess return. In the CAPM, the market return represents the only systematic risk factor, and it must hold that α k = 0. The case when α k = 0 can be viewed as evidence against the validity of the CAPM, suggesting either the existence of additional uncaptured risk sources, or the failure of another related assumption (e.g. the assumption of a static β k ).

There have been several important theoretical and empirical challenges to the CAPM (see Brown & Reilly (2012) for an overview). Observed failures of the assumption that α k = 0 have been explained most prominently in two separate strands of the literature: (i) Fama & French (1992) , Fama & French (1993) , and Fama & French (1995) criticize the supposition of a single systematic risk source, and posit the existence of several additional risk premia using a multi-factor version of the CAPM; (ii) The observed instability of market beta over time has given rise to a conditional version of the CAPM, that permits variation in β k conditional on the state of the economy.

As highlighted in Brown & Reilly (2012) , empirical studies have generally found the market beta to be unstable over time with static estimates highly sensitive to the chosen sample period. Estimates of β k obtained from historical data therefore tend to be poor predictors of future risk (Rossi, 2016) .

A theoretical literature explores a conditional version of the CAPM, arguing that β k should be expected to vary over the course of the business cycle, as market participants demand a higher hurdle rate of return to compensate equity risk during recessionary periods (see for instance Jensen (1968), Dybvig & Ross (1985) , and Hansen & Richard (1987) for a discussion of the theoretical models). The conditional CAPM emerged both in response to the empirical observation of unstable market betas, and (with mixed success) in response to the failure of the zero alpha assumption, suggesting that unexplained excess return (α k = 0) stems not from unaccounted risk premia, but from the existence of multiple beta-regimes.

Examples of empirical estimates of conditional CAPM include Jagannathan & Wang (1996 ), Adrian & Franzoni (2004 , Petkova & Zhang (2005) , Lustig & Van Nieuwerburgh (2005) and Santos & Veronesi (2006) . A particularly interesting pendant to the approach presented here is found in Petkova & Zhang (2005) , who determine market beta based on a data set of macroeconomic variables (denoted z t ):

Here γ 0 and γ are coefficients. In the case of Petkova & Zhang (2005) , r tk is the return on high and low book-to-market value portfolios, and the authors attempt to explain the apparent existence of a value premium using the economic state. Substituting Eq. 5.4 into Eq. 5.3 allows for estimation ofβ tk in a regression framework with interaction terms betweenr tm and z t . 14 A PENN model can be used to estimate a nonlinear version of the above framework, where parameter estimates are obtained using:

and q θ;k ([α tk β tk ]|z t−1 ) is the inference network. This results in a PENN architecture, with local regression parameters α tk and β tk , whose parameterizations are inferred based on an encoder input data set z t−1 , a decoder inputr tm , and the outputr tk . The setup corresponds to the architecture described in Fig. 3 .4.

Other approaches to the nonlinear estimation of CAPM typically employ time-varying parameter frameworks, such as rolling sample regressions (e.g. Koutmos & Knif (2002) , Adrian & Franzoni (2004) , and Glova (2015)). The approach proposed here differs in the important respect that, instead of inferring the evolution of β tk over the temporal dimension, the PENN learns correlative associations between β tk and the regime that the economy and financial markets reside in at time t. Since the regimes are inferred from input data, estimates are less dependent on their immediate history, and the model yields better real-time forecasts than time-varying parameter alternatives.

Section 5.4 highlights this distinction by comparing PENN estimates to rolling OLS estimates over different sample periods.

Equity risk premia are estimated for each of 10 sectors using a global universe of stock returns. 15

The market return (r tm ) is proxied using the MSCI All Countries World Index (ACWI), which is a value-weighted index of global large and mid-cap stocks comprising over 3000 constituents.

The sector returns (r tk ) are given by the associated sector-specific sub-indices of the MSCI ACWI.

Returns are calculated over a rolling one month period for each index and adjusted for dividends and distributions. Individual sectors are referred to by the two letter acronyms listed in Fig. 5.1 (e.g. EN = Energy). The risk-free rate (r tf ) is taken to be the 3-month US Treasury bill rate, as is typically assumed in 15 A complete set of global sectors according to the global industry classifiction standard (GICS) includes 11 sectors. The real estate sector is excluded here due to limited data availability.

the related literature. US economic variables are used to proxy for both the risk-free rate and the macroeconomic state. While this implies a regional mismatch to the MSCI ACWI index, the high percentage of US equities in the index and the important role of the US financial system suggest that the generalization is reasonable. The sample consists of daily data for the period January 16 The Moody's Seasoned 10-year Baa Corporate Bond Index is used to measure US corporate bond yields.

The networks are regularized using the two hyperparameters λ and δ, with linear constant parameters resulting as λ → ∞ and δ → 0. Model selection for machine learning algorithms (i.e. selecting optimal hyperparameter values) is most commonly performed using some form of cross-validation (CV) algorithm. CV algorithms perform pseudo out-of-sample model evaluations by training the network several times on different subsets of the data, and evaluating each fit using validation samples that are withheld during training. Hyperparameters are chosen based on a measure of predictive accuracy (validation MSE).

To accommodate the time-dependence structure of time series data, model selection in this special case usually involves expanding or rolling window CV procedures. Since these methods discard a substantial portion of the data during training (e.g. recent data is only used in a single training slice), I employ the hv-block CV algorithm described in Racine (2000) . Data is divided into v consecutive validation blocks that retain their ordering. A margin of h samples before and after each validation set is masked from training, to prevent data leakage between training and validation sets, which may occur due to time dependencies and autocorrelation. Bergmeir & Benítez (2012) use rigorous empirical tests to demonstrate that the benefit of the more efficient use of data during training outweighs the theoretical inconsistency of hv-block CV, that results from an evaluation with past data. The authors find that hv-block CV achieves significantly better results than expanding window alternatives.

In the application, I set v = 10 and h = 10, and evaluate a grid of candidate values of λ and δ that includes both the static and the unregularized extremes, with the optimal model minimizing the validation error. Hyperparameter tuning is performed individually for each of the sector indices. The spikes in systematic risk exposure are concentrated around periods of economic crisis, suggesting that market participants require a higher compensation for holding equity during tumultuous market phases. This finding aligns with theoretical notions underpinning the development of the conditional CAPM. Distinct differences between sectors' cyclical responses can be observed. For instance, the financial sector exhibits a large risk increase during the financial crisis of 2008, while the IT sector realizes a peak during the dot-com bust. Other sectors, such as CD or UT spike during all recessionary periods (see, for instance, the middle panel in Fig. 5.4 ).

Systematic risk in the energy sector is closely related to the oil price. 

As stated at the outset, the PENN model represents a nonlinear version of a conditional CAPM.

In order to study the manner in which variation inβ tk andα tk is embedded in the economic state, 

Model interpretability is oftentimes presented as an inverse function of model complexity. On the one extreme lies the linear regression, characterized by complete interpretability. On the other are machine learning algorithms such as neural networks which -while highly flexible -are opaque. Post hoc algorithms that attempt to understand the inner workings of uninterpretable models, typically shift along the same continuum by superimposing a less complex and thus more interpretable model onto the black box. This paper has introduced a method that breaks with such a straightforward portrayal by generating interpretable outcomes similar to those of a linear regression, but in the context of a neural network. The PENN does not simply aim to interpret a fitted model, but rather aims to understand an underlying DGP, estimating posterior densities for a locally linear model parameterization, that are capable of accounting for the rich feature interactivity encoded in the neural network architecture.

Simulations illustrate that the PENN is capable of producing consistent coefficient estimates and feature contributions in the presence of dependent features, thus achieving a high degree of interpretability without imposing an additive structure. In addition, an empirical application demonstrates how the method can be deployed in the context of econometric analysis, both to explore nonlinear parameter behavior and to conduct parameter inference. A nonlinear version of the conditional CAPM is estimated, which is capable of producing real-time predictions of systematic and idiosyncratic risk components of financial assets that are embedded in the economic regime. The results suggest a substantial amount of nonlinear variation in the risk structure of equity returns depending on the economic regime, as well as variation in the extent to which the assumptions underlying the CAPM are met. Specifically, the theoretical assumptions underlying the CAPM appear to be violated with a high probability during economic downturns and financial crises.

As an explainability concept, the PENN is interesting insofar as it achieves interpretability without inherently sacrificing complexity. The trend in explainability algorithms arguably goes towards model-agnostic algorithms, such as SHAP or LIME (Molnar, 2020) . Nonetheless, by embedding the concept of interpretability into the neural network loss function, the need to impose a simpler explainability model is obviated. While the difference may appear trivial, it is precisely to capture non-additive behavior that a neural network is generally deployed. If the econometrician's objective is to explore a presumed nonlinear additive DGP, a sufficient toolkit of interpretable and more readily implemented methods exist that are fit for the task. Using a neural network to explore a complex DGP must be accompanied by an explainability approach that is equally complex.

an extremely flexible environment, capable of capturing asymmetries, thresholds, regime-changes and many other types of nonlinear behavior. The flexibility of the underlying inference network facilitates the estimation of local parameters without imposing onerous assumptions on the DGP.

The approach to regularization is highly intuitive in the form of shrinkage towards the static linear solution. This permits the PENN to be employed when data availability is comparatively limited, and makes the task of model training more transparent, with the complexity of the neural network directly observable in the heterogeneity of the posterior densities.

Let q θ (β|x) be an inference network that approximates a posterior density of β, p(β|y, x). The density q θ (β|x) follows a mean field variational distribution with

where i ∈ 1, ..., N . The functions µ θ (·) and σ 2 θ (·) are components of the inference network that return parameters for q θ (β i |x i ) given data.

The information difference between the approximate and true posteriors can be measured using a Kullback-Leibler divergence, D KL , with:

The LHS of Eq. A.3 is simply referred to D KL below. Applying Bayes' rule to Eq. A.3 results in

Next, use the law of logarithms and distribute the integrand:

The term log p(y i |x i ) can be removed from the second integral in Eq. A.6, since it does not depend

Given that q θ (β i |x i ) is a probability distribution and integrates to one, Eq. A.7 can be simplified further:

Now, log p(y i |x i ) is moved to the LHS, and the law of logarithms is applied, once again, on the RHS, followed by a distribution of the integrand:

The first term on the RHS of Eq. A.10 is an expectation, while the second represents another Kullback-Leibler divergence. Adjusting Eq. A.10 to reflect this, results in:

Eq. A.11 can be rearranged to yield the disaggregated form of the equation in Proposition 1:

Finally, aggregating Eq. A.12 over N results in the equation in Proposition 1:

Let q θ (β ik |x i ) = N (µ q ik , σ q ik 2 ) and p(β ik |x i ) = N (µ p ik , σ p ik 2 ) be the normally distributed approximate posterior and prior densities of a parameter β ik ∈ β, such that

Given the definition of the Kullback-Leibler divergence, the regularization term becomes:

The term in the logarithm can be simplified, resulting in

Eq. B.4 can be expressed as an expectation and transformed using the rules of the expectations operator, such that

The expectation of the squared difference from the mean is simply the variance, i.e., Note that the final step expands the bracket by adding and subtracting µ q ik . Grouping the terms in the bracket and multiplying out, now yields

With the assumption of independently distributed prior and posterior parameters, the stacked Kullback-Leibler penalty is obtained by summing over K, such that

The prior density p(β|x) is parameterized using a kernel weighting function that identifies disjoint neighborhoods (I) of points in x satisfying the condition

The number of neighborhoods is determined by the hyperparameter δ.

Instead of defining disjoint neighborhoods, traditional kernels can be used to calculate π x . This section examines the effect of two compact support kernels (Epanechnikov and tri-cube), with support regions determined using a bandwidth parameter (b), such that

The Epanechnikov kernel is defined as Now, a weighting matrix π x;b is defined, whose ith row elements are given by H(a)/ j H(a), and which subsequently substitutes π x;δ during PENN training. overall weight of the prior in the loss function is high (λ = 100). The default kernel results in disjoint parameter regimes, while both the Epanechnikov and tri-cube kernels smoothly approximate the data generating process as the bandwidth is increased (i.e. the area of support is reduced).

No clear preference is stated here, with all kernels presenting valid approaches to learning relational prior behavior. The smooth (as opposed to disjoint) neighborhoods tend to result in much stronger regularization to static parameters when data is noisy. Since smoothness in the parameter estimates is also attained by reducing λ from the illustratively high level used in the example, the disjoint kernel does not limit flexibility. In fact, δ has the intuitive appeal that it defines clear parameter regimes with the smoothness of regime transition depending on the level of λ. 

Explaining Individual Predictions When Features Are Dependent: More Accurate Approximations to Shapley Values

Learning about Beta: Time-Varying Factor Loadings, Expected Returns, and the Conditional CAPM. (Staff Report 193)

Keras: R Interface to "Keras

Tensorflow: R Interface to

On the Robustness of Interpretability Methods

Towards Principled Methods for Training Generative Adversarial Networks

Approximate Residual Balancing: Debiased Inference of Average Treatment Effects in High Dimensions

Generalized Random Forests

Inference on Treatment Effects after Selection among High-Dimensional Controls

Program Evaluation and Causal Inference With High-Dimensional Data

On the Use of Cross-Validation for Time Series Predictor Evaluation

Variational Inference: A Review for Statisticians

Application to Default Risk Analysis. (Staff Working Paper 816)

Random Forests

Analysis of Investments and Management of Portfolios

Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission

Symmetric Variational Autoencoder and Connections to Adversarial Learning

Double/Debiased Machine Learning for Treatment and Structural Parameters

Algorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems

Differential Information and Performance Measurement Using a Security Market Line

Regulation (EU) 2016/679 of the European Parliament, Directive 95/46/EC (General Data Protection Regulation)

The Cross-Section of Expected Stock Returns

Common Risk Factors in the Returns on Stocks and Bonds

Size and Book-to-Market Factors in Earnings and Returns

Deep Neural Networks for Estimation and Inference

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

Scaleable Input Gradient Regularization for Adversarial Robustness

All Models Are Wrong, but Many Are Useful: Learning a Variable's Importance by Studying an Entire Class of Prediction Models Simultaneously

Local Linear Forests

Greedy Function Approximation: A Gradient Boosting Machine

Time-Varying CAPM and Its Applicability in Cost of Equity Determination

Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation

Adaptive Computation and Machine Learning)

The Role of Conditioning Information in Deducing Testable Restrictions Implied by Dynamic Asset Pricing Models

Generalized Additive Models

Generalized Additive Models

Learning Basic Visual Concepts with a Constrained Variational Framework. ICLR Conference Paper

Significance Tests for Neural Networks

Multilayer Feedforward Networks Are Universal Approximators

The Conditional CAPM and the Cross-Section of Expected Returns

The Performance of Mutual Funds in the Period 1945-1964

An Introduction to Variational Methods for Graphical Models

Shapley Regressions: A Framework for Statistical Inference on Machine Learning Models. (Staff Working Paper 784)

Finding Groups in Data: An Introduction to Cluster Analysis

Economy Statistical Recurrent Units For Inferring Nonlinear Granger Causality

Adam: A Method for Stochastic Optimization

Auto-Encoding Variational Bayes

Estimating Systematic Risk Using Time Varying Distributions

The Conditional CAPM Does Not Explain Asset-Pricing Anomalies

The Valuation of Risk Assets and the Selection of Risky Investments in Stock Portfolios and Capital Budgets

Efficient Estimation of General Additive Neural Networks: A Case Study for CTG Data

Amortized Causal Discovery: Learning to Infer Causal Graphs from Time-Series Data

A Unified Approach to Interpreting Model Predictions

Conference on Neural Information Processing Systems

Housing Collateral, Consumption Insurance, and Risk Premia: An Empirical Perspective

Interpretable Models for Granger Causality Using Selfexplaining Neural Networks

Portfolio Selection

Towards Robust Interpretability with Self-Explaining Neural Networks. in 32nd Conference on Neural Information Processing Systems

Iml: An R package for Interpretable Machine Learning

Interpretable Machine Learning

Causal Discovery with Attention-Based Convolutional Neural Networks

Is Value Riskier Than Growth?

Generalized Additive Neural Networks

Consistent Cross-Validatory Model-Selection for Dependent Data: Hv-Block Cross-Validation

Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Why Should I Trust You?": Explaining the Predictions of Any Classifier

The Capital Asset Pricing Model: A Critical Literature Review

Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead

Labor Income and Predictable Stock Returns

A Value for N-Person Games

Capital Asset Prices: A Theory of Market Equilibrium Under Conditions of Risk

Learning Important Features Through Propagating Activation Differences

Explaining Prediction Models and Individual Predictions with Feature Contributions

An Interpretable and Sparse Neural Network Model for Nonlinear Granger Causality Discovery

Machine Learning and Causality: The Impact of Financial Crises on Growth

Big Data: New Tricks for

Estimation and Inference of Heterogeneous Treatment Effects using Random Forests

Fast Stable Restricted Maximum Likelihood and Marginal Likelihood Estimation of Semiparametric Generalized Linear Models: Estimation of Semiparametric Generalized Linear Models

Discovering Nonlinear Relations with Minimum Predictive Information Regularization

Advances in Variational Inference

Proposition 2 states that when x β i = 0 and features are independent, the parameter vector β m i resembles the gradient of the PENN network f (x) with respect to the inputs:Using the product rule to compute the partial derivative of f m (x) with respect to x k , wherex k β m k (and represents the element-wise multiplication operator), yields:where 0 is a matrix of zeroes. Thus, when the gradient of the inference network with respect to the inputs is zero, the parameters are equal to the overall network gradient.The derivative in the presence of dependent features changes slightly, as shown in Eq. C.3:Enforcing local stability such that ∂q θ (β|x)/∂x k ≈ 0 results in a neural network that encodes β mIt is clear that the constraint is not equivalent to the parameter values, β m i x x i = β m i , wheneverTraining a neural network using a gradient penalty, such that || x f m (x i ) − β m i || ≈ 0, as in the case of the SENN, therefore results in a parameter bias when features are dependent.