key: cord-0457751-xi6mziip authors: Hecq, Alain; Ternes, Marie; Wilms, Ines title: Hierarchical Regularizers for Mixed-Frequency Vector Autoregressions date: 2021-02-23 journal: nan DOI: nan sha: 14505e8ff5da9f4eeccb7ab9fa0d647cb58eb2fe doc_id: 457751 cord_uid: xi6mziip Mixed-frequency Vector AutoRegressions (MF-VAR) model the dynamics between variables recorded at different frequencies. However, as the number of series and high-frequency observations per low-frequency period grow, MF-VARs suffer from the"curse of dimensionality". We curb this curse through a regularizer that permits hierarchical sparsity patterns by prioritizing the inclusion of coefficients according to the recency of the information they contain. Additionally, we investigate the presence of nowcasting relations by sparsely estimating the MF-VAR error covariance matrix. We study predictive Granger causality relations in a MF-VAR for the U.S. economy and construct a coincident indicator of GDP growth. Supplementary Materials for this article are available online. Vector AutoRegressive (VAR) models are a cornerstone for modeling multivariate time series; studying their dynamics and forecasting. However, standard VARs require all component series to enter the model at the same frequency, while in practice macro and financial series are typically recorded at different frequencies; quarterly, monthly, or weekly for instance. One could aggregate high-frequency variables to one common low frequency and continue the analysis with a standard VAR, but such a practice wastes valuable information contained in high-frequency data due to two main reasons. First, high-frequency data are inherently more timely; they closely track the state of the economy in real time. Second, they can help unmask dynamics that would be hidden under temporal aggregation (see e.g., recent discussions in Cimadomo et al., 2021; Paccagnini and Parla, 2021) . Mixed-frequency (MF) models, instead, exploit the information available in series recorded at different frequencies. One commonly used MF models is the MIxed DAta Sampling (MIDAS) regression (Ghysels et al., 2004) . While the literature first focused on a singleequation framework for modelling the low-frequency response, the multivariate extension by Ghysels (2016) enabled one to model the relations between high-and low-frequency series in a mixed-frequency VAR (MF-VAR) system. This is known as the stacked MF-VAR approach since the MF-VAR is estimated at the lowest frequency and all higher-frequency variables are treated as separate components series which are stacked in the MF-VAR. 1 A complication with MF-VARs is that they are severely affected by the "curse of dimensionality". This curse arises due to two sources. First, the number of parameters grows quadratically with the number of component series, just like for standard VARs. Secondly, specific to MF-VARs, we have many high-frequency observations per low-frequency observation, which each enter as different component series in the model, thereby adding to the dimensionality. Without further adjustments, one would be limited to MF-VARs with few 1 Alternatively to stacking, MF-VARs can be accommodated within the framework of state-space models, where the low-frequency variables are modelled as high-frequency series with latent observations (e.g., Kuzin et al., 2011; Foroni and Marcellino, 2014; Schorfheide and Song, 2015; Brave et al., 2019; Gefang et al., 2020; Koelbl and Deistler, 2020 ; the latent "L-BVAR" of Cimadomo et al., 2021) . We contribute to the literature stream of Ghysels (2016) as the stacked MF-VAR system allows for the application of standard VAR tools such as Granger causality to the mixed-frequency setting, as discussed in Section 2. series and/or a small number of high-frequency observations per low-frequency observation. This curse of dimensionality has mostly been addressed through mixed-frequency factor models (e.g., Marcellino and Schumacher, 2010; Foroni and Marcellino, 2014; Andreou et al., 2019) or Bayesian estimation (e.g., Schorfheide and Song, 2015; Ghysels, 2016; Götz et al., 2016; Cimadomo et al., 2021; Paccagnini and Parla, 2021) . Sparsity-inducing regularizers form an appealing alternative (see Hastie et al., 2015 for an introduction), but despite their popularity in regression and standard VAR settings (e.g., Hsu et al., 2008; Davis et al., 2016; Callot et al., 2017; Derimer et al., 2018; Smeekes and Wijler, 2018; Barigozzi and Brownlees, 2019; Hecq et al., 2021a) , they have only been rarely explored as a tool for dimension reduction in mixed-frequency models. An exception is Babii et al. (2021) who recently used the sparsegroup lasso to accommodate the dynamic nature of high-dimensional, mixed-frequency data, thereby providing a complementary structured machine learning perspective to the penalized Bayesian MIDAS approach of Mogliani and Simoni (2021) . Nonetheless, they address univariate MIDAS regressions, leaving regularization of MF-VARs unexplored. Our paper's first contribution concerns the introduction of a novel convex regularizer that extends the univariate MIDAS approach of Babii et al. (2021) to the multivariate MF-VAR setting. To this end, we propose a mixed-frequency extension of the single-frequency hierarchical regularizer by Nicholson et al. (2020) used for standard VARs that accounts for covariates at different (high-frequency) lags being temporally ordered. We build upon the group lasso with nested groups and encourage a hierarchical sparsity pattern that prioritizes the inclusion of coefficients according to the recency of the information the corresponding series contains about the state of the economy. In addition to the development of a new MF-VAR with hierarchical lag structure, our paper investigates the presence of nowcasting restrictions in a high-dimensional mixedfrequency setting. According to Eurostat's glossary, a nowcast is "a rapid [estimate] produced during the current reference period [, say T * a particular quarter,] for a hard economic variable of interest observed for the same reference period [T * ]" (Eurostat: Statistics Explained, 2014) . In this narrow sense, and contrarily to forecasting, nowcasting makes use of all available information becoming available between (and strictly speaking not including) T * −1 and T * . Götz and Hecq (2014) show that nowcasting in (low-dimensional) MF-VARs can be studied through contemporaneous Granger causality tests; by testing the null of a block diagonal error covariance matrix of the MF-VAR. We build on Götz and Hecq (2014) to study nowcasting relations in high-dimensional MF-VARs by sparsely estimating the covariance matrix of the MF-VAR errors. Its sparsity pattern then provides evidence on those high-frequency variables (i.e., the series and their particular time period) one can use to build coincident indicators for the low-frequency main economic indicators. 2 In a simulation study (Section 4), we find that the hierarchical regularizer performs well in terms of estimation accuracy and variable selection when compared to alternative methods. Furthermore, we accurately retrieve nowcasting relations between the low-and high-frequency variables by sparsely estimating the error covariance matrix. In the application (Section 5), we study a high-dimensional MF-VAR for the U.S. economy. We apply the hierarchical regularizer to characterize the predictive Granger causality relations through a network analysis. Moreover, we investigate which high-frequency series nowcast quarterly U.S. real gross domestic product (GDP) growth, use those to construct a reliable coincident indicator of GDP growth and evaluate its performance pre and post Covid-19. The remainder of this paper is structured as follows. Section 2 introduces the MF-VAR with hierarchically structured parameters and defines the nowcasting causality relations, Section 3 describes the regularized estimation procedure for MF-VARs and nowcasting causality. Section 4 shows the results on the simulation study, Section 5 on the empirical application of the U.S. economy. Section 6 concludes. Additional results are available in the online appendix. We start from Ghysels' (2016) mixed-frequency VAR systems and include d different high-frequency components. Let y(t) denote a k L -dimensional vector collecting the low-frequency variables for t = 1, . . . , T . Further, let x m 1 (t), x m 2 (t),...,x m d (t) denote the d different, multivariate high-frequency components with m 1 , m 2 , . . . , m d number of highfrequency observations per low-frequency period t, respectively. Without loss of generality, we take m 1 < m 2 < · · · < m d . For instance, in a quarter/month/weekly-example m 1 = 3 and m 2 = 12. Let each component contain k 1 , k 2 , . . . , k d time series of the same high-frequency, respectively. To be precise, each high-frequency component is given by where the couple (t, j) indicates higher-frequency period j = 1, . . . , m i during the lowfrequency period t. The MF-VAR K ( ) for a lag length of = 1 is then given by where B ∈ R K×K denotes the autoregressive coefficient matrix at lag 1 for which the total number of series K is given by is a mean zero error vector with nonsingular contemporaneous covariance matrix Σ u . We focus on stationary time series and assume that all series are mean-centered such that no intercept is included. As for the standard VAR, the ij th entry of B, denoted by β ij , explains the lagged effect of the jth series on the ith series. The entries of the autoregressive coefficient matrix thus permits one to study one-step-ahead predictive Granger causality, an often investigated feature also in the context of MF-VARs (e.g., Ghysels et al., 2016; Götz et al., 2016) . Besides, given the mixed-frequency nature of the variables, it may be of interest to analyze whether knowing the values of the high-frequency variable at time t helps to predict the low-frequency variable during the same time period and vice versa. This form of hidden, contemporaneous Granger causality between high-and low-frequency variables can be derived from the error covariance matrix Σ u (see Section 2.2). We focus on first-order mixed-frequency VARs throughout the paper, extensions to higher-order systems can be easily made, see Remark 2.2. If the true model is multivariate, each component series follows an ARMA(K , (K−1) ), see the final equation representation of VARs in Zellner and Palm, 1974, or Cubadda et al., 2009 for a more recent discussion. Besides, a VAR(1) can even yield long-memory processes for K → ∞ (Chevillon et al., 2018) . Hence, as a large system with a small lag length can generate smaller systems with rich and versatile dynamics, it is plausible to assume that the data generating process of a high-dimensional MF-VAR has a small . But even a MF-VAR K (1) model requires one to estimate K 2 parameters which quickly becomes large since for a fixed T the parameter vector grows (quadratically) with the number of time series included but also due to the high-per-low frequency observations m 1 , . . . , m d . In the remainder, we use matrix notation to compactly express the mixed-frequency VAR. To this end, definē where N = T − 1 are the number of time points available given the MF-VAR K (1), I K denotes the identity matrix of dimension K and ⊗ the Kronecker product. Then the MF-VAR K (1) can be written as y = Xβ + u, with y = vec(Y), β = vec(B ), and u = vec(U). In the classical low-dimensional setting K < N , the MF-VAR can be estimated by least squares. However, as the number of parameters grows relative to the time series length T , least squares becomes unreliable as it results in high variance, overfitting and poor out-ofsample forecasting. We therefore resort to penalization methods. Many authors have used the lasso (Tibshirani, 1996) , which attains so called "patternless" sparsity in the parameter matrices of the VAR (e.g., Hsu et al., 2008; Derimer et al., 2018) . We, instead, use the dynamic structure of the MF-VAR as a guide in our sparse estimation procedure. Therefore, we propose regularized estimation of MF-VARs that translates information about the hierarchical structure of low-and high-frequency variables into a convex regularizer that delivers structured sparsity patterns appropriate to the context of MF-VARs. We describe the hierarchical sparsity patterns that arise in the autoregressive coefficient matrix B of a MF-VAR. We demonstrate the intuition, for ease of notation, with k L = k 1 = · · · = k d = 1, see Remark 2.1 for an extension to multiple series. The parameters of the matrix B can be divided in (d + 1) 2 different groups as depicted by the sub-matrices in the left panel of Figure 1 . We distinguish three types of sub-matrices capturing respectively Own-on-Own, Higher-on-Lower and Lower-on-Higher effects, to extend standard VAR estimation with single-frequencies to mixed-frequencies. The Own-on-Own sub-matrices are m × m square matrices that lie on the main diagonal and describe the effects of a series' own lags on itself (with m L = 1 for the low-frequency variable). The Higher-on-Lower sub-matrices are short, wide m L × m H matrices, where L and H refer to the corresponding lower-and higher-frequency variable. They lie above the main diagonal and contain the lagged effect of a higher-frequency series onto a series with respective lower frequency. Note that L (and H) just identify which of the two variables is of lower (and higher) frequency. Thus, this group does not only contain the effects of the higher-frequency variables onto the variable with the lowest frequency but also describes the interactions between the higherfrequency variables. For instance, in a quarter/month/week-example, these incorporate the effects of the monthly and weekly variable onto the quarterly variable but also the effects of the weekly onto the monthly variable. The Lower-on-Higher sub-matrices are long, thin m H × m L matrices. They lie below the main diagonal and contain the lagged effect depicting the effect of a lower-frequency series onto a higher-frequency series. For each sub-matrix, we impose a hierarchical priority structure for parameter inclusion. Parameters with higher priority within one group should be included in the model before parameters with lower priority within the same group, where the priority value of each parameter depends on how informative a practitioner finds the associated regressor. A priority value of one indicates highest priority for inclusion. More precisely, we introduce a priority value p ij g ∈ {1, . . . , P g } for each element ij belonging to parameter group g = 1, . . . , G in the matrix B to denote its inclusion priority. Here, we order the groups in Figure 1 Higher-on-Lower Lower-on-Higher the rectangle to the right is group g = 2. If p ij g < p i j g for i = i and j = j , we prioritize parameter β ij over β i j in the model. The hierarchical structure is imposed for each group g individually and hence in each group the priority values start with value 1 (highest priority) and go up to P g (lowest priority). We do not encode structures where certain parameters of one group should enter the model before certain parameters of another group as our aim is to encode priority of parameter inclusion with respect to the effects of one frequency component (Own/Higher/Lower) on another (Own/Higher/Lower). These hierarchical priority structures are highly general and could potentially accommodate various structured sparsity patterns in the AR parameter matrix that researchers or practitioners want to encourage. We give special attention to a recency-based priority structure, where the priority value p ij g is set according to the recency of the information the jth time series ofȳ t−1 contains relative to the ith series ofȳ t . The more recent the information contained in the lagged predictor, the more informative, and thus the higher its inclusion priority. Figure 2 visualizes the recency-based structure for an example with only one quarterly and one monthly variable having a 4 × 4 coefficient matrix B and G = 4 groups. For instance, consider the Higher-on-Lower parameter block, where month three of the previous quarter contains the most recent information, followed by month two and then month one. The priority values are thus increasing from left to right. The recency-based priority structure is conceptually similar to other often used restric- Higher-on-Lower Lower-on-Higher tion schemes in the (MF)-VAR literature. For the Higher-on-Lower group, it is similar to a MIDAS weighting function with decaying shape, for instance an exponential Almon lag polynomial. Both approaches assume a decaying memory pattern in the economic processes, however, in our setting, we do not restrict the parameters to a specific nonlinear function. Besides, our recency-based priorities are similar in spirit to the Minnesota prior in the Bayesian literature (e.g., Cimadomo et al., 2021) in the sense that longer lags are treated differently than shorter lags as they are expected to contain less information about a variable's current value. Remark 2.1. If one were to extend to a MF-VAR K (1) with multiple time series per frequency component, the total number of groups G becomes (k L + k 1 + · · · + k d ) 2 . Then the priority values describing the dependence of the dependent variable having frequency m i on the independent variable having frequency m j would be replicated k i ×k j times. E.g., in a setting with two quarterly and three monthly variables, the priority values describing the effect of monthly series on quarterly series are replicated six times. Remark 2.2. If one were to increase the lag length, the total number of groups G would remain unchanged, but each group would additionally incorporate all its higher-order lagged coefficients. Since higher-order lags contain older information, we decrease their priority of inclusion. For example, the Higher-on-Lower group for a MF-VAR 4 (2) would have priority values one up to three for lag 1; four up to six for lag 2. We discuss the sensitivity of our empirical results to the maximum lag length in Section 5.3. The hidden contemporaneous links between high-and low-frequency variables can be investigated in the covariance matrix Σ u of the error terms u t in the MF-VAR model (1). Götz and Hecq (2014) test for block diagonality of Σ u to investigate the null of no contemporaneous Granger causality or, to put it differently, the absence of nowcasting relationships between high-and low-frequency indicators. In the remainder, we refer to contemporaneous Granger causality as "nowcasting causality". The authors show that the conditional single equation model (e.g., MIDAS) with contemporaneous regressors can be misleading as it can change the dynamics observed in a MF-VAR system (e.g., Granger causality relations). We do not face this problem since we work with the reduced form MF-VAR and not with a single equation conditional model derived from the MF-VAR. In the context of our paper, a detailed inspection of the residual covariance matrix T t= +1 u t u t of the high-dimensional MF-VAR is interesting for (at least) two reasons. First, from an economic perspective, we aim to investigate whether there exist high-frequency months, weeks or days of some series that nowcast low-frequency variables. Since this is a correlation measure observed in the symmetric blocks of the residual covariance matrix, we obviously cannot point towards a direction of this contemporaneous link. Lütkepohl (2005, pages 45-48) stresses that the "direction of (nowcasting) causation must be obtained from further knowledge (e.g., economic theory) on the relationship between the variables". We can only agree on that. Second, from a statistical perspective, we aim to compare the performance of the MF-VAR from Section 2.1 when additional restrictions on the error covariance matrix are considered. Intuitively, this is a Generalized Least Squares (GLS) type improvement over the "unrestricted" hierarchical MF-VAR. In our economic application (to be discussed in Section 5), we consider quarterly real GDP growth as one of the main low-frequency indicators, and are interested in investigating whether some high-frequency monthly (e.g., retail sales) and/or weekly (e.g., money stock, federal fund rate) series can deliver a coincident indicator of GDP growth with the advantage being that they are released earlier. We do not claim that they directly impact GDP but simply that the business cycle movements detected in those high-frequency variables track the fluctuations of the GDP growth well. To investigate such nowcasting relations, let us assume a single low-frequency variable and decompose Σ u into four blocks with Σ 2:K the (K −1)×(K −1) block of Σ u corresponding to the covariances between errors of the high-frequency variables. It contains both (co)-variances per high-frequency variable (i.e., its main-diagonal blocks) as well as covariances between different high-frequency variables (i.e., its off-diagonal blocks). σ 2 1.1 is the error variance of the low-frequency variables. It is a scalar when there is only one variable but in general it is a square matrix. σ 2 .K−1 is a vector when there is a single low-frequency variable, with the cross-covariances of the errors between low-and high-frequencies. This will be the block we focus on to detect nowcasting relations between each low-frequency variable and each high-frequency variable. To detect nowcasting relations, we investigate the existence of a sparse matrix where we leave σ * 2 1.1 as well as the main-diagonal blocks in Σ * 2:K unrestricted. The sparsity of the block σ * 2 .K−1 is our main focus as it allows us to detect nowcasting relations between the low-frequency variables and each of the high-frequency ones. The sparsity in the offdiagonal blocks of Σ * 2:K we allow for is not of main economic interest to our nowcasting application, but it may be of interest in other applications and it does facilitate retrieval of a positive definite Σ * u matrix (see Section 3.2). Remark 2.3. The case σ * 2 .K−1 = 0 (K−1)×1 (with one low-frequency variable) corresponds to the block diagonality of Σ * u and hence the null of no nowcasting causality. Remark 2.4. Unlike for the autoregressive parameter matrix, we do not impose a hierarchical sparsity structure on σ * 2 .K−1 because the middle month could be a better coincident indicator for a quarterly variable than the last month, for instance. Since this is an empirical issue, we prefer to stick to a regularized estimator of the residual covariance matrix that encourages "patternless" sparsity in σ * 2 .K−1 . Finally, to build our coincident indicator, we use a simple procedure which consists of first selecting the variable-period combinations corresponding to non-zero entries in σ * 2 .K−1 , standardizing the series and subsequently constructing the first principal component. We rescale the coincident indicator to GDP growth as in Lewis et al. (2021) . To attain the hierarchical structure presented in Section 2.1, we use a nested group lasso (Zhao et al., 2009 ) which has been successfully employed for various statistical problems, among which, time series models (e.g., Nicholson et al., 2020) , generalized additive models (e.g., Lou et al., 2016) , regression models with interactions (e.g., Haris et al., 2016) or banded covariance estimation (e.g., Bien et al., 2016) . The group lasso uses the sum of (unsquared) Euclidean norms as penalty term to encourage sparsity on the group-level. Then, either all parameters of a group are set to zero or none. By using nested groups, hierarchical sparsity constraints are imposed where one set of parameters being zero implies that another set is also set to zero. We encourage hierarchical sparsity within each group of parameters by prioritizing parameters with a lower priority value over parameters with a higher one. The proposed hierarchical group estimator for the MF-VAR is given by where P Hier (β) denotes the hierarchical group penalty and λ β ≥ 0 is a tuning parameter. Before introducing the hierarchical group penalty, recall that we distinguish g = 1, . . . , G parameter groups in the autoregressive parameter matrix and that P g is the maximum priority value of parameter group g. To impose the hierarchical structure within each parameter group g, we consider P g nested sub-groups s g contains all parameters of group g, s (2) g omits those parameters having priority value one, and finally the last sub-group s (Pg) g only contains those parameters having the highest priority value P g . Clearly, a nested structure arises with s Pg g ] , for 1 ≤ p ≤ P g , where β p g collects the parameters of group g having priority value p. We are now ready to define the hierarchical penalty function as (3) The hierarchical structure within each group g is built in through the condition that if Remark 3.1. A related sparse-group lasso penalty has been proposed by Babii et al. (2021) . Similar to our penalty, theirs is tailored to the dynamic nature of the mixed-frequency model to account for different high-frequency lags being temporally related. As such, both penalties induce sparsity at two levels: whether a variable is important for explaining another or not; and if important, both steer the effect's duration via an additional sparsity layer. Via our nested group lasso, we do this by building a hierarchy to cut off the effect of one variable on another after a certain (high-frequency) lag. Via the sparse-group lasso, Babii et al. (2021) regulate the shape of the MIDAS weight function for each group which consists of all high-frequency lags of a single variable. The weights w s (p) g balance unequally sized nested sub-groups. We take w s (p) g ) + 1, where card(·) corresponds to the cardinality. As the cardinality of the subgroups s (p) g is decreasing with p, the weights of the nested sub-groups are increasing with p. Sub-groups containing parameters with lower priority, i.e., with older information about the state of the economy, are thus penalized more and hence more likely to be zeroed out. Remark 3.2. A simple alternative to our proposed weights are equal weights as used in Zhao et al. (2009 ) or Nicholson et al. (2020 , which would lead to less aggressive shrinkage. Simulations in Section 4.1 reveal that the proposed more aggressive weights are preferable for sparsity recognition. In Section 4.4, we include recommendations to guide forecasters in choosing the weights. Finally, we propose a proximal gradient algorithm (see e.g., Tseng, 2008) to efficiently solve the optimization problem in Equation (2), as detailed in Appendix A. To detect nowcasting relations between the low-and high-frequency variables, we use a lasso-penalty to impose "patternless" sparsity on the covariances between low-and highfrequency errors and the covariances between different high-frequency errors (see Section 2.2). The proposed sparse covariance estimator is then given by where λ Σ ≥ 0 is a tuning parameter, Σ − are the elements of the off-diagonal blocks of Σ. Remark 3.3. The mere addition of the l 1 -penalty in problem (4) does not guarantee the estimator Σ * u to be positive definite (see Rothman et al., 2009) . To ensure its positive definiteness, the constraint Σ 0 implies that we only consider solutions with strictly positive eigenvalues. Rothman et al. (2009) show that (4) without the constraint Σ 0 essentially boils down to element-wise soft-thresholding of Σ − : the sparse estimate Σ * u is given by sign(Σ − ) max(|Σ − | − λ Σ , 0). If the minimum eigenvalue of the unconstrained solution is greater than 0, then the soft-thresholded matrix is the correct solution to (4). However, if the minimum eigenvalue of the soft-thresholded matrix is below 0, we follow Bien and Tibshirani (2011) and perform the optimization using the alternating direction method of multipliers (Boyd et al., 2011) , which is implemented in the R function ProxADMM of the package spcov (Bien and Tibshirani, 2012) . Note that similarly to the estimation of the MF-VAR, we solve (4) for a decrementing log-spaced grid of λ Σ -values. If one wishes to incorporate the estimated nowcasting relations, the autoregressive parameters can be re-estimated by taking the error covariance matrix into account. This results in a type of generalized least squares estimator of β as given by where y * = Σ −1/2 y, X * = Σ −1/2 X and Σ = Σ * u ⊗ I N . We assess the performance of the proposed hierarchical group estimator through a simulation study where we compare its performance to three alternatives, namely the ordinary least squares (OLS), the ridge and the lasso. The set-up of the simulation study is driven by our empirical application (Section 5), namely the small MF-VAR with K = 22 (one quarterly and seven monthly variables) and T = 125. The parameter matrix is set to reflect the obtained coefficients which result in a stable MF-VAR. To make a clear distinction between zero and nonzero coefficients, we set all coefficients smaller than 0.01 to zero. As a result, the coefficient matrix does not strictly follow the recency-based hierarchical structure anymore, thereby favoring the hierarchical estimator less compared to its benchmarks. Throughout the paper, we standardized each series to have sample mean zero and variance one as commonly done in the regularization literature before applying a sparse method such that all coefficients have comparable sizes after standardization. To reduce the influence of initial conditions on the data generation process (DGP), we burn in the first 300 observations for each simulation run. We consider four simulation designs and run R = 500 simulations in each. The first design compares the estimators in terms of their estimation accuracy and variable selection performance of the autoregressive parameter vector. In the second design, we analyze how well the proposed regularization method can detect the nowcasting relations between the low-and high-frequency variables in Σ * u . The third design compares the point forecasts between the hierarchical estimator and its GLS-type version. The fourth design investigates the performance of the proposed estimator for DGPs with varying degree of sparsity. We take the error covariance matrix to be the identity matrix and compare estimation accuracy of the autoregressive parameter vector by calculating the mean squared error k refers to the kth element of the estimated parameter vector in simulation run r. To investigate variable selection performance, we use the false positive rate (FPR), the false negative rate (FNR) and Matthews correlation coefficient (MCC): where T P (and T N ) are the number of regression coefficients that are estimated as nonzero (zero) and are also truly nonzero (zero) in the model and F P (and F N ) are the number of regression coefficients that are estimated as zero (nonzero), but are truly nonzero (zero) in the model. Both FPR and FNR should be as small as possible. The MCC balances the two measures and is in essence a correlation coefficient between the true and estimated binary classifications. It returns a value between −1 and +1 with +1 representing a perfect prediction, 0 no better than random prediction and −1 complete discrepancy between prediction and observation. For the regularization methods (i.e., the proposed hierarchical estimator, lasso and ridge), we each time use a grid of 10 tuning parameters and select the one that minimizes the MSE between the estimated and true parameter vector. Results. We first focus on estimation accuracy, see the left panel in Figure 3 . The hierarchical estimator generates the lowest estimation errors. It significantly outperforms all others in terms of MSE as confirmed by paired sample t-tests at 5% significance level. OLS suffers as it is an unregularized estimator and thus cannot impose the necessary sparsity on the parameter vector; similarly for ridge which can only perform shrinkage but the grey area of the hierarchical estimator but does not for lasso. Finally, we repeat the simulation study using the hierarchical estimator with equal weights; see Figure 8 in Appendix B.1. While the equal weights version generates lower estimation errors, the original weights version performs better in terms of sparsity recognition (MCC). The hierarchical estimator with equal weights shrinks less aggressively, thereby having a higher FNR as many zero coefficients are estimated as small non-zeros. We evaluate the detection of the nowcasting relations. To this end, we set the error covariance matrix to the regularized error covariance matrix estimated in the application Section 5.2. Coefficients smaller than 0.03 are set to zero to again ensure a clear distinction between the zeros and non-zeros. We first estimate the model using the four different estimators, calculate the resulting residual covariance matrix and then compute its regularized version through optimization problem (4). In line with our empirical application, we are mainly concerned with the sparsity pattern of the first row/column, namely the one corresponding to the low-frequency variable. Precisely, we investigate its variable selection performance using MCC, FPR and FNR. We estimate the MF-VAR and covariance matrix of the corresponding residuals for a two-dimensional (10 λ β × 10 λ Σ ) grid of tuning parameters. We select the tuning parameter couple that maximizes the MCC of the regularized covariance matrix in each simulation run. Table 1 contains the results. Results. The performance across estimators is very similar, but the hierarchical estimator and lasso do perform best. Their MCCs lie at roughly 0.78. Their FPRs are slightly higher than their FNR which implies that the nowcasting relations tend to be estimated too sparsely. On the other hand, the low FNR suggests that in general we do not select variables which do not nowcast the low-frequency variable. Lastly, all estimators perform comparably across the tuning parameter grid for λ Σ , but the variability around the selected λ Σ is higher for OLS and ridge than for lasso and the hierarchical estimator. We assess whether the best possible forecasting performance of the hierarchical estimator can be improved with its GLS version that incorporates the best nowcasting relations. The error covariance matrix is set to the same matrix as in Section 4.2. We generate time series of length T = 105 (as in the forecast of Section 5.3), fit the models to the first T − 1 observations and use the last observation to compute the one-step-ahead mean squared forecast error for each series. In line with the empirical application, we focus on forecast accuracy of the first series, which represents the low-frequency variable, and select the tuning parameter λ β that minimizes its squared forecast error. For the selected model, we then estimate its regularized covariance matrix using 10 λ Σ values and choose the one that maximizes the MCC. Finally, with the selected Σ * u , we re-estimate the model according to Equation (5) to compare the forecast performance of the MF-VAR when the additional restrictions on the error covariance matrix are accounted for. Results. The one-step ahead MSFE for the first series of the hierarchical unrestricted estimator and its GLS version are 0.5508 and 0.5810, respectively. The former significantly outperforms the latter, as confirmed with a paired sample t-test at 1% significance level. The addition of the nowcasting relation may not improve the forecast because the values of the covariances in the covariance matrix of the DGP, particularly of the first row/column, are relatively small in absolute terms. 3 Moreover, it is important to point out that the running time for the estimation of the restricted version is substantially larger than for the unrestricted version. Thus, even if the covariance matrix would be denser, there is a clear trade-off between forecast accuracy and computational efficiency one needs to make. In light of the recent debate on the validity of sparse methods for macroeconomic data , we revisit simulation studies 1 and 2 by varying the degree of sparsity of the DGP. Full details on the simulation designs are given in Appendix B.2. First, we vary the degree of sparsity of the autoregressive parameters in simulation study 1. Figure 9 demonstrates that the MSE of the hierarchical estimator stays relatively constant when the degree of sparsity decreases. Only for the fully dense DGP, we observe a small increase in MSE. The hierarchical estimator with equal weights does not suffer from this. Hence, if one expects a dense DGP or variable selection is not the main priority, the hierarchical estimator with equal weight can be used. On the other hand, if one expects a sparse DGP or seeks parsimonious models to facilitate interpretation, more aggressive shrinkage can be enforced via the hierarchical estimator with proposed weights. Next, we vary the degree of sparsity in the first row/column of the error covariance matrix of simulation study 2. Here, denser settings do have an effect on the detection of the nowcasting relations. Table 4 shows that the MCC decreases mainly due to an increase in the FPR, meaning that the nowcasting relations are estimated too sparsely. We investigate a high-dimensional MF-VAR for the U.S. economy. We use data from 1987 Q3 until 2018 Q4 (T = 126) on various aspects of the economy: amongst others output, income, prices and employment, see Table 5 Table 5 ) which we apply to all series, thereby facilitating replicability and comparison with related research. To evaluate the influence of additional variables and higher-frequency components, we estimate three MF-VAR models: 4 The small MF-VAR (K = 22) consists of the quarterly variable at interest, real GDP (GDPC1 ), and seven monthly variables focusing on (industrial) production, employment and inflation. The medium MF-VAR (K = 55) contains the small group and eleven additional monthly variables containing further information on different aspects of the economy including financial variables. The large MF-VAR incorporates the variables of the medium group and replaces four monthly variables (CLAIMSx, M2SL, FEDFUNDS, S&P 500 ) with their equivalent weekly series. See columns "K = 22", "K = 55" and "K = 91" of Table 5 . To ensure that m 2 = 12 in the large MF-VAR, we consider all months with more than four weekly observations and disregard the excessive weeks at the beginning of the corresponding month, as in Götz et al. (2016) . We aim to investigate several aspects. First, we focus on the autoregressive parameter estimates of the hierarchical estimator to investigate the predictive Granger causality relations between the series through a network analysis. Second, we concentrate on nowcasting and investigate whether some high-frequency monthly and/or weekly economic series nowcast quarterly U.S. GDP growth and thus can deliver a coincident indicator of GDP growth. Third, we perform a sensitivity analysis of these results. Finally, we perform an out-of-sample expanding window nowcasting exercise including recent data on the Covid-19 pandemic. For ease of the discussion of the results, we follow the variable classification of McCracken and Ng (2016) which can be found in Table 6 in Appendix C.1. We first investigate predictive Granger causality relations. To that end, we estimate the MF-VAR using the hierarchical estimator with proposed weights and recency-based priority structure (seasonally adjusted data are used) for all parameter groups and a grid of 10 tuning parameters λ β . The tuning parameter is selected using rolling window time series cross-validation with window size T 1 . For each rolling window whose in-sample period ends at time t = T 1 , . . . , T −1, we first standardize each time series to have sample mean zero and variance one using the most recent T 1 observations, thereby taking possible time variations in the first and second moment of the data into account (see e.g., the Great Moderation Campbell, 2007; Stock and Watson, 2007) . Given the evaluation period [T 1 , T ], we use the one-step-ahead mean squared forecast error as cross-validation score. T 1 = 105, leaving us with 20 observations for tuning parameter selection. We first discuss the results for the small MF-VAR, then summarize the findings for the medium and large MF-VARs. Small MF-VAR. the linkages between the macroeconomic categories in Table 2 . The columns reflect a macroeconomic category's out-degree (influence), the rows its in-degree (responsiveness). We first concentrate on our main variable of interest GDPC1. Eight variables contribute towards its prediction (see first row of Figure 4 (a) or incoming edges in panel (b)). Next, we focus on the linkages between macroeconomic groups containing the higherfrequency variables in Finally, we inspect the linkages within each macroeconomic group (diagonal in Table 2 ). Employment and Housing display the highest within-group interaction. Zooming in on PAYEMS or USFIRE in Figure 4 , we see that their own lagged effects are the strongest, despite them belonging to a group containing more than one high-frequency variable. Within macroeconomic groups, a series' own lags thus remain the most informative. When focusing again on the network, we find that the third months of the variables have the most outgoing edges, whereas the first months have the most incoming edges. This is a logical consequence of the recency-based priority structure that we imposed. Next we focus on the linkages between the macroeconomic groups. For the medium MF-VAR, high within-groups interaction. In contrast to the small system, the within-group linkages among Prices highly increases, which is likely due to the addition of four price variables. The introduction of the weekly variables in the large system does not change the relations among the macroeconomic categories (see Table 8 ). We investigate whether some high-frequency monthly and/or weekly economic series nowcast quarterly GDP growth and thus can deliver a coincident indicator. More specifically, we analyze how the performance of the indicator is influenced through (i) the formalization of "sparse" nowcasting relations in the MF-VAR in comparison to a naive approach of constructing a coincident indicator based on the first principal component of all high-frequency variables and (ii) the number of high-frequency components included in the model. To construct the coincident indicator, we first estimate the MF-VAR using the hierarchical estimator as in Section 5.1. Secondly, we compute the regularized Σ * u from the MF-VAR residuals. We then select the high-frequency variables having a non-zero covariance element in the GDP column and construct the first principal component of the corresponding correlation matrix. 5 We estimate the MF-VAR and corresponding covariance matrix of the MF-VAR residuals for a two-dimensional (10 × 10) grid of tuning parameters λ β and λ Σ . We report the results for the tuning parameter couple that maximizes the correlation between the most parsimonious coincident indicator and GDP growth. 6 Small MF-VAR. where the lagged and instantaneous dynamics are separated, thus increases the correlation by 2 percentage points. The advantage of carefully selecting variables before conducting a principal component analysis is a result consistent with Bai and Ng (2008) . Furthermore, the correlations with GDP growth are fairly stable across the two-dimensional grid of tuning parameters. In roughly half of the cases, the correlation is at least as high as the one computed from all high-frequency variables. Our findings are thus rather robust to a different choice of selection criterion for the tuning parameters. 7 GDPC1 is typically released with a relatively long delay (usually one month after the quarter for the first release), whereas the monthly variables are released in blocks at different dates throughout the following month. 8 For instance, the previous month's UNRATE, PAYEMS and USFIRE are typically released on the first Friday of the following month, whereas the remaining variables are only available around the middle of the month. We follow the release scheme of Giannone et al. (2008) to study the marginal impact of these data releases on the construction of our coincident indicator. We do not take data revisions into account. Hence, our vintages are "pseudo" real-time vintages rather than true realtime vintages. The second month releases have a large impact on the accuracy of the coincident indicator. Particularly, the addition of the variables INDPRO and CUMFNS raises the correlation significantly. In fact, it is possible to construct an almost equally reliable indicator with the data from month one and two compared to one constructed with the data from the entire quarter. Hence, one can build a reliable coincident indicator roughly 1.5 months before the first release of GDP, thereby accounting for the publishing lag of approximately half a month for the monthly series. Medium and Large MF-VAR. Table 3 summarizes the correlations between GDP growth and the coincident indicators constructed from the different MF-VAR systems. The correlation for the large system (0.7928) is slightly higher than for the medium MF-VAR (0.7715) and both outperform the small system (0.7429). Full details on the medium and large MF-VAR are available in Appendix C.2: Figures 12(a) and 13(a) show that both indicators behave similarly and can better pick up the drop in GDP growth during the financial crisis than the coincident indicator of the small system. Table 3 illustrates that the advantage of selecting the nowcasting relations in the MF-VARs persists as the correlation achieved from the first principal component with all variables is lower. Lastly, the coincident indicator constructed from the selected variables from month one and two for the medium system performs comparable to the one constructed from all selected variables, in line with our finding for the small MF-VAR. We investigate the sensitivity of our results regarding (i) the choice of weights for the hierarchical estimator, (ii) the maximal lag length of the MF-VAR, (iii) the inclusion of daily high-frequency data, and (iv) forecast comparisons. Equal Weights. We use the hierarchical estimator with equal weights and present the autoregressive linkages between the macroeconomic groups for the MF-VARs in Tables 9-11 of Appendix C.3. Using equal weights corresponds to weaker penalization, hence the networks become denser, but the main conclusions regarding the inter-dependencies remain unchanged. Next, we revisit the results on the coincident indicators in Table 12 . For the small MF-VAR, the results are identical. For the medium and the large MF-VAR, respectively more and fewer high-frequency variables are selected for the coincident indicator which only affects its correlation with GDP growth in the first two months. Lag Length. We re-estimate the small MF-VAR with the hierarchical estimator using a maximal lag = 1, 2, or 4. Practically all coefficients from the second lag onwards (more than 98%) are zeroed out, and more first-order coefficients are shrunken towards zero. Both the Bayesian and Akaike Information Criterion indicate one to be the "optimal" lag length. Daily Data. To investigate the behavior of the hierarchical estimator in presence of daily high-frequency components, we replace the weekly financial variables (FEDFUNDS, (for the first two months), but not when using the daily data from the whole quarter. Forecast Comparison. We investigate the out-of-sample forecast performance of the hierarchical estimator, as detailed in Appendix C.5. We compare its performance to the random walk (RW) and AR(1) model as two popular, simple univariate benchmarks; a traditional quarterly VAR(1), estimated by OLS, with all higher-frequency variables aggregated to the quarterly level; and the three alternative estimators from Section 4 (OLS, ridge and lasso). The Covid-19 pandemic has caused a dramatic drop in economic activity worldwide including the U.S. We end by assessing the performance of our coincident indicators in an out-of-sample expanding window nowcasting exercise with evaluation period 2019 Q1 until 2021 Q2, thereby covering the recent pandemic. For each current quarter in the expanding window approach, we estimate the MF-VAR with the hierarchical estimator (using data until the previous quarter), select the high-frequency variables for the coincident indicator from the residual covariance matrix as in Section 5.2 and obtain their loadings on the first principal component. The nowcasts are then calculated by multiplying these loadings with the out-of-sample high-frequency data of the current quarter. The mean squared nowcast errors between GDP growth and the coincident indicators constructed from the three MF-VARs are depicted on the last line of the table in Figure 6 (a). The coincident indicator of the small MF-VAR attains the best accuracy, the error more than doubles for the larger MF-VARs. In addition, we track the nowcasting accuracy as the quarter progresses, thereby constructing the coincident indicator only from the selected high-frequency variables from the first month (first line), or the first two months (second Finally, we compare the performance of the coincident indicator of the small MF-VAR with two state-of-the-art benchmarks, one Bayesian and one survey. We use the "blocking" Bayesian MF-VAR (B-BVAR) by Cimadomo et al. (2021) since it also relies on the stacked MF-VAR approach. 9 Second, we use the GDP nowcasts of the Survey of Professional 9 We would like to thank Michele Lenza for providing the code for the B-BVAR method which we applied Forecasters (SPF). Figure 6( We introduce a convex regularization method tailored towards the hierarchically ordered structure of mixed-frequency VARs. To this end, we use a group lasso with nested groups which permits various forms of hierarchical sparsity patterns that allows one to discriminate between recent and obsolete information. Our simulation study shows that the proposed regularizer can improve estimation and variable selection performance. Furthermore, nowcasting relations can be detected from the sparsity pattern of the covariance matrix of the MF-VAR errors. Those high-frequency variables that nowcast the low-frequency variables, as evident from their non-zero contemporaneous link, can deliver a coincident indicator of the low-frequency variable. Constructing coincident indicators from a group of selected variables rather than all permits policy makers to get an earlier grasp of the state of the economy, as can be seen from our economic application on U.S. GDP growth. The proposed MF-VAR method is quite flexible and can be extended in various ways. First, regularization via restrictions other than sparsity can be explored. Temporal aggregation restrictions, for instance, can be imposed in the MF-VAR by exploiting fusion (with default settings for stationary data) to the small and medium MF-VAR. In line with our findings, their small MF-VAR was best performing, hence the results for their medium MF-VAR are omitted but available upon request. 10 Please note that all insights obtained from Figure 6 (b) are still subject to GDP being revised. penalties (e.g., Yan and Bien, 2021 ) that encourage similarity across certain coefficients. For monthly stock data it could, for instance, be interesting to encourage the effects of all months on the quarterly variable to be similar, thereby implicitly aggregating the monthly variable to the quarterly level. Second, an interesting path for future research concerns the extension of our method to enable structural analysis as recently done in a Bayesian set-up by e.g., Cimadomo et al. (2021) who use generalized impulse responses to track transmission mechanisms of low-frequency shocks hitting the U.S. economy, or Paccagnini and Parla (2021) who use orthogonalized impulse responses to identify the impact of high-frequency shocks thereby revealing a temporal aggregation bias when adopting single low-frequency models instead of mixed-frequency ones. Lastly, while we consider MF-VAR for stationary data, a natural next step would be to allow for non-stationarity by building on the lagaugmentation idea of Toda and Yamamoto (1995) as done in Götz and Hecq (2019) To assess the effect on the hierarchical estimator of varying degrees of sparsity in the autoregressive parameters and the covariance matrix of the error terms, we repeat simulation study 1 and simulation study 2 with varying degrees of sparsity in their respective DGPs. For simulation study 1, we gradually increase the percentage of non-zero coefficients in the autoregressive parameters across four settings. The sparsest setting (55% of non-zero coefficients) starts from the design used in Section 4.1 where all coefficients smaller than 0.01 are set to zero. For the second setting (70% non-zeros), all coefficients smaller than 0.002 are set to zero. For the third setting (84% non-zeros), we take the unaltered parameter matrix from Section 4.1. For the densest setting (100% non-zeros) all zero coefficients of the sparsest setting are assigned a value of 0.01. Figure 9 gives the results. For simulation study 2, we gradually increase the percentage of non-zero coefficients in the first row/column of the error covariance matrix across four settings. The sparsest setting corresponds to the design of Section 4.2 with 62% non-zeros. We then increase the density towards respectively 76%, 90% and 100% non-zeros by each time setting the remaining zero coefficients in the first row/column to 0.03. Table 4 displays the results. C Macroeconomic Application C.1 Additional Tables Table 5: Data description table. Column T-code denotes the data transformation applied to a time-series, which are: (1) not transformed, (2) ∆x t , (3) ∆ 2 x t , (4) log(x t ), (5) ∆log(x t ), (6) ∆ 2 log(x t ). Columns K = 22, K = 55 and K = 91 indicate whether and at which frequency the variable was included in the model. We use a rolling-window set-up with window size T 1 = 105, providing us 20 quarterly observations for forecast comparison. For each rolling window, we select the value of λ β that minimizes the one-step-ahead squared forecast error for our main variable of interest GDPC1. The MSFE for GDPC1 for the unrestricted hierarchical estimator across all MF-VARs is given in Table 15 . 11 11 We have also performed the forecasting exercise with the restricted hierarchical estimator imposing the nowcasting restrictions (through GLS). In line with our simulation study, the GLS version did not result in an improved point forecast and thus is omitted. Inference in group factor models with an application to mixed-frequency data Machine learning time series regressions with an application to nowcasting Forecasting economic time series using targeted predictors Nets: Network estimation for time series Regularized estimation in sparse high-dimensional time series models Network Granger causality with inherent grouping structure Convex banding of the covariance matrix spcov: Sparse Estimation of a Covariance Matrix Sparse estimation of a covariance matrix Distributed optimization and statistical learning via the alternating direction method of multipliers Forecasting economic activity with mixed frequency BVARs Modeling and forecasting large realized covariance matrices and portfolio choice Macroeconomic volatility, predictability, and uncertainty in the great moderation: Evidence from the survey of professional forecasters Generating univariate fractional integration within a large VAR(1) Nowcasting with large Bayesian vector autoregressions Studying co-movements in large multivariate data prior to multivariate modelling Sparse vector autoregressive modeling Estimating Global Bank Network Connectedness Glossary:nowcasting A comparison of mixed frequency approaches for nowcasting Euro area macroeconomic aggregates Computationally efficient inference in large Bayesian mixed frequency VARs Macroeconomics and the reality of mixed frequency data Testing for Granger causality with mixed frequency data The MIDAS touch: Mixed data sampling regression models Economic predictions with Big Data: The illusion of sparsity Nowcasting: The real-time informational content of macroeconomic data Nowcasting causality in mixed frequency vector autoregressive models Testing for Granger causality in large mixed-frequency VARs Granger causality testing in mixed-frequency vars with possibly (co)integrated processes The model confidence set Convex modeling of interactions with strong heredity Statistical learning with sparsity: the lasso and generalizations Granger Causality testing in highdimensional VARs: A Post-Double-Selection procedure Inference in non-stationary highdimensional VARs Subset selection for vector autoregressive processes using lasso A new approach for estimating VAR systems in the mixed-frequency case MIDAS vs. mixed-frequency VAR: Nowcasting GDP in the Euro area Measuring real activity using a weekly economic index Sparse partially linear additive models New introduction to multiple time series analysis Factor MIDAS for nowcasting and forecasting with ragged-edge data: A model comparison for German GDP FRED-MD: a monthly database for macroeconomic research FRED-QD: a quarterly database for macroeconomic research Real-time forecasting and scenario analysis using a large mixed-frequency Bayesian VAR Bayesian MIDAS penalized regressions: estimation, selection, and prediction High dimensional forecasting via interpretable vector autoregression Identifying high-frequency shocks with Bayesian mixedfrequency VARs Generalized thresholding of large covariance matrices Real-time forecasting with a mixed-frequency VAR Macroeconomic forecasting using penalized regression methods Why has U.S. inflation become harder to forecast Regression shrinkage and selection via the lasso Statistical inference in vector autoregressions with possibly integrated processes On accelerated proximal gradient methods for convex-concave optimization. submitted to Rare feature selection in high dimensions Time series analysis and simultaneous equation econometric models The composite absolute penalties family for grouped and hierarchical variable selection We thank the editor, associate editor and reviewers for their thorough review and highly appreciate their constructive comments which substantially improved the quality of the paper. Wilms gratefully acknowledges funding from the European Union's Horizon 2020 research and innovation programme (Marie Sk lodowska-Curie grant No 832671). Algorithm 1 provides an overview of the proximal gradient algorithm (see e.g., Tseng, 2008) to efficiently solve the optimization problem in Equation (2). A key ingredient of the algorithm concerns the proximal operator Prox vλP(·) which has a closed-form solution making it extremely efficient to compute. Indeed, the updates of each hierarchical group p = 1, . . . , P g correspond to a groupwise soft thresholding operation given by maxwith v being the step size which we set equal to the largest singular value of X. Algorithm 1 requires a starting value β[0] which we initially set equal to 0. Finally, while the algorithm is given for a fixed value of the tuning parameter λ, it is standard in the regularization literature to implement it for a decrementing log-spaced grid of λ values. The starting value λ max is an estimate of the smallest value that sets all coefficients equal to zero. For each smaller λ along the grid, we use the outcome of the previous run as a warm-start for β[0]. Require: y, X, P(β), λ, ε initialization:• step size v which is set equal to the largest singular value of X for r = 3, 4, ... do for g = 1, ..., G dö 6 68 35 7 45 56 49 62 52 57 437 Sales 5 32 24 8 42 51 32 37 38 39 308 Housing 1 1 2 9 8 0 0 0 2 1 24 Employment 4 35 33 4 69 26 35 27 38 28 299 Prices 10 101 50 22 57 152 93 72 78 104 739 CLAIMSx 2 60 41 13 35 49 123 110 116 87 636 Money 2 63 29 22 45 71 103 140 123 94 692 Interest Rate 3 74 28 3 52 54 84 115 135 105 653 Stock Prices 3 52 33 18 48 79 59 111 106 121 630 Out-degree 36 490 279 107 406 539 580 679 693 639 4448