key: cord-0576373-zdw8xr3c
authors: Tomarchio, Salvatore D.; Punzo, Antonio; Maruotti, Antonello
title: Parsimonious Hidden Markov Models for Matrix-Variate Longitudinal Data
date: 2021-07-09
journal: nan
DOI: nan
sha: 58a8dcc3152637f04c2ca31371dab75e2214b90b
doc_id: 576373
cord_uid: zdw8xr3c

Hidden Markov models (HMMs) have been extensively used in the univariate and multivariate literature. However, there has been an increased interest in the analysis of matrix-variate data over the recent years. In this manuscript we introduce HMMs for matrix-variate longitudinal data, by assuming a matrix normal distribution in each hidden state. Such data are arranged in a four-way array. To address for possible overparameterization issues, we consider the spectral decomposition of the covariance matrices, leading to a total of 98 HMMs. An expectation-conditional maximization algorithm is discussed for parameter estimation. The proposed models are firstly investigated on simulated data, in terms of parameter recovery, computational times and model selection. Then, they are fitted to a four-way real data set concerning the unemployment rates of the Italian provinces, evaluated by gender and age classes, over the last 16 years.

in the standard three-way format, where units, times and variables are arranged in software-ready manners. Because of their three-way structure, multivariate longitudinal data have been recently arranged in a matrix-variate fashion (Huang et al., 2019 and Viroli, 2011b) : for each unit i = 1, . . . , I, we observe a P × T matrix, where P and T denote the number of variables and times, respectively. Then, such data have been used for model-based clustering via matrix-variate mixture models (see e.g. Wang and Melnykov, 2020; Melnykov and Zhu, 2019; Tomarchio et al., 2020a,b) . This allows for both clustering units in homogeneous groups, defined according to similarities between matrix-variate data, and separately modeling the association between variables and times. Unfortunately, this procedure has two side effects:

(a) using the time on either the rows or the columns of the matrices reduces the types of longitudinal data structures that can be arranged in a matrixvariate framework. For instance, spatio-temporal data are used to either to analyze P variables observed at T times for R different locations (Viroli, 2011b) or to evaluate one measurement on R locations at T times on a set of I units (Viroli, 2011a) . However, it is not possible to jointly consider P variables at R locations for T times on I units. A possible solution could be to combine locations-times in one RT -dimension, as done by Viroli (2011a) , but this implies a loss in terms of interpretability as well as an increase in the number of parameters of the models, given the higher dimensionality of the matrices. Another example consists of two-factor data, which have been commonly considered in longitudinal settings (see e.g. Brunner and Puri, 2001 , Fitzmaurice and Ravichandran, 2008 , Noguchi et al., 2012 . Such data, have been recently used in matrixvariate mixture models by (Sarkar et al., 2020) in a not-longitudinal way, given that the factors fill the two dimensions of the P × R matrices for the I units, and an additional dimension for the time is required. In other terms, in both examples it would be necessary to move from three-way to four-way data structures. (b) Currently available clustering approaches for matrix-variate data assume time-constant clustering, i.e. it is not possible for the sample units to move across clusters over time and the evolution over time of the clustering structure is completely overlooked. Zhu and Melnykov (2021) recognize the importance of allowing for time-dependence in a matrix-variate clustering setting and propose a procedure capable of capturing the heterogeneity pattern and estimating change points from all data groups simultaneously. Indeed, time-varying heterogeneity is a specific important feature of longitudinal data analysis, and as such appropriate modeling strategies should be considered. Hidden Markov models (HMMs) have been extensively used to address this longitudinal data peculiarity (Zucchini et al., 2017 , Bartolucci et al., 2012 , Maruotti, 2011 and Altman, 2007 . Being (dependent) mixtures, HMMs simultaneously allow for clustering units and for modeling the evolution of the clustering over time.

To jointly consider the aspects in (a) and (b), in this manuscript we introduce and discuss HMMs for matrix-variate longitudinal data, with a specific application on the two-factor longitudinal case. Such kind of data can be arranged in a four-way array of dimension P × R × I × T . Under this four-way longitudinal data setting, HMMs simultaneously investigates several further (null) hypotheses, beyond the recovering of the time-varying clustering:

all means at different time points are the same; -there is an association between the factor levels.

The association is often the primary interest when researchers aim to study whether levels differentially affect outcomes in different hidden states.

A side effect of working with four-way data is the large number of parameters involved. This often occurs because of the (row-and column-specific) covariance matrices, since P (P + 1)/2 and R(R + 1)/2 unique parameters must be estimated. One of the most classical ways of addressing this overparameterization issue involves the spectral decomposition of the covariance matrices (Celeux and Govaert, 1995) . This decomposition offers remarkable flexibility and a geometric interpretation in terms of volume, shape, and orientation of the hidden states (for other approaches available in the HMMs literature see Maruotti et al., 2017 and . By using the spectral decomposition of the covariance matrices, we obtain a family of 98 parsimonious HMMs that will be described in Section 2.2, after the presentation of the general model (Section 2.1). In this framework, model parameters can be estimated by a full maximum likelihood method based on the Expectation Conditional Maximization (ECM) algorithm (Meng and Rubin, 1993) , and recursions widely used in the HMM literature (Baum et al., 1970 ). An iterative Minorization-Maximization (MM) algorithm (Browne and McNicholas, 2014) is also adopted to estimate a subset of the parsimonious HMMs.

We illustrate the proposal by a large-scale simulation study in order to investigate the empirical behavior of the proposed approach with respect to several aspects, such as the number of observed units and times, the number of hidden states and the association structure between factor-levels. We focus on goodness of clustering and parameters recovery, with a focus on computational times and model selection procedures. Furthermore, we test the proposal by analyzing a sample taken from the Italian National Institute of Statistics on the unemployment rate in 98 Italian provinces recorded for 16 years, also covering the 2008 crisis. We examine the unemployment rate arranged as a two-factor design, i.e. taking into account gender and age classes, by allowing some dynamics in the evolution of unemployment. We obtain a flexible model by including different associations across levels, changing according to the inferred dynamics, and by accounting for unobserved characteristics influencing changes in the province's unemployment patterns.

Finally, Section 5 summarizes the key aspects of our proposal along with future possible extensions.

Let {X it ; i = 1, . . . , I, t = 1, . . . , T } be a sequence of matrix-variate longitudinal observations recorded on I units over T times, with X it ∈ R R×P , and let {S it ; i = 1, . . . , I, t = 1, . . . , T } be a first-order Markov chain defined on the state space {1, . . . , k, . . . , K}. As mentioned in Section 1, a HMM is a particular type of dependent mixture model consisting of two parts: an underlying unobserved process {S it } that satisfies the Markov property, i.e.

and a state-dependent observation process {X it } for which the conditional independence property holds, i.e.

where f (·) is a generic probability density function (pdf). Therefore, the unknown parameters in an HMM involve both the parameters of the Markov chain and those of the state-dependent pdfs. In detail, the parameters of the Markov chain are the initial probabilities π ik = Pr (S i1 = k), k = 1, . . . , K, being K the number of states, and the transition probabilities

where k refers to the current state and j refers to the one previously visited. To simplify the discussion, we will consider homogeneous HMMs, that is π ik|j = π k|j and π ik = π k , i = 1, . . . , I. We collect the initial probabilities in the Kdimensional vector π, whereas the time-homogenous transition probabilities are inserted in the K × K transition matrix Π.

Regarding the conditional density for the observed process, it will be given by a matrix-normal distribution, i.e.

where M k is the P × R matrix of means, Σ k is the P × P covariance matrix containing the variances and covariances between the P rows, Ψ k is the R × R covariance matrix containing the variance and covariances of the R columns and θ k = {M k , Σ k , Ψ k }. For an exhaustive description of the matrix-normal distribution and its properties see Gupta and Nagar (2018) .

As introduced in Section 1, a way to reduce the number of parameters of the model is to introduce parsimony in the covariance matrices via the wellknown eigen decomposition. Specifically, a Q × Q covariance matrix can be decomposed as

where λ k = |Φ k | 1/Q , Γ k is a Q × Q orthogonal matrix of the eigenvectors of Φ k and ∆ k is a diagonal matrix with the eigenvalues of Φ k located on the main diagonal. From a geometric point of view, λ k determines the volume, Γ k indicates the orientation, and ∆ k denotes the shape of the kth state. By imposing constraints on the three components of (2), the fourteen parsimonious models of Table 1 are obtained. 

Considering that we have two covariance matrices in (1), this would yield to 14 × 14 = 196 parsimonious HMMs. However, there is a non-identifiability issue since Ψ⊗Σ = Ψ * ⊗Σ * if Σ * = aΣ and Ψ * = a −1 Ψ. As a result, Σ and Ψ are identifiable up to a multiplicative constant a (Sarkar et al., 2020) . To avoid such issue, the column covariance matrix Ψ is restricted to have |Ψ| = 1, implying that in (2) the parameter λ k is unnecessary. This reduces the number of models related to Ψ from 14 to 7, i.e., I, ∆, ∆ k , Γ∆Γ ⊤ , Γ∆ k Γ ⊤ , Γ k ∆Γ ⊤ k , Γ k ∆ k Γ ⊤ k . Therefore, we obtain 14 × 7 = 98 parsimonious HMMs.

To fit our HMMs, we use the expectation-conditional maximization (ECM) algorithm (Meng and Rubin, 1993) . The ECM algorithm is a variant of the classical expectation-maximization (EM) algorithm (Dempster et al., 1977) , from which it differs since the M-step is replaced by a sequence of simpler and computationally convenient CM-steps.

Let S = {X it ; i = 1, . . . , I, t = 1, . . . , T } be a sample of matrix-variate longitudinal observations. Then, the incomplete-data likelihood function is

on the main diagonal, 1 K is a vector K ones and Θ contains all the model parameters. In this setting, S is viewed as incomplete because, for each observation, we do not know its state membership and its evolution over time. For this reason, let us define the unobserved state membership z it = (z it1 , . . . , z itk , . . . , z itK ) ′ and the unobserved states transition

with θ = {θ k ; k = 1, . . . , K} and

In the following, by adopting the notation used in Melnykov and Zhu (2019) , the parameters marked with one dot represent the updates at the previous iteration and those marked with two dots are the updates at the current iteration.

Step The E-step requires calculation of the conditional expectation of (3), given S c and the current estimates ofΘ. Therefore, we need to replace z itk and z itjk with their conditional expectations, namely,z itk andzz itjk . This can be efficiently done by exploiting a forward recursion approach (Baum et al., 1970; Zucchini et al., 2017) . Let us start by defining the forward probability

that is the probability of seeing the partial sequence finishing up in state k at time t, and the corresponding backward probability

It is known that the computation of the forward and backward probabilities is susceptible to numerical overflow errors (Zucchini et al., 2017) . To prevent, or at least to decrease, the risk of such errors a scaling procedure is used. Specifically, it will be convenient to work on the log-scale (Farcomeni, 2012) . In order to do so, we use a simple computational device based on the following equality:

Therefore, if one has only the log of two quantities log (a) and log (b), only their difference must be exponentiated to obtain log (a + b), reducing the risks of underflow. By iterating this reasoning, one can sum a vector of quantities on the log-scale. This operation is called . Thus, when t = 1, the forward recursion on the log-scale is given by

whereas, for t = 2, . . . , T , it is

In a similar way, for the backward recursion on the log-scale is log (β iT k = 0) , and, for t = T − 1, . . . , 1, we obtain

Then, the updates required in the E-step can be computed as

At the first CM-step, we maximize the expectation of the complete-data log-likelihood with respect to Θ 1 , fixing Θ 2 atΘ 2 . In particular, we obtain

The update for Σ k depends on the parsimonious structure considered. For notational simplicity, letŸ = K k=1Ÿ k be the update of the within state row scatter matrix,

the update of the row scatter matrix related to the kth state. The updates for the 14 parsimonious structures of Σ k are:

In this setting, the row covariance matrices of all states are spherical and have equal volume. We need to estimate only λ as λ = tr Ÿ P RT I .

-Model VII [Σ k = λ k I] In this case, the row covariance matrices are spherical but their volume is different. Thus, the update for λ k is

Here, the row covariance matrices of all states have equal volume, shape and are axis-aligned. The updates for ∆ and λ arë

In this setting, the row covariance matrices of all states have equal shape and are axis-aligned, but they are allowed to have different volumes. We need to update ∆ and λ as

In this case, the row covariance matrices have equal volume and are axis-aligned, but they have different shapes. The updates for ∆ k and λ arë

The most generic case in the diagonal family has row covariance matrices with different volume and shape, in addition to being axis-aligned. Then, the updates for ∆ k and λ k arë

The most constrained member of the general family has row covariance matrices with same volume, shape and orientation. Thus, the update for Σ is given bÿ

This model assumes row covariances matrices with same shape and orientation but different volumes. Let C =Γ∆Γ ⊤ .

The updates for C and λ k arë

Here, the row covariance matrices have equal volume and orientation but different shapes. Given that there is no analytical solution for Γ, while keeping fixed the other parameters, an iterative Minorization-Maximization (MM) algorithm (Browne and McNicholas, 2014) is employed. In detail, a surrogate function can be constructed as

being the largest eigenvalue of Y k . The update of Γ is given byΓ =ĠḢ ⊤ , whereĠ andḢ are obtained from the singular value decomposition of F . This process is repeated until a specified convergence criterion is met and the estimateΓ is obtained from the last iteration. Then, we obtain the update for ∆ k and λ as

In this case, the row covariance matrices have the same orientation, but varying volumes and shapes. Again, there is no analytical solution for Γ, and its update is obtained by employing the MM algorithm as described for the EVE model. Then, the updates for ∆ k and λ k are∆

Here, row covariance matrices have the same volume and shape, but different orientations. An algorithm similar to the one proposed by Celeux and Govaert (1995) can here employed. In detail, the eigen-decomposition Y k = L k Ω k L ⊤ k is firstly considered, with eigenvalues in the diagonal matrix Ω k following descending order and orthogonal matrix L k composed of the corresponding eigenvectors. Then, we obtain the update for Γ k , ∆ and λ as

In this setting, row covariance matrices have the same shape, but different volumes and orientations. By using the same algorithm applied for the EEV model, the update for Γ k , ∆ k and λ k areΓ

In this model, row covariance matrices have varying shapes and orientations but equal volume. The updates of this model can be obtained in a similar fashion of the EVI model. Thus, by considering C k = Γ k ∆ k Γ ⊤ k , we estimate Γ k , ∆ k and λ as

In the full unconstrained case, we obtain

Step 2 At the second CM-step, we maximize the expectation of the completedata log-likelihood with respect to Θ 2 , keeping Θ 1 fixed atΘ 1 . The update for Ψ k depends on which of the 7 parsimonious structure is considered. For notational simplicity, letẄ = K k=1Ẅ k be the update of the within state column scatter matrix, whereẄ k =

is the update of the column scatter matrix related to the kth state. In detail, we have:

-Model II [Ψ k = I] This is the simpler model, since the column covariance matrices are spherical and assumed to be R × R identity matrices. Therefore, there are no parameters to be estimated.

In this setting, the column covariance matrices have the same shape and are axis-aligned. The update for ∆ is

Here, the column covariance matrices have different shapes and are axis-aligned. Thus, we update ∆ k as

In this case, the column covariance matrices have equal shapes and orientations. Therefore, we can directly obtain

In this setting, the column covariance matrices have equal orientation but different shapes. Similarly to the EVE and VVE models in CM-Step 1, there is no analytical solution for Γ, while keeping fixed the other parameters. Therefore, the MM algorithm is implemented by following the same procedure explained for the EVE model and by replacing Y with W. Then, the update of ∆ k is

Here, the column covariance matrices have common shape but varying orientations. By using the same approach of the EEV and VEV models, and by changingŸ withẄ, we obtain the updates of Γ k and ∆ as

In the full unconstrained case, we obtain

To start our ECM algorithm, we followed the approach of Tomarchio et al. (2020b) , where a generalization of the short-EM initialization strategy proposed by Biernacki et al. (2003) has been implemented. It consists in H short runs of the algorithm from several random positions. The term "short" means that the algorithm is run for a few iterations s, without waiting for convergence. In this manuscript, we set H = 100 and s = 1. Then, the parameter set producing the largest log-likelihood is used to initialize the ECM algorithm. In both simulated and real data analyses this procedure has shown stable results after multiple runs.

In this section, we examine different aspects of our HMMs through large-scale simulation studies. Considering the high number of models, we will only focus on two of them for the sake of simplicity. In detail, the EII-II HMM (which is the most parsimonious model) and the VVE-VE HMM (which is one of the two models for which an MM algorithm is used both for Σ k and Ψ k ) are considered. For each model, several experimental conditions are evaluated. Specifically, we set P = R = 2, I = 100, T ∈ {5, 10, 15}, K ∈ {2, 4} and two levels of overlap, that will be labeled as "Overlap 1" and "Overlap 2". Therefore, 3 × 2 × 2 = 12 scenarios are analyzed and, for each of them, 50 data sets are generated by the considered HMMs. About the parameters used to generate the data, when K = 2 we set -EII-II Model The mean matrix of the second state (M 2 ) is obtained by adding a constant c to each element of M 1 , which depends on the level of overlap. Specifically, we consider c = 2 under the "Overlap 1" scenarios, whereas c = 5 under the "Overlap 2" scenarios. When K = 4, the first two hidden states have the same {Σ k , Ψ k , M k ; k = 1, 2} as before. Clearly, the covariance matrices of the third and fourth hidden states for the EII-II Model are still equal to those of the first two states. On the contrary, for the VVE-VE model we have To obtain M 3 and M 4 we add c = 4 and c = −2 to each element of M 1 , respectively.

First of all, we evaluate the recovery and the consistency of the estimated parameters by computing the mean square errors (MSEs). Considering the high number of parameters that should be reported, we follow an approach similar to the one used by Farcomeni and Punzo (2020) , i.e. we calculate the average of the MSEs of each parameter of the model over the K states, allowing us to summarize in a single number the MSE of each parameter. Furthermore, before showing the obtained results, it is important to underline the wellknown label switching issue, caused by the invariance of the likelihood function under relabeling the model states (Frühwirth-Schnatter, 2006) . There are no generally accepted labeling methods, and we simply attribute the labels by looking at the estimated M k . Table 2 and Table 3 report the average MSEs, computed after fitting the EII-II and VVE-VE HMMs, with the corresponding K, to the respective data sets. Note that the Ψ covariance matrix is not reported in Table 2 since it is not estimated in the EII-II HMM. As we can see, the MSEs can be considered negligible in all the considered scenarios. It is interesting to note that, for a fixed overlap, their values become better with the increase of T and that, fixed T , their values roughly improve as we move from "Overlap 1" to "Overlap 2", thus indicating a decrease in the level of overlap. Additionally, when the VVE-VE HMM is considered, it seems that the MM algorithms used for estimating the covariance matrices produce reliable values. Another aspect that is interesting to evaluate, is the computational time required for fitting the HMMs. In detail, on each of the above data sets, all the 98 HMMs are now fitted for the corresponding K, and their computational times (in seconds) are illustrated by using the heat maps of Figures 1-4 . Computation is performed on a Windows 10 PC, with AMD Ryzen 7 3700x CPU, 16.0 GB RAM, using the R 64-bit statistical software (R Core Team, 2019), and the proc.time() function of the base package is used to measure the time. As it is reasonable to expect, the computational time grows as T increases on each scenario, and it approximately halves when we pass from "Overlap 1" to "Overlap 2", highlighting the easier of estimation in this case. Furthermore, with the exclusion of Fig. 4a , the computational time approximately triplicates when we move from fitting HMMs with K = 2 to HMMs with K = 4 hidden states. It is interesting to note that the EVE-VE and VVE-VE HMMs, which are the two models for which we use a MM algorithm for estimating both covariance matrices, are the most time consuming, with a computational burden that seems to double with respect to the other models. This is particularly evident in the "Overlap 2" scenarios. The total computational time can be strongly reduced by exploiting parallel computing. In detail, Table 4 shows the overall time taken by fitting the 98 HMMs sequentially (default in R) and by parallelizing them on 14 cores. As Fig. 2 : Heat maps of the average computational time for the 98 HMMs, computed over 50 data sets, when the data are generated by a EII-II HMM with K = 4 and "Overlap 1" (a) or "Overlap 2" (b). Fig. 3 : Heat maps of the average computational time for the 98 HMMs, computed over 50 data sets, when the data are generated by a VVE-VE HMM with K = 2 and "Overlap 1" (a) or "Overlap 2" (b).

we can see, the computational burden is decreased by about 10 times, and all the models can be fitted in a reasonable fast way (with some exceptions in the "Overlap 1" scenarios). Lastly, the capability of the Bayesian information criterion (BIC; Schwarz et al., 1978) in identifying the true parsimonious structure and the correct number of groups is investigated. This is because, so far, we have fitted models with K equal to the true number of states present in the data, and we need to assess if the BIC, which is one of the most famous and used tools in model-based Fig. 4 : Heat maps of the average computational time for the 98 HMMs, computed over 50 data sets, when the data are generated by a VVE-VE HMM with K = 4 and "Overlap 1" (a) or "Overlap 2" (b). clustering, accurately works. Therefore, on each of the above data sets, the 98 HMMs are fitted for K ∈ {1, . . . , K + 1}, and the results are reported in Table 5 . First of all, in each scenario, the true K has been always selected by the best fitting model according to the BIC (for this reason this information is not reported in Table 5 ). Additionally, we notice that in almost all the cases the true data generating model has been identified by the BIC. In those few cases where the BIC selects a wrong model, this is because of an incorrect choice of the parsimonious structure for one of the two covariance matrices Σ or Ψ. 4 Real data example

In this section, we analyze data concerning the unemployment rate in the Italian provinces (NUTS3, according to the European Nomenclature of Territorial Units for Statistics). The data comes from the Italian National Institute of Statistics (ISTAT), a public research organization and the main producer of official statistics in the service of citizens and policy-makers in Italy, and are freely accessible at http://dati.istat.it/#. In detail, we investigate the I = 98 Italian provinces for which the unemployment rate is available from the beginning of the data collection at the provincial level (2004) to the most recent year (2019). This implies that we are considering T = 16 years of data. Note that some provinces are not included in the analysis since the data were available for only few years. For each province, the unemployment rate is recorded in a two-factor format. The first factor, gender, has two levels (i.e. P = 2): males and females. The second factor, age, has three levels (i.e. R = 3) driven by the age category: 15-24, 25-34 and 35-older. Therefore, the whole data set is presented in a four-way array having dimensions 2 × 3 × 98 × 16. The unemployment rates are then mapped to the real line by using the logit transformation, as commonly done in this branch of literature for overcoming boundary bias problems (see, e.g. Wallis, 1987; Koop and Potter, 1999; Hudomiet, 2015) .

In analyzing this data set, several questions arise concerning the existence of areas with similar unemployment levels among the Italian provinces. Historically Southern Italy have always shown worse economic performance with respect to the rest of the Country (Daniele et al., 2007) . However, within each region (NUTS2, according to the European Nomenclature of Territorial Units for Statistics) there can be considerable differences among the provinces that are part of it. Also of interest is the strength of time dependence as measured by the transition probability matrix, as well as how the provinces move between the hidden states. This latter aspect can be particularly of interest in light of the two main recessions that the Italian economy faced in 2008 and 2011. Relatedly, an overview of which provinces have best withstood the crises or which have been able to recover from the crises can be easily obtained.

Our 98 HMMs are fitted to the data for K ∈ {1, . . . , 10} and, according to the BIC, the best model is the VEV-EE with K = 7 hidden states. Despite the logit scale, we can easily interpret the K = 7 states by looking at the estimated mean matrices As we can note, it is possible to sort the states according to growing unemployment levels, both in the gender and ages factors. Specifically, as we move from the first to the seventh state the unemployment rises, and each state becomes worse than the previous ones under each point of view. The only occasion where this happens partially concerns the states two and three. Indeed, regardless of gender, state two shows better rates for people over 25 and conversely state three is preferred for people under 25. However, since 4 times out of 6 state two is preferred, and considering that this involves the majority of people, we could globally consider it better than state three. We can also observe that, regardless of the considered state, the unemployment levels are higher for females and get lower as the age increases. It might be also interesting to report that, regardless of the age class, the lowest relative differences between the two genders are in the seventh state, the worst. Useful insights can also be obtained from the analysis of the estimated covariance matrices. The best fitting model suggests various volumes and orientations but equal shapes for the gender-related covariance matrices, whereas all shapes and orientations are found to be the same for the age-based covariance matrices. In detail, the estimated gender-related covariance matrices are we can note that the variances become lower as the age categories grows. Lastly, before showing how these states cluster the Italian provinces, it is worth analyzing the estimated transition probability matrix can note by the estimated transition probability matrix, transitions between states are not uncommon between adjacent states, whereas are null among distant states. Furthermore, it seems that the persistence of staying in a state, increases as we move from the fourth to the seventh, i.e. it appears increasingly difficult for the provinces clustered in the troubled states to improve their position. In particular, the last state shows a high persistence. From the analysis of the bar plot, the highest number of switches between the states occurs over the years 2009-2013, as a consequence of the two aforementioned economic recessions. Indeed, all the provinces showed an increase in the unemployment rates in those years, causing a change towards worse states. This can be better understood by looking at the Italian provinces maps of Figure 6 , that are colored according to state memberships. Note that the provinces not included in the analysis are colored in gray. For simplicity, we avoid to plot a map for each of the 16-years of data, and we limit to report some key years. It is interesting to note that, in almost all the cases, not all the provinces belonging to the same region are clustered in the same state. Therefore, working at province-level may offer additional insight than working with regional-level.

Starting from the first year of analysis, i.e. 2004, in Figure 6a we can recognize several clusters of provinces that, as we move towards the south, belong to states with higher unemployment rates. After some years (2005) (2006) (2007) (2008) characterized by relatively few changes among the states, the first economic recession began to produce its effects in 2009, where a lot of provinces started to perform badly (see Figure 6b ). Such event, further strengthened by the second economic recession of 2011, lead to a continuous rise of the unemployment levels, that (see Figures 6c-6f ) lead the majority of provinces to the worst states, reducing the pre-existing differences between them. After two years of small changes (2014) (2015) , the provinces located in the upper part of the Country started to recover, whereas the rest of the provinces failed to start again, remaining anchored in the post-recessions difficulties (see Figures 6g and 6h that show the most two recent years). In any case, these signs of recovery are going to be dramatically arrested by the COVID-19 pandemic, and its effects will have serious repercussions in the next years.

It is also interesting to report the behavior of some specific provinces. For example, there is only one province that, over the 16 years, has never changed its state. Specifically, the province of Bolzano, the northernmost in Italy, has been always in state one, the best. This means that, regardless of the economic crises, the unemployment issue has been virtuously managed in this province. We can also mention the two provinces that have worsened their position the most over the 16 years. In detail, the province of Ancona and Ferrara were in state one in the first years of our analysis, but they are actually located in the fifth state (since 2013 and 2016, respectively). Conversely, none of the provinces is actually in a state that can be considered better than the one had in 2004. This means that all the provinces have kept (Bolzano), lost (most of the provinces) and at best regained (only some provinces) the state they had at the beginning of our analysis.

In this manuscript we introduced parsimonious hidden Markov models for matrix-variate longitudinal data. Being (dependent) mixture models, they allow the recovery of homogenous latent subgroups and, simultaneously, provide meaningful interpretation on how the sample units move between the hidden states over time. The parsimony has been introduced via the eigen decomposition of the state covariance matrices, producing a family of 98 HMMs. An ECM algorithm has been illustrated for parameter estimation. At first, the parameter recovery of our algorithm has been evaluated under different scenarios, providing good results. This can be particularly interesting for those HMMs that use a MM algorithm at each step of the ECM algorithm. Relatedly, we have analyzed the computational times for fitting all the 98 HMMs. The computational burden of the HMMs using MM algorithm is definitely higher, even if we are able to fit all the HMMs in a quite fast way when parallel computing is considered. The BIC has proven to be effective in detecting the true number of states in the data as well as the parsimonious structure. The real data example has shown the usefulness of our HMMs. Indeed, other than identifying different states, they have provided a tool for easily analyzing the evolution over time of the unemployment at province level, and for obtaining useful insight on the behavior of some specific provinces.

There are different possibilities for further work, some of which are worth mentioning. First of all, we can extend our HMMs by using skewed or heavy tailed state dependent probability density functions McNicholas, 2017, 2019; Tomarchio et al., 2020b,a) , in order to model possible features commonly present in the data. Another extension would deal with the regression setting (Viroli, 2012) , where covariates shared by all units in the same hidden state are used. This can be done both in a fixed and in random covariates framework.

Mixed hidden Markov models

Latent Markov models for longitudinal data

A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains

Choosing starting values for the em algorithm for getting the highest likelihood in multivariate Gaussian mixture models

Estimating common principal components in high dimensions

Nonparametric methods in factorial designs

Gaussian parsimonious clustering models

Il prodotto delle regioni e il divario nord-sud in italia

Maximum likelihood from incomplete data via the EM algorithm

Quantile regression for longitudinal data based on latent markov subject-specific parameters

Robust model-based clustering with mild and gross outliers

Dimension reduction for longitudinal multivariate data by optimizing class separation of projected latent markov models

A primer in longitudinal data analysis

Finite mixture and Markov switching models

A matrix variate skew-t distribution

Three skewed matrix variate distributions

Multilevel matrix-variate analysis and its application to accelerometry-measured physical activity in clinical populations

The role of occupation specific adaptation costs in explaining the educational gap in unemployment

Mixed hidden markov models for longitudinal data: An overview

Dynamic mixtures of factor analyzers to characterize multivariate air pollutant exposures

Studying crime trends in the usa over the years 2000-2012

Maximum likelihood estimation via the ecm algorithm: A general framework

nparLD: an R software package for the nonparametric analysis of longitudinal data in factorial experiments

R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing

On parsimonious models for modeling matrix data

Estimating the dimension of a model

Mixtures of contaminated matrix variate normal distributions

Two new matrix-variate distributions with application in model-based clustering

The analysis of multivariate longitudinal data: A review

The analysis of multivariate longitudinal data: An instructive application of the longitudinal three-mode model

Finite mixtures of matrix normal distributions for classifying three-way data

Model based clustering for three-way data structures

On matrix-variate regression analysis

Time series analysis of bounded economic variables

On variable selection in matrix mixture modelling

On finite mixture modeling of change-point processes

Hidden Markov models for time series: an introduction using R