key: cord-0577991-pu9m2n8z authors: Singh, Rahul; Zhang, Qinsheng; Chen, Yongxin title: Learning Hidden Markov Models from Aggregate Observations date: 2020-11-23 journal: nan DOI: nan sha: df61a4899294a2e8c916183fcf595e8433a1956d doc_id: 577991 cord_uid: pu9m2n8z In this paper, we propose an algorithm for estimating the parameters of a time-homogeneous hidden Markov model from aggregate observations. This problem arises when only the population level counts of the number of individuals at each time step are available, from which one seeks to learn the individual hidden Markov model. Our algorithm is built upon expectation-maximization and the recently proposed aggregate inference algorithm, the Sinkhorn belief propagation. As compared with existing methods such as expectation-maximization with non-linear belief propagation, our algorithm exhibits convergence guarantees. Moreover, our learning framework naturally reduces to the standard Baum-Welch learning algorithm when observations corresponding to a single individual are recorded. We further extend our learning algorithm to handle HMMs with continuous observations. The efficacy of our algorithm is demonstrated on a variety of datasets. There has been a growing interest in applications where data about individuals are not accessible, instead aggregate population-level observations in the form of counts of the individuals are available [29, 18] . For various reasons including measurement fidelity, privacy preservation, cost of data collection, and scalability, data is often collected as aggregates. For example, in human ensemble flow analysis, individual trajectories may not be readily accessible due to privacy concerns, but the number of individuals in a certain geographical area can typically be counted by cell phone carriers. More examples include voter turnout based on demography from census data [15] and bird migration analysis [31] . One fundamental part in modeling such aggregate data is estimating the individual model parameters. Learning the underlying individual model from aggregate observations is a challenging task since the full trajectory of each individual is not accessible. We are interested in learning hidden Markov models (HMMs) using aggregate data. HMMs are popular graphical models used in various scenarios involving unobservable (hidden) data sequences arising in ecology, social dynamics, and emergence of an epidemic [26, 5, 9, 30] . Due to their ability to address the nonstationarity in observed data sequences, HMMs are capable of modeling a rich class of problems. In aggregate HMM settings, a large set of homogeneous individuals transit from one state to another according to the underlying HMM and at each time-step, corresponding aggregated observations are recorded. For example, in epidemiology, one can model spread of an infectious disease such as COVID-19 over time in a geographical area using the population level aggregate data generated by an HMM. In this work, we consider the problem of estimating the parameters of a time-homogeneous hidden Markov model, i.e., transition and observation probabilities, from noisy aggregate data. A traditional method for learning HMM is the Baum-Welch algorithm [1, 2] , which is a special case of the expectation-maximization (EM) algorithm [8, 23] . For the given observations sampled from a model consisting of latent variables (variables that are not observable) with unknown parameters, the EM algorithm aims to find the maximum likelihood estimates of the model parameters. In its first step (E-step), the EM algorithm estimates a function of the expected values of the latent variables and subsequently in the second step (M-step), it finds the maximum likelihood parameter estimates. For the case of learning HMM parameters, inference algorithms such as belief propagation (BP) algorithm [25] is utilized in the E-step of the EM algorithm. The Baum-Welch algorithm for estimating an HMM uses the forward-backward inference algorithm, one type of BP algorithms, in the E-step to complete the data. Unfortunately, traditional HMM learning methods such as Baum-Welch algorithm [2] can not be applied to aggregate setting. Learning the individual model from such population-level observations becomes challenging since great amount of information about individuals is lost due to data aggregation and observation noise. Recently, the learning and inference problems in aggregate settings have been formalized under the collective graphical model (CGM) framework [29] . Within the CGM framework, for learning the parameters of the individual model, several aggregate inference methods such as non-linear belief propagation (NLBP) [31] and Bethe-RDA [34] algorithms has been utilized in the E-step of the EM algorithm aiming to maximize the complete data likelihood. Both of the inference algorithms work on an explicit observation model. In addition, since NLBP does not exhibit convergence guarantee, it does not lead to stable learning methods. The primary contribution of our work is a novel algorithm for estimating the HMM parameters with theoretical guarantees from noisy aggregate observations. We utilize a modified EM algorithm for the learning task, where the E-step of the algorithm is solved using recently proposed aggregate inference method, the Sinkhorn belief propagation (SBP) algorithm [30] . We show that our algorithm exhibits convergence guarantee. Instead of explicitly considering the noise model, we incorporate observation noise in the underlying graph and as a result, our algorithm reduces to the standard Baum-Welch algorithm when only one individual is considered. We further extend our algorithm to learn the model parameters with continuous observation noise model. We evaluate the performance of our algorithm on a variety of scenarios including human ensemble flow on real-world data. Related Work: Estimating Markov chains from aggregate data, also referred to as macro data in earlier works, has a long history. It was first studied in [17] where the transition matrices were estimated based on maximum likelihood method. In [32, 21, 14] , the modeling of a single Markov chain was studied by maximizing the aggregate posterior. More recent learning methods from aggregate data include [18, 24] . After the introduction of the CGM framework in [29] , there have been a few works on learning the underlying individual model from aggregate data. The NLBP algorithm [31] , a message passing type algorithm for approximate inference in CGMs, has been utilized in EM for the task of learning a Markov chain. Another existing aggregate inference algorithm utilized in the E-step of the EM algorithm is Bethe-RDA [34] which exhibits convergence guarantees. Finally, [3] pro-posed a method of moments estimator for learning a Markov chain within the CGM framework. Other works along this line include estimating spatiotemporal population flow [11] and recurrent estimation of HMM [19] from aggregate data, learning stochastic behaviour of aggregate data [20] , learning hidden nonlinear dynamics from aggregate data [36] , and estimating group behavior from ensemble observations [37] . The rest of the paper is organized as follows. In Section 2, we briefly discuss related background. We present our main results and algorithms in Section 3 for discrete observations. The counterpart with continuous observations is developed in Section 4 followed by experimental results in Section 5 and a concluding remark in Section 6. In this section, we present related background on HMMs, their extension to aggregate settings, and the CFB inference algorithm. An HMM is a Markov chain where the variables are not directly observable, but corresponding noisy variables are observed. Denote the unobserved hidden variables as X 1 , X 2 , . . . and observed variables as O 1 , O 2 , . . .. Here X t and O t are random variables taking values from sets X and O respectively. In general, both X and O can be either finite sets or infinite sets. For discrete HMMs, X and O are finite sets with cardinalities |X | = d and |O| = s, respectively. A time-homogeneous HMM is parameterized by the initial distribution π(X 1 ), the state transition probabilities p(X t+1 |X t ), and the observation probabilities p(O t | X t ) independent of time steps t = 1, 2, . . .. An HMM is a special type of probabilistic graphical model (PGM) [35] . The graphical representation of a length T HMM is shown in Figure 1 . The joint distribution of an HMM with length T factorizes as . . , o T } denote particular assignments to the hidden and observation variables, respectively. One of the most important problems in HMMs is Bayesian inference where the goal is to calculate the posterior distributions of the hidden states X t given a sequence of observations o = {o 1 , o 2 , . . . , o T }. This is also known as filtering/smoothing [22] in systems and control community. A well-known algorithm for this task is the standard forward-backward algorithm [27] , which itself a special case of belief propagation [25] for Bayesian inference of general graphical models. Another important problem in HMMs is the parameter learning, which is also known as system identification. Denote the set of parameters to be learned as Let T } be a set of observed trajectories. The objective of parameter learning of HMMs is to estimate the parameter θ using the available data {o (m) } M m=1 . Since the HMM is a latent variable model where the latent variable X t is not observable, the maximum likelihood estimation cannot be applied directly. A popular approach for learning latent variable models is the expectation-maximization (EM) algorithm [23, 4] . The EM algorithm is an iterative method that involves two steps in each iteration: E-step and M-step. In the E-step, the values associated with the hidden variables are estimated to make the data complete and then, in M-step, the parameters of the underlying model are optimized based on the complete data likelihood. When specialized to HMMs, the EM algorithm reduces to the Baum-Welch algorithm [16] . Aggregate HMM is a framework for learning and inference from noisy aggregate data generated from an HMM describing the behavior of individuals. It is a special case of the collective graphical model [29] , which is a framework for general probabilistic graphical models. The aggregate data is generated from M independent individuals following an HMM. The HMMs are aggregate in the sense that they are indistinguishable to each other. Let X (m) t be the (unobservable) state of Figure 2 . Given these aggregate observations, the goal of inference in aggregate HMMs is to estimate the latent distributions The exact inference is proved to be computationally infeasible [29] for problems with large T and M . It is proposed in [30] that this aggregate inference can be approximately achieved by solving a free energy minimization problem. Moreover, the approximation error vanishes as the size M of the population goes to infinity. The integral constraints on n can be relaxed [30] without affecting the precision much. With this relaxation, the latent distributions n = {n t , n o t , n t,t , n t,t+1 } satisfy the local polytope constraints Denote the local polytope described in (3) It is in fact equal to the Kullback-Leibler divergence between the inferred distribution and the prior distribution over the space of trajectories [30] . The aggregate inference problem is equivalent [30] to the following convex optimization problem. Thanks to the large deviation theory [33, 30] , the condi-tional distribution p(n|y, θ) of n given the observation y approximately concentrates on the solution to Problem 1. The very same approximation is the foundation of the Schrödinger bridge problem [6, 7] which has been explored extensively in stochastic control. In [30] , we proposed the SBP algorithm for solving aggregate inference problems over more general CGMs with tree-structure. The SBP algorithm has convergence guarantees with linear rate [30] . There exist some other algorithms for aggregate inference problems in CGMs including approximate MAP [28] , NLBP [31] and Bethe-RDA [34] . One major difference between SBP and these methods is the observation model. In [28, 31, 34] , the noise is added to the aggregate observation y t directly, meaning the real observed histogram is a perturbed version of y t by some random noise. In contrast, in Problem 1, we assume that the observation noise enters the system in the individual level and the measurement of the histogram is precise. It has the nice property that when M = 1, it reduces to a standard inference problem for PGMs or HMMs. We refer the reader to [30] for more details on the comparison of the observation models. The collective forward-backward algorithm (CFB) is a special case of the general SBP algorithm when the underlying graphical model is an HMM [30] . It is a message passing type algorithm, similar to BP, consisting of four types of messages. Figure 3 depicts the messages employed by the CFB algorithm with α t (x t ) being the messages in the forward direction and β t (x t ) being the messages in the backward direction. Moreover, γ t (x t ) denote the messages from observation node to hidden node and ξ t (o t ) are the messages from hidden nodes to observation nodes. These messages are characterized by with boundary conditions α 1 ( The sequence of update steps are listed in Algorithm 1. Obtain the solution n * to Problem 1 using CFB with parameters θ −1 M-step: θ = argmin θ F(n * , θ) end for Once the algorithm converges, which is guaranteed, the latent marginals can be estimated as Clearly, when the population size is 1, i.e., M = 1, the CFB algorithm reduces to the standard Forwardbackward algorithm for the inference of HMMs [30, 10] . The learning problem in CGMs is concerned with estimating the individual model parameters of the underlying graphical model from aggregate observations. For learning the parameters of a latent variable model, the EM algorithm [8] is a standard approach. The EM algorithm consists of two operations: the E-step to compute the log-likelihood of the observations given the current estimation of parameters, and the M-step to maximize the log-likelihood. The challenge to apply the EM algorithm for learning CGMs lies in the fact that the E-step requires inferring the conditional distribution of n on the observation y, which is untractable [28, 30] . In this section, we propose the approximate EM algorithm (Algorithm 2) for learning HMMs with observations in aggregate form. The key idea is to use the tractable CFB algorithm to approximately infer the aggregate distributions n. Note that the SBP algorithm can be used for learning more general CGMs. The Approximate EM algorithm converges. Proof. The E-step and M-step in Algorithm 2 are coordinate descent updates of the free energy F(n, θ) with respect to n and θ, and thus the objective function is monotonically decreasing. Moreover, since the free energy F in (4) is equal to Kullback-Leibler divergence between the inferred distribution and the prior distribution over the space of trajectories [30] , it is bounded below by 0. Thus, in view of the fact that F is continuously differentiable, the approximate EM algorithm converges to a local minimum. We next argue that the log-likelihood L(θ ) := log p(y|θ ) approximately monotonically increases. The improvement of L at the -th iteration is . Since p(n|y, θ −1 ) approximately concentrates on n * , . Again, due to the large deviation theory [30] , p(y, n | θ) ≈ exp[−M F(n, θ)]. Thus, The approximate monotonicity of likelihood then follows from the definition of the M-step in Algorithm 2. Thanks to the special structure of HMMs, the M-step can be implemented efficiently in closed-form. HMMs is given by . (7c) Proof. See Appendix A. Algorithm 2 is for learning from a single sequence of aggregate data generated from a certain number of samples. Learning from multiple sequences of observations was initially explored in the Baum-Welch algorithm [27] . Building on the same idea, we extend it to the setting (Algorithm 3) with an ensemble of K number of aggregate observation sequences generated from the same HMM model. Note that here each aggregate observation is based on the collective information of M individuals, therefore K such aggregate observations in fact corresponds to N = M K individuals. Denote the ensemble of aggregate observations by {y k } K k=1 , then in the E-step, we need to find the solution n k to Problem 1 for each of these observations. In the M-step, one solves min θ K k=1 F(n k , θ). . Algorithm 3 Learning HMMs with an ensemble of aggregate observations Compute n k by solving Problem 1 with measurement y k using CFB for all k = 1, . . . , K Update the parameters using (8) until convergence Proof. See Appendix B. Observations Next we turn our attention to the parameter learning problems of HMMs with continuous observation space O = R s (the state space X is still discrete). Such a HMM with continuous observation is similar to the discrete HMM except that it has a continuous emission density.The continuous observation model in standard HMMs has been studied in [12, 13] . In this section, we extend our learning algorithm to aggregate HMMs with continuous emission densities. Recently, the inference problem in aggregate HMMs with continuous emission densities has been studied in [38] . It was shown that the latent marginals can be estimated as (Corollary 2, [38] ) where α t (x t ), β t (x t ), and γ t (x t ) are the messages in aggregate HMMs as depicted in Figure 3 . They correspond to the fixed point of the updates The inference estimates given by (9) are applicable to aggregate HMMs with any general continuous emission density. Next, we derive the formulas for parameter estimation of the underlying continuous observation HMM with Gaussian emission density. Assuming the Gaussian noise model for emission density, Compute n t,t+1 (x t , x t+1 ), n t (x t ), n (m) t (x t ) using CFB Update the parameters using (12) until convergence it takes the form i.e., each (discrete) hidden state corresponds to a single Gaussian density parameterized by mean µ(x t ) and variance Σ(x t ). In such a model, an observation o (m) t corresponding to the m-th individual at time t is nothing but a sample from one of the Gaussian components. The learning of aggregate HMMs with Gaussian emission density can be achieved using the approximate EM algorithm with slightly modifications in the two steps. The E-step is an inference step using (9) . The M-step has a closed-form expression given by the following Proposition. Gaussian emission density take the form where prime denotes matrix transpose. Proof. See Appendix C. Based on Proposition 3, the parameters of a Gaussian-HMM are estimated using Algorithm 4. Note that in this aggregate Gaussian-HMM setting, the estimation updates for the initial distribution π(x 1 ) and the transition probabilities p(x t+1 |x t ) are the same as in Algorithm 2. The convergence of Algorithm 4 follows from the same arguments as in the proof of Theorem 2. Remark 3 Similar to discrete HMMs, one can extend Algorithm 4 to the setting with an ensemble of continuous aggregate observations. To illustrate the efficacy of the proposed aggregate learning algorithms, we perform multiple sets of experiments on synthetic as well as real-world dataset of population flow over a geographical area. In this section, we consider synthetic data for evaluating our learning algorithms. We perform multiple sets of experiments for performance comparison of fitted timeinvariant HMM models with discrete as well as continuous observations. The initial state probability π(x 1 ) is sampled from the uniform distribution over the probability simplex. To produce the transition matrix, we first randomly permute rows of noised identity matrix I + 0.05 × √ d × exp(U nif orm[−1, 1]). We scale rows of the permuted matrix so that the resulting matrix is a valid conditional distribution. For discrete observation setting, the emission matrix is generated in a similar way as transition matrix, but with a different random seed. In case of HMMs with continuous observations, we consider the Gaussian emission model. For each state, the corresponding Gaussian distribution is parameterized by a random mean and variance. The mean is sampled from U nif orm[−5d, 5d] and variance is from U nif orm [1, 5] . In continuous observation setting, the algorithm is required to estimate the initial distribution, the transition matrix and the means of Gaussian emission densities. We generate N individual trajectories from the HMM parameterized with θ * and aggregate them. Each aggregate sequence consists of collective observations of M independent trajectories of length T . The HMM parameters are learned based on K = N M number of aggregate sequences. For testing purpose, we generate another set of N individual trajectories. We use the negative log likelihood (N LL) as a metric for evaluating performance of our learning algorithm. The difference of N LLs between the learned model θ and ground truth θ * is The HMM model with learned parameters is evaluated on test dataset with the same number of total trajectories N as in training data. Figure 4 shows the performance of our algorithm for different values of state dimension d and population size M on HMMs with discrete observations. Curves in the same figure show learning performance with different values of M but with fixed values for d, T , and N . It can be observed that one achieves best performance for the case of no aggregation (M = 1) and as the aggregate size M increases, ∆N LL also increases. Similar observations can be made for the case of Gaussian observation model as depicted in Figure 5 . It shows that our algorithm can effectively learn the generative models. Larger aggregate size corresponds to lower convergence rate as expected; with larger aggregate size, more information is lost about the individuals. To further demonstrate the scalability of our algorithm, we conduct experiments with various HMM lengths and sample sizes as depicted in Figure 6 and Figure 7 , respec- tively. In Figure 6 , the curves in different colors depict the learning performance with different HMM lengths T . We observe that larger T leads to better performance. This is because larger T is associated with more training data. Moreover, one can also observe that as the aggregate size M increases, the performance degrades as expected. Figure 7 demonstrates the effect on HMM learning varying data sizes N . It can be observed that with more data available, the performance of our algorithm improves. With data size smaller than N = 500, the overfitting problem occurs; even though the algorithm converges on training data, the ∆N LL evaluated on test data tends to increase. Next, we compare our algorithm with the learning framework involving NLBP [31] , Bethe-RDA [34] , and Prox [30] in Figure 8 . Since those algorithms assume different observation models, we only learn the initial distribution and transition matrix with given observations models for a fair comparison. For the cases of NLBP, Bethe-RDA, and Prox, we choose the explicit aggregate noise model following independent Poisson distributions for each aggregate state (see [31] for more details on this aggregate noise model). We conduct experiments on discrete HMM with T = 5, N = 5000, and M = 10. The comparison of learning performances for different values of d is depicted in Figure 8 . One can clearly observe from Figure 8 that our learning framework based on the CFB algorithm converges faster and performs better than the existing aggregate learning approaches. We now turn to real-world aggregated data of population flow within the Japanese city of Tokyo. We assume that the observations are corrupted by Gaussian Noise. Additionally, with a small chance, a point in the center block can be categorized to eight neighbouring blocks incorrectly, which account for sensor inaccuracy. In Figure 9 , We estimate the HMM parameters directly from the aggregated data. We consider the estimated parameters with M = 1 as the ground truth while assuming that the observation noise model is known. Figure 10 depicts the comparison between our estimation and ground truth movement at the four timestamps. The red arrows in the figure implicitly represent the underlying transition probabilities multiplied by the total population N = 49538. One can observe that our algorithm successfully recovers the underlying movement of population with noisy aggregate observations. In this paper, we proposed an algorithm for learning the parameters of a time-homogeneous HMM from aggregate data. Our algorithm is based on a modified version of the EM algorithm, wherein we utilized the Sinkhorn belief propagation algorithm to infer the unobservable states. In contrast to the existing state-of-the-art algorithms that explicitly consider the aggregate observation noise, our algorithm employs the aggregate observation noise within the graphical model and due to which it is consistent with the standard Baum-Welch algorithm when aggregate data consists of only a single individual. Moreover, our algorithm enjoys convergence guarantees. We further extended our algorithm to incorporate continuous observations and presented estimates for Gaussian observation model. In this work, we have assumed where θ = {π(x 1 ), p(x t+1 |x t ), p(o t |x t )} and F(n, θ) as in (4) . Let the Lagrange multipliers corresponding to the constraints (A.1b), (A.1c), and (A.1d) be λ, ν, and µ respectively. Then, the Lagrangian is Setting the derivatives of the Lagrangian with respect to the variables to zero, we get Solving above equations, in view of the constraints (A.1b)-(A.1c)-(A.1d), we obtain π(x 1 ) = n 1 (x 1 ), (A.2a) . where δ(·) denotes the Dirac function. Then the messages in collective forward-backward algorithm coincide with the messages in standard forward-backward algorithm [30] and take the following form p(x t+1 |x t )β t+1 (x t+1 )p(ô t+1 |x t+1 ), (B.2b) γ t (x t ) = p(ô t |x t ). (B.2c) Using above messages, the required marginals can be estimated as p(ô t |x t )p(ô t+1 |x t+1 ), (B.3b) n t,t (x t ,ô t ) = n t (x t ). (B.3c) Finally, the parameter update equations given in Algorithm 2 reduce to the standard Baum-Welch algorithm. where θ = {π(x 1 ), p(x t+1 |x t ), µ(x t ), Σ(x t )}. The free energy F is basically the same as in (4); the only difference is that p(o t |x t ) is a Gaussian parametrized by (11) . Let the Lagrange multipliers corresponding to the constraints (C.1b) and (C.1c) be λ and ν respectively, then the Lagrangian is Setting the derivatives of the Lagrangian with respect to {π(x 1 ), p(x t+1 |x t )} to 0 we obtain ∂L ∂π(x 1 ) = n 1 (x 1 ) π(x 1 ) − λ = 0, An inequality with applications to statistical estimation for probabilistic functions of markov processes and to a model for ecology A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The annals of mathematical statistics Consistently estimating Markov chains with noisy aggregate data Pattern recognition and machine learning Inference in hidden Markov models On the relation between optimal transport and Schrödinger bridges: A stochastic control viewpoint Stochastic control liaisons: Richard Sinkhorn meets Gaspard Monge on a Schrödinger bridge Maximum likelihood from incomplete data via the em algorithm Graph-coupled hmms for modeling the spread of infection Multi-marginal optimal transport and probabilistic graphical models Neural collective graphical models for estimating spatio-temporal population flow from aggregated data Maximum-likelihood estimation for mixture multivariate stochastic observations of markov chains Maximum likelihood estimation for multivariate mixture observations of markov chains (corresp.) Estimation in Markov models from aggregate data A solution to the ecological inference problem: Reconstructing individual behavior from aggregate data Probabilistic graphical models: principles and techniques Estimating the parameters of the markov probability model from aggregate time series data Learning mixtures of Markov chains from aggregate data with structural constraints Recurrent estimation of hidden markov model transition probabilities from aggregate data Learning stochastic behaviour of aggregate data Estimation of time-varying Markov processes with aggregate data Machine learning: a probabilistic perspective A view of the EM algorithm that justifies incremental, sparse, and other variants Estimating discrete Markov models from various incomplete data schemes Probabilistic reasoning in intelligent systems: Networks of plausible inference An introduction to hidden markov models A tutorial on hidden markov models and selected applications in speech recognition Approximate inference in collective graphical models Collective graphical models Inference with aggregate data: An optimal transport approach Message passing for collective graphical models Some results about decomposable (or Markov-type) models for multidimensional contingency tables: distribution of marginals and partitioning of tests Large deviations and applications Bethe projections for non-local inference Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning Learning deep hidden nonlinear dynamics from aggregate data Sample-based population observers Filtering for aggregate hidden Markov models with continuous observations Solving above equations, we arrive at.(C.2b)To find the Gaussian density parameter µ(x t ), differentiating the objective F(n, θ) with respect to µ(x t ) and equating it to zero, we getSimilarly, the update (12d) can be obtained by setting ∂F/∂Σ(x t ) to zero.