key: cord-0561374-sgs1cxci authors: Zhou, Feng; Zhang, Yixuan; Zhu, Jun title: Efficient Inference of Nonparametric Interaction in Spiking-neuron Networks date: 2020-06-23 journal: nan DOI: nan sha: bd3d54eb37b5065c6d946dfcb3c89623fdb8aba0 doc_id: 561374 cord_uid: sgs1cxci Hawkes process provides an effective statistical framework for analyzing the time-dependent interaction of neuronal spiking activities. Although utilized in many real applications, the classical Hawkes process is incapable of modelling inhibitory interactions among neurons. Instead, the nonlinear Hawkes process allows for a more flexible influence pattern with excitatory or inhibitory interactions. In this paper, three sets of auxiliary latent variables (P'{o}lya-Gamma variables, latent marked Poisson processes and sparsity variables) are augmented to make synapses connection weights in a Gaussian form, which allows for a simple iterative algorithm with analytical updates. As a result, an efficient expectation-maximization (EM) algorithm is derived to obtain the maximum a posteriori (MAP) estimate. We demonstrate the accuracy and efficiency performance of our algorithm on synthetic and real data. For real neural recordings, we show our algorithm can estimate the temporal dynamics of interaction and reveal the interpretable synaptic structure underlying neural spike trains. One of the most important tracks in neuroscience is to examine the neuronal activity in the cerebral cortex under varying experimental conditions. Recordings of neuronal activity are represented through a series of action potentials or spike trains. The transmitted information and synapses connection between neurons are considered to be primarily represented by spike trains [1; 2; 3; 4] . A spike train is a sequence of recorded times at which a neuron fires an action potential and each spike may be considered to be a timestamp. Spikes occur irregularly both within and across multiple trials, so it is reasonable to consider a spike train as a point process with the instantaneous firing rate being the intensity function of point processes [5; 6; 7] . An example of spike trains for multiple neurons is shown in our real data experiment. Despite many existing applications, the classical point process models, e.g. Poisson processes, neglect the time-dependent interaction within one neuron and between multiple neurons, so fail to capture the complex temporal dynamics of a neural population. In contrast, Hawkes process is one type of point processes which is able to model the self-exciting interaction between past and future events. Existing applications cover a wide range of domains including seismology [8; 9] , criminology [10; 11] , financial engineering [12; 13] and epidemics [14; 15] . Unfortunately, due to the linearly additive intensity, the vanilla Hawkes process can only represent the purely excitatory interaction because a negative firing rate may exist with inhibitory interaction. This makes the vanilla version inappropriate in the neuroscience domain where the influence between neurons is a mixture of excitation and inhibition [16; 17] . In order to reconcile Hawkes process with inhibition, various nonlinear Hawkes process variants are proposed to allow for both excitatory and inhibitory interactions. The core point of nonlinear Hawkes process is a nonlinearity which maps the convolution of the spike train with a causal influential kernel to a nonnegative conditional intensity, such as rectifier [18] , exponential [19] and sigmoid [20; 21] . The sigmoid mapping function has the advantage that the Pólya-Gamma augmentation scheme can be utilized to convert the likelihood into a Gaussian form, which makes the inference tractable. In [20] , a discrete-time model is proposed to convert the likelihood from a Poisson process to a Poisson distribution. Then Pólya-Gamma random variables are augmented on discrete observations to propose a Gibbs sampler. This method is further extended to a continuous-time regime in [21] by augmenting thinned points and Pólya-Gamma random variables to propose a Gibbs sampler. However, the influence function is limited to be purely exciting or inhibitive exponential decay. Besides, due to the nonconjugacy of the excitation parameter of exponential decay influence function, a Metropolis-Hastings sampling step has to be embedded into the Gibbs sampler making the Markov chain Monte Carlo (MCMC) algorithm further inefficient. To address the parametric and inefficient problems in aforementioned existing works, we develop a sigmoid nonparametric nonlinear multivariate Hawkes processes (SNNMHP) model in the continuoustime regime, (1) which can represent the nonparametric excitation-inhibition-mixture temporal dynamics among the neural population, (2) with the efficient conjugate inference. An EM inference algorithm is proposed to fit neural spike trains. Inspired by [22; 23] , three auxiliary latent variable sets: Pólya-Gamma variables, latent marked Poisson processes and sparsity variables are augmented to make synapses connection weights in a Gaussian form. As a result, the EM algorithm has analytical updates with drastically improved efficiency. As shown in experiments, it is even more efficient than the maximum likelihood estimation (MLE) for the parametric Hawkes process in high dimensional cases. Neurons communicate with each other by action potentials (spikes) and chemical neurotransmitters. A spike causes the pre-synaptic neuron to release a chemical neurotransmitter that induces impulse responses, either exciting or inhibiting the post-synaptic neuron from firing its own spikes. The addition of excitatory and inhibitory influence to a neuron determines whether a spike will occur. In the meantime, the impulse response characterizes the temporal dynamics of the exciting or inhibiting influence which can be complex and flexible [24; 25; 26] . Apparently, the nonparametric nonlinear multivariate Hawkes processes is a suitable choice for representing the temporal dynamics of mutually excitatory or inhibitory interactions and synapses connectivity of neuron networks. The vanilla multivariate Hawkes processes [27] are sequences of timestamps n is the timestamp of n-th event on i-th dimension with N i being the number of points on i-th dimension, M being the number of dimensions, T being the observation window. The i-th dimensional conditional intensity, the probability of an event occurring on i-th dimension in an infinitesimal interval [t, t + dt) given history on all dimensions before t, is designed in a linear superposition form: where µ i > 0 is the baseline rate of i-th dimension and φ ij (·) ≥ 0 is the causal influence function (impulse response) from j-th dimension to i-th dimension which is normally a parameterized function, e.g. exponential decay. The summation explains the self-and mutual-excitation phenomenon, i.e. the occurrence of previous events increases the intensity of events in the future. Unfortunately, one blemish is the vanilla multivariate Hawkes processes allow only nonnegative (excitatory) influence functions because negative (inhibitory) influence functions may yield a negative intensity which is meaningless. To reconcile the vanilla version with inhibitory effect and nonparametric influence function, we propose the SNNMHP. The i-th dimensional conditional intensity of SNNMHP is defined as where µ i is the base activation of neuron i, h i (t) is a real-valued activation and σ(·) is the logistic (sigmoid) function which maps the activation into a positive real value in [0, 1] with λ i being a upper-bound to scale it to [0, λ i ]. The sigmoid function is chosen because as seen later, the Pólya-Gamma augmentation scheme can be utilized to make the inference tractable. After incorporating the nonlinearity, it is straightforward to see the influence functions, φ ij (·), can be positive or negative. If φ ij (·) is negative, the superposition of φ ij (·) will lead to a negative activation h i (t) that renders the intensity to 0; instead, the intensity tends to λ i with a positive φ ij (·). To achieve a nonparametric impulse response, the influence function is assumed to be a weighted sum of basis functions where {φ b } B b=1 are predefined basis functions and w ijb is the weight capturing the influence from j-th dimension to i-th dimension by b-th basis function with positive indicating excitation and negative indicating inhibition. A similar nonparametric scheme is used in [28] . The basis functions are nonnegative functions capturing the temporal dynamics of the interaction. Although basis functions can be in any form, in order for the weights to represent synapses connection strength, basis functions are chosen to be probability densities with compact support that means they have bounded support [0, T φ ] and the integral is one. As a result, the i-th dimensional activation is where Φ jb (t) is the convolution of j-th dimensional observation with b-th basis function and can be precomputed; [20] where a binary variable is included to characterize the sparsity of synapses connection. As shown later, the sparsity in our model is guaranteed by utilizing a Laplace prior on weight instead. In this paper, the basis functions are scaled (shifted) beta densities, but alternatives such as Gaussian or Gamma also can be used. The reason we choose beta distribution is the inference of weights will be subject to edge effects with infinite support densities when close to the endpoints of [0, T φ ]. The weighted sum of Beta densities is a natural choice. With appropriate mixing, it can be used to approximate functions on bounded intervals arbitrarily well [29] . The likelihood of a point process model is provided in [30] . Correspondingly, the probability density (likelihood) of SNNMHP on the i-th dimension as a function of parameters in continuous time is It is worth noting that h i (t) depends on w i and observations on all dimensions. Our goal is to infer the parameters i.e. weights and intensity upper-bounds, from observations, e.g. neural spike trains, over a time interval [0, T ]. As proved in neuron science [31; 32] , the synaptic connectivity in cortical circuits is unraveled to be sparse. To include sparsity, a factorizing Laplace prior is applied on the weights which characterize the synaptic connection. With the likelihood Eq. (5) and Laplace prior , the log-posterior corresponds to a L1 penalized log-likelihood. The i-th dimensional MAP estimate can be expressed as where w * i and λ * i are MAP estimates. The dependency of the log-posterior on parameters is complex because the sigmoid function exits in the log-likelihood term and the absolute value function exits in the log-prior term. As a result, we have no closed-form solutions for the MAP estimates. Numerical optimization methods can be applied, but unfortunately, the efficiency is low due to the high dimensionality of parameters which is (M B + 2) × M . To circumvent this issue, three sets of auxiliary latent variables: Pólya-Gamma variables, latent marked Poisson processes and sparsity variables are augmented to make the weights appear in a Gaussian form in the posterior. As a result, an efficient EM algorithm with analytical updates is derived to obtain the MAP estimate. Following [33] , the binomial likelihoods parametrized by log odds can be represented as mixtures of Gaussians w.r.t. a Pólya-Gamma distribution. Therefore, we can define a Gaussian representation of the sigmoid function where f (ω, z) = z/2−z 2 ω/2−log 2 and p PG (ω|1, 0) is the Pólya-Gamma distribution with ω ∈ R + . Substituting Eq. (7) into the likelihood Eq. (5), the products of observations σ(h i (t i n )) are transformed into a Gaussian form. Inspired by [23] , a latent marked Poisson process is augmented to linearize the exponential integral term in the likelihood. Applying the property of sigmoid function σ(z) = 1 − σ(−z) and Eq. (7), the exponential integral term is transformed to The right hand side is a characteristic functional of a marked Poisson process. According to the Campbell's therem [34] (Appendix I), the exponential integral term can be rewritten as where denotes a realization of a marked Poisson process and p λi is the probability measure of the marked Poisson process Π i with intensity λ i (t, ω) = λ i p PG (ω|1, 0). The events {t i k } Ki k=1 follow a Poisson process with rate λ i and the latent Pólya-Gamma variable ω i k denotes the independent mark at each location t i k . We can see that, after substituting Eq. (9) into the likelihood Eq. (5), the exponential integral term is also transformed into a Gaussian form. The augmentation of two auxiliary latent variables above makes the augmented likelihood become a Gaussian form w.r.t. the weights. However, the absolute value in the exponent of the Laplace prior hampers the Gaussian form of weights in the posterior. To circumvent this issue, we augment the third set of auxiliary latent variables: sparsity variables. It has been proved that a Laplace distribution can be represented as an infinite mixture of Gaussians [22; 35] where p(β ijb ) = (β ijb /2) −2 exp (−1/(2β ijb )). It is straightforward to see the weights are transformed into a Gaussian form in the prior after the augmentation of latent sparsity variables β. After the augmentation of three sets of latent variables, we obtain the augmented joint likelihood and prior (derivation in Appendix II) where ω i is the vector of ω i n on each t n , |1, 0) . The motivation of augmenting auxiliary latent variables should now be clear: the augmented likelihood and prior contain the weights in a Gaussian form, which corresponds to a quadratic expression for the log-posterior (L1 penalized log-likelihood). The original MAP estimate has been represented by Eq. (6) . With the support of auxiliary latent variables, we propose an analytical EM algorithm to obtain the MAP estimate instead of performing numerical optimization. In the standard EM algorithm framework, the lower-bound (surrogate function) of the log-posterior can be represented as where Λ i (t, ω) is the posterior intensity of Π i , p IG is the inverse Gaussian distribution. It is worth noting that h ). The updated parameters can be obtained by maximizing the lower-bound. Due to the augmentation of auxiliary latent variables, the update of parameters has a closed-form solution where with diag(·) indicating the diagonal matrix of a vector, ω)dω with δ(·) being the Dirac delta function. It is worth noting that numerical quadrature methods, e.g. Gaussian quadrature, need to be applied to intractable integrals above. The complexity of our proposed EM algorithm is O(NN T φ M 2 B + M (M B + 1) 3 ) whereN is the average number of observations on each dimension andN T φ is the the average number of observations on the support of T φ on each dimension. The first term is due to the convolution nature of Hawkes process and the second term to the matrix inverse in the M step. For one application, the number of dimensions M and basis functions B are fixed and much less thanN . Therefore, the complexity can be simplified as O(NN T φ ) whereN T φ N . The hyperparameter α in Laplace prior that encodes the sparsity of weights and parameters of basis functions can be chosen by cross validation or maximizing the lower-bound Q using numerical methods. The number of basis functions is a tradeoff between accuracy and efficiency: in essence, a large number leads to a more flexible functional space while a small number results in a faster inference. In experiments, we gradually increase it until no more significant improvement. Similarly, the number of quadrature nodes and EM iterations is also gradually increased until a suitable value. The pseudocode is provided in Alg. 1. Result: {λi(t) = λiσ(w T i · Φ(t))} M i=1 Predefine basis functions {φ b (·)} B We validate the EM algorithm for SNNMHP in analyzing both synthetic and real-world spike data collected from the cat primary visual cortex. For comparison, the following baselines are considered: (1) parametric linear multivariate Hawkes processes that are vanilla multivariate Hawkes processes with exponential decay influence functions, for which the inference is performed by MLE; (2) nonparametric linear multivariate Hawkes processes with flexible influence functions, for which the inference is by majorization minimization Euler-Lagrange (MMEL) [28] ; (3) parametric nonlinear multivariate Hawkes processes with exponential decay influence functions, for which the inference is by an MCMC algorithm based on augmentation and Poisson thinning (MCMC-Aug) [21] . We analyze spike trains obtained from the synthetic network model shown in Fig. 1a . The synthetic neural network contains 4 groups of neurons with a couple of ones in each group. In each group, the 2 neurons are self-exciting and mutual-inhibitive while groups are independent of each other. We assume 4 scaled (shifted) Beta distributions as basis functions with support [0, T φ = 6] in Fig. 1b . For simplicity, it is assumed that we use the thinning algorithm [8] to generate two sets of synthetic spike data on the time window [0, T = 1000] with one being the training dataset and the other one test dataset. Each dataset contains 8 sequences and each sequence consists of 3340 events on average. We aim to identify the synaptic connectivity (functional connectivity) of the neural population and the temporal dynamics of influence functions from statistically dependent spike trains. More experimental details, e.g. hyperparameters, are given in the Appendix IV. The temporal dynamics of interactions among the neural population is shown in Fig. 1c where we plot the estimated influence functions of 7-th and 8-th neurons (other neurons are shown in the Appendix IV). The estimatedφ 77 andφ 88 exhibit the self-exciting relation withφ 78 andφ 87 characterizing the mutual-inhibitive interactions. All estimated influence functions are in a flexible form and close to the ground truth. Besides, as shown in Fig. 1d , the estimated functional connectivity is compared with the ground truth. The functional connectivity is defined as |φ ij (t)|dt meaning there is no connection only if neither excitation nor inhibition exists. We can see the estimated functional connectivity recovers the synaptic connection structure successfully. To verify the convergence and efficiency of our EM algorithm, we analyze the log-likelihood curve and running time. The training and test log-likelihood curves w.r.t. EM iterations are shown in Fig. 1e where our EM algorithm converges fast with only 50 iterations needed to obtain a plateau. Moreover, we compare the running time of our method with alternatives w.r.t. the average number of points on each dimension in Fig. 1f where the number of dimensions M is fixed to 2, basis functions B to 4, quadrature nodes to 200 and EM iterations to 200, gradient descend steps to 200, MMEL iterations to 200 and MCMC iterations to 200. We can observe that our EM algorithm is the most efficient, even superior to MLE for the classical parametric case, which proves its efficiency. Also, we compare our model's fitting and prediction ability with baseline models (parametric or linear) for 1st and 2nd neurons. Training and test log-likelihood results are shown in Tab. 1 where our SNNMHP model with EM inference is the champion due to its superior generalized expressive ability. In this section, we analyze our model performance on a multi-neuron spike train dataset. We aim to draw some conclusions about the functional connectivity of cortical circuits and make inferences of the temporal dynamics of influence. Results The estimated functional connectivity is shown in Fig. 2b where we can see the synaptic connection structure among neural population is sparse. An interesting phenomenon is the 11-th neuron is very "outgoing": it has a large impact on other neurons but other ones nearly have no influence on it. We speculate that this phenomenon is because most synapses of 11-th neuron are monodirectional. In Appendix IV, we plot the estimated influence functions of 11-th neuron. The training and test log-likelihood curves are shown in Fig. 2c where they both reach a plateau. Based on the augmented likelihood and prior, we can obtain the conditional densities of latent variables and parameters in closed form, which constitutes a Gibbs sampler with better efficiency than MCMC-Aug since the time-consuming Metropolis-Hasting sampling in MCMC-Aug is not needed. For the model in [21] , a tighter intensity upper-bound is used to reduce the number of thinned points to accelerate the sampler. Instead, our EM algorithm does not encounter this problem as we compute the expectation rather than sampling. Moreover, [21] only used one basis function, which limits influence functions to be purely exciting or inhibitive exponential decay. Instead, we utilize multiple basis functions to characterize an influence function that is a mixture of excitation and inhibition. In this paper, we develop a SNNMHP model in the continuous-time regime which can characterize excitation-inhibition-mixture temporal dependencies among the neural population. Three auxiliary latent variables are augmented to make the corresponding EM algorithm in a closed form to improve efficiency. The synthetic and real data experimental results confirm that our model's accuracy and efficiency are superior to the state of the arts. From the application perspective, although our model is proposed in the neuroscience domain, it can be applied to other applications where the inhibition is a vital factor, e.g. in the coronavirus (COVID-19) spread, the inhibitive effect may represent the medical treatment or cure, or the forced isolation by government. From the inference perspective, our EM algorithm is a point-estimation method; other efficient distribution-estimation methods can be developed, e.g. the Gibbs sampler mentioned above or the mean-field variational inference method. Like the regular latent variable model, our model is efficient and provides an inference algorithm which is easy to implement. This is particularly beneficial for the big data in the point process community. However, it also inherits the drawback of latent variable model: the derivation procedure requires a deep understanding of the probabilistic graphical model and Bayesian inference, which hampers its broad application by non-experts in machine learning. Let ΠẐ = {(z n , ω n )} N n=1 be a marked Poisson process on the product spaceẐ = Z × Ω with intensity Λ(z, ω) = Λ(z)p(ω|z). Λ(z) is the intensity for the unmarked Poisson process {z n } N n=1 with ω n ∼ p(ω n |z n ) being an independent mark drawn at each z n . Furthermore, we define a function h(z, ω) : Z × Ω → R and the sum H(ΠẐ ) = (z,ω)∈ΠẐ h(z, ω). If Λ(z, ω) < ∞, then for any ξ ∈ C. The above equation defines the characteristic functional of a marked Poisson process. This proves Eq.(9) in the main paper. The mean is which is used when substituting Eq. 13 into Eq. 12. Substituting Eq. (7) and (9) into Eq.(5) in the main paper, the augmented likelihood is obtained where ω i is the vector of ω i n and λ i (t i n , ω i n ) = λ i p PG (ω i n |1, 0). It is straightforward to see the augmented likelihood is which is Eq.(11a). Similarly, the integrand in Eq. 10 is just the augmented prior in Eq. 11b. In the standard EM algorithm framework, the lower-bound of log-posterior has been provided in Eq. 12. The posterior of latent variables can be derived from the joint distribution in Eq. 11. The derivation is relatively easy for ω i and β i while Π i is difficult. In the following, s − 1 and s mean the last and current iteration in the EM algorithm. where we utilize the tilted Pólya-Gamma density p PG (ω|b, c) ∝ e −c 2 ω/2 p PG (ω|b, 0) [33] . . It is worth noting that numerical quadrature methods need to be applied to intractable integrals above. In this appendix, we elaborate on some experimental details. For the synthetic data, since we already know the ground truth, the basis functions are chosen as the ground truth:φ {1,2,3,4} = Beta(α = 50,β = 50, scale = 6, shift = {−2, −1, 1, 0}). By cross validation, the hyperparameter α is chosen to be 0.05. We plot the estimated influence functions from 1-th to 6-th neuron. The running time experiment and the fitting and prediction experiment are both conducted for 2 neurons because the baseline models cannot finish in 2 days with 8 neurons because of the curse of dimensionality. All hyperparameters in real data experiments are chosen by the cross validation. The number of basis functions is chosen to be 7,φ {1,2,3,4,5,6,7} = Beta(α = 50,β = 50, scale = 10, shift = {−3, −2, −1, 0, 1, 2, 3}), the hyperparameter α is optimised to be 100. The 11-th neuron is "outgoing": it has a large impact on other neurons but other ones nearly have no influence on it. We plot estimated influence functions from 11-th neuron to others and from other neurons to the 11-th one to visualize it. Analysis of neural data A spike-train probability model Multiple neural spike train data analysis: state-of-theart and future challenges The time-rescaling theorem and its application to neural spike train data analysis Neuronal spike trains and stochastic point processes: II. Simultaneous spike trains Maximum likelihood estimation of cascade point-process neural encoding models Dynamic analysis of neural encoding by point process adaptive filtering Space-time point-process models for earthquake occurrences in Seismicity patterns, their statistical significance and physical meaning Self-exciting point process modeling of crime Self-exciting point process models of civilian deaths in Iraq Hawkes processes in finance Apparent criticality and calibration issues in the Hawkes selfexcited point process model: application to high-frequency financial data Generating functions and stability study of multivariate self-excited epidemic processes SIR-Hawkes: linking epidemic models and Hawkes processes to model diffusions in finite populations Selective reconfiguration of layer 4 visual cortical circuitry by visual deprivation Inhibitory connectivity defines the realm of excitatory plasticity Inference of functional connectivity in neurosciences via Hawkes processes On the stability and dynamics of stochastic spiking neuron models: Nonlinear Hawkes process and point process GLMs Bayesian methods for discovering structure in neural spike trains Mutually regressive point processes Inverse Ising problem in continuous time: A latent variable approach Efficient Bayesian inference of sigmoidal Gaussian Cox processes Fundamental neuroscience Network neuroscience Spectra of some self-exciting and mutually exciting point processes Learning triggering kernels for multi-dimensional Hawkes processes Dirichlet process mixtures of Beta distributions, with applications to density and intensity estimation An introduction to the theory of point processes. vol. i. probability and its applications Interlaminar connections in the neocortex Rate, timing, and cooperativity jointly determine cortical synaptic plasticity Bayesian inference for logistic models using Pólya-Gamma latent variables Poisson processes On the noise model of support vector machines regression The neural data was recorded by Tim Blanche in the laboratory of Nicholas Swindale, University of British Columbia, and downloaded from the NSF-funded CRCNS Data Sharing website The posterior of sparsity variables β i is an inverse Gaussian distribution which is dependent on weights w s−1