key: cord-0592408-t2f7c43r
authors: Zhu, Shixiang; Wang, Haoyun; Dong, Zheng; Cheng, Xiuyuan; Xie, Yao
title: Neural Spectral Marked Point Processes
date: 2021-06-20
journal: nan
DOI: nan
sha: c4e9ca295f59a80d4059ddaafa5534fa781f054d
doc_id: 592408
cord_uid: t2f7c43r

Self- and mutually-exciting point processes are popular models in machine learning and statistics for dependent discrete event data. To date, most existing models assume stationary kernels (including the classical Hawkes processes) and simple parametric models. Modern applications with complex event data require more general point process models that can incorporate contextual information of the events, called marks, besides the temporal and location information. Moreover, such applications often require non-stationary models to capture more complex spatio-temporal dependence. To tackle these challenges, a key question is to devise a versatile influence kernel in the point process model. In this paper, we introduce a novel and general neural network-based non-stationary influence kernel with high expressiveness for handling complex discrete events data while providing theoretical performance guarantees. We demonstrate the superior performance of our proposed method compared with the state-of-the-art on synthetic and real data.

Event sequence data are ubiquitous in our daily life, ranging from traffic incidents, 911 calls, social media posts, earthquake catalog data, and COVID-19 data (see, e.g., Bertozzi et al. (2020) ). Such data consist of a sequence of events indicating when and where each event occurred, with additional descriptive information (called marks) about the event (such as category, volume, or free-text). The distribution of events is of scientific and practical interest, both for prediction purposes and for inferring events' underlying generative mechanism.

A popular framework for modeling events is point processes (Daley & Vere-Jones, 2008) , which can be continuous over time and the space of marks. An important aspect of this model is capturing the event's triggering effect on its subsequent events. Since the distribution of point processes is completely specified by the conditional intensity function (the occurrence rate of events conditioning on the history), such triggering effect has been captured by an influence kernel function embedded in the conditional intensity. In statistical literature, the kernel function usually assumes a parametric form. For example, the original work by Hawkes (Hawkes, 1971) considers an exponential decaying influence function over time, and the seminal work (Ogata, 1998) introduces epidemic-type aftershock sequence (ETAS) model, which considers an influence function that exponentially decays over space and time. With the increasing complexity of modern applications, there has been much recent effort in developing recurrent neural network (RNN)-based point processes, leveraging the rich representation power of RNNs (Du et al., 2016; Mei & Eisner, 2017; Xiao et al., 2017b) .

However, there are several limitations of existing RNN-based models. First, such models typically do not consider the kernel function (Du et al., 2016; Li et al., 2018; Mei & Eisner, 2017; Upadhyay et al., 2018; Xiao et al., 2017a; b) ; thus, the RNN approach does not enjoy the interpretability of the kernel function based models. Second, the popular RNN models such as Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) still implicitly discounts the influence of events over time (due to their recursive structure) (Vaswani et al., 2017; Zhu et al., 2021d) . Such assumptions may not hold in many real-world applications. Take the earthquake catalog as an example, which is a typical type of discrete event data; most aftershocks occur along the fault plane or other faults within the volume affected by the mainshock's strain (Zhu et al., 2021b) . This means that different regions may be correlated to their surrounding area differently according to their geological structure, which creates a complex non-stationary spatial profile that we would like to capture through the model. Third, a majority of the existing works mainly focus on one-dimensional temporal point processes. Although there are works on marked point processes (Du et al., 2016; Mei & Eisner, 2017; Reinhart, 2018) , they are primarily based on simplifying assumptions that the marks are conditionally independent of the event's time and location, which is equivalent to assuming the kernel is separable; these assumptions may fail to capture some complex non-stationary, time-and location-dependent triggering effects for various types of events, as observed for many real-world applications (see, e.g., (Bertozzi et al., 2020) ).

Contribution. In this paper, we present a novel general non-stationary point process model based on neural networks, referred to as the neural spectral marked point process (NSMPP) . The key component is a new powerful representation of the kernel function using neural networks, which enables us to go beyond stationarity (and thus go beyond standard Hawkes processes) and has the capacity to model high-dimensional marks. Figure 1 gives an example of non-stationary influence kernel that measures the influence of the past events to the future time t. The premise of the model design is that the conditional intensity function uniquely specifies the distribution of the point process, and the most important component in the intensity function is the influence kernel. In summary, the novelty of our approach includes the following:

• The kernel function is represented by a spectral decomposition of the influence kernel with a finite-rank truncation in practice. Such a kernel representation will enable us to capture the most general non-stationary process as well as high-dimensional marks. The model also allows the distribution of marks to depend on time, which is drastically different from the separable kernels considered in the existing literature (Reinhart, 2018) . • The spectral decomposition of asymmetric influence kernel consists of a sum of the product of feature maps, which can be parameterized by neural networks. This enable us to harvest the powerful expressiveness and scalability to high-dimensional input of neural networks for complicated tasks involving discrete events data. • We establish theoretical guarantees of the maximum likelihood estimate for the true kernel function based on functional variational analysis and finite-dimensional asymptotic analysis, which shed light on theoretical understanding of neural network-based kernel functions. • Using synthetic and real data (seismic and police data), we demonstrate the superior performance of our proposed method in complex situations; the performance gain is particularly outstanding for cases involving non-stationary point processes.

Related work. Seminal works in point processes modeling (Ogata, 1988; 1998) assume parametric forms of the intensity functions. Such methods enjoy good interpretability and are efficient to estimate. However, classical parametric models are not expressive enough to capture the events' dynamics in modern applications.

Recent research interests aim to improve the expressive power of point process models, where a recurrent neural networks (RNNs)-based structure is introduced to represent the conditional intensity function (Du et al., 2016; Mei & Eisner, 2017; Xiao et al., 2017b) . However, most of these works either explicitly or implicitly specify the inter-event dependence in a limited form with restrained representative power. For example, Du et al. (2016) expresses the influence of two consecutive events in a form of exp{w(t i+1 − t i )}, which is an exponential function with respect to the length of the time interval t i+1 − t i with weights w; Mei & Eisner (2017) enhances the expressiveness of the model and represents the entire history using the hidden state of an LSTM, which still implicitly assumes the influence of the history decays over time due to the recurrent structure of LSTM.

Another line of research uses neural networks to directly model dependence of sequential events without specifying the conditional intensity function explicitly (Li et al., 2018; Xiao et al., 2017a) . Some studies consider non-stationary influence kernel using neural networks in spatio-temporal point processes (Zhu et al., 2021b; a) . Recent work (Omi et al., 2019) also uses a neural network to parameterize the hazard function, the derivative of which gives the conditional intensity function. However, the above approaches either capture the temporal dependence or assume the temporal and mark dependence are separable rather than jointly accounting for marked-temporal dependence.

Recently, attention models have become popular in computer vision and sequential data modeling (Britz et al., 2017; Luong et al., 2015; Vaswani et al., 2017) . This motivates works including Zhang et al. (2019); Zhu et al. (2021c; Zuo et al. (2020) to model the conditional intensity of point processes using the attention mechanism and characterize the inter-event dependence by a score function. The attention mechanism has proven to be more flexible in capturing long-range dependence regardless of how far apart two events are separated and greatly enhances the performance in practice. However, the main limitation of Zhang et al. (2019); Zuo et al. (2020) is that they rely on a conventional score function -dot-product between linear mappings of events, which is still limited in representing non-linear dependence between events for some applications. Zhu et al. (2019; 2021c ;d) used a more flexible and general Fourier kernel as a substitution for the dot-product score; however, the expressive power of the proposed Fourier kernel is still limited, and the spectrum of the Fourier basis is represented by a generative neural network, which is difficult to learn in some cases (Arjovsky & Bottou, 2017) .

There are also works considering point processes with non-stationary intensities. Chen & Hall (2013) proposed time-varying background intensity for point process, while we focus on the non-stationary triggering kernel depicting complex events dependency. Remes et al. (2017; studied nonstationary kernels combined with Gaussian processes, assuming specific structures of the kernels in the Fourier domain. Such kernels are more restricted than ours since the nature of Gaussian processes requires that the kernel is positive semidefinite.

Marked temporal point processes (MTPPs) (Reinhart, 2018) consist of a sequence of events over time.

Each event is associated with a (possibly multi-dimensional) mark that contains detailed information of the event, such as location, nodal information (if the observations are over networks, such as sensor or social networks) categorical data, and contextual information (such as token, image, and text descriptions). Let T > 0 be a fixed time-horizon, and M ⊆ R d be the space of marks. We denote the space of observation as X = [0, T ) × M and a data point in the discrete event sequence as

where t is the event time and m represents the mark. Let N t be the number of events up to time t < T (which is random), and H t := {x 1 , x 2 , . . . , x Nt } denote historical events. Let N be the counting measure on X , i.e., for any measurable S ⊆ X , N(S) = |H T ∩ S|. For any function f : X → R, the integral with respect to the counting measure is defined as

The events' distribution in MTPPs can be characterized via the conditional intensity function λ(x), which is defined to be the conditional probability of observing an event in the marked temporal space X given the events' history H t(x) , that is,

Above, t(x) extracts the occurrence time of event x, and we omit the dependence on H t(x) in the notation of λ(x) for simplicity.

As self-and mutual-exciting point processes, Hawkes processes (Hawkes, 1971 ) have been widely used to capture the mutual excitation dynamics among temporal events. The model assumes that influences from past events are linearly additive towards the current event. The conditional intensity function for a self-exciting point process takes the form of

where µ > 0 stands for the background intensity, and the so-called "influence kernel" k : X ×X → R is crucial in capturing the influence of past events on the likelihood of event occurrence at the current time. Here we use the notation [k] to stress the dependence of the conditional intensity function on the kernel function k(x , x). Written in the form of the integral over counting measure, we have that

where X t is the subset of X with the first component smaller than t.

The most commonly made assumption in the literature is that the process is stationary, where the influence of the past events is shift-invariant, such that k(

where β controls the decay rate and α > 0 controls the magnitude of the influence of an event. The current work aims at going beyond stationary point processes, which enables us to better capture the heterogeneity in the events' influence across the spatial-temporal space, which naturally arises in many applications.

We propose to represent more general non-stationary influence kernels in the conditional intensity function λ(x) specified in (4).

Kernel representation. The main idea of the proposed model is to represent the influence kernel k using a general finite-rank decomposition

where ψ r : X → R, φ r : X → R, r = 1, · · · , R, are two sets of feature functions in some smooth functional space F ⊂ C 0 (X ), and ν r is the corresponding weight -or "spectrum". This representation is motivated by the spectral decomposition of a general kernel function. While functional spectral decomposition is usually infinitely dimensional, for practical considerations, we truncate the "spectrum" and only consider a finite rank representation. Note that while we view (5) as similar to a spectral decomposition, it can be better understood as feature maps in kernel representation, and, particularly, we do not need the feature functions ψ r and φ r to be orthogonal.

The decomposition (5) represents the kernel function using three parts: two sets of (normalized) feature functions and the energy spectrum-the spectrum ν r plays the role of weights to combine the feature maps. In the learning process, we can train the feature functions (typically neural networks) and the weights separately; since learning the normalized feature maps tend to be more numerically stable. The proposed form of kernel is not necessarily positive semi-definite, and even not symmetric. (b) Network structure Figure 2 : (a) The architecture of the proposed non-stationary neural spectral kernel. A hidden embedding will first summarize the data point; then, the embedding will be mapped to different features via multi-branch neural networks. The kernel value is calculated by summing up the products between two sets of learned features, {ψr} and {φr}, weighted by the spectrum {νr}. The spectrum, feature functions, and hidden embeddings are jointly learned from data. (b) The structure of the multi-branch neural network. There are two shared layers with n nodes per layer that generate the hidden embedding; d denotes the dimension of the input data point; p denotes the dimension of the middle layer that generates the feature. Further specifications of neural networks will be provided in the experiments in Section 4.

Note that the spectral representation can allow for a completely general kernel that can be used for non-stationary processes (since our influence function does not impose a shift-invariant structure). Moreover, the spectral representation allows us to go beyond monotone decay or parametric form as commonly assumed in the prior literature.

Neural network feature function. As one of the most salient features of our method, feature functions {φ r , ψ r } are represented using neural networks, leveraging their known universal approximation power. First, the input data point x ∈ X will be projected to a hidden embedding space via a multi-layer shared network, aiming to extract key information of the input data. Here we have adopted Softplus non-linear activation in this feature extraction sub-network, while other options may be possible. Next, the hidden embedding will be mapped to the different features {φ r (x)} through R branched sub-networks. To ensure the output feature is constrained in a bounded space, we choose the scaled sigmoidal function as the activation of the output layer, i.e., f (x) = s/(1 + exp(−x)), where s is a constant to enable rescaling of the output to a proper range (in our setting, we set s to be 100). The overall architecture of our kernel formulation and the structure of the multi-branch neural network are summarized in Figure 2 (a) and (b), respectively. Further specifications of neural networks will be provided in the experiments in Section 4.

An essential task in learning the neural point process model is to estimate the influence kernel function for the point process. The reason is two-fold: First, the kernel function is the most important component in representing the point process. Second, in practice, the influence kernel offers clear interpretations, such as "how events at a particular time and location will influence future events at a given time and location." Such interpretation is essential for predicting using event data-one of the main applications for point process models.

To estimate the influence kernel function for point process models, we consider a popular approach through maximum likelihood estimation (MLE). Formally, the optimal kernel can be found by solving the following optimization problem given M sequences of training event sequences over the time

where λ j and N j denote the conditional intensity and counting measure associated with the j-th trajectory, respectively, and K ⊂ C 0 (X × X ) represents the family of regular kernel functions induced by the feature function family F and the finite-rank decomposition (5).

Learning algorithm. To solve MLE (6) when the kernel function k is parameterized by neural networks, we use a stochastic gradient as summarized in Algorithm 1 (Appendix A). Note that to calculate the log-likelihood function in (6), we need to evaluate an integral (the second term), which does not have a closed-form expression. We approximate the integral numerically by Monte Carlo integration -drawing samples and take the average, as described in Algorithm 2 (Appendix A).

We first consider the MLE as a functional optimization problem, since when using neural networks to approximate the kernel function, we are interested in such functional approximation results. We show that the expected log-likelihood reaches its maximum at the true kernel with the second-order derivative bounded away from zero, and thus the true kernel function is identifiable by solving the MLE problem under some regularity conditions.

Consider the log-likelihood of point processes over a family of kernel functions K which contains K, the family induced by feature functions in F and non-negative spectrum {ν r } R r=1 . Note that K may go beyond the finite-rank decomposition in (5). Later throughout the theoretical analysis, we omit the spectrum as they can be absorbed into the feature functions. The details are discussed in Remark 3.2.

Assumption 3.1. (A1) The kernel function family K ⊂ C 0 (X × X ) which is uniformly bounded, and the true kernel k * ∈ K; (A2) There exist c 1 , c 2 positive constants, such that for any k ∈ K, a.s. for event data trajectory,

Note that, apart from (A2), we only need kernel functions to be measurable for theoretical analysis which is guaranteed by (A1). In practice, the proposed neural network parametrization leads to a continuous and smooth (low-rank) kernel function, which induces non-singular λ.

We have the following lemma, which shows that when the log-likelihood function has a local perturbation around the true kernel function, there is going to be a decrease in the log-likelihood function value in expectation. Note we assume that k * lies in (or can be well-approximated by) the family of function class K -thanks to the well-known universal approximation power of neural networks. Lemma 3.1 (Local perturbation of likelihood function around the true kernel function). Under Assumption 3.1, for anyk ∈ K and δk =k − k * , we have

where

The implication of Lemma 3.1 is the following. Note that in (7), we have nicely decomposed the difference in the likelihood function caused by a small perturbation around the true kernel function, as two terms: the first term in (7) is a martingale integral (since the conditional expectation of

, and the second term is the integral of a quadratic term against the counting measure. The expectation of the first term is zero due to the property of the martingale process. For the second term, per j,

and thus perturbation of the intensity function which corresponds to the second (quadratic) term in (7) will be reflected in the decrease of the log-likelihood. Furthermore, we have the identifiability of the kernel itself, as stated in the following theorem. Theorem 3.2 (Kernel identifiability using maximum likelihood). Under Assumption 3.1, the true kernel function k * is locally identifiable in that k * is a local minimum solution of maximum likelihood (6) in expectation. Remark 3.1 (Function identification and neural network parametrization). Theorem 3.2 derives a result which holds for variational perturbation of the kernel function k in the possibly parametric family K induced by the feature function family F. When F is the class of functions that a neural network can represent, the perturbation δk is induced by the change of network parameters. It is common that in neural network models, parameter identification is difficult to search for (e.g., due to the symmetry of permuting hidden neurons); however, the neural network function identification may still hold, e.g., under the mean-field approximation (Mei et al., 2018) . Thus the kernel function identification results in Theorem 3.2 is important when learning kernel functions by neural networks. Remark 3.2 (Finite rank kernel representation). When assuming the parameter representation (5), we represent the true kernel as

. For theoretical analysis purposes, without loss of generality, we can assume ν r = 1, r = 1, · · · , R, since they can absorbed into the feature functions. Consider a perturbed kernel where the feature functions areψ r = ψ r + δψ r and φ r = φ r + δφ r , and remains to satisfy Assumption 3.1. The kernel function variation δk in (8) can then be written as δk(

. With a rank-R representation of the kernel and smooth ψ r , φ r represented by neural networks, we potentially prevent overfitting that leads to singular kernels, which may be a problem for the over-complete kernel family K as in (A1). Later we show experimentally that the learned kernel is smooth and close to the true kernel in Figure 3 and Figure 4 .

We also studied the MLE for parameterized kernel in Appendix D, when the target feature function belongs to a certain parametric function class with a finite number of parameters. This includes, for instance, spline functions, functions represented by Fourier basis, and neural networks, which proves that our proposed method acts as a fundamental framework of finite-rank and neural network kernel representation.

This section presents experimental results on both synthetic and real data and compares them with several state-of-the-arts, including (i) standard Hawkes process with an exponentially decaying kernel function (Hawkes) (Hawkes, 1971 ); (ii) recurrent marked temporal point processes (RMTPP) (Du et al., 2016) ; and (iii) Neural Hawkes process (NH) (Mei & Eisner, 2017) . See Appendix B for a detailed review of those existing methods. To perform the experiments, we use the following procedure: Given training data, we first estimate the kernel function to obtaink(·, ·) by solving the maximum likelihood problem (6) using stochastic gradient descent, as described in Section 2.3. Note that according to (2), the conditional intensity function can be treated as a prediction of the chance of having an event at a given time t after observing the past events. Thus, to evaluate the prediction performance on test data, given a test trajectory, we perform online prediction by feeding the past events (in the test trajectory) into evaluating the conditional intensity function according to (3), which gives the conditional probability of a future event given the past observations. Performance metrics. We consider two performance metrics: (i) The performance for synthetic data is evaluated by measuring the out-of-sample mean-average-error (MAE) for conditional intensity function. For synthetic data, we know the true kernel, which is denoted as k * . Given a sequence of test data, let λ[k * ] be the "true" conditional intensity function defined by (3) using the true kernel k * , and let λ[k] be an estimated conditional intensity function for the same sequence using the estimated kernelk. Thus, λ[k] can be viewed as a probabilistic prediction of the events (since it specifies the likelihood of an event happening given the test trajectory's history) when the kernel k is estimated separately from training data. The out-of-sample MAE for one test trajectory is defined as

Then we average this over all test trajectories to obtain our performance measure. (ii) For real data, since we do not know the true intensity function, we report the average out-of-sample predictive log-likelihood. Given a test trajectory (information contained in counting measure N(x)), the predictive likelihood for a trajectory using an estimated kernelk is given as

The average out-of-sample predictive log-likelihood is obtained by average over test trajectories. This has been widely adopted as a metric for real-data (Mei & Eisner, 2017; Omi et al., 2019; Zhang et al., 2019; Zhu et al., 2021a; b; : the higher the log-likelihood, the better the model predicts. See all experiment configurations in Appendix B.

Synthetic data. In the experiment, we assume the background rate is a constant (µ = 1) and focus on recovering both stationary and non-stationary kernel structures. We generate four one-dimensional and three two-dimensional synthetic data sets, which are simulated by a vanilla Hawkes process and our model described in Section 2.2 with randomly initialized parameters. The simulated data is generated by the thinning algorithm described in Algorithm 3 (Appendix A), which is an accept-reject method for simulating a point process (Daley & Vere-Jones, 2008; Gabriel et al., 2013) . For ease of comparison, we normalize the time and mark spaces for all data sets to the range from 0 to 100. Each data set is composed of 1,000 event sequences with an averaged length of 121. We fit the models using 80% of the synthetic data set and the remaining 20% as the testing set.

Kernel recovery. One of the most important ability of our approach is to recover the true kernel. We report the kernel recovery results for one-and two-dimensional synthetic data sets, shown in Figure 3 and Figure 4 , respectively. In Figure 3 , we observe that our proposed model can accurately recover the true kernel for one-dimensional data. In contrast, the vanilla Hawkes process with a stationary kernel failed to capture such non-stationarity. We also present results of two-dimensional data in Figure 4 , where a continuous one-dimensional mark is introduced to the events. The first three columns demonstrate two-dimensional "slices" of the four-dimensional kernel evaluation, indicating our model is also capable of recovering complex high-dimensional kernel structure. We note that event points are sparsely scattered in the two-dimensional marked-temporal space, where the events' correlation is measured in a high-dimensional (in our example, four-dimensional) kernel space.

Intensity prediction. We evaluate the predictive accuracy for conditional intensity λ[k](x) (3) using estimated kernelk from the training data. Note that λ[k](x) is a probabilistic prediction since it can be viewed as the instantaneous probability of an event given the historical events. The last panel in Figure 3 shows the predicted one-dimensional conditional intensity functions given a sequence randomly selected from the testing set. We compare our predicted λ[k](x) with the one estimated by a vanilla Hawkes model with stationary kernel. The result shows that the predicted conditional Figure 3 : Kernel recovery results on a one-dimensional synthetic data set. The first three panels show the true kernel that generates the data, kernel learned by our model, and kernel learned by a vanilla Hawkes process, respectively. The fourth panel shows the true and predicted conditional intensity functions of a test sequence. intensity λ[k](x) suggested by our model well matches that using the true kernel λ[k * ](x). In contrast, the Hawkes model only provides limited flexibility for modeling the non-stationary process. Two panels in the last column of Figure 4 give another comparison between the true and predicted two-dimensional conditional intensities over marked-temporal space. This result confirms that our model can accurately predict the two-dimensional conditional intensity function. To validate the robustness of our proposed method, we also test our model using another one-dimensional data set generated by a vanilla Hawkes process with a stationary parametric kernel, as shown in Figure 5 . Though our model is evidently overparametrized for this data, our method is still able to recover the kernel structure, as well as predict the conditional intensity accurately, which demonstrate the adaptiveness of our model and confirm that our model can approximate the stationary kernel without incurring overfitting. More results of kernel recovery experiments can be found in Appendix C.

Additionally, we compare our method with three other baselines on the synthetic data sets. Figure 6 shows the conditional intensities based on the true kernel λ[k * ](x) (solid black lines) and predicted intensity using each method. We observe that our method obtained much better predictive performances compared to other baseline approaches. Table 1 summarizes two performance metrics for these methods: the predictive log-likelihood and the MAE. The result shows that our method greatly outperforms other baseline approaches in both metrics. In particular, the MAE of our method is at least 90% lower than the other methods for both one-dimensional and two-dimensional data sets. Figure 6 : Predicted conditional intensity using our method and other baselines for a sequence selected from test data. We aggregate the conditional intensity in mark space for ease of presentation and visualize the average conditional intensity over time for two-dimensional synthetic data on the third panel. Real-data results. Finally, we test our method on two large-scale real data sets: (1) Atlanta 911 calls-for-service data. The data set contains the 911 calls-for-service data in Atlanta from 2015 to 2017. We extract 7,831 reported robberies from the data set since robbers usually follow a particular modus operandi (M.O.). Each robbery report is associated with a timestamp indicating when the robbery occurred. We consider each series of robberies as a sequence.

(2) Northern California seismic data. The Northern California Earthquake Data Center (NCEDC) provides public time series data Northern California Earthquake Data Center. UC Berkeley Seismological Laboratory. Dataset (2014) that comes from broadband, short period, strong motion seismic sensors, GPS, and other geophysical sensors. We extract 16,401 seismic records with a magnitude larger than 3.0 from 1978 to 2018 in Northern California and partition the data into multiple sequences every quarter. We fit the models using 80% of the data set and the remaining 20% as the testing set.

In Table 1 , we report the out-of-sample predictive log-likelihood of each method since the ground truth is not available for real data. We can see that our model attains the highest predictive log-likelihood for both synthetic and real data sets. In particular, the better performance on real data and significantly higher out-of-sample predictive log-likelihood achieved by our approach than other existing methods show that the non-stationary kernel seemingly captures the nature of the real data better in these cases. This shows the merit of our approach for real data where we do not know the ground truth.

Training / testing time. In Table 1 , we compare the running time of our model and other baselines on one-dimensional synthetic data set. The size and performance of each model are presented in Table 1 . The running time refers to wall clock time. We can observe that the training time of our model is similar with other neural point process models (RMTPP and NH). Hawkes process model has the minimum training time because it has only two parameters to be estimated. The testing (inference) time of our model is significantly lower than the other two recurrent neural point process models.

We have presented a new non-stationary marked point process model with a general kernel represented by neural network feature maps, motivated by spectral decomposition of the kernel. The flexible kernel with expressive power enabled by neural networks can capture complex dependence across temporal, spatial, and mark spaces, which can be valuable for real-data modeling in various applications. The benefits of our approach are demonstrated by superior out-of-sample predictive log-likelihood on real data. The model can be learned efficiently with stochastic gradient descent. We also develop a theoretical guarantee for maximum likelihood identifiability. Generally, model architectures may well depend on specific tasks. For example, neural networks can be based on CNN if the high dimensional markers are in image-like shapes and can also be LSTM or even BERT if the markers are text-based. Thus the choice of the deep model architecture and optimization algorithm can be more systematically explored, especially the comparison of non-stationary neural kernels to stationary ones. Understanding and characterizing what kernels can be learned through such an approach is left for future study. Also, when extending to high-dimensional mark space, more efficient algorithm is needed for the computation of the log-likelihood. 

Algorithm 1: Stochastic gradient-based learning algorithm Input: X = {x j } j=1,...,N is the training set with N sequences; η is the number of learning iterations; γ is the learning rate; M is the size of a mini-batch; Initialization: model parameters θ 0 ; l ← 0; while l < η do Randomly draw M sequences from X denoted as X l = {x j } j=1,...,M ⊂ X; θ l+1 ← θ l + γ∂ /∂θ l given X l ; l ← l + 1; end Algorithm 2: Monte Carlo estimation for the integral of conditional intensity in (6) Input: λ denotes the model that we need to evaluate;

is an event sequence, where N j is the number of events in the sequence; N is the number of samples uniformly drawn from X ; Λ is the integral of the conditional intensity over the data space;

Algorithm 3: Efficient thinning algorithm for simulating point process input θ, T, M; output A set of events H t ordered by time.;

Now we describe the experiment configurations: We set the rank R = 5 in our setting. We consider the shared network that summarizes input data into a hidden embedding to be a fully connected three-layer network and the sub-network to be a fully connected network with two hidden layers. The width of the hidden layers in the shared network is n = 128, and the width of the input layers in sub-networks (or the output layer in the shared network) is p = 10. We adopt the SoftPlus f (x) = 1/ log(1 + exp(x)) as the activation function of each layer in the network. To learn the model's parameters, we adopt the Adam optimizer Kingma & Ba (2014) with a constant learning rate of 10 −2 and the batch size is 32. All experiments are performed on Google Colaboratory (Pro version) with 12GB RAM and dual-core Intel processors, which speed up to 2.3 GHz (without GPU). Codes to reproduce the experimental results are publicly available 1 .

This study considers three baseline methods: (1) Standard Hawkes process with an exponentially decaying kernel function (Hawkes) : it specifies the conditional intensity function as λ(t) = µ + α tj <t β exp{−β(t − t j )}, where parameters µ, α, β can be estimated via maximizing likelihood (Hawkes, 1971 ); (2) Recurrent marked temporal point processes (RMTPP): it assumes the conditional intensity function λ(t) = exp v h j + ω(t − t j ) + b , where the j-th hidden state h j in the RNN represents the history influence up to the nearest happened event j, and w(t − t j ) represents the current influence; the v, ω, b are trainable parameters (Du et al., 2016) ; and (3) Neural Hawkes process (NH): it specifies the conditional intensity function as λ * (t) = f (ν h t ), where h t is the hidden state of a continuous-time LSTM up to time t representing the history influence, and the f (·) is a SoftPlus function which ensures the positive output given any input (Mei & Eisner, 2017) .

We note that, unlike the standard Hawkes process and our NSMPP, RMTPP and NH do not parameterize the kernel function directly; instead, they aim to model the conditional intensity using an LSTM-based structure. Particularly, these models pass the history information sequentially via a hidden state, where the recent memory will override the long-term memory. This has led RMTPP and NH to "overemphasize" the recent events and therefore assume the temporal correlation would monotonically decrease over time. In addition, RMTPP and NH can only deal with one-dimensional categorical marks, while our model can be extended to high-dimensional continuous mark space. To ensure comparability, we only consider one-and two-dimensional event consisting of time and mark in our experiment (d ∈ {1, 2}). For RMTPP and NH, the event's mark will be discretized and treated as categorical input.

In this section, we present additional experiments to demonstrate the robustness of our approach by considering fitting to stationary exponentially decaying kernel and other non-stationary kernels and by varying the neural network architecture and training sample sizes. Figure 8 show the results on one-and two-dimensional data sets with non-stationary kernels generated by model described in Section 2.2. Note that compared to the Hawkes models with a stationary kernel, our model can capture the non-stationarity of the kernel and predict the conditional intensity function λ more accurately. Besides, our proposed method also recovers the kernel structure and predicts λ well in high-dimensional space. Note that the event points are more densely scattered in the areas with high intensity than those with much lower ones. Figure 9 and Figure 10 study the effect of increased model size and decreased training sample sizes to the learning of our model. The results of these two ablation studies show that our proposed model works consistently well without overfitting after increasing the model size or decreasing the training sample size in both the stationary and non-stationary cases.

We notice that in practice, the kernel function is usually restricted to some function family space for the problem to be well-defined (given a finite number of events ). In this section, we study such a set-up, where the target feature function belongs to a certain parametric function class with a finite number of parameters. This includes, for instance, spline functions, functions represented by Fourier basis, and by neural networks -the main interest of this paper. We are particularly interested in neural networks due to its strong expressive power of functional representation.

We start from a general framework of feature function basis representation. Assume that the feature functions of the kernel can be well-approximated by a linear combination of basis functions: b i (x) : Figure 7 : Additional kernel recovery results for two other one-dimensional synthetic data sets. The first three columns show the true kernel that generates the data, kernel learned by our model, and kernel learned by a Hawkes process, respectively. The fourth column shows the true and predicted conditional intensity functions for a test sequence. Figure 8 : Additional kernel recovery results for two other two-dimensional synthetic data sets. The first and third rows show the true models and the second and fourth rows show the learned models. The first three columns show different snapshots of kernel evaluation for each model; the last column shows their corresponding conditional intensity over marked-temporal space given a test sequence, where the black dots indicate the location of the events.

X → R, i = 1, . . . , S: : Ablation studies on one-dimensional synthetic data sets used in Figure 3 . "Proposed kernel with increased network size" refers to the model with one more hidden layer and doubled layer width in the subnetworks (p = 20); "Proposed kernel with half training sample size" refers to the model with default architecture but trained with only a half of the training samples. Then the kernel function in (5) can be written as k A (x , x) = b(x ) T Ab(x), where b(x) = (b 1 (x), · · · , b S (x)) T , and the (p, q)-th entry of the matrix A is given by

Here we assume each matrix A corresponds to a unique model, i.e. for any A = A, there exists x) . Under such a parametrization, we can write the intensity function in (4) as

which is linear in the vectorized A. Here ·, · is the Frobenius inner product of matrices, and η(x) ∈ R S×S is conditioned on H t(x) with the (p, q)-th entry

Now the influence kernel estimation problem has been reduced to the problem of estimating the S-by-S matrix A. We will estimate the coefficient matrix A by the Maximum Likelihood Estimator (MLE) in the set of low-rank matrices A = {A ∈ R S×S : k A ∈ K} (recall that K is the family of kernels which admits the finite-rank decomposition as in (5). By saying A consists of low-rank matrices, we implicitly assume that R S). Also we denote the family of kernel functions that can be represented by the chosen basis functions as

where the log-likelihood A is the corresponding variant of (6),

• With orthonormal bases, the classical theory has that the recovery of A will ensure the recovery of the original kernel function. With a set of over-complete bases, R is less than S (the number of bases function), and then A is a low-rank matrix, for which case we provide a theory for the recovery of A via MLE (Theorem D.2). • More generally, the above framework contains other constructions of

are random features, then α ri and β ri can be viewed as weights to combine features. Random feature model can be naturally viewed as a neural network with one hidden layer, that is, b i (x) = σ(w T i x) is the activation on the i-th hidden nodes, and i α i b i (x) gives the second layer output function. Here, w i are weights in the first layer, and linear combination coefficients α i are weights in the second layer. Thus, the random feature model corresponds to only training the second layer weights, leaving the 1st layer as randomly initialized. In this case, S is the number of hidden neurons, and R can be interpreted as the number of heads in an attention model.

A few consequences. We derive a few necessary basic results based on the model parametrization for presenting the results. Recall that η(x) ∈ R S×S depends only on data and basis but not the coefficient matrix A.

• For each trajectory j, the intensity under parameter A at x has partial derivative ∂λ A,j (x) ∂A pq = η pq,j (x).

• The score function, i.e., the partial gradient of the log-likelihood function with respect to the coefficient matrix A, is given by

• The Hessian matrix of the log-likelihood function is given by

Next, we provide some analysis for the MLE with possible model misspecification, as the true kernel k * may not fall into K finite .

Theorem D.1 (Distance between the true kernel and the optimal fit). Let A ∈ A be the one which maximizes the expected log-likelihood function, i.e.

Under Assumption 3.1, let the 2 -norm of a kernel be

Then we have

where D(k * , K finite ) is the 2 -distance between the true kernel and the set K finite , D(k * , K finite ) = min k∈K finite k * − k 2 .

Remark D.1. This theorem holds true for any k * ∈ K without the rank-R assumption on kernels. In this case, both feature function approximation by basis and low-rank approximation of the kernel contribute to the estimation error associated with model misspecification.

Next, similar to the classic asymptotic normality of the MLE under model misspecification (White, 1982) , we have the following result for the low-rank MLE. Theorem D.2 (Asymptotic normality of low-rank MLE). Under Assumption 3.1, Assumption E.1, let the singular value decomposition of A be A = U ΛV T . Let I be the expected Hessian matrix of the log-likelihood at A,

where the matrices are vectorized by concatenating their columns, and let J be the covariance matrix of a single trajectory's score function at A,

where † represents pseudo-inverse and Q a,b ∈ R ab×ab is the permutation matrix such that vec(P T ) = Q a,b vec(P ) for any a-by-b matrix P . If F T ( I + C)F shares the same null-space with F , then the low-rank estimator A MLE , solved from the constrained maximum likelihood problem, satisfies

when M → ∞. Remark D.2 (Parameter recovery guarantee). The result shows that the maximum likelihood estimate for the kernel matrix A converges to the optimal fit A, and the matrices I, J, C captures the residual variance; F projects the variance caused by I, J, C onto the tangent space of the low-rank manifold at A. In practice, the variance term can be estimated empirically; J can be estimated from the empirical covariance matrix of the score function of each trajectory, while the other matrices can be approximated by the properties of the log-likelihood function at A MLE . If the true kernel falls into the kernel family K finite , both I, J will equal to the Fisher Information, G, C will vanish, which greatly simplify the process of estimating the variance. But with the presence of model misspecification, this is generally not the case. 

, where the conditional intensity λ[k] is defined as in (4), k * andk are the true kernel and the perturbed one respectively. Then we have

Putting together (16) and (17),

The above holds for each j, and taking the average over the M trajectories proves the lemma. 

holds true for anyk ∈ K and δk =k − k * (the 2 -norm is defined as (12).)

Proof. By Lemma 3.1, since all trajectories are i.i.d., and for each trajectory the conditional expectation

there is

Following Assumption 3.1

Next we lower bound the term above by taking the integral over the event space E = ∞ i=0 E i of trajectories, where for each i,

consists of all the trajectories with exactly i events. For each H T ∈ E, let N be the associated counting measure, the probability density of H T

E X (δλ(x)) 2 dx can be lower bounded by taking the integral over the lower bound of probability density ρ,

and by the equivalence over ordering of x 1 , · · · , x i ,

Then we substitute δλ(x) with X t(x) δk(x , x)dN(x ) and change the order of integrals, ( * ) equals to

For x = x , dN(x )dN(x ) = 1 when two of x 1 , · · · , x i equal to x , x . For x = x , dN(x )dN(x ) = 1 when one of x 1 , · · · , x i equals to x = x . Again by the equivalence over ordering of x 1 , · · · , x i , we assume x = x 1 , x = x 2 when x = x , or x = x = x 1 , ( * ) equals to

For the first row, we take the integral over x 3 , · · · , x i and for the second row, we take the integral over x 2 , · · · , x i , ( * ) equals to

Note that the first term is non-negative.

Proof of Theorem 3.2. This follows immediately from Lemma E.1.

Lemma E.2. Under Assumption 3.1, for any k ∈ K,

Proof. Similar to the proof of Lemma 3.1 and Theorem 3.2, for a single trajectory and any k ∈ K let

There is

For each trajectory H T ∈ E, the probability density of H T

Following similar arguments as the proof of Lemma E.1,

Assumption E.1. Assume A is the unique minimizer in (11) Proof. Consider the local parametrization of A R in the neighborhood of A based on the singular value decomposition A = U ΛV T , U, V ∈ R S×R , Λ is a R-by-R diagonal matrix,

Here (U U ), (V V ) are fixed orthogonal matrices where the first R columns are U, V respectively. P 1 ∈ R R×R , P 2 , P 3 ∈ R (S−R)×R are parameters in the neighborhood of 0 such that Λ + P 1 is non-singular. By Theorem 3.2 of White (1982), let P 1,MLE , P 2,MLE , P 3,MLE be the parameters which correspond to A MLE , i.e.

A MLE = (U U ) Λ + P 1,MLE P T 2,MLE P 3,MLE P 3,MLE (Λ + P 1,MLE ) −1 P T

For simplicity, let p = (vec T (P 1 ), vec T (P 2 ), vec T (P 3 )) T . Then .

For simplicity, from now on, we write all partial derivatives taken at P 1 = 0, P 2 = 0, P 3 = 0, A = A without specifying the location. Let Next we look at the covariance matrix of the multivariate Gaussian distribution,

.

Since A maximizes E( A ) and is on the interior of A and hence the interior of the manifold parametrized by p, we have

Published as a conference paper at ICLR 2022

For any i, j ∈ [(2S − R)R],

the (i,j)-th component of I

We observe that some blocks of ( ∂ 2 vec T (A) ∂pj ∂pj vec( G)) ij are 0.

∂ 2 vec T (A)vec( G) ∂vec(P 1 )∂vec T (P 1 ) = ∂ 2 vec T (U P 1 V T + U P T 2 V T + U P 3 V T + U P 3 (Λ + P 1 ) −1 P T 2 V T )vec( G) ∂vec(P 1 )∂vec T (P 1 ) = 0, as (1) the second order derivative of U P 1 V T , U P T 2 V T , U P 3 V T over P 1 is 0 since they are either linear in or irrelevant with P 1 , (2) the second order derivative of U P 1 V T + U P 3 (Λ + P 1 ) −1 P T 2 V T over P 1 is 0 since it is taken at P 2 = P 3 = 0. Also we have ∂ 2 vec T (A)vec( G) ∂vec(P 1 )∂vec T (P 2 ) = 0, ∂ 2 vec T (A)vec( G) ∂vec(P 1 )∂vec T (P 3 ) = 0, ∂ 2 vec T (A)vec( G) ∂vec(P 2 )∂vec T (P 2 ) = 0, ∂ 2 vec T (A)vec( G) ∂vec(P 3 )∂vec T (P 3 ) = 0, for similar reasons. The only non-zero block is ∂ 2 vec T (A)vec( G) ∂vec(P 2 )∂vec T (P 3 ) = ∂ 2 vec T (U P 3 Λ −1 P T 2 V T )vec( G) ∂vec(P 2 )∂vec T (P 3 ) = ∂ ∂vec(P 2 ) ∂vec T (U P 3 Λ −1 P T 2 V T )vec( G) ∂vec T (P 3 ) = ∂ ∂vec(P 2 ) ∂vec T (P 2 ) ∂vec(U P 3 V T ) ∂vec T (P 3 )

Z has full row-rank, and

Then the covariance matrix of the asymptotic distribution of vec( A MLE )

Next we check that (Z T IZ) = F T ( I + C)F.

By (18) 

Published as a conference paper at ICLR 2022

Towards principled methods for training generative adversarial networks

The challenges of modeling and forecasting the spread of covid-19

Massive exploration of neural machine translation architectures

Inference for a nonstationary self-exciting point process with an application in ultra-high frequency financial data modeling

An introduction to the theory of point processes

Recurrent marked temporal point processes: Embedding event history to vector

stpp: An r package for plotting, simulating and analyzing spatio-temporal point patterns

Spectra of some self-exciting and mutually exciting point processes

Long short-term memory

Adam: A method for stochastic optimization

Learning temporal point processes via reinforcement learning

Effective approaches to attention-based neural machine translation

The neural hawkes process: A neurally self-modulating multivariate point process

A mean field view of the landscape of twolayer neural networks

Statistical models for earthquake occurrences and residual analysis for point processes

We then have thatwhere the last equality is by definition of δλ j .Back to (14), the second termThe first term under Taylor expansion of log has that, for each x ∈ X ,where ξ j (x) takes value betweenλ j (x) and λ * j (x). By Assumption 3.1, bothλ j (x) and λ * j (x) are strictly positive and upper-bounded by c 2 , and thus 0 < ξ j (x) ≤ c 2 . This means thatThus, the 1st term in (14) satisfies