key: cord-0592408-t2f7c43r authors: Zhu, Shixiang; Wang, Haoyun; Dong, Zheng; Cheng, Xiuyuan; Xie, Yao title: Neural Spectral Marked Point Processes date: 2021-06-20 journal: nan DOI: nan sha: c4e9ca295f59a80d4059ddaafa5534fa781f054d doc_id: 592408 cord_uid: t2f7c43r Self- and mutually-exciting point processes are popular models in machine learning and statistics for dependent discrete event data. To date, most existing models assume stationary kernels (including the classical Hawkes processes) and simple parametric models. Modern applications with complex event data require more general point process models that can incorporate contextual information of the events, called marks, besides the temporal and location information. Moreover, such applications often require non-stationary models to capture more complex spatio-temporal dependence. To tackle these challenges, a key question is to devise a versatile influence kernel in the point process model. In this paper, we introduce a novel and general neural network-based non-stationary influence kernel with high expressiveness for handling complex discrete events data while providing theoretical performance guarantees. We demonstrate the superior performance of our proposed method compared with the state-of-the-art on synthetic and real data. Event sequence data are ubiquitous in our daily life, ranging from traffic incidents, 911 calls, social media posts, earthquake catalog data, and COVID-19 data (see, e.g., Bertozzi et al. (2020) ). Such data consist of a sequence of events indicating when and where each event occurred, with additional descriptive information (called marks) about the event (such as category, volume, or free-text). The distribution of events is of scientific and practical interest, both for prediction purposes and for inferring events' underlying generative mechanism. A popular framework for modeling events is point processes (Daley & Vere-Jones, 2008) , which can be continuous over time and the space of marks. An important aspect of this model is capturing the event's triggering effect on its subsequent events. Since the distribution of point processes is completely specified by the conditional intensity function (the occurrence rate of events conditioning on the history), such triggering effect has been captured by an influence kernel function embedded in the conditional intensity. In statistical literature, the kernel function usually assumes a parametric form. For example, the original work by Hawkes (Hawkes, 1971) considers an exponential decaying influence function over time, and the seminal work (Ogata, 1998) introduces epidemic-type aftershock sequence (ETAS) model, which considers an influence function that exponentially decays over space and time. With the increasing complexity of modern applications, there has been much recent effort in developing recurrent neural network (RNN)-based point processes, leveraging the rich representation power of RNNs (Du et al., 2016; Mei & Eisner, 2017; Xiao et al., 2017b) . However, there are several limitations of existing RNN-based models. First, such models typically do not consider the kernel function (Du et al., 2016; Li et al., 2018; Mei & Eisner, 2017; Upadhyay et al., 2018; Xiao et al., 2017a; b) ; thus, the RNN approach does not enjoy the interpretability of the kernel function based models. Second, the popular RNN models such as Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) still implicitly discounts the influence of events over time (due to their recursive structure) (Vaswani et al., 2017; Zhu et al., 2021d) . Such assumptions may not hold in many real-world applications. Take the earthquake catalog as an example, which is a typical type of discrete event data; most aftershocks occur along the fault plane or other faults within the volume affected by the mainshock's strain (Zhu et al., 2021b) . This means that different regions may be correlated to their surrounding area differently according to their geological structure, which creates a complex non-stationary spatial profile that we would like to capture through the model. Third, a majority of the existing works mainly focus on one-dimensional temporal point processes. Although there are works on marked point processes (Du et al., 2016; Mei & Eisner, 2017; Reinhart, 2018) , they are primarily based on simplifying assumptions that the marks are conditionally independent of the event's time and location, which is equivalent to assuming the kernel is separable; these assumptions may fail to capture some complex non-stationary, time-and location-dependent triggering effects for various types of events, as observed for many real-world applications (see, e.g., (Bertozzi et al., 2020) ). Contribution. In this paper, we present a novel general non-stationary point process model based on neural networks, referred to as the neural spectral marked point process (NSMPP) . The key component is a new powerful representation of the kernel function using neural networks, which enables us to go beyond stationarity (and thus go beyond standard Hawkes processes) and has the capacity to model high-dimensional marks. Figure 1 gives an example of non-stationary influence kernel that measures the influence of the past events to the future time t. The premise of the model design is that the conditional intensity function uniquely specifies the distribution of the point process, and the most important component in the intensity function is the influence kernel. In summary, the novelty of our approach includes the following: • The kernel function is represented by a spectral decomposition of the influence kernel with a finite-rank truncation in practice. Such a kernel representation will enable us to capture the most general non-stationary process as well as high-dimensional marks. The model also allows the distribution of marks to depend on time, which is drastically different from the separable kernels considered in the existing literature (Reinhart, 2018) . • The spectral decomposition of asymmetric influence kernel consists of a sum of the product of feature maps, which can be parameterized by neural networks. This enable us to harvest the powerful expressiveness and scalability to high-dimensional input of neural networks for complicated tasks involving discrete events data. • We establish theoretical guarantees of the maximum likelihood estimate for the true kernel function based on functional variational analysis and finite-dimensional asymptotic analysis, which shed light on theoretical understanding of neural network-based kernel functions. • Using synthetic and real data (seismic and police data), we demonstrate the superior performance of our proposed method in complex situations; the performance gain is particularly outstanding for cases involving non-stationary point processes. Related work. Seminal works in point processes modeling (Ogata, 1988; 1998) assume parametric forms of the intensity functions. Such methods enjoy good interpretability and are efficient to estimate. However, classical parametric models are not expressive enough to capture the events' dynamics in modern applications. Recent research interests aim to improve the expressive power of point process models, where a recurrent neural networks (RNNs)-based structure is introduced to represent the conditional intensity function (Du et al., 2016; Mei & Eisner, 2017; Xiao et al., 2017b) . However, most of these works either explicitly or implicitly specify the inter-event dependence in a limited form with restrained representative power. For example, Du et al. (2016) expresses the influence of two consecutive events in a form of exp{w(t i+1 − t i )}, which is an exponential function with respect to the length of the time interval t i+1 − t i with weights w; Mei & Eisner (2017) enhances the expressiveness of the model and represents the entire history using the hidden state of an LSTM, which still implicitly assumes the influence of the history decays over time due to the recurrent structure of LSTM. Another line of research uses neural networks to directly model dependence of sequential events without specifying the conditional intensity function explicitly (Li et al., 2018; Xiao et al., 2017a) . Some studies consider non-stationary influence kernel using neural networks in spatio-temporal point processes (Zhu et al., 2021b; a) . Recent work (Omi et al., 2019) also uses a neural network to parameterize the hazard function, the derivative of which gives the conditional intensity function. However, the above approaches either capture the temporal dependence or assume the temporal and mark dependence are separable rather than jointly accounting for marked-temporal dependence. Recently, attention models have become popular in computer vision and sequential data modeling (Britz et al., 2017; Luong et al., 2015; Vaswani et al., 2017) . This motivates works including Zhang et al. (2019); Zhu et al. (2021c; Zuo et al. (2020) to model the conditional intensity of point processes using the attention mechanism and characterize the inter-event dependence by a score function. The attention mechanism has proven to be more flexible in capturing long-range dependence regardless of how far apart two events are separated and greatly enhances the performance in practice. However, the main limitation of Zhang et al. (2019); Zuo et al. (2020) is that they rely on a conventional score function -dot-product between linear mappings of events, which is still limited in representing non-linear dependence between events for some applications. Zhu et al. (2019; 2021c ;d) used a more flexible and general Fourier kernel as a substitution for the dot-product score; however, the expressive power of the proposed Fourier kernel is still limited, and the spectrum of the Fourier basis is represented by a generative neural network, which is difficult to learn in some cases (Arjovsky & Bottou, 2017) . There are also works considering point processes with non-stationary intensities. Chen & Hall (2013) proposed time-varying background intensity for point process, while we focus on the non-stationary triggering kernel depicting complex events dependency. Remes et al. (2017; studied nonstationary kernels combined with Gaussian processes, assuming specific structures of the kernels in the Fourier domain. Such kernels are more restricted than ours since the nature of Gaussian processes requires that the kernel is positive semidefinite. Marked temporal point processes (MTPPs) (Reinhart, 2018) consist of a sequence of events over time. Each event is associated with a (possibly multi-dimensional) mark that contains detailed information of the event, such as location, nodal information (if the observations are over networks, such as sensor or social networks) categorical data, and contextual information (such as token, image, and text descriptions). Let T > 0 be a fixed time-horizon, and M ⊆ R d be the space of marks. We denote the space of observation as X = [0, T ) × M and a data point in the discrete event sequence as where t is the event time and m represents the mark. Let N t be the number of events up to time t < T (which is random), and H t := {x 1 , x 2 , . . . , x Nt } denote historical events. Let N be the counting measure on X , i.e., for any measurable S ⊆ X , N(S) = |H T ∩ S|. For any function f : X → R, the integral with respect to the counting measure is defined as The events' distribution in MTPPs can be characterized via the conditional intensity function λ(x), which is defined to be the conditional probability of observing an event in the marked temporal space X given the events' history H t(x) , that is, Above, t(x) extracts the occurrence time of event x, and we omit the dependence on H t(x) in the notation of λ(x) for simplicity. As self-and mutual-exciting point processes, Hawkes processes (Hawkes, 1971 ) have been widely used to capture the mutual excitation dynamics among temporal events. The model assumes that influences from past events are linearly additive towards the current event. The conditional intensity function for a self-exciting point process takes the form of where µ > 0 stands for the background intensity, and the so-called "influence kernel" k : X ×X → R is crucial in capturing the influence of past events on the likelihood of event occurrence at the current time. Here we use the notation [k] to stress the dependence of the conditional intensity function on the kernel function k(x , x). Written in the form of the integral over counting measure, we have that where X t is the subset of X with the first component smaller than t. The most commonly made assumption in the literature is that the process is stationary, where the influence of the past events is shift-invariant, such that k( where β controls the decay rate and α > 0 controls the magnitude of the influence of an event. The current work aims at going beyond stationary point processes, which enables us to better capture the heterogeneity in the events' influence across the spatial-temporal space, which naturally arises in many applications. We propose to represent more general non-stationary influence kernels in the conditional intensity function λ(x) specified in (4). Kernel representation. The main idea of the proposed model is to represent the influence kernel k using a general finite-rank decomposition where ψ r : X → R, φ r : X → R, r = 1, · · · , R, are two sets of feature functions in some smooth functional space F ⊂ C 0 (X ), and ν r is the corresponding weight -or "spectrum". This representation is motivated by the spectral decomposition of a general kernel function. While functional spectral decomposition is usually infinitely dimensional, for practical considerations, we truncate the "spectrum" and only consider a finite rank representation. Note that while we view (5) as similar to a spectral decomposition, it can be better understood as feature maps in kernel representation, and, particularly, we do not need the feature functions ψ r and φ r to be orthogonal. The decomposition (5) represents the kernel function using three parts: two sets of (normalized) feature functions and the energy spectrum-the spectrum ν r plays the role of weights to combine the feature maps. In the learning process, we can train the feature functions (typically neural networks) and the weights separately; since learning the normalized feature maps tend to be more numerically stable. The proposed form of kernel is not necessarily positive semi-definite, and even not symmetric. (b) Network structure Figure 2 : (a) The architecture of the proposed non-stationary neural spectral kernel. A hidden embedding will first summarize the data point; then, the embedding will be mapped to different features via multi-branch neural networks. The kernel value is calculated by summing up the products between two sets of learned features, {ψr} and {φr}, weighted by the spectrum {νr}. The spectrum, feature functions, and hidden embeddings are jointly learned from data. (b) The structure of the multi-branch neural network. There are two shared layers with n nodes per layer that generate the hidden embedding; d denotes the dimension of the input data point; p denotes the dimension of the middle layer that generates the feature. Further specifications of neural networks will be provided in the experiments in Section 4. Note that the spectral representation can allow for a completely general kernel that can be used for non-stationary processes (since our influence function does not impose a shift-invariant structure). Moreover, the spectral representation allows us to go beyond monotone decay or parametric form as commonly assumed in the prior literature. Neural network feature function. As one of the most salient features of our method, feature functions {φ r , ψ r } are represented using neural networks, leveraging their known universal approximation power. First, the input data point x ∈ X will be projected to a hidden embedding space via a multi-layer shared network, aiming to extract key information of the input data. Here we have adopted Softplus non-linear activation in this feature extraction sub-network, while other options may be possible. Next, the hidden embedding will be mapped to the different features {φ r (x)} through R branched sub-networks. To ensure the output feature is constrained in a bounded space, we choose the scaled sigmoidal function as the activation of the output layer, i.e., f (x) = s/(1 + exp(−x)), where s is a constant to enable rescaling of the output to a proper range (in our setting, we set s to be 100). The overall architecture of our kernel formulation and the structure of the multi-branch neural network are summarized in Figure 2 (a) and (b), respectively. Further specifications of neural networks will be provided in the experiments in Section 4. An essential task in learning the neural point process model is to estimate the influence kernel function for the point process. The reason is two-fold: First, the kernel function is the most important component in representing the point process. Second, in practice, the influence kernel offers clear interpretations, such as "how events at a particular time and location will influence future events at a given time and location." Such interpretation is essential for predicting using event data-one of the main applications for point process models. To estimate the influence kernel function for point process models, we consider a popular approach through maximum likelihood estimation (MLE). Formally, the optimal kernel can be found by solving the following optimization problem given M sequences of training event sequences over the time where λ j and N j denote the conditional intensity and counting measure associated with the j-th trajectory, respectively, and K ⊂ C 0 (X × X ) represents the family of regular kernel functions induced by the feature function family F and the finite-rank decomposition (5). Learning algorithm. To solve MLE (6) when the kernel function k is parameterized by neural networks, we use a stochastic gradient as summarized in Algorithm 1 (Appendix A). Note that to calculate the log-likelihood function in (6), we need to evaluate an integral (the second term), which does not have a closed-form expression. We approximate the integral numerically by Monte Carlo integration -drawing samples and take the average, as described in Algorithm 2 (Appendix A). We first consider the MLE as a functional optimization problem, since when using neural networks to approximate the kernel function, we are interested in such functional approximation results. We show that the expected log-likelihood reaches its maximum at the true kernel with the second-order derivative bounded away from zero, and thus the true kernel function is identifiable by solving the MLE problem under some regularity conditions. Consider the log-likelihood of point processes over a family of kernel functions K which contains K, the family induced by feature functions in F and non-negative spectrum {ν r } R r=1 . Note that K may go beyond the finite-rank decomposition in (5). Later throughout the theoretical analysis, we omit the spectrum as they can be absorbed into the feature functions. The details are discussed in Remark 3.2. Assumption 3.1. (A1) The kernel function family K ⊂ C 0 (X × X ) which is uniformly bounded, and the true kernel k * ∈ K; (A2) There exist c 1 , c 2 positive constants, such that for any k ∈ K, a.s. for event data trajectory, Note that, apart from (A2), we only need kernel functions to be measurable for theoretical analysis which is guaranteed by (A1). In practice, the proposed neural network parametrization leads to a continuous and smooth (low-rank) kernel function, which induces non-singular λ. We have the following lemma, which shows that when the log-likelihood function has a local perturbation around the true kernel function, there is going to be a decrease in the log-likelihood function value in expectation. Note we assume that k * lies in (or can be well-approximated by) the family of function class K -thanks to the well-known universal approximation power of neural networks. Lemma 3.1 (Local perturbation of likelihood function around the true kernel function). Under Assumption 3.1, for anyk ∈ K and δk =k − k * , we have where The implication of Lemma 3.1 is the following. Note that in (7), we have nicely decomposed the difference in the likelihood function caused by a small perturbation around the true kernel function, as two terms: the first term in (7) is a martingale integral (since the conditional expectation of , and the second term is the integral of a quadratic term against the counting measure. The expectation of the first term is zero due to the property of the martingale process. For the second term, per j, and thus perturbation of the intensity function which corresponds to the second (quadratic) term in (7) will be reflected in the decrease of the log-likelihood. Furthermore, we have the identifiability of the kernel itself, as stated in the following theorem. Theorem 3.2 (Kernel identifiability using maximum likelihood). Under Assumption 3.1, the true kernel function k * is locally identifiable in that k * is a local minimum solution of maximum likelihood (6) in expectation. Remark 3.1 (Function identification and neural network parametrization). Theorem 3.2 derives a result which holds for variational perturbation of the kernel function k in the possibly parametric family K induced by the feature function family F. When F is the class of functions that a neural network can represent, the perturbation δk is induced by the change of network parameters. It is common that in neural network models, parameter identification is difficult to search for (e.g., due to the symmetry of permuting hidden neurons); however, the neural network function identification may still hold, e.g., under the mean-field approximation (Mei et al., 2018) . Thus the kernel function identification results in Theorem 3.2 is important when learning kernel functions by neural networks. Remark 3.2 (Finite rank kernel representation). When assuming the parameter representation (5), we represent the true kernel as . For theoretical analysis purposes, without loss of generality, we can assume ν r = 1, r = 1, · · · , R, since they can absorbed into the feature functions. Consider a perturbed kernel where the feature functions areψ r = ψ r + δψ r and φ r = φ r + δφ r , and remains to satisfy Assumption 3.1. The kernel function variation δk in (8) can then be written as δk( . With a rank-R representation of the kernel and smooth ψ r , φ r represented by neural networks, we potentially prevent overfitting that leads to singular kernels, which may be a problem for the over-complete kernel family K as in (A1). Later we show experimentally that the learned kernel is smooth and close to the true kernel in Figure 3 and Figure 4 . We also studied the MLE for parameterized kernel in Appendix D, when the target feature function belongs to a certain parametric function class with a finite number of parameters. This includes, for instance, spline functions, functions represented by Fourier basis, and neural networks, which proves that our proposed method acts as a fundamental framework of finite-rank and neural network kernel representation. This section presents experimental results on both synthetic and real data and compares them with several state-of-the-arts, including (i) standard Hawkes process with an exponentially decaying kernel function (Hawkes) (Hawkes, 1971 ); (ii) recurrent marked temporal point processes (RMTPP) (Du et al., 2016) ; and (iii) Neural Hawkes process (NH) (Mei & Eisner, 2017) . See Appendix B for a detailed review of those existing methods. To perform the experiments, we use the following procedure: Given training data, we first estimate the kernel function to obtaink(·, ·) by solving the maximum likelihood problem (6) using stochastic gradient descent, as described in Section 2.3. Note that according to (2), the conditional intensity function can be treated as a prediction of the chance of having an event at a given time t after observing the past events. Thus, to evaluate the prediction performance on test data, given a test trajectory, we perform online prediction by feeding the past events (in the test trajectory) into evaluating the conditional intensity function according to (3), which gives the conditional probability of a future event given the past observations. Performance metrics. We consider two performance metrics: (i) The performance for synthetic data is evaluated by measuring the out-of-sample mean-average-error (MAE) for conditional intensity function. For synthetic data, we know the true kernel, which is denoted as k * . Given a sequence of test data, let λ[k * ] be the "true" conditional intensity function defined by (3) using the true kernel k * , and let λ[k] be an estimated conditional intensity function for the same sequence using the estimated kernelk. Thus, λ[k] can be viewed as a probabilistic prediction of the events (since it specifies the likelihood of an event happening given the test trajectory's history) when the kernel k is estimated separately from training data. The out-of-sample MAE for one test trajectory is defined as Then we average this over all test trajectories to obtain our performance measure. (ii) For real data, since we do not know the true intensity function, we report the average out-of-sample predictive log-likelihood. Given a test trajectory (information contained in counting measure N(x)), the predictive likelihood for a trajectory using an estimated kernelk is given as The average out-of-sample predictive log-likelihood is obtained by average over test trajectories. This has been widely adopted as a metric for real-data (Mei & Eisner, 2017; Omi et al., 2019; Zhang et al., 2019; Zhu et al., 2021a; b; : the higher the log-likelihood, the better the model predicts. See all experiment configurations in Appendix B. Synthetic data. In the experiment, we assume the background rate is a constant (µ = 1) and focus on recovering both stationary and non-stationary kernel structures. We generate four one-dimensional and three two-dimensional synthetic data sets, which are simulated by a vanilla Hawkes process and our model described in Section 2.2 with randomly initialized parameters. The simulated data is generated by the thinning algorithm described in Algorithm 3 (Appendix A), which is an accept-reject method for simulating a point process (Daley & Vere-Jones, 2008; Gabriel et al., 2013) . For ease of comparison, we normalize the time and mark spaces for all data sets to the range from 0 to 100. Each data set is composed of 1,000 event sequences with an averaged length of 121. We fit the models using 80% of the synthetic data set and the remaining 20% as the testing set. Kernel recovery. One of the most important ability of our approach is to recover the true kernel. We report the kernel recovery results for one-and two-dimensional synthetic data sets, shown in Figure 3 and Figure 4 , respectively. In Figure 3 , we observe that our proposed model can accurately recover the true kernel for one-dimensional data. In contrast, the vanilla Hawkes process with a stationary kernel failed to capture such non-stationarity. We also present results of two-dimensional data in Figure 4 , where a continuous one-dimensional mark is introduced to the events. The first three columns demonstrate two-dimensional "slices" of the four-dimensional kernel evaluation, indicating our model is also capable of recovering complex high-dimensional kernel structure. We note that event points are sparsely scattered in the two-dimensional marked-temporal space, where the events' correlation is measured in a high-dimensional (in our example, four-dimensional) kernel space. Intensity prediction. We evaluate the predictive accuracy for conditional intensity λ[k](x) (3) using estimated kernelk from the training data. Note that λ[k](x) is a probabilistic prediction since it can be viewed as the instantaneous probability of an event given the historical events. The last panel in Figure 3 shows the predicted one-dimensional conditional intensity functions given a sequence randomly selected from the testing set. We compare our predicted λ[k](x) with the one estimated by a vanilla Hawkes model with stationary kernel. The result shows that the predicted conditional Figure 3 : Kernel recovery results on a one-dimensional synthetic data set. The first three panels show the true kernel that generates the data, kernel learned by our model, and kernel learned by a vanilla Hawkes process, respectively. The fourth panel shows the true and predicted conditional intensity functions of a test sequence. intensity λ[k](x) suggested by our model well matches that using the true kernel λ[k * ](x). In contrast, the Hawkes model only provides limited flexibility for modeling the non-stationary process. Two panels in the last column of Figure 4 give another comparison between the true and predicted two-dimensional conditional intensities over marked-temporal space. This result confirms that our model can accurately predict the two-dimensional conditional intensity function. To validate the robustness of our proposed method, we also test our model using another one-dimensional data set generated by a vanilla Hawkes process with a stationary parametric kernel, as shown in Figure 5 . Though our model is evidently overparametrized for this data, our method is still able to recover the kernel structure, as well as predict the conditional intensity accurately, which demonstrate the adaptiveness of our model and confirm that our model can approximate the stationary kernel without incurring overfitting. More results of kernel recovery experiments can be found in Appendix C. Additionally, we compare our method with three other baselines on the synthetic data sets. Figure 6 shows the conditional intensities based on the true kernel λ[k * ](x) (solid black lines) and predicted intensity using each method. We observe that our method obtained much better predictive performances compared to other baseline approaches. Table 1 summarizes two performance metrics for these methods: the predictive log-likelihood and the MAE. The result shows that our method greatly outperforms other baseline approaches in both metrics. In particular, the MAE of our method is at least 90% lower than the other methods for both one-dimensional and two-dimensional data sets. Figure 6 : Predicted conditional intensity using our method and other baselines for a sequence selected from test data. We aggregate the conditional intensity in mark space for ease of presentation and visualize the average conditional intensity over time for two-dimensional synthetic data on the third panel. Real-data results. Finally, we test our method on two large-scale real data sets: (1) Atlanta 911 calls-for-service data. The data set contains the 911 calls-for-service data in Atlanta from 2015 to 2017. We extract 7,831 reported robberies from the data set since robbers usually follow a particular modus operandi (M.O.). Each robbery report is associated with a timestamp indicating when the robbery occurred. We consider each series of robberies as a sequence. (2) Northern California seismic data. The Northern California Earthquake Data Center (NCEDC) provides public time series data Northern California Earthquake Data Center. UC Berkeley Seismological Laboratory. Dataset (2014) that comes from broadband, short period, strong motion seismic sensors, GPS, and other geophysical sensors. We extract 16,401 seismic records with a magnitude larger than 3.0 from 1978 to 2018 in Northern California and partition the data into multiple sequences every quarter. We fit the models using 80% of the data set and the remaining 20% as the testing set. In Table 1 , we report the out-of-sample predictive log-likelihood of each method since the ground truth is not available for real data. We can see that our model attains the highest predictive log-likelihood for both synthetic and real data sets. In particular, the better performance on real data and significantly higher out-of-sample predictive log-likelihood achieved by our approach than other existing methods show that the non-stationary kernel seemingly captures the nature of the real data better in these cases. This shows the merit of our approach for real data where we do not know the ground truth. Training / testing time. In Table 1 , we compare the running time of our model and other baselines on one-dimensional synthetic data set. The size and performance of each model are presented in Table 1 . The running time refers to wall clock time. We can observe that the training time of our model is similar with other neural point process models (RMTPP and NH). Hawkes process model has the minimum training time because it has only two parameters to be estimated. The testing (inference) time of our model is significantly lower than the other two recurrent neural point process models. We have presented a new non-stationary marked point process model with a general kernel represented by neural network feature maps, motivated by spectral decomposition of the kernel. The flexible kernel with expressive power enabled by neural networks can capture complex dependence across temporal, spatial, and mark spaces, which can be valuable for real-data modeling in various applications. The benefits of our approach are demonstrated by superior out-of-sample predictive log-likelihood on real data. The model can be learned efficiently with stochastic gradient descent. We also develop a theoretical guarantee for maximum likelihood identifiability. Generally, model architectures may well depend on specific tasks. For example, neural networks can be based on CNN if the high dimensional markers are in image-like shapes and can also be LSTM or even BERT if the markers are text-based. Thus the choice of the deep model architecture and optimization algorithm can be more systematically explored, especially the comparison of non-stationary neural kernels to stationary ones. Understanding and characterizing what kernels can be learned through such an approach is left for future study. Also, when extending to high-dimensional mark space, more efficient algorithm is needed for the computation of the log-likelihood. Algorithm 1: Stochastic gradient-based learning algorithm Input: X = {x j } j=1,...,N is the training set with N sequences; η is the number of learning iterations; γ is the learning rate; M is the size of a mini-batch; Initialization: model parameters θ 0 ; l ← 0; while l < η do Randomly draw M sequences from X denoted as X l = {x j } j=1,...,M ⊂ X; θ l+1 ← θ l + γ∂ /∂θ l given X l ; l ← l + 1; end Algorithm 2: Monte Carlo estimation for the integral of conditional intensity in (6) Input: λ denotes the model that we need to evaluate; is an event sequence, where N j is the number of events in the sequence; N is the number of samples uniformly drawn from X ; Λ is the integral of the conditional intensity over the data space; Algorithm 3: Efficient thinning algorithm for simulating point process input θ, T, M; output A set of events H t ordered by time.; Now we describe the experiment configurations: We set the rank R = 5 in our setting. We consider the shared network that summarizes input data into a hidden embedding to be a fully connected three-layer network and the sub-network to be a fully connected network with two hidden layers. The width of the hidden layers in the shared network is n = 128, and the width of the input layers in sub-networks (or the output layer in the shared network) is p = 10. We adopt the SoftPlus f (x) = 1/ log(1 + exp(x)) as the activation function of each layer in the network. To learn the model's parameters, we adopt the Adam optimizer Kingma & Ba (2014) with a constant learning rate of 10 −2 and the batch size is 32. All experiments are performed on Google Colaboratory (Pro version) with 12GB RAM and dual-core Intel processors, which speed up to 2.3 GHz (without GPU). Codes to reproduce the experimental results are publicly available 1 . This study considers three baseline methods: (1) Standard Hawkes process with an exponentially decaying kernel function (Hawkes) : it specifies the conditional intensity function as λ(t) = µ + α tj