key: cord-0648080-x3pqd26o authors: Zhou, Zihao; Yang, Xingyi; Rossi, Ryan; Zhao, Handong; Yu, Rose title: Neural Point Process for Learning Spatiotemporal Event Dynamics date: 2021-12-12 journal: nan DOI: nan sha: 528adda5488bcab09e253fd286cee0ebeb0ba361 doc_id: 648080 cord_uid: x3pqd26o Learning the dynamics of spatiotemporal events is a fundamental problem. Neural point processes enhance the expressivity of point process models with deep neural networks. However, most existing methods only consider temporal dynamics without spatial modeling. We propose Deep Spatiotemporal Point Process (ours{}), a deep dynamics model that integrates spatiotemporal point processes. Our method is flexible, efficient, and can accurately forecast irregularly sampled events over space and time. The key construction of our approach is the nonparametric space-time intensity function, governed by a latent process. The intensity function enjoys closed form integration for the density. The latent process captures the uncertainty of the event sequence. We use amortized variational inference to infer the latent process with deep networks. Using synthetic datasets, we validate our model can accurately learn the true intensity function. On real-world benchmark datasets, our model demonstrates superior performance over state-of-the-art baselines. Our code and data can be found at the https://github.com/Rose-STL-Lab/DeepSTPP. Accurate modeling of spatiotemporal event dynamics is fundamentally important for disaster response (Veen and Schoenberg, 2008) , logistic optimization (Safikhani et al., 2018) and social media analysis (Liang et al., 2019) . Compared to other sequence data such as texts or time series, spatiotemporal events occur irregularly with uneven time and space intervals. Discrete-time deep dynamics models such as recurrent neural networks (RNNs) (Hochreiter and Schmidhuber, 1997; Chung et al., 2014 ) assume events to be evenly sampled. Interpolating an irregular sampled sequence into a regular sequence can introduce significant biases (Rehfeld et al., 2011) . Furthermore, event sequences contain strong spatiotemporal dependencies. The rate of an event depends on the preceding events, as well as the events geographically correlated to it. Spatiotemporal point processes (STPP) (Daley and Vere-Jones, 2007; Reinhart et al., 2018) provides the statistical framework for modeling continuous-time event dynamics. As shown in Fig-Figure 1 : Illustration of learning spatiotemporal point process. We aim to learn the space-time intensity function given the historical event sequence and representative points as background. Machine learning community is observing a growing interest in continuous-time deep dynamics models that can handle irregular time intervals. For example, Neural ODE (Chen et al., 2018) parametrizes the hidden states in an RNN with an ODE. Shukla and Marlin (2018) uses a separate network to interpolate between reference time points. Neural temporal point process (TPP) (Mei and Eisner, 2017; Zhang et al., 2020; Zuo et al., 2020) is an exciting area that combines fundamental concepts from temporal point processes with deep learning to model continuous-time event sequences, see a recent review on neural TPP (Shchur et al., 2021) . However, most of the existing models only focus on temporal dynamics without considering spatial modeling. In the real world, while time is a unidirectional process (arrow of time), space extends in multiple directions. This fundamental difference from TPP makes it nontrivial to design a unified STPP model. The naive approach to approximate the intensity function by a deep neural network would lead to intractable integral computation for likelihood. Prior research such as Du et al. (2016) discretizes the space as "markers" and uses marked TPP to classify the events. This approach cannot produce the space-time intensity function. Okawa et al. (2019) models the spatiotemporal density using a mixture of symmetric kernels, which ignores the unidirectional property of time. Chen et al. (2021) proposes to model temporal intensity and spatial density separately with neural ODE, which is computationally expensive. We propose a simple yet computationally efficient approach to learning STPP. Our model, Deep Spatiotemporal Point Process (DeepSTPP) marries the principles of spatiotemporal point processes with deep learning. We take a non-parametric approach and model the space-time intensity function as a mixture of kernels. The parameters of the intensity function are governed by a latent stochastic process which captures the uncertainty of the event sequence. The latent process is then inferred via amortized variational inference. That is, we draw a sample from the variational distribution for every event. We use a Transformer network to parametrize the variational distribution conditioned on the previous events. Compared with existing approaches, our model is non-parametric, hence does not make assumptions on the parametric form of the distribution. Our approach learns the space-time intensity function jointly without requiring separate models for temporal intensity function and spatial density as in Chen et al. (2021) . Our model is probabilistic by nature and can describe various uncertainties in the data. More importantly, our model enjoys closed form integration, making it feasible for processing large-scale event datasets. To summarize, our work makes the following key contributions: • Deep Spatiotemporal Point Process. We propose a novel Deep Point Process model for forecasting unevenly sampled spatiotemporal events. It integrates deep learning with spatiotemporal point processes to learn continuous space-time dynamics. • Neural Latent Process. We model the space-time intensity function using a non-parametric approach, governed by a latent stochastic process. We use amortized variational inference to perform inference on the latent process conditioned on the previous events. • Effectiveness. We demonstrate our model using many synthetic and real-world spatiotemporal event forecasting tasks, where it achieves superior performance in accuracy and efficiency. We also derive and implement efficient algorithms for simulating STPPs. We first introduce the background of the spatiotemporal point process and then describe our approach to learning the underlying spatiotemporal event dynamics. Spatiotemporal Point Process. Spatiotemporal point process (STPP) models the number of events N (S × (a, b)) that occurred in the Cartesian product of the spatial domain S ⊆ R 2 and the time interval (a, b]. It is characterized by a non-negative space-time intensity function given the history which is the probability of finding an event in an infinitesimal time interval (t, t + ∆t] and an infinitesimal spatial ball S = B(s, ∆s) centered at location s. Example 1: Spatiotemporal Hawkes process (STH). Spatiotemporal Hawkes (or self-exciting) process assumes every past event has an additive, positive, decaying, and spatially local influence over future events. Such a pattern resembles neuronal firing and earthquakes. It is characterized by the following intensity function (Reinhart et al., 2018) : where g 0 (s) is the probability density of a distribution over S, g 1 is the triggering kernel and is often implemented as the exponential decay function, g 1 (∆t) := α exp(−β∆t) : α, β > 0, and g 2 (s, s i ) is the density of an unimodal distribution over S centered at s i . Example 2: Spatiotemporal Self-Correcting process (STSC). Self-correcting spatiotemporal point process (Isham and Westcott, 1979) assumes that the background intensity increases with a varying speed at different locations, and the arrival of each event reduces the intensity nearby. STSC can model certain regular event sequences, such as an alternating home-to-work travel sequence. It has the following intensity function: Here g 0 (s) is the density of a distribution over S, and g 2 (s, s i ) is the density of an unimodal distribution over S centered at location s i . Maximum likelihood Estimation. Given a history of n events H t , the joint log-likelihood function of the observed events for STPP is as follows: Here, the space-time intensity function λ * (s, t) plays a central role. Maximum likelihood estimation seeks the optimal λ * (s, t) from data that optimizes Eqn. 4. Predictive distribution. Denote the probability density function (PDF) for STPP as f (s, t|H t ) which represents the conditional probability that next event will occur at location s and time t, given the history. The PDF is closely related to the intensity function: where F is the cumulative distribution function (CDF), see derivations in Appendix A.1 . This means the intensity function specifies the expected number of events in a region conditional on the past. The predicted time of the next event is the expected value of the predictive distribution for time f (t) in the entire spatial domain: Similarly, the predicted location of the next event evaluates to: Unfortunately, Eqn. (4) is generally intractable. It requires either strong modeling assumptions or expensive Monte Carlo sampling. We propose the Deep STPP model to simplify the learning. We propose DeepSTPP, a simple and efficient approach for learning the space-time event dynamics. Our model (1) introduces a latent process to capture the uncertainty, (2) parametrizes the latent process with deep neural networks to increase model expressivity, and (3) approximates the intensity function with a set of spatial and temporal kernel functions. Neural latent process. Given a sequence of n event, we wish to model the conditional density of observing the next event given the history f (s, t|H t ). We introduce a latent process to capture the uncertainty of the event history and infer the latent process with armotized variational inference. The latent process dictates the parameters in the space-time intensity function. We sample from the latent process using the re-parameterization trick (Kingma and Welling, 2013) . As shown in Figure 2 , given the sequence with n events H t = {(s 1 , t 1 ), . . . , (s n , t n )} tn≤t , we encode the entire sequence into the high-dimensional embedding. We use positional encoding to encode the sequence order. To capture the stochasticity in the temporal dynamics, we introduce a sample ! " ($|ℋ ' ) ℋ ' = * + , -+ , … , * / , -/ ' 0 1 ' $ + $ 2 $ / … transformer 3 * 5, -= 6 7 8 7 9 : (5, 5 7 ; < 7 )9 ' (-, -7 ; = 7 ) > ? @ ? decoder Figure 2 : Design of our DeepSTPP model. For a historical event sequence, we encode it with a transformer network and map to the latent process (z 1 , · · · , z n ). We use a decoder to generate the parameters (w i , γ i , β i ) for each event i given the latent process. The estimate intensity is calculated using kernel functions k s and k t and the decoded parameters. latent process z = (z 1 , · · · , z n ) for the entire sequence. We assume the latent process follows a multivariate Gaussian at each time step: where the mean µ and covariance Diag(σ) are the outputs of the embedding neural network. In our implementation, we found using a Transformer ( Non-parametric model. We take a non-parameteric approach to model the space-time intensity function λ * (s, t) as: Here w i (z), γ i (z), β i (z) are the parameters for each event that is conditioned on the latent process. Specifically, w i represents the non-negative intensity magnitude, implemented with a soft-plus activation function. J is the number of representative points that we will introduce later. k s (·, ·) and k t (·, ·) are the spatial and temporal kernel functions, respectively. For both kernel functions, we parametrize them as a normalized RBF kernel: where the bandwidth parameter γ i controls an event's influence over the spatial domain. The parameter β i is the decay rate that represents the event's influence over time. α = S exp −γ i s−s i ds is the normalization constant. We use a decoder network to generate the parameters {w i , γ i , β i } given z separately, shown in Figure 2 . Each decoder is a 4-layer feed-forward network. We use a softplus activation function to ensure w i and γ i are positive. The decay rate β i can be any real number, such that an event can also have constant or increasing triggering intensity over time. Representative Points. In addition to n historical events, we also randomly sample J representative points from the spatial domain to approximate the background intensity. This is to account for the influence from unobserved events in the background, with varying rates at different locations. The model design in (7) enjoys a closed form integration, which gives the conditional PDF as: See the derivation details in Appendix A.2. DeepSTPP circumvents the integration of the intensity function and enjoys fast inference in forecasting future events. In contrast, NSTPP Chen et al. (2021) is relatively inefficient as its ODE solver also requires additional numerical integration. Parameter learning. Due to the latent process, the posterior becomes intractable. Instead, we use amortized inference by optimizing the evidence lower bound (ELBO) of the likelihood. In particular, given event history H t , the conditional log-likelihood of the next event is: where φ represents the parameters of the encoder network, and θ are the parameters of the decoder network. p(z) is the prior distribution, which we assume to be Gaussian. KL(·||·) is the Kullback-Leibler divergence between two distributions. We can optimize the objective function in Eqn. (11) w.r.t. the parameters φ and θ using back-propagation. Spatiotemporal Dynamics Learning. Modeling the spatiotemporal dynamics of a system in order to forecast the future is a fundamental task in many fields. Most work on spatiotemporal dynamics has been focused on spatiotemporal data measured at regular space-time intervals, e.g., (Xingjian et al., 2015; Li et al., 2018; Yao et al., 2019; Fang et al., 2019; Geng et al., 2019) . For discrete spatiotemporal events, statistical methods include space-time point process, see (Moller and Waagepetersen, 2003; Mohler et al., 2011) . Zhao et al. (2015) proposes multi-task feature learning whereas Yang et al. (2018) proposes RNN-based model to predict spatiotemporal check-in events. These discrete-time models assume data are sampled evenly, thus are unsuitable for our task. 2017) shows that there is no significant benefit of using continuous-time RNN for discrete event data. Special treatment is still needed for modeling unevenly sampled events. We evaluate DeepSTPP for spatiotemporal prediction using both synthetic and real-world data. Baselines We compare DeepSTPP with the state-of-the-art models, including • Spatiotemporal Hawkes Process (MLE) (Reinhart et al., 2018) : it learns a spatiotemporal parametric intensity function using maximum likelihood estimation, see derivation in Appendix A.3. • Recurrent Marked Temporal Point Process (RMTPP) (Du et al., 2016) : it uses GRU to model the temporal intensity function. We modify this model to take spatial location as marks. • Neural Spatiotemporal Point Process (NSTPP) (Chen et al., 2021): state of the art neural point process model that parameterizes the spatial PDF and temporal intensity with continuous-time normalizing flows. Specifically, we use Jump CNF as it is a better fit for Hawkes processes. All models are implemented in PyTorch, trained using the Adam optimizer. We set the number of representative points to be 100. The details of the implementation are deferred to the Appendix C.1 . For the baselines, we use the authors' original repositories whenever possible. Datasets. We simulated two types of STPPs: spatiotemporal Hawkes process (STH) and spatiotemporal self-correcting process (STSC) . For both STPPs, we generate three synthetic datasets, each with a different parameter setting, denoted as DS1, DS2, and DS3 in the tables. We also derive and implement efficient algorithms for simulating STPPs based on Ogata's thinning algorithm (Ogata, 1981) . We view the simulator construction as an independent contribution from this work. The details of the simulation can be found in Appendix B. We use two real-world spatiotemporal event datasets from NSTPP (Chen et al., 2021) to benchmark the performance. • Earthquakes Japan: catalog earthquakes data including the location and time of all earthquakes in Japan from 1990 to 2020 with magnitude of at least 2.5 from the U.S. Geological Survey. There are in total 1,050 sequences. The number of events per sequences ranges between 19 to 545 1 . • COVID-19: daily county level COVID-19 cases data in New Jersey state published by The New York Times. There are 1,650 sequences and the number of events per sequences ranges between 7 to 305. For both synthetic data and real-world data, we partition long event sequences into non-overlapping subsequences according to a fixed time range T . The targets are the last event, and the input is the rest of the events. The number of input events varies across subsequences. For each dataset, we split each into train/val/test sets with the ratio of 8:1:1. All results are the average of 3 runs. For synthetic data, we know the ground truth intensity function. We compare our method with the best possible estimator: maximum likelihood estimator (MLE), as well as the NSTPP model. The MLE is learned by optimizing the log-likelihood using the BFGS algorithm. RMTPP can only learn the temporal intensity thus is not included in this comparison. 1. The statistics differ slightly from the original paper due to updates in the data source. Predictive log-likelihood. Table 1 shows the comparison of the predictive distribution for space and time. We report Log Likelihood (LL) of f (s, t|H t ) and the Hellinger Distance (HD) between the predictive distributions and the ground truth averaged over time. On both the STH and STSC datasets with different parameter settings, DeepSTPP outperform the baseline NSTPP in terms of LL and HD. It shows that DeepSTPP can estimate the spatiotemporal intensity more accurately for point processes with unknown parameters. Temporal intensity estimate. Table 2 shows the mean absolute percentage error (MAPE) between the models' estimated temporal intensity and the ground truth λ (t) over a short sampled range. On the STH datasets, since MLE has the correct parametric form, it is the theoretical optimum. Compared to baselines, DeepSTPP generally obtained the same or lower MAPE. It shows that joint spatiotemporal modeling also improve the performance of temporal prediction. Intensity visualization. Figure 3 visualizes the learned space-time intensity and the ground truth for STH and STSC, providing strong evidence that DeepSTPP can correctly learn the underlying dynamics of the spatiotemporal events. Especially, NSTPP has difficulty in modeling the complex dynamics of the multimodal distribution such as the spatiotemporal Hawkes process. NSTPP sometimes produces overly smooth intensity surfaces, and lost most of the details at the peak. In contrast, our DeepSTPP can better fit the multimodal distribution through the form of kernel summation and obtain more accurate intensity functions. Computational efficiency. Figure 4 provides the run time comparison for the training between DeepSTPP and NSTPP for 100 epochs. To ensure a fair comparison, all experiments are conducted on 1 GTX 1080 Ti with Intel Core i7-4770 and 64 GB RAM. Our method is 100 times faster than NSTPP in training. It is mainly because our spatiotemporal kernel formulation has a close form of integration, which bypasses the complex and cumbersome numerical integration. For real-world data evaluation, we report the conditional spatial and temporal log-likelihoods, i.e., log f * (s|t) and log f * (t), of the final event given the input events, respectively. The total log-likelihood, log f * (s, t), is the summation of the two values. Predictive performances. As our model is probabilistic, we compare against baselines models on the test predictive LL for space and time separately in Table 3 . RMTPP can only produce temporal intensity thus we only include the time likelihood. We observe that DeepSTPP outperforms NSTPP most of the time in terms of accuracy. It takes only half of the time to train, as shown in Figure 4 . Furthermore, we see that STPP models (first three rows) achieve higher LL compared with only modeling the time (RMTPP). It suggests the additional benefit of joint spatiotemporal modeling to increases the time prediction ability. Ablation study We conduct ablation studies on the model design. Our model assumes a global latent process z that governs the parameters {w i , β i , γ i } with separate decoders. We examine other alternative designs experimentally. (1) Shared decoders: We use one shared decoder to generate model parameters. Shared decoders input the sampled z to one decoder and partition its output to generate model parameters. (2) Separate process: We assume that each of the {w i , β i , γ i } follows a separate latent process and we sample them separately. Separate processes use three sets of means and variances to sample {w i , β i , γ i } separately. (3) LSTM encoder: We replace the Transformer encoder with a LSTM module. As shown in Table 4 , we see that (1) Shared decoders decreases the number of parameters but reduces the performance. (2) Separate process largely increases the number of parameters but has negligible influences in test log-likelihood. (3) LSTM encoder: changing the encoder from Transformer to LSTM also results in slightly worse performance. Therefore, we validate the design of DeepNSTPP: we assume all distribution parameters are governed by one single hidden stochastic process with separate decoders and a Transformer as encoder. We propose a family of deep dynamics models for irregularly sampled spatiotemporal events. Our model, Deep Spatiotemporal Point Process (DeepSTPP), integrates a principled spatiotemporal point process with deep neural networks. We derive a tractable inference procedure by modeling the space-time intensity function as a composition of kernel functions and a latent stochastic process. We infer the latent process with neural networks following the variational inference procedure. Using synthetic data from the spatiotemporal Hawkes process and self-correcting process, we show that our model can learn the spatiotemporal intensity accurately and efficiently. We demonstrate superior forecasting performance on many real-world benchmark spatiotemporal event datasets. Future work include further considering the mutual-exciting structure in the intensity function, as well as modeling multiple heterogeneous spatiotemporal processes simultaneously. Conditional Density. The intensity function and probability density function of STPP is related: The last equation uses the relation that λ * (s, t) = λ * (t)f (s|t), according Daley and Vere-Jones (2007) Chapter 2.3 (4). Here λ * (t) is the time intensity and f * (s|t) := f (s|t, H t ) is the spatial PDF that the next event will be at location s given time t. According to Daley and Vere-Jones (2007) Chapter 15.4, we can also view STPP as a type of TPP with continuous (spatial) marks, Likelihood. Given a STPP, the log-likelihood of observing a sequence H t = {(s 1 , t 1 ), (s 2 , t 2 ), ...(s n , t n )} tn≤t is given by: Inference. With a trained STPP and a sequence of history events, we can predict the next event timing and location using their expectations, which evaluate to The predicted location for the next event is: Computational Complexity. It is worth noting that both learning and inference require conditional intensity. If the conditional intensity has no analytic formula, then we need to compute numerical integration over S. Then, evaluating the likelihood or either expectation requires at least triple integral. Note that E[t i |H t i−1 ] and E[s i |H t i−1 ] actually are sextuple integrals, but we can memorize all λ * (s, t) from t = t i−1 to t t i−1 to avoid re-compute the intensities. However, memorization leads to high space complexity. As a result, we generally want to avoid an intractable conditional intensity in the model. PDF Derivation The model design of DeepSTPP enjoys a closed form formula for the PDF. First recall that For DeepSTPP, the spatiotemporal intensity is The temporal intensity simply removes the k s (which integrates to one). The bandwidth doesn't matter. Then replacing the integral in the original formula yields The temporal kernel function k t (t, t i ) = exp(−β i (t − t i )), we reach the closed form formula. Inference The expectation of the next event time is where the inner integral has a closed form. It requires 1D numerical integration. Given the predicted timet i , the expectation of the space can be efficiently approximated by Spatiotemporal Hawkes process (STHP). Spatiotemporal Hawkes (or self-exciting) process is one of the most well-known STPPs. It assumes every past event has an additive, positive, decaying, and spatially local influence over future events. Such a pattern resembles neuronal firing and earthquakes. Spatiotemporal Hawkes is characterized by the following intensity function (Reinhart et al., 2018) : where g 0 (s) is the probability density of a distribution over S, g 1 is the triggering kernel and is often implemented as the exponential decay function, g 1 (∆t) := α exp(−β∆t) : α, β > 0, and g 2 (s, s i ) is the density of an unimodal distribution over S centered at s i . Maximum Likelihood. For spatiotemporal Hawkes process, we pre-specified the model kernels g 0 (s) and g 2 (s, s j ) to be Gaussian: Specifically for the STHP, the second term in the STPP likelihood evaluates to Finally, the STHP log-likelihood is This model has 11 scalar parameters: 2 for s µ , 3 for Σ g0 , 3 for Σ g2 , α, β, and µ. We directly estimate s µ as the mean of {s i } n 0 , and then estimate the other 9 parameters by minimizing the negative log-likelihood using the BFGS algorithm. T in the likelihood function is treated as t n . Inference Based on the general formulas in Appendix A.1, and also note that for an STHP, Both require only 1D numerical integration. Spatiotemporal Self-Correcting process (STSCP). A lesser-known example is self-correcting spatiotemporal point process Isham and Westcott (1979) . It assumes that the background intensity increases with a varying speed at different locations, and the arrival of each event reduces the intensity nearby. The next event is likely to be in a high-intensity region with no recent events. Spatiotemporal self-correcting process is capable of modeling some regular event sequences, such as an alternating home-to-work travel sequence. It has the following intensity function: λ * (s, t) = µ exp g 0 (s)βt − i:t i 0 Here g 0 (s) is the density of a distribution over S, and g 2 (s, s i ) is the density of an unimodal distribution over S centered at s i . In this appendix, we discuss a general algorithm for simulating any STPP, and a specialized algorithm for simulating an STHP. Both are based on an algorithm for simulating any TPP. The most widely used technique to simulate a temporal point process is Ogata's modified thinning algorithm, as shown in Algorithm 1 Daley and Vere-Jones (2007) It is a rejection technique; it samples points from a stationary Poisson process whose intensity is always higher than the ground truth intensity, and then randomly discards some samples to get back to the ground truth intensity. The algorithm requires picking the forms of M * (t) and L * (t) such that sup(λ * (t + ∆t), ∆t ∈ [0, L(t)]) ≤ M * (t). In other words, M * (t) is an upper bound of the actual intensity in [t, t + L(t)]. It is noteworthy that if M * (t) is chosen to be too high, most sampled points would be rejected and would lead to an inefficient simulation. When simulating a process with decreasing inter-event intensity, such as the Hawkes process, M * (t) and L * (t) can be simply chosen to be λ * (t) and ∞. When simulating a process with increasing inter-event intensity, such as the self-correcting process, L * (t) is often empirically chosen to be 2/λ * (t), since the next event is very likely to arrive before twice the mean interval length at the beginning of the interval. M * (t) is therefore λ * (t + L * (t)). It has been mentioned in Section 2.1 that an STPP can be seen as attaching the locations sampled from f * (s|t) to the events generated by a TPP. Simulating an STPP is basically adding one step to Algorithm 1: sample a new location from f * (s|t) after retaining a new event at t. As for a spatiotemporal self-correcting process, neither f * (s, t) nor λ * (t) has a closed form, so the process's spatial domain has to be discretized for simulation. λ * (t) can be approximated by s∈S λ * (s, t)/|S|, where S is the set of discretized coordinates. L * (t) and M * (t) are chosen to be 2/λ * (t) and λ * (t + L * (t)). Since f * (s|t) is proportional to λ * (s, t), sampling a location from f * (s|t) is implemented as sampling from a multinomial distribution whose probability mass function is the normalized λ * (s, t). To simulate a spatiotemporal Hawkes process with Gaussian kernel, we mainly followed an efficient procedure proposed by Zhuang (2004) , that makes use of the clustering structure of the Hawkes process and thus does not require repeated calculations of λ * (s, t). Algorithm 2 Simulating spatiotemporal Hawkes process with Gaussian kernel 1: Generate the background events G (0) with the intensity λ * (s, t) = µg 0 (s), i.e., simulate a homogenous Poisson process Pois(µ) and sample each event's location from a bivariate Gaussian distribution N (s µ , Σ) 2: = 0, S = G ( ) 3: while G = ∅ do 4: for i ∈ G do 5: Simulate event i's offsprings O ( ) i with the intensity λ * (s, t) = g 1 (t, t i )g 2 (s, s i ), i.e., simulate a non-homogenous stationary Poisson process Pois(g 1 (t, t i )) by Algorithm 1 and sample each event's location from a bivariate Gaussian distribution N (s i , Σ) For the synthetic dataset, we pre-specified both the STSCP's and the STHP's kernels g 0 (s) and g 2 (s, s j ) to be Gaussian: The STSCP is defined on S = [0, 1] × [0, 1], while the STHP is defined on S = R 2 . The STSCP's kernel functions are normalized according to their cumulative probability on S. Table 5 shows the simulation parameters. The STSCP's spatial domain is discretized as an 101 × 101 grid during the simulation. In this section, we include experiment configurations and some additional experiment results. Recurrent neural networks for multivariate time series with missing values Neural ordinary differential equations Neural spatio-temporal point processes Empirical evaluation of gated recurrent neural networks on sequence modeling An introduction to the theory of point processes: volume II: general theory and structure Gru-ode-bayes: Continuous modeling of sporadically-observed time series Recurrent marked temporal point processes: Embedding event history to vector Augmented neural odes Gstnet: Global spatialtemporal network for traffic flow prediction How to train your neural ode Spatiotemporal multi-graph convolution network for ride-hailing demand forecasting Anode: Unconditionally accurate memory-efficient gradients for neural odes Long short-term memory A self-correcting point process. Stochastic processes and their applications Neural jump stochastic differential equations Neural controlled differential equations for irregular time series Auto-encoding variational bayes Diffusion convolutional recurrent neural network: Data-driven traffic forecasting Deep sequential multi-task modeling for next check-in time and location prediction Discovering latent network structure in point process data The neural hawkes process: A neurally self-modulating multivariate point process Self-exciting point process modeling of crime Statistical inference and simulation for spatial point processes Discrete event, continuous time rnns Neural ode processes. ICLR On lewis' simulation method for point processes Deep mixture point processes: Spatio-temporal event prediction with rich contextual information Comparison of correlation analysis techniques for irregularly sampled time series A review of self-exciting spatio-temporal point processes and their applications Sabiheh Sadat Faghih, and Bahman Moghimi. Spatio-temporal modeling of yellow taxi demands in new york city using generalized star models This work was supported in part by Adobe Data Science Research Award, U.S. Department Of Energy, Office of Science, U. S. Army Research Office under Grant W911NF-20-1-0334, and NSF Grant #2134274. For a better understanding of DeepSTPP, we list out the detailed hyperparameter settings in Table 6 . We use the same set of hyperparameters across all datasets.