key: cord-0843633-nir6alt1
authors: Lei, Yiyuan; Ozbay, Kaan; Xie, Kun
title: Safety analytics at a granular level using a Gaussian process modulated renewal model: A case study of the COVID-19 pandemic
date: 2022-05-23
journal: Accid Anal Prev
DOI: 10.1016/j.aap.2022.106715
sha: 364a7f682b463849a87664769c0c482ab21776e1
doc_id: 843633
cord_uid: nir6alt1

With the advance of intelligent transportation system technologies, contributing factors to crashes can be obtained in real time. Analyzing these factors can be critical in improving traffic safety. Despite many crash models having been successfully developed for safety analytics, most models associate crash observations and contributing factors at the aggregate level, resulting in potential information loss. This study proposes an efficient Gaussian process modulated renewal process model for safety analytics that does not suffer from information loss due to data aggregations. The proposed model can infer crash intensities in the continuous-time dimension so that they can be better associated with contributing factors that change over time. Moreover, the model can infer non-homogeneous intensities by relaxing the independent and identically distributed (i.i.d.) exponential assumption of the crash intervals. To demonstrate the validity and advantages of this proposed model, an empirical study examining the impacts of the COVID-19 pandemic on traffic safety at six interstate highway sections is performed. The accuracy of our proposed renewal model is verified by comparing the areas under the curve (AUC) of the inferred crash intensity function with the actual crash counts. Residual box plot shows that our proposed models have lower biases and variances compared with Poisson and Negative binomial models. Counterfactual crash intensities are then predicted conditioned on exogenous variables at the crash time. Time-varying safety impacts such as bimodal, unimodal, and parabolic patterns are observed at the selected highways. The case study shows the proposed model enables safety analytics at a granular level and provides a more detailed insight into the time-varying safety risk in a changing environment.

For traffic safety analytics, contributing factors to crashes such as weather conditions, traffic volumes, and traffic speed are usually collected at the aggregate level. Crash events, however, are discrete in the time dimension. The incompatibility between aggregated covariates and discrete crash data can result in problematic analysis when they are associated without care. Such incompatibility is known as the fundamental dilemma in the traffic safety analysis (Chang and Jovanis, 1990) . A common approach to avoid incompatibility is to model crash occurrences by aggregating crash events per month, quarter, or year. Continuous contributing factors are then usually represented by their averages over the time intervals. A limitation of this approach is that aggregations can lead to information loss and reduced sample size (Guo, 2010; Usman et al., 2011) . Depending on the level of aggregations, datasets that are further classified into refined subgroups are considered disaggregate data. For example, aggregations by each year can be considered as disaggregate data with respect to data aggregated per three years. Studies show that using disaggregate data can provide a more accurate parameter estimation (Sengupta et al., 2021) and a smaller standard error (Bae et al., 2021) than aggregate one. The granular level is reached when disaggregate crash datasets are broken into the time of the crash. As intelligent transportation systems (ITS) are being more widely deployed, some safety-related factors can now be obtained continuously through advanced data management technologies. Few existing crash models, however, associate crash observations and contributing factors at the granular level, to reduce information loss due to aggregation (homogeneous intensities at aggregated intervals fail to reveal fluctuations).

At the aggregate level, traditional Poisson regression models (Gustavsson and Svensson, 1976) which estimate the average crash occurrences in unit time or space are known to be prone to overdispersion. To address overdispersion caused by inequalities between mean and variance or excessive zero counts, models such as mixtures of negative binomial regression (or Poisson-Gamma mixture) models (Lord et al., 2005; Zou et al., 2013; 2014) , and zero-inflated regression models (Yan et al., 2012; Hall and Tarko, 2019) are employed for crash analysis. These models study the crash intensities at the aggregated time intervals. Instead of studying the crashes at the aggregate level, survival analysis models (or duration models) study the crash occurrences as a sequential process (Jovanis and Chang, 1989; Chang and Jovanis, 1990; Shankar et al., 2008; Chen and Guo, 2016; Xie et al., 2019) at the disaggregate level. Jovanis and Chang (1989) discussed the advantages of using survival theory principles to combine discrete crash data and aggregate exposure data. The widely used survival models such as Cox proportional hazard (semi-parametric) models, however, rely on the assumptions of proportional hazard functions over time, which can be hard to satisfy in practice. Xie et al. (2019) proposed a Bayesian survival analysis model which accounted for the crash hazard functions (Hensher and Mannering, 1994; Kalbfleisch and Prentice, 2011) during the consecutive crash times across sites. While this study modeled individual crashes, covariates such as traffic volumes, however, were aggregated during the crash intervals. Since crash events can be rare by nature, the duration between two crashes tends to be long. Aggregate covariates measuring the variations coarsely in a long interval can cause information loss. Additionally, another limitation is that observed crashes or crash intervals are assumed to be exponential independently and identically distributed (i.i.d.), which may not be satisfied in practice since the crash sites can be spatiotemporally correlated (Lord and Mannering, 2010; Mannering and Bhat, 2014; Mannering et al., 2016) . Many researchers applied hierarchical models (Park and Lord, 2007; El-Basyouny and Sayed, 2009; Usman et al., 2012; Demiroluk and Ozbay, 2014; Xie et al., 2014; Cui and Xie, 2021) to address the spatiotemporal correlations. For example, Demiroluk and Ozbay (2014) proposed a hierarchical Bayesian model to estimate the spatial correlations using traffic exposure data aggregated at the county level. Usman et al. (2012) applied a multi-level model for crash analysis during storm events using hourly aggregated contributing factors, which addressed correlations within storm events. While these hierarchical models can perform crash analysis well without relying on the i.i.d. assumption by accounting for the spatiotemporal correlations, time-varying crash patterns can be hard to capture. As an alternative approach to relax the i.i.d. assumption and capture the non-homogeneous arrival rates at continuous time, Lasko (2014) proposed a Gaussian-process modulated renewal process (GPMRP) model that drops the exponential i.i.d. assumption. The model has been validated to be at least twice accurate and more efficient than existing best-known methods using simulated data (Lasko, 2014) . Most importantly, GPMRP is not limited to model memoryless Poisson process, but also applies to a wider range from bursty events to refractory processes (Lasko, 2014) .

In this paper, we model the crash arrivals as a point stochastic process. We propose GPMRP to infer the distributions of crash intensities. We use the summary statistics such as means and 95 percentile intervals to describe the intensity distribution. The validity of GPMRP inference will be evaluated by integrating the crash intensity over time using the trapezoid method. Through comparing with Poisson regression and negative binomial regression models, we show the benefits of using GPMRP are as follows: 1) capturing the non-homogeneous crash events by dropping the i.i.d. assumption and temporal correlations by constructing a squared exponential covariance matrix; 2) inferring the crash intensities at a continuous time dimension to be combined with covariates, which discrete counting models fail to handle such scenarios (Brillinger, 1994) . As a case study, we apply the GPMRP model to study crashes at 6 different locations in the United States during the COVID-19 pandemic. Researchers in different countries such as Canada (Rapoport et al., 2021) , the United States (Doucette et al., 2021; Li et al., 2021) , Italy (Colonna and Intini, 2020) , Greece (Vandoros, 2021) , and Japan (Inada et al., 2021) have evaluated the impact of the pandemic on traffic safety. The time-varying variations of crash intensities during the pandemic however are rarely explored. Our case study aims to fill this gap and demonstrate how crash events and covariates can be combined at the granular level for time-dependent crash analysis.

This section introduces the methodology of crash intensity inference using the GPMRP model. The probability density function of crash events is summarized in section 2.1, and section 2.2 shows how posterior distributions of crash intensities are computed.

Given the crash site k, let T = {t 1 , t 2 , ⋯, t n } be the set of crash event time up to event n. Time intervals δ i = t i − t i− 1 (the sequential event order i is a positive integer starting from 1 to n), where t 0 represents the initial observation time. Instead of modeling crash event time T as directly a homogeneous gamma process, a modulated Gamma process which drops the i.i.d. assumption is adopted (Lasko, 2014) . Let the intensity function represent the number of crashes per unit time as λ(t). Denote Λ(t i ) as the number of events from 0 to t i , which can be obtained by the integration of intensity function over time:

In order to obtain a smooth function of λ(t), given two crash occurrence time t i and t j , a Gaussian process with zero mean and squared exponential covariance matrix Σ is put on the logarithmic intensity function f(t):

where a squared exponential matrix Σ with signal variance σ 2 and length scale l is calculated as:

To this end, the general form of crash events' probability density function f 1 for Bayesian inference can be described as:

where Γ stands for the gamma function, α stands for the shape parameter. Depending on the value of α with respect to 1, bursty events, memoryless Poisson events and regular events can be modelled (Lasko, 2014) . The warped interval ω i = Λ(t i ) − Λ(t i− 1 ) is modeled by homogeneous gamma process with shape α, inverse scale parameter β, and probability density function as f 2 :

The posterior distributions of intensity function λ(t), shape parameter α, signal variance and length scales are inferred by the Markov Chain Monte Carlo (MCMC) simulation. The objective is to maximize the likelihood equations f 1 (T; α, λ(t) ). To improve computational efficiency, Lasko (2014) applied equally spaced grid intervals T ∈ [0, t n + 1] and numerical integrations to obtain the intensity functions rather than directly computes the likelihood functions at all observed events n. The intensities at the time of the crash are then obtained from interpolations which are guaranteed by the smooth covariance matrix given by equation (3). To begin with, the prior distributions of shape parameter α, signal variance σ 2 , and length scale l are assumed to be uniform. Next, the log intensities f(t) given by equation (2) is computed at the given grid intervals using slice sampling. Then, the log intensity of f(t) is calculated by smooth interpolation of f(t). After that, the warped interval ω i is calculated using trapezoid integration method. Finally, the likelihood function given by f 1 (T; α, λ(t) ) is calculated so that the shape parameter α and likelihood function will be updated by Metropolis-Hasting algorithms. Notice that the inverse scale parameter β is set as 1 to avoid identifiability problem and ω 1 , ω n are modeled by f 2 (ω i ; 1, 1)

for computational efficiency (Lasko, 2014) .

As shown in Fig. 1 , the empirical study will demonstrate how GPMRP models can be applied for disaggregate crash analysis and to infer the time-varying safety impact of an evolving event. We modeled the interstate highway sections as unit points as an example. The objective is to perform crash analysis to evaluate the time-dependent impact of the pandemic on crash intensities. Let the intervention Z be the dates after March 13, 2020, when the United States declared a national emergency due to the outbreak of COVID-19. The impacts are measured as the difference between the intervention group (the observed crash intensity λ i ) and the missing counterfactual group that should have been without the intervention (crash intensity λ i ). At step I, crash data sources including both crash data, contributing factors and COVID-19 confirmed cases are collected and cleaned. To avoid information loss due to aggregated covariates during the crash intervals, continuous crash intensities are first inferred using the GPMRP model at step II. Direct comparison of the crash intensity functions before and after the pandemic can have hidden biases, as it ignores exogenous variables, which might affect crashes. Hence, at step III, individual crash intensity (λ i ) and the exogenous variables at the time of the crash (X i ) will be used to restore the counterfactual group. Let the historical crash data from January 1, 2017 to December 31, 2019 be the training dataset, crash data from January 1, 2020 to March 13, 2020 be the validating dataset, and crash events after March 13, 2020 be the counterfactual dataset to be predicted. The function mapping between the discrete λ i and X i will be learned by a supervised machine learning algorithm using the training dataset. Selected based on the minimum validation errors, the best machine learning model will be used to predict the expected counterfactual crash intensities conditional on X i of the intervention group. Finally, the crash intensity evaluations during the pandemic will be evaluated based on the difference between λ i and λ i .

The countrywide crash dataset is an open-source crash data augmented with multiple application programming interfaces (API) including Wunderground weather data, TimeAndDate light conditions, crash occurrence time, and locations 1 (Moosavi et al., 2019) . The crash data contains over 2 million crashes in the United States with a time span of over 4 years, which meets the threshold to achieve statistically meaningful results (Chang et al., 2017) . We modeled the crash occurrences at interstates as a point stochastic process. This case study focuses on the interstate sections with the highest crash counts from January 2017 to December 2020 in 6 districts: New For example, intersections such as I-5N CA, has experienced an increasing number of crashes starting from 2019 Q2 and a rapid decrease during 2020 the first half-year, before another drastic increase starting at Q3. The exact time when these changes occur, however, is lost at the process of aggregation. To prevent such information loss, crash intensities (the number of crashes per unit time) will be inferred by the GPMRP model at step II.

Due to the limitation of the data source lacking detailed exposure data such as traffic volumes, vehicle miles traveled, or vehicle hours traveled at granular level, the contributing factors are selected as exogenous variables at the time of the crash, such as temperature, humidity, visibility, wind speed, precipitation and light conditions, which were unaffected by the interventions. Usman et al. (2012) found that temperature and visibility can be significantly negatively correlated with crashes, and wind speed can be significantly positively correlated with crashes. Hourly precipitation, however, showed no evidence of significance. Durduran (2010) and Schlögl (2020) showed that air temperature, air pressure, and precipitation are very important weather covariates for crash analysis. The exogenous dataset has first been cleaned and preprocessed by removing the outliers or miscoded values and replacing the missing values. For example, a data record with the air pressure of 196 kPa in I-5N CA is removed because it is an unreasonable extreme value. The missing values for precipitation (<5%) have been filled with the mode value because the skewed frequency of crash events is the highest at zero precipitation. Table 1 summarizes the summary statistics of the numerical traffic weather data. From the table, we know that humidity has the largest standard deviations (about 22.11) while precipitation has the smallest deviations (about 1.07). There are mainly over 30 types of categorical weather conditions and two types of categorical light conditions as well as seven days of the week. The label encoding method is applied to address these categorical datasets.

Four years of crash events from 2017 to 2020 (t max = 1461)in 6 interstate sections are modeled by using the GPMRP model at Step II. Considering the trade-off between accuracy and computational efficiency, the number of burn-in and iteration times are set as 1000 respectively. Shape parameter α is assumed to be a uniform distribution However, this does not necessarily mean that those interstates did not experience significant impacts during the pandemic, due to the annual vehicle-miles travelled are estimations based on the proportions of the studied intersections over the total interstate lengths within their states. The proportions for I-94W IL and I-93N MA are about 3% and 8%, while other interstates comprised up about over 20% of the total interstate lengths in the states, the actual vehicle-miles travelled for I-94W IL and I-93N MA may be underestimated. Hence, the evaluations of the change in crash intensities change during the pandemic need further analysis. Compared with Fig. 2(b) , the model is able to capture the non-homogeneous crash intensity functions. Furthermore, the Gaussian Process covariance functions are able to obtain smoother curvatures which show the value of crash intensities at a continuous time-space so that crash data can be studied at the individual level. To sum up, step II results provide direct visualization of how non-homogeneous crash events are changed during the 4-year of observation.

To validate the accuracy of the GPMRP model results, a trapezoid numerical integration method is applied to obtain areas under curve (AUC) of crash times using equation (1). Table 2 , shows the inferenced crash times compared with actual crash events per year for the six interstates. The results show that the inferenced average crash intensities are generally close to the actual crash times. The relative errors between inferenced crash cases and actual cases have relatively low inference errors before 2020 (<5%), however, the inference error for I-5N WA, I-94W IL, and I-395N DC in 2020 have an estimated error rate up to 6.7%. The possible reason for this is that crash events during the pandemic are non-homogeneous. To reduce the computational cost, our prosed GPMRP model uses a grid time T to compute the intensity functions and then performs a smooth interpolation at the time of the crash T to infer the crash intensities. The equally spaced grid time can increase the inference error when the crash events present heterogeneous crash intensities. Considering the actual crash times fall in the 95% confidence intervals obtained by the GPMRP model, the accuracy of model inference results for crash intensities can be considered as acceptable.

In order to evaluate the information loss due to aggregation and traditional regression models, Poisson regression models and negative binomial regression models using quarterly aggregated crash events are performed as baseline comparisons. For Poisson regression models, the crash counts per quarter assumes to follow a Poisson distribution with the expected crash intensity (λ i ), which are linked with the exponential functions of the linearly combined covariates X i . The negative binomial models can be regarded as a generalized Poisson regression to compensate the overdispersion, where the variance is no less than the mean. It assumes that λ i follows a gamma distribution, so that the crash counts per quarter follows a negative binomial (Poisson-Gamma mixture) distribution. A stepwise forward selection has been adopted in order to include important weather covariates. The coefficients of the covariates are computed based on maximum likelihood estimation method. The regression coefficients, standard errors, the 95% confidence intervals as well as significance levels for negative binomial models are attached in Tables A.2.1-A.2.6 in Appendix. Fig. 4 compares the expected crash intensities based on actual counts, Gaussian Process modulated renewal models (GP), Poisson Fig. 5 . Overall, both biases and variances of GP are smaller than PR and NB, suggesting that the inferenced granular crash intensities exhibit less information loss in case of GP models. Outliers for most of the sections shown in Fig. 5 are found to occur during the second half year of 2020, after the stay-at-home policies in the six districts were imposed. The specific residual error records can be found in Tables A.3 .1-A.3.6 in Appendix.

In order to reduce the biases from exogenous variables such as weather conditions, lighting conditions, the inferred crash intensities at the time of the crash are combined with the exogenous variables and categorical variables such as day of the week. To evaluate the impacts on crash intensity (λ) during the pandemic, the common approach is to compare the difference between the observed intervention group E[λ|Z = 1]and the missing counterfactual group E[λ|Z = 0] (Rubin, 1974) , where Z is the binary intervention. We make an assumption that the historical training group and the counterfactual group E[λ|Z = 0] follows a same mapping function between the historical crash intensities and real-time exogenous variables. In other words, if the coronavirus did not exist and no national emergency was ever announced, the crash intensities in 2020 can be predicted by the previous years' crash intensities given the same traffic conditions such as weather conditions. This mapping assumption is supported by the observations from Fig. 2(b) and Fig. 3 : except I-5N CA, all other interstate sections did not experience a sharp increase in their crash intensity functions until March 2020. The fairness of the assumption will be justified by the validation errors from the validation dataset. Under this assumption, statistical models are applied to restore counterfactual crash intensity as a comparison to the intervention group. The restored counterfactual group will be predicted conditioned on exogenous variables X z=1 . Five machine learning models are selected as follows: historical mean model (Li et al., 2015) , support vector regression model (Wu et al., 2004; Ahn et al., 2016) , linear regression Model, XGBoost model (Dong et al., 2018; Mei et al., 2018; Shi et al., 2018) , random forest model (Wang et al., 2018) .

Historical Mean model (HM): this model takes the average of historical inferred crash intensities before March 13, 2020, and the predicted intensities will be the interpolated values at the corresponding time. Support Vector Regression model (SVM): A radial-based kernel function is used to find the best fitted nonlinear relationships between training attributes (hyperplane) and training crash intensities (boundary lines).

Linear Regression model (LR): Multiple linear regression model predicts the expected traffic intensities conditioned on the exogenous variables by minimizing the least square residuals.

XGBoost model (XGB): A gradient boosting tree model with higher accuracy and lower computational costs which contains tree-pruning and regularizations to prevent overfitting.

Random Forest model (RF): A tree-based prediction model with a subset of randomly selected attributes from the exogenous variables to prevent overfitting.

To justify our assumptions made earlier and evaluate the accuracy of statistical models' restorations, evaluation metrics needs to be picked. Considering crash events in I-395 N are sparse, thus the inferred crash intensity λ i at the denominator can be smaller than 1, resulting in a large numerical error if mean absolute percentage error (MAPE) is used. Hence, rooted mean square error (RMSE) is selected as the evaluation metrics, which is written as:

where n is the maximum number of crash events in the validation period, λ i is the inferred crash intensity during the validation period, λ i is the predicted crash intensity with training dataset from crash events before 2020. Table 3 shows validation results for the six interstate sections. As highlighted in black, models with the smallest validation error will be selected as the best model for counterfactual restoration. Counterfactual crash intensities after March 13, 2020, are restored as shown in Fig. 6 . The black solid line is the inferred crash intensities. The best prediction models are shown in the other line. The best prediction model for I-5N CA is the XGB model. Notice that I-5N CA crash patterns have higher variations starting from July 1st, 2019, which violates the assumption that crash intensity distributions are identical before March 2020. The training dataset has two distinct probability distributions, leading to dense spikes in its prediction results. The best statistical models for I-5N WA and I-385N DC are also XGB, different from I to 5N CA, no significant violation of our proposed assumptions found in their crash intensity distributions. The best prediction model for I-90E NY is the historical mean model with a RMSE around 0.0652, the restored counterfactual data is shown in the red solid line. RF model is the best prediction model for counterfactual construction with a RMSE around 0.4210, which is shown in the green solid line.

The GPMRP model allows crash data and covariates to be combined at the individual level. The usage of real-time covariates helps to reduce the estimation biases from weather conditions and light conditions. One of the benefits is that the time-varying impacts of crash intensity changes during the pandemic period can be obtained. Fig. 7(a) shows the impacts based on the difference between crash intensities at the intervention group λ i and restored counterfactual group λ i . The vertical lines are the marked important timestamps as a reference. The normalized Covid-19 case numbers at nationwide is included in Fig. 7 . for visual comparisons. While I-5N CA has increased crash intensities up to 50 in the later periods of 2020, other 5 interstate sections have relatively smaller changes in crash intensities (no more than 10/unit time). In order to compare the impacts of six interstate sections, the impacts are normalized to the same scale from − 1 to 1. Three distinct time-varying patterns can be observed: I-5N CA has a parabolic pattern; I-395N DC and I-5N WA have bimodal patterns, and I-93N MA and I-94W IL have unimodal patterns.

The empirical study section of the paper showed safety analytics performed at a granular level using GPMRP and insights of COVID-19 impacts on safety. The spread of COVID-19 together with stay-at-home policies raises challenges to crash frequency modeling and data aggregation, as using aggregation per quarter may fail to reveal important patterns of crash intensities. Existing aggregated crash modeling methods rely on the i.i.d. assumptions of crash events (see Mannering and Bhat, 2014; Abdulhafedh, 2016; Zheng et al., 2020 for a systematic review), crash occurrences, however, did not follow the identical patterns before the pandemic (see Fig. 2b ). In addition, models fitted by aggregated data can be insufficient to capture the non-homogeneous information. Following the work of Demiroluk and Ozbay (2015) ; Li et al. (2017) ; to model non-homogeneous crash intensity, we proposed to apply the GPMRP model, which is a modulated renewal stochastic process that relaxes the i.i.d. assumptions. It uses the Bayesian inference approach to estimate the distribution of crash intensity functions by maximizing the Gamma likelihood functions, which provided confidence intervals for the inferred crash intensity (Fig. 3) .

The crash intensities inferred by the GPMRP model present detailed inhomogeneous properties during the pandemic. For instance, from Fig. 2(b) in Section 3.2, one can observe that the number of crashes per unit mile for I-93N MA experienced its peak near Q2 in 2020, which was the time when one of the major cities, Boston, announced the emergency order due to COVID-19 (from 03/24/2020 to 04/07/2020). The detailed crash intensity variation information at the granular level was hidden due to aggregation per quarter. In contrast, based on the GPMRP inference results seen on Fig. 3 and Fig. 4 ., the crash intensity of I-93N MA first reduced slightly near mid-April 2020 before its intensity continued to climb up. Moreover, different from the coarsely aggregated crash occurrences, the application of the Gaussian Process with a squared exponential covariance matrix is found to be able to infer the crash intensities smoothly considering temporal correlations, presenting direct comparisons of historical intensity during different periods. As an example, the inferred average intensity of I-94W IL in Fig. 3 showed that there were four peaks on average each year from 2017 to 2019 before the spread of COVID-19. When it came to the year 2020, the crash intensities experienced one major peak from 03/01/2020 to 08/01/2020, which was higher than the ones in the previous three years. Similar phenomena can be observed in other studied interstate sections, suggesting that highway traffic safety hazards in 2020 have changed with respect to the previous years. These findings are consistent with the previous stay-at-home related traffic safety studies (NHTSA, 2020; Doucette et al., 2021) . Table 2 and Fig. 5 of the case study showed that our proposed model can accurately infer the four-year crash events based on AUC using equation (1). The results in this paper add to the existing literature on how the inhomogeneous crash intensities of the major interstate sections in continuous time.

Furthermore, the time-varying changes during the pandemic in six interstate sections were quantified. Five machine learning algorithms were employed to combine the crash intensities and exogenous variables under the assumption of each crash is a discrete event. The dashed black line in Fig. 7b was the date when the United States declared a national emergency and the dashed color lines represented the corresponding States that started to reopen. It is noted that an increased crash intensity reached its peak for I-5N WA, I-93N MA, and I-94W IL during the stay-athome policy periods compared with the restored counterfactual crash intensities. I-90 E NY, I-5N CA, and I-395 DC generally experienced a reduced crash intensity during Q2 and Q3 before their intensity differences started to become positive in Q4. The presence of bimodal patterns of I-395N DC and I-5N WA are shown in Fig. 7b , suggesting the presence of perturbation interrupted from the external field (Moslonka and Sekimoto, 2020) . The parabolic pattern of I-5N CA suggests that the crash intensities were increasing rapidly since the third quarter of 2020. This can be expected since different States were proposing "stay-athome" policies as a response to stop spreading the virus and the total traffic volumes across the country during this period were reported to decrease since March of 2020 (Li et al., 2021; USDOT, 2021) . Previous studies provided possible explanations of these COVID-19 impact patterns. Research shows that while the traffic counts in many regions such as California (Vanlaar et al., 2021) , Washington, D.C. (Lazo, 2021) , and Alabama (Adanu et al., 2021) were found to be reduced significantly over the stay-at-home policy compared with the same periods in previous years, speeding-related crash rates, however, increased significantly by nearly 4 times in New York City, approximately 8 times in Seattle (Liao and Lowry, 2021) . Moreover, more than 80% increase in speeding tickets over 100 mph were issued in California during the stay at home policy (Mcgreevy, 2020) .

Since the temperature variations in I-90E NY, I-93N MA, I-94W IL, and I-395N DC were found to be about twice larger than those in I-5N CA and I-5NWA (see Table 1 ). The weather conditions were considered when estimating the crash intensity differences given the observed fluctuations. The temporal effects such as day of the week were also included as categorical data in the model, however, seasonal trends and local trends were not included due to the irregularity of crash events. Moreover, important variables related to traffic exposure such as traffic volumes, traffic speed could not be included in this case study due to the limitation of datasets. The fitted machine learning models will be expected to have fewer variances (spikes) if a more important contributing factor dataset becomes available (see Fig. 6 ). It was demonstrated that the application of GPMRP can quantify time-varying impacts and can be used in future studies to analyze the emerging disaggregate datasets. When the inferred crash intensities are combined with covariates at the granular level, a wider range of information collected by ITS in real-time can be used for analysis. For example, vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) technologies provide nearby vehicles' numbers and speeds at the time of the crash. These emerging data sources can be used to study important contributing factors leading to crash events or near-crash events to reduce the hidden biases (Mannering and Bhat, 2014).

This paper proposed a Gaussian Process modulated renewal process model for the crash intensities estimations for disaggregate crash analysis. One of the benefits of our proposed model is that it can capture the non-homogeneous properties of crash events and allow crash data to be associated with contributing factors at the granular level to avoid information loss. In our case study, we used the GPMRP model to perform disaggregate analysis for interstate sections with the highest crash rates in 6 regions. Based on the normalized crash intensities from January 2017 to December 2020, we can observe that except I-5N CA, other sections (I-90E NY, I-5N WA, I-93N MA, I-94W IL, and I-395N DC) did not experience a visually drastic increase until 2020 March. Based on this observation, an assumption that the relationship between crash intensities and contributing factors can be modeled by the same mapping functions if COVID-19 has not had happened. We used five machine learning algorithms to model the mapping functions. The best-fitted models were selected based on RMSE. Except for I-5N CA, the counterfactual crash intensities can be properly predicted based on the training dataset, which is as expected because of the fact that I-5N CA violated our mapping assumption (see Section 3.3.3). It experienced a drastic increase from 2019 October. The inferred continuous intensity functions showed their usefulness in estimating the time-varying impacts of COVID-19. Three distinct patterns were found among the six interstate sections that were included in our study: bimodal patterns (I-395N DC, I-90E NY), parabolic pattern (I-5N CA), and unimodal patterns (I-93N MA, I-94W IL). We conclude that the GPMRP model can be applied for disaggregate analysis as it can help identify inhomogeneous properties at the continuous-time dimension when combined with real-time contributing factors.

One of the limitations, however, is that the time-lagging effects between crash events become hard to analyze as the intervals between crash events are unevenly spaced at the individual level. As a future research direction, the temporal correlations, seasonality, and time lagging effects for irregularly sparse crash events shall be explored. Another limitation is that the important real-time traffic exposure datasets and driver information such as age were not included in the contributing factors due to unavailability. With the proliferation of ITS technologies, emerging data sources can be created by advanced sensors and cameras continuously. In addition, with the help of V2V and V2I communication technologies, crash-related information can be stored and accessed efficiently. Better use of continuous-time datasets will help improve traffic safety analytics. One of the practical implications our proposed model is its potential ability to analyze the naturalistic trajectory data at the intersections considering the surrounding vehicles at the time of crash or near crash events. The proposed model is expected to further reduce the estimation biases should more real-time traffic exposure data become available before long. 

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. 

Crash frequency analysis

How did the covid-19 pandemic affect road crashes and crash outcomes in Alabama?

Highway traffic flow prediction using support vector regression and bayesian classifier

Identifying temporal aggregation effect on crash-frequency modeling

Time series, point processes, and hybrids

Proactive methods for road safety analysis

Formulating accident occurrence as a survival process

Evaluating the influence of crashes on driving risk using recurrent event models and naturalistic driving study data

Compensation effect between deaths from covid-19 and crashes: The italian case

An accelerated hierarchical bayesian crash frequency model with accommodation of spatiotemporal interactions

Spatial analysis of county level crash risk in new jersey using severity-based hierarchical bayesian models

A doubly stochastic point process model for modeling crashes along a corridor

Year. Short-term traffic flow prediction based on xgboost

Initial impact of covid-19's stay-at-home order on motor vehicle traffic and crash patterns in Connecticut: an interrupted time series analysis

A decision making system to automatic recognize of traffic accidents on the basis of a gis platform

Collision prediction models using multivariate poissonlognormal regression

Survival analysis

A Poisson regression model applied to classes of road accidents with small frequencies. Scand

Adequacy of negative binomial models for managing safety on rural local roads

Hazard-based duration models and their application to transport analysis

Covid-19 lockdown and fatal motor vehicle collisions due to speed-related traffic violations in japan: a time-series study

Disaggregate model of highway accident occurrence using survival theory

The Statistical Analysis of Failure Time Data

Efficient inference of gaussian-process-modulated renewal processes with application to medical event data

Traffic Counts Fell During the Coronavirus Pandemic, But Road Fatalities Still Increased. The Washington Post

Association between changes in social distancing policies in ohio and traffic volume and injuries

A non-parametric bayesian change-point method for recurrent events

Evaluation of risk change-point for novice teenage drivers

Traffic prediction in a bike-sharing system

Speeding and traffic-related injuries and fatalities during the 2020 covid-19 pandemic: The cases of seattle and new york city

The statistical analysis of crash-frequency data: a review and assessment of methodological alternatives

Poisson, Poisson-gamma and zero-inflated regression models of motor vehicle crashes: balancing statistical fit and theory

Analytic methods in accident research: Methodological frontier and future directions

Unobserved heterogeneity and the statistical analysis of highway accident data

Tickets for speeding in excess of 100 mph surge 87% amid coronavirus shutdown, chp says

Year. Short-term traffic flow prediction based on combination model of xgboost-lightgbm

A countrywide traffic accident dataset

Memory through a hidden martingale process in progressive quenching

Early estimates of motor vehicle traffic fatalities and fatality rate by subcategories through

Multivariate poisson-lognormal models for jointly modeling crash frequency by severity

Impact of covid-19 on motor vehicle injuries and fatalities in older adults in ontario

Estimating causal effects of treatments in randomized and nonrandomized studies

A multivariate analysis of environmental effects on road accident occurrence using a balanced bagging approach

Examining the impacts of crash data aggregation on spf estimation

Analysis of naturalistic driving data: Prospective view on methodological paradigms

Accident risk prediction based on driving behavior feature learning using cart and xgboost

Travel monitoring traffic volume trends

Accident prediction models for winter road safety: does temporal aggregation of data matter?

A disaggregate model for quantifying the safety effects of winter road maintenance activities at an operational level

Covid-19, lockdowns and motor vehicle collisions: Empirical evidence from greece

The impact of covid-19 on road safety in Canada and the United States

Random forest based hourly building energy prediction

Travel-time prediction with support vector regression

A new methodology for before-after safety assessment using survival analysis and longitudinal data

Crash frequency modeling for signalized intersections in a high-density urban road network

Distinguishing between rural and urban road segment traffic safety based on zero-inflated negative binomial regression models

Modeling traffic conflicts for use in road safety analysis: a review of analytic methods and future directions

Application of finite mixture of negative binomial regression models with varying weight parameters for vehicle crash data analysis

Analyzing different functional forms of the varying weight parameter for finite mixture of negative binomial regression models

This work was partially supported by the Connected Cities for Smart Mobility towards Accessible and Resilient Transportation (C2SMART), a Tier 1 U.S. Department of Transportation funded University Transportation Center (UTC) led by New York University. The contents of this paper only reflect the views of the authors who are responsible for the facts and do not represent any official views of any sponsoring organizations or agencies.

The summary statistics of crash events for the six interstate segments are represented in Table A .1. It contains the minimum (min.) maximum (max.), mean and standard deviation (sd.) of the monthly crash counts as shown in the black lines in Fig. 4 .Based on Deviance and Chi-square tests, we do not reject the null hypothesis H0 (H0: the negative binomial models are good fits) at the significance levels (α) of 0.01. To show the importance of the fitted covariates, the coefficients (Coef), standard error of the coefficients (SE), the 95% confidence intervals (95% CI), Z statistics (Z-score) as well as P-value are listed in Tables