key: cord-0671099-pt3gn6te authors: Sangari, Seema; Dallal, Eric title: Correcting for Reporting Delays in Cyber Incidents date: 2022-01-20 journal: nan DOI: nan sha: 105abd255b505269dcf2d3b97d751da42e181333 doc_id: 671099 cord_uid: pt3gn6te With an ever evolving cyber domain, delays in reporting incidents are a well-known problem in the cyber insurance industry. Addressing this problem is a requisite to obtaining the true picture of cyber incident rates and to model it appropriately. The proposed algorithm addresses this problem by creating a model of the distribution of reporting delays and using the model to correct reported incident counts to account for the expected proportion of incidents that have occurred but not yet been reported. In particular, this correction shows an increase in the number of cyber events in recent months rather than the decline suggested by reported counts. The cyber security domain is evolving rapidly, with new attack vectors emerging regularly, and even the most up-to-date data cannot be considered complete. Cyber incidents take a long time to become known and even longer to appear in online databases. While some cyber events are known immediately after they occur, most events are often reported many months or years after the event actually occurred, resulting in biased data. Marriott's major cyber incident occurred in 2014 but was only reported in 2018. Major cyber events become headlines in leading newspapers when reported publicly rather than when they happened. Smaller cyber events may never be reported at all, or have extreme delays, as only public companies and those with personally identifiable information may be obligated to report. Sometimes reporting can take 5-10 years, for various intentional or unintentional reasonsfailing to realize that a cyber incident happened, failing to immediately determine the extent of accessed or stolen data, or deciding not to publicize the incident for fear of reputation risk and consequent financial impacts. As a result, reporting delays are often observed in historical cyber event databases. These databases show a decrease in cyber incidents in the recent few years. Coleman et al. (2021) raised the concern that cyber incidents would remain undetected due to advanced threat techniques. Cyber risk modeling firms rely upon historical data to build their models, which are in turn relied upon by cyber insurers for underwriting, portfolio management, and risk transfer. To build robust loss estimation models for today's evolving cyber world, the most recent and updated information is required, with as little bias as possible. Correcting reporting delays in these databases is therefore a key requirement to have trustworthy cyber insurance models. With the necessary corrections, one can then properly examine temporal trends in the targeting of industries or in attacker tactics. Terminology followed in this paper: i: Incident I: Set of all incidents delay, δ i : Time between the incident date and the reporting date age, A i : Time between the incident date and the last incident reporting date in the data f ∆ : Probability density function of the delay distribution F ∆ : Cumulative distribution function of the delay distribution Most of the literature on reporting delays addresses the medical space. We are not aware of any existing papers that propose a methodology for correcting cyber incident counts to account for reporting delays. However, Coleman et al. (2021) does examine both the distribution of the number of days to discover cyber incidents and the number of days to disclose them. Harris (1987) described reporting delays as a statistical problem for the first time. Heisterkamp et al. (1988a Heisterkamp et al. ( ,b, 1989 and Brookmeyer and Damiano (1989) made distributional assumptions and built linear/quadratic models whereas Rosenberg (1990) and Cheng and Ford (1991) suggested Poisson models. These are easy to implement but assume stationary reporting delays and don't capture trends. Morgan and Curran (1986) , Downs et al. (1987 Downs et al. ( , 1988 ), Healy and Tillett (1988) and Heisterkamp et al. (1988a Heisterkamp et al. ( ,b, 1989 fitted exponential, integrated logistic and log-linear models to capture trends. Gail and Brookmeyer (1988) , Brookmeyer and Liao (1990) , Kalbfleisch and Lawless (1991) and Esbjerg et al. (1999) applied conditional probabilities to capture trends but this resulted in over-fitting. Lawless (1994) proposed a multinomial model with distributed random effects based on Dirichlet/Poisson/Gamma distributions to capture trends in a timely fashion but failed to handle longer delays. Wang (1992) suggested maximum likelihood estimation (MLE) based non-parametric and semi-parametric approaches but with complete 1 data. Harris (2020) suggested correcting COVID cases with an expectation-maximization (EM) algorithm and trained the model with complete data to correct test data. White et al. (2009) and Weinberger et al. (2020) proposed a simpler method based on proportions but also require complete data to train. Wang (1992) , Keiding and Moeschberger (1992) and Midthune et al. (2005) applied truncated models to avoid random effects but require stable reporting delays. Höhle and An Der Heiden (2014), Bastos et al. (2019) and Chitwood et al. (2020) suggested a Bayesian and hierarchical approach with Poisson and negative binomial distributions. It is easy to implement but makes distributional assumptions. Noufaily et al. (2015 Noufaily et al. ( , 2016 ) suggested a log-likelihood approach with a truncation model that is data driven but sensitive to the choice of three reporting time steps. Bastos et al. (2019) suggested a chain-ladder approach but it is sensitive to outliers. Jewell (1989) , Zhao et al. (2009), Zhao and Zhou (2010) , and Avanzi et al. (2016) investigated cyber claims data to account for reporting delays from a capital reserving perspective. This problem is different from the one being investigated, since reporting delays in claims are due only to detection delays. As stated in Brookmeyer and Liao (1990) , none of these approaches deal with delays longer than any previously observed. The data used is a proprietary data set constructed by merging multiple source cyber event sets together. The source cyber event sets contained differing fields that, at minimum, provided information as to the "what", "when" and "who" of an event. Concretely, the data sets included a description of the event, the name of the company to which the cyber event occurred (N.B.: aggregation events separately list each company known to be impacted), along with the date that the event occurred and the date that the event was reported. Because company names frequently differ from one data set to another, the data sets could not be de-duplicated based on direct string matching. Instead, the event data sets were matched to a firmographic data set containing approximately 55 million businesses in the US. This was done via a previously developed matching algorithm that examined company name, industry classification (e.g., via NAICS codes), address information, and any other fields common to both the cyber event data set and the firmographic data set. Events in distinct cyber event data sets were identified as the same when: 1. They were matched to the same company in the firmographic data set; and 2. They were listed as having occurred within 1 week of each other. Limitations: Some events do not have an occurrence date listed, and were therefore excluded from this analysis. There are also large spikes in event counts listed as having occurred on January 1st in most years (Fig. 1a) , which was assumed to be a default value when only the year of the event was known. These events were therefore re-distributed proportionally throughout the year as shown in Fig. 1b but excluded while developing the approach. The proposed approach consists of estimating the reporting delay distribution f ∆ from the empirical data. With this distribution, a corrected count of events with age a can be determined by dividing the raw counts by F ∆ (a), as the latter expression represents the proportion of events that are reported within a delay, δ, of less than a or, equivalently, the proportion of events that are reported as of the present. There are four complications with estimating the delay distribution in the "obvious" way: • The nature of reporting delays means that direct estimates from empirical data will be biased towards shorter delays, since recent events could only appear in the data set in the first place if the reporting delay is small. • The estimates assume zero probability of any delay longer than the longest in the data set. • The estimates for long reporting delays are based on few data points. • The reporting delay distribution may not be stationary. Mathematically, the first problem is that, for incidents with age a, the delay δ i can only be observed if δ i <= a. This problem can be resolved if one assumes that f ∆ is stationary, in which case the distribution can be estimated for delays, δ, larger than a from older events. This is the approach taken in Algorithm 1. To test for stationary, the algorithm was run on two year windows of the data set centered at different times. (Note that the two year windows include all events that occur within the window, irrespective of when they are reported.) This showed that the delay distribution is non-stationary, as can be seen in Fig. 2 . To deal with this non-stationarity, the delay distribution was modeled via parametric distributions where parameters were estimated over monthly rolling two year windows. For each two year window, optimal parameters were determined based on matching cumulative distributions. Specifically, Algorithm 1 was run on each of the monthly two year windows to obtain an estimate of the delay distribution for that time window. For a given two year window and modeled distribution, an optimization algorithm was used to determine the parameters that gave the best match between: • the delay distribution obtained by Algorithm 1; and • the parametric distribution restricted to the domain where the maximum delay is less than the window's maximum reporting delay. The optimal distribution parameters for each window were computed using a derivative free optimizer. In what follows, the above two distributions will be referred to as the "debiased empirical delay distribution" and the "modeled delay distribution", respectively. Inspired from Brookmeyer and Liao (1990) , the proposed algorithm works on the limitation that the delay, δ, is not considered beyond age and hence only a distribution conditioned on the delay being less than or equal to age can be estimated. The algorithm corrects the incident counts with F ∆ estimated from empirical data. The algorithm applies a top down approach to estimate the distribution "from the outside in", accounting for the estimated proportion of events that have occurred but not yet been reported as the distribution itself is being computed. Let A max := max i∈I A i be the maximal age of any event in the data set. Also let h A (a) be the number of incidents of age a, and let h ∆ (δ) be the number of incidents with delay δ. Formally, h A (a) = |{i ∈ I : Then the equation to estimate the delay distribution is: Intuitively, the distribution is generated based on the ratio of the number of events with the given delay period, h ∆ (δ), to the best estimate of the true number of events whose age is old enough to be seen in the incident data set (i.e., where the age is larger than the delay under consideration), Amax a=δ hA(a) /F∆(a). Algorithm 1 shows the algorithmic implementation to generate the distribution. Algorithm 1 Algorithm for computing the debiased empirical delay distribution Input: The histograms, h A and h ∆ , computed as in Eqs. (1) and (2), respectively. Output: The distribution f ∆ . Computes denominator 9: end for 10: return f ∆ 15: end function As previously described, the modeled delay distribution is determined by an optimization that minimizes the difference in the cumulative distribution functions of the debiased empirical delay distribution and the modeled delay distribution, restricted to the domain [0, δ max ] (N.B.: δ max is the maximum observed delay). The PDF and CDF plots shown in Fig. 2 suggest that a single distribution will not provide a good fit, as the delay distribution is bi-modal. Rather, a mixture of distributions would be required. Fig. 2 suggests a mix of exponential and normal distributions. Mathematically, where F Exp = Exponential CDF with parameter, Scale = 1 /λ F N = Normal CDF with parameters, µ and σ A possible interpretation of the bi-modal nature of the reporting delay distribution is that there are two distributions: one for events that are discovered almost immediately (modeled by the exponential) and one for events which have a delay due to both discovery time and public disclosure time (modeled by the normal). The parameter α can therefore be interpreted as the proportion of events that are discovered right away by the organization. Since the normal distribution is defined on (−∞, ∞) and reporting delays cannot be negative, the modeled delay distribution needs to be adjusted. The CDF in Eq. 4 truncated to [0, ∞) (and re-normalized) can be expressed as where 0 ≤ δ ≤ ∞ Since the debiased empirical delay distribution is only defined on [0, δ max ], which is the domain on which it is compared to the modeled delay distribution, the truncation of the modeled distribution to this domain is also defined, denoted by F θ as below. Defining the optimization function was challenging due to two factors in particular: • There are many combinations of parameters that give approximately the same distribution when restricted to the domain [0, δ max ], but which differ substantially in how much of the total distribution's weight is contained in this domain. • As two year windows closer to the present are considered, the quantity of data shrinks, resulting in increasingly unstable parameter estimates. The optimization function used is shown in Eq. 7 below. The first term 2 log 10 F θ − log 10 F ∆ 2 reduces the CDF difference between the debiased empirical delay distribution and the modeled delay distribution over the domain [0, δ max ]. The purpose of applying log 10 weights is to place more emphasis on a good fit for the initial months 3 . As mentioned above, there are many different combinations of parameters that would result in comparable errors in the first optimization term, but which nevertheless differ substantially over the domain [0, ∞). This problem is relatively minor when the range [0, δ max ] contains the bulk of the distribution, which it does when δ max is substantially greater than the second peak of the debiased empirical delay distribution. But as two year windows closer to the present are considered, this ceases to be the case. In order to avoid this problem, the optimization function must consider modeled delay distribution values beyond δ max . The second term log 10 S θ − log 10 S θ 2 penalizes large differences between the CDFs of consecutive modeled distributions beyond δ max . In order to further minimize parameter instability for recent two year windows, another set of weights is assigned to the first two terms, which effectively diminishes the importance of a good fit with the empirical data and increases the importance of parameter stability as the amount of available data diminishes. The third and fourth terms are penalization terms -F 2 N (0, µ, σ) term penalizes negative delays introduced by the normal distribution (defined over (−∞, +∞)) whereas S 2 θ (10Y ) term penalizes delays beyond 10 years. Mathematically, the optimization function is defined as where δ F ix is the Maximum value of δ in the dataset. F θ is as defined in Eq. 6 whereas S θ is the survival function, defined as the complement of F θ of Eq. 5: Finally, θ refers to the previous two year window's optimal parameters. Since no previous parameters are available for the first two year window, the second term is taken to be zero for that window. log 10 S θ − log 10 S θ 2 = 0 (9) θ refers to optimal parameters at previous step. The covariance matrix adaptation evolution strategy (CMA-ES) was applied to compute the modeled distribution parameters. It is a derivative free optimization algorithm, a type typically used when derivatives are difficult or costly to compute (Hansen (2006 (Hansen ( , 2016 (Hansen ( , 2019 ). Once the modeled delay distributions (one for each two year window) have been obtained by optimization, event counts can be corrected based on the cumulative distribution function of the full modeled distribution (i.e., defined over [0, ∞)) defined by Eq. 5. Specifically, where a is age of the given month, 'm'. From each modeled distribution, the correction factor is computed at only one point, age, of the modeled distribution for the given month's correction. Fig . 3 shows two examples of plots comparing PDFs of the fitted parametric modeled distribution, its truncation to the domain [0, δ max ] -where δ max is computed for each window individually, and the debiased empirical delay distribution. Fig. 3a shows this comparison for the two year window starting from July 2012 until June 2016 and Fig. 3b shows this comparison for the most recent window starting from January, 2017 until December 2018. Fig . 4 shows the parameter plots of the delay distribution generated for each monthly two year rolling window. The alpha plot (Fig. 4a) suggests that organizations discover 8-18% of cyber events right away (8% ≤ α ≤ 18%). The scale plot (Fig. 4b) suggests that the short delays modeled by the exponential distribution had a mean of less than 60 days delay until early 2016 but increased rapidly to around 140 days in early 2018. The normal distribution mean, µ, (Fig. 4c ) and standard deviation, σ, (Fig. 4d ) parameter plots suggest that the longer delays modeled by the normal distribution remained consistent over time. The period of longer delays remain consistent varying within ±10% range. Year ahead F θ (a, a + 1 Y ear) = F θ (a) F θ (a + 1 Y ear) where a is the age of the event counts being corrected. Whereas the 2017 year ahead corrections (Fig. 5a) initially show close agreement with the 2018 counts, more recent year ahead corrections underestimate the 2018 counts. On the other hand, the 2018 year ahead corrections (Fig. 5b) generally overestimate the 2019 counts, except for the most recent months, which show close agreement. As stated in section 4.3, the debiased empirical delay distribution has fewer data points for more recent two year windows so weights of ( δmax /δFix) and (1 − δmax /δFix) are used in the optimization function to dynamically adjust the weight given to the CDF before and after δ max respectively. By removing these weights, better estimates for recent months might be obtained but would come at the cost of more parameter instability and worse validation plots (overfitting). In either Fig. 5a or Fig. 5b , the corrected counts (dashed line) show a trend of increasing incident counts since 2016, which is contrary to the diminishing trend seen in the raw counts. The trend in corrected counts is therefore much more in line with reports from insurers and other organizations that release reports on cyber risk. This work examined the long known problem of reporting delays in historical cyber events databases and proposed an algorithm to correct for these delays. Interestingly, the true distribution of reporting delays appears to be bi-modal, which we have interpreted as a mixture of two distributions: one for incidents that are discovered immediately, modeled by an exponential distribution, and one for incidents that are not immediately discovered, modeled by a normal distribution. With this form of reporting delay distribution, we obtained non-stationary modeled delay distributions via optimization. These modeled delay distributions were used to estimate the total number of cyber incidents that will eventually be reported from the current counts. The approach was validated by estimating year ahead corrections. To understand the current cyber threat landscape and to create robust cyber risk models, one needs accurate historical data. While it is not possible to get the exact count of cyber events, the proposed algorithm aims to correct for reporting delays approximately. The reported cyber incident counts in recent times show a decreasing trend simply because incidents have not been reported yet, even though they have actually already occurred. However, in reality, the rate of cyber incidents is increasing and that is what the algorithm reveals. The general approach can be applied to any form of data with reporting delays including other long tailed insurance claims (such as liability), and COVID-19 pandemic data. The current study corrects the overall counts for US cyber events, and the industry specific correction of cyber event counts is a future direction of this work. A micro-level claim count model with overdispersion and reporting delays A modelling approach for correcting reporting delays in disease surveillance data Statistical methods for short-term projections of AIDS incidence Minimum Size of the Acquired Immunodeficiency Syndrome (Aids) Epidemic in the United States The analysis of delays in disease reporting: Methods and results for the acquired immunodeficiency syndrome Adjustment of aids surveillance data for reporting delay to the editor Bayesian nowcasting with adjustment for delayed and incomplete reporting to estimate COVID-19 infections in the United States Trends in Cybersecurity Breaches AIDS in Europe: Current trends and short-term predictions estimated from surveillance data The statistical estimation, from routine surveillance data, of past, present and future trends in AIDS incidence in Europe Reporting delay and corrected incidence of multiple sclerosis Methods for projecting course of acquired immunodeficiency syndrome epidemic The CMA evolution strategy: a comparing review The CMA Evolution Strategy: A Tutorial CMA -Python Package Delay in Reporting Acquired Immune Deficiency Syndrome (AIDS) Overcoming Reporting Delays Is Critical to Timely Epidemic Monitoring: The Case of COVID-19 in New York City. medRxiv Short-Term Extrapolation of the AIDS Epidemic The use of Genstat in the estimation of expected numbers of AIDS cases adjusted for reporting delays Statistical estimation of AIDS incidence from surveillance data and the link with modelling of trends Correcting reported aids incidence: A statistical approach Bayesian nowcasting during the STEC O104: H4 outbreak in Germany Predicting ibnyr events and delays Regression Models for Right Truncated Data With Applications To Aids Incubation Times and Reporting Lags Independent Delayed Entry Adjustments for Reporting Delays and the Prediction of Occurred but Not Reported Events. The Canadian Journal of Statistics / La Revue Canadienne de Statistique Modeling reporting delays and reporting corrections in cancer registry data Acquired immunodeficiency syndrome: Current and future trends Modelling reporting delays for outbreak detection in infectious disease data Detection of Infectious Disease Outbreaks From Laboratory Data With Reporting Delays A simple correction of AIDS surveillance data for reporting delays The Analysis of Retrospectively Ascertained Data in the Presence of Reporting Delays Estimation of Excess Deaths Associated with the COVID-19 Pandemic in the United States Estimation of the reproductive number and the serial interval in early phase of the 2009 influenza A/H1N1 pandemic in the USA Applying copula models to individual claim loss reserving methods Semiparametric model for prediction of individual claim loss reserving The study was done in collaboration with AIR Worldwide using their proprietary cyber data. The authors would like to thank Scott Stransky for his comments and suggestions that helped improve the approach.