key: cord-0846186-1jm7slb2
authors: Eryarsoy, Enes; Delen, Dursun; Davazdahemami, Behrooz; Topuz, Kazim
title: A Novel Diffusion-Based Model for Estimating Cases, and Fatalities in Epidemics: The Case of COVID-19
date: 2020-12-01
journal: J Bus Res
DOI: 10.1016/j.jbusres.2020.11.054
sha: abd40afb86223586a0ec7bd8f820279de7498213
doc_id: 846186
cord_uid: 1jm7slb2

While the COVID-19 pandemic is still ongoing in a majority of countries, a wealth of literature published in reputable journals attempted to model the spread of the disease. A vast majority of these studies dealt with compartmental models such as susceptible-infected-recovered (SIR) model. Although these models are rather simple, intuitive, and insightful, we argue that they do not necessarily provide a good enough fit to the reported data, which are usually reported in the form of daily fatalities and cases during pandemics. This study proposes an alternative analytics approach that relies on diffusion models to predict the number of cases and fatalities in epidemics. After evaluating several of the well-known and widely used diffusion models in business literature, including ADBUDG, Gompertz, and Bass models, we developed and used a modified/improved version of the original Bass diffusion model to address the shortcomings of the ordinary compartmental models such as SIR and demonstrated its applicability on the portrayal of the COVID-19 pandemic incident data. The proposed model differentiates itself from other similar models by fitting the data without the need for preprocessing, requiring no initial conditions and assumptions, not involving in heavy parameterization, and also properly addressing the pressing issues such as undocumented cases, length of infectious or recovery periods.

Recently at the end of 2019, Wuhan, an emerging business hub of China, reported the first case of an unknown type of pneumonia to the World Health Organization (WHO). At the time, nobody thought that the new disease would shut down everyday life in more than 200 countries around the globe in around three months, reach 59 million confirmed cases and claim nearly 1.4 million deaths across the world (as of November 20, 2020). The Coronavirus Disease of 2019 was characterized as a pandemic by the World Health Organization (WHO) in early March of 2020. In a short time and in parallel to the vast amount of medical research aiming at finding a remedy for the disease, a large number of epidemiologists and data analysts began to study the spread of COVID infections using the limited data available. While individual countries often provide data at varying levels of detail, the globally reported data is limited to daily cases, recovery, and fatality numbers for each country.

Notably, multiple research works have been done to simulate the spread of the disease relying on some widely accepted epidemiologic models such as susceptible-infected-recovered (SIR)(Q. J. T. Wu, Leung, Bushman, et al., 2020; You et al., 2020) , susceptibleexposed-infected-recovered (SEIR) (Cheng & Shan, 2020; Ferguson et al., 2020; Tang et al., 2020 ; J. T. , and susceptible-infected-recovered-dead (SIRD) (Anastassopoulou et al., 2020; Fanelli & Piazza, 2020) models. These studies mainly aimed at estimating the key transmission parameters of the disease such as reproduction number ( ) (i.e., the average number of people getting infected by each infected individual), 0 incubation rate (i.e., the rate at which the symptoms show up in an infected individual), and recovery rate (i.e., the rate at which the symptoms go away after they emerged). However, as we will discuss in more detail, the results reported by these studies are often remarkably inconsistent. For instance, estimates for in the Diamond Princess cruise ship range from 0 2.28 to numbers as large as 11 and 14.8 (Rocklöv et al., 2020) . This inconsistency is not surprising given the volatility and misleading nature of available data at the present time combined with the rigidity of the mentioned compartmental models.

Diffusion models are widely used in various areas of research, from politics and social sciences ( de Tarde, 1895; Givan et al., 2010) to marketing (Bass, 1969; Basu et al., 1995; Chang et al., 2015; Darrat, 2000; Y.-M. Li & Shiu, 2012; Little, 1970; Nguimkeu, 2014; Peres et al., 2010; Schramm et al., 2010; Sood et al., 2012; Van Den Bulte, 2000; C. Wu et al., 2012) , and medicine (Lou & Zhao, 2011; W. Wang et al., 2016) . While models such as SIR, SEIR, and their variations seem more capable of traditional diffusion models from which they were derived, they are basically meant to be used by healthcare practitioners as they simulate the whole course of the disease from exposure through recovery or death. Diffusion models, on the other hand, do not model stages and transitions of the disease and fit only to one quantity, such as the number of cases or the number of fatalities.

We argue that at the time being, the reported case numbers are misleading or varying across (or even within) individual countries. For example, how many and how diagnostic tests are carried out will influence the number of cases reported. Therefore, reported cases may not necessarily reflect the actual spread of the virus. Especially considering reports affirming the possibility of entirely asymptomatic development of COVID-19 in some patients (Bai et al., 2020; as well as the unreliability of some of the test kits 1 that are still being employed in developing countries make this issue even more arguable. Furthermore, each country may differ in reporting speed, or frequency, or even accuracy of reporting, leading to varying parameter values for individual country models.

Having that said, in the present study, we take a diffusion model approach and focus on the reported data, i.e., the number of recovered/deceased population rather than the whole transition process. Hence, instead of forcing our model to fit into both inaccurately reported cases and fatality/recovery numbers simultaneously, we only fit our model into infections or fatalities. Specifically, we use data from individual countries to estimate the parameters of multiple well-known diffusion models and assess their fit. These models have been widely used in business literature. Following that, we utilize numbers from existing literature (e.g., mortality, recovery rate, and demographic breakdown) along with model parameters, in a backward manner, to estimate the number of symptomatic cases, hospitalizations, and deaths in the United States. A similar backward approach has been recently employed by Flaxman et al. (Flaxman et al., 2020) to estimate the reproduction number of COVID-19 in different European countries.

Modeling product growth by forecasting adoption can be considered analogous to disease spread models. Analogies such as product-disease, infection-purchase, reinfection-repeat purchase, susceptible-market potential suggest using similar models for disease spreads may be viable. For example, Olson & Choi (1985) used a logistic-based diffusion model for forecasting the rate of diffusion with repeat purchases. This is analogous to forecasting the infection rates when individuals can be reinfected. However, unlike compartmental models such as SIR, the diffusion models work without compartments, focusing only on one compartment (fatality, in our case).

After illustrating how to fit several general diffusion models on the COVID-19 data, we then customize one of the top-performing ones (the Bass diffusion model) to address several shortcomings of both compartmental models and off-the-shelf diffusion models.

The contribution of the present study is two-fold: 1) compared to the majority of similar studies, we rely on a simplistic innovative approach with a considerably easier setup and less computational requirements, and yet providing comparably even possibly better estimates. 2) By customizing the Bass diffusion model, we build an easy-to-use model that fits the data well with reasonable parameterization.

The rest of this paper is organized as follows. In the following section, we discuss various methods -with an emphasis on SIR models-being used by epidemiologists to study the spreading behavior of infectious diseases. Next, we elaborate on the diffusion models, as well as our modification, followed by an illustration of the results. We conclude the paper by discussing the findings and providing guidelines for future research.

There is a wide selection of forecasting models besides compartmental models such as SIR. Machine learning (ML) methods such as support vector machines or artificial neural networks are known to perform very well in various forecasting tasks. However, in the case of modeling an epidemic spread, off-the-shelf machine learning models may not adequately perform due to several reasons. First, ML models do not assume any underlying distribution or trajectory beyond the available data. Therefore, they are traditionally better fit to predict "between other" observed situations, such as interpolation tasks. However, estimating fatalities, for instance, requires making predictions beyond the observed range. For instance, a polynomial function may suggest exponential growth at the early stages of the spread and then change shape only after more daily data is observed. While ML methods excel in interpolation related tasks, this is a time-series forecasting task with no calculable seasonality or trend components. This also avoids using the off-the-shelf ARIMA-based family of models.

Diffusion models (a.k.a. growth curves) are widely adopted in various areas of research, from politics and social sciences ( de Tarde, 1895; Givan et al., 2010) to marketing (Basu et al., 1995; Chang et al., 2015; Darrat, 2000; Y.-M. Li & Shiu, 2012; Little, 1970; Nguimkeu, 2014; Peres et al., 2010; Schramm et al., 2010; Sood et al., 2012; Van Den Bulte, 2000; C. Wu et al., 2012) , information systems (Jeyaraj & Sabherwal, 2014; Ntwoku et al., 2017; Udo et al., 2018) , finance (Boratyńska & Grzegorzewska, 2018; Hongmei, 2012; Jibao et al., 2010) , economics (Yang, 2012; Yuhn et al., 2015) , and medicine (Lou & Zhao, 2011; W. Wang et al., 2016) . In his original work Tarde (1890) noted the power of imitation as one of the driving factors behind the spread of political ideas. These models have certain mathematical properties that help get insights into the diffusion process.

Generally, the diffusion models follow an S-shape curve representing the spread or growth. Analogous to cumulative distribution functions (cdf), the models map the independent variable (time) to the dependent variable, diffusion (see Figure 1 ).

[Insert Figure 1 here] Mahajan (1977; 1985) generalized diffusion models as , where M is the limit ( ) = ( ) -( ) for diffusion and , and are diffusion rate and cumulative diffusion, respectively. ( ) ( ) ( , known as the "pressure function", is used to characterize the structure of the diffusion model. )

There are a variety of models/functions that offer different kinds of pressure functions such as, Gompertz (1825) , Weibull (1951 ), Chapman-Richards (1959 , Bertalanffy (1957) , and Bass (1969) . Many of these models share similar characteristics (e.g., Bertalanffy is a special case of generalized logistic function). Jaakkola (1996) offers a useful review of the existing diffusion models.

More recently, a new generation of modified forecasting diffusion models with enhanced capabilities has emerged. These models may use simulation techniques, dynamic programming, or feedback mechanisms for model fit; and may incorporate exogenous variables or perform multi-phased analyses. Some of them are already widely used in epidemics such as susceptible-infected-recovered (SIR) (Q. J. T. Wu, Leung, Bushman, et al., 2020; You et al., 2020) , susceptible-exposed-infected-recovered (SEIR) (Cheng & Shan, 2020; Ferguson et al., 2020; Tang et al., 2020; , or susceptibleinfected-recovered-dead (SIRD) (Anastassopoulou et al., 2020; Fanelli & Piazza, 2020) . For example, SIR models ( Figure 2 ) are often employed to model infectious diseases using three states (susceptible, infectious, and recovered). It employs three differential equations that manage the transitions from one state to another.

The number of susceptible, infectious, and recovered individuals are represented by S(t), I(t), and R(t), respectively. Denoting the population size as M, the corresponding fraction of the population can be obtained by dividing those three values by M. The quantities in each compartment depend on the inter-compartment transitions, which can be written as a set of differential equations (1)-(3). Solving SIR requires specialized algorithms such as Runge-Kutta method to estimate model parameters (Zeb et al., 2014) needed to simulate the disease's spread.

(1)

-.

(2) ⅆ ⅆ = .

(3)

Perhaps the most critical parameter for a disease spread is the basic reproduction number, 0 . This number is also used to estimate the number of total infections. For seasonal flu, = γ for example, this number hovers around 1.3, meaning each infected person infects 1.3 people (Biggerstaff et al., 2014) . Parameters' infectious rate (β), and recovery rate ( ) manage the transition between states (Figure 2 ). Recovery rate ( ) refers to the inverse of the average infectious period. Using Lambert W function to numerically solve Eq (2) for S, the model can also be used to estimate the percentage of the susceptible population for given ( Figure 3 ).

The model indicates that even considering an optimistic estimate of of 2.2 for COVID-19, 0 as suggested by , 84% of the population is susceptible to infection.

[Insert Figure 3 here] Also, Figure 4 indicates the percentage of the population involved through the overall course of the COVID-19 progress based on SIR model and using the estimates provided by . Again, the curve (S) does not drop down to zero, indicating a maximum susceptible fraction of near 84%.

[Insert Figure 4 here]

The SIR model requires three initial state values indicating (i) the number of susceptible population (or population at time t), (ii) number of active infections, and (iii) currently recovered patients. While this model is very intuitive, the three model parameters for COVID-19 have been widely studied and reported in recent literature with significantly inconsistent results. , for example, estimated basic reproduction number as 2.2, and ( 0 ) recovery rate as (i.e., average infectious period is 2.3 days). In another study, Wu et = 1/2.3 al.(2020) provided an estimate of 1.94 for . Other studies have also provided estimates 0 ranging from 2.0 to 6.7 for (Anastassopoulou et al., 2020; Fanelli & Piazza, 2020; 0 Organization & Organization, 2020; Tang et al., 2020) .

Perhaps one reason behind this inconsistency is that the reported incident data and the SIR model are not compatible. SIR model estimates (and therefore fits to) the number of infectious and recovered. Countries report only daily new cases, new fatalities, and recoveries. The active number of cases is, therefore, the difference between cumulative figures of the reported cases and recoveries. This figure indicates the number of cases that are under medical supervision as inpatient or outpatients. However, the model relies on the number of infectious individuals who may be unattended but still infectious. So, it is important to estimate the active number of infections, given the number of cases. Within SIR this estimation relies on the parameter , and using a sliding window with a length of . Delen et al. (2020) estimated the size of this 1/ window for the reported data as 8 days for COVID-19. Therefore is typically invariant, and we can calculate the active infections by aggregating the reported number of daily cases over a sliding window. This, however, limits the capability of SIR model as only parameter can be adjusted to fit the data throughout the analysis. We believe this is one of the reasons why SIR may not be the most suitable model for fitting to spread data. Below we summarize three principal reasons for using diffusion models besides SIR:

i.

parameter alone is not sufficient to mimic the spread: As mentioned above, if we consider the number of susceptible in the population as fixed, the only remaining parameter to be estimated is, . However, alone may not be sufficient to fit the data (See Figure 5) tightly. The figure suggests that changing values is not sufficient to fit the real data, and that there is a need for a time shift, at the very least.

[Insert Figure 5 here]

ii.

Even without fixing recovery rate the model may not fit: This is possibly due to the undocumented cases making fitting data more difficult. The model formulation, given in Eq. (1-3), assumes all infections being documented. However, there may be cases of infection that go undocumented. These cases may as well be the key driving factor behind new infections. iii.

Using the reported number of cases and fatalities is more direct: In the case of COVID-19, standardized reports contain only daily cases and fatalities for individual countries.

In SIR, the transition between compartments is managed through the two parameters. These parameters are used to estimate the number of susceptible, infectious, and recovered individuals. The SIR model is fit by using the infectious compartment (Eq (2)) as it contains both parameters. Therefore, unfortunately, the model cannot be used to fit to, say, reported fatality numbers only. The model fits to fatality and infection numbers together. However, keeping track of the number of infectious is not straightforward. The number of reported cases varies from one country to another depending on their practices, such as the number of tests they conduct each day. The data available offers only two ways of calculating the number of infected: (i) computing the difference between the recovered compartment and the cumulative number of cases reported. This number can be used as the number of the currently infected population.

(ii) estimating infectious patients using a sliding window with a length of ). This 1/ length can be estimated by calculating a lag parameter between the recovery and infection data . We argue that being able to use fatality numbers alone is more reliable, especially since the reported case numbers depend on the specific country's reporting style (such as testing frequency or reporting speed). This can be easily seen by looking at the significantly varying crude CFR rates (#of cases / #of deaths) across individual countries. This may be due to highly varying levels of testing across countries. For example, denser testing may reveal more cases, including milder ones or asymptomatic ones, whereas less frequent testing may only be applied to patients with more obvious symptoms. Although fatality numbers might still be biased due to misclassification of the reason for some deaths or even intentional underreporting by the health authorities, however, they are expected to provide a more accurate image of the reality than the infection numbers given the less biasing factors involved in their reporting process. Therefore, fitting a diffusion model on fatality data only might be easier.

We attempted to compute SIR model parameters that fit COVID-19 data using both methods. However, we were unable to compute a consistent set of parameters providing a good fit to the data reported ( Figure 5 ).

As mentioned earlier, a review of literature on employing compartmental models such as SIR for estimating parameters of the COVID-19 spread also shows a lack of consensus among researchers. Of course, this inconsistency is not surprising given the volatility and misleading nature of available data, the number of tests carried out, various interventions implemented in different countries to combat the spread of the disease, the intensive parameter calculations required for these compartment models. One way of fitting the SIR model is by modifying and letting vary across time Katriel & Stone, 2012) . However, this kind of over-parametrization may make the model overfit the data, and still does not address the issue of underreporting. Diffusion models, as opposed to SIR, do not model stages and transitions and fit only to one quantity, such as the number of cases or the number of fatalities, and hence, have fewer parameters to be estimated.

In this paper, we first take a diffusion model approach by employing a variety of well-accepted general diffusion models and focus only on the recovered/deceased population as opposed to modeling the whole transition process. Observing the early stages of the spread, experts may use their domain knowledge and make forecasts on, for example, the number of fatalities at the end of the infection cycle. Incorporating such experts' knowledge in the modeling efforts could be beneficial, given the high degree of uncertainty when a new disease emerges. By design, this additional domain information cannot be embedded in the SIR model. However, the diffusion models we use in this study let us incorporate such domain knowledge. For example, including expert opinions based on the total number of fatalities is possible by adjusting the upper-asymptotes in a diffusion model. Also, diffusion models can be used to fit fatality or infection data independently. We argue that such an approach may yield more reliable estimates. Furthermore, fitting a model to fatality numbers using diffusion models is algebraically more straightforward than a set of differential equations.

At the next stage, we modify the Bass diffusion model (one of the best-performing ones from the previous stage) and relax the domain knowledge required for setting upper-asymptotes and attempt to address some of the shortcomings of both diffusion and compartmental models. The model can be directly used on daily reported data. Our model is composed of two sub-models: (i) a modified Bass diffusion sub-model to fit reported cases data, and (ii) a sliding window sub-model to fit fatality data.

Seven weeks after curfew, on March 12 of 2020, the Chinese government declared a triumph over COVID-19 after curbing the spread of the virus. There are several different data providers on the latest COVID-19 spread, such as European Union Center for Disease Prevention and Control (ECDC), World Health Organization (WHO), and Johns Hopkins University (JHU). For this study, we make use of the dataset maintained by ECDC 2 . The dataset starts with the first reported cases by China, on December 31, 2019. It contains 210 countries and regions with their populations, continents, and country geographical IDs. Based on the dataset as of June 23, several other countries, including Italy, and Spain, are reaching the end of the COVID-19 spread cycle.

As the spread dynamics are age-sensitive, we also make use of data on population pyramids of countries maintained by the United Nations. For diffusion models, we benchmark our analysis using US data. We then estimate age-adjusted incident expectations using conditional probabilities that help create higher quality upper asymptotes for the models. The reason for this adjustment is that age-disparities can account for as much as 22-folds of difference in case fatality rate (CFR) depending on the specific country's population pyramid .

Diffusion models fit better when data spans wider time periods. At the time of this paper's writing, not all countries have reached the peak of disease spread. In order to be able to compare parameter values across countries, we first check spread maturities in these countries (whether the spread is still at its infancy, or maturing, or towards the ending stage). While we use data from the top 50 countries based on their reported number of fatalities, we discard those that are still experiencing the early stages of the spread (such as India, Bangladesh, and Panama). We indicate the maturity levels of spread for those that are included in our analysis.

We first compile a list of diffusion models to use on our dataset. A typical diffusion process for a disease involves huge amounts of data and a high number of interrelated variables. The diffusion models we use are simplified mathematical representations for the major aspects of the diffusion process. We use the following set of diffusion models to test on COVID-19 data: ADBUDG, Bass diffusion model, Gompertz, Weibull-Gamma, Logistic, and Chapman-Richards models. These models have been extensively used in business literature, specifically in marketing, information systems, and finance: ADBUDG (Basu et al., 1995; Dong et al., 2007; Little, 1970; Streukens et al., 2011; C. Wu et al., 2012) , Bass (Fan et al., 2017; Hsiao et al., 2009; Jeyaraj & Sabherwal, 2014; Naseri & Elliott, 2013; Ntwoku et al., 2017) , Gompertz (Darrat, 2000; Naseri & Elliott, 2013; Nguimkeu, 2014; Sood et al., 2012; Udo et al., 2018) , Logistics (Boratyńska & Grzegorzewska, 2018; Naseri & Elliott, 2013; Nguimkeu, 2014; Van Den Bulte, 2000; Van Den Bulte & Joshi, 2007) , Weibull (Van Den Bulte & Joshi, 2007) . We must also note that other diffusion models have also been used in literature. Interested readers may refer to Mahajan (Mahajan et al., 1990) for an earlier, but insightful discussion of using a variety of such models in marketing.

While these models have been widely used in especially business literature, they are not exclusively built to analyze the dynamics of the spread of infectious diseases. However, by nature, their formulations can accommodate such dynamics. The number of infected persons (inverted-S), recovered ones, and the number of fatalities typically follow an S-shaped distribution (see Figure 4 ). Below, we briefly introduce the diffusion models we used in our study.

The Advertising Budgeting (ADBUDG) model proposed by Little (1970) is basically designed as a marketing mix model focusing on advertising budget management. The original model is meant to fit advertising efforts to market responses. Hence it is not designed to map time to diffusion. In the context of disease spread over time, the model can be reformulated as follows: 

Developed by Frank Bass (1969) , the original Bass model describes the adoption process of new products over time by taking into account the interactions of potential adopters. Bass classifies new product adopters as innovators and imitators and implies that the speed and time of adoption depend on the degree of innovativeness (p) and imitation (q) among them, respectively, where the latter can be controlled by the decision-maker while the former is out of her control.

We adapt the Bass diffusion model in the context of the number of disease fatalities over time as follows:

Where:

A: diffusion upper asymptote time period : number of periods during which the disease is infectious 3 :

: number of infections during period .

: cumulative number of fatalities = ∑ -1 = 1 : Imitation coefficient. This manages the deaths due to disease transmission from an individual to an individual that can be controlled by the decision-maker.

: Innovation coefficient. This coefficient is typically smaller than , and this is the coefficient that the decision-maker cannot control. This is basically due to the transmissions that will happen regardless of the number of measures taken by the governments. Typically curve fitting yields a smaller number for than .

The model in the original form is essentially in the quadratic form where:

Gompertz model (Gompertz, 1825) , similar to Weibull model, is one of the proportional hazard models used for diffusion widely used in marketing for modeling new products adoption (Van den Bulte & Stremersch, 2004 ) as well as in information systems for modeling new technology adoption (Mann et al., 2011) .

The original Gompertz curve is formulated as follows:

Shifted Gompertz distribution, first introduced by Bemmaor (1992) for modeling adoption of innovation, is a modified version of Gompertz curve where the cumulative distribution function is:

Where (in the context of disease fatalities):

A: upper asymptote/population size parameter : shape parameter (typically for an S-shaped curve) > 1 scale parameter ( : > 0)

cumulative number of fatalities ( ):

Weibull is a continuous probability distribution (Weibull, 1951) that is widely used in forecasting using time series (e.g., Kulkarni, Kannan, & Moe, 2012; Wright & Stern, 2015) . The cumulative density function for Weibull is (for values t>0):

Where:

: shape parameter (typically k>1 for S-shaped curves) is the scale parameter : cumulative number of fatalities ( ):

Logit models are discrete-time form hazard models. We set upper and lower asymptotes at 0 and 1 respectively, and used a generalized logistic growth model of the form:

where : Growth range k: Growth rate Timeshift parameter 0 :

cumulative number of fatalities ( ):

The Chapman-Richards growth curve is a generalization of von Bertalanffy's growth model (Von Bertalanffy, 1957) . It is a popular model in forestry science for modeling the height of trees over time. Its S-shape cumulative distribution function is as follows: For a more in-depth discussion of Gompertz, Logit and Probit models, and ADBUDG model, readers can refer to (Wierenga & van der Lans, 2017) . Figure 6 demonstrates the steps we followed to perform our analyses using basic diffusion models.

[Insert Figure 6 here]

While using diffusion models to predict cases or fatalities has its merits, it has several shortcomings as well, such as having to set initial conditions and upper-asymptotes to fit the data well. On the other hand, reported data includes daily numbers of infections, rather than the number of active infections. This is not well-suited for compartmental models as they take the numbers of active infections as inputs. This is also suggested in the literature. Tuncer & Le (2018) suggested that if health agencies report prevalence data rather than incidence data, compartmental models would better fit the data. While most of the diffusion models, as well as the SIR model, use two parameters, modeling complex processes such as epidemic disease spread typically will require more parameters or experimental data does not fit well. For example, compartmental models, as well as diffusion models, assume a fixed transmission rate throughout the spread. Letting the transmission rate parameter be time-variant will result in a tighter fit. However, it will also impede model identifiability and will cause overparametrization, and therefore, overfitting.

In this section, we present a modified version of the original Bass diffusion model that can be used to work directly with incidence data and address some of the shortcomings mentioned above. Our model is given below: Equation (4) is analogous to compartment I in the SIR model. The reported data do not reveal the number of infectious individuals. Therefore, one way of fitting SIR model is by including a separate step to estimate the number of active infectious cases among the reported ones considering the duration of being infectious. While this duration may seem to be diseasedependent or may even vary from an individual to another, in our model, we assume this value to be fixed for each region/country and associate this value to parameter d. We let this parameter to be country-specific as each country may have a different reporting system so that the calculated d values may differ across countries. For example, being quicker or slower in reporting, or conducting more or less number of tests on asymptomatic patients may change corresponding d values.

Perhaps the major shortcoming of diffusion models is the need for defining an upper asymptote. This is mainly due to the fact that diffusion models are designed to converge to an upper asymptote as the time progresses, and not all who catch the infection are reported. We address this shortcoming by introducing a scaling parameter k. The parameter k, , corresponds to ≥ 1 the level of reporting. Larger k values indicate larger cases going unreported. k = 1 corresponds to perfect reporting. Equation (5) links reported case data to reported fatalities. It simply adjusts a lag parameter to reported cases by multiplying with a case fatality rate (c). We also let to be country-specific due to their diverse individual reporting procedures.

One of the possible shortcomings of diffusion models is that in the absence of sufficient empirical data (short time period) or improper asymptote, they may not yield a stable model (sensitivity problem). Therefore upper-asymptote selection is crucial. We calculated asymptotes based on Eryarsoy et al. . However, the reliability of these asymptotes depends on factors such as how well the country is managing (lower than expected, for instance), or how accurately they are reporting. The calculations are briefly discussed in the following section. For more detail, users may refer to Eryarsoy et al. .

Calculating upper-asymptotes translates to finding a rough approximation of the total number of incidents. The total number of fatalities or the total disease spread may depend on age/gender disparities. For COVID-19, prior studies show that age is a critical factor in the recovery/death of infected patients (Novel, 2020; Z. Wu & McGoogan, 2020; Zhou et al., 2020) . So, it may not be accurate to expect similar death tolls in Niger and Japan (countries with the youngest and oldest populations, respectively). We, therefore, make use of conditional probabilities, fatality estimates, and already existing parameters from literature (e.g., recovery rate and demographic breakdown), to address these disparities. We propose two steps for calculating upper-asymptotes: (i) compute other country upper-asymptotes using conditional probabilities following the setting depicted in Figure 7 ; (ii) find good quality upper-asymptote candidates for a benchmark country (the US, in our case).

The present study takes into account age standardization using the age pyramids of countries as a complementary piece of information to map parameters from other countries. Figure 8 visualizes this phenomenon. Also, Figure 7 visualizes the spread from our analysis point of view. While we include other stages in Table 2 , for the sake of simplicity, we will only focus on Reported cases (R) and Deaths (D) stages for this study. We also make use of age-dependent incident data from four different countries: Spain 4 , Italy (Bignami & Ghio, 2020) , United States 5 , and South Korea (Shim et al., 2020) . It can be seen that countries whose age pyramids are more stretched at the top experienced relatively higher numbers of infections and fatalities. The age standardization is done using simple conditional probability formulas, for example, for reported cases (R) as:

. We use these conditional ( │ ) = ( ( │ ))/ ( ) ( ) probabilities to compute the upper-asymptote for fatality for the US. We then estimate upperasymptotes for other countries using their population pyramids and conditional probabilities. While this is not central to our study, we provide our estimation of age breakdowns for the US in Table 1 . Table 1 here]

A recent study by Oxford scientists (Lourenco et al., 2020 ) estimated a possibility of as high as 64% of UK citizens already being infected by COVID-19 during the first 45 days of postintroduction (using ). However, the article assumed a mere 0.1% of the population 0 = 2.25 to require hospital care and contradicts another UK report by scientists at Imperial College in London (Ferguson et al., 2020) predicted approximately 510K fatalities, and the peak influx of patients during June (around 120 days after the first infection) for the UK (and 2.2M for the US). Both of these studies used the SIR type of models. As noted earlier, fitting a SIR model using Runga-Kutta algorithm and reported cases, fatalities, and recoveries on COVID-19 does not provide a good fit to data. Table 2 gives different estimates for the SIR model parameters in different countries. This is possibly due to a significant number of undocumented cases. Therefore using better estimates is crucial in model fitting.

Determining R 0 and Susceptible Population: Different studies give different estimates of R 0 for COVID-19 (Roques et al., 2020; Sugishita et al., 2020; Tang et al., 2020; Trilla, 2020; Zhang et al., 2020) ranging from 1.4 to as high as 6.7 (usually early studies suggest larger numbers). The government measures also are effective in reducing R 0 of the virus. Sugishita et al. (2020) study the effect of voluntary event cancellation on R 0 values. They found that in Japan, the R 0 was reduced from 2.29 to 1.99. Another study from South Korea (Hwang et al., 2020) revealed effective R 0 values near 1 in some of the regions. However, as of the writing of this paper, keeping a world-wide spread virus R 0 at 1 seems overly optimistic. Therefore, we select an interval for R 0 as [1.8-2.2] to design our scenarios for this study.

Determining the probability of developing symptoms, infections being reported, and hospitalization: Population with asymptomatic infections are still infectious, and their CT images can still exhibit those of typical infected COVID-19 patients. In order to compute different asymptote candidates, the probability of developing symptoms once infected is also another parameter we need to consider (P(S|I) in Figure 7) . , in their study, assumed a 50% probability of developing symptoms after infection. Using IFR and cCFR in Russel et al. (2020) , this number is at 45.5%. A more recent study suggested this number to be as low as 20% (Day, 2020) . For our scenarios, we consider a probability interval of [0.20-0.50] for developing symptoms given infection. In a similar manner, we consider a range of [0.11, 0.20] for P(R|S), probability of being reported given developing symptoms. Also, P(H|R), the probability of hospitalization of reported cases is considered 36% across all scenarios (Ministry de Sanidad, 2020).

Determining underreporting: Many studies highlight under-reporting or undocumented cases. A peer-reviewed paper states as much as 86% of the cases in Wuhan went undocumented (R. . Another study put the same figure at around 55-68% (Russell et al., 2020) . Also, an article by the Economist 6 visualized total death numbers for several cities from Europe. After accounting for the expected number of deaths, and deaths reported due to pandemic, the unexplained numbers suggest that undocumented death cases may actually be even higher than the COVID-19 fatalities reported by countries. Depending on the scenario, we assume the number of reported cases to be between %11 and 60% of the symptomatic cases.

Determining Case Fatality Rates: Crude estimates of case fatality rates are calculated by simply dividing the number of deaths (D) by the number of reported ((R) in Figure 7) . Studies focusing on fatality rates of COVID-19 also report different findings. While earlier studies (Livingston & Bucher, 2020; estimated case fatality risk (CFR) as high as 6-7% percent, the numbers were found to be inflated because recovery or death is not known for all the cases (hence referred to as naïve or crude CFR). Studies typically report fatalities based on the number of reported cases and therefore are region-specific (Onder et al., 2020) . Some studies analyze fatality rates by taking the median time from the onset of symptoms (case reporting) until the time of the death into account. Wilson et al. (2020) estimate this time around 13 days. The length of this time period is consistent across other studies as well (Linton et al., 2020; Russell et al., 2020) . Two recent studies estimated CFR at 1.4% J. T. Wu, Leung, Bushman, et al., 2020) . The CFRs reported by such countries are arguably a better indicator of the lethalness of the virus. However, assuming that even this level of screening can potentially miss some cases, then we opt for a mild CFR at 1.5% for our scenarios. We acknowledge that the above approach is fairly simplistic and only used it to create different analysis scenarios.

Following the previous discussion on estimated ranges for R 0 , CFR (corrected), and conditional probabilities of developing symptoms, infections being reported, and hospitalization, we decide on the following three scenarios for this study: A published poll (from the CDC experts) on four possible scenarios based on COVID-19 characteristics states that between 160 (best case scenario) to 240 million (worst case scenario) people in the United States could be infected throughout the course of the epidemic, in which case between 2.4 to 21 million people may require hospitalization and between 200,000 to 1.7 million people could die 7 . Another recent estimate by Murray (2020) as of April 12, 2020 forecasts the number of fatalities for the US to be between 26K-155K. Table indicates the breakdowns for our scenarios 1-3, and the expert estimates. The estimated numbers for susceptible people and fatalities yielded by our 1 st and 2 nd scenarios roughly match the CDC estimates, while 2 nd and 3 rd scenarios also roughly match Murray (2020). Scenario 3 also uses the parameters that demonstrate a good fit with the Wuhan and Italy data. We decided to use the 2 nd scenario that matches both CDC estimates as well as Murray (2020) to determine upper asymptotes. Table 3 here] 

Computing parameters of diffusion models are usually done by non-linear optimization with box-constraints. For curve fitting, we minimized the sum of squared error and used Nelder Mead (1965 ), Hooke Jeeves (1961 , and Subplex (Nelder Mead algorithm on the sequence of subspaces) (Rowan, 1991) algorithms to report the best set of parameters corresponding to the lowest RMSE. All of the diffusion models converged and fit the data.

We coded all our datasets in R language. We performed all analyses on a personal laptop equipped with 7th generation Intel ® i5 processor and with 16GB of memory.

We fit diffusion models to the selected top countries in terms of the number of fatalities. For each country and model, we report RMSE. The numbers suggest that the Chapman-Richards and Bass models fit data better than other diffusion models. They outperformed others on more country data (Chapman-Richards on 27, Bass on 13 country datasets), and the variance of RMSE is small for both of these models, indicating consistency. Details are given in Table 5 . The average RMSEs are given in Table 4 . The table suggests that Scenario #2 enables the best fit for four of the diffusion models. Hence, the data on hand is arguably in line with Scenario #2 being the most likely scenario.

[Insert Table 4 here] Also, Figure 9 depicts the fitted curve for each model in comparison to Figure 5 , for Belgium. Even though the models had a good fit (low RMSEs) they pointed out a range of diffusion patterns.

The shape parameters of the fitted diffusion models suggested that even though the median age of China and the US are very similar (37.4 and 38.1, respectively), much faster diffusion taking place in the US. For example, the Bass diffusion model that has the innovator, and imitator coefficients, both measures of spread rates over a period, were (p=0.0045, q=0.14) and (p=0.00001, q=0.17314) for China and the US, respectively. While p, the innovation coefficient is dominant during the first stages of Bass diffusion, the imitation coefficient becomes much more dominant at later stages. These differences between the p and q coefficients for US and China diffusions suggest that even though the US had a slower pace at the beginning, the spread in the US became significantly faster than that in China.

The modified Bass diffusion model tends to require heavier parameterization. Computing parameters of the modified model is also done by using non-linear optimization methods with box-constraints. While using the same set of optimization methods we used for diffusion models such as Nelder-Mead, or Hooke Jeeves is possible, these algorithms did not converge for some of the instances. We used Shuffled Complex Evolution (SCE) optimization due to Duan et al. (1992) for curve fitting. The algorithm took less than 5 seconds of computing time per country. We also cross-checked the results implementing a Genetic Algorithms search with BFGS due to Byrd et al. (1995) . While GA took significantly more time to converge, it was unable to outperform the SCE on any country data. Both of the algorithms converged to a very similar set of parameter values across the countries we tested. Figure 10 illustrates the fit over Belgium case data for comparison.

The last column of Table 5 indicates the RMSEs of our modified Bass model for different countries. As shown, for all countries, the proposed model outperforms all the regular diffusion models in fitting COVID-19 data.

[Insert Table 5 Table 6 shows the parameter estimates for each country. These parameters should not be seen as parameters of the disease, but rather be seen as parameters corresponding to the reported spread. For instance, the results show that d, duration of being infectious, is roughly the same (around 11 days) for the countries in the initial or mid-stages of the spread, while it is shorter (on average 8 days) for countries in the ending stages. Parameter d depends on the average duration for a case to be actually tested positive, so the variation could be due to testing and reporting differences. While our model is rather parsimonious, it still indicates a large number of unreported cases ( almost for all countries except Sweden). This is in accordance with one of the most recent "1 press releases by CDC that suggests the number of infected Americans to be at least 10 times higher than the reported 2.3 million 8 ( in our model).

Submodel 2 works in tandem with Submodel 1. As all COVID-19 data resources omit daily recoveries data (except for Johns Hopkins), we setup Submodel 2 to predict the number of deaths rather than recoveries. Therefore, c corresponds to the time-adjusted CFR rate. The average of time adjustment parameter is around 7 days globally.

[Insert Table 6 here]

Compartmental models have been widely used in the literature and have been shown to be very effective-they do not require heavy parametrization and are quite intuitive. However, they have their shortcomings, especially in this application domain. In our opinion, these flaws are surfacing because of the reporting styles and inherent noise in the reported data. The top three flaws are: (i) Health organizations report incidence data on COVID-19 while the SIR model is better suited to work with prevalence data rather than incidence data (Tuncer & Le, 2018) . (ii) Incidences, such as the number of new infections, are not accurately reported. (iii) The transmission-rate parameter may be time-variant, depending on the rapidly changing government policies during a pandemic. In this study, we propose a model that help mitigate the effects of (i) and (ii).

While the COVID-19 pandemic is in its mid-stages in many countries, several countries seem to have gone through the first wave of the spread. We fit COVID-19 infection and fatality data using multiple well-known diffusion models. These diffusion models required calculating upper-asymptotes for the diffusion, which has proven to be rather problematic. However, in this study, we provided our estimates based on three different scenarios from the early COVID-19 literature (optimistic, moderate, and pessimistic). Our results suggest that the medium scenario fitted the data better. We noted that having to estimate upper asymptotes is impractical, and introduced our own model.

Our model consists of two sub-models working in tandem. Our first sub-model, the modified Bass diffusion model, does not require initial conditions, upper-asymptote or population adjustments, or length of the infectious period to fit cases, and deaths data. The model can fit directly to the reported incident data, while its parameters help us understand more about the spread. To this end, we relax SIR model assumptions and shortcomings by introducing other parameters such as , and . In practice, each country may apply different measures to , control the speed of spread. Therefore, this may influence the parameters p and q mirroring transmission. By letting these parameters vary along the time axis, we can make the model fit data even tighter. This can also help to model the second wave of the spread in some countries. However, using heavier parametrization will also cause the model to overfit the data. To avoid over-fitting, we assume fixed transmission rates (p and q) for each country throughout the spread. We argue that this is one of the key contributions of the present study, which is disregarded in many other fatality estimation studies.

While multi-modal (such as a second or third waves) pandemic spreads cannot be scientifically defined, many countries have experienced beyond unimodal spread patterns. One way of dealing with such spread patterns is by modeling each wave using a separate diffusion model. Also, our study can be extended by relaxing our assumption of fixed transmission rates and letting rates to be time-variant. Altough, the model would become overly complicated, and potentially overfit, this approach may be used to handle multiple-waves of spread.

The second sub-model simply applies a lag parameter and CFR (c) to the reported data. The estimated CFR (c) is time-adjusted but does not take unreported cases into account. The actual CFR, which takes asymptomatic and unreported fatalities, can be obtained by multiplying c with k. Our results suggest the actual disease to be much less lethal than we believed.

While developing plans to better handling the limited healthcare resources at the time of epidemics highly depends on reliable and accurate estimates of the demands, providing such estimates using popular compartmental models (e.g., SIR or MSIR) is challenging at the early stages of the spread. This is particularly true given the complex nature of those models, which require large amounts of data to be trained. Our proposed methodology enables healthcare managers and decision-makers to achieve comparably good estimates with fewer data available at the early stages of the disease and plan their resources accordingly.

While our model fits COVID-19 data well, there is a need for a follow-up study for model identifiability. Model identifiability is important to avoid over-parametrization where "data does not provide enough information for uniquely estimating al parameters" (Kreutz, 2018). A recent paper by Tuncer & Le (2018) studied model identifiability of other outbreak models. Our study can potentially be extended by providing non-identifiability of the modified Bass model. Given that age has been recognized as a critical factor affecting infections and deaths, this study can be extended by using age-pyramids to readjust our model parameters. This can make model parameters more comparable across countries. Another extension is to modify the model to estimate the number of hospitalizations and intensive care unit requirements during the course of an epidemic. Finally, both our model and SIR model assume life-long immunity and fit to unimodal disease spread patterns. Figure 1 . A typical S-shape curve of a diffusion model. 

Data-Based Analysis, Modelling and Forecasting of the novel Coronavirus (2019-nCoV) outbreak

Presumed asymptomatic carrier transmission of COVID-19

A new product growth for model consumer durables

Modeling the Response Pattern to Direct Marketing Campaigns

Modeling the diffusion of new durable goods: Word-of-mouth effect versus consumer heterogeneity

Estimates of the reproduction number for seasonal, pandemic, and zoonotic influenza: a systematic review of the literature

A demographic adjustment to improve measurement of COVID-19 severity at the developing stage of the pandemic

Bankruptcy prediction in the agribusiness sector: Lessons from quantitative and qualitative approaches

A Limited Memory Algorithm for Bound Constrained Optimization

Persuasive messages, popularity cohesion, and message diffusion in social media marketing

2019 Novel coronavirus: where we are and what we know

On the Gompertz Process and New Product Sales: Some Further Results from Cointegration Analysis

Covid-19: four fifths of cases are asymptomatic, China figures indicate

La logique sociale p. G [abr

No Place Like Home: A Cross-National Assessment of the Efficacy of Social Distancing during the COVID-19

JMIR Public Health and Surveillance

The role of channel quality in customer equity management

Effective and efficient global optimization for conceptual rainfall-runoff models

Adjusting COVID-19 Reports for Countries' Age Disparities: A Comparative Framework for Reporting Performances

Product sales forecasting using online reviews and historical sales data: A method combining the Bass model and sentiment analysis

Analysis and forecast of COVID-19 spreading in China, Italy and France

Estimating the number of infections and the impact of non-pharmaceutical interventions on COVID-19 in 11 European countries

The diffusion of social movements: Actors, mechanisms, and political effects

On the Nature of the Function Expressive of the Law of Human Mortality, and on a New Mode of Determining the Value of Life Contingencies

Clinical Characteristics of Coronavirus Disease 2019 in China

The medium and long term forecast of China's vehicle stock per 1000 person based on the gompertz model

Direct Search" Solution of Numerical and Statistical Problems

Information diffusion and new product consumption: A bass model application to tourism facility management

Basic and effective reproduction numbers of COVID-19 cases in South Korea excluding Sincheonji cases. MedRxiv, 2020.03.19

Comparison and analysis of diffusion models

The bass model of diffusion: recommendations for use in information systems research and practice

Forecasting on China's civil automobileowned based on gompertz model

Attack rates of seasonal epidemics

Using online search data to forecast new product sales

Early transmission dynamics in Wuhan, China, of novel coronavirus-infected pneumonia

Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV2). Science, eabb3221

A diffusion mechanism for social advertising over microblogs

Incubation Period and Other Epidemiological Characteristics of 2019 Novel Coronavirus Infections with Right Truncation: A Statistical Analysis of Publicly Available Case Data

Models and managers: The concept of a decision calculus

Coronavirus Disease 2019 (COVID-19) in Italy

A reaction-diffusion malaria model with incubation period in the vector population

Fundamental principles of epidemic spread highlight the immediate need for large-scale serological surveys to assess the stage of the SARS-CoV-2 epidemic

New Product Diffusion Models in Marketing: A Review and Directions for Research

Generalized model for the time pattern of the diffusion process

Innovation diffusion models of new product acceptance: A reexamination

Are there contagion effects in information technology and business process outsourcing? Decision Support Systems

Transmission potential of the novel coronavirus (COVID-19) onboard the diamond Princess Cruises Ship

Estimating the asymptomatic proportion of coronavirus disease 2019 (COVID-19) cases on board the Diamond Princess cruise ship

The diffusion of online shopping in Australia: Comparing the Bass, Logistic and Gompertz growth models

A Simplex Method for Function Minimization

A simple selection test between the Gompertz and Logistic growth models

The epidemiological characteristics of an outbreak of 2019 novel coronavirus diseases (COVID-19) in China

ICT adoption in Cameroon SME: application of Bass diffusion model. Information Technology for Development

Case-Fatality Rate and Characteristics of Patients Dying in Relation to COVID-19 in Italy

Report of the who-china joint mission on coronavirus disease 2019 (covid-19)

Innovation diffusion and new product growth models: A critical review and research directions

A flexible growth function for empirical use

COVID-19 outbreak on the Diamond Princess cruise ship: estimating the epidemic potential and effectiveness of public health countermeasures

Modèle SIR mécanisticostatistique pour l'estimation du nombre d'infectés et du taux de mortalité par COVID-19

Functional stability analysis of numerical algorithms

Using a delay

An agent-based diffusion model with consumer and brand agents

Transmission potential and severity of COVID-19 in South Korea

Predicting the path of technological innovation: SAW vs. Moore, Bass, Gompertz, and Kryder

Return on marketing investments in B2B customer relationships: A decision-making and optimization approach

Insignificant effect of counter measure for coronavirus infectious disease-19 in Japan

Estimation of the transmission risk of the 2019-nCoV and its implication for public health interventions

Les lois de l'imitation

One world, one health: The novel coronavirus COVID-19 epidemic

ANALYSIS OF THE GROWTH OF SECURITY BREACHES: A MULTI-GROWTH MODEL APPROACH

New product diffusion acceleration: Measurement and analysis

New product diffusion with influential and imitators

Social contagion and income heterogeneity in new product diffusion: A meta-analytic test

Quantitative laws in metabolism and growth

Clinical Characteristics of

Hospitalized Patients with 2019 Novel Coronavirus-Infected Pneumonia in Wuhan, China

Suppressing disease spreading by using information diffusion on multiplex networks

Wide applicability

Marketing decision models: Progress and perspectives

Case-Fatality Risk Estimates for COVID-19 Calculated by Using a Lag Time for Fatality

Forecasting new product trial with analogous series

ADBUDG model optimization and practical analysis

Estimating clinical severity of COVID-19 from the transmission dynamics in Wuhan, China

Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study

Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: summary of a report of 72 314 cases from the Chinese Center for Disease Control and Prevention

Two-warehouse partial backlogging inventory models with threeparameter Weibull distribution deterioration under inflation

Estimation of the time-varying reproduction number of COVID-19 outbreak in China

Bubbles and the Weibull distribution: was there an explosive bubble in US stock prices before the global economic crisis?

Comparison of numerical methods of the SEIR epidemic model of fractional order

Estimation of the reproductive number of novel coronavirus (COVID-19) and the probable outbreak size on the Diamond Princess cruise ship: A data-driven analysis

Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan

He has recently published ten books/textbooks in the broad area of Business Intelligence and Business Analytics. He is often invited to national and international conferences and symposiums for keynote addresses, and companies and government agencies for consultancy and professional education engagements on analytics and data science related topics. Dr. Delen served as the general co-chair for the 4th International Conference on Network Computing and Advanced Information Management (September 2-4, 2008 in Soul, South Korea) , and regularly chairs tracks and mini-tracks at various information systems and analytics conferences. He is currently serving as the editor-in-chief, senior editor, associate editor, and editorial board member of more than a dozen academic journals.