key: cord-351030-jqqxqjzf
authors: Rui, M.; Qi, D.; Yong, L.
title: A Sparse Gaussian Network Model for Prediction the Growth Trend of COVID-19 Overseas Import Case: When can Hong Kong Lift the International Traffic Blockad?
date: 2020-05-16
journal: nan
DOI: 10.1101/2020.05.13.20099978
sha: 
doc_id: 351030
cord_uid: jqqxqjzf

The COVID-19 virus was first discovered from China. It has been widely spread internationally. Currently, compare with the rising trend of the overall international epidemic situation, China's domestic epidemic situation has been contained and shows a steady and upward trend. In this situation, overseas imports have become the main channel for china to increase the number of infected people. Therefore, how to track the spread channel of international epidemics and predict the growth of overseas case imports is become an open research question. This study proposes a Gaussian sparse network model based on lasso and uses Hong Kong as an example. To explore the COVID-19 virus from a network perspective and analyzes 75 consecutive days of COV-19 data in 188 countries and regions around the world. This article establishes an epidemic spread relationship network between Hong Kong and various countries and regions around the world and build a regression model based on network information to fit Hong Kong's COV-19 epidemic growth data. The results show that the regression model based on the relationship network can better fit the existing cumulative number growth curve. After combining the SEIJR model, we predict the future development trend of cumulative cases in Hong Kong (without blocking international traffic). Based on the prediction results, we suggest that Hong Kong can lift the international traffic blockade from early to mid-June

The COVID-19 virus was found in Wuhan, Hubei Province, China in December 2019.

According to the evidence of early transmission dynamics, interpersonal communication has occurred between close contacts since mid-December 2019 [1] . In order to control the spread of infections, Hubei and other provinces have adopted measures such as urban segregation and reducing inter-city mobility. Through a large number of public health interventions, the local epidemic situation in various provinces and cities in China has been basically controlled.

However, the international spread of the epidemic is inevitable. Therefore, for areas where local transmission has been basically controlled, how to prevent overseas transmission has become the focus of current epidemic prevention work [2] [3] [4] .

This article takes Hong Kong as an example to discuss how to effectively predict the cumulative case growth curve of regions with overseas imports as the main growth mode. As an important international transportation hub, the migration of a large number of international passengers has had an important impact on the spread of the epidemic in Hong Kong. Foreigners entering All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 16, 2020. . https://doi.org/10.1101/2020.05. 13.20099978 doi: medRxiv preprint through international transportation channels such as aircraft shipping are the main way of increasing cases in the region. In order to reduce the possible transmission risk, from 0:00 on March 25, 2020, Hong Kong announced that non-Hong Kong people are prohibited from entering Hong Kong Airport. However, as an international financial and transportation center, Hong Kong will cause a lot of losses every day due to the international traffic blockade. Therefore, it is of great significance to predict the possible cumulative case growth rate of Hong Kong under the premise of incomplete blockade, and to further determine the possible date of unblocking in Hong Kong. However, the transmission rate in the traditional SIR/SEIR model is a constant value. However in practical problems, the transmission rate is constantly changing.

For example, the growth rate of overseas imported cases is affected by changes in the international epidemic, etc. In this situation, it is difficult to use traditional infectious disease models to predict the future growth trend of Hong Kong [2, 5] . Therefore, finding a new model to solve this problem has become an open research question.

In our research, we used 75 days of real-time infection data from 188 countries and regions around the world. Establish a case transmission relationship network between Hong Kong and other parts of the world through the sparse Gaussian network model based on lasso. The results

show that the correlation coefficient between the epidemic trend in Hong Kong and several outbreak centers abroad is extremely high. At the same time, we can use the cumulative case growth data in areas with high correlation to Hong Kong in the network to establish a regression model to fit the cumulative case growth data in Hong Kong. After further combining the SEIJR model to predict case growth data of target areas (related to Hong Kong). We can predict the number of COVID-19 cases in Hong Kong without blocking traffic. Our findings can help Hong Kong adjust public interventions, estimate the time for lifting the blockade, and provide effective evidence to avoid serious outbreaks and economic losses.

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 

The real epidemic data set used in this article mainly comes from the website:https://github.com/BlankerL/DXY-COVID-19-Data，Including the cumulative number of confirmed cases and cumulative number of cured cases from January 19 to April 2, 2020.

This article uses a Gaussian graph model based on Lasso to construct an international epidemic spread network. We use a Neighborhood selection strategy to solve the covariance selection problem. The specific model is as follows:

Consider n-dimensional multivariate normally distributed random variables = ( 1 ,⋯⋯ ) ∼ ,∑ .This includes Gaussian linear models, for example, 1 is the response variable, Neighborhood selection can be used as a standard regression problem. It can be effectively solved by Lasso [7] , as shown in this article.

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 16, 2020. . https://doi.org/10.1101/2020.05. 13.20099978 doi: medRxiv preprint For sparse high-dimensional graphs, the consistency of the proposed neighborhood selection will be shown, the number of variables may increase with any power of the number of observations (high-dimensional), and the number of neighbors of any variable is the slowest than the number of observations (sparseness).

Neighborhood selection. As we all know, Lasso [7] proposed by Tibshirani et. Al., In the context of wavelet regression [8] , it is called basic pursuit and has simplicity [8] .When the forecast has all remaining variables ; Γ\ \ . The estimated value of the disappeared lasso coefficient asymptotically identifies the neighborhood of node a in the graph, as shown below.

Let × dimensional matrix contain independent observations. Therefore, for all Γ , Column corresponds to a vector of independent observations. Let • , • be the usual inner product on Rn, and • 2 is the corresponding norm [9] .

Lasso estimates the formula of in , as formula (1):

Is the 1 norm of the coefficient vector. It is recommended to normalize all variables to a common empirical variance in the above formula. The solution of the above formula is not necessarily unique. However, if the uniqueness fails, the solution set is still convex, and all of our results on the neighborhood apply to any solution of the above formula.

Other regression estimates based on the norm have been proposed, where is usually in the range [0,2] (reference [10] ). A value of = 2 will result in a ridge estimate and = t corresponds to the traditional model selection. As we all know, only when 1, the estimated value has a parsimony property (some components happen to be zero), For ≥ 1 , the optimization problem in the above formula is only convex. Therefore, the minimization of empirical risk constrained by 1 occupies unique position, Due to = 1 is the only value of , the variable selection is performed on this value, and the optimization problem is still convex, so it is feasible for high-dimensional problems.

The neighborhood estimate (parameterized by λ) is defined by the nonzero coefficient estimate of 1 penalty regression as formula (2):

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 

Therefore, each choice of penalty parameter λ specifies the estimate of the neighborhood of node Γ , and the rest is to choose the appropriate penalty parameter. A larger penalty value tends to reduce the size of the estimated set, and if the value of λ decreases, usually more variables are included in the estimated value.

Predict Oracle solutions. A seemingly useful choice of penalty parameters is (unavailable) to predict the oracle value as formula (3):

Expectation is understood to be about the new , which has nothing to do with the samples that estimate , . The prediction penalty minimizes the prediction risk in all Lasso estimates. The h m estimate is obtained by selecting m for cross-validation.

Shao [11] showed that for a 10-penalty return. The cross-validation selection of penalty parameters is consistent with the model selection of the verification set size under certain conditions. Predict that the Oracle solution will not lead to consistent model selection for Lasso. In ∑ = ∑ = . For some t < < 1 and all . Under the prediction oracle penalty, the probability of choosing the wrong neighborhood for node converges to 1 as formula (4):

From the proof of Proposition 1, it can be concluded that many noise variables are included in the prediction of the neighborhood of the Oracle solution. In fact, for a fixed number of variables, the possibility of including noise variables in prediction predictions will not even disappear gradually. If the selected penalty is greater than the predicted optimal value, then

Lasso can be used for consistent neighborhood selection.

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. does not need to set the learning rate. However, there is an additional regularization parameter C than the perceptron. Similar to most regression algorithms, the purpose of Passive Aggressive Algorithms is to use training samples to learn relevant parameters to minimize the value of the loss function [12] . In the learning of training samples (‫,ݔ‬ǡ) one by one, Passive Aggressive Algorithms uses stochastic gradient algorithms to update parameters. First, the gradient ∇ of the loss J associated with the newly input training samples (‫,ݔ‬ǡ) is obtained. Then update the parameter in the direction of gradient descent as formula (5):

In the probability gradient descent method, When the gradient descent is too large, the learning results tend to be unstable; When the gradient descent is too small, it will make the convergence rate slower.

If researchers can reasonably choose the loss function. It can make the gradient drop to the bottom of the valley quickly. Therefore, a penalty coefficient is generally introduced. That is, when deviating from the current solution , make appropriate adjustments to the amount of gradient descent. Therefore, we can get formula (6):

λ is a positive scalar. Such a learning method can effectively suppress the gradient descent. This

Algorithms is called Passive Aggressive Algorithms.

The specific algorithms of Passive Aggressive Algorithms are:

1. Select the initial value, t 2. Using the newly input training samples (x, y), formula 6 is updated for the parameter as formula (7).

3. Repeat the second step until convergence All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 16, 2020. . https://doi.org/10.1101/2020.05. 13.20099978 doi: medRxiv preprint In the regression problem, the function J (θ) has two different ways: using 1 h݉ and 2 h݉. As shown in formula (8-9):

This article uses two regularizes to establish regression models respectively and uses data from countries and regions related to Hong Kong to fit Hong Kong's COV-19 cumulative cases data.

After establishing regression models using known data. We found that most areas related to Hong Kong are still in the outbreak period of local transmission. In this situation, we can use the traditional infectious disease model to predict the growth trend of cases in these areas, thereby further predicting the possible growth trend of Hong Kong (without international traffic blockade). Therefore, this article first uses the SEIJR model to predict the growth curve of the number of local diagnoses in countries and regions related to Hong Kong. Finally, through these data, we can fit and predict the future growth trend of Hong Kong.

The classic SEIR model divides the crowd into S (Susceptible), I (Infected), E (Exposed) and R (Recovered) [13] . The model also assumes that all individuals in the population have an organic infection rate [14, 15] . When the infected individual recovers, antibodies will be produced, that is, the recovered population R will not be infected again.

In this study, the population was classified as susceptible S, latent E, infectious I, diagnosed J, recovered R, 1 and 2 respectively representing two populations with different susceptibility among susceptible populations, The infection risk of 2 is low, risk probability value is p, the probability of asymptomatic latent persons being infectious is q, the probability of latent persons turning into infected persons is k, the isolation rate is l, the confirmed rate of infectious persons is α, the infectious person 's The recovery rate γ1, the fatality rate of the infectious person δ, the recovery rate of the diagnosed person γ2, the mortality rate δ[16]. The transmission rate β is defined as the average number of infections caused by a person who is All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 16, 2020. . https://doi.org/10.1101/2020.05.13.20099978 doi: medRxiv preprint susceptible to contact with Class I per unit time. Based on the above parameters, the model is established as follows as formula (10-15):

In 

As shown in Figure 1 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 16, 2020. . https://doi.org/10.1101/2020.05. 13.20099978 doi: medRxiv preprint The results are encouraging, From Table 1 , we find that the United States, Hubei, China, Iran, Italy and other countries are Hub nodes in the network. These countries have the most relations with other countries, which means that these countries are the main sources of international communication. This result is consistent with existing general knowledge.

Next, we extracted the subnets of countries and regions related to Hong Kong. As shown in Figure 2 ,we found that 24 countries and regions are connected to Hong Kong. In addition to the hub nodes of the network. We also found South Korea, Russia and other countries that have close contacts with Hong Kong. We hope that the use of data from these countries and regions that are closely related to Hong Kong can help us fit the existing cumulative case growth data in Hong Kong and predict future trends. We have found that in areas related to Hong Kong, except for a few areas such as Hubei in China, other areas are at the peak of local transmission. This situation means that researchers can use traditional infectious disease models to predict epidemic trends in these areas, which is also the basis for this article to use the SEJIR model.

After getting the Hong Kong relationship subnet. We used Passive Aggressive Algorithm based on 1 and 2 norm to establish regression models respectively. The experimental results are shown in Figure 3 . As shown in Table 2 -3, the Passive Aggressive Algorithm can fit the existing growth curve well, and the error of the 5-fold cross-validation is only -6.94. Among them, Passive Aggressive Algorithm_L2 has a better realization, and the minimum error is only 0.12. The explained_variance and r2 indicators of both models are 0.99, and mean_absolute_error is lower than 13.3, Shows an extremely high degree of fit to existing growth data. This result shows that we can use the data based on the relation network to fit the existing case growth situation in Hong Kong. This means that we can use 24 countries related to Hong Kong to predict the future case growth curve of Hong Kong (unblocked traffic).

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 16, 2020. . https://doi.org/10.1101/2020.05. 13.20099978 doi: medRxiv preprint Finally, we used the SEJIR model to simulate the cumulative case growth data of 23 countries and regions related to Hong Kong. The parameters used in this article are shown in Table 4, where Beta is a floating value, which is adjusted according to the specific conditions of each country. Since this article can collect real data from other countries from April 2-19, this article first uses real data to predict the growth curve of Hong Kong's epidemic without blocking international traffic, the purpose is to evaluate the necessity of traffic blockade. As shown in 

At present, researchers have done a lot of work to predict the trend of local transmission of COV-19. However, in many countries and regions, the growth pattern of outbreaks is dominated by overseas imports. Existing research is difficult to predict effectively. Therefore, we take Hong Kong as an example, the purpose is to propose an infectious disease model that predicts the growth trend of imported cases abroad.

In this article, first, we proposed the sparse network model based on lasso, by analyzing the data matrix of real case statistics and drawing the COV-19 epidemic international network. From this All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. This article still has some shortcomings. For example, we assume that international tourists in Hong Kong will be quarantined and all infected persons will be diagnosed. However, some asymptomatic infected persons may still be missed, causing local transmission in Hong Kong. Therefore, in the next step we plan to count the asymptomatic infection rate in Hong Kong and further improve the model in this article.

No potential conflict of interest was reported by the author(s). All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 16, 2020. . https://doi.org/10.1101/2020.05.13.20099978 doi: medRxiv preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 16, 2020. . https://doi.org/10.1101/2020.05. 13.20099978 doi: medRxiv preprint Figure 3 Fitting curve using Passive Aggressive Algorithm based on 1 and 2 Figure 4 Comparison between the predicted growth curve and the true growth curve without hindering international flows All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 16, 2020. . https://doi.org/10.1101/2020.05. 13.20099978 doi: medRxiv preprint 

Predicting the cumulative number of cases for the COVID-19 epidemic in China from early data

The effect of travel restrictions on the spread of the 2019 novel coronavirus (COVID-19) outbreak

Preparedness and vulnerability of African countries against importations of COVID-19: a modelling study

COVID-19 and Italy: what next? The Lancet

Estimation of the transmission risk of the 2019-nCoV and its implication for public health interventions

Gibbs and Markov properties of graphs

Regression shrinkage and selection via the lasso: a retrospective

Asymptotics for lasso-type estimators

Pattern recognition and machine learning

A statistical view of some chemometrics regression tools

Linear model selection by cross-validation

Online passive-aggressive algorithms

Global stability for the SEIR model in epidemiology

Statistical inference in a stochastic epidemic SEIR model with control intervention: Ebola as a case study

How will country-based mitigation measures influence the course of the COVID-19 epidemic? The Lancet

All rights reserved. No reuse allowed without permission.(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.The copyright holder for this preprint this version posted May 16, 2020. . https://doi.org/10.1101/2020.05.13.20099978 doi: medRxiv preprint All rights reserved. No reuse allowed without permission.(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.The copyright holder for this preprint this version posted May 16, 2020. . https://doi.org/10.1101/2020.05.13.20099978 doi: medRxiv preprint All rights reserved. No reuse allowed without permission.(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.The copyright holder for this preprint this version posted May 16, 2020. . https://doi.org/10.1101/2020.05. 13.20099978 doi: medRxiv preprint