key: cord-0979060-zgi2sdc1
authors: YEŞİLKANAT, Cafer Mert
title: Spatio-temporal estimation of the daily cases of COVID-19 in worldwide using random forest machine learning algorithm
date: 2020-08-20
journal: Chaos Solitons Fractals
DOI: 10.1016/j.chaos.2020.110210
sha: 3d5d6d4d8722c9fc09a8a1ae30dc5eef198200dc
doc_id: 979060
cord_uid: zgi2sdc1

Novel Coronavirus pandemic, which negatively affected public health in social, psychological and economical terms, spread to the whole world in a short period of 6 months. However, the rate of increase in cases was not equal for every country. The measures implemented by the countries changed the daily spreading speed of the disease. This was determined by changes in the number of daily cases. In this study, the performance of the Random Forest (RF) machine learning algorithm was investigated in estimating the near future case numbers for 190 countries in the world and it is mapped in comparison with actual confirmed cases results. The number of confirmed cases between 23/01/2020 - 17/06/2020 were divided into 3 main sub-datasets: training sub-data, testing sub-data (interpolation data) and estimating sub-data (extrapolation data) for the random forest model. At the end of the study, it has been found that R(2) values for testing sub-data of RF model estimates range between 0.843 and 0.995 (average R(2)= 0.959), and RMSE values between 141.76 and 526.18 (mean RMSE = 259.38); and that R(2) values for estimating sub-data range between 0.690 and 0.968 (mean R(2) = 0.914), and RMSE values between 549.73 and 2500.79 (mean RMSE = 909.37). These results show that the random forest machine learning algorithm performs well in estimating the number of cases for the near future in case of an epidemic like Novel Coronavirus, which outbreaks suddenly and spreads rapidly.

Novel Coronavirus disease 2019 (COVID- 19) , which first appeared in Wuhan, China in December 2019, has caused the death of more than 450 thousand people worldwide as of June 2020 [1] . In addition, COVID-19 quickly became a worldwide epidemic due to its high contagiousness and rapid spread [2] . For this reason, all countries take steps to prevent the spread of the COVID-19 outbreak.

Many medical studies have been conducted to examine and treat the disease caused by this new type of virus in recent months [3] [4] [5] [6] [7] . In addition, many studies have been conducted to examine the social, psychological and economic effects of the COVID-19 outbreak and the changes it causes [8] [9] [10] [11] [12] [13] . Epidemiological, statistical and mathematical models have also been introduced to predict the distribution, to observe the changes depending on meteorological conditions, and to examine the structure of this epidemic which affects all countries globally [14] [15] [16] [17] [18] [19] [20] [21] . Besides, the performance of machine learning approaches for the diagnosis and treatment of the disease was also studied [22] [23] [24] [25] [26] [27] [28] . All these studies reveal the general structure of such an epidemic and disease that humanity has not encountered before and its effects on society. For this reason, it is very important to research each individual and social impact that occurs in the COVID-19 pandemic, in different disciplines with different methods, along with its causes. This idea has been the main motivation for this study. In this study, the performance of the Random Forest (RF) method, which is a machine learning algorithm, was analyzed in estimating the daily increase rates and the number of daily cases in the near future and synchronous parallel computing was carried out for 190 countries.

In recent years, machine learning algorithms and artificial intelligence approaches have been used successfully in many different fields [29] [30] [31] [32] [33] [34] [35] [36] . One of the most important of these algorithms is Random forest machine learning [37, 38] . This method occurs with the combination of many specialized decision trees. The input-output relationship is learned by the machine in certain confidence intervals with the help of experimental data. The success level of the estimation model is determined by testing validation data after sufficient learning is provided by the machine.

The main purpose of this study is to discover the spread estimation of the daily cases of the COVID-19 outbreak for the near future using RF machine learning algorithm. Thus, by using the daily changes in the number of confirmed cases for 190 countries worldwide, the spatio-temporal distribution of the outbreak in the world is estimated and mapped. In addition, this study also aims to reveal the performance of the RF algorithm both in determining the spread of the outbreak and in estimating cases for the near future.

The COVID-19 data repository 1 used in the study was obtained from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) [39] . In this study, the number of updated and confirmed cases for 147 days between 23/01/2020 -17/06/2020 in 190 countries worldwide was used. The entire study was conducted in the R programming environment [40, 41] and Random Forest [42] was used for random forest calculations, covid19.analytics [43] for COVID-19 data, rnaturalearth [44] for mapping, ggpot2 [45] for visualizing of data, and caret [46] R packages for data preparation and separation.

The random forest approach proposed by Breiman [37] is a machine learning algorithm with many decision trees. It is a combination of Bagging [47] and Random Subspaces [48] methods. This method has proved its success in both regression and classification problems in recent years and is one of the best machine learning algorithms used in many different fields [25, 30, 34, 38, [49] [50] [51] .

In RF algorithm, firstly, data set is randomly divided into two parts as training data (the in-Bag) for learning and validation data (the-out of bag) for testing the learning level. 2/3 of the data set is devoted to training data and 1/3 to validation data. Later, many decision trees are randomly created with "boot-strap samples" from the data set. The branching of each tree is determined by randomly selected predictors at node points. The RF Final estimate is the average of all results from each tree. Therefore, each individual tree affects RF estimation at certain weights.

Since this method shows "black box" feature, each tree is not examined individually [52] . RF algorithm is stronger than other machine learning algorithms due to its ability to randomly receive training data from subsets and form trees with random algorithm [53] . In addition, the random forest algorithm maintains the overfitting level as training is carried out on randomly selected different subdatasets by boot-strap sampling.

This study was carried out in 4 main stages. These process steps are shown in Figure 1 and explained below.

Step, data split process; the number of confirmed cases for 190 countries between 23/01/2020 -17/06/2020 is divided into 3 main sub-datasets. The first data set is the training sub-dataset between 23/01/2020-31/05/2020. The second sub-dataset is the testing sub-dataset consisting of 6 days (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28) (29) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28) (29) April and 12-19 May) data randomly selected from the training data set days after the 50th day and separated from the training data set. This data set is different from validation data, which is inside the RF algorithm system and whose data is separated as 1/3. The third sub-dataset is the estimating sub-dataset, where future predictions are made for the date range 01/06/2020 -17/06/2020. This data set is separated from training data like the testing data set and is not included in the RF learning algorithm. Testing sub-data shows randomly selected days (after the 50th day) among the date ranges in the training data set (Interpolation), while estimating sub-data shows data from the days (near future) after the end of the training data (Extrapolation).

Step, RF training process; machine learning process is performed at this stage by applying RF algorithm with training data. In this process, determining the number of trees to be created and the number of splits at the node points of the trees is important for accurate predictions. 1/3 validation sub-datasets created in RF algorithm were used for the optimization of these values. At the end of the optimization, the number of trees was found as 1500 for the most suitable model and the number of splits on the nodes as 3 (Average R 2 = 0.952 at 10-fold crossvalidation, average RMSE = 354.74).

Step, RF testing process; after performing RF training with actual data, the model created is tested with the testing sub-dataset separated from the data set and the results are shown in the cross-validation diagram. The performance of the model is determined by mean error (ME, Eq. 1), root mean square error (RMSE, Eq. 2) and the correlation coefficient (R 2 , Eq. 3). Figure 4 shows diagrams that demonstrate estimation performance of the RF model for the near future. In these diagrams, the actual confirmed cases and RF estimation results of 3 days (5-10-15 June 2020) selected from estimation data, which have never been introduced to the machine as training data interval, are shown in both maps and cross validation diagrams comparatively. Accordingly, R 2 has been calculated as 0.958, 0.910 and 0.938; and RMSE as 814.06, 1129.12 and 616.37 for 135th, 140th and 145th days, respectively. In addition, Table 1 lists the performance identifiers between actual confirmed cases and RF model estimation values for each day from June 1 to 17, 2020. According to this table, the best RF model estimation for the near future has been calculated as the highest R 2 0.968 and the lowest RMSE 549.73 for the 6 June 2020 data. In addition, the average R 2 value for 17 days between 1-17 June 2020 has been found as 0.914 and the average RMSE value has been found as 909.37. These results show the success of the RF machine learning algorithm in estimating the number of COVID-19 daily cases in the near future. However, when Table 1 is analyzed, it is seen that there is a significant decrease in RF estimation performance for June 17, 2020. The main reason for this is thought to be the unpredictably high increase in the number of daily cases (36, 179) recorded in Chile that day. RF model estimation maps and cross-validation diagrams for 1-17 June 2020 are presented in Figure S1 in the Supplementary Material.

When the results of the study are evaluated in general, it has been shown that the random forest machine learning algorithm can create appropriate estimations in determining the number of near future cases in a sudden emerging epidemic. It is thought that appropriate estimations can be made for the distant future as well by increasing the input data and introducing other factors affecting the epidemic as an appropriate parameter to the random forest learning algorithm. Based on these results, a hybrid approach can be created by using the advantages of other machine learning algorithms in future studies. Spatio-temporal spread rate estimation of a sudden epidemic and potentially risky areas identification might also be possible with the aforementioned approach. However, it should be noted that the machine learning process and the estimation periods will be longer in this case. These issues should be taken into consideration for future studies.

Cafer Mert Yeşilkanat: Investigation, Data curation, Writing -original draft, Visualization, Conceptualization, Methodology, Software, Formal analysis, Writing -review & editing.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. All work was conducted by Cafer Mert Yeşilkanat

Guidance for the care of neuromuscular patients during the COVID-19 pandemic outbreak from the French Rare Health Care for Neuromuscular Diseases Network

Plans to Reactivate Gastroenterology Practices Following the COVID-19 Pandemic: A Survey of North American Centers

Adult Cardiac Surgery and the COVID-19 Pandemic: Aggressive Infection Mitigation Strategies are Necessary in the Operating Room and Surgical Recovery

Prior and novel coronaviruses, Coronavirus Disease 2019 (COVID-19), and human reproduction: what is known?

Infection control practices in children during COVID-19 pandemic: differences from adults

Combating Heightened Social Isolation of Nursing Home Elders: The Telephone Outreach in the COVID-19 Outbreak Program

Inequality in Learning Opportunities during Covid-19: Evidence from Library Takeout

The socio-economic implications of the coronavirus pandemic (COVID-19): A review

Characterize health and economic vulnerabilities of workers to control the emergence of COVID-19 in an industrial zone in Vietnam

Who Loses Income During the COVID-19 Outbreak? Evidence from China

Show me a man or a woman alone and I'll show you a saint: Changes in the frequency of criminal incidents during the COVID-19 pandemic

Estimating the infection horizon of COVID-19 in eight countries with a data-driven approach

Estimation of COVID-19 dynamics "on a back-ofenvelope": Does the simplest SIR model provide quantitative parameters and predictions? Chaos

A simple model for COVID-19

Development of new hybrid model of discrete wavelet decomposition and autoregressive integrated moving average (ARIMA) models in application to one month forecast the casualties cases of COVID-19

Mathematical modeling of the spread of the coronavirus disease 2019 (COVID-19) taking into account the undetected infections. The case of China

Impact of weather on COVID-19 pandemic in Turkey

Effects of temperature and humidity on the daily new cases and new deaths of COVID-19 in 166 countries

A mechanism-based parameterisation scheme to investigate the association between transmission rate of COVID-19 and meteorological factors on plains in China

Use of machine learning and artificial intelligence to predict SARS-CoV-2 infection from full blood counts in a population

A modified deep convolutional neural network for detecting COVID-19 and pneumonia from chest X-ray images based on the concatenation of Xception and ResNet50V2

Predicting the growth and trend of COVID-19 pandemic using machine learning and cloud computing

Spatial modelling, risk mapping, change detection, and outbreak trend analysis of coronavirus (COVID-19) in Iran (days between

Examining the effect of social distancing on the compound growth rate of COVID-19 at the county level (United States) using statistical analyses and a random forest machine learning model

COVID-19 identification in chest X-ray images on flat and hierarchical classification scenarios

Analysis on Novel Coronavirus ( COVID-19 ) Using Machine Learning Methods

Artificial neural networks for infectious diarrhea prediction using meteorological factors in

Evaluation of random forest and regression tree methods for estimation of mass first flush ratio in urban catchments

Spatiotemporal patterns of PM 10 concentrations over China during 2005-2016: A satellite-based estimation using the random forests approach

Spatial interpolation of McArthur's Forest Fire Danger Index across Australia: Observational study

A comparative study of logistic model tree, random forest, and classification and regression tree models for spatial prediction of landslide susceptibility

An evaluation of Guided Regularized Random Forest for classification and regression tasks in remote sensing

Spatial interpolation and radiological mapping of ambient gamma dose rate by using arti fi cial neural networks and fuzzy logic methods

Determination and mapping the spatial distribution of radioactivity of natural spring water in the Eastern Black Sea Region by using artificial neural network method

Random Forests

Random forest as a generic framework for predictive modeling of spatial and spatiotemporal variables

An interactive web-based dashboard to track COVID-19 in real time

R: A Language for Data Analysis and Graphics

R: A language and environment for statistical computing, reference index version 2.2.1. R Found Stat Comput

Classification and Regression by randomForest

analytics: Load and Analyze Live Data from the CoViD-19 Pandemic. R Packag Version 11

World Map Data from Natural Earth. R Packag Version 010 Https

Elegant Graphics for Data Analysis

Classification and Regression Training. R Packag Version

Bagging predictors

The random subspace method for constructing decision forests

Random forests for classification in ecology

Random forest regression prediction of solid particle Erosion in elbows

Modeling spatial patterns of fire occurrence in Mediterranean Europe using Multiple Regression and Random Forest

Newer classification and regression tree techniques: Bagging and random forests for ecological prediction

Combining Bagging and Random Subspaces to Create Better Ensembles