key: cord-286419-jyvj3mo2 authors: Rahaman Khan, Hasinur; Hossain, Ahmed title: Countries are Clustered but Number of Tests is not Vital to Predict Global COVID-19 Confirmed Cases: A Machine Learning Approach date: 2020-04-29 journal: nan DOI: 10.1101/2020.04.24.20078238 sha: doc_id: 286419 cord_uid: jyvj3mo2 COVID-19 disease is a global pandemic and it appears as pandemic for each and every nation and territory in the earth.This paper focusses on analysing the global COVID-19 data by popular machine learning techniques to know which covariates are importantly associated with the cumulative number of confirmed cases, whether the countries are clustered with respect to the covariates considered, whether the variation in the covariates are explained by any latent factor. Regression tree, cluster analysis and principal component analysis are implemented to global COVID-19 data of 133 countries obtained from the Worldometer website as reported as on April 17, 2020. Our results suggest that there are four major clusters among the countries. First cluster consists of 8 countries where cumulative infected cases and deaths are highest. It is also revealed that there are two principal components. The countries which play vital role to explain the 60% variation of the total variations by the first component characterized by all variables except the rate variables include USA, Spain, Italy, France, Germany, UK, and Iran. Remaining countries contribute to explaining 20% variation of the total variations by the second component characterized by only three rate variables. We also found that the number of tests by the country variable among other variables country, number of active cases, number of deaths, number of recovered patients, number of serious cases, and number of new cases is an unimportant variable to predict cumulative number of confirmed cases. Hence, the number of tests might play vital role to individual country level who are in the primary level of virus spread but not to the global level. are in potential dangerous threat (Khan & Hossain, 2020) . According to World Bank data (WB, 2020) , Bangaldesh in 2015 has 0.8 hospital beds for every 1,000 people, the India has 0.7 (2011), the Pakistan has 0.6 (2012), US has 2.9 (2012) while China has 4.2 (2012) beds per 1,000 people. It is recommended that ICU practitioners, hospital administrators, governments, and policy makers must prepare for a substantial increase in critical care bed capacity, with a focus not just on infrastructure and supplies, but also on staff management (Phua et al., 2020) . Tests capability is not uniform over the countries rather haterogeneous and even within country it is heterogenous. Testing can depend on mainly country's financial capability, laboratory capacity, and access although it is one of our most important tools for slowing down and reducing the spread and impact of the virus. Within limited capability, the low and middle income countries may have to battle their COVID-19 pandemic. Tests allow us to identify infected individuals, guiding the medical treatment that they receive. It enables the isolation of those infected and the tracing and quarantining of their contact (Hellewell et. al., 2020) . As of 17 April 2020, USA administered the highest no. of tests which is approximately 3.4 million which is almost 20% of global test total, followed by Germany (over 1.7 million), Russsia (over 1.6 million) and Italy (approximately 1.2 million). Figure 1 represents the scatter plot between the cumulative cases and cumulative tests for 132 countries. USA was discarded in the graph as USA has exceptionally very high tests performed. We found correlation coefficient between these two variables for 132 countries is 0.71 that indicates strong positive correlation, while including USA it is 0.88 that indicates very very high positive correlation. We found a number of research works where machine learning tools have used for global and local COVID-19 data. Recently, Chuanyu et al. (2020) used several machine learning tools including elastic net, random forest, and bagged flexible discriminant analysis for predicting mortality risk of COVID-19 patients. Ismail Magdon-Ismail (2020) presented a robust data-driven machine learning analysis of the COVID-19 pandemic from its early infection dynamics. "COVID-19 and artificial intelligence: protecting health-care workers and curbing the spread" (2020) discussed how artificial intelligence protecting health-care workers and curbing the spread of COVID-19. News (2020) discussed about the hunting the virus with technology, AI, and analytics. News (2020) used deep learning method for reviewing and critically appraising published and preprint re-3 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 29, 2020. . including big data techniques to COVID-19 data to determine the spread of the disease, predict the risk of disease, the diagnosis of disease, number of incidence, health care faciities. In this paper, we would like to explore whether global cumulative infected people can be predicted with the avalable data, collected as of 17 Aapril 202 from Worldometer If so, then we would like to know how much vital the cumulative number of tests is to predict the number of infections. We will further invesigate whether the countries are clustered on the basis of these covariates. Finaly, whether the total variations can be explained with some latent groups who are uncorrelated each other. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 29, 2020 . . https://doi.org/10.1101 The data used for the current study has been collected from the real time COVID-19 data from the Worldometer website (Max Roser & Ortiz-Ospina, 2020) until 17-th April, 2020. The Worldometer is the data repository and the free reference website which is trusted by the UK Government, Johns Hopkins CSSE etc. For the current study, we collated the information obtained on the top 133 countries with the 100 number of confirmed COVID-19 cases. For each country we collected information on total confirmed cases, new confirmed cases, total deaths, total recovered patients, total active case, total seriousely critical patients, infection rate in million, death rate in million, total tests conducted, and test rate in million. New confirmed cases are the confirmed cases reported on 17-th April. The definition of recovery and serious cases vary from country to country. According to Max Roser & Ortiz-Ospina (2020) , the recovered number is not very accurate as reporting can be missing, incomplete, incorrect, based on different definitions, or dated (or a combination of all of these) for many governments, both at the local and national level, sometimes with differences between states within the same country or counties within the same state. We considered the data representing the rates such as cases, deaths and tests per million in our analysis since these are vital statistics and representing the proxy of the respective population size. Figure 2 describes that most of the COVID-19 cases and deaths are from USA and counttries from Europe. We found that USA and european countries such as Germany, Russia, Italy, Spain, UK and France administered very high number of tests. The average number of tests among the available countries of 133 countries is found nearly 156,500 5 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 29, 2020. . Our all variables except the country are correlated. We standardized the data and imputed the missing value through EM algorithm according to (Dray & Josse, 2020) 6 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 29, 2020. . https://doi.org/10.1101/2020.04.24.20078238 doi: medRxiv preprint CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 29, 2020. . https://doi.org/10.1101/2020.04.24.20078238 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 29, 2020. . consisted of the highest number of 68 countries which are clustered mainly based on the test and case rates variable along with other variabless used in this study. We implemented the regression tree using CART to predict the cumulative number of infected people. Main purpose of implementing regression tree is to see whether the global cumulative number of infected people can be predicted very well with the ten variables under study. Results are presented in Table 2 that shows the weights including their percentage of importance for all ten variables. It reveals from the results that country and cumulative active cases appear as the most important variables to predict the cumulative number of infected people, followed by cumulative deaths, cumulative recovered cases, new case and cumulative serious cases. However, most strikingly we found that the cumulative tests appears as one of the most unimportant variablesto predict the cumulative number of infections. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 29, 2020. . In this paper, we demonstrated how to implement the basic machine learning techniquesprincipal component, cluster analysis and regression tree to analyse global COVID-19 data that was extracted from the Worldometer website (Max Roser & Ortiz-Ospina, 2020) and reported as of April 17, 2020. We considered 10 variables for each of 133 countries. We found from the PCA analysis that there are two latent variables that are characterized by the 10 variables we considered. The first principal component explains 60% variation of the total variations, while this is characterized mainly by 7 variables. These are the total infected cases, deaths, active cases, recovered cases, serious cases. new cases and total tests. The source of the majority of total variations is collectively all variables but the rate variables. Remaining three variables-case, death and test rates measured in per million characterize the second principal component that is due for the 20% variation of the total variations. The latent factor behind this appears to be the country's population size as all these three variables are the proximates to population size. Neither populations of 133 countries are uniform nor the population density. We belive that country's population size or indirectly the associated population density is responsible for the 20% variation of the total variations. The cluster analysis found four major clusters among the countries but two clusters among the 11 variables. It reveals from the analysis that the countries are clustered based on the variation among the variables. We found that the 8 countries which having the highest number of cases form a cluster, while 43 countries form another cluster based on 10 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 29, 2020. . mainly all the variables but the case rate and test rate. The eight countries are USA, Spain, Italy, France, Germany, UK, China and Iran who are homogeneous interms of cumulative cases, deaths, active cases and tests. Most of them were/are the epicenter of the pandemic. However, we found that 14 countries, who have very low rate of deaths, form one cluster and 68 countries who have higher test and case rates along with significant effect of other eight variables form the fourth cluster. Countries having low death rates includes Bahrain, Belgium, Channel Islands, Faeroe Islands, Gibraltar, Iceland, Ireland, Isle of Man, Luxembourg, Malta, San Marino, Switzerland, UAE. We found from the regression tree results that country, total active cases, total deaths, total recovered cases, new cases and total serious cases are very important variables to predict the cumulative number of cases but the number of tests including three rate variables is not the important variable. As stated, global data analysis indicates that the cumulative number of tests is not significant to predict cumulative cases but it is quite important to consider a specific country is in what situation or context. Besides, the policies on testing differs from country to country, region to region or even city to city. It mainly depends on what stage that country or community has reached in the pandemic curve and at the same time the level of preparedness in the specific context like number of lab facility, lab staff, sample collection strategy etc. When resources are limited and when the healthcare system is overloaded the widespread testing as suggested by WHO may not be implemented. This is a reality to many low and middle income countries in the list of 133 countries in our study. Aparently, number of tests is very important for many countries to limit the spread in early stage or even in any stage of spreading by identifying cases and isolating them and their contacts. However, global COVID-19 data analysis results revealing that cumulative tests is not at all any important determinant to predict the cumulative number of tests for the country. The world grapples with the containment of the COVID-19 outbreak and countries are trying to reduce virus spread by performing tests for detecting and then isolating the infected people and quaranting the susceptible people. Besides, cntinueing the lockdown and social distancing is expected to help in reducing the spread in considerable amount. However, this paper found that the countries are clustered with respect to underlying effects of the covariates although the countries are fighting independently against this virus war. Similarly, variables related to rates is together a cluster while other variables 11 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 29, 2020. . together is another cluster of variables. Most strikingly, we found that the cumulative tests appeared as an unimportant variable to predict the cumulative infected people. We declare that we have no competing interests. There is no funding for this study. MHRK carried out the statistical analysis and contributed to draft the manuscript. AH arranged the datasets and contributed to finalize the manuscript. The working data set used for this study has been submitted to the journal as additional supporting file. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 29, 2020 . . https://doi.org/10.1101 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 29, 2020 . . https://doi.org/10.1101 COVID-CAPS: A capsule network-based framework for identification of COVID-19 cases from X-ray images. arXiv Classification and Regression Trees Early prediction of mortality risk among severe COVID-19 patients using machine learning Detection of 2019 novel coronavirus (2019-nCoV) by real-time RTPCR COVID-19 and artificial intelligence: protecting health-care workers and curbing the spread Principal component analysis with missing values: a comparative survey of methods Big data in the time of coronavirus (COVID-19) An alternative workflow for molecular detection of SARS-CoV-2-escape from the NA extraction kit-shortage. medRxiv Estimating uncertainty and interpretability in deep learning for coronavirus (COVID-19) detection. arXiv Coronavirus detection and analysis on chest CT with deep learning. arXiv Finding Covid-19 from chest X-rays using deep learning on a small dataset. arXiv Understanding the COVID-19 pandemic as a big data analytics issue Feasibility of controlling COVID-19 outbreaks by isolation of cases and contacts Artificial intelligence forecasting of COVID-19 in China. arXiv The continuing 2019-nCoV epidemic threat of novel coronaviruses to global health-the latest 2019 novel coronavirus outbreak in wuhan, china IBM releases novel AI-powered technologies to help health and research community accelerate the discovery of medical insights and treatments for COVID-19 COVID-19 Outbreak Situations in Bangladesh: An Empirical Analysis. medRxiv caret: Classification And REgression Training Within the Lack of Chest COVID-19 X-ray Dataset: A Novel Detection Model Based on Machine Learning the Phenomenology of COVID-19 From Early Infection Dynamics. MedRxiv A novel AI-enabled framework to diagnose coronavirus COVID-19 using smartphone embedded sensors Coronavirus Disease (COVID-19 Tracking COVID-19: Hunting the Virus with Technology, AI, and Analytics. Institute for Human-Centered Artificial Intensive care management of coronavirus disease 2019 (COVID-19): challenges and recommendations Machine learning-based CT radiomics model for predicting hospital stay in patients with pneumonia associated with SARS-CoV-2 infection: a multicenter study Identification of COVID-19 can be quicker through artificial intelligence framework using a mobile phone-based survey in the populations when cities/towns are under quarantine Real-time forecasts of the COVID-19 epidemic in China from The World Bank data Prediction of criticality in patients with severe Covid-19 infection using three clinical features: a machine learning-based prognostic model with clinical data in Wuhan