key: cord-0700156-oxi8ayh4 authors: Pasha, Deepro F.; Lundeen, Alex; Yeasmin, Dilruba; Pasha, M. Fayzul K. title: An analysis to identify the important variables for the spread of COVID-19 using numerical techniques and data science date: 2021-06-30 journal: Case Studies in Chemical and Environmental Engineering DOI: 10.1016/j.cscee.2020.100067 sha: 685f3b8f1d3a60f033873dedc2d337b58dd6b604 doc_id: 700156 cord_uid: oxi8ayh4 Considering system theory, the socio-economic variables that constitute a society should be able to capture the system response such as the number of weekly COVID-19 cases. A numerical approach has been presented in this paper to answer two vital questions; which variables are more important and how many variables are needed to capture the dynamics of the spread. Using the theory of least squares regression, two types of problems have been set up and solved using multilinear regression (MLR) and nonlinear powered function known as NLR in this study. Numerical techniques were applied to pre- and post-process the data and the vast number of outputs. Total 43 socio-economic and meteorological variables from 31 counties in California in the United States resulted about 37.4 millions of combinations for the analysis. Results show that variables related to total population, household income, occupation, and transportation are more important than the others. The frequency of having higher correlation for a variable increases as more variables are combined with it. Similarly, correlation increases as the number of variables in a combination increases. Some 5- variable combinations can capture the dynamics of the spread with higher accuracy having correlation coefficient as high as 0.985. If society is considered to be a system, its variables should capture the system's response and behaviors. There are many variables under different categories including demography, economy, culture, education, transportation, health, weather etc. that constitute a society. Some of these variables may be more important than the others in capturing system response and behavior. As the Coronavirus 2019 (COVID- 19) , also known as SARS-CoV-2 identified to be highly infectious [1, 2] , scientist and researchers are working to understand the dynamics of the spread of the virus using the predictive models in epidemics [3, 4] and system behaviors to aid in the process of controlling the outbreaks [5] . According to the US Center for Disease Control [6] , the primary method of COVID-19 spread is through respiratory droplets from human to human. Many researchers looked at the different factors that might affect the transmission of the droplets [7] . Some of those included demography [8, 9] , social connectedness and travel [10, 11] , economic, cultural, financial conditions [12, 13] , and meteorological characteristics [14] [15] [16] . Numerous relationships were found between these variables and the spread of COVID-19 via a variety of methods. Some of these relationships include higher COVID-19 susceptibility in older and more intergenerational populations and in lower temperature areas than in younger populations [8] and in higher temperature areas [14] . However, despite all of these predictions, the rate of infection continued to increase in summer affecting younger generations besides the elderly [17] , disproving the rising average temperature hypothesis presented by many scientists. While many factors that are possibly associated with COVID-19 cases have been identified discretely, a study is thus required where all the main socio-economic and meteorological variabilities are considered comprehensively. A systematic approach can include the social variabilities in a way that can quantify the impact of each of the variables and their combinations on the spread of COVID-19. IA comprehensive understanding of these combinations of variables and their magnitudes of influence on COVID-19 spread would greatly benefit the decision-makers to control the spread of COVID-19. Data science and existing numerical techniques including single linear regression (SLR), multilinear regression (MLR), and nonlinear regression (NLR) can be used to understand the relationships and their degrees of influence on COVID-19 spread. This paper outlines a numerical scheme to identify the important variables separately and combined and their impacts on the spread of COVID-19 using SLR, MLR, and NLR and well-known statistical assessment parameters. Please note that MLR with one independent variable is referred to as SLR in this study. Methodology consists of four steps. First step was data collection followed by data normalization and correlation analysis using MLR and NLR in the second and third steps respectively. In the final step, all the results were combined together to identify the important variables and their impacts on the spread of the virus. Literature search was conducted first to identify the variables that are postulated to be the reasons of spreading the virus. Since the number of weekly new COVID-19 cases, which is the dependent variable, was available throughout the study area, weekly time step was used as the temporal resolution. Hence, the meteorological variabilities were also converted into weekly basis. Different county data has been used to observe the impact of the data that are quasi-static in nature such as household income. Spatial resolution includes county level since different counties exhibit different weather and climatic behaviors but under similar state regulations. To eliminate the bias in magnitude, data was normalized (standardized) using the following equation, where, x ij , μ j , σ j , Z ij were respectively any value, mean, standard deviation, and standardized values of variable j. Assuming linear relationship between observed independent variables X, (x 1 , x 2 , x 3 , ….. … …, x n ) and observed dependent variable vector Y (y for single occurrence), the mathematical equation can be written as [18] . where, A (a 0 , a 1 , a 2 , a 3 , ….. …. ….., a n ) is the regression coefficient vector and ε is the error vector. As mentioned, if a single independent (i.e., only x 1 ) variable is used, the equation is referred to as SLR. If more than one independent variables are used, the equation is referred to as MLR in this paper. However, if the dependent variable is nonlinearly (using a power equation) related to the independent variable(s), the following equation which is known as NLR in this paper can be used. A logarithmic transformation of original data can be used to determine the coefficients of Eq. (4). Using the theory of least squares regression, the following equation can be used to solve for the regression coefficients A [18] . where X is the matrix values of observed independent variables and t denotes the transpose of the matrix. Two separate analyses one for MLR and the other for NLR were conducted to identify the important variables and their impacts in terms of frequency analysis and correlations. The following equation was used to calculate the correlation coefficient (CC). State of California offers a wide range of variabilities in weather, social, and economic data under similar state regulation. Total 31 counties (Fig. 1 ) from different regions of the state were considered for the study. Total 43 independent variables have been selected under eight different categories; weather, demography, education, household income, occupation, health, transportation, and recreation. The definition, symbol, and source of data for each of these indicators are presented in Table 1 . A brute force method was applied to calculate the total number of combinations assuming one (1) independent variable to five (5) independent variables combinations resulting 1,099,295 (¼ 43 C 1 þ 43 C 2 þ 43 C 3 þ 43 C 4 þ 43 C 5 ) combinations separately for MLR and NLR. These combinations were analyzed for each week separately for 17 weeks (from March 18 to July 15). Therefore, the combinations to analyze were 37,376,030 (¼ 2 x 17 x1,099,295). Considering the vastness of the combinations, MATLAB-R2018b [19] programming language was used to develop codes for computational purpose. Frequency analysis includes identifying the number of weeks for which the correlation between a single independent and dependent variable was higher than a threshold value (CC¼>0.8 in this case). Out of total 43, 19 variables were found highly correlated with the number of weekly cases (Fig. 2) . The main categories of these variables are total population, household income, occupation, and transportation. Both MLR and NLR identify the same variables but with different magnitudes of correlation. For example, in household income category, VLI (i.e., very low income) has the highest impact and HI (i.e., high income) has the lowest impact on the spread of the virus. Similarly, in occupation category, SH (i.e., service in health) has the highest impact. Some variables combined with the other variables may show better correlation. Therefore, considering the presence of each variable in a combination (2-5 variables combinations), the maximum possible number of occurrence for a single variable was calculated and used in the frequency analysis. Table 2 shows that an independent variable combined with other variable(s) can capture the dynamics better. As seen, the general trend of the impact is similar to one independent variable case ( Fig. 2) . However, variables with minimum or no impact in single variable case can have more impact when combined with other variables. Considering the vastness of the combinations, average CC under each number of combination for all 17 weeks of results was used to assess the impact of the number of variables in a combination. The average CC represents the trend and average impact of the number of variables on the dynamics of the spread (Fig. 3) . As seen, CC increases with the increase of number of variables in a combination. This is expected, since one or few variable(s) may not be enough to represent the true dynamics of the COVID-19 spread. The slopes of the curve are found steeper in the beginning than towards the end representing smaller contributions by additional variables. Similar to Fig. 2 and Table 2 , the NLR is capable of assessing the impacts of each combinations separately causing the average CC smaller than MLR. Another analysis has been conducted to observe how average CC varies by week. It is found that the average CC varies from 0.50 to 0.96 Table 2 Frequency of independent variable in multiple variable combination affecting the number of cases with CC greater than 0.80 for MLR and NLR. Fig. 3 . Impact of number of variables in a combination to predict the number of cases. for 1-to 5-variable combinations for both MLR and NLR. The average CC for a particular combination fluctuates more in NLR than in MLR. The highest and the lowest CC are found for Week 2 (March 25 -March 31) and Week 11 (May 27 -June 2) respectively. One of the highest CC value combinations (TP, VLI, SH, PT, and CH) has been used to observe the dynamics and predictability of the regression coefficients for Week 3 (Fig. 4) . As seen, both MLR and NLR can generally capture the pattern of the observed cases for all the counties. While MLR slightly over and under predicts, the NLR performs better prediction for the lower number of cases. A numerical approach, which consists of the applications of MLR and NLR has been presented to observe the correlation between different socio-economic and meteorological variables and the weekly number of COVID-19 cases. Results show that 8 independent variables under total population, household income, occupation, and transportation categories are highly correlated with COVID-19 cases. As the given number of independent variables in a combination increases, the frequency of occurrences for higher correlation coefficient also increases. Similarly, the average CC, which is calculated considering all weeks results increases with the number of independent variables in a combination. Some highly correlated combinations can capture and predict the system behavior with high accuracy. The correlation coefficients for these combinations can be as high as 0.985. These observations can be used to develop a predictive model. WHO World Health Organization, World Health Organization Media Briefing. Director General's Opening Remarks at the Media Briefing on COVID-19 -11 COVID-19 Dashboard by the Center for Systems Science and Engineering (CSSE) Contributions to the mathematical theory of epidemics. II -the problem of endemicity Analysis and forecast of COVID-19 spreading in China Environmental engineers and scientists have important roles to play in stemming outbreaks and pandemics caused by enveloped viruses Novel Coronavirus Factsheet -what You Should Know about COVID-19 to Protect Yourself and Others novel Coronavirus (COVID-19) pandemic: built environment considerations to reduce transmission Demographic science aids in understanding the spread and fatality rates of COVID-19 Investigation of effective climatology parameters on COVID-19 outbreak in Iran The Geographic Spread of COVID-19 Correlates with Structure of Social Networks as Measured by Facebook COVID-19 and community mitigation strategies in a pandemic The influence of social and economic ties to the spread of COVID-19 in europe Low-Income and Communities of Color at Higher Risk of Serious Illness if Infected with Coronavirus, 2020. Retrieved on Impact of meteorological factors on the COVID-19 transmission: a multi-city study in China Factors influencing the epidemiological characteristics of pandemic COVID 19: a TISM approach Temperature, Humidity, and Latitude Analysis to Predict Potential Spread and Seasonality for COVID-19 Coronavirus and COVID-19: Younger Adults Are at Risk, Too Numerical Methods for Engineers The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.