key: cord-0697314-dh1nvm0y
authors: Joshua, Vasna; Sylvia Grace, J.; Godwin Emmanuel, J.; Satish, S.; Elangovan, A.
title: Spatial mapping of COVID-19 for Indian states using Principal Component Analysis
date: 2021-01-05
journal: Clin Epidemiol Glob Health
DOI: 10.1016/j.cegh.2020.100690
sha: 66b85086279244b83dbfb97fa44edceef87cf428
doc_id: 697314
cord_uid: dh1nvm0y

nan

The first case of the pandemic outbreak of Coronavirus disease 'COVID-19' was reported in Wuhan, China, in November 2019. The pandemic outbreak has spread very quickly to 210 countries, including territories across the globe. 1 In India, the first case of the COVID-19 was reported on January 30, 2020, originating from China. As of Oct 12, 2020, the Ministry of Health and Family Welfare has confirmed 7,011,388 cases, 6,149,535 recoveries, and 109,150 deaths in the country. 2 The infection rate of COVID-19 in India has slowed down, and the growth of the infections has become more or less linear and not exponential. 3 The outbreak has been declared an epidemic in more than a dozen states and union territories, where provisions of the Epidemic Diseases Act, 1897 have been invoked, and educational institutions, tourist's places, Shopping malls, recreational centres, foreign consulates, and many commercial establishments have been shut down. 4 According to Centres for Disease Control and Prevention (CDC), persons at higher risk for the severe illness of COVID-19 are older adults and persons of any age who have serious ailments and under medication like Asthma, HIV, etc., pregnant people, experiencing homeless dwellers, and persons with disabilities. 5 The present study's objective was to identify the regions at greater risk of developing the disease for Indian states using COVID-19 data and its risk-related factors using the Principal Component Analysis technique.

Study population: We retrieved the latest data available on the official website of the Ministry of Health and Family Welfare (MoHFW), 2 India; Census of India 6 ; National Institution for Transforming India Aayog 7 ; National AIDS Control Organization 8 ; National Health Mission 9 ; National Health Profile 2018 10 ; National Family Health Survey 4 11 ; Handbook of Social Welfare Statistics, Ministry of Social Justice and Empowerment 12 ; Source State of forest report 2019 13 and published articles 14,15, . 16 The information on COVID-19 active cases, deaths, and confirmed cases were collected on Oct 12, 2020. 2 The selection of the risk related factors of COVID-19 was based on a review of the literature and essentially with the available data. They were retrieved for 37 Indian States, including Union territories. The risk related factors extracted were population, percentage of geographical region, population density, number of households, the proportion of males, average family size, persons per room, percentage of illiterates, percentage of the elderly population (60 or more years), percentage of the homeless population, percentage of slum population, net migration rate, persons below poverty line (BPL), disability rate, the prevalence of diabetes, common cancers and hypertension among attending NCD clinics and adult HIV prevalence.

Factor analysis was used to reduce the large data set into a smaller subset without losing much information. Principal Component Analysis (PCA) technique was used to achieve it. The objective of the PCA is to take a larger number of variables, say N variables X 1 , X 2 , …, X N and find combinations of these to produce principal components Z 1 , Z 2 , …, Z N that are uncorrelated in order of their importance, and to describe the variation in the data. The ith principal component is a linear combination given by Z i = a i1 X 1 + a i2 X 2 + … + a iN X N N of these components and the coefficients a ij 's are given by the eigenvector a i corresponding to the ith largest eigenvalue λ i of the correlation matrix of the X variables. When doing so, there is always a possibility that most of the principal components' variances may turn to be negligible. In that case, most of the full data set variation can be adequately described by the few Z components with variances that are not negligible. The best results are obtained when the original variables are highly correlated, either positively or negatively. The original set of 20 or more variables can be adequately represented by few (three or four) principal components. The first principal component has the highest variance, whereas the other components all have variances that are much less than the highest, which means that the first principal component is the most important, followed by (two/three) other components for representing the variation in the measurements of the (20 or more) variables. For better interpretability, the factors are improved using varimax rotation, which is widely used, maximizing the sum of the variances of all factors used.

For further analysis, it is usual to use only the first few principal components, providing that the sum of their variances is a high percentage (e.g., 80% or more) of the sum of the variances for all N components. A factor score can be obtained as a linear combination of standardized factors. The factor coefficient of the factors is called the factor score coefficient. Using the variance percentages as weights on the factor scores, the initial score is computed 17, 18, 19, 20, 21, 22 .

Our ultimate aim was to make the original data set into relatively fewer independent factors and estimate the factor scores. The original data set contained an array of dimension 37 states x 21 factors. These factors were examined using the correlation matrix and for a meaningful representation. The risk related factors were of different units of measurement; hence they were standardized. The PCA reduced the 19 riskrelated factors (after omission of net migration and prevalence of common cancers) into ten highly correlated factors. Hence the final data set used in the analysis was the size (37 states x10 factors) (Table 1) further, a smaller subset of four factors extracted using the eigenvalue greater than one. Varimax rotation was used to improve the factors, and finally, the factor scores were obtained. Percentage of variation was used as weights, and the initial score for each state was obtained. For the sake of comparison, the initial scores were standardized and listed. The above analysis was done using the SPSS software. 23

A simple spatial interpolation method, namely the Inverse Distance Weighting method (IDW), 24 was applied to predict unmeasured locations using the available information from the measured locations. Here we have information in the form of derived scores for 37 locations (states), and the weights were assigned as the inverse of the distance between known and unknown locations. An IDW power coefficient of 2 with 12 nearest neighbourhood was used for the analysis.

The locations (longitude, latitude) of each state and the derived score were integrated into the ArcGIS version 10 software (ESRI, Redlands, CA, USA) 25 to predict values in the unmeasured locations.

The PCA identified four factors, which together explained about 83% of the total variation. All the factors selected for the analysis were examined. It was found to be highly correlated as required for the factor analysis. The four-factor loadings that are larger (≥0.64) are listed in The first factor consists of the disease COVID-19 highly correlated statistics, namely active cases, number of deaths, and confirmed cases. The second factor consists of the illiterate population and the mean number of persons used per room. The third factor consists of the residential population, homeless population, and elderly population aged 60 or more years, and the fourth factor consists of disability rate and slum population.

The initial score for various states are listed in Table 3 , wherein the last column represents the corresponding standardized score in descending order.

States Maharashtra, Uttar Pradesh, Andhra Pradesh, Karnataka, and Tamil Nadu stood above the average. It had a standardized score of 50 or above, indicating greater interventional care needed to bring down the COVID-19 transmission in India.

States NCT of Delhi, West Bengal, Bihar, Telangana, Madhya Pradesh, Odisha, Rajasthan, Chhattisgarh, Uttarakhand, Punjab, Gujarat, Jammu Kashmir, and Haryana, which had a score between 50 and 25 needs the next priority care and the last nineteen states which had a score of less than 25 needs less care as on the date of the investigation.

The map obtained (Fig. 1) showed an optimal unbiased representation of multiple risk-related factors of the disease COVID-19 transmission with the Inverse Distance weighted estimates. The figure shows the regional variation and the disease high risk concentrated regions (hot spots) and regions at the greater risk of developing the infection. The estimates showed the high-risk concentrated regions as the central part of India with hot spots in Maharashtra, Uttar Pradesh, Andhra Pradesh, Karnataka, and Tamil Nadu. The transmission appeared to be lower in the North-Eastern part of India, Himachal Pradesh, and Dadra & Nagar Haveli.

The states have been classified with zones/districts as red if there are a sizeable number of covid-19 cases or with hotspots, the green zone is areas with zero confirmed cases in the last 21 days, and left-outs are orange zone with a limited number of cases, and thereby people's movements are restricted. 26 Looking at the raw data and the magnitude of confirmed cases, Maharashtra stood first, followed sequentially by Andhra Pradesh, Karnataka, Tamil Nadu, Uttar Pradesh, Delhi, West Bengal, Kerala, and Odisha. The above exercise brings out a red alert state in sequential order as Maharashtra, Uttar Pradesh, Andhra Pradesh, Karnataka, and Tamil Nadu COVID-10 risk-related factors in a multivariate set up. The map shows the hot spot regions, mainly in the central part of India. It also showed a few cold spots in 'Seven Sister' states. 27 Apart from the data COVID-19, the study also brings out the proxy determinants as illiteracy and mean number of persons using per room; followed by residential population, homeless population and elderly population 60 years or more; disability rate and slum population. Even though eighteen variables (including chronic disease rates) were included in the study, only the above seven variables (other than COVID-19 cases) showed a higher correlation value of more than 0.5 with the infection cases. Directly or indirectly, all the seven variables are a function of the variable 'social distancing'. Public health officials emphasize social distancing as they are considered an important measure for mitigating the pandemic COVID-19. In a country like India, 'Social distancing poses unique challenges. 28 The study had used the state as a unit of study. If finer grid points like districts or taluks are considered, it would have been more precise to pinpoint the country's pockets for the remedial measure. The COVID-19 risk-related data have been used from multiple sources with different years, which could be one of the study's limitations.

Supplementary data to this article can be found online at https://doi. org/10.1016/j.cegh.2020.100690.

1. The original data set (37*21) was extracted from various sources (shown in supplementary material). 2. The selection of 10 factors was based on the following.

(i) Net migration rate state-wise was readily available only for the year 1991-2001 hence not included and (ii) Prevalence of common cancers from 01.01.2017 to 31.12.2017 attending NCD clinics was missing for 5 states hence was also not included for the further analysis.

Hence the original data set was reduced to (37*19), and it was further examined.

(iii) The basic assumption of factor analysis is to identify highly correlated factors. It also brings out the number of factors required to represent the major portion provided by all the observed factors. It is done by expressing each factor as the best linear combination of a small number of unknown common unobserved factors. The success of any factor analysis depends on obtaining really meaningful factors. 3. Based on the above assumption, the original extracted data set was reduced (37*10 factors). Hence the final data set (37*10 factors) was used in the analysis.

The initial scores and standardized scores for the Indian states, 2020. Steps involved:

(i) As a first step, the factors were standardized to eliminate the effect of different scales of measurement. The standardized dataset (shown in the supplementary material) was given as an input to the factor analysis program in SPSS software. (ii) The next step was to examine the correlation matrix between the factors. The values ranged from 0.1 to 0.9 in absolute value. The majority of the correlates was ≥0.3 (iii) The communality values for the factors ≥0.7 are shown in the last column of Table 2 . (iv) Further suitability of the data set or analysis was assessed using Bartlett's test of sphericity, Kaiser-Meyer-Olkin measure of sampling adequacy, and inspection of residuals and rotated factor loadings. (v) Bartlett's test of sphericity, which tests that the correlation matrix is the identity on the assumption of multivariate normality, was found to be highly significant of P(<0.001) (vi) Kaiser-Meyer-Olkin measure of sampling adequacy was 0.61, which represents an acceptable value for factor analysis.

(i) Further, a smaller subset of four factors was extracted using the eigenvalue greater than one. Varimax rotation was used to improve the factors and readily identifiable, and finally, the factor scores were obtained (shown in supplementary material). (ii) Percentage of variation (is shown in the last column of Table 2 ) was used as weights, and the initial score for each state was obtained (is shown in the last column of Table 3 ). Table 3 .

Multivariate Statistical Methods -A Primer

Factor Analysis an Introduction and Manual for the Psychologist and Social Scientist

Modem Factor Analysis

Development of an index of need for health resources for Indian States using factor analysis

Mapping co-variates of mortality up to age of five years for Indian states

Index based mapping of high-risk behaviors for HIV among female sex workers in India

Statistical Package for the Social Sciences (SPSS) for Windows (Version 18)

Improved weighting methods, deterministic and stochastic data-driven models for estimation of missing precipitation records

Environmental Systems Research Institute. Inc. ArcGIS Desktop: Release 10