key: cord-0321740-6h3pzuad authors: Spors, E.; Michael, S. title: Standardized incidence ratio of the COVID-19 pandemic: a case study in a Midwestern state date: 2021-09-30 journal: nan DOI: 10.1101/2021.09.28.21263671 sha: e71844b8d247af3464f9fa5b4618ce0fbd543333 doc_id: 321740 cord_uid: 6h3pzuad The Coronavirus disease 2019 (COVID19) has made a dramatic impact around the world with some communities facing harsher outcomes than others. We sought to understand how each county fared compared to what would be expected and what factors contributed to negative outcomes from the pandemic in South Dakotas counties. The Standardized incidence ratios of all counties using age adjusted hospitalization and death rates are computed. In addition, a penalized generalized linear regression model is used to identify factors that have an association with COVID19 hospitalization and death rates. The results identified counties that had more severe outcomes than what would be expected. In addition, race, education, and testing rate were some of the significant factors associated with the outcome. diabetes, heart conditions, immunocompromised state, liver disease, over weight or obesity, pregnancy, and sickle cell disease [9] . In addition, life style factors, such as smoking and substance abuse, also may place people at increased risk [9] . In this paper COVID-19 cases, hospitalizations, and deaths at the county level were considered as response variables. Several social, economic, and demographic variables, that have been reported in the literature as risk factors for infectious disease transmission, were also considered at the county level. We considered penalized generalized linear models to identify factors that are associated to the disease spread [13] . The paper is organized as follows: In Section 2 we provide the data sources and methods used for data analysis. In Section 3 the results of the analysis are presented. Section 4 discusses our main findings and the corresponding literature. The South Dakota (SD) COVID-19 data were received directly from the SD Department of Health (SDDOH) [20] . This study specifically examined reported COVID-19 cases, hospitalizations, and deaths on a county level. Based on definitions from the SDDOH, cases included persons who met the national surveillance definition for COVID-19. A person who had a positive PCR test for SARS-CoV-2 was a confirmed case and a person with a positive antigen test was a probable case. A person was listed as hospitalized if, at any point during the infection, they were hospitalized under transmission-based precautions. Deaths reflected the number of people who died and COVID-19 was listed as the cause of death, or significant contributor to death as determined by the judgement of the healthcare provider or coroners completing the death certificate [3] . The dataset obtained from SDDOH included information on each case of COVID-19 in SD. Details included the date the positive test was reported to the SDDOH, a positive PCR test indicator (yes/no), positive antigen test indicator (yes/no), recovery date, current county of residence, age category, hospitalization or death indicators (yes/no). If necessary, date of hospitalization and discharge and date of death were included [20] . The age category was represented by a number one through nine which corresponded to ten-year age groupings including, 0 to 9 years of age, 10 to 19 years of age, 20 to 29 years of age, 30 to 39 years of age, 40 to 49 years of age, 50 to 59 years of age, 60 to 69 years of age, 70 to 79 years of age, and 80 years of age and older. An additional dataset included the daily count of PCR and antigen COVID-19 tests per county [21] . The testing data were not broken down by age group. We selected the following socioeconomic factors at the county level for analysis: nonwhite percentage, educational attainment (percentage of population with a bachelor's degree or higher), percentage of population with total annual income at or below a specified poverty threshold (determined by household size and composition) as designated by the U.S. Census Bureau, median income, unemployment percentage, uninsured percentage, cardiovascular disease hospitalization rate per 1,000 population, diabetes percentage (Type I & II), obesity percentage defined by BMI, physical inactivity percentage, and rate of providers,. The source of each factor is described in detail in the appendix (See Section 6). In this paper, some of the results are presented for the top ten counties in SD. These counties are chosen based on the population of the main city in that county. These top 10 counties with the largest cities in order of their size are: Minnehaha (Sioux Falls), Pennington (Rapid City), Brown (Aberdeen), Brookings (Brookings), Codington (Watertown), Davison (Mitchell), Yankton (Yankton), Hughes (Pierre), Beadle (Huron), and Lawrence (Spearfish) county. Several of counties are also home to South Dakota Board of Regent Schools. In order of size, South Dakota State University (Brookings), University of South Dakota (Clay), Black Hills State University (Lawrence), Northern State University (Brown), and South Dakota School of Mines and Technology (Pennington). The initial dataset had several incongruities noted for hospitalization dates, specifically dates that were before the pandemic officially reached South Dakota. Data cleaning was performed by changing the listed date to a more appropriate date based on available information from the specific case. For example, the date of hospitalization listed 01/20 was changed to 01/21 based on the date a positive test result was reported to SDDOH. After preparation, the data were transformed from information on an individual level data to counts by county of residence for hospitalizations and deaths per age group (1 through 9), by day (from Mar 10, 2020 through May 1, 2021). The age distribution for SD and two counties is shown in Figure 1 . It is well known that severe illness from COVID-19, in the time frame this report covers (before the delta variant), disproportionately impacts older individuals. In college towns such as Brookings, the age distribution based on the 2019 census [? ] is different than many of the other counties in SD (see Figure 1 ). The Brookings County has a younger population than both Brown County and SD overall. Thus, crude rates of hospitalization or deaths per 100,000 may be less in Brookings County compared to counties with an older population simply due to age differences. To account for this difference, the age-adjusted rate per 100,000 is standardized to the values across counties with respect to their age distribution. The daily age-adjusted rate per 100, 000, R, was calculated as where i increments daily for the dates from 10-March-2020 to 5-May-2021, j = 1, . . . , 66 is a county in SD, n = 1, . . . , 9 represents a specific age group, c n,i,j rep-3 . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2021. ; resents the count of a particular outcome (hospitalization or death) for age group n on day i in county j, p n,k is the population size for the nth age group in the jth county, USPOP represents the U.S. population based on the 2019 Census estimates and u n is the U.S. population in the age group n. The daily age-adjusted rates per 100, 000 were calculated for hospitalizations and deaths. Since the testing data was not broken down by age group, we calculated the daily crude testing rate per 100,000 as T i,j = ci,j pj · 100000. For linear models, cross-sectional data are needed. Therefore, cumulative counts at a specific date were considered. Similar to daily counts, the cumulative age-adjusted rate per 100,000 was also calculated for deaths and hospitalizations as follows. After finding the cumulative hospitalizations and deaths for each county up to May 5, 2021, the cumulative rate was calculated with R j = 9 n=1 cn,j pn,j · un U SP op · 100000, where c n,j , p n,j , and u n are defined similarly as in Equation 1. Again, the testing was calculated by cumulative tests per 100,000 with R j = cj pj · 100000. A Standardized incidence ratio (SIR) was used to compare how counties fared in terms of cumulative hospitalizations and deaths. This was done by comparing the observed value in a county to an expected value and was calculated with SIR j = Oj Ej , where O j was the observed value from county j and E j was the expected value for the jth county. The confidence intervals for the SIR indices were calculated with where v 1 = 2O j and V 2 = 2(O j + 1) and χ 2 v,α was from the Chi-Squared distribution with the critical value at α = 0.05 and v degrees of freedom [17] . To adjust for multiple comparisons, we used a Bonferroni correction and divided the critical value by 2 · m, where m = 66 was the number of comparisons to be made between counties. An SIR value was calculated for the cumulative hospitalizations and deaths for all counties in South Dakota. Note that values greater than one indicate that that county was experiencing a higher rate of that measure. Conversely, if the value was lower than one, then that county was experiencing a lower rate of that measure relative to what was expected. If the confidence interval did not include 1 this indicates that the county was significantly different than what was expected. To understand how socioeconomic factors impacted COVID-19 hospitalizations and deaths, we utilized generalized linear models (GLM) with Poisson and Gaussian link functions. In general, given Y 1 , . . . , Y n and predictor variables X 1 , . . . , X n , assume µ i = E(Y i ), the GLM has the following structure, g(µ i ) = X i β, and g is called the link function. In GLM we assume that Y i follows an exponential family distribution such as the Gaussian, Binomial, Poisson, etc [28] . The β 0 , β 1 , β 2 , ..., β p are estimated using the maximum likelihood estimation approach. For this method, the coefficients that maximize the likelihood function are obtained. In our data, however, since there were a large number of potential predictors and several of them exhibited co-linearity, we chose to use the penalized GLM for feature selection. Specifically, the Lasso GLM incorporates λ p j=1 |β j | as a penalty in the likelihood function [12, 28] . Lasso regression will force some of the coefficient estimates to be exactly zero as λ increases, effectively dropping the variables from the model. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2021. ; https://doi.org/10.1101/2021.09.28.21263671 doi: medRxiv preprint All data preparation and statistical analysis were performed using R and RStudio [18, 19] . The package tidyverse was used for general data manipulation and all graphs were created with the ggplot2 package [26, 27] . Lasso regression was completed with the R package glmnet [12] . To better understand the data, several techniques were used to complete exploratory data analysis. All our analysis henceforth will use the age-adjusted rates per 100,000 for cases, hospitalization, and deaths and crude rates for testing data. We first explored the trends in the COVID-19 data. The daily ageadjusted cases of the counties in SD are shown in Figure 2 . In this plot we can see the major outbreak (peak) at around Nov-Dec 2020. We also note the initial surge in Minnehaha County, where an outbreak at a pork processing plant (Smithfield Foods Incorporation) occurred [23] . The two highest daily rates corresponded to the mass testing at the facility in ... County []. The map of South Dakota with age-adjusted cumulative hospitalizations and deaths are shown in Figure 3 . As expected, we see a strong association between hospitalization and death rate among the counties. Also note that the west-central part of the state has higher rates of hospitalizations and deaths. Figure 4 shows the cumulative age-adjusted hospitalizations and deaths SD counties. For hospitalization, there seems to be two groups with light gray and darker gray shades in the plot. For deaths, we can see three groups of counties with those separated at the top with like grey shade, those in the middle with darker grey, and another group with close to zero deaths at the bottom with light grey shades. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The numerical summaries of the demographic and socioeconomic explanatory variables are given in Table 1 .A more detailed graphical summary of the socioeconomic explanatory variables are given in Appendix 6. Figure 5 shows the correlation between the explanatory variables. It can be observed that most of the variables have high correlation, explaining the need for feature selection methods. For example, as expected, median income is highly correlated with education (0.6) and inversely correlated with poverty and uninsured rates (-0.7 and -0.6, respectively). A GLM model with a set of variables without some of the highly correlated variables and a penalized Poisson regression with lasso penalty were fitted for age-adjusted cumulative hospitalization and death rates on May 1, 2021. The lambda for the Lasso penalty was chosen based on cross-validation method. For cumulative hospitalizations, 6 . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2021. ; the lasso model selected the nonwhite percentage and educational attainment as important variables. After creating the linear model, there was a positive association between the cumulative rate of hospitalizations and the nonwhite percentage. Conversely, there was a negative association between the cumulative rate of hospitalization and educational attainment. That is, a county with a higher number of college-educated residents saw fewer hospitalizations. For cumulative deaths, lasso regression selected nonwhite percentage. The linear model revealed that there was a positive association between cumulative deaths and the nonwhite percentage. After fitting the penalized GLM models to the data we computed the expected cumulative hospitalization and death rate for each county. This then was used to compute new predicted SIR values after controlling for race and educational attainment in case of hospitalization and race in case of deaths. The results of both SIR's with just age adjustment and with both age and other factor adjustment are presented in Figures 6 and 7 . For ease of presentation and interpretation of the results we present 1 − SIR values. Hence, the values below zero indicate that the observed values are less than the expected and counties that had more than expected will be above zero. For age-adjusted rates of hospitalization, from the top ten counties in SD, Lawrence, Clay, Beadle, and Brookings had the lowest significant SIR values with Lawrence County experiencing about 60 percent fewer hospitalizations and Brookings experiencing about 36 percent fewer hospitalizations than expected. Minnehaha had 17 percent more and Brown had 10 percent more cases than expected. For deaths, from the top ten counties in SD, Yankton, Brookings, Clay, and Lawrence had the lowest SIR values with Yankton experiencing about 43 percent fewer deaths and the others about 35 percent fewer deaths. We note that similar counties tend to have more than expected on severe outcomes. After controlling for other factors using the GLM model, the range of SIR values for counties with significant SIR's shifted from a maximum of around 3 to 1.25 for hospitalization rate and from around 4.5 to 2.5 in the case of death rate. Especially, 7 . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2021. counties with a large non-white population have moved towards the middle of the SIR graph or have become non-significant and have been dropped. The SIR model shows how different parts of SD responded to the pandemic. by demonstrating the spatial patterns and identifies the counties that had more share in the rate of severe outcomes and those that had less. We noted that in the top 11 counties, most counties had lower values than expected when compared to South Dakota as a baseline. Consistently, Lawrence, Brookings, and Clay had significantly lower rates of severe outcomes than expected. In contrast, Minnehaha and Davison experienced significantly higher rates of severe outcomes than expected. Looking at specific states at the worse end of the map, Dewey, Todd, Olgala Lakota, and Buffalo had twice to four times more hospitalizations and deaths that what might be expected. We note that these counties are where some of the American Indian reservations are located in SD. Throughout all the models, the nonwhite percentage was repeatedly selected as positively associated with hospitalizations and deaths. Based on the available literature, this is not surprising. Historical treatment and continuing racial differences have contributed to large disparities between minority and white populations in the United States. In South Dakota, American Indian and Alaska Native (AI/AN) is the largest minority at 9 percent followed by two or more Races (2.5 percent) and Black or African American (2.3 percent) [24] . According to the CDC, the ratio of AI/AN to white non-Hispanic persons was 3.4 for hospitalizations and 2.0 for deaths. For Black or African Americans, it was 2.8 for hospitalization and 2.0 for deaths. That is, for all measures, both AI/AN and Black persons experienced higher rates that their white counterparts. In addition, AI/AN had the highest ratios among all minorities for hospitalizations and deaths [10] . One study that confirmed the disproportionate burden of cases on AI/AN and cited reliance on shared transportation and household size may have been a few of several factors that increased community transmission and which may have incurred higher rates of hospitalizations and deaths [14] . In addition, Indian Health Services reports that AI/AN have consistently experienced a decreased health status compared to other Americans [16] . AI/AN disproportionately suffer from high blood pressure, chronic liver disease, obesity, diabetes, heart disease, chronic lower respiratory disease among others [16] . These all contribute to an increased likelihood of hospitalization or death from a COVID-19 infection. In the cumulative hospitalizations linear model, in addition, educational attainment was negatively associated with hospitalizations. Counties with more adults who had achieved a bachelor degree or higher had fewer COVID-19 hospitalizations. Current literature proposes two explanations. Low educational attainment (such as a high school degree or less) has been associated with adverse health effects such as coronary artery disease[]. Scientists are now understanding the overall relationship between education and health, finding that more schooling is linked to better health and longer life [29] . Adults with more years of education have access to higher paying, more stable jobs where they can accumulate wealth to invest in their health. In addition, individuals with fewer years of education tend to have more chronic health conditions, smoke, have a less healthy diet, and lack adequate exercise [29] . All of these factors can culminate in someone who is more likely to require hospitalization due to COVID-19 based on known risk factors [9] . In addition, research has shown that the pandemic has disproportionately impacted those with lower educational levels [11] . Some limitations to this study comes from the data itself. Many of the socioeconomic data sets were based on self-reported information which may confer bias to those numbers. In addition, South Dakota has a rural state with many counties being is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2021. ; sparsely populated as the median county population was 5,430. This lead to some disproportionate rates. Take, for example, the cumulative age-adjusted death rates in Jones and Buffalo Counties. Jones had a population near 900 and reported no deaths as of May 1st. Buffalo had a population near 1,962 and reported 13 deaths. Jones County had the lowest age-adjusted death rate while Buffalo County had the highest death rate. Due to their small size, for some counties, a small difference in numbers can lead to a big change in the adjusted rates. This study used SIR and GLM models to compare age-adjusted COVID-19 hospitalization and death rates among SD counties before and after adjusting for demographic and socioeconomic factors. To account for age difference and population in the counties age-adjusted rates were used. For further research we suggest looking into COVID-19 positivity rates and hospitalization rates to determine if certain counties were under testing their population. This would help identify if there were likely undocumented cases in the state. In addition, time series models could be developed and used to predict future levels of the pandemic. In addition, a more flexible model that accounts for the heterogeneity in the counties of the state can be considered. Other mobility or social media related data can be incorporated to capture more variability in the response. whom poverty status was determined. The poverty threshold is adjusted based on family size and composition. It is not adjust geographically, but the thresholds are updated for inflation using the Consumer Price Index. If a family's combined pretax income does not meet the threshold, then the whole family is considered in poverty [6] . Median income represented the median income of households in the last 12 month to 2019 inflation-adjusted dollars [4] . [7] . For an individual to be insured, they must have been currently covered by insurance from a current or former employer or union, had insurance purchased directly from an insurance company, had Medicare, been on any kind of government-assistance plan for those with low-income or disability, military health care, VA health care, and/or Indian Health Services (IHS). However, those only covered with IHS were considered uninsured because IHS was not comprehensive coverage [7] . This study specifically looked at the percentage of uninsured individuals living in each county. (5) The rate of cardiovascular hospitalizations was from the Centers for Disease Control and Prevention's Division for Heart Disease and Stroke Prevention. These values represented the age-adjusted rate per 1,000 from 2016-2018 of county residents aged 65+ for whom a cardiovascular disease was the principle (or first-listed) diagnosis upon admission to the hospital [8] . (6) The United States Diabetes Surveillance System under the guidance of the Division of Diabetes Translation, CDC asked adults if "a doctor has ever told you that you have diabetes" to calculate the percentage of adults who had diabetes per county. They also determined the percentage of adult that were obese, or whose body mass index (BMI) was over 30 from self-reported height and weight. The percentage of physically inactive was calculated from the number of adults who answered that they had not participated in physical activity in the last month [25]. (7) The rate of active doctors, including MD's and DO's, per county from 2019-2020 was provided by the Heath Resources and Services Administration via the Area Heath Resource Files [15] . The graphical summaries of some of the socioeconomic factors considered in this paper are presented below. . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2021. WHO delcares COVID-19 a pandemic Covid-19)-associated hospitalization: Covid-19-associated hospitalization surveillance network and behavioral risk factor surveillance system Novel Coronavirus (COVID-19) Updates and Information Median Income in the Past 12 Months, ACS 5-Year Estimates American Community Survey. Small Area Health Insurance Estimates Centers for Disease Control and Prevention. Interactive Atlas for Heart Disease and Stroke Assessing Risk Factors for Severe COVID-19 Illness Introduction to COVID-19 Racial and Ethnic Health Disparities The Unequal Impact of COVID-19: Why Education Matters Regularization paths for generalized linear models via coordinate descent The elements of statistical learning: Data mining, inference, and prediction COVID-19 among American Indian and Alaska Native Persons -23 states Health Resources Services Administration. Area Health Resource Files Indian Health Services: The Federal Health Program for American Indians and Alaska Natives National Cancer Institute: Surveillance, Epidemiology, and End Results Program (SEER) Standardized Incidence Ration and Confidence Limits R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing RStudio: Integrated Development Environment for R. RStudio, PBC South Dakota Department of Health. COVID-19 data on confirmed cases, hospitalizations, deaths, and tests in South Dakota South Dakota Department of Health. COVID-19 testing information in South Dakota Covid-19 outbreak among employees at a meat processing facility -south dakota ggplot2: Elegant Graphics for Data Analysis Welcome to the tidyverse Generalized additive models: an introduction with R The relationship between education and health: Reducing disparities through a contextual approach Acknowledgement We acknowledge the support of the South Dakota Department of Health in providing the COVID-19 data.This work is partially supported by the South Dakota State University's Presidential Research Project (PREP-21, SDSU). Additional expertise was provided by Bonny Specker, PhD, Professor Emerita from South Dakota State University. The contents are solely the responsibility of the authors. No copyrighted figures, surveys, instruments, or tools were used in this study. (1) The US Census provided demographic information on population estimates broken down by race and age groups in 2019. Due to South Dakota's predominately white population (with the largest minority being Native American or Alaska Native at 9 percent), the percentage of nonwhite individuals overall was calculated instead of being broken down by racial groups. The age group demographics were necessary to compute age adjusted rates of COVID-19 cases, hospitalizations, and deaths [24]. (2) The American Community Survey via the United States Census Bureau provided information on education attainment percentages, poverty percentages, and median income per county in SD in 2019 [4] [5] [6] . The American Community Survey is an ongoing survey that gathers information from the United States public for government and public use. For this study, education attainment was measured as the percentage of a county's residents, aged 25 and older, who had completed a bachelor's degree, master's degree, doctoral degree, or professional degree [5] . The poverty statistic measured the percentage of county residents for