key: cord-287027-ahoo6j3o
authors: Lai, Yuan; Charpignon, Marie-Laure; Ebner, Daniel K.; Celi, Leo Anthony
title: Unsupervised Learning for County-Level Typological Classification for COVID-19 Research
date: 2020-08-30
journal: Intelligence-based medicine
DOI: 10.1016/j.ibmed.2020.100002
sha: 
doc_id: 287027
cord_uid: ahoo6j3o

The analysis of county-level COVID-19 pandemic data faces computational and analytic challenges, particularly when considering the heterogeneity of data sources with variation in geographic, demographic, and socioeconomic factors between counties. This study presents a method to join relevant data from different sources to investigate underlying typological effects and disparities across typologies. Both consistencies within and variations between urban and non-urban counties are demonstrated. When different county types were stratified by age group distribution, this method identifies significant community mobility differences occurring before, during, and after the shutdown. Counties with a larger proportion of young adults (age 20-24) have higher baseline mobility and had the least mobility reduction during the lockdown.

The COVID-19 pandemic has showcased the need for a multidisciplinary exploration, interpretation, and presentation of data. In comparison with the SARS-CoV-1 outbreak from 2002 to 2004, advances in cloud storage, analytic infrastructure, and platforms for dissemination of information have dramatically expanded the data resources available for studying virus transmission in communities, as well as the interplay between individual and geographical factors, including the socio-political landscape. Policy experts increasingly seek to leverage data, machine learning, and cloud computing in their response strategies. Unfortunately, data heterogeneity, a dearth of data standards, and poorly interoperable data-sharing platforms complicate the quality and availability of analyzable data, marring both data value and methodological reproducibility.

The New York Times (TNYT) developed a live data repository with daily county-level coronavirus cases and deaths (TNYT, 2020) . County-level data has emerged as the primary geographical level of analysis, self-contained for reporting purposes while additionally responsible for the execution of epidemic policy response. Moreover, disaster funding is allocated at the county-level. Analyzing data at the county-level has significant benchmarking challenges: for instance, counties have fundamental differences in geographic, demographic, political, and socioeconomic characteristics, which lead to differing and unique epidemiological trajectories that go uncaptured in a static pooled analysis. In response to this, the U.S Centers for Disease Control and Prevention (CDC) in 2011 created a Social Vulnerability Index (SVI) aimed at quantifying the resilience of communities to disasters and disease outbreaks (CDC, 2011) , an index that has been expanded throughout this pandemic. Based on these indicators, the CDC has J o u r n a l P r e -p r o o f identified 220 "most vulnerable" counties and other jurisdictions that are at highest risk for outbreaks, with consequent impact on federal resource distribution, aid, and policy.

However, without a deep understanding of the underlying variation across the counties and the states, modeling leads to error, bias, and flawed interpretations, leading to downstream deleterious impacts on the ability for a community --and the nation --to respond to this crisis. A recent paper from Bosancianu and colleagues (Bosancianu, 2020) found that a county's political leaning, social structures, and local government effectiveness also explain, in part, COVID-19 mortality. These findings cannot solely be explained by the urban/rural divide, nor racial and ethnic disparities, between counties (Bassett et al, 2020; Chen and Krieger, 2020) . County-level analysis has similarly demonstrated a link between political beliefs and compliance with social distancing (Painter and Qiu, 2020) , as well as connections between COVID-19 transmission to air pollution and other factors (Wu et al., 2020) . A robust analytical system capable of identifying granular patterns and trends, track county-level case incidence, mortality, and excess mortality (CDC, 2020), and thereby disentangle causal, mitigative, and correlative effects (Knittel and Ozaltun, 2020) , is critical for healthcare resource allocation during this and future pandemics.

This project introduces a methodology to specifically address the computational and analytical challenges of aggregating county-level heterogeneous data sources for research. This captures the first steps necessary to reliably frame and analyze county-level data, including incorporation of higher resolution, individual-level data in analysis. The purpose of this study is to summarize publicly available and relevant COVID-19 data sources, to address the benchmarking challenge from the data heterogeneity through clustering, and to classify counties J o u r n a l P r e -p r o o f based on their underlying variations. Through these methodologies, greater understanding of the spread of COVID-19 and future pandemics may be attained, leading to better data-driven policies.

We represent socioeconomic characteristics by integrating multiple county-level data sources (Table S1 ). These include baseline measures from population census data, geographical information systems data, business pattern censuses, and other sources that report relatively timeinvariant variables. Spatial data was collected by quantifying geographical attributes per county and integrating this with other datasets. County land area is enumerated through evaluation of county geometry from TIGER/Line Shapefiles, with subsequent estimation of county-level population density (1000 people per square km). The CDC publishes spatial data representing the top 500 cities' boundaries ranked by population. Using spatial geometry, the intersection of county and city borders are evaluated to approximate the total urban area. Based on the total county-level urban area, areas that were greater than 25% were classified as "urban" while the rest were classified as "non-urban".

We calculated county-level total population, gender-, race-, and age group distribution using 2018 population estimates. Using data reported from the Small-Area Life Expectancy Estimates Project (USALEEP), county-level average life expectancy was estimated as a proxy for local quality-of-life differences (USALEEP, 2015) . Further, education was represented as the percentage of adults with a bachelor's degree or higher (2014-2018) as reported by the U.S.

Census Bureau. We further aggregated the age groups † and computed underlying typologies using clustering techniques. K-means clustering is an unsupervised machine learning method that partitions observations into k groups (as clusters) based on their distance to the group means (as clusters' centroids) (Lloyd, 1982) . It is one of the most common non-hierarchical clustering methods (Steinley, 2006) . We first identified the optimal number of clusters, denoted by k, by computing the silhouette score in line with Lloyd et al., and then generated categorical variables as typology indicating different age distributions across counties.

Recent studies identify the importance of the timing of COVID-19 spread in different counties (Bialek, et al., 2020) . A core analytical challenge is how to take these varying timelines into account when comparing virus transmission across different counties. COVID-19 case and death data were collected from TNYT GitHub repository, which reports the county-level cumulative counts daily. Multiple measures were then quantified at the county-level, including: (Jia, et al., 2020) . Finally, the slope of the growth in death rate over time was estimated via a linear fit for each county. † Age group 1 = Age 0 -9, group 2 = Age 10 -19, group 3 = Age 20 -29, group 4 = Age 30 -39, group 5 = Age 40 -49, group 6 = Age 50 -59, group 7 = Age 60 -69, group 8 = Age 70 -79, group 9 = Age 80 and above.

Human mobility was evaluated as a dependent and independent variable during the pandemic, with particular emphasis on how mobility changed responding to local policy and affected outbreak trajectory. County-level mobility change was quantified using exposure indices derived from PlaceIQ Movement Data based on mobile phone data (PlaceIQ, 2020). The countylevel device exposure index (DEX) is a proxy for local human mobility, which reports the county-level average spatial-temporal co-existence of unique mobile devices. This index measures daily average exposure to other people and/or crowds, reflecting local social distancing policy and compliance. DEX measures the absolute change of mobility density, demonstrating both weekly patterns and county-level variations. To generate a less-noisy and comparable measure across counties, values were computed by normalizing the county-level DEX timeseries raw data to enable cross-county comparison.

The mechanism with which urbanization impacts vulnerability to a pandemic and the subsequent health outcomes is not fully elucidated. Between the correlation matrices for urban and non-urban environments, consistency is seen but with subtle variation (Figure 1 ). Both matrices reveal a correlation between some baseline measures: counties with higher educational attainment have higher income levels and life expectancy. Race and sex have a weaker correlation with income, unemployment, and education in urban areas compared to non-urban areas. When looking at the correlations between baseline measures and pandemic outcome measures, counties with a comparatively larger population, higher income and education J o u r n a l P r e -p r o o f attainment, and/or life expectancy had the earliest cases. Consistent correlations were observed between case rate and population, density, unemployment, income, and education. (Colorado), Florida, and Gulf Coast. Evaluation of these geographical patterns suggests that urban areas may not be the "epicenters" but rather the "vanguards" of pandemic spread (Angel, et al., 2020) . Figure 3a and 3b reveal the disparities between urban and non-urban counties in terms of variation in death rate over time, as well as in number of days from the first local death.

Notably, non-urban counties have steeper slopes than urban counties, are hit later in the total pandemic timeline, and experience death rates higher than in urban areas. Figure 3c bins the counties by death rate slope, highlighting that most counties are classified as non-urban areas, and that these had a long-tail distribution of death rate growth slope as compared to urban counties. Figure 3d compares the density curves of the two county types, demonstrating the more dispersed death rate slope variations in non-urban counties. 

The K-means clustering algorithm labels all counties into three groups using age group distribution typology. As Figure 4 indicated, Type A (in red) represents counties with a predominantly young population, defined as in their 20s. Type B (in blue) represents counties with more older adults (age >= 60). Type C (in green) represents most counties, which contain relatively "typical" age patterns. This method highlights dynamic patterns in county-level age distribution differences versus traditional analytical methods.

J o u r n a l P r e -p r o o f We identify three phases for each county according to its normalized human mobility changes ( Figure 5 ). Phase one prior to March 2020, during which most counties experienced increasing mobility density. Phase two occurred in March, when most counties witnessed drastically reduced local mobility density, reaching a nadir in early April. Finally, phase three began in early April, marking a slow return to mobility pre-pandemic. Counties with different age group distributions demonstrate various mobility changes before, during, and after the U.S. Federal

Government announced the national emergency on March 13th. Counties with a largely young population (Type A in red) saw less mobility reduction ( Figure 6 ). During the "shelter-in-place" policy implementation period in which most places experienced a drastic decline in mobility, these counties had the largest drop in mobility compared to other counties (in green and blue).

Furthermore, in the third phase, as businesses have started reopening, these counties demonstrated the largest return of mobility. Figure 5 . Normalized county-level human mobility changes. The group average changes (defined by the age pattern typology) are in bold-dash lines colored accordingly. Two vertical lines represent the median dates when counties experienced maximum and minimum human mobility. Figure 6 . Box plot of local mobility change grouped by age pattern type and time period (before, during, and after shutdown).

This study contributes to both data integration and analytical methods that are critical for pandemic research. Analyzing demographic, geographical, and socioeconomic characteristics can inform the local public health response and decision-making (Lai et al. 2020) . However, such comprehensive insights require multi-disciplinary and long-term efforts to collect, integrate, and analyze data from heterogeneous sources. Limitations of data sources and quality bemire analysis and interpretation, since representativeness and quality depend on particular sources and collection methods. Such data variations bring challenges for integrating heterogeneous data relevant to this pandemic. For example, county-level demographic and socioeconomic census provide long-term baseline measures, but often lack high temporal frequency and spatial granularity. Mobile phone data, as another example, provide nearly real-time digital representation of human mobility at high spatiotemporal granularity, but suffer from noisy data and underlying sampling bias. That said, our study extends the exploration of information sources and integration methods considering there is no central source for all available data.

This study demonstrates the clustering technique using health-related data for pandemic research. Identifying the underlying county typology provides critical value in comparing health outcomes across counties (Wallace, Sharfstein, Kaminsky, & Lessler, 2019) . Recent systematic review of K-means clustering in air pollution epidemiology-related literature has demonstrated significant utility for typology discovery and knowledge mining (Colin, Jabbar, & Osornio-Vargas, 2017) . Further, K-means clustering is widely used for population segmentation analysis, classifying underlying subgroups with an eye toward evaluating specific healthcare demands and policy interventions (Shi, Kwan, Tan, Thumboo, & Low, 2018) . Particularly at the county-level, previous studies have implemented clustering techniques to analyze various data sources relating J o u r n a l P r e -p r o o f to demographic, geographic, environment, and socioeconomic determinants of health and disease. Two use case applications of clustering include discovery of underlying patterns based on high-dimensional data (Cossman, et al., 2007; Chi, Grigsby-Toussaint, & Choi, 2013) and

prediction of counterfactuals for population health policy intervention (Strutz, et al., 2020) .

University on March 19th, 2020, people over 60 and those with chronic health conditions are at the highest risk for COVID-19 complications (Sharfstein, 2020) . Though this simple measure evaluates the percentage of the population aged 60 and above, it may fail to capture more dynamic county-level age distribution differences. Clustering technique may identify underlying county types defined by age group distributions. In the future, we plan to scale up the clustering method by integrating more variables to identify county typology at higher dimensions.

There is no singular source of human mobility data. Multiple digital product vendors, data brokers, and research institutes have published mobility data or processed metrics, including PlaceIQ, Safe Graph, Descartes Labs, Apple Mobility Trends Report, and Google Community

Mobility Reports (PlaceIQ, 2020; Safe Graph, 2020; Descartes Labs, 2020; Apple, 2020; Google, 2020) . Product provider-generated mobility measures, such as data shared by Apple and

Google, are limited to data collected by their own digital product line (e.g., Google Maps or Apple Maps), customer segments, and user-product interactions. The DEX index from PlaceIQ data only represents a fraction of the actual population as samples. Even though such data sampling processes are randomly conducted for estimating human mobility, understanding sampling biases, population representativeness, and the resulting accuracy requires a more indepth investigation, possibly with other human mobility-related data from different sources as validation. Moreover, integration of data between multiple sources is complicated by vendor-J o u r n a l P r e -p r o o f specific methods for data reporting, collecting, sharing, sampling, aggregation, and quantification. Further opportunities exist with regard to integration of mobility data with specific events, such as election or protests (Cotti et al. 2020) . The human mobility data presented here may not fully reflect the compliance (or lack thereof) to local stay-at-home orders and the effects of social distancing (Gao et al. 2020 ).

This study only evaluated data from January 22nd to May 15th. The results and interpretations only represent this specific period and may not necessarily translate to future resurgence of the pandemic. While data is updated on TNYT and the PlaceIQ data portals daily, the descriptive summary, clustering results, and death growth rates change with each update.

This raises questions on the trade-off between timeliness and accuracy, which is a core challenge in real-time or near real-time data analysis. We excluded New York City (NYC) from this analysis. We believe it would be more appropriate to study NYC in a separate research for several reasons. TNYT's data reports NYC differently by treating it as one entity without specific counties including New York County (Manhattan), Kings County (Brooklyn), Bronx

This study presents integration of various data sources to investigate the drivers of the community spread of COVID-19 based on county typologies. Both similarities and variations between urban and non-urban counties are demonstrated by the methodology. While previous findings reveal possible geographical clusters of COVID-19 cases at the county-level, our study indicates this is from the underlying typology based on high-dimensional variables. Counties vary by geographic, demographic, and socioeconomic characteristics, with associated collective behavior during a pandemic.

COVID-19 has accelerated data sharing at scale to crowdsource knowledge generation that can inform national and international policy. We showcased a method for data integration to investigate the spread of the pandemic in the United States. The dissonance in presentation between urban and non-urban areas was highlighted, as well as the impact of population age and mobility during the lockdown. Just as policy occurs at levels from local to (inter)national, so too must data analysis: this study is a first step toward that end.

LAC is funded by the National Institute of Health through NIBIB R01 EB017205.

YL led the data analysis and the drafting of the manuscript. All the authors discussed the interpretation of the findings and contributed to the writing.

J o u r n a l P r e -p r o o f

American Communities Project (ACP). 2020. Retrieved from ACP

Apple Mobility Trends Report

The Coronavirus and the Cities: Variations in the Onset of Infection and in the Number of Reported Cases and Deaths

The unequal toll of COVID-19 mortality by age in the United States: Quantifying racial/ethnic disparities

Geographic differences in COVID-19 cases, deaths, and incidence -United States

Centers for Disease Control and Prevention (CDC). 2011. Social Vulnerability Index (SVI). Retrieved from CDC

The relationship between in-person voting

Revealing the unequal burden of COVID-19 by income, race/ethnicity, and household crowding: US county vs ZIP code analyses

Can geographically weighted regression improve our contextual understanding of obesity in the US? Findings from the USDA Food Atlas

A systematic review of data mining and machine learning for air pollution epidemiology

Persistent clusters of mortality in the United States

Descartes Labs. 2020. Retrieved from Descartes Labs

Mapping county-level mobility pattern changes in the United States in response to COVID-19

Google Community Mobility Reports

Probability of current COVID-19 outbreaks in all US counties. The University of Texas at Austin Technical Report

Population flow drives spatio-temporal distribution of COVID-19 in China

2020. COVID-19 United States Cases by County Dashboard

What does and does not correlate with COVID-19 death rates. medRxiv

Urban Intelligence for Pandemic Response. JMIR public health and surveillance

Least squares quantization in PCM

Political beliefs affect compliance with covid-19 social distancing orders

Exposure indices derived from PlaceIQ movement data

Contextualizing covid-19 spread: a county-level analysis, urban versus rural, and implications for preparing for the next wave. medRxiv. Sharfstein, J. 2020. COVID-19 Situation Report & Public Health Guidance

The Atlantic. 2020. The COVID Tracking Project

The New York Times COVID-19 data

A systematic review of the clinical application of data-driven population segmentation analysis

Safe Graph. 2020. Social distancing metrics

K means clustering: a half century synthesis

Determining county-level counterfactuals for evaluation of population health interventions: A novel application of K-means cluster analysis

Retrieved from Surgo Foundation: https://precisionforcovid.org/ccvi University of California San Francisco (UCSF). 2020. COVID-19 County Tracker

small-area life expectancy estimates project -USALEEP

Comparison of US countylevel public health performance rankings with county cluster and national rankings: assessment based on prevalence rates of smoking and obesity and motor vehicle crash death rates

Exposure to air pollution and COVID-19 mortality in the United States

American Community Survey 5-year average county

None declared. • This study presents a method to join relevant data from different sources to investigate underlying typological effects and disparities across typologies. Both consistencies within and variations between urban and non-urban counties are demonstrated.

• Significant community mobility differences occurring before, during, and after the shutdown, based on the types of age group distribution. Counties with a larger proportion of young adults have higher baseline mobility and the least mobility reduction during the lockdown.

☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:J o u r n a l P r e -p r o o f