key: cord-0211917-rndc2v2b authors: Killeen, Benjamin D.; Wu, Jie Ying; Shah, Kinjal; Zapaishchykova, Anna; Nikutta, Philipp; Tamhane, Aniruddha; Chakraborty, Shreya; Wei, Jinchi; Gao, Tiger; Thies, Mareike; Unberath, Mathias title: A County-level Dataset for Informing the United States' Response to COVID-19 date: 2020-04-01 journal: nan DOI: nan sha: 685c08621500a142543323739e043499414edf6c doc_id: 211917 cord_uid: rndc2v2b As the coronavirus disease 2019 (COVID-19) becomes a global pandemic, policy makers must enact interventions to stop its spread. Data driven approaches might supply information to support the implementation of mitigation and suppression strategies. To facilitate research in this direction, we present a machine-readable dataset that aggregates relevant data from governmental, journalistic, and academic sources on the county level. In addition to county-level time-series data from the JHU CSSE COVID-19 Dashboard, our dataset contains more than 300 variables that summarize population estimates, demographics, ethnicity, housing, education, employment and in come, climate, transit scores, and healthcare system-related metrics. Furthermore, we present aggregated out-of-home activity information for various points of interest for each county, including grocery stores and hospitals, summarizing data from SafeGraph. By collecting these data, as well as providing tools to read them, we hope to aid researchers investigating how the disease spreads and which communities are best able to accommodate stay-at-home mitigation efforts. Our dataset and associated code are available at https://github.com/JieYingWu/COVID-19_US_County-level_Summaries. COVID-19 has had a devastating impact on the United States' health care system, economy, and social wellbeing. Despite early promises of an "American Resurrection" by April 12, 2020 [4], social distancing measures remain in effect through the month of April, and many scientists and public health experts speculate they may last much longer. As of the time of writing, restrictions in Hubei province, China, where the disease originated in December, 2019, are only now gradually being lifted [5] . Confirmed COVID-19 cases, hospitalizations, and-unfortunately-deaths are increasing Please direct inquiries to Benjamin D. Killeen, Jie Ying Wu, and Mathias Unberath. An earlier version of this article appeared at https://link.medium.com/ N2azyHrq94. . However, no stimulus can offset the effects of an indefinite quarantine. Determining when and how to roll back non-pharmaceutical interventions in a manner which is safe and responsible is of the utmost importance. The initial quarantine period is necessary to avoid overwhelming our hospital systems. After this, we must balance reducing the risk of spread with the adverse economic consequences of millions of furloughed and unemployed people. To inform this process, we have curated a machine-readable dataset that aggregates data from governmental, journalistic, and academic sources on the county level. While most of these sources are freely available, there is significant work to align them and put them in a standard format that enables analysis. In addition to time-series data from [1], which details COVID-19 per-county infections and deaths, our dataset contains more than 300 variables that summarize population estimates, demographics, ethnicity, housing, education, employment and income, climate, transit scores, and healthcare system-related metrics. Further, we source a significant number of journal articles detailing implementation dates of interventions, including stay-at-home orders, school closures, and restaurant and entertainment venue closures [10]- [48] . Finally, we aggregate out-of-home activity data from [2] in each county, possibly measuring compliance with the aforementioned restrictions. Fig. 1 shows a sample of out-of-home activity for selected counties.. We hope that this dataset proves to be a useful resource to the community, facilitating important research on epidemiological forecasting. In particular, a machine learning approach to identify highly relevant factors may inform a graduated rollback of isolation measures and travel restrictions. Because of the rapidly-evolving nature of the COVID-19 pandemic, the response from the data science community is ongoing and in flux. Here, we review some related efforts available at the time of this writing. As new articles are published every day, this is by no means an exhaustive review. Despite significant public interest, government agencies have yet to publish a county-level data source for cases of COVID-19. As such, [1], [63] , [64] constitute the most upto-date and reputable collection of COVID-19 cases across the United States, hosted by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University, the New York Times, and crowdsourced individuals, respectively. These efforts focus on current, hard data gathered from local government publications and reputable journalistic sources. Other efforts focus on gleaning related information from a variety of sources, including social media. [65] tracks COVID-19 related tweets in an effort to understand the conversation and possible misinformation surrounding the pandemic. Johns Hopkins University, University of Maryland, and George Washington University has also started a collaboration to track COVID-19 through social media [66] . In just the last week, people have launched 2290 Kaggle challenges related to the virus. Much recent work has focused on using machine learning and data science tools to understand the virus. [ [69] , [70] focus on understanding the current pandemic in its early stages, compensating for the inherent uncertainty in novel disease. We describe the structure of our dataset, which includes each component in its raw form as well as a narroweddown, machine-readable form conducive to a machine-learning approach. Table I summarizes the sources and availability for each type of data, and a full description of each variable can be found in our repository. We populate a CSV file with 348 variables for 3220 county-equivalent areas (as well as the fifty states, District of Columbia, and the whole United States) with numerous types of data, including population, education, economic, climate, housing, health care capacity, public transit, and crime statistics. Each area is uniquely identified by its Federal Information Processing Standard (FIPS) code, a five digit number where they first two digits designate the state, and the last three digits describe the county-equivalent. Our sources include the United States Census Bureau [49] , [50] , [55] , [56] , the United States Department of Agriculture (USDA) Economic Research Service [51] , [52] , the National Oceanic and Atmosphere Administration (NOAA) [53] , the Association of American Medical Colleges (AAMC) [57] , the Henry J. Kaiser Family Foundation (KFF) [3], [58] , [59] , the Center for Neighborhood Technology (CNT) [61] , and the Bureau of Justice Statistics, Department of Justice (DOJ) [62] . Perhaps most relevant to the ongoing effort to mitigate the effects of COVID-19 in the U.S. is county-level healthcare system capacity. The dataset includes detailed counts for each type of medical practitioner as well as the number of Intensive Care Unit beds in each county, shown in Fig. 2 . For the most part, these basic descriptive variables are unaltered from their original state. Where appropriate, missing values have been imputed with the state-wide average, detailed in Table I . Our dataset describes mitigation efforts taken at the state level, including stay-at-home advisories, banning large gatherings, public school closures, and restaurant and entertainment venue closures. For machine readability, we provide each date of implementation as a Gregorian ordinal, i.e. the integer number of days starting at January 1, Year 1 CE, consistent with standard software libraries. Moreover, these data are provided according to the same county-level row ordering as our county descriptor data (see Sec. III-A). Interventions made at the state level have been assigned to each county in that state, and we include county-level interventions wherever possible An intervention is designated NA if the county or state has not yet enacted it. We have aggregated point-of-interest location data gathered from user's smartphones to show out-of-home activity, using raw data from [2] . For privacy and IP reasons, our dataset does not include user location data in its raw form but rather in several time-series files summarizing county-level activity. Fig. 1 shows the time-series for selected counties which have a high incidence of COVID-19 cases. The decline in overall activity on May 12 corresponds to an increased media attention and stay-at-home advisories in those areas. At the same time, a spike in grocery store visits points to a panic-buying spree which has since subsided. Finally, we provide time-series data for the cumulative number of COVID-19 confirmed cases and related deaths, from [1]. This data begins on January 22, 2020. It should be noted that epidemiological modeling efforts may want to consider the uncertainty surrounding U.S. testing [71] , on which these data are based. At the time of this writing, efforts to improve the availability of COVID-19 tests are ongoing, but the current strategies prioritize patients with severe symptoms. Thus, modeling efforts may wish to take into account random subsampling of the true population, where untested individuals still spread the virus. This is especially true given that nearly half of all COVID-19 infections may be asymptomatic [72] . Fig. 3 shows the infections in King County, WA collected by [1]. King County had an early confirmed cases of the virus, and the exponential curve illustrates the rapid growth currently taking place there as a result. This extreme growth reinforces the need for constant vigilance and continued intervention efforts everywhere. Although the curve of measured infections in Fig. 3 would flatten as the population approaches herd immunity, it would require almost everyone in the U.S. to be infected. If that happens too quickly, it will completely overwhelm our healthcare system. The number of individuals who will ultimately be infected-and the number of deaths that will result-depend on the interventions reinforced now. At the same time, the economic impact of these interventions, which is not evenly distributed across counties, cannot be ignored. It depends on the characteristic qualities of each area-very different, for example, New York as opposed to Silicon Valley. The former has a large population in the entertainment and service industries, which will need financial support during quarantine, whereas the latter is dominated by large tech firms, whose employees can adapt to working from home. By providing the socioeconomic attributes of each county, the spread of COVID-19 confirmed cases, and the ongoing response in a machine-readable format, we hope to inform the decisions made to most effectively protect each area. ACKNOWLEDGMENT Thank you to all our sources, especially the JHU CSSE COVID-19 Dashboard for making their data public and Safe-Graph, for providing researchers their data for COVID-19 related work. Coronavirus: NY, NJ, CT coordinate restrictions on restaurants, limit events to fewer than 50 people Adds 'Stay-at-Home' Order on Same Day as Maryland and Virginia Greg Abbott closes bars, restaurants and schools as he anticipates tens of thousands could test positive for coronavirus Reynolds issues state of public health disaster emergency, closing Iowa businesses In Missouri, no dining-in at restaurants, groups of 10 or more banned amid coronavirus Kentucky Derby postponed, restaurants restricted as state tries to control spread of virus LIST: Here's how the state and each island is responding to coronavirus LIVE UPDATES: Here's the latest on the coronavirus in Forsyth County and Georgia Local bars, resaurants, gyms, theaters react to coronavirus Maine bars, restaurants ordered to close to dine-in customers, coronavirus cases increase Map: Coronavirus and School Closures -Education Week Mayors scramble to know: Does Gov. Reeves' coronavirus declaration clash with local orders Montana Extends School, Restaurant Closures 2 More Weeks Nevada orders all casinos, bars, restaurants closed as U.S. coronavirus cases surge New Hampshire bans dine-in restaurant meals until April 7 New restrictions for New Mexico restaurants and bars to begin Monday Tucson order closures of bars, restaurants Restaurants and Bars Shuttered Across the U.S. in Light of Coronavirus Pandemic RI Restaurants Closed Amid Community Spread Of Coronavirus See Which States and Cities Have Told Residents to Stay at Home State bans restaurant dining as Alaska's confirmed coronavirus cases grow to 6 State to Restrict Bars and Restaurants Statewide Starting at 8PM States order bars and restaurants to close due to coronavirus Tennessee governor orders restaurants, bars closed except for takeout and delivery; gyms closed over coronavirus These states have implemented stay-athome orders. Here's what that means for you Utah Orders Restaurants,Bars to Close All Dining to Curb Coronavirus Virginia Restaurants and Bars Close for Dine-In Service to Help Curb Coronavirus Which states have closed restaurants and bars due to coronavirus Wyoming cancellations and closures caused by coronavirus What You Need to Know About Trump's European Travel Ban Population Estimates Selected social characteristics in the united states: 2018 acs 1 year data estimate profiles Poverty estimates for the U.S., States and counties Unemployment and median household income for the U.S., states and counties NOAA's Climate Divisional Database Population and housing unit counts: 2010 Selected Social Characteristics in the United States: Households By Type Annual County Resident Population Estimates by Age, Sex, Race, and Hispanic Origin American Association of Medical Colleges Professionally active primary care physicians by field Professionally active specialist physicians by field Kaiser Family Foundation Alltransit performance score United states crime rates by county We're Sharing Coronavirus Case Data for Every U.S. County The covid tracking project COVID-19: The First Public Coronavirus Twitter Dataset Estimation of the reproductive number of novel coronavirus (COVID-19) and the probable outbreak size on the Diamond Princess cruise ship: A datadriven analysis AI-Driven Tools for Coronavirus Outbreak: Need of Active Learning and Cross-Population Train/Test Models on Multitudinal/Multimodal Data Composite Monte Carlo Decision Making under High Uncertainty of Novel Coronavirus Epidemic Using Hybridized Deep Learning and Fuzzy Rule Induction Finding an Accurate Early Forecasting Model from Small Dataset: A Case of 2019-nCoV Novel Coronavirus Outbreak The US is severely under-testing for coronavirus as death toll and new cases rise Coronavirus Cases Without Symptoms Spur Call for Wider Tests