key: cord-0845605-37dadupn
authors: DeCaprio, Dave; Gartner, Joseph A; Burgess, Thadeus; Kothari, Sarthak; Sayed, Shaayaan; McCall, Carol J
title: Building a COVID-19 Vulnerability Index
date: 2020-03-21
journal: nan
DOI: 10.1101/2020.03.16.20036723
sha: b5a2bc9046ae7ae60d3ac2d98720bf7049902698
doc_id: 845605
cord_uid: 37dadupn

COVID-19 is an acute respiratory disease that has been classified as a pandemic by the World Health Organization. Information regarding this particular disease is limited, however, it is known to have high mortality rates, particularly among individuals with preexisting medical conditions. Creating models to identify individuals who are at the greatest risk for severe complications due to COVID-19 will be useful to help for outreach campaigns in mitigating the diseases worst effects. While information specific to COVID-19 is limited, a model using complications due to other upper respiratory infections can be used as a proxy to help identify those individuals who are at the greatest risk. We present the results for three models predicting such complications, with each model having varying levels of predictive effectiveness at the expense of ease of implementation.

i. COVID-19 Virus Coronaviruses (CoV) are a large family of viruses that cause illness ranging from the common cold to more severe diseases such as Middle East Respiratory Syndrome (MERS-CoV) and Severe Acute Respiratory Syndrome (SARS-CoV). CoV are zoonotic, meaning they are transmitted between animals and people. Coronavirus disease (COVID-19) is a new strain that was discovered in 2019 and has not been previously identified in humans [1] .

COVID-19 is a respiratory infection with common signs of infection that include respiratory symptoms, fever, cough, shortness of breath and breathing difficulties. In more severe cases, infection can cause pneumonia, severe acute respiratory syndrome, kidney failure and death.

ii. Flattening the Curve On March 11, 2020 the World Health Organization (WHO) declared COVID-19 to be a pandemic [2] . In their press conference, they were clear that pandemic was not a word they used lightly or carelessly or to cause unreasonable fear. They were also clear to highlight that this is the first pandemic to ever be caused by a coronavirus and that all countries can still act to change its course.

Public health and healthcare experts agree that mitigation is required in order to slow the spread of COVID-19 and prevent the collapse of healthcare systems. On any given day, health systems in the United States run close to capacity [3] and so every transmission that can be avoided and every case that can be prevented has enormous impact.

iii. Identifying Vulnerable People

The risk of severe complications from COVID-19 is higher for certain vulnerable populations, particularly people who are elderly, frail, or have multiple chronic conditions. The risk of death has been difficult to calculate [4] , but a small study [5] of people who contracted COVID-19 in Wuhan suggests that the risk of death increases with age, and is also higher for those who have diabetes, disease, blood clotting problems, or have shown signs of sepsis. With an average death rate of 1%, the death rate rose to 6% for people with cancer, high blood pressure and chronic respiratory disease, 7% for people with diabetes, and 10% for people with heart disease. There was also a steep age gradient; the death rate among people aged 80+ was 15%

Identifying who is most vulnerable is not necessarily straightforward. More than 55% of Medicare beneficiaries meet at least one of the risk criteria listed by the CDC [6] . People with the same chronic condition don't have the same risk, and simple rules can fail to capture complex factors like frailty [8] which makes people more vulnerable to severe infections.

Since real world data on COVID-19 cases is not readily available, the CV19 Index was developed using close proxy events. A person's CV19 Index is measured in terms of their near-term risk of severe complications from respiratory infections (e.g. pneumonia, influenza). Specifically, 4 categories of diagnoses were chosen from the Clinical Classification Software Refined (CCSR) [11] classification system:

• RSP002 -Pneumonia (except that caused by tuberculosis) • RSP003 -Influenza • RSP005 -Acute bronchitis • RSP006 -Other specified upper respiratory infections Machine learning models were created that use a patient's historical medical claims data to predict the likelihood that they will have an inpatient hospital stay due to one of the above conditions in the next 3 months. The data used was an anonymized 5% sample of the Medicare claims data from 2015 and 2016. This data spanned the transition from ICD-9 to ICD-10 on October 1st, 2016. The data set used to create the model was created by identifying all living members above the age of 18 on 9/30/2016. Only fee -for-service members were included because medical claims histories for other members are not reliably complete. We then excluded all members who had less than 6 months of continuous eligibility prior to 9/30/2016. We also excluded members who lost coverage within 3 months after 9/30/2016, except for those members who lost coverage due to death. Table 1 below summarizes the population selection.

The final data set is split 80/20% into train and test sets, with 1,481,654 people in the training set and 369,865 in the test set. The prevalence of the proxy event within the final population was 0.23%.

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 21, 2020. Exclude members who lose coverage in the next 3 months not due to death. The labels for the prediction task were created by identifying all patients who had an inpatient visit with an admission date from 10/1/2016 through 12/31/2016 with a primary diagnosis from one of the listed categories. A 3 month delay was imposed on the input features to the model, so that no claims after 6/30/2016 were used to make the predictions. This 3 month delay simulates the delay in claims processing that usually occurs in practical setting and enables the model to be used in realistic scenarios.

We highlight a few approaches to building models to help identify individuals who are vulnerable to complications to respiratory infections. All 3 approaches described are machine learning methods created using the same data set. We have chosen 3 different approaches that represent a tradeoff between accuracy and ease of implementation. For individuals who have access to data, but not the coding background to adopt our model, we hope that the simple model can be easily ported to other systems. For a more robust model, we create a Gradient boosted tree leveraging age, gender, and medical diagnosis history. This model has been made open sourced, and can be obtained from github (https://github.com/closedloop-ai/cv19index). Finally, we have created a third model that uses an extensive feature set generated from Medicare claims data along with linked geographical and social determinants of health data. This model is being made freely available through our hosted platform. Information about accessing the platform can be found at https://cv19index.com.

The first approach is aimed at reproducing the high level recommendations coming from the Center for Disease Control (CDC) website [7] for identifying those individuals who are at risk. They identify risk features as:

• Older adults • Individuals with Heart disease • Individuals with Diabetes • Individuals with Lung disease

To turn this into a model, we extract International Classification of Diseases, Version 10 (ICD-10) diagnosis codes from the claims and aggregate them using the Clinical Classification Codes Revised (CCSR) categories. We create indicator features for the presence of any code in the CCSR category. The mapping between the CDC risk factors and the CCSR codes are described in 3 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 21, 2020. -0.020 CCSR:RSP016 age X Pneumonia 0.010 n/a age X Other and ill-defined heart diseas 0.003 n/a age X Heart failure 0.009 n/a age X Acute rheumatic heart disease 0.003 n/a age X Coronary atherosclerosis and other heart disease 0.011 n/a age X Pulmonary heart disease -0.000 n/a age X Chronic rheumatic heart disease -0.001 n/a age X Diabetes mellitus with complication 0.007 n/a age X Diabetes mellitus without complication 0.009 n/a age X Chronic obstructive pulmonary disease and bronchiectasis 0.013 n/a age X Other specified and unspecified lower respiratory disease 0.006 n/a Table 2 . We start with these features as they give us an ability to quantify the portion of the at risk population that are encapsulated by the high level CDC recommendations. In addition to the conditions coming from the recommendations of the CDC, we will look at features that our other modeling efforts surfaced as important and avail those features to the model as well. We also provide gender, age in years, as well as an interaction term between age and the diagnostic features. This simple data set is used to train a logistic regression model [9] . In addition to the CCSR Codes, Table 2 additionally includes iv. Gradient Boosted Trees Our more robust approach uses gradient boosted trees. Gradient boosted trees are a machine learning method that use an ensemble of simple models to create highly accurate predictions [9] . The resulting models demonstrate higher accuracy. The drawback to these models are that they are significantly more complex, however, "by hand" implementations of such models is impractical. Here, we create two variations of the models. The first, is a model that leverages similar information as our logistic regression model. A nice feature of Gradient Boosted Trees is they are fairly robust against learning features that are excentricities of the training data, but do 4 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 21, 2020. . https://doi.org/10.1101/2020.03.16.20036723 doi: medRxiv preprint We additionally built a model within the ClosedLoop platform. The ClosedLoop platform is a software system designed to enable rapid creation of machine learning models utilizing healthcare data. The full details of the platform are outside the bounds of this paper, however, using the platform allows us to leverage engineered features coming from peer reviewed studies. Examples are social determinants of health, and the Charlson Comorbidity Index [12] . We chose not to include these features within the open sourced model, because the purpose of the open sourced version is to be as accessible as possible for the greater healthcare data science community.

We quantify the performance of our models using metrics that are standard within the data science community. In particular, we visualize the performance of our model using a Receiver Operation Characteristic graph, see Figure 1 . Additionally, the metrics quantifying the effectiveness of our models are in Table 3 . As you can see, the performance of both Gradient Boosted Tree models are very similar. The ROC curve demonstrates that even as the decision threshold increases, the percentage of the potentially affected population increases at roughly the same rate. Similarly, the Logistic Regression model has similar performance at low alert rates. We can see that at a 3% alert rate, the difference in sensitivity is only .02. The performance at higher alert rates experiences a significant performance disadvantage, however, for most interventions this would be at alert rates higher than is practical.

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 21, 2020. 

There are two ways of accessing the models that we are providing. The first, is to access our open sourced version of our model. As stated, we have released an open sourced version of the model, available at https://github.com/closedloop-ai/cv19index. This model is written in the Python programming language. We have included synthetic data for the purpose of walking individuals through the process of going from tabular diagnosis data to the input format specific for our models. We encourage the healthcare data science community to fork the repository, and adapt it to their own purposes. We encourage collaboration from the open sourced community, and pull requests will be considered for inclusion in the main branch of the package. For those wishing to use our models within our platform, we are providing access to the COVID-19 model free of charge. Please visit https://closedloop.ai/cv19index for instructions on how to gain access.

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 21, 2020. 

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted March 21, 2020. 

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted March 21, 2020. . https://doi.org/10.1101/2020.03.16.20036723 doi: medRxiv preprint

World Health Organization

www.who.int/dg/speeches/detail/who-director-general-s-opening-remarks-at-the-mediabriefing-on-covid

Simple Math Offers Alarming Answers about Covid-19, Health Care

Why Is It so Hard to Calculate How Many People Will Die from Covid-19?

Coronavirus: Risk of Death Rises with Age, Diabetes and Heart Disease

Centers for Disease Control and Prevention

Centers for Disease Control and Prevention

Frailty Status at Admission to Hospital Predicts Multiple Adverse Outcomes

An Introduction to Statistical Learning, with Applications in R

Countyhealthrankings.Org, University of Wisconsin Population Health Institute, 2019, www.countyhealthrankings.org/explore-health-rankings/measures-data-sources

Clinical Classifications Software Refined (CCSR) for ICD-10-CM Diagnoses

A New Method of Classifying Prognostic Comorbidity in Longitudinal Studies: Development and Validation

Full Feature List We include a full list of features available within our platform. The majority of features are binary variables indicating if a patient has had one type of medical event 15 moths prior to the date of prediction

This pandemic has already claimed thousands of lives, and sadly, this number is sure to grow. As healthcare resources are constrained by the same scarcity constraints that effect us all, it s important to empower intervention policy with the best information possible. We have provided several models and means of access for those individuals with varying levels of technical expertise. It is our hope that by providing these models quickly to the healthcare data science community, widespread adoption will lead to more effective intervention strategies, and ultimately, help to curtail the worst effects of this pandemic.