key: cord-302336-zj3oixvk
authors: Clift, Ash K; Coupland, Carol A C; Keogh, Ruth H; Diaz-Ordaz, Karla; Williamson, Elizabeth; Harrison, Ewen M; Hayward, Andrew; Hemingway, Harry; Horby, Peter; Mehta, Nisha; Benger, Jonathan; Khunti, Kamlesh; Spiegelhalter, David; Sheikh, Aziz; Valabhji, Jonathan; Lyons, Ronan A; Robson, John; Semple, Malcolm G; Kee, Frank; Johnson, Peter; Jebb, Susan; Williams, Tony; Hippisley-Cox, Julia
title: Living risk prediction algorithm (QCOVID) for risk of hospital admission and mortality from coronavirus 19 in adults: national derivation and validation cohort study
date: 2020-10-21
journal: BMJ
DOI: 10.1136/bmj.m3731
sha: 
doc_id: 302336
cord_uid: zj3oixvk

OBJECTIVE: To derive and validate a risk prediction algorithm to estimate hospital admission and mortality outcomes from coronavirus disease 2019 (covid-19) in adults. DESIGN: Population based cohort study. SETTING AND PARTICIPANTS: QResearch database, comprising 1205 general practices in England with linkage to covid-19 test results, Hospital Episode Statistics, and death registry data. 6.08 million adults aged 19-100 years were included in the derivation dataset and 2.17 million in the validation dataset. The derivation and first validation cohort period was 24 January 2020 to 30 April 2020. The second temporal validation cohort covered the period 1 May 2020 to 30 June 2020. MAIN OUTCOME MEASURES: The primary outcome was time to death from covid-19, defined as death due to confirmed or suspected covid-19 as per the death certification or death occurring in a person with confirmed severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection in the period 24 January to 30 April 2020. The secondary outcome was time to hospital admission with confirmed SARS-CoV-2 infection. Models were fitted in the derivation cohort to derive risk equations using a range of predictor variables. Performance, including measures of discrimination and calibration, was evaluated in each validation time period. RESULTS: 4384 deaths from covid-19 occurred in the derivation cohort during follow-up and 1722 in the first validation cohort period and 621 in the second validation cohort period. The final risk algorithms included age, ethnicity, deprivation, body mass index, and a range of comorbidities. The algorithm had good calibration in the first validation cohort. For deaths from covid-19 in men, it explained 73.1% (95% confidence interval 71.9% to 74.3%) of the variation in time to death (R(2)); the D statistic was 3.37 (95% confidence interval 3.27 to 3.47), and Harrell’s C was 0.928 (0.919 to 0.938). Similar results were obtained for women, for both outcomes, and in both time periods. In the top 5% of patients with the highest predicted risks of death, the sensitivity for identifying deaths within 97 days was 75.7%. People in the top 20% of predicted risk of death accounted for 94% of all deaths from covid-19. CONCLUSION: The QCOVID population based risk algorithm performed well, showing very high levels of discrimination for deaths and hospital admissions due to covid-19. The absolute risks presented, however, will change over time in line with the prevailing SARS-C0V-2 infection rate and the extent of social distancing measures in place, so they should be interpreted with caution. The model can be recalibrated for different time periods, however, and has the potential to be dynamically updated as the pandemic evolves.

Objective To derive and validate a risk prediction algorithm to estimate hospital admission and mortality outcomes from coronavirus disease 2019 (covid-19) in adults.

Population based cohort study.

setting anD participants QResearch database, comprising 1205 general practices in England with linkage to covid-19 test results, Hospital Episode Statistics, and death registry data. 6 .08 million adults aged 19-100 years were included in the derivation dataset and 2.17 million in the validation dataset. The derivation and first validation cohort period was 24 January 2020 to 30 April 2020. The second temporal validation cohort covered the period 1 May 2020 to 30 June 2020.

The primary outcome was time to death from covid-19, defined as death due to confirmed or suspected covid-19 as per the death certification or death occurring in a person with confirmed severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection in the period 24 January to 30 April 2020. The secondary outcome was time to hospital admission with confirmed SARS-CoV-2 infection. Models were fitted in the derivation cohort to derive risk equations using a range of predictor variables. Performance, including measures of discrimination and calibration, was evaluated in each validation time period. results 4384 deaths from covid-19 occurred in the derivation cohort during follow-up and 1722 in the first validation cohort period and 621 in the second validation cohort period. The final risk algorithms included age, ethnicity, deprivation, body mass index, and a range of comorbidities. The algorithm had good calibration in the first validation cohort. For deaths from covid-19 in men, it explained 73.1% (95% confidence interval 71.9% to 74.3%) of the variation in time to death (R 2 ); the D statistic was 3.37 (95% confidence interval 3.27 to 3.47), and Harrell's C was 0.928 (0.919 to 0.938). Similar results were obtained for women, for both outcomes, and in both time periods. In the top 5% of patients with the highest predicted risks of death, the sensitivity for identifying deaths within 97 days was 75.7%. People in the top 20% of predicted risk of death accounted for 94% of all deaths from covid-19.

The first cases of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection were reported in the UK on 24 January 2020, with the first death from coronavirus disease 2019 (covid-19) on 28 February 2020. As of 18 August 2020, more than 41 000 deaths from covid-19 had occurred in the UK and more than 773 000 deaths globally. 1 In the initial absence of any vaccination or prophylactic or curative treatments, the UK government implemented social distancing and shielding measures to suppress the rate of infection and protect vulnerable people, thereby trying to minimise the risk of serious adverse outcomes. 2 3 Emerging evidence throughout the course of the pandemic, initially from case series and then from cohorts of patients with confirmed SARS-CoV-2 doi: 10.1136/bmj.m3731 | BMJ 2020;371:m3731 | the bmj infection, has shown associations of age, sex, certain comorbidities, ethnicity, and obesity with adverse covid-19 outcomes such as hospital admission or death. [4] [5] [6] [7] [8] [9] [10] [11] The knowledge base regarding risk factors for severe covid-19 is growing. As many countries are cautiously attempting to ease "lockdown" measures or reintroduce measures if rates are rising, an opportunity exists to develop more nuanced guidance based on predictive algorithms to inform risk management decisions. 12 Better knowledge of individuals' risks could also help to guide decisions on mitigating occupational exposure and in targeting of vaccines to those most at risk. Although some prediction models have been developed, a recent systematic review found that they all have a high risk of bias and that their reported performance is optimistic. 13 The use of primary care datasets with linkage to registries such as death records, hospital admissions data, and covid-19 testing results represents a novel approach to clinical risk prediction modelling for covid-19. It provides accurately coded, individual level data for very large numbers of people representative of the national population. This approach draws on the rich phenotyping of individuals with demographic, medical, and pharmacological predictors to allow robust statistical modelling and evaluation. Such linked datasets have an established track record for the development and evaluation of established clinical risk models, including those for cardiovascular disease, diabetes, and mortality. [14] [15] [16] We aimed to develop and validate population based prediction models to estimate the risks of becoming infected with and subsequently dying from covid-19 and of becoming infected and subsequently admitted to hospital with covid-19. The model we have developed is designed to be applied across the adult population so that it can be used to enable risk stratification for public health purposes in the event of a "second wave" of the pandemic, to support shared management of risk and occupational exposure, and in early targeting of vaccines to people most at risk. An ongoing companion study will externally validate the models, using datasets across all four nations of the UK, and will be reported separately.

This study was commissioned by the Chief Medical Officer for England on behalf of the UK Government, who asked the New and Emerging Respiratory Virus Threats Advisory Group (NERVTAG) to establish whether a clinical risk prediction model for covid-19 could be developed in line with the emerging evidence. The protocol has been published. 17 The study was conducted in adherence with TRIPOD 18 and RECORD 19 guidelines and with input from our patient advisory group. study design and data sources We did a cohort study of primary care patients using the QResearch database (version 44). QResearch was established in 2002 and has been extensively used for the development of risk prediction algorithms across the National Health Service (NHS) and for epidemiological research. By April 2020, 1205 practices in England were contributing to QResearch, covering a population of 10.5 million patients. The database is linked at individual patient level, using a project specific pseudonymised NHS number, to hospital admissions data (including intensive care unit data), positive results from covid-19 real time reverse transcriptase polymerase chain reaction tests held by Public Health England, cancer registrations (including detailed radiotherapy and systemic chemotherapy records), the national covid-19 shielded patient list in England, and mortality records held by NHS Digital.

We identified a cohort of people aged 19-100 years registered with participating general practices in England on 24 January 2020. We excluded patients (approximately 0.1%) who did not have a valid NHS number. Patients entered the cohort on 24 January 2020 (date of first confirmed case of covid-19 in the UK) and were followed up until they had the outcome of interest or the end of the first study period (30 April 2020), which was the date up to which linked data were available at the time of the derivation of the model, or the second time period (1 May 2020 until 30 June 2020) for the temporal cohort validation.

The primary outcome was time to death from covid-19 (either in hospital or outside hospital), defined as confirmed or suspected death from covid-19 as per the death certification or death occurring in an individual with confirmed SARS-CoV-2 infection at any time in the period 24 January to 30 April 2020. The secondary outcome was time to hospital admission with covid-19, defined as an ICD-10 (International Classification of Diseases, 10th revision) code for either confirmed or suspected covid-19 or new hospital admission associated with a confirmed SARS-CoV-2 infection in the study period.

We selected candidate predictor variables on the basis of the presence of existing clinical vulnerability group criteria (table 1) , associations with outcomes in other respiratory diseases, or hypothesised to be linked to adverse outcomes on clinical/biological plausibility and likely to be available for implementation. They are summarised in box 1 and supplementary box A. We defined variables according to information recorded using Read Codes in general practices' electronic health records at the start of the study period. The exception to this was information on chemotherapy, radiotherapy, and transplants, which was based on linked hospital records.

We randomly allocated 75% of practices to the derivation dataset, which we used to develop the models. We evaluated the models' performance in the remaining 25% of practices (the validation set).

All models were fitted separately in men and women. The outcomes of interest are subject to competing risks. For the primary outcome of death from covid-19, the competing risk is death due to other causes. For the secondary outcome of hospital admission, the competing risk is death from any cause before admission. We fitted a sub-distribution hazard (Fine and Gray 21 ) model for each outcome to account for competing risks. Individuals who did not have the outcome of interest were censored at the study end date, including those who had a competing event.

For all predictor variables, we used the most recently available value at the entry date (24 January 2020). We used second degree fractional polynomials to model non-linear relations for continuous variables (age, body mass index, and Townsend material deprivation score, an area level score based on postcode 20 ). Initially, we fitted a complete case analysis by using a model within the derivation data to derive the fractional polynomial terms. For indicators of comorbidities and medication use, we assumed the absence of recorded information to mean absence of the factor in question. Data were missing in four variables: ethnicity, Townsend score, body mass index, and smoking status. We used multiple imputation with chained equations under the missing at random assumption to replace missing values for these variables. For computational efficiency, we used a combined imputation model for both outcomes. The imputation model was fitted in the derivation data and included predictor variables, the Nelson-Aalen estimators of the baseline cumulative sub-distribution hazard, and the outcome indicators (death from covid-19 and hospital admission with covid-19). We carried out five imputations. Each analysis model was fitted in each of the five imputed datasets. We used Rubin's rules to combine the model parameter estimates and the baseline cumulative incidence estimates across the imputed datasets.

We initially sought to fit models using all predictor variables. Owing to sparse cells, some conditions were combined if clinically similar in nature (such as rare neurological disorders). We examined interactions between body mass index and ethnicity and interactions between predictor variables and age, focusing on predictor variables that apply across the age range (asthma, epilepsy, diabetes, severe mental illness). We explored the use of penalised models (LASSO) to screen variables for inclusion, but this retained all the predictor variables and most interaction terms. 17 In line with the protocol, we subsequently removed a small number of variables with low numbers of events and adjusted (sub-distribution) hazard ratios close to 1 (as these will have minimal effect on predicted risks) or with uncertain clinical credibility, defined as counterintuitive results in light of the emerging literature. Lastly, we combined regression coefficients from the final models with estimates of the baseline cumulative incidence function evaluated at 97 days to derive risk equations for each outcome. We used all the available data in the database.

We did all model evaluation using the validation data with two separate periods of follow-up. The first validation study period was the same as for the derivation cohort: 24 January to 30 April 2020. The second temporal validation covered the subsequent period of 1 May 2020 to 30 June 2020. This was carried out with the same validation cohort except for exclusion of patients who died during 24 January to 30 April 2020. In the validation cohort, we fitted an imputation model to replace missing values for ethnicity, body mass index, Townsend score, and smoking status. This excluded the outcome indicators and Nelson-Aalen terms, as the aim was to use covariate data to obtain a prediction as if the outcome had not been observed to reflect intended use.

We applied the final risk equations developed from the derivation dataset to men and women in the validation dataset and evaluated R 2 values, Brier scores, and measures of discrimination and calibration for the two time periods. 22 where lower values indicate better accuracy. 25 D statistics (a discrimination measure that quantifies the separation in survival between patients with different levels of predicted risks) and Harrell's C statistics (a discrimination metric that quantifies the extent to which people with higher risk scores have earlier events) were evaluated at 97 days (the maximum followup period available at the time of the derivation of the model) and 60 days for the second temporal validation, with corresponding 95% confidence intervals. 26 We assessed model calibration by comparing mean predicted risks with observed risks by twentieths of predicted risk for each of the validation cohorts. Observed risks were derived in each of the 20 groups by using non-parametric estimates of the cumulative incidences. Additionally, we did a recalibration for the mortality outcome, using the method proposed by Booth et al by updating the baseline survivor function based on the temporal validation cohort with the prognostic index as an offset term. 27 We also applied the algorithms to the validation cohort for the first time period to define the centile thresholds based on absolute risk. We also defined centiles of relative risk (defined as the ratio of the individual's predicted absolute risk to the predicted absolute risk for a person of the same age and sex with a white ethnicity, body mass index of 25, and mean deprivation score with no other risk factors). We calculated the performance metrics in the whole validation cohort and in the following pre-specified 17 we evaluated performance by calculating Harrell's C statistics in individual general practices and combining the results using a random effects meta-analysis. 28 patient and public involvement Patients were involved in setting the research question and in developing plans for design and implementation of the study. Patients were asked to aid in interpreting and disseminating the results.

Overall study population Overall, 1205 practices in England met our inclusion criteria. Of these, 910 practices were randomly assigned to the derivation dataset and 295 to the validation cohort. The practices had 8 256 158 registered patients aged 19-100 years on 24 January 2020. We included 6 083 102 of these in the derivation cohort, and the validation dataset comprised 2 173 056 people. Table 2 shows the baseline characteristics of patients in the derivation cohort. Of these patients, 3 035 409 (49.9%) were men and 990 799 (16.3%) were of black, Asian, or other minority ethnic (BAME) background.

In the derivation cohort, 10 776 (0.18%) patients had a covid-19 related hospital admission and 4384 (0.07%) had a covid-19 related death during the 97 days' follow-up, of which 4265 (97.3%) were recorded on the death certificate and 119 (2.71%) were based only on a positive test (and of these <15 were based on a test more than 28 days before death). Admissions and deaths due to covid-19 occurred across all regions, with the greatest numbers in London, which accounted for 3799 (35.3%) of admissions and 1287 (29.4%) of deaths. Of those who died, 2517 (57.4%) were male, 732 (16.7%) were BAME, 3616 (82.5%) were aged 70 and over, 1417 (32.3%) had type 2 diabetes, 1311 (29.9%) had dementia, and 1033 (23.6%) were identified as living in a care home.

The characteristics of the validation cohort were similar to those of the derivation cohort, as shown in supplementary tables A and B. In the first validation period (24 January to 30 April 2020), 1722 deaths and 3703 hospital admissions due to covid-19 occurred. In the second validation period (1 May to 30 June 2020), 621 deaths and 1002 admissions due to covid-19 occurred.

The variables included in the final models were fractional polynomial terms for age and body mass index, Townsend score (linear), ethnic group, domicile (residential care, homeless, neither), and a range of conditions and treatments as shown in figure 1, figure 2, figure 3 , and figure 4. These conditions and treatments were cardiovascular conditions (atrial fibrillation, heart failure, stroke, peripheral vascular disease, coronary heart disease, congenital heart disease), diabetes (type 1 and type 2 and interaction terms for type 2 diabetes with age), respiratory conditions (asthma, rare respiratory conditions (cystic fibrosis, bronchiectasis, or alveolitis), chronic obstructive pulmonary disease, pulmonary hypertension or pulmonary fibrosis), cancer (blood cancer, chemotherapy, lung or oral cancer, marrow transplant, radiotherapy), neurological conditions (cerebral palsy, Parkinson's disease, rare neurological conditions (motor neurone disease, multiple sclerosis, myasthenia, Huntington's chorea), epilepsy, dementia, learning disability, severe mental illness), other conditions (liver cirrhosis, osteoporotic fracture, rheumatoid arthritis or systemic lupus erythematosus, sickle cell disease, venous thromboembolism, solid organ transplant, renal failure (CKD3, CKD4, CKD5, with or without dialysis or transplant)), and medications (≥4 prescriptions from general practitioner in previous six months for oral steroids, long acting β agonists or leukotrienes, immunosuppressants). Figure 1 and figure 2 show the adjusted hazard ratios in the final models for covid-19 related death in the derivation cohort in women and men. Figure 3 and figure 4 show the adjusted hazard ratios for the final models for covid-19 related hospital admission in the derivation cohort.

Supplementary figures A and B show graphs of the adjusted hazard ratios for body mass index, age, and the interaction between age and type 2 diabetes for deaths and hospital admissions due to covid-19 (which showed higher risks associated with younger ages). Supplementary figures C and D show fully adjusted hazard ratios for variables for the full model, including variables that were not retained in the final model (for example, adjusted hazard ratios close to one or those which lacked clinical credibility). Other variables with too few events for inclusion were HIV, sphingolipidoises, short bowel syndrome, polymyositis, dermatomyositis, Ehlers-Danlos syndrome, biliary cirrhosis, hepatitis B and C, haemochromatosis, non-alcoholic fatty liver disease, chronic pancreatitis, drug misuse, asplenia, cholangitis, scleroderma, Sjogren's syndrome, and pregnancy. Supplementary figures E and F show fully adjusted hazard ratios for a combined outcome of either covid-19 related death or hospital admission. This gave very similar absolute risks to the hospital admission outcome. Table 3 shows the performance of the risk equations in the validation cohort for women and men over 97 days for the main study period and for the temporal validation cohort evaluated from 1 May 2020 to 30 June 2020. Overall, the values for the R 2 , D, and C statistics were similar in women and men. Values for the mortality outcome tended to be higher than those for the hospital admission outcome. For example, in the first validation period, the equation explained 74% of the variation in time to death from covid-19 in women; the D statistic was 3.46, and Harrell's C statistic was 0.933. The corresponding values in men were 73.1%, 3.37, and 0.928. The results for the second validation period were similar except for covid-19 related admissions in women, for which the explained variation and discrimination were lower than for the first period (explained variation 45.4%, D statistic 1.87, and Harrell's C statistic 0.776).

Supplementary tables C-F show the corresponding results by region, age band, and fifth of deprivation and within each ethnic group in men and women in both validation periods. Performance was generally similar to the overall results except for age, for which the values were lower within individual age bands. Figure 5 shows funnel plots of Harrell's C statistic for each general practice in the validation cohort versus the number of deaths in each practice in men and women in the first validation period. The summary (average) C statistic for women was 0.916 (95% confidence interval 0.908 to 0.924) from a random effects meta-analysis. The corresponding summary C statistic for men was 0.919 (0.912 to 0.926). 

We have developed and evaluated a novel clinical risk prediction model (QCOVID) to estimate risks of hospital admission and mortality due to covid-19. We have used national linked datasets from general practice and national SARS-CoV-2 testing, death registry, and hospital episode data for a sample of more than 8 million adults representative of the population of England. The risk models have excellent discrimination (Harrell's C statistics >0.9 for the primary outcome). Although the calibration for the hospital admission outcome was good in both time periods, some under-prediction existed for the mortality outcome in the second validation cohort, which improved after recalibration. The recalibration method could be used to transport the risk models to other settings or time periods with different absolute risks of covid-19. QCOVID represents a new approach for risk stratification in the population. It could also be deployed in several health and care applications, either during the current phase of the pandemic or in subsequent "waves" of infection (with recalibration as needed). These could include supporting targeted recruitment for clinical trials, prioritisation for vaccination, and discussions between patients and clinicians on workplace or health risk mitigation-for example, through weight reduction as obesity may be an important modifiable risk factor for serious complications of covid-19 if a causal association is established. 10 Although QCOVID has been specifically designed to inform UK health policy and interventions to manage covid-19 related risks, it also has international potential, subject to local validation. One of the variables in our model (the Townsend measure of deprivation) may need to be replaced with locally available equivalent measures, or some recalibration may be needed. Previous risk prediction models based doi: 10 29 30 comparison with other studies Although similarities exist between our study and the recently reported analysis of risk factors from another English general practice database using a different clinical computer system, our project had a different aim-namely, to develop and evaluate a risk prediction model. We used a more comprehensive outcome (including deaths in patients with positive tests for SARS-CoV-2), a much wider range of predictors, and a more granular assessment of ethnicity and body mass index. Our C statistic for mortality (>0.92) is substantially higher than the previous study's reported value of 0.77. 31 Other prediction models have been reported, although these focus on other outcomes of covid-19, including risk of admission to intensive care or death following a positive test, or clinical decision tools that integrate biochemical and imaging parameters to aid diagnostis. 13 However, most such studies are at high risk of bias, as they have been developed in highly selected cohorts, have limited transparency, are likely to have optimistic reported performance, or did not use covid-19 specific data. 13 This study represents a substantial improvement on previously developed risk algorithms in terms of the size and representativeness of the study population, the richness of data linkages enabling accurate ascertainment of cases (including both in-hospital and out of hospital deaths) across the health network, and the breadth of candidate predictor variables considered. Importantly, it analyses risks at the population level, rather than risks in people with confirmed or suspected infection, and may have relevance for shielding or other policies that seek to mitigate risk of viral exposure. complexities of modelling Several complexities of modelling adverse risks from covid-19 in the general population warrant discussion. We used a general population approach which, although not able to incorporate all determinants of being infected, offers an overall estimate of risk of adverse outcomes from covid-19 that could be used in discussions between clinicians and patients about adjustment of lifestyle or occupational and behavioural factors that could limit viral exposure. Our model predicts risks of "catching covid-19 and then having a severe outcome," on the basis of data collected during the first peak of the pandemic. The endpoint in this study examines a risk trajectory that comprises two elements: becoming infected, which is predominantly a function of behavioural/environmental factors including occupation, local infection rate, and numbers of social interactions; and risk of hospital admission and death due to the infection, which is arguably primarily driven by "vulnerability" (that is, biological/ physiological factors including age, sex, body mass index, comorbidities, and medications). Although producing a prediction model for risk of "death if infected" is feasible in principle, this approach is not yet possible owing to the approach to testing in the UK and the context of an as yet incompletely quantified degree of asymptomatic background transmission. Limited covid-19 testing data are available, but the difficulty is that no systematic community testing was done in the UK during the study period, so only patients unwell enough to attend hospital were tested. This means that a risk score developed in those who tested positive would overestimate risks of severe outcomes. As more widespread testing is done and those data become available, we will be able to update the model to take background infection rates into account and also model regional differences. Although the absolute risk levels will of course change over time, depending on the incidence of the disease, our analysis over two validation time periods indicates that the relative risk measures and discrimination are likely to remain stable. Secondly, the model estimates the absolute risk for a non-infected individual in the general population of becoming infected and then dying (or needing to be admitted to hospital) from the virus over a 97 day period. Although many more than 40 000 people have died from covid-19 in the UK to date, when the denominator is a population of multi-millions, the absolute risk for most people may be low. Therefore, when conveying this type of risk score to an individual, due emphasis is needed on the different meanings of absolute and relative risk.

Thirdly, the absolute risk of catching covid-19 depends not only on the incidence of the infection but also on the number of people one gets close to. For this reason, non-pharmacological interventions such as social distancing and shielding were introduced in the UK during the study period. We have included some measures of multi-occupancy, as we have factored care homes into the analysis. The data generated during the study period will therefore be affected by the uptake of 

Harrell's C statistic 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Twentieth of predicted risk at 97 days 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Twentieth of predicted risk at 97 days interventions such as social distancing and shielding, intended to mitigate the risks of SARS-CoV-2 infection. This could result in underestimation of some model coefficients and hence underestimation of absolute risk in people who were shielded. Also, as this is a prediction model derived from an observational study, the associations estimated for individual predictor variables should not be interpreted as causal effects.

However, ethical questions must be considered regarding how the tools may be used. We have presented two ways of stratifying risk based on either absolute or relative risk measures with associated centile values, but the choice of whether to have a threshold (given that risk is a continuous measure), and if so what threshold, will depend on the purpose for which the risk assessment tool is to be used, the available resources, and the ethical framework for decision making. We have analysed this within the "four ethical principles" framework that is widely used in medical decision making. The four principles are autonomy, beneficence, justice, and non-maleficence. 32 The new risk equations, when implemented in clinical software, are designed to provide more accurate information for patients and clinicians on which to base decisions, thereby promoting shared decision making and patient autonomy. They are intended to result in clinical benefit by identifying where changes in management are likely to benefit patients, thereby promoting the principle of beneficence. Justice can be achieved by ensuring that the use of the risk equations results in fair and equitable access to health services that is commensurate with patients' level of risk. Lastly, the risk assessment must not be used in a way that causes harm either to the individual patient or to others (for example, by introducing or withdrawing treatments where this is not in the patient's best interest), thereby supporting the non-maleficence principle. How this applies in clinical practice will naturally depend on many factors, especially the patient's wishes, the evidence base for any interventions, the clinician's experience, national priorities, and the available resources. The risk assessment equations therefore supplement clinical decision making and do not replace it. With these caveats, the predicted risk estimates can be used to identify people at higher risk, to inform shared decision making between healthcare professionals and service users, or for population level stratification.

strengths and limitations of study Our study has some major strengths, but some important limitations, which include the specific factors related to covid-19 along with others that are similar to those for a range of other widely used clinical risk prediction algorithms developed using the QResearch database. [14] [15] [16] Key strengths include the use of a very large validated data source that has been used to develop other risk prediction tools; the wealth of candidate risk predictors; the prospective recording of outcomes and their ascertainment using multiple national level database linkage; lack of selection, recall and respondent biases; and robust statistical analysis. We have used non-linear terms for body mass index and age. We examined interaction terms, which show increased risks at younger ages for adults with type 2 diabetes. We also established a new linkage to the systemic anti-cancer therapy (SACT) database for chemotherapy prescribed and administered in secondary care (which may not be recorded well in general practice software) to circumvent possible missing data for this important variable. Specific limitations include the occurrence of shielding during the study period and that the study was conducted during the first phase of the UK epidemic. We have accounted for many risk factors for covid-19 mortality, but risks may be conferred by some rare medical conditions or other factors such as occupation that have not yet been observed or are poorly recorded in general practice or hospital data. In particular, the model does not include two important predictorsnamely, prevailing infection rate and personal social distancing measures. A lack of comprehensive testing has led to some missing data on covid-19 admissions and/or deaths, which means that development of a valid model for predicting death in people infected with SARS-CoV-2 is not yet possible. We acknowledge that absolute risks are changing during the course of the pandemic, so these should be interpreted with caution. However, we would expect predictors of risk, relative risk measures, and discrimination to be more stable over time, which is consistent with the results from our temporal validation. Although this tool was modelled on the best available data from the first wave of the pandemic, it will be updated as further testing and outcome data accrue, immunity levels change, and (potentially) a vaccine becomes available. Nevertheless, having a risk score available at this stage of the pandemic may be useful to identify people at high risk before a vaccine or treatment is available.

We have reported a validation in each of two time periods using practices from QResearch, but these practices were completely separate from those used to develop the model. We have used this approach previously to develop and validate other widely used prediction models. When these have been further externally validated on completely different clinical databases, by ourselves and others, the results have been very similar. [33] [34] [35] Work is already under way to evaluate the models in external datasets across all four nations of the UK and to integrate the algorithms within NHS clinical software systems.

This study presents robust risk prediction models that could be used to stratify risk in populations for public health purposes in the event of a "second wave" of the pandemic and support shared management of risk. We anticipate that the algorithms will be updated regularly as understanding of covid-19 increases, as more data become available, as behaviour in the population changes, or in response to new policy interventions. It is important for patients/carers and clinicians that a common, appropriately developed, evidence based model exists that is consistently implemented and is supported by the academic, clinical, and patient communities. This will then help to ensure consistent policy and clear national communication between policy makers, professionals, employers, and the public. 

Impact assessment of nonpharmaceutical interventions against coronavirus disease 2019 and influenza in Hong Kong: an observational study

Effects of non-pharmaceutical interventions on COVID-19 cases, deaths, and demand for hospital services in the UK: a modelling study

Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study

COVID-19 and African Americans

Clinical characteristics of 113 deceased patients with coronavirus disease 2019: retrospective study

Clinical course and mortality risk of severe COVID-19

No commercial reuse: See rights and reprints

Variation in COVID-19 Hospitalizations and Deaths Across New York City Boroughs

writing group form Obesity UK, Obesity Empowerment Network, UK Association for the Study of Obesity. Obesity and COVID-19: a call for action from people living with obesity

Obesity Is a Risk Factor for Severe COVID-19 Infection: Multiple Potential Mechanisms

Prevalence of co-morbidities and their association with mortality in patients with COVID-19: A systematic review and meta-analysis

Shielding from covid-19 should be stratified by risk

Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal

Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study

Development and validation of QDiabetes-2018 risk prediction algorithm to estimate future risk of type 2 diabetes: cohort study

Development and validation of QMortality risk prediction algorithm to estimate short term risk of death and assess frailty: cohort study

Protocol for the development and evaluation of a tool for predicting risk of short-term adverse outcomes due to COVID-19 in the general UK population

Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): the TRIPOD statement

RECORD Working Committee. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) statement

The Black report. Penguin

A Proportional Hazards Model for the Subdistribution of a Competing Risk

Explained variation for survival models

A new measure of prognostic separation in survival data

Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors

Assessing the performance of prediction models: a framework for traditional and novel measures

Comparing the predictive powers of survival models using Harrell's C or Somers' D

Temporal recalibration for improving prognostic model development and risk predictions in settings where survival is improving over time

External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: opportunities and challenges

Improvement in Cardiovascular Risk Prediction with Electronic Health Records

Non-invasive risk scores for prediction of type 2 diabetes (EPIC-InterAct): a validation of existing models

Factors associated with COVID-19-related death using OpenSAFELY

Medical ethics: four principles plus attention to scope

An independent external validation and evaluation of QRISK cardiovascular risk prediction: a prospective open cohort study

An independent and external validation of QRISK2 cardiovascular disease risk score: a prospective open cohort study

Predicting the 10 year risk of cardiovascular disease in the United Kingdom: independent and external validation of an updated version of QRISK2

Web appendix: Supplementary materials

Imperial College London, London, UK We acknowledge the contribution of EMIS practices who contribute to QResearch and EMIS Health and the Universities of Nottingham and Oxford for expertise in establishing, developing, or supporting the QResearch database. This project involves data derived from patient level information collected by the NHS, as part of the care and support of cancer patients. The data are collated, maintained, and quality assured by the National Cancer Registration and Analysis Service, which is part of Public Health England (PHE). Access to the data was facilitated by the PHE Office for Data Release. The Hospital Episode Statistics data used in this analysis are reused by permission from NHS Digital, which retains the copyright in that data. We thank the Office for National Statistics (ONS) for providing the mortality data. NHS Digital, PHE, and the ONS bear no responsibility for the analysis or interpretation of the data. We express our gratitude to Anne Rigg, Nisha Shaunak, Tom Charlton, Ana Montes, Claire Harrison, Susan Robinson, David Wrench, Matthew Streetly, Omer BenGal, Doraid Alrifai, and Rajjinder Nijjar for aiding the authors (notably PJ and JHC) with the classification of agents on the SACT dataset linkage used in this study and to David Coggon for general comments on the study protocol and interpretation.Contributors: JHC, CC, AKC, RK, KDO, PH, and NM led study conceptualisation. All authors contributed to the development of the research question and study design, with development of advanced statistical aspects led by JHC, CC, RK, KDO, and AKC. JHC, AKC, CC, JB, and PJ were involved in data specification, curation, and collection. JHC and AKC developed, checked, or updated clinical code groups. JHC did the statistical analyses, which were checked by CC. JHC developed the software for the web calculator. All authors contributed to the interpretation of the results. AKC and JHC wrote the first draft of the paper. All authors contributed to the critical revision of the manuscript for important intellectual content and approved the final version of the manuscript. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted. JHC is the guarantor.

Funding: This study is funded by a grant from the National Institute for Health Research (NIHR) following a commission by the Chief Medical Officer for England, whose office contributed to the development of the study question, facilitated access to relevant national datasets, and contributed to interpretation of data and drafting of the report. produces open and closed source software to implement clinical risk algorithms (outside this work) into clinical computer systems; CC reports receiving personal fees from ClinRisk, outside this work; AH is a member of the New and Emerging Respiratory Virus Threats Advisory Group; PJ was employed by NHS England during the conduct of the study and has received grants from Epizyme and Janssen and personal fees from Takeda, Bristol-Myers-Squibb, Novartis, Celgene, Boehringer Ingelheim, Kite Therapeutics, Genmab, and Incyte, all outside the submitted work; AKC has previously received personal fees from Huma Therapeutics, outside of the scope of the submitted work; RL has received grants from Health Data Research UK outside the submitted work; AS has received grants from the Medical Research Council (MRC) and Health Data Research UK during the conduct of the study; CS has received grants from the DHSC National Institute of Health Research UK, MRC UK, and the Health Protection Unit in Emerging and Zoonotic Infections (University of Liverpool) during the conduct of the study and is a minority owner in Integrum Scientific LLC (Greensboro, NC, USA) outside of the submitted work; KK has received grants from NIHR, is the national lead for ethnicity and diversity for the National Institute for Health Applied Research Collaborations, is director of the University of Leicester Centre for Black Minority Ethnic Health, was a steering group member of the Risk reduction Framework for NHS staff (chair) and for Adult care Staff, is a member of Independent SAGE, and is supported by the NIHR Applied Research Collaboration East Midlands (ARC EM) and the NIHR Leicester Biomedical Research Centre (BRC); RHK was supported by a UKRI Future Leaders Fellowship (MR/S017968/1); KDO was supported by a grant from the Alan Turing Institute Health Programme (EP/T001569/1); no other relationships or activities that could appear to have influenced the submitted work. The views expressed are those of the author(s) and not necessarily those of the NIHR, the NHS, or the Department of Health and Social Care.

Data sharing: To guarantee the confidentiality of personal and health information, only the authors have had access to the data during the study in accordance with the relevant licence agreements. Access to the QResearch data is according to the information on the QResearch website (www.qresearch.org).The lead author affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained.Dissemination to participants and related patient and public communities: Patient representatives from the QResearch Advisory Board have advised on dissemination of studies using QResearch data, including the use of lay summaries describing the research and its findings.Provenance and peer review: Not commissioned; externally peer reviewed. This is an Open Access article distributed in accordance with the terms of the Creative Commons Attribution (CC BY 4.0) license, which permits others to distribute, remix, adapt and build upon this work, for commercial use, provided the original work is properly cited. See: http://creativecommons.org/licenses/by/4.0/.