key: cord-155530-fz9l7rc7 authors: Pierson, Emma title: Assessing racial inequality in COVID-19 testing with Bayesian threshold tests date: 2020-11-02 journal: nan DOI: nan sha: doc_id: 155530 cord_uid: fz9l7rc7 There are racial disparities in the COVID-19 test positivity rate, suggesting that minorities may be under-tested. Here, drawing on the literature on statistically assessing racial disparities in policing, we 1) illuminate a statistical flaw, known as infra-marginality, in using the positivity rate as a metric for assessing racial disparities in under-testing; 2) develop a new type of Bayesian threshold test to measure disparities in COVID-19 testing and 3) apply the test to measure racial disparities in testing thresholds in a real-world COVID-19 dataset. A widely used metric in monitoring COVID-19 outbreaks is the positivity rate, defined as the fraction of COVID-19 tests which are positive (Johns Hopkins Coronavirus Resource Center, 2020; World Health Organization, 2020) . A low positivity rate suggests that an area has enough testing to properly monitor its outbreak; a high positivity rate suggests under-testing. In the United States, there are large racial disparities in COVID-19 cases and deaths per capita (Oppel Jr. et al., 2020) , prompting recommendations that the positivity rate be reported broken down by race (Wen and Sadeghi, 2020; Servick, 2020) . While this data is not yet systematically available on a national level, the data that is available reveals large racial/ethnic disparities (Figure 1 ), consistent with prior work (Martinez et al., 2020; Cordes and Castro, 2020; Bilal et al., 2020) . The Black positivity rate is higher in all states with data than the white positivity rate; similarly, the Hispanic positivity rate is higher in all states than the non-Hispanic positivity rate, suggesting undertesting of Black and Hispanic populations. Motivated by these racial disparities in the positivity rate, in this work we make three contributions. First, we illuminate a statistical flaw, known as infra-marginality, in using the positivity rate as a metric for assessing racial disparities in under-testing, drawing on the literature on measuring racial disparities in policing. Second, we describe how a Bayesian threshold test approach, which has been used to measure racial disparities in policing, can be used to measure disparities in COVID-19 testing, and develop a version of the test suitable for the COVID-19 setting. Third, we use the test to measure racial disparities in testing thresholds in a real-world COVID-19 dataset. We conclude by discussing broader applications of threshold tests in medicine. Breaking down the COVID-19 positivity rate by race is an example of an "outcome test", a widely used technique for measuring racial bias in decision-making by examining the outcomes of decisions (Ayres, 2002; Becker, 1993; Carr and Megbolugbe, 1993) . Outcome tests have been applied in diverse domains from policing to lending. In policing, a frequently used outcome is whether a police search finds contraband: if searches of white drivers find contraband 90% of the time, but searches of Black drivers find contraband only 10% of the time, it suggests that the police are searching white drivers only when they're very likely to be car- But the previous literature on outcome tests also illuminates a problem with simply examining the positivity rate, called inframarginality (Ayres, 2002) , which we explain by adapting an example from Simoiu et al. (2017) . Imagine that there are two races -white and Black -and within each race there are two equally-sized groups -one who is very unlikely to have COVID-19, and one who is quite likely. Imagine these groups are easy to tell apartone group is showing COVID-19 symptoms, for example, and one group is asymptomatic. 5% of the asymptomatic patients have COVID-19, regardless of their race. 50% of the white symptomatic patients have COVID-19, and 75% of the Black symptomatic patients have COVID-19. Finally, imagine there is no racial bias in who is tested: everyone who is more than 10% likely to have COVID-19 is tested, regardless of race, so the same probability threshold is applied to both races. All symptomatic patients will be tested, producing a positivity rate of 50% for white patients and 75% for Black patients. We will incorrectly conclude from the higher positivity rate among Black patients that they are being under-tested relative to whites -that is, tested only when they are more likely to have COVID-19. But in fact, in this hypothetical, everyone faces the same 10% testing threshold. We reach this misleading conclusion because the statistic we're measuring -the positivity rate -is not the same as the probability threshold at which patients are being tested. (We note that positivity rate analysis can also yield a misleading result in the opposite direction, where it fails to show racial disparities even though there are disparities in testing thresholds.) In general, if two race groups have very different risk distributions (in the hypothetical example above, the Black risk distribution is right-shifted) simply looking at the positivity rate may yield misleading conclusions. Figure 2 illustrates this graphically for continuous risk distributions. In the case of COVID-19, infra-marginality is not a hypothetical concern: per capita infection rates are much higher in Black populations than white populations, so it is plausible that there might be dramatic differences in the risk distributions. This threshold which people are tested is hard to measure -unlike the positivity rate, it's not a simple fraction directly observable from the data, but a latent quantity that must be inferred. Threshold tests attempt to in-0% 20% 40% Probability of having COVID-19 Group 1 Risk Group 2 Risk Threshold Figure 2 : Hypothetical example illustrating that if two racial groups (red and blue lines) have very different distributions of COVID-19 risk, the same testing threshold produces different distributions above the threshold, and therefore different positivity rates. fer both the thresholds and risk distributions and thereby circumvent the problem of inframarginality. Threshold tests for policing, proposed in Simoiu et al. (2017) and applied in Pierson et al. (2018 Pierson et al. ( , 2020 , use a Bayesian model to simultaneously infer the race-and location-specific risk distributions and thresholds. For brevity, we refer the reader to Simoiu et al. (2017) for a description of the original threshold test in the context of policing, and here describe only our adapted generative model for the COVID-19 context; Appendix A.2 details further how the two models differ. Observed data. We assume that we observe three pieces of information for all race groups r and counties d: the population of the race group in the county, n rd ; the cumulative number of COVID-19 tests for the race group in the county t rd ; and the cumulative number of COVID-19 cases (positive tests) for the race group in the county, c rd . Generative model. On each day, the probability p that a person of race r in county d has COVID-19 is drawn from a race and countyspecific risk distribution -a probability distribution on [0, 1]. p represents the probability a person has COVID-19 given their relevant observable characteristics -for example, whether they are coughing and whether they have recently been to large gatherings. Each person gets tested if p exceeds a race and countyspecific testing threshold, z rd . We let P rd denote the random variable corresponding to the risk distribution for each race and county. The probability a person of a given race in a given county will get tested, f rd , is the proportion of the risk distribution that lies above the threshold, p(P rd > z rd ), that is, the complementary cumulative distribution function of the risk distribution. The probability a test will be positive, g rd , is the expected value of the risk distribution conditional on being above the threshold: E(P rd |P rd > z rd ). The observed data are drawn as as follows: The latent parameters of the model are the thresholds z rd and the parameters of the risk distributions. Following Pierson et al. (2018 Pierson et al. ( , 2020 , we parameterize the risk distributions as discriminant distributions, which are twoparameter distributions on [0, 1] that facilitate fast inference in this setting. We allow the risk distributions to vary by race and by county to accommodate the fact that the true prevalence of COVID-19 can vary by race and county. To complete the Bayesian specification, we must place priors on the latent parameters, which we describe in our full model specification, available online. 1 We infer posteriors over the latent parameters using Hamiltonian Monte Carlo (Neal, 1994) , implemented in the probabilistic programming language Stan (Carpenter et al., 2017). We fit the model to cumulative COVID-19 test and case count data through August 16, 2020 in the US state of Indiana, broken down by race and county (further data details in Appendix A). We infer testing thresholds for non-Hispanic Black, non-Hispanic white, and Hispanic populations. The primary latent parameters of interest are the inferred testing thresholds z rd for each race and county; we plot these in Figure 3 . Inferred thresholds for minorities are generally higher than those for whites in the same county, suggesting that minorities are under-tested relative to whites: that is, tested only when they have a higher probability of having COVID-19. Consistent with this, the raw positivity rates ( Figure A1 ) also show racial disparities, but they are less consistent, and this analysis is not robust to infra-marginality. (Appendix A includes additional model results. Figure A2 plots the inferred risk distributions, illustrating that there are indeed differences across race groups. Figure A3 plots posterior predictive checks, a standard check in Bayesian inference; Table A1 shows that our main results remain robust across alternate specifications.) There are broader potential applications of threshold tests both in COVID-19 and in other medical conditions. While we focus here on racial disparities in COVID-19, one could assess COVID-19 under-testing across locations or across other demographic dimensions like age. Beyond COVID-19, threshold tests could be used to measure racial disparities in undertesting and under-diagnosis in medicine more broadly, an issue of known concern in cardiac conditions Schulman et al. World Health Organization. Public health criteria to adjust public health and social measures in the context of COVID-19. 2020. 3. Subtract ethnicity. We use the raw counts for Black and Hispanic cases/tests, and for whites, subtract the number of Hispanic cases/tests. This method does not attempt to account for missing data. We filter for counties with Black and Hispanic populations of at least 500 to ensure that they have large enough minority populations to be able to meaningfully assess disparities. This filter retains counties containing 87% of the Hispanic population and 98% of the Black population. Our full model specification, which includes priors on all parameters and the parameterization of the thresholds and risk distributions, is available online, along all the code to reproduce our results. 4 . Here, we briefly detail how our model differs from previous threshold models. The primary difference between the original threshold model for police searches detailed in Simoiu et al. (2017) , and our threshold model for COVID-19 tests, is that the policing model measures disparities only among stopped drivers, whereas the COVID-19 model measures disparities in the entire population, and must therefore model and make use of population information (eg, from Census data). While the number of police searches cannot exceed the number of police stops (and the original authors therefore model the number of searches as a Binomial draw from the number of stops), the number of COVID-19 tests in a county can exceed the number of people in a location (since each person can be tested multiple times), so a Binomial model is unsuitable. Pierson et al. (2018) proposes a version of the threshold test which incorporates population information, but makes use of only the proportion rather than the absolute population of each race group in each location: eg, the population information provided to their model is that "in County X, 40% of people are Hispanic, 20% are white, and 40% are Black". This is unsuitable for our setting, because intuitively our inferences about COVID-19 testing threshold and prevalence should be very different in a county with 100 people and 10 tests, compared to a county with 10,000 people and 10 tests, even if the relative fractions of each race group remain constant. In the COVID-19 setting, we incorporate population information by modelling the number of tests for each race group in each county as a Poisson draw (whose rate parameter is proportional to the population of that race group in that county). Our use of a Poisson bears some similarity to the Poisson regression setting, in which a Poisson whose rate parameter depends on covariates is used to model rates in a population; a natural direction for future work is to extend our model to accommodate overdispersion via, eg, a quasi-Poisson or negative binomial model (Gardner et al., 1995) . As a further robustness check (Table A1) , we ensure that our primary results remain robust when we replace the Poisson with a Binomial distribution (which is similar to the original specification in Simoiu et al. (2017)), even though the latter makes less sense conceptually. Hispanic positivity rate Figure A1 : Positivity rates by county in Indiana. While, as with the thresholds, positivity rates are generally higher for minorities than for whites in the same county, this analysis is not robust to infra-marginality and disparities emerge slightly less consistently: for example, in 5 counties, the Black positivity rate is lower than the white positivity rate. Figure A3 : Posterior predictive checks, a standard check for model fit (Gelman et al., 1996) . We compare observed and predicted quantities for tests per capita (top plot) and positivity rate (bottom plot). The x-axis plots the observed quantity, and the y-axis plots the error -ie, the difference between the observed values and the model-predicted values. Points are sized proportional to the number of tests for the race group and county. The size of the point represents the number of tests in that location. Across all race groups, errors are small and there is a lack of systematic bias, validating model fit. Table A1 : Robustness checks on model and data processing. Each row reports inferred thresholds (weighted mean across counties, weighting by total tests conducted in the county) for one specification: one data processing method (first column) and one model (second column), with the primary specification which is used to generate all results in the paper in the first row. The next three columns report the inferred thresholds (reporting the mean of the posterior MCMC draws, followed by 95% confidence interval in parentheses). The final two columns provide the ratio of minority thresholds to white thresholds. For all specifications, thresholds for Black and Hispanic populations are higher than those for whites. For more details, see Appendix A. Outcome tests of racial disparities in police practices Nobel lecture: The economic way of looking at behavior Spatial Inequities in COVID-19 outcomes in Three US Cities. medRxiv Thanks to Serina Chang, Irene Chen, Sam Corbett-Davies, Pang Wei Koh, Lester Mackey, Ziad Obermeyer, Nat Roth, Leah Pierson, Miriam Pierson, Jacob Steinhardt, and seminar attendees for helpful conversations, and to Jaline Gerardin, Alexis Madrigal, and Albert Sun for data assistance. Appendix A. We fit the model to cumulative COVID-19 test and case count data in Indiana through August 16, 2020 in the US state of Indiana, broken down by race and county. 2 We chose Indiana because it was one of the few states which made the requisite data available. 3 We infer county-specific COVID-19 testing thresholds for non-Hispanic white, non-Hispanic Black, and Hispanic populations. Indiana reports data (test and case counts) aggregated by race (eg, white or Black), and aggregated by ethnicity (Hispanic or non-Hispanic), but not data aggregated by both at once. A second caveat is that there is significant missing race/ethnicity data: the median county has ethnicity data for only about half of cases and tests, and race data for 80-90% of cases and tests. Due to these two caveats, there are multiple potential ways of processing the raw data to produce the data we use to actually fit the model. As a sensitivity analysis, we process the data three different ways, and verify that our main conclusion (that minorities face higher testing thresholds) remains robust across all three specifications (Table A1 ).1. Original specification: We assume that race and ethnicity are independent -eg, the fraction of whites who are Hispanic is the same as the fraction of Blacks who are Hispanic. We define the number non-Hispanic white cases aswhere w is the number of white cases in the raw data, f hisp is the proportion of