key: cord-0216093-4kkzi70v
authors: Manski, Charles F.; Molinari, Francesca
title: Estimating the COVID-19 Infection Rate: Anatomy of an Inference Problem
date: 2020-04-13
journal: nan
DOI: nan
sha: d9bf11dd9fdf075a26144b7c816440092200b4f0
doc_id: 216093
cord_uid: 4kkzi70v

As a consequence of missing data on tests for infection and imperfect accuracy of tests, reported rates of population infection by the SARS CoV-2 virus are lower than actual rates of infection. Hence, reported rates of severe illness conditional on infection are higher than actual rates. Understanding the time path of the COVID-19 pandemic has been hampered by the absence of bounds on infection rates that are credible and informative. This paper explains the logical problem of bounding these rates and reports illustrative findings, using data from Illinois, New York, and Italy. We combine the data with assumptions on the infection rate in the untested population and on the accuracy of the tests that appear credible in the current context. We find that the infection rate might be substantially higher than reported. We also find that the infection fatality rate in Italy is substantially lower than reported.

It is well appreciated that accurate characterization of the time path of the coronavirus pandemic has been hampered by a serious problem of missing data. Confirmed cases have commonly been measured by rates of positive findings among persons who have been tested for infection. Infection data are missing for persons who have not been tested. It is also well-appreciated that the persons who have been tested differ considerably from those who have not been tested. Criteria used to determine who is eligible for testing typically require demonstration of symptoms associated with presence of infection or close contact with infected persons. This gives considerable reason to believe that some fraction of untested persons are asymptomatic or pre-symptomatic carriers of the COVID-19 disease. Presuming this is correct, the actual rate of infection has been higher than the reported rate.

It is perhaps less appreciated that available measurement of confirmed cases is imperfect because the prevalent tests for infection are not fully accurate. There is basis to think that accuracy is highly asymmetric. Various sources suggest that the positive predictive value (the probability that, conditional on testing positive, an individual is indeed infected) of the tests in use is close to one. However, it appears that the negative predictive rate (the probability that, conditional on testing negative, the individual is indeed not infected) may be substantially less than one. Presuming this asymmetry, the actual rate of infection has again been higher than the reported rate.

Combining the problems of missing data and imperfect test accuracy yields the conclusion that reported rates of infections are lower than actual rates. Reported rates of infection have been used as the denominator for computation of rates of severe disease conditional on infection, measured by rates of hospitalization, treatment in intensive care units (ICUs), and death. Presuming that the numerators in rates of severe illness conditional on infection have been measured accurately, reported rates of severe illness conditional on infection are higher than actual rates.

On March 3, 2020 the Director General of the World Health Organization (WHO) stated: 1 "Globally, about 3.4% of reported COVID-19 cases have died." It is tempting to interpret the 3.4% number as the actual case-fatality ratio (CFR). However, if deaths have been recorded accurately and if the actual rate of infection has been higher than the reported rate, the WHO statistic should be interpreted as an upper bound on the actual CFR on that date. Recognizing this, researchers have recommended random testing of populations as a potential future method to solve the missing data problem. 2 In the present absence of random testing, various researchers have put forward point estimates and forecasts for infection rates and rates of severe illness derived in various ways. Work performed by separate groups of epidemiologists at the Imperial College COVID-19 Response Team and the University of Washington's Institute for Health Metrics and Evaluation has received considerable public attention. 3 The available estimates and forecasts differ in the assumptions that they use to yield specific values. The assumptions vary substantially and so do the reported findings. To date, no particular assumption or resulting estimate has been thought sufficiently credible as to achieve consensus across researchers.

We think it misguided to report point estimates obtained under assumptions that are not well justified.

We think it more informative to determine the range of infection rates and rates of severe illness implied by a credible spectrum of assumptions. In some disciplines, research of this type is called sensitivity analysis. A common practice has been to obtain point estimates under alternative strong assumptions. A problem with sensitivity analysis as usually practiced is that, in many applications, none of the strong assumptions entertained has a good claim to realism.

Rather than perform traditional sensitivity analysis, this paper brings to bear econometric research on partial identification. Study of partial identification analysis removes the focus on point estimation obtained under strong assumptions. Instead it begins by posing relatively weak assumptions that should be highly 1 https://www.who.int/dg/speeches/detail/who-director-general-s-opening-remarks-at-the-media-briefing-on-covid-19---3-march-2020 2 See, for example, https://www.statnews.com/2020/03/17/a-fiasco-in-the-making-as-the-coronavirus-pandemictakes-hold-we-are-making-decisions-without-reliable-data/ 3 See https://www.imperial.ac.uk/mrc-global-infectious-disease-analysis/covid-19/ and http://www.healthdata.org/ respectively. credible in the applied context under consideration. Such weak assumptions generally imply set-valued estimates rather than point estimates. Strengthening the initial weak assumptions shrinks the size of the implied set estimate. The formal methodological problem is to determine the set estimate that logically results when available data are combined with specified assumptions. See Manski (1995 Manski ( , 2003 Manski ( , 2007 for monograph expositions at different technical levels. See Tamer (2010) and Molinari (2020) for review articles.

Considering estimation of the infection rate for the coronavirus, combining available data with credible assumptions easily yields a lower bound on the infection rate. The harder problem is to determine an upper bound that is both credible and informative, so as to obtain an interval estimate that is credible and informative. This paper explains the logic of the identification problem, determines the identifying power of some credible assumptions, and reports illustrative set estimates.

We analyze data from Illinois, New York, and Italy, over the period March 16 to April 6, 2020. We impose weak monotonicity assumptions on the rate of infection in the untested sub-population to draw credible conclusions about the population infection rate. We find that the infection rate as of April 6, 2020, for Illinois, New York, and Italy are, respectively, bounded in the intervals [0.001, 0.517], [0.008, 0.645], and [0.003, 0.510]. Further analyzing the rate of hospitalization, treatment in an intensive care unit, and death in Italy, we find that as of April 6 the rates for these severe outcomes are bounded, respectively, in the intervals [0.001, 0.172], [0, 0.02], and [0.001, 0.086]. The upper bound on the fatality rate is substantially lower than that among confirmed infected individuals, which was 0.125 on April 6.

We first address the basic problem of bounding the infection rate as of a specified date. The analysis initially considers the problem abstractly and then derives bounds under particular credible assumptions.

We next show how bounding the infection rate yields a bound on the rate of severe illness conditional on infection. We then extend the analysis to bound the rates conditional on observed patient characteristics.

Knowledge of all of these rates is important to inform public health policy.

Consider a specified population of persons who are alive at the onset of the pandemic. This may, for example, be the population of a city, state, or nation. Let the objective be to determine the fraction of the population who are infected by the SARS-CoV-2 virus by a specified date d. Synonymously, this is the fraction of the population who experience onset of the COVID-19 disease by date d.

The present analysis assumes that a person can have at most one disease episode. If a person has been infected, the person either achieves immunity after recovery or dies. The assumption that immunity is achieved after recovery, at least for some period of time, is consistent with current knowledge of the disease.

We also assume that the size of the population is stable over the time period of interest. Thus, we abstract from the fact that, as time passes, deaths from the disease and other causes reduce the size of the population, births increase the size, and migration may reduce or increase the size on net.

Let C d = 1 if a person has been infected by the coronavirus by date d and Cd = 0 otherwise. The objective is to determine P(Cd = 1), the probability that a member of the population has been infected by date d. Equivalently, P(Cd = 1) is the infection rate or the fraction of persons who have experienced disease onset. Our analysis of the problem of inference on P(Cd = 1) is a simple extension of ideas that have been used regularly in the literature on partial identification, beginning with study of inference with missing outcome data (Manski, 1989) . P(Cd = 1) is not directly observable. However, population surveillance systems provide daily data on two quantities related to P(Cd = 1). These are the rate of testing for infection and the rate of positive results among those tested. To simplify analysis, we assume that a person is tested at most once by date d. This assumption may not be completely accurate, for reasons that will be explained later.

Let Td = 1 if a person has been tested by date d and Td = 0 otherwise. Let Rd = 1 if a person has received a positive test result by date d and Rd = 0 otherwise. Observe that Td = 0 ⇨ Rd = 0 and Rd = 1 ⇨ Td = 1. By the Law of Total Probability, the infection rate may be written as follows:

(1) P(Cd = 1) = P(Cd = 1|Rd = 1)P(Rd = 1) + P(Cd = 1| Rd = 0)P(Rd = 0), where (2) P(Rd = 1) = P(Rd = 1|Td = 1)P(Td = 1),

(3) P(Rd = 0) = P(Td = 0) + P(Rd = 0|Td = 1)P(Td = 1),

(4) P(Cd = 1| Rd = 0)P(Rd = 0) = P(Cd = 1, Rd = 0) = P(Cd = 1|Td = 0)P(Td = 0) + P(Cd = 1|Td = 1, Rd = 0)P(Rd = 0|Td = 1)P(Td = 1). Now consider each of the component quantities that together determine the infection rate. Assuming that reporting of testing is accurate, daily surveillance reveals the testing rate and the rate of positive results among those tested. Thus, the quantities P(Td = 0), P(Td = 1), P(Rd = 0|Td = 1), and P(Rd = 1|Td = 1) are directly observable. The remaining quantities are not directly observable.

The quantities P(Cd = 1|Rd = 1) and P(Cd = 1|Td = 1, Rd = 0) are determined by the accuracy of testing.

The former is the positive predictive value (PPV) and the latter is one minus the negative predictive value (NPV). We note that medical researchers and clinicians often measure test accuracy in a different way, through test sensitivity and specificity. The sensitivity and specificity of tests for COVID-19 on the tested sub-population are P(Rd = 1|Td = 1, Cd = 1) and P(Rd = 0|Td = 1, Cd = 0) respectively. Sensitivity and specificity are related to PPV and NPV through Bayes Theorem, whose application generally requires knowledge of P(Cd = 1|Td = 1), the infection rate in the tested sub-population. An exception to this generalization is that PPV equals one if and only if specificity equals one, whenever P(Cd = 1|Td = 1) > 0.

Medical experts believe that the PPV of the prevalent tests for COVID-19 is close to one, but that NPV may be considerably less than one. We have obtained this information in part from personal communication We therefore find it credible to assume that P(Cd = 1|Rd = 1) = 1. It can be shown that this is equivalent to assuming that test specificity P(Rd = 0|Td = 1, Cd = 0) = 1. The final sentence of the Breining quote explains why it may not be completely accurate to assume that persons are tested at most once, but we maintain this assumption for simplicity.

There does not appear to presently be a firm basis to determine the precise NPV of the prevalent nasalswab tests, but there may be a basis to determine a credible bound. Medical experts have been cited as believing that the rate of false-negative test findings is at least 0.3. However, it is not clear whether they have in mind one minus the NPV or one minus test sensitivity. 5 One may perhaps find it credible to extrapolate from experience testing for influenza to testing for covid-19. For example, Peci et al. (2014) study the performance of rapid influenza diagnostic testing. They find a PPV of 0.995 and an NPV of 0.853.

It is not clear whether NPV has been constant over the short time period we study or, contrariwise, has varied as testing methods and the subpopulation of tested persons change over time. The NPV may also vary over longer periods if the virus mutates significantly. The illustrative results that we report later assume that NPV is in the range [0.6, 0.9], implying that P(Cd = 1|Td = 1,

It remains to consider P(Cd = 1|Td = 0), the rate of infection among those who have not been tested. This quantity has been the subject of much discussion, with substantial uncertainty expressed about its value. It may be that the value changes over time as criteria for testing people evolve and testing becomes more common. The illustrative results that we report later show numerically how the conclusions one can draw about P(Cd = 1) depends on the available knowledge of P(Cd = 1|Td = 0).

To finalize the logical derivation of a bound on P(Cd = 1), let [Ld0, Ud0] and [Ld10, Ud10] denote credible lower and upper bounds on P(Cd = 1|Td = 0) and P(Cd = 1|Td = 1, Rd = 0) respectively. Now combine these bounds with the assumption that P(Cd = 1|Rd = 1) = 1 and with empirical knowledge of the testing rate and the rate of positive test results. Then equations (1) -(4) imply this bound on the population infection rate:

(5) P(Rd = 1) + Ld0P(Td = 0) + Ld10P(Rd = 0|Td = 1)P(Td = 1) ≤ P(Cd = 1) ≤ P(Rd = 1) + Ud0P(Td = 0) + Ud10P(Rd = 0|Td = 1)P(Td = 1).

The width of bound (5) is

Inspection of (6) shows that uncertainty about test accuracy and about the infection rate in the untested subpopulation, measured by Ud10 − Ld10 and Ud0 − Ld0, combine linearly to yield uncertainty about the population infection rate. The fractions P(Td = 1) and P(Td = 0) of the population who have and have not been tested linearly determine the relative contributions of the two sources of uncertainty.

As of early April 2020, the fraction of the population who have been tested is very small in most locations. For example, the fraction who have been tested by April 6, 2020 was about 0.005 in Illinois, 0.017 in New York, and 0.012 in Italy; see Section 3 for details on the data sources. Hence, the present dominant concern is uncertainty about the infection rate in untested sub-populations. We now consider the problem of obtaining a credible bound on this quantity. We judge the current situation to be intermediate between the worst and best case scenarios. We are aware of no credible way to assign a precise value to P(Cd = 1|Td = 0), nor even to place a tight bound on the quantity. On the other hand, it is too pessimistic to view society as having no relevant information. Two monotonicity assumptions are highly credible in the current context.

Present criteria for testing persons for infection by the coronavirus commonly require the person to display symptoms of infection or to have been in close contact with someone who has tested positive. These criteria strongly suggest that, as of each date d, the infection rate among tested persons is higher than the rate among untested persons. This yields the testing-monotonicity assumption (7) P(Cd = 1|Td = 0) ≤ P(Cd = 1|Td = 1).

Observe that if testing for infection were random rather than determined by the current criteria, it would be credible to impose a much stronger assumption, namely P(Cd = 1|Td = 0) = P(Cd = 1|Td = 1). However, testing clearly has not been random. Hence, we only impose assumption (7).

Research on partial identification has often exploited monotonicity assumptions similar to (7), beginning with Manski and Pepper (2000) . To use the assumption in the present setting, consider the quantity P(Cd = 1|Td = 1). The Law of Total Probability, the maintained assumption that positive test results are always accurate, and the specified upper bound on P(Cd = 1|Td = 1, Rd = 0) yield (8) P(Cd = 1|Td = 1) = P(Rd = 1|Td = 1) + P(Cd = 1|Td = 1, Rd = 0)P(Rd = 0|Td = 1) ≤ P(Rd = 1|Td = 1) + Ud10P(Rd = 0|Td = 1).

Combining (7) and (8) yields this upper bound on P(Cd = 1|Td = 0).

(9) Ud0 = P(Rd = 1|Td = 1) + Ud10P(Rd = 0|Td = 1) = Ud10 + (1 − Ud10)P(Rd = 1|Td = 1).

Bound (9) is methodologically interesting because Ud0 is now a function of Ud10 rather than a separate quantity. It thus enhances the importance of securing an informative upper bound on P(Cd = 1|Td = 1, Rd = 0). In particular, (9) implies that Ud0 ≥ Ud10, whatever the rate P(Rd = 1|Td = 1) of positive test outcomes may be.

The monotonicity assumption does not affect the lower bound Ld0, which is zero in the absence of other information. Hence, inserting Ld0 = 0 and (9) into the bound (5) on P(Cd = 1) yields (10) P(Rd = 1) + Ld10P(Rd = 0|Td = 1)P(Td = 1) ≤ P(Cd = 1) ≤ P(Rd = 1) + Ud10P(Rd = 0|Td = 1)P(Td = 1) + [P(Rd = 1|Td = 1) + Ud10P(Rd = 0|Td = 1)]P(Td = 0).

(11) (Ud10 − Ld10)P(Rd = 0|Td = 1)P(Td = 1) + [P(Rd = 1|Td = 1) + Ud10P(Rd = 0|Td = 1)]P(Td = 0).

In the present context where P(Td = 1) is very small, the width of the bound approximately equals the sum of the rate P(Rd = 1|Td = 1) of positive test results plus the product of the rate P(Rd = 0|Td = 1) of negative test results and the upper bound on P(Cd = 1|Td = 1, Rd = 0).

A second form of monotonicity holds logically rather than by assumption. Our analysis thus far has only considered the infection rate by a specified date. A person who has been infected by an early date necessarily has been infected by every later date. Hence, for two dates d and d', we have the temporal monotonicity condition (12) d' < d ⇨ P(Cd' = 1) ≤ P(Cd = 1). Manski and Pepper (2000) . Proposition 1 of that article shows that, given a set of date-specific lower and upper bounds on the infection rate for various dates, condition (12) implies that P(Cd = 1) must be greater than or equal to the maximum of the date-specific lower bounds for all d' ≤ d. Moreover, P(Cd = 1) must be less than or equal to the minimum of the date-specific upper bounds for all d' ≥ d. 7 Applying this result to the date-specific bounds (10) yields this result: (13) is less than the one in (10). Thus, the temporal monotonicity condition may or may not have identifying power, depending on the testing data. We find that it modestly improves lower bounds with the data we use.

7 Proposition 1 of Manski and Pepper (2000) shows that this bound is sharp. That is, it is the tightest bound achievable with the available information. Molinari (2020, Section 2.1) shows that it is a more complex matter to obtain sharp bounds for functions of the infection rate that vary with time.

We are presently unaware of other assumptions or logical conditions that enjoy credibility comparable to the above monotonicity assumptions and that have identifying power. One may, however, perhaps feel comfortable bringing to bear assumptions whose credibility stems from the judgement of respected medical and epidemiological experts. We provide an example here to illustrate how this may be done and the identifying power studied. We do not endorse the specific assumptions made here.

Consider the decomposition of COVID-19 episodes into those where the patient does and does not manifest discernible symptoms. Dr. Anthony Fauci, the director of the National Institute of Allergy and Infectious Diseases, has been quoted as saying that the fraction of cases in which the patient is infected but shows no symptoms is "somewhere between 25 and 50 percent." Fauci went on to say "And trust me, that is an estimate. I don't have any scientific data yet." 8

Supposing it to be correct, Fauci's bound has identifying power when combined with a further assumption. Let Ad = 1 or Sd = 1 if a person has respectively had an asymptomatic or symptomatic case of COVID-19 by date d. Let each quantity equal zero otherwise. The two categories of illness are mutually exclusive, so Cd = Ad + Sd. Hence, (14) P(Cd = 1) = P(Sd = 1) + P(Ad = 1).

Fauci imposes the assumption (15) P(Ad = 1) = αP(Cd = 1), for some α ∊ [0.25, 0.5].

Combining (14) and (15) This lower bound is (0.75) -1 times that in (10), thus improving on it. If one finds bound (17) credible, the final lower bound on P(Cd = 1) is the maximum of the lower bounds in (17) across dates d' ≤ d, as in (13).

Surveillance systems may report several rates of severe illness (V), including hospitalization (H), ICU usage (U), and death (D). 9 The present discussion considers these reports to be accurate. Thus, one may have empirical knowledge of the rates P(Vd = 1) for V ∊ {H, U, D}.

Surveillance systems do not report rates of severe illness conditional on infection. These have the form (18) P(Vd = 1|Cd = 1) = P(Vd = 1, Cd = 1)/P(Cd = 1).

The numerator P(Vd = 1, Cd = 1) may logically differ from the reported rate P(Sd = 1). This may occur for H and U if some persons hospitalized for COVID-19 are mis-diagnosed. It may occur for D if some reported causes of death are inaccurate. For simplicity, we assume here that such errors do not occur. However, we caution that the assumption may not be realistic. 10

In the absence of reporting errors, Vd = 1 ⇨ Cd = 1. Hence,

P(Vd = 1|Cd = 1) = P(Vd = 1)/P(Cd = 1).

Given (19), the bound obtained for P(Cd = 1) immediately yields a bound for P(Vd = 1|Cd = 1). The lower (upper) bound on P(Vd = 1|Cd = 1) is achieved when P(Cd = 1) takes it upper (lower) bound.

When interpreting rates of severe illness conditional on infection, one should keep in mind that severe cases of COVID-19 may not be apparent as of the date of infection. Many patients begin with mild or no symptoms and develop severe cases a week to two weeks after infection. Hence, the rate of severe illness computed as of a specified date may understate the rate of eventual severe illness.

The above derivations hold however one defines the population of interest. Application of the bound on the infection rate is possible if one has empirical knowledge of the testing rate and the rate of positive testing findings for the relevant population. Application of the bound on the rate of severe illness conditional on infection is possible if one additionally has knowledge of the rate of severe illness in the relevant population.

There are both clinical and public health reasons why one would like to know P(Cd = 1|X) and P(Vd = 1|X, Cd = 1) for persons with specified personal characteristics X. For example, it has been thought important to know these rates conditional on the demographic characteristics X = (age, gender, race).

Whatever X may be, the bound on P(Cd = 1|X) is this X-specific version of (5):

(20) P(Rd = 1|X) + LdX0P(Td = 0|X) + LdX10P(Rd = 0|X, Td = 1)P(Td = 1|X) ≤ P(Cd = 1|X)

≤ P(Rd = 1|X) + UdX0P(Td = 0|X) + UdX10P(Rd = 0|X, Td = 1)P(Td = 1|X).

When computing the bound, one should bring to bear credible X-specific bounds on P(Cd = 1|X, Td = 0) and P(Cd = 1|X, Td = 1, Rd = 0). If one imposes a monotonicity restriction conditional on X as in Section 2.2, the bound in (10) is updated similarly as we did in (14) for the bound in (5). The bound on P(Vd = 1|X, Cd = 1) is computable if surveillance additionally reports P(Vd = 1|X).

We analyze data from two states in the United States, Illinois and New York, and from Italy. Our data sources are Illinois Department of Public Health (2020) For Italy, Table 1 also reports, in columns 8-10, the rates of severe outcomes P(Hd = 1), P(Ud = 1), and P(Dd = 1). The data reveal that P(Hd = 1) increased from 0.00021 to 0.00054, and P(Ud = 1) from 0.00003 to 0.00006. The fact that these rates decrease towards the end of the period is due to the reduction in the number of new cases and the increase in the number of recovered individuals that Italy has experienced since April 4. The death rate P(Dd = 1) has increased from 0.00004 to 0.00027. 

This paper has used standard methods of partial identification analysis to study two key aspects of the uncertainty that has frustrated attempts to learn the COVID-19 infection rate and rates of severe illness conditional on infection. We have quantified the implications of uncertainty about the infection rate among non-tested persons and about the NPV of the tests in use. The simple analysis of Section 2 shows how available data and maintained assumptions combine to determine the inferences that can logically be drawn.

We have used monotonicity assumptions that have strong credibility in the current context. We also have used a conjecture bounding the rate of asymptomatic infection to illustrate how further assumptions having a less firm foundation may be brought to bear, should one find them credible.

We have used data for two American states and for Italy to illustrate application of the analysis. Given that the tested fraction of the population has been very low, one can barely draw any conclusion about the population infection rate without making assumptions that bound the rate of infection in the untested subpopulation. Imposing the monotonicity assumptions restricts the population infection rate to bounds that have about width 0.5 in the current covid context.

One naturally may prefer bounds of narrower width. Given the available data, this is logically possible to achieve only if one imposes stronger assumptions with considerable identifying power. We have not reported narrower bounds because we do not immediately see a credible basis to add assumptions that would justify them. Readers who feel that they can motivate stronger assumptions may adapt our analysis to determine their implications.

Among the possibilities for narrowing the bounds, it has often been suggested that we can learn about the prevalence and severity of COVID-19 in one location by observing the experiences of populations in other locations. For example, it has been suggested that the United States can learn from the experience in China, South Korea, and Italy. In these locations the epidemic began earlier and has been handled in different ways. Bringing to bear data from different locations is not helpful per se. It may be helpful if the data are combined with assumptions that enable credible extrapolation across locations. Given such assumptions, the partial-identification sub-literature on intersection bounds shows how to proceed formally to tighten inference. See Manski (2020) and Molinari (2020) .

To simplify the presentation, we have intentionally abstracted from other potential sources of uncertainty that may further aggravate the inferential problem. We have assumed that persons who recover from the COVID-19 disease become immune and, hence, cannot be infected anew. We have assumed that persons who are tested and receive a negative result are not retested subsequently. We have assumed that hospitals correctly diagnose patients and that public records correctly code causes of death. We caution that these assumptions may not be completely accurate. The partial identification analysis performed in Section 2 may be extended to incorporate these and other further uncertainties.

Departing from conventional practice in applied econometric analysis, we do not refer to the empirical results in Section 4 as "estimates" and we do not provide measures of statistical precision. Instead, we view states and nations as the units of interest rather than as realizations from some sampling process.

Measurement of statistical precision requires specification of a sampling process that generates the available data. Yet we are unsure what type of sampling process would be reasonable to assume in this work.

The data we used are exact population counts of tests performed and their results in each location, not observations of samples drawn in the locations. To perform statistical inference, one would have to view the population of each location as the sampling realization of a random process defined on a superpopulation of alternative population sizes and compositions. See Manski and Pepper (2018) for extended discussion of this matter in a different applied context.

While the bounds we report can be narrowed by imposing stronger assumptions, a more satisfactory way to increase knowledge of the infection rate is to obtain better data. As has been widely recognized, random testing of populations would contribute enormously. Obtaining a firm understanding of the negative predictive value of the tests in use is also important. We urge efforts to progress in both directions. P(Td = 1) P(Rd = 1|Td = 1) P(Td = 1) P(Rd = 1|Td = 1) P(Td = 1) P(Rd = 1|Td = 1) P(Hd = 1) P(Ud = 1) P(Dd = 1) 3/16/2020 0.000 0.092

Census Bureau 2019. State Population Totals

Covid 19 Statistics

Resident Population as of

Anatomy of the Selection Problem

Identification Problems in the Social Sciences

Partial Identification of Probability Distributions

Identification for Prediction and Decision

Toward Credible Patient-Centered Meta-Analysis

Monotone Instrumental Variables: With an Application to the Returns to Schooling

How Do Right-to-Carry Laws Affect Crime Rates? Coping with Ambiguity Using Bounded-Variation Assumptions

Microeconometrics with Partial Identification

New York State Statewide COVID-19 Testing

Performance of Rapid Influenza Diagnostic Testing in Outbreak Settings

Partial Identification in Econometrics

Protezione Civile 2020. Emergenza Coronavirus: la risposta nazionale

Acknowledgements: We thank Yizhou Kuang for able research assistance. We thank Michael Gmeiner, Valentyn Litvin, and Jörg Stoye for helpful comments. We are grateful for the opportunity to present this work at an April 13, 2020 virtual seminar at the Institute for Policy Research, Northwestern University.