key: cord-0920045-c1rbnbgw authors: Harris, J. E. title: Overcoming Reporting Delays Is Critical to Timely Epidemic Monitoring: The Case of COVID-19 in New York City date: 2020-08-04 journal: nan DOI: 10.1101/2020.08.02.20159418 sha: 73a11c7ac393318d3462d2ea0c5ee02237c467c4 doc_id: 920045 cord_uid: c1rbnbgw During a fast-moving epidemic, timely monitoring of case counts and other key indicators of disease spread is critical to an effective public policy response. We describe a nonparametric statistical method - originally applied to the reporting of AIDS cases in the 1980s - to estimate the distribution of reporting delays of confirmed COVID-19 cases in New York City. During June 21 - August 1, 2020, the estimated mean delay in reporting was 5 days, with 15 percent of cases reported after 10 or more days. Relying upon the estimated reporting-delay distribution, we project COVID-19 incidence during the most recent three weeks as if each case had instead been reported on the same day that the underlying diagnostic test had been performed. The statistical method described here overcomes the problem of reporting delays only at the population level. The method does not eliminate reporting delays at the individual level. That will require improvements in diagnostic technology, test availability, and specimen processing. Timely surveillance of newly diagnosed cases is essential for effective control of the COVID-19 epidemic. When new infections are detected primarily through voluntary testing of symptomatic individuals, as is the case in the United States, there will be two main sources of delay in monitoring trends. First, there will be a testing delay between the actual date when an individual becomes infected and the date when that individual is ultimately tested. Second, unless test samples are very rapidly processed, there will be a further reporting delay between the date of testing and the date the test results are communicated by the reporting entity. The present research addresses the latter source of delay. A statistical method for nonparametric or semiparametric estimation of the distribution of reporting delays was previous investigated in connection with delays in reporting of newly diagnosed AIDS cases during the 1980s (Harris 1990) . The estimated distribution of delays allowed the analyst to predict the actual incidence of AIDS cases well before all cases were fully reported. That statistical method is adapted here to recent daily reports of newly diagnosed cases of COVID-19 by the New York City Department of Health and Mental Hygiene. All data were downloaded from the New York City health department repository (New York Department of Health and Mental Hygiene 2020b). The data consisted of a series of daily updates of a data file named case-hosp-death.csv. In this report, we relied solely on the first two variables in each updated file, labeled DATE_OF_INTEREST and CASE_COUNT, which we interpreted, respectively, as the date of diagnosis and the cumulative number of confirmed COVID-19 cases so far diagnosed by that date. We did not rely on data on hospitalizations or deaths in this study. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 4, 2020 . . https://doi.org/10.1101 hosp-death.csv indicated that a total of 65 cases had been diagnosed on 7/21/2020 and reported by 7/22/2020. Thus, we have and , where t corresponds in this example to the diagnosis date 7/21/2020. The very next day's version indicated that a total of 132 cases had been diagnosed on 7/21/2020 and reported by 7/23/2020. Thus, we have . We used this method of successive differences to recover the underlying quantities , which formed the basic data for our analysis. Our statistical approach followed earlier work (Harris 1990 We considered the simplest model where the distribution of delays was independent of the date of diagnosis or any other observable, exogenous variable. That is, the probability that a case diagnosed at date t will be reported with delay u is , where . Let denote the vector of all parameters . Extensions of this basic model, including a variation in which , have been developed elsewhere (Harris 1990 ). The basic idea is to estimate the delay distribution from our observed data, and then use the estimate to project the total number of cases diagnosed on a given date, including diagnoses yet to be reported. In general, we define as the total number of cases diagnosed on date t that have so far been reported by the cutoff date T. In the early part of the sample, for any date of diagnosis , this marginal sum simplifies to y t 0 = 9 y t1 = 65 − 9 = 56 CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.02.20159418 doi: medRxiv preprint 4 and represents the total number of cases diagnosed on that date. So, conditional on the marginal sums , we have the projected number of cases , which is independent of . Since we have already observed all the cases diagnosed on date t, there is nothing unknown to project. In the late part of the sample, for any date of diagnosis , we can instead write the marginal sums as . The projected number of cases diagnosed at date t will depend on the parameters as , where is the estimated probability that a case diagnosed at date t will be reported by the cutoff date T. We assume that the counts are the realizations of independent Poisson random variables. Given the marginal sums , the conditional likelihood of the parameters is maximized by the following iterative procedure, which is equivalent to the EM algorithm (Demster, Laird, and Rubin 1977) . Let denote the total number of cases reported with a delay of u days, summed over all dates of diagnosis t. We start with initial estimates for all . At iteration with provisional parameters , we update our parameters to , where the denominator is the projected total number of diagnosed cases for which a delay u has been observed. To complete the iteration, we normalize to get . We continue to iterate until is arbitrarily small. Once we've converged on an estimate , the projected case counts are for all . Initial scanning of the data indicated that reporting delays have been increasing over the course of the epidemic since the initial outbreak in early March 2020. For cases diagnosed on or after June 21, however, the reporting distribution appeared to be stable with essentially all . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.02.20159418 doi: medRxiv preprint reports received within 21 days of diagnosis. We therefore designated June 21 as diagnosis date and days as the maximum reporting delay. As a result, the early part of our sample, that is, the range of dates t for which the observed case counts were complete, ran from June 21 through July 11. The late part of our sample, in which the observations on were truncated, ran from July 12 through August 1, 2020. Figure 1 shows the estimated distribution of reporting delays. Only 3.8 percent of confirmed COVID-19 cases were reported on the same day that the underlying diagnostic test was performed, that is . An additional 18.6 percent were reported on the following day, that is, , and another 20.6 percent were reported two days later, that is, . The mean reporting delay, based upon the assumption of full reporting by 21 days, was 4.96 days. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.02.20159418 doi: medRxiv preprint As indicated in the figure, an estimated 85.2 percent of diagnosed cases have been reported within 10 days from the date the underlying diagnostic test was performed. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 4, 2020 . . https://doi.org/10.1101 7 that the most recent three weeks are not expected to show an unusual deviation from the previous trend in case reporting. Additional notes on the analysis of the data are given in Appendix 1. During a fast-moving epidemic, timely monitoring of case counts and other key indicators of disease spread is critical to an effective public policy response (Harris 2020b Reported and Projected COVID-19 Diagnoses New York City, June 21 -August 1, 2020 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.02.20159418 doi: medRxiv preprint individual level -something that can be done only through improvements in diagnostic technology, test availability, and specimen processing. Many state and local health departments -including the New York City health department -have tabulated counts of COVID-19 cases according to the date the relevant diagnostic test was performed. In contrast to tabulating cases according to the date the test result was received, this reporting convention has the advantage of assigning each case as closely as possible to the actual date of infection. The problem, however, is that the reporting agency has to continually update past counts every time a new case report is received. What's worse, the most recent data points are invalid because all the cases haven't yet come in (Harris 2020a) . That is precisely what we see in the last two to three weeks of gray points in Figure 3 . The usual workaround for the delayed reporting problem is to attach an advisory that the most recent data are to be ignored. In fact, the data dashboard of the New York City department of health advises readers, "Due to delays in reporting, recent data are incomplete." (New York Department of Health and Mental Hygiene 2020a) But this means that New York State's socalled Early Warning Monitoring Dashboard actually tracks an incidence for New York City that is two to three weeks behind (New York State 2020). A key message of this article is that, so long as the distribution of reporting delays is stable, the most recently reported case counts need not be ignored. The resulting timely estimate of recently diagnosed cases could have a significant impact on policy decisions to relax or tighten social distancing measures. We estimated a mean delay of 5 days from the date of diagnostic testing to the date of a positive case report by the New York City health department. The cumulative reporting delay distribution in Figure 2 shows that even 10 days after diagnostic testing, about 15 percent of positive test results have yet to be reported. It is possible that individual patients are being informed of their positive test results before the data are entered into the health department's aggregate public tally. Still, any significant delay in notification of positive test results at the individual level can have a critical adverse impact on the timing of a decision to self-quarantine. We have no data on the distribution of delays in reporting negative test results. Delays in notification of negative test results can similarly have adverse consequences for the timing of an individual decision to continue working or to return to work. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10. 1101 Even with the proposed statistical correction for reporting delay, there remains the problem of testing delay. In the system of voluntary, symptom-motivated testing in the United States, testing delay has two components. The first is the incubation period between initial infection and first symptoms of illness, estimated to be about 5.1 days (Lauer et al. 2020 ). The second is the additional delay between the onset of symptoms and date the test is performed. In a voluntary system, testing delays can be reduced by patient education, enhanced availability of walk-in and drive-through testing. Alternative testing technologies that would permit rapid, selftesting would go far to reduce these critical bottlenecks (Larremore et al. 2020 ). The statistical method proposed here assumes that reports arrive according to an independent, homogeneous Poisson process. Reporting delays could vary according to laboratory of diagnosis, duration or severity of infection, and characteristics of the individual patient. Reports could also arrive in batches. Data to test these possibilities are currently unavailable. While virtually all reports since June 21 were received within 3 weeks, there remains the possibility that reporting delays will further increase. If so, correction for delays will assume even greater importance for timely detection of epidemic trends. As shown in Figure 3 , the incidence of new confirmed COVID-19 cases in New York City has continued to remain stable at under 500 cases per day. This observed flattening of the incidence curve may reflect a delicate balancing between falling and rising incidence in different demographic or geographic groups, and thus may not remain stable. Further monitoring of newly diagnosed cases, aided by information from the reporting delay distribution as described here, will permit timely determination as to whether this apparent stability is persistent or fleeting. The New York City department of health did not post a case-hosp-death.csv file for cases reported through June 28. The corresponding observations on were therefore imputed from the following day's case-hosp-death.csv file and the estimated reporting delay distribution through June 27. In addition, in scattered instances, the computed value of was negative, presumably due to correction of prior reporting errors. In those cases, we distributed the reduction uniformly across prior case counts for the same date of diagnosis , setting . is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 4, 2020. . https://doi.org/10.1101/2020.08.02.20159418 doi: medRxiv preprint Maximum likelihood from incomplete data via the EM algorithm The Coronavirus Epidemic Curve is Already Flattening Reopening Under COVID-19: What to Watch For Reporting delays and the incidence of AIDS Test sensitivity is secondary to frequency and turnaround time for COVID-19 surveillance The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases: Estimation and Application COVID-19: Data New York Department of Health and Mental Hygiene. 2020b. nyc health/coronavirus-data New York State. 2020. Early Warning Monitoring Dashboard