key: cord-0620575-9hn21kpr authors: Thakkar, Niket title: A modeling approach for estimating dynamic measles case detection rates date: 2022-02-22 journal: nan DOI: nan sha: becf7c0e98f9a358bcb6724e70de384127298133 doc_id: 620575 cord_uid: 9hn21kpr The main idea in this paper is that the age associated with reported measles cases can be used to estimate the number of undetected measles infections. Somewhat surprisingly, even with age only to the nearest year, estimates of underreporting can be generated at the much faster, 2 week time-scale associated with measles transmission. I describe this idea by focusing on the well-studied, 60 city United Kingdom data set, which covers the transition to universal healthcare in 1948, and is, as a result, an interesting case study in infectious disease surveillance. Finally, at the end of the paper, I comment briefly on how the approach can be modified for application to modern contexts. When we try to make decisions to mitigate infectious disease burden, whether that's in allocating vaccines or in applying other interventions, the key piece of data informing our epidemiological perspective is often reported infections over time. In many situations, this timeseries can be stratified across social dimensions, like age, or location, or sex, and differences in trends are then used to make directional statements regarding risk ("cases are rising faster in men than in women", for example). These risk assessments ultimately become one part of a broader decision-making conversation. It's almost always the case that a significant fraction of infections go undetected [1] . This reality often forces epidemiologists to assume underreporting happens uniformly over time and strata, facilitating at least some relative estimates of burden and risk [2] . While this assumption might be good in certain circumstances, it typically goes untested, and its validity is challenged by other realities of healthcare, like inequities and disruptions in access across populations [3, 4] . Estimating the degree of underreporting is a challenging problem. Surveillance data, despite being so central to our epidemiological understanding, usually comes from convenience samples and is rarely collected through formal, randomized surveys. Data quality is then, at best, estimated through independent serological surveys that measure population immunity, a point of comparison to the level of immunity implied by accumulated case reports and data on vaccination [5] . This approach can give some assessment of underreporting across social dimensions, but since serosurveys can't be performed continuously, it has fundamental limitations when it comes to estimating changes over time. Transmission modeling offers a complementary approach, but it comes with its own challenges. At a high level, transmission models consider interactions between a population's infectious and susceptible individuals over time [6] , offering a mathematical platform for interpreting surveillance data in the context of an underlying disease transmission process. In principle, this approach yields time-resolved estimates of the total infectious population which can be compared to case reports to dynamically assess underreporting. In practice, however, transmission models are under-constrained, and the same surveillance data can be explained by multiple, distinct combinations of reporting and transmission processes. Resolving this identifiability issue, specifically in the context of endemic diseases like measles, is the focus of this paper. Broadly speaking, I find that the year-to-year changes in the surveillance system can be constrained by the age distribution of reported cases and then used to regularize the behavior of a high time-resolution transmission model. This approach yields a model with sufficient structure to distinguish reporting and transmission volatility while avoiding assumptions about the fasttime-scale variation in reporting. As a result, it offers a general procedure for dynamic assessment of the surveillance system based on reported cases over time. To be concrete, I describe the method by way of example. In particular, I focus on the well-studied, 60 city United Kingdom (UK) dataset, which contains biweekly reports of measles cases from 1944 to 1966, covering the establishment of the National Health Service (NHS) in 1948 but before the introduction of the measles vaccine in 1967. The lack of vaccine, of course, limits direct applicability of the results to modern contexts, but it helps to simplify the discussion and focus on disease surveillance, which changed significantly with the nationalization of healthcare. As a result, this example proves to be an illustrative case study of the interactions between a variety of social forces as viewed through disease transmission. The key pieces of data for this study are visualized in Fig. 1 [7] . For the purposes of this paper, I added to this dataset Fine and Clarkson's distribution of the ages of measles cases circa 1950 (Fig. 1c) [8] and an estimate of the population pyramid for the same year ( Fig. 1d) [9]. In many ways, this is the classic measles dataset, and a lot of its features are very well understood. For example, transmission modeling has established a dynamic connection between the baby-boom after World War 2 (1946-48) and the shift from biannual to annual measles outbreaks from 1947 to 1953 [10] . Similar studies have also established a connection between seasonal variation in transmission rates and the holiday calendar for UK schools [7, 11] . Disaggregating the data across the 60 cities has facilitated understanding of the population size needed to maintain endemic transmission [12] and the wave-behavior associated with measles' diffusion from city-to-city [13] . In short, many researchers have brought a variety of perspectives to this data, teaching us a lot about measles epidemiology in the process. That said, to my knowledge, no study has focused on the time-variation in reporting rates associated with this data. 1 This is surprising since, with the establishment of the NHS in July 1948, the population's relationship to healthcare changed dramatically [14] . Given the level of This thought can be framed with more mathematical precision by defining a discrete stochastic process model of measles transmission and reporting. Along the lines of classic disease models [6, 7] , and mindful that measles is transmitted person-to-person, the time from exposure to rash onset is roughly 14 days, and those that survive the disease are immune for life, I assume (2) where, at time t in 2 week time steps, S t is the population susceptible to measles, I t is the infectious population, B t are births into the population (e.g., Fig. 1b , interpolated to the biweekly time scale), and C t are reported measles cases (e.g., Fig. 1a ). New infections are generated through a transmission process where β t is the average transmission rate and ε t models multiplicative transmission volatility (i.e., E[ln ε t ] = 0, V[ln ε t ] = σ 2 t ). These quantities, taken together, select a random fraction of possible susceptibleinfectious pairs to contribute to onward transmission and replace the previous infectious generation. Notice, however, that I t can be discounted by α ≤ 1 in the enumeration of pairs, modeling the idea that as I t grows, the likelihood of infection among relatively isolated children, who are incapable of onward transmission, like the very young, increases. The reporting process, r t , represents the probability that infections are reported as cases and is in principle free to vary such that 0 ≤ r t ≤ 1. Evaluating a measles surveillance system in the context of this model is therefore a statistical inference problem where the goal is to estimate unknowns r t , β t , ε t , α, and S 0 given C t and B t . As mentioned, this inference problem is poorly-posed in general. To illustrate the issue, consider solving the expected value of Eq. 3 for I t and inserting the result into Eq. 1 to write where I've set α = 1 for clarity. If, for example, C t increases relative to C t−1 so that C t /C t−1 > 1, that variation can either be explained by sufficient increase in r t or by a sufficiently high reproductive number, In other words, in the context of an imperfectly observed epidemiology, an increase in reported cases can either be a result of more people seeking care by chance or an unfortunately social infectious generation or both. Without additional model structure and information, we cannot tell these effects apart. Some qualitative considerations can help motivate a way forward. In general, we expect r t to have 2 distinct timescales: A fast timescale, comparable to the 14 day transmission scale, associated with random fluctuations in the health-seeking behavior of I t , and a much slower timescale associated with changes in the health system like the creation of new facilities. While it's difficult to say in advance how "slow" should be defined, we also expect β t to vary seasonally with environmental conditions, and as a result, at any sub-annual timescale, we should anticipate seeing both transmission and reporting variation overlaid. The year-to-year timescale is perhaps the highest resolution where we can reasonably expect systemic reporting variation to be isolated. The age-at-infection distribution in Fig. 1c characterizes the UK's measles transmission at the annual scale. The two peaked structure is consistent with school-driven transmission, the larger peak corresponding to infection at school entry at roughly 5 years old and the smaller peak likely including the 5 year-olds' younger siblings at home. While the distribution is based on reported cases in 1950, this mechanistic connection to the school system gives some confidence that it can be taken as representative of the infectious population as a whole for the entire 23 year period of interest. 2 More formally, I'll assume the distribution in Fig. 1c , π(a), is the probability of rash onset at age a, or equivalently, the probability of infection, i(t|s), during year t = a + s given birth year s. Marking annually aggregated quantities with a tilde, I can compute whereĨ B t are infections in children born during the T = 23 years of interest. This interpretation of the age-atinfection distribution, that it tells us how birth-cohorts are expected to appear as infections over time, is the key idea of this paper. It inspires a strategy to compute total expected burden, E[Ĩ t ], as a point of comparison toC t . Eq. 4 accounts for one of two possible sources of infections in our model. If susceptible individuals were not born in the years 0 to T − 1, they must have been part of S 0 , the initial susceptible population. In other words, calculating expected infections in S 0 , E[Ĩ 0 t ], would complete the estimate of total burden over time. The age-at-infection distribution offers valuable perspective again. If we approximate the transmission process as a simple random sample of the susceptible population, sometimes called a "well-mixed" assumption, and we further assume that π(a) is representative of years before year 0, then the initially susceptible population has age distribution p(a| ∈ S 0 ) = π(a) as well. Operating under this approximation, inspired by the logic of Eq. 4, S 0 can be thought of as a mixture of birth-cohorts born in years s < 0. This implies that where τ (t) is the convolution of S 0 's age-distribution with the age-at-infection distribution -the discrete selfconvolution of π(a) under our approximations. Moreover, the number S 0 can be constrained by the survival function, Π(a) = 1 − π(a ), estimating the probability of remaining susceptible at age a. Combined with the age pyramid, n(a), in Fig. 1d , this means that we expect S 0 to be mostly individuals less than 5 years old, roughly 7% of the total population. Taken together then, expected yearly burden, can be estimated entirely from yearly births, the age-at-infection distribution, and the age-pyramid. The annualized reporting rate then satisfies with additive noisew t . We can enforce the constraint 0 ≤ r t ≤ 1 by modelingr t = f (θ t ) where f (·) is the logistic function andθ t is a Gaussian process (see Appendix A). Then, modeling the variance inw t as a constant fully specifies a non-linear least squares problem that can be solved forθ t to estimater t . Fig. 2 visualizes the results of this approach applied to the data from the UK. In the top panel,r t E[Ĩ t ] (black) follows the trend inC t (blue) with uncertainty (95% interval in grey) driven largely by the biannual periodicity in outbreaks starting in 1954. Eq. 5 is clearly an incomplete epidemiological model, lacking the transmission process required to explain outbreaks, and somewhat reassuringly, it cannot capture key features of the data with reporting variation alone. But still, this is progress. The corresponding distribution forr t is visualized in the lower panel, with the 50% and 95% intervals in progressively lighter tints. The estimate is consistent with a constant-reporting-rate model (black) [7] , but captures an overall falling trend from 1944 to 1966. More practically, the range of probabler t values, which was initially only loosely constrained by the rules of probability, is dramatically reduced through implications of the age-at-infection distribution. Keeping these results in mind, we can return to the more general inference problem outlined in Section III. In somewhat abstract terms, completely defining Eqs. 1 to 3 requires us to calculate the posterior probability distribution p(β t , ε t , α, r t , S 0 |D), where D is the complete dataset {C t , B t ,C t ,B t , π(a), n(a)}, making clear the assumed separation of time scales. This formal statement of the problem is useful because it can be organized hierarchically to inspire an approachable inference algorithm. Specifically, we can write choosing this (exact but non-unique) separation to draw distinction between the parameters explicitly connected to the annual-scale (S 0 , r t ) and the parameters responsible for fast dynamics (β t , ε t , α). As we'll see, the two terms on the right-hand-side lend themselves to approximation more readily than the distribution as a whole. The first term can be used as a vehicle for the survival analysis and annual-scale regression underlying the results in Fig. 2 . Towards that end, we can approximate p(S 0 , r t |D) ≈ p (S 0 |π(a), n(a)) p(r t |π(a),C t ,B t , S 0 ) Here, the first line is the conditional independence assumption that S 0 and r t , in the absence of a transmis- the intuitive result that the susceptible population is the total balance of newborns and exposures every time step. Furthermore, taking the log of Eq. 1 implies ln I t − ln S t−1 = α ln I t−1 + ln β t−1 + ln ε t−1 . (7) And finally, through properties of the binomial distribution (see Appendix B), Eq. 3 implies that E[I t |C t , r t ] = [(C t + 1)/r t ] − 1. Thus, conditional on r t and S 0 , Eq. 7 is very nearly a well-defined linear regression for α, β t , and ε t , but with 26T equations and up to (52T −1) unknowns -too many for a unique solution. We can use measles' transmission seasonality as an epidemiologically reasonable way to reduce dimensionality. Along those lines, I'll assume that β t is a periodic Gaussian process with 1 year (26 time step) periodicity (see Appendix A for details). Furthermore, I'll assume that ln ε t has constant variance, σ 2 ε . Those assumptions dramatically reduce the number of unknowns, to 28 in total, making Eq. 7 a solvable linear regression. Returning to Eq. 6, the approximations just discussed give us a route to evaluate the entire right-hand-side, and as a result, with enough patience, we could calculate whatever properties of that distribution that we might be interested in. That said, since the inference problem remains high dimensional (26T + 29 unknowns), it's reasonable to look for a more digestible distribution. A standard approach is to construct a Gaussian approximation at the distribution's mode [15] . That procedure is facilitated by the negative log posterior, L, which can be minimized to find the mode and twice-differentiated at the minimum to estimate a covariance matrix. For our purposes, up to a constant, whereσ ε is the maximum-likelihood estimate of σ ε associated with the least-squares solution to Eq. 7. This equation nicely illustrates the balance we've achieved, with the first term corresponding to the model's ability to capture the fast dynamics and the next two terms enforcing consistency with the slow time scale. Eq. 8 also encapsulates a well-defined statistical inference algorithm. We first minimize L to construct a Gaussian approximation of Eq. 6, and then properties of Gaussians can be leveraged to calculate marginal distributions or draw sample parameter sets consistent with the data. 3 Finally, for any candidate set of parameters, Eqs. 1 to 3 allow us to produce time series of I t and S t and then estimate quantities of epidemiological interest. I apply this algorithm to the UK example in Fig. 3 , using 10,000 sample trajectories drawn from the fitted model to quantify overall uncertainty. In the top panel, the model (green) follows C t (black dots) closely, now able to explain outbreaks with a seasonal measles transmission process. Moreover, the uncertainty estimates have good empirical performance, with the 50% interval across trajectories capturing 53% of the data and the 95% interval capturing 96% of the data. We can also see good consistency with the results in Fig. 2. In particular, the initial susceptible population (blue) is roughly 7% of the total population, as expected based on the survival function. Even more clearly, in Fig. 3 's final panel, r t exhibits 2 distinct timescales, with step-to-step volatility associated with people's behavior and a year-to-year trend consistent withr t (overlaid in grey). Thus, speaking broadly, Fig. 3 demonstrates that we've created a measles transmission model consistent with both the dynamics and the age distribution of reported cases, and in doing so, we've uncovered the variation in the surveillance system. Before discussing the r t estimates, it's worth establishing Fig. 3 's consistency with past studies. Two key results to that effect are visualized in Fig. 4 . In Fig. 4a , I've plotted the model's estimate of the reproductive number (blue), which is proportional to β t , in comparison with Ref. 7's point estimate (black) and the UK school holiday calendar (grey). The correlation with the school calendar is visually apparent, with closures suppressing transmission rates and the return from holidays associated with transient increases. Our estimate is noticeably smoother than Ref. 7's: This is because of the periodic Gaussian process mentioned in the previous section, modeling the fact that biweekly intervals occur at slightly different times of year every year. But outside of that minor difference, our estimates clearly recapitulate this classic result. A second important result is the relationship between the post World War 2 baby boom (1946-48, see Fig. 1b) FIG. 3. Modeling measles transmission and surveillance dynamics. Forcing consistency with reported cases, both in terms of the biweekly dynamics (black dots) and age-distribution-based results in Fig. 2 (dashed grey) , yields a model (colors, 95% interval shaded) that can distinguish variation in population prevalence (peach) and susceptibility (blue) from the probability infections are reported (yellow). and the transition from biannual to annual outbreaks. Following the approaches in Refs. 7 and 10, we can verify that this phenomena is a deterministic, dynamical feature of the fitted model by estimating the equilibrium periodicity of I t as a function of S t in expectation. More concretely, I can compute long time, 200 year trajectories of the model mean as a function of a constant birth rate, using the first century to reach equilibrium and the second century to sample it. Then, by plotting realized (S t , I t ) pairs at specific times of year, I can construct phase portraits with geometric features that give insight into the model's dynamical properties. The results of this procedure are visualized in Fig. 4b . Choosing two representative times of year (biweek 10, in orange, in keeping with Ref. 7's choice, and biweek 15, in purple, because it's the peak of the low season), (S t , I t ) pairs fall into 2 distinct phases. At low and high susceptibility, I t takes a single value at the same time every year, indicating that all years are the same and the equilibrium has annual periodicity. Meanwhile, in the vicinity of 8% susceptibility, I t falls on one of two branches corresponding to low and high years -that is, biannual periodicity. Returning to the timeseries of Fig. 3 , we see that 8% susceptibility is indeed a critical, historically relevant threshold. In the post World War 2 years, increased birth rates pushed susceptibility to roughly 9% and I t exhibits annual periodicity. At other times from 1944 to 1966, susceptibility hovered near 8%, and outbreaks occurred every other year. Fig. 4b suggests that this transition was not due to chance, that is volatility in ε t . Instead, it is a deterministic, in some sense physical, feature of the UK's measles epidemiology, reproducing the insight first explored in Ref. 7. The model's incorporation of the age-at-infection distribution adds some texture to these famous results. As mentioned, the distribution's two-peaked structure (Fig. 1c ) also supports the school system's role in transmission, suggesting that the correlation in Fig. 4a can be interpreted causally. Meanwhile, that the 1946 to 1948 baby-boom raises susceptibility through 1952, that is for roughly 5 years, is what we might have expected based on the survival function, which vanishes after roughly 5 years. These elements of our inferences suggest that, even without directly modeling ageing, we've captured the dynamic implications of the age distribution. Now, with some confidence in the corresponding transmission process, we can return to r t . Fig. 4c visualizes the estimate (red) in the lead up and aftermath of the NHS's establishment with the model's mean I t estimate (grey) to contextualize the time of year. It's striking that 1948 emerges naturally as a critical year in measles surveillance. Major outbreaks in 1945 and 1947 are accompanied by suppression in r t , but this feature vanishes after 1948, and looking to Fig. 3 , it does not return for the remaining 19 years. A plausible hypothesis explaining the full 23 year estimate is that, up to 1947, large-scale outbreaks placed significant burden on populations without access to healthcare. Then, after 1948, universal healthcare resolved this problem; however, the steady year-to-year r t decrease from 1948 onward suggests that increased demand impacted quality overall. Indeed, in the early 1970's, just after the model period, the NHS received it's first round of reforms in response to surprising demand [14] . By virtue of being difficult to estimate, estimates of r t are difficult to validate. That said, the hypothesis above motivates a coarse prediction. If we assume that measles reporting is a product of two decisions, that an individual is sick enough to seek care and that they have access to care given their need, and we assign probabilities p N and p A|N to those two events, then post-1948, to a good approximation, p A|N = 1. Meanwhile, we can further assume that p N is the same pre-and post-1948, since we expect it to be dominated by biological features of measles infections. As a result, we should expect the ratio of pre-and post-1948 r t estimates to measure p A|N before healthcare reform. And so, with roughly 60% reporting after the NHS's establishment and roughly 40% at peak suppression in 1947, we estimate that roughly 30% of the UK's population lacked access to healthcare before reform. Looking to other historical records, this estimate is in good agreement with shortages of tuberculosis beds in 1947 (32,600 available with roughly 46,000 needed) [14] , and survey data taken at that time [16] might give additional validation -I'm currently looking into it. Zooming out to conclude: It's remarkable how disease transmission reflects society's layers with such clarity, from the school system to birth rates and finally to healthcare access. We've certainly seen this recently as well [1] , sometimes harshly [3] . But for our goals in measles control today, we should aspire to the level of detail with which we understand transmission and surveillance in the pre-vaccine-era UK. Towards that end, this paper's approach can be extended to incorporate vaccination. At a high level, the annual burden estimates of Sec. IV need to be adjusted for fractions of birth cohorts and of the initial susceptible population that are immunized before infection. Those adjustments can be informed by survey data on vaccination coverage, accounting for the administration age and, via the survival function, the probability of remaining susceptible at that age. Subtracting these estimates from E[Ĩ t ] in Eq. 5 then allowsr t to be constrained in essentially the same way. Inference at the fast timescale then proceeds as in Sec. V, with Eq. 2 modified to account for immunization [17] . More generally, this paper has emphasized the need to understand not only the drivers of disease transmission but also the components of disease surveillance, thinking of both as dynamic, epidemiological processes. I hope retrospective, quantitative inference is a step towards understanding how surveillance systems can be improved. Cryptic transmission of SARS-CoV-2 in Washington state Temporal rise in the proportion of younger adults and older adolescents among coronavirus disease (COVID-19) cases following the introduction of physical distancing measures SARS-CoV-2 positivity rate for Latinos in the The WHO, Health inequities and their causes Benefits and challenges in using seroprevalence data to inform models for measles and rubella elimination Infectious diseases of humans: dynamics and control Time series modelling of childhood diseases: a dynamical systems approach Measles in England and Wales-II: the impact of the measles vaccination programme on the distribution of immunity in the population Dynamics of measles epidemics: scaling noise, determinism, and predictability with the TSIR model Measles in England and Wales-I: an analysis of factors underlying seasonal patterns Disease extinction and community size: modeling the persistence of measles Travelling waves and spatial hierarchies in measles epidemics The history of the NHS Data analysis: a Bayesian tutorial The Hospital Surveys: the Domesday book of the hospital services (Nuffield Provincial Hospitals Trust Decreasing measles burden by optimizing campaign timing This work was done in conversation with many of my colleagues. I'd like to thank Safi Karmy-Jones for motivating the historical exploration, Edward Wenger for his intuition on the age-distribution's relationship to schools, and Mike Famulare for his sense of direction on the survival analysis. Edward, as well as Kevin McCarthy and Arie Voorman, also gave very useful feedback on this paper's first draft. The regression problems in Eqs. 5 and 7 rely on Gaussian processes to create smoothness in time. More precisely, when I write that θ t is a Gaussian process, I mean that p(θ t ) = N (θ t |0, (λD D) −1 ), where the matrix D is the finite-difference approximation to the second derivative. For periodic processes, D has periodic boundary conditions. Otherwise, the boundaries are handled by switching from centered to forward or backward approximations. In either case, this Gaussian distribution is taken as a prior in the relevant regression problems, leading to a penalty proportional to θ t 's second derivative and, in that way, enforcing smoothness.The constant λ can be thought of in terms of correlation time. Specifically, the total variation, ν = ||Dθ t || 2 , is Gamma distributed with shape T /2 and scale 2/λ, so E[ν] is inversely proportional to λ. Meanwhile, for a sine wave with period τ , the total variation goes as τ −4 . Thus, in specifying λ, I choose an expected timescale for θ t by setting λ ∝ τ 4 . For Eq. 5, I choose τ = 5 years, and for Eq. 7, I choose τ = 3 biweeks. That said, in sensitivity testing, none of the results of this paper change significantly for reasonable choices for λ. In specifying the model, I leverage one small theorem related to the binomial distribution, also discussed in Ref. 17 . Specifically, ifthen I can use Bayes' theorem with a uniform prior enforcing I t ≥ C t to computewhich is a distribution over I t , normalized by the additional factor of r t . The associated moment generating function iswhich implies (by Taylor expanding and picking off the coefficient linear in s) thatThis result relates Eq. 7 to the reported data, helping us along the way to a well-defined linear regression for the transmission process. Incidentally, the dependence on C t + 1 makes ln E[I t |C t , r t ] well-defined even at C t = 0 for any 0 < r t < 1, so this paper's methods are somewhat naturally capable, at least in this respect, of handling the sparse data typical of modern high-burden settings.