key: cord-0529989-y34d2tej authors: Haber, Noah A; Clarke-Deelder, Emma; Salomon, Joshua A; Feller, Avi; Stuart, Elizabeth A title: Policy evaluation in COVID-19: A graphical guide to common design issues date: 2020-09-03 journal: nan DOI: nan sha: 866fadf04c85c07267737dcf309b2c9ed510a483 doc_id: 529989 cord_uid: y34d2tej Policy responses to COVID-19, particularly those related to non-pharmaceutical interventions, are unprecedented in scale and scope. Researchers and policymakers are striving to understand the impact of these policies on a variety of outcomes. Policy impact evaluations always require a complex combination of circumstance, study design, data, statistics, and analysis. Beyond the issues that are faced for any policy, evaluation of COVID-19 policies is complicated by additional challenges related to infectious disease dynamics and lags, lack of direct observation of key outcomes, and a multiplicity of interventions occurring on an accelerated time scale. In this paper, we (1) introduce the basic suite of policy impact evaluation designs for observational data, including cross-sectional analyses, pre/post, interrupted time-series, and difference-in-differences analysis, (2) demonstrate key ways in which the requirements and assumptions underlying these designs are often violated in the context of COVID-19, and (3) provide decision-makers and reviewers a conceptual and graphical guide to identifying these key violations. The overall goal of this paper is to help policy-makers, journal editors, journalists, researchers, and other research consumers understand and weigh the strengths and limitations of evidence that is essential to decision-making. The response to the global COVID-19 pandemic has demanded urgent decision making in the face of substantial uncertainties. Policies to arrest transmission, including stay-at-home orders and other non-pharmaceutical interventions (NPIs), have wide-reaching consequences that touch many aspects of well being. Decision-making in the public interest requires evaluating and weighing the evidence on both intended and unintended consequences in order to best predict outcomes. [1] [2] [3] The wide range of policy interventions implemented by different jurisdictions may yield opportunities for learning from what has already happened to inform future policymaking, and we have observed a proliferation of studies aimed at such policy evaluations. 4 However, policy evaluation requires a complex combination of circumstance, data, study design, analysis, and interpretation in order to be informative. Policy impact evaluation aims to answer questions about the extent to which the realized outcomes given a particular policy would have been different in the absence of that policy. Estimating the causal impact of the policy with observational data is challenging because what would have happened in the absence of the policy change (the "counterfactual") is, by definition, unobserved. Randomized controlled trials (RCTs) of policies related to COVID-19 interventions may not always be practical or ethical. 5 In this context, a large and growing number of studies have attempted to evaluate the impact of COVID-19 policies using observational data. There are many potential pitfalls in the use of observational data for evaluation generally, and some additional methodological design challenges relating to COVID-19 policies in particular. This paper provides a graphical guide to policy impact evaluations for COVID-19, targeted to decision-makers, researchers and evidence curators. Our aim is to provide a coherent framework for conceptualizing and identifying common pitfalls in COVID-19 policy evaluation. Importantly, this should not be taken either as a comprehensive guide to policy evaluation more broadly or as guidance on performing analysis, which may be found elsewhere. [6] [7] [8] [9] [10] Rather, we review relevant study designs for policy evaluations -including pre/post, interrupted time series, and difference-in-difference approaches -and provide guidance and tools for identifying key issues with each type of study as they relate to NPIs and other COVID-19 policy interventions. Improving our ability to identify key pitfalls will enhance our ability to identify and produce valid and useful evidence for informing policymaking. Identifying the type of design Pre/post Figure 1A At least one None At least one (typically one) At least one (typically one) Outcome would have stayed the same from the pre period to the post period. Interrupted time-series (ITS) Figure 1B At least one None More than one At least one (typically several) Outcome slope and level* would have continued along the same modelled trajectory from the pre-period to the post period. Difference-in-diff erences (DiD) Figure 1C At least one At least one At least one (typically one) At least one (typically one) Outcome in intervention units would have changed as much as (or in parallel with) the outcome in the non-intervention units. Comparative interrupted time series (CITS) Figure 1D At least one At least one More than one (typically several) At least one (typically several) Outcome slope and level* would have changed as much as non-intervention group's slope and level* changed. * Assessing both slope and level only applicable if there are multiple data points during the post period Identifying the underlying design in a given analysis often requires using a combination of the methods as reported and evaluating the data structure that is used for the main analysis, as shown in Table 1 . COVID-19-related policy evaluation analyses typically fall under these categories. In most cases, the design can be categorized using a combination of whether there are also units that did not receive the treatment (columns 2-3) and whether there are time points both before and after intervention for those units (columns [4] [5] . The final column describes the implied counterfactual, discussed further in subsequent sections. Methods descriptions may not always provide a precise or reliable guide to which of the design approaches has been used. Some studies do not explicitly name these designs (or may classify them differently); and these are only a small fraction of designs and frameworks that are possible to use for policy evaluation. 6, 11, 12 Studies may have data at multiple time points but are effectively cross-sectional. 13 DiD, ITS, and CITS designs based on repeated cross-sectional data are sometimes described as cross-sectional 14, 15 instead of longitudinal. The term "event study" is often used to refer to studies with a single unit and one change over time resembling ITS, [16] [17] [18] but may refer to other designs. Although ITS is often used to describe changes in one unit, it may also refer to settings in which many treated units adopt an intervention over time. [19] [20] [21] [22] [23] [24] Studies will also frequently employ multiple designs, 18, 25 while others use more complex methods of generating counterfactuals. 26 Definitions of these terms vary widely, and the definitions above should be considered as guidance only. The simplest design is the cross-sectional analysis, which compares COVID-19 outcomes between units of observation (e.g., cities) at a single calendar time or time since an event, typically post-intervention. These studies are unlikely to be appropriate for COVID-19-related policy evaluations, but provide a useful starting point for reasoning about different designs. Just as with comparisons of non-randomized medical treatments, the localities that adopt a particular policy likely differ substantially from those that don't on both observed and unobserved characteristics on a number of dimensions, including epidemic status and timing. This chart shows four canonical longitudinal designs. In all cases: the blue shading represents the underlying data trends, the solid vertical grey line represents the time of intervention, the grey dashed lines represent the assumed counterfactual in the absence of the intervention, as discussed in the text. The impact estimate is obtained by comparing the outcomes observed for the treated unit in the post period (the solid line) with the implied counterfactual line (the dashed line). In the case of the pre/post and difference-in-differences panels the large black dots represent the time of measurement, connected by the grey dotted lines. Given the challenges in a simple cross-sectional comparison, which compare post-intervention outcomes, it is important to consider longitudinal designs, which instead look at differences or trends across time, as summarized in Figure 1 . These can be distinguished by the data used and the construction of the counterfactual. Pre/post, for example, has only one unit, measured at two time points. Two common strategies expand on the logic and data requirements of the pre/post design. Interrupted time series designs ( Figure 1B) incorporate multiple time points before the intervention, and usually multiple time points after the intervention, to enable a more complete view on changes in levels and trends that are temporally related to the intervention. Difference-in-difference designs ( Figure 1C ) add a set of comparison points from a group or location that did not have the intervention. Another related design (comparative interrupted time-series, Figure 1D , discussed only briefly here), uses both aspects -a change over time and a comparison group -to compare the observed change in slopes for the intervention group with the change in slope for the comparison group. Pre/post studies The simplest longitudinal design is a pre/post analysis, where some outcome is observed before policy implementation, and again after, in a single group ( Figure 1A) . Pre/post studies are analogous to a single arm trial 27 with no control and only a single follow-up observation after treatment.. This effectively imposes the assumption that the counterfactual trend is completely flat (i.e., that the outcome in the post-period in the absence of the policy change is the same as the value of the outcome before the policy change) without accounting for pre-existing underlying trends, and attributing all outcome changes completely to the intervention of interest. Just as the outcomes for an individual patient might be expected to change before and after treatment, for reasons unrelated to the treatment, outcomes related to policy interventions will change for reasons not caused by the policy. Infection rates, for example, would not be expected to remain stationary except in very specific circumstances, but a pre/post measurement would assume that any changes in infection rates are attributable to the policy. Interrupted time-series This chart shows one canonical design for ITS (blue, Panel A) and four panels demonstrating common issues with ITS analysis (red, panels B-E) discussed in the text. In all cases: the lag/red shading represents the underlying data trends, the vertical grey line represents the time of intervention, the grey dashed lines represent the assumed counterfactual in the absence of the intervention. In panel D) the dash-dot line represents the time at which the policy is expected to impact the outcome. In panel E), the vertical dotted lines represent concurrent events and changes. Interrupted time-series (ITS) is a strategy that uses a projection of the pre-policy outcome trend as a counterfactual for how the outcome would have changed if the policy had not been introduced. In other words, in the absence of the policy change, ITS assumes the outcome would have continued on its pre-policy trend during the study period. ITS can be a useful tool in policy evaluation because it allows researchers to account for underlying trends in the outcome and, by comparing the treated unit (or location) to itself; it can therefore eliminate some of the confounding concerns that arise in cross-sectional or pre-post studies. However, the validity of ITS depends critically on how well counterfactual trends in the outcome are modelled, and whether the policy of interest is the only relevant change during the study period. In the canonical setting (Figure 2A) , the pre-policy trend is stable and can be feasibly modelled with the available data; the researcher appropriately models the timing of the change in the slope and/or level of the outcome; the researcher has sufficient information to conclude that there were no other changes during the study period that would be expected to influence the outcome. These elements are largely not satisfied in studies of COVID-related policy, as described below. ITS relies critically on modelled trends of the outcome over time. Key components of ITS analyses include both visual and statistical examination of trends, preferentially alongside a theoretical justification of the model used. At a minimum, analyses should provide graphical representation of the data and model over time to examine whether pre-trend outcomes are stable, all trends are well-fit to the data, "interrupted" at the appropriate time point, and sensibly modelled ( Figure 2B ). In the case where an ITS includes a large number of units (e.g. states), it can be difficult to display this information graphically. One common pitfall in ITS is adoption of inappropriate assumptions on the outcome trend ( Figure 2C ). The estimate of policy impact will be biased if a linear trend is assumed but the outcome and response to interventions instead follow nonlinear trends (either before or after the policy). In some cases, transformation of the outcome, for example using a log scale, may improve the suitability of a linear model, as in Palladino et al., 2020. 28 Imposing linearity inappropriately is a serious risk in the context of COVID-19, as trends in infectious disease dynamics are inherently non-linear. 29 For intuition, terms such as "exponential growth," "flattening," and "s-curves" all refer to non-linear infectious disease trends. Depending on the particular situation, non-linearity or other modelled trends can have complicated and counterintuitive impact on policy impact. Apparent linearity may also be temporary and an artifact of testing, which may give a misleading impression that linear models for infectious disease trends are appropriate indefinitely, as is the case for Zhang et al., 2020. 30 While some use linear progression in order to avoid more complex infectious disease models, in fact, linear projections impose strict and often unrealistic models, generally resulting in an inappropriate counterfactual. Researchers can easily misattribute the timing of the policy impact, resulting in spurious inference and bias ( Figure 2D ). Some public health policies can be expected to translate into immediate results (e.g., smoking bans and acute coronary events 16 ). In contrast, nearly every outcome of interest in COVID-19 exhibits complex and difficult to infer time lags 31 typically in the realm of many weeks. The time between policy implementation and expected effect in the data can be large and highly variable. For example, in order to see the impact of a mask order, first the mask order takes effect, then people change their behaviors over time to comply with the order (or sometimes the reverse in the case of anticipation effects), mask use behavior produces changes in infections, then infections later result in symptoms, symptoms induce people to seek testing, the tests must then be processed in labs, and then finally the results get reported in data monitoring efforts. Selection of lead/lad time should be justifiable a priori , as was done in Islam et al., 2020 22 and Auger et al., 2020, 32 or external data. Selecting a lag based on the data, as in Slavova et al., 2020, 33 risks issues comparable to p-hacking. 34 Finally, and perhaps most concerningly in the context of COVID-19, ITS fails when the policy of interest coincides in time with other changes that affect the outcome ( Figure 2E ) 35 . For example, if both mask and bar closure orders are rolled out together as a package, ITS cannot isolate the impact of bar closures specifically. These changes do not need to have taken place exactly concurrently with the policy implementation date of interest; they merely need to have some effect within the time period of measurement to result in potentially serious bias in effect estimates if unaddressed. ITS will also likely be biased if, during the study period, there is a change in the way the outcome data is collected or measured. This might occur if the introduction of a COVID-19 control policy is combined with an effort to collect better data on infection or mortality cases. Analogously, if an RCT involves randomizing people to a group receiving both A and B vs. control, we typically can't disentangle the effects of A from the effects of B, unless we also have separate A-and B-only arms. Ultimately, if multiple things are changing at the same time, ITS may not be an appropriate design for policy evaluation. COVID-19 policies rarely arrive alone; they are typically created alongside other policies, unofficial action, and large scale behavior changes 36 which themselves impact COVID-19-related outcomes. In some cases, anticipation of a policy may induce behavior change before the actual policy takes effect. The policies themselves may have been chosen due to the expectation of change in disease outcomes, which introduces additional biases related to "reverse" causality. If any answer is "no," this analysis is unlikely to be appropriate or useful for estimating the impact of the intervention of interest. Does the analysis provide graphical representation of the outcome over time? -Check for a chart that shows the outcome over time, with the dates of interest. Outcomes may be aggregated for clarity (e.g. means and CIs at discrete time points). Is there sufficient pre-intervention data to characterize pre-trends in the data? -Check the chart(s) to see if there are several time points over a reasonable period of time over which to establish stability and curvature in the pre-trends. Is the pre-trend stable? -Check if there are sufficient data to reasonably determine a stable functional form for the pre-trends, and that they follow a modelable functional form. Is the functional form of the counterfactual (e.g. linear) well-justified and appropriate? -Check whether the authors explain and justify their choice of functional form. -Check if there is any curvature in the pre-trend. -Consider the nature of the outcome. Is it sensible to measure the trend of this outcome on the scale and form used? Note: infectious disease dynamics are rarely linear. -Consider that while pre-trend fit is a necessary condition for an appropriate linear counterfactual model, it is not sufficient. Check if the authors provide justification for the functional form to continue to be of the same functional form (e.g. linear). Is the date or time threshold set to the appropriate date or time (e.g. is there lag between the intervention and outcome)? -Check whether the authors justify the use of the date threshold relative to the date of the intervention. -Trace the process between the intervention being put in place to when observable effects in the outcome might appear over time. -Consider whether there are anticipation effects (e.g. do people change behaviors before the date when the intervention begins?) -Consider whether there are lag effects. (e.g. does it take time for behaviors to change, behavior change to impact infections, infections to impact testing, and data to be collected, etc?) -Check if authors appropriately and directly account for these time effects. Is this policy the only thing to happen which could have impacted the outcome during the measurement period? -Consider other policies or interventions which could impact the outcome during this time. -Consider social behaviors changed which could meaningfully impact the outcome during this time. -Consider economic conditions changed which could meaningfully impact the outcome during this time. -Note that the actual concurrent changes do not need to happen during the period of measurement, just their effects. These issues are summarized as a checklist of questions to identify common pitfalls in Table 2 . The difference-in-difference (DiD) approach uses concurrent non-intervention groups as a counterfactual. Typically, this consists of one set of units (e.g., regions) that had the intervention and one set that did not, with each measured before and after the intervention took place. DiD is more directly analogous to a non-randomized medical study with at least one treatment and control group but limited observation before and after treatment. In contrast to ITS, which compares a unit with itself over time, DiD compares differences between treatment arms or units at two observation points. In many analyses, a DiD approach is implied by comparing regions over time, without formally naming or modelling it. Other DiD approaches use interventions implemented at multiple time points; 37 see Goodman-Bacon and Marcus 2020 9 for further discussion of the nuances of using those methods to study COVID-19 related policies. This chart shows one canonical design for DiD (blue, Panel A) and four panels demonstrating common issues with DiD analysis (red, panels B-E). In all cases: the blue/red shading represents the underlying data trends, the vertical grey line represents the time of intervention, the grey dashed lines represent the assumed counterfactual in the absence of the intervention. In panel D) the dash-dot line represents the time at which the policy is expected to impact the outcome. In panel E), the vertical dotted lines represent concurrent events and changes. One key component of the standard DiD approach is the parallel counterfactual trends assumption: that the intervention and comparison groups would have had parallel trends over time in the absence of the intervention. In some cases, the parallel trends assumption may be referenced or examined implicitly but not named. 15 Ideally, pre-intervention trends would be shown to be clearly identifiable, stable, of a similar level, and parallel between groups, such as in Hsaing et al 2020. 19 With only one observation before and only one after the intervention, assessment of the plausibility of the parallel counterfactual trends assumption is not possible. Absent this confirmation 38 the evaluation runs the risk of biased estimation due to differential pre-trends ( Figure 3B ). Pre-trends approaching the ceiling or floor 15 may also not be informative about stable and parallel pre-trends. Empirical assessment of whether pre-intervention trends were parallel and stable between groups is possible when multiple observations are available at multiple time points before the intervention, noting that this can begin to resemble a CITS design. In this scenario, pre-trend data should be visually and statistically established and documented. While parallel trends before intervention (which we can observe and may be testable) do not guarantee parallel counterfactual trends in the post-intervention period (which we cannot observe and are generally untestable), examining pre-intervention parallel trends is a minimal requirement for DiD reliability. It is also important to consider the scale and level on which the outcome is measured ( Figure 3C ). As with ITS, if the outcomes in the treatment and comparison groups are moving in parallel on a logged scale, they will not be moving in parallel on a natural scale. Level differences by themselves may be a problem for COVID-19 outcomes, as infectious disease transmission dynamics dictate that infection risks are related to the prevalence of infected people in a population, i.e. the rate of change is linked intrinsically to the level. A population with an extremely low prevalence will tend to have an inherently slower rise in infection rates than an otherwise identical population with merely a low prevalence. Just as importantly, large level differences in the outcome between intervention and comparison groups is often indicative of other important differences between comparators, which may result in other assumptions being violated. While DiD is in some ways more robust to very specific kinds of timing effects ( Figure 3D ) and concurrent changes ( Figure 3E ), it also introduces additional risks. DiD effectively doubles the opportunity for concurrent changes to spuriously impact results, since they can occur in the treatment or comparison groups. As above, this can become even more problematic for DiD in the typical case where intervention groups enact more or very contextually different policies than non-intervention groups. Even cases where concurrent changes happen equally in both treatment and comparison groups can lead to overwhelming bias, particularly when approaching the maximum or minimum levels of the outcome. If either the treatment or control group is approaching the floor (e.g. 0% prevalence) or ceiling for an outcome of interest due to other policies concurrent in both places (e.g. national lockdowns, but region-level differences in mask policy), this can lead to bias when comparing changes between the two groups. Table 3 : Checklist for identifying common pitfalls for DiD to evaluate COVID-19 policy If any answer is "no," this analysis is unlikely to be appropriate or useful for estimating the impact of the intervention of interest Does the analysis provide graphical representation of the outcome over time? -Check for a graph that shows the outcome over time for all groups, with the dates of interest. Outcomes may be aggregated for clarity (e.g. mean and CI at discrete time points). Is there sufficient pre-intervention data to observe both pre and post trends in the data? -Check the chart(s) to see if there are several time points over a reasonable period of time over which to establish stability and curvature in the pre-trends. Are the pre-trends stable? -Check if there are sufficient graphical data to reasonably determine a stable functional form for the pre-trends, and that they follow a modelable functional form. Are the pre-trends parallel? -Observe if the trends in the intervention and comparison groups appear to move together at the same rate at the same time. Are the pre-trends at a similar level? -Check if the trends in the intervention and comparison groups are at similar levels. -Note that non-level trends exacerbates other problems with the analysis, including linearity assumptions Are intervention and non-groups broadly comparable? -Consider areas where comparison groups may be dissimilar for comparison beyond just the level of the outcome. Is the date or time threshold set to the appropriate date or time (e.g. is there lag between the intervention and outcome)? -Check whether the authors justify the use of the date threshold relative to the date of the intervention. -Trace the process between the intervention being put in place to when observable effects in the outcome might appear over time. -Consider whether there are anticipation effects (e.g. do people change behaviors before the date when the intervention begins?) -Consider whether there are lag effects. (e.g. does it take time for behaviors to change, behavior change to impact infections, infections to impact testing, and data to be collected, etc?) -Check if authors appropriately and directly account for these time effects. Is this policy the only thing to happen which could have impacted the outcome during the measurement period? -Consider other policies or interventions which could impact the outcome during this time. -Consider social behaviors changed which could meaningfully impact the outcome during this time. -Consider economic conditions changed which could meaningfully impact the outcome during this time. -Note that the actual concurrent changes do not need to happen during the period of measurement, just their effects. Similarly to the ITS section, these issues are summarized as a checklist of questions to identify common pitfalls in Table 3 . In recent months, there has been a proliferation of research evaluating policies related to the COVID-19 pandemic. As with other areas of COVID-19 research, quality has been highly variable, with low quality studies resulting in poorly or mis-informed policy decisions, wasted resources, and undermined trust in research. 39, 40 To support high quality policy evaluations, in this paper we describe common approaches to evaluating policies using observational data, and describe key issues that can arise in applying these approaches. We hope that this guidance can help support researchers, editors, reviewers, and decision-makers in conducting high quality policy evaluations and in assessing the strength of the evidence that has already been published. Policy evaluation -far from a simple task in normal circumstances -is particularly challenging during a pandemic. Cross-sectional comparisons of states or countries are likely to be biased by selection into treatment: for example, countries with worse outbreaks may be more likely to implement policies such as mask requirements. In analyses of changes over time -such as single-unit studies using interrupted time-series or multi-unit comparisons using difference-in-differences or comparative interrupted time-series -it may not be possible to parse apart the effects of different policies implemented around the same time, such as mask mandates paired with limits on social gatherings. Analyses of changes over time may also be biased if disease or human behavioral dynamics are not modelled appropriately. This can be challenging because case counts typically do not grow linearly and there is often a lag between a policy change and a behavioral response. This guidance should be considered minimal screening to identify low quality policy impact evaluation in COVID-19, but is in no way sufficient to identify high quality evidence or actionability. Decision-makers and researchers should pay particular attention to the relevance of the intervention as it was evaluated to relevant decisions being made. The evaluated impact of a program encouraging mask use through messages might not be informative about mask requirement orders. Differences in level of aggregation may be important, such as ecological fallacy arising from a situation in which areas with higher overall mask use have higher transmission, but transmission is actually lower for individuals wearing masks. Policy impact evaluation is only as useful as the question it asks, data it uses, and the way it is analyzed. Problems with measurement, spillover effects, generalizability, changes in measurement overtime (e.g. varying test availability), statistics, testing robustness to alternative assumptions, and many issues can undermine an otherwise robust evaluation, and are not discussed here. While this guidance is not comprehensive, it may help inform study designs not covered here. Issues with comparative interrupted time-series and synthetic control methods, 41 for example, are broadly similar to the issues with difference-in-differences analyses we discuss here. Other approaches may include adjustment and matching based observational causal inference designs, 7 instrumental variables and related quasi-experimental approaches, 42 and randomized controlled trials. Each has its own set of practical, ethical, and inferential limitations. In the face of these challenges, we recommend careful scrutiny and attention to potential sources of bias in COVID-19-related policy evaluations, but we remain optimistic about the potential for robust evaluations to inform decision-making. Researchers and decision-makers should triangulate across a large variety of approaches from theory to evidence, invest in better data and more reliable and useful evidence wherever feasible, clearly acknowledge limitations and potential sources of bias, and acknowledge when actionable evidence is not feasible. 43 We anticipate increasing opportunities for better examining policies moving forward, particularly if policies and interventions are designed with policy impact evaluation and data collection in mind. The COVID-19 pandemic requires urgent decisions about policies that affect millions of people's lives in significant ways. High quality evidence on the effects of these policies is critical to informing decision-making, but is very hard to generate. Evidence-based decision-making depends on research that carefully considers potential sources of bias, and clearly communicates underlying assumptions and sources of uncertainty. Making Decisions in a COVID-19 World Defining high-value information for COVID-19 decision-making Implementation science in times of Covid-19. Implementation Sci Quarantine alone or in combination with other public health measures to control COVID-19: a rapid review Which interventions work best in a pandemic? Mostly Harmless Econometrics: An Empiricist's Companion Causal Inference: What If Evaluating the impact of healthcare interventions using routine data Using Difference-in-Differences to Identify Causal Effects of COVID-19 Policies Designing Difference in Difference Studies: Best Practices for Public Health Policy Research Evaluating the Health Impact of Large-Scale Public Policy Changes: Classical and Novel Approaches Quasi-experimental study designs series-paper 7: assessing the assumptions Statewide Interventions and Covid-19 Mortality in the United States: An Observational Study Mask Wearing and Control of SARS-CoV-2 Transmission in the United States Comparison of Estimated Rates of Coronavirus Disease 2019 (COVID-19) in Border Counties in Iowa Without a Stay-at-Home Order and Border Counties in Illinois With a Stay-at-Home Order Interrupted time series regression for the evaluation of public health interventions: a tutorial Community Use Of Face Masks And COVID-19: Evidence From A Natural Experiment Of State Mandates In The US: Study examines impact on COVID-19 growth rates associated with state government mandates requiring face mask use in public When Do Shelter-in-Place Orders Fight COVID-19 Best? Policy Heterogeneity Across States and Adoption Time The effect of large-scale anti-contagion policies on the COVID-19 pandemic Impact of National Containment Measures on Decelerating the Increase in Daily New Cases of COVID-19 in 54 Countries and 4 Epicenters of the Pandemic: Comparative Observational Study Social distancing merely stabilized COVID-19 in the US Physical distancing interventions and incidence of coronavirus disease 2019: natural experiment in 149 countries Impact of social distancing during COVID-19 pandemic on crime in Los Angeles and Indianapolis The Efficacy of Lockdown Against COVID-19: A Cross-Country Panel Analysis. Appl Health Econ Health Policy Unemployment insurance and food insecurity among people who lost employment in the wake of COVID-19 Causal Impact of Masks, Policies, Behavior on Early Covid-19 Pandemic in the U.S Demystifying the Placebo Effect Excess Deaths and Hospital Admissions for COVID-19 Due to a Late Implementation of the Lockdown in Italy Mathematical models of infectious disease transmission Identifying airborne transmission as the dominant route for the spread of COVID-19 Inferring change points in the spread of COVID-19 reveals the effectiveness of interventions Association Between Statewide School Closure and COVID-19 Incidence and Mortality in the US Signal of increased opioid overdose during COVID-19 from emergency medical services data. Drug and Alcohol Dependence The extent and consequences of p-hacking in science Testing the Validity of the Single Interrupted Time Series Design Report No.: w26080 Comparing the impact on COVID-19 mortality of self-imposed behavior change and of government regulations across 13 countries Difference-in-Differences with Variation in Treatment Timing Report No.: w25018. Available from Social Distancing and Outdoor Physical Activity During the COVID-19 Outbreak in South Korea: Implications for Physical Distancing Strategies Waste in covid-19 research Too much information, too little evidence: is waste in research fuelling the covid-19 infodemic? Synthetic control methodology as a tool for evaluating population-level health interventions Causal Inference for Statistics, Social, and Biomedical Sciences Will COVID-19 be evidence-based medicine's nemesis? We would like to thank Dr. Sarah Wieten, Dr. Cathrine Axfors, and Dr. Mario Malicki for edits and suggestions on the manuscript. Critical feedback and support was provided by Dr. Steven Goodman and Dr. John Ioannidis. Kevin Hassett provided inspiration to use a cubic polynomial model for the infectious disease outcome curves in the figures. No funding was provided specifically for this article