key: cord-0875744-f9uh3rkn
authors: Haber, Noah A; Clarke-Deelder, Emma; Salomon, Joshua A; Feller, Avi; Stuart, Elizabeth A
title: COVID-19 Policy Impact Evaluation: A guide to common design issues
date: 2021-06-28
journal: Am J Epidemiol
DOI: 10.1093/aje/kwab185
sha: 56387dd343f67d743bafb47c1933e5dd90bfb720
doc_id: 875744
cord_uid: f9uh3rkn

Policy responses to COVID-19, particularly those related to non-pharmaceutical interventions, are unprecedented in scale and scope. However, policy impact evaluations require a complex combination of circumstance, study design, data, statistics, and analysis. Beyond the issues that are faced for any policy, evaluation of COVID-19 policies is complicated by additional challenges related to infectious disease dynamics and a multiplicity of interventions. The methods needed for policy-level impact evaluation are not often used or taught in epidemiology, and differ in important ways that may not be obvious. Methodological complications of policy evaluations can make it difficult for decision-makers and researchers to synthesize and evaluate strength of evidence in COVID-19 health policy papers. We (1) introduce the basic suite of policy impact evaluation designs for observational data, including cross-sectional analyses, pre/post, interrupted time-series, and difference-in-differences analysis, (2) demonstrate key ways in which the requirements and assumptions underlying these designs are often violated in the context of COVID-19, and (3) provide decision-makers and reviewers a conceptual and graphical guide to identifying these key violations. The overall goal of this paper is to help epidemiologists, policy-makers, journal editors, journalists, researchers, and other research consumers understand and weigh the strengths and limitations of evidence.

consequences. [1] [2] [3] Epidemiologists are increasingly being asked to participate in policy recommendations, evidence evaluation, and evidence generation. Given the importance of these problems, there has been a proliferation of studies aiming to evaluate interventions implemented by different jurisdictions to inform future policymaking. 4 To be informative, however, such evaluations require a complex combination of circumstance, data, study design, analysis, and interpretation.

Estimating John Snow, they are relatively rarely taught and practiced in contemporary epidemiology, 6, 7 particularly with regard to the intricacies of policy evaluation. As Caniglia and Murray note, in epidemiology, these methods have -fallen out of use given the field's shift to focus on questions of clinical relevance rather than population health relevance.‖ 6 That is changing, with these types of methods increasingly being included in epidemiology curricula, and chapters being added to texts, such as Glymour and Swanson's recent chapter addition in the 4th edition of Modern Epidemiology. 8 These efforts are well-timed, as COVID-19 has highlighted the need for epidemiology to rapidly re-engage with policy evaluation methods.

This paper provides a graphical guide to policy impact evaluations for COVID-19, targeted to decision-makers, researchers and evidence curators. Our aim is to provide a coherent framework for conceptualizing and identifying common pitfalls in COVID-19 policy evaluation.

Importantly, this should not be taken either as a comprehensive guide to policy evaluation more broadly or as guidance on performing analysis, which may be found elsewhere. [9] [10] [11] [12] [13] Rather, we review relevant study designs for policy evaluationsincluding pre/post, interrupted time series, and difference-in-difference approachesand provide guidance and tools for identifying key issues with each type of study as they relate to non-pharmaceutical interventions and other COVID-19 policy interventions. Improving our ability to identify key pitfalls will enhance our ability to identify and produce valid and useful evidence for informing policymaking. Identifying the type of design For our purposes, we define impact evaluation as examining the quantitative causal effect of particular policies (already implemented) on outcomes primarily using observational data. This is in contrast to models of hypothetical scenarios and/or using assumed or estimated data (e.g., typical uses of mechanistic or curve-fitting infectious disease models, scenario projections, etc.), noting that the lines between -models‖ and -impact evaluation‖ is blurry; models and impact evaluation methods are often together as complements within the same article. 

rather than just change in levels. Difference-in-differences analysis compares the outcome change in units that received the intervention with those that did not (or have not yet), with at least one point before and one after the intervention. In scenarios with multiple periods, that may involve a comparison with the pre-policy period of one region with the post-period of a different region, even though all regions eventually receive the intervention. Comparative interrupted time-series uses both aspectsa change over time and a comparison groupto compare the observed change in slopes for the intervention group with the change in slope for the comparison group.

Methods descriptions may not always provide a precise or reliable guide to which of the design approaches has been used. Some studies do not explicitly name these designs (or may classify them differently); and these are only a small fraction of designs and frameworks that are possible to use for policy evaluation. 7, 9, 14 Studies may have data at multiple time points but are effectively cross-sectional. 15 DiD, ITS, and CITS designs based on repeated cross-sectional data are sometimes described as -cross-sectional‖ 16, 17 instead of longitudinal. The term -event study‖ is often used to refer to studies with a single unit and one change over time resembling ITS, 18-20 but may refer to other designs. Although ITS is often used to describe changes in one unit, it may also refer to settings in which many treated units adopt an intervention over time. 21-26 Studies will also frequently employ multiple designs, 20,27 while others use more complex methods of generating counterfactuals. 28 Definitions of these terms vary widely, and the definitions above should be considered as guidance only.

The simplest design is the cross-sectional analysis, which compares COVID-19 outcomes between units of observation (e.g., cities) at a single calendar time or time since an event, typically post-intervention, often referred to as an ecological design. This design is unlikely to be appropriate for COVID-19-related policy evaluations, but provides a useful starting point for reasoning about different designs. Just as with comparisons of non-randomized medical treatments, the localities that adopt a particular policy likely differ substantially from those that policies instead consider longitudinal designs, which look at differences or trends across time.

These can be distinguished by the data used and the construction of the counterfactual, represented graphically for comparison in Figure 1 and summarized in text in Table 1 However, this is unlikely to be true in the case of COVID-19 policies. Just as the outcomes for an individual patient might be expected to change before and after treatment, for reasons unrelated to the treatment, outcomes related to policy interventions will change for reasons not caused by the policy. Infection rates, for example, would not be expected to remain stationary except in very specific circumstances, but a pre/post measurement would assume that any changes in infection rates are attributable to the policy. However, the validity of ITS depends critically on how well counterfactual models in the outcome are correctly specified, and whether conditional exchangeability holds due to the policy of interest is the only relevant change during the study period. In the canonical setting ( Figure   2A ), the pre-policy trend is stable and can be modelled with the available data; the researcher appropriately models the timing of the change in the slope and/or level of the outcome; the researcher has sufficient information to conclude that there were no other changes during the study period that would be expected to influence the outcome. These elements are largely not satisfied in studies of COVID-related policy, as described below.

Visual and statistical examination of trends, preferentially alongside a theoretical justification of the model used, are key to examining assumptions for ITS. At a minimum, analyses should provide graphical representation of the data and model over time to examine whether pre-trend outcomes are stable, all trends are well-fit to the data, -interrupted‖ at the appropriate time point, and sensibly modelled ( Figure 2B ). In the case where an ITS includes a large number of units (e.g. states), it can be difficult to display this information graphically.

Model misspecification is a common pitfall in ITS ( Figure 2C ). The estimate of policy impact will be biased if a linear trend is assumed but the outcome and response to interventions instead follow nonlinear trends (either before or after the policy). In some cases, transformation of the

outcome, for example using a log scale, may improve the suitability of a linear model, as in Palladino et al., 2020. 30 Imposing linearity inappropriately is a serious risk in the context of COVID-19, as trends in infectious disease dynamics are inherently non-linea 18 r. 31 For intuition, terms such as -exponential growth,‖ -flattening,‖ and -s-curves‖ all refer to non-linear infectious disease trends. Depending on the particular situation, non-linearity or other modelled trends can have complicated and counterintuitive impact on policy impact. Apparent linearity may also be temporary and an artifact of testing, which may give a misleading impression that linear models for infectious disease trends are appropriate indefinitely, as is the case for Zhang et al., 2020. 32 While some use linear progression in order to avoid more complex infectious disease models, in fact, linear projections impose strict and often unrealistic models, generally resulting in an inappropriate counterfactual. Functional form issues can often be mitigated through careful choice of models, 18 and/or testing of alternative assumptions. 33 At minimum, analyses should justify the use of their selected functional form, and preferably explore alternative specifications.

Researchers can easily misattribute the timing of the policy impact, resulting in spurious inference and bias ( Figure 2D ). Some public health policies can be expected to translate into immediate results (e.g., smoking bans and acute coronary events 18 ). In contrast, nearly every outcome of interest in COVID-19 exhibits complex and difficult to infer time lags 34 36 risks issues comparable to p-hacking, 37 where the lag is selected for the point at which results are more extreme, rather than most appropriate for causal inference.

Finally, and perhaps most concerningly in the context of COVID-19, conditional exchangeability fails for ITS when the policy of interest coincides in time with other changes that affect the outcome ( Figure 2E ). 33 

ITS will also likely be biased if, during the study period, there is a change in the way the outcome data is collected or measured. This might occur if the introduction of a COVID-19

control policy is combined with an effort to collect better data on infection or mortality cases.

Anticipation of a policy may induce behavior change before the actual policy takes effect. The policies themselves may have been chosen due to the expectation of change in disease outcomes, which introduces additional biases related to -reverse‖ causality.

These issues are summarized as a checklist of questions to identify common pitfalls in Table 2 .

The difference-in-difference (DiD) approach uses concurrent non-intervention groups as a counterfactual. Typically, this consists of one set of units (e.g., regions) that had the intervention and one set that did not, with each measured before and after the intervention took place. DiD is more directly analogous to a non-randomized medical study with at least one treatment and control group but limited observation before and after treatment. In contrast to ITS, which compares a unit with itself over time, DiD compares differences between treatment arms or units at two observation points. The exchangeability assumption in DiD applies to the outcome trends rather than the outcome levels; in other words, it implies that the treated and control units would for further discussion of the nuances of using those methods to study COVID-19 related policies.

One key component of the standard DiD approach is the parallel counterfactual trends assumption: that the intervention and comparison groups would have had parallel trends over time in the absence of the intervention. In some cases, the parallel trends assumption may be referenced or examined implicitly but not named. 17 Ideally, pre-intervention trends would be shown to be clearly identifiable, stable, of a similar level, and parallel between groups, such as in Hsaing et al 2020. 21 With only one observation before and only one after the intervention, assessment of the plausibility of the parallel counterfactual trends assumption is not possible. Absent this confirmation 40 the evaluation runs the risk of biased estimation due to differential pre-trends ( Figure 3B ). Pre-trends approaching the ceiling or floor 17 may also not be informative about stable and parallel pre-trends. Empirical assessment of whether pre-intervention trends were parallel and stable between groups is possible when multiple observations are available at multiple time points before the intervention, noting that this can begin to resemble a CITS design. 41 In this scenario, pre-trend data should be visually and statistically established and documented. While parallel trends before intervention (which we can observe and may be testable) do not guarantee parallel counterfactual trends in the post-intervention period (which we cannot observe and are generally untestable), examining pre-intervention parallel trends is a minimal requirement for DiD reliability.

Although DiD can provide some robustness for functional form, this can still be a major issue in many circumstances. 42 It is important to consider the scale and level on which the outcome is measured ( Figure 3C Similarly to the ITS section, these issues are summarized as a checklist of questions to identify common pitfalls in Table 3 .

Comparative interrupted time-series is conceptually and strongly related to DiD, but uses control locations to model the counterfactual change in trajectory over time, rather than just levels. The key difference lies in how pre-trends are modeled, which are not inherently assumed to be parallel in CITS. CITS allows controlling for differences in the overall trajectory, provided stable trends otherwise un-impacted by functional form, concurrent events, and other issues. In many cases, including the multiple-time-period case, DiD and CITs can be effectively indistinguishable. 41 As such, the same checks that apply to DiD also apply to CITS.

As with any other causal inference design, problems with measurement, generalizability, changes in measurement over time (e.g. varying test availability), statistical models, data quality, testing robustness to alternative assumptions, and many issues can undermine an otherwise robust 

In recent months, there has been a proliferation of research evaluating policies related to the COVID-19 pandemic. As with other areas of COVID-19 research, quality has been highly variable, with low quality studies resulting in poorly or mis-informed policy decisions, poorly tuned and inaccurate projection models, wasted resources, and undermined trust in research. 44, 45 To support high quality policy evaluations, in this paper we describe common approaches to evaluating policies using observational data, and describe key issues that can arise in applying these approaches. This guidance should be considered minimal screening to identify low quality policy impact evaluation in COVID-19, but is in no way sufficient to identify high quality evidence or actionability. We hope that this guidance can help support researchers, editors, This can be challenging because case counts typically do not grow linearly and there is often a lag between a policy change and a behavioral response.

While this guidance is not comprehensive, it may help inform study designs not covered here.

Many synthetic control methods, 47 and non-randomized trials with regional interventions (e.g.

policies), are broadly comparable with the issues with difference-in-differences analyses we discuss here. Many causal inference methods for policy evaluation share much of their basic design and key assumptions with the designs in this guidance, although this might vary slightly depending on how they are structured. Other approaches may include adjustment and matching based observational causal inference designs, 10 instrumental variables and related quasiexperimental approaches, 7, 9, 48 cluster-randomized controlled trials. Each has its own set of practical, ethical, and inferential limitations.

In the face of these challenges, we recommend careful scrutiny and attention to potential sources 14 Pre/post Figure 1A At least one None At least one (typically one)

At least one (typically one)

Outcome would have stayed the same from the pre period to the post period. [15] [16] [17] [18] [19] [20] [21] [22] Interrupte d timeseries Figure 1B At least one None More than one At least one (typically several)

Outcome slope and level b would have continued along the same modelled trajectory from the pre-period to the post period.

Difference -indifference s Figure 1C At least one At least one c At least one (typically one)

At least one (typically one)

Outcome in intervention units would have changed as much as (or in parallel with) the outcome in the nonintervention units. [37] [38] [39] [40] [41] [42] [43] [44] [45] Comparati ve interrupte d time series Figure 1D At least one At least one c More than one (typically several)

At least one (typically several)

Outcome slope and level* would have changed as much as non-intervention group's slope and level* changed. Does the analysis provide graphical representation of the outcome over time?

•Check for a chart that shows the outcome over time, with the dates of interest. Outcomes may be aggregated for clarity (e.g. means and CIs at discrete time points).

Is there sufficient pre-intervention data to characterize pre-trends in the data?

•Check the chart(s) to see if there are several time points over a reasonable period of time over which to establish stability and curvature in the pre-trends.

Is the pre-trend stable?

•Check if there are sufficient data to reasonably determine a stable functional form for the pretrends, and that they follow a modelable functional form.

Is the functional form of the counterfactual (e.g. linear) well-justified and appropriate?

• Does the analysis provide graphical representation of the outcome over time?

•Check for a graph that shows the outcome over time for all groups, with the dates of interest. Outcomes may be aggregated for clarity (e.g. mean and CI at discrete time points).

Is there sufficient pre-intervention data to observe both pre and post trends in the data?

•Check the chart(s) to see if there are several time points over a reasonable period of time over which to establish stability and curvature in the pre-and post-trends.

Are the pre-trends stable?

•Check if there are sufficient graphical data to reasonably determine a stable functional form for the pre-trends, and that they follow a modelable functional form.

Are the pre-trends parallel?

•Observe if the trends in the intervention and comparison groups appear to move together at the same rate at the same time.

Are the pre-trends at a similar level?

•Check if the trends in the intervention and comparison groups are at similar levels. •Note that non-level trends exacerbates other problems with the analysis, including linearity assumptions Are intervention and non-groups broadly comparable?

•Consider areas where comparison groups may be dissimilar for comparison beyond just the level of the outcome.

Is the date or time threshold set to the appropriate date or time (e.g. is there lag between the intervention and outcome)?

•Check whether the authors justify the use of the date threshold relative to the date of the intervention.

•Trace the process between the intervention being put in place to when observable effects in the outcome might appear over time.

•Consider whether there are anticipation effects (e.g. do people change behaviors before the date when the intervention begins?)

•Consider whether there are lag effects. (e.g. does it take time for behaviors to change, behavior change to impact infections, infections to impact testing, and data to be collected, etc?) •Check if authors appropriately and directly account for these time effects.

Is this policy the only uncontrolled or unadjusted-for way in which the outcome could have changed during the measurement period, differently for policy and non-policy regions?

•Consider any uncontrolled factor which could have influenced the outcome during the measurement period.

•Did any factor(s) influence the outcome in different amounts in policy and non-policy regions? •This may include (but is not limited to)

•Other policies •Social behaviors •Economic conditions •Spillover effects from other regions •Are these factors justified as having negligible impact?

•If justified, is the argument that these have negligible impact convincing? •Note that the actual concurrent changes do not need to happen during the period of measurement, just their effects. a If any answer is -no,‖ this analysis is unlikely to be appropriate or useful for estimating the impact of the intervention of interest. 

Making Decisions in a COVID-19 World

Defining highvalue information for COVID-19 decision-making [preprint]. medRxiv

Implementation science in times of Covid

Quarantine alone or in combination with other public health measures to control COVID-19: a rapid review

Which interventions work best in a pandemic

Difference-in-Difference in the Time of Cholera: a Gentle Introduction for Epidemiologists

Quasiexperimental study designs series-paper 7: assessing the assumptions

Modern Epidemiology

Mostly Harmless Econometrics: An Empiricist's Companion. 1st

Causal Inference: What If

Evaluating the impact of healthcare interventions using routine data

Using Difference-in-Differences to Identify Causal Effects of COVID-19 Policies

Designing Difference in Difference Studies: Best Practices for Public Health Policy Research

Statewide Interventions and Covid-19

Mortality in the United States: An Observational Study

Mask Wearing and Control of SARS-CoV-2 Transmission in the United States

Comparison of Estimated Rates of Coronavirus Disease

COVID-19) in Border Counties in Iowa Without a Stay-at-Home Order and Border Counties in Illinois With a Stay-at-Home Order

Interrupted time series regression for the evaluation of public health interventions: a tutorial

The Efficacy of Lockdown Against COVID-19: A Cross-Country Panel Analysis. Appl Health Econ Health Policy

Unemployment insurance and food insecurity among people who lost employment in the wake of COVID-19 [preprint]. medRxiv

Causal Impact of Masks, Policies

Demystifying the Placebo Effect

Excess Deaths and Hospital Admissions for COVID-19 Due to a Late Implementation of the Lockdown in Italy

Mathematical models of infectious disease transmission

Identifying airborne transmission as the dominant route for the spread of COVID-19

Testing the Validity of the Single Interrupted Time Series Design

Report No.: w26080

Inferring change points in the spread of COVID-19 reveals the effectiveness of interventions

Association Between Statewide School Closure and COVID-19 Incidence and Mortality in the US

Signal of increased opioid overdose during COVID-19 from emergency medical services data. Drug and Alcohol Dependence

The extent and consequences of p-hacking in science

Goodman-Bacon A. Difference-in-Differences with Variation in Treatment Timing

Social Distancing and Outdoor Physical Activity During the COVID-19 Outbreak in South Korea: Implications for Physical Distancing Strategies

Do Methodological Birds of a Feather Flock Together?

When Is Parallel Trends Sensitive to Functional Form?

Invited Commentary: Selection Bias Without Colliders

Waste in covid-19 research

Too much information, too little evidence: is waste in research fuelling the covid-19 infodemic

Problems with Evidence Assessment in COVID-19 Health Policy Impact Evaluation (PEACHPIE): A systematic review of evidence strength

Synthetic control methodology as a tool for evaluating population-level health interventions

Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction

Will COVID-19 be evidence-based medicine's nemesis?

Empirical assessment of government policies and flattening of the COVID19 curve

Revealing regional disparities in the transmission potential of SARS-CoV-2 from interventions in Southeast Asia

COVID-19 effective reproduction number dropped during

Spain's nationwide dropdown, then spiked at lower-incidence regions

The effect of lockdown on the COVID-19 epidemic in Brazil: evidence from an interrupted time series design

Effect of mitigation measures on the spreading of COVID-19 in hard-hit states in the U

COVID-19 transmission in the

before vs. after relaxation of statewide social distancing measures

Fangcang shelter hospitals are a One Health approach for responding to the COVID-19 outbreak in Wuhan, China. One Health

Associations of Stay-at-Home Order and Face-Masking Recommendation with Trends in Daily New Cases and Deaths of Laboratory-Confirmed COVID-19 in the United States. Exploratory Research and Hypothesis in Medicine

county level analysis to determine If social distancing slowed the spread of COVID-19. Revista Panamericana de Salud Pública

Strong Social Distancing Measures In The United States Reduced The COVID-19 Growth Rate

Were Urban Cowboys Enough to

Control COVID-19? Local Shelter-in-Place Orders and Coronavirus Case Growth

Shelter-In-Place Orders Reduced COVID-19 Mortality And Reduced The Rate Of Growth In Hospitalizations. Health Aff (Millwood)

Lessons Learnt from China: National Multidisciplinary Healthcare Assistance

All things equal? Heterogeneity in policy effectiveness against COVID-19 spread in chile

The Effects of Border Shutdowns on the Spread of COVID-19

Association of State Stay-at-Home Orders and State-Level African American Population With COVID-19 Case Rates