key: cord-0960178-ywjxeeps authors: Fanshawe, Thomas R.; Turner, Philip J.; Gillespie, Marjorie M.; Hayward, Gail N. title: The comparative interrupted time series design for assessment of diagnostic impact: methodological considerations and an example using point-of-care C-reactive protein testing date: 2022-03-02 journal: Diagn Progn Res DOI: 10.1186/s41512-022-00118-w sha: a6065272aaf467be70fae18b4d385e771b6dcde8 doc_id: 960178 cord_uid: ywjxeeps BACKGROUND: In diagnostic evaluation, it is necessary to assess the clinical impact of a new diagnostic as well as its diagnostic accuracy. The comparative interrupted time series design has been proposed as a quasi-experimental approach to evaluating interventions. We show how it can be used in the design of a study to evaluate a point-of-care diagnostic test for C-reactive protein in out-of-hours primary care services, to guide antibiotic prescribing among patients presenting with possible respiratory tract infection. This study consisted of a retrospective phase that used routinely collected monthly antibiotic prescribing data from different study sites, and a prospective phase in which antibiotic prescribing rates were monitored after the C-reactive protein diagnostic was introduced at some of the sites. METHODS: Of 8 study sites, 3 were assigned to receive the diagnostic and 5 were assigned as controls. We obtained retrospective monthly time series of respiratory tract targeted antibiotic prescriptions at each site. Separate ARIMA models at each site were used these to forecast monthly prescription counts that would be expected in the prospective phase, using simulation to obtain a set of 1-year predictions alongside their standard errors. We show how these forecasts can be combined to test for a change in prescription rates after introduction of the diagnostic and estimate power to detect this change. RESULTS: Fitted time series models at each site were stationary and showed second-order annual seasonality, with a clear December peak in prescriptions, although the timing and extent of the peak varied between sites and between years. Mean one-year predictions of antibiotic prescribing rates based on the retrospective time series analysis differed between sites assigned to receive the diagnostic and those assigned to control. Adjusting for the trend in the retrospective time series at each site removed these differences. CONCLUSIONS: Quasi-experimental designs such as comparative interrupted time series can be used in diagnostic evaluation to estimate effect sizes before conducting a full randomised controlled trial or if a randomised trial is infeasible. In multi-site studies, existing retrospective data should be used to adjust for underlying differences between sites to make outcome data from different sites comparable, when possible. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s41512-022-00118-w. The development of diagnostic tests is central to improving the timely diagnosis and subsequent treatment of disease. Before a new diagnostic test can become fully established in practice, it is necessary to demonstrate its diagnostic performance in clinical settings, its potential to improve patient outcomes, and its cost-effectiveness [1] . The evaluation cycle from demonstrating analytical performance to cost-effectiveness and broader impact can take a long time: the median time for point-of-care tests has been estimated as 9 years [2] , with the range of evidence necessitating a variety of different studies with different designs [3, 4] . Diagnostic accuracy is only a single component in the comprehensive evaluation of a new diagnostic, as recognised in overviews of the field [5, 6] , and so it is important to also consider downstream consequences, which might include the effect on treatment prescribing, cost, patient outcomes and adverse effects. These have been collectively termed 'clinical impact' [7] . The 2017 European Union regulation on In Vitro Diagnostic Medical Devices (Regulation (EU) 2017/746) specifies that evidence of clinical performance should be demonstrated in order for a CE mark to be gained, a change from the earlier directive 98/ 79/EC [8] . As such, it has become necessary for studies of new diagnostic devices to include clinical impact measures as outcome variables. Although the randomised controlled trial (RCT) has historically been regarded as the highest quality design for demonstrating the effectiveness of interventions [9] , many diagnostic RCTs may be underpowered [10] and the time required can delay the adoption of rapidly evolving technologies, suggesting other designs should be considered. One such design that has been proposed is the 'controlled before/after' study, a quasi-experimental design that can be analysed using methods for comparative interrupted time series (CITS), such as segmented regression. In this design, the diagnostic device can be introduced into a number of locations, and outcomes compared both between the locations using versus those not using the diagnostic; and between the time period after versus the time period before the diagnostic was introduced. As this design partly uses retrospective (socalled 'real-world') data, it can reduce the time and cost of conducting such a study, the aim being to provide a plausible estimate of the effect of the diagnostic that can be used subsequently in the design of a full randomised controlled trial of clinical impact. In medical research, the existing interrupted time series methodology primarily focuses on evaluations of treatments or public health interventions rather than diagnostics [11] and on a single time series from a population rather than multiple time series from different locations [12] . As the CITS design has rarely been used in the evaluation of diagnostics (one example is [13] ), there is scope for this design and its associated analytic methods to be explored as a way to evaluate the impact of diagnostics and accelerate the adoption of new technologies into clinical practice. Point-of-care (POC) diagnostic devices are suitable candidates for evaluations of this form, as they can be introduced to different primary or secondary care services for which relevant clinical impact outcome measures are often already routinely collected. In this paper, we describe the design of study for a POC diagnostic for C-reactive protein (CRP) testing in out-of-hours primary care, and outline how this design affects analytical considerations. Results from the prospective phase of the study will be reported in a subsequent publication. This paper is structured as follows. First, we give details of the study evaluating POC CRP testing that motivated this work. We then describe a general methodological approach that can be used to design evaluations of this nature, before showing how this was applied to the study in question. The paper concludes with a discussion of relevant issues when using these methods to design studies of diagnostic impact. Example: point-of-care C-reactive protein testing This work was motivated by the design of a study to assess the impact of introducing POC CRP machines to out-of-hours primary care services under the governance of Practice Plus Group. The use of CRP to support antibiotic prescribing decisions for suspected lower respiratory tract infection was supported by the National Institute for Health and Care Excellence Clinical Guideline CG191 (withdrawn after the start of the COVID-19 pandemic) and has been discussed elsewhere [14, 15] . A previous evaluation of out-of-hours primary care services found that as many as 15% of consultations resulted in the issuing of an antibiotic prescription [16] , but a systematic review in 2013 estimated a reduction in antibiotic prescribing at consultation in primary care of around 25% when CRP testing had been used [17] . The study aimed to assess the short-term impact of introducing POC CRP machines on antibiotic prescribing in this healthcare setting, with the results potentially being used to inform a longer-term follow-up study or a full cluster randomised controlled trial if there was indication of an improvement in prescribing decisions. It is necessary to obtain an estimate of the effect size for a possible intervention effect as the basis for designing a randomised controlled trial. The CRP study consisted of two phases: a retrospective phase that analysed historic antibiotic prescribing data, and a prospective phase that assessed prescribing data after the introduction of POC CRP machines at certain sites. These machines were provided to OOH clinicians with no restrictions on their clinical use. Guidance on the CRP thresholds above which antibiotics should be considered in patients with suspected lower respiratory tract infection followed NICE guidance at that time. Although many tests were linked to this indication, CRP testing was also used for decision-making in a wider array of clinical contexts, at the discretion of the clinician. The prospective phase used a parallel cluster design, with the periods of measurement at each base coinciding. Figure 1 shows a flow diagram of the whole study design. Practice Plus Group is contracted to deliver out-ofhours services via a number of primary care 'bases' in several regions of England [18] . As the number of machines available for inclusion in the prospective phase of this study was limited to three, an important design decision was how to allocate primary care bases as either receiving POC CRP machines (to perform the diagnostic test), or not receiving machines for POC CRP testing, with the latter group acting as comparators or controls. The choice of sites that received POC CRP machines was made in a nonrandomised manner. This decision was informed by examination of the retrospective monthly time series of antibiotic prescription numbers, available separately for each base, and more details are provided in the Results section. The retrospective time series were used in the design to determine the magnitude of change that may have been attributable to the introduction of a POC CRP machine. The main outcomes were the monthly numbers of respiratory tract targeted antibiotic prescriptions in adults, and total antibiotic prescriptions issued. Therefore, all patients who attended one of the included primary care bases and who was considered for an antibiotic prescription could potentially contribute data. A list of included respiratory tract targeted antibiotics appears in the Additional file 1. Secondary outcomes (not discussed further in the current paper) included total non-topical antibiotic prescriptions, the proportion of patients requiring further general practitioner contact or hospital admission within 14 days, the time required for testing and the test failure rate. A qualitative substudy, also to be reported elsewhere, aimed to explore clinicians' perspectives of the use of POC CRP tests in out-ofhours services. The CITS design is an extension of the interrupted time series design that has been widely used as a quasi-experimental approach for the evaluation of health policies or other interventions for which randomisation may be infeasible, such as those in education settings [19] [20] [21] [22] . Some papers have investigated sample size and power considerations for these types of designs. Cruz et al. described power considerations for interrupted time series models, but their model was aimed at change-point detection, which is less relevant when the time of introducing a diagnostic test is known [23] . Zhang et al. examined the relationship between power and the number of time-points in the available time series, also for a single time series, and they restricted their model to be of autoregressive (AR and ARCH) form [24] . The general ARMA(p,q), or autoregressive moving average, model for the time series (y t : t = 1, …, n) has the form In this equation, δ represents the mean level of the outcome (y), (ϕ 1 , …, ϕ p ) are parameters that reflect its dependence on previous values of the time series (the autoregressive component), (ϵ 1 , …ϵ n ) are random variables that are assumed to be independent errors, and (θ 1 , …, θ q ) are parameters that reflect the dependence of the time series on previous error terms (the moving average component). The parameter δ is assumed constant in the formulation above but can be supplemented with another functional form, such as a linear or non-linear time trend, if required. In this paper, we use the more general ARIMA(p,d,q), or autoregressive integrated moving average model, which extends [1] to allow for situations in which the time series is not stationary (i.e. if the assumption that its mean, variance and autocorrelation do not fluctuate over time does not hold). Further details of the models are provided in the Additional file 1 and Chapters 3.4.6 and 4.6 of the book by Chatfield [25] . Alternatives to these models include simpler linear models that may not allow for autocorrelation [19, 26] , and dynamic models that model this autocorrelation via correlated, temporally evolving random processes [27] . A flexible implementation of this class of models is provided by the automatic ARIMA time series package for R, which selects a best-fitting model among the class required using the Akaike Information Criterion or Bayesian Information Criterion [28] . Methods of prediction from ARIMA models for forecasting individual values of y t for t ≥ n + 1 using the Kalman filter have previously been described [29, 30] and implemented in the simulate.ets() function in the R 'forecast' package [28, 31] . For the purpose of the present work, interest lies in simulating values of S k ¼ P nþk t¼nþ1ŷt , whereŷ t are forecasted values of the time series and, for example, k = 12 if t represents time in months and the follow-up period is scheduled to last for 1 year. Thus S k represents the sum of forecasted values over the subsequent year. In such a case, theŷ t will typically be positively correlated, and using the mean and standard error of the predictive distributions of eachŷ t independently to estimate the standard error of S k will underestimate the latter if this correlation is not accounted for. Instead, the mean and standard error of the predictive distribution of S k can be estimated by repeated direct simulation: simulating a complete vector ðŷ t : t ¼ n þ 1; …; n þ kÞ, using the sum as a single estimate of S k , repeating, and then calculating the meanm k and standard deviationŝ k over all calculated estimates of S k . After observing the follow-up data values (y t : t = n + 1, …, n + k), a standardised measure of the increase in observed values over the expected values based on the retrospective time series can be calculated as Tests can be combined as a global z-test using standard methods [32] , treating the individual test statistics as realisations from a Normal distribution with known mean 0 and variance 1. In a study with n intervention regions and m control regions, if Z I and Z C are the means of the Z-statistics in the intervention regions and the control regions, respectively, then a test statistic for the difference in means is Equations (2) and (3) allow estimation of the power to detect a change in the number of prescriptions relative to the trend in the retrospective time series. Consider a test for a single site, as given by [2] .If V ¼ P nþk t¼nþ1 y t follows a Normal distribution with mean m * and standard deviation s * , then a hypothesis test of size α based on [2] will detect a reduction from the trend based on the retrospective time series if V