key: cord-0781581-goqo9lcf
authors: nan
title: Methods for Evaluation of medical prediction Models, Tests And Biomarkers (MEMTAB) 2020 Symposium: Virtual. 10-11 December 2020
date: 2021-04-01
journal: Diagn Progn Res
DOI: 10.1186/s41512-021-00094-7
sha: ee647699b115251da79e1bbb74518bc6a7032ffb
doc_id: 781581
cord_uid: goqo9lcf

nan

guideline developers and practising clinicians, in the hope of improving current understanding through knowledge exchange, and forging our diverse experiences and perspectives to delineate the future direction of diagnostic test research. In this respect, it is the only conference in the world that provides a platform dedicated to the investigation of medical tests, markers, models and other devices used for diagnosis prognosis and monitoring. This year's symposium focussed on the following conference themes:

How to develop and apply prediction models and diagnostic tests High-dimensional data and genetic prediction Machine learning for evaluation of diagnostic tests, markers and prediction models Impact studies for diagnostic tests, markers and prediction models (including low resource settings) Systematic review and meta-analysis (including individual participant data) Big data, electronic health records, dynamic prediction How to quantify overdiagnosis With over 135 delegates and 88 accepted abstracts, we believe we were able to offer a very strong programme. It was our great pleasure to host this year's symposium and are looking forward to meeting you again at the next MEMTAB symposium! Background: Heterogeneity of treatment effect refers to the nonrandom variation in the magnitude of the absolute treatment effect ('benefit') across levels of covariates. For randomized controlled trials (RCTs), the PATH (Predictive Approaches to Treatment effect Heterogeneity) Statement suggests 2 categories of predictive HTE approaches: "risk modeling" approaches, which combine a multivariable model with a constant relative effect of treatment, and "effect modeling" approaches, which includes interactions between treatment and baseline covariates [1] . We aimed to assess practical challenges in deriving estimates of absolute benefit based on risk modeling.

We re-analyzed data from 30,510 patients with an acute myocardial infarction, as enrolled in the GUSTO-I trial [2] . The average mortality was 6.3% with tPA and 7.3% with streptokinase, or an average benefit of 1.0% (p<.001). A multivariable logistic regression model included 6 predictors of 30-day mortality, which occurred in 2128 patients. The model provided a linear predictor (or risk score) that discriminated well between low risk and high risk patients, with an area under the ROC curve of 0.82. The benefit of tPA over streptokinase treatment increased from 0.2% to 2.4% for the lowest to the highest risk quarter (Figure 1 ). Proportionality of the treatment effect across predictors was not rejected in tests of interaction (overall test: p=0.30). Continuous benefit was estimated by subtracting estimated risk under either treatment with a spline transformation of the linear predictor ( Figure 1 ). Sensitivity analyses showed similar results for different specifications of the risk model or the continuous benefit modeling. Exploratory one at a time subgroup analyses showed consistent relative effects of treatment. Conclusions: Risk modeling should become part of the primary analysis of RCTs. One at a time subgroup analyses should be abandoned as secondary to indicate any heterogeneity of treatment effect. Keywords: Heterogeneity of treatment effect, regression model, spline functions Fig. 1 (abstract 2) . benefit of treatment by tPA compared to streptokinase in the GUSTO-I trial Background: In diagnostic accuracy studies, sensitivity and specificity are recommended as co-primary endpoints. For the sample size calculation, assumptions about the expected sensitivity and specificity of the index test as well as the minimal acceptable diagnostic accuracy or the expected diagnostic accuracy of the comparator test have to be made. However, the assumptions from previous studies are often unsure. [1] As an example for the talk we chose the study from Yan et al., where the estimated sensitivity was 75.8%, whereas the authors expected 91%. [2] Methods: Because of the uncertainty, it is essential to develop methods for a sample size re-estimation in diagnostic accuracy trials. While such adaptive designs are standard in interventional trials, in diagnostic trials they are uncommon. [3] Known approaches from interventional trials cannot be applied to diagnostic accuracy studies or have to be modified; especially because the specific feature of diagnostic accuracy trials are the two co-primary endpoints sensitivity and specificity. Results: In this talk we propose an approach for an unblinded sample size re-estimation in diagnostic accuracy studies. We can show that with the adaptive design the type-one error is maintained and the desired power is achieved. Furthermore, the results of the example study are presented. Conclusion: Using unblinded sample size re-estimation, diagnostic accuracy studies can be made more efficient. Keywords: Diagnostic accuracy, adaptive design, unblinded interim analysis Background: Systematic reviews include primary studies that differ in sample sizes, with larger studies contributing more to the metaanalysis. At present, study size is not considered in Risk of Bias evaluations. We aimed to develop an alternative way to present the contribution of individual studies to the total body of evidence on diagnostic accuracy, in terms of risk of bias and concerns about applicability, one that takes the effective sample size into account. Methods: We used the results of a systematic review of diagnostic accuracy studies of the Enhanced Liver Fibrosis (ELF) test for diagnosing liver fibrosis among non-alcoholic fatty liver disease patients. We assessed the 11 studies identified from our systematic search of five databases with the QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies) tool. We first used number of studies to show the proportion of studies at low, unclear and high risk of bias. We then developed an alternative version of the graph, which relies on the proportion of the total sample size of studies at different levels of risk of bias. Results: The risk of bias levels for each domain of the QUADAS-2 checklist changed after replacing the number of studies with the relative sample sizes of the individual studies. For instance, the risk of bias was high in the patient selection domain in 45% of the studies, and low and unclear in 27% of studies ( Figure 1A ). The alternative graph using the sample sizes showed 25%, 41% and 34% of included population with high risk, unclear and low risk of bias, respectively ( Figure 1B) . Conclusion: A fair representation of the risk-of-bias and concerns about applicability in the available body of evidence from diagnostic accuracy studies should be based on the total sample size, not on the number of studies Keywords: Meta-analysis, accuracy studies, risk of bias assessment Background: Although a large number of resources have been invested in biomarker (BM) discovery, for both prognostic and diagnostic purposes, very few of those BMs have been clinically adopted.

In an attempt to bridge the gap between BM discovery and clinical use, our previous study has developed and retrospectively validated a checklist comprised of 125 characteristics associated with cancer BM clinical implementation. Despite validation, complexity in implementing the full checklist might present a barrier. Therefore, this study aims to generate a user-friendly and concise consensus statement with literature-reported attributes associated with successful BM implementation.

Methods: A checklist of BM attributes was created using Medline and Embase databases according to PRISMA guidelines. A qualitative approach was applied to validate the list utilising semi-structured interviews (n=32). Thematic analysis was conducted until thematic saturation was achieved. Upon completion of literature review and interviews, a 3-phase online Delphi-Survey was designed aiming to develop a consensus document. The participants involved were grouped based on their expertise: clinicians, academics, patient and industry representatives.

Results: Previously identified 125 attributes retrieved from literature and reporting guidelines were included in the checklist. Upon thematic analysis of the interviews, characteristics listed in the checklist were validated. Most commonly occurring theme focused on clinical utility. Interestingly, different groups focused on differential themes emphasising the importance of participants' diverse background. In specific, clinician and laboratory personnel commonly occurring themes fell under clinical utility. Moreover, patient representatives and industry personnel recurrent themes focused on clinical and analytical validity, respectively.

Conclusions: This study generated a validated checklist with literature-reported attributes linked with successful BM implementation. Upon completion of the Delphi-survey, a consensus statement will be generated which could be used to i) detect BMs with the highest potential of being clinically implemented and ii) shape how BM studies are designed and performed. Keywords: Biomarkers, clinical implementation, checklist, Delphi survey, qualitative research

Developing Target Product Profiles for medical tests: a methodology review Paola Cocco 1 , Anam Ayaz-Shah 2 , Michael Paul Messenger, 3 Robert Michael West, 4 Bethany Shinkins 1

Background: A Target Product Profile (TPP) is a strategic document which describes the necessary characteristics of an innovative product to address an unmet clinical need. TPPs present valuable information for designing 'fit for purpose' tests to manufacturers. To our knowledge, there is no formal guidance as to best practice methods for developing a TPP specific to medical tests. We aimed to review and summarise the methods currently used to develop TPPs for medical tests and identify the test characteristics commonly reported. Methods: We conducted a methodology systematic review of TPPs for medical tests. Database and website searches were carried out in November 2018. TPPs written in English for any medical test were included. Test characteristics were clustered into commonly recognized themes. Results: Forty-four studies were identified, all of which focused on diagnostic tests for infectious diseases. Three core decision-making phases for developing TPPs were identified: scoping, drafting and consensus-building. Consultations with experts and the literature mostly informed the scoping and drafting of TPPs. All TPPs provided information on unmet clinical need and desirable test analytical performance, and the majority specified clinical validity characteristics. Few TPPs described specifications for clinical utility, and none included cost-effectiveness.

Background: The assessment of agreement in method comparison and observer variability analysis on quantitative measurements is often done with Bland-Altman Limits of Agreement (BA LoA) for which the paired differences are implicitly assumed to follow a Normal distribution. Whenever this assumption does not hold, the respective 2.5% and 97.5% percentiles are often assessed by simple quantile estimation. Sample, subsampling, and Kernel quantile estimators as well as other methods for quantile estimation have been proposed in the literature and were compared in this simulation study. Methods: Given sample sizes between 30 and 150 and different distributions of the paired differences (Normal; Normal with 1%, 2%, and 5% outliers; Exponential; Lognormal), the performance of 14 estimators in generating prediction intervals for one newly generated observation was evaluated by their respective coverage probability.

Results: For n=30, the most simple sample quantile estimator (smallest and largest observation as estimates for the 2.5% and 97.5% percentiles) outperformed all other estimators. For sample sizes of n=50, 80, 100, and 150, only one other sample quantile estimator (a weighted average of two order statistics) complied with the nominal 95% level in all distributional scenarios. The Harrell-Davis subsampling estimator and estimators of the Sfakianakis-Verginis type achieved at least 95% coverage for all investigated distributions for sample sizes of at least n=80 apart from the Exponential distribution (at least 94%). Conclusions: Simple sample quantile estimators based on one and two order statistics can be used for deriving nonparametric Limits of Agreement. For sample sizes exceeding 80 observations, more advanced quantile estimators of the Harrell-Davis and Sfakianakis-Verginis types that make use of all observed differences are equally applicable, but may be considered intuitively more appealing than simple sample quantile estimators that are based on only two observations per quantile ( Figure 1 ). Keywords: Agreement, Bland-Altman plot, coverage, prediction, quantile estimation, repeatability, reproducibility

Session chair: Laure Wynants 12. QUADAS-C: a tool for assessing risk of bias in comparative diagnostic accuracy studies Bada Yang 1 , Penny Whiting 2 , Clare Davenport 3,4 , Jonathan Deeks 3,4 , Christopher Hyde 5 , Susan Mallett 3 , Yemisi Takwoingi 3,4 and Mariska Leeflang 1 for the QUADAS-C group

Background: Comparative diagnostic test accuracy studies assess the accuracy of multiple tests in the same study and compare their accuracy. While these studies have the potential to yield reliable evidence regarding comparative accuracy, shortcomings in the design, conduct and analysis may bias their results. The currently recommended quality assessment tool for diagnostic accuracy studies, QUADAS-2, is not designed for the assessment of test comparisons.

We developed QUADAS-C as an extension to QUADAS-2 to assess the risk of bias in comparative diagnostic test accuracy studies. Methods: Through a four-round Delphi study involving 24 international experts in test evaluation and a face-to-face consensus meeting, we developed a draft version of QUADAS-C which will undergo piloting in ongoing systematic reviews of comparative diagnostic test accuracy.

Results: QUADAS-C retains the same four-domain structure of QUADAS-2 (patient selection, index test, reference standard, flow and timing) and is comprised of additional questions to each QUADAS-2 Background: Once a clinical prediction model has been developed its predictive performance should be examined in new data, independent to that used for model development. This process is known as external validation. Many current external validation studies suffer from small sample sizes and, subsequently, imprecise estimates of a model's predictive performance.

To address this, in our talk we propose methods to determine the minimum sample size needed for external validation of a clinical prediction model with a continuous outcome. Methods: Four criteria are proposed, that target precise estimates of (i) R 2 (the proportion of variance explained), (ii) calibration-inthe-large (agreement between predicted and observed outcome values on average), (iii) calibration slope (agreement between predicted and observed values across the range of predicted values), and (iv) the variance of observed outcome values. Closed-form sample size solutions are derived for each criterion, which require the user to specify anticipated values of the model's performance (in particular R 2 ) and the outcome variance. Background: Clinical prediction models (CPMs) are of great interest to Oncology clinicians, who can use past and current patient characteristics to inform current and future health status. However, systematic reviews show that CPMs are often developed and validated using inappropriate methodology and are poorly reported. Application of Machine Learning (ML) methods to develop CPMs has risen considerably and it is often portrayed to offer many advantages (over traditional statistical methods), especially when using 'big', non-linear and high-dimensional data. However, poor methodology and reporting continue to be barriers to their clinical use. To improve usability of ML-CPMs, it is important to evaluate their methodological quality and adherence to reporting guidelines for prediction modelling. We aimed to evaluate methodological conduct and reporting of author-defined ML-CPM studies within Oncology. Methods: We conducted a systematic review of Oncology ML-CPMs published in 2019 using MEDLINE and Embase. We excluded studies using imaging or lab-based data. We extracted data on study design, outcome, sample size, ML methodology, and items for risk of bias [1] . The primary outcome was adherence to prediction modelling reporting guidelines [2] . Results: We identified 2922 publications and excluded 2843 based on the eligibility criteria; extracting data from 79 publications. Preliminary results show poor reporting and methodological conduct. Studies used inefficient validation methods (e.g., split-sample) and did not adequately address missing data. Sample size was not reported for most studies, and discrimination was emphasised over calibration. Studies were at increased risk of overfitting, leading to optimistic performance measures for their models.

Conclusions: Reporting and methodological conduct of Oncology ML-CPMs needs to be improved. Caution is needed when interpreting ML-CPMs as performance may be over-optimistic. Keywords: Machine learning, prediction modelling, reporting However, prediction models are commonly interpreted in a causal manner -for example by altering inputs to the model to demonstrate hypothetical impact of an intervention. This can lead to biased causal effects being inferred, and thus misinformed decision making. We aimed to collect examples of use of prediction models in a causal manner in practice, and to identify and interpret literature that provides methods for enriching prediction models with causal interpretations. Methods: We systematically reviewed literature to identify methods for prediction models with causal interpretations, by adapting a scoping review framework, and considering the interaction of prediction modelling keywords, and causal inference keywords. We included papers where methods are developed or applied that undertake prediction enriched with causal inference methods; specifically allowing for some assessment of the causal impact of an intervention on predicted risk.

Results: There were two broad categories of approach identified: 1) enriching prediction models with externally estimated causal effects, such as from meta-analyses of clinical trials; and 2) estimating both a prediction model, and causal effects, from observational data. The latter category included methods such as marginal structural models and g-estimation, embedded within both statistical and machine learning frameworks.

Conclusions: There is a need for prediction models that allow for 'counterfactual prediction': i.e. estimating risk of outcomes under different hypothetical interventions, to support decision making. Methods exist but require development, particularly when triangulating data from different sources (e.g. observational data and randomised controlled trials). Techniques are also required to validate such models.

Keywords: Causal, counterfactual, prediction, model

Risk prediction with discrete ordinal outcomes; calibration and the impact of the proportional odds assumption Michael Edlinger 1,2 , Maarten van Smeden 3,4 , Hannes F Alber 5,6 , Ewout W Steyerberg 7 , Ben Van Calster 1, 7 Background: When evaluating the performance of risk prediction models, calibration is often underappreciated. There is little research on calibration for discrete ordinal outcomes. We aimed to compare calibration measures for risk models that predict a discrete ordinal outcome (typically 3 to 6 categories), investigate the impact of assuming proportional odds on risk estimates and calibration, and study the impact of assuming proportional odds. Methods: We studied multinomial logistic, cumulative logit, adjacent category logit, continuation ratio logit, and stereotype logistic models. To assess calibration, we investigated calibration intercepts and slopes for every outcome level, for every dichotomised version of the outcome, and for every linear predictor (i.e. algorithm-specific calibration). Finally, we used the estimated calibration index as a single-number metric, and constructed calibration plots. We used large sample simulations to study the behaviour of the logistic models in terms of risk estimates, and small sample simulations to study overfitting. As a case study, we used data from 4,888 symptomatic patients to predict the degree of coronary artery disease (five levels, from no disease to three-vessel disease).

Results: Models assuming proportional odds easily resulted in incorrect risk estimates. Calibration slopes for specific outcome levels or for dichotomised outcomes often deviated from unity, even on the development data. Non-proportional odds models, however, suffered more from overfitting, because these models require more parameters. Algorithm-specific calibration for proportional odds models assumes that this assumption holds, and therefore did not fully evaluate calibration. Background: Reporting of clinical prediction models has been shown to be poor with information on the intercept often missing. To allow application of a model for individualized risk prediction, information on the intercept is essential. We aimed to evaluate possible methods to estimate an unreported intercept of a logistic regression model. Methods: Using existing data, we developed a logistic regression model with 6 predictors to predict the risk of operative delivery in pregnant women. We considered 4 scenarios which did not report the intercept, but in which different information was available: i) web calculator, ii) nomogram, iii) coefficients/odds ratios, and iv) scoring table (i.e., three simplified categories and corresponding predicted probabilities). In scenario i) and ii), the coefficient for each predictor was estimated by assessment of the change in predicted probabilities that occurred with the change in the particular predictor. Then, the intercept was estimated by calculating the differences between the predicted probability and the estimated predictor coefficients. In scenario iii), the intercept was estimated based on the assumption that the mean risk estimated by the model would be close to the observed incidence of the outcome in a patient who had the mean value for each predictor. In scenario iv), the intercept was estimated by the association between score categories and corresponding predicted probabilities. Results: Among 5667 laboring women, 1590 (28.1%) had an operative delivery. While the true value of the intercept was -9.563, the estimated intercept in each scenario was -9.552, -9.580, -9.308, and -8.940, respectively. Conclusion: In scenarios i) and ii) where detailed information of predicted probability is available, the unreported intercept can be accurately estimated. On the other hand, the estimation of the intercept could be unstable when only coefficients/odds ratios or simple scoring and corresponding predicted probabilities are reported. Background: The ongoing Covid-19 pandemic was just a matter of time to occur, given not only the long history of outbreaks and pandemics but also the increase in their frequency and diversity during the past decades. A number of human and nonhuman related factors are converging and driving these outbreaks.

The causative viral agent, SARS-Co-2, spread around the world in a matter of weeks, facilitated by transmission via the respiratory airways and even by people who do not display symptoms. The world still had to predominantly react in crisis mode, exposing gaps in pandemic preparedness at multiple fronts. It's a reminder that a problem in one part of the world can rapidly become a problem in every part of the world where it impacts can be felt beyond the medical and public health levels.

Methods: Therefore, we call for a quantum change in the world's approach, preparedness and response to pandemics, some posing existential threats to society. Proper preparation should include an international re-evaluation of the role of basic healthcare around the world. The growing threat of future 'unseen' enemies requires the adoption of a fundamental new mind set, one with a longer time horizon, new technological tools, including more advanced diagnostics that can be deployed locally, generate high quality data rapidly and can be mass produced.

Results: In order to respond better to future threats we should invest and develop pan-viral therapies and more vaccine platforms that can deliver solutions much more rapidly. Background: Screening programmes are evaluated using randomised controlled trials with lengthy follow-up to morbidity, mortality and overdiagnosis. The fast pace of advances in technology means these trials are often based on outdated screening tests. A framework is needed to guide how to evaluate proposed changes to screening tests. We aimed to develop a practical framework to evaluate proposed changes to screening tests in established screening programmes, by synthesis of existing methods and development of new theory. Methods: We identified published frameworks for the evaluation of tests and screening programmes (n=64), and existing methods for evaluating or comparing screening tests published in websites of national screening organisations from 16 countries. We extracted principles relevant to evaluation of screening tests. We then searched the same websites for reviews evaluating changes to screening tests (n= 484). We analyzed the pathways through which these changes to screening tests affected downstream health, and used these to adapt and extend our framework.

We did not find an existing framework specifically designed for evaluation of screening tests across screening programmes. Our proposed framework describes the pathways through which changing a screening test can affect downstream health. Some of these pathways are already included in test evaluation frameworks, e.g. test failures, test accuracy and incidental findings. Some are specific to screening, such as overdiagnosis. We recommend study designs to evaluate these pathways, and recommend a stepwise approach to ensure proportionate review, with the most intensive evaluation required when there is a change to spectrum of disease detected. Conclusions: We present a draft framework for evaluating changes in screening tests. This framework adapts principles from diagnostic test frameworks to the unique challenges of evaluating screening tests, including the complexity in estimating the benefit of earlier detection following screen detection, and associated overdiagnosis. Background: Diagnostic accuracy studies with small numbers of cases often use data-driven methods to simultaneously identify an optimal cutoff and estimate its accuracy. When data-driven optimal cutoffs diverge from standard or commonly used cutoffs, authors sometimes argue that sample characteristics influence accuracy and thus different optimal cutoffs are needed for particular population subgroups. We aimed to explore variability in optimal cutoffs identified and diagnostic accuracy estimates from samples of different sizes and quantify bias in accuracy estimates for data-driven optimal cutoffs, using real participant data on Patient Health Questionnaire-9 (PHQ-9) diagnostic accuracy. Methods: We conducted a simulation study using data from an individual participant data meta-analysis (IPDMA) of PHQ-9 diagnostic accuracy (N studies = 58, N participants = 17,436, N cases = 2,322). 1000 samples of size 100, 200, 500 and 1000 participants were drawn with replacement from the IPDMA database. (Figure 1 ) Optimal cutoffs (based on Youden's J) and their accuracy estimates were compared to accuracy estimates for the standard and optimal cutoff of ≥ 10 in the full IPDMA database. Results: Optimal cutoffs ranged from 3-19, 5-14, 5-13, and 6-12 in samples of 100, 200, 500, and 1000 participants, respectively. Compared to estimates for a cutoff of ≥ 10 in the full IPDMA database, sensitivity was overestimated by 10%, 8%, 6% and 5% in samples of 100, 200, 500, and 1000 participants, respectively. Specificity was underestimated by 4% across sample sizes. Conclusions: Using data-driven methods to select optimal cutoffs in small samples leads to large variability in optimal cutoffs identified and exaggerated accuracy estimates, although cutoff variability and sensitivity exaggeration reduce as sample size increases. Researchers should report accuracy estimates for all cutoffs rather than just study-specific optimal cutoffs. Differences in accuracy and optimal cutoffs seen in small studies may be due to the small sample sizes rather than participant characteristics. Background: Results of diagnostic test accuracy (DTA) meta-analyses are often presented in two ways: i) forest plots displaying metaanalysis results for sensitivity and specificity separately, and ii) Summary Receiver Operating Characteristic (SROC) curves to provide a global summary of test performance. However other relevant information on included studies is often not presented graphically and in the context of the results. We aimed to develop graphical enhancements to SROC plots to address shortcomings in the current guidance on graphical presentation of DTA meta-analysis results. Methods: A critical review of guidelines for conducting DTA systematic reviews and meta-analyses was conducted to establish and critique current recommendations for best practice for producing plots. New plots addressing shortcomings identified in the review were devised and implemented in MetaDTA [1] . Results: Two primary shortcomings were identified: i) lack of incorporation of quality assessment results into the main analysis and; ii) ambiguity with how the contribution of individual studies to the meta-analysis are represented on SROC curves. In response, two novel graphical displays were developed: i) A quality assessment enhanced SROC plot which displays the results from individual studies in the meta-analysis using glyphs to simultaneously represent the multiple dimensions of quality assessed using QUADAS-2; and ii) A percentage study weights enhanced SROC plot which accurately portrays the percentage contribution each study makes to both sensitivity and specificity simultaneously using ellipses. Conclusions: The proposed enhanced SROC curves facilitate the exploration of DTA data, leading to a deeper understanding of the primary studies including identifying reasons for between-study heterogeneity and why specific study results may be divergent. Both plots can easily be produced in the free online interactive application, MetaDTA [1] . Keywords: Diagnostic test accuracy, meta-analysis, visualisation Background: Several biomarkers have been proposed for the diagnosis of sporadic Creutzfeldt-Jakob disease (sCJD), the most prevalent form of human prion disease. We identified and evaluated all relevant diagnostic studies for the biomarker-based differential diagnosis (using serum or cerebrospinal fluid biomarkers) of sCJD, and combined direct and indirect evidence from these studies in a network meta-analysis. Methods: We systematically searched Medline (via PubMed), Embase, and the Cochrane Library. To be eligible, studies had to include the established diagnostic criteria of sCJD and established diagnostic criteria for other forms of dementia as reference standard. The studies had to provide sufficient information to construct the 2×2 contingency table (i.e., false and true positives and negatives). We registered the study protocol with PROSPERO, number CRD42019118830. Risk of bias was assessed with the QUADAS-2 tool. We used a bivariate model to conduct meta-analyses of individual biomarkers and to estimate the between-study variability in logit sensitivity and specificity. To investigate sources of heterogeneity, we performed subgroup analyses based on QUADAS-2 quality and clinical criteria. We used a Bayesian beta-binomial analysis of variance model for the network meta-analysis. Results: We included eleven studies, which investigated 14-3-3 (n= 11), NSE (n=1), RT-QuIC (n=3), S100B (n=3), and tau (n=9). Heterogeneity was high in the meta-analyses of individual biomarkers and different depending on the level of certainty of sCJD diagnosis. In the network meta-analysis, 14-3-3 was the most sensitive, but among the least specific test, while RT-QuIC was the most specific though among the least sensitive test. Conclusions: Our work shows the weaknesses of previous diagnostic accuracy studies. Subgroup analyses will reveal if our results depend on methodological quality of the studies or clinical criteria of the patients. Keywords: Blood, cerebrospinal fluid, neurodegeneration, diagnosis, sporadic Creutzfeldt-Jakob disease

The potential for seamless designs in traditional diagnostic research? Werner Vach 1 , Eric Bibiza-Freiwald 2 , Oke Gerke 3 , Tim Friede 4 , Patrick M Bossuyt 5 , Antonia Zapf 2 Background: New diagnostic tests to identify a well-established disease state have to undergo a series of scientific studies from test construction until finally demonstrating a societal impact. Traditionally, these studies are performed with substantial time gaps in between. Seamless designs allow us to combine a sequence of studies in one protocol and may hence accelerate this process. We performed a systematic investigation of the potential of seamless designs in diagnostic research. Methods: We summarized the major study types in diagnostic research and identified their basic characteristics with respect to applying seamless designs. This information was used to identify major hurdles and opportunities for seamless designs. Results: 11 major study types were identified. The following basic characteristics were identified: type of recruitment (case-control vs population-based), application of a reference standard, inclusion of a comparator, paired or unpaired application of a comparator, assessment of patient relevant outcomes, possibility for blinding of test results. Two basic hurdles could be identified: 1) Accuracy studies are hard to combine with post-accuracy studies, as the first are required to justify the latter and as application of a reference test in outcome studies is a threat to the study's integrity. 2) Questions, which can be clarified by other study designs, should be clarified before performing a randomized diagnostic study. However, there is a substantial potential for seamless designs since all steps from the construction until the comparison with the current standard can be combined in one protocol. This may include a switch from case-control to population-based recruitment as well as a switch from a single arm study to a comparative accuracy study. In addition, change in management studies can be combined with an outcome study in discordant pairs. Examples from the literature illustrate the feasibility of both approaches. Conclusions: There is a potential for seamless designs in diagnostic research.

Keywords. Test construction studies, accuracy studies, randomized diagnostic studies, seamless design, blinding Background: Multivariate probit models are used to analyze correlated ordinal data. In the context of diagnostic test accuracy without a gold standard test, their use has been more limited. Multivariate probit models have been used for the analysis of dichotomous and categorical (>1 threshold) diagnostic tests in a single study, and for the meta-analysis of dichotomous tests. We aimed to (i) develop a model for the meta-analysis of multiple binary and categorical diagnostic tests without a gold standard; (ii) extend the model to enable estimation of joint test accuracy. Methods: We extended proposed multivariate probit models for the meta-analysis of diagnostic test accuracy, modelling the conditional within-study correlations between tests. Dichotomous tests use binary multivariate probit likelihoods and categorical tests use ordered likelihoods. We also showed how the model can be extended to estimate joint test accuracy, to meta-analyse studies which report accuracy at distinct thresholds, and how to incorporate priors for the 'gold standard' tests based on inter-rater agreement information. We fitted the models using Stan which uses a state-of-the-art Hamiltonian Monte Carlo algorithm.

Results: We applied the methods to a dataset in which studies evaluated the accuracy of tests for deep vein thrombosis, where studies included two dichotomous tests and one categorical test. We compared our results to the original study, which assumed a perfect reference test. In Stan, we found estimation to be very slow for metaanalyses which contained large studies with sparse data. We discuss these computational issues and possible ways to improve scalability by making use of recently proposed algorithms, such as calibrated data augmentation Gibbs sampling. Conclusions: We developed a model for the meta-analysis of multiple, categorical diagnostic tests without a gold standard. Unlike latent class models, they can be extended to tackle a variety of problems without having to inappropriately simplify or discard data. Keywords: Meta-Analysis, diagnostic, test, accuracy, probit, imperfect, gold, reference, thresholds, interrater, agreement Background: Most heart failure (HF) clinical prediction models (CPMs) have not been independently externally validated. We aimed to test the performance of HF CPMs using a systematic approach. Methods: We performed a systematic review to identify CPMs predicting outcomes in HF, stratified by acute and chronic HF CPMs. External validations were performed using individual patient data from 8 large HF trials. CPM discrimination (c-statistic, % relative change in c-statistic) as well as calibration (Harrell's E, E90, net benefit) was estimated for each CPM with and without recalibration. Results: Of 135 HF CPMs screened, 24 (18%) were matched on population, predictors and outcomes to the trials and 42 external validations were performed. The median derivation c-statistic of acute HF CPMs was 0.76 (IQR, 0.75-0.8), validation c-statistic was 0.67 (0.65, 0.68) and model-based c-statistic was 0.68 (0.66, 0.76), demonstrating that most of the decrement in model performance was due to narrower case-mix in the validation cohort compared with the development cohort. The median derivation c-statistic for chronic HF CPMs was 0.76 (0.74, 0.8), validation c-statistic 0.61 (0.6, 0.63) and modelbased c-statistic 0.68 (0.62, 0.71), thus decrement in model performance was only partially due to case-mix heterogeneity. The median E (standardized by outcome rate) was 0.5 (0.3, 2.2) for acute HF CPMs and 0.6 (0.3, 0.7) for chronic HF CPMs. Updating the intercept alone led to a significant improvement in calibration in acute HF CPMs, but not chronic HF CPMs. Net benefit analysis showed potential for harm in using CPMs when decision threshold was not near the overall outcome rate but this improved with model recalibration (Table) .

Conclusions: A small minority of published CPMs were matched to clinical trial datasets. For acute HF CPMs, discrimination is largely preserved after adjusting for case-mix; however, model updating is required for both acute and chronic HF CPMs. Keywords: Clinical prediction model, heart failure, mortality Background: Clinical prediction models (CPMs) are increasingly developed and validated using electronic health records (EHRs) since they provide rich, longitudinal information on a patient's interactions with healthcare services. The analysis of such data is not, however, without challenges. Specifically, the observation process of EHRs is dependent on the underlying health status of the individual, which not only leads to irregularly collected information, but importantly means that the type, timing, and frequency of data collection could be informative with respect to a patient's health status. This is referred to as "informative presence" and "informative observation". Informative presence/observation may be an opportunity, as the additional information contained within the observation process could improve accuracy of prediction models. This project aims to synthesise the existing analytical methodology that could be used to allow CPMs to learn from "informative presence" and "informative observation". In doing so, we aim to identify remaining methodological challenges in this area. Methods: A systematic literature search was conducted by two independent reviewers. Keywords were identified and used to search Embase, MEDLINE and Web of Science. Articles were screened based on title and abstract at stage one, and full texts at stage two for any remaining papers. Results: All methods (within 37 papers) discovered during this review broadly fall under three categories; methods which use derived information about the observation process as model predictors (e.g. counts of observations or visits), methods which make indirect use of the observation process via a latent structure (e.g. through random effects in joint models), or methods that model under informed missingness.

Conclusions: Methodology to incorporate informative presence/observation in CPMs is beginning to emerge, and shows promise in improving the performance of prediction models. However this is still an underdeveloped area, and further work should explore where each method improves predictive accuracy. Background: Prediction models are often developed using a multivariable regression framework (e.g. logistic, survival, or linear regression), which provides an equation to estimate an individual's outcome probability (for binary or time-to-event outcomes) or outcome value (for continuous outcomes) conditional on values of multiple variables ('predictors'). When estimating such equations using a particular dataset, standard estimation techniques are often used, in particular ordinary least squares or maximum likelihood estimation. However, when applied in new individuals, these fitted equations tend to produce optimistic (i.e. too extreme) predictions; that is, the predicted outcome probability (from logistic or time-to-event regression models) or the predicted outcome value (from linear regression) for new individuals is too far from the mean for some individuals. This is a particular concern when the number of predictors is large relative to the sample size, such that overfitting is a concern. Methods: To address the issue of overfitting, penalisation estimation techniques are increasingly being recommended, especially for situations where the effective sample size is low. These include uniform shrinkage (e.g. estimated via bootstrapping), the lasso, elastic net, and ridge regression. Many researchers believe such methods resolve the issue of overfitting entirely. Results: In this talk we highlight that penalisation methods are no substitute for obtaining large sample sizes for model development.

In particular, through examples, simulation and analytic reasoning, we show that shrinkage and penalty factors are typically estimated with large uncertainty, especially in small development datasets where the potential for overfitting is large. Conclusion: We discuss and illustrate approaches to reduce this uncertainty for the lasso, elastic net and ridge regression, and reinforce guidance for how to derive the sample size needed to develop a model. . Effects of updating on net benefit by decision curve analysis. Threshold refers to decision threshold and Prev./2 refers to the net benefit when the decision threshold is half the event rate, prevalence means the decision threshold is at the outcome prevalence and Prev.*2 refers to decision threshold at twice the outcome prevalence.

(EstBB) cohort [1] . We analyzed how well this PRS performs in estimating women's future risk of developing BC. We aimed to estimate the cumulative BC incidence for women in the EstBB cohort, using the prevalence-based PRS [1] as predictor together with year of cohort-entry, age, BMI, smoking status, educational level and prevalent co-morbidities. To evaluate the prognostic performance of PRS-based risk. Methods: We included data on 30,312 women from the EstBB cohort, between 20-89 years and without a history of BC. We estimated absolute 3 and 5-year PRS-based risk with a Cox Proportional Hazards model, retaining PRS and age as covariates. Performance of the ageadjusted PRS-based risk was assessed in terms of cross-validated calibration, discrimination, and reclassification. Results: Calculated risks derived from the age-adjusted PRS-model were consistent with the observed cross-validated 3-year cumulative incidence of 0.33% and 5-year cumulative incidence of 0.61% for the entire cohort. The AUC was 0.720 (95% CI: 0.675 to 0.765) for 3 years and 0.704 (95% CI: 0.670 to 0.737) for 5 years. Compared to an ageonly model, this was just 0.022 higher for 3 years and 0.023 higher for 5 years. Reclassification analysis, using a 1% risk cut-off, showed that few but overall more women were correctly vs incorrectly reclassified (3-year NRI 0.094; 5-year NRI 0.0527). Conclusion: Despite good calibration, we found modest incremental performance improvement of the PSR-based risk compared to agebased. A considerably larger study would be needed to assess whether the PRS could meaningfully contribute to the development of more efficient screening strategies. Keywords: Prognostic accuracy, Breast cancer, Polygenic risk score, Precision screening, Risk stratification, Medical test evaluation, Biomarker evaluation, Performance measures.

References: [1] Background: Prognostic models predict outcome for people with an underlying medical condition. Many conditions are typified by recurrent events such as seizures in epilepsy. Prognostic models for recurrent events can be utilised to predict individual patient risk of disease recurrence or outcome at certain time points. Methods: Methods for analysing recurrent event data are not widely known or applied in research. Most analyses use survival analysis to consider time until the first event, meaning subsequent events are not analysed and key information is lost. An alternative is to analyse the event count using Poisson or Negative Binomial regression. However, this ignores the timing of events.

Results: Therefore, a systematic review on methodology for analysing recurrent event data in prognostic models is ongoing. Results from this review will identify methods commonly used in practice. Information such as the event rate of the underlying condition will be collected to determine whether model choice might be influenced by this factor. Conclusions: Results from this review will be presented including a summary of each method identified. The results will be the first step towards a toolkit for future analysis of recurrent event data. Background: A gulf exists between the number of prognostic models developed and those effectively adopted as clinical tools. Bridging this gap requires recognition that a tool will not sit in isolation, but be integrated into an often complex clinical system. Utility therefore rests on both providing accurate predictions in a target population, and on understanding a tool's role and acceptability in a clinical system. The OxMIV prediction model for violence in severe mental illness has been internally and externally validated using Swedish population registers. [1] The current project aims to externally validate and develop OxMIV as a clinical tool for UK community mental health teams who engage individuals with first-episode psychosis. This study will examine how OxMIV can be used in a clinical setting-and specifically its clinical acceptability, to develop a framework for its implementation and wider evaluation.

Methods: Mixed methods are used to examine the process of integrating OxMIV into community mental health teams in two counties. Interviews with 20 multidisciplinary clinicians focus on acceptability and barriers to use. Approaches to examining uptake, reach and utility are piloted using structured data from electronic records.

Results: Pilot work demonstrated the feasibility of using routine data for validation, and showed 1 in 10 individuals under the care of these services were arrested for violence in 12-months, but no structured framework currently exists to determine risk. Preliminary findings from work to develop OxMIV for this role will be discussed, focusing on transferrable themes pertinent to the clinical translation of prediction models.

Conclusions: These will include service management perspectives, interface with electronic systems, risk communication, decision pathways, and clinician views on the desirable properties of a usable tool.

Keywords: Prediction, model, clinical, psychiatry, psychosis, violence Background: Measurement error of binary variables (misclassification) is a common problem in the analysis of multiple data sources, including individual participant data meta-analysis (IPD-MA). Misclassification may lead to biased parameter estimates, even when the misclassification is entirely random. Available methods for addressing misclassification do not account for between-study heterogeneity in an IPD-MA. We aimed to develop statistical methods that facilitate unbiased estimation of logistic regression models for a one-stage IPD-MA, where the extent and nature of misclassification may vary across studies.

We focus on the estimation of predictor-outcome associations and between-study heterogeneity. Methods: We present Bayesian methods that allow misclassification to be dependent on study-level and participant-level characteristics. We illustrate this in an example of the differential diagnosis of dengue using two predictors, where the gold standard measurement for one (muscle pain) is unavailable for some studies, which only measured a surrogate prone to misclassification. We present a simulation study to assess bias, root mean square error (RMSE), coverage and power in estimating the muscle pain-dengue association. Results: In the example, our methods yielded estimates with less error than analyses naive with regard to misclassification or based on gold standard measurements alone. Minor differences were observed in the estimates of heterogeneity of the muscle pain-dengue association.

In our simulations, adjusting the one-stage IPD-MA models for misclassification lead to valid estimates of the adjusted predictoroutcome association, with less RMSE, greater power and similar coverage compared to an analysis restricted to available gold standard measurements. Conclusion: Our proposed framework can account for the presence of predictor misclassification in IPD-MA. It requires that 1) some studies supply IPD for the surrogate and gold standard variables and 2) misclassification is exchangeable across studies conditional on observed covariates (and outcome Results: SPIRIT-AI and CONSORT-AI include recommendations which are AI-specific, such as asking authors to provide clear descriptions of the AI system, including instructions and skills required for use, the operational environment in which the AI intervention is integrated, the handling of input and output data of the AI system, the human-AI interaction and provision of an analysis of error cases.

Conclusion: SPIRIT-AI and CONSORT-AI will help promote transparency and completeness of studies in this area. It will assist editors and peer reviewers, as well as the general readership, to understand, interpret and critically appraise the quality of clinical trial design and risk of bias in the reported outcomes. Background: Guidelines recommend identifying in early pregnancy women at elevated risk of pre-eclampsia. Existing prediction tools perform poorly among nulliparous women. We aimed to 1) develop and validate a pre-eclampsia risk prediction model for nulliparous women. 2) compare the model's performance against the existing NICE approach.

Methods: This retrospective cohort study included all nulliparous women who gave birth in three public hospitals, Western-Sydney-Local-Health-District, Australia, 2011-2014. Using births from 2011-2012, we performed multivariable logistic regression incorporating established maternal risk factors to develop, and internally validate, the "Western Sydney (WS) model". The WS model was externally validated using births from 2013-2014, assessing its discrimination and calibration. We fitted the final WS model for all births from 2011-2014, and compared its accuracy with the NICE approach. Results: Among 12,395 births, 293 women (2.4%) had pre-eclampsia. The WS model included: maternal age, BMI, ethnicity, multiple pregnancy, family history of pre-eclampsia, autoimmune disease, chronic hypertension and chronic renal disease. In the validation sample (N= 6201), the model c-statistic was 0.70 [95% CI 0.65-0.75], suggesting good discrimination. The observed:expected ratio for pre-eclampsia was 0.91, and the Hosmer-Lemeshow p-value of 0.20 suggesting good calibration. In the entire sample (N=12,395), 374 (3.0%) women had a WS model-estimated pre-eclampsia risk ≥8%, the risk-threshold for considering aspirin prophylaxis. Of these, 54 (14.4%) developed pre-eclampsia (sensitivity 18% [14-23], specificity 97% [97-98]). Using the NICE approach, 1173 (9.5%) women were classified as high-risk, of which 107 (9.1%) developed pre-eclampsia (sensitivity 37% [31-42], specificity 91% [91-92]). The final model showed similar accuracy to NICE approach when using a lower risk-threshold ≥4%.

Conclusions: This WS risk model achieved modest performance for pre-eclampsia prediction in nulliparous women. Although not superior to the NICE approach, the WS model has the advantage of providing individualised risk-estimates to inform decisions for pregnancy surveillance and aspirin prophylaxis.

Background: Studies addressing diagnostic and prognostic prediction models are abundant in many clinical domains. At the same time, many systematic reviews showed that the quality of reporting of prediction model studies is suboptimal. [1] Due to the increasing availability of larger, routinely collected and complex data, and the rising application of Artificial Intelligence (AI) and Machine Learning techniques (ML) for clinical predictions, the number of prediction models is expected to increase even further. These AI/ML-based prediction model studies are often labeled as a "black box" and not much is known yet about the quality of reporting. The aim of this systematic review is to evaluate the reporting and methodological conduct of prediction model studies that applied AI/ ML techniques for model development or validation. Methods: Our protocol was registered in PROSPERO (CRD42019161764). A search was performed in January 2020 to identify primary studies developing and/or validating prediction models using any AI/ML methodology across all medical fields. Studies were included if predicted patient-related outcomes, used any study design and were published in 2018-2019. We assessed (1) the quality of reporting by measuring the adherence to Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis guideline (TRIPOD) and (2) the risk of bias in prediction model development or validation using the Prediction model Risk of Bias Assessment Tool (PROBAST). Results: Initial results from the review will be presented, stratified by medical field and prevalent AI/ML methods. Conclusions: Emerging issues will be discussed, as well as the necessity for specific reporting (TRIPOD-AI/ML) and risk of bias (PROBAST AI/ML) assessment for AI/ML-based prediction model studies. Background: The Cox proportional hazards model is a commonly used method when developing prognostic prediction models using time-to-event data. In the presence of competing risks -events that might prevent the occurrence of the event of interestusing a Cox model leads to predicted probabilities that are too high. Thus methods that account for competing risks, such as the Fine-Gray model, are preferred. However, fitting this model is computationally complex, particularly when used in combination with multiple imputation and fractional polynomials. This poses a significant challenge when developing prediction models in big databases. We aimed to describe prediction modelling approaches that minimise computation time, without compromising model validity, in a large dataset of electronic health records. Data: Data from the Clinical Practice Research Datalink were used to create prognostic models for adverse events related to antihypertensive medication, treating death as a competing event (prevalence 10%). The dataset included 1,773,224 patients and 40 predictors. Methods/Results: A multivariable competing risk model developed using the stcrreg command in STATA 16 (8 cores) required approximately two weeks to converge using an 8 core 32GB, i9 PC. Computation time was reduced to less than one day when estimating regression coefficients using the R package fastcmprsk, which uses a forward backward scan algorithm: this is more efficient than the Newton-Raphson method. Robust bootstrap confidence intervals were estimated using the percentile method. Fractional polynomial transformations were computationally prohibitive, thus variable transformations were modelled with the use of Cox regression, providing a good approximation of the relationship. Conclusions: Computational obstacles to correctly account for competing risks in clinical prediction models can be overcome by combining fast algorithms, robust bootstraps and approximate fractional polynomials transformations. We are currently investigating how to optimise the use of the Fine-Gray model in conjunction with multiple imputation. Keywords: Prediction, competing risks, big data Background: Measurement error in biomarkers is best estimated in Biological Variability Studies (BVS) where individuals have repeated measures both at the same and at different time points. However, BVS are not always feasible. We investigate whether measurement error can be estimated using routine data from biomarker monitoring programmes by application of a method known as the variogram. We aimed to demonstrate the potential of the variogram using open-source monitoring programme data of serum albumin measurements on stage 2-4 primary biliary cirrhosis patients. Methods: Variation in measurements from patients over time includes three components: true differences at baseline betweenpatients; true changes from baseline within-patients ('signal'); and measurement error ('noise'). The variogram considers differences within-patients computed between baseline and each follow-up point; the variances of these differences increase at a rate dependent on the magnitude of the within-patient variability. We grouped measurements by year, and assigned weights according to the closeness of the actual time to the midpoint using a Gaussian kernel approach.

Weighted variances of differences in serum albumin were calculated per time. The variogram is a plot of weighted variance of differences (y-axis) against time (x-axis), with a fitted line estimated by linear regression, weighted according to sample size. Extrapolation of the fitted line to intersect the y-axis was used to estimate the measurement error ('noise'). Results: The measurement error ('noise') estimate was 0.10 (gm/dL) 2 . (Figure 1 ) 'Signal' first surpassed 'noise' at five years; the variance of differences at one year was estimated almost entirely 'noise'. Such results from weighted variogram analyses could be used to help define optimal measurement timings for monitoring programmes. Conclusions: Weighted variogram analyses have potential for application where health status changes are unlikely; care should be exercised in implementation, particularly related to bias from dropout. Keywords: Variogram, variability, monitoring, measurement error Background: Health care data often has a hierarchical structure with observations at different group levels (e.g., regions, hospitals, patients). Most commonly applied statistical models either combine the data to estimate an average effect (complete pooling) or partition the data to estimate a separate effect for each group (no pooling). Such models treat the between-group variance implicitly as either zero (i.e., complete pooling) or infinity (i.e., no pooling). We seek statistical models that balance the trade-off between zero and infinite between-group variance and, as a result, incorporate both between-and within-group information in the group-level estimates. Methods: Bayesian hierarchical models account for the uncertainty in the estimate of the between-group variance through partial pooling. That is, group-level effects are estimated by taking into account the uncertainty about the estimates. For groups with few observations (or few information), the estimates are closer to the estimate from complete pooling. Instead, for groups with many observations, the estimates are closer to the estimate from no pooling. This principle is called shrinkage and can be thought of as pulling group-level estimates towards the population mean when uncertainty in the estimate is high.

We apply Bayesian hierarchical models in different contexts of personalized health care. Thereby, hierarchical models reveal that patient-specific effects can be estimated precisely for patients with many observations, while the estimate for patients with few observations are pulled towards the patient-average. Conclusions: Our applications demonstrate the advantages of Bayesian hierarchical models for personalized health care. Results from such models entail important implications for medical practitioners. They inform physicians about patients where personalized treatment is more likely to be successful and, vice versa, where a common treatment should rather be administered because the range of possible treatment effects is too large. Keywords: Bayesian hierarchical models, patient-specific effects, personalized health care Background: Dealing with multiple thresholds in diagnostic accuracy meta-analysis can be challenging. We applied two modelling strategies to summarize the available evidence of the diagnostic accuracy of biomarkers for urinary tract infections in children. Methods: We performed a systematic review and meta-analysis of diagnostic test accuracy studies. We searched seven databases for relevant articles. Eligible studies were prospective or retrospective observational studies that reported the accuracy of urine or blood biomarkers for urinary tract infections in children. Statistical analyses were performed using R software. The bivariate random effects model by Reitsma et al. [1] ('mada' package) and the model by Steinhauser et al. [2] ( 'diagmeta' package), taking into account multiple thresholds per study, were both performed to calculate summary estimates for six biomarkers. We compared the output of two modelling strategies and reported the following characteristics: Area Under the Curve (AUC) and clinical usability (clinically relevant threshold providing a specificity of 0.90). For now, only results for C-reactive protein (CRP) are shown, with the other biomarkers to be presented at the MEMTAB 2020 symposium. Results: We screened 9975 eligible studies, of which we included 62 in the review. For CRP, we found eight primary studies that reported on 1 to 6 thresholds, ranging from 5 to 200 mg/l. Using the model by Reitsma et al. [1] and Steinhauser et al. [2] Background: Clinical prediction models (CPMs) can predict the risk of health outcomes, such as disease onset or progression, for individual patients. The majority of existing CPMs only harness cross-sectional patient information. Incorporating repeated measurements into CPMs may provide an opportunity to enhance their performance. We aimed to systematically review the literature to understand and summarise existing approaches for harnessing repeated measurements in the development of CPMs, and empirically investigate the suitability of identified methods to real-world data using an illustrative example in rheumatoid arthritis (RA). Methods: Medline, Embase, and Web of Science were searched for articles reporting the development of a multivariable CPM for patient-level prediction, and modelling repeated measurements of at least one predictor. Information was extracted on: the method, its specific aim, reported advantages and limitations, and software available to apply the method. Background: For prediction model development, specifying an optimal sample size in terms of predictive performance is an active area of research.

It is suggested that sample size depends on factors including event per variable (EPV), outcome prevalence and prevalence of binary predictors. We introduce a flexible approach for sample size determination based on learning curves. Such curves monitor model performance as new data comes in, to allow stopping patient recruitment when a pre-specified stopping criterion has been reached. We illustrate the approach using data for the diagnosis of obstructive coronary artery disease (n=4888, 44% event rate). Methods: We used logistic regression to develop prediction models consisting of a-priori selected variables. We mimicked prospective patient recruitment as follows. First, we fitted the model on 100 randomly chosen patients, and estimated model performance metrics using bootstrapping. Second, we sequentially added 50 random new patients until we reached 3000 patients, and estimated model performance at each step. We repeated the procedure 500 times to investigate variability. We built models once without addressing nonlinear effects of continuous predictors (ML-LR), and once with restricted cubic splines (RCS). We examined the required sample size for the following possible stopping criteria: (1) Background: Meta-analysis of individual participant data (IPD-MA) offers new opportunities for studying the generalizability of prediction models across different settings and populations. The interpretation of model performance estimates in IPD-MA is often challenging, because between-study heterogeneity may arise from invalid model coefficients and differences in (the distribution of) population characteristics. Hence, the benefit of local model revisions may be unclear.

We aimed to disentangle the effects of differences in case-mix and invalid regression coefficients, to allow for the identification of reproducibility of model performance and predictor effects. Methods: We propose to standardize the c-statistic, calibration slope and calibration-in-the-large for case-mix differences between samples by applying propensity-weighting. The propensity scores are derived using a (multinomial) membership model that predicts the originating sample of an individual in the IPD-MA. We illustrate our methods in a motivating example on the validation of eight diagnostic prediction models for detecting deep vein thrombosis (DVT) that may aid in the diagnosis of patients suspected of DVT in 12 external validation data sets. We analyze the estimates of prediction models' performance across the external validation sets with random effects meta-analysis.

Results: In the meta-analysis of c-statistics, summary estimates were not affected much by standardization. However, standardization substantially reduced the between-study heterogeneity, indicating that variation of the models' discrimination across the validation studies can partially be attributed to differences in case-mix, rather than invalid model coefficients. Background: In the UK, 1.4 million people live with undiagnosed and untreated obstructive sleep apnoea (OSA) and are at an increased risk of cardio-metabolic complications and diabetes. Polysomnography (PSG) is the gold standard for the diagnosis of OSA but is expensive, time-consuming and has long waiting lists. A questionnaire to identify patients at high risk of OSA requiring further investigation and treatment would be of great benefit. We aimed to determine the best questionnaire for identifying adults at high risk of OSA amongst different clinical cohorts accounting for multiple questionnaires and multiple thresholds. Methods: 31 studies reporting the diagnostic accuracy of the Berlin, STOP or STOP-Bang questionnaires as a screening tool for moderateto-severe OSA were available for meta-analysis from two clinical cohorts of patients: sleep clinic and surgical. Within each cohort random effects bivariate binomial models were fitted to each questionnaire. Where there was a difference in diagnostic ability between questionnaires we tested this using meta-regression. In the surgical cohort, we accounted for multiple thresholds using the methods of Steinhauser et al [1] . Results: In both the sleep clinic and surgical cohorts, meta-regression including questionnaire as a covariate identified statistical differences in sensitivity between STOP-Bang and Berlin. There was no evidence of differences in specificity. Due to the large number of parameters estimated when accounting for multiple thresholds we were only able to fit two of the eight models proposed by Steinhauser et al [1] . Conclusions: Performing a coherent analysis under the frequentist framework that is able to incorporate multiple questionnaires and multiple thresholds across different clinical cohorts whilst avoiding the well-known issues associated with multiple testing can be Background: In the age of personalized medicine, prediction models are becoming increasingly popular for risk stratification and informed treatment decisions. Accessibility of large routine data collections and observational cohorts facilitates the validation of existing prediction models and the development of new ones. We aimed to define necessary steps and to stress the importance of initial data analysis before running regression analysis (IDA-REG), assuming that a data set has already passed an initial data cleaning stage. Methods: Following a conceptional framework for IDA [1] , we describe 3 mandatory and 3 optional steps of IDA-REG.

Results: IDA-REG focuses on informing an analyst about features in the data that should be known to the data analyst in order to a) properly interpret results of an analysis, b) make decisions on how to present the results of an analysis, and c) adapt the statistical analysis plan to avoid analysis errors. Mandatory steps include summaries of univariate distributions of predictors and outcome variable, summaries of bi-and trivariate distributions of predictors, and summaries of patterns of missing values. Optional steps include investigation of measurement error, investigation of levels of measurement (hierarchies), and exploring unsupervised possibilities to reduce dimensionality of regression models. The evaluation of associations of predictors with the outcome is explicitly not part of an IDA-REG. We exemplify IDA-REG by means of simulated and real data. Conclusions: Appropriate graphical and analytical tools enable a researcher to perform IDA-REG in order to avoid misinterpretation, poor presentation and analysis errors. These necessary preparations are too often forgotten by inexperienced data analysts. Background: Pertussis or whooping cough is a highly contagious vaccine preventable disease. Incidence of pertussis, has known a steady decline after the introduction of pertussis vaccination nevertheless pertussis incidence increased over the past two decades in many countries. The analysis of serial serological survey data can improve our understanding about the dynamics of pertussis. However, the development of assays for the detection of IgG antibodies in sera entails that various assays have been used for different survey years. We need comparable sero-epidemiological results for statistical and mathematical models to estimate time-varying epidemiological parameters. We aimed to investigate the consequences of the uncertainty related to the standardization of pertussis toxin IgG antibodies results from three serological surveys conducted in Belgium (2002, 2006, 2013) . Methods: In each survey, 150 samples were selected such that the range of the original values for IgG antibodies against pertussis toxin was as best as possible covered. All 450 samples were then tested using a magnetic bead-based multiplex immunoassay (MIA). [1] We investigated different models for the log-transformed values and considered also different strategies for outliers and censored data. Results: The model choice for the standardization can be sensitive to the strategy applied for outliers and censored data. The survey 2013 was originally already tested using MIA but at different concentrations as the current study. The comparison with the re-tested 150 samples from 2013 together with validation data can be used to investigate intra-assay variability.

Conclusions: The uncertainty in the standardization of antibody titres needs to be reflected in models aimed at estimating time-varying epidemiological parameters, such as the force of infection, from serial serological survey data. Keywords: Sero-epidemiology, pertussis, assay comparison Background: Warfarin remains the most used oral anticoagulant in sub-Saharan Africa. It has a narrow therapeutic index and highly variable clinical response for a given dose and thus optimal dose prediction is difficult. We aimed to develop and validate a warfarin dose prediction model for use in sub-Saharan African populations. Methods: Multivariable linear regression models were fitted using data from 364 patients. Starting with a list of potential variables, all possible linear models were fitted, with the optimal models chosen with reference to mean absolute error (MAE), mean absolute percentage error (MAPE) and logarithmic accuracy ratio. Bootstrap validation was applied to correct overfitting and the final models were externally validated in a cohort of 690 patients. In both development and external validation cohorts, we compared our models with current warfarin initiation practice (fixed dose of 35 mg/week) and two widely known dose prediction models. Table 1 shows similar AUROC and Brier across methods, and similar calibration (in-the-large and slope) at crossvalidation. PCA and non-negativity constraints stand out in terms of calibration at external validation. All methods reduce but still contain one or more negative coefficients for OAR dosage variables except when using non-negativity constraints (β+). If accepted for presentation, we expect to show results of combining PCA and nonnegativity constraints as well, as we concluded these methods to be most beneficial for this use case based on these preliminary results. Keywords: Collinearity, NTCP, radiotherapy, prediction, dysphagia Background: Assessing risk of bias and applicability (RoB) of included studies is critical for interpreting meta-analysis (MA) results. RoB tools for diagnostic, prognostic, and prediction studies include QUADAS-2 and PROBAST. However, individual participant data meta-analyses (IPD-MAs) differ from aggregate-data MAs in that in IPD-MA, datasets may include additional information, eligibility criteria may differ from the original publications, and definitions for index tests/predictors and reference standards/outcomes can be standardized across studies. Thus, tailored RoB tools may be needed. We aimed to review how RoB is currently assessed in IPD-MAs, and to examine QUADAS-2 and PROBAST, with the goal of developing IPD-MA extensions for each tool. Methods: We reviewed RoB assessments in IPD-MAs published in the last 12 months. We then examined how QUADAS-2 (and in-progress extensions) and PROBAST items might be evaluated in an IPD-MA context; noting which items might be removed, edited, or added; and hypothesized how results may be incorporated into IPD-MA analyses.

Results: We observed that current IPD-MAs rarely and inconsistently evaluate RoB, and most do not incorporate RoB judgements into analyses. Our findings indicate using QUADAS-2 and PROBAST to assess RoB of IPD datasets themselves, rather than study publications. Certain items may need to be coded at the participant level (e.g., timing between index test/predictor and reference standard/outcome), whereas others (e.g., quality of diagnostic tool) may apply uniformly to an included study. Most analysis items (e.g., prespecification of thresholds and variables for analysis) may not be relevant, as IPD-MA researchers perform the analyses themselves. RoB results may be incorporated into analyses by conducting subgroup analyses among studies and participants with overall low RoB or by conducting formal interaction analyses with item-level RoB responses.

Conclusions: Development and dissemination of IPD-MA extensions for QUADAS-2 and PROBAST will lead to improved RoB assessments in IPD-MAs of diagnostic, prognostic, and prediction studies. Background: Selectively reporting accuracy results from only wellperforming cutoffs in studies of diagnostic or screening tests may result in biased estimates when synthesized. Extent of bias may differ depending on the availability of a well-defined standard cutoff. We compared bias in accuracy estimates and cutoff reporting patterns for the Patient Health Questionnaire-9 (PHQ-9; well-defined standard cutoff ≥10) and Edinburgh Postnatal Depression Scale (EPDS; no standard cutoff, common cutoffs ≥10 to ≥13). Methods: We analyzed subsets of datasets from two separate individual participant data meta-analyses (IPDMAs) on PHQ-9 and EPDS accuracy. Separately, for the PHQ-9 and EPDS, we used bivariate random effects meta-analysis to compare accuracy estimates based on published cutoffs only versus all cutoffs from all studies. We also compared the number of published cutoffs below and above the standard or common cutoffs in relation to study-specific "optimal" cutoffs.

Results: For the PHQ-9 (30 studies, N = 11,773), published results underestimated sensitivity compared to results for all cutoffs for cutoffs below ≥10 (median difference: -0.06) and overestimated for cutoffs above ≥10 (median difference: 0.07). EPDS (19 studies, N = 3,637) sensitivity estimates were similar for cutoffs below ≥10 (median difference: 0.01) but higher for published cutoffs above ≥13 (median difference: 0.14). Mean cutoff of all cutoffs reported among PHQ-9 studies with optimal cutoffs below ≥10 was 8.8 compared to 11.8 for studies with optimal cutoffs above ≥10. 18 of 19 EPDS studies had optimal cutoffs below ≥13; those below ≥10 did not report more cutoffs below ≥10 (mean cutoff: 9.9), but those with above ≥10 reported more above ≥10 (mean cutoff: 11.8).

Conclusions: Selective cutoff reporting and resulting bias in accuracy estimates were more pronounced for the PHQ-9 than EPDS. Database. An increase of 26,5 μmol/L creatinine between baseline-value and the ED-value was used as AKI definition (KDIGO). We analyzed four baseline-definitions: lowest, mean, median and most recent value from the patient's EHR. Multiple time intervals were used (≤365 days prior ED-presentation) to determine AKI-prevalence. Results: The longest interval (365 days prior presentation) in combination with the lowest value as baseline resulted in the highest AKIprevalence (12,65%) compared to the mean (4,23%), median (4,8%) and the most recent value (4,5%). Iteratively reducing the time window for extracting the creatinine measurement only showed extreme differences when using the lowest value as baseline. In comparison with the shortest interval (45 days) the longest interval increased the prevalence with 10,92% (5.151/47.190 additional AKI labels).

Conclusions: Using a specific definition of baseline, results in significantly different AKI prevalence in the ED. Adequate translation of guidelines to diagnose disease is crucial for accurate patient labeling to reduce misclassification by the model and to improve CDSS's accuracy to better support clinical decision making by treating physicians.

Keywords: Acute kidney injury, electronic health records, outcome

Real-time handling of Missing Predictor Values when implementing and using prediction models in daily practice Steven WJ Nijman 1 , T Katrien J Groenhof 1 , Jeroen Hoogland 1 , Michiel L Bots 1 , Menno Brandjes 2 , John JL Jacobs 2 , Folkert W Asselbergs 3,4,5 , Karel GM Moons 1 , Thomas PA Debray 1, 4 Background: Using prediction models to calculate a patients individual risk in clinical practice, requires complete information on all predictors in the prediction model. Unfortunately, routine care data is often incomplete due to a variety of reasons. Although several methods for real-time imputation of missing predictor values exist, they often require immediate access to data from other similar patients and are therefore not directly suitable for routine care. We aimed to develop and evaluate methods for real-time imputation of missing predictor values in routine clinical care when applying prediction models to individual patients. Methods: We describe (i) mean imputation (where missing values are replaced by the sample mean), (ii) joint modeling imputation (JMI, where we use a multivariate normal approximation to generate patient-specific imputations) and (iii) conditional modeling imputation (CMI, where a multivariable imputation model is derived for each predictor from a population). We compared the imputation methods by applying a previously developed prediction model (predicting 10-year risk of recurrent vascular disease) in a dataset with 3,880 participants from the Utrecht Cardiovascular Cohort in which missing predictor values were simulated. Furthermore, comparing true and imputed predictor values, the root mean squared error (RMSE) and coverage of the 95% confidence intervals (i.e. the proportion of confidence intervals that contain the true predictor value) were evaluated. Results: We found that RMSE was lowest when adopting JMI or CMI, although imputation of individual predictors did not always lead to substantial accuracy improvements with regards to the RMSE, as compared to mean imputation. JMI and CMI appeared particularly useful when the values of multiple predictors of the model were missing. Background: Mortality from chronic liver disease (CLD) is rising. This is despite 'early warning' from commonly requested liver function tests (LFTs) which are abnormal in around 20% of cases, providing a clear opportunity for earlier diagnosis and intervention. Intelligent liver function testing (iLFT) is a revolutionary system which aims to increase early diagnosis of CLD. The referring clinician provides information on alcohol intake and co-morbidities, allowing an automated algorithm to reflex relevant tests without further venepuncture when initial LFTs are abnormal. Recommended outcomes are then provided: secondary care referral; primary care follow-up; or further investigations and referral criteria. This replaces the current, protracted system in which tests are often repeated over many years before diagnosing irreversible liver cirrhosis. iLFT is cost-effective and provides a window of opportunity for lifestyle modification and treatment.

We aimed to improve healthcare by identifying an appropriate care pathway for individual patients, utilise the existing potential of equipment and working practices, and improve service access to Hepatology, ensuring appropriate patients are seen by specialists.

Methods: A retrospective analysis was performed of iLFT requests and results in the first year, and a user questionnaire was analysed.

Results: 2362 iLFT requests were received over 12 months, identifying 509 patients with advanced CLD requiring secondary care review, and 1504 patients with early CLD in whom lifestyle modifications could prevent disease progression. The proportion of liver testing made up by iLFT increased month-on-month; iLFT now accounts for 3% of monthly LFTs. 98 of 100 local General Practitioners surveyed would recommend iLFT to colleagues. Conclusions: iLFT is a successful system which utilises currently available resources to increase the diagnosis of CLD and provide appropriate referral advice. This creates a means to manage the growing healthcare burden from CLD and allows access to specialist care for appropriate patients. Background: There is no standardized terminology for describing diagnostic test accuracy (DTA) studies, which presents a barrier to clear and informative reporting of primary studies and hinders efforts towards making valid evidence synthesis. In a previous project, we observed a heterogeneous and sometimes confusing use of terminology for describing DTA study design features in reviews prepared for NICE guidelines [1] .

We aimed to develop a coherent set of terms for describing DTA study design features. Methods: Based on data from our previous study, and newly collected data on features and terms, we are performing an iterative clarification, sorting, and categorization of all the terms and features we identified. These will be integrated in a coherent and complete set of terms, as a prototype. The strengths and limitations of this prototype are evaluated through an electronic survey. Participants are experienced DTA researchers and non-academic stakeholders and include health technology assessment groups, DTA guideline developers, and collaborators from industry. The survey responses are used to adapt and modify the set of terms. In the last phase, the set of terms will be piloted among end users with varying levels of DTA experience, to evaluate if it facilitates informative descriptions of DTA study designs. Results: Our set of terms, developed with the input from a large group of experts and stakeholders, can be used to describe a DTA study in sufficient detail, without ambiguity. Conclusion: We believe that having a standardized and agreed upon set of terms can reduce the use of misleading, subjective, ambiguous and heterogeneous wording when describing DTA research. This will eventually enable secondary researchers and health care decisionmakers to better assess the validity and generalizability of DTA evidence. Keywords: Diagnostic test accuracy, study designs, terminology, labelling

Frequencies and patterns of microbiology test requests from general practice José M. Ordóñez-Mena 1,2 , Thomas R. Fanshawe 1 , Dona Foster 3 , Sarah Walker 2 , Gail Hayward 1 Background: Microbiological tests requested from primary care are currently almost entirely performed in a central NHS laboratory. New diagnostic technologies allowing results to be available at the point of prescription could contribute to antimicrobial stewardship. We aimed to quantify the demand for microbiology tests in primary care and highlight the most important individual and combinations of tests, and pathogens to inform the development of new single and multiplexed point-of-care tests. Methods: A retrospective cohort of all Oxfordshire primary care patients for whom a microbiology test was requested between 2008-2018. We described test frequencies overall, positive test results, pathogens identified, and trends over time. We also investigated patterns of co-testing in the same and subsequent visits with heat-maps and hierarchical cluster analysis overall and in sex and age categories. Results: 1,596,752 microbiology tests were requested for 393,905 patients of which 65.3% were women and 48.8% aged 18-49 years old. We organized individual tests into 19 microbiology test groups, 8 combined cultures and microscopies, and 11 related to individual pathogens. Urine cultures and microscopies (n=673,612) accounted for 42% of all microbiology tests and were mainly requested in isolation but also in follow-up visits after 7 and 14 days. Of all urine cultures, 27 % were positive and 26% had equivocal results. E. coli was the most prevalent pathogen in urine cultures (65.2%). Antenatal urine cultures and blood tests (Hepatitis B, HIV, Syphilis, and Rubella) formed the most common combination of tests particularly among women aged 18-49. Background: Public and patient involvement (PPI) in medical research is defined as research carried out "with" or "by" members of the public rather than "to," "about" or "for" them [1] . PPI is a key part of medical research, with many national health and funding organisations stating PPI is essential including bodies from the UK, Netherlands, America, Canada and USA [2] . There is little information on how to integrate PPI into methodological research and the stakeholders that should be considered as public contributors. We aimed to provide information for PPI involvement in methodological research, including available resources and present a case study.

Methods: As a case study, we describe a methodological research fellowship focusing on methods for determining the performance of diagnostic imaging tests by including information on interobserver variability and time to diagnosis. Within this project, we consulted with colleagues with experience integrating PPI into their methodological research, PPI leads from local hospitals and research centres, presented the research proposal to a PPI group for feedback and developed an integrated PPI approach.

Results: A description of the integrated PPI involvement for a methodological research fellowship and list of resources available for guidance. Some of the online resources include the INVOLVE National Standards for Public Involvement and cost calculator [3.4] . Other resources include links to toolkits and useful papers on public involvement. Conclusions: Investigators should plan PPI involvement in advance, research available help in local area including colleagues, PPI leads, and online support. Keywords: PPI, methodological, research

Background: The Observational Health Data Sciences and Informatics (OHDSI) collaborative has established an international network of databases mapped to the Observational Medical Outcomes Partnership (OMOP) Common Data Model [1] , enabling large-scale analyses. We aimed to develop of a framework for risk-based assessment of heterogeneity of treatment effect (HTE) within the OHDSI setting of analysis of observational data. Methods: The steps required for the standardized analysis are: 1) definition of the problem, i.e. the treatment, the comparator and the outcome(s) of interest; 2) identification of the database(s) in which the framework will be applied; 3) development of the prediction model for the outcome(s) of interest from a propensity score matched subpopulation of merged treatment and comparator cohorts, using a large set of standardized predictor variables including demographics, conditions, drugs, measurements procedures and observation concepts; 4) estimation of the propensity scores within strata of predicted risk using large-scale regularized regression, selecting from the same large set of candidate variables; 5) estimation of relative and absolute treatment effects within risk strata-matching or stratification on the propensity score or inverse probability of treatment weighting can be applied. Results: We compared angiotensin-converting enzyme (ACE) inhibitors (treatment) to beta blockers (comparator) with regard to a set of 9 outcomes in patients with hypertension across three observational databases.

Conclusions: Reproducible risk-based assessment of HTE in observational data is made possible. (Figure 1 ) The standardized nature of the process allows its implementation at scale, while the common data model enables collaboration across multiple sites with access to different databases. Background: Sporadic Creutzfeldt-Jakob disease (sCJD) is the world's most common invariably fatal human prion disease with an incidence rate of 1-2 cases per million and year. Disease duration averages 5-6 months from diagnosis to death, but ranges from weeks to several years. We aimed to develop an individual prognostic prediction model based on cerebrospinal fluid (CSF) biomarkers and other proposed disease survival modifiers, which are easily obtainable in routine settings at the time of diagnosis. Methods: Probable or definite sCJD cases from a German surveillance study were included. The prognostic accuracy to predict overall survival after sCJD diagnosis was measured by the c statistic of a model derived from a multivariable Cox proportional hazard regression.

Results: Complete information about age, sex, codon 129 genotype, presence of 14-3-3 in the CSF, and CSF tau concentrations was available for 1,226 out of 2,908 sCJD cases. The median age at diagnosis was 66 years (range 19-89 years). The male-to-female ratio was 1:1.

A Cox proportional hazard model containing age, sex, genotype, CSF tau and the interaction terms age × tau, sex × tau, and sex × genotype was selected as the model with the highest c statistic (0.686, 95% CI 0.665-0.707) using cross-validation. This model was well calibrated. A score chart was derived to predict 6-month survival and median survival time (Figure 1 ). Background: Assessment of the incremental gain and impact of a novel marker to better predict disease risk is an ongoing quest in many clinical disciplines. For binary and time-to-event outcomes, two popular metrics used to assess incremental gain are difference in the C-index or the Area under the ROC curve (dAUC) and Integrated Discrimination Improvement (IDI). However, inference for these two measures are complex for their non-standard distributions, especially, while comparing nested models that are build and evaluated on the same dataset. Methods and results: We propose an easy-to-implement permutation test for dAUC and IDI to provide exact inference for the incremental gain. Via extensive simulation studies, we show that for small to moderate sample sizes, the type I error rate and power for dAUC and IDI are comparable the type I error rate and power for the likelihood ratio test and Wald test for comparing nested logistic and Cox proportional hazards models. In addition, we also assess the performance of the permutation test for classification trees. We demonstrate the approach in a real dataset where the incremental value of timeto-first cigarette to select ever-smokers for lung cancer was assessed. Background: Putting data together from different sources into a homogeneous data resource would enable unprecedented opportunities to study human health. However, these disparate collections of data are inevitably heterogeneous and have made aggregation a difficult challenge. We focus on the issue of content heterogeneity in data integration. Traditional approaches for resolving content heterogeneity map all source datasets to a common data model that includes only shared data items. Our focus is on integration of structured data. We assume that each one of these datasets that needed to be integrated consists of a single table; and that each of these datasets describes a disjoint set of entities. Therefore, record linkage is not needed. Methods: We propose the development of improved, probabilistic approaches for data integration, capable of advancing the timely utilisation of large-scale biomedical data resources. Our approaches aim to forego the need for perfect data standardisation by employing a probabilistic post-alignment of data items that is integrated with statistical inference. Using these approaches, missing or semantically ambiguous information is estimated from datasets potentially relevant for answering the research question. (1) and (2) on specified outcomes. Simulation techniques have become a common approach over the past two decades; the most flexible methodthe error model simulation approachis based on the iterative application of bias and imprecision onto baseline "true" values. Whilst previous studies have focused on clinical performance (e.g. diagnostic accuracy), evaluations can be feasibly extended to clinical-utility and cost-effectiveness outcomes using decision analytic models.

Conclusions: Various approaches are available for conducting indirect assessments to inform outcome-based APS and test evaluations. This study provides a useful overview of methods and key considerations for future research. Keywords: Measurement uncertainty, methodology review, analytical performance specifications, test evaluation

Background: Obstetric healthcare relies on an adequate antepartum risk selection. Most guidelines used for risk stratification, however, do not assess absolute risks. In 2017, a prediction tool was implemented in a Dutch region. This tool combines first trimester prediction models with obstetric care paths tailored to the individual risk profile, enabling risk-based care (RBC). We aimed to assess impact and cost-effectiveness of RBC compared to care-as-usual (CAU) in a general population.

Methods: A before-after study was conducted using two multicenter prospective cohorts. The first cohort (2013-2015) received CAU, the second cohort (2017-2018) received RBC. Health outcomes were 1) a composite of adverse perinatal outcomes and 2) maternal quality adjusted life years (QALYs). Costs were estimated using a healthcare perspective from conception to six weeks after the due date. Mean costs per woman, cost differences between the two groups, as well as incremental cost effectiveness ratios were calculated. Sensitivity analyses were performed to evaluate the robustness of the findings. Results: In total 3,425 women were included. (Figure 1 Background: Pre-eclampsia is the most predicted obstetric outcome, with more than 130 prognostic models developed. [1] A quarter of these have been externally validated, and showed only modest predictive performance, characterised by methodological shortcomings in development including overfitting of models, small event numbers in development datasets and predictors not varied enough to adequately capture the differences between women. Access to IPD from multiple studies will provide increased sample size with more outcomes to evaluate several candidate prognostic factors, beyond what would have been possible in a single study, to subsequently develop clinically relevant and robust models. It will also enable the evaluation of any prediction model developed across different settings and population case-mix. We aimed to develop and validate pre-eclampsia prediction models using IPD from multiple studies. Methods: Logistic regression with a random intercept to account for clustering by study for model development. Internal-external crossvalidation using random-effects meta-analysis to summarise performance measures across studies.

Results: The International Prediction of Pregnancy Complications (IPPIC) network [2] is a group of 125 researchers contributing data of 3,674,684 pregnancies from 78 datasets. Twelve prediction models were developed, four each for any, early and late-onset preeclampsia. 3-11 datasets were used to develop each model depending on the availability of predictors within datasets. Average models discrimination were good (0.68-0.83), however calibration performance was heterogeneous across datasets. The models showed the highest net-benefit for predicted probability thresholds in nulliparous women at thresholds above 5%.

Conclusions: The IPPIC models on average showed promising predictive performance. However before application in practice, recalibration of model parameters to particular populations and settings may be needed. Additional predictors may improve the predictive performance of the models.

Background: With about 70 published prognostic models, preeclampsia is the most frequently predicted outcome in obstetrics, yet only 10% have been externally validated, 1 and none are recommended in national guidelines for routine clinical use, partly due to a paucity in external validation. Access to individual participant data (IPD) from multiple studies allows for external validation in different populations. It saves cost by reusing existing data thereby reducing research waste, and increases the sample-size with more outcomes beyond what would have been possible in a single study, allowing for evaluation of prediction models of rare conditions, such as earlyonset pre-eclampsia, which affects only 0.5% of all pregnancies. We aimed to assess the external predictive performance of existing prognostic models for pre-eclampsia within the UK healthcare setting.

Methods: Systematic review and external validation of prognostic models using IPD meta-analysis. Performance was evaluated using measures of discrimination, calibration and net-benefit. Randomeffects meta-analysis was used to summarise and estimate heterogeneity in model performance across studies. Background: Guidelines exist for reporting the development and validation of prediction models (TRIPOD), and for reporting systematic reviews (PRISMA). However, no specific guidance exists for reporting systematic reviews of prediction models which can have different aims, ranging from identifying models through to comparing predictive performance of models. Therefore, existing reporting guidelines require modification to be more suitable for systematic reviews of prediction model studies.

We aimed to develop an extension to TRIPOD, specific to systematic reviews of prediction model studies.

Methods: Existing reporting guidelines were reviewed. Relevant guideline items were combined and assessed for suitability by two researchers, considering the different aims of systematic reviews: i) identification of prediction models within a broad clinical field, ii) identification of prediction models for a target population, iii) identification of models for a particular outcome, iv) assessing the performance of a particular model, and v) comparison of models (in terms of predictive performance). Item suitability and wording were discussed within the working group and a draft extension to TRIPOD was produced. An online Delphi survey was conducted, using researchers with experience in systematic review and prediction modelling to provide feedback on the proposed items.

Results: PRISMA and TRIPOD-Cluster (in development) were identified as the most relevant reporting guidelines. They contained many overlapping items; while PRISMA contained some items specific to systematic reviews, TRIPOD-IPD contained some items specific to prediction models. Items from both guidelines were combined, resulting in many items being merged and modified, while other items specific to model development or individual participant data were removed. Feedback from the Delphi survey was incorporated and the draft extension will be presented, welcoming feedback before a second Delphi survey. Conclusions: TRIPOD-SR is an extension of existing reporting guidelines that is being developed to provide more tailored guidance for reporting systematic reviews of prediction models. Keywords: Reporting guidelines, systematic reviews, prediction models Background: The slope of a calibration plot is often referred to as "calibration slope". Methodology texts emphasize the slope should not be used in isolation but accompanied by other metrics and graphs: poor calibration, by any definition, can occur even when the slope is perfect (equals 1). Method: We review recent usage of the calibration slope.

Results: In 33 validation papers (24 external) published 2017-2018, 25 papers identified the slope with calibration, 1 identified calibration slope with discrimination and 7 used the term calibration slope without explicitly interpreting it. In 17 papers (52%) the slope was used as sole measure of calibration. We are currently reviewing papers from 2019 and 2020. Conclusions: The paper often cited as the origin of the "calibration slope" did not use the term calibration, but "spread". More recently "spread" has been identified in some papers as an aspect of calibration and in others as an aspect of discrimination, sometimes by the same authors. We resolve this apparent paradox by proposing that calibration and discrimination are not a dichotomy. If we equate the A (calibration-in-the-large), B (calibration slope) and C (discrimination) of Steyerberg and Vergouwe's ABCD [1] with bias, spread and ordering, then we can see that good calibration-in-the-large equates to low bias; calibration as often defined equates to low bias and adequate spread; good discrimination requires correct spread and correct ordering; and moderate to strong calibration, as defined by Van Calster [2] , requires low bias, adequate spread and correct ordering. Authors, reviewers and editors have a duty to discourage the perception that calibration is a unidimensional construct quantifiable by a single statistic, the slope. Background: Biomarkers and tests are often used to diagnose or monitor a condition, or function as outcomes in clinical trials. Key questions arises on the measurement properties of biomarkers when used for such purposes, as measurements are subject to variability, such as analytical, biological, and intra/inter-rater. Methods for metaanalysis are required in order to synthesize results in systematic reviews from individual studies assessing the reliability of biomarkers or tests. We aimed to review the current state of methods used for metaanalysis of reliability estimates reported in biological variability studies.

Methods: Published systematic reviews reporting the reliability of any test measuring presence or progress of any pathological condition were identified by searches of Medline and Embase from 2010-19. Detailed information was extracted regarding: the experimental test; the condition; the review methodology including the literature search, approach to quality assessment, the statistical methodology used to examine reliability; and the results each study reported. Results: 228 reviews were identified, with only 23 performing a meta-analysis of the reported estimates. The most common metaanalytical estimate was the intra class correlation (61%), with 3 studies using the Fisher's Z transformation to account for the non-normal distribution of ICC data. Other reported statistics include the Kappa coefficient, standard error of measurement, coefficient of variation, limits of agreement, repeatability coefficients, linear regression based R 2 , and correlation coefficients. The majority of studies (78%) constructed forest plots and used random effects models to account for differences between studies. One study used a fixed effects model, while the method was not specified in 2 studies. Other approaches include pooling the data and performing linear regression, Bland-Altman analysis on the test-retest values, and describing the distribution of the study results.

Conclusions: Any limitations in the statistical estimates and metaanalysis methods used to date will be explored and presented. Background: Point of care blood testing to aid diagnosis is becoming increasingly common in acute ambulatory settings and enables timely investigation of a range of diagnostic markers. However, this testing allows scope for errors in the pre-analytical phase, which Background: The diagnosis of a clinical condition is usually the first and more crucial step before initiating treatment. Diagnostic tests are routinely used for confirming or excluding a target condition. Although most diagnostic test accuracy (DTA) studies have focused on assessing a single index test, increasingly studies and systematic reviews are comparing the accuracy of multiple index tests to facilitate the selection of the best performing test(s) for patient care. For example, HPV DNA, HPV mRNA, and co-testing (Pap test + HPV DNA or mRNA test) can be used for cervical cancer diagnosis. But which test is the best? Since studies that directly compare test accuracy are not always available and comparisons between multiple tests constitute a network, DTA network meta-analysis (DTA-NMA) has been proposed. We aimed to identify and assess DTA-NMA methods for comparing the accuracy of multiple diagnostic tests. Methods: We conducted a methodological review of statistical and empirical studies that performed, described, or evaluated a DTA-NMA of at least 3 diagnostic tests. We searched PubMed, JSTOR, and Web of Science. Studies of any design published in English were eligible for inclusion. We also included relevant unpublished material. Results: We included 38 relevant studies. The results will be presented at the Symposium. In particular, we will present the approaches that have been proposed together with a critique of their strengths and limitations. In addition, using cervical cancer as a case study, we will present an application of DTA-NMA methods to determine the most promising test (in terms of sensitivity and specificity) for use as the primary screening test for cervical cancer and to identify which women need referral for colposcopy. Background: Development of diagnostics is best driven by a comprehensive understanding of the clinical need and optimal role of the device within a care pathway. We aimed to identify the potential roles of new diagnostic tests to inform future development objectives. Methods: A survey was sent to UK NHS doctors and nurses who were involved in the care of patients with suspected sepsis. . Effects of updating on net benefit. Threshold is the decision threshold and is represented in relation to the outcome prevalence. N is number of independent external validations. % Above refers to net benefit above the default strategy, % neutral refers to net benefit not different from the default strategy and % Below refers to net benefit less than the default strategy (net harm).

The Predictive Approaches to Treatment effect Heterogeneity (PATH) Statement

Development of an interactive web-based tool to conduct and interrogate meta-analysis of diagnostic test accuracy studies: MetaDTA

Poor reporting of multivariable prediction model studies: Towards a targeted implementation strategy of the TRIPOD statement

Modelling multiple thresholds in meta-analysis of diagnostic test accuracy studies

Initial Data Analysis" of the STRATOS Initiative: A Contemporary Conceptual Framework for Initial Data Analysis

World Health Organization, Global tuberculosis report

Role of patient and public involvement in implementation research: a consensus study

Validation of a common data model for active safety surveillance research

Accuracy of clinical characteristics … using IPD metaanalysis

External validation of prognostic models to predict pre-eclampsia: An Individual Participant Data Meta-analysis

Shakila Thangaratinam 1,3 for the IPPIC Collaborative Network*

Prognostic models in obstetrics

External validation, update and development...the IPPIC pre-eclampsia Network protocol

On the use of comparison regions in visualizing stochastic uncertainty in some two-parameter estimation problems

We then (1) evaluated, in a separate study of 502 samples, the presence of a linear relationship between the results of the tests. We (2) used the regression equation to obtain harmonized test results and (3) performed a single meta-analysis, combining the results from all nine studies. Results: Eight studies used one formula (Siemens) and two used another (Guha). The first meta-analysis of the eight studies resulted in an "optimal" threshold (maximum Youden) of 9

Conclusions: Our three-step method allows the combination of multiple tests of the same marker in a single meta-analysis, facilitating the interpretation of the accuracy of using specific thresholds. Keywords: Meta-analysis, accuracy studies, harmonization 85. Large-scale validation of the Prediction model Risk Of Bias ASsessment Tool (PROBAST) using a short form

Benha Faculty of Medicine, Benha, Egypt; 6 Center for Clinical Evidence Synthesis, Institute for Clinical Research and Health Policy Studies

Primary outcome was the change in the area under the receiver operating characteristic curve (dAUC, available for 1,147 validations) between the derivation and the validation cohorts in low versus high ROB CPMs. Results: The full PROBAST classified 98 of 102 CPMS as high ROB. The short form identified 96 of these 98 as high ROB (98% sensitivity), with perfect specificity. Perfect agreement with the full PROBAST could be achieved with re-review of only a small number of low ROB CPMs. In the full CPM registry, 529 of 556 CPMs (95%) were classified as high ROB, 20 (4%) low ROB, and 7 (1%) unclear ROB. The median change in discrimination was significantly smaller in low ROB models

Network meta-analysis methods for ranking the accuracy of multiple diagnostic tests

Areti Angeliki Veroniki 1,2,3 , Sofia Tsokani 1

Background: Analysis using random effects linear models is the established method used in biological variability studies to attribute the observed variability arising from between-patient differences, within-patient differences, and measurement error. However, these models assume underlying normality, and thus may not be applicable for biomarkers based on counts. We aimed to present methods for estimating sources of variability in count-based biomarkers and apply and compare approaches in a case study of patients with Sjogren's syndrome. Methods: Both Poisson and negative binomial models are appropriate for analysis of count data, and methods for obtaining between and within-patient variance estimates are described in Leckie et al [1] . We analysed the biomarker data using random effects Poisson and negative binomial models, and for comparison, using a random effects linear regression model. The intra-class-correlation (ICC) was calculated as a ratio of the between-patient variance over the total variance, and was compared across the different models. The AIC and BIC criteria were used to assess each model's performance. Data from 32 patients with Sjogren's syndrome was used as a case study, considering the focus score, calculated for each salivary gland observed in each biopsy as the number of foci over the glandular area, multiplied by 4. Between-patient and within-patient-between-gland sources of variability were estimated. Results: The ICC estimates obtained from Poisson (0.323) and negative binomial models (0.310) were similar, and higher than the linear regression model (0.222). AIC and BIC values were similar for Poisson (AIC=463.63, BIC=469.84) and negative binomial models (AIC=465.55, BIC=474.87) and indicated both were a better fit than the linear regression model (AIC=632.69, BIC=642.01). Conclusions: It is important to properly model the distribution of biomarkers based on count data to correctly estimate sources of variability and measurement error. Keywords: Biomarkers, variability, count data Conclusions: The results of a diagnostic accuracy study can be presented in a way, which allows post hoc testing of (linear) hypotheses of weighted averages about two diagnostic accuracy parameters.Keywords: Diagnostic accuracy studies, uncertainty, visualization, sensitivity and specificity, false positive and true negative rate.

depends on the operator handling and transferring specimens correctly. The extent and nature of these pre-analytical errors in clinical settings has not been widely reported. Methods: We carried out a convergent parallel mixed-methods service evaluation to investigate pre-analytical errors leading to a machine error reports in a large acute hospital trust in the UK. The quantitative component comprised a retrospective analysis of all recorded error codes from Abbott Point of Care i-STAT 1, i-STAT Alinity and Abbott Rapid Diagnostics Afinion devices to summarise the error frequencies and reasons for error, focusing on those attributable to the operator. The qualitative component included a prospective ethnographic study and a secondary analysis of an existing ethnographic dataset, based in hospital-based ambulatory care and community ambulatory care respectively. Results: The i-STAT had the highest usage (113,266 tests, January 2016-December 2018). As a percentage of all tests attempted, its device-recorded overall error rate was 6.8% (95% confidence interval 6.6% to 6.9%), and in the period when reliable data could be obtained, the operator-attributable error rate was 2.3% (2.2% to 2.4%). Staff identified that the most difficult step was the filling of cartridges, but that this could be improved through practice, with a perception that cartridge wastage through errors was rare. Conclusions: In the observed settings, the rate of errors attributable to operators of the primary point of care device was less than 1 in 40. In some cases, errors may lead to a small increase in resource use or time required so adequate staff training is necessary to prevent adverse impact on patient care. Keywords: Point of care, ambulatory Care, pre-analytical error Background: Familial hypercholesterolaemia (FH) is a common, lifethreatening genetic condition associated with long-term elevation of cholesterol levels in the blood. A diagnosis of FH can be confirmed by genetic testing; however, it is expensive, and can often be mistargeted due to limitations of the scoring systems used to refer patients. We aimed to develop a model using clinical data to improve the targeting of genetic testing by predicting the likelihood of a patient having a variant causing FH. Methods: Data were obtained from 243 patients referred for genetic testing on suspicion of having FH. Forward stepwise logistic regression was performed, with variant status (binary) as the dependent variable, and age, sex, individual components of the Dutch Lipid Clinic Network (DLCN) criteria, total cholesterol (TC), high-density lipoprotein cholesterol (HDL-C), triglycerides, and low-density lipoprotein cholesterol (LDL-C) as independent variables. Variables were added to the model until their inclusion was not significant (p>0.05), and the Bayesian information criterion (BIC) increased. Backward stepwise logistic regression was performed to verify the results and Background: Familial hypercholesterolaemia (FH) is associated with the long-term elevation of cholesterol levels in the blood. According to guidance from the National Institute for Health and Care Excellence (NICE), FH is suspected if total cholesterol exceeds 7.5 and 9 mmol/L in people under and over 30 years of age, respectively. The use of these cut-offs may over diagnose older people whose cholesterol has risen due to lifestyle factors, and underdiagnose younger people who have not reached the threshold, but may be at risk. We aimed to develop an interactive application which places a patient on a specific population-based cholesterol centile according to their age and sex to improve the identification of people at risk of FH. Methods: Health Survey for England (HSE) data were obtained from NHS Digital covering seven years between 2003 and 2014. Data for age, sex, high-density lipoprotein cholesterol (HDL-C), total cholesterol (TC), and use of lipid-lowering drugs were extracted. Centiles were derived at intervals of 0.1 between 0.5 and 99.5, for non-HDL cholesterol (non-HDL-C) [non-HDL-C = TC -HDL-C] and TC, in patients not being treated with lipid-lowering drugs. Results: An interactive application was developed using Shiny that places a patient on a specific cholesterol centile based on their age and sex. Figure 1 shows an example of a 35 (I) and 55 (II) year old male with a TC of 9 mmol/L, which places the example patients on the 98.7 and 96.9 centiles, respectively. Conclusions: When used in conjunction with current methods, the use of age and sex adjusted cholesterol centiles could help improve the identification of patients with FH, and therefore refine the selection of index cases for targeted genetic testing. Keywords: Familial hypercholesterolaemia, cholesterol, identification, application. Background: A conceptually oriented pre-processing of a large number of potential prognostic factors may improve the development of a prognostic model and hence may play an important role in this process. However, it is unclear, whether this assumption holds and which way of pre-processing is optimal. This study investigated whether various forms of conceptually oriented pre-processing or the preselection of established factors was superior to using all factors as input. Methods: We made use of an existing project which developed two conceptually oriented subgroupings of low back-pain patients without taking the outcome into account. Based on the prediction of six outcome variables by seven statistical methods, this type of preprocessing was compared with domain specific principal component scores, medical experts' preselection of established factors as well as with using all 112 available baseline factors. Results: Subgrouping of patients was associated with low prognostic capacity. Applying a Lasso-based variable selection to all factors or to domain-specific principal component scores performed best. The preselection of established factors showed a good compromise between model complexity and prognostic capacity. Background: The results of a diagnostic accuracy are often two parameters which we have to interpret together: Sensitivity and specificity, false positive and true negative rate, positive and negative predictive value, test positive rate and sensitivity, change in false positive and true negative rate, change in sensitivity and specificity, etc. For the interpretation, we often assign weights or utilities to each rate, and consider a weighted average. However, different stakeholders may use different weights, and the weights may also vary with the intended application of the test. This raises the question how we should present the results of a diagnostic accuracy studyand in particular their uncertaintysuch that we can evaluate different weights in a post hoc situation. Methods: Post hoc analyses of weighted averages require testing null hypotheses of the type that a weighted average is below a certain threshold. This can be approached by comparing the corresponding half space in the two-dimensional parameter space with a 95% confidence region. However, this is a very conservative approach. Results: We present as an alternative approach so-called comparison regions, such that no overlap between the half space and the comparison region is equivalent to rejecting the null hypothesis at the 5% level. [1] This way we can test any hypothesis about any weighted average, and in addition any hypothesis, which corresponds to the complement of a convex sets. Figure 1 illustrates the point.Questions focused on current care pathways in sepsis, current availability and utility of tests for infection and the unmet clinical needs in this pathway. Results: Responses were received from 265 individuals across 68 NHS Trusts. The strongest role for a point of care (POC) sepsis test was as a 'rule-out' test which was favoured by doctors but not nursing staff, who preferred a 'rule-in' test. 67% of respondents indicated that the major cause of delay in caring for suspected sepsis patients was initial identification and flagging of deterioration. Existing blood tests did not greatly increase the confidence of consultants diagnosing sepsis. The majority of those surveyed felt there was a role for a POC sepsis test as they felt it would be quicker.Conclusions: There is a need for sepsis diagnostics which are quicker and more specific than existing tests, to inform early identification and management of sepsis patients. Development of sepsis diagnostics should focus on solving these needs. Keywords: Survey, development, diagnostic, care pathway, sepsis, point of carePublisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.