key: cord-0165090-bqeiq8f5
authors: Robertson, David S.; Choodari-Oskooei, Babak; Dimairo, Munya; Flight, Laura; Pallmann, Philip; Jaki, Thomas
title: Point estimation for adaptive trial designs
date: 2021-05-18
journal: nan
DOI: nan
sha: 2e8f6340fdb126fd7ad1724ec76c1d9962fe70cf
doc_id: 165090
cord_uid: bqeiq8f5

Recent FDA guidance on adaptive clinical trial designs defines bias as"a systematic tendency for the estimate of treatment effect to deviate from its true value", and states that it is desirable to obtain and report estimates of treatment effects that reduce or remove this bias. In many adaptive designs, the conventional end-of-trial point estimates of the treatment effects are prone to bias, because they do not take into account the potential and realised trial adaptations. While much of the methodological developments on adaptive designs have tended to focus on control of type I error rates and power considerations, in contrast the question of biased estimation has received less attention. This article addresses this issue by providing a comprehensive overview of proposed approaches to remove or reduce the potential bias in point estimation of treatment effects in an adaptive design, as well as illustrating how to implement them. We first discuss how bias can affect standard estimators and critically assess the negative impact this can have. We then describe and compare proposed unbiased and bias-adjusted estimators of treatment effects for different types of adaptive designs. Furthermore, we illustrate the computation of different estimators in practice using a real trial example. Finally, we propose a set of guidelines for researchers around the choice of estimators and the reporting of estimates following an adaptive design.

Adaptive clinical trials allow for pre-planned opportunities to alter the course of the trial on the basis of accruing information [1] [2] [3] . This may include changes such as increasing the recruitment target (in sample size re-estimation designs), selecting the most promising treatment arms (in multi-arm multi-stage designs) or patient subpopulations (in population enrichment designs), shifting the randomisation ratio towards more promising arms (in adaptive randomisation designs), or terminating recruitment early for clear evidence of benefit or lack thereof (in group sequential designs) 4 . Despite some additional complexities when implementing such trials 5 , they are increasingly being used in practice due to their attractive features, adding flexibility to a trial design whilst maintaining scientific rigour [6] [7] [8] . The challenges of the COVID-19 pandemic have also recently accelerated their use 9, 10 .

To date most of the research undertaken has focused on the design of such studies and the associated question of maintaining desirable operating characteristics related to hypothesis testing (namely type I error and power) rather than estimation. General methods on the basis of p-value combination 11, 12 and conditional error functions 13, 14 have been proposed that are applicable to a wide range of adaptive designs as well as specialised methods for specific designs such as multi-arm multi-stage (MAMS) designs 15, 16 , enrichment designs 17, 18 , responseadaptive randomisation (RAR) designs 19, 20 , and sample size re-estimation 21, 22 .

In contrast, the question of estimation of treatment effects in an adaptive clinical trial has received comparatively less attention, as reflected in the recent FDA guidance on adaptive designs 23 which states "Biased estimation in adaptive design is currently a less well-studied phenomenon than Type I error probability inflation". In this paper, we review current methods for estimating treatment effects after an adaptive design and critically discuss different approaches. While equally important, the construction of related quantities for inference, such as confidence intervals or regions, is beyond the scope of this paper so we signpost the interested reader to related literature 24, 25 . Moreover, our focus is on bias in the statistical sense, and we do not consider issues such as operational bias in adaptive designs 2, 3, 26 or the impact of publication bias 27 ; see also recent reference articles 28, 29 for discussion of other types of biases in randomised controlled trials.

The issue with estimation after an adaptive trial is that traditional (maximum likelihood, ML) estimators tend to be biased either because of some selection that took place following an interim analysis [see Bauer et al. 30 for a detailed explanation of why selection results in bias] or other mechanisms utilized in an adaptive design, such as early stopping, which might affect the sampling distribution of the estimator (these depend on the nature of the design). For this reason, the usual ML estimator (MLE) is sometimes referred to as the 'naive' estimator for the trial.

Before considering the issue of estimation further, it is worth clarifying what we mean by a biased estimator. The FDA draft guidance on adaptive designs 23 defines bias as "a systematic tendency for the estimate of treatment effect to deviate from its true value", and states that "It is important that clinical trials produce sufficiently reliable treatment effect estimates to facilitate an evaluation of benefit-risk and to appropriately label new drugs, enabling the practice of evidence-based medicine". It is clear that (all else being equal) it is desirable to obtain estimators of treatment effects that are unbiased in order to make reliable conclusions about the effects of study treatments.

While it is relatively easy to define statistical bias, different definitions of an unbiased estimator are relevant in our context. To introduce these, let us denote the population parameter of interest, the treatment effect, by and an estimator thereof by ̂. We give concrete examples of these different types of estimators in our case study in Section 4.

An estimator ̂ is called mean unbiased if its expected value is the same as the true value of the parameter of interest, that is, (̂) = . This is the most commonly used definition of unbiasedness.

An estimator ̂ is called median unbiased if (̂< ) = (̂> ), that is, if the probability of overestimation is the same as the probability of underestimation. Note that for symmetric sampling distributions of ̂, a median unbiased estimator is also mean unbiased.

A further distinction of bias in estimators refers to whether they are conditionally or unconditionally unbiased. In our context, an estimator is unconditionally unbiased (also known as marginally unbiased), if it is unbiased when averaged across all possible realisations of an adaptive trial. In contrast, an estimator is conditionally unbiased if it is unbiased only conditional on the occurrence of a subset of trial realisations. For example, one might be interested in an estimator only conditional on a particular arm being selected at an interim analysis; as such, the focus becomes on a conditional unbiased estimator. We discuss the issue of conditional versus unconditional bias further throughout the rest of the paper, particularly in Sections 4 and 5.

When considering estimation after an adaptive trial one is often faced with the following conundrum: precise estimators can potentially be biased while unbiased estimators tend to be less precise, which reflects the classical bias-variance tradeoff. This means that in many instances the mean squared error (MSE), a measure of precision defined as E((̂− ) 2 ), of a biased estimator is often smaller than the MSE of an unbiased estimator. This makes it challenging to find the 'best' estimator that fits a particular trial design, and we return to this issue in our guidance in Section 5.

Since bias as defined above is an expectation (or probability) taken over possible realisations of data, it is inherently a frequentist concept. However, we can still evaluate the frequentist bias of a Bayesian point estimator, such as the posterior mean. Similarly, it is possible to use frequentist point estimators in the analysis of Bayesian trial designs (i.e. where the adaptations and decision rules are driven by Bayesian analyses). Hence we consider both Bayesian point estimators and Bayesian adaptive designs in this paper. However, in general we do restrict our attention to phase II and III trial designs, since these have a direct impact on health policy and the adoption of new treatments for wider public use.

In the following, we will describe the problem of bias in an adaptive design in more detail in Section 2, and review different approaches to remove or reduce the bias in an adaptive design in Section 3 before providing an exemplary case study in Section 4. We conclude with some guidance for researchers and discussion in Sections 5 and 6.

Adaptive designs play an important role in health care decision-making, increasingly providing evidence of the clinical effectiveness of an intervention as well as key secondary outcomes such as health economic ones. Failing to account for biases in point estimates could result in incorrect decision-making, potentially wasting limited resources, preventing patients from receiving effective treatments, or exposing patients to unnecessary risks. In the following subsections, we consider some of the potentially negative impacts of reporting biased estimates.

As highlighted by Dimairo et al. 2, 3 , the goal of clinical trials should be to provide reliable estimates of the treatment effect to inform accurate decision-making, but this can be compromised when an adaptive design is analysed with inappropriate statistical methods. Clearly, reporting substantially biased estimates for a primary outcome measure following an adaptive design can result in poor decisions. However, other negative impacts include the results of adaptive designs being viewed with scepticism amongst stakeholders who are aware of potential biases but do not feel they have been adequately addressed by researchers 31 . This could impede the uptake of results from an adaptive trial design or discourage research teams from using these designs in practice.

A further concern is the potential for over-or underestimation of treatment effects to affect further research. In a phase II trial, for example, ineffective treatments with exaggerated effects may be wrongly selected for further investigations in phase III trials or potentially effective treatments may not be pursued further when their effects are underestimated [32] [33] [34] [35] [36] . However, Wang et al. 37 and Goodman 38 suggest that group sequential trials that stop early do not produce materially biased estimates.

The consequences following biased estimates from a phase III trial could include treatments being made available to patients with an overstated benefit or treatments not being recommended for use in practice because of an understated benefit, see Briel et al. 32 for examples. Both scenarios can have a detrimental impact on patients, especially in resource limited healthcare systems such as the National Health Service (NHS) in the UK, where resources spent on a treatment with overstated benefit removes funding for alternative treatments elsewhere in the system. Mueller et al. 39 also argue that there are serious ethical problems when trialists fail to account for bias in estimates following an adaptive design, as this may violate the scientific validity of the research and social value when these estimates are used to inform clinical decision-making.

Clinical trials often collect information about a number of key secondary outcomes that may also require adjustment in an adaptive design. If these secondary outcomes are strongly correlated with the primary outcome used to inform the adaptations to the trial they will also be vulnerable to bias 40, 41 . This is highlighted in the FDA guidance on adaptive designs 23 , which states "It is widely understood that multiple analyses of the primary endpoint can [...] lead to biased estimation of treatment effects on that endpoint. Less well appreciated, however, is that [...] biased estimation can also apply to any endpoint correlated with the primary endpoint".

As highlighted in the benefit-risk analysis literature, there can often be a trade-off between different outcomes when developing and evaluating an intervention 42, 43 . In an example reported by Briel et al. 32 , a trial assessing the effectiveness of vitamin E in premature newborns was stopped early based on an interim analysis of approximately half of the total number of participants planned at the start of the trial. This early analysis showed a reduction in intracranial hemorrhage 44 . However, a later evidence synthesis showed that this trial failed to identify that vitamin E increases the risk of sepsis 45 . Failing to accurately estimate treatment effects on key secondary endpoints could result in an intervention being adopted whose safety is overestimated or whose side effects are underestimated.

Meta-analysis and evidence synthesis provide frameworks for combining the results from several trials 46 and are useful tools for providing an overall understanding of what the synthesised research has found 47 . In a review of 143 trials using adaptive designs that stopped early, Montori et al. 48 found that few evidence syntheses and meta-analyses that included these trials considered the possible biases resulting from using these designs.

Several authors have argued that failing to account for adaptive designs in a meta-analysis or evidence synthesis can introduce bias 49, 50, 33, 34 . Cameron et al. 51 explored the impact of adaptive designs in a network meta-analysis. The authors considered three alternative methods to convert outcome data derived from an adaptive design to non-adaptive designs and found that failing to account for different study designs could bias estimates of effect in a network metaanalysis. Additionally, Walter et al. 36 suggest that the estimate of treatment benefit can be calculated more accurately by applying weights to subgroups of studies with and without stopping rules.

However, there are several authors that suggest the biases are minimal 37,52-55 including Schou et al. 56 who argue that removing truncated trials from a meta-analysis leads to substantial bias, whereas including these trials does not introduce large biases. The authors therefore recommend that all studies regardless of whether they stop early should be included in metaanalyses. Finally, Marschner et al. 57 and Luchini et al. 58 provide guidance on how sensitivity analyses may be conducted to explore the impact of trials that stopped early for benefit in a meta-analysis in line with CONSORT and GRADE 59 reporting checklists.

Increasingly clinical trials are designed with health economic objectives in mind, so that related outcomes are collected to inform a health economic analysis following the trial 60 . This may include clinical data on primary and secondary outcomes to inform parameters in a health economic model or costs and quality of life data collected directly from participants during the trial 61 .

Marschner and Schou 62 discuss the underestimation of treatment effects in sequential clinical trials when they do not stop early for benefit. The authors highlight the importance of an unbiased estimate of the treatment effect for cost-effectiveness analyses using a reanalysis of the GUSTO study 63, 64 . They show that the treatment effect may have been underestimated and the experimental therapy appeared less cost-effective than it actually was. Flight 65 showed, using a simulation study, that when there are high levels of correlation between primary and health economic outcomes collected during a group sequential design, bias is introduced in the point estimates (and confidence intervals) of health economic outcomes. The levels of bias may be reduced in a model-based health economic analysis but this will depend on several factors such as the data structure, correlation, and adaptive design used.

A review by Flight et al. 66 found no health economic analyses were adjusted following an adaptive design where this was thought to be necessary by the review authors. This potentially compromises decision-making if a decision to fund a treatment is based on biased estimates of cost-effectiveness. Additionally, patients may be penalised when a treatment is not funded based on an underestimate of cost-effectiveness, or resources may be wasted based on an overestimate of cost-effectiveness. Flight 65 extended the bias-adjusted ML estimate approach proposed by Whitehead 67 to health economic outcomes and illustrated how this can reduce bias in a health economic analysis following an adaptive design.

In this subsection, we discuss the extent of the bias in point estimates of treatment effects as a result of interim monitoring or data-dependent stopping of an adaptive design. This might be due to the pre-specified treatment selection criteria or any other stopping rules, i.e. lack-ofbenefit (futility) and/or efficacy boundaries: for example, see Figures 1-4 in Walter et al. 36 . Therefore, the magnitude of treatment effect and any potential bias plays a critical role in the benefit-risk assessment of a treatment.

The average treatment effects estimated from trials can adversely be affected by the design features of the trial. Two such designs are the group sequential designs and drop-the-loser (or treatment selection) designs. In these designs, the degree of any potential bias depends on many different features of the design: sample size, selection times, selection rule, sample size reassessment rule, whether an intermediate outcome is used for selection, and the size of the unknown underlying treatment effect 68 . In the remainder of this subsection, we describe how these design features can affect the average treatment effect in different settings.

It has been shown that the average treatment effect from a group sequential design is biased when the trial terminates early 35, 36, 69 . The bias generally tends to be larger the 'earlier' the selection happens, that is, when the decision to stop the treatment arm or continue to the next stage is based on a relatively small amount of information. By information, we mean statistical information that is driven by the number of participants in trials with continuous and binary outcomes, and the number of primary outcome events in trials with time-to-event outcomes. However, the degree of any potential bias will depend on the stopping rules, i.e. how likely it is to stop the trial early on, as well as the underlying treatment effect, see Figures 1 and 2 in Walter et al. 36 .

For example, in a group sequential design with four equally spaced interim analyses with an O'Brien-Fleming stopping boundary 70 , there is a very small chance of stopping for overwhelming efficacy at the interim looks if the underlying treatment effect is zero. The chances of stopping for overwhelming efficacy at the first, second and third interim looks are 0.001%, 0.2% and 0.8%, respectively, under the null hypothesis 71 . In similar designs with Haybittle-Peto efficacy stopping boundaries 72 , the corresponding probabilities are 0.1% at all stages. Even under the alternative hypothesis of a hazard ratio of 0.75, the chances of stopping for efficacy at the first interim stage are 0.3% and 7.2% for the O'Brien-Fleming and Haybittle-Peto stopping boundaries, respectively. In all these cases, the average treatment effect for trials that cross the first interim stage efficacy boundary will be different from the underlying treatment effect. However, as we have seen, the probability of crossing the boundary will also be very small. For this reason, it can be argued that the bias in the estimate of treatment effect of trials that reach the final stage is a more major consideration than that in stopped trials. An effective experimental arm is more likely to reach the final stage of group sequential trials, and the results of such trials are more likely to be adopted into clinical practice.

Previous empirical studies 68 showed that in designs with lack-of-benefit stopping boundaries the size of the selection bias for trials that reach the final stage is generally small. In fact, the bias is negligible if the experimental arm is truly effective. For trials that stopped early for lack of benefit, by definition a claim that the study treatment is better than the control is not made. Therefore, the fact that the treatment effect estimate is biased may be of less importance though results are useful in evidence synthesis. Furthermore, in designs that utilise an intermediate outcome for treatment selection it has been shown that this reduces the selection bias in the estimates of treatment effects in both selected and dropped treatment arms. In these designs, the degree of bias depends on the correlation between the intermediate and definitive outcome measures and this bias is markedly reduced by further patient follow-up and reanalysis at the planned 'end' of the trial 68 .

In drop-the-loser designs, where treatment selection is done based on the relative performance (e.g. ranking) of research arms, the average treatment effect will be overestimated in the treatment arms that continue to the next stage, and will be underestimated in deselected treatment arms. In these designs, the degree of bias depends strongly on the true underlying treatment effects of the research arms, which is always unknown, the timing of treatment selection as well as the number of treatment arms to select from. Generally speaking though, in many scenarios there is a fundamental dilemma in that "the better the selection, the worse the bias" 30 .

It has been shown that in drop-the-loser designs, the bias tends to be largest, and confidence intervals have incorrect coverage, where the underlying treatment effects of the treatment arms are equal, for example when all arms are under the global null or global alternative 73, 74 . More generally, bias will be smaller where the underlying effects differ amongst the treatment arms than when they are similar. In contrast, when one treatment arm has a distinctly larger treatment effect than those it is competing against, bias in the final average treatment effect is minimal for the arm that is performing best when the selection takes place. Moreover, the number of treatment arms to select from was also found to increase the degree of bias in the average treatment effects in pick-the-winner designs 75 . Finally, early stopping rules that are binding increase bias compared to a design with no early stopping rules, particularly under the global null or where only one treatment arm had the target treatment effect 30 .

We reviewed the literature to understand how often methods for reducing or removing bias in point estimates following an adaptive design are used in practice. We focused on results reported from randomised trials that used an adaptive design with options for early trial stopping (i.e. in group sequential designs), treatment selection (i.e. in MAMS designs), population selection (i.e. adaptive enrichment designs) and to change treatment allocation based on the accrued response data (i.e. RAR designs). A review of group sequential trials was based on pre-existing reviews known to the authors by the 30th Sept 2020. For other adaptive designs, we systematically searched the MEDLINE database via PubMed (from 1st Jan 2000 to 30th Sept 2020) using the search strategy given in Appendix A.1.

For group sequential designs, Stevely et al. 31 identified 68 trials that were published in leading medical journals, of which 13 (19%) were multi-arm trials. A total of 46/68 (68%) were stopped early, primarily either for efficacy (10/46, 22%) or futility (28/46, 61%) . Of these trials, only 7% (3/46) disclosed the use of some form of bias correction. A subsequent review of 19 twoarm group sequential trials 76 in oncology that were stopped early found that none of these applied any bias correction to the estimated hazard ratios. Case studies also highlighted the routine lack of use of bias-corrected estimators in trials that are stopped early 77 . In summary, in trials that use group sequential designs, bias-adjusted methods are rarely used in practice and the implications are not well known 31, [76] [77] [78] .

The results of the systematic search for other types of adaptive designs are given in Table 1 below, with more details given in Appendix A.1. In summary, across the adaptive designs considered, unbiased or bias-adjusted treatment effect estimates are currently rarely used and reported. 

In this section, we summarise the results of a semi-systematic review of methodology to remove or reduce the potential bias in point estimates for adaptive trial designs. We divided the literature review into methods for unbiased estimation (which aim to completely remove bias) and bias-reduced estimation (which aim to reduce the magnitude of the bias, but not necessarily to eliminate it).

We conducted a database search of Scopus on 18th March 2021 of all available papers up to that date, using pre-defined search terms for different categories of estimators (see Appendix A.2). This resulted in 240 papers found in total, of which 137 could be excluded immediately (based on the title and abstract) as not being relevant. We then looked for additional potentially relevant papers citing or cited by these remaining 103, which added 57 papers. After completing a full text review of these papers, a total of 134 were deemed relevant, and information about the trial contexts, advantages/limitations, code availability and case studies was extracted for qualitative synthesis. A flow diagram of these different phases for each category of estimator is given in Appendix A.2.

We summarise the methods for unbiased estimation in Section 3.1 and methods for biasreduced estimation in Section 3.2, with Section 3.3 providing a comparative overview for different types of adaptive designs. Full results giving a summary of each paper extracted for qualitative analysis can be found in the supporting materials.

Unbiased estimation can be defined in terms of mean and median unbiasedness, as discussed in Section 1. We consider these two types of unbiased estimation in turn.

In order to achieve exact mean-unbiasedness, typically it is necessary to find an estimator whose distribution is independent of the trial adaptations. One key setting where this is possible is in classical group sequential trials. Given that there is a pre-specified number of observations at the time of the first interim analysis, then the sample mean (which corresponds to the MLE) at the end of the first stage is (unconditionally) unbiased as no adaptations to the trial have occurred. However, this estimator is also clearly inefficient for estimating the overall treatment effect, as it does not use any information that may arise from later stages of the trial.

Another setting where unbiased estimation is possible are multi-stage trials with ranking and selection from a subset of candidates, such as candidate treatments or patient subgroups. The sample mean (MLE) for the selected candidate(s) calculated using only the final stage data will be conditionally unbiased, conditional on the ranking and selection that has taken place in the previous stages. However, again this estimator is clearly inefficient as it ignores all the information about the selected candidate(s) from the previous stages of the trial. Moreover, unlike the first stage estimator described in the previous paragraph, it cannot be used to estimate the effect of a treatment or subpopulation that was deselected at an interim analysis.

One way to obtain a more efficient estimator that uses more of the information from the trial, while still maintaining exact mean unbiasedness, is to apply the Rao-Blackwell theorem. This theorem implies that if U is an unbiased estimator of the unknown parameter of interest θ, and T is a sufficient statistic, then the estimator Û = E(U|T) is unbiased and var(Û) ≤ var(U). In certain cases, this technique of 'Rao-Blackwellisation' allows the derivation of the unbiased estimator that has the smallest possible variance, which is known as the uniformly minimum variance unbiased estimator (UMVUE). The Lehmann-Scheffé theorem 79 states that if is a sufficient and complete statistic, and U is an unbiased estimator, then the Rao-Blackwellised estimator Û = E(U| ) is the unique UMVUE.

The derivation of UMVUEs has been a focus in the literature on group sequential trials, with work by Chang et al. 80 85 proved that the sufficient statistic for the latter setting is in fact not complete, but that the Rao-Blackwellised estimator is still UMVUE among all 'truncation-adaptable' unbiased estimators i.e. where inference following stopping early does not require knowledge of the future analyses. For simplicity, in what follows we do not make this technical distinction when describing UMVUEs. Similar, more general theoretical results can be found in Liu et al. 86 for group sequential tests for distributions in a one-parameter exponential family and Liu et al. 87 for multivariate normal group sequential tests. See also Liu et al. 88 and Liu and Pledger 89 for further theoretical results for general two-stage adaptive designs.

Methodology has also been developed to calculate the UMVUE for secondary parameters or endpoints in group sequential trials. Liu and Hall 90 derived the UMVUE in the context of correlated Brownian motions, while Gorfine 91 considered normally-distributed endpoints where the secondary parameter is the mean in a subgroup of subjects. Similarly, Liu et al. 92 derived the UMVUE for a secondary probability, such as the rate of toxicity, in group sequential trials with binary endpoints. Kunz and Kieser 93 also derived UMVUEs for secondary endpoints in two-stage trials with binary endpoints. Further settings include estimating the sensitivity and specificity of a diagnostic test using a group sequential design, where UMVUEs have been derived by Shu et al. 94 and Pepe et al. 95 .

Moving away from classical group sequential designs, Liu et al. 96 derived the UMVUE for two-stage adaptive designs with sample size adjustment based on conditional power, while Porcher and Desseaux 97 and Zhao et al. 98 focused on Simon's two-stage designs. These results were generalised by Kunzmann and Kieser 99 for trials with binary endpoints. Finally, Bowden and Trippa 100 showed how to calculate a Rao-Blackwellised estimator for trials using RAR with binary endpoints.

In some trial contexts, it may be more appropriate to find unbiased estimators conditional on the adaptations that have taken place. For example, in multi-stage trials with treatment selection, typically there would be greater interest in estimating the properties of the betterperforming treatments that are selected at the end of the trial, rather than those treatments that are dropped for futility. In addition, it may not be possible to find a complete statistic for the parameter of interest without additionally conditioning on the selection rule used (see e.g. Cohen and Sackrowitz 101 ). For both these reasons, in settings such as multi-stage trials with ranking and selection, there has been a focus on deriving uniformly minimum variance conditionally unbiased estimators (UMVCUEs, also known as CUMVUEs), where the conditioning is on the triggered adaptations.

One of the first papers to take this conditional perspective and calculate the UMVCUE was Cohen and Sackrowitz 101 , in the context of a two-stage 'drop-the-loser' trial with treatment selection and normally-distributed endpoints (see Tappin 102 for the setting with binary endpoints), where only the best-performing treatment is taken forward to the final stage. This was subsequently extended by Bowden and Glimm 103 to allow the calculation of the UMVCUE for treatments that were not the best-performing, Bowden and Glimm 104 for multi-stage dropthe-loser trials (this was only a Rao-Blackwellised estimator) and Robertson and Glimm 105 for unknown variances. Meanwhile Koopmeiners et al. 106 derived the UMVCUE for a two-stage trial evaluating continuous biomarkers. A similar line of work focused on deriving UMVCUEs for binary data, with Pepe et al. 95 looking at two-stage designs testing the sensitivity of a dichotomous biomarker (this UMVCUE can also be applied for Simon's two stage designs, see Porcher and Desseaux 97 , which was extended by Robertson et al. 107 ).

For seamless phase II/III trials with a normally-distributed endpoint, Kimani et al. 108 showed how to calculate a Rao-Blackwellised conditionally unbiased estimator, which was subsequently extended by Robertson et al. 109 . More recently, Stallard and Kimani 110 derived the UMVCUE for MAMS trials (again with normally-distributed endpoints) with treatment selection and early stopping for futility, conditional on any pre-specified selection or stopping rule.

A recent setting where UMVCUEs have been derived is two-stage adaptive enrichment designs, where biomarkers are used to select a patient population to investigate in the second stage. Kimani et al. 111 and Kunzmann et al. 112 derived the UMVCUE (among other estimators) for such designs with normally-distributed endpoints, which was extended by Kimani et al. 113 to allow the biomarker cut-off to be determined by the first stage data, and Kimani et al. 114 for time-to-event endpoints. Finally, Kunzmann and Kieser 99 and Broberg and Miller 115 considered two-stage sample size adjustable designs based on interim data, for normally-distributed and binary endpoints respectively.

While exact mean-unbiasedness is a common criterion to aim for, it is not always feasible or desirable to use mean-unbiased estimators. The main potential disadvantage is an increase in the MSE when compared with the usual MLE, due to the bias-variance tradeoff. As well, the calculation of Rao-Blackwellised estimators may become difficult or infeasible for more flexible trial designs, where adaptations are not fully pre-specified in advance. For both these reasons, median-unbiased estimators (MUEs) have been proposed that can have a lower MSE than mean-unbiased estimators in some settings, and can be derived for a wider class of adaptive trial designs.

Consider testing a hypothesis about some parameter of interest θ. One common approach to calculating MUEs is as follows:

1. Define an ordering of the trial design space with respect to evidence against H0. For example, in group sequential trials the predominant option is stage-wise ordering: this depends on the boundary crossed, the stage at which stopping occurs and the value of the standardised test statistic (in decreasing order of priority).

2. Using this ordering, define a p-value function P(θ) which gives the probability that, at the stage the trial stopped, more extreme evidence against H0 could have been observed.

3. At the point the trial stops, find the MUE ̂, which satisfies P(̂) = 0.5.

The last step is essentially finding a 50% confidence bound for θ, or equivalently the midpoint of a symmetric 100(1 -α)% confidence interval.

A number of papers have derived MUEs for group sequential trials, including Kim 82, 116 , Emerson and Fleming 83 , Todd et al. 117 and Troendle and Yu 118 . Hall and Liu 119 provided MUEs that account for overrunning, while Hall and Yakir 120 derived MUEs for secondary parameters. Note also early work by Woodroofe 121 showed how to calculate the MUE but in the context of sequential probability ratio tests.

Meanwhile, MUEs for Simon's two-stage design were derived by Koyama and Chen 122 . For adaptive group sequential trials, where (unlike for classical group sequential designs) the sample size of the next stage can be determined based on the treatment effect observed in previous stages, Wassmer 123 and Brannath et al. 124 showed how to calculate MUEs for survival and normally-distributed endpoints respectively. These results were subsequently extended by Gao et al. 125 , who presented general theory for point and interval estimation (see also Mehta et al. 126 ). Gao et al. 127 also derived a MUE for an adaptive group sequential design that tests for noninferiority followed by testing for superiority.

Moving away from group sequential trials, Wang et al. 128 and Liu et al. 96 focused on two-stage adaptive designs with sample size adjustments based on conditional power considerations, and showed how to calculate MUEs in these settings. More general results for two-stage trials with an adaptive sample size based on interim results are given by Kunzmann and Kieser 99 and Nhacolo and Brannath 129 .

Like for mean-unbiased estimation, the conditional perspective has also been used for medianunbiased estimation. The previously mentioned work by Hall and Yakir 120 for group sequential trials is an early example of this conditional perspective. For two-stage group sequential designs with survival endpoints, Shimura et al. 130 showed how to calculate conditional MUEs, for the cases where the study stops or does not stop early for either efficacy or futility. Meanwhile, Broberg and Miller 115 considered the general setting of a two-stage sample size adjustable design based on interim data, and derived the MUE conditional on continuing to the second stage.

Finally, MUEs can be derived for flexible adaptive designs, which allow arbitrary changes to the trial at each stage based on the interim results and/or external information. Bauer et al. 131 first described the construction of MUEs for two-stage designs based on combination tests and conditional error functions; see also Brannath et al. 132 . Lawrence and Hung 133 proposed a MUE for the setting where the maximum information of a trial is adjusted on the basis of unblinded data. Finally, Brannath et al. 134 gave a general method to calculate MUEs when using recursive combination tests for flexible multi-stage adaptive trials.

Whilst using an unbiased estimator whenever there is one available for a particular adaptive design may seem like a straightforward choice, this may not always be the best option, due to the inherent bias-variance tradeoff: reducing the bias of an estimator can come at the cost of increasing its variance (thus decreasing its precision) substantially. This can lead to unbiased treatment effect estimates being accompanied by large standard errors and wide confidence intervals, which may be of little use practically. Therefore, reducing the bias but not completely eliminating it sometimes proves to be a better solution than aiming for complete unbiasedness.

In practice, however, such bias-reduced or bias-adjusted estimators sometimes "overcorrect" thereby introducing bias in the opposite direction.

In the following, we discuss analytical and iterative methods in Subsection 3.2.1; empirical Bayes and fully Bayesian methods in 3.2.2; and methods based on resampling in 3.2.3.

Cox recognised as early as 1952 135 that in sequential test procedures the option to stop early introduces bias to the MLE. He proposed a modified MLE that can be used for repeated significance tests 136 but not more general group sequential tests. Whitehead 67 focused on the sequential probability ratio test 137 and the triangular test 138 , and derived an analytic expression to quantify the bias and proposed a bias-adjusted estimator ̃ obtainable by subtracting the bias from the MLE:

Evaluating the bias at ̃ leads to an iterative procedure, and this method can be used for various types of endpoints including continuous, binary, and time-to-event. Guo and Liu 139 argued that a simpler single-iteration version could be used, where the estimated bias of the MLE is subtracted rather than the bias at the unknown ̃:

They suggested that this approach could be extended from single-arm phase II trials to more general estimation problems involving designs with early stopping.

Chang et al. 80 and applied Whitehead's idea to group sequential designs with binary response data, while Tan and Xiong 140 proposed using the bias-adjusted MLE to estimate response rates in trials using fully sequential and group sequential conditional probability ratio tests. Li and DeMets 141 derived an exact analytical expression for the bias of the MLE for a group sequentially monitored Brownian motion process with a linear drift, and used it to construct a bias-adjusted estimator based on the arguments of Whitehead 67 . Todd et al. 117 studied the biasadjusted MLE further in simulations of group sequential designs with a normally-distributed endpoint using the O'Brien-Fleming or triangular test. Denne 142 postulated that the same bias-reduced MLE could also be used to estimate the treatment effect following a two-stage sample size re-estimation design on the basis of conditional power. Levin et al. 143 found that the biasadjusted MLE could be applied to a wider class of adaptive group sequential designs with predefined rules.

Whitehead 40 extended the idea of the bias-adjusted MLE to the estimation of secondary endpoints in sequential trials using two different approaches: conditional on the primary endpoint (in which case the MLE is unbiased) and unconditional. Liu et al. 92 further extended Whitehead's approach to estimate secondary probabilities after termination of a sequential phase II trial with response rate as the primary endpoint. The bias-adjusted MLE has also been proposed for a group sequential strategy to monitor toxicity incidents in clinical trials by Yu et al. 144 , for which they noted the bias of the usual MLE was "considerably large". Meanwhile, Flight 65 showed how to use the bias-adjusted MLE for health economic outcomes following a group sequential design.

For trials with treatment selection, early work by Coad 145 in the setting with two treatments and either a two or three-stage trial (where only one treatment is selected to be carried forward to the final stage), showed how to derive a bias-adjusted MLE. Finally, for response-adaptive trials, Coad 146 and Morgan 147 showed how to approximate the bias of the MLE, which then allows the construction of a bias-adjusted estimator following the idea of Whitehead 67 .

Troendle and Yu 118 developed methods for estimating treatment differences in group sequential designs with normally-distributed endpoints that condition on the stopping time T of the trial, thereby reducing the conditional bias, i.e. the discrepancy between the conditional expectation of the estimator and the parameter value:

They used a similar iterative approach as Whitehead 67 to compute the estimator. Coburger and Wassmer 148 took this one step further, by conditioning both on stopping time as well as the sample sizes of the stages up to that point; whereas Troendle and Yu had assumed equal sample sizes for all stages. They also considered adaptive group sequential designs, where the unconditional bias is no longer computable, but the conditional bias can be approximated. Tremmel 149 extended these considerations to stratified binomial and time-to-event endpoints. Coburger and Wassmer 150 suggested using a similar conditional bias-adjusted estimator not only when a trial stops but also during the course of an adaptive sample size re-estimation design, to prevent an overestimation of the sample sizes for subsequent stages.

Fan et al. 151 showed that the conditional bias-reduced estimator for group sequential designs as proposed by Troendle and Yu 118 is equivalent to a conditional MLE as well as to a conditional moment estimator. They also proposed two modified estimators which are hybrids of the conditional and unconditional MLE that have smaller variance and MSE. The similarity of the conditional MLE and the conditional bias-reduced estimator was also pointed out by Liu et al. 152 , who additionally derived the conditional MLE for secondary endpoints in a group sequential design and adaptive group sequential design. More general theoretical results for sequential trials focusing on the conditional MLE can be found in Molenberghs et al. 153 and Milanzi et al. 154, 155 . Meanwhile, Pepe et al. 95 and Koopmeiners et al. 106 considered conditional bias-adjusted estimation in the context of group sequential diagnostic biomarker studies with a single interim analysis for futility, conditional on not stopping early.

Shimura et al. 156 modified Troendle and Yu's conditional bias-reduced estimator for two-stage group sequential trials with normal endpoints by combining it with the shrinkage idea of Thompson 157 : in the calculation of the bias adjustment term, they replaced the MLE with a weighted average of the MLE at the interim analysis and a 'prior' estimate of the effect size, ̂ * =̂+ (1 − ) 0 with 0 the effect size assumed in the initial sample size calculation and shrinkage weights c determined as a function of the MLE as proposed by Thompson 157 .

Although the authors presented their idea in a frequentist framework, it can also be interpreted as Bayesian, with 0 the prior and ̂ * the posterior expectation.

Meanwhile, Cheng and Shen 158 derived an easily computable bias-adjusted estimator of the treatment effect in a 'self-designing' trial (where the sample size is sequentially determined to achieve a target power) with a normally-distributed endpoint 159 . They further proposed a modified estimator, involving a shrinkage factor, to reduce bias when block sizes are small. For 'self-designing' trials with censored time-to-event endpoints and group sequential sample size re-estimation 160 Moving on to trial designs with treatment selection, when multiple treatments are compared to a control and the most promising treatment is selected at the first interim analysis (with or without further early stopping opportunities at subsequent interim analyses), Stallard and Todd 162 developed a bias-adjusted estimator for the treatment effect. Kimani et al. 108 adapted this estimator for the difference between the selected and control treatments in seamless phase II/III designs with normally-distributed endpoints. Unlike Stallard and Todd's estimator, which aimed to be approximately unbiased conditional on the selected treatment, this new estimator aimed to be approximately unbiased conditional on both the selected treatment and continuation to the second stage. In the context of selecting the treatment dose with the highest response rate (using a normal approximation), Shen 163 also derived a stepwise bias overcorrection. When there are two treatments being compared with a control, Stallard et al. 164 proposed a family of approximately conditionally unbiased designs.

Brückner et al. 75 extended Stallard and Todd's bias-reduced estimator to two-stage multi-arm designs involving a common control and treatment selection at the interim analysis with timeto-event endpoints. Despite deriving analytical expressions for the selection bias of the MLE in the selected and stopped treatment arms, respectively, they experienced computational difficulties and convergence problems. They only defined the resulting bias-reduced estimator for the interim analysis, and proposed using a two-stage estimator for the final analysis, which is a weighted average of the bias-reduced estimator at the interim analysis and the MLE.

Luo et al. 165 used the method of conditional moments to derive a bias-adjusted estimator of the response rate following a two-stage drop-the-loser design with a binary endpoint. Adopting a stochastic process framework (as opposed to the conventional random variable viewpoint) for clinical trials with adaptive elements, Luo et al. 166 proposed approximating treatment effects based on a general estimating equation to match the first conditional moment. This generic approach can be used for a variety of adaptive designs with endpoints following any probability distribution. They illustrated it for two-stage treatment selection designs comparing multiple treatments against a control.

Kunzmann et al. 112 applied the conditional moment estimator of Luo et al. 165 to two-stage adaptive enrichment designs with a normally-distributed endpoint and binary biomarker with a pre-specified cut-off. Additionally, they studied two hybrid estimators, one between the UMVCUE and the conditional moment estimator, and the other one combining the UMVCUE and MLE. Kimani et al. 113 studied single-and multi-iteration bias-adjusted estimators for the treatment effect in the selected subpopulation of a two-stage adaptive threshold enrichment design allowing the cut-off of a predictive biomarker based on first stage data.

Finally, Bebu et al. 167 devised a bias-corrected conditional MLE for the mean of the selected treatment in two-stage designs and normally distributed endpoints. They focused on the case of two experimental treatments, but this was extended by Bebu et al. 168 to designs involving the selection of more than one treatment arm, selection rules based on both efficacy and safety, including a control arm, adjusting for covariates, and binomial endpoints. Bowden and Glimm 104 further extended the method to multi-stage drop-the-loser designs which must always proceed to the final stage (i.e. no early stopping rules).

Several Bayesian approaches to bias-reduced estimation in adaptive designs have been developed, typically for trials using frequentist frameworks. These differ from fully Bayesian adaptive designs in that they only use the accumulating data, rather than the posterior, to make decisions about adaptations (e.g. stopping or selecting arms) and rely primarily on classical frequentist inference such as hypothesis testing, but often require additional assumptions or information, such as the specification of a prior.

Hughes and Pocock 169 and Pocock and Hughes 170 proposed a Bayesian shrinkage method for estimating treatment effects in trials stopped at an interim analysis, whereby a prior quantifying the plausibility of different treatment effect sizes is specified at the outset of the trial and combined with the trial data using Bayes' rule, producing a posterior estimate of the treatment effect that is shrunk towards the median of the prior distribution. They acknowledged that this will be sensitive to the choice of prior and warned against using an overoptimistic prior as this would lead to little or no shrinkage, saying that their method "works best in the hands of prior specifiers who are realists by nature" 169 .

Hwang 171 proposed using Lindley's 172 shrinkage estimator for estimating the mean of the best treatment in single-stage multi-arm designs with 4 or more study treatments and a normallydistributed endpoint. This was extended by Carreras and Brannath 74 to multi-arm two-stage 'drop-the-losers' designs. Their two-stage version is a weighted mean of Lindley's estimator for the first stage and the MLE of the selected treatment for the second stage. It shrinks the MLE towards the mean across all arms, but only works with a minimum of 4 study treatment arms. For designs with 2 or 3 study treatments, they replaced Lindley's estimator with the best linear unbiased predictor of a random effects model for the first stage estimator. This method also allows for bias-reduced estimation of the second best treatment mean.

Of note, Lindley's estimator can be viewed as an empirical Bayes estimator of the posterior mean of the posterior distribution of a specific treatment mean given its MLE. This fact was exploited by Bowden et al. 173 , who adapted Carreras and Brannath's method and replaced their two-stage estimator with a single shrinkage equation for both stages. This, however, adds both practical and theoretical complications, necessitating further modifications such as estimating a between-arm heterogeneity parameter. The authors suggest this approach could be applied to other MAMS designs.

Along the lines of Hwang's approach 171 , Brückner et al. 75 derived two shrinkage estimators for two-stage multi-arm designs with a common control, treatment selection at the interim analysis, and time-to-event endpoints. They both shrink the MLE of the selected treatment towards the overall log-HR. The authors also defined a two-stage estimator similar to that of Carreras and Brannath 74 , as well as a bias-corrected Kaplan-Meier estimator for the selected treatment.

Bunouf and Lecoutre 174 derived two Bayesian estimators for multi-stage designs with binary endpoints where after every stage a decision is made whether to stop or continue recruiting: one estimator is the posterior mean based on a 'design-corrected' prior, the other the posterior mode based on a similarly corrected uniform prior, and for the latter they also provided an easy-to-use approximation for two-stage designs. Meanwhile, Kunzmann et al. 112 developed two empirical Bayes estimators (one using Carreras and Brannath's two-stage approach, one using Bowden et al.'s single-equation approach) for two-stage adaptive enrichment designs with a single pre-specified subgroup and pre-specified decision rule based on a binary biomarker and a normally-distributed endpoint with known variance. Kimani et al. 113 extended this work to adaptive threshold enrichment designs where the biomarker cut-off is determined based on first stage data.

Finally, Kunzmann and Kieser 99 developed an estimator for two-stage single-arm trials with a binary endpoint with sample size adjustment and early stopping for futility or efficacy. This estimator minimises the expected MSE subject to compatibility with the test decision and monotonicity conditions, and can be viewed as a constrained posterior mean estimate based on the non-informative Jeffrey's prior (although this can be replaced with an informative prior).

The methods for unbiased and biased-reduced estimation considered in the previous subsections require explicit formulae based on the population response model as well as the trial design and adaptations. For example, UMVUEs and UMVCUEs can often be given as closed-form expressions, while for MUEs a p-value function is often specified. Meanwhile, bias-reduced estimators tend to have explicit expressions for an estimate of the bias, leading to equations that can be solved numerically.

An alternative class of methods is based on resampling procedures, where the trial data is resampled or generated via a parametric bootstrap a large number of times, and the resulting trial replicates can then be used to give empirical estimates of the bias. One advantage of these methods is that essentially the same procedure can be used under a variety of different stopping rules and trial designs, including complex designs as detailed below.

In the group sequential setting, Wang and Leung 175 showed how to use a parametric bootstrap to calculate a bias-corrected estimate for the mean of normally-distributed endpoints with either known or unknown variance. This methodology was generalised by Leung and Wang 176 , who proposed a generic stochastic approximation approach based on a parametric bootstrap, which can be used for non-iid data and a large class of trial designs, including sequential ones. Cheng and Shen 177 proposed a variant of the Wang and Leung 175 procedure for complex group sequential monitoring rules that assess the predicted power and expected loss at each interim analysis. Magnusson and Turnbull 17 proposed a repeated bootstrap procedure for group sequential enrichment designs incorporating subgroup selection. Meanwhile, Pinheiro and DeMets 178 described a simulation approach to estimate the bias of the MLE and then proposed constructing a bias-reduced estimator for the treatment difference, although this is equivalent to Whitehead's bias-adjusted MLE 67 Kunzmann et al. 112 carried out further work in the context of adaptive enrichment trials. In the two-stage trial setting with a binary biomarker and normally-distributed endpoints, they considered a number of alternatives to the MLE, including a parametric bootstrap procedure. More generally, Simon and Simon 179 considered multi-stage enrichment designs that develop model-based predictive classifiers based on multiple markers, and showed how to use a parametric bootstrap method for bias correction for generic response distributions. Finally, in trials with treatment selection, Pickard and Chang 180 considered parametric bootstrap corrections for a two-stage drop-the-losers design, for normally and binomially distributed endpoints.

Thus far, all the methods presented above are fully parametric, i.e. assuming that the trial outcome data comes from a probability distribution with a fixed set of parameters. An alternative approach is to use non-parametric bootstrap procedures to correct for bias. An early example of this approach was given by Leblanc and Crowley 181 , who proposed a nonparametric bootstrap in the context of group sequential designs with censored survival data using log rank testing. Subsequently, Rosenkranz 182 described using non-parametric bootstrap to correct the bias of treatment effect estimates following selection (e.g. selecting the treatment with the maximum effect). More recently, Whitehead et al. 183 proposed an estimation approach based on the method of Rao-Blackwellisation (i.e. targeting unbiased estimation), but used a non-parametric bootstrap procedure to recreate replicate observations in a complicated group sequential trial setting comparing multiple treatments with a control.

In this section, we summarise the results of our semi-systematic review by classifying them according to the broad class of adaptive design used. For each design class, we give references to the relevant literature for the different types of estimators that have been proposed. We also give a summary of general pros and cons of the different estimators and the comparisons between them as given in the literature. Finally, we point out examples of the use of different estimators on trial data, as well as where software/code is available. Note that this list of software/code is by no means exhaustive, but only includes those that were specifically mentioned in the reviewed methodological literature or those that the authors were already aware of.

Group sequential Mean-unbiased estimation Chang et al. (1989) 80 , Kim (1989) 82 , Emerson and Fleming (1990) 83 , Emerson and Kittelson (1997) 84 Kim (1988 Kim ( , 1989 82 67 , Chang et al. (1989) 80 , Tan and Xiong (1996) 140 , Todd et al. (1996) 117 , Li and DeMets (1999) 141 

Two-stage designs Cohen and Sackrowitz (1989) 101 , Tappin (1992) 102 , Glimm (2008, 2014) 103 

Mean-unbiased estimation Bowden and Trippa (2017) 100 Naive unbiased estimator has a large MSE, and approaches to improve this can be very computationally intensive

Bias-reduced 146 , Morgan (2003) 147 No comparisons given None 

To illustrate the practical application of some of the estimators reviewed in Section 3, we use the phase III MUSEC (MUltiple Sclerosis and Extract of Cannabis) trial, as described by Bauer et al. 186 and Zajicek et al. 187 . This is an example of a two-stage group sequential design where the trial continued to the second stage. The MUSEC trial investigated the effect of a standardised oral cannabis extract (CE) on muscle stiffness for adult patients with stable multiple sclerosis. The primary endpoint was a binary outcome -whether or not a patient had 'relief from muscle stiffness' after 12 weeks of treatment (based on a dichotomised 11 point category rating scale).

A two-stage group sequential design with O'Brien-Fleming (OBF) efficacy stopping boundaries 70 was used, with a pre-planned maximum total sample size of 400 subjects (200 per arm) and an unblinded interim analysis planned after 200 subjects (100 per arm) had completed the 12 week treatment course. In the actual trial, an unblinded sample size re-estimation based on conditional power considerations 186 was also carried out at the interim analysis, which reduced the total sample size from 400 to 300. For simplicity and the purpose of illustrating the calculation of a larger range of adjusted estimators, we ignore this sample size re-estimation in what follows. If we were to take into account the sample size re-estimation then the methods for adaptive group sequential designs would apply (see Sections 3.2 and 3.3), but only median unbiased estimators are available in that setting. Table 2 below summarises the observed data from the trial at the interim and final analyses, as well as the Wald test statistics and the OBF stopping boundaries. As can be seen, at the interim analysis the boundary for early rejection of the null hypothesis (no difference in the proportion of subjects with relief from muscle stiffness between treatment arms) was almost reached, with the Wald test statistic being close to the stopping boundary. 

Using the observed data from the MUSEC trial, we now demonstrate how to calculate various unbiased and bias-adjusted estimators for the treatment difference, from both a conditional and unconditional perspective. More formally, letting pCE and p0 denote the response probability for patients on CE and the placebo respectively, we consider estimators of θ = pCE -p0. R code to obtain these estimators is provided in the supporting materials.

The standard estimator for the treatment difference, i.e. the overall MLE, is given by ̂=̂−̂0 , where ̂ and 0 are the observed proportions of successes on the CE and placebo arms respectively. In what follows, it is also useful for illustrative purposes to consider the MLE calculated just using the interim data (stage 1 data), denoted ̂1 , as well as the MLE calculated just using the stage 2 data (i.e. only the data from after the interim analysis), denoted ̂2 . These estimators are inefficient (and potentially unethical) since they 'discard' patient data, so we are not recommending that these are used in practice.

From an unconditional perspective, we want to estimate θ regardless of the stage that the trial stops, and are interested in the bias as averaged over all possible stopping times, weighted by the respective stage-wise stopping probabilities. More formally, letting T be the random variable denoting the stage that the trial eventually stops, we define the unconditional bias of an estimator ̂ as In the two-stage trial setting, Emerson 188 presented an analytical expression for this bias of the overall MLE, which depends on the unknown value of θ:

where e denotes the efficacy stopping boundary, I1 and I2 denote the information at stage 1 and stage 2 respectively and ϕ denotes the pdf of a standard normal distribution. Following Whitehead 67 and as described in Section 3.2.1, we can use this expression to calculate an unconditional bias-corrected MLE ̃ (UBC-MLE), which is the solution of the equation ̃=̂− (̃), where ̂ is the observed value of the overall MLE.

Alternatively, as mentioned in Section 3.1.1, the UMVUE can be calculated by using the Rao-Blackwell technique on the stage 1 MLE ̂1 , which is unconditionally unbiased. More formally, the UMVUE in our context is given by E[̂1| (T = 2, ̂) ] and the closed-form expression can be found in Appendix A.3.

A median unbiased estimator (MUE) can also be calculated, as described in Section 3.1.2. This depends on a choice of the ordering of the sample space with respect to evidence against H0.

In what follows, we use stagewise ordering, which has desirable properties described by Jennison and Turnbull 189 . This allows us to define a p-value function P(θ) and find the MUE, which is the solution to the equation P(̂) = 0.5. The formula for the p-value function is given in Appendix A.3.

From a conditional perspective, we are interested in estimation conditional on the trial continuing to stage 2. We define this conditional bias of an estimator ̂ as [ ̂ | = 2] − . In the context of group sequential trials, as argued by several authors 118, 130, 151, 190, 191 , the conditional bias of an estimator is also an important consideration: given that the study has in fact stopped with T = 2, we can use this knowledge in our bias calculations. As well, while the unconditionally unbiased estimators are unbiased overall, they tend to overestimate the treatment effect when there is early stopping and underestimate the effect when the trial continues to the end. The authors of the present paper see value in both the conditional and unconditional perspective. As the unconditional estimators tend to be biased once the stopping reason is known they do, however, have a slight preference for conditional estimators. However, there is no consensus in the literature and we provide a few example quotations illustrating this in Appendix A.3.

We can calculate an analytical expression for the conditional bias of the overall MLE, which is given below and again depends on the unknown true parameter θ:

where Φ represents the cdf of a standard normal distribution.

Using this expression, we can calculate a conditional bias-corrected MLE ̃ (CBC-MLE), which is the solution of the equation ̃=̂− (̃). As well, the UMVCUE can be calculated by again using the Rao-Blackwell technique, but this time applied to the stage 2 MLE ̂2 , which is conditionally unbiased. More formally, the UMVCUE is given by E[̂2| (T = 2, ̂) ], and the closed-form expression can be found in Appendix A.3. Table 3 gives the values of all of the estimators described above, calculated using the observed data and OBF stopping boundaries from the MUSEC trial. For illustration purposes, we also calculated the standard error (SE) for all estimators using a parametric bootstrap approach assuming that the true (unknown) difference in proportions is equal to 0.14, which should be treated with caution, as they will vary depending on this key assumption. 6 replicates, assuming that the true difference in proportions is equal to 0.14.

The overall MLE is 0.1370 (with a standard error of 0.054), and is the comparator for all the other estimators in Table 3 . Starting with the unconditionally unbiased and bias-adjusted estimators, the stage 1 MLE is slightly larger (0.1436), but this is based on only the stage 1 data and hence is slightly inefficient: it has an information fraction of 0.795 and a standard error of 0.057. The MUE, UBC-MLE and UMVUE are all slightly lower than the MLE, although they are all within 0.01 in absolute terms (i.e. within 7% in relative terms). This downward correction is intuitivewe would expect the MLE to overestimate the magnitude of θ averaged over the possible stopping times: if ̂1 is sufficiently larger than θ, the trial stops with T = 1 and the MLE equals ̂1 , whereas if ̂1 is lower than θ by a similar amount, the trial can continue, allowing the stage 2 data to reduce the negative bias of the overall MLE. The standard errors of these estimators are all very similar to that of the MLE under the assumption of a true difference in proportions of 0.14, perhaps reflecting the small corrections to the MLE. Finally, we see that MUE > UBC-MLE > UMVUE, which could reflect the fact that the MUE is not mean-unbiased, and the UBC-MLE will also be expected to have residual mean bias as it is not exactly unbiased.

Moving on to the conditionally unbiased and bias-adjusted estimators, the stage 2 MLE is substantially lower (0.1139) than the overall MLE (and indeed any of the other estimators conditional or unconditional). However, the information fraction for stage 2 was only 0.205 ignoring the 0.795 from stage 1, and hence this estimator has a substantially higher variability with a (conditional) standard error of 0.111. Both the CBC-MLE and the UMVCUE are noticeably larger than the overall MLE (in relative terms 39% and 26% larger respectively). An upward correction is intuitive from a conditional perspective: there is downward selection pressure on the stage 1 MLE ̂1 , since if ̂1 is sufficiently larger than θ then the trial does not continue to stage 2. Given that the stage 1 MLE was almost large enough for the Wald test statistic to cross the OBF stopping boundary (note that on the difference of proportion scale, the OBF stopping boundary at stage 1 was 0.1581), the relatively large correction to the overall MLE is not too surprising. Both estimators have substantially lower (conditional) standard error than the stage 2 MLE, since they are utilising all of the trial data. However, the conditional standard errors are larger than the unconditional ones. This is because unconditionally, most of the simulated trial replicates will stop at stage 1 and hence will simply be equal to the stage 1 MLE. Finally, we see that CBC-MLE > UMVCUE, which again could reflect the residual mean conditional bias in the CBC-MLE.

In summary, we have seen that the use of different estimators can give noticeably different values for the estimated treatment effect, particularly when considering a conditional versus unconditional perspective. This could influence the interpretation of the trial results in certain cases, and highlights the importance of pre-specifying which estimator(s) will be reported following an adaptive design. The choice of estimator(s) will depend on what the researchers wish to achieve regarding the estimand in question. There will be pros and cons for the different estimators, one key example being the bias-variance tradeoff. We explore these issues further in the following section. Finally, we note that there is a strong link between design and estimation -the estimated values above depend on the design of the trial, and would be different if (for example) the design had also included futility stopping boundaries.

In this section, we give guidance on the choice of estimators and the reporting of estimates for adaptive designs. This builds on the relevant parts of the FDA guidance for adaptive designs 23 and the Adaptive designs CONSORT Extension 2,3 . The issue of estimation and potential bias should be considered throughout the whole lifecycle of an adaptive trial, from the planning stages to the final reporting and interpretation of the results. Indeed, the design and analysis of an adaptive trial are closely linked, and one should not be considered without the other. In what follows, our focus is on the confirmatory setting where analyses are fully pre-specified, but some of the principles can also apply to more exploratory settings, particularly around the choice of estimators and the final reporting of trial results.

The context, aims and design of an adaptive trial should all inform the analysis strategy used, which includes the choice of estimators. These decisions should not only be left to trial statisticians, but also discussed with other trial investigators to ensure that it is consistent with what they want to achieve. Firstly, it is necessary to decide on what exactly is to be estimated (that is, the estimands of interest). Secondly, the desired characteristics of potential estimators should be decided. Two key considerations are as follows:

• Conditional versus unconditional perspective: The choice of whether to look at the conditional or unconditional bias of an estimator will depend on the trial design. For example, in a drop-the-losers trial where only a single candidate treatment is taken forward to the final stage, a conditional perspective reflects the interest being primarily in estimating the effect of the successful candidate. On the other hand, for group sequential trials, the unconditional perspective is recognised as being an important consideration (see the discussion in Appendix A.3).

• Bias-variance trade-off: As discussed throughout the paper, typically there will be a tradeoff between the bias and variance of different estimators. Depending on the context and aims of the trial, different relative importance may be given to the two. For example, in a phase II trial where a precise estimate of the treatment effect is needed to inform a followup confirmatory study, the variance of an estimator may be of greater concern, whereas in a definitive phase III trial an unbiased estimate of treatment effect is key for real-world decision-making, as discussed in Section 2.1. One proposal given in the literature is to use the MSE as a way of encompassing both bias and variance.

Potentially different criteria will be needed for different outcomes, e.g. when considering primary and secondary outcomes, which may then lead to using different estimators for different outcomes. As well, in some trial settings such as multi-arm trials (and drop-the-loser designs) where more than one arm reaches the final stage, the bias of each arm could be considered separately, but there may also be interest in calculating e.g. the average bias at the across all arms that are selected. In any case, once criteria for assessing estimators have been decided, the next step is to find potential estimators that can be used for the trial design in question. Section 3.3 is a starting point to find relevant methodological literature and code for implementation.

For more commonly-used adaptive designs, a review of the literature may be sufficient to compare the bias and variance of different estimators. Otherwise, we would recommend conducting simulations to explore the bias and variance of potential estimators given the adaptive trial design. In either case, we recommend assessing the estimators across a range of plausible parameter values and design scenarios, taking into account important factors such as the probability of early stopping or reaching the final stage of a trial. More generally, any simulations should follow the relevant FDA guidelines regarding simulation reports for adaptive designs 23(pp28-29) , see also guidance by Mayer et al. 192 .

The simulation-based approach can also be used when there are no proposed alternatives to the standard MLE for the trial design under consideration. Even in this setting, we would still encourage an exploration of the bias properties of the MLE. If there is a potential bias of a nonnegligible magnitude, then this can impact how the results of the trial are reported (see Section 5.4).

The statistical analysis plan (SAP) and health economic analysis plan (HEAP) should include a description of the estimators that are planned to be used to estimate treatment effects of interest when reporting the results of the trial, and a justification of the choice of estimators based on the investigations conducted during the planning stage. This reflects the FDA guidance 23(p28) , which states that there should be "prespecification of the statistical methods that will be used to […] As an example of what this looks like in practice, we point the reader to the TAO (Treatment of Acute Coronary Syndromes with Otamixaban) trial as described by Steg et al. 193 , particularly Appendix B, Section 9. There, the authors consider the MLE and a median-unbiased estimator (MUE), and explore the bias and MSE via simulations. They conclude that the MUE has a "consistently smaller" bias with no "noticeable difference in terms of MSE". Therefore, they propose to use the MUE as the point estimator in their trial.

When presenting interim results to DMCs, the issue of potential bias should also be considered. We would recommend that the sensitivity of the standard MLE (based on the interim data) to potential bias should be reported, for example based on simulations conducted during the planning stage. As recommended by Zhang et al. 190 and Shimura et al. 76 , when unbiased or bias-reduced estimators are available, these should also be presented to the DMC, as an additional tool for appropriately considering potential bias in the decision-making process of whether to stop a trial early (or to perform other trial adaptations such as modifying the sample size).

When reporting results following an adaptive design, there should be a clear description of the "statistical methods used to estimate measures of treatment effects" 2(p16) . Hence, when unbiased or bias-adjusted estimators are used this should be made clear, as well as any underlying assumptions made to calculate them (for example, being unbiased conditional on the observed stopping time). As reflected in the ACE guidance 2(p16) , "when conventional or naïve estimators derived from fixed design methods are used, it should be clearly stated" as well.

The FDA guidance on adaptive designs 23(p30) states that "treatment effect estimates should adequately take the design into account". Hence, we reiterate that adjusted estimators taking the trial design into account are to be preferred if available. The FDA guidance goes on to say that "if naive estimates such as unadjusted sample means are used, the extent of bias should be evaluated, and estimates should be presented with appropriate cautions regarding their interpretation" 23(p30) . Similarly, the ACE guidelines encourage researchers to discuss "Potential bias and imprecision of the treatment effects if naive estimation methods were used".

These discussions would naturally link back to the planning stage literature review and/or simulations (which could potentially be updated in light of the trial results and any unplanned adaptations that took place), taking into account important factors such as the probability of early stopping and plausible values of the unknown true treatment effect. For example, if the potential bias of the MLE is likely to be negligible, this would be a reassuring statement to make. On the other hand, in a setting where no adjusted estimators exist and there is the potential for non-negligible bias in the MLE, a statement flagging up this potential concern would allow appropriate caution to be taken when using the point estimate to inform clinical or policy decisions, future studies or meta-analyses.

If an unbiased or bias-adjusted estimator is also reported (as specified in the SAP for a confirmatory study), then it is useful to look at the similarity with the MLE. If the two estimates are very close to each other, then it is reassuring that the trial results appear to be somewhat robust to the estimation strategy used. Conversely, if the two estimates are substantially different, then this may indicate that more care is needed in interpreting the trial results and when using the point estimates for decision-making or further research. However, we caution that the observed difference between the MLE and an unbiased (or bias-adjusted) estimate is not necessarily a precise measure of the actual bias in the observed MLE. Firstly, the bias of the MLE depends on the true underlying treatment effect, which is unknown. Secondly, an unbiased estimator is only unbiased on average, and not necessarily in any particular trial realisation. To this end, a potentially more transparent way of reporting results is to show the plausible extent of bias of the MLE via a graphical representation for a practically reasonable range of the true treatment effect, in addition to considering the corresponding probabilities of stopping (or reaching the final stage). Again, this would build on the planning stage review and simulations.

Finally, the reporting of appropriate measures of uncertainty for the estimators, such as confidence or credible intervals, is also important. If methods exist for constructing confidence intervals associated with the adjusted estimator, then clearly these can be reported. However, for many adjusted estimators it is not clear how to construct valid confidence intervals, and hence the 'standard' confidence interval for the MLE may be the only one available. We briefly return to this issue in the Discussion.

Motivated by recent FDA guidance on estimation after adaptive designs, this paper critically assesses how bias can affect standard MLEs, the negative effects this can have, and the potential solutions that exist. We found that there is a growing body of methodological literature proposing and evaluating a range of unbiased and bias-adjusted estimators for a wide variety of adaptive trial designs. However, there has been little uptake of adjusted estimators in practice, with the vast majority of trials continuing to only report the MLE (if indeed it is made clear which estimation method is being used at all). There are a variety of reasons why this may be the case. Firstly, there is a common belief that the bias of the MLE will typically be negligible in realistic trial scenarios. This assumption is sometimes made without supporting evidence such as simulation studies for a variety of trial contexts. In theory, the bias can be very large in certain scenarios 194 . However, as discussed throughout, the magnitude of the bias depends on the type of design and whether we are interested in conditional or unconditional bias. Hence, this issue needs to be carefully considered for the specific trial in question. Secondly, there is perhaps a lack of awareness of the range of different unbiased and bias-adjusted estimators that exist in the methodological literature. Linked with this, statistical software and code to easily calculate adjusted estimators is relatively sparse, which is an obstacle to the uptake of methods in practice even if they exist. It also remains the case that for more complex or novel adaptive designs, adjusted estimators may not exist.

It is our hope that this paper will encourage the increased use and reporting of adjusted estimators in practice for trial settings where these are available. As described in our guidance section, estimation issues should be considered in the design stage of an adaptive trial. The paper by Bowden and Wason 195 is a good example of how this can be done in a principled way for two-stage trials with a binary endpoint. More generally, the estimation strategy should take the design of the trial into account, which motivates the use of adjusted estimators. In terms of trial reporting, statements about the potential bias of the reported estimates can indicate where more care is needed in interpretation of the results and the use of the point estimates for further research including evidence synthesis and health economic analyses.

From a methodological perspective, there remain important open questions regarding estimation after adaptive trials. In general, an estimator for a parameter of interest should ideally have all of the following properties:

1. Adequately reflects the adaptive design used; 2. No or small bias; 3. Low MSE (reflecting a favourable bias-variance trade-off); 4. A procedure for calculating an associated confidence interval that: a. Has the correct coverage; and b. Is consistent / compatible with the hypothesis test decisions (including early stopping); 5. Is easily computable.

Constructing an overarching framework for estimation after adaptive designs that has all these properties would be very useful, although challenging. Some initial steps in this direction have been taken by Kunzmann and Kieser 99, 196 in the context of adaptive two-stage single-arm trials with a binary endpoint. Even if such a framework is not feasible more generally, the issue of constructing confidence intervals remains an important question that has so far received less attention in the literature than point estimation.

Finally, to improve the uptake of unbiased and bias-adjusted estimators in practice, there is the need for the further development of user-friendly software and code to allow straightforward calculation of trial results and to aid in simulations. Ideally, the calculation of adjusted estimators could be added to existing widely used software for adaptive trial design and analysis. Otherwise, there is scope for stand-alone software packages or code (such as that provided for our case study) focusing on estimation after adaptive designs, particularly with simulation studies in mind.

A total of 654 records were retrieved and screened for MAMS trials. Only 11/654 (1.7%) reported results. These articles were supplemented by an additional 7 trials from related work 2 within the same search period. As a result, we reviewed 18 eligible MAMS trials; phase II (n=10), phase II/III (n=6), and phase III (n=2). All trials had at least one treatment arm that was dropped early but the trial continued beyond the first interim analysis and 88.9% (16/18) used frequentist methods. Only 1/18 (5.9%) of the trials reported a bias-adjusted point estimate of the treatment effect 197 .

Forty-five records were retrieved that used a RAR design; of which 14 were randomised trials. Of these 14, 10 were reporting interim (n=3) and final results (n=7); phase II (n=8), phase I/II (n=1), and phase III (n=1). Of the 3 that reported interim results, 1 was stopped early for futility, 1 had a treatment that graduated (i.e. was declared successful) for further evaluation at phase III, and the remaining was stopped early for efficacy. Most of the trials (80.0%, 8/10) used Bayesian methods. None of the trials used bias correction methods.

Only 2 of the screened 417 records were adaptive enrichment trials that used frequentist methods; however, one was a protocol and the other was a methodological publication. There were an additional 2 known trials not retrieved in the search that also did not use any bias correction methods.

In summary, across the adaptive designs considered, unbiased or bias-adjusted treatment effect estimates are currently rarely used and reported.

We conducted a "title, abstract, keywords" search using Scopus on 18th March 2021, with the following search terms of each category of estimator.

Unbiased ((((unbiased OR ((conditionally OR mean OR median OR "uniformly minimum variance" OR "uniformly minimum variance conditionally") AND unbiased)) AND estim*) OR "median adjusted" OR umvcue OR umvue) AND ( "adaptive design" OR "adaptive trial" OR "adaptive clinical trial" OR "group-sequential")) Bias-reduced ((("bias-reduced" OR "bias-adjusted" OR "conditional moment" OR "empirical Bayes" OR shrinkage) AND estimat*) OR ((corrected OR adjusted OR conditional) AND (mle OR "maximum likelihood"))) AND ("adaptive design" OR "adaptive trial" OR "adaptive clinical trial" OR "group-sequential")

Resampling-based methods (((bootstrap OR nonparametric OR non-parametric OR resampling OR "distribution free" OR distribution-free OR jackknife OR "Monte Carlo") AND estim*) AND ("adaptive design" OR "adaptive trial" OR "adaptive clinical trial" OR "group-sequential")) Flow diagram where Z2 is the (observed) overall Wald test statistic.

Since the trial continued to the second stage, the p-value function (using stage-wise ordering) is as follows:

( ) = ∫ ∫ 2 (( 1 , 2 ), ( √ 1 , √ 2 ), ( 1 √ 1 / 2

where 2 ( , , ) is the density of a bivariate normal distribution with mean and covariance matrix .

The UMVCUE is given by 

Below we give some quotations from the literature focusing on the issue of the conditional vs unconditional perspective in the context of group sequential designs. Troendle and Yu (1999) 118(pp1617-1618) :

"Suppose a group sequential clinical trial is undertaken to determine the effect of an experimental drug on the state of a certain disease. Now suppose it is known that the trial was stopped at the first interim analysis because of treatment efficacy, and that the estimated treatment effect was X1 -Y1, the difference in sample means from the two groups… Is X1 -Y1 a reasonable estimate of the effect size? Although X1 -Y1 is unbiased, the general estimator XT -YT, where T is the random stopping time, is known to be biased. Recently, an unbiased estimator ... and an essentially unbiased estimator ... have been developed for this problem. However, as will be shown later, these methods remain unbiased by overestimating the effect when there is early stopping while underestimating the effect when the trial stops later. The overall effect is an unbiased estimator, but does that leave the scientist, who knows T = 1 any happier? We propose conditioning on the stopping time in a group sequential trial to reduce the discrepancy between the conditional expectation of the estimator and the parameter value." "We also note that the bias referred to is the marginal or overall bias [i.e. the unconditional bias]. As much as the importance of the marginal bias, sometimes we will also face the question of what the potential bias is given the fact that the study is already stopped at this time, especially when it is a very early interim stage. To answer this question, we feel it is more relevant to investigate the bias conditioning on the actual stopping time. In this paper, we focus on the angle of the conditional bias and in the meanwhile also keep in mind the marginal bias."

"The conditional method is not meant to replace the unconditional methods because these two methods are developed to address different issues. Instead it is proposed as an addition and alternative means that we can take advantage of when the conditional bias is more concerning, rather than a replacement to the unconditional methods." "Although this article focuses on the bias conditional on the observed stopping time, we also recognize the importance of the marginal or unconditional bias … Evaluation of the unconditional bias is particularly helpful in the trial design stage; however, there is also value in assessing the potential bias given that the trial has already stopped (conditional bias), especially on the basis of a very early interim analysis. Fan and colleagues found that the conditional bias may be quite serious, even in situations in which the unconditional bias is acceptable 151 . Most of the available adjustment methods focus on the unconditional bias, which has little effect on the conditional bias."

Schoenbrot and Wagenmakers (2018) 191(p140) :

"Although sequential designs have negligible unconditional bias, it may nevertheless be desirable to provide a principled "correction" for the conditional bias at early terminations, in particular when the effect size of a single study is evaluated." 130(p2068) :

"A reduction in conditional bias is as important as a reduction in overall bias because, in practice, researchers can only obtain an estimate that is conditional on the stopping stages."

Adaptive designs in clinical trials: why use them, and how to run and report them

The Adaptive designs CONSORT Extension (ACE) statement: a checklist with explanation and elaboration guideline for reporting randomised trials that use an adaptive design

The adaptive designs CONSORT extension (ACE) statement: a checklist with explanation and elaboration guideline for reporting randomised trials that use an adaptive design

Adding flexibility to clinical trial designs: an example-based guide to the practical use of adaptive designs

Barriers and opportunities for implementation of adaptive designs in pharmaceutical product development

Adaptive designs undertaken in clinical research: a review of registered clinical trials

Adaptive Designs for Clinical Trials

Adaptive design clinical trials: a review of the literature and ClinicalTrials.gov

Efficient Adaptive Designs for Clinical Trials of Interventions for COVID-19

Clinical Trials Impacted by the COVID-19 Pandemic: Adaptive Designs to the Rescue?

Evaluation of Experiments with Adaptive Interim Analyses

Posch M, Bauer P. Adaptive Two Stage Designs and the Conditional Error Function

Adaptive group sequential designs for clinical trials: combining the advantages of adaptive and of classical group sequential approaches

A generalized Dunnett test for multi-arm multi-stage clinical studies with treatment selection

Novel designs for multi-arm clinical trials with survival outcomes with an application in ovarian cancer

Group sequential enrichment design incorporating subgroup selection

Design and estimation in clinical trials with subpopulation selection

Familywise error control in multi-armed responseadaptive trials

Response-adaptive randomization in clinical trials: from myths to practical considerations. ArXiv200500564 Stat

Re-calculating the sample size in internal pilot study designs with control of the type I error rate

Sample Size Reassessment and Hypothesis Testing in Adaptive Survival Trials

Adaptive Designs for Clinical Trials of Drugs and Biologics: Guidance for Industry

Group Sequential and Confirmatory Adaptive Designs in Clinical Trials

Simultaneous confidence intervals that are compatible with closed testing in adaptive designs

Bias, Operational Bias, and Generalizability in Phase II/III Trials

The effect of publication bias magnitude and direction on the certainty in evidence

Bias Control in Randomized Controlled Clinical Trials

Controlling Bias in Randomized Clinical Trials

Selection and bias -Two hostile brothers

An Investigation of the Shortcomings of the CONSORT 2010 Statement for the Reporting of Group Sequential Randomised Controlled Trials: A Methodological Systematic Review

The dangers of stopping a trial too early

Stopping randomized trials early for benefit and estimation of treatment effects: systematic review and meta-regression analysis

Reflections on meta-analyses involving trials stopped early for benefit: is there a problem and if so, what is it?

Quantifying the bias in the estimated treatment effect in randomized trials having interim analyses and a rule for early stopping for futility

Randomised trials with provision for early stopping for benefit (or harm): The impact on the estimated treatment effect

Quantifying over-estimation in early stopped clinical trials and the "freezing effect" on subsequent research

Stopping trials for efficacy: an almost unbiased view

Ethical issues in stopping randomized trials early because of apparent benefit

Supplementary analysis at the conclusion of a sequential clinical trial

The Design and Analysis of Sequential Clinical Trials

PROTECT Benefit-Risk Group. Recommendations for benefit-risk assessment methodologies and visual representations

Benefit-risk evaluation: the past, present and future

Effect of intramuscular vitamin E on mortality and intracranial hemorrhage in neonates of 1000 grams or less

Vitamin E supplementation for prevention of morbidity and mortality in preterm infants

Meta-Analysis of Controlled Clinical Trials

Randomized Trials Stopped Early for Benefit: A Systematic Review

The impact of stopping rules on heterogeneity of results in overviews of clinical trials

Systematic reviewers neglect bias that results from trials stopped early for benefit

The Importance of Considering Differences in Study Design in Network Meta-analysis: An Application Using Anti-Tumor Necrosis Factor Drugs for Ulcerative Colitis

Incorporation of sequential trials into a fixed effects meta-analysis

Stopping at nothing? Some dilemmas of data monitoring in clinical trials

Systematic reviews are not biased by results from trials stopped early for benefit

Bias and trials stopped early for benefit

Meta-analysis of clinical trials with early stopping: an investigation of potential bias

Sensitivity analyses assessing the impact of early stopping on systematic reviews: Recommendations for interpreting guidelines

Assessing the quality of studies in meta-research: Review/guidelines on the most important quality assessment tools

GRADE guidelines: 4. Rating the quality of evidence -study limitations (risk of bias)

The Inclusion of Economic Endpoints as Outcomes in Clinical Trials Reported to ClinicalTrials.gov

Methods for the Economic Evaluation of Health Care Programmes

Underestimation of treatment effects in sequentially monitored clinical trials that did not stop early for benefit

The GUSTO investigators. An International Randomized Trial Comparing Four Thrombolytic Strategies for Acute Myocardial Infarction

Cost Effectiveness of Thrombolytic Therapy with Tissue Plasminogen Activator as Compared with Streptokinase for Acute Myocardial Infarction

The use of health economics in the design and analysis of adaptive clinical trials

A Review of Clinical Trials With an Adaptive Design and Health Economic Analysis

On the bias of maximum likelihood estimation following a sequential test

Impact of lack-of-benefit stopping rules on treatment effect estimates of two-arm multi-stage (TAMS) trials with time to event outcome

Interpretation of a leukemia trial stopped early

A Multiple Testing Procedure for Clinical Trials

Stopping clinical trials early for benefit: impact on estimation

Repeated assessment of results in clinical trials of cancer treatment

Drop-the-Losers Design: Binomial Case

Shrinkage estimation in two-stage adaptive designs with midtrial treatment selection

Estimation in multi-arm two-stage trials with treatment selection and time-to-event endpoint

Assessment of Hazard Ratios in Oncology Clinical Trials Terminated Early for Superiority: A Systematic Review

Stopping a trial early -and then what?

A systematic review of the reporting of Data Monitoring Committees' roles, interim analysis and early termination in pediatric clinical trials

Statistical Inference

The bias of the sample proportion following a group sequential phase II clinical trial

On the estimation of the binomial probability in multistage clinical trials: estimation of binomial probability

Point Estimation Following Group Sequential Tests

Parameter estimation following group sequential hypothesis testing

A computationally simpler algorithm for the UMVUE of a normal mean following a group sequential trial

Unbiased estimation following a group sequential test

Estimation following a group sequential test for distributions in the one-parameter exponential family

Completeness and unbiased estimation of mean vector in the multivariate group sequential case

A Unified Theory of Two-Stage Adaptive Designs

On design and inference for two-stage adaptive clinical trials with dependent data

Unbiased estimation of secondary parameters following a sequential test

Estimation of a Secondary Parameter in a Group Sequential Clinical Trial

Supplementary analysis of probabilities at the termination of a group sequential phase II trial

Estimation of secondary endpoints in two-stage phase II oncology trials

Point and interval estimation of accuracies of a binary medical diagnostic test following group sequential testing

Conditional estimation of sensitivity and specificity from a phase 2 biomarker study allowing early termination for futility: Early termination of biomarker studies

On Efficient Two-Stage Adaptive Designs for Clinical Trials with Sample Size Adjustment

What inference for two-stage phase II trials

Statistical inference for extended or shortened phase II studies based on Simon's two-stage designs

Point estimation and p-values in phase II adaptive two-stage designs with a binary endpoint

Unbiased estimation for response adaptive clinical trials

Two stage conditionally unbiased estimators of the selected mean

Unbiased estimation of the parameter of a selected binomial population

Unbiased Estimation of Selected Treatment Means in Two-Stage Trials

Conditionally unbiased and near unbiased estimation of the selected treatment mean for multistage drop-the-losers trials

Conditionally unbiased estimation in the normal setting with unknown variances

Conditional estimation after a two-stage diagnostic biomarker study that allows early termination for futility

Correcting for bias in the selection and validation of informative diagnostic tests

Conditionally unbiased estimation in phase II/III clinical trials with early stopping for futility

Unbiased estimation in seamless phase II/III trials with unequal treatment effect variances and hypothesis-driven selection rules

Uniformly minimum variance conditionally unbiased estimation in multi-arm multi-stage clinical trials

Estimation after subpopulation selection in adaptive seamless trials

Point estimation in adaptive enrichment designs

Point estimation following two-stage adaptive threshold enrichment clinical trials: Estimators for adaptive threshold enrichment clinical trials

Point and interval estimation in two-stage adaptive designs with time to event data and biomarker-driven subpopulation selection

Conditional estimation in two-stage adaptive designs: Conditional Estimation in Two-Stage Adaptive Designs

Improved approximation for estimation following closed sequential tests

Point and interval estimation following a sequential clinical trial

Conditional estimation following a group sequential clinical trial

Sequential tests and estimators after overrunning based on maximumlikelihood ordering

Inference about a secondary process following a sequential trial

Estimation after sequential testing: A simple approach for a truncated sequential probability ratio test

Proper inference from Simon's two-stage designs

Planning and Analyzing Adaptive Group Sequential Survival Trials

Exact Confidence Bounds Following Adaptive Group Sequential Tests

Exact inference for adaptive group sequential designs

Exact Inference for Adaptive Group Sequential Designs

Adaptive designs for noninferiority trials

Estimation and Confidence Intervals for Two-Stage Sample-Size-Flexible Design with LSW Likelihood Approach

Interval and point estimation in adaptive Phase II trials with binary endpoint

Comparison of conditional bias-adjusted estimators for interim analysis in clinical trials with survival data: Comparison of conditional bias-adjusted estimators for survival data

Flexible two-stage designs: an overview

Estimation in flexible two stage designs

Estimation and Confidence Intervals after Adjusting the Maximum Information

Recursive Combination Tests

A note on the sequential estimation of means

Repeated Significance Tests on Accumulating Data

Sequential Analysis

The Design and Analysis of Sequential Clinical Trials

A Simple and Efficient Bias-Reduced Estimator of Response Probability Following a Group Sequential Phase II Trial

Continuous and group sequential conditional probability ratio tests for phase II clinical trials

On the bias of estimation of a brownian motion drift following group sequential tests

Estimation following extension of a study on the basis of conditional power

An evaluation of inferential procedures for adaptive clinical trial designs with pre-specified rules for modifying the sample size

Group sequential control of overall toxicity incidents in clinical trials -non-Bayesian and Bayesian approaches

Sequential estimation for two-stage and three-stage clinical trials

Estimation following sequential tests involving data-dependent treatment allocation

Estimation following group-sequential response-adaptive clinical trials

Conditional Point Estimation in Adaptive Group Sequential Test Designs

Statistical Inference after an Adaptive Group Sequential Design: A Case Study

Sample Size Reassessment in Adaptive Clinical Trials Using a Bias Corrected Estimate

Conditional bias of point estimates following a group sequential test

Conditional Maximum Likelihood Estimation Following a Group Sequential Test

On random sample size, ignorability, ancillarity, completeness, separability, and degeneracy: Sequential trials, random sample sizes, and missing data

Properties of Estimators in Exponential Family Settings with Observation-based Stopping Rules

Estimation After a Group Sequential Trial

Conditional estimation using prior information in 2-stage group sequential designs assuming asymptotic normality when the trial terminated early

Some Shrinkage Techniques for Estimating the Mean

Estimation of a parameter and its exact confidence interval following sequential sample size reestimation trials

Statistical inference for self-designing clinical trials with a one-sided hypothesis

Sample Size Reestimation for Clinical Trials with Censored Survival Data

Adaptive design: estimation and inference with censored data in a semiparametric model

Point estimates and confidence regions for sequential trials involving selection

An improved method of evaluating drug effect in a multiple dose clinical trial

Estimation following selection of the largest of two normal means

Parameter estimation following an adaptive treatment selection trial design

Estimation of treatment effect following a clinical trial with adaptive design

Likelihood inference for a two-stage design with treatment selection

Confidence intervals for confirmatory adaptive two-stage designs with treatment selection: Confidence intervals for adaptive two-stage designs

Stopping rules and estimation problems in clinical trials

Practical problems in interim analyses, with particular regard to estimation

Empirical Bayes Estimation for the Means of the Selected Populations

Discussion of Professor Stein's paper "Confidence sets for the mean of a multivariate normal distribution

Empirical Bayes estimation of the selected treatment mean for two-stage drop-the-loser trials: a meta-analytic approach

On Bayesian estimators in multistage binomial designs

Bias reduction via resampling for estimation following sequential tests

Bias Reduction using Stochastic Approximation

An efficient sequential design of clinical trials

Estimating and Reducing Bias in Group Sequential Designs with Gaussian Independent Increment Structure

Using Bayesian modeling in frequentist adaptive enrichment designs

A Flexible Method Using a Parametric Bootstrap for Reducing Bias in Adaptive Designs With Treatment Selection

Using the bootstrap for estimation in group sequential designs: an application to a clinical trial for nasopharyngeal cancer

Bootstrap corrections of treatment effect estimates following selection

Estimation of treatment effects following a sequential trial of multiple treatments

Flexible Interim Analyses in Clinical Trials Using Multistage Adaptive Test Designs

Viscum album [L.] extract therapy in patients with locally advanced or metastatic pancreatic cancer: a randomised clinical trial on overall survival

Twenty-five years of confirmatory adaptive designs: opportunities and pitfalls

Multiple sclerosis and extract of cannabis: results of the MUSEC trial

Parameter Estimation Following Group Sequential Hypothesis Testing

Group Sequential Methods with Applications to Clinical Trials

Overestimation of the effect size in group sequential trials

Bayes factor design analysis: Planning for compelling evidence

Simulation Practices for Adaptive Trial Designs in Drug and Device Development

Design and rationale of the treatment of acute coronary syndromes with otamixaban trial: a double-blind triple-dummy 2-stage randomized trial comparing otamixaban to unfractionated heparin and eptifibatide in non-STsegment elevation acute coronary syndromes with a planned early invasive strategy

Precision of maximum likelihood estimation in adaptive designs

Identifying combined design and analysis procedures in two-stage trials with a binary end point

Test-compatible confidence intervals for adaptive two-stage single-arm designs with binary endpoint

QALS Study Group. Phase II trial of CoQ10 for ALS finds insufficient evidence to justify phase III

All of the data that support the findings of this study are available within the paper itself.

We systematically searched the MEDLINE database via PubMed (from 1st Jan 2000 to 30th Sept 2020) using the following search terms:MAMS trials (("multi-stage") OR ("multi stage") OR ("multi-arm multi-stage") OR ("multi arm multi stage") OR (two-stage) OR ("two stage") OR ("pick the winner") OR ("pick-the-loser") OR ("drop the loser") OR ("drop-the-loser") OR ("dose selection") OR (seamless)) RAR designs (("response adaptive") OR ("response-adaptive") OR ("adaptive randomisation") OR ("adaptive randomization") OR ("outcome adaptive") OR ("outcome-adaptive")) Adaptive enrichment trials (("adaptive enrichment") OR ("population enrichment") OR ("patient enrichment") OR ("enrichment design") OR ("biomarker-adaptive") OR ("biomarker adaptive") OR ("subgroup selection") OR ("subpopulation selection") OR ("enrichment"))