key: cord-0953580-hj7rnng3
authors: Benkeser, David; Diaz, Ivan; Luedtke, Alex; Segal, Jodi; Scharfstein, Daniel; Rosenblum, Michael
title: Improving Precision and Power in Randomized Trials for COVID-19 Treatments Using Covariate Adjustment, for Ordinal or Time to Event Outcomes
date: 2020-04-23
journal: medRxiv
DOI: 10.1101/2020.04.19.20069922
sha: dc33316ec5f5e33b146b11c6250772c7a2cdb266
doc_id: 953580
cord_uid: hj7rnng3

We aim to help inform the choice of estimand (i.e., target of inference) and analysis method to be used in future COVID-19 treatment trials. To this end, we describe estimands for outcome types of particular interest in these trials (ordinal and time-to-event). When the outcome is ordinal, the estimands that we consider are the difference between study arms in the mean outcome, the Mann-Whitney (rank--based) estimand, and the average of the cumulative log odds ratios over the levels of the outcome. For time-to-event outcomes, we consider the difference between arms in the restricted mean survival time, the difference between arms in the cumulative incidence, and the relative risk. Advantageously, the interpretability of these estimands does not rely on a proportional odds or proportional hazards assumptions. For each estimand, we evaluate the potential value added by using estimators that leverage information in baseline variables to improve precision and reduce the required sample size to achieve a desired power. These are called covariate adjusted estimators. To evaluate the performance of the covariate adjusted and unadjusted estimators that we present, we simulate two-arm, randomized trials comparing a hypothetical COVID-19 treatment versus standard of care, where the primary outcome is ordinal or time-to-event. Our simulated distributions are derived from two sources: longitudinal data on over 500 patients hospitalized at Weill Cornell Medicine New York Presbyterian Hospital prior to March 28, 2020, and a CDC preliminary description of 2449 cases reported to the CDC from February 12 to March 16, 2020. We focus on hospitalized, COVID-19 positive patients and consider the following outcomes: intubation, ventilator use, and death. We conduct simulations using all three estimands when the outcome is ordinal, but only evaluate the restricted mean survival time when the outcome is time to event. Our simulations showed that, in trials with at least 200 participants, the precision gains due to covariate adjustment are equivalent to requiring 10-20% fewer participants to achieve the same power as a trial that uses the unadjusted estimator; this was the case for each outcome and estimand that we considered.

preliminary description of 2449 cases reported to the CDC from February 12 to March 16, 2020. We focus on hospitalized, COVID-19 positive patients and consider the following outcomes: intubation, ventilator use, and death. We conduct simulations using all three estimands when the outcome is ordinal, but only evaluate the restricted mean survival time when the outcome is time to event. Our simulations showed that, in trials with at least 200 participants, the precision gains due to covariate adjustment are equivalent to requiring 10-20% fewer participants to achieve the same power as a trial that uses the unadjusted estimator; this was the case for each outcome and estimand that we considered.

Paper Organization. Section 2 provides background on the use of covariate adjustment in randomized trials. Section 3 describes estimands and estimation strategies that we find compelling when the outcome is ordinal or a time to event. Section 4 describes the methods behind the simulation study that we conducted to evaluate these estimation strategies. Section 5 presents the results from this simulation. Section 6 presents our recommendations for the estimands and primary efficacy analysis methods that we recommend for future COVID-19 treatment trials.

The ICH E9 Guidance on Statistical Methods for Analyzing Clinical Trials (1998) states that "Pretrial deliberations should identify those covariates and factors expected to have an important influence on the primary variable(s), and should consider how to account for these in the analysis to improve precision and to compensate for any lack of balance between treatment groups" [10] . Adjusting for prognostic baseline variables is called covariate adjustment. Though there appears to be general agreement among regulators [8, 9] on how to adjust for baseline variables when the outcome is continuous valued (i.e., using analysis of covariance, abbreviated as ANCOVA), there is less specific guidance for several outcome types of interest in COVID-19 treatment trials, namely binary, ordinal, or time to event outcomes [1] . We evaluate the performance of covariate adjusted estimators for ordinal and time to event outcomes. Binary outcomes are a special case where the ordinal outcome has two levels, and the standardization approach of [13, 17] can be applied.

We consider three estimands when the outcome is ordinal. We comment on these estimands briefly here, and refer the reader to Appendix A.1 for analytic expressions for these quantities.

Estimand 1: Difference in means of a prespecified transformation of the ordinal outcome (DIM). In most settings, this transformation will be monotone so that larger values of the ordinal outcome (assumed to be preferable) will result in larger, and therefore preferable, transformed outcomes.

Estimand 2: Mann-Whitney (MW) estimand. This estimand reports the probability that a random individual assigned to treatment will have a better outcome than a random individual assigned to control, with ties broken at random.

Estimand 3: Log-odds ratio (LOR). We consider a nonparametric extension of the logodds ratio (LOR) [6] defined as the average of the cumulative log odds ratios over levels 1 to K − 1 of the outcome. In the case that the distribution of the outcome given treatment arm is accurately described by a proportional odds model [12] , this estimand corresponds to the coefficient in front of treatment in this model. Otherwise, it represents a projection onto this model.

It follows from Appendix A.1 that all of the above estimands can be written as smooth summaries of the arm-specific cumulative distribution functions (CDFs) of an ordinal outcome. To estimate these quantities, it suffices to estimate these two CDFs and then to evaluate these summaries. This results in so-called plug-in estimators.

We consider two CDF estimators. The first, which we term the unadjusted estimator, corresponds to using the arm-specific empirical CDFs, that is, corresponds to stratifying the sample based on treatment arm and then using the observed empirical distributions within each arm. The resulting estimator for the DIM corresponds precisely to taking the empirical differences of transformed mean outcomes between the two arms. The resulting estimator M for the MW estimand is closely related to the usual Mann-Whitney U-statistic U : in particular, U = n 0 n 1 M , where n 0 and n 1 are the total sample sizes in the two treatment groups.

The second CDF estimator, which we term the covariate adjusted estimator, leverages prognostic information that is available at baseline. It uses working models, i.e., models that are fit in the process of computing the estimator but which we do not assume to be correctly specified (and for which consistency and asymptotic normality still hold under arbitrary model misspecification). Specifically, this estimator makes use of a treatment-stratified proportional odds working model that assumes that, within each treatment arm, the CDF of the ordinal outcome satisfies the proportional odds assumption based on a linear combination of the baseline covariates. This yields two working covariate-conditional CDFs (one per treatment arm). Our final treatment-specific marginal CDFs are obtained by averaging these conditional CDFs across the pooled covariates from the two treatment arms. Importantly, the validity of this estimator (e.g., asymptotic unbiasedness, asymptotic normality) does not rely on correct specification of the working treatment-stratified proportional odds model 3 that this estimator uses. The fact that our CDF estimator is robust to misspecification of the outcome working model is reminiscent of the fact that ANCOVA estimator can be used for covariate adjustment when the outcome is continuous [24, 22] . We also refer the reader to the following references for further discussion of the benefits of covariate adjustment: for the DIM, see [22] ; for the MW estimand, see [21] ; for the LOR, see [6] ; in the case that the ordinal outcome takes only two values, see [14] , whose general discussion applies to all three estimands.

Because the covariate adjusted approach that we have described is based on a working parametric model, there are many possible approaches to developing valid confidence intervals for the estimand of interest. These approaches include the nonparametric BCa bootstrap [7] or Wald-type confidence intervals. When developing BCa intervals based on trial data, we recommend using a large number of bootstrap replicates to ensure that the final interval does not meaningfully depend on the random number generator.

We recommend handling missing covariates by imputing them based only on data from those covariates that were observed. Importantly, to ensure that treatment assignment can be treated as randomized even conditionally on the imputed covariates, we do not recommend using treatment or outcome information in this imputation. We recommend handling missing ordinal outcomes using doubly robust methods whose validity relies on the outcomes being missing at random conditionally on the covariates and treatment assignment. In terms of implementation, this approach involves simply applying the methods described in Appendix A.2, but with study arm assignment recoded as 0 to indicate that a patient was both randomized to study arm 0 (control) and had their outcome measured, 1 to indicate that a patient was both randomized to study arm 1 (treatment) and had their outcome measured, and -1 to indicate that the outcome is missing.

We consider three treatment effect estimands in the time-to-event setting, all of which are interpretable under violations of a proportional hazards assumption. We refer the reader to Appendix B for formal analytic expressions for these quantities.

Estimand 1: Difference in restricted mean survival times (RMSTs). Recall that the RMST corresponds to the expected value of a survival time that is truncated at a specified time [4, 15] . Alternatively, the RMST can be expressed as the area under the survival curve up to the truncation time.

Estimand 2: Cumulative incidence difference (CID). Additive difference in armspecific incidences of the event by a specified time.

Estimand 3: Relative risk (RR). Ratio of the arm-specific incidences of the event by a specified time A covariate adjusted targeted minimum loss-based estimator of the RMST is described in Díaz et al. [5] . Analogous to the ordinal outcome case, the estimator of the RMST relies 4 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted April 23, 2020.

on an estimate of the survival function at each time point. Estimators obtained through the Kaplan-Meier curve are referred to as unadjusted. Adjusted estimators are obtained by computing time-specific hazard probabilities which are transformed using the productlimit formula and marginalized over covariates. The adjusted approach has two key benefits relative to unadjusted alternatives, for example, those based on an unadjusted Kaplan-Meier estimator [11] . First, under regularity conditions, the adjusted estimator is at least as precise as the unadjusted estimator, asymptotically. Second, the adjusted estimator's consistency depends on an assumption of censoring being independent of the outcome given study arm and baseline covariates, rather than an assumption of censoring in each arm being independent of the outcome marginally; see Eqs. 3 and 4 in Appendix B. The former may be a more plausible assumption. Similar covariate adjusted estimators for the CID and RR are also available [2] .

Both [5] and [2] provide approaches that can be used to develop Wald-type confidence intervals and corresponding tests of the null of no effect.

We generated data based on [3] , which reported outcomes for hospitalized and non-hospitalized individuals with COVID-19. We focus on the hospitalized data and place additional results pertaining to the non-hospitalized population in the Appendix. There were three levels of the ordinal outcome, with level 1 representing death, level 2 representing that ventilation was required, and level 3 representing that ventilation was not required. We adjusted for the following patient age categories: 0-19, 20-44, 45-54, 55-64, 65-74, 75-84, and ≥85. Lower and upper estimates were given for probabilities of outcomes in each age group in [3] ; we used the mean of these within each age group.

For the hospitalized COVID-19 positive population, the resulting probabilities of being in each age group and of the outcomes: ventilator use and death are listed in Table 1 Table 1 : Hospitalized, COVID-19 positive population: Age and conditional outcome distributions used for the control arm in the ordinal outcome simulation study.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted April 23, 2020. .

We evaluated performance of the methods for both ineffective and effective treatments. We considered trials with 1:1 randomization probabilities and total enrollment of n = 100, 200, 500, and 1000. To simulate treatment effects, we assumed that a treatment would have no effect on the probability of death but would decrease the odds of ventilation by the same relative amount in each age category. This relative reduction was selected for each sample size, such that a two-sided t-test that the average outcome between treatment arms were equal rejected the null hypothesis with approximately 80% power (n=200,500,1000) and 50% power (n=100). The discrepancy between n = 100 and the other sample sizes occurred because there did not exist a relative reduction that achieved 80% power at a sample size of 100. For each setting, 1000 Monte Carlo replicates were performed.

The adjusted estimators that we implemented in our simulation were based on a proportional odds regression model that included age category as a linear term (0-19, 20-44, 45-54, 55-64, 65-74, 75-84, ≥85). We estimated the treatment propensity using the marginal empirical probability of treatment. We used 1000 replicates when developing BCa bootstrap confidence intervals.

In order to mimic key predictive features of clinical outcomes in COVID patients, our simulation data generating mechanism is based on the over 500 patients hospitalized at Weill Cornell Medicine New York Presbyterian Hospital prior to March 28, 2020. This ensures we have at least 14 days of follow-up for all individuals. The event of interest in this simulation is time from hospitalization to intubation, and the predictive variables used are sex, age, whether the patient required supplemental oxygen at ED presentation, dyspnea, hypertension, and the presence of bilateral infiltrates in the chest x-ray. We focus on the RMST 14 days after hospitalization. Patient data was re-sampled with replacement to generate 1000 datasets of each of the sizes n = 100, 200, 500, and 1000. For each dataset, a hypothetical treatment variable was drawn from a Bernoulli distribution with probability 0.5 independently of all other variables. A positive treatment effect of 1.06 was simulated by adding an independent random draw from a χ 2 distribution with 4 degrees of freedom to the treated group. Five percent of the patients were selected at random to be censored, and their censoring time was drawn from a uniform distribution.

We compare the performance of the implementation of the covariate adjusted estimator described in Section 6 of [5] to the unadjusted Kaplan-Meier-based estimator described in [5] . Wald-type confidence intervals and corresponding tests of the null of no effect are reported.

We compare the type I error and power of tests of the null hypothesis of no treatment effect based on unadjusted and adjusted estimators, both within and across estimands. For each estimand, we also compare the estimator bias, variance, and mean squared error of the unadjusted and the adjusted approaches. We report the relative efficiency of the unadjusted relative to the adjusted estimator (ratio of variance of adjusted estimator to variance of 6 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 23, 2020. . unadjusted estimator)-since both of these estimators are asymptotically unbiased, one minus the relative efficiency approximately has the interpretation as the proportion reduction in sample size needed for a covariate adjusted estimator to achieve the same statistical power as the unadjusted estimator, at local alternatives.

We present results based on bootstrap-based confidence intervals and hypothesis tests here, while results based on closed-form inference methods are presented in Appendix C. Tables 2, 3, and 4 display results for the difference in means, Mann-Whitney estimand, and log odds ratio, respectively. Type 1 error control of the covariate adjusted methods was comparable to or better than that of the unadjusted methods across most settings. The covariate adjusted methods achieved higher power across all settings. Absolute gains in power varied from 4% to 7% for the DIM, 1% to 7% for the MW estimand, and 3% to 5% for the LOR. The relative efficiency of covariate adjusted methods relative to unadjusted methods varied from 0.83 to 0.89 for the DIM, 0.82 to 0.91 for the MW estimand, and 0.83 to 0.88 for the LOR. 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 23, 2020. 

We present two sets of results. Table 5 displays the results for estimators that adjust for age and sex along with the four other variables described in Section 4.1.2. Type 1 error control of the covariate adjusted methods was comparable to or better than that of the unadjusted methods across most settings. The covariate adjusted methods achieved higher power across all settings, with absolute gains varying from 3% to 8%. The relative efficiency of covariate adjusted methods relative to unadjusted methods improved monotonically as a function of sample size, from approximately 0.95 when n = 100 to approximately 0.80 when n = 1000. Table 6 displays the results of the analysis that only adjusts for age and sex. In this case, the gains of the covariate adjusted methods relative to the unadjusted methods are more modest, with absolute gains in power of approximately 1% and relative efficiency ranging from 0.97 to 1.00. These results suggest that there can be a meaningful benefit from adjusting for prognostic covariates beyond just age and sex.

We found that adjusting for variables beyond just age and sex can add substantial value. Specifically, for a time-to-event intubation outcome among hospitalized patients at Weill Cornell Medicine, we saw substantial gains from additionally adjusting for hypertension, dyspnea, whether the patient required supplemental oxygen at ED presentation, and the presence of bilateral infiltrates in the chest x-ray. For ordinal outcomes, we saw that the log 8 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 23, 2020. Table 4 : Results for log-odds ratio (LOR) in hospitalized population using BCa bootstrap for inference. When n=100, the estimator exhibited non-robust behavior, as some samples included no observations in the treatment arm of at least one outcome level.

odds ratio estimator has undesirable performance when the sample size is small (n = 100).

Because the log odds ratio estimand is also hard to interpret clinically in the likely scenario that a proportional odds assumption fails to hold, we recommend the difference of means or Mann Whitney estimands when the outcome is ordinal.

6 Recommendations for Treatment Effect Definition and Primary Efficacy Analysis 1. Randomization scheme. Use stratified block randomization, with strata including site and calendar time (which may be strong predictors of the outcome and for which there may be too many to adjust for at the end of the trial). Prognostic baseline variables such as age, disease severity, and comorbidities will be adjusted for in the primary analysis (see point 3 below), rather than being stratified on in the randomization procedure.

Estimand when the outcome is ordinal. In selecting a treatment effect definition, if a utility function can be agreed upon to transform the outcome to a score with a clinically meaningful scale, then we recommend using the difference between the transformed means in the treatment and control arms. Otherwise, we recommend using the unweighted difference between means or the Mann-Whitney estimand. We recommend against estimating log odds ratios, since these are difficult to interpret.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 23, 2020. 3. Covariate adjustment. Adjust for prognostic baseline variables to improve precision and power. We expect improvements to be substantial since there are already several known prognostic baseline variables, e.g., age and co-morbidities. The baseline variables should be specified before the trial is started (or should be selected using only blinded data from the trial using a prespecified algorithm) and the number of variables adjusted for should be no more than (roughly) the total sample size divided by 20.

4. Confidence intervals and hypothesis testing. we recommend that the nonparametric bootstrap (BCa method) be used with 10000 replicates for constructing a confidence interval. The entire estimation procedure, including any model fitting, should be repeated in each replicate data set. Hypothesis tests can be conducted either by inverting the confidence interval or by permutation methods -the latter can be especially useful in smaller sample size trials in order to achieve the desired Type I error rate.

Early stopping for efficacy or futility. Rules for early stopping such as O'Brien-Fleming boundaries can be directly applied, except where z-statistics are constructed using the covariate adjusted estimator described above and the covariance between statistics at different analysis times is estimated using nonparametric bootstrap as described above. The timing of analyses can be based on the information accrued, which has the advantage of reducing the average sample size even in trials with no treatment effect, which results

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 23, 2020. Table 6 : Results for difference in restricted mean survival times (RMST) in hospitalized population using Wald-based inference when the estimator only adjusts for age and sex.

from earlier futility stopping.

6. Plotting the CDF when the outcome is ordinal. Regardless of which treatment effect definition is used in the primary efficacy analysis, we recommend that the estimated CDF of the primary outcome be plotted for each study arm when the outcome is ordinal. This is analogous to plotting Kaplan-Meier curves for time-to-event outcomes, which can help in interpreting the trial results. For example, if the observed probability of death is similar in the two arms but the probability of ICU admissions is smaller in the treatment arm, then this will be evident in the plots. We also recommend to plot simultaneous confidence bands to represent uncertainty about the estimated curves. To improve precision, covariate adjustment can be used when constructing the estimated CDFs and the corresponding confidence band.

7. Missing covariates. We recommend handling missing covariates by imputing these covariates based only on data from those covariates that were observed. Importantly, to ensure that treatment assignment is randomized conditionally on the imputed covariates, no treatment or outcome information should be used in this imputation.

8. Missing ordinal outcomes. We recommend handling missing ordinal outcomes using doubly robust methods whose validity relies on the outcomes are missing at random conditionally on the covariates and treatment assignment. One such doubly robust approach can be evaluated by applying the methods described in Appendix A.2, but with study 11 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 23, 2020. . https://doi.org/10.1101/2020.04.19.20069922 doi: medRxiv preprint arm recoded as 0 to indicate that a patient was both randomized to study arm 0 (control) and had their outcome measured, 1 to indicate that a patient was both randomized to study arm 1 (treatment) and had their outcome measured, and -1 to indicate that the outcome is missing. When study arm is recoded in this way and the outcome is not missing completely at random but is missing at random conditionally on covariates, it is important that the model used forπ described in Appendix A.2 actually depend on the baseline covariates, since this recoded treatment is not fully randomized.

9. Loss to follow up with time-to-event outcomes. We recommend accounting for loss-to-follow-up using doubly robust methods such as those described in [2, 5] . These methods rely on a potentially more plausible condition on the censoring distribution than do unadjusted methods. The covariate adjusted estimator for the restricted mean survival time in the time-to-event setting is double robust under censoring being independent of the outcome given baseline variables and arm assignment.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 23, 2020. 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 23, 2020. .

[22] Wang, B., E. L. Ogburn, and M. Rosenblum (2019). Analysis of covariance in randomized trials: More precision and valid confidence intervals, without model assumptions.

Biometrics 75 (4), 1391-1400.

[23] Williamson, E. J., A. Forbes, and I. R. White (2014). Variance reduction in randomised trials by inverse probability weighting using the propensity score. Statistics in medicine 33 (5), 721-737.

[24] Yang, L. and A. A. Tsiatis (2001). Efficiency study of estimators for a treatment effect in a pretest-posttest trial. The American Statistician 55 (4), 314-321.

Let Y 0 denote a draw from the distribution of ordinal outcomes in treatment arm 0 and Y 1 denote an independent draw from the distribution of ordinal outcomes in treatment arm 1.

The three estimands write as

Above, u is a prespecified real-valued transformation of an outcome.

Consider a setting in which we observe n independent draws (X 1 , A 1 , Y 1 ), . . . , (X n , A n , Y n ), where each X i represents baseline covariates, each A i represents a treatment, and each Y i represents an outcome. We suppose that A ⊥ Y |X. We now derive an estimator for the CDF that is closely related to an estimator presented in Scharfstein et al. (1999) [16] and to targeted minimum loss-based estimators [20, 19] . For α := (α(j)) K−1 j=1 ∈ R K−1 and β ∈ R d , define the following R K−1 × R d function as follows:

logit m α,β (j, x) = α(j) + β x.

We will consider the treatment-stratified proportional odds working model for P {Y ≤ j|A = a, X = x} in which there exist (α 0 , β 0 ) and (α 1 , β 1 ) such that P {Y ≤ j|A = a, X = x} = m αa,βa (j, x) for all j, x, a. Importantly, we do not rely on this model being correct.

14 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 23, 2020. . Letπ(a|x) be an estimate of P (A = a|X = x). In the clinical trial setting that we consider, this quantity can be estimated with the empirical marginal treatment probability. At the end of this subsection, we describe alternative approaches for estimating this quantity. Suppose that, for a ∈ {0, 1},α(a) andβ(a) are chosen to minimize the following weighted empirical risk in (α, β):

Each of these a-specific optimizations can be solved by running a weighted logistic regression on a repeated measures dataset of size n × (K − 1). Alternatively, they can be fitted using software for a proportional odds model. For both levels of the treatment a, it can be shown thatα(a) 1 ≤α(a) 2 ≤ . . . ≤α(a) K−1 , and so, for any covariate value x, mα (a),β(a) (·, x) is a monotone nondecreasing function. Moreover, if our treatment-stratified proportional odds working model is correct, thenα(a) andβ(a) are asymptotically linear estimators of the true underlying parameters.

Our covariate adjusted estimate of the CDF ψ a (j) := P (Y ≤ j|A = a) is given bŷ

Because mα (a),β(a) (·, X i ) is monotone nondecreasing for all i = 1, . . . , n,ψ a (·) is also monotone nondecreasing. The above estimator also satisfies the known constraint thatψ a (j) ∈ [0, 1]. It can also be shown thatψ a (j) is (i) doubly robust and (ii) efficient if both the treatment mechanism and the stratified proportional odds working model are correctly specified. To show this, we now establish thatψ a (j) is in fact an augmented inverse probability weighted estimator. First, note that minimizing (1) to findα a andβ a implies that the following first-order condition is satisfied for all j ∈ {1, . . . , K − 1}:

Next, note that adding this to the right-hand side of (2) shows that

The above shows thatψ a (j) is an augmented inverse probability weighted estimator for ψ a (j), with the estimate of the outcome regression x → P {Y ≤ j|A = a, X = x} given by x → mα (a),β(a) (j, x).

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 23, 2020. .

We conclude by discussing our estimation of the treatment probability P (A = a|X = x). Though this quantity can always be estimated by the empirical treatment probability in our randomized trial setting, there are generally advantages to estimating this quantity within a richer model. For example, a main-terms logistic regression of treatment on covariates could be used -in a randomized trial setting, this model is correctly specified provided that it includes an intercept term. The advantage of estimating known treatment probabilities via correctly specified parametric models has been discussed elsewhere -see, for example, [23] or, for a general treatment, Section 2.3.7 in [18].

Let T be a time-to-event outcome, C be a right-censoring time, A be a treatment indicator, and X be a collection of baseline covariates. Let τ be an investigator-specified truncation time that will be used to define the RMST, and let t * be an investigator-specified time at which a comparison between the arm-specific cumulative incidences of the event is of interest. Let T 1 be a random variable with the distributon of T |A = 1 and let T 0 be a random variable with the distribution of T |A = 0. The three estimands write as

We conclude by describing the assumptions made by the adjusted and the unadjusted methods on the right censoring time C. Unadjusted methods assume that

The adjusted methods discussed in the main text assume that

which is often more plausible than (3).

C.1 Results for Wald-style inference for hospitalized patients

For the non-hospitalized COVID-19 positive population, the probabilities of being in each age group and of hospitalization and death are listed in Table 10 . For this simulation, adjustment 16 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 23, 2020. . via the proportional odds model resulted in numerical instabilities in many simulated data sets. Therefore, we considered adjusting for age using a stratified estimator. That is, the conditional probability of each outcome in each age group was estimated using the empirical proportion of the sample with each outcome in each age group. As with the hospitalized simulation, we assumed that a treatment would have no effect on the probability of death but would decrease the odds of ventilation by the same relative amount in each age category. The relative reduction was selected for each sample size, such that a two-sided t-test that the average outcome between treatment arms were equal rejected the null hypothesis with approximately 80% power (n=200,500,1000) and 50% power (n=100).

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 23, 2020. Table 9 : Results for log-odds ratio (LOR) in hospitalized population using Wald-based inference. When n=100, the estimator exhibited non-robust behavior, as some samples included no observations in the treatment arm of at least one outcome level.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 23, 2020. 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted April 23, 2020. Table 13 : Results for log-odds ratio (LOR) in non-hospitalized population using BCa bootstrap for inference. When n=100, the estimator exhibited non-robust behavior, as some samples included no observations in the treatment arm of at least one outcome level.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted April 23, 2020. .

A substantial and confusing variation exists in handling of baseline covariates in randomized controlled trials: a review of trials published in leading medical journals

Improved estimation of the cumulative incidence of rare outcomes

Severe outcomes among patients with coronavirus disease 2019 (covid-19) -united states

Causal inference on the difference of the restricted mean lifetime between two groups

Improved precision in the analysis of randomized trials with survival outcomes, without assuming proportional hazards

Enhanced precision in the analysis of randomized trials with ordinal outcomes

An introduction to the bootstrap

Guideline on adjustment for baseline covariates in clinical trials