key: cord-0433468-9daug4lh
authors: Hejazi, Nima S.; Benkeser, David; D'iaz, Iv'an; Laan, Mark J. van der
title: Efficient estimation of modified treatment policy effects based on the generalized propensity score
date: 2022-05-11
journal: nan
DOI: nan
sha: 5809fdb75b49cd15bb55878df45282ab300b0b99
doc_id: 433468
cord_uid: 9daug4lh

Continuous treatments have posed a significant challenge for causal inference, both in the formulation and identification of scientifically meaningful effects and in their robust estimation. Traditionally, focus has been placed on techniques applicable to binary or categorical treatments with few levels, allowing for the application of propensity score-based methodology with relative ease. Efforts to accommodate continuous treatments introduced the generalized propensity score, yet estimators of this nuisance parameter commonly utilize parametric regression strategies that sharply limit the robustness and efficiency of inverse probability weighted estimators of causal effect parameters. We formulate and investigate a novel, flexible estimator of the generalized propensity score based on a nonparametric function estimator that provably converges at a suitably fast rate to the target functional so as to facilitate statistical inference. With this estimator, we demonstrate the construction of nonparametric inverse probability weighted estimators of a class of causal effect estimands tailored to continuous treatments. To ensure the asymptotic efficiency of our proposed estimators, we outline several non-restrictive selection procedures for utilizing a sieve estimation framework to undersmooth estimators of the generalized propensity score. We provide the first characterization of such inverse probability weighted estimators achieving the nonparametric efficiency bound in a setting with continuous treatments, demonstrating this in numerical experiments. We further evaluate the higher-order efficiency of our proposed estimators by deriving and numerically examining the second-order remainder of the corresponding efficient influence function in the nonparametric model. Open source software implementing our proposed estimation techniques, the haldensify R package, is briefly discussed.

Across a range of scientific fields, research efforts often aim to quantify the causal effects of intervening on continuous or ordinal treatments. Examples include evaluating the impacts of increased physical exercise on aging (Díaz & van der Laan, 2012) , reductions in surgical operating time on post-surgical health outcomes , changes in vaccine-induced immunologic response on disease risk (Hejazi et al., 2020b) , impact of nurse hours per patient on hospital readmission risk (McHugh et al., 2013) , and mortality reduction from COVID-19 delay-inintubation policies (Díaz et al., 2022) . In these settings, causal effects are often quantified in terms of causal doseresponse curves (Imbens, 2000; Kennedy et al., 2017) . Unfortunately, the estimation of such dose-response curves MAY 18, 2022 for continuous treatments in infinite-dimensional statistical models is exceptionally challenging. Estimators of these curves generally do not enjoy n 1/2 -rate asymptotic behavior and are incompatible with many standard estimation techniques.

To circumvent these challenges, it is common in practice to discretize the treatment into categories, which enables practitioners to recover n 1/2 -rate behavior with standard estimation techniques. However, this strategy is subject to a host of limitations, the most fundamental of which is a potential violation of the causal identification assumption of consistency. This assumption stipulates that hypothetical interventions should be well-defined, which is rarely (if ever) the case when conceptualizing interventions that assign a continuous treatment to a set of (discrete) values. More practically, if the treatment is poorly discretized, perhaps due to limited subject matter knowledge, the resultant dose-reponse curve is prone to providing a poor approximation of the underlying curve.

More recent work has developed asymptotic theory under a range of semiparametric assumptions. For example, Kennedy et al. (2017) proposed estimators based on locally linear smoothing; van der Laan et al. (2018a) considered cross-validated targeted minimum loss estimation of a general class of non-standard parameters, including the doseresponse curve as an example; and Westling et al. (2020) developed estimators assuming monotonicity of the doseresponse curve. These approaches represent valuable contributions in this area. However, each approach relies on assumptions that are difficult to justify across a range of settings. In particular, each posits a positivity assumption requiring common support of the covariate-conditional treatment density across all possible covariate values.

This fundamental limitation has motivated the study of causal effects of stochastic interventions (Stock, 1989; Díaz & van der Laan, 2012; Young et al., 2014) . These interventions consider setting each individual's treatment level to a random draw from an investigator-specified distribution. By carefully considering the candidate post-intervention distribution of the treatment, practitioners can avoid making unrealistic positivity assumptions. One popular approach draws treatment values from a modification of the naturally occurring treatment distribution. Counterfactuals defined in this way are better aligned with plausible future interventions that may be engineered by scientific or policy initiatives. Recent efforts have yielded several candidate approaches (e.g., Díaz & van der Laan, 2012; Young et al., 2014; Kennedy, 2019) for identifying and estimating the causal effects of stochastic interventions, including for longitudinal treatment regimens (Díaz et al., 2021 (Díaz et al., , 2022 and mediation analysis (e.g., Hejazi et al., 2022b) .

Many estimation approaches for stochastic interventions center around estimation of the generalized propensity score (Hirano & Imbens, 2004) , the conditional treatment density given covariates. Estimation of the generalized propensity is a common problem in causal inference and has been studied in the context of, for example, marginal structural models (Robins et al., 2000) , and covariate balancing (Hirano & Imbens, 2004) . Despite its prevalence, most proposals rely on assumptions imposing restrictive parametric forms to facilitate estimation of this quantity, though flexible estimators leveraging advances in machine learning have emerged (e.g., Díaz & van der Laan, 2011; Zhu et al., 2015) . Numerical studies examining the impact of generalized propensity score estimation on the evaluation of the causal effects of continuous treatments were contributed by Austin (2018) .

In the present work, we develop a flexible, nonparametric estimator of the generalized propensity score that is fully compatible with techniques for sieve estimation. This estimator is formulated with the goal of evaluating the causal effects of continuous treatments, using the framework of stochastic interventions. Building upon this flexible nuisance estimator, we formulate and evaluate a unique class of nonparametric inverse probability weighted (IPW) estimators capable of asymptotically achieving a semiparametric analog of the efficiency bound. This represents, to the best of our knowledge, two significant theoretical and practical advances: (1) successfully tailoring undersmoothing principles to conditional density functionals for the efficient estimation of low-dimensional statistical parameters, and (2) developing IPW estimators of continuous treatment effects that are both asymptotically efficient and competitive with modern doubly robust estimators. We additionally discuss the open source haldensify (Hejazi et al., 2022a) package, for the R statistical programming language and environment (R Core Team, 2022), which implements our generalized propensity score and IPW estimators.

The remainder of the present manuscript is organized as follows. Section 2 introduces notation and discusses existing theory on the causal effects of stochastic interventions and their efficient estimation. Section 3 outlines our proposed conditional density estimation procedure and goes on to detail its use in constructing a novel class of nonparametric IPW estimators, proposing several procedures for sieve estimation designed to allow these estimators to attain asymptotic efficiency. Section 4 investigates the relative performance of our proposed nonparametric IPW estimators in numerical experiments. Section 5 concludes with a discussion of future avenues of investigation. MAY 18, 2022 2 Preliminaries

Let W ∈ W denote a vector of baseline covariates, A ∈ A a real-valued continuous (or ordinal) treatment, and Y ∈ Y an outcome of interest. To formalize the causal question of interest, we introduce a structural causal model (SCM) to describe the data-generating process (Pearl, 2009) . Specifically, we assume the following system of structural equations generates the observed data for a single unit:

where {f W , f A , f Y } are deterministic functions, and {U W , U A , U Y } are exogenous random variables. Importantly, the SCM implies a model for the distribution of counterfactual random variables, which are generated by specific interventions on the data-generating process. The standard identification assumptions of consistency (Y d(a,w) = Y in the event A = d(a, w)) and lack of interference (Cox, 1958) hold as derived properties of this SCM (for a discussion on the former, see Pearl, 2010) . The complete (unobserved) data unit (Neyman, 1938) is X = (W, Y a : a ∈ A), where the counterfactuals Y a are the potential outcomes corresponding to the values the outcome would take under each possible value in the support of the treatment A. We will focus on the estimation of counterfactual treatment parameters that are functionals of the distribution of X.

We consider the data on a single observational unit O, denoted O = (W, A, Y ), as arising from P 0 , the true generating distribution of O. Assuming access to n i.i.d. copies of O, we use P n to denote the empirical distribution of O 1 , . . . , O n . At times, it will prove convenient to use standard notation from empirical process theory; specifically, we will write P n f := n −1 n i=1 f (O i ) and, more generally, P f := f (O)dP on occasion. Assuming only that P 0 is an element of the nonparametric statistical model M, i.e., P 0 ∈ M, we strive to avoid placing undue restrictions on the form of P 0 . We use p 0 to denote the density of O, which evaluated on a typical observation o, is

where q 0,Y denotes the conditional density of Y given {A, W } with respect to some dominating measure, g 0,A the conditional density of A given W with respect to dominating measure µ, and q 0,W the density of W with respect to dominating measure ν.

Counterfactual quantities of interest may be defined by specific interventions that alter the structural equation f A and "surgically" insert post-intervention treatment values in place of those naturally generated by f A . Static interventions make for a familiar example: f A is merely replaced with a specific value, selected a priori, a ∈ A. When | A |, the cardinality of A, is small, contrasts of the counterfactual means of static interventions under each a ∈ A can prove informative. On the other hand, when | A | is large or uncountable (e.g., when A ∈ R, i.e., is continuous), the evaluation of many such counterfactuals is both of questionable scientific relevance and is theoretically challenging.

A stochastic intervention modifies the value A would naturally assume, replacing it with a draw from a postintervention distributiong 0,A (· | W ) (n.b., the naught subscript emphasizes that this distribution may depend on the true, but unknown, data-generating distribution P 0 ). A stochastic intervention may be designed to collapse into a static intervention by selectingg 0,A (· | W ) to be degenerate, placing all mass on a single point a ∈ A. The taxonomy of stochastic interventions is, as of yet, unsettled -the term is widely used to describe intervention regimes exhibiting randomness conditional on covariates W . From this perspective, static interventions are deterministic and dynamic interventions, which assign treatment based on W , are deterministic conditional on W . Throughout, we focus on intervention regimes meeting this latter definition.

Using the former definition, Díaz & van der Laan (2012) described a stochastic intervention that draws A from a distribution such thatg 0,A (a | W ) = g 0,A (a − δ(W ) | W ) for all a ∈ A. Shortly thereafter, , taking the latter definition, showed that estimation of the causal effect attributable to the stochastic intervention of Díaz & van der Laan (2012) is equivalent to the effect of an investigator-supplied hypothetical intervention modifying the value A would naturally assume according to a regime d(a, w; δ) = a + δ(w). These authors go on to characterize a broad class of interventions by introducing the assumption of piecewise smooth invertibility. A1 (Piecewise smooth invertibility). For each w ∈ W, assume that the interval I(w) = (l(w, ), u(w)) may be partitioned into subintervals I δ,j (w) : j = 1, . . . , J(w) such that the regime d(a, w; δ) is equal to some d j (a, w; δ) in I δ,j (w) and d j (·, w; δ) has inverse function b j (·, w; δ) with derivative b j (·, w; δ).

Assumption A1 can be used to show that the intervention implied by the regime d(A, W ; δ) may be interpreted as acting on the individual level (Young et al., 2014) , incompatible with the definition used by Díaz & van der Laan (2012) . Importantly, the regime d(A, W ; δ) may depend on both covariates W and the treatment A that would be assigned in its absence (i.e., the treatment's natural value); consequently, such regimes have been termed modified treatment policies (MTP). Regimes obeying the MTP definition constitute an important class of stochastic interventions of the second variety (n.b., these are non-deterministic conditional on W , since they depend on A). Both Haneuse & Rotnitzky (2013) and Díaz & van der Laan (2018) were motivated by counterfactual questions of modifying an assigned treatment, the former seeking to evaluate the effect on patient health of reducing surgical operating time and the latter by the effect of adjusting prescribed exercise regimens based on existing athletic habits. Conveniently, both sets of authors considered an MTP of the form:

and u(w) is the maximum value in the conditional support of g 0,A (· | W = w). While an intuitively useful MTP, in many settings, an additive alteration to the natural value of A may prove inadequate to formulate an actionable policy. This arises, for example, in environmental health, in which it is of interest to evaluate effects of multiplicative reductions to, say, pollutants. MTPs can be tailored to such uses-cases easily:

The MTPs in equations (1) and (2) can be seen as generating a counterfactual outcome

The goal, then, is to identify and estimate ψ 0,δ := E P δ 0 {Y d(A,W ;δ) }, the mean of this counterfactual outcome.

Using the population intervention effect (PIE) framework (Hubbard & van der Laan, 2008) , Díaz & van der Laan (2012 , 2018 introduced the PIE for stochastic interventions θ 0,δ := ψ 0,δ − EY . As EY is trivially estimable, their efforts focused on identifying and estimating ψ 0,δ . For the MTP (1), these authors showed ψ 0,δ to be identified by

where Q 0,Y (a, w) := E P0 {Y | A = a, W = w} (conditional mean of Y given A = a and W = w, as implied (a, w; δ) , w), and g 0,A (a | W = w) is the conditional density of the treatment. We stress that the equivalence in equation (3) holds only when the post-intervention treatment densityg 0,A arises from a regime d(A, W ; δ) adhering to Assumption A1, which facilitates use of the change-of-variable formula in the integral expressions. This mathematical convenience allows identification of ψ 0,δ by replacing evaluation of the expectation Q 0,Y (a, w) under the counterfactual treatment densityg 0,A with evaluation of the post-modification outcome mechanism Q d 0,Y (a, w) under the natural treatment density g 0,A . This has proven to be a significant simplification, essential for the formulation of efficient estimators that are computationally feasible (e.g., Díaz & van der Laan, 2018; Hejazi et al., 2020b) . Considerations necessary for appropriately expressingg 0,A in terms of g 0,A for the MTPs (1) and (2), whether A is continuous or discrete-valued, have been previously outlined by and Díaz et al. (2022, see equations (6) and (7)). Throughout, in a minor abuse of notation, we will useg 0,A to refer to the appropriate post-intervention treatment density for the MTPs (1) and (2) .

For the statistical functional (3) to correspond to the causal estimand ψ 0,δ , several standard but untestable assumptions are required, including the following.

The positivity assumption A3, required to establish equation (3), is unlike its analog for simpler (i.e., static or dynamic) intervention regimes. Instead of requiring positive mass to be placed across all treatment levels for all covariate strata w ∈ W, this positivity condition requires only that the post-intervention treatment mechanism be bounded, i.e., P P0 {g 0,A (A | W )/g 0,A (A | W ) < ∞} = 1, which may generally be satisfied by a suitable choice of the parameter δ(W ) in an MTP d(A, W ; δ). Importantly, this assumption ensures that the post-intervention treatment densityg 0,A does not place mass at points in the support at which g 0,A places no mass, leading to well-defined counterfactuals.

Beyond their careful study of the identification of this causal effect, Díaz & van der Laan (2012 , 2018 derived the efficient influence function (EIF), a quantity central to semiparametric efficiency theory, of ψ 0,δ in the nonparametric model M. Using the EIF, these authors proposed asymptotically efficient doubly robust estimators constructed based on the form of the EIF. When evaluated on a typical observation o, the EIF is

where the auxiliary covariate H(a, w) takes the form H 0 (a, w) =g 0,A (a | w)/g 0,A (a | w). The EIF characterizes the best possible asymptotic variance, or nonparametric efficiency bound, of all regular asymptotically linear estimators of ψ 0,δ and is thus usually a critical ingredient in the development of efficient estimation strategies. While the expression of the EIF appearing in equation (4) is suitable for the construction of efficient estimators, it arises from a von Mises expansion (von Mises, 1947; van der Laan & Robins, 2003) of the parameter functional Ψ δ : M → R at the distributions P, P 0 ∈ M:

Equation (5) is critically important to characterizing the asymptotic behavior of regular and asymptotically linear estimators of Ψ δ (P 0 ) ≡ ψ 0,δ . The first term can be shown to converge readily to a zero-centered Gaussian distribution with variance equal to the variance of the EIF, i.e., N (0, P 0 D (P 0 )(O) 2 ), by the central limit theorem. The second term, which corresponds to "plug-in bias", generally fails to vanish asymptotically, motivating the development of specialized correction strategies for this express purpose (e.g., the targeted learning framework of van der Laan & Rose (2011)). Under standard empirical process conditions (on nuisance parameter estimators), the third and fourth terms can be shown to converge to zero in probability; moreover, cross-validation can be used to circumvent even these conditions (Klaassen, 1987; Zheng & van der Laan, 2011) . The last term R 2 (P, P 0 ) of the expansion (5) is a second-order remainder associated to the scaled difference of Ψ δ (P ) and Ψ δ (P 0 ); furthermore, it is typical to assume necessary rates of nuisance parameter convergence such that R 2 (P, P 0 ) = o p (n −1/2 ). Such an assumption would be satisfied, for example, if all nuisance estimators were to converge to their target functionals at n −1/4 -rate. Under standard conditions, the asymptotic efficiency of regular asymptotically linear estimators may be described via the EIF:

where ψ n,δ ≡ Ψ δ (P n ) is an asymptotically linear estimator of the target parameter ψ 0,δ . The form of the expansion emphasizes that the EIF measures the quality of the estimator ψ n,δ , and that ψ n,δ must solve the EIF estimating equation to be asymptotically efficient.

For estimation of ψ 0,δ , Díaz & van der Laan (2018) defined a direct (or, substitution) estimator based on the Gcomputation formula. This estimator is

where Q n,AW (a, w) is merely an appropriate estimate of the joint distribution of (A, W ). An inverse probability weighted (IPW) estimator of ψ 0,δ takes the form

Popular in practice, the stabilized IPW estimator, an adaptation of the estimator of equation (7), uses standardization to reduce the impact of unstable inverse weights:

where

In equations (6), (7), (8), and in the sequel, the subscript n denotes quantities estimated based on the empirical distribution P n in lieu of their true counterparts (at P 0 ∈ M) -that is, g n,A is an estimate of g 0,A , while Q n,Y is an estimate of Q 0,Y . Occasionally, subscripts are omitted entirely, in which case a given quantity should be taken as being estimated at an arbitrary distribution P ∈ M.

Relevant to our discussion of H(A, W ), Díaz et al. (2021) , attempting to avoid the laborious task of density estimation, proposed direct estimation of this ratio of inverse probability weights, utilizing a re-parameterization based on the odds of a variable in an artificial dataset constructed so as to include records for all units under bothg 0,A and g 0,A . Their approach adapted earlier ideas exploited by, among others, Qin (1998), Cheng & Chu (2004) , and van der Laan et al. (2018b) in the context of causal inference and explored independently in the context of general machine learning (e.g., Sugiyama et al., 2010 Sugiyama et al., , 2012 . Recently, Nugent & Balzer (2021, see page 8) , in demonstrating causal inference with MTPs, praised the computational convenience of the direct ratio estimation approach, relative to direct estimation of the conditional density g 0,A . Yet, this approach is only applicable in limited settings. To see this, consider the following shortcomings of this approach. Firstly, direct ratio estimation is prone to producing biased estimated weights, which arise from the direct estimate taking nonzero values even wheng 0,A is truly zero. This amounts to an artificial (that is, algorithmically induced) violation of the positivity assumption A3, requiring manual correction even in standard software implementations. Secondly, there are numerous settings in which such a re-parameterization categorically fails, including in the estimation of weights for stochastic interventions that are not MTPs or in causal mediation analysis with and without MTPs (e.g., . Thirdly, and perhaps most critically, the ratio of inverse probability weights is not amenable to likelihood-based estimation or inference, as it is not itself a component of the underlying likelihood of the data-generating process. Lastly, such a representation leads to "artificial" score terms that are incompatible with scores based on g 0,A , severely complicating any efforts to derive semiparametric-efficient estimators within several well-studied frameworks.

Both the direct estimator ψ SUB n,δ and the IPW estimator ψ IPW n,δ require estimation of only a single nuisance parameterthe outcome mechanism and the conditional treatment density, respectively -but are well-known to be irregular when the corresponding nuisance parameter is estimated flexibly, suffer from residual asymptotic bias, and generally fail to achieve the nonparametric efficiency bound. What's more, neither estimator is asymptotically linear for ψ 0,δ when flexible regression strategies are used for nuisance parameter estimation, significantly limiting the real-world scenarios in which these approaches may be successfully applied. These notable issues, among others, have contributed to the rise in popularity of doubly robust estimators; accordingly, we briefly review two such estimators of ψ 0,δ in Section S4 of the Supplementary Materials. We focus instead on a nonparametric estimator g n,A of the generalized propensity score g 0,A , which facilitates construction of nonparametric, asymptotically efficient IPW estimators of ψ 0,δ , when coupled with sieve estimation and semiparametric efficiency theory.

In our subsequent developments, we make use of a nonparametric function estimator, the highly adaptive lasso (HAL) (van der Laan, 2017; van der Laan & Bibaut, 2017), which approximates a functional parameter of interest (e.g., conditional mean, hazard, density, conditional density) using a linear combination of basis functions. HAL requires that the target functional belong to the set of càdlàg (i.e., right-hand continuous with left-hand limits) functions with sectional variation norm bounded by a finite (but unknown) constant, a global constraint that limits the degree of variability that the target functional can exhibit. Similar nonparametric estimation approaches generally make use of local smoothness assumptions. While developments similar to ours may be possible under alternative assumptions, pursuing such a course of study lies outside our scope here. . Thus, f s (u s ) is a section of f that sets the components not in s to zero, that is, allowing variation only along those components belonging to u s . Per Qiu et al. (2021) , this definition of variation norm corresponds with the notion of Hardy-Krause variation (Owen, 2005) . van der Laan (2015 van der Laan ( , 2017 proved that the HAL estimator converges to the target functional at a rate faster than n −1/4 , without any contribution from the dimensionality d of the problem at hand, so long as d remains fixed. Subsequent theoretical investigations improved this rate of convergence to n −1/3 log(n) d/2 , with ongoing work yielding promising further improvements still (van der Laan, 2022). The HAL estimator proceeds in two steps. To start, a rich set of indicator (or higher-order, i.e., spline) basis functions are generated to represent the target functional; this step is a mapping of the covariate space in terms of the HAL basis. Lasso regression (Tibshirani, 1996) is then applied to a linear combination of these basis functions, minimizing the expected value of an appropriate loss function while constraining the L 1 -norm of the vector of coefficients to be bounded by a finite constant, its approximate sectional variation norm). demonstrated the utility of the HAL estimator in an extensive series of numerical experiments. The HAL estimator is implemented in the free and open source hal9001 package Hejazi et al., 2020a) for the R language and environment for statistical computing (R Core Team, 2022).

When a nuisance parameter of interest is taken as the target functional, the HAL estimator may be applied to generate initial estimates, under the assumption that the true nuisance parameter functional (e.g., the generalized propensity score) is of finite sectional variation. When the nuisance parameter η, with arbitrary input Z, is real-valued, its HAL representation may be expressed

which can be approximated by the use of a discrete measure placing mass on each observed Z s,i . When the range of η is the unit interval, an analogous approach, using instead logit(η), may be leveraged based on a representation of Gill et al. (1995) ; notably, Ertefaie et al. (2022) work with this representation in using HAL regression for estimation of the propensity score in the case of binary treatments. Now, take Z s,i to be support points of η s and let φ s,i (c s ) := I(Z s,i ≤ c s ), then

where |β 0 | + s⊂{1,...,d} n i=1 |β s,i | is an approximation of the sectional variation norm of η. The loss-based HAL estimator β n may then be defined

where L(·) is an appropriately chosen loss function; see Dudoit & van der Laan (2005) for an illuminating discussion of appropriate choices of loss function for a range of loss-based estimation problems. Salient to our goal of estimating the generalized propensity score, the negative log-density loss, L(η) = − log(p η ), is suitable for density estimation. Finally, the HAL estimate of η may be denoted η n,λ ≡ η β n,λ . Each choice of the regularization term λ corresponds to a unique HAL estimator, though methods for the selection of λ must be tailored to the estimation goal in order to yield suitable candidate estimators.

We now turn to procedures for estimation of the generalized propensity score g 0,A , from which we may be able to subsequently develop efficient, nonparametric estimators of ψ 0,δ . For compatibility with our subsequent theoretical results, we present our conditional density estimation procedure in the context of the HAL estimator discussed in Section 2.5; this is chiefly for three reasons. Firstly, formal theoretical results guarantee a suitable convergence rate for estimator construction when the HAL estimator is used to approximate nuisance functionals (van der Laan, 2017). Secondly, contemporaneous efforts have made progress in developing efficient direct and IPW estimators of lowdimensional parameters in causal inference settings (e.g., van der Laan et al., 2019; Ertefaie et al., 2022) . Thirdly, the HAL regression algorithm is readily available and accessible via the free and open source hal9001 R package Hejazi et al., 2020a) .

In considering estimation of g 0,A , a straightforward strategy involves assuming a parametric working model for a few moments of the density functional, allowing the use of standard regression techniques to generate suitable estimates (Robins et al., 2000; Hirano & Imbens, 2004) . For example, one could operate under the working assumption that the density of A given W follows a Gaussian distribution with homoscedastic variance and mean p j=1 β j φ j (W ), where φ = (φ j : j) are user-specified basis functions and β = (β j : j) are unknown regression parameters. Under such a regime, a density estimate could be generated by fitting a linear regression of A on φ(W ) to estimate E[A | W ], paired with maximum likelihood estimation of the variance of A. In this case, the estimated conditional density would be given by the density of a Gaussian distribution evaluated at these estimates. Since such strategies make strong assumptions about the form of the conditional density functional, they are more prone to model misspecification than alternatives that strive to limit reliance on structural assumptions.

Constructing a flexible density estimator is a more challenging problem, as the set of available tools is limited. These limitations motivated our investigation of conditional density estimation approaches capable of handling arbitrary regression functions, which have been formulated and applied only sporadically in causal inference (e.g., van der Laan, 2010; Díaz & van der Laan, 2011; Sofrygin et al., 2019) . We next describe a flexible class of nonparametric estimators, implemented in the haldensify R package (Hejazi et al., 2022a) ; two more restrictive classes of semiparametric conditional density estimators are outlined in Section S3 of the Supplementary Materials.

Estimators that eschew any assumptions on the form of the conditional density are a rarity. Notably, van der Laan (2010) and Díaz & van der Laan (2011) proposed constructing estimators of this nuisance functional by a combination of pooled regression and exploitation of the relationship between the (conditional) hazard and density functions. Our proposal builds upon theirs, replacing their recommendation of an arbitrary binary regression model with HAL regression. This requires the key change of adjusting the penalization step of HAL regression to respect the use of a loss function appropriate for density estimation, namely, L(·) = − log(g n,A ) (Dudoit & van der Laan, 2005) .

To build an estimator of a conditional density, Díaz & van der Laan (2011) considered discretizing the observed A ∈ A based on a number of bins T and a binning procedure (e.g., T bins of equal length), where the choice of T corresponds conceptually to the choice of bandwidth in kernel density estimation. To take an example, an instantiation of this procedure would divide the observed support of A into, say, T = 7, bins of equal length. Such a partitioning would require T + 1 cutpoints along this support, yielding T bins:

Next, a reformatted, repeated measures dataset would be generated by representing each observational unit by up to T records (using only

, with the number of records for a given unit matching the rank of the bin into which a given A i falls. To clarify, consider an individual unit {A i , W i } for which the value A i falls in the fifth bin of the seven into which the support has been partitioned (i.e., [α 5 , α 6 )). Five distinct records would be used to represent the data on this single unit:

The resultant repeated measures structure allows for the hazard probability of A i falling in a given bin along its support to be evaluated via pooled hazard regression (requiring only binary regression procedures). This proposal effectively reformulates the problem into a corresponding set of hazard regressions:

may be directly estimated via a binary regression procedure, by re-expressing the corresponding likelihood in terms of the likelihood of a binary variable in a repeated measures dataset with this structure. The hazard estimates may then be mapped to density estimates by re-scaling the hazard estimates by the bin sizes

We formalize this procedure in Algorithm 1.

Originally, a key element of this proposal was the flexibility to use any arbitrary binary regression procedure to estimate P(A ∈ [α t−1 , α t ) | W ), facilitating the incorporation of flexible regression techniques like ensemble modeling (e.g., van der Laan et al., 2007) . We alter this proposal, replacing the arbitrary estimator of P(A ∈ [α t−1 , α t ) | W ) with HAL regression, making it possible for the resultant conditional density estimator to achieve a convergence rate with respect to a loss-based dissimilarity of n −1/4 under global smoothness assumptions (van der Laan, 2017; van der Laan & Bibaut, 2017), as discussed in Section 2.5. We stress that this is a critical advance -essential to the asymptotic properties of our nonparametric estimators of ψ 0,δ .

Algorithm 1: Pooled hazards conditional density estimation Result: Estimates of the conditional density of A, given W .

1 An observed data vector of the continuous treatment for n units: A 2 An observed data vector of the baseline covariates for n units: W 3 The number of bins into which the support of A is to be divided: T 4 A procedure to discretize the support of A into T bins: ω(·)

2 Expand the observed data in a repeated measures data structure, expressing each unit by up to T records (recording unit ID with each). For a given unit i, the set of records is {A ij , W ij } Ti j=1 , where W ij are constant in j, A ij is a binary counting process jumping to 1 at T i , and T i ≤ T indicates the bin in which A i belongs. 3 Estimate, via HAL regression, the conditional hazard probability of bin membership

, selecting λ n by cross-validated empirical risk minimization of an appropriate loss function (Dudoit & van der Laan, 2005) . 4 Rescale the conditional hazard probability estimates to the conditional density scale by dividing the cumulative hazard by the bin widths for each A i .

Output: g n,A , an estimate of the generalized propensity score.

We now consider the construction of estimators of ψ 0,δ that rely solely on nuisance estimation of the generalized propensity score g 0,A . Data adaptive estimators of nuisance functionals are generally incompatible with the direct (i.e., G-computation) and IPW estimators, as conditions necessary for achieving asymptotic desiderata (e.g., minimal bias, efficiency) are unattainable without imposing strong smoothness assumptions on the functional form of the nuisance parameter estimator. This theoretical impasse has, in part, fueled the considerable popularity enjoyed by doubly robust estimation methodologies, such as the one-step estimation or targeted minimum loss estimation frameworks (van der Laan & Rubin, 2006; van der Laan & Rose, 2011 , 2018 , both of which we review in the context of the estimation of ψ 0,δ in Section S4 of the Supplementary Materials. Briefly, doubly robust estimators require estimation of both the outcome mechanism Q 0,Y and the propensity score g 0,A ; moreover, such estimators are consistent for ψ 0,δ when either of the two nuisance parameter estimators converge to their targets but asymptotically efficient only when both converge quickly enough. When the outcome mechanism Q 0,Y proves challenging to estimate well, it is tempting to consider the use of IPW estimation, which eschews estimation of Q 0,Y altogether. The construction of consistent and asymptotic efficient IPW estimators capable of incorporating data adaptive estimation of g 0,A has been considered before. For example, Hirano et al. (2003) proposed a logistic series estimator of g 0,A , which requires fairly strong smoothness assumptions (i.e., k-times differentiability), while van der Laan (2014) investigated targeted (parametric) fluctuation of g n,A to construct efficient IPW estimators, which proved to be asymptotically linear but irregular. Using HAL to estimate g n,A , Ertefaie et al. (2022) developed undersmoothing strategies for asymptotically linear and efficient IPW estimation. All of these authors exclusively considered settings with binary treatments.

We develop nonparametric-efficient IPW estimators, leveraging Algorithm 1 to construct g n,A , and build partially upon the contributions of Ertefaie et al. (2022) , who demonstrated that undersmoothing strategies could be used to select a HAL estimator of g 0,A := P(A = 1 | W ), ultimately yielding IPW estimators (of the causal effects of static interventions) that achieve the nonparametric efficiency bound. Ertefaie et al. (2022) formulate conditions under g n,A estimated via HAL converges to g 0,A at a suitably fast rate for their nonparametric IPW estimator to be asymptotically efficient. Their methods rely upon undersmoothing of the HAL-estimated g n,A (relative to that selected by V -fold cross-validation) so as to include a richer set of basis functions than is required for optimal estimation of g 0,A . We propose two classes of selection procedures for undersmoothing HAL estimators of g 0,A . These strategies share a common first step: construction of a family of HAL-based conditional density estimators g n,A indexed by the regularization term λ, i.e., {g n,A,λ : λ 1 , . . . , λ K }. For this, we recommend applying Algorithm 1, altering it to use cross-validation only to select the strictest degree of regularization from which to begin the candidate sequence in λ for undersmoothing. Rather than returning a single estimator g n,A , the algorithm outputs ¿¿¿¿¿¿¿ b51dee4ac6f657d9c589a28e7822ff166fb626d7 {g n,A,λ : λ 1 , . . . , λ K }, where λ 1 ≡ λ CV denotes the degree of regu-larization chosen by the cross-validation selector and λ 1 > . . . > λ K uniformly in the index set. In the sequel, we assume access to a sequence of generalized propensity score estimators {g n,A,λ : λ 1 , . . . , λ K } and a corresponding sequence of IPW estimators {ψ n,δ,λ : λ 1 , . . . , λ K }. Then, what remains is to select an IPW estimator exhibiting desirable asymptotic properties from this sequence of candidates.

With similar goals, Ertefaie et al. (2022) propose two undersmoothing criteria: (1) minimization of the mean of the EIF estimating equation up to a theoretically desirable degree, and (2) minimization of a score term (A − g n,A ) attributable to the treatment mechanism. Their first selector explicitly uses the form of the EIF and must be derived anew for any given intervention class (e.g., static vs. stochastic). We provide the first explicit re-characterization of the EIF (4) in a form suitable for IPW estimator selection based on this strategy. Their second selection procedure is incompatible with MTPs, as there is no explicit score term for the treatment mechanism in this setting. Intuitively, this is attributable to the fact that MTPs depend on the natural value of A, which is not the case for simpler static or dynamic regimes. As an alternative, we develop a class of selection procedures sensitive to changes along the sequence of IPW estimators {ψ n,δ,λ : λ 1 , . . . , λ K }. As part of these proposals, we provide, to the best of our knowledge, the first demonstration of conditional density estimator undersmoothing in causal inference.

To select an IPW estimator from the sequence {ψ n,δ,λ : λ 1 , . . . , λ K } based on the EIF, it is necessary to re-express the EIF in a form incorporating the IPW estimating function, which can be done by projecting this object onto the space of all functions of A that are mean-zero conditional on W van der Laan & Robins, 2003) . The IPW mapping defining the estimating function of the stabilized IPW estimator (8) is

Note that the IPW estimator (8) is merely a solution to the estimating function (10) (i.e., obtained by setting P n U (g A , ψ) = 0). We next present Lemma 1, which characterizes the form of the EIF estimating equation suitable for IPW estimator selection. Lemma 1 (IPW representation of the EIF). Let D (P ) denote the EIF (4) and let U (g A , ψ) denote the IPW estimating function (10).

, the space of all functions of A that are mean-zero, conditional on W , yields D CAR (P ). Using U (g A , ψ) and D CAR (P ), the EIF is D (P ) = U (g A , ψ) − D CAR (P ) van der Laan & Robins, 2003) , where

Among a family of IPW estimators {ψ n,δ,λ : λ 1 , . . . , λ K }, an asymptotically efficient estimator minimizes |P n D CAR (g n,A,λ , Q n,Y )| for some Q n,Y , per Theorem 1, which we will formally present shortly. Such an IPW estimator is an approximate solution to the estimated EIF as per Lemma 1. The lemma's derivation is provided in the Supplementary Materials. Since IPW estimators are constructed as explicit solutions to P n U (g A , ψ) = 0, the first term of the EIF representation in the lemma is solved; thus, selection of an asymptotically efficient IPW estimator need only consider D CAR .

Evaluating D CAR requires integration over A, which can prove computationally taxing on account of the need to combine techniques for numerical integration with those for conditional density estimation. When A is discrete with known bounds, this integral expression is easily simplified; however, when A is continuous, evaluation of D CAR as a basis for developing a selection criterion can be impractical. To resolve this, we introduce the following alternative expression as a practical innovation:

which is obtained by combining D CAR with another score term appearing in the EIF; details are given in Section S1 of the Supplementary Materials. Via this simplification, D CAR can be made to serve reliably as a suitable criterion for both discrete and continuous A. Now, a targeted selector, making use of the form of D CAR , is

By minimizing the empirical mean of the estimated EIF absolutely, this selector runs the risk of undersmoothing past the point at which asymptotic efficiency is attained. Another suitable selection criterion, proposed by Ertefaie et al. (2022) , chooses the first λ n at which |P n D CAR (g n,A,λ , Q n,Y )| falls below {σ n,λCV / log(n)}, where σ n,λCV is the standard error estimate based on the EIF at the cross-validated choice of λ; this criterion balances asymptotic bias against standard error while prioritizing asymptotic efficiency.

To characterize the efficiency of our proposed IPW estimators more completely, we examine the second-order remainder R 2 of the EIF expansion (5). Such higher-order representations provide insight into the behavior of efficient estimators, and an exact expression for R 2 can be used to study the degree to which candidate estimators achieve higher-order forms of efficiency. Lemma 2 provides an expression for this second-order term; a proof is given in Section S2 of the Supplementary Materials.

Lemma 2 (Second-order expansion of the EIF). Denote by {g n,A , Q n,Y } the nuisance estimators of {g 0,A , Q 0,Y }, the target nuisance functionals; then, R 2 is

Careful examination of Lemma 2 reveals that the form of R 2 is intimately related to that of D CAR , which appears in the IPW representation of the EIF in Lemma 1 -in fact,

This remainder term characterizes the contributions to asymptotic efficiency of differences between the nuisance estimators {g n,A , Q n,Y } and the corresponding nuisance functionals {g 0,A , Q 0,Y }. As such, this second-order term may be used to deepen our understanding of how improved nuisance estimation can impact estimator efficiency in a manner that standard (first-order) EIF expansions are not equipped to capture. Notably, the term R 2 cannot be evaluated in real-world settings as it requires explicit knowledge of the forms of {g 0,A , Q 0,Y }, which are never available in practice; however, the form of R 2 itself can still be useful in (1) deriving higher-order efficient estimators (e.g., Robins et al., 2008) and (2) numerically benchmarking the relative efficiency of both established and novel estimators.

We next characterize the asymptotic properties of our undersmoothed IPW estimators in Theorem 1, which essentially demonstrates that IPW estimators are asymptotically efficient under regularity and rate-convergence conditions considered standard in semiparametric estimation. We include its proof below since the rationale is straightforward. First, consider the following conditions, necessary to establish Theorem 1. C3 (Vanishing remainder). R 2 (g n,A,λn , Q n,Y , g 0,A , Q 0,Y ) = o P (n −1/2 ).

Theorem 1 (Asymptotic linearity and efficiency of undersmoothed IPW). Let λ n be an undersmoothed selection along the HAL regularization path in λ, e.g., as in expression (11), and let ψ n,δ,λn denote the solution to the estimating equation P n U (g n,A,λn , ψ) ≈ 0 in ψ. Then, under Conditions C1-C3, we have

Proof. Define ψ n,δ,λn as the solution in ψ to the EIF estimating equation P n D (g n,A,λn , Q n,Y , ψ). Under Conditions C1-C3, P n D (g n,A,λn , Q n,Y , ψ n,δ,λn ) = o P (n −1/2 ) .

Since g n,A,λn and Q n,Y are HAL estimators, they are both of bounded sectional variation norm with universal bound M , fulfilling Condition C1 (van der Laan, 2017). Note that Condition C3 is standard in semiparametric estimation and Condition C2 is satisfied by the design of our selection procedures. Then, applying standard arguments for doubly robust estimation (e.g., van der Laan & Rose, 2011; Kennedy, 2016) yields

The result follows from noticing that since both P n D CAR (g n,A,λn , Q n,Y ) = o P (n −1/2 ) and D (P ) = U (g A , ψ) − D CAR (P ), it must also hold that ψ n,δ,λn = ψ n,δ,λn + o P (n −1/2 ).

The proof establishes that an IPW estimator selected based on an undersmoothing criterion like (11) will be equivalent to an asymptotically efficient estimator under standard rate-convergence and entropy conditions on the nuisance estimators, which HAL has been shown to satisfy. Even the entropy conditions can be further relaxed by cross-validating the IPW estimator (Klaassen, 1987; Zheng & van der Laan, 2011) . While the theorem's results require that the estimating equation P n D CAR (g n,A,λn , Q n,Y ) = o P (n −1/2 ), and, by extension, an estimating equation based on D , be solved reasonably well, we may not explicitly need to use D CAR as part of a selection criterion, as we detail in the sequel.

Explicit (re)characterization of the EIF for the selection of an IPW estimator can prove a challenging endeavor, requiring specialized and tedious mathematical manipulations. As such, it is often preferable to select an IPW estimator in a manner agnostic to the EIF's form. Firstly, such agnostic selection procedures are better aligned philosophically with IPW estimation, which avoids explicit modeling of the outcome mechanism (n.b., the EIF always contains a term for Q 0,Y ). Secondly, agnostic procedures may solve score equations beyond those explicitly appearing in the EIF expression, allowing them to automatically and blindly satisfy higher-order efficiency desiderata. We formulate two selection procedures that do not use the form of the EIF, instead considering properties of the IPW estimators {ψ n,δ,λ : λ 1 , . . . , λ K } along their regularization trajectory {λ 1 , . . . , λ K }. In the context of nonparametric sieve estimation, such ideas were formulated in the seminal work of Lepskii (1992 Lepskii ( , 1993 , with sporadic extensions since (e.g., Davies & van der Laan, 2014; Mukherjee et al., 2015; van der Laan et al., 2018a; Cai & van der Laan, 2020; Qiu et al., 2021) .

The aim of formulating such EIF-agnostic undersmoothing selection procedures is to produce selections similar to (or better than) those of the targeted selectors. For IPW estimators based on these agnostic selectors to asymptotically attain the nonparametric efficiency bound, these estimators must solve the EIF estimating equation. Our first proposal in this class balances changes along the regularization sequence {λ 1 , . . . , λ K } in the IPW estimator ψ n,δ,λ against changes in its estimated standard error σ n,λ . This is merely the first element in the sequence {λ 1 , . . . , λ K } for which the condition

is met. Note that Z (1−α/2) is the (1 − α/2) th quantile of the standard Gaussian distribution. While we recommend the use of the stabilized IPW estimator (8) for ψ n,δ,λ , there are several candidate choices for the standard error estimator σ n,λ . We recommend basing this on the empirical variance of the estimated IPW estimating function n −1 {V[U (g n,A , ψ n,δ,λ )]}. While demonstrated this variance estimator to be anticonservative (contrary to naive theoretical expectations), this choice is made for the sake of computational ease, as only the successive differences in standard error estimates along the regularization path are necessary. An alternative variance estimate could be obtained via the nonparametric bootstrap, the functional smoothness conditions of which HAL has been shown to satisfy (Cai & van der Laan, 2020) . Of course, σ n,λ could also be based on the empirical variance of the EIF, though this obviously undermines the selector's design goal of EIF-agnosticism. We stress that our choice of variance estimator is imperfect (possibly failing to reflect increases in variance that accompany the relaxation of regularization), but it provides a reasonable surrogate for variance changes that balances computational expedience and inferential stability. Examination of equation (12) reveals that this selector identifies λ n as the first point in the sequence {λ 1 , . . . , λ K } at which the relaxation of regularization impacts the IPW point estimates less than it affects their corresponding standard error estimates, indicating that a desirable bias-variance tradeoff has been achieved. The scaling of the standard error estimate by Z (1−α/2) / log(n) ensures that changes in the point estimate are weighed against changes in confidence interval width, a surrogate for inferential efficiency.

A potential pitfall of the immediately preceding proposal is its requirement of standard error estimation, which can be computationally challenging (e.g., requiring the bootstrap), result in unstable or unreliable estimates (e.g., when the standard error estimator is itself poor), or require nuisance estimation beyond g 0,A (i.e., when the standard error estimated is based on the EIF). Thus, such an agnostic selector may, at times, fail to choose the optimal IPW estimator among a sequence, especially in scenarios in which standard error estimation proves challenging. It is possible to eschew such requirements altogether, relying instead entirely on the trajectory of {ψ n,δ,λ : λ 1 , . . . , λ K } alone. Under such a strategy, the selection λ n would simply be the first term in the sequence {λ 1 , . . . , λ K } at which the trajectory taken by {ψ n,δ,λ : λ 1 , . . . , λ K } encounters a plateau. We formalize this second class of agnostic selection procedures in Algorithm 2.

Such a selection procedure considers the "sharpness" of changes in the point estimates ψ n,δ,λ sequentially in {λ 1 , . . . , λ K } as the degree of regularization is relaxed. This selector aims to identify a value λ n at which the trajectory of the IPW point estimate ψ n,δ,λn encounters an inflection point, a phenomenon suggesting that further MAY 18, 2022 : λ 1 , . . . , λ K } 4 A positive scalar of the maximum multiplier of the CV-chosen L 1 -norm: K max 5 An algorithm to detect inflection points in a sequence: τ (·)

1 Estimate L 1 -norm for each IPW estimator in the given sequence by summing the coefficients for the nuisance estimators {g n,A,λ : λ 1 , . . . , λ K }, denoting this {M n,λ1 , . . . , M n,λ K }. The L 1 -norm of the CV-selected estimator g n,A,λ1 is M n,λ1 . 2 Select a window of candidate IPW estimators anchored in the neighborhood of the CV-selected estimator ψ n,δ,λ1 by computing a confidence interval around it as {ψ n,δ,λ1 ± Z 1−α/2 σ n,λ1 }, with σ n,λ1 being the standard error estimate based on U (g n,A,λn , ψ) and Z 1−α/2 being a standard Gaussian quantile, and, then, setting the L 1 -norm's maximum allowed value as M n,max := K max M n,λ1 . Reduce the neighborhood of candidate estimators to {ψ n,δ,λ : λ 1 , . . . , λ J } for J < K falling within the confidence region and with L 1 -norms no greater than M n,max . 3 Construct a smoothed trajectory of the IPW estimators {ψ n,δ,λ : λ 1 , . . . , λ J } against the subsequence of L 1 -norms by applying LOESS (or similar). 4 Find an inflection point along the smoothed trajectory by applying τ (·). If this succeeds, denote the estimator at this index ψ n,δ . If τ (·) fails, let ψ n,δ := ψ n,δ,λ1 .

Output: ψ n,δ , a nonparametric IPW estimate of ψ 0,δ .

relaxations of the regularization parameter ought not contribute meaningfully to the goal of debiasing the IPW estimator through undersmoothing. As this trajectory may be insufficiently smooth (e.g., when {λ 1 , . . . , λ K } is too coarse) to reveal such inflection points, LOESS can be leveraged to enforce a sufficient degree of smoothness, helping a given inflection-finding algorithm escape failure. This agnostic selection algorithm attempts to find an inflection point in the trajectory of the IPW estimators only within a limited neighborhood beyond the IPW estimator chosen by cross-validation g n,A,λCV . This form of anchoring serves to narrow the set of allowable IPW estimators considered by the selector such that the degree of debiasing enforced by undersmoothing is not overly extreme, as undersmoothing uncontrollably could lead to the inclusion of HAL basis functions irrelevant for optimal estimation of ψ 0,δ . Intuitively, this form of anchoring ensures that this selection procedure operates in a neighborhood of L 1 -norm values roughly similar to those considered by the targeted selection procedures previously formulated, a claim inspired and corroborated by extensive numerical experiments conducted during this algorithm's development. In those rare instances in which an inflection point cannot be found within this neighborhood, the selection algorithm simply returns the IPW estimator based on g n,A,λCV . We conjecture that an IPW estimator identified by this selection algorithm will suitably solve the EIF estimating equation and thus achieve desirable asymptotic properties, though we leave the theoretical formalization of this claim as a topic of future investigation.

We evaluated our proposed IPW estimators in simulation experiments using two distinct data-generating processes (DGPs) for the treatment mechanism. These experiments aimed to tease apart the impact treatment mechanism's form on the performance of our IPW estimators relative to an alternative class of doubly robust (DR) estimators popular in modern causal inference. Specifically, we compared our IPW estimators to targeted minimum loss estimators of ψ 0,δ (reviewed briefly in Section S4 of the Supplementary Materials), as these estimators can incorporate flexible regression strategies for nuisance estimation. We exclude from our comparison classical IPW estimators based on parametric estimation of g 0,A , as such alternatives are prone to model misspecification bias and generally fail to achieve the nonparametric efficiency bound (van der Laan & Robins, 2003; Ertefaie et al., 2022) . Across all reported experiments, we consider IPW estimators over a regularization trajectory of 3000 preset candidate values of λ. We demonstrate that undersmoothing of g n,A (based on Algorithm 1) succeeds in debiasing our IPW estimators of ψ 0,δ , conferring benefits in terms of both bias and asymptotic efficiency.

Throughout our experiments, we compare several variants of our IPW estimators ψ n,δ,λn , differing by the manner in which λ n is selected. We consider a total of six estimator variants, including (1) one in which λ n is selected by "global" cross-validation (CV), which is optimal for estimation of g 0,A but not ψ 0,δ ; (2) two based on the targeted criteria of Lemma 1, including (11) and its tolerance-based alternative Ertefaie et al. (2022) ; (3) one based on the agnostic selection criterion (12); (4) another based on the agnostic selection approach of Algorithm 2; and (5) a hybrid selector combining the agnostic selection approach of Algorithm 2 with a targeted stopping point, which limits its candidates to regions along the regularization grid prior to the λ n selected by criterion (11). From among these six selectors, two -the "global" CV selector and the targeted selector (11) minimizing |P n D CAR (g n,A,λ , Q n,Y )|are used as candidates for g n,A in the DR estimators, which use either HAL regression or a (misspecified) GLM to estimate Q n,Y . Here, we compare our estimators in terms of scaled bias (multiplied by n 1/2 ), scaled mean-squared error relative to the efficiency bound of the given DGP, and the degree to which an estimator solves the empirical mean of the EIF. In Section S5 of the Supplementary Materials, we examine estimator performance in terms of (1) the coverage of oracle confidence intervals, which use a Monte Carlo variance estimator (eschewing any problems in variance estimation); (2) the coverage of Wald-style confidence intervals, which use the empirical variance of the EIF (and may be prone to poor variance estimation); and (3) an approximated second-order remainder R 2 of the EIF.

For each DGP, we consider a collection of observed data units O = (W 1 , W 2 , W 3 , A, Y ), where W 1 ∼ Bern(p = 0.6), W 2 ∼ Unif(min = 0.5, max = 1.5), and W 3 ∼ Pois(λ = 2); the generating functions for A and Y vary across the two scenarios. For each simulation experiment, in a given scenario, we sample n ∈ {100, 200, 500} units and apply the IPW and DR estimator variants to estimate ψ 0,δ at δ = 1, repeating each experiment 300 times. We approximate ψ 0,δ in each scenario by applying the direct estimator (6), using the known Q 0,Y , in a single, large sample of n = 1, 000, 000. For the evaluation criteria considered, the scaled asymptotic bias (i.e., n 1/2 (ψ 0,δ − ψ n,δ,λn )) is expected to decrease with increasing sample size for consistent estimators, while the scaled mean-squared error (i.e., n{(ψ 0,δ −ψ n,δ,λn ) 2 +σ 2 n,λn }), relative to the efficiency bound, should converge to unity for regular efficient estimators. All experiments were performed with version 4.0.3 of the R language and environment for statistical computing (R Core Team, 2022).

This setting is formulated based on a treatment mechanism governed by a Poisson distribution, in which the mean and variance of A are both equally impacted by a subset of the baseline covariates {W 1 , W 2 }; the exact form of the treatment mechanism results in A taking on discrete values in a large range. This design choice ensures that A is incompatible with alternative (static or dynamic) intervention schemes but compatible with MTPs. For this scenario, the treatment mechanism takes the form A | W ∼ Poisson λ = (1 − W 1 ) + 0.25W 3 2 + 2W 1 W 2 + 4 and the outcome mechanism is Y | A, W ∼ Bernoulli (p = expit(A + 2(1 − W 1 ) + 0.5W 2 + 0.5W 3 + 2W 1 W 2 − 7)). Numerical evaluation of the performance of our proposed IPW estimators of ψ 0,δ across the 300 simulation experiments is summarized in Figure 1 .

In terms of scaled bias, all of the undersmoothed IPW estimator variants (square and diamond markers) are comparable at n = 200 and n = 500, with performance indistinguishable from that of the better-performing DR estimators (triangle markers). At n = 100, the IPW estimators exhibit a limited degree more bias than their DR counterparts, though this is on the order of thousandths on the raw bias scale. The IPW estimator based on the "global" CV selector (circle marker) exhibits the highest degree of bias at n = 100 and n = 500, and its scaled bias does not appear to be converging to zero with increasing sample size. In terms of asymptotic efficiency, the two targeted undersmoothed IPW estimator variants (square markers) outperform the others at n = 500. Among the three agnostic undersmoothed IPW estimators (diamond markers), the two based on variations of Algorithm 2 outperform the IPW estimator based on Lepski's selection approach (12); however, none of these IPW estimators achieve the efficiency bound by n = 500. Notably, the IPW estimator based on criterion (12) displays inflated variance at n = 500, which we conjecture to be due to occasionally poor selections driven by unstable variance estimation, as noted previously. Interestingly, and unexpected by theory, the IPW estimator based on the "global" CV selector nearly achieves the efficiency bound at n = 200 and n = 500 but displays nontrivial bias at the larger sample size, suggesting that its efficiency arises from reduced variance, rather than the debiasing induced by undersmoothing. All of the estimator candidates appear to solve the empirical mean of the EIF quite well, suggesting that all of these estimators are asymptotically efficient by way of their being solutions to the EIF estimating equation (up to a reasonable degree). Overall, the results of this set of numerical experiments speak in favor of our proposed estimators, suggesting that most, if not all, of our nonparametric IPW estimators are consistent and efficient at even moderate sample sizes. Notably, the DR estimators achieve better performance at n = 100 and should be preferred when the number of study units available is so limited. 

This experimental scenario was constructed using a treatment mechanism governed by a Negative Binomial distribution, in which the mean µ and dispersion ν are both impacted by a subset of the baseline covariates {W 1 , W 2 }.

Recall that the variance of a Negative Binomial random variable is µ + µ 2 /ν, so that, unlike in the previous scenario, the mean and variance of A need not match. In this scenario, the treatment mechanism is A | W ∼ NB µ = (1 − W 1 ) + 0.25W 3 2 + 2W 1 W 2 + 4, ν = 5W 2 + 7 while the outcome mechanism is Y | A, W ∼ Bernoulli (p = expit(A + 2(1 − W 1 ) + W 2 − W 3 + 1.5W 1 W 2 − 5)). Numerical evaluation of the performance of our proposed IPW estimators of ψ 0,δ across the 300 simulation experiments is summarized in Figure 2 .

In terms of scaled bias, the undersmoothed IPW estimators (square and diamond markers) outperform the IPW estimator based on "global" CV at n = 100 and n = 200, depicting the debiasing enforced by undersmoothing. These novel IPW estimators display performance comparable to that of the DR estimators across all of the sample sizes considered (n.b., that the unscaled bias is on the order of thousandths) and are indistinguishable from one another by n = 500. None of the estimator variants considered appear to show any sign of being asymptotically inconsistent. As regards asymptotic efficiency, the DR estimator variants outperform their IPW competitors at n = 100 but show comparable efficiency by n = 200; moreover, all estimators appear to be approaching the efficiency bound by n = 500, though the IPW estimators based on the targeted undersmoothing seem to be converging more quickly (in n) than their counterparts based on agnostic undersmoothing criteria. Unlike in the preceding scenario, the IPW estimator using agnostic undersmoothing based on criterion (12) does not show an uptick in efficiency at n = 500 but instead appears to be converging to the efficiency bound. The IPW estimator based on "global" CV exhibits inflated bias and poorer efficiency than its counterparts at n = 100 and n = 200, though its performance vastly improves (to the point of being comparable) by n = 500; we stress that this is not expected from theory and should not be taken as a general property of IPW estimators based on such a selector. All of the IPW and DR estimator variants appear to solve the empirical mean of the EIF quite well, suggesting that all are solutions to the EIF estimating equation and, as such, capable of achieving asymptotic efficiency. Of note, with respect to this metric, the IPW estimators based on targeted undersmoothing criteria (square markers) exhibit performance closer to that of the DR estimators than do their IPW counterparts based on agnostic undersmoothing (diamond markers). As with the preceding scenario, the results of this set of numerical experiments suggest that all of our novel IPW estimators are asymptotically consistent and efficient, with performance comparable to that of DR estimators in sample sizes as low as n = 500. While the conclusion may remain that DR estimators ought to be preferred when sample sizes are very limited (e.g., n = 100), our novel IPW estimators are competitive, as measured by standard desiderata from semiparametric efficiency theory, with access to only modest sample sizes. Furthermore, as evidenced by the performance of the IPW estimators based on agnostic undersmoothing, our proposed estimators can achieve such desirable levels of performance without any explicit knowledge of semiparametric efficiency theory.

We have developed nonparametric-efficient IPW estimators of the causal effects of MTPs by applying targeted or agnostic sieve estimation techniques to flexible estimators of the generalized propensity score. Our IPW estimators rely entirely on the accuracy and rate-convergence properties of the generalized propensity score estimator; moreover, our targeted selectors rely upon estimation of the outcome mechanism as well, which plays a role in the quality of the selection. There are several avenues along which our proposed estimators and selection procedures may be evaluated. For instance, further investigations may focus on evaluating the degree to which selection of an optimal undersmoothed estimator g n,A,λn is impacted by possibly poor estimation of Q n,Y as well as the degree to which the efficiency of our IPW estimators degrades under poor estimates of the latter quantity. An analytic strategy in this vein could lead to a sensitivity analysis for this class of IPW estimators. Examining how our successful application of sieve estimation principles in this setting could be expanded to other areas of causal inference reveals further avenues for potential progress, both theoretically and practically.

Although the formulation and demonstration of our proposed methodology in the context of single timepoint interventions appears promising, extensions to longitudinal intervention regimes require significant and careful attention. have already extended the theoretical underpinning of MTPs to the longitudinal setting, though their estimation efforts focused exclusively on sequentially doubly robust estimators. As these authors have already derived the EIF with respect to the nonparametric model M, our efforts could focus instead on deriving the relevant D CAR projection term necessary for using undersmoothing techniques to construct nonparametric-efficient IPW estimators of the causal effects of longitudinal MTPs. The successful formulation of sieve-based estimation strategies in this setting promises unique challenges, as the generalized propensity score estimators at each timepoint would need to be jointly undersmoothed with the aim of solving the EIF adequately across all timepoints. No similar strategy appears to have been previously explored in the context of IPW estimation.

Our examinations aimed only to develop nonparametric-efficient IPW estimators of the causal effects of MTPs, yet flexible and accurate generalized propensity score estimators are critical for formulating doubly robust estimators as well. In the Supplementary Materials, we presented evidence from numerical studies suggesting that our agnostic selectors, and one of our targeted selectors, lead to estimators that minimize the second-order remainder associated with the EIF just as well as conventional doubly robust techniques do. An important area of future work concerns how these selectors can be combined with doubly robust estimators so as to develop procedures that are doubly robust (in terms of consistency) but that yet achieve improved efficiency and rate-convergence properties via undersmoothing of the generalized propensity score. Whether such estimators could be made to exhibit higher-order forms of asymptotic efficiency -and the degree to which our selection procedures would need to be adapted to this end -remains an open question, though investigations outside of the framework of sieve estimation have shown promise (e.g., Mukherjee et al., 2017; van der Laan et al., 2021) . VAN We provide details on our theoretical arguments in Sections S1 and S2, throughout which we rely on notation and quantities introduced within the entirety of the main manuscript. In these sections, we will, at times, make use of a convenient identity for swappingg

, itself based on the change-of-variable formula for integration. For an arbitrary function f (A, W ) of (A, W ), this identity is

where d(A, W ) is an MTP following Assumption A1, as in equations (1) and (2), and, in an abuse of notation, the term b (A, W ; δ) is the derivative of the inverse of d(A, W ; δ), meant to refer to the appropriate inverse of its primitive, for d(A, W ; δ) defined piecewise. Examination of equation (S1) reveals that the right-hand side of the equivalence is far easier to evaluate, since it requires only integration with respect to g A (A | W ), the natural conditional treatment density, instead of integration with respect tog A (A | W ), the counterfactual conditional treatment density. When the MTP is as in (1), the expression on the right-hand side further simplifies, as b (A, W ; δ) = 1 uniformly in the support of d(A, W ; δ); however, the expression is not much more complicated when the MTP is as in (2), in which case b (A, W ; δ) = δ −1 . Throughout the remainder of the present treatment, for simplicity of exposition, we will assume that d(A, W ; δ) is the MTP (1), though similar arguments hold for alternative MTP definitions.

We now turn to deriving the projection component D CAR (P 0 ) of the EIF of ψ 0,δ , leading to Lemma 1 presented in Section 3.2 of the manuscript. We use this form of the projection D CAR (P 0 ) to detail the construction of our pragmatic criterion D CAR (P 0 ).

Recall that we consider MTPs of the forms (1) and (2), which imply differing post-intervention conditional treatment densities, both of which we denote asg 0,A (A | W ). These correspond to stochastic interventions of the second variety discussed in the main text, in which the MTP d(A, W ; δ) shifts the natural value of the treatment A by an investigator-supplied scalar δ(W ), which may depend on baseline covariates W . To begin, we remind ourselves that D (P ) = U (g A , ψ) − D CAR (P ) at a distribution P ∈ M by a representation attributed to .

Specifically,

score for a known stochastic intervention (DCAR)

score for a learned stochastic intervention (sg) .

Note that this general representation of the EIF accounts for the dependence of the post-intervention treatment densitỹ g A (A | W ) on the distribution P , that is, where the target parameter is the counterfactual mean of Y under a stochastic intervention of the first variety discussed in the main text. While this general representation encompasses stochastic interventions that do not strictly satisfy Assumption A1, required of MTPs, it is challenging to work with on account of both the second term, which requires integration over A, and the third term, which involves a score based on the outcome mechanism Q Y . Fortunately, when the stochastic intervention is an MTP (as in the scope of this manuscript), a significant simplification can be made. Namely, when the post-intervention conditional density is "known" (i.e., implied by an MTP d(A, W ; δ)), the dependence of the post-intervention densityg 0,A (A | W ) on P (i.e., the third term for learning a stochastic intervention) may be dropped. In this case, the EIF may be expressed

With this in mind, we need only segregate the IPW estimating function U (g A , ψ) from the remaining terms in this expression of the EIF D (P ) in order to formulate a suitable term to be used as an objective for undersmoothing over a sequence of conditional density estimators {g n,A,λ : λ 1 , . . . , λ K }. Recalling the general expression for the EIF D (P ) as a difference between the IPW estimating function U (g A , ψ) and any remaning terms, we denote this expression D CAR .

By equivalence of these two preceding expressions for the EIF, we now define D CAR := D CAR − s g , where the two terms are as defined in the general expression of the EIF above. We stress that D CAR is the difference between the projection term in the case of a known stochastic interventiong A and the score term s g arising from the dependence ofg A on P . By re-expressing s g , we obtain the following expression for D CAR :

score for a learned stochastic intervention (sg) .

Under the MTP definition and using Assumption A1 to apply the change-of-variable formula (per ) to re-express the first integral expression identically to the second, the two integral expressions can be made to cancel out. This resolves a significant practical obstacle to working with D CAR as a selection criterion. We now have a simplified expression for D CAR :

This projection term D CAR may serve as a suitable criterion for undersmoothing of a non-stabilized IPW estimator (7); however, it does not extend immediately to the stabilized IPW estimator (8). Next, we repeat the analysis above, abbreviating mathematical manipulations for the sake of convenience, to derive the equivalent expression for the stabilized IPW estimator (8).

In the case of the stabilized IPW estimator, the form of the IPW estimating function and its corresponding D CAR projection term take slightly different forms than they do in the case of the non-stabilized IPW estimator. Specifically, in this case, we have

for the IPW estimating function U (g A , ψ) at a distribution P ∈ M. Then, the corresponding term D CAR is

Notably, the score term s g for the dependence ofg A (A | W ) on the distribution P remains unaltered. Then, a term appropriate as a target for undersmoothing of the stabilized IPW estimator, analogous to that for the non-stabilized IPW estimator, is

. and may equivalently be re-expressed

. As per the main text, this projection of the EIF onto F CAR , the space of functions of A that are mean-zero conditional on W can be used in formulating a targeted undersmoothing criterion for selecting asymptotically efficient IPW estimators that solve the EIF estimating equation as per (11). Any IPW estimator sufficiently minimizing |P n D CAR (g n,A,λ , Q n,Y )| is necessarily a solution of |P n D (g n,A,λ , Q n,Y )| ≈ 0, given that all IPW estimators are constructed as solutions to |P n U (g A , ψ)| ≈ 0.

To formally characterize the exact second-order remainder R 2 of the EIF expansion (5), as presented in Lemma 2, we consider the EIF expansion at P, P 0 ∈ M:

where, per the discussion in Section S1, the last term P 0 D (P ) admits the representation

Throughout the manipulations in the expressions above, the rearrangements between (S2) and (S3) are straightfoward, while the final equivalence (S4) is justified by use of the true distribution of W , under which the equivalence P 0 A Q Y (a, W )g A (a | W )dµ(a) − Ψ δ (P ) = 0 holds. In order to obtain an exact expression for the second-order remainder term, one need only combine (S4) with the difference term Ψ δ (P ) − Ψ δ (P 0 ), yielding R 2 (g n,A , Q n,Y , g 0,A , Q 0,Y ), per the definition above. Combining these expressions, and after mathematical simplifications omitted for brevity, one arrives immediately at the form of the remainder given in Lemma 2:

This form of the second-order remainder provides insights into the mechanism by which undersmoothing of an estimator of g 0,A (A | W ) can endow an estimator Ψ δ (P ) of Ψ δ (P 0 ) with desirable asymptotic efficiency properties.

In the main manuscript, shortly after presenting Lemma 2, it is noted that this remainder term may be characterized in terms of D CAR , that is,

. This expression for the second-order remainder is based on re-expressing the EIF using the characterization introduced in Lemma 1, which explicitly states that D (P ) = U (g A , ψ) − D CAR (P ). Straightforward manipulation of this expression of the EIF quickly yields,

With this equivalence established, it is now easier to notice that the second-order remainder is itself only a score of the treatment mechanism; thus, undersmoothing of an estimator of the treatment mechanism would be expected to solve R 2 to some nontrivial degree. Indeed, this expression provides the insight that undersmoothing of a HAL estimator of g 0,A (A | W ) so as to include a richer set of basis functions (yielding a more nonparametric estimator) ought to solve this second-order remainder better. As noted in the main manuscript, R 2 itself, unlike D CAR , cannot be used as an undersmoothing target, since its form relies on the true outcome mechanism Q 0,Y (A, W ), whose form is never known in practice. Yet, it may be possible to make use of R 2 as a diagnostic for the comparison of candidate estimators. To this end, in our numerical experiments (presented in Section 4 and reexamined in Section S5), we evaluate the performance of our proposed IPW estimators, alongside that of several competitors, using an approximation of R 2 (as well as standard metrics). To do so, we defineR 2 (·) := P 0 D CAR (g A , Q Y − Q 0,Y ), which can be evaluated in any given simulated dataset O 1 , . . . , O n by estimating the nuisance quantities (e.g., with appropriate HAL estimators) and computing the differences of these against the known forms of the nuisance parameters, themselves evaluated on the given dataset. In our examples, by applying the MTP directly to the given dataset, it is possible to numerically evaluate the integral expressions (overg A (A | W ))) that appear in the definition of the remainder term. EvaluatingR 2 across many numerical examples then allows approximation of R 2 as R 2 = P 0R2 , where the expectation with respect to P 0 is evaluated by Monte Carlo integration.

A recently developed family of flexible semiparametric conditional density estimators takes the general form ρ(A − µ(W )/σ(W )), where ρ is a given marginal density function. Conditional density estimation procedures falling within this framework may be characterized as belonging to a conditional location-scale family, i.e., where g n,A (A | W ) = ρ((A − µ n (W ))/σ n (W )); where we stress that the marginal density mapping ρ is selected a priori, leaving only the relevant moments µ and σ to be estimated.

Though the restriction to (conditional) location-scale families imposes some limitations on the form of the target functional, the strategy is made particularly flexible by its ability to incorporate arbitrary, data-adaptive regression strategies for the estimation of µ(W ) = E[A | W ] and, optionally, of the conditional variance σ(W ) = E[(A − µ(W )) 2 | W ]. In particular, in settings with limited data, the additional structure imposed by the assumption of form of the target density functional (i.e., in the specified kernel function ρ) can prove beneficial, when the true MAY 18, 2022 Note that the cumulative hazard function is merely the primitive of the conditional hazard function λ(a | w); thus, we have

where a l ≤ a and this final quantity is estimable via standard techniques for numerical integration. With an estimate of Λ(a | w) in hand, an estimator of the conditional density may be constructed by the relation given in equation (S5):

Although it is not without assumptions -most notably, that of proportional hazards -such an estimator of the conditional density is more flexible than its entirely parametric counterparts, especially when nonparametric regression (e.g., HAL) is used for estimation of the conditional hazard function. A variant of this strategy has been explored and successfully utilized by Rytgaard et al. (2022, Appendix B and C) in the context of conditional density estimation in longitudinal studies using an intensity-based representation of the hazard based on the Poisson likelihood.

Two popular frameworks for efficient estimation, both taking advantage of the EIF estimating equation, include onestep estimation (Pfanzagl & Wefelmeyer, 1985; and targeted minimum loss (TML) estimation (van der Laan & Rubin, 2006; van der Laan & Rose, 2011) . Importantly, both the one-step and TML estimators are doubly robust (DR), a property which affords two important conveniences. Firstly, both estimators are consistent for ψ 0,δ when either of the initial nuisance estimates of g n,A and Q n,Y are consistent for their respective nuisance functionals. While the protection afforded by this form of robutness is practically useful, it carries with it a disadvantage: these estimators can only achieve asymptotic efficiency when both initial nuisance estimates are consistent. Secondly, as a consequence of their explicit construction based on the EIF, both readily accommodate the use of flexible, dataadaptive estimation strategies for the initial estimation of nuisance parameters. This latter property provides these estimators with a distinct advantage over their direct and IPW estimator counterparts: two opportunities to avoid misspecification bias.

Invoking either estimation strategy proceeds first by constructing initial estimators, g n,A and Q n,Y , of the conditional treatment density g 0,A and the outcome mechanism Q 0,Y . These two DR frameworks diverge at their second stage, each of which implement different forms of bias correction. Within the one-step framework, this amounts to updating the initial estimate ψ n,δ -itself obtained simply by "plugging in" the initial estimate of the outcome mechanism Q n,Y into the direct estimator (6) -by adding to it the empirical mean of the EIF evaluated at the initial nuisance estimates {g n,A , Q n,Y }. Using the EIF D n,i evaluated at the nuisance estimates {g n,A , Q n,Y }, the one-step estimator is ψ + n,δ := ψ n,δ + n −1 n i=1 D n,i . The one-step estimator ψ + n,δ of ψ 0,δ is both asymptotically linear (on account of the bias correction step) and capable of asymptotically achieving the nonparametric efficiency bound; moreover, it is compatible with initial nuisance estimates {g n,A , Q n,Y } that are constructed from flexible, data-adaptive regression strategies. On account of its additive bias-correction procedure, the one-step estimator is not guaranteed to respect the bounds of the parameter space of ψ 0,δ .

In the TML estimation framework, a univariate (most often, logistic) parametric tilting model is used to update the initial estimate Q n,Y of the outcome mechanism in such a way that the EIF estimating equation is solved up to a theoretically desirable degree. Plugging this updated initial estimate Q n,Y into the direct estimator (6) yields a targeted estimator ψ n,δ of ψ 0,δ . Specifically, the TML estimator may be constructed by updating the initial estimator Q n,Y to a tilted variant Q n,Y , through a one-dimensional logistic tilting model of the form logitQ n,Y = logitQ n,Y + H n , in which the initial estimate Q n,Y is taken as an offset (i.e., fixed parameter value known a priori to be one) and only the parameter need be estimated. This targeted estimator is then ψ n,δ := Q n,Y (d(a, w; δ), w)dQ n,AW (a, w). Like its one-step estimator counterpart, the TML estimator ψ n,δ is both asymptotically linear (on account of the bias correction afforded by the tilting step) and capable of asymptotically achieving the nonparametric efficiency bound. It is also compatible with initial nuisance estimates {g n,A , Q n,Y } that are constructed from flexible, dataadaptive regression strategies, and, unlike the one-step estimator, it is a plug-in estimator, incapable of lying outside the bounds of the parameter space of ψ 0,δ . Additionally, numerical studies have shown that the TML estimator generally displays better finite-sample performance than the one-step estimator (van der Laan & Rose, 2011). Accordingly, in the numerical experiments presented in Section 4 of the manuscript, we use a TML estimator under differing nuisance parameter configurations. Both one-step and TML estimators of ψ 0,δ are implemented in the open source txshift R package (Hejazi & Benkeser, 2020 , 2022 , freely available on the Comprehensive R Archive Network.

Here, we consider the performance of the same set of six IPW and four DR estimators across the two DGPs discussed in the main manuscript, though we assess performance in terms of three alternative metrics. These three metrics are the empirical coverage of oracle confidence intervals (CIs), which use a Monte Carlo variance estimate to avoid issues introduced by unstable variance estimation; the empirical coverage of Wald-style confidence intervals, which use the empirical variance of the estimated EIF and so may be prone to issues of poor variance estimation; and the magnitude of an approximation of the second-order remainder R 2 associated with each of the estimator variants. The manner in which this second-order remainder term is approximated is briefly discussed in Section S2. Since this approximation of the remainder term R 2 is prone to a small degree of instability, we compute the metric using a trimmed mean that removes the lowest and highest 5% of its realizations across the 300 simulations. We note that these metrics are similar to those considered in Section 4 of the main manuscript: performance in terms of the coverage of oracle CIs is a measure of bias; that in terms of the coverage of Wald-style CIs is a measure of efficiency (in that it accounts for both bias and the quality of variance estimation); and that in terms of the magnitude of R 2 is a measure of higher-order efficiency, that is, how well the estimators achieve efficiency beyond the first-order EIF approximation.

Using the same DGP as considered in Section 4.1, we evaluate each of the estimator variants, in terms of the three alternative metrics discussed above, across 300 simulation experiments. Their performance is summarized in Figure S1 . Mean of EIF R 2 of candidate IPW and DR estimators Figure S1 : Alternative numerical comparisons of our nonparametric IPW estimators and DR estimators. Figure S1 quickly reveals that all but one of the estimators considered perform well in terms of the coverage of oracle CIs, indicating negligible bias for the vast majority of estimators. This assessment is generally in agreement with corresponding results on bias appearing in Figure 1 . Notably, the IPW estimator based on "global" CV selection shows poor coverage (20% lower than the nominal rate) at n = 100 and degrading coverage from n = 200 (at the nominal rate) and n = 500 (5% lower than the nominal rate); again, this is consistent with conclusions drawn based on other metrics. The coverage of Wald-style CIs reveals that the IPW estimator variants based on targeted undersmoothing perform comparably to the best variant among the DR estimators. The IPW estimators based on agnostic undersmoothing achieve coverage near the nominal rate, but show very limited coverage loss (5% lower than the nominal rate) by n = 500; moreover, the IPW estimator using the agnostic criterion (12) shows significant loss of coverage (10% lower than the nominal rate) by n = 500, suggesting instability of this selector, possibly on account of poor standard error estimation. The IPW estimator based on the "global" CV selector appears to cover near the nominal rate, though this is likely attributable to the manner in which this selector priorities variance estimation over debiasing. Among the DR estimators, those basing the treatment mechanism estimate on the "global" CV selector achieve desirable coverage at n = 100 and n = 200, but the variant using a misspecified GLM to estimate the outcome mechanism shows degrading coverage by n = 500. Those DR estimators using criterion (11) for the treatment mechanism estimator exhibit poorer coverage at n = 100 and n = 200 (≈7% lower than the nominal rate) but but appear to be steadily improving with growing sample size. Turning now to the metric approximating the second-order remainder R 2 , most of the estimator variants display comparably good performance uniformly across the three sample sizes; however, the DR estimators and the undersmoothed IPW estimator that make use of criterion (11) for treatment mechanism estimation result in an inflated second-order remainder relative to their competitors (by a factor of three at n = 100 and by a factor of five at n = 500). This suggests that absolute minimization of |P n D CAR (g n,A,λ , Q n,Y )| carries with it a tradeoff as regards stability of the selected treatment mechanism estimator, as related IPW estimators exhibit low bias and reasonable efficiency but may be doing so by compromising higher-order efficiency. This is not necessarily at odds with theoretical expectations. Overall, the picture painted by examining our proposed IPW estimators and standard DR competitors along this alternative set of metrics speaks to the reliable performance of our undersmoothed IPW estimators, in terms of inferential bias (captured by the coverage of CIs) and higher-order efficiency criteria.

In order to gain a degree of intuition regarding the choices of treatment mechanism estimator made by each of the selectors, we examine, in a single numerical example with sample size n = 100, all of the IPW estimator candidates along the regularization path defined by λ. Initially, this regularization path is set to include 3000 preset values of λ (fixed across all of our numerical experiments), though the λ n chosen by the "global" CV selector is used as the starting point for the undersmoothing-based selectors. Figure S2 illustrates the trajectory taken by the IPW estimators along this grid in λ. In Figure S2 , the six candidate IPW estimators generated by each of the selectors are displayed, with the choice made by the "global" CV selector being the first in the solution path; all values along the grid corresponding to stricter regularization than this are truncated. The true value of ψ 0,δ , approximated in a very large sample, is indicated by horizontal dashed line. While this is only a single numerical example, the IPW estimators chosen by the various selectors may be somewhat indicative of the degree of regularization preferred by each, and one may generally expect that the targeted undersmoothing selector based on criterion (11) will prefer far less extreme levels of regularization than the others. Notably, as the degree of regularization is relaxed too much, the solution path of the IPW estimators diverges (after − log(λ) ≈ 10) from the true value ψ 0,δ in an irrecoverable manner.

S5.2 Simulation #2: Treatment mechanism with unequal mean and variance

Using the same DGP as considered in Section 4.2, we evaluate each of the estimator variants, in terms of the three alternative metrics discussed above, across 300 simulation experiments. Their performance is summarized in Figure S3 . Examining first the coverage of oracle CIs, all of the DR estimators achieve coverage at the nominal rate, irrespective of sample size, indicating unbiasedness of these estimators. The IPW estimators achieve comparable performance by n = 500, though they fare slightly worse at n = 100. In terms of the coverage of Wald-style CIs, all of the IPW estimators, and half of the DR estimators, achieve coverage at or very near the nominal rate uniformly across the three sample sizes. The two DR estimators that base their treatment mechanism estimator on criterion (11) display poorer coverage, worse when the outcome mechanism estimator is misspecified but (predictably) better when the outcome mechanism estimator is modeled flexibly. Notably, all of the undersmoothing-based IPW estimators provide reliable coverage (at or near the nominal rate), indicating their simultaneous achieving negligible bias and accurate variance estimation. Cursory examination of the approximate second-order remainder R 2 reveals that all but three of the estimator variants succeed in successfully minimizing this metric. All of the IPW estimators based on agnostic undersmoothing fall in this majority. As with the preceding scenario, the DR estimators and the undersmoothed IPW estimator that make use of criterion (11) for treatment mechanism estimation struggle to approppriately minimize this metric, diverging with growing sample size. This provides further confirmation of the notion that absolute minimization of |P n D CAR (g n,A,λ , Q n,Y )| induces a compromise in terms of higher-order efficiency, as measured by this remainder term in the EIF expansion. This may prove a fruitful avenue for future theoretical and numerical investigation. Overall, the conclusion appears well-supported that our proposed IPW estimators fare comparably (well) to standard DR competitors in terms of both inferential bias and higher-order contributions to asymptotic efficiency.

As with the previous scenario, we examine a single numerical example at sample size n = 100 to develop some degree of intuition on the choices of treatment mechanism estimator made by each of the selection criteria. Here, we visualize the trajectory of the IPW estimators along the regularization path of 3000 preset values of λ (fixed across all of our numerical experiments), limiting the estimators displayed by truncating the regularization grid so as to begin at the λ n chosen by the "global" CV selector. Figure S4 Regularization path of candidate IPW estimators Figure S4 : Solution path in λ of nonparametric IPW estimators in a single numerical example.

In Figure S4 , the six candidate IPW estimators generated by each of the selectors are displayed, and the "global" CV selector's choice is the first in the solution path; all stricter regularization values are truncated. The true value of ψ 0,δ , approximated in a very large sample, is indicated by horizontal dashed line. Although this illustration arises only from a single numerical example, the IPW estimators chosen by the various undersmoothing selectors can be taken as somewhat representative of the the degree of regularization preferred by each. For example, one notices, as with the previous example depicted in Figure S2 , that the targeted undersmoothing selector based on criterion (11) prefers the most relaxed (least extreme) degree of regularization; furthermore, one should note that the agnostic undersmoothing selectors appear to all reliably choose the peak along the trajectory of IPW estimators, which corresponds to the point estimate closest to the true value of ψ 0,δ , illustrating the debiasing induced by these undersmoothing selectors.

Assessing the performance of the generalized propensity score for estimating the effect of quantitative or continuous exposures on binary outcomes

The highly adaptive lasso estimator

Fast rates for empirical risk minimization with cadlag losses with bounded sectional variation norm

Efficient and Adaptive Estimation for Semiparametric Models

Stacked regressions

Nonparametric bootstrap inference for the targeted highly adaptive least absolute shrinkage and selection operator (LASSO) estimator

Semiparametric density estimation under a two-sample density ratio model

Planning of Experiments

hal9001: The scalable highly adaptive lasso

Sieve plateau variance estimators: A new approach to confidence interval estimation for dependent data

Causal mediation analysis for stochastic interventions

Non-parametric efficient causal mediation with intermediate confounders

Causal survival analysis under competing risks using longitudinal modified treatment policies

Super learner based conditional density estimation with application to marginal structural models

Population intervention causal effects based on stochastic interventions

Stochastic treatment regimes

Nonparametric causal effects based on longitudinal modified treatment policies

Asymptotics of cross-validated risk estimation in estimator selection and performance assessment

Nonparametric inverse probability weighted estimators based on the highly adaptive lasso

Inefficient estimators of the bivariate survival function for three models

Estimation of the effect of interventions that modify the received treatment

haldensify: Highly adaptive lasso conditional density estimation

hal9001: Scalable highly adaptive lasso regression in R

Nonparametric causal mediation analysis for stochastic interventional (in)direct effects

Efficient nonparametric inference on the effects of stochastic interventions under two-phase sampling, with applications to vaccine efficacy trials

The propensity score with continuous treatments. Applied Bayesian Modeling and Causal Inference from Incomplete-data Perspectives 226164

Efficient estimation of average treatment effects using the estimated propensity score

Population intervention models in causal inference

The role of the propensity score in estimating dose-response functions

Semiparametric theory and empirical processes in causal inference

Nonparametric causal effects based on incremental propensity score interventions

Non-parametric methods for doubly robust estimation of continuous treatment effects

Consistent estimation of the influence function of locally asymptotically linear estimators

Asymptotically minimax adaptive estimation I -Upper bounds and optimally adaptive estimates

Asymptotically minimax adaptive estimation II -Schemes without optimal adaptation: Adaptive estimators

Hospitals with higher nurse staffing had lower odds of readmissions penalties than hospitals with lower staffing

Semiparametric efficient empirical higher order influence function estimators

Lepski's method and adaptive estimation of nonlinear integral functionals of density

Contribution to the theory of sampling human populations

Evaluating shifts in mobility and COVID-19 case rates in US counties: A demonstration of modified treatment policies for causal inference with continuous exposures

Multidimensional variation for quasi-monte carlo. In Contemporary Multivariate Analysis And Design Of Experiments: In Celebration of Professor Kai-Tai Fang's 65th Birthday

Causality: Models, Reasoning, and Inference

Brief report: On the consistency rule in causal inference: "axiom, definition, assumption, or theorem

Inferences for case-control and semiparametric two-sample density ratio models

Universal sieve-based strategies for efficient estimation using machine learning tools

R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing

Higher order influence functions and minimax estimation of nonlinear functionals

Marginal structural models and causal inference in epidemiology

Estimation of regression coefficients when some regressors are not always observed

Targeted learning with daily EHR data

Nonparametric policy analysis

Density Ratio Estimation in Machine Learning

Conditional density estimation via least-squares density ratio estimation

Regression shrinkage and selection via the lasso

we have been otherwise unable to ascertain a formal description of it; thus, we provide such a formalization in Algorithm S1, which sketches the construction of conditional density estimators of this family. Algorithm S1: Location-scale conditional density estimation Result: Estimates of the conditional density of A, given W . Input : 1 An observed data vector of the continuous treatment for n units: A 2 An observed data vector (or matrix) of the baseline covariates for n units: W 3 A kernel function specification to be

the conditional variance of A given W , by applying the regression estimator f σ , yieldinĝ σ 2 (W )

Estimate the one-dimensional density of (A −μ(W )) 2 /σ 2 (W ), using kernel smoothing to obtainρ(A)

Construct the estimated conditional density g n,A (A | W ) =ρ((A −μ(W ))/σ(W )). Output: g n,A , an estimate of the generalized propensity score

The highly adaptive lasso estimator

Fast rates for empirical risk minimization with cadlag losses with bounded sectional variation norm

Efficient and Adaptive Estimation for Semiparametric Models

Stacked regressions

hal9001: The scalable highly adaptive lasso

glmnet: Lasso and elastic-net regularized generalized linear models

Estimation of the effect of interventions that modify the received treatment

txshift: Efficient estimation of the causal effects of stochastic interventions in R

txshift: Efficient estimation of the causal effects of stochastic interventions

hal9001: Scalable highly adaptive lasso regression in R

Contributions to a general asymptotic statistical theory

Estimation of regression coefficients when some regressors are not always observed

Continuous-time targeted minimum loss-based estimation of intervention-specific mean outcomes

A generally efficient targeted minimum loss based estimator based on the highly adaptive lasso

Targeted Learning: Causal Inference for Observational and Experimental Data

Targeted maximum likelihood learning

We thank Alan Hubbard and Nick Jewell, for helpful comments on an early draft of the manuscript; Jeremy Coyle, for discussions and early-stage collaborations on software implementations; and Ted Westling, for bringing to our attention semiparametric conditional density estimation techniques for location-scale families. NSH's work was supported in part by the National Science Foundation (award no. DMS 2102840), and MJvdL was partially supported by a grant from the National Institute of Allergy and Infectious Diseases (award no. R01 AI074345).

The sketch of the algorithm for constructing estimators g n,A (a | w) of the generalized propensity score may take two forms, which diverge at the second step above. Firstly, one may elect to estimate only the conditional mean µ(W ) via a regression technique, leaving the variance to be taken as constant (estimated simply as the marginal mean of E[(A −μ(W )) 2 ]). Estimators of this form may be described as having homoscedastic error, based on the variance assumption made. Alternatively, one may additionally estimate the conditional variance σ 2 (W ) via the residuals of the estimated conditional mean, that is, estimating instead the conditional meanWhile the regression procedures f µ and f σ used to estimate the conditional mean µ(W ) and the conditional variance σ 2 (W ), respectively, may be arbitrary, with candidates including, for example, regression ensembles van der Laan et al., 2007) , we recommend the use of HAL regression van der Laan, 2017) , as its use will ensure an enhanced rate of convergence (Bibaut & van der Laan, 2019) of the estimator g n,A to its target g 0,A . The data-adaptive nature of HAL regression affords a degree of flexibility that ought to limit opportunities for model misspecification to compromise estimation of g 0,A ; morever, their improved convergence rate will help to facilitate the construction of asymptotically linear and efficient estimators of ψ 0,δ .

Note that a conditional density may be written in terms of the hazard function and survival function as followswhere λ(a | w) is the conditional hazard function and the survival function is exp(− a 0 λ(s | w)ds, i.e., a rescaled cumulative hazard. Using the highly adaptive lasso, the conditional hazard may be estimated based on a Cox proportional hazard model:where any given φ j = I(a > c j ) for some cutoff c j , essentially a knot point when spline basis functions are usd. Assuming that the quantity in equation (S6) can be estimated using open source software, as available in the hal9001 and glmnet (Friedman et al., 2009 ) R packages. In the sequel, we focus on estimation of the quantity Λ(a | w) = a