key: cord-0578096-mb3trlal
authors: Malenica, Ivana; Phillips, Rachael V.; Pirracchio, Romain; Chambaz, Antoine; Hubbard, Alan; Laan, Mark J. van der
title: Personalized Online Machine Learning
date: 2021-09-21
journal: nan
DOI: nan
sha: 99cc90e31dcca5d45a867391eacd2043b823738b
doc_id: 578096
cord_uid: mb3trlal

In this work, we introduce the Personalized Online Super Learner (POSL) -- an online ensembling algorithm for streaming data whose optimization procedure accommodates varying degrees of personalization. Namely, POSL optimizes predictions with respect to baseline covariates, so personalization can vary from completely individualized (i.e., optimization with respect to baseline covariate subject ID) to many individuals (i.e., optimization with respect to common baseline covariates). As an online algorithm, POSL learns in real-time. POSL can leverage a diversity of candidate algorithms, including online algorithms with different training and update times, fixed algorithms that are never updated during the procedure, pooled algorithms that learn from many individuals' time-series, and individualized algorithms that learn from within a single time-series. POSL's ensembling of this hybrid of base learning strategies depends on the amount of data collected, the stationarity of the time-series, and the mutual characteristics of a group of time-series. In essence, POSL decides whether to learn across samples, through time, or both, based on the underlying (unknown) structure in the data. For a wide range of simulations that reflect realistic forecasting scenarios, and in a medical data application, we examine the performance of POSL relative to other current ensembling and online learning methods. We show that POSL is able to provide reliable predictions for time-series data and adjust to changing data-generating environments. We further cultivate POSL's practicality by extending it to settings where time-series enter/exit dynamically over chronological time.

Predictive analytics with large data streams is a common task across many fields, including physics, medicine, engineering and finance. The insights drawn from these data typically come in the form of forecasts (predictions about the future), and inform subsequent action by the machine or user. The usefulness of the forecasts (interchangeably referred to as "predictions") often depends on their timeliness, accuracy, and uncertainty. For example, in the intensive care unit (ICU), it is imperative to quickly convert streams of data into predictions (Chan et al., 2020) . The same is true for high-frequency trading, where computers need to rapidly make beneficial decisions from their forecasts. Drawing from the COVID-19 pandemic, obtaining accurate forecasts in a timely manner is crucial for making evolving policy decisions (Altieri et al., 2020) . In these examples, and for real-world data streams in general, the observations are derived from dynamic environments, where time-series are ever-growing and evolve in possibly unforeseeable ways.

In order for a machine to quickly adapt with the dynamic patterns in data streams, algorithmic strategies that regularly reassess the information learned from incoming data relative to historical data are crucial. The traditional machine learning and forecasting paradigm has been in the form of offline estimation, where a prediction algorithm (also referred to as "learner" and as shortened "algorithm") is updated with new batches of data by first adding the new data to all, or part, of the existing data, and then retraining the learner on the new training dataset. Because an offline algorithm's training data grows with each update, these strategies are generally not scalable when updates are frequent. Tools that assess the reliability of an algorithm, such as calibration diagnostics, can inform reactive updates for an offline algorithm. However, when new patterns are expected to emerge quickly and often in the time-series, real-time (as opposed to reactionary) learning is essential to maintain the reliability of the system.

Online estimation, where an algorithm is updated with a new batch of data without revisiting past training data, has become a promising technique for learning from data streams in real-time. There is a growing body of literature on online algorithms and software, including online implementations of canonical time-series algorithms (Anava et al., 2013; Hoi et al., 2018) . Some implementations are ensemble-based, combining forecasts from multiple algorithms as part of their procedure. Ensembling strategies have been shown to increase forecast accuracy; for instance, the most successful entries in the 2018 M4 Forecasting Competition were ensembling methods (Smyl, 2020; Gilliland, 2020; Shaub, 2020; Pawlikowski and Chorowska, 2020) . Hibon and Evgeniou (2005) showed empirically that the best combination of forecasts performed as well as the best individual forecast, and argued that choosing an individual algorithm from a set of available algorithms is more risky than choosing a combination.

Many longstanding challenges exist in applying online learning strategies. Online learning strategies, just like all other machine learning, require data in order to perform well, and in some settings data accumulation over time is a luxury and is not guaranteed in practice. For instance, in order to forecast a patient's trajectory in the hospital, a purely online learning strategy would require following the patient for a long period of time beforehand, which is not practical for in-hospital forecasting applications. Another limitation of online learning is the phenomenon of "catastrophic forgetting/interference", in which the new information interferes with what the model has already learned. Catastrophic forgetting can result in sudden drops in performance and/or overwriting prior knowledge that could be informative again in the future (Lee and Lee, 2020). Constrained online learning/ensembling strategies offer the potential to reduce catastrophic forgetting events as the degree in which an algorithm is allowed to adjust its parameters at each update could be restricted. Also, online ensembling of a hybrid of offline algorithms and online algorithms provides a means to address these limitations.

Considering that specification error tends to propagate in online learning settings, a principled methodology for algorithm selection and ensembling is warranted (Hibon and Evgeniou, 2005; Shaub, 2020) . However, despite the emerging popularity of online learning implementations and ensembling algorithms, literature at the intersection of these two fields is relatively scarce. Furthermore, a personalized online ensembling paradigm that is grounded in statistical optimality theory, to fit and evaluate the performance among multiple diverse algorithms under an individualized optimization strategy, has only been described in the commercial/proprietary realm and has not yet been formally defined in the literature, to the best of our knowledge. This article proposes such a principled paradigm, and builds off developments presented in Benkeser et al. (2018) . Benkeser et al. (2018) generalized an existing oracle result for independent and identically distributed (i.i.d.) data to time-series data, showing that the online algorithm exhibiting the best cross-validated performance is asymptotically equivalent to the oracle; the oracle selects the algorithm in the library (the set of candidate algorithms) that has the best performance with respect to the the true, unknown data-generating process (DGP) (Dudoit and van der Laan, 2005; van der Laan and Dudoit, 2003; van der Laan et al., 2007; Benkeser et al., 2018; Ecoto et al., 2021) . As such, the oracle cannot be computed in practice, but it serves as a useful benchmark. These oracle results were further extended to the best combination of individual online algorithms, showing that the ensembling online learner's performance is asymptotically equivalent to an optimal ensemble that performs best with respect to the DGP Benkeser et al. (2018) . Benkeser et al. (2018) propose one form of cross-validation for identifying the best candidate online algorithm among a set of possible online learners, and applied this to training with batches of incoming data and with a single individual's data stream.

In this work, we introduce a novel online ensembling algorithm -Personalized Online Super Learner (POSL) -that utilizes a diversity of time-series and ensembling methods, with the goal of optimizing baseline-covariate-level (including individual-level) forecasts. POSL leverages multiple candidate algorithms, training pooled (population based) and online individualized learners at each time step, while allowing for the ensembling to depend on the amount of data collected, status of stationarity, and residual noise. As such, POSL is not hindered by the limitations of purely online learning or purely offline learning as it draws on both strategies. Our proposed algorithm is able to adapt to the underlying structure in the data, allowing it to pick between relying on the structure through time (e.g., conditional stationarity as in Benkeser et al. (2018) ) or number of samples at each time point (when no structure in time is present, or not enough time points are collected). In addition, POSL allows for completely different data-generating distributions across time-series, with possibly only common baseline covariates -thus, it is constructed to provide optimal forecasts for a specific unit, as opposed to a collection of time-series. Overall, POSL is able to provide accurate predictions for short time-series data and adjust to changing data-generating environments.

Our main framework consists of observing n data structures over a finite number of time points, possibly from different data-generating distributions, where each observation is comprised of baseline and time-varying covariates and a response. We show that this setup is easily extended to settings where individual time-series enter/exit dynamically, with possibly different lengths and numbers of time-series observed at chronological time t. We propose several new online forms of cross-validation for single and multiple time-series, with varying sample dependence, all of which can be used to identify the best candidate algorithm in the current library. Building on theoretical foundations proposed by Benkeser et al. (2018) , and under stronger assumptions than necessary for the setup we describe, we apply the online oracle inequalities to multiple time-series showing that the online algorithm with the best cross-validated performance is asymptotically equivalent with the performance of the oracle. Lastly, we propose an adaptive ensembling step, that bases the final ensemble weights on mutual characteristics of a group of time-series, allowing for a continuum of weight personalization that aims to increase predictive power of forecasts for shorter time-series.

We formulate the general statistical estimation problem in Section 2, including defining the target parameter and the loss-based paradigm for estimation. In Section 3, we present several online cross-validation schemes, all of which are valid for POSL; proposed extensions to the multiple time-series settings can be found in Appendix B. In Section 4, we discuss how online cross-validation is used to identify the best performing individual online algorithm (discrete online super learner) and the best performing online ensemble of individual algorithms, and we define the oracle selector for POSL. In Section 5, we extend the current formulation of POSL to adaptive enrollment, with possibly different length of time-series and number of available subjects at each time point. In Section 6 we conduct multiple simulation studies, and compare POSL to various ensembling and online methods currently available in the literature. In Section 7, we provide a data analysis example for blood pressure forecasting. We conclude with a short discussion in Section 8.

In the following, we formalize the prediction task as an estimation problem, identifying the statistical target parameter as the minimizer of the risk induced by a valid, problem-specific loss function.

We model a data structure under the shape of a random variable defined as O i = (O i (t) : t = 1, . . . , τ ), where O i (t) is a time t-specific p-dimensional variable for sample i. We focus on situations where the time-varying part O i (t) decomposes as O i (t) = (W i (t), Y i (t)), with W i (t) defining a vector of time-varying covariates occurring before the response variable Y i (t), both indexed by time t and sample id i. We denote X i as the vector of baseline covariates which, by definition, are initiated at t = 0 and are not dependent on t. We view each X i and O i = (O i (t) : t = 1, . . . , τ ) as the sample i-specific baseline covariates and time-series, respectively. With that, we note that X i could simply be a function of i itself, or continuous/discrete covariates which allow one to smooth across the subjects. We observe n independent realizations of random variables denoted as (X 1 , O 1 ), . . . , (X n , O n ). For convenience, we also introduce X n = (X i : i = 1, . . . , n) and O n = (O i : i = 1, . . . , n) as the collections consisting of the n units. Consequently, for each i = 1, . . . , n, O n (t) = (O i (t) : i = 1, . . . , n) reflects the collection of the n t-specific p-dimensional variables observed at time t.

We denote by M the statistical model; that is, the set of laws from which (X n , O n ) can be drawn. The more we know, or are willing to assume about the experiment that produces the data, the smaller is the model; this will be discussed momentarily. Let P n 0 ∈ M be the true probability distribution of (X n , O n ). Moreover, let P 0,O i |X i be the conditional distribution of O i given X i for each i = 1, . . . , n. When conditioning on X i , we use the short notation P 0,X i instead of P 0,O i |X i (not to be confused with P 0,X , the marginal distribution over the baseline covariates). We emphasize that P 0,X i could be just unit i specific, as is the case when X i is simply a function of i itself; alternatively, P 0,X i could be a smooth function of X i , allowing one to smooth across the units. We let p n 0 denote the density of P n 0 with respect to (w.r.t) a measure µ n that dominates all elements of M. The joint likelihood of (x n , o n ) can be factorized according to the time-ordering as follows:

where p 0,x marks the probability density for the baseline covariates, and p 0,o i (t) is the conditional density of O i (t) given X i and all the observed past until time t for sample i. In particular, we define O i (t − 1) as the t-specific history of the time-series for

In order to allow learning from a dependent process, we must make a few assumptions on the law of the data, P n 0 , through restrictions made on the statistical model M. In particular, we assume that each factor p 0,o i (t) (O i (t)|X i , O i (t − 1)) depends on the past through a fixed-dimensional summary measure Z i (t − 1). For some applications, the fixed dimensional summary measure Z i (t) covers a limited history, that is Z i (t) = (O i (t − m), . . . , O i (t)). In words, the dependent process has a finite memory m allowing us to learn through time. Similarly, we could have defined Z i (t) to be a function of finite memory m of a finite number of other time-series in unit i's network.

, with the means computed component-wise. The formulation of Z i is general enough to allow for different trends (e.g., seasonality), because the definition of Z i can involve i itself.

Secondly, we define P 0,O i as the common conditional probability distribution of O i (t) given Z i (t − 1) and X i under P n . As we will describe later, this assumption is not crucial for the algorithm itself -if not enough time points are collected, we rely on performance based on the number of trajectories. However, if online learning is to be useful, and achieve oracle results, some structure across time is necessary. We also stress that our formulation allows for P 0,O i to be a function of time t, making it possible for the proposed procedure to learn how much to rely on conditional stationarity over time. Thus, in light of (1) and the two above mentioned assumptions, the joint likelihood of (x n , o n ) under any element P n of the (constrained) statistical model M decomposes as

where we extend the previously described notation with the substitution of P n (p n ) for P n 0 (p n 0 ). Note that P 0,O i is subject specific, and we don't assume a common across i distribution; if all the time-series are drawn from the same distribution, we let the algorithm learn that. In the rest of the manuscript, we will deal explicitly with P 0,O i and P 0,X i .

The above derivation of (2) hinges on independence across subjects. While we write the likelihood as a product of both the number of samples (n) and time points (t), we emphasize that for deriving our main results, dependent on the asymptotics in time, we do not need to assume anything about dependence among subjects. Our statistical model M is, in essence, a model for a single time-series. With that, while independence across subjects allows us to have asymptotics in total number of time points observed across the n subjects, dependence among subjects is also allowed under the described formulation. In particular, network dependence could be allowed simply by letting each Z i (t − 1) to summarize the whole past O n (t − 1) = {O 1 (t − 1), . . . , O n (t − 1)} of O n at time (t − 1), or a subgroup specific past of its network, as opposed to the i-specific past O i (t − 1).

Most prediction-based literature focuses on parameters of the population distribution P n 0 or, as is the case for the time-series literature, on unit-specific forecasts. Our goal is not to understand the population distribution P n 0 . Instead, we focus on the parameters of the unit-specific conditional distribution P 0,X i .

We define the relevant feature of the true data distribution we are interested in as the statistical target parameter. As in Section 2.1, we assume that P n 0 belongs to a statistical model M, defined as a collection of possible common conditional distributions P 0,O i and marginal P 0,X that could have given rise to the observed data. We define a parameter mapping, Ψ : M → D, from the model M into a space D; and a parameter value, ψ := Ψ(P ) of Ψ for a given P ∈ M. The parameter space, corresponding to parameter mapping Ψ, is defined as Ψ Ψ Ψ := {Ψ(P ) : P ∈ M} ⊆ D.

In some cases, we might be interested in learning the entire conditional distribution P 0,X i ; however, frequently the actual goal is to learn a particular feature of the true distribution that satisfies a scientific question of interest. In particular, we are interested in forecasting -hence, we define our estimand for the i th subject as:

where the expectation on the right hand side is taken w.r.t the conditional distribution P 0,X i , and ψ 0 (

is the prediction function evaluated at the truth for the i th subject at time t. In particular, we want to learn (X, Z(t − 1), t) → Ψ(P n 0 )(X, Z(t − 1), t), where Z(t − 1) is fixed dimensional, and thereby obtain a prediction function for each unit i that predicts Y i (t) with Ψ(P )(X i , Z i (t − 1), t) for P ∈ M.

We define L as a loss function; we emphasize that the chosen loss should be picked in accordance with the target parameter. Specifically, a valid loss function for a given parameter is defined as a function whose true conditional mean is minimized by the true value of the parameter. As such, let L be a loss function adapted to the problem, i.e. a function that maps every Ψ(P ) to L(Ψ(P )) :

) as a time t and subject i loss for Ψ(P ). Note that we could equivalently define L a function that maps every ψ to L(ψ) :

, z i (t − 1)) since ψ := Ψ(P ). As our parameter of interest is a conditional mean, we could use the square error to define the loss; then we have that L(ψ)(

is the subject and time specific weight. Our accent on appropriate loss functions strives from their multiple use within our framework -as a theoretical criterion for comparing the estimator and the truth, as well as a way to compare multiple estimators of the target parameter.

We define the true risk as the expected value of L(ψ)(X i , Y i (t), Z i (t − 1)) w.r.t the true conditional distribution P 0,O across all individuals and times:

where the second equality holds only when the loss function is valid for the target parameter; we simply illustrate what R(P n 0 , ψ) would be with squared error as a loss in (4). The notation for true risk, R(P n 0 , ψ), emphasizes that ψ is evaluated w.r.t. the true data-generating distribution. Finally, we define ψ 0 as the minimizer over the true risk of all evaluated ψ in the parameter space

The corresponding true risk is denoted as θ 0 = R(P n 0 , ψ 0 ).

The true risk establishes a true measure of performance for ψ, optimizing over all times. We note, however, that we could also define a true i-specific risk -where the i-specific risk would measure the performance of ψ for individual i across all time points. Note that ψ 0 implies ψ 0,i by evaluating at X i , as ψ 0,i is a prediction function given X i . The i-specific expected loss measures the performance of the prediction function Ψ for individual i across all time points, optimizing the following equation:

with optimal risk, corresponding to ψ 0,i , defined as θ 0,i = R(P n 0 , ψ 0,i ).

The estimator mapping,Ψ, is a function from the empirical distribution to the parameter space Ψ Ψ Ψ. Let P n,t denote the empirical distribution of n time series collected until time t. In particular, P n,t →Ψ(P n,t ) represents a mapping from P n,t , with n time-series collected until time t, into a predictive functionΨ(P n,t ). Further, the predictive functionΨ(P n,t ) maps (X i , Z i (t − 1), t) into a time-and subject-specific outcome, Y i (t). We emphasize thatΨ(P n,t ) can map any (X i , Z i (s − 1), s) into a time s prediction, even for s > t under stationarity conditions. We can write ψ n,t (X i , Z i (t − 1), t) :=Ψ(P n,t )(X i , Z i (t − 1), t) as the predicted outcome for unit i of the estimatorΨ(P n,t ) at time t, based on (X i , Z i (t − 1), t). We define the conditional risk as the risk for ψ n,t with respect to the true, unknown data-generating distribution P n 0 , denoted asθ n = R(P n 0 , ψ n,t ). The naive risk is defined asθ n = R(P n,t , ψ n,t ). In order to obtain an unbiased estimate of the true conditional risk, we resort to appropriate cross-validation for dependent data, as described in the next section.

Let C(i, s, ·) denote, at minimum, the time s-and unit i-

. The general formulation of C(i, s, ·) allows us to add identifying information (in addition to time and sample id) needed to construct valid cross-validation scheme; for instance, for dynamic enrollment/exit dates, C(i, s, ·) might include enrollment and exit time for a time-series as well. If no additional information is included, we write C(i, s, ·) = C(i, s). To derive a general representation for cross-validation (CV), we also define a time t specific split vector B t , where t indicates the final time-point of the currently available data where for all 1 ≤ i ≤ n, B t (i, ·) ∈ {−1, 0, 1} t . Let v be a particular v-fold, where v range from 1 to V . A realization of B t defines a particular split of the learning set into corresponding three disjoint subsets,

For each t, let P 0 n,t denote the empirical distribution of the training sample until time t. Similarly, we define P 1 n,t as the empirical distribution of the validation set. Let

= 1) denote the number of observations in the training and validation sets for split B t , respectively, over all folds v until time t. We let B 0 t,v denote all the (i, s, ·) indexes in the training set, and B 1 t,v all indexes in the validation set for fold v. In general, we use different time-series cross-validation schemes to evaluate how well an estimator trained on specific samples' past data is able to predict an outcome for specific samples in the future. We now give relevant cross-validation schemes that are supported by the theoretical results for our proposed algorithm.

Rolling origin cross-validation scheme lends itself to online cross-validation-based ensemble learning, as described by Benkeser et al. (2018) . In general, the rolling origin scheme defines an initial training set and, with each iteration, the size of the training set grows by m observations until we reach time t for split B t . Whether or not the samples in the training set are also present in the validation set is optional, but classically, rolling origin cross-validation represents the scenario where the training and validation points are evaluated on the same time-series. Regardless of which samples are included in the training and validation sets, time points included in the training set always occur before the validation set time points; in addition, there might be a gap between training and validation times of size h. We define cross-validation folds v = 1, . . . , V with B v t for a single unit as follows:

where n 0 t,v 1 is the size of the training set at v = 1, and n 1 t,v is the size of the validation set for all v. An example of rolling origin cross-validation is illustrated in Figure 1 for V = 3. A variant of rolling origin scheme which accounts for sample dependence is the rolling-origin-V -fold cross-validation. In contrast to the canonical rolling origin CV, samples in the training and validation set are not the same, as this variant of the rolling origin CV encompasses of V -fold CV in addition to the time-series setup. The predictions are evaluated on the future times of time-series units not seen during the training step, allowing for dependence in both samples and time. In order to characterize B v t , we add sample id in addition to n 0 t,v 1 , n 1 t,v , m and h in Figure 7 . We show an example with V = 2 V -folds (i.e., splitting across samples, which we denote as v') and V = 2 time-series CV folds (denoted as v) in Appendix B, Figure 7 .

Instead of adding more time points to the training set per each iteration, the rolling window cross-validation scheme "rolls" the training sample forward by m time units.

The rolling window scheme might be considered in parametric settings when one wishes to guard against moment or parameter drift that is difficult to model explicitly; it is also more efficient for computationally demanding settings such as streaming data, in which large amounts of training data cannot be stored. In contrast to rolling origin CV, the training sample size for each iteration of the rolling window scheme is always n 0 t,v for all v. We emphasize that the rolling window CV could also be viewed as a subset of the rolling origin CV where only recent data are used for training; with that, we can incorporate a variant of the rolling window approach by using a learner that only trains on the recent past as one of the candidates in a library. In such a scenario, a rolling origin CV could be used to evaluate the final loss, but we might incorporate learners that train on fewer training time-points (as opposed to all training points from the start) thus adjusting to changes over time. In all the further sections and theoretical results, we consider rolling window as a subset of a rich class consisting of the rolling origin CV. Finally, we define cross-validation folds v = 1, . . . , V with B v t and gap of size h for a single time-series as shown below. We illustrate canonical rolling window cross-validation in Figure 2 , and the rolling-window-V -fold cross-validation which accounts for sample dependence in Figure 8 (allocated to Appendix B). 

Suppose we have K candidate estimators,Ψ k , and recall the definition of an estimator from subsection 2.3 (page 9). In order to evaluate performance of eachΨ k , we use cross-validation for dependent data to estimate the average loss for each candidate. In particular, eachΨ k is trained on the training set until time t, using P 0 n,t and resulting in a predictive function ψ 0 n,t,k :=Ψ k (P 0 n,t ) for k = 1, . . . , K. We define the online cross-validated risk for each candidate estimator as:

where R CV (P 1 n,t ,Ψ k (·)) is the cumulative performance ofΨ k trained on training sets and evaluated on corresponding validation samples across all time points until time t. For instance, whileΨ k (P 0 n,t ) is trained on the training set P 0 n,t , its performance will be over the validation set P 1 n,t . Additionally, ifΨ k is an online estimator, then the online cross-validated risk is also an online estimator. For the squared error loss mentioned in Section 2.3 where c(i, j) = 1 (and is thus omitted), we rewrite the above online cross-validated risk as:

The online cross-validated risk estimates the following true online cross-validated risk, denoted as R CV (P n 0 ,Ψ k (·)) and expressed as

Note that R CV (P n 0 ,Ψ k (·)) reflects the true average loss for the candidate estimator with respect to the true conditional distribution P 0,O i . As opposed to the true online cross-validated risk, R CV (P 1 n,t ,Ψ k (·)) gives an empirical measure of performance for each candidate estimator k trained on training data until time t. In light of that, we define the discrete online cross-validation selector as:

which is the estimator that minimizes the online cross-validated risk. The discrete (online) super learner is the estimator that at each time point t uses the estimates from the discrete online cross-validation selector -for time t, we have ψ 0 n,t,kn,t := Ψ kn,t (P 0 n,t ). We emphasize that the discrete super learner can switch from one learner to another as t progresses, in response to accumulating more data and detecting changes in the time-series. Finally, if all the candidate estimators are online estimators, the discrete (online) super learner is itself an online estimator.

In order to study performance of an estimator of ψ 0 , we construct loss-based dissimilarity measures. First, we define i-specific loss-based dissimilarities for the k th estimator, Ψ k , trained until time t as

which compares performance of the cross-validated estimator to the true parameter. Note that the training and validation sample is just sample i. We further define the measure d 0,t (ψ n,t,k , ψ 0 ) = 1 n n i=1 d 0,t (ψ n,t,k,i , ψ 0,i ) as an average of i-specific loss-based dissimilarities over all the samples until time t for the k th estimator; d 0,t (ψ n,t,k , ψ 0 ) reflects how far ψ n,t,k is from ψ 0 over all available times and samples in terms of the chosen loss. We define the time t oracle selector as the unknown estimator that uses the candidate closest to the truth in terms of the defined dissimilarity measure:

Due to it being a function of the true conditional mean, the oracle selector cannot be computed in practice. However, we can utilize it as benchmark in order to describe performance of the online cross-validation based estimator. In the Appendix A Theorem 1, assuming conditional stationarity (as proposed in Section 2.1, assumption not necessary for the algorithm function), we appropriate the work from Benkeser et al. (2018) to multiple time-series using the cross-validation schemes described in Section 2. In particular, Appendix A Theorem 1 shows that the performance of the discrete POSL is asymptotically equivalent to that of the oracle selector. The result relies on the martingale finite-sample inequality by van Handel (2011) to show that, as t → ∞,

under conditional stationarity and additional conditions specified in Appendix A.

In this section, we consider a more flexible online learner that generates a weighted combination of candidate estimators. LetΨ α be a function of empirical distribution (P n,t , at any t) generating an ensemble of K estimators (Ψ 1 , . . . ,Ψ K ) indexed by a vector of coefficients α, where α = (α 1 , . . . , α K ). For example,Ψ α could represent a convex linear combination:Ψ

such that K k=1 α k = 1 and for all α k , α k ≥ 0. We define conditional meta-learning by allowing the weight vector to depend on the baseline covariates X, where α(X) = {α 1 (X), . . . , α K (X)} with K k=1 α k (X) = 1 and for all α k (X), α k (X) ≥ 0. For example, we can define α(X) by considering a parametric family H = {α β : β ∈ B} where α β (X) = exp (β k,1 + β k,2 X) K k=1 exp (β k,1 + β k,2 X)

.

To alleviate notation, we define α as an universal vector of coefficients (including conditional metalearning) in further sections. LetΨ α = K k=1 α kΨk , so that the predictive function based on the training set P 0 n,t is given by ψ 0 n,t,α := K k=1 α kΨk (P 0 n,t ) with α ∈ H. We define a H-specific online cross-validation selector for the ensemble as:

where the loss is defined as the mean squared error. We can define an oracle selector for this class of estimators as the choice of weights that minimizes the true average of the loss-based dissimilarity:

The results from Theorem 1 in Appendix A extend to all meta-learning, as the performance of the online cross-validated ensemble is asymptotically equivalent to the oracle ensemble of candidate estimators as t goes to infinity, under conditional stationarity and conditions outlined in Appendix A,

We note that one could also define a sequence of H m -specific online Super Learners, ranging from highly parametric to nonparametric for m = 1, . . . , M , possibly stratified by the subject itself. Then, the online ensemble would have candidate algorithmsΨ k for k = 1, . . . , K augmented with a collection of online Super Learners indexed by weight classes H m , m = 1, . . . , M . In this matter, the discrete online Super Learner would adaptively determine the optimal level of data adaptivity of the meta-learner, based on many candidates for the online discrete Super Learner considered. For example, depending on the number of subjects n and time t, the choice k n,t of the online cross-validation selector might switch from discrete online Super Learner based on K algorithms to more aggressive online Super Learner indexed by a more flexible weight class over time.

Due to the continuously updating procedure that allows the algorithm to evolve over time, POSL generalizes to a diversity of data streams. POSL accomodates varying degrees of personalization, such as within-covariate or within-subject; and it can handle multiple (potentially time-varying) dependence structures, from individual time series to networks of connected individuals. We delineate one version of POSL in Algorithm 1, which can train online or in batches, and benefits from learning both from other subjects and from the history of the target individual's trajectory. We define Historical learners as K H learners generating a pooled (across individuals and time before t) estimatorΨ k (P 0 n,t ) for algorithm k ∈ K H , trained on samples j = 1, . . . n in the training sample. The motivation behind Historical learners is to provide an initial estimate based on previously collected trajectories (or multiple concurrently collected time-series), convenient for forecasting early in the trajectory for individual i; here, we don't need conditional stationarity for POSL to do well. Historical learners can be trained on time-series data collected even before the trajectory of interest is sampled, and can be updated at specific time points t s based on the computational efficiency. We note that Historical learners can generate a historical online Super Learner as well, which thus provides another candidate online estimator. Similarly, we define the Individual learners as K I learners applied to P n,t , which stratify per subject when training. With that, we generate an individual estimatorΨ S,k (P 0 i,t ) for algorithm k ∈ K I , individualized to sample target i. Individual learners train on the training data by stratify by id, and predict the outcome in the future according to the forecast horizon. The K I candidate learners are possibly time-series learners, and are frequently updated in an online or batch fashion in order to accommodate the continuously incoming data. As for historical learners, individual learners can also form an individual online Super Learner, which becomes another candidate in the POSL library. With this formulation, we allow the POSL to leverage cross-validation in order to choose between pooled and individual fits at each time point t -in essence, allowing the algorithm to choose an appropriate structure from the data (learning from samples if no conditional stationarity is present or learning through time). This results in a natural adaptation to the amount of available data and the stationarity of the individual target time-series. The final discrete and full online super learner for the POSL is generated based on all the samples until specified time t. The cross-validation selector k n,t reflects an optimized ensemble among all the available learners; the candidate learners reflect a collection of online algorithms which range in how much they use the current time series (stratify by id) and use the historical data (pool across all available time-series). All simulations in Section 6 test the version of the POSL described in Algorithm 1.

Algorithm 1 Personalized Online Super Learner 1: t s : time steps at which Historical learner fit is updated. 2: K H : Historical candidate learners. 3: K I : Individual candidate learners. 4: K: All candidate learners, K H ∪ K I . 5: k: Any learner among candidate learners. ReturnΨ(P 0 n,t ), trained using using all available units (i = 1, . . . , n) for any t.

10: procedure Individual Learner(i,t)

ReturnΨ S (P 0 n,t ), for any t where we stratify by sample i.

13: while t < τ do 14:

if t ∈ t s then 16:

Run Historical Learner(n,t), return ψ H n,t,k =Ψ k (P 0 n,t ).

else 18:

ψ H n,t,k =Ψ k (P 0 n,t s−1 ).

for k ∈ K I do 20:

Run Individual Learner(i,t), return ψ I i,t,k =Ψ S,k (P 0 n,t ).

if Discrete Online Super Learner then

Return ψ n,t,kn,t , where k n,t = argmin k R CV (P n,t , {ψ H n,t,k , ψ I i,t,k }).

if Ensemble Online Super Learner then 24:

Return ψ n,t,αn,t , where α n,t = argmin α d 0,t ({ψ H n,t,k , ψ I i,t,k }, ψ 0 ).

In most practical settings, time-series data exhibit a heterogeneous streaming profile comprised of varied length of the series, and diverse start and exit times. In order to accommodate a manifold of different applications, including dynamic enrollment/exit (collectively referred to as dynamic streams), we extend the formulation of the estimation problem described in Section 2, along with the POSL algorithm, in the following sections. In particular, we redefine the observed data, statistical model, and the target parameter in Section 5.1, taking into account the possibly dynamic and disparate tracking of each collected sample. In Section 5.2 we describe the appropriate CV for dynamic streams, redefine the loss, online cross-validated selector, ensemble of candidate estimators, and define a new m-specific prediction function for dynamic enrollment streaming settings. In Figure 3 we provide examples of dynamic streams, introducing subject-specific time and its relationship to chronological time, and illustrate various ways in which this heterogeneous streaming profile can evolve.

Let E i be an entry time for each new time-series, corresponding to the chronological time domain t = 1, . . . , τ , for i = 1, . . . , n. We assume a natural ordering for all E i , with 0 ≤ E 1 ≤ . . . ≤ E n even if multiple samples enroll around the same time.

Our assumption on the strictly monotone increasing entry times follows from the fluid definition of t; as we put no restrictions on the time definition, we note that for sufficiently small t, no subjects exhibit E i = E i+1 even if enrolling a group of units. 

) be the observed data on subject i in chronological time t; note that, in order to define O i (h i (t)), we need the current time t and the subject's entry time, E i . As in Section 2, the time-varying part of the observed data has structure corresponding to O i (m) = (W i (m), Y i (m)), equivalently written as O i (h i (t)) = (W i (h i (t)), Y i (h i (t))) in chronological time. Here, Y i (h i (t)) is a response variable and W i (h i (t)) is a vector of covariates for subject i at collected point h i (t) in chronological time t. We impose an order constraint where, for all t, W (h i (t)) occurs before Y (h i (t)), and define X i as the vector of baseline covariates collected at entry time E i for subject i. We define the total number of subjects in the study at chronological time t as n(t) = n i=1 I(E i ≤ t), reflecting all trajectories started before (or at) time t. Equivalently, we also let n m (t) denote the number of samples with m points up to t, where n m (t) = n(t) i=1 t s=1 I(h(s) = m). We can represent the observed data coming from dynamic streams as a single time-series through chronological time t by defining a process F such that

Then, we have that (F n(0) (0), . . . , F n(τ ) (τ )) reflects a single time-series we can learn from. We emphasize that, for dynamic streaming settings, F (t) describes a collection of all observed time-series enrolled at or before time point t. Finally, in the previous sections we defined time t-and sample i-specific history as O i (t − 1) = (O i (1), . . . , O i (t − 1)). For dynamic streams, we let the history of the i-th time-series until time t be defined as O i (h i (t − 1)) = (O i (h i (0)), . . . , O i (h i (t − 1))). We define the complete history for all samples until chronological time t as O(h(t − 1)), which Figure 3 : Examples of dynamically streaming time-series data. The following scenarios are shown: settings where the subjects i have varying entry times E i , exit times T i , observation periods M i . In (A), a classic example of dynamic streams is displayed, in which start and exit times vary across subjects and occasionally overlap. In (B), dynamic streams which consist of completely non-overlapping time-series are displayed: an example of time-series that are observed one at a time (such as patients seen by a doctor). In (C), a dynamic setting with large amounts of historical information relative to recently entered time-series (upper-most subject 1) is shown; in this scenario, training algorithms on historical information might be particularly useful to make forecasts for subject 1's time-series as not much data has been accumulated on this subject. In (D) an example of a network of dynamically streaming time-series is provided; in particular, the network for a single subject (middle line on D) is shown. This subject's network is constrained in that they can only interact with at most two other subjects at any given time, and this restriction is illustrated by the two colors for arrows and boxes that branch from the middle subject's time-series. In general, examples of dynamically streaming networks include processes that spread and evolve, like disease and social networks. includes all trajectories observed by time t.

note that with this formulation, Z i (h i (t − 1)) could support both time and sample dependence, as discussed in previous sections. We define the estimand as a time m prediction problem for the i th subject:

where the expectation on the right hand side is taken w.r.t the conditional distribution P 0,X i , and Ψ(P 0 )(X i , Z i (m − 1), m) is the prediction function for the i th subject at time-series time m (equivalent to chronological time h i (t)). In particular, we want to learn (X, Z (m − 1), m) → Ψ(P 0 )(X, Z (m − 1), m), and thereby obtain a prediction function for each unit i that predicts Y i (m) with Ψ(P 0 )(X i , Z i (m − 1), m). Further, let P n,t →Ψ(P n,t ) represent a mapping from P n,t into a predictive functionΨ(P n,t ). We defineΨ(P n,t )(X i , Z i (m − 1), m) as an estimator of Ψ(P 0 )(X i , Z i (m − 1), m), and writeΨ(P n,t )(X i , Z i (m − 1), m) = ψ n,t (X i , Z i (m − 1), m) as the predicted outcome for unit i of the estimatorΨ(P n,t ) based on (X i , Z i (m − 1), m). 1) ), E i , T i ) denote the subject i and chronological time t observed data, analogue to Section 3. As previously defined, B t defines a time-specific split vector such that, for all 1 ≤ i ≤ n, B t (i, ·, E i , T i ) ∈ {−1, 0, 1} t . Extending the formulation described in Section 3 further, we define the following split of the learning set into corresponding three disjoint subsets,

where our cross-validation scheme now takes into account if we have yet to observe sample i (by chronological time t) and how long is its trajectory. Knowing E i and T i proves important shortly, as we define how the loss, and the corresponding (online) cross-validated risk are defined and evaluated.

Let L(ψ)(C h (i, t, E i , T i )) denote the loss function for the data record C h (i, t, E i , T i ) for subject i, where

For example, given the prediction function ψ, we might want to evaluate its performance using the squared error loss

where c(i, h i (t), E i , T i ) represents a weight function dependent on sample i, time h i (t) and the unit's entry and exit time.

weight might additionally represent a weight function that down-weights the losses for points h i (t) for which n l (t) is small; with that, our prediction function would not be penalized for not having enough data collected up until certain times l.

Let I = {i : E i ≤ t, t ≤ T i } denote a set of all samples with start date before current chronological time t with data still being collected (t ≤ T i ). We define the true risk as the expected value of L(ψ)(C h (i, t, E i , T i )) w.r.t the true conditional distribution P 0,O i for each sample i:

= τ t=1 i∈I

defined for all subjects that had their start E i before chronological time t, and end date after t. Note that, if sample i with start date E i ≤ t also had their end date before time t -then their loss would be undefined, unless an appropriate weighting is part of the loss definition (as discussed above). We note that R(P 0 , ψ) is an average of all appropriate i-specific losses measuring the performance of ψ across all available time points. One might instead be interested in defining a m-specific prediction function up until time τ . In particular, the true risk of the m-specific prediction function can be written as: (20) reflecting an average over all available active samples and times with m time points.

Suppose we have K candidate estimatorsΨ k , where we denote ψ 0 n,t,k as ψ 0 n,t,k := Ψ k (P 0 n,t ) for k = 1, . . . , K. In order to evaluate the time specific performance of eacĥ Ψ k , we use cross-validation for dependent dynamic steams (which takes into account E i and T i ) in order to estimate the average loss for each candidate k over time. The online cross-validated risk of an online estimator is computed at each time point in chronological time and defined as follows

with mean squared error as the loss. Similarly, one could define the online cross-validated risk R CV,m (P 1 n,t ,Ψ k (·)) ofΨ k for evaluating the cross-validated performance of the prediction function ψ 0 n,t,k at time l as:

We define the total online cross-validated risk of m-specific prediction functions as m-specific risks, with:

The online cross-validated risk R CV (P 1 n,t ,Ψ k (·)) gives an empirical measure of performance for candidate estimator k trained on training data until chronological time t. We define the time t discrete online cross-validation selector as:

reflecting the discrete online super learner for all m. Instead, we could define a separate selector for the different time points m, with the discrete online super learner stratifying the selector by m, k n,t,m = arg min k=1,...,K R CV,m (P 1 n,t ,Ψ k (·)).

Finally, we consider a more flexible online learner that generates a weighted combination of candidate estimators at each time point. LetΨ α be a function of empirical distribution generating an ensemble of K estimators {Ψ 1 , . . . ,Ψ K } indexed by a vector of coefficients α. Let H define a class of weight functions, where α = α(X) = (α 1 (X), . . . , α K (X)) is a collection of K weights that might depend on the baseline covariates X, with K k=1 α k = 1 and ∀α k , α k ≥ 0. LetΨ α = K k=1 α kΨk , so that the predictive function based on the training set P 0 n,t is given by ψ 0 n,t,α := K k=1 α kΨk (P 0 n,t )(C h (i, t, E i , T i )) with α ∈ H. We define a H-specific online cross-validation selector for the ensemble as:

where the loss is defined as the mean squared error. Alternatively, we could compute an online cross-validation selector α n,t,m for each m, where α n,t,m = argmin α∈H R CV,m (P 1 n,t ,Ψ α (·)).

(27)

We evaluate the Personalized Online Super Learner implementation described in Algorithm 1, testing its performance and pattern of adaptivity over time for several common time-series settings. For all scenarios, we simulate a total of 31 time-series with τ = 540 time-points, and repeat the entire procedure 50 times for a total of 1550 trajectories. We use a random sample of 30 time-series to train the Historical learner, and the remaining random sample for the Individual learners. For all simulations described below, we use the same library consisting of a grid of xgboost and glm learners for the Historical and Individual learners implemented in sl3 R package (Chen and Guestrin, 2016; Coyle et al., 2021; R Core Team, 2020) . The Historical Super Learner is fit once on the data pooled across individuals whereas the Individual learners, and with it the POSL, are updated every 20 time points resulting in possibly different fits and weights at different times. In particular, we train sequentially for sample sizes {10, 30, 50, . . . , 470, 490, 510}, and evaluate the loss over the last 5 time points of the time-series for which we forecast. To report the performance, we present an average sum over 50 simulations of convex non-negative least square weights given to K H Historical and K I Individual candidate learners over time in Figure 4 . In addition, we compare the performance of the POSL to an online and non-online ensemble methods using the same library of algorithms, data and test set. In particular, we compare POSL to the canonical online Super Learner algorithm described by Benkeser et al. (2018) and the V -fold cross-validated Super Learner presented by van der Laan et al. (2007) ; by V -fold, we refer to V -fold cross-validation and the original Super Learner for i.i.d. data. The online Super learner was trained using all the samples in a sequential manner: the loss is evaluated over a future five time-point window not seen by the learners, then the fit is updated with new data. The evolution of the mean squared error is shown in Figure 5 .

We start with a simple scenario, making sure that the POSL can learn the conditional mean under the correct data-generating distribution over time for sample i we are interested in. In particular, in Simulation 1, Historical and Individual data are sampled from different data-generating distributions. We sample 30 time-series from a fifth-order autoregressive integrated moving average process, ARIMA(5,0,0), and a single time-series i from ARIMA(0,0,5). We are interested in optimizing predictions for the single ARIMA(0,0,5) time-series.

From Figure 4 (A), we can see that the POSL gives more weight to the Historical learners in the beginning, as there is not enough time points to learn just from the sample i. As we collect more data, and the time-series progresses, POSL quickly starts to consistently give more weight to the Individual learners. The Personalized Online Super Learner demonstrates good forecasting performance in terms of the mean squared error at all training times, as shown in Figure 5 (A), with V-fold Super Learner as a close second.

In Simulation 2, we build on Simulation 1 by adding a common component to the Historical and Individual data-generating distributions, in order to investigate the performance and behavior of the POSL algorithm in situations where there is significant "overlap" in the data-generating process, but the Individual distribution is different enough that, asymptotically, one would expect the POSL to give more weights to the Individual learner as t gets larger.

We simulate baseline covariates X = {W 1 , W 2 , W 3 } with

We define a X-dependent offset as a function of W 1 , W 2 and W 3 with f (X i ) = 0.5W 1,i +0.02W 2,i +0.5W 3,i being the offset for sample i. We sample 30 time-series from f (X) + ARIMA(5,0,0) process, reflecting the Historical data-generating distribution. We generate sample i from f (X i ) + ARIMA(0,0,5), so the trajectory evolves as a MA process with offset f (X i ). Figure 4 (B) shows the evolution of weights for the Historical and Individual learners over time. As seen in Simulation 1, the POSL gives more weight to the Historical fit in beginning, due to the scarce number of time points collected for time-series i. As data becomes more abundant, we can characterize the conditional mean for sample i better, giving more weight to the Individual learners.

However, from Figure 4 (B), we can also see that due to the common offset, POSL does not start giving more weight to the Individual fit until about 100 time points are included in the training set -much further than seen in Simulation 1. As shown in Figure 5 (B), POSL demonstrates uniformly the best forecasting performance in terms of the mean squared error at all training times.

We continue to build on Simulation 2 by generating an interrupted time-series as the data-generating distribution for sample i we want to create forecasts for. The motivation for Simulation 3 is to test if POSL can detect changes in the underlying stream of data, and adjust accordingly. We sample 30 time-series from f (X) + ARIMA(5,0,0) process, representing the Historical data-generating distribution. As in the previous section, we define f (X) as a X-dependent function with f (X i ) = 0.5W 1,i + 0.02W 2,i + 0.5W 3,i characterizing the sample i offset. In contrast, we sample trajectory i as an interrupted time-series -with the first half drawn from f (X) + ARIMA(0,0,5) process, and the second half drawn from the same process as the Historical data-generating distribution, f (X) + ARIMA(5,0,0). From Figure 4 (C) we can see that the POSL is able to detect changes in the time-series data for sample i as time progresses and more data is collected. As in Figure 4 (B), the POSL starts with giving more weight to the Historical learners (until about t = 100), but quickly learns to start giving more weight to the Individual learners as more data is collected. At time t = 270, at roughly about half the training time, we can see that POSL responds to change in the data-generating process by giving more weight to the Historical learners again -which is the correct data-generating process for the second half of time-series i. The distribution of weights assigned to the Individual learners continues to decrease until the end of training, as demonstrated in Figure 4 (C), showing that POSL is able to quickly adapt to changes in time-series as time progresses. In Figure 5 (C) we can see that POSL outperforms all other tested algorithms, except at rare points in the later part of the time-series i when V -fold Super Learner slightly outperforms (or performs as well) as POSL; this can be explained by the fact that V -fold Super Learner fit is trained only on the samples sampled from f (X) + ARIMA(5,0,0) process, and POSL has to learn the correct current form with a slight delay due to the small batch sizes used for training.

In simulation 4, we simulate sets of time-series using the GRATIS package, developed in order to expedite simulation of dependent data with controllable features and provide basis for future time-series benchmarks (Kang et al., 2020) . The general approach employed is based on Gaussian mixture autoregressive models, used to generate a wide range of non-Gaussian and nonlinear time-series. First developed by Le et al. (1996) , mixture transition distribution models used to capture many non-Gaussian and nonlinear features were generalized to Gaussian mixture autoregressive models by Wong and Li (2000) . In addition to supporting generation of heterogeneous sets of time-series, GRATIS also provides options for simulating from a random population of mixture autoregressive models with specified features (Kang et al., 2020) . In particular, we specify common features of the Historical and Individual data-generating distribution including entropy and the smoothed trend component for the Seasonal and Trend decomposition using Loess (STL decomposition). We differentiate the series based on their stability, defined as the variance of non-overlapping window means and the largest mean shift between two consecutive windows. With that, the Historical and Individual series exhibit the same trend and amount of information, but different variance. From Figure 4 (D), we can see that POSL once again starts with giving more weight to the Historical fit, but eventually switches completely to the Individual learners as more data on the sample in question is collected. In terms of the MSE, POSL shows an uniformly best forecast among all other algorithms across all tested times. 

We illustrate the POSL algorithm in an application for five-minute ahead forecasting of an individual's mean arterial pressure (MAP), which is one of the most important vital signs in the intensive care unit (ICU). Data obtained from the MIMIC II database (Multiparameter Intelligent Monitoring in Intensive Care) included 370 subjects' baseline covariates (age, sex, body mass index, ICU subunit, SAPS II and SOFA mortality scores, ICU admission type), time-varying binary exposures (vasopressors, ventilation, sedation), and time-varying continuous vitals (pulse, heart rate, systolic and diastolic blood pressures, and MAP outcome) (Goldberger et al., 2000; Saeed et al., 2011) . Additional covariates were derived from this set of variables, most of them from the time-varying variables, including lagged values of the time-series and summary measures over at most one hour of history.

A total of 368 subjects were used for training the Historical learners using the sl3 R package (Coyle et al., 2021; R Core Team, 2020) . The library of Historical learners included the following: multiple variations of gradient boosted decision trees (xgboost), random forests (ranger), and elastic net generalized linear models (glmnet); a discrete Bayesian additive regression trees model (dbarts), a Bayesian generalized linear model (bayesglm), and a linear regression (glm) (Chen and Guestrin, 2016; Wright and Ziegler, 2017; Friedman et al., 2010; Chipman et al., 2010; Gelman and Su, 2020) . This library was fit after reducing the number of time-varying covariates with a pre-screening step that selected the 200 "most important" time-varying covariates according to a ranger random forest variable importance metric, and then those 200 time-varying covariates and the baseline covariates were passed on to the library of Historical learners. The two patients that were not selected for training the Historical learners were used to train, separately, a library of Individual online learners; the selection of the two patients was random. Individual learners were updated with each batch of five observations (i.e., updated every five minutes), following accumulation of an initial training size consisting of 60 observations. The Individual learners included the following: multiple variations of nonlinear time-series models (tsDyn) and elastic net generalized linear models (glmnet); a linear regression (glm), a gradient-boosted decision tree model (xgboost), a random forest model (ranger), and an ARIMA model with automated tuning (auto.arima) (Narzo et al., 2020; Friedman et al., 2010; Chen and Guestrin, 2016; Wright and Ziegler, 2017; Hyndman et al., 2021) . The linear regression and ARIMA Individual learners were fit following a pre-screening step involving lasso regression, in which the variables with non-zero lasso regression coefficients were selected and then passed to these Individual learners.

At each subject-specific 5-minute update, POSL selects the candidate with the lowest online cross-validated risk, where the set of candidates included the Individual and Historical learners, as well as ensembles of them. For both subjects, POSL's risk function was the weighted mean squared error (expectation of the weighted squared error loss), where the weights decreased as a function of time. For losses obtained 180 minutes or more from the subject's current time m, the weights were set to 0 when calculating the weighted mean loss. For losses obtained 30 minutes or less from the subject's current time m, the weights assigned to those losses were set to 1. The weights assigned to losses that were obtained more than 30 minutes but less than 180 minutes from the subject's current time m decayed as (1 − 0.001) m−m L , where m L is the time when the loss was measured, m L = 0, . . . , m, so the difference m − m L is the lag in time fromloss's time andcurrent time. Let w(m L ) denote the weight assigned to a loss measured at time m L , then this strategy to weight losses based on their lag from the current time can also be expressed as

In Figure 6 we illustrate the application of POSL to the ICU data problem to obtain five-minute ahead forecasts of an individual's mean arterial pressure, summarizing POSL's performance for the two subjects that were not used in training the Historical learners. In parts A and B of Figure 6 , we show how POSL assigned weight to the Historical and Individualized candidate learners over time. For each subject, we identified the Individualized learner and the Historical learner that had the lowest MSE when averaged across the individual's time-series, and these "best" Individualized and Historical learners varied across the subjects. We present POSL's forecasts and the "best" Historical and Individual learners' forecasts alongside the observed mean arterial pressure in parts C and D of Figure 6 . We present the mean squared error of all of the learners presented in Figure 6 in part E which displays for subject 2, the performance of the learners that performed best for subject 1, and visa versa. This table shows the variability of the performance of the candidate learner's across subjects and the stability of POSL's performance, highlighting POSL's ability to adapt to an individual's time-series by leveraging the candidates and POSL's ability to perform better, in terms of the specified loss, than any of it's candidates. Figure 6 : Application of POSL for five-minute ahead forecasting of mean arterial pressure with ICU data. In (A) and (B), we show how POSL assigned weights over time to the Historical and Individualized candidate learners for subjects 1 and 2, respectively. We present POSL's forecasts and the "best" (lowest mean squared error) Historical and Individual learners' forecasts alongside the observed mean arterial pressure in (C) for subject 1, where the best Individualized learner was a gradient boosted regression tree (XGBoost) and the best Historical learner was elastic net regression, and in (D) for subject 2, where the best Individualized learner was a Self Exciting Threshold Autoregressive (SETAR) time-series model and the best Historical learner was random forest. In (E), the mean squared error of all the learners in (C) and (D) is displayed for both subjects.

In this manuscript, we consider the problem of generating personalized forecasts in the data streaming setting with multiple time-series of unknown underlying structure. The Personalized Online Super Learner (POSL) is an online ensembling machine learning algorithm which utilizes multiple time-series and ensembling combination methods with the goal of optimizing individual forecasts. The POSL is regularly updated over time using batches of streaming data, and leverages both online pooled (learning across individuals) and individual (learning through time) learners at each time step -allowing for the ensemble weights to depend on the amount of data collected, stationarity, and level of noise. While relying on both Historical (pooled) and Individual learners, the final forecasts are a product of optimizing predictions for each time-series individually at each time step. The general scenario studied consists of observing n observations comprised of baseline, time-varying and response covariates collected over τ time points -possibly with unknown dependence structure among trajectories or trajectories sampled from different processes. In addition, we study how the proposed method can be adapted to dynamic enrollment and exit of samples over time. We present multiple cross-validation schemes relevant for different streaming settings, and advocate for an adaptive meta-learning step, where the final weights of the ensemble learner are based on mutual characteristics of a group of time-series, or completely individualized. Finally, under stronger conditions then necessary for the setup we describe, we apply the results established by Benkeser et al. (2018) in a more general time-series setting, with multiple time-series. The established result shows that the performance of the cross-validation based best algorithm is asymptotically equivalent with the performance of the best unknown candidate learner -providing a powerful way to optimally, and in a personalized way, combine multiple estimators in an online, dependent setting.

We note that the POSL can be used for estimation of any parameter of the conditional distribution of O i given its past and past of other, (possibly different) time-series that minimizes the empirical risk. Thinking of all the trajectories as a single ordered time-series provides an interesting opportunity to study the asymptotic behavior of the proposed online super learner in a variety of settings, including dynamic streams. Depending on the number of time points, type of enrollment and dependence across subjects, it is possible to consider asymptotics in time t, number of subjects n, or a combination. For example, one can study asymptotics in time t only for a fixed number of dependent subjects, or asymptotics in time t for time-series sampled from different data-generating distributions. Alternatively, we could rely on the number of samples only -when subjects are followed up for a limited time frame, when there is no common structure through time, or when the entry times are all concentrated in a finite chronological time interval. For low dependence settings where samples are followed for a long period of time, it is possible to exploit asymptotics in both the total number of time points observed as well as across the n subjects. We emphasize that POSL is able to adapt to the underlying structure in data for all the mentioned settings -allowing the proposed methodology to pick between relying on structure through time, samples, or both, at each time point. As such, while we impose assumptions on our statistical model in Section 2.1 for the sake of obtaining oracle results, our true statistical model does not rely on conditional stationarity in order for POSL to perform well, which is in contrast to the canonical online Super Learner (Benkeser et al., 2018) . Our proposed method is also constructed to provide optimal forecasts for unit i sampled from P 0,O i , instead of a collection of time-series.

Finally, we emphasize that the POSL represents theoretically proven, flexible, open source algorithm for many canonical and custom made time-series prediction problems. While motivated by precision medicine, POSL has a wide range of applications that could be considered, including infectious disease forecasting and financial data. The general algorithm described encompasses varying forecasting horizons, cross-validations, dependence across time, varying enrollment/exit times, tailored ensembling methods, and combinations of individual time-series and pooled algorithms. Our simulation results show superior performance over current state-of-the-art online and ensembling algorithms in terms of MSE across a wide range of forecasting scenarios. In future work, we explore similar formulations of the Personalized Online Super Learner that allow for peak detection and safe updates in possible data drifts.

A1. There exists a M 1 < ∞ for any valid loss function L and ψ ∈ Ψ Ψ Ψ such that

A2. There exists a M 2 < ∞ for ψ ∈ Ψ Ψ Ψ so that with probability 1,

There exists a slowly increasing sequence M 3 < ∞ such that with probability tending to 1, we have

A4. Given that M 3 is a sequence that grows arbitrarily slow to infinity,

Theorem 1 Let P n 0 describe the true data-generating distribution P n 0 ∈ M, with the target parameter defined as Ψ : M → Ψ Ψ Ψ evaluated at a particular P ∈ M. We establish the cross-validation selector k n,t as the minimizer of the cross-validated risk, and the oracle selector k n,t as the minimizer of the true cross-validated risk. Under assumptions A1-A4, there exists a constant C(M 1 ) < ∞ such that:

Proof. Under Lemma 1, the proof is a direct generalization of the oracle inequality for a single time-series proved in Benkeser et al. (2018) to multiple time-series under cross-validation schemes described in Section 3, assuming conditional stationarity. Figure 7 : Rolling origin V -fold cross-validation illustration V = 2 v'-wise folds (i.e. sample splitting) and V = 2 time-series folds (i.e. splitting across time), with initial training set size n 0 t,v = 15, validation set size n 1 t,v = 10, batch size m = 10, gap h = 5 and 2 unique id's. 

time t-specific time-varying covariates for sample i. 6

time t-specific response variable for sample i. 6 M statistical model, such that the true data-generating distribution is an element of it (P n 0 ∈ M). 7 P n 0 true probability distribution of (X n , O n ). 7 L(Ψ(P ))(X i , Y i (t), Z i (t − 1)) loss function corresponding to Ψ(P ) evaluated for sample i at time t; equivalently, we write L(ψ)(X i , Y i (t), Z i (t − 1)) for ψ := Ψ(P ). 9 R(P n 0 , ψ) true risk of ψ, defined as the sum over time and samples of the expected value of the loss with respect to P n 0 ; R(P n 0 , ψ) = 1 n n i=1 τ t=1 E P 0,O i [L(ψ)(X i , Y i (t), Z i (t− 1))|X i , Z i (t − 1)]. 9 ψ 0 predictive function corresponding toΨ(P n 0 ); minimizer over the true risk of all evaluated ψ in the parameter space. 9 ψ 0,i minimizer over the true risk of all evaluated ψ in the parameter space for sample i. 10 Ψ estimator mapping, a function from the empirical distribution (P n,t , at any t) to the parameter space, Ψ Ψ Ψ. 10 P n,t empirical distribution of n time series collected until time t. 10 ψ n,t predictive function corresponding toΨ(P n,t );Ψ(P n,t ) maps (X i , Z i (t − 1), t) into a time-and subject-specific outcome, Y i (t). 10 C(i, s, ·) at minimum, time s-and unit i-specific record where C(i, s, ·) = (X i , Z i (s − 1), Y i (s), ·). 10 v fold number corresponding to the used cross-validation scheme. 10 estimator mapping for candidate algorithm k, a function from the empirical distribution (P n,t , at any t) to the parameter space, Ψ Ψ Ψ. 14 ψ 0 n,t,k predictive function corresponding toΨ k (P 0 n,t ) based on the candidate algorithm k and trained on the training set P 0 n,t until time t. 14 R CV (P 1 n,t ,Ψ k (·)) online cross-validated risk for candidate estimator k defined as an empirical average of the loss with respect to P 1 n,t ; in particular, we write R CV (P 1 n,t ,Ψ k (·)) = t j=1 V v=1 (i,s)∈B 1 j,v L(ψ 0 n,j,k )(C(i, s)). 14 R CV (P n 0 ,Ψ k (·)) the true online cross-validated risk for candidate estimator k defined as the expected value of the loss with respect to P n 0 ; in particular, we write R CV (P n 0 ,Ψ k (·)) = predictive function corresponding toΨ kn,t (P 0 n,t ), based on the discrete online cross-validation selector k n,t and trained on the training set P 0 n,t until time t. 15 d 0,t (ψ n,t,k,i , ψ 0,i ) loss-based dissimilarity for sample i between k estimatorΨ k evaluated until time t and ψ 0,i with respect to P 0,O i . 15 d 0,t (ψ n,t,k , ψ 0 ) average of i-specific loss-based dissimilarities over all samples evaluated until time t. 16 k n,t oracle selector among discrete set of candidate estimatorsΨ k based on P n,t .. 16 Ψ α estimator mapping, a function from the empirical distribution (P n,t , at any t) generating an ensemble of K estimators (Ψ 1 , . . . ,Ψ K ) indexed by a vector of coefficients α (α 1 , . . . , α K ). 16 ψ 0 n,t,α predictive function corresponding toΨ α (P 0 n,t ) based on the ensemble of K candidate estimators with weights α, trained on the training set P 0 n,t until time t. 17 α n,t weight for the ensemble of candidate estimatorsΨ k based on P n,t that minimizes the online cross-validated risk. 17 α n,t oracle selector among weights of the ensemble of candidate estimatorsΨ k based on P n,t . 17 F n(t) (t) collection of n(t) observed time-series enrolled at or before time point t; we write F n(t) (t) ≡ {O i (h i (t)) : ∀i where E i ≤ t}. 20 C h (i, s, E i , T i ) time h i (s)-and unit i-specific record where C h (i, s, E i , T i ) = (X i , Z i (h i (s − 1)), Y i (h(s)), E i , T i ). 22 I set of all samples i with start date before current chronological time t (E i ≤ t) with data still being collected (t ≤ T i ). 23 R m (P 0 , ψ) true risk of ψ for a m-time-point-specific prediction, defined as the sum over time and available samples of the expected value of the loss with respect to P n 0 ; R m (P 0 , ψ) = τ t=1

R CV,m (P 1 n,t ,Ψ k (·)) online cross-validated risk for candidate estimator k defined as an empirical average of the loss with respect to P 1 n,t at individual time m. 24 k n,t,m learner that minimizes the online cross-validated risk for individual time m, the m-discrete online cross-validation selector . 24 α n,t,m weight for the ensemble of candidate estimatorsΨ k based on P n,t for individual time m that minimizes the online cross-validated risk. 25

Curating a covid-19 data repository and forecasting county-level death counts in the united states

Online learning for time series prediction

Online cross-validation-based ensemble learning

Generalizable deep temporal models for predicting episodes of sudden hypotension in critically ill patients: a personalized approach

XGBoost: A scalable tree boosting system

Bart: Bayesian additive regression trees

Rachael Phillips, and Oleg Sofrygin. sl3: Modern Pipelines for Machine Learning and Super Learning

Asymptotics of cross-validated risk estimation in estimator selection and performance assessment

One-step ahead sequential super learning from short times series of many slightly dependent data, and anticipating the cost of natural disasters

Regularization paths for generalized linear models via coordinate descent

arm: Data Analysis Using Regression and Multilevel/Hierarchical Models

The value added by machine learning approaches in forecasting

Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. circulation

To combine or not to combine: selecting among forecasts and their combinations

Online learning: A comprehensive survey. arXiv

forecast: Forecasting functions for time series and linear models

Gratis: Generating time series with diverse and controllable characteristics

Modeling flat stretches, bursts outliers in time series using mixture transition distribution models

Clinical applications of continual learning machine learning. The Lancet Digital Health

Nonlinear Time Series Models with Regime Switching

Weighted ensemble of statistical models

R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing

Multiparameter intelligent monitoring in intensive care ii (mimic-ii): a public-access intensive care unit database

Fast and accurate yearly time series forecasting with forecast combinations

A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting

Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples

Super learner

The cross-validated adaptive epsilon-net estimator

Oracle inequalities for multi-fold cross validation

On the minimal penalty for Markov order estimation

On a mixture autoregressive model

ranger: A fast implementation of random forests for high dimensional data in C++ and R

The authors would like to thank Jeremy R. Coyle, Ph.D. for all of the valuable comments and insight.

Appendix A Lemma 1 The difference between the online cross-validated risk (minimized by k n,t ) and the online cross-validated true risk (minimized by k n,t ) is a discrete martingale.). The difference between centered cross-validated risk R CV (P 1 n,t ,Ψ k (·)) and the true cross-validated risk R CV (P 0 ,Ψ k (·)) conditional on the filtration defined by the training set is a discrete martingale: