key: cord-0212355-ryjywu68
authors: Chen, Yudong; Wang, Tengyao; Samworth, Richard J.
title: Inference in high-dimensional online changepoint detection
date: 2021-11-02
journal: nan
DOI: nan
sha: 7b6b55dde055ec9fe747ca625ef1f8fc69fb733f
doc_id: 212355
cord_uid: ryjywu68

We introduce and study two new inferential challenges associated with the sequential detection of change in a high-dimensional mean vector. First, we seek a confidence interval for the changepoint, and second, we estimate the set of indices of coordinates in which the mean changes. We propose an online algorithm that produces an interval with guaranteed nominal coverage, and whose length is, with high probability, of the same order as the average detection delay, up to a logarithmic factor. The corresponding support estimate enjoys control of both false negatives and false positives. Simulations confirm the effectiveness of our methodology, and we also illustrate its applicability on both US excess deaths data from 2017--2020 and S&P 500 data from the 2007--2008 financial crisis.

The real-time monitoring of evolving processes has become a characteristic feature of 21st century life. Watches and defibrillators track health data, Covid-19 case numbers are reported on a daily basis and financial decisions are made continuously based on the latest market movements. Given that changes in the dynamics of such processes are frequently of great interest, it is unsurprising that the area of changepoint detection has undergone a renaissance over the last 5-10 years.

One of the features of modern datasets that has driven much of the recent research in changepoint analysis is high dimensionality, where we monitor many processes simultaneously, and seek to borrow strength across the different series to identify changepoints. The nature of series that are tracked in applications, as well as the desire to evade to the greatest extent possible the curse of dimensionality, means that it is commonly assumed that the signal of interest is relatively sparse, in the sense that only a small proportion of the constituent series undergo a change. Furthermore, the large majority of these works have focused on the retrospective (or offline) challenges of detecting and estimating changes after seeing all of the available data (e.g. Chan and Walther, 2015; Cho and Fryzlewicz, 2015; Jirak, 2015; Cho, 2016; Soh and Chandrasekaran, 2017; Wang and Samworth, 2018; Enikeeva and Harchaoui, 2019; Padilla et al., 2019; Kaul et al., 2021; Follain et al., 2021; Liu et al., 2021; Londschien et al., 2021; Rinaldo et al., 2021) . Nevertheless, the related problem where one observes data sequentially and seeks to declare changes as soon as possible after they have occurred, is nowadays receiving increasing attention (e.g. Kirch and Stoehr, 2019; Dette and Gösmann, 2020; Gösmann et al., 2020; Chen et al., 2021; Yu et al., 2021b) . Although the focus of our review here has been on recent developments, including finite-sample results in multivariate and high-dimensional settings, we also mention that changepoint analysis has a long history (e.g. Page, 1954) . Entry points to this classical literature include Csörgő and Horváth (1997) and Horváth and Rice (2014) . For univariate data, sequential changepoint detection is also studied under the banner of statistical process control (Duncan, 1952; Tartakovsky et al., 2014) . In the field of high-dimensional statistical inference more generally, uncertainty quantification has become a major theme over the last decade, originating with influential work on the debiased Lasso in (generalised) linear models (Javanmard and Montanari, 2014; van de Geer et al., 2014; Zhang and Zhang, 2014) , and subsequently developed in other settings (e.g. Janková and van de Geer, 2015; Yu et al., 2021a) .

The aim of this paper is to propose methods to address two new inferential challenges associated with the high-dimensional, sequential detection of a sparse change in mean. The first is to provide a confidence interval for the location of the changepoint, while the second is to estimate the signal set of indices of coordinates that undergo the change. Despite the importance of uncertainty quantification and signal support recovery in changepoint applications, neither of these problems has previously been studied in the multivariate sequential changepoint detection literature, to the best of our knowledge. Of course, one option here would be to apply an offline confidence interval construction after a sequential procedure has declared a change. However, this would be to ignore the essential challenge of the sequential nature of the problem, whereby one wishes to avoid storing all historical data, to enable inference to be carried out in an online manner. By this we mean that the computational complexity for processing each new observation, as well as the storage requirements, should depend only on the number of bits needed to represent the new data point observed 1 . The online requirement turns out to impose severe restrictions on the class of algorithms available to the practitioner, and lies at the heart of the difficulty of the problem.

To give a brief outline of our construction of a confidence interval with guaranteed (1 − α)level coverage, consider for simplicity the univariate setting, where (X n ) n∈N form a sequence of independent random variables with X 1 , . . . , X z iid ∼ N (0, 1) and X z+1 , X z+2 , . . . iid ∼ N (θ, 1). Without loss of generality, we assume that θ > 0. Suppose that θ is known to be at least b > 0 and, for n ∈ N, let 2 t n,b := argmax 0≤h≤n n i=n−h+1

Since n i=n−h+1 (X i − b/2) can be viewed as the likelihood ratio statistic for testing the null 1 Here, we ignore the errors in rounding real numbers to machine precision; thus, we do not distinguish between continuous random variables and quantised versions where the data have been rounded to machine precision.

2 In the case of a tie, we choose the smallest h achieving the minimum.

of N (0, 1) against the alternative of N (b, 1) using X n−h+1 , . . . , X n , the quantity t n,b is the tail length for which the likelihood ratio statistic is maximised. If N is the stopping time defining a good sequential changepoint detection procedure, then, intuitively, N − t N,b should be close to the true changepoint location z, and almost pivotal. This motivates the construction of a confidence interval of the form max N − t N,b − g(α, b), 0 , N , where we control the tail probability of the distribution of N −t N,b to choose g(α, b) so as to ensure the desired coverage.

In the multivariate case, considerable care is required to handle the post-selection nature of the inferential problem, as well as to determine an appropriate left endpoint for the confidence interval. For this latter purpose, we only assume a lower bound on the Euclidean norm of the vector of mean change, and employ a delicate multivariate and multiscale aggregation scheme; see Section 2 for details. In terms of the base sequential changepoint detection procedures, we focus on the ocd algorithm (short for online changepoint detection) introduced by Chen et al. (2021) , as well as its variant ocd , which provides guarantees on both the average and worst-case detection delays, subject to a guarantee on the patience, or average false alarm rate under the null hypothesis of no change. Crucially, these are both online algorithms. Our confidence intervals, which we correspondingly denote ocd CI and ocd CI , inherit this same online property, thereby making them applicable even in very high-dimensional settings and where changes may be rare, so that we may need to see many new data points before declaring a change.

In Section 3 we study the theoretical performance of the ocd CI procedure. In particular, we prove in Theorem 1 that, for a suitable choice of input parameters, the confidence interval has at least nominal coverage. Moreover, Theorem 2 ensures that, with high probability, its length is of the same order as the average detection delay for the base ocd procedure, up to a logarithmic factor. This is remarkable in view of the intrinsic challenge that the better such a changepoint detection procedure performs, the fewer post-change observations are available for inferential tasks.

A very useful byproduct of our ocd CI methodology is that we obtain a natural estimate of the set of signal coordinates (i.e. those that undergo change). In Theorem 3, we prove that, with high probability, it is able both to recover the effective support of the signal (see Section 3.1 for a formal definition), and avoids noise coordinates.

Section 4 is devoted to a study of the numerical performance of our methodological proposals. Our simulations confirm that the ocd CI methodology attains the desired coverage level across a wide range of parameter settings, that the average confidence interval length is of comparable order to the average detection delay and that our support recovery guarantees are validated empirically. Moreover, in Sections 4.4.1 and 4.4.2, we illustrate the practical utility of our methods by applying them to both excess death data during the Covid-19 pandemic in the US and S&P 500 data during the 2007-2008 financial crisis.

Proofs are given in Section 5, with auxiliary results deferred to Section 6. An R implementation of our methodology is available at github.com/yudongchen88/ocd_CI.

We conclude this introduction with some notation used throughout the paper. We write N 0 for the set of all non-negative integers. For d ∈ N, we write [d] := {1, . . . , d}. Given a, b ∈ R, we denote a ∨ b := max(a, b) and a ∧ b := min(a, b). For a set S, we use 1 S and |S| to denote its indicator function and cardinality respectively. For a real-valued function f on a totally ordered set S, we write sargmax x∈S f (x) := min argmax x∈S f (x) and largmax x∈S f (x) := max argmax x∈S f (x) for the smallest and largest maximisers of f in S, and define sargmin x∈S f (x) and largmin x∈S f (x) analogously. For a vector

We use Φ(·),Φ(·) and φ(·) to denote the distribution function, survivor function and density function of the standard normal distribution respectively. For two real-valued random variables U and V , we write

We adopt conventions that an empty sum is 0 and that min ∅ := ∞, max ∅ := −∞.

In the multivariate sequential changepoint detection problem, we observe p-variate observations X 1 , X 2 , . . . in turn, and seek to report a stopping time 3 N by which we believe a change has occurred. The focus of this work is on changes in the mean of the underlying process, and we denote the time of the changepoint by z. Moreover, since our primary interest is in highdimensional settings, we will also seek to exploit sparsity in the vector of mean change. Given α ∈ (0, 1), then, our primary goal is to construct a confidence interval C ≡ C(X 1 , . . . , X N , α) with the property that z ∈ C with probability at least 1 − α.

For i ∈ N and j ∈ [p], let X j i denote the jth coordinate of X i . The ocd algorithm of Chen et al. (2021) , which forms part of Algorithm 1, relies on a lower bound β > 0 for the 2 -norm of the vector of mean change and sets of signed scales B and B 0 defined in terms of β. From our perspective, the key aspects of this multiscale algorithm are that, in addition to returning a stopping time N as output, it produces a matrix of residual tail lengths (t (1)), an 'anchor' coordinateĵ ∈ [p], a signed anchor scaleb ∈ B and a tail partial sum vector

The intuition is that the anchor coordinate and signed scale are chosen so that the final tĵ N,b observations provide the best evidence among all of the residual tail lengths against the null hypothesis of no change. Meanwhile, A ·,ĵ N,b aggregates the last tĵ N,b observations in each coordinate, thereby providing a measure of the strength of this evidence against the null.

The main idea of our confidence interval construction is to seek to identify coordinates with large post-change signal. To this end, observe when tĵ N,b is not too much larger than N − z,

with variance close to 1. Indeed, ifĵ,b, N and tĵ N,b were fixed, and if 0 < tĵ N,b ≤ N − z, then the former quantity would be normally distributed around this centering value, with unit variance. The random nature of these quantities, however, introduces a post-selection inference aspect to the problem. Nevertheless, by choosing an appropriate threshold value d 1 > 0, we can ensure that with high probability, when j =ĵ is a noise coordinate, we have |E j,ĵ N,b | < d 1 , and when j =ĵ is a coordinate with sufficiently large signal, there exists a signed scale b ∈ (B ∪ B 0 ) ∩ [−|θ j |, |θ j |], having the same sign as θ j , for which E j,ĵ N,b − |b|(tĵ N,b ) 1/2 ≥ d 1 . In fact, such a signed scale, if it exists, can always be chosen to be from B 0 . As a convenient byproduct, the set of indices j for which the latter inequality holds, which we denote asŜ, forms a natural estimate of the set of coordinates in which the mean change is large.

For

We denote the signed version of this quantity, where the sign is chosen to agree with that of E j,ĵ N,b , byb j ; this can be regarded as a shrunken estimate of θ j , and therefore plays the role of the lower bound b from the univariate problem discussed in the introduction. Finally, then, our confidence interval can be constructed as the intersection over indices j ∈Ŝ of the confidence interval from the univariate problem in coordinate j, with signed scaleb j .

Pseudo-code for this ocd CI confidence interval construction is given in Algorithm 1, where we suppress the n dependence on quantities that are updated at each time step. The computational complexity per new observation, as well as the storage requirements, of this algorithm are O p 2 log(ep) regardless of the observation history, so it satisfies the condition to be an online algorithm, as discussed in the introduction.

While our experience is that ocd CI performs very well empirically, in our theoretical analysis it turns out to be easier to study a slight variant of this algorithm, denoted ocd CI . There are two main differences between the algorithms. First, in ocd CI, the base changepoint detection procedure is ocd, while in ocd CI , we use the ocd procedure of Chen et al. (2021) instead. This latter algorithm is designed to avoid difficulties caused by adversarial pre-change observations that may lead to lengthy response delays for the ocd procedure. In particular, for each j ∈ [p] and b ∈ B, instead of using the final t j n,b observations at time n to construct test statistics based on A ·,j n,b , the ocd procedure aggregates over a reduced number τ j n,b of observations to obtain test statistics based on Λ ·,j n,b , where τ j n,b is constructed in an online manner to lie in the interval [t j n,b /2, 3t j n,b /4] for t j n,b ≥ 2. Even though the reduced tail lengths may lead to a slight deterioration in empirical performance, provided no change has been declared by time z, they guarantee that from a later time of the form z + O(b −2 ), the last τ j n,b observations consist entirely of post-change data.

Second, in ocd CI , we allow the practitioner to observe a further observations after the time of changepoint declaration, before constructing the confidence interval. The additional observations are used to determine the anchor coordinateĵ and scaleb, as well as the estimated supportŜ and the estimated scaleb j for each j ∈Ŝ. Thus, the extra sampling is used Algorithm 1: Pseudo-code for the confidence interval construction algorithm ocd CI Input:

to guard against an unusually early changepoint declaration that leaves very few post-change observations for inference. Nevertheless, we will see in Theorem 1 below that the ocd CI confidence interval has guaranteed nominal coverage even with = 0, so that additional observations are only used to control the length of the interval. In fact, even for this latter aspect, the numerical evidence presented in Section 4 indicates that = 0 provides confidence intervals of reasonable length in practice. Similarly, Theorem 3 ensures that with high probability, our support estimateŜ contains no noise coordinates (i.e. has false positive control) even with = 0, so that the extra sampling is only used to provide false negative control. Pseudo-code for the ocd CI algorithm is given in Algorithm 2; its computational complexity per new observation, and storage requirements, remain O p 2 log(ep) .

Throughout this section, we will assume that the sequential observations X 1 , X 2 , . . . are independent, and that there exist z ∈ N 0 and θ = (θ 1 , . . . , θ p ) = 0 for which X 1 , . . . , X z ∼ N p (0, I p ) and X z+1 , X z+2 , . . . ∼ N p (θ, I p ). We let ϑ := θ 2 , and write P z,θ for probabilities computed under this model, though in places we omit the subscripts for brevity. Define the effective sparsity of θ, denoted s(θ), to be the smallest s ∈ 2 0 , 2 1 , . . . , 2 log 2 (p) such that the corresponding effective support S(θ) := j ∈ [p] : |θ j | ≥ θ 2 / s log 2 (2p) has cardinality at least s(θ). Thus, the sum of squares of coordinates in the effective support of θ has the same order of magnitude as θ 2 2 , up to logarithmic factors. Moreover, if at most s components of θ are non-zero, then s(θ) ≤ s, and the equality is attained when, for example, all non-zero coordinates have the same magnitude.

The following theorem shows that the confidence interval constructed in the ocd CI algorithm has the desired coverage level.

Theorem 1. Let p ≥ 2. Fix α ∈ (0, 1) and γ ≥ 1 and assume that z ≤ 2αγ. Then there exist universal constants C 1 , C 2 > 0, such that with inputs

As mentioned in Section 2.1, our coverage guarantee in Theorem 1 holds even with = 0, i.e. with no additional sampling. The condition z ≤ 2αγ ensures that the probability of a false alarm is at most α/2, so that P z,θ (N ≤ z) ≤ α/2.

We now provide a guarantee on the length of the ocd CI confidence interval.

Theorem 2. Assume that θ has an effective sparsity of s := s(θ) ≥ 2. Fix α ∈ (0, 1) and γ ≥ 1, and assume that z ≤ 2αγ. Then there exist universal constants C 1 ,

Algorithm 2: Pseudo-code for the ocd CI algorithm, a slight variant of ocd CI Input:

The main conclusion of Theorem 2 is that, with high probability, the length of the confidence interval is of the same order, up to a logarithmic factor, as the average detection delay guarantee for the ocd procedure (Chen et al., 2020, Theorem 4) . Note that the choices of inputs in Theorem 2 are identical to those in Theorem 1, except that we now ask for some additional observations after the changepoint declaration, the number of which is of the same order of magnitude as the length of the interval.

Recall the definition of S(θ) from the beginning of this section, and denote S β (θ) := j ∈ [p] : |θ j | ≥ b min , where b min , defined in Algorithm 2, is the smallest positive scale in B ∪ B 0 , We will suppress the dependence on θ of both these quantities in this subsection. Theorem 3 below provides a support recovery guarantee forŜ, defined in Algorithm 2. Since neitherŜ nor the anchor coordinateĵ defined in the algorithm depend on d 2 , we omit its specification; the choices of other tuning parameters mimic those in Theorems 1 and 2.

Theorem 3. Assume the conditions of Theorem 1.

(a) There exist universal constants C 1 , C 2 > 0, such that with inputs (X t ) t∈N , 0 < β ≤ ϑ, a = C 1 log{pγ(β −2 ∨ 1)α −1 }, T diag = log{16pγ log 2 (4p)}, T off = 8 log{16pγ log 2 (2p)}, ≥ 0 and d 1 = C 2 a in Algorithm 2, we have

(b) Assume further that θ has effective sparsity s := s(θ) ≥ 2. There exist universal constants

Note that S ⊆ S β ⊆ {j ∈ [p] : θ j = 0}. Thus, part (a) of the theorem reveals that with high probability, our support estimateŜ does not contain any noise coordinates. Part (b) offers a complementary guarantee on the inclusion of all 'big' signal coordinates, provided we augment our support estimate with the anchor coordinateĵ.

In this section, we study the empirical performance of the ocd CI algorithm. Recall that in ocd CI, the off-diagonal statistics Q j b are computed using tail partial sums of length t j b and that we do not have any extra sampling beyond the time of declaration that a change has occurred. Chen et al. (2021) found that the theoretical choices of thresholds T diag and T off for the ocd procedure were a little conservative, and therefore recommended determining these thresholds via Monte Carlo simulation; we replicate the method for choosing these thresholds described in their Section 4.1. Likewise, as in Chen et al. (2021) , we take a = √ 2 log p in our simulations. This leaves us with the choice of tuning parameters d 1 and d 2 . As suggested by Theorems 1 and 2, we take d 2 = 4d 2 1 . Finally, again as suggested by our theory, we take d 1 to be of the form d 1 = c log(p/α), and then tune the parameter c > 0 through Monte Carlo simulation, as we now describe. We considered the parameter settings p ∈ {100, 500}, s ∈ {2, √ p , p}, ϑ ∈ {2, 1, 1/2}, α = 0.05, β ∈ {2ϑ, ϑ, ϑ/2}, γ = 30000 and z = 500. Then, with θ generated as ϑU , where U is uniformly distributed on the union of all s-sparse unit spheres in R p (independent of our data), we studied the coverage probabilities, estimated over 2000 repetitions as c varies, of the ocd CI confidence interval for data generated according to the Gaussian model defined at the beginning of Section 3. Figure 1 displays a subset of the results (the omitted curves were qualitatively similar). On this basis, we recommend c = 0.5 as a safe choice across a wide range of data generating mechanisms, and we used this value of c throughout our confidence interval simulations. 

In Table 1 , we present the detection delay of the ocd procedure, as well as the coverage probabilities and average confidence interval lengths of the ocd CI procedure, all estimated over 2000 repetitions, with the same set of parameter choices and data generating mechanism as in Section 4.1. From this table, we see that the coverage probabilities are at least at the nominal level (up to Monte Carlo error) across all settings considered. Underspecification of β means that the grid of scales that can be chosen for indices inŜ is shifted downwards, and therefore increases the probability thatb j will significantly underestimate θ j for j ∈Ŝ.

In turn, this leads to a slight conservativeness for the coverage probability (and corresponding increased average confidence interval length). On the other hand, overspecification of β yields a shorter interval on average, though these were nevertheless able to retain the nominal coverage in all cases considered. Another interesting feature of Table 1 is to compare the average confidence interval lengths with the corresponding average detection delays. Theorem 2, as well as Chen et al. (2021, Theorem 4) , indicate that both of these quantities are of order (s/β 2 )∨1, up to polylogarithmic factors in p and γ, but of course whenever the confidence interval includes the changepoint, its length must be at least as long as the detection delay. Nevertheless, in most settings, it is only 2 to 3 times longer on average, and in all cases considered was less than 7 times longer on average. Moreover, we can also observe that the confidence interval length increases with s and decreases with β, as anticipated by our theory.

We now turn our attention to the empirical support recovery properties of the quantityŜ (in combination with the anchor coordinateĵ) computed in the ocd CI algorithm. In Table 2 , we present the probabilities, estimated over 500 repetitions, thatŜ ⊆ S β and thatŜ ∪ {ĵ} ⊇ S for p = 100, s ∈ {5, 50}, ϑ ∈ {1, 2}, and for three different signal shapes: in the uniform, inverse square root and harmonic cases, we took θ

and θ ∝ (j −1 1 {j∈[s]} ) j∈[p] respectively. As inputs to the algorithm, we set a = √ 2 log p, α = 0.05, d 1 = 2 log(p/α), β = ϑ, and, motivated by Theorem 3, took an additional = a 2 sβ −2 log 2 (2p) post-declaration observations in constructing the support estimates. The results reported in Table 2 provide empirical confirmation of the support recovery properties claimed in Theorem 3.

Finally in this section, we consider the extent to which the additional observations are necessary in practice to provide satisfactory support recovery. In the left panel of Figure 2 , we plot Receiver Operating Characteristic (ROC) curves to study the estimated support recovery probabilities with = 0 (i.e., no additional sampling) as a function of the input parameter d 1 , which can be thought of as controlling the trade-off between P(Ŝ ∪ {ĵ} ⊇ S) and P(Ŝ ⊆ S β ). The fact that the triangles in this plot are all to the left of the dotted vertical line confirms the theoretical guarantee provided in Theorem 3(a), which holds with d 1 = 2 log(p/α), and even with = 0); the less conservative choice d 1 = √ 2 log p, which roughly corresponds to an average of one noise coordinate included inŜ, allows us to capture a larger proportion of the signal. From this panel, we also see that additional sampling is needed to ensure that, with high probability, we recover all of the true signals. This is unsurprising: for instance, with a uniform signal shape and s = 50, it is very unlikely that all 50 signal coordinates will have accumulated such similar levels of evidence to appear inŜ ∪ {ĵ} by the time of declaration. The right panel confirms that, with an inverse square root signal shape, the probability that we capture each signal increases with the signal magnitude, and that even small signals tend Table 2 : Estimated support recovery probabilities (with standard errors in brackets). Other settings: p = 100, a = √ 2 log p, α = 0.05, d 1 = 2 log(p/α), β = ϑ, and with an additional = a 2 sβ −2 log 2 (2p) post-declaration observations. to be selected with higher probability than noise coordinates.

In this section, we apply ocd CI to a dataset of weekly deaths in the United States between January 2017 and June 2020 4 . The data up to 29 June 2019 are treated as our training data. For each of the 50 states, as well as Washington, D.C. (p = 51), we pre-process the data as follows. To remove the seasonal effect, we first estimate the 'seasonal death curve', i.e. the mean death numbers for each day of the year, for each state. The seasonal death curve is estimated by first splitting the weekly death numbers evenly across the seven relevant days, and then estimating the average number of deaths on each day of the year from these derived daily death numbers using a Gaussian kernel with a bandwidth of 20 days. As the death numbers follow an approximate Poisson distribution, we apply a square-root transformation to stabilise the variance; more precisely, the transformed weekly excess deaths are computed as the difference of the square roots of the weekly deaths and the predicted weekly deaths from the seasonal death curve. Finally, we standardise the transformed weekly excess deaths using the mean and standard deviation of the transformed data over the training period. The standardised, transformed data are plotted in Figure 3 for 12 states. When applying ocd CI to these data, we take a = √ 2 log p, T diag = log{16pγ log 2 (4p)}, T off = 8 log{16pγ log 2 (2p)}, d 1 = 0.5 log(p/α) and d 2 = 4d 2 1 , with α = 0.05, β = 50 and γ = 1000. On the monitoring data (from 30 June 2019), the ocd CI algorithm declares a change on the week ending 28 March 2020, and provides a confidence interval from the week ending 21 March 2020 to the 4 Available at: https://www.cdc.gov/nchs/nvss/vsrr/covid19/excess_deaths.htm. In the left panel, we plot ROC curves for three different signal shapes and for sparsity levels s ∈ {5, 50}. The triangles and circles correspond to points on the curves with d 1 = 2 log(p/α) (with α = 0.05), and d 1 = √ 2 log p respectively. The dotted vertical line corresponds to P(Ŝ ⊆ S β ) = 1 − α. In the right panel, we plot the proportion of 500 repetitions for which each coordinate belongs toŜ ∪ {ĵ} with d 1 = √ 2 log p; here, the s = 20 signals have an inverse square root shape, and are plotted in red; noise coordinates are plotted in black. Other parameters for both panels: p = 100, β = ϑ = 2, = 0, a = √ 2 log p. 

We now use ocd CI to study market movements leading up to the financial crisis of 2007-2008. We selected the p = 254 stocks that were both in the S&P 500 listing and were traded throughout the period from 1 January 2006 to 31 December 2007. The historical price data were downloaded from finance.yahoo.com using the quantmod R package (Ryan et al., 2020) ; a similar dataset was studied by Cai and Wang (2021) . For each stock, we compute the daily logarithmic returns from the adjusted closing prices. We use the data from 2006 as the training data and standardise the entire data according to the mean and variance over the training period. Since these logarithmic returns have a heavy-tailed distribution, we clip the standardised data at ±3. When applying the ocd CI procedure to this dataset, we used the same input parameters as in Section 4.4.1. So as to be able to use ocd CI repeatedly to identify multiple changes, we also set a cool-down period of 10 trading days (i.e. the monitoring resets and restarts 10 trading days after a change is declared). This allows the market to recover from any loss (or gain) from the previous change so that the same market movement is not identified as more than one changepoint. The first four changes were declared on 27 February 2007 , 24 May 2007 , 24 July 2007 and 8 August 2007 , with corresponding confidence intervals shown in Figure 4 . This figure also depicts the relative sector impact of each change by showing the percentage of stocks in each sector (according to the Global Industry Classification Standard 6 ) that belongs to the estimated support of a changepoint. In particular, the first and last identified changepoints are primarily associated with changes in Real Estate stocks; these correspond to an HSBC announcement indicating loan losses on subprime mortgages in February 2007, and American Home Mortgage Investment Corporation filing for bankruptcy in August 2007 respectively (Hausman and Johnston, 2014) . 

Proof of Theorem 1. Fix n > z, j ∈ [p], b ∈ B and j ∈ [p] \ {j}. We assume, without loss of generality, that θ j ≥ 0. The case θ j < 0 can be analysed similarly. Recall that b min , defined in Algorithm 2, is the smallest positive scale in B ∪ B 0 , and write b j

z, τ j n,b + }, τ j n,b + . Thus, recalling the definition ofŜ andb j from Algorithm 2, we have

Moreover, by a similar argument to (13) in the proof of Proposition 4, for b ∈ (0, θ j ), we have

Combining (3) and (4), we have

r 0 := 24T off ϑ 2 ∨ 12(a 2 ∨ 8 log 2)s ϑ 2 ∨ 128(T diag + log(8/α))s β 2 log 2 (2p) + 2.

By a union bound and Proposition 6, we have

Therefore, for sufficiently large C 1 > 0 and C 2 > 0, the choice of d 1 and d 2 in the statement of the theorem ensures that

Combining this with the fact that P(N ≤ z) ≤ z/(4γ) ≤ α/2, which follows from Lemma 7 when C 1 ≥ √ 8, we deduce the result.

Proof of Theorem 2. Denote 0 := C 3 a 2 s log 2 (2p) β 2 +1 . Since the output of Algorithm 2 remains unchanged if we replace (X j t : t ∈ N) by (−X j t : t ∈ N) for any fixed j, we may assume without loss of generality that θ 1 ≥ θ 2 ≥ ϑ/ s log 2 (2p). For j ∈ {1, 2}, we denote b j := max{b ∈ B ∪ B 0 : b ≤ θ j } . Since ϑ ≥ β and s ≤ 2 log 2 (p) , we have b 1 ≥ b 2 ≥ β/ s log 2 (2p) ≥ √ 2b min . For C 5 > 0, let r := C 5 a 2 s log 2 (2p) β 2 + 2, u := 0 β 2 80s log 2 (2p) and δ := a 2 √ r + .

Now define the following events:

and all j ∈ [p] \ {j} with θ j ≤ δ ,

Finally, we denote event Ω 4 := Ω 4,1 ∪ Ω 4,2 ,

with Ω 4,1 := ĵ = 1, 1 ∈Ŝ,b 1 ≥ b 1 / √ 2 Ω 4,2 := ĵ = 1, 2 ∈Ŝ,b 2 ≥ b 2 / √ 2 .

Henceforth, we will assume without loss of generality that C 2 ≥ 1/2. Then, on the event 4

k=0 Ω k , we have

Let C 4 := C 2 1 (C 5 + C 3 40 + 8C 2 2 ). Then

By choosing C 1 ≥ √ 8 and choosing C 5 to be a sufficiently large universal constant, we have

log 2 (2p) + 2, so we may apply Proposition 6 and Lemma 7 to deduce that

On Ω 0 , we have for any j ∈ [p] and b ∈ B ∪ B 0 that

Thus, by Lemma 5 (taking µ = −b/2) and a union bound, we have

Now observe that, for all z < n ≤ z + r, b ∈ B ∪ B 0 , j ∈ [p] and j ∈ [p] \ {j}, we have

Hence, when θ j ≤ δ, we have that , 1) , and where the last inequality follows from the relation a = 2δ √ r + . Thus, by a union bound, we have

By Chen et al. (2021, Lemma 9 in the supplement), we have t j n,b /2 ≤ τ j n,b ≤ t j n,b for all n ∈ N 0 , b ∈ B ∪ B 0 and j ∈ [p]. Moreover, when C 3 ≥ 80(C 5 ∨ 2), we have 0 ≥ 80r. Define b * := β/ s log 2 (2p) ∈ B, so that u = 0 b 2 * /80. We therefore have for any z < n ≤ z +r, j ∈ [p]

and b ∈ B that

where the final inequality follows from Lemma 8(a), applied with U =

, φ 3 = + τ j n,b * , φ 4 = n − z − τ j n,b * and κ = /80. Observe thatQĵ n,b ≥Qĵ n,b * . Thus, by a union bound, we have

When C 3 ≥ 144C 2 2 ∨ 80C 5 ∨ 160, we have 0 ≥ 80r and d 1 ≤

Hence, for any z < n ≤ z + r, j ∈ [p] \ {1} and b ∈ B, we have

Here, in the final bound, we have used the facts that Ξ 1,j n,b | τ j n,b ∼ N θ 1 min{(n + − z)(τ j n,b + ) −1/2 , (τ j n,b + ) 1/2 }, 1 and that

when τ j n,b ≤ /16, as well as the standard Gaussian tail bound used at the end of the proof of Lemma 5. By a similar argument, we also have for any z < n ≤ z + r and b ∈ B that P Ω 3 ∩ {N = n,ĵ = 1,b = b} ∩ Ω c 4,2 X 1 1 , X 1 2 , . . .

Thus, by a union bound, we have

Hence by substituting (7), (8), (9), (10) and (11) into (6), we conclude that, by increasing the universal constant C 1 > 0 if necessary,

as required.

Proof of Theorem 3. (a) For j ∈ S c β , we have |θ j | < b min , so the event {|b j | ≤ |θ j |} is empty. Thus by (3), we have, for n > z, j ∈ [p], b ∈ B and j ∈ S c β , that

Hence, recalling the definition of r 0 from (5), by Lemma 7 applied with C 1 ≥ √ 8, a union bound and Proposition 6, we have

where the final bound follows as in the proof of Theorem 1.

(b) We use the events Ω 0 , Ω 1 , Ω 2 , Ω 3 defined in the proof of Theorem 2. Recall from the argument immediately below (10) that when C 3 ≥ 144C 2 2 ∨ 80C 5 ∨ 160, we have τĵ N,b ≤ /16 and d 1 ≤ min j ∈S |θ j | √ /12 on Ω 0 ∩ Ω 3 . Recall also the definition of Ξ j ,j n,b from Algorithm 2. Then, for any z < n ≤ z + r, j ∈ [p], j ∈ S \ {j} and b ∈ B, we have

where, in the final bound, we have used the facts that Ξ j ,j n,b | τ j n,b ∼ N θ j min{(n+ −z)(τ j n,b + ) −1/2 , (τ j n,b + ) 1/2 }, 1 and that

where the penultimate inequality follows from (7), (8), (9), (10) and (12), and the last inequality follows by choosing the universal constant C 1 > 0 to be sufficiently large.

Proposition 4. Let X 1 , X 2 , . . . be independent random variables with X 1 , . . . , X z iid ∼ N (0, 1) and X z+1 , X z+2 , . . . iid ∼ N (θ, 1). Assume that 0 < b ≤ θ and let t n,b be defined as in (1) for n ∈ N. Then for any α ∈ (0, 1), and any stopping time N satisfying P(N < z) ≤ α/2, we have that the confidence interval

Remark. We could also replace 4{Φ −1 (1 − α/4)} 2 /b 2 by 8 log(2/α)/b 2 in the confidence interval construction, if we apply the final bound from Lemma 5 in (13).

Proof. For n ∈ N, define R n,b := max{R n−1,b + b(X n − b/2), 0}, with R 0,b = 0. By Chen et al. (2021, Lemma 2 

where the last inequality follows from Lemma 5. Thus, if we choose y = 4{Φ −1 (1 − α/4)} 2 /b 2 , then we are guaranteed that P(N − t N,b − y > z) ≤ α/2. Combining this with the assumption that P(N < z) ≤ α/2, the desired result follows.

Lemma 5. Let Y 1 , Y 2 , . . . iid ∼ N (µ, 1). Define U n := n i=1 Y i for n ∈ N 0 , and let ξ := sargmin n∈N 0 µU n . Then, for y ∈ [0, ∞), we have

Proof. The first inequality holds since µU ξ ≤ µU 0 = 0. For the second and third inequalities, we may assume without loss of generality that µ > 0, since the result is clear when µ = 0, and if µ < 0 then the result will follow from the corresponding result with µ > 0 by setting

Note that (U n − nµ) n∈N 0 is a standard Gaussian random walk starting at 0. Let (B t ) t∈[0,∞) denote a standard Brownian motion starting at 0. Then, we have for any y ∈ N 0 and u > 0 that P inf

where the final inequality follows from Siegmund (1986, Proposition 2.4 and Equation (2.5)).

Thus, for y ∈ [0, ∞), we have P inf n∈N 0 :n≥y

where the first inequality follows from (14) and the fact that U y ∼ N ( y µ, y ) and the last inequality follows from the standard normal distribution tail boundΦ(x) ≤ e −x 2 /2 /2 for x ≥ 0.

In Proposition 6 and Lemma 7 below, we assume the Gaussian data generating mechanism given at the beginning of Section 3.

Proposition 6. Assume that θ has an effective sparsity of s := s(θ) ≥ 2. Then, the right endpoint N of the interval output from the ocd CI algorithm, with inputs (X t ) t∈N , 0 < β ≤ ϑ, a > 0, T diag = log{16pγ log 2 (4p)} and T off = 8 log{16pγ log 2 (2p)}, satisfies

for all r ≥ 24T off log 2 (2p) ϑ 2 ∨ 12(a 2 ∨8 log 2)s log 2 (2p)

Proof. For θ ∈ R p with effective sparsity s(θ), there is at most one coordinate in θ of magnitude larger than ϑ/ √ 2, so there exists b * ∈ β/ s(θ) log 2 (2p), −β/ s(θ) log 2 (2p) ⊆ B such that J := j ∈ [p] : θ j /b * ≥ 1 and |θ j | ≤ ϑ/ √ 2

has cardinality at least s(θ)/2. Note that the condition θ j /b * ≥ 1 above ensures that {θ j : j ∈ J } all have the same sign as b * . By Chen et al. (2021, Proposition 8) , we have on the event {N > z} that

We now fix

For j ∈ J , define the event Ω j r := t j z+ r ,b * > 2 r /3 . Chen et al. (2021, Lemma 2) to t j z+ r ,b * , we have for j ∈ J that t j z+ r ,b * = sargmax 0≤h≤z+ r z+ r

Recall that X z+1 , X z+2 , . . . iid ∼ N p (θ, I p ). Hence, by applying Chen et al. (2021, Lemma 6(b)) with a = 0, b = |b * |/2 and c = r /3, we have

We now work on the event Ω j r , for some fixed j ∈ J . We note that (17) guarantees that r ≥ 2, and thus t j z+ r ,b * ≥ 2 r /3 ≥ 2. Then, by (16) and (17), we have r 0 > 3t j z,b * , and hence by Chen et al. (2021, Lemma 9) , r 3 < t j z+ r ,b * 2 ≤ τ j z+ r ,b * ≤ 3t j z+ r ,b * 4 ≤ 3 t j z,b * + r 4 < r.

We conclude that 2/3 ≤ r /3 < τ j z+ r ,b * ≤ r .

Recall that Λ ·,j z+ r ,b * ∈ R p records the tail CUSUM statistics with tail length τ j z+ r ,b * . We observe by (19) that only post-change observations are included in Λ ·,j z+ r ,b * . Hence we have that Λ k,j z+ r ,b * τ j z+ r ,b * ind ∼ N θ k τ j z+ r ,b * , τ j z+ r ,b *

for k ∈ [p] \ {j}. By the definition of the effective sparsity of θ, the set L j := j ∈ [p] : |θ j | ≥ ϑ s log 2 (2p) and j = j has cardinality at least s − 1. Hence, by (19), for all k ∈ L j , |θ k | τ j z+ r ,b * > ϑ 2 r 3s log 2 (2p) =:ã r .

1 |Λ j ,j z+ r ,b * |≥a τ j z+ r ,b * < T off .

Combining (18), (22) and (23), we deduce that P z,θ N > z + r ≤ P z,θ N > z + r ≤ P z,θ j∈J (Ω j r ) c + j∈J P z,θ Ẽ j r ∩ Ω j r ≤ P z,θ j∈J (Ω j r ) c + j∈J P z,θ Ω j r ∩ U j ≥ |L j |/2 ≤ exp − sb 2 * (r − 1) 48 + p exp − ϑ 2 (r − 1) 128 log 2 (2p) , as desired.

Lemma 7. Let p ≥ 2. Then the output N from ocd , with inputs (X t ) t∈N , 0 < β ≤ ϑ, a ≥ 8 log(p − 1), T diag = log{16pγ log 2 (4p)} and T off = 8 log{16pγ log 2 (2p)}, satisfies

Proof. This follows from Chen et al. (2021, (16) in the proof of Theorem 1 and (31) in the proof of Theorem 3).

Lemma 8. Let U ∼ N (0, φ 1 ), V ∼ N (0, φ 2 − φ 1 ), Y ∼ N (αφ 3 , φ 3 ) and Z ∼ N (αφ 4 , φ 4 ) be independent random variables.

(a) Assume that min{φ 2 , φ 3 }/4 ≥ κ ≥ φ 1 ≥ 0 for some κ > 0. Then

(b) Assume that min{φ 1 , φ 3 }/4 ≥ κ ≥ φ 4 ≥ 0 for some κ > 0. Then

Proof. The case α = 0 is trivial in both cases, so without loss of generality, we may assume α > 0 throughout the rest of the proof. (a) Let

Hence, by the standard Gaussian tail bound used at the end of the proof of Lemma 5, we have

where w 1 := φ 1 +φ 3

Then

where the first inequality holds because w 1 is increasing in φ 1 and decreasing in both φ 2 and φ 3 . Hence, using the fact that −(U + V + Y ) ≤ st U + V + Y , as well as (24) and (25), we

as required.

Note that the assumption guarantees that E(W 2 ) > 0. Hence, by the standard Gaussian tail bound used at the end of the proof of Lemma 5, we have

where

Calculating the partial derivatives of w 2 with respect to φ 1 , φ 3 and φ 4 and simplifying the expressions, we have

Thus w 2 is increasing in φ 4 and decreasing in both φ 1 and φ 3 and hence w 2 ≤ 6 κ .

Michigan and Louisiana as the estimated support of the change. Interestingly, if we run the ocd CI procedure from the beginning of the training data period (while still standardising as before, due to the lack of available data prior to 2017), it identifies a subtler change on the week ending 6

Estimation of high-dimensional change-points under a group sparsity structure

Optimal detection of multi-sample aligned sparse signals

2020) ocd: high-dimensional, multiscale online changepoint detection. R package

High-dimensional, multiscale online changepoint detection

Change-point detection in panel data via double CUSUM statistic

Multiple-change-point detection for high dimensional time series via sparsified binary segmentation

Limit Theorems in Change-Point Analysis

A likelihood ratio approach to sequential change point detection for a general class of parameters

Quality Control and Industrial Statistics

High-dimensional change-point detection under sparse alternatives

High-dimensional changepoint estimation with heterogeneous missingness

Sequential change point detection in high dimensional time series

Timeline of a financial crisis: Introduction to the special issue

Probability inequalities for sums of bounded random variables

Extensions of some classical methods in change point analysis

Confidence intervals for high-dimensional inverse covariance estimation

Confidence intervals and hypothesis testing for high-dimensional regression

Uniform change point tests in high dimension

Inference on the change point under a high dimensional sparse mean shift

Sequential change point tests based on U -statistics. arXiv preprint

Minimax rates in sparse, high-dimensional changepoint detection

Change-point detection for graphical models in the presence of missing values

Optimal nonparametric multivariate change point detection and localization

Continuous inspection schemes

Localizing changes in highdimensional regression models

2020) quantmod: quantitative financial modelling framework. R package

Boundary crossing probabilities and statistical applications

High-dimensional change-point estimation: Combining filtering with convex optimization

Sequential Analysis: Hypothesis testing and Changepoint Detection

On asymptotically optimal confidence regions and tests for high-dimensional models

High dimensional change point estimation via sparse projection

Confidence intervals for high-dimensional Cox models

Optimal network online change point localisation

Confidence intervals for low dimensional parameters in high dimensional linear models

Acknowledgements: The research of TW was supported by Engineering and Physical Sciences Research Council (EPSRC) grant EP/T02772X/1 and that of RJS was supported by EPSRC grants EP/P031447/1 and EP/N031938/1, as well as European Research Council Advanced Grant 101019498.

We then observe, from (17), thatã r > 2 a ∨ 8 log 2 .Hence, from (20), we have for all k ∈ L j that P z,θ Ω j r ∩ |Λ k,j z+ r ,b * | < 1 2ãWe denoteThen, by the Chernoff-Hoeffding binomial tail bound (Hoeffding, 1963, Equation ( 2.1)), we havewhere the penultimate inequality follows from (21). Now, on the event Ω j r ∩ U j < |L j |/2 , we havewhere the penultimate inequality uses the fact that |L j | ≥ s−1 and the last inequality follows from (17). We now denotẽHence, using the fact that −(U + Y + Z) ≤ st U + Y + Z, as well as (26) and (27), we haveas required.