key: cord-0452173-9h5f2ac9
authors: Tartakovsky, Alexander G.
title: An Asymptotic Theory of Joint Sequential Changepoint Detection and Identification for General Stochastic Models
date: 2021-02-02
journal: nan
DOI: nan
sha: 24601402cc247b7bd99a81c042c9bc8300133680
doc_id: 452173
cord_uid: 9h5f2ac9

The paper addresses a joint sequential changepoint detection and identification/isolation problem for a general stochastic model, assuming that the observed data may be dependent and non-identically distributed, the prior distribution of the change point is arbitrary, and the post-change hypotheses are composite. The developed detection--identification theory generalizes the changepoint detection theory developed by Tartakovsky (2019) to the case of multiple composite post-change hypotheses when one has not only to detect a change as quickly as possible but also to identify (or isolate) the true post-change distribution. We propose a multi-hypothesis change detection--identification rule and show that it is nearly optimal, minimizing moments of the delay to detection as the probability of a false alarm and the probabilities of misidentification go to zero.

I N many applications, one needs not only to detect an abrupt change as quickly as possible but also to provide a detailed diagnosis of the occurred change -to determine which type of change is in effect. For example, the problem of detection and diagnosis is important for rapid detection and isolation of intrusions in large-scale distributed computer networks, target detection with radar, sonar and optical sensors in a cluttered environment, detecting terrorists' malicious activity, fault detection and isolation in dynamic systems and networks, and integrity monitoring of navigation systems, to name a few (see [19, Ch 10] for an overview and references). In other words, there are several kinds of changes that can be associated with several different post-change distributions and the goal is to detect the change and to identify which distribution corresponds to the change. As a result, the problem of changepoint detection and diagnosis is a generalization of the quickest change detection problem [5] , [12] , [14] , [15] , [19] to the case of N 2 post-change hypotheses, and it can be formulated as a joint change detection and identification problem. In the literature, this problem is usually called change detection and isolation. The detection-isolation problem has been considered in both Bayesian and minimax settings. In 1995, Nikiforov [7] suggested a minimax approach to the change detection-isolation problem and showed that the multihypothesis version of the CUSUM rule is asymptotically optimal when the average run length (ARL) to a false alarm and the mean time to false isolation become large. Several versions of the multihypothesis CUSUM-type and SR-type procedures, which have minimax optimality properties in the classes of rules with constraints imposed on the ARL to a false alarm and conditional probabilities of false isolation, are proposed by Nikiforov [8] , [9] and Tartakovsky [21] . These rules asymptotically minimize maximal expected delays to detection and isolation as the ARL to a false alarm is large and the probabilities of wrong isolations are small. Dayanik et al. [2] proposed an asymptotically optimal Bayesian detectionisolation rule assuming that the prior distribution of the change point is geometric. In all these papers, the optimality results were restricted to the case of independent and identically distributed (i.i.d.) observations (in pre-and post-change modes with different distributions) and simple post-change hypotheses. In many practical applications, the i.i.d. assumption is too restrictive. The observations may be either non-identically distributed or dependent or both, i.e., non-i.i.d. Also, in a variety of applications, a pre-change distribution is known but the post-change distribution is rarely known completely. A more realistic situation is parametric uncertainty when the parameter of the post-change distribution is unknown since a putative parameter value is rarely representative. Lai [4] provided a certain generalization for the non-i.i.d. case and composite hypotheses for a specific loss function. See Chapter 10 in Tartakovsky et al. [19] for a detailed overview.

One of the most challenging and important versions of the change detection-isolation problem is the multidecision and multistream detection problem when it is necessary not only to detect a change as soon as possible but also to identify the streams where the change happens with given probabilities of misidentification. Specifically, there are N data streams and the change occurs in some of them at an unknown point in time. It is necessary to detect the change in distribution as soon as possible and indicate which streams are "corrupted." Both the rates of false alarms and misidentification should be controlled by given (usually low) levels. In the following, we will refer to this problem as the Multistream Sequential Change Detection-Identification problem.

In this paper, we address a simplified multistream detectionidentification scenario where change can occur only in a arXiv:2102.01306v1 [math.ST] 2 Feb 2021 single stream and we need to determine in which stream. We assume that the observations in streams can have a very general structure, i.e., can be dependent and non-identically distributed. We focus on a semi-Bayesian setting assuming that the change point is random possessing the (prior) distribution. However, we do not suppose that there is a prior distribution on post-change hypotheses. We generalize the asymptotic Bayesian theory developed by Tartakovsky [23] for a single post-change hypothesis (for a single stream). Specifically, we show that under certain conditions (related to the law of large numbers for the log-likelihood processes) the proposed multihypothesis detection-identification rule asymptotically minimizes the trade-off between positive moments of the detection delay and the false alarm/misclassification rates expressed via the weighted probabilities of false alarm and false identification. The key assumption in the general asymptotic theory is the stability property of the log-likelihood ratio processes in streams between the "change" and "nochange" hypotheses, which can be formulated in terms of the law of large numbers and rates of convergence of the properly normalized log-likelihood ratios and their adaptive versions in the vicinity of the true parameter values.

The rest of the paper is organized as follows. In Section II, we describe the general stochastic model, which is treated in the paper. In Section III, we introduce the mixturebased change detection-identification rule. In Section IV, we formulate the asymptotic optimization problems in the class of changepoint detection-identification rules with the constraint imposed on the probabilities of false alarm and wrong identification. In Section V, we obtain upper bounds on the probabilities of false alarms and misidentification as functions of thresholds. In Section VI, we derive asymptotic lower bounds for moments of the detection delay in the class of rules with given probabilities of false alarms and misidentification, and in Section VII, we prove asymptotic optimality of the proposed mixture detection-identification rule as the probabilities of false alarm and misidentification go to zero. In Section VIII, we consider an example that illustrates general results. Section IX concludes.

Suppose there are N independent data streams {X n (i)} n 1 , i = 1, . . . , N , observed sequentially in time subject to a change at an unknown time ν ∈ {0, 1, 2, . . . }, so that X 1 (i), . . . , X ν (i) are generated by one stochastic model and X ν+1 (i), X ν+2 (i), . . . by another model when the change occurs in the ith stream. We will assume that the change in distributions may happen only in one stream and it is not known which stream is affected, i.e., we are interested in a "multisample slippage" changepoint model (given ν and that the ith stream is affected with the parameter θ i ) for which joint density p(X n |H ν,i , θ i ) of the data X n = (X n (1), . . . , X n (N )), X n (i) = (X 1 (i), . . . , X n (i))) observed up to time n is of the

where H ν,i denotes the hypothesis that the change occurs at time ν in the stream i, g i (X t (i)|X t−1 (i)) and f i,θi (X t (i)|X t−1 (i)) are conditional pre-and post-change densities in the ith data stream, respectively (with respect to some sigma-finite measure), and N = {1, 2, . . . , N }. In other words, all components X t ( ), ∈ N , have conditional densities g (X t ( )|X t−1 ( )) before the change occurs and X t (i) has conditional density f i,θi (X t (i)|X t−1 (i)) after the change occurs in the ith stream and the rest of the components X t (j), j ∈ N \ {i} have conditional densities g j (X t (j)|X t−1 (j)). The parameters θ i ∈ Θ i , i = 1, . . . , N of the post-change distributions are unknown. The event ν = ∞ and the corresponding hypothesis H ∞ : ν = ∞ mean that there never is a change. Notice that the model (1) implies that X ν+1 (i) is the first post-change observation under hypothesis H ν,i .

Regarding the change point ν we assume that it is a random variable independent of the observations with prior distribution π k = P(ν = k), k = 0, 1, 2, . . . with π k > 0 for k ∈ {0, 1, 2, . . . } = Z + . We will also assume that a change point may take negative values, which means that the change has occurred by the time the observations became available. However, the detailed structure of the distribution P(ν = k) for k = −1, −2, . . . is not important. The only value which matters is the total probability q = P(ν −1) of the change being in effect before the observations become available, so we set P(ν −1) = P(ν = −1) = π −1 = q, q ∈ [0, 1).

A changepoint detection-identification rule is a pair δ = (d, T ), where T is a stopping time (with respect to the filtration {F n = σ(X n )} n∈Z+ ) associated with the time of alarm on change and d = d T ∈ N is a decision on which stream is affected (or which post-change distribution is true) which is made at time T .

It follows from (1) that for an assumed value of the change point ν = k, stream i ∈ N , and the post-change parameter value in the ith stream θ i ∈ Θ i , the likelihood ratio (LR) LR i,θi (k, n) = p(X n |H k,i , θ i )/p(X n |H ∞ ) between the hypotheses H k,i and H ∞ for observations accumulated by the time n has the form

We suppose that L i,θi (0) = 1, so that LR i,θi (−1, n) = LR i,θi (0, n). Define the average (over the prior π k ) LR statistics

Let W i (θ i ), Θi dW i (θ i ) = 1, i ∈ N be mixing measures and, for k < n and i ∈ N , define the LR-mixtures

and the statistics

where in the statistic Λ π i,W (n) defined in (5) i = 0 corresponds to the hypothesis H 0 that there is no change (in the first n observations).

Write N 0 = {0, 1, . . . , N }. For the set of positive thresholds A = (A ij ), j ∈ N 0 \ {i}, i ∈ N , the change detectionidentification rule δ A = (d A , T A ) is defined as follows:

where the Markov times T (i) A , i ∈ N are given by

In definitions of stopping times we always set inf{∅} = ∞, i.e., T

A for several values of i then any of them can be taken.

Let E k,i,θi and E ∞ denote expectations under probability measures P k,i,θi and P ∞ , respectively, where P k,i,θi corresponds to model (1) with an assumed value of the parameter θ i ∈ Θ i , change point ν = k, and the affected stream i ∈ N . Define the probability measure P π i,θi (A × K) = k∈K π k P k,i,θi (A) under which the change point ν has distribution π = {π k } and the model for the observations is of the form (1) and let E π i,θi denote the corresponding expectation. For r 1, ν = k ∈ Z + , θ i ∈ Θ i , and i ∈ N introduce the risk associated with the conditional rth moment of the detection delay

where for k = −1 we set T − k = T , but not T + 1, as well as the integrated (over prior π) risk associated with the moments of delay to detection

where

is the weighted probability of false alarm. Note that in (10) and (11) we used the equality P k,i,θi (T k) = P ∞ (T k) since the event {T k} belongs to the sigma-algebra F k = σ(X k ) and, hence, depends only on the first k observations which distribution corresponds to the measure P ∞ . This implies, in particular, that

Also, introduce

the weighted probability of false alarm on the event {d = i}, i.e., the probability of raising the alarm with the decision d = i that there is a change in the ith stream when there is no change.

The loss associated with wrong identification is reasonable to measure by the maximal probabilities of wrong decisions (misidentification)

i, j = 1, . . . , N, i = j. Note that

Define the class of change detection-identification rules δ with constraints on the probabilities of false alarm PFA π i (δ) and the probabilities of misidentification PMI π ij (δ):

where α = (α 1 , . . . , α N ) and β = (β ij ) i,j∈N ,i =j are the sets of prescribed probabilities α i ∈ (0, 1) and β ij ∈ (0, 1). Ideally, we would be interested in finding an optimal rule δ opt = (d opt , T opt ) that solves the optimization problem

However, this problem is intractable for arbitrary values of α i ∈ (0, 1) and β ij ∈ (0, 1). For this reason, we will focus on the asymptotic problem assuming that the given probabilities α i and β ij approach zero. To be more specific, we will be interested in proving that the proposed detection-identification rule δ A = (d A , T A ) defined in (7)-(8) is first-order uniformly asymptotically optimal in the following sense

where A = A(α, β) is the set of suitably selected thresholds such that δ A ∈ C π (α, β). Hereafter α max = max i∈N α i , β max = max i,j∈N ,i =j β ij . In addition, we will prove that the rule δ A = (d A , T A ) is uniformly pointwise first-order asymptotically optimal in a sense of minimizing the conditional risk (9) for all change point values ν = k ∈ Z + , i.e.,

It is also of interest to consider the class of detectionidentification rules C π (α,β) = δ : PFA π (δ) α, PMI π i (δ) β i , i ∈ N (17) (β i = (β 1 , . . . ,β N )) with constrains on the total probability of false alarm PFA π (δ) (defined in (11)) regardless of the decision d = i which is made under hypothesis H ∞ and on the misidentification probabilities

Obviously, PFA π (δ) = N i=1 PFA π i (δ) and PMI π i (δ) = j∈N \{i} PMI π ij (δ). In this paper, we consider only a fixed number of hypotheses N . The large-scale case where N → ∞ with a certain rate (which requires a different definition of false alarm and misidentification rates) will be considered elsewhere.

In the following, we assume that mixing measures W i , i = 1, . . . , N , satisfy the condition:

for any κ > 0 and any θ i ∈ Θ i .

By (2), for the assumed values of ν = k, i ∈ N , and

and the LLR between the hypotheses H k,i and H k,j of observations accumulated by the time k + n is

To study asymptotic optimality we need certain constraints imposed on the prior distribution π = {π k } and on the asymptotic behavior of the decision statistics as the sample size increases (i.e., on the general stochastic model).

For

where

Regarding the model for the observations (1), we assume that the following two conditions are satisfied (for local LLRs in data streams):

For any ε > 0 and some r 1

Note that condition C 1 holds whenever λ i,θi;j,θj (k, k+n)/n converges almost surely (a.s.) to

Regarding the prior distribution π k = P(ν = k) we assume that it is fully supported (i.e., π k > 0 for all k ∈ Z + and π ∞ = 0) and the following two conditions are satisfied:

The class of prior distributions satisfying conditions CP 1 and CP 2 will be denoted by C(µ).

Note that if µ > 0, then the prior distribution has an exponential right tail. In this case, condition (24) holds automatically. If µ = 0, the distribution has a heavy tail, i.e., belongs to the model with a vanishing hazard rate. However, we cannot allow this distribution to have a too heavy tail, which is guaranteed by condition CP 2 . A typical heavytailed prior distribution that satisfies both conditions CP 1 with µ = 0 and CP 2 for all r 1 is a discrete Weibull-type distribution with the shape parameter 0 < κ < 1. Constraint (24) is often guaranteed by finiteness of the r-th moment, ∞ k=0 k r π k < ∞. To obtain lower bounds for moments of the detection delay we need only right-tail conditions (20) . However, to establish the asymptotic optimality property of the rule δ A both righttail and left-tail conditions (20) and (21) are needed.

Next, define the statistic Λ π,W i,j,θj (n) = Λ π i,W (n)/Λ π j,θj (n) and the measure P π,n ,W (A) = Θ P π,n ,θ (A)dW (θ ).

Denote by P| Fn the restriction of the measure P to the sigmaalgebra F n . Obviously,

and hence, the statistic Λ π,W i,j,θj (n) is a ( P π,n j,θj , F n )-martingale with unit expectation for all θ j ∈ Θ j . Therefore, by the Wald-Doob identity, for any stopping time T and all θ j ∈ Θ j ,

where E π j,W and E π j,θj stand for the operators of expectation under P π,T j,W and P π,T j,θj , respectively. The following theorem establishes upper bounds for the PFA and PMI of the proposed detection-identification rule δ A . Note that these bounds are valid in the most general case -neither of the conditions on the model C 1 , C 2 or on the prior distribution CP 1 are required. Theorem 1. Let δ A be the changepoint detectionidentification rule defined in (7)- (8) . The following upper bounds for the PFA and PMI of rule δ A hold

and

Thus, if α max < 1 − π −1 , then

Proof: Using the Bayes rule, notation (2)- (6) , and the fact that LR i,θi (k, n) = 1 for k n, we obtain

, so that

Next, obviously,

Therefore, taking into account that P π i,θi (T

and inequalities (26) follow. Inequality (27) follows immediately from the fact that PFA π (δ) = N i=1 PFA π i (δ). To prove the upper bound (28) note that Λ π,W j,i,θi (n) Λ π,W ji (n) for all n 1 and θ i ∈ Θ i and thatΛ π,W ji (T

This yields

, the upper bound (28) follows. The upper bound (29) follows from (28) and the fact that PMI π i (δ) = j∈N \{i} PMI π ij (δ). Implications (30) and (31) are obvious. Remark 1. Typically, the upper bounds (26)-(29) for PFA and PMI are not tight but rather quite conservative, especially when overshoots over thresholds are large (i.e., when the hypotheses H i and H ∞ are not close). Unfortunately, in the general noni.i.d. case, the improvement of these bounds is not possible. In the i.i.d. case where observations are independent and identically distributed with the common pre-change density g i (x) and the common post-change density f i (x) in the ith stream (i.e., when the post-change hypotheses are simple), it is possible to obtain asymptotically accurate approximations using the renewal theory similarly to how it was done in [ 

and

The following theorem establishes asymptotic lower bounds on moments of the detection delay R r k,i,θi (δ) andR r i,θi (δ) (r 1) in classes of detection-identification rules C π (α, β) and C π (α,β) defined in (14) and (17), respectively. These bounds will be used in the next section for proving asymptotic optimality of the detection-identification rule δ A with suitable thresholds.

Theorem 2. Let, for some µ 0, the prior distribution belong to class C(µ). Assume that for some positive and finite

and

where Ψ i (α, β) and Ψ i (α,β) are defined in (32) and (33), respectively.

Proof: We only provide the proof of asymptotic lower bounds (34) and (35). The proof of (36) and (37) is essentially similar.

Notice that the proof can be split into two parts since if we show that, on one hand, for any rule δ ∈ C π (α, β) as

and on the other hand

where o(1) → 0, then, obviously, combining inequalities (38) and (40) 

The following proposition, whose proof is given in the Appendix, establishes first-order asymptotic approximations to the moments of the detection delay of the detectionidentification rule δ A when thresholds A ij go to infinity regardless of the PFA and PMI constraints. Write A min = min i∈N ,j∈N0\{i} A ij . Proposition 1. Let r 1 and let the prior distribution of the change point belong to class C(µ). Assume that for some 0 < I i (θ i ) < ∞, θ i ∈ Θ i , i ∈ N and 0 < I ij (θ i , θ j ) < ∞, θ i ∈ Θ i , θ j ∈ Θ j , i ∈ N , j ∈ N \ {i} right-tail and left-tail conditions C 1 and C 2 are satisfied and that inf θj ∈Θj I ij (θ i , θ j ) > 0 for all j ∈ N \ {i}, i ∈ N . Then, for all 0 < m r, θ i ∈ Θ i , and i ∈ N as A min → ∞

where

Hereafter we use a standard notation x a ∼ y a as a → a 0 if lim a→a0 (x a /y a ) = 1.

In order to prove this proposition we need the following lemma, whose proof is given in the Appendix. For i = 1, . . . , N , define

where y is the greatest integer.

Lemma 1. Let r 1 and let the prior distribution of the change point satisfy condition (23) . Then, for a sufficiently large A min , any 0 < ε < J ij (θ i , µ) and all k ∈ Z + ,

where x + = max(0, x) and

Theorem 1, Theorem 2 and Proposition 1 allow us to conclude that the detection-identification rule δ A is asymptotically first-order optimal in classes C π (α, β) and C π (α,β) as α max , β max → 0.

Theorem 3. Let r 1 and let the prior distribution of the change point belong to class C(µ). Assume that for some 0 < I i (θ i ) < ∞, θ i ∈ Θ i , i ∈ N and 0 < I ij (θ i , θ j ) < ∞, θ i ∈ Θ i , θ j ∈ Θ j , i ∈ N , j ∈ N \ {i} right-tail and left-tail conditions C 1 and C 2 are satisfied and that inf θj ∈Θj I ij (θ i , θ j ) > 0 for all j ∈ N \ {i}, i ∈ N .

first-order asymptotically optimal as α max , β max → 0 in class C π (α, β), minimizing moments of the detection delay up to order r: for all 0 < m r, θ i ∈ Θ i , and i ∈ N inf δ∈Cπ(α,β)

and

inf

then δ A is first-order asymptotically optimal as α,β max → 0 in class C π (α,β), minimizing moments of the detection delay up to order r: for all 0 < m r, θ i ∈ Θ i , and i ∈ N ,

In particular, log A i0 ∼ | log α i | and log A ij ∼ | log β ji | if A i0 = (1 − α i )/α i and A ij = [(1 − α j )β ji ] −1 , and by Theorem 1, PFA π i (δ A ) α i and PMI ij (δ A ) β ij with this choice of thresholds (see (30)). Comparing asymptotic approximations (51) with the lower bounds (34) in Theorem 2 completes the proof of (47). The proof of (48) is similar.

Proof of (ii). Setting log A 0 ∼ | log α| and log A i ∼ | logβ j | in (42) yields as α max ,β max → 0

In particular, log A 0 ∼ | log α| and log

and by Theorem 1, PFA π (δ A ) α and PMI i (δ A ) β i with this choice of thresholds (see (31)). Comparing asymptotic approximations (52) with the lower bounds (36) in Theorem 2 completes the proof of (49). The proof of (50) is similar.

If the prior distribution π = π αmax,βmax depends on the PFA α max and PMI β max constraints and µ αmax,βmax → 0 as α max , β max → 0, then a modification of the preceding argument can be used to show that the assertions of Theorem 3 hold with µ = 0.

Note that conditions (20) are satisfied if [17, p. 243] ). Assume also that for some positive and finite numbers I 0,i (θ i ), i ∈ N ,

In particular, in the i.i.d. case, these conditions hold with

being the Kullback-Leibler information numbers. Then,

. Therefore, if the prior distribution of the change point is heavy-tailed (i.e., µ = 0) and the PFA is smaller than the PMI, α i < β ji , α <β j , which is typical in many applications, then asymptotics (48) and (50) are reduced to

(as α max , β max → 0) and

(as α,β max → 0). Consider now the fully Bayesian setting where not only the prior distribution π = {π k } of the changepoint ν is given, but also the prior distribution p = {p i } i∈N of hypotheses P(H i ) = p i , i ∈ N is specified. Then in place of the maximal probabilities of misidentification (13) one can consider the following average probabilities of misidentification

and the risk associated with the detection delay is measured byR r π,W,p (δ) = E π,W,p [(T − ν) r |T > ν] (in place of (10)). Here

and E π,W,p is the expectation under the measure P π,W,p . It follows from Theorem 1 that for the rule

Introduce the class of detection-identification rules C π,W,p (α, β) = δ : PFA π (δ) α and PMI π,W,p (δ) β for which the weighted probability of false alarm does not exceed α ∈ (0, 1) and the average probability of misidentification does not exceed β ∈ (0, 1). Note that δ A ∈C π,W,p (α, β) whenever

Using Theorem 3 it is easy to prove that rule δ A is first-order asymptotically optimal in the fully Bayesian setting in class C π,W,p (α, β). Specifically, the following theorem holds.

Theorem 4. Let r 1, let the prior distribution of the change point belong to class C(µ), and let p = {p i } i∈N be the prior distribution of hypotheses that the change occurs in the ith data stream. Assume that for some 0

j ∈ N \ {i} right-tail and left-tail conditions C 1 and C 2 are satisfied and that inf θj

then δ A is first-order asymptotically optimal as α, β → 0 in classC π,W,p (α, β), minimizing moments of the detection delay up to order r: for all 0 < m r,

where 

Suppose there is an N -channel sensor system and we are able to observe the output vector X n = (X n (1), . . . , X n (N )), n = 1, 2, . . . The observations X n (i) in the ith channel have the form X n (i) = θ i S i,n 1l {n>ν} + ξ i,n , n 1, i = 1, . . . , N, where θ i is an unknown intensity or amplitude (θ i > 0) of a deterministic signal S i,n (e.g., the signal S i,n = cos(ω i n)) and {ξ i,n } n∈Z+ , i ∈ N are mutually independent noises which are AR(p) Gaussian stable processes that obey recursions ξ i,n = p t=1 i,t ξ i,n−t + w i,n , n 1.

(57)

Here {w i,n } n 1 , i ∈ N , are mutually independent i.i.d. Gaussian sequences with mean zero and standard deviation σ > 0. The coefficients i,1 , . . . , i,p and variance σ 2 are known.

A signal may appear only in one channel and should be detected and isolated quickly, i.e., the number of a channel where the signal appears should be identified along with detection.

Define

where p n = p if n > p and p n = n if n p. The LLRs have the form

Under measure P k,i,ϑ , ϑ ∈ Θ i , the LLR λ i,θi;j,θj (k, k+n) is a Gaussian process (with independent non-identically distributed increments) with mean and variance

Let Θ i = (0, ∞), i ∈ N and assume that

where 0 < Q i < ∞. This is typically the case in most signal processing applications, e.g., for the sequence of sine pulses S i,n = sin(ω i n + φ i ) with frequency ω i and phase φ i . Then for all k ∈ Z + and θ i , θ j ∈ (0, ∞)

so that condition C 1 holds. Furthermore, since all moments of the LLR are finite condition C 2 holds for all r 1. Indeed, using (58), we obtain that

and for any κ > 0

is the sequence of normal random variables with mean zero and variance σ 2 i,n = n −1 σ −2 θ 2 i k+n t=k+1 ( S i,t ) 2 , which is asymptotic to θ 2 i Q i /σ 2 . Thus, for a sufficiently large n there exists δ 0 > 0 such that σ 2 n δ 0 + θ 2 i Q i /σ 2 , and we obtain that for all large n

where the right-hand side term is finite for all r 1 due to the finiteness of all moments of the normal distribution, so that condition C 2 holds for all r 1.

Obviously, inf θj ∈(0,∞) I ij (θ i , θ j ) = θ 2 i Q i /(2σ 2 ) = I i (θ i ) > 0. Therefore, by Theorem 3, the detectionidentification rule δ A is asymptotically first-order optimal with respect to all positive moments of the detection delay and asymptotic formulas (48) and (50) hold with inf θj ∈(0,∞)

If max j =i β ji α i , max j =iβj α, and µ = 0, then asymptotic formulas (54) and (55) hold.

Note that by condition C 2 rule δ A is asymptotically optimal for almost arbitrary mixing distributions W i (θ i ). In this example, it is most convenient to select the conjugate prior, W i (θ i ) = F (θ i /v i ), where F (y) is a standard normal distribution and v i > 0, in which case the decision statistics can be computed explicitly.

It is worth noting that this example arises in certain interesting practical applications, e.g., in multichannel/multisensor surveillance systems such as radars, sonars, and electrooptic/infrared sensor systems, which deal with detecting moving and maneuvering targets that appear at unknown times, and it is necessary to detect a signal from a randomly appearing target in clutter and noise with the smallest possible delay as well as to identify a channel where it appears. See [1] , [6] , [13] , [18] . Another challenging application area where the multichannel model is useful is cyber-security [16] , [20] , [24] . Malicious intrusion attempts in computer networks (spam campaigns, personal data theft, worms, distributed denial-ofservice (DDoS) attacks, etc.) incur significant financial damage and are severe harm to the integrity of personal information. It is therefore essential to devise automated techniques to detect computer network intrusions as quickly as possible so that an appropriate response can be provided and the negative consequences for the users are eliminated. In particular, DDoS attacks typically involve many traffic streams resulting in a large number of packets aimed at congesting the target's server or network.

IX. CONCLUDING REMARKS 1. Since we do not specify a class of models for the observations such as Gaussian, Markov, or HMM and build the decision statistics on the LLR processes, we restrict the behavior of LLRs which is expressed by conditions C 1 and C 2 related to the law of large numbers for the LLR and rates of convergence in the law of large numbers. As the example in Section VIII shows, these conditions hold for the additive changes (in the mean) of the AR(p) process governed by the Gaussian process. These conditions also hold in a variety of non-additive examples (detection of changes in spectrum of time series such as AR(p) and ARCH(p) processes) as well as for a large class of homogeneous Markov processes [10] , [11] , [17, Sec 3.1, Ch 4] and for hidden Markov models with finite hidden state space [22] .

2. While we focused on the multistream detectionidentification problem (1), it should be noted that similar results also hold in the "scalar" detection-isolation problem when the observations {X n } n 1 represent either a scalar process or a vector process but all components of this process change at time ν. Specifically, let {f θ (X t |X t−1 ), θ ∈ Θ} be a parametric family of densities and for i = 1, . . . , N and Θ i ⊂ Θ consider the model

where g(X t |X t−1 ) and f θ (X t |X t−1 ) are conditional pre-and post-change densities. In other words, there are N types of change and for the ith type of change the value of the postchange parameter θ belongs to a subset Θ i of the parameter space Θ. It is necessary to detect and isolate a change as rapidly as possible, i.e., to identify what type of change has occurred. The change detection-identification rule δ A = (d A , T A ) is defined as in (7) where the statisticsΛ π,W ij (n) get modified as follows

Write λ θ (k, k + n) = log LR θ (k, n) and λ θ,θ * (k, k + n) − λ θ * (k, k +n), where λ θ * (k, k +n) = 0 for θ * = θ 0 , i.e., when there is no change. Conditions C 1 and C 2 also get modified as C 1 . There exist positive and finite numbers I(θ, θ 0 ) = I(θ), θ ∈ Θ i , i ∈ N and I(θ, θ * ), θ * ∈ Θ j , j ∈ N \ {i}, θ ∈ Θ i , i ∈ N , such that for any ε > 0 and all

For any ε > 0 and some r 1

Essentially the same argument shows that all previous results hold in this case too. In particular, the assertions of Theorem 3 are correct: as α max , β max → 0 for all θ ∈ Θ i and

i.e., the detection-identification rule δ A is asymptotically optimal to first order.

Note also that, in general, these asymptotics are not reduced to (54) even when α i = β ji . Everything depends on the configuration of the hypotheses.

3. For independent observations as well as for many Markov and certain hidden Markov models the decision statistics Λ π,W ij (n) can be computed effectively, so implementation of the proposed detection-identification rule is not an issue. Still, in general, the computational complexity and memory requirements of rule δ A are high. To avoid this complication, rule δ A can be modified into a window-limited version where the summation in the statisticsΛ π,W ij (n) over potential change points k is restricted to the sliding window of size . Following guidelines of [17, Ch 3, Sec 3.10] (where asymptotic optimality of mixture window-limited rules was established in the single-stream case), it can be shown that the window-limited version also has first-order asymptotic optimality properties as long as the size of the window (A) goes to infinity as A → ∞ at such a rate that (A)/ log A → ∞ but log (A)/ log A → 0. The details are omitted.

4. If π ∈ C(µ = 0) or π α,β depends on α, β and µ α,β → 0 as α max , β max → 0, then an alternative detectionidentification rule δ * A = (d * , T * A ) defined as in (7)- (8) where in the definition of T

A the statisticsΛ π,W ij (n) are replaced by the statistics

is also asymptotically optimal to first order. Specifically, with a suitable selection of thresholds asymptotic approximations (54) and (55) hold for δ * A . 5. For practical purposes, it is more reasonable to consider a "frequentist" problem setup that does not use prior distributions of the changepoint π and hypotheses p. We believe that the most reasonable performance metric for false alarms is the maximal conditional local probability of a false alarm in a prespecified time-window , sup 1 k<∞ P ∞ (k T < k + |T > k) (see, e.g., [17] , [19] for a detailed discussion). The optimality results in the Bayesian problem obtained in this paper are of importance in the frequentist (minimax and pointwise) problem, which can be embedded into the Bayesian criterion with asymptotically improper uniform distribution of the changepoint. See Pergamenchtchikov and Tartakovsky [10] , [11] and Tartakovsky [17, Ch 4] for the single population.

The author would like to thank referees whose comments improved the article.

Proof of Theorem 2: The proof is split into two parts. Part 1: Proof of asymptotic inequalities (38) and (39). To prove (38) and (39) define

and note first that, by the Chebyshev inequality, for every ε ∈ (0, 1) and r > 0

whenever for all ε ∈ (0, 1) and all fixed k ∈ Z + lim αmax,βmax→0

inf δ∈Cπ(α,β)

(A.1) and inequality (38) follows since ε can be arbitrarily small and

Analogously,

so that inequality (39) holds whenever

2) Hence, we now focus on proving equalities (A.1) and (A.2).

Obviously,

For any δ ∈ C π (α, β) and k 0, we have

and PMI ij (δ) = sup θi∈Θi ∞ s=−1 π s P s,i,θi (d = j, T < ∞) β ij so that, for any δ ∈ C π (α, β),

and

This inequality implies that to prove (A.1) we have to show that

For the sake of brevity, we will write λ i,j (k, k + n) for the LLR λ i,θi;j,θj (k, k + n). Let A k,β = {k < T k + M βji } and for C > 0

Changing the measure P k,j,θj → P k,i,θi , for any C > 0 we obtain

where the last inequality follows from the trivial inequality P(A ∩ B) P(A) − P(B c ). It follows that

where by (A.4) sup θj ∈Θj P k,j,θj (d = i, T < ∞) β ji /π k , which along with (A.7) yields the inequality sup δ∈Cπ(α,β) P k,i,θi 0 < T − k M βji , d = i β ε 2 ji /π k + p M β ji ,k (ε; i, θ i ; j, θ j ). The first term goes to zero for any fixed k and the second term also goes to zero as β max → 0 by condition C 1 , which implies equalities (A.6) and (A.1).

Next, multiplying both sides of inequality (A.7) by π k and summing over k 0, we obtain

where K β is an arbitrary integer which goes to infinity as β max → 0. Obviously, the first term goes to 0 as β max → 0.

The second term P(ν > K β ) → 0 by conditions (23) and (24) . The third term also goes to 0 due to condition C 1 and Lebesgue's dominated convergence theorem. Hence, for any δ ∈ C(α, β),

as α max , β max → 0 for any δ ∈ C π (α, β). This yields (A.2), and therefore, inequalities (39). Part 2: Proof of asymptotic inequalities (40) and (41).

Changing the measure P ∞ → P k,i,θi and using an argument similar to that used in Part 1 to obtain (A.7) with M βij replaced by

where for all ε ∈ (0, 1)

Using (A.9) and (A.10), we obtain sup δ∈Cπ(α,β)

where for every fixed k ∈ Z + the value of U αi,k (ε, ε 1 ) tends to zero and also p Nα i ,k (ε; i, θ i ) → 0 as α max → 0 by condition C 1 . Hence, it follows that for every fixed k ∈ Z + lim αmax→0 sup δ∈Cπ(α,β)

(A.11) Next, we have

where the second inequality follows from the Chebyshev inequality and

and by (A.11) the second term on the right hand-side goes to 0 for any fixed k ∈ Z + . It follows that for all fixed k ∈ Z + inf δ∈Cπ(α,β)

where ε and ε 1 can be arbitrarily small, which implies the inequality (40). Next, define

Using inequalities (A.9) and (A.10), we obtain

If µ > 0, by condition (23), log Π Kα i ∼ −µ K αi as α max → 0, so Π Kα i → 0. If µ = 0, this probability goes to 0 as α max → 0 as well since, by condition (24),

Obviously, the second term U αi,Kα i (ε, ε 1 ) → 0 as α max → 0. By condition C 1 and Lebesgue's dominated convergence theorem, the third term goes to 0, and therefore, all three terms go to zero as α max , β max → 0 for all ε, ε 1 > 0, so that

and by (A.8) P π i,θi (T > ν, d = i) → 1 as α max , β max → 0 for any δ ∈ C π (α, β), it follows that

Finally, by the Chebyshev inequality,

which implies that for any δ ∈ C π (α, β) as α max , β max → 0 (1)).

Owing to the fact that ε and ε 1 can be arbitrarily small the inequality (41) follows. Proof of Lemma 1: For k ∈ Z + , define the exit times τ (k) i (A) = inf{n 1 : λ i,W (k, k + n) − λ π j (k + n) log(A ij /π k ) ∀ j ∈ N 0 \ {i}}, i ∈ N , where λ i,W (k, k + n) = log Λ i,W (k, k + n) and λ π 0 (k + n) = log P(ν k + n) = log Π k+n−1 .

Obviously, for any n > k and k ∈ Z + , logΛ π,W ij (n) log π k LR i,W (k, n) n−1 =−1 π sup θj ∈Θj LR j,θj ( , n) = λ i,W (k, n) − λ π j (n) + log π k , so for every set A = (A ij ) of positive thresholds A ij , we have (T A − k) + (T i (A) > n P k,i,θi λ i,W (k, k + n) − λ π j (k + n) n

Clearly, for all n M i (A i0 ) the last probability does not exceed the probability P k,i,θi λ i,W (k, k + n) n < I i (θ i ) + µ − ε − | log Π k+n−1 | n and, by condition CP 1 , for a sufficiently large value of A i0 there exists a small κ such that

Therefore, for all sufficiently large n,

Also,

where Γ κ,θi = {ϑ ∈ Θ i : |ϑ − θ i | < κ}. Thus, for all sufficiently large n and A min , for which κ + | log W (Γ κ,θi )|/n < ε/2, we have P k,i,θi τ (k) i (A) > n P k,i,θi 1 n inf ϑ∈Γ κ,θ i λ i,ϑ (k, k + n) < I i (θ i ) − ε + κ + 1 n | log W (Γ κ,θi )| P k,i,θi 1 n inf ϑ∈Γ κ,θ i λ i,ϑ (k, k + n) < I i (θ i ) − ε/2 .

(A.14)

Using (A.13) and (A.14) yields inequality (46) and the proof is complete.

Proof of Proposition 1: By Theorem 1, the rule δ A belongs to class C π (α, β) when

and hence, Theorem 2 implies (under condition C 1 ) the asymptotic (as A min → ∞) lower bounds which hold for all r > 0, θ i ∈ Θ i , and i ∈ N . Thus, to prove the validity of the asymptotic approximations (42) and (43) it suffices to show that, under the left-tail condition C 2 , for 0 < m r and all θ i ∈ Θ i and i ∈ N the following asymptotic upper bounds hold as A min → ∞: It follows from inequality (46) in Lemma 1 that for any 0 < ε < J ij (θ i , µ)

1 + Ψ i (A, π k , θ i , µ, ε) r + r2 r−1 Υ r (κ, ε; i, θ i ), (A. 19) where Υ r (κ, ε; i, θ i ) is defined in (19) . Similarly to (A.3) we

and hence,

Using this inequality and inequality (A. 19) , we obtain

Ii(θi)+µ−ε r + r2 r−1 Υ r (κ, ε; i, θ i )

. (A.20)

Since, by condition C 2 , Υ r (κ, ε; i, θ i ) < ∞ for all θ i ∈ Θ i and i ∈ N , this implies the asymptotic upper bound (A.17). This completes the proof of the asymptotic approximation (42). Next, using inequality (A. 19) we obtain

∞ k=−1 π k 1 + Ψ i (A, π k , θ i , µ, ε) r + r2 r−1 Υ r (κ, ε; i, θ i ).

Recall that we set T A − k = T A for k = −1. Applying this inequality together with inequality

∞ k=−1 π k 1 + Ψ i (A, π k , θ i , µ, ε) r + r2 r−1 Υ r (κ, ε; i, θ i )

.

(A.21)

By condition C 2 , Υ r (κ, ε; i, θ i ) < ∞ for any ε > 0 and any θ i ∈ Θ i and, by condition (24), ∞ k=0 π k | log π k | r < ∞. This implies that, as A min → ∞, for all 0 < m r, all θ i ∈ Θ i , and all i ∈ N the following upper bound holds R r i,θi (δ A ) Ψ i (A, π k = 1, θ i , µ, ε) r (1 + o(1)).

Since ε can be arbitrarily small and lim ε→0 Ψ i (A, π k = 1, θ i , µ, ε) = Ψ i (A, θ i , µ), the upper bound (A.18) follows and the proof of the asymptotic approximation (43) is complete.

Statistical Radar Theory

Asymptotically optimal Bayesian sequential change detection and identification rules

Multihypothesis sequential probability ratio tests-Part II: Accurate asymptotic expansions for the expected sample size

Sequential multiple hypothesis testing and efficient fault detection-isolation in stochastic systems

Procedures for reacting to a change in distribution

Sonar and Underwater Acoustics

A generalized change detection problem

A simple recursive algorithm for diagnosis of abrupt changes in random signals

A lower bound for the detection/isolation delay in a class of sequential tests

Asymptotically optimal pointwise and minimax quickest change-point detection for dependent data

Asymptotically optimal pointwise and minimax change-point detection for general stochastic models with a composite post-change hypothesis

Optimal detection of a change in distribution

Fundamentals of Radar Signal Processing, ser

On optimum methods in quickest detection problems

Optimal Stopping Rules, ser. Series on Stochastic Modelling and Applied Probability

Rapid detection of attacks in computer networks by quickest changepoint detection methods

Sequential Change Detection and Hypothesis Testing: General Non-i.i.d. Stochastic Models and Asymptotically Optimal Rules, ser. Monographs on Statistics and Applied Probability 165

Adaptive spatial-temporal filtering methods for clutter removal and target tracking

Sequential Analysis: Hypothesis Testing and Changepoint Detection, ser. Monographs on Statistics and Applied Probability 136

Efficient computer network anomaly detection by changepoint detection methods

Multidecision quickest change-point detection: Previous achievements and open problems

Asymptotic Bayesian theory of quickest change detection for hidden Markov models

Asymptotic optimality of mixture rules for detecting changes in general stochastic models

Detection of intrusions in information systems by sequential changepoint methods

M'01-SM'02) research interests include theoretical and applied statistics; applied probability; sequential analysis; changepoint detection phenomena; and a variety of applications including statistical image and signal processing; video tracking; detection and tracking of targets in radar and infrared search and track systems

During 1981-92, he was first a Senior Research Scientist and then a Department Head at the Institute of Radio Technology (Moscow, Russian Academy of Sciences) as well as a Professor at FizTech, working on the application of statistical methods to optimization and modeling of information systems. From 1993 to 1996, Dr. Tartakovsky worked at the University of California, Los Angeles (UCLA)