Discovering During-Temporal Patterns (DTPs) in Large Temporal Databases ∗ Li Zhanga, Guoqing Chena,†, Tom Brijsb, Xing Zhanga a School of Economic and Management, Tsinghua University, Beijing, 100084, P.R.China b Transportation Research Institute, Hasselt University, Diepenbeek, B3920, Belgium Abstract Large temporal Databases (TDBs) usually contain a wealth of data about tem- poral events. Aimed at discovering temporal patterns with during relationship (during-temporal patterns, DTPs), which is deemed common and potentially valuable in real-world applications, this paper presents an approach to finding such DTPs by investigating some of their properties and incorporating them as desirable pruning strategies into the corresponding algorithm, so as to optimize the mining process. Results from synthetic reveal that the algorithm is efficient and linearly scalable with regard to the number of temporal events. Finally, we apply the algorithm into the weather forecast field and obtain effective results. Keywords: data mining; during relationship; temporal pattern 1 Introduction In recent years, discovery of association rules [14] and sequential patterns [13] has been a major research issue in the area of data mining. While typical association rules usually reflect related events occurring at the same time, sequential patterns represent commonly occurring sequences that are in a time order. However, real-world businesses often generate a massive volume of data in daily operations and decision-making processes, which are of a richer temporal nature. For instance, a customer could buy a DVD machine after TV was bought; the duration of an ERP project partially overlapped the duration of a BPR project; and a patient suffered from cough during the period of fever. Apparently, ∗The work was partly supported by the National Natural Science Foundation of China (70231010/70321001), Tsinghua University’s Research Center for Contemporary Management, and the Bilateral Scientific and Technological Cooperation between China and the Flanders. †Corresponding author.E-Mail: chengq@em.tsinghua. edu.cn 1 such temporal relationships (e.g., after, overlap, during, etc.) are kinds of real-world semantics that are, in many cases, considered meaningful and useful in practice. Usually, temporal relationships between events with different time stamps could be categorized into several types in forms of temporal comparison predicates such as after, meet, overlap, during, start, finish, and equal [10]. Though recent years have witnessed several efforts on discovering the after relationship [7,8,11,13], more in-depth investigations of the relationship are still badly needed, let along their explorations of other types of temporal relationships. Furthermore, results from the studies on the after relationship could hardly be simply extended to the case of some other relationships such as during, overlap, etc. This may be attributed to the fact that in the after relationship, events can generally be dealt with on a time point, whereas in other relationships, events are considered to be of a time interval nature. On the other hand, both Rainsford [3] and Hoppner [6] have recently discussed the issues of finding temporal relationships between time-interval-based events using temporal comparison predicates [10], but with different mining approaches. Rainsford introduced temporal semantics into association rules, in forms of X⇒Y∧P1∧P2∧...∧Pn (n≥0), where X and Y are itemsets, and X ∩ Y = ∅. P1∧P2∧...∧Pn is a conjunction of binary temporal predicates. While mining a database DT , a rule is accepted when its confidence factor 0≤c≤1 is equal to or larger than the given threshold. Similarly, each predicate Pi is measured with a temporal confidence factor 0≤tcPi ≤1. The algorithm firstly generates the traditional association rules without considering the temporal factors, and then finds all of the possible pairings of temporal items in each rule. Subsequently, these pairings are tested so that strong temporal relationships could be found. Obviously, the complexity of this sequentially executed algorithm rises rapidly as the number of typical rules grows. Differently, Hoppner proposed another technique for discovering temporal patterns in state sequences. He defined the supporting level of a pattern as the total time in which the pattern can be observed within a sliding window, which should be predetermined by the user. However, a major concern for this technique is how to decide a proper size for the sliding window, since the sliding window can affect the mining results. Furthermore, The changes of the sliding window will lead to a sub-patterns check. The check requires some backtracking mechanism, which is computationally expensive. Like many existing data mining algorithms, the algorithm needs to scan the database repeatedly, which would significantly lower its efficiency. This paper will focus on a particular type of temporal relationships, namely during, which rep- resents that one event starts and ends within the duration of another event. Notably, this during relationship could reflect the temporal semantics of during, start, finish and equal described in [10]. An approach will be proposed to discover the so-called during-temporal patterns (DTPs) in larger temporal databases, which are considered common and potentially valuable in real-world applications. One idea behind the approach is to design the corresponding algorithm so as to reduce the workload in 2 scanning the database. In doing so, the database is partitioned into some disjoined datasets with two operations when calculating the support level of each pattern, so that scanning the whole database could be avoided. Furthermore, some properties of DTPs are investigated and then incorporated into the algorithm as pruning strategies to optimize the mining process for efficiency purposes. The remainder of this paper is organized as follows. Section 2 formulates the problem and introduces related notions. In Section 3, the algorithmic details are provided, along with some of the related properties. The experiments on synthetic data and real weather data are discussed in Section 4, and Section 5 concludes the paper. 2 The problem formulation Let A={a1,a2,...,am} be a set of states, and DT a temporal database as shown in Table 1. Given a database DT with N records, each of which is in the form of {a,(st,et)} with respect to event e, i.e., e=(a,t), where a is the state involved in the event, t=(st,et) is the time interval which indicates starting time (st) and ending time (et) of state a in the event. A specific event is denoted as el =(ai ,tl ) (1≤l≤N and 1≤i≤m) and tl = (stl , etl ), i.e., S(el )=stl and E(el )=etl . For example, with a1=rain, e1=(a1,(1,20)) in Table 1 means that it began to rain at 1:00h and ended at 20:00h. Table 1: A Temporal Database Event State Starting Time Ending Time e1 a1 1 20 e2 a3 1 4 e3 a4 5 7 e4 a1 22 28 e5 a2 2 8 e6 a3 10 13 e7 a5 25 35 e8 a3 23 28 e9 a4 25 27 e10 a6 25 26 e11 a1 30 40 e12 a3 30 38 e13 a4 34 38 e14 a6 37 37 Definition 1 Let el =(ai ,tl ) and ek =(aj ,tk ) be two events in DT . We call el during ek (or ek contains el ), denoted as el d a1 a4=> d a1 a6=> d a1 a4=> d a3 a6=> d a3 a6=> d a4 a4=> d a3=> d a1 a6=> d a3=> d a1 a6=> d a4=> d a1 a6=> d a4=> d a3 a6=> d a4=> d a3=> d a1 a1 a3 a4 a6 {} Figure 5: The lattice of frequent patterns Next, the set g(α ⇒d p(β, −1)) for new candidate pattern α⇒dp(β,-1) is calculated and the cor- responding h(α ⇒d p(β, −1)) are obtained. We define the operation ∩d of the sets g(α) and g(β) as follows: g(α) ∩d g(β) = {tl ∈ g(α)|∃tk ∈ g(β), such that tl = tl ∩ tk} That is, if time interval tl ∈ g(α) is totally contained by a time interval tk ∈ g(β), then tl is an element of the set g(α) ∩d g(β). Proposition 1 Let β be ak ⇒d ak−1 ⇒d ... ⇒d a2 ⇒d a1 , α be ak ⇒d ak−1 ⇒d ... ⇒d a2 ⇒d ap and γ be ak−1 ⇒d ... ⇒d a2 ⇒d ap ⇒d a1. The three patterns can generate a longer pattern σ: ak ⇒d ak−1 ⇒d ... ⇒d a2 ⇒d ap ⇒d a1. The set g for the longer pattern σ is g(σ) = g(α) ∩d g(γ) = g(β) ∩d g(γ). 12 Proof: According to the definition of g(α), we have g(α) = g(ak ⇒d ak−1 ⇒d ... ⇒d a2 ⇒d ap) = {tk ∈ g(ak )|∃tp ∈ g(ap), tk ∈ g(ak ), ti ∈ g(ai ), such that ti+1 ∩ ti = ti+1, t2 ∩ tp = t2, for all i = 2, 3, ..., k − 1} g(γ) = g(ak−1 ⇒d ... ⇒d a2 ⇒d ap ⇒d a1) = {tpk−1 ∈ g(ak−1)|∃tpp ∈ g(ap), tpk−1 ∈ g(ak−1), tp1 ∈ g(a1), tpi ∈ g(ai ), such that tpi+1 ∩ tpi = tpi+1, tp2 ∩ tpp = tp2, tpp ∩ tp1 = tpp, for all i = 2, 3, ..., k − 2} g(σ) = g(ak ⇒d ak−1 ⇒d ... ⇒d a2 ⇒d ap ⇒d a1) = {tqk ∈ g(ak )|∃tqp ∈ g(ap), tqk ∈ g(ak ), tq1 ∈ g(a1), tqi ∈ g(ai ), such that tqi+1 ∩ tqi = tqi+1, tq2 ∩ tqp = tq2, tqp ∩ tq1 = tqp, for all i = 2, 3, ..., k − 1} Firstly, we prove g(σ) ⊆ g(α)∩d g(γ). ∀tqk ∈ g(σ), from g(σ) we have tqi+1∩tqi = tqi+1 and tq2 ∩tqp = tq2 for tqi ∈ g(ai ), i=2,3,...,k-1. Let tk = tqk , tqk meets the conditions in g(α). So g(σ) ⊆ g(α). And then, we need to prove for any tqk ∈ g(σ), there exists tpk−1 ∈ g(γ), such that tqk ∩ tqk−1 = tqk . ∀tqk ∈ g(σ), there is a group of time intervals tqk−1, t q k−2,...,t q 2, t q 1, t q p satisfying the conditions in the form of g(σ). This group of time intervals can also meet the conditions in the form of g(γ), so tqk−1 ∈ g(γ). Take tpk−1 = t q k−1, we have t q k ∩ tpk−1 = tqk since tqk ∩ tqk−1 = tqk according to the conditions in g(σ). Hence, we get g(σ) ⊆ g(α) ∩d g(γ). Secondly, we prove g(α) ∩d g(γ) ⊆ g(σ). ∀tk ∈ g(α) ∩d g(γ) means ∃tk ∈ g(α), tpk−1 ∈ g(γ), such that tk ∩ tpk−1 = tk . tpk−1 ∈ g(γ) means that there exist corresponding tpk−2,...,tp2, tpp, tp1 satisfying tpi+1 ∩ tpi = tpi+1, tp2 ∩ tpp = tp2, and tpp ∩ tp1 = tpp for all i=2,3,...,k-2. Take tqk = tk , tqp = tpp, and tqi = tpi for all i=1,2,...,k-1, and then we have tqk ∩ tqk−1 = tk ∩ tpk = tk ; tqi+1 ∩ tqi = tpi+1 ∩ tpi = tpi = tqi+1 (i=2,3,...,k-2); tq2 ∩ tqp = tp2 ∩ tpp = tp2 = tq2; tqp ∩ tq1 = tpp ∩ tp1 = tpp = tqp which means that tk ∈ g(σ). Thus, for any tk ∈ g(α) ∩d g(γ), tk ∈ g(σ). That is, g(α) ∩d g(γ) ⊆ g(σ). In the same way, we can get g(σ) = g(β) ∩d g(γ). ¥ The proposition indicates that the set g for a new candidate can be computed by the frequent patterns g(α) and g(γ), or g(β) and g(γ), while the set h for a new candidate pattern can only be obtained from h(γ), 13 h(α ⇒d p(β, −1)) = h(α ⇒d p(γ, −1)) = h(γ) − {ak} Thus, we get the sets g and h for a new candidate pattern without scanning the original database and the sets g for the single states. The procedure of join phase is shown in Figure 6. CDTPk .Gen() 1. for all pattern β∈FDTPk−1 do 2. for all aj∈h(β) do 3. for α∈FDTPk−1,p(α,-1)=aj do 4. if Property 4 is satisfied then 5. CDTPk =CDTPk S{α⇒dp(β,-1)} 6. end if 7. end for 8. end for 9. end for Figure 6: The procedure used for CDTPk (3). Pruning phase In this phase, if the support degree of a candidate pattern in CDTPk is smaller than the user- specified threshold, then prune it from CDTPk . At last, FDTPk is obtained. Intercross Step 2 and Step 3 until Property 2 cannot be satisfied any longer. The procedure of pruning phase is shown in Figure 7. FDTPk .Gen() 1. FDTP0 known, FDTPk =∅ (k=1,2,...) 2. for (k=1;FDTPk−1 6=∅;k++) do 3. CDTPk =CDTPk .Gen(FDTPk−1); 4. for all candidate patterns β∈CDTPk do 5. FDTPk ={β∈CDTPk||g(β)|≥minsupport} 6. end for 7.end for 8. FDTP= S k FDTPk Figure 7: The procedure used for FDTPk (4). Generating valid DTPs Given a pattern α: ak ⇒d ak−1 ⇒d ... ⇒d a2 ⇒d a1, it is necessary to know how frequent α is when a1 has occurred, when a2 ⇒d a1 has occurred, when a3 ⇒d a2 ⇒d a1 has occurred, and so on. Thus, we can get the pattern β⇒dγ as mentioned in Section 3, (ak ⇒d ak−1 ⇒d ... ⇒d aj +1) ⇒d (aj ⇒d aj−1 ⇒d ... ⇒d a2 ⇒d a1) (1 ≤ j ≤ k − 1) That is, a frequent pattern can generate (k-1) valid DTPs at most. Calculating the confidence degree starting from the longest consequent pattern, i.e., from the pattern with j=k-1, the patterns meeting the confidence threshold will be valid DTPs. The pattern α will be stopped to compute if confidence(β⇒dγ) 14 is less than the threshold. Otherwise, the confidence degree of a shorter consequent pattern, i.e., j=j-1, is computed next. In discovering DTPs, a temporal database as shown in Tabel 1 is usually needed, which could be obtained either directly or by converting conventional databases. Since a DTP is acturally an event sequence in terms of time inclusion (i.e., during relationship), the records of a conventional database needs to be sorted by ascending start time primarily and descending end time secondarily. Consider the database in Table 1: DT ={e1,e2,e5,e3,e6, e4,e8,e7,e9,e10,e11, e12,e13,e14}, and in a sorted form as DT ={ep1,ep2, ep3,...,ep12,ep13,ep14}. Subsequently, the set of the resultant events {epi ,epi+1,...,ei+k} (k=1,2,...) is called a during-sequence if epi+j < depi+j−1 for all j=1,2,..,k and e p i+k+1≮ depi+k . For example, in the sorted DT , {e1,e2,e5,e3,e6} is a during-sequence since e6k and both ek and ew are the events with the same state aj , we have E(ek )l and ek is not during the period of el , so we have E(ek )>E(el ). Thus, we have S(ew )>E(el ). That is, ew must not occur during the period of el . ¥ 3.2 An example Let us take an example to explain the DTP algorithm. We will execute the algorithm on the temporal database in Table 1 for minimal support count=2. From Figure 1 we know FDTP0 ={a1, a3, a4, a6}. Next, for each aj∈h(ai ), add aj⇒dai into CDTP1. Thus, we get CDTP1={a3⇒da1, a4⇒da1, a6⇒da1, a4⇒da3, a6⇒da3, a6⇒da4}, which corresponds to the sets shown in Figure 8. Subsequently, Step 2 and Step 3 of the DTP algorithm are carried out iteratively. For each aj∈h(β), search the corresponding α and γ mentioned in Property 4. For example, a3∈h(a4 ⇒d a1) and the support degree of a3 ⇒d a1 is not less than the support threshold, so we can join a3 ⇒d a1 and a4 ⇒d a3 if both patterns are frequent, and obtain the new candidata pattern a4⇒da3⇒da1 with the related sets g(a4⇒da3⇒da1) and h(a4⇒da3⇒da1) (as shown in Figure 9 (m)). Similarly, the sets g(a6⇒da4⇒da3⇒da1) and h(a6⇒da4⇒da3⇒da1) are obtained in Figure 10. 15 Support g(a6 ⇒d a1) h(α) 1 (25,26) 2 (37,37) a3,a4 Support g(a4 ⇒d a3) h(α) 1 (25,27) 2 (34,38) a6 (i) generated by (a) and (f) (j) generated by (c) and (d) Support g(a6 ⇒d a3) h(α) 1 (25,26) 2 (37,37) a4 Support g(a6 ⇒d a4) h(α) 1 (25,26) 2 (37,37) ∅ (k) generated by (c) and (f) (l) generated by (d) and (f) Figure 8: The sets of CDTP1 Support g(a4⇒da3⇒da1) h(α) 1 (25,27) 2 (34,38) a6 Support g(a6⇒da3⇒da1) h(α) 1 (25,26) 2 (37,37) a4 (m) generated by (g) and (j) (n) generated by (g) and (k) Support g(a6⇒da4⇒da1) h(α) 1 (25,26) 2 (37,37) a3 Support g(a6⇒da4⇒da3) h(α) 1 (25,26) 2 (37,37) ∅ (0) generated by (h) and (l) (p) generated by (j) and (k) Figure 9: The sets of CDTP2 Support g(a6 ⇒d a4 ⇒d a3 ⇒d a1) h(α) 1 (25,26) 2 (37,37) ∅ (q) generated by (m) and (p) Figure 10: The sets of CDTP3 Lastly, we calculate the confidence degree of the patterns from the bottom of every sublattice. In Figure 5, there are three sublittices. 4 Experiments To assess the relative performance of these two algorithms and study their scale-up properties, we performed several experiments on a computer with 512 RAM and Pentium4 2.6GHz for some synthetic datasets and a real data set with weather information, which was stored on a local 20G disk. 4.1 Generation of synthetic data To evaluate the performance of the algorithms over a large volume of data, we generated synthetic temporal events which mimic the events in the real word. We will show the experimental results from synthetic data so that the work relevant to data cleaning, which is in fact application dependent and 16 also orthogonal to the incremental technique proposed, is hence omitted for simplicity. For obtaining reliable experimental results, the method to generate synthetic data we employed in this study is similar to the ones used in [12]. Table 2 summarizes the meaning of the parameters used in the experiments. The number of input-events in the temporal database relies on |Q| and |T|. That is, the average number of events is |DT|=|Q|*|T|. The starting time and ending time of each event in a during-sequence are generated randomly based on the during relationship. We generated datasets by setting |L|=5, N=50 and P=25. Table 3 summarizes the dataset parameter settings. Table 2: Parameters |Q| Number of during-sequences |T| Average number of events per during-sequences |L| Average length of maximal potentially large patterns N Number of states P Number of maximal potentially large patterns Table 3: Parameters settings (Synthetic datasets) Name |Q| |T| |D| size(MB) Data1-Q10-T5 10000 5 50,000 0.83 Data2-Q10-T10 10000 10 100,000 1.79 Data3-Q20-T5 20000 5 100,000 1.82 Data4-Q20-T10 20000 10 200,000 3.72 Data5-Q30-T10 30000 10 300,000 5.65 Data6-Q40-T10 40000 10 400,000 7.58 Data7-Q50-T10 50000 10 500,000 9.51 4.2 The relative performance with synthetic datasets Figure 11 shows the execution times for the first four synthetic datasets given in Table 3 for decreasing values of minimum support. We did not plot the execution times of the Tree algorithm for some lower support values since they are too large compared to the execution times of DTP algorithm. As the support threshold decreases, the execution times of both the algorithms increase because of the increase in the total number of candidate and large patterns. When the support threshold is higher, there are only a limited number of frequent patterns with length 2 produced. So both algorithms consume less time. However, as the support threshold decreases, the performance difference becomes prominent in that DTP algorithm significantly outperforms the Tree algorithm. 4.3 The reduction of candidate patterns As explained previously, the DTP algorithm substantially reduces the number of candidate patterns generated. The experimental results in Table 4 and Table 5 show the number of candidate patterns of both algorithms for different supports on the two datasets. As shown in Table 5, the DTP algorithm 17 Data1-Q10-T5 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 0 . 2 5 % 0 . 5 0 % 1 % 2 . 5 0 % 5 % 10% 2 0 % Minimum Support T im e (s e c ) Tree DTP Data2-Q10-T10 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 1% 2 . 5 0 % 5% 10% 20% Minimum Support T im e (s e c ) Tree DTP Data3-Q20-T5 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 0 . 5 0 % 0 . 7 5 % 1 % 2 . 5 0 % 5 % 1 0 % 2 0 % Minimum Support T im e (s e c ) Tree DTP Data4-Q20-T10 0 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0 2 5 0 0 2 . 5 0 % 5 % 1 0 % 2 0 % 3 0 % 4 0 % 5 0 % Minimum Support T im e (s e c ) Tree D T P Figure 11: Execution times leads to a 55%-75% candidate reduction rate compared to the Tree algorithm when Data3-Q20-T5 is used. In another dataset Data4-Q20-T10, the DTP algorithm can achieve higher reduction rate in generating candidate DTPs, such as 200%, as shown in Table 4. Similar phenomena were observed when other datasets were used. This feature of the DTP algorithm could help efficiently reduce the execution time (as mentioned in Section 5.1). Table 4: Reduction on candidate patterns with Data4-Q20-T10 candidate patternsminsupport DTP Tree frequent patterns 40% 34 112 27 30% 167 364 133 20% 552 1202 460 10% 2483 7074 1920 5% 10105 34046 8160 2.50% 35281 142009 29712 18 Table 5: Reduction on candidate patterns with Data3-Q20-T5 candidate patternsminsupport DTP Tree frequent patterns 20% 16 53 11 10% 118 351 68 5% 454 1029 251 2.50% 1223 2827 699 1% 3166 9334 2237 0.75% 4901 12845 3022 0.50% 7202 19090 4482 0.25% 13662 39555 8691 0.10% 30964 113144 21500 4.4 Scale-up Figure 12 shows how DTP algorithm scales up as the number of input-events is increased from 100,000 to 500,000 when the datasets Data2, Data4, Data5, Data6 and Data7 in Table 3 are used. The minimum support level was set to 5%. As shown in Figure 12, the algorithm is approximately linear scalable over the number of input events. Min-Support=5% 0 200 400 600 800 1000 1200 100000 200000 300000 400000 500000 The number of input-events T im e (s e c ) Figure 12: Scale-up 4.5 Experimtents with weather dataset The DTP algorithm has been applied to a weather dataset. The data set was obtained from a weather station in The Netherlands in 2002, which contains weather records with 17 attributes for each hour of the year. The attributes include wind direction, average wind speed, maximum wind gust, average hourly temperature, percentage relative humidity, global hourly radiation, hourly sunshine duration, hourly precipitation duration, hourly precipitation amount,horizontal visibility, fog, snow, etc. Most of the values are continuous. We discretized the data and then converted the database into a temporal one, in a form similar to Table 1. Figure 13 compares the execution time of the two algorithms on the real dataset, with different 19 minimum support degrees. From this figure, we can see that DTP algorithm is more efficient than Tree algorithm, especially at the lower support level. 0 100 200 300 400 500 600 700 800 900 1000 0.10% 0.25% 0.50% 0.75% 1% 2.50% 5% 10% 20% Minimum Support T im e (s e c ) Tree DTP Figure 13: Execution times on the weather dataset 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0.25% 0.50% 0.75% 1% 2.50% 5% 10% 20% Minimum Support T h e n u m b e r o f v a li d D T P s Figure 14: The number of valid DTPs Figure 14 shows the number of valid DTPs with various support degrees. Some interesting results were generated by applying the algorithm. Firstly, we discovered that there were significant temporal relationships between wind and rain, humidity and radiation, and temperature, radiation and sunshine duration. For example, in terms of wind and rain, strong wind often came prior to rain and lasted until or beyond the end of the rain (rain during strong wind). Another example is that the horizontal visibility was usually weak during the time that fog was heavy or it snowed (weak horizontal visibility during heavy fog or snowing). Secondly, some more complex rules were also discovered. For instance, when the sunshine duration was shortened, the global hourly radiation would decrease and the hori- zontal visibility would become worse. In fact, before sunshine duration reached zero and the night fell, the temperature would have begun to decrease. This rule can be expressed as worse horizontal sight 20 during no global hourly radiation during no sunshine during very lower temperature. 5 Conclusions and future work In this paper, we have studied the problem of discovering during-temporal patterns between events and proposed the DTP algorithm. By analyzing the properties of the during relationship, we have developed an optimization technique with pruning strategies that enabled us to retrieve the patterns with minimal database scan. The experimental results have illustrated the effectiveness and efficiency of the algorithm. An ongoing effort centers on extending this algorithm to discovering the dynamic temporal database, since in the light of the fact that the content of a TDB always keeps growing and the discovered patterns need to be maintained periodically over time. While there are some studies on mining of dynamic database and maintenance of association rules [1,2,4,5,15], our focus is on the maintenance of temporal patterns, so as to reduce the overhead of rediscovering patterns in the presence of data updates. References [1] A. Veloso, W. Meira Jr., M.B.De Carvalho, S.Parthasarathy, and M.Zaki. (2003). Parallel, Incre- mental and Interactive Mining for Frequent Itemsets in Evolving Databases. In Proceedings of the Sixth SIAM Workshop on High Performance Data Mining, May 2003. [2] C. Jin, W. Qian, C. Sha, J. Yu, and A. Zhou. (2003). Dynamically Maintaining Frequent Items over A Data Stream. In Proceedings of the 12th ACM CIKM International Conference on Information and Knowledge Management, pages 287–294, 2003. [3] Chris P.Rainsford, and John F.Roddic. (1999). Adding Temporal Semantics to Association Rules. In Proc. of the 3rd European conf. on principles and practice of knowledge discovery in databases, pages 504-509, 1999. [4] D.W.Cheung, J.Han, V.T.Ng, and C.Y.Wong. (1996). Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique. Proc, Int’l Conf. Data Eng., (ICDE’96)S.Y.W.Su,ed., pp.106-114,1996. [5] D.W.Cheung, S.D.Lee, and B.Kao. (1997). A General Incremental Technique for Maintaining Dis- covered Association Rules. In Proc. Fifth Int’l Conf. Database Systems for Advanced Applications, 1997. 21 [6] F.Hoppner. (2001). Discovery of Temporal Patterns—Learning Rules about the Qualitative Be- haviour of Time Series. In PKDD’01, number 2168 of LNAI, Freiburg, Germany, pages 192-203, 2001. [7] Guoqing Chen, Jiang Ai, Wei Yu. (2002) Discovering Temporal Association Rules for Time-Lag Data. Proceedings of International Conference on E-Business (ICEB2002), Beijing, May 2002. [8] G.Das, K.I.Lin, H. Mannila. (1998). Rule discovery from time series. In Proceedings of the inter- national conference on KDD and Data Mining, 1998. [9] J. Chang and W. Lee. (2003). Finding Recent Frequent Itemsets Adaptively over Online Data Streams. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining, pages 487–492, 2003. [10] James F.Allen. (1983). Maintaining Knowledge about Temporal Intervals. Communication of the ACM, Volume 26, Number 11, P832-843, November 1983. [11] Mohammed.J.ZAKI. (2000). SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning, 0, 1-31, 2000. [12] R.Agrawal, and R.Srikant. (1994). Fast algorithm for mining association rules. In Proc. of the 20th Int’l Conference on Very Large Databases, Santiago, Chile, September, 1994. [13] R.Agrawal, and R.Srikant. (1995). Mining Sequential Patterns. International Conference on Data Engineering(ICDE), March, Taiwan, 1995. [14] R.Agrawal, T.Imielinski, and A.Swami. (1993). Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Manage- ment of Data, pages 207-216, Washington DC, May 26-28, 1993. 10. [15] Y.Chi, H.J.Wang, Philip S.Yu, and Richard R. Muntz. (2004). Catch the Moment Maintaining Closed Frequent Itemsets. http://citeseer.ist.psu.edu. 22 Table 1: A Temporal Database Event State Starting Time Ending Time e1 a1 1 20 e2 a3 1 4 e3 a4 5 7 e4 a1 22 28 e5 a2 2 8 e6 a3 10 13 e7 a5 25 35 e8 a3 23 28 e9 a4 25 27 e10 a6 25 26 e11 a1 30 40 e12 a3 30 38 e13 a4 34 38 e14 a6 37 37 23 Table 2: Parameters |Q| Number of during-sequences |T| Average number of events per during-sequences |L| Average length of maximal potentially large patterns N Number of states P Number of maximal potentially large patterns 24 Table 3: Parameters settings (Synthetic datasets) Name |Q| |T| |D| size(MB) Data1-Q10-T5 10000 5 50,000 0.83 Data2-Q10-T10 10000 10 100,000 1.79 Data3-Q20-T5 20000 5 100,000 1.82 Data4-Q20-T10 20000 10 200,000 3.72 Data5-Q30-T10 30000 10 300,000 5.65 Data6-Q40-T10 40000 10 400,000 7.58 Data7-Q50-T10 50000 10 500,000 9.51 25 Table 4: Reduction on candidate patterns with Data4-Q20-T10 candidate patternsminsupport DTP Tree frequent patterns 40% 34 112 27 30% 167 364 133 20% 552 1202 460 10% 2483 7074 1920 5% 10105 34046 8160 2.50% 35281 142009 29712 26 Table 5: Reduction on candidate patterns with Data3-Q20-T5 candidate patternsminsupport DTP Tree frequent patterns 20% 16 53 11 10% 118 351 68 5% 454 1029 251 2.50% 1223 2827 699 1% 3166 9334 2237 0.75% 4901 12845 3022 0.50% 7202 19090 4482 0.25% 13662 39555 8691 0.10% 30964 113144 21500 27 Support g(a1) h(a1) 1 (1,20) 2 (22,28) a2,a3,a4,a6 3 (30,40) Support g(a2) h(a2) 1 (2,8) a4 Support g(a3) h(a3) 1 (1,4) 2 (10,13) 3 (23,28) a4,a6 4 (30,38) (a) (b) (c) Support g(a4) h(a4) 1 (5,7) 2 (25,27) a6 3 (34,38) Support g(a5) h(a5) 1 (25,35) a4,a6 Support g(a6) h(a6) 1 (25,26) 2 (37,37) ∅ (d) (e) (f) Support g(a3 ⇒d a1) h(α) 1 (1,4) (10,13) 2 (23,28) a4,a6 3 (30,38) Support g(a4 ⇒d a1) h(α) 1 (5,7) 2 (25,27) a3,a6 3 (34,38) (g) (h) Figure 1: The examples of the sets g and h 28 t a3 a2 a2 a1 a3 a2 Figure 2: A counterexample for pattern transitivity 29 a2 a5 a6 a9 a10 a11 a12 a6 a5 a4 a2 a7 a9 a10 a11 a12 a11 a5 a6 a7 a9 a12 a13 Figure 3: trees of frequent DTP1 30 FDTP0.Gen() 1. FDTP0 =∅; 2. for all ai∈A do 3. if |g(ai )|≥ minsupport then 4. FDTP0=FDTP0∪{ai}; 5. end if 6. end for Figure 4: The procedure used for F DT P0 31 a3=> d a1 a4=> d a1 a6=> d a1 a4=> d a3 a6=> d a3 a6=> d a4 a4=> d a3=> d a1 a6=> d a3=> d a1 a6=> d a4=> d a1 a6=> d a4=> d a3 a6=> d a4=> d a3=> d a1 a1 a3 a4 a6 {} Figure 5: The lattice of frequent patterns 32 CDTPk .Gen() 1. for all pattern β∈FDTPk−1 do 2. for all aj∈h(β) do 3. for α∈FDTPk−1,p(α,-1)=aj do 4. if Property 4 is satisfied then 5. CDTPk =CDTPk S{α⇒dp(β,-1)} 6. end if 7. end for 8. end for 9. end for Figure 6: The procedure used for CDTPk 33 FDTPk .Gen() 1. FDTP0 known, FDTPk =∅ (k=1,2,...) 2. for (k=1;FDTPk−1 6=∅;k++) do 3. CDTPk =CDTPk .Gen(FDTPk−1); 4. for all candidate patterns β∈CDTPk do 5. FDTPk ={β∈CDTPk||g(β)|≥minsupport} 6. end for 7.end for 8. FDTP= S k FDTPk Figure 7: The procedure used for FDTPk 34 Support g(a6 ⇒d a1) h(α) 1 (25,26) 2 (37,37) a3,a4 Support g(a4 ⇒d a3) h(α) 1 (25,27) 2 (34,38) a6 (i) generated by (a) and (f) (j) generated by (c) and (d) Support g(a6 ⇒d a3) h(α) 1 (25,26) 2 (37,37) a4 Support g(a6 ⇒d a4) h(α) 1 (25,26) 2 (37,37) ∅ (k) generated by (c) and (f) (l) generated by (d) and (f) Figure 8: The sets of CDTP1 35 Support g(a4⇒da3⇒da1) h(α) 1 (25,27) 2 (34,38) a6 Support g(a6⇒da3⇒da1) h(α) 1 (25,26) 2 (37,37) a4 (m) generated by (g) and (j) (n) generated by (g) and (k) Support g(a6⇒da4⇒da1) h(α) 1 (25,26) 2 (37,37) a3 Support g(a6⇒da4⇒da3) h(α) 1 (25,26) 2 (37,37) ∅ (0) generated by (h) and (l) (p) generated by (j) and (k) Figure 9: The sets of CDTP2 36 Support g(a6 ⇒d a4 ⇒d a3 ⇒d a1) h(α) 1 (25,26) 2 (37,37) ∅ (q) generated by (m) and (p) Figure 10: The sets of CDTP3 37 Data1-Q10-T5 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 0 . 2 5 % 0 . 5 0 % 1 % 2 . 5 0 % 5 % 10% 2 0 % Minimum Support T im e (s e c ) Tree DTP Data2-Q10-T10 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 1% 2 . 5 0 % 5% 10% 20% Minimum Support T im e (s e c ) Tree DTP Data3-Q20-T5 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 0 . 5 0 % 0 . 7 5 % 1 % 2 . 5 0 % 5 % 1 0 % 2 0 % Minimum Support T im e (s e c ) Tree DTP Data4-Q20-T10 0 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0 2 5 0 0 2 . 5 0 % 5 % 1 0 % 2 0 % 3 0 % 4 0 % 5 0 % Minimum Support T im e (s e c ) Tree D T P Figure 11: Execution times 38 Min-Support=5% 0 200 400 600 800 1000 1200 100000 200000 300000 400000 500000 The number of input-events T im e (s e c ) Figure 12: Scale-up 39 0 100 200 300 400 500 600 700 800 900 1000 0.10% 0.25% 0.50% 0.75% 1% 2.50% 5% 10% 20% Minimum Support T im e (s e c ) Tree DTP Figure 13: Execution times on the weather dataset 40 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0.25% 0.50% 0.75% 1% 2.50% 5% 10% 20% Minimum Support T h e n u m b e r o f v a li d D T P s Figure 14: The number of valid DTPs 41