key: cord-0036019-yo4arwoi
authors: Chung, Seokkyung; McLeod, Dennis
title: Dynamic Pattern Mining: An Incremental Data Clustering Approach
date: 2005
journal: Journal on Data Semantics II
DOI: 10.1007/978-3-540-30567-5_4
sha: 056674c9a5177fe8457d69b236ed805d90858904
doc_id: 36019
cord_uid: yo4arwoi

We propose a mining framework that supports the identification of useful patterns based on incremental data clustering. Given the popularity of Web news services, we focus our attention on news streams mining. News articles are retrieved from Web news services, and processed by data mining tools to produce useful higher-level knowledge, which is stored in a content description database. Instead of interacting with a Web news service directly, by exploiting the knowledge in the database, an information delivery agent can present an answer in response to a user request. A key challenging issue within news repository management is the high rate of document insertion. To address this problem, we present a sophisticated incremental hierarchical document clustering algorithm using a neighborhood search. The novelty of the proposed algorithm is the ability to identify meaningful patterns (e.g., news events, and news topics) while reducing the amount of computations by maintaining cluster structure incrementally. In addition, to overcome the lack of topical relations in conceptual ontologies, we propose a topic ontology learning framework that utilizes the obtained document hierarchy. Experimental results demonstrate that the proposed clustering algorithm produces high-quality clusters, and a topic ontology provides interpretations of news topics at different levels of abstraction.

With the rapid growth of the World Wide Web, Internet users are now experiencing overwhelming quantities of online information. Since manually analyzing the data becomes nearly impossible, the analysis would be performed by automatic data mining techniques to fulfill users' information needs quickly.

On most Web pages, vast amounts of useful knowledge are embedded into text. Given such large sizes of text datasets, mining tools, which organize the text datasets into structured knowledge, would enhance efficient document access. This facilitates information search and at the same time, provides an efficient framework for document repository management as the number of documents becomes extremely huge.

Given that the Web has become a vehicle for the distribution of information, many news organizations are providing newswire services through the Internet. Given this popularity of the Web news services, we have focused our attention on mining patterns from news streams. 1 The simplest document access method within Web news services is keywordbased retrieval. Although this method seems effective, there exist at least three drawbacks. First, if a user chooses irrelevant keywords (due to broad and vague information needs or unfamiliarity with the domain of interest), retrieval accuracy will be degraded. Second, since keyword-based retrieval relies on the syntactic properties of information (e.g., keyword counting), 2 semantic gap cannot be overcome. Third, only expected information can be retrieved since the specified keywords are generated from users' knowledge space. Thus, if users are unaware of the airplane crash that occurred yesterday, then they cannot issue a query about that accident even though they might be interested.

The first two drawbacks stated above have been addressed by query expansion based on domain-independent ontologies [47] . However, it is well known that this approach leads to a degradation of precision. That is, given that the terms introduced by term expansion may have more than one meaning, using additional terms can improve recall, but decrease precision. Exploiting a manually developed ontology with a controlled vocabulary is helpful in this situation [27, 28, 29] . However, although ontology-authoring tools have been developed in the past decades, manually constructing ontologies whenever new domains are encountered is an error-prone and time-consuming process. Therefore, integration of knowledge acquisition with data mining, which is referred to as ontology learning, becomes a must [32] .

In this paper, we propose a mining framework that supports the identification of meaningful patterns (e.g., topical relations, topics, and events that are instances of topics) from news stream data. To build a novel framework for an intelligent news database management and navigation scheme, we utilize techniques in information retrieval, data mining, machine learning, and natural language processing.

To facilitate information navigation and search on a news database, we first identify three key problems.

1. Vague information needs. Sometimes, defining keywords for a search is not an easy task, especially when a user has vague information needs. Thus, a reasonable starting point would be provided to assist the user. 2. Lack of topical relations in concept-based ontologies. In order to achieve rich semantic information retrieval, an ontology-based approach would be provided. However, as discussed in Agirre et al. [2] , one of the main prob-lems with concept-based ontologies is that topically related concepts and terms are not explicitly linked. 3 That is, there is no relation between courtattorney, kidnap-police, etc. Thus, concept-based ontologies have a limitation in supporting a topical search. For example, consider the Sports domain ontology that we have developed in our previous work [27, 28, 29] . In this ontology, "Kobe Bryant", who is an NBA basketball player, is related with terms/concepts in Sports domain. However, for the purpose of query expansion, "Kobe Bryant" needs to be connected with a "court trial" concept if a user keeps "Kobe Bryant court trial" in mind. Therefore, it is essential to provide explicit links between topically related concepts/terms. 3. High rate of document insertion. As several hundred news articles are published everyday at a single Web news site, triggering the whole mining process whenever a document is inserted to the database is computationally impractical. To cope with such a dynamic environment, efficient incremental data mining tools need to be developed.

The first of the three problems can be approached using clustering. A collection of documents is easy to skim if similar articles are grouped together. If the news articles are hierarchically classified according to their topics, then a query can be formulated while a user navigates a cluster hierarchy. Moreover, clustering can be used to identify and deal with near-duplicate articles. That is, when news feeds repeat stories with minor changes from hour to hour, presenting only the most recent articles is probably sufficient.

To remedy the second problem, we present a topic ontology, which is defined as a collection of concepts and relations. In a topic ontology, concept is defined as a set of terms that characterize a topic. We define two generic kinds of relations, generalization and specialization. The former can be used when a query is generalized to increase recall or broaden the search. On the other hand, the latter is useful when refining the query. For example, when a user is interested in someone's court trial but cannot remember the name of a person, then specialization can be used to narrow down the search.

To address the third problem, we propose a sophisticated incremental hierarchical document clustering algorithm using a neighborhood search. The novelty of the proposed algorithm is the ability to identify news event clusters as well as news topic clusters while reduce the amount of computation by maintaining cluster structure incrementally. Learning topic ontologies can be performed on the obtained document hierarchy. Figure 1 illustrates the main parts of the proposed framework. In the information gathering stage, a Web crawler retrieves a set of news documents from a news Web site (e.g., CNN). Developing an intelligent Web crawler is another research area, and it is not our main focus. Hence, we implement a simple Web spider, which downloads news articles from a news Web site on a daily basis. The retrieved documents are processed by data mining tools to produce useful higher-level knowledge (e.g., a document hierarchy, a topic ontology, etc), which is stored in a content description database. Instead of interacting with a Web news service directly, by exploiting knowledge in the database, an information delivery agent can present an answer in response to a user request.

Main contributions of our work are twofold. First, despite the huge body of research efforts on document clustering [33, 30, 22, 31, 52] , little work has been conducted in the context of incremental hierarchical news document clustering. To address the problem of frequent document insertions into a database, we have developed an incremental hierarchical clustering algorithm using a neighborhood search. Since the algorithm produces a document cluster hierarchy, it can identify event level clusters as well as topic level clusters. Second, to address the lack of topical relations in concept-based ontologies, we propose a topic ontology learning framework, which can interpret news topics at multiple levels of abstraction.

The remainder of this paper is organized as follows. Section 2 presents related work. Section 3 discusses the information preprocessing step. In Section 4, we explain the information analysis component, which is a key focus of this paper. Section 5 presents experimental results. Finally, we conclude the paper and provide our future plans in Section 6.

The most relevant research areas to our work are Topic Detection and Tracking (TDT) and document clustering. Section 2.1 presents a brief overview on TDT work. In Section 2.2, a survey on previous document clustering work is provided. Finally, Section 2.3 introduces previous work on intelligent news services, which utilize document clustering and TDT.

Over the past six years, the information retrieval community has developed a new research area, called Topic Detection and Tracking (TDT) [4, 5, 10, 48, 49] . The main goal of TDT is to detect the occurrence of a novel event in a stream of news stories, and to track the known event. In particular, there are three major components in TDT.

1. Story segmentation. It segments a news stream (e.g., including transcribed speech) into topically cohesive stories. Since online Web news (in HTML format) is supplied in segmented form, this task only applies to audio or TV news.

to an existing topic or a new topic. 3. Topic tracking. It tracks events of interest based on sample news stories. It associates incoming news stories with the related stories, which were already discussed before. It can be also asked to monitor the news stream for further stories on the same topic.

In Allan et al. [4] , the notion of event is first defined. Event is defined as "some unique thing that happens at some point in time". Hence, an event is different from a topic. For example, "airplane crash" is a topic while "Chinese airplane crash in Korea in April 2002" is an event. Thus, there exists M-1 mapping between event and topic (i.e., multiple events can be on a same topic). Note that it is important to identify events as well as topics. Although the user may not be interested in a flood topic, in general, she may be interested in documents about a flood event in her home town. Thus, a news recommendation system must be able to distinguish different events within a same topic. Yang et al. introduced an important property of news events, referred to as temporal locality [48] . That is, news articles discussing the same event tend to be temporally proximate. In addition, most of the events (e.g., flood, earthquake, wildfire, kidnapping) have short duration (e.g., 1 week -1 month). They exploited these heuristics when computing similarity between two news articles.

The most popular method in TDT is to use a simple incremental clustering algorithm, which is shown in Figure 2 . Our work starts by addressing the limitations of this algorithm.

In this section, we classify the widely used document clustering algorithms into two categories (partition-based clustering and hierarchical clustering), and provide a concise overview for each of them.

1. Initially, only one news article is available, and it forms a singleton cluster.

2. For an incoming document (d * ), we compute the similarity between d * and pre-generated clusters. The similarity is computed by the distance between d * and the representative of the cluster.

3. Selects the cluster (Ci) that has the maximum proximity with d * .

4. If the similarity between d * and Ci exceeds the pre-defined threshold, then all documents in Ci are considered as related stories to d * (topic tracking), and d * is assigned to Ci. Otherwise, d * is considered as a novel story (first story detection), and a new cluster for d * is created.

5. Repeat 2-4 whenever a new document appears in a stream.

Partition-Based Clustering Partition-based clustering decomposes a collection of documents, which is optimal with respect to some pre-defined function. Typical methods in this category include center-based clustering, Gaussian Mixture Model, etc. Center-based algorithms identify the clusters by partitioning the entire dataset into a pre-determined number of clusters (e.g., K-means clustering), or an automatically derived number of clusters (e.g., X-means clustering) [9, 23, 13, 16, 30, 37, 39] .

The most popular and the best understood clustering algorithm is K-means clustering [13] . The K-means algorithm is a simple but powerful iterative clustering method to partition a dataset into K disjoint clusters, where K must be determined beforehand. The idea of the algorithm is to assign points to the cluster such that the sum of the mean square distance of points to the center of the assigned cluster is minimized.

While the K-means clustering approach works in a metric space, medoidbased method works with a similarity space [23, 37] . It uses the medoids (representative sample objects) instead of the means (e.g., the centers of clusters) such that the sum of the distances of points to their closest medoid is minimized.

Although the center-based clustering algorithms have been widely used in document clustering, there exist at least five serious drawbacks. First, in many center-based clustering algorithms, the number of clusters (K) needs to be determined beforehand. Second, the algorithm is sensitive to an initial seed selection. Depending on the initial points, it is susceptible to a local optimum. Third, it can model only a spherical (K-means) or ellipsoidal (K-medoid) shape of clusters. Thus, the non-convex shape of clusters cannot be modeled in centerbased clustering. Forth, it is sensitive to outliers since a small amount of outliers can substantially influence the mean value. Finally, due to the nature of iterative scheme in producing clustering results, it is not relevant for incremental datasets.

Hierarchical Agglomerative Clustering Hierarchical (agglomerative) clustering (HAC) finds the clusters by initially assigning each document to its own cluster and then repeatedly merging pairs of clusters until a certain stopping condition is met [13, 18, 26, 19, 52] . Consequently, its result is in the form of a tree, which is referred to as a dendrogram. A dendrogram is represented as a tree with numeric levels associated to its branches.

The main advantage of HAC lies in its ability to provide a view of data at multiple levels of abstraction. However, since HAC builds a dendrogram, a user must determine where to cut the dendrogram to produce actual clusters. This step is usually done by human visual inspection, which is a time-consuming and subjective process. Moreover, the computational complexity of HAC is more expensive than that of partition-based clustering. In partition-based clustering, the computational complexity is O(nKI) where n is the number of documents, K is the number of clusters, and I is the number of iterations, respectively. In contrast, HAC takes O(n 3 ) if pairwise similarities between clusters are changed when two clusters are merged. However, the complexity can be reduced to O(n 2 logn) if we utilize a priority queue [52] .

The one of the most successful intelligent news services is NewsBlaster [34] . The basic idea of NewsBlaster is to group the articles on the same story using clustering, and present one story using multi-document summarization. Thus, the main goal of NewsBlaster is similar to ours in that both aim to propose intelligent news analysis/delivery tools. However, the underlying methodology is different. For example, with respect to clustering, NewsBlaster is based on the clustering algorithm in Hatzivassiloglou et al. [22] . Main contributions of their work is to augment document representation using linguistic features. However, rather than developing their own clustering algorithm, they used conventional HAC, which has the drawbacks as discussed in Section 2.2.

Recent attempts present other intelligent news services like NewsInEssence [41, 42] , or QCS (Query, Cluster, Summarize) [14] . Both services utilize a similar approach to NewsBlaster in that they separate the retrieved documents into topic clusters, and create a single summary for each topic cluster. However, their main focus does not lie in developing a novel clustering algorithm. For example, QCS utilizes generalized spherical K-means clustering whose limitations have been addressed in Section 2.2.

Therefore, it is worthwhile to develop a sophisticated document clustering algorithm that can overcome the drawbacks of previous document clustering work.

In particular, the developed algorithm must address the special requirements in news clustering such as high rate of document insertion, or ability to identify event level clusters as well as topic level clusters.

The information preprocessing step extracts meaningful information from unstructured text data and transforms it into structured knowledge. As shown in Figure 1 , this step is composed of the following standard IR tools.

-HTML preprocessing. Since downloaded news articles are in HTML format, we remove irrelevant HTML tags for each article and extract meaningful information. -Tokenization. Its main task is to identify the boundaries of the terms.

-Stemming. There can be different forms for the same terms (e.g., students and student, go and went). These different forms of the same term need to be converted to their roots. Toward this end, instead of solely relying on Porter stemmer [40] , in order to deal with irregular plural/tense, we combine Porter stemmer with the lexical database [35] . -Stopwords removal. Stopwords are the terms that occur frequently in the text but do not carry useful information. For example, have, did, and get are not meaningful. Removing such stopwords provide us with a dimensionality reduction effect. We employ the stopword list that was used in Smart project [44] .

After preprocessing, a document is represented as a vector in an n-dimensional vector space [44] . The simple way to do this is to employ the Bag-Of-Word (BOW) approach. That is, all content-bearing terms in the document are kept and any structure of text or the term sequence is ignored. Thus, each term is treated as a feature and each document is represented as a vector of certain weighted term frequencies in this feature space.

There are several ways to determine the weight of a term in a document. However, most methods are based on the following two heuristics.

-Important terms occur more frequently within a document than unimportant terms do. -The more times a term occurs throughout all documents, the weaker its discriminating power becomes.

The term frequency (TF) is based on the first heuristic. In addition, TF can be normalized to reflect different document lengths. Let f req ij be the number of t i 's occurrence in a document j, and l j be the length of the document j. Then, term frequency (tf ij ) of t i in the document j is defined as follows:

kidnap abduct child boy police search missing investigate suspect return home d1 The document frequency (DF) of the term (the percentage of the documents that contain this term) is based on the second heuristic. A combination of TF and DF introduces TF-IDF ranking scheme, which is defined as follows:

where w ij is the weight of t i in a document j, n is the total number of documents in the collection, and n i is the number of documents where t i occurs at least once.

The above ranking scheme is referred to as static TF-IDF since it is based on static document collection. However, since documents are inserted incrementally, IDF values are initialized using a sufficient amount of documents (i.e., the document frequency is generated from training corpus). After then, IDF is incrementally updated as subsequent documents are processed. In particular, we employ an incremental update of IDF value proposed by Yang et al. [48] .

Finally, to measure closeness between two documents, we use the Cosine metric, which measures the similarity of two vectors according to the angle between them [44] . Thus, vectors pointing to similar directions are considered as representing similar concepts. The cosine of the angles between two m-dimensional vectors (x and y) is defined by

This section presents the information analysis component of Figure 1 . Section 4.1 illustrates a motivating example for the proposed incremental clustering algorithm. In Section 4.2, a non-hierarchical incremental document clustering algorithm using a neighborhood search is presented. Section 4.3 explains how to extend the algorithm into a hierarchical version. Finally, Section 4.4 shows how to build a topic ontology based on the obtained document hierarchy.

To illustrate a simple example, consider the following three documents (whose document×term matrix is shown in Table 1 ). In the above three documents, although d 1 and d 2 are similar, and d 2 and d 3 are similar, d 1 and d 3 are completely dissimilar since they share no terms. Consequently, transitivity relation does not hold. Why does this happen? We provide explanations to this question in terms of three different perspectives.

1. Fuzzy similarity relation. As discussed in the fuzzy theory [50] , the similarity relation does not satisfy transitivity. To make it satisfy transitivity, a fuzzy transitivity closure approach was introduced. However, this approach is not scalable with the number of data points. 2. Inherent characteristic of news. As discussed in Allan et al. [4] , event is considered as an evolving object through some time interval (i.e., content of news articles on the same story are changed throughout time). Hence, although the documents belong to a same event, the terms the documents use would be different if they discuss different aspects of the event. 3. Language semantics. The diverse term usage for a same meaning (e.g., kidnap and abduct) needs to be considered. Using only a syntactic property (e.g., keyword counting) aggravates the problem.

The transitivity is related with document insertion order in incremental clustering. Consider the TDT incremental clustering algorithm in Figure 2 . If the order of document insertion is "d 1 d 2 d 3 ", then one cluster ({{d 1 , d 2 , d 3 }}) is obtained. However, if the order is "d 1 d 3 d 2 ", then two clusters ({{d 1 , d 2 }, {d 3 }}) are obtained. Although the order of document insertion is fixed (because the document is inserted whenever it is published), it is undesirable if the clustering result significantly depends on the insertion order. Regardless of the input order, the successful algorithm should produce a single cluster, {{d 1 , d 2 , d 3 }}.

Before we present detailed discussions on the proposed clustering algorithm, definitions for basic terminology are provided first. In addition, Table 2 shows the notations, which will be used throughout this paper. Similarity(d i , d j ) ≥ , then a document d i is referred to as similar to a document d j .

That is, -neighborhood for a document d i is defined as a set of documents, which are similar to d i . In this paper, -neighborhood and neighborhood are used interchangeably.

The proposed clustering algorithm is based on the observation that a property of an object would be influenced by the attributes of its neighbors. Examples of such attributes are the properties of the neighbors, or the percentage of neighbors that fulfill a certain constraint. The above idea can be translated into clustering perspective as follows: a cluster label of an object depends on the cluster labels of its neighbors.

Recent data mining research has proposed density-based clustering such as Shared Nearest Neighbors (SNN) clustering [15, 24] . In SNN, the similarity between two objects is defined as the number of k-nearest neighbors they share. Thus, the basic motivation of SNN clustering is similar to ours, however, as we will explain in Section 4.3, the detailed approach is completely different. Figure 3 shows the proposed incremental clustering algorithm. Initially, we assume that only one document is available. Thus, this document itself forms a singleton cluster. Adding a new document to existing cluster structure proceeds in three phases: neighborhood search, identification of an appropriate cluster for a new document, and re-clustering based on local information. In what follows, these three steps are explained in detail.

Achieving an efficient neighborhood search is important in the proposed clustering algorithm. Since we deal with documents in this research, we can rely on an inverted index for the purpose of the neighborhood search. 4 In an inverted index [44] , the index associates a set of documents with

Step 1. Initialization: Document d0 forms a singleton cluster C0.

Step 2. Neighborhood search: Given a new incoming document d * , obtain N (d * ) by performing a neighborhood search.

Step 3. Identification of a cluster that can host a new document: Compute the similarity between d * and a cluster Ci ∈ C d * .

Based on the value obtained from above, if there exists a cluster (Cj) that can host d * , then add d * to the cluster and update the DCFj . Otherwise, create a new cluster for d * and create a corresponding DCF vector for this new cluster.

Step 4. Re-clustering: Let Cj be the cluster that hosts d * .

If Cj is not a singleton cluster, then trigger merge operation.

Step 5.

Repeat

Step 2-4 whenever a new document appears in a stream. Fig. 3 . The incremental non-hierarchical document clustering algorithm terms. That is, for each term t i , we build a document list that contains all documents containing t i . Given that a document d i is composed of t 1 , ... ,t k , to identify similar documents to d i , instead of checking whole document dataset, it is sufficient to examine the documents that contain any t i . Thus, given a document d i , identifying the neighborhood can be accomplished in O(|D di |).

To assign an incoming document (d * ) to the existing cluster, the cluster, which can host d * , needs to be identified using the neighborhood of d * . If there exists such a cluster, then d * is assigned to the cluster. Otherwise, d * is identified as an outlier and forms a singleton cluster. Toward this end, the set of candidate clusters (C d * ) is identified by selecting the cluster that contains any document belonging to N (d * ). Subsequently, the cluster, which can host a new document, is identified by using one of the following three methods.

1. Considering the size of an overlapped region. Select the cluster that has the largest number of its members in N (d * ). This approach only considers the number of documents in the overlapped region, and ignores the proximity between neighbors and d * . 2. Exploiting weighted voting. The similarities between each neighbor of d * and the candidate clusters are measured. Then, the similarity values are aggregated using weighted voting. That is, the weight is determined by the similarity between the proximity of a neighbor to the new document. Thus, each neighbor can vote for its class with a weight proportional to its proximity to the new document. Let W j be a weight for representing the proximity of n j to the new document (e.g., Cosine similarity between n j and the new document). Then, the most relevant cluster (C * ) is selected based on the following formula:

Equation (4) mitigates the problem of the previous method by considering the weight W j . Moreover, it still favors the cluster with a large size of overlapped region to N (d * ) by summing up the weighted similarity. 3. Exploiting a signature vector. While the weighted voting approach is effective, it is computationally inefficient since the similarities between all neighbors and all candidate clusters need to be computed. Instead, we employ a simple but effective approach, which measures the similarity between the signature vector of the neighborhood and that of the candidate clusters.

The signature vector should be composed of terms that reflect the main characteristics of the documents within a set. For example, the center of a cluster would be a signature vector for the cluster. For each term t i in the set A j (e.g., cluster/neighborhood), we compute the weight for the signature vector using the following formula:

In Equation (5), the first factor measures the normalized document frequency within a set, and the second factor measures the sum of the weight for the term over the whole documents within a set. Next, the notion of Document Cluster Feature (DCF) vector 5 is presented as follows:

is a document frequency vector for C i , and W i is a weight sum vector for C i , respectively.

Step 1: Check whether d1 can be added to cluster 1

Step 2: Add d1 to cluster 1

Step 3: Merge cluster 1 and cluster 2 if they satisfy the merge constraint 

DCF i = (N i , DF i , W i ) and DCF j = (N j , DF j , W j )

Proof. It is straightforward by simple linear algebra.

To compute the similarity between a document and a cluster, we only need signature vectors of the cluster and the document. However, the signature vector does not need to be recomputed as a new document is inserted to the cluster. This property is based on the additivity of DCF. Since S i (a signature vector for C i ) can be directly reconstructed from DCF i , instead of recomputing S i whenever a new document is inserted into C i , the DCF i only needs to be updated using the additivity of DCF.

In sum, if there exists a cluster (C i ) that can host a new document, then the new document is assigned to C i and the DCF i is updated. Otherwise, a new cluster for d * and a DCF vector for this cluster are created.

If d * is assigned to C i , then a merge operation needs to be triggered. This is based on a locality assumption [43] . Instead of re-clustering the whole dataset, we only need to focus on the clusters that are affected by the new document. That is, a new document is placed in the cluster, and a sequence of cluster re-structuring processes is performed only in regions that have been affected by the new document. Figure 4 illustrates this idea. As shown, clusters that contain any document belonging to the neighborhood of a new document need to be considered. 

When the algorithm in Figure 3 is applied to a news article dataset, different event clusters 6 can be obtained. Since our goal is to generate a cluster hierarchy, all event clusters on the same topic need to be combined together. For example, to reflect a court trial topic, all court trial event clusters at level 1 should be merged in a single cluster at level 2. However, in many cases, this becomes a difficult task due to the extremely high term-frequency of named entities within a document. Named entities are people/organization, time/date and location, which play a key role in defining "who", "when", and "where" of a news event.

Thus, although two different event clusters belong to the same topic, similarity between the clusters becomes extremely low, consequently, the task of merging different event clusters (on a same topic) is not simple. To address the above problem, we illustrate how to extend the algorithm (in Figure 3 ) into a hierarchical version. Table 3 summarizes the notations that will be used in this section. Before presenting a detailed discussion, necessary terminology is first defined.

A specific term for a cluster C i is a term, which frequently occurs within a cluster C i , but rarely occurs outside of C i . The collection of specific terms for C i is denoted by ST Ci .

Let df i (T ) be the document frequency of a term t i in whole document dataset at time T . Then, the document frequency of t i at time T + 1 is defined as follows:

Let df ij IN (T ) be the document frequency of a term t i within a C j at time T . Then df ij IN (T + 1) is recursively defined as follows:

We denote K(T ) as a number of clusters at level 1 at time T . Then, K(T +1) is defined as follows:

if d * is inserted to an existing cluster at T K(T ) + 1, if d * itself forms a new cluster at T K(T ) − 1, if two clusters are merged at T (8) Although df i (T + 1) − df ij IN could be considered for representing how much t i occurs outside C j at T + 1, it is not sufficient if our goal is to quantify how much t i is informative for C j . This is because the number of clusters can also affect on how much t i discriminates C j from other clusters. Thus, df ij OUT (T + 1), which represents how much t i occurs outside C j at time T + 1, can be defined as follows:

Finally, the selectivity of a term t i for the cluster C j at time T + 1 is defined as follows:

In sum, Equation (10) assigns more weight to the terms occurring frequently within C j , and occurring rarely outside of C j . Therefore, a term with high selectivity for C i can be a candidate for ST Ci . Based on the definition of ST , the proposed hierarchical clustering algorithm is described. While clusters at level 1 are generated using the algorithm in Figure 3 , if no more documents are inserted to a certain cluster at level 1 during the pre-defined time interval, then we assume that the event for the cluster ends, 7 and associate ST with this cluster at level 1. We then perform a neighborhood search for this cluster at level 2. Since ST reflects the most specific characteristics for the cluster, it is not helpful if two topically similar clusters (but different events) need to be merged. Hence, when we build a vector for C j i , terms in ST (for C j i ) are not included for building a cluster vector. At this moment, it is worthwhile to compare our algorithm with the SNN approach [24, 15] . The basic strategy of SNN clustering is as follows: It first constructs the nearest neighbor graph from the sparsified similarity matrix, which is obtained by keeping only k-nearest neighbor of each entry. Next, it identifies representative points by choosing the points that have high density, and removes noise points that have low density. Finally, it takes connected components of points to form clusters. Table 4 . A sample specific terms for the clusters at level 1. The term with regular font denotes NE. Thus, this supports the argument that NE plays a key role in defining specific details of events

The key difference between SNN and our approach is that SNN is defined on static datasets while ours can deal with incremental datasets. The re-clustering phase, and special data structures (e.g., DCF or signature vector) make our algorithm more suitable for incremental clustering than SNN. The second distinction is how a neighborhood is defined. In SNN, a neighborhood is defined as a set of k-nearest neighbors while we use -neighborhood. Thus, as discussed in Han et al. [21] , the neighborhood constructed from k-nearest neighbors is local in that the neighborhood is defined narrowly in dense regions while it is defined more widely in sparse regions. However, for document clustering, a global neighborhood approach produces more meaningful clusters. The third distinction is that we intend to build a cluster hierarchy incrementally. In contrast, SNN does not focus on hierarchical clustering. Finally, our algorithm can easily identify singleton clusters. This is especially important in our application since an outlier document on a in a news stream may imply a valuable fact (e.g., a new event or technology that has not been mentioned in previous articles). In contrast, SNN overlooks the importance of singleton clusters.

Specific features Court trial attorney court defense evidence jury kill law legal murder prosecutor testify trial Kidnapping abduct disappear enforce family girl kidnap miss parent police Earthquake body collapse damage earthquake fault hit injury magnitude quake victim Airplane crash accident air aircraft airline aviate boeing collision crash dead flight passenger pilot safety traffic warn Table 5 . A sample specific terms for the clusters at level 2

A topic ontology is a collection of concepts and relations. One view of a concept is as a set of terms that characterize a topic. We define two generic kinds of relations, specialization and generalization. The former is useful when refining a query while the latter can be used when generalizing a query to increase recall or broaden the search.

General features Court trial 1 arm arrest camera count delay drug hill injury order store stand target victim Table 6 . General terms for the court trial cluster 1 in Table 4   Table 4 and Table 5 illustrate the sample specific terms for the selected events/topics. As shown, with respect to the news event, we observed that the specific details are captured by the lower levels (e.g., level 1), while higher levels (e.g., level 2) are abstract. We can also generate general terms for the node, which is defined as follows:

A general term for a cluster C i is a term, which frequently occurs within a cluster C i , and also frequently occurs outside of C i . A collection of general terms for C i is denoted by GT Ci .

Thus, in comparison with ST, the selectivity of GT is less than that of ST. Those ST and GT constitute the concepts of a topic ontology. 8 Table 6 shows GT for the "court trial 1" cluster in Table 4 . When the "Winona Ryder court trial" cluster (C 1 ) is considered, ST C1 represents the most specific information for "Winona Ryder court trial event", GT C1 carries the next most specific information for the event, and specific terms for the court trial cluster describe the general information for the event. Therefore, we can conclude that a topic ontology can characterize a news topic at multiple levels of abstraction.

Human-understandable information needs to be associated with cluster structure such that clustering results are easily comprehensible to users. Since a topic ontology provides an interpretation of a news topic at multiple levels of detail, an important use of a topic ontology is automatic cluster labeling. In addition, a topic ontology can be effectively used for suggesting alternative queries in information retrieval.

There exists research work on extraction of hierarchical relations between terms from a set of documents [17, 45] or term associations [46] . However, our work is unique in that the topical relations are dynamically generated based on incremental hierarchical clustering rather than based on human defined topics such as Yahoo directory (http://www.yahoo.com). 

In this section, we present experimental results that demonstrate the effectiveness of the information analysis component. Section 5.1 illustrates our experimental setup. Experimental results are presented in Section 5.2.

For the empirical evaluation of the proposed clustering algorithm, approximately 3,000 news articles downloaded from CNN (http://www.cnn.com) are used. The total number of topics and events used in this research is 15 and 180, respectively. Thus, the maximum possible number of clusters we can obtain (at level 1) is 180. Note that the number of documents for events ranges from 1 to 151. Table 7 illustrates sample examples for topics and events.

The quality of a generated cluster hierarchy was determined by two metrics, precision and recall. Let T r be a class on topic/event r. 9 Then, a cluster C r is referred to as a topic r cluster if and only if the majority of subclusters for C r belong to T r . The precision and recall of the clustering at level i (where K i is the number of clusters at level i) then can be defined as follows:

Thus, if there is large topic overlap within a cluster, then the precision will drop down. Precision and recall are relevant metrics in that they can measure "meaningful theme". That is, if a cluster (C) is about "Turkey earthquake", then C should contain all documents about 'Turkey earthquake". In addition, documents, which do not talk about 'Turkey earthquake", should not belong to C. 

For the purpose of comparison, we decided to use K-means clustering. However, since K-means is not suitable for incremental clustering, K-means clustering is performed retrospectively on datasets. In contrast, the proposed algorithm was tested on incremental datasets after learning IDF . Moreover, since we already knew the number of clusters at level 1 based on the ground-truth data, K could be fixed in advance. Furthermore, to overcome K-mean's sensitivity to initial seed selections, a seed p is selected with the condition that the chosen seeds are far from each other. Since we deal with document datasets, the intelligent seed selection 10 can be easily achieved by using an inverted index.

Parameterization The size of a neighborhood, which is determined by , influences clustering results. To observe the effect, we performed an experiment as follows: From 3,000 documents, we organized sample datasets, which consists of 500 documents in 50 clusters of different sizes. Then, while changing the value of , our clustering was conducted on the dataset.

In Figure 5 , the x-axis represents the value of , and the y-axis represents the number of clusters in the result (k1) over the number of clusters determined by ground-truth data (k2). Thus, if the clustering algorithm guesses the exact number of clusters, then the value of y corresponds to one. As observed in Figure 5 , we could find the best result when varies between 0.1 and 0.25, i.e., the algorithm guessed the exact number of clusters. If the value of was too small, then the algorithm found a few large-size clusters. In contrast, many small-size clusters were identified if the value is too large. Thus, the proposed algorithm might be considered as sensitive to the choice of . However, once the value of (i.e., = 0.2) was fixed, the approximately right number of clusters were always obtained whenever we performed clustering on different datasets. Therefore, the number of clusters does not need to be given to our algorithm as an input parameter, which is a key advantage over partition-based clustering.

To illustrate the simple example for the shapes of document clusters with the same density, approximately the same number of documents were randomly chosen from two different events (a wildfire event and a court trial event), and the document×term matrix on this dataset is decomposed by Singular Value Decomposition. By keeping the first two largest singular values, the dataset could be projected onto a 2D space corresponding to principal components. Figure 6 illustrates the plot of the documents. As shown, since the shape of document cluster can be arbitrary, a shape of document cluster cannot be assumed in advance (e.g., hyper-sphere in k-means).

To test the ability of identifying the different shapes of clusters, we organized datasets where each cluster consists of approximately the same number of documents (but as illustrated in Figure 6 , each document cluster will have a different shape). As shown in Figure 7 , the proposed algorithm outperforms the modified K-means algorithm in terms of precision and recall. 11 This is because the proposed algorithm measures similarity between a cluster and a neighborhood of a document while K-means clustering measures similarity between a cluster and a document. Note that 10% increase in accuracy is significant by considering the fact that we provided the correct the number of clusters (K) and choose the best initial seed points for K-means.

As illustrated in Figure 7 , the recall of our algorithm decreases as the level increases. The main reason for this poor recall at level 2 is related to the characteristics of news articles. As discussed, a named entity (NE) plays a key role in defining who/when/where of an event. Hence, NE contributes to high quality clustering at level 1. However, at level 2, since the strength of topical terms are not very strong (unlike named entities), it was not easy to merge different event clusters (belonging to the same topic) into a same topical cluster.

Since the sizes of clusters can be of arbitrary numbers, clustering algorithms must be able to identify the clusters with wide variance in size. To test the ability of identifying clusters with different densities, we organized datasets where each dataset consists of document clusters with diverse densities. As shown in Figure 8 , when the density of each cluster is not uniform, the accuracy of the modified K-means clustering algorithm degraded. In contrast, the accuracy of our algorithm remains similar. Therefore, based on the experimental results on datasets-1 and datasets-2, we can conclude that our algorithm has better ability to find arbitrary shapes of clusters with variable sizes than K-means clustering.

There are some events that we could not correctly separate. For example, on the wildfire topic, there exist different events, such as "Oregon wildfire", "Arizona wildfire", etc. However, at level 1, it was hard to separate those events into different clusters. Table 8 illustrates the reason for this event confusion at level 1. As shown, term frequency of topical terms (e.g., fire, firefighter, etc) is relatively higher than that of named entities (e.g., Colorado, Arizona, etc). Similarly, for the airplane crash topic, it was difficult to separate different airplane crash events since distinguishing lexical features like plane number has extremely low term frequency.

The capability of distinguishing different events on the same topic is important. One possible solution is to use temporal information. Rational behind this approach is based on the assumption that news articles on same event are temporally proximate, However, if two events occur during the same time interval, then this temporal information might not be helpful. Another approach is to use classification, i.e., training dataset is composed of multiple topic classes, and each class is composed of multiple events. After then, we learn the weight of topic-specific terms and named entities [49] . However, this approach is not relevant since we cannot accommodate the dynamically changing topics. Therefore, we need further study for the event confusion.

We presented the mining framework that is vital to intelligent information retrieval. An experimental prototype has been developed, implemented and tested to demonstrate the effectiveness of the proposed framework. In order to accommodate topics that change over time, we developed the incremental document clustering algorithm based on a neighborhood search. The presented clustering algorithm could identify news event clusters as well as topic clusters incrementally. We also showed that presented topic ontologies could characterize news topics at multiple levels of abstraction. We intend to extend this work into the following five directions. First, although a document hierarchy can be obtained using unsupervised clustering, as shown in Aggarwal et al. [1] , the cluster quality can be enhanced if a preexisting knowledge base is exploited. That is, based on this priori knowledge, we can have some control while building a document hierarchy. Second, besides exploiting text data, we can utilize other information since Web news articles are composed of text, hyperlinks, and multimedia data. For example, as described in [25] , both terms and hyperlinks (which point to related news articles or Web pages) can be used for feature selection. Third, coupling with WordNet [36] , we plan to extend the topic ontology learning framework to accommodating rich semantic information extraction. To this end, we will annotate a topic ontology within Protégé [38, 54] . Forth, our clustering algorithm can be tested on other datasets like TDT corpus [53] . Finally, to strengthen our work in terms of generality, we are in the process of investigating the potential applicability of our method to earth science information streams.

On the merits of using supervised clustering for building categorization systems

Enriching very large ontologies using the WWW

Efficient similarity search in sequence database

Topic detection and tracking: pilot study final report

First story detection in TDT is hard

The R*-tree: an efficient and robust access method for points and rectangles

The X-tree: An index structure for high dimensional data

Using linear algebra for intelligent information retrieval

Scaling clustering algorithms to large databases

A system for new event detection

Efficient time series matching by wavelets

Dynamic topic mining from news stream data

Pattern Classification

QCS: a tool for querying, clustering, and summarizing documents

Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data

Initialization of iterative refinement clustering algorithms

Inferring hierarchical descriptions

CURE: An efficient clustering algorithm for large databases

ROCK: A robust clustering algorithm for categorical attributes

R-Trees: A dynamic index structure for spatial searching

Data mining: concepts and techniques

An investigation of linguistic features and clustering algorithms for topical document clustering

Robust Statistics

Clustering using a similarity measure based on shared near neighbors

Composite kernels for hypertext categorisation

CHAMELEON: a hierarchical clustering algorithm using dynamic modeling

Effective retrieval of audio information from annotated text using ontologies

Disambiguation of annotated text of audio using onologies

Retrieval effectiveness of an ontology-based model for information selection

Fast and effective text mining using linear-time document clustering

Document clustering with cluster refinement and model selection capabilities

Ontology learning for the Semantic Web

Efficient clustering of high-dimensional data sets with application to reference matching

Tracking and summarizing news on a daily basis with Columbia's Newsblaster

Automatic evaluation and uniform filter cascades for inducing n-best translation lexicons

Wordnet: An on-line lexical database

Efficient and effective clustering methods for spatial data mining

Creating Semantic Web contents with Protégé-2000

X-means: Extending K-means with efficient estimation of the number of clusters

An algorithm for suffix stripping

Newsinessence: a system for domain-independent, real-time news clustering and multi-document summarization

Interactive, domainindependent identification and summarization of topically related news

Incremental support vector machine learning: a local approach

Introduction to modern information retrieval

Deriving concept hierarchies from text

Towards context sensitive information inference

Query expansion using lexical-semantic relations

Learning approaches for detecting and tracking news events

Topic-conditioned novelty detection

Similarity relations and fuzzy orderings

BIRCH: an efficient data clustering method for very large databases

Evaluations of hierarchical clustering algorithms for document datasets

Nist topic detection and tracking corpus

This paper is based on our previous work [12] , which was presented at the Second International Conference on Ontologies, DataBases, and Applications of Semantics for Large Scale Information Systems (ODBASE 2003), Catania, Sicily, Italy, November 2003. We would like to thank the audience and anonymous reviewers of ODBASE 2003 for their helpful comments. We also would like to appreciate anonymous reviewers of this special issue for their valuable comments. Finally, we would like to thank Jongeun Jun for helpful discussions on the clustering algorithm.This research has been funded in part by the Integrated Media Systems Center, a National Science Foundation Engineering Research Center, Cooperative Agreement No. EEC-9529152.