PII: S0957-4174(99)00028-7 Intelligent query agent for structural document databases q M.F. Jiang, S.S. Tseng*, C.J. Tsai Department of Computer and Information Science, National Chiao Tung University, Hsinchu 300, Taiwan, ROC Abstract Querying a database for document retrieval is often a process close to querying an answering expert system. In this work, we apply the expert system techniques to the intelligent query agent establishment and regard the structural document database as the expertise which can be the objective of the knowledge acquisition. A new knowledge representation, named Structural Documents (SDs), is proposed to be the base of our model, and a transformation process from the raw data to the format of a database is applied. Based on the SDs, more suitable results could be inferenced rapidly by inference engine, and the flow of inference is also described. For implementation, an intelligent Chinese information retrieval system for personnel regulations by integrating knowledge-based and full-text searching techniques is proposed. In our experiments, the structural information of the documents can be acquired from the database using the knowledge extraction module. By observing the operating process of users, we found the query process of users are simplified. q 1999 Published by Elsevier Science Ltd. All rights reserved. Keywords: Information retrieval; Document database; Intelligent agent; Expert system; Hierarchy clustering 1. Introduction Nowadays, database systems are useful in business and industrial environments due to their multiple applications. A lot of database systems are built for storing documents, say document databases, and deserves more attention. Recently, due to the rapid growth of the Internet and hardware perfor- mance, research relating to digital library has become an important issue. A major portion in the digital library is the electronic books which have the advantage of saving a lot of retrieval time. There has been a recent trend to publish electronic books excepting hard copy. Especially in the professional field, the reference manuals are usually preserved as the document databases in order to increase the convenience to query. If there is a database system, which can be modeled as shown in Fig. 1, users could formulate the query to retrieve documents, and reformulate the query when they see the results, and so on, until satisfied with the answer (Navarro & Baeza-Yates, 1997). The query effectiveness depends upon user’s knowledge about the query language. In order to improve the convenience of query for the traditional database system, we wanted to design a query agent which can be used to transform user’s demand into a query format, coordinate with a user’s request and revise the query interactively. For example, querying the personnel regulations for civil servant in Taiwan seems to be very sophisticated, and a lot of access time is required. Building a query agent is one of the solutions to such problems. Moreover, we hope that the agent provided an adaptability for different document databases. Therefore, we built an intelligent query agent (IQA), which is illustrated in Fig. 2, to assist users generating suitable queries and adjust the queries according to user’s demand (Riecken, 1994). Since querying a database for document retrieval is often a process close to querying an answering expert system (ES) (Celentano, Fugini, & Pozzi, 1995), in this work, we apply the ES techniques to the IQA establishment. In building IQA by the ES approach, we are concerned about the construction of the knowledge base, including the knowl- edge representation and the method of knowledge acquisi- tion. A document database consists of a large number of elec- tronic books. In addition to the electronic form of content stored in the database, some structural information, i.e. the chapter/section/paragraph hierarchy, may also be embedded in the database. Classical information retrieval usually allows little structuring (Navarro & Baeza-Yates, 1997), since it retrieves information only on data. However, the structural information is useful in querying the document database, for instance, most people always read books with a chapter-oriented concept. In accordance with the Expert Systems with Applications 17 (1999) 105–113PERGAMON Expert Systems with Applications 0957-4174/99/$ - see front matter q 1999 Published by Elsevier Science Ltd. All rights reserved. PII: S 0 9 5 7 - 4 1 7 4 ( 9 9 ) 0 0 0 2 8 - 7 www.elsevier.com/locate/eswa q This work was supported by the National Science Council of the Republic of China under Grant No. NSC88-2213-E-009-078. * Corresponding author. Tel: 1 886-3-5712121, ext. 56658; fax: 1 886-3-5721490. E-mail address: sstseng@cis.nctu.edu.tw (S.S. Tseng) document structure, including the index and the table of content, it can be regarded as the expertise and also the objective of the knowledge acquisition. In order to elicit the knowledge embedded in the docu- ment structure, we first proposed a new knowledge repre- sentation, named Structural Documents (SD) (Jiang, Tseng, & Tsai, 1999), to be the basis of IQA. Second, our idea is to design a process of transforming the documents into a set of structural documents, which merge two documents with similarity greater than the given threshold into one struc- tural document. Based on this idea, we developed a cluster- ing-based approach to construct the SD in this work. Moreover, baed on the SD, more suitable results could be inferenced rapidly by the inference engine, and the flow of inference is also described. Besides, the architecture of the IQA and whole structural document database system is also proposed. As we know, a sound legal system and complete regula- tions are usually of great importance for a government by law. Right now, the personnel regulations for civil servant in Taiwan seem to be very sophisticated. Although several kinds of reference books about personnel regulations have been provided for the general public to inquiry, they are not easy to use and a lot of access time is required. Therefore, we design an intelligent Chinese information retrieval system for personnel regulations by integrating ES- based and full-text searching techniques. Our system may help users to retrieve the required personnel regulations. As mentioned above, a transformation process from the raw data to the format of a database is first applied. The embedded knowledge of the resulting data are then elicited by applying clustering techniques. By this way, the semantic indexes of the raw data can be established, and suitable results may be obtained. In our experiments, the structural information of the documents can be acquired from the database using the knowledge extraction module. By obser- ving the operating process of users, we found that the query processes of users are simplified. The organization of the rest of paper is described as follows. Reviews of related work are first described in Section 2. The knowledge extractor process we proposed is presented in Section 3. Section 4 presents the inference process of the IQA. Section 5 demonstrates the implemen- tations of this approach. Finally, the conclusion and future work are discussed in Section 6. 2. Related works An ES is a knowledge-base system to solve problem at the level of a human expert (Giarratano & Riley, 1993), which generally consists of three modules: user interface, knowledge base and inference engine. User interface helps users to easily communicate with the ES, knowledge base contains the knowledge which forms the basis of inference and inference engine generates some conclusions according to the knowledge bases. Generally, the process of building the knowledge base, which is called knowledge acquisition, is interviewing experts by knowledge engineers. However, it may cost a lot of time, as the domain expert may not have any sense of the computer techniques or the knowledge engineer is not familiar with the domain knowledge (Hwang & Tseng, 1990). In order to achieve the goal of decreasing the inter- vention of experts and the goal of smoothing the knowledge acquisition, we established an adaptive process to transfer domain knowledge to the format of a knowledge base. The approach undertaken in IQA is based on the assump- tion that the structure of reference books is a hierarchy structure. Let us take the book’s structure, shown in Fig. 3, as an example. In Fig. 3, there are nine chapters, where Chapter 1 consists of three sections and some section may be composed of paragraphs. As most of the readers usually refer the books following the tree-like hierarchy, consider- ing the book’s structure seems to influence in query constructing. Intuitively, the documents in the same chapter seem to have very likely higher similarity rather than the documents in other chapters. Similarly, for the documents in the same chapter, the documents in the same section have very likely higher similarity rather than the documents in other sections. So, the information of the structural hierar- chy may be useful in retrieving document databases. Besides, the semantic meaning of the documents in the M.F. Jiang et al. / Expert Systems with Applications 17 (1999) 105–113106 Fig. 1. The model of the database system. Fig. 2. An intelligent query agent for the document database system. Fig. 3. The chapter structure of a book. book is expressed in the contents including words, phrases, etc. Therefore, we propose a new knowledge representation mixing contents and the structure of document database in building the IQA. 3. System structure In this section, we will first introduce our system model and then the major modules of the system will be described in detail. 3.1. System overview Fig. 4 shows the overview of the system, which consists of three components: IQA; query transformer (QT) module and knowledge extractor (KE) module; and a traditional database system. IQA and QT are used to transform the user’s demand into an accessible query for the database system, and the KE module is used to extract the knowledge from the database and then store the extracted knowledge into the knowledge base of IQA. Based on the knowledge stored in the format of SD, which is the new knowledge representation proposed in this work, the IQA module infers and revises the suitable query for user’s demand, and the QT module transforms the result of IQA into an accessible query following the query syntax supported by the database system. 3.2. IQA and KE modules The IQA module consists of user interface, inference engine and knowledge base. The user interface helps users communicate easily with the IQA using a web-based approach. The inference engine implies the results based on knowledge and the process is addressed in Section 4. Knowledge base stores the knowledge for inference, uses a new knowledge representation called SD, which is formally defined in Definition 3.1. As mentioned above, the knowledge is extracted from the database by the KE module. The flow of the knowledge extracting is described as follows as shown in Fig. 5. The content of the document database is first transformed into sets of words by partition and transformation. Besides, the index of the book are further considered as the knowledge source to identify the set of keywords for each document in the partition and transformation step. By computing each pair of documents, the similarity matrix for documents are obtained, and used as the basis for the subsequent clustering to find a similar pair of documents. After clustering, the hierarchy of struc- tural documents can be obtained. In this figure, the two kinds of existing resources, includ- ing the index and the table of contents for the books are used as the knowledge source in the KE module. All the three procedures in the KE module described in detail are as follows. 3.2.1. Partition and transformation The KE module is capable of processing English or Chinese document database. As there is no obvious word boundary in the Chinese text, an identification process for identifying each possible disyllabic word from target data- base is needed. After the disyllabic words are identified from the sentence by applying the association measure, each document can be transferred to a set of keywords. That is, for two different Chinese characters, ci and cj, the association (Liang, 1995; Sproat and Shih, 1990) of the two M.F. Jiang et al. / Expert Systems with Applications 17 (1999) 105–113 107 Fig. 4. Overview of the system. Fig. 5. Flowchart of knowledge extracting. characters can be computed by: log2 freq�CiCj� n freq�Ci� n freq�Cj� n ; where the value of the function freq( ) is the occurrence frequency of any character or two successive characters and n is the number of all the characters in the target data- base. Since the association formula is originally designed for a large corpus, the obtained associations may not good enough to identify words. Therefore, further manual turning may be required. After the disyllabic words are identified from the sentence by applying the association measure, each document can be transferred to a set of words. For example, the sentence may be segmented into and . It should be noted here that there is no need to identify words in English text, and so the above segmented method should be skipped for English text. In our approach, the indexes of the book are also used to identify the set of keywords for each document. After the execution of the above two steps, each document is trans- formed into a set of keywords and words. Therefore, the size of both sets varies depending on the document itself, because the words are identified by the association measure and the keywords are identified by comparing with the index of the reference books. 3.2.2. Similarity measure To measure the similarity between two documents, we use the following heuristics. The similarity between two documents in the same chapter is higher than that in two different chapters, and the similarity between two docu- ments in the same section is higher than that in two different sections. Without losing the generality, we assumed the whole reference book to be divided into a three-tier hierar- chy, including chapter, section and paragraph. Based upon these heuristics, the Hierarchy Dependence (HD) between two documents can be easily computed by the following procedure: Step 1: If two documents are not in the same chapter, HD ← 0 and stop. Step 2: If two documents are not in the same section HD ← �1=s�; where s is the total number of sections in this chapter, and stop. Step 3: If two documents are not in the same paragraph, HD ← 0:5 1 �1=p� , where p is the total number of para- graphs in this section, and stop. Step 4. HD ← 1 . Let the two documents be denoted as Di and Dj. In the KE module, the similarity of Di and Dj, denoted by S�i; j�, is computed by the following formula: S�i; j�� �1 2 d� p same�i; j� 1 d p HD�i; j�; where same�i; j� means the number of words and keywords which appear both in documents Di and Dj. The value is normalized by dividing the total number of keywords in documents Di and Dj, HD�i; j� means the hierarchy depen- dence of documents Di and Dj and d is an adaptive weight value with 0 # d # 1. The default value of d is 0.5, which can be adjusted by the number of chapters for a given book. For example, when the number of chapters is near to 1 or n for a book divided into n documents, we set d as the value closed to 0, as there is not much meaning in the structure. Example 3.1. Let Di � { }, j Di j � 4, and Dj � { }, j Dj j � 3. Therefore, the number of the words which appear both in documents Di and Dj is 2, and we have same�i; j� �j Di > Dj j =�j Di j 1 j Dj j� � 2=�4 1 3�: Example 3.2. Let Di � {‘intelligent’, ‘query’, ‘agent’}, j Di j � 3, and Dj � {‘database’, ‘query’,}, j Dj j � 2. Therefore, the number of the words which appear both in documents Di and Dj is 1, and we have same�i; j� � 0:2. By computing the similarity measure of all two different documents Di and Dj, a similarity matrix [Sij] can be formed by letting Sij � S�i; j�. To simplify our further discussion, we assign Sij � 0 for i $ j. The matrix becomes an upper trian- gular matrix and the diagonal elements are 0’s. 3.2.3. Clustering and structuring documents Clustering is an important step in the KE module for the purpose of structuring documents. To transfer the docu- ments into a set of structural documents, the algorithm for similarity matrix updating in hierarchy clustering (Jain & Dubes, 1988) is used. In hierarchy clustering, a lot of approaches are considered in measuring similarity of clus- ters, including the single-link and the complete-link meth- ods. It seems that the clustered hierarchy generated by the complete-link method is more balanced than the one gener- ated by the single-link method. In this work, the complete- link method is implemented. To represent the knowledge in clustering process, the definition and notation of SD are formally defined. Definition 3.1. The SD with level l is defined recursively as follows: • A document Di is a SD with level 0, denoted as SDi which also can be denoted as �0Di�0. • A pair of SD, denoted as �lSDi; SDj�l, with the greatest M.F. Jiang et al. / Expert Systems with Applications 17 (1999) 105–113108 similarity measure, which is greater than a threshold u, among all the different pairs is also a SD with level l, where l � Maximum�m; n� 1 1, m is the level number of SDi and n is the level number of SDj. Definition 3.2. The similarity S 0 between the structural documents SDi and SDj is defined as follows: 1. If SDi � �Di� and SDj � �Dj�, the similarity S 0�SDi; SDj� � S�i; j�. 2. The similarity between �SDi; SDj� and SDk is defined as S 0�SDk;�SDi; SDj�� � Q{S 0�SDk; SDi�; S 0�SDk; SDj�}, where the operator Q means ‘minimum’ for the complete-link method, or ‘maximum’ for the single- link method. Assume we have n documents, and then an n × n similar- ity matrix can be generated by our method. First, each docu- ment is assigned to be a SD, i.e. SDi ��Di� for document Di. Moreover, the similarity matrix for documents is trans- ferred to the initial similarity matrix for SDs. The elements of the similarity matrix [S 0ij] is the similarity measure of the structural documents SDi and SDj, i.e. S 0 ij � Sij. Let C � {SDi} be the set of all SDi. Based upon the Johnson’s algo- rithms (1967) for hierarchy clustering (Jain & Dubes, 1988), we propose the following procedure for updating similarity matrix. Algorithm 3.1. Step 1: Find the most similar pair of structural documents in the current similarity matrix, say pair {p; q}, where, S0p;q � Maximum{S 0i;j, for any i; j}. Step 2: Merge structural documents SDp and SDq into a new single structural document (SDp, SDq). Step 3: Delete the structural documents SDp and SDq from the set C, and insert the new structural document �SDp; SDq� into the set, i.e. C ← C 2 {SDp} 2 {SDq} 1 {�lSDp; SDq�l}, where l � Maximum�m; n� 1 1, m is the level number of SDp and n is the level number of SDq. Step 4: Update the similarity matrix by deleting the rows and columns related to structural documents SDp and SDq, and adding a row and a column corresponding to the new structural document �SDp; SDq�. Step 5: If there are no two structural documents with similarity greater than u, stop. Otherwise, go to Step 1. After clustering, the hierarchy of structural documents can be obtained according to the set C. 3.3. An example Let u � 0:1, assume we have six documents, {D1; D2; D3; D4; D5; D6}, and the similarity matrix is: D1 D2 D3 D4 D5 D6 D1 D2 D3 D4 D5 D6 0 0:8 0:4 0:5 0:4 0:1 0 0:5 0:6 0:3 0:2 0 0:7 0:8 0:1 0 0:9 0:3 0 0:2 0 2 6666666666664 3 7777777777775 Before clustering, each document is assigned to be a structural document, i.e. SDi ��0Di�0 for document Di. The set of all SDi is written as C � {SD1; SD2; SD3; SD4; SD5; SD6}, and the initial simi- larity matrix for SDs is generated as: SD1 SD2 SD3 SD4 SD5 SD6 SD1 SD2 SD3 SD4 SD5 SD6 0 0:8 0:4 0:5 0:4 0:1 0 0:5 0:6 0:3 0:2 0 0:7 0:8 0:1 0 0:9 0:3 0 0:2 0 2 6666666666664 3 7777777777775 In the first iteration, the structural documents, SD4 and SD5, are merged into (SD4, SD5), as the elementary value of the 4th row and the 5th column is the maximum. The set of all SDi is written as C � {SD1; SD2; SD3; �1SD4; SD5�1; SD6}. After the similarity matrix is updated, we get a new similarity matrix after Step 4: 1 2 3 4; 5 6 1 2 3 4; 5 6 0 0:8 0:4 0:4 0:1 0 0:5 0:3 0:2 0 0:7 0:1 0 0:2 0 2 6666666664 3 7777777775 In the second iteration, we merge the structural docu- ments SD1 and SD2 into (1SD1, SD2)1, as the elementary value of the 1st row and 2nd column is the maximum. We have the new C � {�1SD1; SD2�1; SD3;�1SD4; SD5�1; SD6}. After the similarity matrix is updated, the new similarity M.F. Jiang et al. / Expert Systems with Applications 17 (1999) 105–113 109 matrix is transformed to: 1; 2 3 4; 5 6 1; 2 3 4; 5 6 0 0:4 0:3 0:1 0 0:7 0:1 0 0:2 0 2 6666664 3 7777775 Similarly, we merge the structural documents SD3 and (1SD4, SD5)1 into (2SD3, (1SD4, SD5)1)2 in the third iteration, since the elementary value of the 2nd row and 3rd column is the maximum. We have the new C � {�1SD1; SD2�1; �2SD3; �1SD4; SD5�1�2; SD6} and the new similarity matrix is transformed to: 1; 2 3; 4; 5 6 1; 2 3; 4; 5 6 0 0:3 0:1 0 0:1 0 2 664 3 775 In the last two iterations, we have the new C � {�3�1SD1; SD2�1;�2SD3; �1SD4; SD5�1�2�3; SD6} and {�4�3�1SD1; SD2�1;�2SD3; �1SD4; SD5�1�2�3; SD6�4}. According to the set C, the hierarchy of SDs can be obtained as Fig. 6. 4. Inference process based on the structural documents After the knowledge extracting process, in the previous section, the SD C have been built for the retrieving docu- ments. Based on the SD, more suitable results could be rapidly inferenced by an inference engine. The flow of the inference process is described in Fig. 7. The query result R is generated by some searching engine. However, in most cases, the result may be not satisfied very well for the user’s demand, likely more or less. Two algorithms are designed to solve the problem. In this section, the detail of inference process will be illustrated. First, we state the notations and definitions for the following algorithms. To easily represent the set of documents, a data structure bit_map is defined as: bit_map � b1b2…bn; where bk � 1, if document Dk belongs to some set of docu- ments, and bk � 0, otherwise. Based on the data structure bit_map, for a given query result, a set of documents generated by some searching engine can be written as: R � b1b2…bn; such that bk � 1, if document Dk belongs to the query result, and bk � 0, otherwise. In the format of SDs, a pair of structural documents �SDi; SDj� means the similarity between SDi and SDj is greater than the other SDs. For any structural document SDj, the most similar structural document SDk can be found when SDi, a pair of structural documents �SDj; SDk�, has been found. To easily describe the above idea, the definitions of immediate successor and successor are illustrated as following. Definition 4.1. A structural document SDi is said to be an immediate successor of structural documents SDj, if SDi � �lSDj; SDk�l or SDi ��lSDk; SDj�l for some structural document SDk, and level l. Definition 4.2. A structural document SDi is said to be a successor of a structural document SDj, if there exists an immediate successor SDk of SDj, such that SDi is also a successor of SDk or SDi is an immediate successor of SDj. The key process, that is to find the immediate successor, is to directly scan the structural document C, and described in the following algorithm. M.F. Jiang et al. / Expert Systems with Applications 17 (1999) 105–113110 Fig. 6. The hierarchy of structural documents according to the clustering result. Fig. 7. The flowchart of the inference process. Algorithm 4.1 ((Find_immediate_successor)). Input: C, SDi Output: The immediate successor of SDi Step 1: Find SDi in the set C. Step 2: Check the next element of SDi with two cases possibly existing. Case 1: The next element of SDi is “,”, Find the structural document SDj, for some j . 0, in the next element of “SDi,” and return (k SDi, SDj)k, for k � MAX(i, j) 1 1. Case 2: The next element of SDi is “)l” for some l, Find “(l” in the preceding element of SDi and return �lSD; SDi�l, for some structural document SDj. In the Step 2 of Algorithm 4.1, there are two cases possi- bly existing. In Case 1, output ��SDi; SDj� for some SDj [ C and in Case 2, output ��SDj; SDi� for some SDj [ C. By Definitions 3.1 and 4.1, it can be easily seen that output is the immediate successor of SDi. Now, a function bit_set is defined to convert the structural document SDi into a string of bits, such that the kth bit is 1, if the document Dk belongs to SDi, and is 0, otherwise. Example 4.1. Assume the amount of all documents is 5, bit_set����D1�; �D2��;�D3���� “11100”. The ith cover for all n documents is defined, following the definition of bit_map: Ci � b1b2…bn Algorithm 4.2. Input: Any structural document SDi, number k Output: Ck; Ck11; …; Ck1m, where the number m is the minimum integer such that R # Ck1m Procedure Cover�k; SDi� SDj ← Find_immediate_successor (C, SDi) Ck ← bit_set�SDj) If NOT_EQUAL(R, AND (R, Ck)) Then Call Cover�k 1 1; SDj� End_Procedure Example 4.2. Assume we have six documents, and the hierarchy of SDs are shown in Fig. 6 and C � {���SD1; SD2�;�SD3;�SD4; SD5���; SD6�}. Now, for a user’s query, database return the result, {D2; D3; D4} and R � 011100. When Cover�2; �D3�� is called, we get C2 � 001110 and C3 � 111110. Algorithm 4.3. / p The users want to get a set of docu- ments which can be deleted from the query result. p / Input: The query result R, C Output: The set of documents which can be deleted from the query result. Procedure Less Select a Di according to R SDj ← �Di� C1 ← {Di} Call Cover�2; SDj� Case COUNT_ONE(AND�R; Ck21��- COUNT_ONE(AND(R; SUB�Ck; Ck21��) , 0: RETURN AND(R; Ck21) . 0: RETURN AND(R; SUB�Ck; Ck21�) � 0: RETURN AND(R; Ck21) or AND(R; SUB�Ck; Ck21�) determined by user End_Case End_Procedure M.F. Jiang et al. / Expert Systems with Applications 17 (1999) 105–113 111 Fig. 8. The architecture of CPRCS. Fig. 9. The main window of the CPRCS. Algorithm 4.4. / p The users want to get a set of documents, which could offer more information. p / Input: The query result R, C Output: A set of documents, which could offer more information. Procedure More Select a Di according to R SDj ← �Di� C1 ← {Di} Call Cover�2; SDj� While COUNT_ONE(AND�R; Ck�� ± 0 do Case COUNT_ONE(AND(R; Ck21�)- COUNT_ONE(AND(R; SUB�Ck; Ck21��) , 0: R ← AND(R, SUB(Ck; Ck21�) and Call More . 0: R ← SUB(R, AND(R; SUB�Ck; Ck21��), k ← k 2 1 � 0: RETURN AND(R; Ck21) or AND(R; SUB�Ck; Ck21�) determined by user and stop. End_Case RETURN Ck End_While End_Procedure Example 4.3. Assume we have six documents, and the hierarchy of structural documents are shown in Fig. 6 and C � {���SD1; SD2�;�SD3; �SD4; SD5���; SD6�}. Now, for a user’s query, database return the result, {D3; D4; D6}. For the result, we have the corresponding structural documents {SD3; SD4; SD6} generated by the IQA. When the user feels insufficiency about the result, the IQA may generate the set {SD3; SD4; SD5 ; SD6} and asks database to return docu- ment {D5}. When the user feels insufficiency again, the IQA may generate the set { SD1 ; SD2 ; SD3; SD4; SD5; SD6} and the documents {D1; D2} are returned. On the other hand, if the user feels the amount of result {D3; D4; D6} is too many, IQA generate the set {SD3; SD4} and deletes the document {D6} from the query result. If the user feels the amount of result is too many again, the IQA generates the set {SD4}, and delete the document {D3} from the query result. Example 4.4. For another user’s query, database return the result, {D3; D4; D5}. The user feels the amount of result is too many, IQA generates the set {SD4; SD5} and deletes the document {D3} from the query result. Example 4.5. For another user’s query, database return the result, {D2; D3; D4}. The user feel the amount of result is too many, IQA generates the set {SD3; SD4} and deletes the document {D2} from the query result. 5. Implementation As we know, a sound legal system and complete regula- tions are usually of great importance for a government by law. Right now, the PR for civil servant in Taiwan seem to be very sophisticated. Although several kinds of reference books about personnel regulations have been provided for the general public to inquiry, they are not easy to use and a lot of access time is required. Therefore, improvement of the methods of inquiry and annotation of the personnel regula- tions has become an important issue. Besides, by comparing the experimental results between the traditional database model and our new model, we may verify the performance of the model based on SDs for the PR document databases. We implemented a database prototype, named CPRCS (Chinese Personnel Regulations Consultation System), for PR documents and used the IQA to assist the querying process. The CPRCS architecture is followed by the system structure of Fig. 4 and shown in Fig. 8. In CPRCS, users operate the system through the web pages interface, which can be easily promoted to be M.F. Jiang et al. / Expert Systems with Applications 17 (1999) 105–113112 Fig. 10. The amount of structural documents of similarity. For instance, the amount of structural documents is 224, when the similarity between any two documents in the same SD is greater than 0.7. extended. Two different modes, with or without IQA, are provided for users as they please. By observing the operating process of users, we found the query process of users are simplified. Fig. 9 is the main window of the system. Furthermore, the relation between the chapter hierarchy and the amount of the SDs is explained in the following experiment. There are 365 documents in the target database for the experiment. All the documents are divided into 11 chapters and the total amount of sections is 171. After the processing of the KE module, the amount of SDs of different similarity is shown in Fig. 10. The figure shows the merging situation in the clustering process. The results observed from the figure are discussed in the following way. When the similarity value is between 0 and 0.2, the amount of structural documents is about 11, which is the amount of chapters. Similarly, when the simi- larity value is between 0.4 and 0.7, the amount of structural documents is about 171, which is the total amount of sections. The situation says that the hierarchy of structural documents is similar to the chapter/section hierarchy for the book. The structural information of the documents can be acquired from the database using the KE module. 6. Conclusion Classical information retrieval usually allows little struc- turing. However, the structural information is useful in querying the document database, for instance, most people always read books with chapter-oriented concept. In accor- dance with the document structure, including the index and the table of content, it can be regarded as the expertise and also can be the objective of the knowledge acquisition. Since querying a database for document retrieval is often a process close to querying an answering expert system, in this work, we apply the ES techniques to the IQA establish- ment. In building IQA by the ES approach, we are concerned about the construction of the knowledge base, including the knowledge representation and the method of knowledge acquisition. Therefore, a new knowledge repre- sentation, named SDs, is defined to construct the acquisition model for IQA, and proposed the KE model to transform the data of database into the knowledge storing in IQA. For comparing the convenience of IQA, an intelligent retrieval system, CPRCS, is implemented. By observing the operat- ing process of users, we found the that query processes of users are simplified. Besides, the experimental result has shown the structural information of the documents can be acquired from the database using the KE module. Future research will focus on several areas. First, better similarity measurements are necessary for increasing the performance of clustering. The analysis for the influence of different clustering methods is not covered in this work. The general formula proposed by Jain and Dubes (1988) includes most of the commonly hierarchical clustering method which is the basis for future work. Another signifi- cant focus on this work is to extend the model based on SD, and the goal is to allow any different kinds of documents that can be merged into the same knowledge structure in order to increase the practicability of the system. References Celentano, A., Fugini, M. G., & Pozzi, S. (1995). Knowledge-based docu- ment retrieval in office environments: the Kabiria system. ACM Trans- actions on Information System, 13 (3), 237–268. Giarratano, J., & Riley, G. (1993). Expert systems, 2. PWS Publishing Company. Hwang, G. J., & Tseng, S. S. (1990). EMCUD: A knowledge acquisition method which captures embedded meanings under uncertainty. Inter- national Journal of Man Machine Studies, 33, 431–451. Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data, Engle- wood Cliffs, NJ: Prentice Hall pp. 58–86. Jiang, M.F., Tseng, S.S., Tsai, C.J. (1999). Discovering structure from document databases, The Third Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD-99, Beijing, China. Liang, T. (1995). The Study of Character-based Signature Methods in Chinese Text Retrieval, PhD thesis, National Chiao Tung University, Taiwan. Navarro, G., & Baeza-Yates, R. (1997). Proximal nodes: a model to query document database by content and structure. ACM Transactions on Information Systems, 15 (4), 400–435. Riecken, D. (1994). Intelligent agent. Communications of ACM, 37 (7), 18– 21. Sproat, R., & Shih, C. (1990). A statistical method for finding word bound- aries in Chinese text. Computer Proceedings of Chinese and Oriental Languages, 4 (4), 336–351. M.F. Jiang et al. / Expert Systems with Applications 17 (1999) 105–113 113