PII: S0957-4174(99)00028-7


Intelligent query agent for structural document databases q

M.F. Jiang, S.S. Tseng*, C.J. Tsai

Department of Computer and Information Science, National Chiao Tung University, Hsinchu 300, Taiwan, ROC

Abstract

Querying a database for document retrieval is often a process close to querying an answering expert system. In this work, we apply the
expert system techniques to the intelligent query agent establishment and regard the structural document database as the expertise which can
be the objective of the knowledge acquisition. A new knowledge representation, named Structural Documents (SDs), is proposed to be the
base of our model, and a transformation process from the raw data to the format of a database is applied. Based on the SDs, more suitable
results could be inferenced rapidly by inference engine, and the flow of inference is also described. For implementation, an intelligent
Chinese information retrieval system for personnel regulations by integrating knowledge-based and full-text searching techniques is
proposed. In our experiments, the structural information of the documents can be acquired from the database using the knowledge extraction
module. By observing the operating process of users, we found the query process of users are simplified. q 1999 Published by Elsevier
Science Ltd. All rights reserved.

Keywords: Information retrieval; Document database; Intelligent agent; Expert system; Hierarchy clustering

1. Introduction

Nowadays, database systems are useful in business and
industrial environments due to their multiple applications. A
lot of database systems are built for storing documents, say
document databases, and deserves more attention. Recently,
due to the rapid growth of the Internet and hardware perfor-
mance, research relating to digital library has become an
important issue. A major portion in the digital library is
the electronic books which have the advantage of saving a
lot of retrieval time. There has been a recent trend to publish
electronic books excepting hard copy. Especially in the
professional field, the reference manuals are usually
preserved as the document databases in order to increase
the convenience to query.

If there is a database system, which can be modeled as
shown in Fig. 1, users could formulate the query to retrieve
documents, and reformulate the query when they see the
results, and so on, until satisfied with the answer (Navarro
& Baeza-Yates, 1997). The query effectiveness depends
upon user’s knowledge about the query language.

In order to improve the convenience of query for the
traditional database system, we wanted to design a query

agent which can be used to transform user’s demand into a
query format, coordinate with a user’s request and revise the
query interactively. For example, querying the personnel
regulations for civil servant in Taiwan seems to be very
sophisticated, and a lot of access time is required. Building
a query agent is one of the solutions to such problems.
Moreover, we hope that the agent provided an adaptability
for different document databases. Therefore, we built an
intelligent query agent (IQA), which is illustrated in Fig.
2, to assist users generating suitable queries and adjust the
queries according to user’s demand (Riecken, 1994).

Since querying a database for document retrieval is often
a process close to querying an answering expert system (ES)
(Celentano, Fugini, & Pozzi, 1995), in this work, we apply
the ES techniques to the IQA establishment. In building
IQA by the ES approach, we are concerned about the
construction of the knowledge base, including the knowl-
edge representation and the method of knowledge acquisi-
tion.

A document database consists of a large number of elec-
tronic books. In addition to the electronic form of content
stored in the database, some structural information, i.e. the
chapter/section/paragraph hierarchy, may also be embedded
in the database. Classical information retrieval usually
allows little structuring (Navarro & Baeza-Yates, 1997),
since it retrieves information only on data. However, the
structural information is useful in querying the document
database, for instance, most people always read books with
a chapter-oriented concept. In accordance with the

Expert Systems with Applications 17 (1999) 105–113PERGAMON

Expert Systems
with Applications

0957-4174/99/$ - see front matter q 1999 Published by Elsevier Science Ltd. All rights reserved.
PII: S 0 9 5 7 - 4 1 7 4 ( 9 9 ) 0 0 0 2 8 - 7

www.elsevier.com/locate/eswa

q This work was supported by the National Science Council of the
Republic of China under Grant No. NSC88-2213-E-009-078.

* Corresponding author. Tel: 1 886-3-5712121, ext. 56658; fax: 1
886-3-5721490.

E-mail address: sstseng@cis.nctu.edu.tw (S.S. Tseng)


document structure, including the index and the table of
content, it can be regarded as the expertise and also the
objective of the knowledge acquisition.

In order to elicit the knowledge embedded in the docu-
ment structure, we first proposed a new knowledge repre-
sentation, named Structural Documents (SD) (Jiang, Tseng,
& Tsai, 1999), to be the basis of IQA. Second, our idea is to
design a process of transforming the documents into a set of
structural documents, which merge two documents with
similarity greater than the given threshold into one struc-
tural document. Based on this idea, we developed a cluster-
ing-based approach to construct the SD in this work.
Moreover, baed on the SD, more suitable results could be
inferenced rapidly by the inference engine, and the flow of
inference is also described. Besides, the architecture of the
IQA and whole structural document database system is also
proposed.

As we know, a sound legal system and complete regula-
tions are usually of great importance for a government by
law. Right now, the personnel regulations for civil servant in
Taiwan seem to be very sophisticated. Although several
kinds of reference books about personnel regulations have
been provided for the general public to inquiry, they are not
easy to use and a lot of access time is required.

Therefore, we design an intelligent Chinese information
retrieval system for personnel regulations by integrating ES-
based and full-text searching techniques. Our system may
help users to retrieve the required personnel regulations. As
mentioned above, a transformation process from the raw
data to the format of a database is first applied. The
embedded knowledge of the resulting data are then elicited
by applying clustering techniques. By this way, the semantic
indexes of the raw data can be established, and suitable
results may be obtained. In our experiments, the structural
information of the documents can be acquired from the
database using the knowledge extraction module. By obser-
ving the operating process of users, we found that the query
processes of users are simplified.

The organization of the rest of paper is described as
follows. Reviews of related work are first described in
Section 2. The knowledge extractor process we proposed
is presented in Section 3. Section 4 presents the inference
process of the IQA. Section 5 demonstrates the implemen-

tations of this approach. Finally, the conclusion and future
work are discussed in Section 6.

2. Related works

An ES is a knowledge-base system to solve problem at
the level of a human expert (Giarratano & Riley, 1993),
which generally consists of three modules: user interface,
knowledge base and inference engine. User interface helps
users to easily communicate with the ES, knowledge base
contains the knowledge which forms the basis of inference
and inference engine generates some conclusions according
to the knowledge bases.

Generally, the process of building the knowledge base,
which is called knowledge acquisition, is interviewing
experts by knowledge engineers. However, it may cost a
lot of time, as the domain expert may not have any sense
of the computer techniques or the knowledge engineer is not
familiar with the domain knowledge (Hwang & Tseng,
1990). In order to achieve the goal of decreasing the inter-
vention of experts and the goal of smoothing the knowledge
acquisition, we established an adaptive process to transfer
domain knowledge to the format of a knowledge base.

The approach undertaken in IQA is based on the assump-
tion that the structure of reference books is a hierarchy
structure. Let us take the book’s structure, shown in Fig.
3, as an example.

In Fig. 3, there are nine chapters, where Chapter 1
consists of three sections and some section may be
composed of paragraphs. As most of the readers usually
refer the books following the tree-like hierarchy, consider-
ing the book’s structure seems to influence in query
constructing. Intuitively, the documents in the same chapter
seem to have very likely higher similarity rather than the
documents in other chapters. Similarly, for the documents in
the same chapter, the documents in the same section have
very likely higher similarity rather than the documents in
other sections. So, the information of the structural hierar-
chy may be useful in retrieving document databases.

Besides, the semantic meaning of the documents in the

M.F. Jiang et al. / Expert Systems with Applications 17 (1999) 105–113106

Fig. 1. The model of the database system.

Fig. 2. An intelligent query agent for the document database system.

Fig. 3. The chapter structure of a book.


book is expressed in the contents including words, phrases,
etc. Therefore, we propose a new knowledge representation
mixing contents and the structure of document database in
building the IQA.

3. System structure

In this section, we will first introduce our system model
and then the major modules of the system will be described
in detail.

3.1. System overview

Fig. 4 shows the overview of the system, which consists
of three components: IQA; query transformer (QT) module
and knowledge extractor (KE) module; and a traditional
database system. IQA and QT are used to transform the
user’s demand into an accessible query for the database
system, and the KE module is used to extract the knowledge
from the database and then store the extracted knowledge
into the knowledge base of IQA.

Based on the knowledge stored in the format of SD,
which is the new knowledge representation proposed in
this work, the IQA module infers and revises the suitable
query for user’s demand, and the QT module transforms the
result of IQA into an accessible query following the query
syntax supported by the database system.

3.2. IQA and KE modules

The IQA module consists of user interface, inference
engine and knowledge base. The user interface helps users
communicate easily with the IQA using a web-based
approach. The inference engine implies the results based
on knowledge and the process is addressed in Section 4.
Knowledge base stores the knowledge for inference, uses
a new knowledge representation called SD, which is
formally defined in Definition 3.1. As mentioned above,
the knowledge is extracted from the database by the KE
module. The flow of the knowledge extracting is described
as follows as shown in Fig. 5. The content of the document
database is first transformed into sets of words by partition
and transformation. Besides, the index of the book are
further considered as the knowledge source to identify the
set of keywords for each document in the partition and
transformation step. By computing each pair of documents,

the similarity matrix for documents are obtained, and used
as the basis for the subsequent clustering to find a similar
pair of documents. After clustering, the hierarchy of struc-
tural documents can be obtained.

In this figure, the two kinds of existing resources, includ-
ing the index and the table of contents for the books are used
as the knowledge source in the KE module. All the three
procedures in the KE module described in detail are as
follows.

3.2.1. Partition and transformation
The KE module is capable of processing English or

Chinese document database. As there is no obvious word
boundary in the Chinese text, an identification process for
identifying each possible disyllabic word from target data-
base is needed. After the disyllabic words are identified
from the sentence by applying the association measure,
each document can be transferred to a set of keywords.
That is, for two different Chinese characters, ci and cj, the
association (Liang, 1995; Sproat and Shih, 1990) of the two

M.F. Jiang et al. / Expert Systems with Applications 17 (1999) 105–113 107

Fig. 4. Overview of the system.

Fig. 5. Flowchart of knowledge extracting.


characters can be computed by:

log2

freq�CiCj�
n

freq�Ci�
n

freq�Cj�
n

;

where the value of the function freq( ) is the occurrence
frequency of any character or two successive characters
and n is the number of all the characters in the target data-
base. Since the association formula is originally designed
for a large corpus, the obtained associations may not good
enough to identify words. Therefore, further manual turning
may be required.

After the disyllabic words are identified from the sentence
by applying the association measure, each document can be
transferred to a set of words. For example, the sentence

may be segmented into
and . It should be

noted here that there is no need to identify words in English
text, and so the above segmented method should be skipped
for English text.

In our approach, the indexes of the book are also used to
identify the set of keywords for each document. After the
execution of the above two steps, each document is trans-
formed into a set of keywords and words. Therefore, the size
of both sets varies depending on the document itself,
because the words are identified by the association measure
and the keywords are identified by comparing with the index
of the reference books.

3.2.2. Similarity measure
To measure the similarity between two documents, we

use the following heuristics. The similarity between two
documents in the same chapter is higher than that in two
different chapters, and the similarity between two docu-
ments in the same section is higher than that in two different
sections. Without losing the generality, we assumed the
whole reference book to be divided into a three-tier hierar-
chy, including chapter, section and paragraph. Based upon
these heuristics, the Hierarchy Dependence (HD) between
two documents can be easily computed by the following
procedure:

Step 1: If two documents are not in the same chapter,
HD ← 0 and stop.
Step 2: If two documents are not in the same section
HD ← �1=s�; where s is the total number of sections in
this chapter, and stop.
Step 3: If two documents are not in the same paragraph,
HD ← 0:5 1 �1=p� , where p is the total number of para-
graphs in this section, and stop.
Step 4. HD ← 1

.
Let the two documents be denoted as Di and Dj. In the KE

module, the similarity of Di and Dj, denoted by S�i; j�, is

computed by the following formula:

S�i; j�� �1 2 d� p same�i; j� 1 d p HD�i; j�;
where same�i; j� means the number of words and keywords
which appear both in documents Di and Dj. The value is
normalized by dividing the total number of keywords in
documents Di and Dj, HD�i; j� means the hierarchy depen-
dence of documents Di and Dj and d is an adaptive weight
value with 0 # d # 1.

The default value of d is 0.5, which can be adjusted by the
number of chapters for a given book. For example, when the
number of chapters is near to 1 or n for a book divided into n
documents, we set d as the value closed to 0, as there is not
much meaning in the structure.

Example 3.1. Let Di � { }, j
Di j � 4, and Dj � { }, j Dj j � 3.
Therefore, the number of the words which appear both in
documents Di and Dj is 2, and we have same�i; j� �j Di >
Dj j =�j Di j 1 j Dj j� � 2=�4 1 3�:

Example 3.2. Let Di � {‘intelligent’, ‘query’, ‘agent’},
j Di j � 3, and Dj � {‘database’, ‘query’,}, j Dj j �
2. Therefore, the number of the words which appear
both in documents Di and Dj is 1, and we have
same�i; j� � 0:2.

By computing the similarity measure of all two different
documents Di and Dj, a similarity matrix [Sij] can be formed
by letting Sij � S�i; j�. To simplify our further discussion, we
assign Sij � 0 for i $ j. The matrix becomes an upper trian-
gular matrix and the diagonal elements are 0’s.

3.2.3. Clustering and structuring documents
Clustering is an important step in the KE module for the

purpose of structuring documents. To transfer the docu-
ments into a set of structural documents, the algorithm for
similarity matrix updating in hierarchy clustering (Jain &
Dubes, 1988) is used. In hierarchy clustering, a lot of
approaches are considered in measuring similarity of clus-
ters, including the single-link and the complete-link meth-
ods. It seems that the clustered hierarchy generated by the
complete-link method is more balanced than the one gener-
ated by the single-link method. In this work, the complete-
link method is implemented. To represent the knowledge in
clustering process, the definition and notation of SD are
formally defined.

Definition 3.1. The SD with level l is defined recursively
as follows:

• A document Di is a SD with level 0, denoted as SDi which
also can be denoted as �0Di�0.

• A pair of SD, denoted as �lSDi; SDj�l, with the greatest

M.F. Jiang et al. / Expert Systems with Applications 17 (1999) 105–113108


similarity measure, which is greater than a threshold u,
among all the different pairs is also a SD with level l,
where l � Maximum�m; n� 1 1, m is the level number of
SDi and n is the level number of SDj.

Definition 3.2. The similarity S 0 between the structural
documents SDi and SDj is defined as follows:

1. If SDi � �Di� and SDj � �Dj�, the similarity
S 0�SDi; SDj� � S�i; j�.

2. The similarity between �SDi; SDj� and SDk is defined as
S 0�SDk;�SDi; SDj�� � Q{S 0�SDk; SDi�; S 0�SDk; SDj�},
where the operator Q means ‘minimum’ for the
complete-link method, or ‘maximum’ for the single-
link method.

Assume we have n documents, and then an n × n similar-
ity matrix can be generated by our method. First, each docu-
ment is assigned to be a SD, i.e. SDi ��Di� for document
Di. Moreover, the similarity matrix for documents is trans-
ferred to the initial similarity matrix for SDs. The elements
of the similarity matrix [S 0ij] is the similarity measure of the
structural documents SDi and SDj, i.e. S

0
ij � Sij. Let C �

{SDi} be the set of all SDi. Based upon the Johnson’s algo-
rithms (1967) for hierarchy clustering (Jain & Dubes, 1988),
we propose the following procedure for updating similarity
matrix.

Algorithm 3.1.

Step 1: Find the most similar pair of structural documents
in the current similarity matrix, say pair {p; q}, where,
S0p;q � Maximum{S 0i;j, for any i; j}.
Step 2: Merge structural documents SDp and SDq into a
new single structural document (SDp, SDq).
Step 3: Delete the structural documents SDp and SDq from
the set C, and insert the new structural document
�SDp; SDq� into the set, i.e.
C ← C 2 {SDp} 2 {SDq} 1 {�lSDp; SDq�l}, where
l � Maximum�m; n� 1 1, m is the level number of SDp
and n is the level number of SDq.
Step 4: Update the similarity matrix by deleting the rows
and columns related to structural documents SDp and
SDq, and adding a row and a column corresponding to
the new structural document �SDp; SDq�.
Step 5: If there are no two structural documents with
similarity greater than u, stop. Otherwise, go to Step 1.

After clustering, the hierarchy of structural documents
can be obtained according to the set C.

3.3. An example

Let u � 0:1, assume we have six documents,

{D1; D2; D3; D4; D5; D6}, and the similarity matrix is:

D1

D2

D3

D4

D5

D6

D1 D2 D3 D4 D5 D6

0 0:8 0:4 0:5 0:4 0:1

0 0:5 0:6 0:3 0:2

0 0:7 0:8 0:1

0 0:9 0:3

0 0:2

0

2
6666666666664

3
7777777777775

Before clustering, each document is assigned to be a
structural document, i.e. SDi ��0Di�0 for document Di.
The set of all SDi is written as
C � {SD1; SD2; SD3; SD4; SD5; SD6}, and the initial simi-
larity matrix for SDs is generated as:

SD1

SD2

SD3

SD4

SD5

SD6

SD1 SD2 SD3 SD4 SD5 SD6

0 0:8 0:4 0:5 0:4 0:1

0 0:5 0:6 0:3 0:2

0 0:7 0:8 0:1

0 0:9 0:3

0 0:2

0

2
6666666666664

3
7777777777775

In the first iteration, the structural documents, SD4 and
SD5, are merged into (SD4, SD5), as the elementary value of
the 4th row and the 5th column is the maximum. The set of
all SDi is written as C � {SD1; SD2; SD3;
�1SD4; SD5�1; SD6}. After the similarity matrix is updated,
we get a new similarity matrix after Step 4:

1

2

3

4; 5

6

1 2 3 4; 5 6

0 0:8 0:4 0:4 0:1

0 0:5 0:3 0:2

0 0:7 0:1

0 0:2

0

2
6666666664

3
7777777775

In the second iteration, we merge the structural docu-
ments SD1 and SD2 into (1SD1, SD2)1, as the elementary
value of the 1st row and 2nd column is the maximum. We
have the new C � {�1SD1; SD2�1; SD3;�1SD4; SD5�1; SD6}.
After the similarity matrix is updated, the new similarity

M.F. Jiang et al. / Expert Systems with Applications 17 (1999) 105–113 109


matrix is transformed to:

1; 2

3

4; 5

6

1; 2 3 4; 5 6

0 0:4 0:3 0:1

0 0:7 0:1

0 0:2

0

2
6666664

3
7777775

Similarly, we merge the structural documents SD3 and
(1SD4, SD5)1 into (2SD3, (1SD4, SD5)1)2 in the third iteration,
since the elementary value of the 2nd row and 3rd column is
the maximum. We have the new C �
{�1SD1; SD2�1; �2SD3; �1SD4; SD5�1�2; SD6} and the
new similarity matrix is transformed to:

1; 2

3; 4; 5

6

1; 2 3; 4; 5 6

0 0:3 0:1

0 0:1

0

2
664

3
775

In the last two iterations, we have the new
C � {�3�1SD1; SD2�1;�2SD3; �1SD4; SD5�1�2�3; SD6} and
{�4�3�1SD1; SD2�1;�2SD3; �1SD4; SD5�1�2�3; SD6�4}.

According to the set C, the hierarchy of SDs can be obtained
as Fig. 6.

4. Inference process based on the structural documents

After the knowledge extracting process, in the previous
section, the SD C have been built for the retrieving docu-
ments. Based on the SD, more suitable results could be
rapidly inferenced by an inference engine. The flow of the
inference process is described in Fig. 7.

The query result R is generated by some searching
engine. However, in most cases, the result may be not
satisfied very well for the user’s demand, likely more or
less. Two algorithms are designed to solve the problem.
In this section, the detail of inference process will be
illustrated. First, we state the notations and definitions
for the following algorithms.

To easily represent the set of documents, a data structure
bit_map is defined as:

bit_map � b1b2…bn;
where bk � 1, if document Dk belongs to some set of docu-
ments, and bk � 0, otherwise.

Based on the data structure bit_map, for a given query
result, a set of documents generated by some searching
engine can be written as:

R � b1b2…bn;
such that bk � 1, if document Dk belongs to the query result,
and bk � 0, otherwise.

In the format of SDs, a pair of structural documents
�SDi; SDj� means the similarity between SDi and SDj is
greater than the other SDs. For any structural document
SDj, the most similar structural document SDk can be
found when SDi, a pair of structural documents
�SDj; SDk�, has been found. To easily describe the above
idea, the definitions of immediate successor and successor
are illustrated as following.

Definition 4.1. A structural document SDi is said to be an
immediate successor of structural documents SDj, if SDi �
�lSDj; SDk�l or SDi ��lSDk; SDj�l for some structural
document SDk, and level l.

Definition 4.2. A structural document SDi is said to be a
successor of a structural document SDj, if there exists an
immediate successor SDk of SDj, such that SDi is also a
successor of SDk or SDi is an immediate successor of SDj.

The key process, that is to find the immediate successor,
is to directly scan the structural document C, and described
in the following algorithm.

M.F. Jiang et al. / Expert Systems with Applications 17 (1999) 105–113110

Fig. 6. The hierarchy of structural documents according to the clustering
result.

Fig. 7. The flowchart of the inference process.


Algorithm 4.1 ((Find_immediate_successor)).

Input: C, SDi
Output: The immediate successor of SDi
Step 1: Find SDi in the set C.
Step 2: Check the next element of SDi with two cases
possibly existing.

Case 1: The next element of SDi is “,”,
Find the structural document SDj, for some j . 0,
in the next element of “SDi,” and return (k SDi,
SDj)k, for k � MAX(i, j) 1 1.

Case 2: The next element of SDi is “)l” for some l,
Find “(l” in the preceding element of SDi and return
�lSD; SDi�l, for some structural document SDj.

In the Step 2 of Algorithm 4.1, there are two cases possi-
bly existing. In Case 1, output ��SDi; SDj� for some SDj [
C and in Case 2, output ��SDj; SDi� for some SDj [ C. By
Definitions 3.1 and 4.1, it can be easily seen that output is
the immediate successor of SDi.

Now, a function bit_set is defined to convert the structural
document SDi into a string of bits, such that the kth bit is 1, if
the document Dk belongs to SDi, and is 0, otherwise.

Example 4.1. Assume the amount of all documents is 5,
bit_set����D1�; �D2��;�D3���� “11100”.

The ith cover for all n documents is defined, following the
definition of bit_map:

Ci � b1b2…bn

Algorithm 4.2.

Input: Any structural document SDi, number k
Output: Ck; Ck11; …; Ck1m, where the number m is the
minimum integer such that R # Ck1m
Procedure Cover�k; SDi�

SDj ← Find_immediate_successor (C, SDi)
Ck ← bit_set�SDj)
If NOT_EQUAL(R, AND (R, Ck)) Then Call
Cover�k 1 1; SDj�

End_Procedure

Example 4.2. Assume we have six documents, and the
hierarchy of SDs are shown in Fig. 6 and
C � {���SD1; SD2�;�SD3;�SD4; SD5���; SD6�}. Now, for a
user’s query, database return the result, {D2; D3; D4} and
R � 011100. When Cover�2; �D3�� is called, we get C2 �
001110 and C3 � 111110.

Algorithm 4.3. / p The users want to get a set of docu-
ments which can be deleted from the query result. p /

Input: The query result R, C
Output: The set of documents which can be deleted from
the query result.
Procedure Less

Select a Di according to R

SDj ← �Di�
C1 ← {Di}

Call Cover�2; SDj�
Case COUNT_ONE(AND�R; Ck21��-

COUNT_ONE(AND(R; SUB�Ck; Ck21��)
, 0: RETURN AND(R; Ck21)
. 0: RETURN AND(R; SUB�Ck; Ck21�)
� 0: RETURN AND(R; Ck21) or
AND(R; SUB�Ck; Ck21�) determined by user

End_Case

End_Procedure

M.F. Jiang et al. / Expert Systems with Applications 17 (1999) 105–113 111

Fig. 8. The architecture of CPRCS.

Fig. 9. The main window of the CPRCS.


Algorithm 4.4. / p The users want to get a set of
documents, which could offer more information. p /

Input: The query result R, C
Output: A set of documents, which could offer more
information.
Procedure More

Select a Di according to R

SDj ← �Di�
C1 ← {Di}

Call Cover�2; SDj�
While COUNT_ONE(AND�R; Ck�� ± 0 do

Case COUNT_ONE(AND(R; Ck21�)-
COUNT_ONE(AND(R; SUB�Ck; Ck21��)
, 0: R ← AND(R, SUB(Ck; Ck21�) and

Call More
. 0: R ← SUB(R, AND(R; SUB�Ck; Ck21��),

k ← k 2 1
� 0: RETURN AND(R; Ck21) or

AND(R; SUB�Ck; Ck21�) determined by
user and stop.

End_Case
RETURN Ck

End_While

End_Procedure

Example 4.3. Assume we have six documents, and the
hierarchy of structural documents are shown in Fig. 6 and
C � {���SD1; SD2�;�SD3; �SD4; SD5���; SD6�}. Now, for a
user’s query, database return the result, {D3; D4; D6}. For
the result, we have the corresponding structural documents
{SD3; SD4; SD6} generated by the IQA. When the user feels
insufficiency about the result, the IQA may generate the set
{SD3; SD4; SD5 ; SD6} and asks database to return docu-
ment {D5}. When the user feels insufficiency again, the IQA

may generate the set { SD1 ; SD2 ; SD3; SD4; SD5; SD6}
and the documents {D1; D2} are returned.

On the other hand, if the user feels the amount of result
{D3; D4; D6} is too many, IQA generate the set {SD3; SD4}
and deletes the document {D6} from the query result. If the
user feels the amount of result is too many again, the IQA
generates the set {SD4}, and delete the document {D3} from
the query result.

Example 4.4. For another user’s query, database return
the result, {D3; D4; D5}. The user feels the amount of result
is too many, IQA generates the set {SD4; SD5} and deletes
the document {D3} from the query result.

Example 4.5. For another user’s query, database return
the result, {D2; D3; D4}. The user feel the amount of result
is too many, IQA generates the set {SD3; SD4} and deletes
the document {D2} from the query result.

5. Implementation

As we know, a sound legal system and complete regula-
tions are usually of great importance for a government by
law. Right now, the PR for civil servant in Taiwan seem to
be very sophisticated. Although several kinds of reference
books about personnel regulations have been provided for
the general public to inquiry, they are not easy to use and a
lot of access time is required. Therefore, improvement of the
methods of inquiry and annotation of the personnel regula-
tions has become an important issue. Besides, by comparing
the experimental results between the traditional database
model and our new model, we may verify the performance
of the model based on SDs for the PR document databases.
We implemented a database prototype, named CPRCS
(Chinese Personnel Regulations Consultation System), for
PR documents and used the IQA to assist the querying
process. The CPRCS architecture is followed by the system
structure of Fig. 4 and shown in Fig. 8.

In CPRCS, users operate the system through the web
pages interface, which can be easily promoted to be

M.F. Jiang et al. / Expert Systems with Applications 17 (1999) 105–113112

Fig. 10. The amount of structural documents of similarity. For instance, the amount of structural documents is 224, when the similarity between any two
documents in the same SD is greater than 0.7.


extended. Two different modes, with or without IQA, are
provided for users as they please. By observing the
operating process of users, we found the query process of
users are simplified. Fig. 9 is the main window of the
system.

Furthermore, the relation between the chapter hierarchy
and the amount of the SDs is explained in the following
experiment. There are 365 documents in the target database
for the experiment. All the documents are divided into 11
chapters and the total amount of sections is 171. After the
processing of the KE module, the amount of SDs of different
similarity is shown in Fig. 10.

The figure shows the merging situation in the clustering
process. The results observed from the figure are discussed
in the following way. When the similarity value is between
0 and 0.2, the amount of structural documents is about 11,
which is the amount of chapters. Similarly, when the simi-
larity value is between 0.4 and 0.7, the amount of structural
documents is about 171, which is the total amount of
sections. The situation says that the hierarchy of structural
documents is similar to the chapter/section hierarchy for the
book. The structural information of the documents can be
acquired from the database using the KE module.

6. Conclusion

Classical information retrieval usually allows little struc-
turing. However, the structural information is useful in
querying the document database, for instance, most people
always read books with chapter-oriented concept. In accor-
dance with the document structure, including the index and
the table of content, it can be regarded as the expertise and
also can be the objective of the knowledge acquisition.
Since querying a database for document retrieval is often
a process close to querying an answering expert system, in
this work, we apply the ES techniques to the IQA establish-
ment. In building IQA by the ES approach, we are
concerned about the construction of the knowledge base,
including the knowledge representation and the method of
knowledge acquisition. Therefore, a new knowledge repre-
sentation, named SDs, is defined to construct the acquisition
model for IQA, and proposed the KE model to transform the

data of database into the knowledge storing in IQA. For
comparing the convenience of IQA, an intelligent retrieval
system, CPRCS, is implemented. By observing the operat-
ing process of users, we found the that query processes of
users are simplified. Besides, the experimental result has
shown the structural information of the documents can be
acquired from the database using the KE module.

Future research will focus on several areas. First, better
similarity measurements are necessary for increasing the
performance of clustering. The analysis for the influence
of different clustering methods is not covered in this work.
The general formula proposed by Jain and Dubes (1988)
includes most of the commonly hierarchical clustering
method which is the basis for future work. Another signifi-
cant focus on this work is to extend the model based on SD,
and the goal is to allow any different kinds of documents
that can be merged into the same knowledge structure in
order to increase the practicability of the system.

References

Celentano, A., Fugini, M. G., & Pozzi, S. (1995). Knowledge-based docu-
ment retrieval in office environments: the Kabiria system. ACM Trans-
actions on Information System, 13 (3), 237–268.

Giarratano, J., & Riley, G. (1993). Expert systems, 2. PWS Publishing
Company.

Hwang, G. J., & Tseng, S. S. (1990). EMCUD: A knowledge acquisition
method which captures embedded meanings under uncertainty. Inter-
national Journal of Man Machine Studies, 33, 431–451.

Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data, Engle-
wood Cliffs, NJ: Prentice Hall pp. 58–86.

Jiang, M.F., Tseng, S.S., Tsai, C.J. (1999). Discovering structure from
document databases, The Third Pacific-Asia Conference on Knowledge
Discovery and Data Mining, PAKDD-99, Beijing, China.

Liang, T. (1995). The Study of Character-based Signature Methods in
Chinese Text Retrieval, PhD thesis, National Chiao Tung University,
Taiwan.

Navarro, G., & Baeza-Yates, R. (1997). Proximal nodes: a model to query
document database by content and structure. ACM Transactions on
Information Systems, 15 (4), 400–435.

Riecken, D. (1994). Intelligent agent. Communications of ACM, 37 (7), 18–
21.

Sproat, R., & Shih, C. (1990). A statistical method for finding word bound-
aries in Chinese text. Computer Proceedings of Chinese and Oriental
Languages, 4 (4), 336–351.

M.F. Jiang et al. / Expert Systems with Applications 17 (1999) 105–113 113