College and Research Libraries


Governm.ent Inform.ation Expert . 
System.s: A Quantitative Evaluation 
John V. Richardson Jr. and Rex B. Reyes 

In this article-the first published quantitative evaluation of knowledge-based 
systems (KBS) or so-called expert systems-the authors quantitatively compare 
and contrast two systems: POINTER and Government Documents Reference 
Aid (GDRA). In a test based on fifteen typical U.S. government document 
reference questions about the federal level of government, POINTER answered 
65 percent of the questions correctly while GDRA answered only 37 percent 
correctly. An analysis of keystroke efficiency revealed that POINTER required 
120 strokes in the reference interview and 60 for the question negotiation phase 
while GDRA needed 120 keystrokes in the reference interview but only 45 
during its question negotiation. The discussion and implication section should 
help developers of knowledge-based computer systems focus their future activi-
ties in this area and reassure human reference librarians who work with 
government information that these systems still have a way to go before they 
are truly competent systems. Nonetheless, the first generation of expert systems 
for depository libraries could already be playing a widespread, if modest, role 
in assisting with federal level reference questions. 

• 

riting in 1964, Jesse Shera ar-
gued that the "fullest utiliza-
tion of the potential of 
automation [such as expert 

systems in reference work] necessi-
tates a thorough study of the total ref-
erence process-from the problems that 
prompt the asking of a question to the 
evaluation of the response."1 Hence, the 
overarching goal of the following 
study is to contribute to the profes-
sion's understanding of the total process 
by evaluating reference question re-
sponses in the field of government in-
formation. 

There are several microcomputer 
know ledge-based or so-called expert 
systems whose coverage includes the 
field of government information. Of the 
systems that specifically emphasize this 
area of specialization, the best known is 
POINTER, which was developed by 
Karen F. Smith at SUNY, Buffalo, in 1984. 
Four years later on the West Coast, Bruce 
Harley and Patricia Knobloch developed 
Government Documents Reference Aid 
(a.k.a. GDRA) at Stanford University. In 
each case, the computer system attempts 
to answer reference questions much the 
way a government documents specialist 

John V. Richardson Jr. is Associate Professor at the Graduate School of Library and Infonnation Science at the 
University of California, Los Angeles, 300 Circle Drive North, Suite 204, 405 Hilgard Avenue, Los Angeles, 
California 90024. He can be reached at (310) 206-9369 or via Internet at IBQ1JVR@MVS.OAC.UCLA.EDU. 
Rex B. Reyes is Reference Librarian at Western State University, College of Law, Fullerton, California 92631. 
The authors wish to thank Terry Crowley of San Jose State University for infonnally discussing the methods, 
results, and implications of this study as well as reviewing an early draft of the article; Zorana Ercegovac of 
UCLA for her discussion of response scoring; and Matthew Schall of Tulane University's Department of 
Psychology (jonnerly of UCLA's Office of Academic Computing) for statistical consulting. 

235 


236 College & Research Libraries 

might-by referring the user to a single 
source or even several sources that are 
thought likely to contain the answer. To the 
best of the authors' knowledge, these are 
the only available systems in government 
information.2 

Although the first of these computer 
systems has been extant for ten years, no 
one has examined systematically the 
quality or accuracy of these systems. 
Harley and Knobloch infer that their sys-
tem is expert while Smith is careful to 
qualify user expectations of her system: 
"POINTER is not an expert system. 
POINTER is a computer-assisted refer-
ence program-inspired by expert sys-
tem developments of the recent past, 
and aspiring to be upgraded to a real 
expert system in the future." 3 Nonethe-
less, it is not clear how much assistance 
users can expect from these systems nor 
how much future development work 
may be necessary for these systems to be 
truly expert in the human sense. 

Hence, the authors believe that a 
quantitative evaluation of the quality of 
these extant systems needs to be under-
taken. To the best of their knowledge, no 
such published study exists; hence, this 
article is an original contribution to un-
derstanding the nature of expertise in 
these systems. As a result, the authors 
have established a method for bench-
marking these systems for the first time. 
Using this methodology, readers can 
judge for themselves the technological 
promise of expert systems. 

SYSTEMS, EVALUATION 
ISSUES, AND GENERAL 

REFERENCE STUDIES 

To keep this research project within 
manageable limits, the scope of this 
study involves the following three di-
mensions: (1) the extant microcomputer 
systems, (2) evaluation issues, and (3) 
thirty years of reference quality studies. 

Extant Microcomputer Systems 

An expert system can be defined as: "a 
program that relies on a body of knowl-
edge to perform a somewhat difficult 
task usually performed only by a human 
expert. The principal power of an expert 

May1995 

system is derived from the knowledge 
the system embodies rather than from 
search algorithms and specific reasoning 
methods. An expert system successfully 
deals with problems for which clear algo-
rithmic solutions do not exist."4 The basic 
assumption underlying expert systems is 
the idea that "Knowledge Is Power."5 

The quality of a knowledge-based 
system, much like human reference 
work, can be measured by a variety 
of factors, such as speed of response, 
subjective short- or long-term user 
satisfaction, the interface design 
(paralleling the question negotiation 
phase of the reference transaction), 
and, of course, the accuracy of 
responses. 

As mentioned above, there are two 
expert systems that focus on the field of 
government information: (1) POINTER 
and (2) Government Documents Refer-
ence Aid (GDRA). For the purposes of 
subsequent discussion, the authors pre-
fer the phrase knowledge-based systems 
because it more accurately describes the 
state of the art at this point. 

POINTER.6 Developed between 1984 
and 1987 by Karen F. Smith, documents 
librarian at the Lockwood Library of 
SUNY, Buffalo; Stuart Shapiro, a SUNY 
Buffalo faculty member; and Sandra Pe-
ters, a computer science student, this 
system was originally written in LISP for 
a VAX minicomputer. It now runs on an 
IBM PC with a minimum of 256K RAM 
and a disk drive; there are 6,000 lines of 
BASIC code. The program is menu-driven 
and includes about 130 screens of text. 

The work of the program developers 
is based on an analysis of 1,071 queries 
in the university library's documents de-
partment, and took four months to de-
velop with a $6,000 investment, including 
a $3,000 Council on Library Resources' 
Faculty and Librarian Cooperative Re-
search Grant.7 The perceived benefits of 
this particular system are twofold: (1) a 
solution to lack of staff, and (2) a training 
tool for student assistants and clerical 


Government Information Expert Systems 237 

staff. User interaction (i.e., the reference 
interview) with POINTER begins with a 
welcome screen, asking whether the 
user wants to continue. If the user types 
yes or y, the next screen describes the 
organization of their collection accord-
ing to the SuDoc classification scheme. 
The system asks the user if s /he has a 
SuDoc number; the three acceptable re-
sponses are: yes, no, or unsure. "Yes" 
refers the user directly to the collection. 
"No" or "Unsure" provides more infor-
mation about SuDoc numbers to help 
the user decide if s/he has a SuDoc num-
ber. Strictly speaking, the interaction 
thus far is not part of question negotia-
tion. Nonetheless, the system forces the 
user to answer these questions as part of 
the preliminary phase of the reference 
interview. Next, the user progresses to a 
menu screen with four options. This 
screen is the first in the real question 
negotiation phase. For the purposes of 
this study, the authors assume that the 
user does not have a SuDoc classification 
number. Question negotiation ends 
when the last screen of sources appears. 
The reference interview concludes after 
the user responds to the prompt "Do you 
have another question?" To exit the pro-
gram, the user must press "control-
break." 

Government Documents Reference 
Aid (GDRA).8 Created in 1988 by Bruce . 
Harley and Patricia Knobloch, then of 
Stanford University Libraries (SUL), 
GDRA was developed on an IBM AT 
using LevelS shell software, which em-
ploys production rules and backward 
chaining logic. The Payson J. Treat Fund 
provided $1,26S and its development re-
quired about four weeks. Its library en-
vironment is SUL and covers U.S. federal, 
state, and local as well as foreign and in-
ternational (including United Nations) 
government publications.9 Special fea-
tures provide that ''both ASCll text files 
and an external program are directly ac-
cessed. The external program, Samson, 
provides the telecommunications link to 
Socrates, SUL's online catalog, activated 
from with GDRA' s rule structure. " 10 

There are four perceived benefits of 
GDRA: it "solve[s] the problem of in-

creasing workload; contributes to the 
mainstreaming of government docu-
ments with SUL; helps train staff provid-
ing government documents reference 
service; and supplements existing gov-
ernment documents reference service."11 

User interaction with GDRA pro-
gresses according to the following pat-
tern: the LevelS shell screen appears; six 
options, selected by using the arrow 
keys, are presented. Novices will not 
know where to start; however, Info, In-
tro, or Main Menu are the most obvious 
choices. The correct place is the intro-
duction module; the authors consider 
the reference interview to start here. The 
Main Menu is where question negotia-
tion starts. During the question negotia-
tion phase, there is a compulsory 
information screen about "U.S. Federal 
Documents" that discusses the SuDoc 
shelving arrangement at Stanford. Ques-
tion negotiation ends when all the 
sources appear (i.e., the screen labeled 
"Subject, Author, Title"). The reference 
interview ends when the user presses 
the "F2" function key to return to the 
welcome screen. The "FlO" function key 
allows the user to exit GDRA. 

Evaluation Issues 

Naturally, the question arises: What 
constitutes a good system? The quality 
of a knowledge-based system, much like 
human reference work, can be measured 
by a variety of factors, such as speed of 
response, subjective short- or long-term 
user satisfaction, the interface design 
(paralleling the question negotiation 
phase of the reference transaction), and, 
of course, the accuracy of responses. In 
this study, the authors investigate two 
aspects of quality: efficiency and accu-
racy. Further, the authors defined effi-
ciency as the number of keystrokes that 
the user has to type. Operationally, the 
authors defined accuracy as the percent-
age of questions correctly answered out of 
a set of fifteen test questions. In this re-
spect, since the authors presume that these 
systems are serving as surrogates for real 
reference librarians, it seems reasonable 
that the competence of such systems 
should be addressed in this manner. 


238 College & Research Libraries 

Thirty Years of Reference Quality 
Studies: The Theoretical Bridge 

The authors knew only what the gen-
eral nature and extent of the extant micro-
computer-based systems in government 
information were, and they wanted to 
know more about their quality. What the 
authors needed was a link between the 
known and the unknown. Thus, they pro-
pose to model this study of machine-based 
reference work on the prior thirty years of 
human reference quality studies. Of 
course, there have been some difficulties 
in undertaking such studies of accuracy; 
notably, the literature does not report the 
most frequently asked government infor-
mation questions.12 Rather than undertake 
that subject as the focus of their work, the 
authors will adopt those questions that 
have already been worked out by other 
researchers. They assume that there is a 
kind of comparability with the studies of 
general reference quality because im-
bedded in many of their test questions 
are questions that, in fact, can be an-
swered in documents departments and 
are documents-type questions. 

KEY OBJECTIVES AND 
RESEARCH QUESTIONS 

To be explicit, the three key research 
objectives of this article are: (1) to depict 
how well each knowledge-based system 
performs; (2) to compare and contrast 
each system; and (3) to test the null hy-
potheses laid out below. Logically, three 
research questions flow from these ob-
jectives: first, can the user get a correct 
answer from either POINTER or GDRA? 
Second, compared to each other, how well 
do POINTER and GDRA perform in per-
centage terms? Lastly, and most impor-
tantly, how well do they perform against 
reports of human reference experts? 

Answers to these questions can help 
knowledge-based system developers fo-
cus their activities and provide a method 
of benchmarking the state of the art in 
know ledge-based systems for govern-
ment information. 

PROVISIONAL HYPOTHESES 

The authors propose the following 
two hypotheses, one about the system's 

May1995 

accuracy and the other which addresses 
its efficiency. Together these hypotheses 
address the quality issue of a knowl-
edge-based system for answering re-
quests for government information. 

The Accuracy Hypothesis 

There is no difference between the per-
formance of these two knowledge-based 
systems and the reported literature accu-
racy rate of 52 to 65 percent success in 
real reference settings. The 13 percent 
variability occurs because unobtrusive 
studies have reported lower levels of 
success than obtrusive ones. Given 
the existence of this range, the authors 
contemplated establishing a similar confi-
dence interval for these two knowledge-
based systems under study by using 
strict or more liberal responses during 
each system's question negotiation ses-
sion (see method section below). 

More fundamentally, the authors be-
lieve that the present state of the art in 
this new technology is still first genera-
tion. While the authors are optimistic 
about the long-term future of this tech-
nology, they suspect that, at present, 
there is a serious need for further devel-
opment work (essentially, more time and 
money needs to be spent in this area) for 
real results in know ledge-based systems 
that can deal with question answering. 

The Efficiency Hypothesis 

There is no difference in the efficiency 
between the two systems during the ref-
erence interview or question negotiation 
phase. As mentioned above, the authors 
defined efficiency as the number of key-
strokes that the user had to type. By refer-
ence interview the authors mean the entire 
interaction with the knowledge-based sys-
tem. The question negotiation phase is just 
the interaction which addresses the 
inquiry (i.e., not, as in POINTER, the 
background information on the SuDoc 
classification scheme or, as in GDRA, the 
description of the Stanford collection). 

METHOD 

This section addresses four concerns: 
(1) defining the population of test ques-
tions, (2) selecting a training set of three 


Government Information Expert Systems 239 

questions and drawing a representative, 
random sample of test questions, (3) 
modeling the obtrusive nature of prior 
studies, and (4) evaluating system re-
sponse and scoring. 

Population of Test Questions 

As mentioned above, the authors 
based their own evaluation of these two 
knowledge-based systems upon the 
more than thirty years of general refer-
ence quality studies. 13 From these nu-
merous studies, the authors selected 
those that actually reported the real 
questions they used in measuring the 
quality of human reference service: 
Charles Bunge (1968), Thomas Childers 
(1971), Terry Crowley (1971), Jassim Jir-
jees (1983), Marcia Myers (1983), Charles 
McClure and Peter Hernon (1983 and 
1987), and Kathy Way (1987).14 Interest-
ingly, only the reported studies of 
McClure and Hernon focused solely 
upon federal government publication-
type questions. From the remaining 
studies, the authors identified just those 
que~tions which could be an5wered us-
ing federal level government publica-
tions. The final pool consisted of eighty 
questions. 

Training Set and Random Sample 

The authors trained together on a set 
of three randomly selected test questions 
(i.e., Hernon and McClure's number 1; 
Jirjees' number 3; and Myers' number 2; 
see appendix A). The design was to act 
as two independent judges, reviewing 
the quality of each system. The authors 
worked independently with each sys-
tem and then came back together to com-
pare their findings. When differences 
emerged in recording the results, the 
authors reached a consensus by discuss-
ing how they interpreted each system's 
prompts and then the authors agreed on 
the proper path (those questions are 
marked in appendix B with an asterisk 
to indicate their initial disagreement). 

Next, the authors randomly selected a 
total of fifteen test questions devoted to 
the U.S. federal level based on each of the 
seven studies with each study propor-
tionately represented. 15,t6 In terms of dif-

ficulty (i.e., time to answer a question), 
the authors assumed that each question 
was of equal difficulty. 17 

Modelling Prior Studies: Liberal 
versus Conservative Approach 

The authors considered the obtrusive 
versus unobtrusive nature of the pre-
vious reference studies. They finally 
adopted one approach to the knowl-
edge-based system interface and its 
question negotiation. Using the set of 
fifteen test questions, the authors were 
generous in their analysis. This liberal 
approach would be similar to the way a 
familiar user of government publica-
tions would respond to the environ-
ment. Such a user is willing to use a 
comp~ter, read an entire screen full of 
information, and thoughtfully select 
menu items after considering all the op-
tions. This approach is the best case sce-
nario. It more closely models the 
obtrusive nature of the previous re-
search on reference quality. The authors 
wanted .to see how capable these knowl-
edge-based systems are in answering 
questions accurately. 

Evaluating System Response 
and Scoring 

At the outset the authors reviewed the 
accuracy scoring methods that have tra-
ditionally been used. Historically, many 
of the previous studies of reference qual-
ity have scored the results as a dichoto-
mous variable-either the question was 
answered or not (i.e., most report the 
percentage of correct answers). 18 Argu-
ably, the ideal response for a fact-type 
question is a single source which con-
tains the complete and correct answer. In 
this case, previous investigators often 
gave one point for the correct answer 
and no points for an incorrect one. Fur-
ther, some used a test set of ten questions 
to make the math involved more 
straightforward. Obviously though, the 
real world of reference work is n~ore 
complex than that-a range of responses 
is possible and extreme values can occa-
sionally occur.19 So more recent investi-
gators such as Cheryl Elzy, Alan Nourie, 
Wilf Lancaster, and Kurt Joseph (1991) 


240 College & Research Libraries 

have reconsidered this response vari-
able; they implicitly recognize it as con-
tinuous.20 In this study, the authors 
explicitly recognized the response range 
as continuous in developing their own 
scoring method (see table 1). 

Next, the authors assigned point val-
ues, creating an eight-point response 
scheme and added qualitative judg-
ments related to the level of service pro-
vided. In the authors' estimation, this 
scheme more adequately reflects reality. 
In fact, the above-named investigators 
agree with the authors that it would be 
appropriate to "give minus values to in-
appropriate referrals ... ,"21 but they did 
not do so in their particular study. The 
present authors do so because they be-
lieve that wrong answers significantly 
penalize users and create ill will. Hence, 
the authors' method does not artificially 
restrict the range of responses and takes 
into consideration the possibility of ex-
treme values as well. 

Finally, to measure efficiency, the 
authors counted keystrokes for both sys-
tems. They counted the total number of 
keystrokes from the beginning to the end 
of the interaction as the "reference inter-
view." They counted the prompt "Do 

May1995 

you have another question?" as the end 
of the reference interview for POINTER; 
for GDRA, the reference interview 
ended when the F2/F3 option appears 
allowing the user to start at the begin-
ning or just at the Main Menu. For ques-
tion negotiation, the authors started 
from the numbered menu option in 
POINTER and from the Main Menu in 
GDRA. They did not count the compul-
sory information screen in GDRA nor 
did the authors count the offer of help 
with the SuDoc classification scheme in 
POINTER. Statistical analysis was sup-
ported by SAS, Version 6.08, running on 
an IBM Series 9000/900 mainframe. Dur-
ing data screening, a univariate analysis 
confirmed: (1) data points are not missing, 
(2) data are not demonstrably nonnormal 
(as measured by skewness and kurtosis of 
less than two for the experimental vari-
ables), and (3) no data outliers except as 
discussed below. 22 

FINDINGS 

The authors can confidently answer 
their first research question straight 
away-yes, the user can get a correct 
answer some of the time. However, the 
systems vary in their ability to do so. 

TABLE 1 

Score 

5.0 

4.0 

3.0 

2.0 

1.0 

0.0 

-1.0 

-2.0 

TAXONOMY OF SYSTEM'S POTENTIAL RESPONSES 

Range of System's Response 

Referred to a single source, complete and correct answer 

Referred to several sources, one of which gave complete 
and correct answer 

Referred to a single source, none of which leads directly to 
an answer but one of which serves as a pre_liminary 
source 

Referred to several sources, none of which leads directly to 
an answer but one of which serves as a preliminary 
source 

No direct answer; referred to specific person/institution 

No answer; no referral (e.g., I don't know) 

Refered to a single inappropriate source 

Referred to several sources, none of which answers 

Service Quality 

Excellent 

Very good 

Good 

Satisfactory 

Fair/poor 

Failure 

Unsatisfactory 

Most unsatis-
facto 

Source: Suggested by Gers and Seward (1985) and Elzy, Nourie, Lancaster, and Joseph (1991). 


Government Information Expert Systems 241 

TABLE2 
SCORING OF POINTER AND GDRA 
ON THE FIFTEEN TEST QUESTIONS 

POINTER Does a Better than 
Satisfactory Job 

Overall, POINTER scored a total of 49 
out of 75 possible points (or 65 percent 
of the federal level fact-type questions 
asked of it). The average score was 2.3 
points per question. Based on table 1, 
that means POINTER is doing a good job 
in the authors' qualitative judgment. 
Parenthetically, see table 2 for the actual 
scores on each question. An analysis of_ 
efficiency (defined as the number of key-
strokes) reveals that POINTER required 
120 strokes during the reference inter-
view and 60 for the question negotiation 
phase (see table 3). A Pearsonian corre-
lation between POINTER's accuracy 
score and the total number of keystrokes 
for POINTER's question negotiation 
was -.237 (t = .88, df = 13, and p = .39). In 
other words, there is no significant cor-
relation between more extensive ques-
tion negotiation and higher accuracy in 
this know ledge-based system. 

Question 

1. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 
9. 

10. 
11. 
12. 
13. 
14. 
15. 

Grand total 

Mean score 

Question 

1. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 

. 9. 
10. 
11. 
12. 
13. 
14. 
15. 

Grand total 
Keystrokes 

Mean 
Median 

Pointer's GDRA's Total 
Score Score Possible 

4 2 5 
2 2 5 
2 2 5 
4 2 5 
4 2 5 
3 3 5 
2 2 5 
2 2 5 
4 2 5 
4 2 5 
2 2 5 
4 -1 5 
4 2 
4 2 
4 2 

49 28 
(65.33%) (37.33%) 

5 
5 
5 

75 
(100%) 

3.266 1.866 Per ques-
tion 

TABLE3 
KEYSTROKE EFFICIENCY OF POINTER 

AND GDRA ON THE FIFTEEN TEST QUESTIONS 
POINTER 

RI QN RI 

8 4 8 
9 5 8 
6 2 8 
7 3 8 
8 4 8 
8 4 7 

11 7 8 
8 4 8 
8 4 8 
7 3 8 
8 4 8 
8 4 9 
7 3 8 
7 3 8 

10 6 8 

120 60 120 

8.0 4.0 8.0 
8.0 4.0 8.0 

GDRA 

QN 

3 
3 
3 
3 
3 
2 
3 
3 
3 
3 
3 
4 
3 
3 
3 

45 

3.0 
3.0 

Standard deviation 1.25 1.25 .37 .37 
Note: RI = reference interview; QN = question negotiation 


242 College & Research Libraries 

GDRA Is Doing an Almost 
Satisfactory Job 

GDRA scored a total of 28 out of 75 
possible points (or 37 percent of the fed-
erallevel fact-type questions asked of it). 
The average score was 1.9 points per 
question. Based on table 1, that means 
that GDRAis doing a nearly satisfactory 
job in the authors' qualitative judgment. 
For a detailed analysis of scoring by 
question, see table 2. GDRA needed 120 
keystrokes in the reference interview but 
only 45 during its question negotiation. 
A Pearsonian correlation between 
GDRA's accuracy score and the total 
number of keystrokes for GDRA's ques-
tion negotiation was -.91 (t = 7.7, df = 13, 
and p = .0001). This time, there is signifi-
cant correlation between more question 
negotiation and a lower score. 

Comparison of the Two Systems 

The second research question asked 
how these systems compared or con-
trasted. Neither system does an excellent 
job (i.e., earning five points in the scor-
ing system), meaning that the user was 
referred to a single source that provided 
the complete and correct answer. Over-
all, though, POINTER is a better system 
for answering federal-level, fact-type 
government publication questions. 

It may be useful to discuss particular 
questions where one system did much 
better or worse than the other. GDRA 
scored very poorly on question 12 (see 
Appendix B) because it recommended 
an inappropriate source and took more 
keystrokes in the reference interview as 
well as the question negotiation to 
achieve the wrong answer. The reason 
for this situation appears to be that the 
designers of GDRA did not anticipate 
users asking retrospective questions, 
specifically historical ones from the 
nineteenth century. 

Hypotheses Testing 

The first hypothesis proposed that 
there was no difference between the per-
formance of these two knowledge-based 
systems and the reported literature rate 
of 52 to 65 percent success in real refer-

May 1995 

ence settings. The authors rejected the 
first part of this hypothesis. POINTER 
answered 65 percent of the test questions 
completely and accurately while GDRA 
answered only 37 percent of them. The 
second part of the hypothesis related 
their findings to the reported literature. 
POINTER matched the higher end of the 
reference studies while GDRAhappened 
to match McClure and Hernon's 1983 
reported findings about the perform-
ance of documents librarians. 

Arguably, the ideal response for a 
fact-type question is a single source 
which contains the complete and 
correct answer. 

Similarly, the authors rejected the sec-
ond hypothesis that there is no differ-
ence in the efficiency between the two 
systems during the reference interview 
or question negotiation phase. POINTER 
required a total of 120 keystrokes (or 
60 in the question negotiation phase) 
before recommending a source(s). On 
the other hand, GDRA also required 
120 total keystrokes to answer the 15 test 
questions but only 45 in the question ne-
gotiation phase. In addition, there is an 
annoying inconsistency in the use of key-
strokes during GDRA's interaction (e.g., 
sometimes one uses the function key 
while at other times it is the enter key 
that is used). 

To test their qualitative observation 
that a modest increase in question nego-
tiation doubles accuracy (i.e., POINTER 
scores 65 percent accuracy with 60 key-
strokes versus GDRA' s 37 percent with 
45), the authors ran a logistic regression 
to model accuracy being equal to each 
knowledge-based system and question 
negotiation.23 The chi-square for model 
fit with 2 degrees of freedom is 13.24, 
p = .001. The association of predicted 
probabilities and observed responses is 
concordant 86.6 percent, discordant 8.6 
percent, and ties 4.8 percent. The chi-
square suggests the model does not fit 
the data very well while the association 
of predicted probabilities suggests it 


Government Information Expert Systems 243 

does. However, the power to detect sig-
nificant differences is low and a larger N 
of test questions would be desirable in 
the future. 

DISCUSSION AND IMPLICATIONS 

Much of the preceding section treats 
the two knowledge-based systems (KBS) 
systems as a black box-i.e., mere input 
and output. More attention needs to be 
focused on the diagnostic issues; for ex-
ample, why do these systems fail to per-
form at higher levels? Either system 
could score higher if it recommended 
fewer titles at the end of question nego-
tiation. In an extreme case, POINTER 
recommended nine potentially relevant 
sources (for question numbers 1 and 15). 
The authors speculate that the naive 
user's confidence in the system's knowl-
edge may be lessened by the large num-
ber of recommended titles. The authors' 
scoring system explicitly assumed that 
users want the single best source which 
completely and accurately answers their 
fact-type question. 

Obviously, the two systems are still 
performing at a modest level, that is, 
they serve as reference systems (i.e., only 
referrals are given) rather than informa-
tion systems (i.e., direct answers to the 
specific questions are given) .. Ideally, 
these systems should be able to give the 
user a direct answer to their question; 
this situation will most likely occur 
when these systems have a knowledge 
base similar to that of humans. 

For the moment POINTER has a 
greater depth of knowledge about the 
federal level than does GDRA. To be a 
fully comprehensive system, POINTER 
ought to have GDRA's greater breadth 
of coverage. And, of course, in both of 
the systems under review, there is a sub-
stantial burden on the user rather than 
on the system. 

Future Work 

Subsequent investigations could take 
several directions in the future. One pos-
sibility is to make a more user-oriented 
evaluation of the knowledge-based sys-
tems. By that the authors mean that the 
typical user's accuracy as well as satis-

faction with the interaction could be 
measured, either immediately or for the 
longer term; the authors hypothesize 
that it would be more in line with what 
the authors called a conservative ap-
proach (see above discussion). 

Second, other useful work might in-
volve the identification of the user's model 
of government information seeking or 
simply the user's model ofthe knowledge-
based system. Then, one could compare 
and contrast their model with others 
such as the one presented by the govern-
ment information textbook authors.24 

Third, Cherie Weil's pioneering work 
at the University of Chicago also raises 
questions about the relationship of a 
knowledge-based system and the hu-
man reference expert. 25 Using 234 bio-
graphical sources, Weil found that while 
her knowledge-based system answered 
10 out of 14 questions (71 percent) cor-
rectly and the human expert answered 
11 out of 14 (79 percent) correctly, work-
ing together the human expert and the 
knowledge-based system could an-
swer more questions correctly than 
either one working independently. 
Could the two KBS systems in this 
study serve a similar complementary 
support role for practitioners, espe-
cially general reference librarians who 
only occasionally answer government-
publication-type questions? 

A narrowly conceived line of future 
work would be a second pass through 
the fifteen test questions, taking a more 
strict or conservative approach, much as 
a naive user might. A naive user (i.e., one 
who knows relatively little about gov-
ernment publications or computer sys-
tems generally) might be willing to use 
a computer, but may not understand 
technical terms related to government 
information. Hence, the naive user 
might select, from a long menu, the first 
item that even looks applicable. In other 
words, s/he may not be willing to read 
an entire screen full of information. Such 
an approach may be said to emulate the 
unobtrusive approach. 

Finally, the scope of analysis could be 
extended to other levels of government 
such as state, local, foreign, and intema-


244 College & Research Libraries 

tional/UN. At the present state of devel-
opment, GDRA would excel POINTER at 
these other levels of government since 
POINTER only addresses the federal level. 

CONCLUSIONS 

This study has demonstrated that 
there is a need for improvement of 
knowledge-based systems in the gov-
ernment information field. For the pur-
poses of subsequent research and dis-
cussion, the phrase knowledge-based sys-
tems should be used because it more ac-
curately describes the present state of 
the art. The question of what role these 
systems should play needs to be exam-
ined in greater detail. Will knowledge-
based systems be expected to serve the 
user in place of the reference librarian, or 
will they merely be used as supplemen-
tary help? The answer will depend on 
future study. Whatever the case may be, 
there is certainly a need to improve aspects 
of these systems, such as the breadth and 
depth of the knowledge base. 

The authors' method of evaluating 
GDRA and POINTER can be replicated to 

May 1995 

judge the effectiveness of other knowl-
edge-based systems, either in govern-
ment information or in general question 
answering. The authors realize that 
there is still more research to be done 
regarding scoring techniques because 
quality and effectiveness may mean dif-
ferent things to different people. Because 
this study builds on the definitive stud-
ies of reference work, the authors believe 
their scoring method is a move in the 
right direction. 

The authors believe that these knowl-
edge-based systems have a place in the 
reference environment, especially in a 
time of budgetary constraints and staff 
shortages. In addition, at least one pre-
vious study demonstrates that the com-
bination of a reference librarian and a 
KBS results in more accurate answers 
than either by themselves. When an 
overwhelming number of studies reveal 
that reference accuracy rates fall be-
tween 52 percent and 65 percent, auto-
mated solutions for the improvement of 
reference service certainly deserve fur-
ther exploration. 

REFERENCES AND NOTES 

1. Jesse Shera, "Automation and the Reference Librarian," RQ 3 Ouly 1964): 3. 
2. John Richardson, Knowledge-based Systems for General Reference Work: Applications, Prob-

lems, and Progress (San Diego: Academic, 1995). 
3. Karen F. Smith, "POINTER: The Microcomputer Reference Program for Federal Docu-

ments," in Expert Systems in Libraries, ed. Rao Aluri and Donald E. Riggs (Norwood, N.J.: 
Ablex, 1990), 41. 

4. Kamran Parsaye and Mark Chignell, Expert Systems for Experts (New York: Wiley, 1988), 1. 
5. See Eliot Freidson, Professional Powers: A Study of the Institutionalization of Formal Knowl-

edge (Chicago: Univ. of Chicago Pr., 1986) as well as Dennis H. Wrong, Power: Its Forms, 
Bases, and Uses (New York: Harper, 1979). 

6. Karen F. Smith, Stuart Shapiro, and Sandra Peters, Final Report on the Development of a 
Computer Assisted Government Documents Reference Capability: First Phase (Buffalo: SUNY 
at Buffalo, 1984); Smith, "Robot at the Reference Desk?" College and Research Libraries 47 
(Sept. 1986): 486-90; and "POINTER vs. Using Government Publications: Where's the Ad-
vantage?" Reference Librarian 23 (1988): 191-205. It is available for $30 from Karen Smith. 

7. According to the sixth edition of the Directory of Government Documents Collections and 
Librarians (Bethesda, Md.: CIS, 1991), SUNY reports an extensive collection of federal and 
state materials and limited collections of local, international, and foreign documents. 

8. Bruce L. Harley and Patricia J. Knobloch, "Government Documents Reference Aid: An 
Expert Systems Development Project," Government Publications Review 19 Oan./Feb. 1991): 
15-33. 

9. According to the sixth edition of the Directory of Gover:nment Documents Collections and 
Librarians (Bethesda, Md.: CIS, 1991), Stanford holds an extensive collection of federal, 
international, and foreign documents and a moderate collection of state and local materials. 


Government Information Expert Systems 245 

10. John Richardson, Knowledge-based Systems for General Reference Work (San Diego: Aca-
demic, 1995). 

11. Ibid. 
12. The first author consulted with Peter Hernon who has worked extensively in this area as 

well. He agreed that "there has never been a reported study done on the types of questions 
asked. There have been studies-with no reliability and validity indicators-of the questions 
asked at a general reference desk," correspondence dated Apr. 30,1993. 

13. Kenny Crews, "The Accuracy of Reference Services: Variables for Research and Implemen-
tation," Library and Information Science Research 10 Ouly /Sept. 1988): 331-56. 

14. Charles A. Bunge, Professional Education and Reference Efficiency, Research Series No. 11 
(Springfield, Ill.: Illinois State Library, 1968; abridged version of "Professional Education 
and Reference Efficiency" (Ph.D. diss., University of Illinois, 1967); Terence Crowley and 
Thomas Childers, Information Service in Public Libraries: Two Studies (Metuchen, N.J.: 
Scarecrow, 1971); Jassim M. Jirjees, "Telephone Reference/Information Services in Selected 
Northeastern College Libraries," in The Accuracy of Telephone Reference/Information Serv-
ices in Academic Libraries: Two Studies (Metuchen, N.J.: Scarecrow, 1983); Peter Hernon and 
Charles R. McClure, Improving the Quality of Reference Service for Government Publications, 
ALA Studies in Librarianship, No. 10 (Chicago: ALA, 1983) and Unobtrusive Testing and 
Library Reference Services (Norwood, N.J.: Ablex, 1987); Marcia J. Myers, "Telephone Refer-
ence/Information Services in Selected Northeastern College Libraries," in The Accuracy of 
Telephone Reference/Information Services in Academic Libraries: Two Studies (Metuchen, N.J.: 
Scarecrow, 1983); and Kathy A. Way, "Quality Reference Service in Law School Depository 
Libraries: A Cause for Action," Government Publications Review 14 (1987): 207-19. 

15. In the history of reference quality, most studies have asked as few as ten questions while 
only a few have asked as many as twenty. Future work should consider the implication of 
small Ns; generally as N increases, so does sensitivity. . 

16. Because almost any question could be answered using a government publication, we tried 
to select only those obviously requiring such a source: i.e., questions requiring an official 
version, an authoritative source, or reliable statistical information. From Bunge's Appendix 
C, which lists eight government documents questions (i.e., 1, 2, 4, 7, 18, 20, 23, and 28), we 
randomly selected numbers 8 and 9; Childers had eight (i.e., 2, 5, 9, 11, 18, 22, 25, and 26), 
we selected number 4 and 5; ·Crowley had four (i.e., 1, 2-4, 7, and 8), we selected number 2; 
Jirjees' nine (i.e., 1, 6, 7, 12, 16, 17, 24, 28, and 34), we selected number 8 and 9; Way's twelve 
(i.e., 1, 3, 4, 5, 6, 8, 11, 12, 17, 18, 19, and 20), we took number 3 and 6; Myers' four (i.e., 1, 5, 
7, and 13), number 7; McClure and Hernon's (1983, appendix A) listed twenty (i.e., all of 
them), we selected number 3, 9 and 17; and from McClure and Hernon (1987, appendix B), 
fifteen (i.e., all of them), we selected number 1 and 12. 

17. We need more studies on the degree of difficulty issue. In 1967 Bunge asked 47librarians, 
of whom 37 responded, to rate questions as "easier, average, or harder" than normal (see 
Professional Education, appendix B). 

18. Crews, "The Accuracy of Reference Services," 331-56. 
19. The issue of multiple sources is vexing. A user validation of the response scheme is highly 

desirable. For instance, we need to know the answer to the following questions: (1) Is the 
user more confident when he has more sources in hand, or (2) Is the user more satisfied when 
he has more sources in hand? 

20. Cheryl Elzy, Alan Nourie, F. W. Lancaster, and Kurt M. Joseph, "Evaluating Reference Service 
in a Large Academic Library," College and Research Libraries 52 (Sept. 1991): 454-65. 

21. Ibid., p. 458. One consequence of negative values is that if the data screening reveals that 
the distribution of this variable is not normal, then a constant may be added before 
undertaking logarithmic transformations. A. A. Afifi and Virginia Clark provide a clear 
discussion of this point as well as the "effect on the statistical properties of the transformed 
variable" in their Computer-Aided Multivariate Analysis, 2d ed. (New York: Van Nostrand 
Reinhold, 1990), 53. 

22. The standard discussion of such statistical matters is covered in Vic Barnett and Toby Lewis, 
Outliers in Statistical Data, 2d ed. (New York: Wiley, 1984) or R. D. Cook, "Influential 
Observations in Linear Regression," Journal of the American Statistical Association 74 (1979): 
169-74. 


246 College & Research Libraries May1995 

23. When the dependent variable is dichotomous (i.e., high accuracy versus low accuracy), a 
logistic regression is appropriate; see David W. Hosmer Jr. and Stanley Lemeshow, Applied 
Logistic Regression (New York: John Wiley and Sons, Inc., 1989). For our analysis, SCR=MA-
CIDNE QN where AVGSCR=actual score/potential score and if AVGSCR .5 then SCR =1 
and ELSE = 0. Machine is dummy coded 1 for POINTER and 0 for GDRA. 

24. John Richardson Jr., "Paradigmatic Shifts in the Teaching of Government Publications, 
1895-1985," Journal of Education for Library and Information Science 26 (Spring 1986): 
249-66; reprint ed., Encyclopedia of Library and Information Science, vol. 44: 242-58. 

25. Cherie B. Weil, "Classification and Automatic Retrieval of Biographical Reference Books" 
(master's thesis, University of Chicago, 1967; idem," Automatic Retrieval of Biographical 
Reference Books," Journal of Library Automation 1 (Dec.1968): 239-49. In fact, Weil said that 
she could not answer three of the four questions because she had exhausted her resources; 
her knowledge-based system found answers to those three questions in the same sources to 
which Weil had access. 

APPENDIX A 
Set of Three Training Questions 

1. For a term paper in history, I am studying the Army's use of camels in the nineteenth century. 
It is my understanding that there is a government document, from the 1850s, on the topic. 
Please help me find it. (Hernon and McClure, 1983, #1) 

2. I would like to know the name of a general who was forced to retire from the Army after 
twice publicly criticizing President Carter's military policies. I think the incident took place 
sometime around the middle of 1977. (Jirjees, #3) 

3. When was George Washington given the title of General of the Armies of the United States? 
(Myers, #2) 

APPENDIXB 
Fifteen Test Questions 

1. I would like the names and office addresses of the senators and representatives representing 
me in the federal legislature. I live in the downtown area of this city. (Bunge, #1) 
POINTER: Y, N, N, 3, 3 = Government Manual, Official Congressional Directory, FED, 
Congressional Staff Directory, and Government Documents Catalog. 
GDRA: INTRO, F2, F2, INFO, US, F2, T =Monthly Catalog or CIS Index. 

*2. How much more or less expensive is it for an average family to live in Chicago than it is in 
Atlanta? (Bunge, #18) 
POINTER: Y, N, N, 3, 2, N, 2 =American Statistics Index and Statistical Abstract. 
GDRA: INTRO, F2, F2, INFO, US, F2, T =Monthly Catalog or CIS Index. 

*3. Where is the nearest commercial airport to Rio Grande, Ohio? (Childers, #11) 
POINTER: Y, N, N, 4 = Maps (referral to same institution, but different department) plus 
Using Government Publications and Monthly Catalog and Government Documents Catalog. 
GDRA: INTRO, F2, F2, INFO, US, F2, T =Monthly Catalog or CIS Index. 

4. What is the salary of the President of the United States? (Childers,# 22; assumptions: federal 
law) 
POINTER: Y, N, N, 3, 5 = United States Code. 
GDRA: INTRO, F2, F2, INFO, US, F2, T =Monthly Catalog or CIS Index. 

5. What is the name of the secretary of commerce? (Crowley, #2-4) 
POINTER: Y, N, N, 3, 3 =United States Government Manual. 
GDRA: INTRO, F2, F2, INFO, US, F2, T =Monthly Catalog or CIS Index. 

6. I need to know the percentage of persons below the poverty line in Colorado for the year 
1975. (Jirjees, # 28) 
POINTER: Y, N, N, 3, 2, Y = 1980 Census. 
GDRA: INTRO, F2, F2, STATS, US STATS, F2 =American Statistics Index. 


Government Information Expert Systems 247 

*7. In 1977 the U.S. Commission on Civil Rights released a report called Window Dressing on the 
Set. It's about the treatment of women and minorities on TV. Has the commission published 
any study to update that report since then? Oirjees, #34; assumptions: subject approach; 
report, when do you stop-after checking every year since 1977) 
POINTER: Y, N, N, 3, 14, 7, Y, and 8 =Monthly Catalog and Cumulative Index 1981-85, 
1976-1980. 
GDRA: INTRO, F2, F2, INFO, US, F2, T =Monthly Catalog or CIS Index. 

*8. I understand that the Caffeine Study Review Panel submitted its final report to the Food and 
Drug Administration on May 15, 1981. The report contains information pertinentto the FDA's 
review of the safety of added caffeine. I would like to know if the final report is available. 
(McClure and Hernon, 1983, # 3; the authors deleted the remainder of this question.) 
POINTER: Y, N, N, 3, 14, 7 =Monthly Catalog and Cumulative Index. 
GDRA: INTRO, F2, F2, INFO, US, F2, T =Monthly Catalog or CIS Index. 

9. In February 1978, there was an FTC (Federal Trade Commission) staff report on television 
advertising to children, by Ellis M. Ratner and others. It recommended the elimination of 
"harms arising out of television advertising to children." Is it still in print? What is the cost? 
(McClure and Hernon, 1983, #9; assUJllptions: current date) 
POINTER: Y, N, N, 3, 14,8 =Government Documents Catalog; Publication Reference File. 
GDRA: INTRO, F2, F2, INFO, US, F2, T =Monthly Catalog or CIS Index. 

10. Where can I get a detailed breakdown of the distribution of federal funds for research and 
development by agency? (McClure and Hernon, 1983, #17) 
POINTER: Y, N, N, 3, 4 = Catalog of Federal Domestic Assistance. 
GDRA: INTRO, F2, F2, INFO, US, F2, T =Monthly Catalog or CIS Index. 

11. What is the zip code for Behrend College in Erie, Pennsylvania? (Myers, #7; assumptions: 
inquirer does not want address and the college is not a government organization) 
POINTER: Y, N, N, 3, 14, 8 = Government Documents Catalog; Publications Reference File; 
Cumulative Index; Monthly Catalog. 
GDRA: INTRO, F2, F2, INFO, US, F2, T =Monthly Catalog or CIS Index. 

12. For a term paper in history, I am studying the laws on the imprisonment of free black seamen 
in the South prior to the Civil War. It is my understanding that the government published a 
report . on the topic in the 1840s. (McClure and Hernon, 1987, #1; assumptions: laws = 
Congress) 
POINTER: Y, N, N, 3, 14, 1 = CIS Index or CIS US Serial Set Index. 
GDRA: INTRO, F2, F2, INFO, US, F2, F, CONG = CIS Index. 

13. In 1980 a public law was enacted that it provided universities and small business with the 
right to obtain patents for inventions which their faculties and staff created with the use of 
Federal funds. Please help me locate a copy of the law. (McClure and Hernon, 1987, #12} 
POINTER: Y, N, N,3,5 =U.S. Code and other titles. 
GDRA: INTRO, F2, F2, INFO, US, F2, T =Monthly Catalog or CIS Index. 

14. How many years must aU .S. magistrate have been a member of the bar prior to appointment? 
(Way, #4; assumptions: federal law) 
POINTER: Y, N, N, 3, 5 =United States Code or 6 = Code of Federal Regulations. 
GDRA: INTRO, F2, F2, INFO, US, F2, T =Monthly Catalog or CIS Index. 

15. Did former President Ford appoint Barbara Walters and Katherine Hepburn to the National 
Commission on the Observance of International Women's Year? (Way, #17; assumptions: 
done by Executive Order) 
POINTER: Y, N, N, 3, 8 =Weekly Compilation of Presidential Documents or Public Papers of 
the Presidents. 
GDRA: INTRO, F2, F2, INFO, US, F2, T =Monthly Catalog or CIS Index. 

Note : For POINTER, N = No, Y = Yes, numbers are responses required at menu options. For GDRA, INTRO 
= Introduction, F2 Continue, Info = Information, US =United States, STAT =Statistics, and T = True. 

,. Indicates initial disagreement in interpreting appropriate response to system's question. Consensus, as re-
ported in appendix, was achieved after discussion. 


A New Service on the 
Infonnation Superhighway 

If you have· been searching for an easy way to authority control your 
library's current cataloging, try LTI 's Authority Express service. 

With Authority Express, a library uses the Internet to transmit a 
file of newly cataloged bibliographic records to LTI (via FfP). 
Overnight, LTI processes the records through its state-of-the-art 
authority control system. Then, at the library's convenience, it logs 
into LTI's FTP server to retrieve fully authorized catalog records, 
along with linked LC name and subject authority records. 

Authority Express 
• Keeps authority control current at an affordable price 
• Integrates easily into existing workflows 
• Lowers cost by reducing staff time spent on catalog maintenance 
• Provides next-day turn around for up to 5,000 catalog records 
• Accepts records for processing even if LTI did not perform the 

original authority control 

"Authority Controlfor the 21st Century" 

• LIBRARY TECHNOLOGIES, INC. 
1142E Bradfield Road Abington, PA 19001 

(215) 576-6983 Fax: {215) 576-0137 
(800) 795-9504 email: LTI@LibraryTech.Com