Issues in Science and Technology Librarianship | Summer 2001 | |||
DOI:10.5062/F43F4MK3 |
Mariella Di Giacomo
Senior Software Engineer
mariella@lanl.gov
Library Without Walls Project
Los Alamos National Laboratory Research Library
Patrons of the Los Alamos National Laboratory Research Library requested a single interface be developed for use in searching the bibliographic and full-text databases made available by the library. Software developers created FlashPoint, a web-based interface that relays search queries to disparate databases. FlashPoint was written in Perl and searches eight different web-based bibliographic databases in parallel.
The Library Without Walls project at the Los Alamos National Laboratory (LANL) Research Library (RL) provides numerous locally housed bibliographic databases available via a web interface. Prior to 1997 two of these databases resided on one computer system while the others resided on another computer system. These different systems provided different interfaces and search capabilities. It was decided to migrate all of these databases to a similar system and search interface. In the meantime still another database system arrived at Los Alamos. This system housed the massive electronic content from Elsevier Science, Academic Press, and IEEE. The content of this system consisted of electronic journal articles with full-text searching. This search interface and its capability were different from all the existing databases.
As part of its ongoing customer initiative, the RL discovered that the migration of databases to a similar interface was prominent among customer requests. Another request was to have one place to search all the databases. One means of achieving this request was to centralize all bibliographic information in one location. LANL was fortunate enough to house the bibliographic information at its own location. The time frame required for such a solution remained undecided.
After 1998 the majority of databases at LANL RL shared a similar web-based interface. Although the interfaces were similar, the databases had searching incongruities. In two of the databases, the authors were indexed by "lastname", first initial, and middle initial with no spaces between. In other databases, the author field was indexed by "lastname", first initial followed by a period, and middle initial followed by a period. Still other databases would have authors' "lastname," a space, full first name, and then a middle initial. Searchers had to be aware of these slight differences in order to increase retrievability. The slight error of leaving out the space between first and middle initials would lead to a failure to retrieve any records.
An alternative approach to centralizing all the databases was considered. Instead of building the database from the bottom up could one build a search interface from the top down? This idea came from metasearch engines such as Dogpile and Profusion, which would take input from their search form and relay the searches to various other search engines.
The problem was whether a web-based bibliographic search form could mimic the searching capability of a metasearch engine by relaying a search query to a number of other remote databases. The solution was found in the programming language Perl and its libwww-perl module maintained by Gisle Aas and Martijn Koster ( http://cpan.perl.org/authors/id/G/GA/GAAS/). A Perl module is a collection of programming codes for a particular use. The libwww module assists developers in writing programs that can simulate web browser functions such as retrieving a web page, submitting a search, and retrieving its results.
The idea behind the development of FlashPoint was to provide novice users with more accurate and precise results by overcoming the subtle idiosyncrasies of the various databases. The name FlashPoint is defined as "a point at which someone or something bursts suddenly into action or being." The search form began as one simple text input box that provided the choice of searching author, or title/abstract/keywords, or journal title, and also offered a selection of databases. The program makes slight adjustments to search queries in order to maximize retrieval in different databases.
These slight adjustments entailed expanding or contracting the lastname, first initial, and middle initial with spaces depending on both the user's exact input and on the translation of search terms required by the specified database. For example, the title/subject/abstract search term for a given database would be converted to title/abstract/keyword in another database. In still another database the one-field search would be transposed into three distinct field searches, conjoined by an "or" operator.
The database selection on the form was provided by checkboxes, the search fields by radio buttons. (See Figure 1) Within an html form one can select multiple responses if they are associated with checkboxes, while radio buttons are mutually exclusive. In regard to the database selection, the default option was to search all the databases. If a user made a selection other than "ALL" the "ALL" checkbox would be automatically deselected. If the ALL checkbox were then reselected the other selected boxes would be automatically deselected. This was accomplished by using JavaScript. Located next to each database name was a link to more information regarding that database. More often than not, all the databases were searched.
After a search was submitted, the user was presented with a "search-in- progress" screen reiterating the search query. The progress screen would refresh approximately every ten seconds until the answers appeared. The search-in-progress screen was followed by the results screen that would repeat the search query and would display a three column table consisting of the link to the results, labeled by the Database Name, the number of results returned, and another link to open the results in a separate browser window. Beneath the table of results was a link to remind the user to reload the page since the queries from some databases would often return more slowly depending on the specific system load. Initially, the searches were done sequentially. Shortly thereafter, the searches were done in a parallel fashion that dramatically reduced the wait time. One major consequence of this parallelism was the non-sequential return of some of the results. This necessitated a link to refresh the page (javascript:location=location link).
Behind the scenes, the main front-end program would send the search queries to the smaller backend database specific programs. These backend programs would adjust the queries appropriately and submit the queries to the individual databases. The small programs would wait for the results and parse the number of hits and the name of the temporary file that held the citation data.
Testing of the FlashPoint simple-search beta was intensive. The lead tester tested nearly a thousand times over a four-month period. Problems were abundant. The developer originally designed the form with only one input box for the novice searcher. Matters were complicated even further when testers inserted embedded bracketed Boolean operators such as <or> and <and> and <near> to form complex compounded queries within one search input box. Testers using all imaginable combinations of names revealed the necessity for numerous rewrites of the parser, which provided a more stable product. It also prodded the developer to research the frequency of complex queries that were encountered by developers of other local databases. The database queries were analyzed based on the number of Boolean operators that appeared in each query. A majority of the bibliographic search forms had three entry fields. Each field had a drop down menu for selecting the fields of the various databases. The three fields could then be coupled with Boolean operators. In addition to these fields, there were more search qualifiers such as year, format, and language.
After analyzing eighteen months' worth of various search queries in the other locally housed databases, definite patterns arose. More than half of the queries used no Boolean operator. One-third used one Boolean operator, fourteen percent used two Boolean operators and 2 to 3 percent used all fields and additional qualifiers. This search analysis also provided insight into what searches returned no hits at all from the many databases. The "zero returns" searches were inverted authors such as "John Doe", full name searches such as "Doe, John" and multiple word phrases such as "fundamentals of gas particle flow".
This review of searching behavior provided an even better product. Since a majority of the authors indexed did not use a full name, the program would parse the author query and if it did include a full name, the search-in-progress screen would provide the user with a suggestion to use initials if no results were returned. If the search was a title/subject/abstract query that consisted of four or more words without a Boolean operator, the user was prompted to use either a Boolean "or" operator or to enclose the phrase within quotes on the search-in-progress screen. Incorporating some intelligent feedback and user training into the interface enhanced the searching experience.
The redesigned product was introduced to users. Initially it was nothing more than simples search with one input box and a choice of author, title/subject/abstract, or source. The next version, introduced a few months later, included three input boxes. Instead of three fields to search there were now seven. These added fields were Conference, Institution, Report Number and Title. This staggered introduction of different versions provided more time for planning and testing. The difficulty of designing and releasing the advanced version was higher. More field translation between databases was necessary. The advanced module also provided a wide range of publishing years, from 1940 to 2001. If the range of particular years was not available for a specific database, the database would not be searched and a message regarding the inconsistency of the years would be noted on the results screen. If a multiple field search was used and a particular field type was not available in a certain database a notice would be presented on the results page. The idea was not to interrupt the search but rather to instruct the users on the variances of the databases.
The FlashPoint product continues to evolve. Currently in its fourth revision, it searches nine bibliographic databases and one full text database. A recent enhancement was the inclusion of the local online catalog. The majority of searches are sent to all the databases. This necessitated some infra-structural changes to move one of the larger databases to its own machine. Upcoming enhancements include searching more external databases, and providing subject search categories in addition to database names. Users provided many comments on FlashPoint. One user felt like it was the right next step to take. Another said "FlashPoint is the best way to search you have come up with yet., it's wonderful. I found hundreds more articles." To some people, that must be a good thing!