Multi-Media, Multi-Cultural, and Multi-Lingual Digital Libraries, Or How Do We Exchange Data In 400 Languages
Multi-Media, Multi-Cultural, and Multi-Lingual Digital
Libraries
Or How Do We Exchange Data In 400 Languages?
Christine L. Borgman
Professor and Chair
Department of Library and Information Science
Graduate School of Education & Information Studies
University of California, Los Angeles
Los Angeles, California
cborgman@ucla.edu
D-Lib Magazine, June 1997
ISSN 1082-9873
Introduction
Medium, Culture, and Language
From Local Systems to Global Systems
Design Tradeoffs
Representation in Digital Form
Language and Character Sets
Transliteration and Other Forms of Data Loss
Character Encoding
Mono-lingual, Multi-lingual, and Universal Character Sets
Library Community Approaches
Summary and Conclusions
References
Introduction
The Internet would not be very useful if communication were limited
to textual exchanges between speakers of English located in the
United States. Rather, its value lies in its ability to enable
people from multiple nations, speaking multiple languages, to
employ multiple media in interacting with each other. While computer
networks broke through national boundaries long ago, they remain
much more effective for textual communication than for exchanges
of sound, images, or mixed media -- and more effective for communication
in English than for exchanges in most other languages, much less
interactions involving multiple languages.
Supporting searching and display in multiple languages is an increasingly
important issue for all digital libraries accessible on the Internet.
Even if a digital library contains materials in only one language,
the content needs to be searchable and displayable on computers
in countries speaking other languages. We need to exchange data
between digital libraries, whether in a single language or in
multiple languages. Data exchanges may be large batch updates
or interactive hyperlinks. In any of these cases, character sets
must be represented in a consistent manner if exchanges are to
succeed. Issues of interoperability, portability, and data exchange
(Libicki, 1995) related to multi-lingual
character sets have received surprisingly little attention in
the digital library community or in discussions of standards for
information infrastructure, except in Europe. The landmark collection
of papers on Standards Policy for Information Infrastructure
(Kahin & Abbate, 1995), for example,
contains no discussion of multi-lingual issues except for a passing
reference to the Unicode standard (Libicki, 1995,
p. 63).
The goal of this short essay is to draw attention to the multi-lingual
issues involved in designing digital libraries accessible on the
Internet. Many of the multi-lingual design issues parallel those
of multi-media digital libraries, a topic more familiar to most
readers of D-Lib Magazine. This essay draws examples from
multi-media DLs to illustrate some of the urgent design challenges
in creating a globally distributed network serving people who
speak many languages other than English.
First we introduce some general issues of medium, culture, and
language, then discuss the design challenges in the transition
from local to global systems, lastly addressing technical matters.
The technical issues involve the choice of character sets to represent
languages, similar to the choices made in representing images
or sound. However, the scale of the language problem is far greater.
Standards for multi-media representation are being adopted fairly
rapidly, in parallel with the availability of multi-media content
in electronic form. By contrast, we have hundreds (and sometimes
thousands) of years worth of textual materials in hundreds of
languages, created long before data encoding standards existed.
Textual content from past and present is being encoded in language
and application-specific representations that are difficult to
exchange without losing data -- if they exchange at all. We illustrate
the multi-language DL challenge with examples drawn from the research
library community, which typically handles collections of materials
in 400 or so languages. These are problems faced not only by developers
of digital libraries, but by those who develop and manage any
communication technology that crosses national or linguistic boundaries.
Medium, Culture, and Language
Speaking is different from writing, and still images are different
from moving images; verbal and graphical communication are yet
more different from each other. Speaking in one's native language
to people who understand that language is different than speaking
through a translator. Language translations, whether oral or written,
manual or automatic, cannot be true equivalents due to subtle
differences between languages and the cultures in which they originate.
Thus the content and effect of messages are inseparable from the
form of communication and the language in which communicated.
For all of these reasons, we wish to capture DL content in the
richest forms possible to assure the maximum potential for communication.
We want accurate representations of the original form and minimal
distortion of the creators' (author, artist, film maker, engineer,
etc.) intentions. At the same time, we want to provide the widest
array of searching, manipulation, display, and capture capabilities
to those who seek the content, for the searchers or users of these
digital libraries may come from different cultures and speak different
languages than those of the creators.
Herein lies the paradox of information retrieval: the need to
describe the information that one does not have. We have spent
decades designing mechanisms to match up the expressions of searchers
with those of the creators of textual documents (centuries, if
manual retrieval systems are considered). This is an inherently
insolvable problem due to the richness of human communication.
People express themselves in distinctive ways, and their terms
often do not match those of the creators and indexers of the information
sought, whether human or machine. Conversely, the same terms may
have multiple meanings in multiple contexts. In addition, the
same text string may retrieve words in multiple languages, adding
yet more variance to the results. Better retrieval techniques
will narrow the gap between searchers and creators of content,
but will never close that gap completely.
Searching for information in multi-media digital libraries is
more complex than text-only searching. Consider the many options
for describing sounds, images, numeric data sets, and mixed-media
objects. We might describe sounds with words, or with other sounds
(e.g., playing a tune and finding one like it); we might describe
an image with words, by drawing a similar object, or by providing
or selecting an exemplar. As Croft (1995)
notes in an earlier D-Lib issue, general solutions to multi-media
indexing are very difficult, and those that do exist tend to be
of limited utility. The most progress is being made in well-defined
applications in a single medium, such as searching for music or
for photographs of faces.
Cultural issues pervade digital library applications, whether
viewing culture at the application level, such as variations in
approaches to image retrieval by the art, museum, library, scientific,
and public school communities, or on a multi-national scale, such
as the differing policies on information access between the United
States and Central and Eastern Europe. Designing digital libraries
for distributed environments involves complex tradeoffs between
tailoring to local cultures and meeting the standards and practices
necessary for interoperability with other systems and services
(Borgman, et al., 1996).
From Local Systems to Global Systems
The easiest systems to design are those for well-defined applications
and well-defined user populations. Under these conditions, designers
can build closed systems tailored to a community of users, iteratively
testing and refining capabilities. These are rare conditions today,
however. More often, we are designing open systems that serve
not only a local population, but also remote and perhaps unknown
populations. Examples include digital libraries of scholarly materials
built by and for one university, then later made openly available
on the Internet; business assets databases developed and tested
at a local site and then provided to corporate sites around the
world; scientific databases designed for research applications,
later made available for educational purposes; and library catalogs
designed for a local university, later incorporated into national
and international databases for resource sharing. Any of these
applications could involve content in multiple media and multiple
languages.
Design Tradeoffs
Consider how the design issues change from local to distributed
systems. In local systems, designers can tailor user interfaces,
representation of content, and functional capabilities to the
local culture and to the available hardware and software. Input
and output parameters are easily specified. If users need to create
sounds or to draw, these capabilities can be provided, along with
display, capture, and printing abilities in the matching standards.
Keyboards can be set to support the local language(s) of input;
screens and printers can be set to support the proper display
of the local languages as well.
Designers have far less control over digital libraries destined
for use in globally distributed environments. Users' hardware
and software platforms are typically diverse and rapidly changing.
Designers often must specify a minimum configuration or require
a minimum version of client software, making tradeoffs between
lowering the requirements to reach a larger population or raising
requirements to provide more sophisticated capabilities. The more
sophisticated the multi-media or multi-lingual searching capabilities,
the higher the requirements are likely to be, and the fewer people
that are likely to be served.
While good design includes employing applicable standards, determining
which standards are appropriate in the rapidly evolving global
information infrastructure involves tradeoffs as well. The use
of some standards may be legislated by the parent organization
or funding agency, and the use of other standards may be a matter
of judging which are most stable and which are most likely to
be employed in other applications with which the current system
needs to exchange data. In the case of character sets for representing
text in digital libraries, designers sometimes face a choice between
a standard employed within their country to represent their national
language and a universal character set in which their national
language is more commonly represented in other countries. At present,
massive amounts of textual data are being generated in digital
form, and represented in formats specific to applications, language,
and countries. The sooner the digital library community confronts
this tidal wave of "legacy data" in incompatible representations,
the more easily this interoperability problem may be solved.
Representation in Digital Form
Although we have been capturing text, images, and sounds in machine-readable
forms for several decades, issues of representation became urgent
only when we began to access, maintain, exchange, and preserve
data in digital form. In information technologies such as film,
phonograph, CD-ROM, and printing, electronic data often is an
intermediate format. When the final product was published or produced,
the electronic data often were destroyed, and the medium (disks,
tapes, etc.) reused.
In digital libraries, the perspective changes in two important
ways: (1) from static output to dynamic data exchange; and (2)
from a transfer mechanism to a permanent archival form. In sound
or print recordings, for example, once the record is issued or
the book printed, it no longer matters how the content was represented
in machine-readable form. In a digital library, the representation
matters because the content must be continuously searched, processed,
and displayed, and often must be exchanged with other applications
on the same and other computers.
When electronic media were viewed only as transfer mechanisms,
we made little attempt to preserve the content. Many print publications
exist only in paper form, the typesetting tapes used to generate
them long since overwritten. Much of the early years of television
broadcasts were lost, as the recording media were reused or allowed
to decay. Now we recognize that digital data must be viewed as
a permanent form of representation, requiring means to store content
in complete and authoritative forms, and to migrate content to
new technologies as they appear.
Language and Character Sets
Character set representation is a problem similar to that of representing
multi-media objects in digital libraries, yet is more significant
due to the massive volume of textual communication and data exchange
that takes place on computer networks. Culture plays a role here
as well: speakers of all languages wish to preserve their language
in its complete and authoritative form. Incomplete or incorrect
data exchange results in failures to find information, in failures
to authenticate identities or content, and in the permanent loss
of information. Handling character sets for multiple languages
is a pervasive problem in automation, and one of great concern
to libraries, network developers, government agencies, banks,
multi-national companies, and others exchanging information over
computer networks.
Much to the dismay of the rest of the world, computer keyboards
were initially designed for the character set of the English language,
containing only 26 letters, 10 numbers, and a few special symbols.
While variations on the typical English-language keyboard are
used to create words in most other languages, doing so often results
in either (1) a loss of data, or (2) encoding characters in a
language-specific or application-specific format that is not readily
transferable to other systems. We briefly discuss the problems
involved in data loss and character encoding, then discuss some
potential solutions.
Transliteration and Other Forms
of Data Loss
Languages written in non-Roman scripts, such as Japanese, Arabic,
Chinese, Korean, Persian (Farsi), Hebrew, and Yiddish (the "JACKPHY"
languages), and Russian, are transliterated into Roman characters
in many applications. Transliteration matches characters or sounds
from one language into another; it does not translate meaning.
Considerable data loss occurs in transliteration. The process
may be irreversible, as variations occur due to multiple transliteration
systems for a given language (e.g., Peking vs. Beijing, Mao Tse-tung
vs. Mao Zedong (Chinese), Tchaikovsky vs. Chaikovskii (Russian)),
and the transliterated forms may be unfamiliar to speakers of
that language. Languages written in extensions of the Roman character
set, such as French, Spanish, German, Hungarian, Czech, and Polish,
are maintained in incomplete form in some applications by omitting
diacritics (accents, umlauts, and other language-specific marks)
that distinguish their additional characters.
These forms of data loss are similar to those of "lossy"
compression of images, in which data are discarded to save storage
costs and transmission time while maintaining an acceptable reproduction
of the image. Any kind of data loss creates problems in digital
libraries. Variant forms of words will not match and sort properly,
incomplete words will not exchange properly with digital libraries
using complete forms, and incomplete forms may not be adequate
for authoritative or archival purposes. The amount of acceptable
loss varies by application. Far more data loss is acceptable in
applications such as email, where rapid communication is valued
over authoritative form, than in financial or legal records, where
authentication is essential, for example.
Character Encoding
The creation of characters in electronic form involves hardware
and software to support input, storing, processing, sorting, displaying,
and printing. The internal representation of each character determines
how it is treated by the hardware (keyboard, printer, VDT, etc.)
and the application software. Two characters may appear the same
on a screen but be represented differently due to their different
sorting positions in multiple languages, for example. Conversely,
the same key sequence on two different keyboards may produce two
different characters, depending upon the internal representation
that is generated. Character encoding for digital libraries includes
all of these aspects:
The keyboard commands used to generate characters, especially
characters with diacritics, for building the digital library content;
The keyboard commands used to generate characters to search
the digital library;
Rules for sorting characters in correct alphabetic sequence,
which are dependent on the internal representation of the character;
the correct sequence varies by language;
Display of characters on computer screens; and
Output of characters on printers and other devices.
Numerous possibilities exist for mismatches and errors in access
to digital libraries in distributed environments, considering
the vast array of hardware and software employed by DLs and their
users and the variety of languages and character encoding systems
that may be involved.
Mono-lingual, Multi-lingual, and Universal
Character Sets
Many standards and practices exist for encoding characters. Some
are language-specific, others are script-specific (e.g., Latin
or Roman, Arabic, Cyrillic), and "universal" standards
that support most of the world's written languages are now available.
Exchanging data among digital libraries that employ different
character encoding formats is the crux of the problem.
If mono-lingual DLs all use the same encoding format, such as
ASCII for English, data exchange should be straightforward. If
mono-lingual DLs use different formats, such as the three encoding
formats approved for the Hungarian language by the Hungarian standards
office (Számítástechnikai karakterkódok. A grafikus karakter magyar referenciakészlete, 1992),
then data exchange encounters problems. Characters generated by
a keyboard that is set for one encoding system may not match characters
stored under another encoding system; characters with diacritics
may display or print incorrectly or not display at all.
DLs using the same script-specific formats, such as Latin-2 extended
ASCII that encompasses the major European languages, should be
able to exchange data with each other. When DLs using Latin-2
attempt to exchange data in those same languages with DLs using
language-specific formats, mismatches may occur. Similarly, mismatches
may occur when DLs that employ Latin-2 for European languages
exchange data with DLs that employ a different multi-lingual set
such as the American Library Association character set (The American Library Association Character Set, 1989)
commonly used in the United States.
After many years of international discussion on the topic, Unicode
appears to be emerging as the preferred standard to support most
of the world's written languages. A universal character set offers
great promise for solving the data exchange problem. If data in
all written languages are encoded in the same format, then data
can be exchanged between mono-lingual and multi-lingual digital
libraries. Just as the networked world is moving toward hardware
platform-independent solutions, adopting Unicode widely would
move us toward language-independent solutions to distributed digital
libraries and to universal data exchange. Techniques for automatic
language translation would be assisted by a common character set
standard as well.
Any solution that appears too simple probably is. Major hardware
and software vendors are beginning to support Unicode, but it
is not embedded in much application software yet. Unicode requires
16 bits to store each character -- twice as much as ASCII, at
8 bits. However, Unicode requires only half as much space as the
earlier version of ISO 10646 (32 bits), the competing and more
comprehensive universal character set. Unicode emerged as the
winner in a long standards battle, eventually merging with ISO
10646, because it was seen as easier to implement and thus more
likely to be adopted widely. As storage costs continue to decline,
the storage requirements of Unicode will be less of an issue for
new applications. Massive amounts of text continue to be generated
not only in language-specific and script-specific encoding standards,
but in local and proprietary formats. Any of this text maintained
in digital libraries may become "legacy data" that has
to be mapped to Unicode or some other standard in the future.
At present, digital library designers face difficult tradeoffs
between the character set standards in use by current exchange
partners, and the standard likely to be in international use in
the future for a broader variety of applications.
Library Community Approaches
The international library community began developing large, multi-language
digital libraries in the 1960s. Standards for record structure
and character sets were established long before the Internet was
created, much less Unicode. Hundreds of millions of bibliographic
records exist around the world in variations of the MARC (MAchine
Readable Cataloging) standard, although in multiple character
set encoding formats. OCLC Online Computer Library Center, the
world's largest cataloging cooperative, serves more than 17,000
libraries in 52 countries and contains over 30 million bibliographic
records with over 500 million records of ownership attached in
more than 370 languages (Mitchell, 1994;
OCLC Annual Report, 1993; Smith, 1994).
OCLC uses the American Library Association (ALA) character set
standard, which extends the English-language keyboard to include
diacritics from major languages (Agenbroad, 1992;
The ALA Character Set, 1989). Text in most
other languages is maintained in transliterated form.
The Library of Congress, which contributes its records in digital
form to OCLC, RLIN (Research Libraries Information Network, the
other major U.S.-based bibliographic utility), and other cooperatives,
also do original script cataloging for the JACKPHY languages mentioned
earlier. RLIN pioneered the ability to encode the JACKPHY languages
in their original script form for bibliographic records, using
available script-specific standards (Aliprand, 1992).
Records encoded in full script form are exchanged between the
Library of Congress, RLIN, OCLC, other bibliographic utilities
in the U.S. and elsewhere, and many digital libraries maintained
by research libraries. Catalog cards are printed in script and
Romanized forms from these databases, but direct use of the records
in script form requires special equipment to create and display
characters properly. Records from OCLC, RLIN, and other sources
are loaded into the online catalogs of individual libraries, where
they usually are searchable only in transliterated forms. Some
online catalogs support searching with diacritics, while others
support only the ASCII characters. Regardless of the local input
and output capabilities, if characters are represented internally
in their fullest form, they will be available for more sophisticated
uses in the future when the search and display technologies become
more widely available.
Libraries always have taken a long-term perspective on preserving
and providing access to information. They manage content in many
languages and cooperate as an international community to exchange
data in digital form. Thus it is not surprising that libraries
were among the first institutions to tackle the multi-lingual
character set problem. Over the last 30 years, libraries have
created massive stores of digital data. Not only do libraries
create and maintain new bibliographic records in digital form,
a growing number of the world's major research libraries have
converted all of their historical records -- sometimes dating
back several hundred years -- into a common record structure.
By now, libraries have the expertise and influence to affect future
developments in standards for character sets and other factors
in data exchange.
The library world is changing, however, as new regions of the
world come online. The European Union is promoting Unicode and
funding projects to support Unicode implementation in library
automation (Brickell, 1997). Automation
in Central and Eastern Europe (CEE) has advanced quickly since
1990 (Borgman, in press). A survey
of research libraries in six CEE countries, each with its own
national language and character set, indicates that a variety
of coding systems are in use. As of late 1994, more than half
used ASCII Latin2, one used Unicode, and the rest used a national
or system-specific format; none used the ALA character set
(Borgman, 1996).
The national libraries in these countries are responsible for
preserving the cultural heritage of their countries that appears
in published form, and thus require that their language be preserved
in its most complete and authoritative digital form. Transliterated
text or characters stripped of diacritics are not acceptable.
Several of these national libraries are now working closely with
OCLC, toward the goal of exchanging data in authoritative forms.
As libraries, archives, museums, and other cultural institutions
throughout the world become more aware of the need to preserve
digital data in archival forms, character set representation becomes
a political as well as technical issue. Many agencies are supporting
projects to ensure preservation of bibliographic data in digital
forms that can be readily exchanged, including the Commission
of the European Communities, International Federation of Library
Associations, Soros Foundation Open Society Institute Regional
Library Program, and the Mellon Foundation
(Segbert &
Burnett, 1997).
Summary and Conclusions
Massive volumes of text in many languages are becoming available
online, whether created initially in digital form or converted
from other media. Much of this data will be stored in digital
libraries, whether alone or in combination with sounds and images.
Digital formats are no longer viewed as an intermediate mechanism
for transferring data to print, film, tape, or media. Rather,
they have become permanent archival forms for many applications,
including digital libraries. DL content is used directly in digital
form -- searched, processed, and often reformatted for reuse in
other applications. Data is exchanged between DLs, whether in
large batch transfers, such as tape loads between bibliographic
utilities and online catalogs, electronic funds transfers between
financial institutions, or as hyperlinks between DLs distributed
across the Internet. In networked environments, searchers speaking
many different languages, with many different local hardware and
software platforms, may access a single digital library. For all
of these reasons, we need to encode characters in a standard form
that can support most of the world's written languages.
The first step is for designers of digital libraries to recognize
that the multi-lingual character set problem exists. The goal
of this essay, and the choice of publication venue, is to bring
the problem to the attention of a wider audience than the technical
elite who have been grappling with it for many years now. The
second step is to take action. The solution will not come overnight,
but given the great strides already taken toward platform-independent
network applications, and toward standards for exchanging sounds
and images, the foundation for progress has been laid.
Designers of networked applications are more aware of interoperability,
portability, and data exchange issues than in the past. Experience
in migrating data from one application to another provides object
lessons in the need to encode data in standard formats. Unicode
appears to be the answer for new applications and for mapping
legacy data from older applications. However, designers still
must weigh factors such as the amount of data currently existing
in other formats, the standards in use by other systems with which
they must exchange data regularly, the availability of application
software that supports Unicode and other universal standards for
encoding character sets, and the pace at which conversion will
occur. The sooner that the digital library community becomes involved
in these discussions, the sooner we will find a multi-media, multi-cultural,
and multi-lingual solution to exchanging data in all written languages.
References
Agenbroad, J. E. (1992). Nonromanization:
Prospects for Improving Automated Cataloging of Items in Other
Writing Systems. Cataloging Forum, Opinion Papers, No. 3.
Washington, DC: Library of Congress.
The ALA character set and other solutions for
processing the worlds information. (1989). Library Technology
Reports, 25(2), 253-273.
Aliprand, J.M. (1992). Arabic script on
RLIN. Library Hi Tech, 10(4), Issue 40, 59-80.
Borgman, C.L. (1996). Automation is the
answer, but what is the question? Progress and prospects for Central
And Eastern European Libraries. Journal of Documentation, 52(3),
252-295.
Borgman, C.L. (In press). From acting
locally to thinking globally: A brief history of library automation.
Library Quarterly. (To appear July, 1997)
Borgman, C.L.; Bates, M.J.; Cloonan,
M.V.; Efthimiadis, E.N.; Gilliland-Swetland, A.; Kafai, Y.; Leazer,
G.L.; Maddox, A. (1996). Social Aspects Of Digital Libraries.
Final Report to the National Science Foundation; Computer,
Information Science, and Engineering Directorate; Division of
Information, Robotics, and Intelligent Systems; Information Technology
and Organizations Program. Award number 95-28808.
Bossmeyer, C.; Massil, S.W. (Eds.). (1987).
Automated systems for access to multilingual and multiscript
library materials : problems and solutions : papers from the
pre-conference held at Nihon Daigaku Kaikan Tokyo, Japan, August
21-22, 1986. International Federation of Library Associations
and Institutions, Section on Library Services to Multicultural
Populations and Section on Information Technology. Munich and
New York: K.G. Saur.
Brickell, A. (1997). Unicode/ISO 10646
and the CHASE project. In M. Segbert & P. Burnett (eds.).
Proceedings of the Conference on Library Automation in Central
and Eastern Europe, Budapest, Hungary, April 10-13, 1996.
Soros Foundation Open Society Institute Regional Library Program
and Commission of the European Communities, Directorate General
XIII, Telecommunications, Information Market and Exploitation
of Research, Libraries Programme (DG XIII/E-4). Budapest: Open
Society Institute.
Croft, W.B. (1995). What do people want from
information retrieval? (The Top 10 Research Issues for Companies
that Use and Sell IR Systems). D-Lib Magazine, November.
Kahin, B.; & Abbate, J. (eds.). (1995).
Standards policy for information infrastructure. Cambridge,
MA: MIT Press.
Libicki, M.C. (1995). Standards: The rough
road to the common byte. B. Kahin & J. Abbate (eds.), Standards
policy for information infrastructure. MIT Press: Cambridge,
MA, 35-78.
Számítástechnikai
karakterkódok. A grafikus karakter magyar referenciakészlete.
(1992). Budapest: Magyar Szabványügyi Hivatal. (Character
sets and single control characters for information processing.
Hungarian Reference version of graphic characters. Budapest: Hungarian
Standards Office.)
Mitchell, J. (1994). OCLC Europe: Progress
report, March, 1994. European Library Automation Group Annual
Meeting, Budapest.
OCLC Online Computer Library Center, Inc. (1993).
Furthering access to the world's information (Annual Report
1992/93). Dublin, OH: Author.
Segbert, M. & Burnett, P. (eds.). (1997). Proceedings of the Conference on Library Automation in Central and Eastern Europe, Budapest, Hungary, April 10-13, 1996. Soros Foundation Open Society Institute Regional Library Program and Commission of the European Communities, Directorate General XIII, Telecommunications, Information Market and Exploitation of Research, Libraries Programme (DG XIII/E-4). Budapest: Open Society Institute.
Smith, K.W. (1994). Toward a global library
network. OCLC Newsletter, 208, 3.
Copyright ©1997 Christine L. Borgman
cnri.dlib/june97-borgman