1 Introduction
The pursuit of knowledge by scholars, scientists, government agencies, and ordinary
citizens requires that the seeker be familiar with the diverse information resources
available. They must be able to identify those information resources that relate to the
goals of their inquiry, and must have the knowledge and skills required to navigate those
resources, once identified, and extract the salient data that are relevant to their
inquiry. The widespread distribution of recorded knowledge across the emerging networked
landscape is only the beginning of the problem. The reality is that the repositories of
recorded knowledge are only a small part of an environment with a bewildering variety of
search engines, metadata, and protocols of very different kinds and of varying degrees of
completeness and incompatibility. The challenge is to not only to decide how to mix,
match, and combine one or more search engines with one or more knowledge repositories for
any given inquiry, but also to have detailed understanding of the endless complexities of
largely incompatible metadata, transfer protocols, and so on.
This proposal describes an information access system that provides a new paradigm for
information discovery and retrieval by exploiting the fundamental interconnections between
diverse information resources including textual and bibliographic information, numerical
databases, and geo- spatial information systems. This system will provide an object-
oriented architecture and framework for integrating knowledge and software capabilities
for enhanced access to diverse distributed information resources.
1.1 Overview
The purpose of this proposal is twofold, involving both practical application of
existing technology, and theoretical examination and evaluation of next- generation
designs for systems architecture and for intelligent assistance in the information
retrieval task. For the first purpose we propose to develop and make ready for production
a next- generation information retrieval system based on international standards (Z39.50
and SGML) which will be used for cross- domain searching, using the Arts and Humanities
Data Service (AHDS), the CURL (Consortium of University Research Libraries), the Online
Archive of California (OAC) and the Making of America II (MOA2) database as principal
repositories. This system will serve as a model for developing efficient paradigms for
information retrieval in a cross- domain, distributed environment. The second purpose will
be addressed in the design, development, and evaluation of the distributed information
retrieval system architecture, its client- side systems that aid the user in exploiting
distributed resources and in the design and evaluation of protocols for efficient and
effective retrieval in a internationally distributed multi- database environment.
The aim is to produce a robust, fully operational system ( Cheshire) within
a three year period which would facilitate searching on the internet across collections of
original materials (i. e., early printed books, records, archives, medieval
and literary manuscripts, and museum objects), statistical databases, full- text, geo-
spatial and multi- media data resources. This system will be based on the work done with
the Cheshire II system in UC Berkeley Digital Library Initiative project, extended with
additional capabilities and redesigned with a new system architecture. This standards-
based client/ server system will have important economies for libraries, museums,
universities, and other information providers and the system produced will be made
available without charge to non- profit, government and educational institutions.
The new extensions to this system will provide a platform and protocols to integrate
databases with fundamentally different content and structure into a common retrieval,
display, and analysis environment. These different database types, and some examples to be
used in this project, include:
- Document databases which describe information about various topics ranging from news
reports and library catalogue entries to full- text articles from academic journals
including text, images and multimedia elements. (Oxford Text Archive, Performing Arts Data
Service, California Sheet Music Project, CURL database, the Digital Archive of California
and the Making of America II (MOA) database).
- Numeric statistical databases which assemble facts about a wide variety of social,
economic, and natural phenomena (History Data Service, NESSTAR and UC Data).
- Geographic databases derived from geographic information systems, digitized maps, and
other resource types which have assembled georeferenced view of the geographic features
and boundaries including georeferenced information derived from place names (Archaeology
Data Service, History Data Service, the UC Berkeley Digital Library database and the MOA
database).
This proposal draws upon the continuing 3- year working relationship between the
Division of Special Collections and Archives at the University of Liverpool Library and
the Digital Libraries Program at the University of California, Berkeley, with the aim of
extending it to other research- based repositories in the EU and USA. In addition to the
development platforms of the AHDS, CURL, OAC and MOA2 databases, other repositories
expressing an interest in the Cheshire project and in using the Cheshire software include
Glasgow University (contact: Lesley Richmond), Oxford University (contact: David Price),
Durham University (contact: John Hall), the Public Record Office (contact: Meg Sweet), the
British Library (contact: Richard Masters) in the UK, the Archive Working Group at Yale
University (Contact: Richard Szary) and the Institute for Advanced Technology in the
Humanities at the University of Virginia (Contact: Daniel Pitti) in the US.
2 System Description
The system that we are proposing will build upon international standards and existing
work in probabilistic information retrieval, and on the experience of the researchers in
applying advanced retrieval methods to full- scale realistic databases.
2.1 A Functional Model
The continuing development of the Cheshire client/ server system is based on a
particular vision of how information access tools will develop, in particular, how they
must respond to the requirements of a large population of users scattered around the globe
who wish simultaneously to access the complete contents of thousands of archives, museums,
and libraries, containing a mixture of text, images, digital maps, and sound recordings.
Such a virtual library must be a networked- based distribution system with local servers
responsible for maintaining individual collections of digital documents, which will
conform to a specific set of standards for documentation description, representation, and
communications protocols. We believe, based on the current directions of research and
adoption of standards by libraries, museums and other institutions, that a major portion
of this emerging global virtual library will be based on SGML (Standard Generalized Markup
Language), and especially its XML subset, and the Z39.50 information retrieval protocol
for resource discovery and cross- database searching. (We also assume that the forthcoming
versions of the HTTP protocol will continue to provide document delivery and hypertext
linking services, and that SQL3, when finalized, will provide the low- level retrieval and
data manipulation semantics for relational and object- relational databases)
The Cheshire retrieval system, in supporting Z39.50 Explain semantics for
navigating digital collections, will allow users to locate and retrieve information about
collections that are organized hierarchically and distributed across servers. It will
enable coherent expressions of relationships among objects and collections, showing for
any given collection superior, subordinate, related, and context collections. These are
essential prerequisites for the development of cross- domain resources discovery tools,
which will enable users to access diverse collections through a single interface.
In its functionality, the proposed system will satisfy all of the recommendations of
the National Resource Discovery Workshop (MODELS 4), and will address many of the research
and development topics noted in the most recent JISC strategy review, as well as the
development programmes of CEI and TASC. (1)
It specifically addresses the critical issue of vocabulary control by
supporting probabilistic best match ranked searching (as discussed below) and
support for Entry Vocabulary Modules (EVMs) that provide a mapping between a
searchers natural language and controlled vocabularies used in the description of
digital objects and collections. It also allows users to navigate collections
(the drilling down approach) through distributed Z39.50 explain
databases and through the use of SGML as the primary database format, particularly for
collection- level descriptions such as the EAD DTD. The system will follow the
recommendations of the Third National Resource Discovery Workshop by providing fully
distributed access to existing catalogues, and is designed to support cross- domain
clumps to facilitate resource discovery. Finally, the proposed server
anticipates the critical issue of displaying non- western character sets in its ability to
handle UNICODE (in addition to the standard ASCII/ ISO8859 character sets).
3 Background
3.1 Development History
The development of the Cheshire system began in the early 1990s at the University of
California, Berkeley, as a means of testing the use of probabilistic information
retrieval methods upon MARC bibliographic data. It was found that these advanced
retrieval methods developed at Berkeley were far more effective than traditional Boolean
methods (or vector space model methods) in accessing records from a bibliographic
database. Needless to say, the deployment of these probabilistic retrieval
algorithms has very important economies particularly in the searching of databases or
documents such as EAD which normally do not use a controlled vocabulary.
The second version of Cheshire, currently deployed at both the University of Liverpool
and the University of California, Berkeley, was designed to extend the format of the
server to include SGML- encoded data. Because SGML is increasingly becoming the mark- up
language of choice for research institutions, it was critical to extend Cheshires
capabilities to support the kinds of SGML metadata which is likely to be included in
national bibliographies. These are: TEI (Text Encoding Initiative), EAD (Encoded Archival
Description), DDI (for Social Science Data Services), CIMI (Consortium for the Interchange
of Museum Information) records, as well as the SGML version of USMARC released by the
Library of Congress (based on the USMARC DTD developed by Jerome McDonough for the
Cheshire project).
The third version extends the use of SGML handling capabilities for these search
indexes. This version was developed by Berkeley and Liverpool for the Arts and Humanities
Data Service, enabling GRS- 1 syntax conversion for nested SGML data, component indexing
and retrieval of SGML formatted documents, and automatic generation of Z39.50 Explain
databases from system configuration files. The current version of the server is now able
to include an element in an SGML record that is a reference to an external digital object
(such as a file name, URL or URN) that contains full- text to be parsed and indexed, these
can be local files or URL and URN referenced files anywhere on the internet. It also
enhances the users ability to perform somewhat less directed searching provided by
Boolean and probabilistic search capabilities that can be combined at the users
direction. This version of Cheshire can display a number of data types ranging from full-
text documents, structured bibliographic records, as well as complex hypertext and multi-
media documents. At its current stage of development, Cheshire forms a bridge between the
realms of purely bibliographic information and the rapidly expanding full- text and
multimedia collections available on- line.
The system proposed here will support nested SGML DTDs, promote cross platform use, and
support a far broader range of SGML document types, specified below. Specifically, this
stage will involve the development of the Cheshire client (currently undertaken by
researchers at Liverpool), the provision of a CORBA- based distributed object version of
both client and server software (undertaken by researchers at UC Berkeley), the
development of the Multi- Valent Document model by researchers at Berkeley and Liverpool
to accommodate delivery of multi- media and GIS information without the need for plug- ins
or helper applications, and application of Entry Vocabulary Modules (EVMs) enabling search
support for unfamiliar metadata vocabularies including cross- language retrieval (http://
sims. berkeley. edu/ research/ metadata/), to facilitate retrieval of information for
large- scale, distributed information services.
The development and integration of these tools will enable extensive testing and
comparative evaluation on the service providers that comprise the Arts and Humanities Data
Service (AHDS), the CURL database, the Online Archive of California and the Making of
America II database, with the aim of providing a production service for these entities and
delivering production- quality versions of the Cheshire software for Higher Education
Institutions in the United Kingdom and the United States.
4 Research Issues: Integrating Access to Resources Across
Domains
The development of the Cheshire system outlined above has been driven in the belief
that many of the recommendations of the National Resource Discovery Workshops (MODELS 3
and 4) can be catered for by a standards- based information retrieval system that will
provide a bridge between existing on- line catalogue technology and databases and the
explosively growing realm of network- based digital libraries with information resources
including full- text, geo- spatial data, numerical data and multimedia. In the following
sections, we will discuss each of these recommendations and the active research issues and
design elements particular to the proposed Cheshire system that are associated with them.
These sections address the following research issues: 1) Distributed Object Retrieval
Architecture 2) Management of Vocabulary Control in a Cross- Domain Context; 3)
Distributed Access to Existing Metadata Resources; 4) Navigating Collections; and 5)
Support for Cross- Domain Clumps to Facilitate Resource Discovery:
4.1 Distributed Object Retrieval Architecture
We see the architecture for the evolution of distributed information access systems as
a highly exible and dynamic system. In such a system both the data (digital objects
instantiating information resources) and the programs that operate on that data (methods)
to achieve the needs and desires of the users of the system for display and manipulation
of the data (behaviours) will be implemented in a distributed object environment using
CORBA for object management.
The basic architecture is a three- tiered division of data and functionality. The tiers
are:
- The Client. The basic client for the distributed Cheshire
system can be any JAVA- enabled WWW Browser. The primary data delivery format will be as
XML (or HTML for initial versions), and the methods for manipulating and navigating within
the data will be implemented as CORBA- enabled JAVA applets, delivered on demand to the
browser.
- The Application Tier. Applications for search and manipulation
of data are distributed between the client and network servers (including the
repositories) to provide distributed functionality (and to provide new behaviours to
clients on demand from any compliant network server). The application tier or layer would
both provide JAVA applets for execution on the client, as well as providing server- side
methods invoked directly on objects in the repository either via direct CORBA invocations
or indirectly via requests from other protocols (e. g. Z39.50 or Open Geospatial Datastore
Interface (OGDI) for network access to heterogeneous geographic data held in multiple GIS
formats and spatial reference systems( Gardels, 1996, 1997)). For example, a client
browser might download an applet that can display MARC records, and invoke a server- side
method to convert repository objects in XML to MARC format. We expect, for performance
reasons, that many operations on stored objects will be server- side methods with
primarily display functions on the client side.
- The Repository. Digital objects and metadata describing them
will reside in the repositories tier or layer. Repositories can be implemented in a
variety of ways, ranging from conventional Relational, ObjectRelational, or Object-
Oriented database systems and Text retrieval engines, to metadata repositories referencing
physical collections in libraries.
4.2 Management of Vocabulary Control in a Cross- Domain Context
This is a key issue in the integration of resources across domains, as brought forward
in MODELS 4. The underlying problem, recognized for over a decade, is that the current
generation of on- line catalogues in most libraries do not do a very good job of providing
topical or subject access to the collections (Matthews, et. al., 1983), The common result
of many subject searches (up to 50%) is search failure, or zero results. In a
distributed environment, this is compounded by the lack of vocabulary control
across domains, added to which is the tendency for users to use general wording or terms
in subject queries, rather than specific ones.
In its initial form, the Cheshire system was initially designed to overcome these
difficulties, and provide users with the tools for formulate effective queries. In its
current configuration, the server is able to map the searchers notion of a topic to
the terms or subject headings actually used to describe that topic in the database. This
is provided for by a variety of search and browsing capabilities(2), but the primary distinguishing feature is the support for
probabilistic searching on any indexed element of the database. This enables the use
of a natural language queries to retrieve the most relevant entries in one or more
databases, even though there may be no exact Boolean matches.
The results set of a probabilistic search is ranked in order of estimated relevance to
the users query. The search engine also supports relevance feedback, as well as
automatically generated hypertext links that will allow the user to follow dynamically
established linkages between associated records.
Support for probabilistic retrieval is critical to the success of a cross- domain
server, insofar as it allows users to make effective queries even when there is no
controlled vocabulary. In the case of the Social Science documents, for example, a user
can make a successful probabilistic search on a given subject, where a
traditional Boolean search would fail. The deployment of these algorithms is only a
preliminary steps in managing vocabulary control. There are, naturally, many research
issues to be addressed in finding the optimal mappings of user and document vocabulary to
the controlled vocabularies used in descriptive metadata this is discussed further under
Search Support for Unfamiliar Metadata Vocabularies below.
4.3 Distributed Access to Existing Metadata Resources
4.3.1 Data Mining: Dublin Core Metadata
One approach to semantic interoperability of distributed systems is to use a
standardized set of metadata, such as the Dublin Core, for the description and retrieval
of electronic resources from disparate data. For example, the Arts and Humanities Data
Service (AHDS) currently operates on a model in which extended Dublin Core elements are
used as the means of retrieving information from five different service providers. In
practice, this has proved to be an inefficient way of effectively leveraging data from
these services, since the data providers often interpret Dublin Core elements differently
and also because Dublin Core elements only comprise a small part of the complex, rich data
resources which could be available as a means of search and retrieval. To take one
example, a full text search of the existing AHDS gateway (http:// prospero. ahds. ac. uk:
8080/ ahds live) for the term England in the Dublin Core element set for the
Oxford Text Archive produces only one hit; whereas a Cheshire version of the
TEI- header information of the same service (http:// sherlock. berkeley. edu/ OTA/)
produces 258 hits, ranked in order of relevance.
This is why many of the fundamental retrieval algorithms being developed as part of the
Cheshire project, described below, are based on the premise that front- end prototyping
will involve entire information resources, not simply restricted subsets based on Dublin
Core metadata. These will provide a much richer platform for the development of retrieval
strategies among large and complex data sets, and the inclusion of the Arts and Humanities
Data Service (AHDS) in this project will bring particular expertise in this area, since
this service currently is a practical implementation of the Dublin Core and the service
providers are already experienced in a production environment in the use of Dublin Core
for resource discovery (Miller & Greenstein, 1997).
4.3.2 Search Support for Unfamiliar Metadata Vocabularies
The next step beyond simple shared category lists like Dublin Core will be to provide
support for enhanced retrieval of unfamiliar metadata (as distinct from Dublin Core
Metadata), extending the findings of the DARPA- funded research project on Search Support
for Unfamiliar Metadata Vocabularies
(http:// sims. berkeley. edu/ research/ metadata). This research, based on work from
the Cheshire research projects, is attempting to go substantially beyond the state- of-
the- art in developing systems that can construct linkages between natural language
expressions of topical information and controlled vocabularies automatically. Today most
such systems depend on the expensive human crafting of links within and between
vocabularies.
For the purposes of this project we propose continued development of the Cheshire
client and application layer middleware to provide sets of entry vocabulary
modules based on the controlled vocabularies of our testbed databases. These
EVMs, will accepts natural language expressions of users queries and
will generate a ranked lists of controlled vocabulary headings most likely to be useful
for that search. This will have three uses:
- as a prompt when searching an unfamiliar vocabulary.
- as computer- aided or automatic indexing of data resources using existing controlled.
vocabularies.
- to extend searches, using derived information of found records as a basis for finding
similar records in another database.
When used in conjunction with the existing Cheshire algorithms for probabilistic
indexing and retrieval, these EVMs provide descriptive surrogates can be used to match
user or document terminology to corresponding controlled vocabulary terms.
4.4 Navigating Collections (the Drilling Down Approach)
One of the primary considerations brought forward in the discussion of search models
during the MODELS workshop 4 is the use of Z39.50 to support a drilling down
approach, which would permit users to drill down between generic and domain-
specific descriptive information. The difficulties of this in the context of Z39.50 are
cited in the MODELS 4 recommendation for further work [item 2.3].
In designing the second and subsequent versions of the Cheshire system, we faced the
question of how to provide a search engine that could support a navigational record
schema, that could be used on both simple text and complex structured records, and also
support complex multimedia documents and databases. In answer to this, it was decided to
adopt SGML as the fundamental data storage type for the Z39.50 client/ server. Virtually
all data manipulation for the database has been generalized as processes acting on SGML
tags or sets of tags. Instead of having to develop new routines to manipulate each
subelement of a new datatype, the developer only needs to provide a DTD and a conversion
routine to convert the new data type to SGML. The built- in file manipulation and indexing
routines can then extract and index any tagged sub- elements of the data type for access(3).
In the proposed distributed system architecture, this functionality will be part of the
application layer and will be available to both client and repository manipulation of the
SGML/ XML data via CORBA distributed methods.
In using SGML tagging for all data in the database and by adopting the SGML DTD
language to define the structure of each data file, it is possible to use a common format
for data types ranging from full- text documents, structured bibliographic records, to
complex hypertext and multimedia documents (using the HTML DTD that defines the elements
of the WWW pages). This has important economies in the delivery of resources
across domains.
We propose to support a far broader range of SGML document types, and to provide JAVA
methods for display of SGML documents on the client. The obvious candidate for this is
DSSSL (Document Style Semantics and Specification Language, international standard: ISO/
IEC 10179), although the use of XML (Extensible Markup Language) with XSL (Extensible
Style Sheets), a restricted subset of SGML with additional formatting capabilities will
also be supported. At a practical level, by creating style sheets for the most commonly
used SGML data types (EAD, CIMI, DDI, TEI), it will be possible to deliver visual
representations of nested data using multiple DTDs. In order to achieve this, the
participants will have to agree on some common visual representation of data, requiring
consultation among institutions. The University of Liverpool has already begun by
developing and distributing a prototype conversion programme which will format and index
archival finding aids encoded in EAD. The functionality of this programme will become part
of the client- side methods for the Cheshire client (http:// gondolin. hist. liv. ac.
uk/ azaroth/ ead2html. html).
4.5 Support of Cross- Domain Resource Discovery
Cross- Domain Resource Discovery is the area of primary concern for this proposal. Its
importance has been emphasized by both MODELS workshops and addressed in phase 3 of the
JISC eLib programme in the UK and has been an important area of research in the NSF/ NASA/
ARPA Digital Library Initiative projects in the US. Indeed, the idea of a National Scale
Resource Discovery system for the UK is based on groupings or clumps of OPACs, whether
they be physical or virtual clumps, predicated on the use of Z39.50.
Although the sheer variety of data types produced for indexes, catalogues and archival
listings across domains remains a challenge, evidence demonstrates that most institutions
are now using SGML as the mark- up language of choice. The recent report for BIBLINK, for
example, documents the wide- spread use of SGML as the encoding format for the kinds of
metadata which are likely to be included in national bibliographies. These include TEI,
EAD, CIMI, and DDI, as well as the SGML version of USMARC released by the Library of
Congress (and originally developed as part of the Cheshire project).
Many of the principal international bibliographic data carriers such as Research
Libraries Group (RLG) and OCLC are now developing strategic initiatives predicated on the
use and development of SGML formatted applications. There is currently a movement to
resolve interoperability issues around the CIMI, EAD, TEI, DDI, and MARC DTDs.
Despite this, relatively little work has been undertaken into the use of production
level SGML and XML- based search engines using the Z39.50 information retrieval protocol.
This makes the development, testing, and implementation of a Z39.50 client/ server using
SGML, as described in this proposal, a critical priority for development. It is, in our
view, an absolute prerequisite if the recommendations coming out of the MODELS workshop 4
are to be implemented.
The proposed Cheshire system, with its potential to display hierarchically organized
information about digital collections distributed across servers, is probably the only
working model for cross- domain resource discovery that is entirely standards- based. In
its use of SGML, it allows institutions to localize their own descriptive information,
while permitting remote users distributed access via a structured information retrieval
protocol (Z39.50). The use of DSSSL/ XSL permits users to navigate consistently across
collections, which may not have conventionally defined structures, provided they can be
expressed in SGML.
In particular, it is proposed to extend the SGML/ XML handling capabilities of Cheshire
to exploit an extraordinary range of documents, providing CORBA- based methods for
extracting and indexing their contents available to any client, application or repository
that conforms to our overall architecture for distributed information retrieval systems.
5 Testbed Development
To help develop the appropriate technologies we propose to use two large- scale
information services sponsored by JISC in the UK which offer complementary data formats:
The Arts and Humanities Data Service (AHDS) and the CURL (Consortium of University
Research Libraries) databases and two large- scale distributed object databases in the US
(The Online Archive of California, and the Making of America II databases). Although these
will be the focus of the present proposal, we will also be bringing together a consortium
of related data providers who may wish to test data using the proposed Cheshire system:
these include government agencies (the Public Record Office); Universities (Glasgow,
Durham, Liverpool, and Oxford); as well as hybrid library projects sponsored as part of
the eLib 3 programme in the UK and the Archive Working Group at Yale University (including
other participants in the US, UK and Australia) the Institute for Advanced Technology in
the Humanities at the University of Virginia (Contact: Daniel Pitti) in the US. We will
also be providing the Cheshire technology to and participating in the development of the
NESSTAR (Networked Social Science Tools and Resources) project.
The NESSTAR project (http:// dawww. essex. ac. uk/ projects/ nesstar. html) is
combining the skills and knowledge of the three main partners, The Data Archive in the UK,
Danish Data Archives and Norwegian Social Science Data Services with assistance in
significant areas of user analysis, usability, user validation, evaluation and quality
assurance from the Institute of Journalism in Norway, ASEP and JD Systems in Spain,
Central Statistics Office in Ireland and Aarhus University in Denmark. The Council of
European Social Science Data Archives is a sponsoring partner for the project. The project
is funded by the European Commission under the Information Engineering sector of the
Telematics Applications programme.
It is our intention that the technology developed as part of this research proposal
will serve as the basis for full- scale information systems of international prominence.
We chose the AHDS, CURL, OAC, and MOA2 databases as the focus of our work for several
reasons: The data sets are large and of a diverse nature; users of these services
represent a broad range of technical expertise; both have a well- founded administrative
structure with existing user- evaluation mechanisms (thus reducing research overhead
costs); finally, the proposed system would give considerable added value to the
repositories themselves, which already comprise valuable national and international
resources.
Both PIs have had a long- standing connection with the AHDS (An earlier version
of Cheshire was installed by the PIs for the History Data Service one of the
AHDS providers). This forms part of the AHDS gateway. In addition one of the principal
investigators, Paul Watry, is a member of the CURL RDD Committee, which includes
development of the COPAC service in its remit.
A brief description of the data services making up the core testbed for this project
follows:
The CURL Database The Consortium of University Research Libraries
currently gives access to its bibliographic database via COPAC, a Z39.50 service funded by
JISC and supported by Manchester Information Datasets and Associated Services (MIDAS). The
COPAC service currently consists of some 3.5 million MARC records held in a central server
at MIDAS, but there are plans to extend this database to non- bibliographic data
resources, such as full- text and EAD- encoded documents.
Arts and Humanities Data Service The Arts and Humanities Data Service
(AHDS) is a national service funded by JISC to collect, describe, and preserve the
electronic resources which result from research and teaching in the humanities. This
research project will focus on the current production service of four data services. The
targets for AHDS include:
- Archaeology Data Service (ADS): The ADS uses a proprietary DBMS system (Fretwell-
Downings VDX system) to store extensive data incorporated as part of the resource,
such as geospatial images, aerial photography, and CAD images.
- Performing Arts Data Service (PADS): PADS is currently using Hyperwave, an object
oriented information retrieval system, to locally store and retrieve information retained
in a variety of formats.
- Oxford Text Archive (OTA): The OTA holds its entire corpus as SGML documents.
- History Data Service (HDS): The History Data Service holds its data as SGML documents
which point to a number of numeric and alpha- numeric data, text, digitized boundry data,
and images converted from historic source documents into computer- readable form.
These four services are available via the AHDS gateway with access points determined by
an extended version of Dublin Core. In addition, there are full- text versions of the
History Data Service and the Oxford Text Archive (TEI- header information only) available
via Cheshire clients and servers.
- We intend, first, to convert the metadata from the ADS and PADS databases to SGML.
(Full- text from OTA and metadata from HDS are already encoded in SGML and require no
conversion.)
- Then, in order to further extend the capabilities of these databases, we intend to
develop a Cheshire client to access methods for all primary data types (text, image, and
map- oriented data), indexing each document by as many methods as are applicable. For
example, photographs will be indexed not only by the content of their images, but also by
their text, their pre- assigned categories, their location, and so forth. This will
necessitate development of a Cheshire client that will support formatting of SGML
documents, using DSSSL or XSL. These indexed documents will provide a basis for testing
retrieval algorithms as well as for pre- processing and post- processing retrieval results
sets.
- It should be noted that a Cheshire query for these data resources may necessitate
interaction with one or more interoperable clients displaying a different document data
type: for example, geographic information system (GIS) data sets. A GIS- oriented browser
delivered in response to a Cheshire query will make it easy to ask for information
pertaining to a geographic region. In devising such a system, we will be integrating the
access tools developed for the DARPA- funded Berkeley Digital Libraries project into the
Cheshire environment. For example, a related client is being developed by researchers at
UC Berkeley to display a union of aerial photographs from UC Santa Barbara (Project
Alexandria) and UC Berkeley databases.
- Finally, we intend to use the HDS data as a testbed for integrating numerical
statistical databases and geographic databases within the Cheshire system.
Online Archive of California : The Online Archive of California
Project, is a two- year pilot project to develop a UC- wide prototype union database of
30,000 pages of archival finding aid data encoded using the Encoded Archival Description
(EAD) SGML document type definition. This database will serve as the foundation for the
development of a full- scale digital archive for the University of California System (UC)
available via the Internet to diverse user communities.
Making of America II : The Making of America II is a Digital Library
Federation project to continue and extend research and demonstration projects that have
begun to develop best practices for the encoding of intellectual, structural, and
administrative data about primary resources housed in research libraries. The Making of
America II Testbed Project collection will be Transportation, 1869- 1900,
particularly the development of the railroads and their relationship to the cultural,
economic, and political development of the United States. It will comprise multimedia
information coordinated by SGML metadata and hub documents.
In addition to using data from the data services cited above, we will also focus on
identifying additional text and bibliographic resources to interact with the numerical and
geospatial data sources. We plan to include, for example NESSTAR (discussed above) and the
UC Data archive, and the existing Digital Library Initiative database at Berkeley as part
of these extended resources.
6 Project Development Details
6.1 Systems Development and Integration Design and Development
Methods
The specific aims of the development work over the three- year duration of the project
are as follows:
- To encode the client/ server library in Java to promote cross- platform use.
Specifically, the system will require a Z39.50 communications class library, and an SGML
parser class library all written in Java or, minimally, a graphic user interface (gui)
encoded in Java making calls to the C code. We intend to begin incrementally, encoding the
GUI first before progressing to the other class libraries. This implies the development of
entity management work of opening, closing, and editing files. We intend to use Native
Methods to tie in other functionality, including the Z39.50 libraries, the SP SGML parser,
and the Jade DSSSL engine.
- The client software will have to support a far broader range of SGML document types than
it currently does. For this to prove practical, some convention for display of SGML
documents needs to be implemented for the client. The obvious candidate is DSSSL, which
would imply also implementing a DSSSL engine in Java. The JADE system from James Clarke is
implemented in C++ and would have to be ported to Java. This would be available as a CORBA
service for any SGML/ DSSSL data.
- The collections proposed will produce some very complex SGML documents, with some
containing binary data. This will require the delivery of multiple files to the client for
display of a single document, possibly including separate files for the SGML declaration,
the DTD, a DSSSL style sheet, the basic document, and multiple referenced files containing
binary data. At a practical level, this means that for a client to display a document
retrieved, in response to a query, it will have to obtain the SGML declaration, the DTD,
any DSSSL/ XSL specifications, and the base document, and invoke appropriate services
(methods) to parse them. It will also resolve SGML entity references from the parsed base
document to other files comprising the document and retrieve those additional files (we
plan to use extensions of Z39.50 to accomplish this initially), and finally utilize the
additional retrieved document components (including any DSSSL style specifications) to
generate a display of documents for the users. The mechanisms used both to refer to
different files composing a single SGML Document (URNs, etc.) and locate and
retrieve the component parts, will need further specification. It is acknowledged that
specifying these mechanisms will be both a technical and political issue, as control of
and access to files from various institutions will have to be arranged within a single,
technical framework.
- We are assuming that the clients will be Java- based and will employ Javas support
for UNICODE to enable the display of character sets outside the limited repertoire of
ASCII and ISO8859 Latin- 1. Second, the indexing and retrieval methods of the applications
layer (servers) will need to be recoded to handle UNICODE character sets, in addition to
standard ASCII/ ISO8859 sets. Font sets for all languages present with the various
collections will need to be either located or produced for inclusion in the client
software.
- Given the wide variety of SGML input mechanisms over distributed sites (including SGML
editors, word processors, and DBMS, involving varying degrees of detail in a variety of
media), we intend to examine encoding practices under SGML with a view to cross- domain
standardization. For example, we plan to consult with appropriate authorities and
participants of Yale Archives Group on the on research issues relevant to the distributed
access of archives on- line for EAD and MARC- AMC encoded materials. These include: ISAAR
(CPF); management of vocabulary control; metadata; and content mapping between ISAD( G),
EAD, MARC- AMC, and MARC- AMC DTD.
- Throughout this project, we will use the SGML handling capabilities of the Cheshire
system to exploit different document types and to extract and index their contents and to
create (Entry Vocabulary Modules) EVMs for intelligent mapping from natural language to
the controlled vocabulary used in databases. This will include the implementation of Entry
Vocabulary Modules for access to standardized classification systems (such as the Library
of Congress Classification) and other controlled vocabularies used in participating
databases.
- One particularly relevant extension to this work will be the DDI DTD, used by the
History Data Service (HDS) and by NESSTAR, to provide electronic codebooks for
numerical databases. In this respect, we will seek to integrate numeric statistical
databases which assemble facts about a wide variety of social, economic, and natural
phenomena and document databases which describe information about various topics, ranging
from new report and library catalogue entries to full- text articles from academic
journals. Both Berkeley and Liverpool have been collaborating with the History Data
Service and NESSTAR (Networked Social Science Tools and Resources) project in Europe to
use the Cheshire system for storage and retrieval of DDI encoded data about numerical
databases.
- Given the Archaeology Data Services (ADS) focus on environmental information.
(archaeological digs), an area of particular interest in terms of this project is
geographic information systems (GIS). A primary development goal will be to integrate
geographic databases (geo- data) with text and numeric data. These constructs are not only
of a conceptual nature, but will form a working prototype within the AHDS itself.
- We plan to convert the metadata associated with the multi- media resources of PADS and
ADS to SGML. The converted version of this metadata will be used for testing planned
extensions for CORBA. We intend to draw on the work of researchers at UC Berkeley on
ZQL for SQL access to objectrelational DBMS via Z39.50.
- The integration of diverse information resources in the production environment that
extends beyond the participating databases described above will require research into the
design a Metaprotocol for describing the protocols needed to interact with the
distributed system components and databases which might be used by databases and systems
outside the project. The metaprotocol could be used to describe the data objects
transferred by a given protocol, and the expected interactions between distributed
components. Information in the metaprotocol might be used to allow a new protocol to be
dynamically incorporated into the system. We plan for an initial version of this protocol
to be based on the Z39.50 Explain service which can provided detailed information about
the databases, search capabilities and data attributes of Z39.50 servers. This will need
to be extended with support for CORBA method descriptions (e. g., IDL and the CORBA
Interface Repository).
7 Performance Goals and Research Plan
Over the three years of this project there are several major objectives, each with a
number of research tasks and milestones that will be critical to the success of the
project as a whole. The systems development work will be divided between the University of
California, Berkeley and the University of Liverpool. In general, client design and
development will be done at Liverpool; server design and development will be done at
Berkeley.
The project is designed in three phases:
- Year One: Design of prototypes for access to each type of data;
- Year Two: Design of preliminary prototype for traversing between the types;
- Year Three: Demonstration of production prototype for accessing each and traversing
between them.
Year 1 Tasks
- Design and implement a preliminary CORBA- based version of the Cheshire system on the
existing C code base, including preliminary design for the distributed component protocols
(Berkeley).
- Preliminary design of user interface; encoding Z39.50 client library in Java, including
support for UNICODE. (Liverpool). 18 months for the first pass, and 36 months for
production level code.
- Implementation of DSSSL engine in Java. (Liverpool). 12 18 months, using the JADE
system ported from C++.
- Implement prototype Entry Vocabulary Modules.
- Design and implement preliminary support for geographic coordinate search and retrieval
in the Cheshire system.
- Develop a matrix of data types, and of hardware/ software environments to ensure a full
range of problems; formalize existing in- house data and net- based resources to create a
defined testbed for this research.
Year 2 Tasks
- Design and implement a Java, C, and C++ CORBA- based distributed component version of
the Cheshire III server (Berkeley).
- Testing and performance evaluation of the distributed components protocols for the
distributed Cheshire system. (Berkeley).
- DSSSL specifications for EAD, TEI, DDI; Z39.50 attribute profile for EAD. (Liverpool).
- Cross institutional standardization of SGML encoding practices. (Liverpool).
- Test and evaluate improved Entry Vocabulary Modules. Integrate with the Cheshire system.
(Liverpool and Berkeley).
- Modify current OGDI/ gltp software to accept input from the indexing/ searching engine
and to acquire metadata, geospatial objects, and attribute information from remote
geographic datastores; work to inform the OGC abstract model definition process for
catalogue services.
- Design for the MetaProtocol for description and dynamic adaptation to new
telecommunications protocols.
- Test Cheshire interoperability with other OGDI/ gltp programmes.
Year 3 Tasks
- Design and implement an improved version of CORBA- based version of the Cheshire system,
incorporating support for the MetaProtocol as well as full support of the component
architecture.
- Implement and test improved user interface with users.
- Evaluation of use and effects on user work of the system.
- Document and report on the system, its use and value.
- Demonstrate interoperability across multiple systems and protocols.
7.1 Deliverables
- The production of a robust, scalable on- line information retrieval system based on
international standards (Z39.50 and SGML) which can be used as a delivery mechanism for
cross- domain resource discovery projects. This system will be portable to a range of
system platforms, and will support the navigation of digital collections, as specified in
Z39.50 Profile for Access to Digital Collections. Navigational support will enable users
to retrieve lists of related collections, including parent, superior, and context
collections, brief descriptions of these collections, and descriptions of their
relationship to the subject collection.
- The Arts and Humanities Data Service has agreed to serve as hosts for the code, to be
made available via ftp. In addition, UC Berkeley will continue to be a host in the United
States and Liverpool will continue to be a host in the United Kingdom. Installation will
be possible for any institution running a Unix or Wintel NT- based system, and the source
code will be self- documenting.
- Consortium participants (AHDS, CURL, OAC and MOA2) will participate in a trial service
involving a wide range of databases. The demonstration of this system extended over a
number of sites in the United Kingdom will show how a standards- based information
retrieval system can support distributed access to existing catalogues, statistical
databases, multi- media, GIS, and full- text data.
- Z39.50 formatting for commonly used SGML formats will be produced for server/ client use
based on DSSSL or XSL.
- Consortium members will document progressive stages of the project, demonstrating the
development of each of the consistent components, taking a model, standard, or prototype
element and extending and adapting that element to production- level, and showing how the
overall structure conforms to a particular vision of the next generation of on- line
catalogues and similar on- line information systems. It is intended to publish further
findings on retrieval algorithms, the use of SGML structured documents as database
objects, management of distributed objects for information access and retrieval, user
interfaces, and user reactions to the advanced on- line catalogue.
8 Project Management Arrangements
The Principal Investigators will be Professor Ray R. Larson, School of Information
Management and Systems, University of California, Berkeley for the US/ NSF sponsored half
of the project and Dr. Paul Watry, Special Collections and Archives, University of
Liverpool for the UK/ JISC sponsored half. The host US institution will be the University
of California, Berkeley and the host UK institution will be Liverpool University. The
additional consortium partners will include the Arts and Humanities Data Service Executive
(Director: Daniel Greenstein), the History Data Service (Sheila Anderson), the Oxford Text
Archive (Michael Popham), the Performing Arts Data Service (Celia Duffy and Carona Boehm),
the Archaeological Data Service (Julian Richards), and the Consortium of University
Research Libraries (contacts: Ric Collins and Julia Chruszcz). It is proposed to create a
management committee, chaired by a University Librarian, that will meet at regular
intervals with project site representatives.
Prior to making a formal proposal, all participants will attend a project meeting
chaired by a University Librarian to draw up a memorandum of agreement avoiding any
potential con icts. All participants will be formally committed to implementing and
testing the system, and participating in design of the DSSSL/ XSL formats.
9 Project Evaluation and Dissemination
As outlined in the Tavistock report( Stern & Kelleher, 1995), the project will
deploy formative and summative evaluation strategies as outlined below:
9.0.1 Formative Evaluation Activities
We intend to use two methods of data collection, transaction monitoring and on- line
questionnaires, to obtain information on how the system is used, and on the reactions and
opinions of users about system features and capabilities. The data collected by
transaction monitoring can be used to reconstruct the users interactions with the
system for detailed analysis of the types of searches conducted and their results.
Transaction monitoring will be supplemented by an on- line questionnaire, which can
provide insight into what users feel about certain features and the user interface. The
questionnaire will be based on the user questionnaire developed for the Council on Library
Resources national survey of on- line catalogue use and users(4).
We intend to provide a continuous assessment of the project and dissemination of
information about it in publications and in workshops. Already disseminated are articles
which appeared in the Journal of the Society of Archivists and to JASIS. The role of the
AHDS in ongoing information dissemination will be critical to our strategy of broadening
the relevance and use of the access tools described in this proposal.
9.1 Summative Evaluation
The summative evaluation will focus on the use and usability of the system, and its
effectiveness in the context of system performance. p This project will support valuable
analysis of information systems, retrieval algorithms, and data structuring in a
production environment. The conclusions reached, by both internal and external evaluators,
may impact on future investment and resource allocation decisions, as well as the design
criteria for future information access systems.
10 Intellectual Advances
A primary focus of the research in our project is the development of better content-
based access methods, along with better paradigms of user interaction with content once
located. The research will, consequently, advance our basic understanding of information
science in the following ways:
- The current classification of information into categories of text, numeric data, multi-
media and geospatial information will be greatly dissolved. This will in turn greatly
strengthen plain- language access to otherwise hidden information.
- The system which will provide a new paradigm for information discovery and retrieval,
exploiting the fundamental interconnections between diverse information resources.
- The system and the bridges built as part of the project will provide a tool
that will foster the serendipitous discovery of new interdisciplinary knowledge the by
users of the system, including teachers, researchers, students, and ordinary citizens.
The originality of the proposed work is two- fold:
- In designing ways to traverse between very heterogeneous data environments;
- and Adopting a component- level approach to information retrieval systems.
We believe that the architecture brie y outlined above represents and fundamental and
inevitable step in the design of information access and retrieval systems. Each aspect of
the indexing and retrieval process carried out by the system can be implemented in a
shared and distributed fashion facilitating the application of new combinations and
sequences of retrieval and indexing algorithms for the differing needs of particular data
sources or user requirement. This will provide a platform for systematic analysis and
exploration of new information retrieval methods and permit examination of retrieval
algorithms at the component instead of the system level.
11 Conclusions
The provision of a production- level version of this retrieval system will certainly
have important economies in the development and implementation of digital libraries in
receipt of funding. More than this, the issues of navigation, indexing, and searching
raised as a result of this funding allocation would be an important step in determining
how standards- based, large- scale network resources can become comprehensible to a
diverse user community with relatively little in the way of resources, guidance, or
assistance.
In todays ever- broadening landscape of on- line databases, web sites, search
engines, and protocols, it has become increasingly difficult for users to know where, how,
and what to search to satisfy their needs for information and knowledge. Todays
information seekers must have extensive knowledge and skills to navigate these on- line
resources and to extract the knowledge relevant to their needs.
We have proposed a standards- based, next- generation on- line information system to
aid the user in exploiting the fundamental interconnections between diverse information
resources, including textual and bibliographic information, multi- media, numerical
databases, and geospatial information resources. It can provide a platform and protocols
to integrate databases with fundamentally different content and structure into a coherent
common retrieval, display, and analysis environment to promote the discovery and use of
new knowledge.
A Prior NSF Support
The PI (Larson) was a faculty investigator on the NSF/ NASA/ ARPA Digital Library
Initiative Project (IRI9411334) entitled The Environmental Electronic Library: A
Prototype of a Scalable Intelligent Distributed Electronic Library. P. I. s: Robert
Wilensky and Michael Stonebraker. $4 million, 9/ 94- 8/ 98.
This project (involving many faculty and graduate student researchers) developed a
prototype digital library focused on the California environment. The Library and its
contents are accessible via http:// elib. cs. berkeley. edu. The project explored many
areas ranging from document decoding and scanning to information categorization and
retrieval. The project showed that it was possible to build an effective and useful large-
scale digital library of scanned document images, and to enliven those images through
developments in Optical Character Recognition, and the development of a new model for
online documents called Multivalent Documents. The project also showed that
effective retrieval from OCR data (with relatively high misinterpretation of characters
and words) is possible to implement (The Cheshire II system is the primary text search and
retrieval engine for the project). The project also made several breakthroughs in computer
vision research and in the design of scalable information retrieval methods for image
retrieval (also using Cheshire II in conjunction with selection and matching methods
developed by computer vision researchers). The project resulted in many publications, many
of which are accessible from the project web site. Specifically relevant publications for
this project are:
M. K. Buckland & C. Plaunt. On the construction of selection systems.
Library Hi Tech, 48 (1994): 15- 28.
M. K. Buckland & Plaunt, C. Selecting Libraries, Selecting Documents,
Selecting Data. In Proceedings of the International Symposium on Research,
Development & Practice in Digital Libraries 1997, ISDL 97, Nov. 18- 21, 1997, Tsukuba,
Japan. Tsukuba, Japan: University of Library and Information Science, 1997, Japan. Pp. 85-
91. (http:// bliss. berkeley. edu/ papers/ isdl97/ isdl97. html)
Ray R. Larson, Geographic Information Retrieval and Spatial Browsing In: L.
Smith and M. Gluck, Eds. GIS and Libraries: Patrons, Maps and Spatial Information, Urbana-
Champaign : University of Illinois, 1996. (p. 81- 124)
Ray R. Larson, Jerome McDonough, Lucy Kuntz, Paul OLeary, and Ralph Moon,
Cheshire II: Designing a Next- Generation Online Catalog. Journal of the
American Society for Information Science, 47( 7) (July 1996), p. 555- 567.
Ray R. Larson and Jerome McDonough. Cheshire II at TREC 6: Interactive
Probabilistic Retrieval. In E. M. Voorhees and D. K. Harman, Eds. Information
Technology: The Sixth Text REtrieval Conference.
Gaithersburg, MD : NIST, 1998. (NIST Special Publication 500- 240) Ray R. Larson.
Interactive Probabilistic Retrieval: Cheshire II at TREC 7. in TREC 7 notebook
(to appear in TREC 7 proceedings).
1 MODELS 4: Integrating Access to Resources Across Multiple Domains
(December 1996). http:// www. ukoln. ac. uk/ models/ models4. html. See also MODELS 3:
National Resource Discovery Workshop: Organizing Access to Printed Scholarly Material; by
Lorcan Dempsey and R. Russell. http:// www. ukoln. ac.uk/ models/ models3. html.
2 Among the materials used for translating a searchers query into the
terms used in the databases are: elimination of unused words using field- specific
stopword lists, particular field- specific query- to- key conversion or
normalization functions, algorithms for reducing significant words to their
roots or stems by converting suffix variations, such as plural forms of a word to a single
form, as well as support for mapping database and query text words to single forms.
3 A basic configuration file, itself an SGML document, defines the physical
database elements, including the locations of data files, which SGML DTD describes the
file, and information on which indexes to create and the elements they should contain.
4 The on- line questionnaire and transaction logs
will allow for analysis of the different types of searches conducted on the system. one
goal is to determine if trends in searching shift from type to type (e. g. title to
subject or topic) with the use of a probabilistic catalogue. For the user questionnaire
developed for the Council on Library Resources see (Matthews, et. al., 1983).
|